CN113434490B

CN113434490B - Quality detection method and device for offline imported data

Info

Publication number: CN113434490B
Application number: CN202010209979.9A
Authority: CN
Inventors: 王子璠
Original assignee: Beijing Jingdong Zhenshi Information Technology Co Ltd
Current assignee: Beijing Jingdong Zhenshi Information Technology Co Ltd
Priority date: 2020-03-23
Filing date: 2020-03-23
Publication date: 2024-04-12
Anticipated expiration: 2040-03-23
Also published as: CN113434490A

Abstract

The invention discloses a quality detection method and device for offline imported data, and relates to the technical field of computers. One embodiment of the method comprises the following steps: acquiring data to be detected generated offline, wherein the data to be detected comprises fields and corresponding field values; integrity detection is carried out on the first field and the corresponding field value so as to obtain an integrity rate; after the validity of the data after the integrity detection is repaired, the validity of the second field and the corresponding field value is detected, so that the efficiency is obtained; after the validity detection, respectively carrying out accuracy detection on the third field and the corresponding field value and carrying out uniqueness detection on the fourth field and the corresponding field value so as to obtain accuracy rate and uniqueness rate; and calculating the quality score of the data to be detected according to the integrity rate, the effective rate, the accuracy rate and the unique rate. According to the embodiment, the characteristics of the offline data can be combined, and a proper data quality detection index can be selected to meet the quality detection requirement of the offline imported data.

Description

Quality detection method and device for offline imported data

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method and an apparatus for quality detection of offline imported data.

Background

In the big data age, various industries have established many data warehouse systems and accumulated a large amount of data. In order to enable data to effectively support daily work, people pay more and more attention to data quality problems. Currently, a general data quality detection method is often used, for example, data quality monitoring software such as Apache Griffin is used to perform data quality detection.

In the process of implementing the present invention, the inventor finds that at least the following problems exist in the prior art:

in the prior art, the universal data quality detection method is applied to detect the data quality, and the method is applicable to online data. However, for the data imported offline, the existing data quality monitoring software cannot provide the function of quality detection of the data imported offline because the data is obtained through manual processing.

Disclosure of Invention

In view of this, the embodiment of the invention provides a quality detection method and device for offline imported data, which can select a proper data quality detection index in combination with the characteristics of offline data, meet the requirement of quality detection of offline imported data, and enable the data quality detection process to be more reasonable and controllable and the accuracy of detection results to be high.

To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided a quality detection method of offline imported data.

A quality detection method of offline imported data comprises the following steps: acquiring data to be detected generated offline, wherein the data to be detected comprises fields and corresponding field values; integrity detection is carried out on the first field and the corresponding field value so as to obtain the integrity rate of the data to be detected; after the validity of the data after the integrity detection is repaired, the validity of the second field and the corresponding field value is detected, so that the validity of the data to be detected is obtained; after the validity detection, respectively carrying out accuracy detection on a third field and a corresponding field value and carrying out uniqueness detection on a fourth field and a corresponding field value so as to obtain the accuracy rate and the uniqueness rate of the data to be detected; and calculating the quality score of the data to be detected according to the integrity rate, the effective rate, the accuracy rate and the unique rate of the data to be detected.

Optionally, before the validity repair is performed on the data after the integrity detection, the method further comprises: replacing null values in the field values corresponding to the first field according to the appointed filling values so as to carry out integrity repair; and modifying the integrity rate of the data to be detected to 100% after the integrity repair.

Optionally, the validity repair includes: and carrying out left and right space elimination on field values corresponding to all fields of the data after the integrity detection.

Optionally, before the accuracy detection of the third field and the corresponding field value, the method further includes: determining the type of the third field, and detecting the accuracy of the type, wherein the type comprises dimensions and facts.

Optionally, if the third field is a dimension field, the accuracy detecting includes: for each dimension field for accuracy detection, respectively calculating the similarity between each field value of the dimension field and the field value of the corresponding appointed dimension field; counting a first proportion of the number of field values with the similarity smaller than a set similarity threshold value in the number of field values of the dimension field; and taking the average value of the first proportion corresponding to all the dimension fields as dimension accuracy so as to detect the accuracy.

Optionally, if the third field is a fact field, the accuracy detection includes: counting a second proportion of the number of field values meeting a set precision threshold in the field values of the fact fields in the number of field values of the fact fields; taking the average value of the second proportion corresponding to all the fact fields as the fact accuracy so as to carry out accuracy detection.

Optionally, if the third field includes both a dimension field and a fact field, the accuracy detection includes: respectively calculating the dimension precision corresponding to the dimension field and the fact precision corresponding to the fact field; and respectively carrying out weighted average on the dimension accuracy and the fact accuracy according to the proportion of the field value number corresponding to the dimension field and the field value number corresponding to the fact field in the field value number corresponding to the third field to obtain the accuracy of the data to be detected so as to carry out accuracy detection.

Optionally, the uniqueness detection includes: splicing all field values corresponding to the fourth field included in each row of data to be used as a retrieval main key; if the multi-row data can be determined according to the search main key, counting the number of repeated rows; and calculating the unique rate of the data to be detected according to the sum of the repeated line numbers corresponding to all the search main keys and the line number of the data to be detected so as to carry out the unique detection.

Optionally, calculating the quality score of the data to be detected according to the integrity rate, the effective rate, the accuracy rate and the unique rate of the data to be detected includes: and calculating the quality score of the data to be detected by carrying out weighted average on the integrity rate, the effective rate, the accuracy rate and the unique rate of the data to be detected.

According to another aspect of the embodiments of the present invention, there is provided a quality detection apparatus for offline imported data.

A quality detection apparatus for offline imported data, comprising: the data acquisition module is used for acquiring data to be detected generated offline, wherein the data to be detected comprises fields and corresponding field values; the first detection module is used for carrying out integrity detection on the first field and the corresponding field value so as to obtain the integrity rate of the data to be detected; the second detection module is used for carrying out validity repair on the data after the integrity detection and then carrying out validity detection on a second field and a corresponding field value so as to obtain the validity of the data to be detected; the third detection module is used for respectively carrying out accuracy detection on the third field and the corresponding field value and carrying out uniqueness detection on the fourth field and the corresponding field value after the validity detection so as to obtain the accuracy rate and the uniqueness rate of the data to be detected; and the quality evaluation module is used for calculating the quality score of the data to be detected according to the integrity rate, the effective rate, the accuracy rate and the unique rate of the data to be detected.

According to yet another aspect of an embodiment of the present invention, there is provided an electronic device for quality detection of offline imported data.

An electronic device for quality detection of offline imported data, comprising: one or more processors; and the storage device is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors realize the quality detection method of the offline imported data.

According to yet another aspect of an embodiment of the present invention, a computer-readable medium is provided.

A computer-readable medium having stored thereon a computer program which, when executed by a processor, implements a quality detection method for offline imported data provided by an embodiment of the present invention.

One embodiment of the above invention has the following advantages or benefits: acquiring data to be detected, which is generated offline, wherein the data to be detected comprises fields and corresponding field values; integrity detection is carried out on the first field and the corresponding field value so as to obtain the integrity rate of the data to be detected; after the validity of the data after the integrity detection is repaired, the validity of the second field and the corresponding field value is detected, so that the validity of the data to be detected is obtained; after the validity detection, respectively carrying out accuracy detection on the third field and the corresponding field value and carrying out uniqueness detection on the fourth field and the corresponding field value so as to obtain the accuracy rate and the repetition rate of the data to be detected; according to the integrity, the effective rate, the accuracy and the repetition rate of the data to be detected, the quality score of the data to be detected is calculated, the quality detection of the data imported under the line is realized, the characteristics of the data imported under the line can be combined, the consistency and the timeliness are removed when the quality detection index of the data is selected, the integrity, the effectiveness, the accuracy and the uniqueness are reserved, and the quality detection requirement of the data imported under the line is met. Meanwhile, the quality detection indexes are not completely independent, so that strict requirements are made on the sequence of the data quality detection, the data quality detection process is more reasonable and controllable, and the accuracy of the detection result is high. In addition, in the process of accuracy detection, the accuracy is clearly divided into dimension accuracy and fact accuracy aiming at disordered data so as to detect the quality of different types of data respectively, so that the data detection is more scientific and reasonable.

Further effects of the above-described non-conventional alternatives are described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

FIG. 1 is a schematic diagram of main steps of a quality detection method of offline imported data according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an implementation flow of quality detection of inline data according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of main modules of a quality detection apparatus for offline import data according to an embodiment of the present invention;

FIG. 4 is an exemplary system architecture diagram in which embodiments of the present invention may be applied;

fig. 5 is a schematic diagram of a computer system suitable for use in implementing an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present invention are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In general, the data quality problem is generalized to the degree to which six indicators are satisfied in the system, as follows:

integrity: indicating whether the data is complete or not and whether the data is missing or not;

uniqueness: whether the data has repeated record;

timeliness: indicating how much time it takes for the data to actually be generated until it is recorded;

effectiveness is as follows: whether the data meets the user-defined condition or is within a certain range of values;

accuracy: the ability of data to accurately represent physical world information;

consistency: whether the values of the same entity are consistent across different systems or data sets.

The data is imported offline, i.e. the data is mainly handled manually, not by a computer program. The method is characterized in that: the data source is single, the time effect requirement is not high, and the total amount is not large but is easy to disorder. The data source is single, so that the detection of the consistency index is not feasible; the aging requirement is not high, so that the aging is not usually considered; because the manual processing is easy to make mistakes (such as missing data, blank spaces which are difficult to be recognized by human eyes exist around fields, naming is not standard, and the like), the disordered offline imported data needs to be processed and repaired so as to better detect the data quality.

Aiming at the special scene of importing data under the line, the invention improves the method on the basis of a general data quality detection method, and designs a personalized data quality detection method.

When the data detection index is selected, the consistency and timeliness are removed, and the integrity, the effectiveness, the accuracy and the uniqueness are reserved. For cluttered data, accuracy is explicitly divided into dimensional accuracy, which examines the extent of the naming convention, and fact accuracy, which examines the accuracy of the number.

In addition, the indexes are not completely independent, and the quality of certain indexes can influence the detection of other indexes. Therefore, the quality detection method of the offline imported data needs to improve the existing data quality detection method in two ways: first, the detection process has strict sequence requirements; second, the data needs to be properly repaired while being detected. That is, after some indexes are detected and repaired, other indexes can be detected.

In the embodiment of the present invention, for descriptive brevity, the total record number of the data to be detected generated under the line to be imported is denoted as S, where each data record includes one or more fields and a field value corresponding to each field. According to the sequence numbers of quality index detection, the field sets which are set by a user and need to be subjected to relevant quality index detection are respectively marked as follows:

F ₁ : a set of fields for which integrity detection is to be performed;

F ₂ : a set of fields to be validity-detected;

F ₃ : a set of fields to be accurately detected;

F ₄ : a set of fields to be uniquely detected.

Correspondingly, is provided withThe detection results (all recorded in percentage) of each quality index are respectively recorded as: p (P) ₁ 、P ₂ 、P ₃ 、P ₄ The method comprises the steps of carrying out a first treatment on the surface of the The weight of each quality index is W ₁ 、W ₂ 、W ₃ 、W ₄ . And the final data quality detection result is Q.

In an embodiment of the invention, in the data preparation phase, the data warehouse comprises at least two libraries: the system comprises a formal database and a test database, wherein the formal database is a database to be imported after offline data are tested, and the test database is a database for importing offline data and testing or repairing. The target table for data import is built in a formal library and a test library respectively, wherein the storage format of data in the target table of the test library is defined as text (e.g. text files in txt or csv format, etc.), and a user needs to ensure that the data file to be imported accords with the description of the target table of the test library, and the information such as coding, separator, field sequence, etc. is the same as the target table; the storage format of the data in the target table of the formal library is not limited, and may be text or other compressed formats. In order to save space, compressed formats are more common.

Fig. 1 is a schematic diagram of main steps of a quality detection method of offline imported data according to an embodiment of the present invention. As shown in fig. 1, the quality detection method of offline import data according to the embodiment of the present invention mainly includes the following steps S101 to S105.

Step S101: acquiring data to be detected generated offline, wherein the data to be detected comprises fields and corresponding field values;

step S102: integrity detection is carried out on the first field and the corresponding field value so as to obtain the integrity rate of the data to be detected;

step S103: after the validity of the data after the integrity detection is repaired, the validity of the second field and the corresponding field value is detected, so that the validity of the data to be detected is obtained;

step S104: after the validity detection, respectively carrying out accuracy detection on the third field and the corresponding field value and carrying out uniqueness detection on the fourth field and the corresponding field value so as to obtain the accuracy rate and the uniqueness rate of the data to be detected;

step S105: and calculating the quality score of the data to be detected according to the integrity rate, the effective rate, the accuracy rate and the unique rate of the data to be detected.

According to the technical scheme of the invention, the first field, the second field, the third field and the fourth field are respectively a preset field set to be subjected to integrity detection, a preset field set to be subjected to validity detection, a preset field set to be subjected to accuracy detection and a preset field set to be subjected to uniqueness detection, wherein only one field or a plurality of fields can be included in the field sets, and the fields included in each field set can be identical to other field sets, can be completely different from other field sets and can be partially identical to other field sets.

In the embodiment of the invention, when the data to be detected generated offline is acquired, the data to be detected generated offline is uploaded to a target table of a test library, and then the data required by each detection is acquired from the target table and data quality detection is performed. In the data quality detection, for offline data, the integrity, the validity, the accuracy and the uniqueness of the data need to be detected sequentially according to a set sequence, wherein the accuracy and the uniqueness can be executed in parallel or the detection sequence of the two can be not limited.

According to step S102, integrity detection is first performed. Determining a first field F according to a preset field set to be subjected to integrity detection ₁ For F ₁ Any one of fields f ₁ Counting the proportion (percentage) of the number of data pieces in which the field value is not null to the total number S of records, and comparing F ₁ The proportion of all the fields included in the table is averaged and is recorded as P ₁ And obtaining the corresponding integrity rate of the data to be detected.

According to one embodiment of the invention, after the integrity detection of the data, before the validity restoration of the data after the integrity detection is carried out, the integrity restoration can also be carried out, specifically, the null value in the field value corresponding to the first field is replaced according to the designated filling value so as to carry out the integrity restoration; and modifying the integrity rate of the data to be detected to 100% after the integrity repair.

According to step S103, validity detection can be performed on the data after the integrity detection. In the case of validity detection, it is first necessary to perform validity repair, for example, as follows: and carrying out left and right space elimination on field values corresponding to all fields of the data after the integrity detection. It should be noted that whether or not the field belongs to a predetermined field set (i.e., second field) F for validity detection ₂ The effectiveness repair is needed, so that the influence of left and right spaces on the calculation of the length of the following text, the judgment of the format and the like can be avoided.

According to an embodiment of the present invention, in performing validity detection, a detection rule may be preset, for example: rules for string length, format, value fields, etc., such as for names, a user may set a rule to be greater than 2 and less than 5 in length; for the ID card number, the format should be numbers, and at most, the tail end is provided with one X; for age, the value range should be between 0 and 150, and so on. Then, for F ₂ Any one of fields f ₂ Counting the percentage of the number of data strips meeting the rule requirement to the total number S of records, and adding F ₂ The percentages corresponding to all the fields included in the table are averaged and recorded as P ₂ The effective rate corresponding to the data to be detected can be obtained.

According to a further embodiment of the present invention, before performing accuracy detection on the third field and the corresponding field value, a type to which the third field belongs may also be determined, and accuracy detection of the corresponding type may be performed, where the type includes a dimension and a fact. Then, accordingly, a certain third field may be a dimension field or a fact field.

In the accuracy detection according to step S104, a third field F is determined from a set of fields to be subjected to accuracy detection set in advance ₃ And for the third field F ₃ Each field f included in ₃ Whether it is a dimension field or a fact field is specified, the two fields being handled differently. The third field includes a field in which,all fields may be dimension fields, all fields may be fact fields, and both dimension fields and fact fields may be present.

For the dimension field, according to one embodiment of the present invention, if the third field is the dimension field, the accuracy detection includes:

for each dimension field for accuracy detection, calculating the similarity between each field value of the dimension field and the corresponding field value of the designated dimension field;

Counting a first proportion of the number of field values with the similarity smaller than a set similarity threshold value in the number of field values of the dimension field;

and taking the average value of the first proportion corresponding to all the dimension fields as dimension accuracy so as to detect the accuracy.

In an embodiment of the invention, for dimension field f ₃ In performing dimension accuracy detection, a user may specify a dimension field T of a dimension table in the data warehouse (assuming that T has N field values, T is used as _i Represents one of the N field values, and 1.ltoreq.i.ltoreq.N), and sets a distance threshold D (defaulting to zero when not set). For field f ₃ Searching the most similar value in all field values of T, and taking the editing distance between the two field values as a measure d of the similarity degree, wherein the method comprises the following steps:

wherein, levenshtein is a function for calculating the editing distance. The edit distance is the minimum number of editing operations required to switch from one to the other between two strings. For example, for two character strings of "Beijing" and "Beijing city", only one word "city" is added to "Beijing", or "Beijing city" deletes the word "city", the two character strings can be converted into another character string, so that the editing distance of the two character strings is 1.

Then, the statistics dimension field f ₃ The percentage of the data quantity of d.ltoreq.D is satisfied. For other dimension field f ₃ ' whereThe method is similar, but the values of T and D should be set according to the practical application.

Since the third fields are all dimension fields, the percentage corresponding to each dimension field can be averaged to obtain the dimension accuracy, and the dimension accuracy is used as the accuracy of the data to be detected.

According to another embodiment of the present invention, if the third field is a fact field, the accuracy detection includes:

counting a second proportion of the number of field values meeting a set precision threshold in the field values of each fact field for carrying out precision detection in the number of field values of the fact field;

taking the average value of the second proportion corresponding to all the fact fields as the fact accuracy so as to carry out accuracy detection.

In an embodiment of the present invention, for the fact field f ₃ In the case of performing the fact accuracy detection, an accuracy threshold a (a non-negative integer, zero represents an integer) needs to be set. Statistics facts field f ₃ The percentage of data quantity meeting the precision requirement (namely, the precision is more than or equal to A). For other facts field f ₃ The' processing mode is similar, but the value of the precision threshold A is set according to the actual application requirement.

Since the third field is a fact field, the percentage corresponding to each fact field may be averaged to obtain a fact accuracy as an accuracy of the data to be detected.

According to yet another embodiment of the present invention, if the third field includes both a dimension field and a fact field, the accuracy detection includes:

respectively calculating the dimension precision corresponding to the dimension field and the fact precision corresponding to the fact field;

and respectively carrying out weighted average on the dimension accuracy and the fact accuracy according to the proportion of the field value number corresponding to the dimension field and the field value number corresponding to the fact field in the field value number corresponding to the third field to obtain the accuracy of the data to be detected so as to carry out accuracy detection.

In a specific implementation, if the third field includes both the dimension field and the fact field, the percentage corresponding to each dimension field and the percentage corresponding to each fact field may be averaged and denoted as P ₃ And obtaining the accuracy corresponding to the data to be detected.

In an embodiment of the present invention, the uniqueness detection mainly includes the steps of:

Splicing all field values corresponding to a fourth field included in each row of data to be used as a retrieval main key;

if the multi-line data can be determined according to the search main key, counting the number of repeated lines;

and calculating the unique rate of the data to be detected according to the sum of the repeated line numbers corresponding to all the retrieval main keys and the line number of the data to be detected so as to carry out the unique detection.

In the case of the uniqueness detection according to step S104, the fourth field F is determined from the set of fields to be uniqueness detected set in advance ₄ By detecting F ₄ Whether all fields in (a) can together form a joint primary key, i.e. when F ₄ When the values of all the fields in the table are determined, whether one row of data in the table can be uniquely determined or not is determined, and the uniqueness detection is performed. If pair F ₄ When l rows are searched together by the combination of the values of one field value, the statistical value of the repeated rows is increased (l-1). Traversing all the value combinations of the field values, and calculating the total record number L of the repeated lines, wherein the following steps are as follows:

the repetition rate is:the uniqueness index value of the data to be detected is uniqueness rate +.>

According to the technical scheme of the invention, finally, when the quality score of the data to be detected is calculated according to the integrity rate, the effective rate, the accuracy rate and the unique rate of the data to be detected, the quality score is calculated by the integrity rate, the effective rate, the accuracy rate and the unique rate of the data to be detected A weighted average is performed to calculate a quality score for the data to be detected. By default, the weights W of the 4 detection indexes are given when the user does not set any weight ₁ ＝W ₂ ＝W ₃ ＝W ₄ =1. Finally, the detection result of the data quality is:

according to P _i The generation method of (2) shows that 0.ltoreq.P _i Less than or equal to 1, and P _i The larger indicates the better the corresponding data quality index. Further, it is known that Q is more than or equal to 0 and less than or equal to 1, and when Q is closer to 1, the overall data quality is better; the closer Q is to 0, the poorer the overall data quality. If Q meets the user's expectations, the data in the test library target table may be imported into the formal library target table.

Fig. 2 is a schematic diagram of an implementation flow of quality detection of inline data according to an embodiment of the present invention. As shown in fig. 2, in one embodiment of the present invention, the implementation flow for performing quality detection on the offline imported data mainly includes:

1. data preparation, namely establishing a target table in a database, and adjusting formats and the like of the imported data under the line;

2. data is imported into a test library so that data quality detection, repair and the like can be performed from the test library;

3. detecting and repairing the data integrity, detecting the data corresponding to the set field to be subjected to the integrity detection, and replacing null values in the data by using a designated filling value to carry out the integrity repair on the data;

4. Performing data validity repair and detection, namely performing left and right blank elimination on field values of all fields to perform validity repair, and then performing validity detection on data corresponding to the set fields to be subjected to validity detection;

5. detecting data accuracy, namely respectively selecting a corresponding mode to detect the data accuracy according to whether the data are dimensional data or fact data;

6. the data uniqueness detection is carried out, data traversal is carried out according to the retrieval primary key obtained by splicing the field values of each row corresponding to the set field to be subjected to the uniqueness detection, so that the number of repeated rows is obtained, and the repetition rate and the uniqueness rate of the data are calculated;

7. generating a data quality detection result;

8. and importing the data meeting the quality requirements into a formal library.

One specific embodiment of the present invention is described below. Assume that offline data to be imported by a user includes three fields of city_id (city ID), city_name (city name), income (income), 5 pieces of data in total, as shown in table 1 below. The underline in the city_id of the 4 th piece of data is actually a space, and is written here as an underline for visualization.

TABLE 1

After the user imports the data into the target table of the test library, the detection of the data quality is started. The integrity of the data is first checked. User settings F ₁ If the program detects that 1 data of the 5 data is NULL (NULL) for the city_name, P is obtained ₁ =80%. At this point the user wishes to replace the null value with "other", P is set ₁ =100% and the data shown in table 2 below was obtained.

TABLE 2

city_id	city_name	income
			1	Beijing city	1000.000
2	Tianjin	2000.000
			3	Shanghai	3000.000
3_	Others	500.000
			4	Chongqing	4000.0000

And a second step of detecting the validity of the data. Before starting the detection, the program eliminates the left and right spaces of the field values corresponding to all the fields, so "3_" of the 4 th data is changed to "3", as shown in table 3 below. User settings F ₂ Is the city_id and it is desirable to detect if the fields are all purely digital. Obviously, the data subjected to validity repair fully meets the requirement, namely P ₂ ＝100％。

TABLE 3 Table 3

city_id	city_name	income
			1	Beijing city	1000.000
2	Tianjin	2000.000
			3	Shanghai	3000.000
3	Others	500.000
			4	Chongqing	4000.0000

And thirdly, detecting the accuracy of the data. User selection F ₃ For city_name and income. Considering the city_name, the user does not set a distance threshold, but designates the dim_city_name field of the city dimension table in the existing data warehouse, and the field value after the duplication removal is as shown in the following table 4 (only part is listed).

TABLE 4 Table 4

dim_city_name
	Beijing
Tianjin
	Shanghai
Chongqing
	……
Others

For the first piece of data, the city_name= "beijing city", obviously, the closest value among all values of dim_city_name is "beijing", and the editing distance of the two is 1, which is greater than the distance threshold; the field values of the city_name, namely "Tianjin", "Shanghai", "Chongqing" and "other", can find the same value in the dim_city_name, the editing distance is 0, and the threshold requirement is met; thus, the percentage of data volume that the portion meets the threshold requirement is 80%.

Considering next the income field, assuming that the user-set precision threshold a=3, i.e. all values are required to be at least 3-bit decimal, it is apparent that all data meet the requirement, with a percentage of 100%. The average value P of the percentages of the two fields ₃ ＝90％。

And fourthly, detecting uniqueness. User settings F ₄ Is the city_id. Here, the city_id of the 3 rd data and the 4 th data is repeated. It is further emphasized that if not validly repaired, the uniqueness detection may be affected by a space to erroneously determine that all data is not repeated, which is obviously unreasonable. And the number of data lines where repetition is detected is 1, so

Finally, generating a data quality detection index, wherein the user does not input any weight value at the moment, and considering W ₁ ＝W ₂ ＝W ₃ ＝W ₄ =1. Thus:

according to another aspect of the present invention, there is also provided a quality detection apparatus for offline imported data. Fig. 3 is a schematic diagram of main modules of a quality detection apparatus for offline import data according to an embodiment of the present invention. As shown in fig. 3, the quality detection apparatus 300 for offline imported data according to the embodiment of the present invention mainly includes a data acquisition module 301, a first detection module 302, a second detection module 303, a third detection module 304, and a quality evaluation module 305.

The data acquisition module 301 is configured to acquire data to be detected generated offline, where the data to be detected includes a field and a corresponding field value;

the first detection module 302 is configured to perform integrity detection on the first field and the corresponding field value, so as to obtain an integrity rate of the data to be detected;

the second detection module 303 is configured to perform validity detection on the second field and the corresponding field value after the validity of the data after the integrity detection is repaired, so as to obtain the validity of the data to be detected;

the third detection module 304 is configured to perform accuracy detection on the third field and the corresponding field value and perform uniqueness detection on the fourth field and the corresponding field value after the validity detection, so as to obtain an accuracy rate and a uniqueness rate of the data to be detected;

the quality evaluation module 305 is configured to calculate a quality score of the data to be detected according to the integrity rate, the effective rate, the accuracy rate and the unique rate of the data to be detected.

According to an embodiment of the present invention, the quality detection apparatus 300 for offline imported data may further include an integrity restoration module (not shown in the figure) for:

before the validity of the data after the integrity detection is repaired, replacing null values in field values corresponding to the first field according to the appointed filling value so as to repair the integrity;

And modifying the integrity rate of the data to be detected to 100% after the integrity repair.

According to another embodiment of the present invention, the second detection module 303 may specifically be:

and carrying out left and right space elimination on field values corresponding to all fields of the data after the integrity detection.

According to a further embodiment of the present invention, the third detection module 304 may be further configured to, prior to performing the accuracy detection on the third field and the corresponding field value:

determining the type of the third field, and detecting the accuracy of the type, wherein the type comprises dimensions and facts.

According to a further embodiment of the present invention, if the third field is a dimension field, the third detection module 304 may be further configured to, when performing accuracy detection:

for each dimension field for accuracy detection, respectively calculating the similarity between each field value of the dimension field and the field value of the corresponding appointed dimension field;

According to a further embodiment of the present invention, if the third field is a fact field, the third detection module 304 may be further configured to, when performing the accuracy detection:

counting a second proportion of the number of field values meeting a set precision threshold in the field values of the fact fields in the number of field values of the fact fields;

According to a further embodiment of the present invention, if the third field includes both a dimension field and a fact field, the third detection module 304 may be further configured to, when performing the accuracy detection:

According to yet another embodiment of the present invention, the third detection module 304 may be further configured to, when performing the uniqueness detection:

Splicing all field values corresponding to the fourth field included in each row of data to be used as a retrieval main key;

if the multi-row data can be determined according to the search main key, counting the number of repeated rows;

and calculating the unique rate of the data to be detected according to the sum of the repeated line numbers corresponding to all the search main keys and the line number of the data to be detected so as to carry out the unique detection.

According to yet another embodiment of the present invention, the quality assessment module 304 may also be configured to:

and calculating the quality score of the data to be detected by carrying out weighted average on the integrity rate, the effective rate, the accuracy rate and the unique rate of the data to be detected.

According to the technical scheme of the embodiment of the invention, the data to be detected, which are generated off-line, are obtained, wherein the data to be detected comprise fields and corresponding field values; integrity detection is carried out on the first field and the corresponding field value so as to obtain the integrity rate of the data to be detected; after the validity of the data after the integrity detection is repaired, the validity of the second field and the corresponding field value is detected, so that the validity of the data to be detected is obtained; after the validity detection, respectively carrying out accuracy detection on the third field and the corresponding field value and carrying out uniqueness detection on the fourth field and the corresponding field value so as to obtain the accuracy rate and the repetition rate of the data to be detected; according to the integrity, the effective rate, the accuracy and the repetition rate of the data to be detected, the quality score of the data to be detected is calculated, the quality detection of the data imported under the line is realized, the characteristics of the data imported under the line can be combined, the consistency and the timeliness are removed when the quality detection index of the data is selected, the integrity, the effectiveness, the accuracy and the uniqueness are reserved, and the quality detection requirement of the data imported under the line is met. Meanwhile, the quality detection indexes are not completely independent, so that strict requirements are made on the sequence of the data quality detection, the data quality detection process is more reasonable and controllable, and the accuracy of the detection result is high. In addition, in the process of accuracy detection, the accuracy is clearly divided into dimension accuracy and fact accuracy aiming at disordered data so as to detect the quality of different types of data respectively, so that the data detection is more scientific and reasonable.

Fig. 4 illustrates an exemplary system architecture 400 to which the quality detection method of offline imported data or the quality detection apparatus of offline imported data of the embodiment of the present invention may be applied.

As shown in fig. 4, the system architecture 400 may include terminal devices 401, 402, 403, a network 404, and a server 405. The network 404 is used as a medium to provide communication links between the terminal devices 401, 402, 403 and the server 405. The network 404 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

A user may interact with the server 405 via the network 404 using the terminal devices 401, 402, 403 to receive or send messages or the like. Various communication client applications, such as shopping class applications, web browser applications, search class applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only) may be installed on the terminal devices 401, 402, 403.

The terminal devices 401, 402, 403 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.

The server 405 may be a server providing various services, such as a background management server (by way of example only) providing support for shopping-type websites browsed by users using the terminal devices 401, 402, 403. The background management server may analyze and process the received data such as the product information query request, and feedback the processing result (e.g., the target push information, the product information—only an example) to the terminal device.

It should be noted that, the method for detecting quality of offline imported data according to the embodiment of the present invention is generally executed by the server 405, and accordingly, the device for detecting quality of offline imported data is generally disposed in the server 405.

It should be understood that the number of terminal devices, networks and servers in fig. 4 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring now to FIG. 5, there is illustrated a schematic diagram of a computer system 500 suitable for use in implementing a terminal device or server in accordance with an embodiment of the present invention. The terminal device or server shown in fig. 5 is only an example, and should not impose any limitation on the functions and scope of use of the embodiments of the present invention.

As shown in fig. 5, the computer system 500 includes a Central Processing Unit (CPU) 501, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 502 or a program loaded from a storage section 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the system 500 are also stored. The CPU 501, ROM 502, and RAM 503 are connected to each other through a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

The following components are connected to the I/O interface 505: an input section 506 including a keyboard, a mouse, and the like; an output portion 507 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker, and the like; a storage portion 508 including a hard disk and the like; and a communication section 509 including a network interface card such as a LAN card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. The drive 510 is also connected to the I/O interface 505 as needed. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as needed so that a computer program read therefrom is mounted into the storage section 508 as needed.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 509, and/or installed from the removable media 511. The above-described functions defined in the system of the present invention are performed when the computer program is executed by a Central Processing Unit (CPU) 501.

The computer readable medium shown in the present invention may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units or modules involved in the embodiments of the present invention may be implemented in software or in hardware. The described units or modules may also be provided in a processor, for example, as: a processor includes a data acquisition module, a first detection module, a second detection module, a third detection module, and a quality assessment module. The names of these units or modules do not in any way limit the unit or module itself, and the data acquisition module may also be described as "a module for acquiring data to be detected generated offline", for example.

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be present alone without being fitted into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to include: acquiring data to be detected generated offline, wherein the data to be detected comprises fields and corresponding field values; integrity detection is carried out on the first field and the corresponding field value so as to obtain the integrity rate of the data to be detected; after the validity of the data after the integrity detection is repaired, the validity of the second field and the corresponding field value is detected, so that the validity of the data to be detected is obtained; after the validity detection, respectively carrying out accuracy detection on a third field and a corresponding field value and carrying out uniqueness detection on a fourth field and a corresponding field value so as to obtain the accuracy rate and the repetition rate of the data to be detected; and calculating the quality score of the data to be detected according to the integrity rate, the effective rate, the accuracy rate and the repetition rate of the data to be detected.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives can occur depending upon design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. A quality detection method for offline imported data, comprising:

acquiring data to be detected generated offline, wherein the data to be detected comprises fields and corresponding field values;

integrity detection is carried out on the first field and the corresponding field value so as to obtain the integrity rate of the data to be detected; the integrity detection comprises counting the proportion of the number of data strips with field values not being empty to the total number of record strips for any one field in the first field, and averaging the proportion corresponding to all the fields in the first field to obtain the integrity rate corresponding to the data to be detected;

after the validity of the data after the integrity detection is repaired, the validity of the second field and the corresponding field value is detected, so that the validity of the data to be detected is obtained; the validity detection comprises counting the percentage of the number of data strips meeting the detection rule requirement to the total number of the recorded strips for any one of the second fields, and averaging the percentages corresponding to all the fields included in the second fields to obtain the corresponding valid rate of the data to be detected;

After the validity detection, respectively carrying out accuracy detection on a third field and a corresponding field value and carrying out uniqueness detection on a fourth field and a corresponding field value so as to obtain the accuracy rate and the uniqueness rate of the data to be detected; wherein if the third field is a dimension field, the accuracy detection includes: for each dimension field for accuracy detection, respectively calculating the similarity between each field value of the dimension field and the field value of the corresponding appointed dimension field; counting a first proportion of the number of field values with the similarity smaller than a set similarity threshold value in the number of field values of the dimension field; taking the average value of the first proportion corresponding to all the dimension fields as dimension accuracy so as to detect the accuracy;

if the third field is a fact field, the accuracy detection includes: counting a second proportion of the number of field values meeting a set precision threshold in the field values of the fact fields in the number of field values of the fact fields; taking the average value of the second proportion corresponding to all the fact fields as the fact accuracy so as to detect the accuracy;

If the third field includes both a dimension field and a fact field, the accuracy detection includes: respectively calculating the dimension precision corresponding to the dimension field and the fact precision corresponding to the fact field; respectively carrying out weighted average on the dimension accuracy and the fact accuracy according to the proportion of the field value number corresponding to the dimension field and the field value number corresponding to the fact field in the field value number corresponding to the third field to obtain the accuracy of the data to be detected so as to carry out accuracy detection;

the uniqueness detection includes: splicing all field values corresponding to the fourth field included in each row of data to be used as a retrieval main key; if the multi-row data can be determined according to the search main key, counting the number of repeated rows; calculating the unique rate of the data to be detected according to the sum of the repeated line numbers corresponding to all the search main keys and the line number of the data to be detected so as to carry out the unique detection;

and calculating the quality score of the data to be detected according to the integrity rate, the effective rate, the accuracy rate and the unique rate of the data to be detected.

2. The method of claim 1, further comprising, prior to validity repair of the integrity-checked data:

Replacing null values in the field values corresponding to the first field according to the appointed filling values so as to carry out integrity repair;

3. The method of claim 1, wherein the validity repair comprises:

4. The method of claim 1, further comprising, prior to the accuracy detecting of the third field and the corresponding field value:

5. The method of claim 1, wherein calculating a quality score for the data to be detected based on the integrity rate, the efficiency, the accuracy rate, and the unique rate of the data to be detected comprises:

6. A quality inspection device for offline imported data, comprising:

The data acquisition module is used for acquiring data to be detected generated offline, wherein the data to be detected comprises fields and corresponding field values;

the first detection module is used for carrying out integrity detection on the first field and the corresponding field value so as to obtain the integrity rate of the data to be detected; the integrity detection comprises counting the proportion of the number of data strips with field values not being empty to the total number of record strips for any one field in the first field, and averaging the proportion corresponding to all the fields in the first field to obtain the integrity rate corresponding to the data to be detected;

the second detection module is used for carrying out validity repair on the data after the integrity detection and then carrying out validity detection on a second field and a corresponding field value so as to obtain the validity of the data to be detected; the validity detection comprises counting the percentage of the number of data strips meeting the detection rule requirement to the total number of the recorded strips for any one of the second fields, and averaging the percentages corresponding to all the fields included in the second fields to obtain the corresponding valid rate of the data to be detected;

the third detection module is used for respectively carrying out accuracy detection on the third field and the corresponding field value and carrying out uniqueness detection on the fourth field and the corresponding field value after the validity detection so as to obtain the accuracy rate and the uniqueness rate of the data to be detected; wherein if the third field is a dimension field, the accuracy detection includes: for each dimension field for accuracy detection, respectively calculating the similarity between each field value of the dimension field and the field value of the corresponding appointed dimension field; counting a first proportion of the number of field values with the similarity smaller than a set similarity threshold value in the number of field values of the dimension field; taking the average value of the first proportion corresponding to all the dimension fields as dimension accuracy so as to detect the accuracy;

And the quality evaluation module is used for calculating the quality score of the data to be detected according to the integrity rate, the effective rate, the accuracy rate and the unique rate of the data to be detected.

7. An electronic device for quality detection of offline imported data, comprising:

one or more processors;

storage means for storing one or more programs,

when executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-5.

8. A computer readable medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-5.