CN111427928A - Data quality detection method and device - Google Patents

Data quality detection method and device Download PDF

Info

Publication number
CN111427928A
CN111427928A CN202010223574.0A CN202010223574A CN111427928A CN 111427928 A CN111427928 A CN 111427928A CN 202010223574 A CN202010223574 A CN 202010223574A CN 111427928 A CN111427928 A CN 111427928A
Authority
CN
China
Prior art keywords
data
detected
inspection
strategy
detection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010223574.0A
Other languages
Chinese (zh)
Inventor
谢良武
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
JD Digital Technology Holdings Co Ltd
Original Assignee
JD Digital Technology Holdings Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by JD Digital Technology Holdings Co Ltd filed Critical JD Digital Technology Holdings Co Ltd
Priority to CN202010223574.0A priority Critical patent/CN111427928A/en
Publication of CN111427928A publication Critical patent/CN111427928A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24564Applying rules; Deductive queries

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Fuzzy Systems (AREA)
  • Quality & Reliability (AREA)
  • General Factory Administration (AREA)

Abstract

The application relates to a data quality detection method and a device, wherein the method comprises the following steps: determining a detection strategy corresponding to a data set to be detected, wherein the detection strategy is used for detecting whether the data to be detected in the data set to be detected meets the preset data quality requirement; inspecting the data to be detected through the inspection strategy to obtain an inspection result corresponding to the data to be detected; and obtaining a data quality detection result of the data set to be detected according to the detection result. Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages: the problem of artifical sampling sample generation consuming time and energy, detection cost height, verification cycle length in big data quality testing process can be solved to this application scheme to can realize generating random detection sample automation, detect process automation, be applicable to multiple detection scene, and then possess the advantage that reduces the human cost, enlarge sampling range, accelerate detection speed.

Description

Data quality detection method and device
Technical Field
The present application relates to the field of big data technologies, and in particular, to a data quality detection method and apparatus.
Background
In the process of processing big data, the quality of the big data needs to be detected, and in the related art, two methods are used for detecting the quality of the big data:
(1) manual sampling test: research personnel need to randomly extract a certain number of samples from the population, then adopt a manual contrast test method to check the quality of the sampled samples, and evaluate the quality of the big data according to the quality of the samples;
(2) customizing the detection program: and developing a customized detection program, and checking the data of various dimensions of each individual of the big data to determine whether the big data meets the quality standard.
In the process of implementing the invention, the applicant finds that the existing method for verifying the data quality has the following problems:
(1) the manual sampling test method is adopted for testing, and the manual detection cost is high, the samples are distributed unevenly, the detected samples are small, the verification period is long, and the detection result is influenced by the personal quality of the detection personnel;
(2) the customized detection program can only solve the data quality of a specific scene, and the universality is not high.
In view of the technical problems in the related art, no effective solution is provided at present.
Disclosure of Invention
In order to solve the technical problem or at least partially solve the technical problem, the present application provides a data quality detection method and apparatus.
In a first aspect, an embodiment of the present application provides a data quality detection method, including:
determining a detection strategy corresponding to a data set to be detected, wherein the detection strategy is used for detecting whether the data to be detected in the data set to be detected meets the preset data quality requirement;
inspecting the data to be detected through the inspection strategy to obtain an inspection result corresponding to the data to be detected;
and obtaining a data quality detection result of the data set to be detected according to the detection result.
Optionally, as in the foregoing data quality detection method, the detecting the data to be detected by the detection strategy to obtain a detection result includes:
randomly selecting candidate data in the data set to be detected to obtain the data to be detected;
performing data quality inspection on the data to be inspected through the inspection strategy to obtain an inspection result;
and continuously randomly selecting candidate data in the data set to be detected to obtain the data to be detected, and performing data quality inspection on the data to be detected through the inspection strategy to obtain the inspection result until the total number of the data to be detected obtained through random selection reaches a preset number.
Optionally, as in the foregoing data quality detection method, the determining a detection policy corresponding to the data set to be detected includes:
determining attribute types corresponding to all field information in the data to be detected;
and determining a corresponding relation between a field inspection strategy and the attribute type in the inspection strategies, wherein the inspection strategies comprise at least one field inspection strategy, and the field inspection strategies are used for inspecting whether field information corresponding to the attribute type meets a preset field quality requirement.
Optionally, as in the foregoing data quality detection method, the detecting the data to be detected by the detection strategy includes:
determining a standard data format preset by each field inspection strategy;
converting each field information in the data to be detected into standard field information in a standard data format according to the corresponding relation to obtain converted data;
and according to the corresponding relation, checking each field information in the converted data through each field checking strategy.
Optionally, as in the foregoing data quality detection method, before performing data quality inspection on the data to be inspected by using the inspection policy, the method further includes:
judging whether the data to be detected meets the preset integral requirement or not;
executing the next step when the data to be detected meets the preset integral requirement;
and when the data to be detected does not accord with the preset overall requirement, judging that the detection result of the data to be detected is not passed through the detection.
Optionally, as in the foregoing data quality detection method, obtaining a data quality detection result of the data set to be detected according to the detection result includes:
when the number of times of data quality inspection reaches a preset number, determining the number of normal inspection results in the inspection results, wherein the normal inspection results are inspection results representing that the data quality meets the preset requirements;
and obtaining a data quality detection result of the data set to be detected according to the number of the normal detection results and the preset number.
Optionally, as in the foregoing data quality detection method, the determining a correct number of normal test results in the test results includes:
determining a normal inspection result obtained when each field information in the inspection result meets the corresponding preset data quality requirement;
and obtaining the correct number according to all the normal test results.
In a second aspect, an embodiment of the present application provides a data quality detection apparatus, including:
the device comprises a determining module, a judging module and a judging module, wherein the determining module is used for determining a detection strategy corresponding to data in a data set to be detected, and the detection strategy is used for detecting whether the data quality of the data meets a preset requirement;
the inspection module is used for carrying out data quality inspection on the data to be inspected through the inspection strategy to obtain an inspection result;
and the quality acquisition module is used for acquiring the data quality of the data in the data set to be detected according to the detection result.
In a third aspect, an embodiment of the present application provides an electronic device, including: the system comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;
the memory is used for storing a computer program;
the processor is configured to implement the data quality detection method according to any one of the preceding claims when executing the computer program.
In a fourth aspect, the present application provides a non-transitory computer-readable storage medium, wherein the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions cause the computer to execute the data quality detection method according to any one of the foregoing embodiments.
The embodiment of the application provides a data quality detection method and a data quality detection device, wherein the method comprises the following steps: determining a detection strategy corresponding to a data set to be detected, wherein the detection strategy is used for detecting whether the data to be detected in the data set to be detected meets the preset data quality requirement; inspecting the data to be detected through the inspection strategy to obtain an inspection result corresponding to the data to be detected; and obtaining a data quality detection result of the data set to be detected according to the detection result. Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages: the problem of artifical sampling sample generation consuming time and energy, detection cost height, verification cycle length in big data quality testing process can be solved to this application scheme to can realize generating random detection sample automation, detect process automation, be applicable to multiple detection scene, and then possess the advantage that reduces the human cost, enlarge sampling range, accelerate detection speed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
Fig. 1 is a schematic flowchart of a data quality detection method according to an embodiment of the present application;
fig. 2 is a schematic flowchart of a data quality detection method according to another embodiment of the present application;
fig. 3 is a schematic flowchart of a data quality detection method according to another embodiment of the present application;
fig. 4 is a schematic flowchart of a data quality detection method according to another embodiment of the present application;
fig. 5 is a block diagram of a data quality detection apparatus according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Fig. 1 is a data quality detection method provided in an embodiment of the present application, including the following steps S1 to S3:
s1, determining a detection strategy corresponding to a data set to be detected, wherein the detection strategy is used for detecting whether the data to be detected in the data set to be detected meets the preset data quality requirement;
specifically, the data quality is the basis for guaranteeing the data application, and the evaluation standard mainly comprises four aspects, namely integrity, consistency, accuracy and timeliness. Whether the data meet the data quality requirement set in expectation or not can be judged through the four aspects.
Integrity: the integrity refers to whether data information is missing, and the missing data may be the missing of the whole data record or the missing of a field of information in the data. The value that incomplete data can be used for reference is greatly reduced, and the evaluation standard is the most basic evaluation standard of data quality. The integrity of the data quality is easier to evaluate and can be generally evaluated by the recorded value and the unique value in the data statistics. For example, the daily access volume of the website log is a recorded value, the daily access volume at ordinary times is about 1000, and the daily access volume is suddenly reduced to 100 a day, so that whether data is missing or not needs to be checked. For another example, each region name of the region distribution condition of website statistics is a unique value, China includes 32 provinces and direct prefectures, and if the unique value obtained by statistics is less than 32, it can be judged that data is likely to be missing.
Consistency: consistency refers to whether data conforms to a uniform specification, and whether a data set maintains a uniform format. The consistency of data quality is mainly reflected in the specification of data records and whether the data conforms to logic. The specification means that a data item exists in its specific format, for example, the mobile phone number must be 13 digits, and the IP address must be composed of 4 digits between 0 and 255 plus ". The logical means that there is a fixed logical relationship among a plurality of data, for example, PV (Page View, access volume) of a certain website must be equal to or greater than UV (uniform View, Visitor number), and the pop-out rate must be between 0 and 1. General data has standard coding rules, and the consistency check of data records is simpler as long as the standard coding rules are met, for example: when the standard coding format of the region class is 'Beijing' instead of 'Beijing City', the corresponding unique value is mapped to the standard unique value.
The accuracy is as follows: accuracy refers to whether there is an anomaly or error in the information of the data record. Unlike consistency, data that has accuracy issues is not just a rule inconsistency. The most common errors in data accuracy are like garbled codes. Second, abnormally large or small data is also ineligible data. The accuracy of the data quality may exist for individual recordings, as well as for the entire data set, e.g., an order of magnitude recording error. Such errors can then be audited using the statistics of the maximum and minimum values. General data conforms to the rule of normal distribution, and if some data with small percentage have problems, the judgment can be made by comparing other data with small quantity. Of course, if the statistical data is not significant but still has errors, the inspection of such values is most difficult, and spider-web trails need to be found through complicated statistical analysis and comparison, and some data analysis tools may be used, and then a specific data correction method is not described here.
Timeliness: the timeliness refers to the time interval from generation to viewing of data, also called the delay time of the data. Timeliness is not a high requirement for data analysis itself, but if the data analysis period plus the data setup time is too long, it may cause the conclusion from the analysis to lose its referential meaning.
The inspection policy is a policy for inspecting any data in the data set to be inspected, and further, one data may only include one field information, for example, the total number of provinces and prefectures of China is 32; a data may also include a plurality of field information, such as: the driver insurance information may include various information such as vehicle model, driver information, insurance records and the like; the checking strategy may include only one checking method or may include a plurality of checking methods, and each checking method generally has field information corresponding to the checking method.
The data to be detected is selected from the data set to be detected for inspection, and may be selected by random sampling or selected one by one.
In some optional implementation manners, when the data to be detected passes the inspection strategy, it can be determined that the data to be detected meets the preset data quality requirement.
And S2, inspecting the data to be detected through an inspection strategy to obtain an inspection result corresponding to the data to be detected.
Specifically, when the data to be detected has a plurality of field information to be detected, the detection result may include a plurality of sub-detection results, and the above example in the step is as follows: since the driver insurance information may include various information such as the vehicle model, the driver information, and the insurance record, sub-inspection results corresponding to the vehicle model, the driver information, and the insurance record, respectively, are obtained.
And S3, obtaining a data quality detection result of the data set to be detected according to the detection result.
Specifically, when tens of millions or hundreds of millions of data exist in the data set to be detected, each data is detected, a large amount of computing resources are consumed, the processing period is long, and the efficiency is low, so that part of data (namely the data to be detected) is generally selected for sampling detection, and a data quality detection result of the whole data set to be detected is obtained according to the detection result of each data to be detected.
In the process of processing the big data, samples with specified sample sizes need to be picked out from the big data, and whether the data quality of each attribute of each sample meets the standard or not is detected. Because the data volume is large, the data sources are many, the data collection time is long, and as the time increases, part of the data does not meet the data processing standard of the data processing program, whether the data meets the data processing program standard or not is judged in the data processing process, namely the quality of the data needs to be detected.
It is common practice to select a data sample from the big data and verify the quality of the data manually. However, the manual detection data has long quality duration, small samples and can not find unknown problems.
Through the scheme in this embodiment, can solve the problem that artifical sampling sample generation is consuming time and energy, detect with high costs, verification cycle length in big data quality testing process to can realize generating random detection sample automation, detection process automation, be applicable to multiple detection scenario, and then possess the advantage that reduces the human cost, accelerate detection speed.
In some embodiments, as in the foregoing data quality inspection method, the step S2 inspects the data to be inspected by the inspection strategy to obtain the inspection result, including the following steps a1 to A3:
a1, randomly selecting candidate data in a data set to be detected to obtain data to be detected;
specifically, data in the data set to be detected are collectively called candidate data; and the data to be detected is: the resulting candidate data is randomly selected.
A2, carrying out data quality inspection on data to be inspected through an inspection strategy to obtain an inspection result; abnormal data are recorded in a verification result of data which is not verified;
and A3, continuously selecting candidate data in the data set to be detected at random to obtain data to be detected, and performing data quality inspection on the data to be detected through an inspection strategy to obtain an inspection result until the total number of the data to be detected obtained through random selection reaches a preset number.
One optional implementation method in this embodiment may be:
(11) determining the size M of the preset quantity;
(12) randomly selecting one candidate data from a data set to be detected as data to be detected;
(13) calculating whether the currently selected data to be detected meets the quality standard or not, and adding 1 to the total number of the data to be detected obtained by random selection;
(14) judging whether the total number of the data to be detected obtained by random selection is equal to the size M of the preset number or not;
(15) circulating the steps (12), (13) and (14) when the number of the sampled data to be detected is smaller than the size M of the preset number;
(16) and stopping sampling inspection when the number of the sampled data to be detected is equal to the size M of the preset number.
Another optional implementation method of this embodiment may be:
(21) determining the size M of the preset quantity;
(22) randomly selecting one candidate data from a data set to be detected as data to be detected;
(23) adding 1 to the total number of the sampled data to be detected, and judging whether the total number of the data to be detected obtained by random selection is equal to the size M of the preset number or not;
(24) calculating whether the currently selected data to be detected meets the quality standard;
(25) circulating the steps (22), (23) and (24) when the number of the sampled data to be detected is smaller than the size M of the preset number;
(26) and stopping sampling inspection when the number of the sampled data to be detected is equal to the size M of the preset number.
Further, after (13) calculating whether each data to be detected meets the quality standard, the method may further include: storing the detection result of each data to be detected; the detection result may include abnormal data, so that the abnormal data can be manually analyzed to find unknown problems in the later period according to the detection result, and meanwhile, the detection strategy (such as a detection function) can be optimized according to the abnormal record, so that the detection can achieve a better effect.
Optionally, random selection and inspection of the data to be detected are performed through a monkey test, and due to the randomness of the monkey test (namely, a random point test), a candidate data can be randomly selected from a data set to be detected as the data to be detected for inspection; when the number of times of the detection reaches a preset number, stopping randomly selecting candidate data from the data set to be detected as the data to be detected and carrying out relevant operations such as detection and the like, and further realizing automatic stop of the test under the condition of no monitoring; in addition, the selection may be performed according to a certain attribute of the data, for example: randomly selecting according to the time attribute of the data, and selecting a plurality of pieces of data as to-be-detected data at intervals of a time period; or selecting according to the data size, and selecting a plurality of data from the data with the data size in a certain interval as the data to be detected; in addition, the random selection of data may be performed in other manners, which are not illustrated herein. By using the testing mode, data with unsatisfactory data quality can be quickly found in a database with large data quantity, such as a large data set, and the whole sampling condition is obtained, so that the sample is enlarged, the verification period is shortened, and more unknown problems can be found.
As shown in fig. 2, in some embodiments, as in the foregoing data quality detection method, the step S1 determines the checking policy corresponding to the data set to be detected, including the following steps S11 and S12:
and S11, determining attribute types corresponding to the field information in the data to be detected.
Specifically, this step is to characterize, and in some cases, the data to be detected is composed of different field information in some cases, that is, as exemplified in step S2: the driver insurance information may include various information such as vehicle model, driver information, insurance records, etc., wherein: the vehicle model, driver information and insurance record can be the attribute type in the embodiment, and the field information is the specific information of the attribute type; further, the field information may include sub-field information, and a plurality of sub-attribute types may also be included in one attribute type, for example: the driver information may further include: driver name, driver age, driver gender, etc.; the name of the driver, the age of the driver, and the sex of the driver are sub-attribute types, and the specific information is sub-field information.
And S12, determining the corresponding relation between the field inspection strategy and the attribute type in the inspection strategies, wherein the inspection strategies comprise at least one field inspection strategy, and the field inspection strategy is used for inspecting whether the field information corresponding to the attribute type meets the preset field quality requirement.
Specifically, the rules corresponding to different attribute types are different, for example: the vehicle model may generally include the manufacturer as well as specific product models, such as: BMW X5, Benz 300, etc., so the corresponding field check strategy can be a Chinese character including more than 2 characters, and English characters and numbers; when the attribute type is the age of the driver, the corresponding field inspection strategy is that the age is between 18 and 120 years, and if the age of the driver with the data to be detected is 200 or 5 years, the data to be detected is obviously wrong, which belongs to the condition that the data quality is problematic; when the attribute type corresponding to one checking strategy is the driver age, the checking strategy only comprises one field checking strategy, and when the checking strategy corresponds to the driver information or the driver insurance information, different field checking strategies are needed to respectively check the field information corresponding to each attribute type to judge whether each field information meets the preset field quality requirement or not because the driver information or the driver insurance information comprises a plurality of attribute types; generally, when the field information passes the inspection of the corresponding field inspection policy, it can be determined that the preset field quality requirement is satisfied.
As shown in fig. 3, in some embodiments, as in the foregoing data quality inspection method, the step S2 inspects the data to be inspected through an inspection strategy, including the following steps S21 and S23:
and S21, determining a standard data format preset by each field inspection strategy.
Specifically, data in the data set to be detected may be imported from multiple parties, so that data formats of each candidate data and the data to be detected may be different, but when a field inspection strategy is used for inspection, comparison and judgment of each field information are required, and when the data format of the field information of the data to be detected is different from a standard data format, a situation that comparison fails or comparison cannot be performed is caused; therefore, it is necessary to determine the standard data format preset in each field inspection policy, for example: when the standard data format of the vehicle model is as follows: manufacturer + specific vehicle model, and a to-be-detected data is: the specific vehicle model + manufacturer may cause problems in comparison. And corresponding relation can be established between the field inspection strategy and the standard data format or the step is realized in a mode that the field inspection strategy carries the standard data format.
And S22, converting each field information in the data to be detected into standard field information in a standard data format according to the corresponding relation to obtain converted data.
Specifically, since the field inspection policy is preset with a standard data format, and the corresponding relationship is a relationship corresponding to each other between the field inspection policy and the attribute type, after the attribute type of each field information in the data to be detected is determined, the standard data format corresponding to each field information can be determined according to the attribute type and the corresponding relationship, that is, each field information can be converted according to the standard data format to obtain the standard field information, and then the converted data can be obtained.
And S23, checking each field information in the converted data through each field checking strategy according to the corresponding relation.
Specifically, the corresponding relationship is a relationship between the field inspection policy and the attribute type, and each field information also has a corresponding attribute type, so that the field inspection policy corresponding to each attribute type can be determined, and the purpose of inspecting each field information in the converted data through each field inspection policy is further achieved.
By the method in the embodiment, the data to be detected can be subjected to standardized processing, so that the detection speed can be increased, and meanwhile, the problem that the accuracy of the finally obtained data quality detection result is influenced due to detection failure caused by format non-correspondence can be solved.
In some embodiments, as the foregoing data quality inspection method, before the step a2 performs data quality inspection on the data to be inspected through the inspection strategy, the method further includes the following steps a4 to a 6:
and A4, judging whether the data to be detected meet the preset integral requirement.
Specifically, the preset overall requirement may be a policy for performing overall judgment on the data to be detected.
A5, when the data to be detected meet the preset integral requirement, executing the next step;
and A6, when the data to be detected do not meet the preset overall requirement, judging that the detection result of the data to be detected is not checked to pass, and recording abnormal data in the verification result of the data which is not checked to pass.
Specifically, when the data to be detected meets the preset overall requirement, the step a2 is executed on the data to be detected, and subsequent other checking actions are performed on the data to be detected. Otherwise, directly judging that the detection result is failed.
Optionally, the predetermined overall requirement may include a requirement for integrity.
When the data to be detected is detected, specific field information needs to be compared and judged, so that the processing performance is consumed by comparison; however, there are some data to be detected, and there may be data missing situations, such as: if the driver information is lacked in the driver insurance information, the obvious integrity of the data does not meet the preset integrity requirement, and the data quality is certain to have a problem, if the field information is sequentially compared and judged after one to-be-detected data is obtained according to a conventional means, a large amount of processing performance can be wasted; by adopting the method in the embodiment, the integrity of the whole can be judged firstly, and the follow-up action is carried out on the basis of the integrity, so that the invalid processing amount can be effectively reduced.
In some embodiments, the preset overall requirement may further include a repetition, so that it may be determined in advance whether the data to be detected is repeated data, and if the data to be detected is repeated data, the repeated data is de-duplicated and deleted to avoid checking the repeated data.
And, when the data to be detected meets the preset integrity requirement, executing step a2 on the data to be detected, and performing other subsequent checking actions.
As shown in fig. 4, in some embodiments, as in the foregoing data quality detection method, the step S3 obtains a data quality detection result of the data set to be detected according to the test result, and includes the following steps S31 and S32:
and S31, when the number of times of data quality inspection reaches a preset number, determining the number of normal inspection results in the inspection results, wherein the normal inspection results are inspection results representing that the data quality meets the preset requirements.
One of the optional implementation methods is as follows: when the number of times of data quality inspection reaches a preset number (for example, 1000 times) through Monkey testing, determining the number of normal inspection results; the preset requirement may be that the data to be detected is completely correct or the accuracy reaches a preset value (e.g., 95%).
The step S31 of determining the correct number of normal test results in the test results includes the following steps B1 and B2:
b1, determining a normal inspection result obtained when each field information in the inspection result meets the corresponding preset data quality requirement;
and B2, obtaining the correct quantity according to all the normal test results.
Specifically, the step representations defined in the steps B1 and B2 determine that the data quality of the data to be detected meets the preset data quality requirement only when the information of each field in the inspection result of the data to be detected passes the inspection.
And S32, obtaining a data quality detection result of the data set to be detected according to the number of the normal detection results and the preset number.
One of the optional implementation methods is as follows: and defining the number of the inspection results of which the data quality meets the preset requirement as R, and defining the number of all the data to be detected as Q, wherein the data quality inspection result is R/Q.
As shown in fig. 5, according to another aspect of the present application, an embodiment of the present application further provides a data quality detection apparatus, including:
the system comprises a determining module 1, a judging module and a judging module, wherein the determining module is used for determining a detection strategy corresponding to data in a data set to be detected, and the detection strategy is used for detecting whether the data quality of the data meets a preset requirement;
the inspection module 2 is used for performing data quality inspection on the data to be inspected through an inspection strategy to obtain an inspection result;
and the quality acquisition module 3 is used for acquiring the data quality of the data in the data set to be detected according to the detection result.
Specifically, the specific process of implementing the functions of each module in the apparatus according to the embodiment of the present invention may refer to the related description in the method embodiment, and is not described herein again.
According to another embodiment of the present application, there is also provided an electronic apparatus including: as shown in fig. 6, the electronic device may include: the system comprises a processor 1501, a communication interface 1502, a memory 1503 and a communication bus 1504, wherein the processor 1501, the communication interface 1502 and the memory 1503 complete communication with each other through the communication bus 1504.
A memory 1503 for storing a computer program;
the processor 1501 is configured to implement the steps of the above-described method embodiments when executing the program stored in the memory 1503.
The bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The communication interface is used for communication between the electronic equipment and other equipment.
The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component.
Embodiments of the present application also provide a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the steps of the above-described method embodiments.
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The foregoing are merely exemplary embodiments of the present invention, which enable those skilled in the art to understand or practice the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A data quality detection method, comprising:
determining a detection strategy corresponding to a data set to be detected, wherein the detection strategy is used for detecting whether the data to be detected in the data set to be detected meets the preset data quality requirement;
inspecting the data to be detected through the inspection strategy to obtain an inspection result corresponding to the data to be detected;
and obtaining a data quality detection result of the data set to be detected according to the detection result.
2. The data quality detection method according to claim 1, wherein the detecting the data to be detected by the detection strategy to obtain a detection result comprises:
randomly selecting candidate data in the data set to be detected to obtain the data to be detected;
performing data quality inspection on the data to be inspected through the inspection strategy to obtain an inspection result;
and continuously randomly selecting candidate data in the data set to be detected to obtain the data to be detected, and performing data quality inspection on the data to be detected through the inspection strategy to obtain the inspection result until the total number of the data to be detected obtained through random selection reaches a preset number.
3. The data quality detection method according to claim 2, wherein the determining the inspection strategy corresponding to the data set to be detected includes:
determining attribute types corresponding to all field information in the data to be detected;
and determining a corresponding relation between a field inspection strategy and the attribute type in the inspection strategies, wherein the inspection strategies comprise at least one field inspection strategy, and the field inspection strategies are used for inspecting whether field information corresponding to the attribute type meets a preset field quality requirement.
4. The data quality detection method according to claim 3, wherein the inspecting the data to be detected by the inspection strategy comprises:
determining a standard data format preset by each field inspection strategy;
converting each field information in the data to be detected into standard field information in a standard data format according to the corresponding relation to obtain converted data;
and according to the corresponding relation, checking each field information in the converted data through each field checking strategy.
5. The data quality detection method according to claim 2, wherein before the data quality inspection of the data to be detected by the inspection strategy, the method further comprises:
judging whether the data to be detected meets the preset integral requirement or not;
executing the next step when the data to be detected meets the preset integral requirement;
and when the data to be detected does not accord with the preset overall requirement, judging that the detection result of the data to be detected is not passed through the detection.
6. The data quality detection method according to claim 2, wherein the obtaining of the data quality detection result of the data set to be detected according to the inspection result comprises:
when the number of times of data quality inspection reaches a preset number, determining the number of normal inspection results in the inspection results, wherein the normal inspection results are inspection results representing that the data quality meets the preset requirements;
and obtaining a data quality detection result of the data set to be detected according to the number of the normal detection results and the preset number.
7. The data quality detection method of claim 6, wherein the determining the correct number of normal test results in the test results comprises:
determining a normal inspection result obtained when each field information in the inspection result meets the corresponding preset data quality requirement;
and obtaining the correct number according to all the normal test results.
8. A data quality detection apparatus, comprising:
the device comprises a determining module, a judging module and a judging module, wherein the determining module is used for determining a detection strategy corresponding to data in a data set to be detected, and the detection strategy is used for detecting whether the data quality of the data meets a preset requirement;
the inspection module is used for carrying out data quality inspection on the data to be inspected through the inspection strategy to obtain an inspection result;
and the quality acquisition module is used for acquiring the data quality of the data in the data set to be detected according to the detection result.
9. An electronic device, comprising: the system comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;
the memory is used for storing a computer program;
the processor, when executing the computer program, is configured to implement the data quality detection method of any one of claims 1 to 7.
10. A non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the data quality detection method of any one of claims 1-7.
CN202010223574.0A 2020-03-26 2020-03-26 Data quality detection method and device Pending CN111427928A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010223574.0A CN111427928A (en) 2020-03-26 2020-03-26 Data quality detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010223574.0A CN111427928A (en) 2020-03-26 2020-03-26 Data quality detection method and device

Publications (1)

Publication Number Publication Date
CN111427928A true CN111427928A (en) 2020-07-17

Family

ID=71548850

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010223574.0A Pending CN111427928A (en) 2020-03-26 2020-03-26 Data quality detection method and device

Country Status (1)

Country Link
CN (1) CN111427928A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112395280A (en) * 2021-01-19 2021-02-23 睿至科技集团有限公司 Data quality detection method and system
CN112487453A (en) * 2020-12-07 2021-03-12 马力 Data security sharing method and device based on central coordinator
WO2021147559A1 (en) * 2020-08-31 2021-07-29 平安科技(深圳)有限公司 Service data quality measurement method, apparatus, computer device, and storage medium
CN116680337A (en) * 2023-07-10 2023-09-01 天津云检医学检验所有限公司 Visual processing method, system and storage medium for qPCR detection data
CN118467842A (en) * 2024-06-05 2024-08-09 苏州慕名信息技术有限公司 Data popularization system and method of mobile internet
CN118467842B (en) * 2024-06-05 2024-10-18 苏州慕名信息技术有限公司 Data popularization system and method of mobile internet

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106445938A (en) * 2015-08-05 2017-02-22 阿里巴巴集团控股有限公司 Data detection method and apparatus
CN106708909A (en) * 2015-11-18 2017-05-24 阿里巴巴集团控股有限公司 Data quality detection method and apparatus
CN109241043A (en) * 2018-08-13 2019-01-18 蜜小蜂智慧(北京)科技有限公司 A kind of data quality checking method and device
CN109271377A (en) * 2018-08-10 2019-01-25 蜜小蜂智慧(北京)科技有限公司 A kind of data quality checking method and device
CN109656812A (en) * 2018-11-19 2019-04-19 平安科技(深圳)有限公司 Data quality checking method, apparatus and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106445938A (en) * 2015-08-05 2017-02-22 阿里巴巴集团控股有限公司 Data detection method and apparatus
CN106708909A (en) * 2015-11-18 2017-05-24 阿里巴巴集团控股有限公司 Data quality detection method and apparatus
CN109271377A (en) * 2018-08-10 2019-01-25 蜜小蜂智慧(北京)科技有限公司 A kind of data quality checking method and device
CN109241043A (en) * 2018-08-13 2019-01-18 蜜小蜂智慧(北京)科技有限公司 A kind of data quality checking method and device
CN109656812A (en) * 2018-11-19 2019-04-19 平安科技(深圳)有限公司 Data quality checking method, apparatus and storage medium

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021147559A1 (en) * 2020-08-31 2021-07-29 平安科技(深圳)有限公司 Service data quality measurement method, apparatus, computer device, and storage medium
CN112487453A (en) * 2020-12-07 2021-03-12 马力 Data security sharing method and device based on central coordinator
CN112395280A (en) * 2021-01-19 2021-02-23 睿至科技集团有限公司 Data quality detection method and system
CN116680337A (en) * 2023-07-10 2023-09-01 天津云检医学检验所有限公司 Visual processing method, system and storage medium for qPCR detection data
CN118467842A (en) * 2024-06-05 2024-08-09 苏州慕名信息技术有限公司 Data popularization system and method of mobile internet
CN118467842B (en) * 2024-06-05 2024-10-18 苏州慕名信息技术有限公司 Data popularization system and method of mobile internet

Similar Documents

Publication Publication Date Title
CN111427928A (en) Data quality detection method and device
CN109271315B (en) Script code detection method, script code detection device, computer equipment and storage medium
CN112346993B (en) Method, device and equipment for testing information analysis engine
CN115841046B (en) Accelerated degradation test data processing method and device based on wiener process
CN110046086B (en) Expected data generation method and device for test and electronic equipment
CN108985187A (en) A kind of method that automatic quality inspection is realized in self verification of digital archive
CN113468034A (en) Data quality evaluation method and device, storage medium and electronic equipment
CN115952081A (en) Software testing method, device, storage medium and equipment
CN113806343B (en) Evaluation method and system for Internet of vehicles data quality
CN111274056B (en) Self-learning method and device for fault library of intelligent electric energy meter
CN112948262A (en) System test method, device, computer equipment and storage medium
CN110769076B (en) DNS (Domain name System) testing method and system
CN112486841A (en) Method and device for checking data collected by buried point
CN109710651B (en) Data type identification method and device
CN111413952A (en) Robot fault detection method and device, electronic equipment and readable storage medium
CN110795308A (en) Server inspection method, device, equipment and storage medium
CN114077545A (en) Method, device and equipment for acquiring verification data and readable storage medium
CN115309661A (en) Application testing method and device, electronic equipment and readable storage medium
CN110362498B (en) Page hot spot testing method and device and server
CN112580334A (en) File processing method, file processing device, server and storage medium
TWI778634B (en) Method for classifying faults, electronic equipment and storage medium
CN112762976B (en) Automatic method and device for comprehensive test of BMC (baseboard management controller) sensor
CN115576801A (en) Method and device for testing buried point data, electronic device and storage medium
CN117762787A (en) ETL system testing method, device, equipment and medium based on metamorphic test
CN115391214A (en) Test case detection method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: Room 221, 2 / F, block C, 18 Kechuang 11th Street, Daxing District, Beijing, 100176

Applicant after: Jingdong Technology Holding Co.,Ltd.

Address before: Room 221, 2 / F, block C, 18 Kechuang 11th Street, Daxing District, Beijing, 100176

Applicant before: Jingdong Digital Technology Holding Co.,Ltd.

Address after: Room 221, 2 / F, block C, 18 Kechuang 11th Street, Daxing District, Beijing, 100176

Applicant after: Jingdong Digital Technology Holding Co.,Ltd.

Address before: Room 221, 2 / F, block C, 18 Kechuang 11th Street, Daxing District, Beijing, 100176

Applicant before: JINGDONG DIGITAL TECHNOLOGY HOLDINGS Co.,Ltd.

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200717