Disclosure of Invention
In order to solve the technical problem or at least partially solve the technical problem, the present application provides a data quality detection method and apparatus.
In a first aspect, an embodiment of the present application provides a data quality detection method, including:
determining a detection strategy corresponding to a data set to be detected, wherein the detection strategy is used for detecting whether the data to be detected in the data set to be detected meets the preset data quality requirement;
inspecting the data to be detected through the inspection strategy to obtain an inspection result corresponding to the data to be detected;
and obtaining a data quality detection result of the data set to be detected according to the detection result.
Optionally, as in the foregoing data quality detection method, the detecting the data to be detected by the detection strategy to obtain a detection result includes:
randomly selecting candidate data in the data set to be detected to obtain the data to be detected;
performing data quality inspection on the data to be inspected through the inspection strategy to obtain an inspection result;
and continuously randomly selecting candidate data in the data set to be detected to obtain the data to be detected, and performing data quality inspection on the data to be detected through the inspection strategy to obtain the inspection result until the total number of the data to be detected obtained through random selection reaches a preset number.
Optionally, as in the foregoing data quality detection method, the determining a detection policy corresponding to the data set to be detected includes:
determining attribute types corresponding to all field information in the data to be detected;
and determining a corresponding relation between a field inspection strategy and the attribute type in the inspection strategies, wherein the inspection strategies comprise at least one field inspection strategy, and the field inspection strategies are used for inspecting whether field information corresponding to the attribute type meets a preset field quality requirement.
Optionally, as in the foregoing data quality detection method, the detecting the data to be detected by the detection strategy includes:
determining a standard data format preset by each field inspection strategy;
converting each field information in the data to be detected into standard field information in a standard data format according to the corresponding relation to obtain converted data;
and according to the corresponding relation, checking each field information in the converted data through each field checking strategy.
Optionally, as in the foregoing data quality detection method, before performing data quality inspection on the data to be inspected by using the inspection policy, the method further includes:
judging whether the data to be detected meets the preset integral requirement or not;
executing the next step when the data to be detected meets the preset integral requirement;
and when the data to be detected does not accord with the preset overall requirement, judging that the detection result of the data to be detected is not passed through the detection.
Optionally, as in the foregoing data quality detection method, obtaining a data quality detection result of the data set to be detected according to the detection result includes:
when the number of times of data quality inspection reaches a preset number, determining the number of normal inspection results in the inspection results, wherein the normal inspection results are inspection results representing that the data quality meets the preset requirements;
and obtaining a data quality detection result of the data set to be detected according to the number of the normal detection results and the preset number.
Optionally, as in the foregoing data quality detection method, the determining a correct number of normal test results in the test results includes:
determining a normal inspection result obtained when each field information in the inspection result meets the corresponding preset data quality requirement;
and obtaining the correct number according to all the normal test results.
In a second aspect, an embodiment of the present application provides a data quality detection apparatus, including:
the device comprises a determining module, a judging module and a judging module, wherein the determining module is used for determining a detection strategy corresponding to data in a data set to be detected, and the detection strategy is used for detecting whether the data quality of the data meets a preset requirement;
the inspection module is used for carrying out data quality inspection on the data to be inspected through the inspection strategy to obtain an inspection result;
and the quality acquisition module is used for acquiring the data quality of the data in the data set to be detected according to the detection result.
In a third aspect, an embodiment of the present application provides an electronic device, including: the system comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;
the memory is used for storing a computer program;
the processor is configured to implement the data quality detection method according to any one of the preceding claims when executing the computer program.
In a fourth aspect, the present application provides a non-transitory computer-readable storage medium, wherein the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions cause the computer to execute the data quality detection method according to any one of the foregoing embodiments.
The embodiment of the application provides a data quality detection method and a data quality detection device, wherein the method comprises the following steps: determining a detection strategy corresponding to a data set to be detected, wherein the detection strategy is used for detecting whether the data to be detected in the data set to be detected meets the preset data quality requirement; inspecting the data to be detected through the inspection strategy to obtain an inspection result corresponding to the data to be detected; and obtaining a data quality detection result of the data set to be detected according to the detection result. Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages: the problem of artifical sampling sample generation consuming time and energy, detection cost height, verification cycle length in big data quality testing process can be solved to this application scheme to can realize generating random detection sample automation, detect process automation, be applicable to multiple detection scene, and then possess the advantage that reduces the human cost, enlarge sampling range, accelerate detection speed.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Fig. 1 is a data quality detection method provided in an embodiment of the present application, including the following steps S1 to S3:
s1, determining a detection strategy corresponding to a data set to be detected, wherein the detection strategy is used for detecting whether the data to be detected in the data set to be detected meets the preset data quality requirement;
specifically, the data quality is the basis for guaranteeing the data application, and the evaluation standard mainly comprises four aspects, namely integrity, consistency, accuracy and timeliness. Whether the data meet the data quality requirement set in expectation or not can be judged through the four aspects.
Integrity: the integrity refers to whether data information is missing, and the missing data may be the missing of the whole data record or the missing of a field of information in the data. The value that incomplete data can be used for reference is greatly reduced, and the evaluation standard is the most basic evaluation standard of data quality. The integrity of the data quality is easier to evaluate and can be generally evaluated by the recorded value and the unique value in the data statistics. For example, the daily access volume of the website log is a recorded value, the daily access volume at ordinary times is about 1000, and the daily access volume is suddenly reduced to 100 a day, so that whether data is missing or not needs to be checked. For another example, each region name of the region distribution condition of website statistics is a unique value, China includes 32 provinces and direct prefectures, and if the unique value obtained by statistics is less than 32, it can be judged that data is likely to be missing.
Consistency: consistency refers to whether data conforms to a uniform specification, and whether a data set maintains a uniform format. The consistency of data quality is mainly reflected in the specification of data records and whether the data conforms to logic. The specification means that a data item exists in its specific format, for example, the mobile phone number must be 13 digits, and the IP address must be composed of 4 digits between 0 and 255 plus ". The logical means that there is a fixed logical relationship among a plurality of data, for example, PV (Page View, access volume) of a certain website must be equal to or greater than UV (uniform View, Visitor number), and the pop-out rate must be between 0 and 1. General data has standard coding rules, and the consistency check of data records is simpler as long as the standard coding rules are met, for example: when the standard coding format of the region class is 'Beijing' instead of 'Beijing City', the corresponding unique value is mapped to the standard unique value.
The accuracy is as follows: accuracy refers to whether there is an anomaly or error in the information of the data record. Unlike consistency, data that has accuracy issues is not just a rule inconsistency. The most common errors in data accuracy are like garbled codes. Second, abnormally large or small data is also ineligible data. The accuracy of the data quality may exist for individual recordings, as well as for the entire data set, e.g., an order of magnitude recording error. Such errors can then be audited using the statistics of the maximum and minimum values. General data conforms to the rule of normal distribution, and if some data with small percentage have problems, the judgment can be made by comparing other data with small quantity. Of course, if the statistical data is not significant but still has errors, the inspection of such values is most difficult, and spider-web trails need to be found through complicated statistical analysis and comparison, and some data analysis tools may be used, and then a specific data correction method is not described here.
Timeliness: the timeliness refers to the time interval from generation to viewing of data, also called the delay time of the data. Timeliness is not a high requirement for data analysis itself, but if the data analysis period plus the data setup time is too long, it may cause the conclusion from the analysis to lose its referential meaning.
The inspection policy is a policy for inspecting any data in the data set to be inspected, and further, one data may only include one field information, for example, the total number of provinces and prefectures of China is 32; a data may also include a plurality of field information, such as: the driver insurance information may include various information such as vehicle model, driver information, insurance records and the like; the checking strategy may include only one checking method or may include a plurality of checking methods, and each checking method generally has field information corresponding to the checking method.
The data to be detected is selected from the data set to be detected for inspection, and may be selected by random sampling or selected one by one.
In some optional implementation manners, when the data to be detected passes the inspection strategy, it can be determined that the data to be detected meets the preset data quality requirement.
And S2, inspecting the data to be detected through an inspection strategy to obtain an inspection result corresponding to the data to be detected.
Specifically, when the data to be detected has a plurality of field information to be detected, the detection result may include a plurality of sub-detection results, and the above example in the step is as follows: since the driver insurance information may include various information such as the vehicle model, the driver information, and the insurance record, sub-inspection results corresponding to the vehicle model, the driver information, and the insurance record, respectively, are obtained.
And S3, obtaining a data quality detection result of the data set to be detected according to the detection result.
Specifically, when tens of millions or hundreds of millions of data exist in the data set to be detected, each data is detected, a large amount of computing resources are consumed, the processing period is long, and the efficiency is low, so that part of data (namely the data to be detected) is generally selected for sampling detection, and a data quality detection result of the whole data set to be detected is obtained according to the detection result of each data to be detected.
In the process of processing the big data, samples with specified sample sizes need to be picked out from the big data, and whether the data quality of each attribute of each sample meets the standard or not is detected. Because the data volume is large, the data sources are many, the data collection time is long, and as the time increases, part of the data does not meet the data processing standard of the data processing program, whether the data meets the data processing program standard or not is judged in the data processing process, namely the quality of the data needs to be detected.
It is common practice to select a data sample from the big data and verify the quality of the data manually. However, the manual detection data has long quality duration, small samples and can not find unknown problems.
Through the scheme in this embodiment, can solve the problem that artifical sampling sample generation is consuming time and energy, detect with high costs, verification cycle length in big data quality testing process to can realize generating random detection sample automation, detection process automation, be applicable to multiple detection scenario, and then possess the advantage that reduces the human cost, accelerate detection speed.
In some embodiments, as in the foregoing data quality inspection method, the step S2 inspects the data to be inspected by the inspection strategy to obtain the inspection result, including the following steps a1 to A3:
a1, randomly selecting candidate data in a data set to be detected to obtain data to be detected;
specifically, data in the data set to be detected are collectively called candidate data; and the data to be detected is: the resulting candidate data is randomly selected.
A2, carrying out data quality inspection on data to be inspected through an inspection strategy to obtain an inspection result; abnormal data are recorded in a verification result of data which is not verified;
and A3, continuously selecting candidate data in the data set to be detected at random to obtain data to be detected, and performing data quality inspection on the data to be detected through an inspection strategy to obtain an inspection result until the total number of the data to be detected obtained through random selection reaches a preset number.
One optional implementation method in this embodiment may be:
(11) determining the size M of the preset quantity;
(12) randomly selecting one candidate data from a data set to be detected as data to be detected;
(13) calculating whether the currently selected data to be detected meets the quality standard or not, and adding 1 to the total number of the data to be detected obtained by random selection;
(14) judging whether the total number of the data to be detected obtained by random selection is equal to the size M of the preset number or not;
(15) circulating the steps (12), (13) and (14) when the number of the sampled data to be detected is smaller than the size M of the preset number;
(16) and stopping sampling inspection when the number of the sampled data to be detected is equal to the size M of the preset number.
Another optional implementation method of this embodiment may be:
(21) determining the size M of the preset quantity;
(22) randomly selecting one candidate data from a data set to be detected as data to be detected;
(23) adding 1 to the total number of the sampled data to be detected, and judging whether the total number of the data to be detected obtained by random selection is equal to the size M of the preset number or not;
(24) calculating whether the currently selected data to be detected meets the quality standard;
(25) circulating the steps (22), (23) and (24) when the number of the sampled data to be detected is smaller than the size M of the preset number;
(26) and stopping sampling inspection when the number of the sampled data to be detected is equal to the size M of the preset number.
Further, after (13) calculating whether each data to be detected meets the quality standard, the method may further include: storing the detection result of each data to be detected; the detection result may include abnormal data, so that the abnormal data can be manually analyzed to find unknown problems in the later period according to the detection result, and meanwhile, the detection strategy (such as a detection function) can be optimized according to the abnormal record, so that the detection can achieve a better effect.
Optionally, random selection and inspection of the data to be detected are performed through a monkey test, and due to the randomness of the monkey test (namely, a random point test), a candidate data can be randomly selected from a data set to be detected as the data to be detected for inspection; when the number of times of the detection reaches a preset number, stopping randomly selecting candidate data from the data set to be detected as the data to be detected and carrying out relevant operations such as detection and the like, and further realizing automatic stop of the test under the condition of no monitoring; in addition, the selection may be performed according to a certain attribute of the data, for example: randomly selecting according to the time attribute of the data, and selecting a plurality of pieces of data as to-be-detected data at intervals of a time period; or selecting according to the data size, and selecting a plurality of data from the data with the data size in a certain interval as the data to be detected; in addition, the random selection of data may be performed in other manners, which are not illustrated herein. By using the testing mode, data with unsatisfactory data quality can be quickly found in a database with large data quantity, such as a large data set, and the whole sampling condition is obtained, so that the sample is enlarged, the verification period is shortened, and more unknown problems can be found.
As shown in fig. 2, in some embodiments, as in the foregoing data quality detection method, the step S1 determines the checking policy corresponding to the data set to be detected, including the following steps S11 and S12:
and S11, determining attribute types corresponding to the field information in the data to be detected.
Specifically, this step is to characterize, and in some cases, the data to be detected is composed of different field information in some cases, that is, as exemplified in step S2: the driver insurance information may include various information such as vehicle model, driver information, insurance records, etc., wherein: the vehicle model, driver information and insurance record can be the attribute type in the embodiment, and the field information is the specific information of the attribute type; further, the field information may include sub-field information, and a plurality of sub-attribute types may also be included in one attribute type, for example: the driver information may further include: driver name, driver age, driver gender, etc.; the name of the driver, the age of the driver, and the sex of the driver are sub-attribute types, and the specific information is sub-field information.
And S12, determining the corresponding relation between the field inspection strategy and the attribute type in the inspection strategies, wherein the inspection strategies comprise at least one field inspection strategy, and the field inspection strategy is used for inspecting whether the field information corresponding to the attribute type meets the preset field quality requirement.
Specifically, the rules corresponding to different attribute types are different, for example: the vehicle model may generally include the manufacturer as well as specific product models, such as: BMW X5, Benz 300, etc., so the corresponding field check strategy can be a Chinese character including more than 2 characters, and English characters and numbers; when the attribute type is the age of the driver, the corresponding field inspection strategy is that the age is between 18 and 120 years, and if the age of the driver with the data to be detected is 200 or 5 years, the data to be detected is obviously wrong, which belongs to the condition that the data quality is problematic; when the attribute type corresponding to one checking strategy is the driver age, the checking strategy only comprises one field checking strategy, and when the checking strategy corresponds to the driver information or the driver insurance information, different field checking strategies are needed to respectively check the field information corresponding to each attribute type to judge whether each field information meets the preset field quality requirement or not because the driver information or the driver insurance information comprises a plurality of attribute types; generally, when the field information passes the inspection of the corresponding field inspection policy, it can be determined that the preset field quality requirement is satisfied.
As shown in fig. 3, in some embodiments, as in the foregoing data quality inspection method, the step S2 inspects the data to be inspected through an inspection strategy, including the following steps S21 and S23:
and S21, determining a standard data format preset by each field inspection strategy.
Specifically, data in the data set to be detected may be imported from multiple parties, so that data formats of each candidate data and the data to be detected may be different, but when a field inspection strategy is used for inspection, comparison and judgment of each field information are required, and when the data format of the field information of the data to be detected is different from a standard data format, a situation that comparison fails or comparison cannot be performed is caused; therefore, it is necessary to determine the standard data format preset in each field inspection policy, for example: when the standard data format of the vehicle model is as follows: manufacturer + specific vehicle model, and a to-be-detected data is: the specific vehicle model + manufacturer may cause problems in comparison. And corresponding relation can be established between the field inspection strategy and the standard data format or the step is realized in a mode that the field inspection strategy carries the standard data format.
And S22, converting each field information in the data to be detected into standard field information in a standard data format according to the corresponding relation to obtain converted data.
Specifically, since the field inspection policy is preset with a standard data format, and the corresponding relationship is a relationship corresponding to each other between the field inspection policy and the attribute type, after the attribute type of each field information in the data to be detected is determined, the standard data format corresponding to each field information can be determined according to the attribute type and the corresponding relationship, that is, each field information can be converted according to the standard data format to obtain the standard field information, and then the converted data can be obtained.
And S23, checking each field information in the converted data through each field checking strategy according to the corresponding relation.
Specifically, the corresponding relationship is a relationship between the field inspection policy and the attribute type, and each field information also has a corresponding attribute type, so that the field inspection policy corresponding to each attribute type can be determined, and the purpose of inspecting each field information in the converted data through each field inspection policy is further achieved.
By the method in the embodiment, the data to be detected can be subjected to standardized processing, so that the detection speed can be increased, and meanwhile, the problem that the accuracy of the finally obtained data quality detection result is influenced due to detection failure caused by format non-correspondence can be solved.
In some embodiments, as the foregoing data quality inspection method, before the step a2 performs data quality inspection on the data to be inspected through the inspection strategy, the method further includes the following steps a4 to a 6:
and A4, judging whether the data to be detected meet the preset integral requirement.
Specifically, the preset overall requirement may be a policy for performing overall judgment on the data to be detected.
A5, when the data to be detected meet the preset integral requirement, executing the next step;
and A6, when the data to be detected do not meet the preset overall requirement, judging that the detection result of the data to be detected is not checked to pass, and recording abnormal data in the verification result of the data which is not checked to pass.
Specifically, when the data to be detected meets the preset overall requirement, the step a2 is executed on the data to be detected, and subsequent other checking actions are performed on the data to be detected. Otherwise, directly judging that the detection result is failed.
Optionally, the predetermined overall requirement may include a requirement for integrity.
When the data to be detected is detected, specific field information needs to be compared and judged, so that the processing performance is consumed by comparison; however, there are some data to be detected, and there may be data missing situations, such as: if the driver information is lacked in the driver insurance information, the obvious integrity of the data does not meet the preset integrity requirement, and the data quality is certain to have a problem, if the field information is sequentially compared and judged after one to-be-detected data is obtained according to a conventional means, a large amount of processing performance can be wasted; by adopting the method in the embodiment, the integrity of the whole can be judged firstly, and the follow-up action is carried out on the basis of the integrity, so that the invalid processing amount can be effectively reduced.
In some embodiments, the preset overall requirement may further include a repetition, so that it may be determined in advance whether the data to be detected is repeated data, and if the data to be detected is repeated data, the repeated data is de-duplicated and deleted to avoid checking the repeated data.
And, when the data to be detected meets the preset integrity requirement, executing step a2 on the data to be detected, and performing other subsequent checking actions.
As shown in fig. 4, in some embodiments, as in the foregoing data quality detection method, the step S3 obtains a data quality detection result of the data set to be detected according to the test result, and includes the following steps S31 and S32:
and S31, when the number of times of data quality inspection reaches a preset number, determining the number of normal inspection results in the inspection results, wherein the normal inspection results are inspection results representing that the data quality meets the preset requirements.
One of the optional implementation methods is as follows: when the number of times of data quality inspection reaches a preset number (for example, 1000 times) through Monkey testing, determining the number of normal inspection results; the preset requirement may be that the data to be detected is completely correct or the accuracy reaches a preset value (e.g., 95%).
The step S31 of determining the correct number of normal test results in the test results includes the following steps B1 and B2:
b1, determining a normal inspection result obtained when each field information in the inspection result meets the corresponding preset data quality requirement;
and B2, obtaining the correct quantity according to all the normal test results.
Specifically, the step representations defined in the steps B1 and B2 determine that the data quality of the data to be detected meets the preset data quality requirement only when the information of each field in the inspection result of the data to be detected passes the inspection.
And S32, obtaining a data quality detection result of the data set to be detected according to the number of the normal detection results and the preset number.
One of the optional implementation methods is as follows: and defining the number of the inspection results of which the data quality meets the preset requirement as R, and defining the number of all the data to be detected as Q, wherein the data quality inspection result is R/Q.
As shown in fig. 5, according to another aspect of the present application, an embodiment of the present application further provides a data quality detection apparatus, including:
the system comprises a determining module 1, a judging module and a judging module, wherein the determining module is used for determining a detection strategy corresponding to data in a data set to be detected, and the detection strategy is used for detecting whether the data quality of the data meets a preset requirement;
the inspection module 2 is used for performing data quality inspection on the data to be inspected through an inspection strategy to obtain an inspection result;
and the quality acquisition module 3 is used for acquiring the data quality of the data in the data set to be detected according to the detection result.
Specifically, the specific process of implementing the functions of each module in the apparatus according to the embodiment of the present invention may refer to the related description in the method embodiment, and is not described herein again.
According to another embodiment of the present application, there is also provided an electronic apparatus including: as shown in fig. 6, the electronic device may include: the system comprises a processor 1501, a communication interface 1502, a memory 1503 and a communication bus 1504, wherein the processor 1501, the communication interface 1502 and the memory 1503 complete communication with each other through the communication bus 1504.
A memory 1503 for storing a computer program;
the processor 1501 is configured to implement the steps of the above-described method embodiments when executing the program stored in the memory 1503.
The bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The communication interface is used for communication between the electronic equipment and other equipment.
The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component.
Embodiments of the present application also provide a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the steps of the above-described method embodiments.
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The foregoing are merely exemplary embodiments of the present invention, which enable those skilled in the art to understand or practice the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.