CN113792033A - Spark-based data quality checking method and device, storage medium and terminal - Google Patents

Spark-based data quality checking method and device, storage medium and terminal Download PDF

Info

Publication number
CN113792033A
CN113792033A CN202110926788.9A CN202110926788A CN113792033A CN 113792033 A CN113792033 A CN 113792033A CN 202110926788 A CN202110926788 A CN 202110926788A CN 113792033 A CN113792033 A CN 113792033A
Authority
CN
China
Prior art keywords
data
field
checking
quality
checked
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110926788.9A
Other languages
Chinese (zh)
Inventor
李红兴
蔡抒扬
夏曙东
陈利玲
孙智彬
张志平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Transwiseway Information Technology Co Ltd
Original Assignee
Beijing Transwiseway Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Transwiseway Information Technology Co Ltd filed Critical Beijing Transwiseway Information Technology Co Ltd
Priority to CN202110926788.9A priority Critical patent/CN113792033A/en
Publication of CN113792033A publication Critical patent/CN113792033A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a Spark-based data quality checking method, a Spark-based data quality checking device, a Spark-based data quality checking storage medium and a Spark-based data quality checking terminal, wherein the Spark-based data quality checking method comprises the following steps: acquiring data partition parameters or screening parameters, and creating a data extraction component according to the Spark SQL component and the partition parameters or the screening parameters; acquiring and preprocessing a data set to be checked from a data center according to a data extraction component; loading a data checking rule table, and determining a data checking rule corresponding to each field in the preprocessed data set to be checked from the data checking rule table; performing quality check on the corresponding fields according to the data check rule corresponding to each field to generate a check result of each field; and inputting the checking result of each field into a preset report template, and generating a data quality checking report of the data set to be checked. Therefore, by adopting the embodiment of the application, automatic quality check can be realized aiming at the required data, so that the data check efficiency is improved, and the accuracy and reliability of the data are effectively guaranteed.

Description

Spark-based data quality checking method and device, storage medium and terminal
Technical Field
The invention relates to the technical field of big data, in particular to a Spark-based data quality checking method, a Spark-based data quality checking device, a Spark-based data quality checking storage medium and a Spark-based data quality checking terminal.
Background
In the enterprise data standardization process, it is expected to manage feedback of value to business through data standardization, and the importance of data quality is emphasized. In the process, low-quality data is inevitably generated, and the quality of the data is affected by large-batch data initialization, problem diffusion caused by unprocessed historical data and low-quality data generated by emergency services. At present, with the rise of big data technology and deep learning technology, the probability of generating low-quality data is controlled, the low-quality data is found in time and is effectively processed, and the measures which researchers desire to realize are provided.
In the prior art, generally, a comprehensive score of data quality is calculated by data quality management software and an effective flow mechanism for deeply tracking each data quality problem according to the score, and since the proportion of scoring for the severity of each data quality problem in the prior art cannot be effectively controlled, even the scoring scheme cannot be adjusted again once the proportion of the problem is determined, the scoring model cannot be adaptively adjusted according to the actual service data volume, the flexibility of the whole data quality checking mode is low, and the data quality checking accuracy is low.
Disclosure of Invention
The embodiment of the application provides a Spark-based data quality checking method and device, a storage medium and a terminal. The following presents a simplified summary in order to provide a basic understanding of some aspects of the disclosed embodiments. This summary is not an extensive overview and is intended to neither identify key/critical elements nor delineate the scope of such embodiments. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is presented later.
In a first aspect, an embodiment of the present application provides a method for checking data quality based on Spark, where the method includes:
acquiring data partition parameters or screening parameters, and creating a data extraction component according to the spark SQL component and the partition parameters or the screening parameters;
acquiring and preprocessing a data set to be checked from a data center according to a data extraction component;
loading a data checking rule table, and determining a data checking rule corresponding to each field in the preprocessed data set to be checked from the data checking rule table;
performing quality check on the corresponding fields according to the data check rule corresponding to each field to generate a check result of each field;
and inputting the checking result of each field into a preset report template, and generating a data quality checking report of the data set to be checked.
Optionally, after the checking result of each field is input into a preset report template and a data quality checking report of the data set to be checked is generated, the method further includes:
sending the data quality check report to a client of a relevant department; wherein the content of the first and second substances,
the client of the relevant department at least comprises a nail robot, a TXT file and a mailbox.
Optionally, the determining, from the data checking rule table, a data checking rule corresponding to each field in the preprocessed data set to be checked includes:
acquiring a data value of each field in the preprocessed data set to be checked;
identifying a data type corresponding to the data value of each field;
and acquiring the data checking rule corresponding to each field from a preset data checking rule table based on the data type.
Optionally, identifying a data type corresponding to the data value of each field includes:
adopting a sliding window algorithm to create a sliding window;
acquiring a plurality of currently existing data types;
binding a plurality of data types with a sliding window to generate a sliding window for judging the data types;
inputting the data value of each field into a sliding window for data type judgment one by one;
and outputting the data type corresponding to the data value of each field.
Optionally, the determining, from the data checking rule table, a data checking rule corresponding to each field in the preprocessed data set to be checked includes:
acquiring a data value of each field in the preprocessed data set to be checked;
determining the quality level corresponding to the data value of each field;
and acquiring the data checking rule corresponding to each field from a preset data checking rule table according to the quality level.
Optionally, determining the quality level corresponding to the data value of each field includes:
initializing a pre-trained data quality level determination model;
inputting the data value of each field into an initialized pre-trained data quality grade determination model;
and outputting the quality level corresponding to the data value of each field.
Optionally, the generating a pre-trained data quality level determination model according to the following steps includes:
acquiring a plurality of field data;
receiving a data quality level labeled for each field data in a plurality of field data, and generating labeled field data;
inputting the marked field data into a convolutional neural network, and outputting a text feature vector with fixed dimensionality;
calculating a first loss value according to the text feature vector with fixed dimensionality;
establishing a data quality level determination model by using a YOLOV3 neural network;
inputting the marked field data into a data quality level determination model for training, and outputting a second loss value;
summing the first loss value and the second loss value and then averaging to generate a target loss value;
when the target loss value reaches the minimum, a pre-trained data quality level determination model is generated.
In a second aspect, an embodiment of the present application provides a Spark-based data quality checking apparatus, where the apparatus includes:
the data extraction component creation module is used for acquiring data partition parameters or screening parameters and creating data extraction components according to the Spark SQL component and the partition parameters or the screening parameters;
the data set preprocessing module is used for acquiring and preprocessing a data set to be checked from the data center according to the data extraction component;
the data checking rule determining module is used for loading a data checking rule table and determining a data checking rule corresponding to each field in the preprocessed data set to be checked from the data checking rule table;
the verification result generation module is used for performing quality verification on the corresponding fields according to the data verification rule corresponding to each field to generate the verification result of each field;
and the quality check report generating module is used for inputting the check result of each field into a preset report template and generating a data quality check report of the data set to be checked.
In a third aspect, embodiments of the present application provide a computer storage medium having stored thereon a plurality of instructions adapted to be loaded by a processor and to perform the above-mentioned method steps.
In a fourth aspect, an embodiment of the present application provides a terminal, which may include: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the above-mentioned method steps.
The technical scheme provided by the embodiment of the application can have the following beneficial effects:
in the embodiment of the application, a Spark-based data quality checking device firstly acquires data partition parameters or screening parameters, creates a data extraction component according to Spark SQL components and the partition parameters or the screening parameters, then acquires and preprocesses a to-be-checked data set from a data center according to the data extraction component, loads a data checking rule table, determines a data checking rule corresponding to each field in the preprocessed to-be-checked data set from the data checking rule table, then performs quality checking on the corresponding field according to the data checking rule corresponding to each field, generates a checking result of each field, and finally inputs the checking result of each field into a preset report template to generate a data quality checking report of the to-be-checked data set. According to the method and the device, the big data component Spark SQL component is adopted to extract data, and quality check is carried out based on the data check rule table, so that automatic quality check can be realized for required data, the data check efficiency is improved, and the accuracy and reliability of the data are effectively guaranteed.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
Fig. 1 is a schematic flowchart of a Spark-based data quality checking method according to an embodiment of the present application;
fig. 2 is a schematic block diagram of a process of a Spark-based data quality checking process according to an embodiment of the present application;
FIG. 3 is a schematic flow chart of a training method for a data quality level determination model according to an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of a Spark-based data quality checking apparatus according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of a terminal according to an embodiment of the present application.
Detailed Description
The following description and the drawings sufficiently illustrate specific embodiments of the invention to enable those skilled in the art to practice them.
It should be understood that the described embodiments are only some embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.
In the description of the present invention, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art. In addition, in the description of the present invention, "a plurality" means two or more unless otherwise specified. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.
The application provides a Spark-based data quality checking method, a Spark-based data quality checking device, a Spark-based data quality checking storage medium and a Spark-based data quality checking terminal, so as to solve the problems in the related technical problems. In the technical scheme provided by the application, because the big data component Spark SQL component is adopted to extract data and quality check is carried out based on the data check rule table, automatic quality check can be realized aiming at required data, so that the data check efficiency is improved, the accuracy and reliability of the data are effectively guaranteed, and the following exemplary embodiment is adopted for detailed description.
The following describes in detail a Spark-based data quality checking method provided in an embodiment of the present application with reference to fig. 1 to 3. The method may be implemented in dependence on a computer program operable on a Spark-based data quality verification device based on von neumann architecture. The computer program may be integrated into the application or may run as a separate tool-like application. The Spark-based data quality checking device in this embodiment of the application may be a user terminal, including but not limited to: personal computers, tablet computers, handheld devices, in-vehicle devices, wearable devices, computing devices or other processing devices connected to a wireless modem, and the like. The user terminals may be called different names in different networks, for example: user equipment, access terminal, subscriber unit, subscriber station, mobile station, remote terminal, mobile device, user terminal, wireless communication device, user agent or user equipment, cellular telephone, cordless telephone, Personal Digital Assistant (PDA), terminal equipment in a 5G network or future evolution network, and the like.
Referring to fig. 1, a schematic flow chart of a Spark-based data quality checking method is provided in an embodiment of the present application. As shown in fig. 1, the method of the embodiment of the present application may include the following steps:
s101, acquiring data partition parameters or screening parameters, and creating a data extraction component according to the spark SQL component and the partition parameters or the screening parameters;
the data partition parameters or the screening parameters are parameter values corresponding to the currently established data partition conditions or data screening conditions. The Spark SQL component is a component for analyzing and processing data under a Spark ecosystem of a big data programming technology, and can analyze a large amount of data and complex data, so that a user can easily use SQL commands to perform data query.
In the embodiment of the application, when data quality inspection is performed, a current data quality inspection task is determined, a data partition condition or a data screening condition of the data inspection task is formulated, a data partition parameter or a screening parameter is obtained according to the data partition condition or the data screening condition, a sparkSQL component is initialized, an SQL command for data query is determined according to the sparkSQL component, and a data extraction component is generated after the data partition parameter or the screening parameter is mapped and associated with the SQL command for data query.
S102, acquiring and preprocessing a data set to be checked from a data center according to a data extraction component;
the data center can be a traditional database, a data center, and a data warehouse. The preprocessing comprises data cleaning, data integration, data transformation and data specification.
In a possible implementation manner, a data center is connected first, then an address of the data center is mapped to a data extraction component to obtain a mapped data extraction component, the mapped data extraction component is executed to obtain a data set to be checked, and finally the data set to be checked is subjected to data cleaning, data integration, data transformation and data specification processing in sequence to obtain a preprocessed data set to be checked.
Specifically, data cleansing, as the name implies, changes "black" to "white" and changes "dirty" to "clean" data. Dirty data represents dirty in form and content. The form of dirt, such as: missing values, with special symbols; dirtying of content, such as: an outlier. Data integration is to merge multiple data sources into one data store, and of course, if the analyzed data is originally in one data store, the integration (integration) of the data is not needed. The data transformation is to be converted into a suitable form to meet the needs of software or analytical theory. Data reduction refers to finding useful characteristics of data depending on a found target on the basis of understanding of a mining task and the content of the data, so as to reduce the data size, and further reduce the data volume to the maximum extent on the premise of keeping the original appearance of the data as much as possible. The data normalization can reduce the influence of invalid and wrong data on modeling, shorten the time and reduce the space for storing the data.
S103, loading a data checking rule table, and determining a data checking rule corresponding to each field in the preprocessed data set to be checked from the data checking rule table;
the data checking rule table is a data checking configuration file, the data checking configuration file is created and generated according to the characteristics of the data values and types of the existing fields and related parameters, the corresponding checking rules can be found from the configuration file according to the characteristics of the data, and the checking rules can be divided into two categories, namely general checking and special checking.
In a possible implementation manner, when determining the data check rule corresponding to each field, first obtaining a data value of each field in the preprocessed data set to be checked, then identifying a data type corresponding to the data value of each field, and finally obtaining the data check rule corresponding to each field from a preset data check rule table based on the data type.
The data type may include byte (byte type), short (short integer type), int (integer type), long (long integer type), float (floating point type), double (double precision floating point type), char (character type), and boolean (boolean type), for example.
Specifically, when identifying the data type corresponding to the data value of each field, a sliding window is created by adopting a sliding window algorithm, then a plurality of currently existing data types are obtained, then the plurality of data types are bound with the sliding window to generate a sliding window for data type judgment, then the data value of each field is input into the sliding window for data type judgment one by one, and finally the data type corresponding to the data value of each field is output.
In another possible implementation manner, when determining the data check rule corresponding to each field, first obtaining a data value of each field in the preprocessed data set to be checked, then determining a quality level corresponding to the data value of each field, and finally obtaining the data check rule corresponding to each field from a preset data check rule table according to the quality level.
Specifically, when the quality level corresponding to the data value of each field is determined, a pre-trained data quality level determination model is initialized, then the data value of each field is input into the initialized pre-trained data quality level determination model, and finally the quality level corresponding to the data value of each field is output. Wherein the higher the level, the higher the complexity of the corresponding data-checking rule.
Further, when a pre-trained data quality level determination model is generated, firstly, a plurality of field data are obtained, then, a data quality level labeled for each field data in the plurality of field data is received, labeled field data are generated, then, the labeled field data are input into a convolutional neural network, a text feature vector with fixed dimension is output, a first loss value is calculated according to the text feature vector with fixed dimension, then, a data quality level determination model is created by adopting a YOLOV3 neural network, the labeled field data are input into the data quality level determination model for training, a second loss value is output, the first loss value and the second loss value are summed, an average value is obtained, a target loss value is generated, and finally, when the target loss value reaches the minimum, the pre-trained data quality level determination model is generated.
Further, when the target loss value does not reach the minimum value, the step of inputting the labeled field data into the convolutional neural network and outputting the text feature vector with fixed dimension is continuously executed, and the training is stopped until the target loss value reaches the minimum value.
In the embodiment of the application, a user can edit the check configuration file according to the check task, a plurality of same check items can be reused, the automatic start script is set, and the automatic start script can be directly executed only by writing the configuration file of the user, so that the data check efficiency is improved.
S104, performing quality check on the corresponding fields according to the data check rule corresponding to each field to generate a check result of each field;
in the embodiment of the application, when the data checking rule matched with the field is common check (common), null value and null string statistics of the character string type data are output. And when the data checking rule matched with the field is the statistical maximum check (Statistics), outputting the Statistics of the maximum and minimum mean values by the statistical maximum check of the numerical type data. And when the data checking rule matched with the field is numerical distribution checking (NumDistributitions), counting the binning range of the numerical value type numerical value, and outputting the counting result of each range. And when the data checking rule matched with the field is enumeration check (EnumDistributions), performing data enumeration type check and outputting a statistical result which accords with each item of enumeration. When the data checking rule matched with the field is unique check (UniqueValues), checking the unique value of the data, and outputting result statistics of the unique value. And when the data checking rule matched with the field is time checking (TimeChecks), checking whether the time type data conforms to a standard character string format or a timestamp format, and outputting data results of second matching, millisecond matching and mismatching of the timestamp. And when the data checking rule matched with the field is numerical type checking (DigitChecks), checking the matched (long, double, digit) type of the numerical field, and outputting a statistical result conforming to each type. When the data checking rule matched with the field is regular matching check (regxrecords), checking whether the data value accords with an incoming regular expression (such as ^ 0-9 a-z) $), and outputting a statistical result. When the data checking rule matched with the field is a combined check (combinatorial checks), checking whether a plurality of field combinations with strong relevance meet given conditions (such as vehicle frame number, vehicle brand, delivery date combined check (judging whether the first three bits and the tenth bit of the vehicle frame number vin correspond to the vehicle brand and the delivery date), engine model and emission standard combined check (judging whether the engine model meets the corresponding emission standard and other combined field check conditions), vehicle horsepower and power combination (horsepower 0.75) and other combined field checks), and outputting a statistical result meeting the conditions.
In the embodiment of the application, two types of data inspection schemes, which are nine aspects in total, are designed according to the characteristics of the service data, and a data quality inspection report covering the whole is provided, so that data problems are analyzed to make decisions in the aspect of data. A user can check special data according to a data source and a data type, a data checking mode of the tool adopts a distributed computing engine spark, the combined checking capability of a plurality of data items is provided, and the comprehensive checking can be rapidly and better carried out on mass data.
And S105, inputting the checking result of each field into a preset report template, and generating a data quality checking report of the data set to be checked.
In the embodiment of the application, a user can customize and combine the inspection rules according to the data with different relevance, and the system can output the data quality report meeting the rules according to the corresponding rules.
In a possible implementation manner, after the checking result is obtained, the checking result of each field is input into a preset report template, a data quality checking report of the data set to be checked is generated, and finally the data quality checking report is sent to a client of a relevant department; wherein, the client of the related department at least comprises a nail robot, a TXT file and a mailbox.
For example, as shown in fig. 2, fig. 2 is a schematic process diagram of a Spark-based data quality inspection process provided in the present application, where multiple data sources are first accessed to form a data center, data is obtained from the data center through Spark ql components and loaded into a memory, a configuration file is then read and analyzed, inspection rules for determining different fields in a loop manner are used to determine that each field inspection scheme is implemented, whether grouping aggregation statistics of field values configured in each data inspection scheme meets conditions is performed one by one, a statistical result is output, and finally, data report results of each inspection scheme are merged and sent to a nail notifier or stored locally in a file form.
In the embodiment of the application, a Spark-based data quality checking device firstly acquires data partition parameters or screening parameters, creates a data extraction component according to Spark SQL components and the partition parameters or the screening parameters, then acquires and preprocesses a to-be-checked data set from a data center according to the data extraction component, loads a data checking rule table, determines a data checking rule corresponding to each field in the preprocessed to-be-checked data set from the data checking rule table, then performs quality checking on the corresponding field according to the data checking rule corresponding to each field, generates a checking result of each field, and finally inputs the checking result of each field into a preset report template to generate a data quality checking report of the to-be-checked data set. According to the method and the device, the big data component Spark SQL component is adopted to extract data, and quality check is carried out based on the data check rule table, so that automatic quality check can be realized for required data, the data check efficiency is improved, and the accuracy and reliability of the data are effectively guaranteed.
Referring to fig. 3, a flowchart of a training method of a data quality level determination model is provided according to an embodiment of the present application. As shown in fig. 3, the training method of the data quality level determination model includes the following steps:
s201, acquiring a plurality of field data;
s202, receiving the data quality level labeled for each field data in the plurality of field data, and generating labeled field data;
s203, inputting the marked field data into a convolutional neural network, and outputting a text feature vector with fixed dimensionality;
s204, calculating a first loss value according to the text feature vector with fixed dimensionality;
s205, establishing a data quality level determination model by using a YOLOV3 neural network;
s206, inputting the marked field data into a data quality level determination model for training, and outputting a second loss value;
s207, summing the first loss value and the second loss value, and then averaging to generate a target loss value;
and S208, when the target loss value reaches the minimum value, generating a pre-trained data quality level determination model.
In the embodiment of the application, a Spark-based data quality checking device firstly acquires data partition parameters or screening parameters, creates a data extraction component according to Spark SQL components and the partition parameters or the screening parameters, then acquires and preprocesses a to-be-checked data set from a data center according to the data extraction component, loads a data checking rule table, determines a data checking rule corresponding to each field in the preprocessed to-be-checked data set from the data checking rule table, then performs quality checking on the corresponding field according to the data checking rule corresponding to each field, generates a checking result of each field, and finally inputs the checking result of each field into a preset report template to generate a data quality checking report of the to-be-checked data set. According to the method and the device, the big data component Spark SQL component is adopted to extract data, and quality check is carried out based on the data check rule table, so that automatic quality check can be realized for required data, the data check efficiency is improved, and the accuracy and reliability of the data are effectively guaranteed.
The following are embodiments of the apparatus of the present invention that may be used to perform embodiments of the method of the present invention. For details which are not disclosed in the embodiments of the apparatus of the present invention, reference is made to the embodiments of the method of the present invention.
Referring to fig. 4, a schematic structural diagram of a Spark-based data quality checking apparatus according to an exemplary embodiment of the present invention is shown. The Spark-based data quality checking device can be implemented by software, hardware or a combination of the two to form all or part of the terminal. The device 1 comprises a data extraction component creation module 10, a data set preprocessing module 20, a data checking rule determination module 30, a checking result generation module 40 and a quality checking report generation module 50.
The data extraction component creation module 10 is used for acquiring data partition parameters or screening parameters and creating data extraction components according to Spark SQL components and the partition parameters or screening parameters;
the data set preprocessing module 20 is configured to acquire and preprocess a data set to be checked from the data center according to the data extraction component;
the data checking rule determining module 30 is configured to load a data checking rule table, and determine, from the data checking rule table, a data checking rule corresponding to each field in the preprocessed data set to be checked;
the verification result generation module 40 is used for performing quality verification on the corresponding fields according to the data verification rule corresponding to each field to generate the verification result of each field;
and a quality check report generating module 50, configured to input the check result of each field into a preset report template, and generate a data quality check report of the data set to be checked.
It should be noted that, when the Spark-based data quality checking apparatus provided in the foregoing embodiment executes the Spark-based data quality checking method, only the division of the functional modules is used for illustration, and in practical applications, the function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the Spark-based data quality checking device provided in the above embodiment and the Spark-based data quality checking method embodiment belong to the same concept, and details of the implementation process are referred to in the method embodiment, and are not described herein again.
The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.
In the embodiment of the application, a Spark-based data quality checking device firstly acquires data partition parameters or screening parameters, creates a data extraction component according to Spark SQL components and the partition parameters or the screening parameters, then acquires and preprocesses a to-be-checked data set from a data center according to the data extraction component, loads a data checking rule table, determines a data checking rule corresponding to each field in the preprocessed to-be-checked data set from the data checking rule table, then performs quality checking on the corresponding field according to the data checking rule corresponding to each field, generates a checking result of each field, and finally inputs the checking result of each field into a preset report template to generate a data quality checking report of the to-be-checked data set. According to the method and the device, the big data component Spark SQL component is adopted to extract data, and quality check is carried out based on the data check rule table, so that automatic quality check can be realized for required data, the data check efficiency is improved, and the accuracy and reliability of the data are effectively guaranteed.
The present invention also provides a computer readable medium, on which program instructions are stored, and when the program instructions are executed by a processor, the method for checking the data quality based on Spark provided by the above method embodiments is implemented. The present invention also provides a computer program product containing instructions, which when run on a computer, causes the computer to execute the Spark-based data quality checking method of the above-mentioned method embodiments.
Please refer to fig. 5, which provides a schematic structural diagram of a terminal according to an embodiment of the present application. As shown in fig. 5, terminal 1000 can include: at least one processor 1001, at least one network interface 1004, a user interface 1003, memory 1005, at least one communication bus 1002.
Wherein a communication bus 1002 is used to enable connective communication between these components.
The user interface 1003 may include a Display screen (Display) and a Camera (Camera), and the optional user interface 1003 may also include a standard wired interface and a wireless interface.
The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), among others.
Processor 1001 may include one or more processing cores, among other things. The processor 1001 interfaces various components throughout the electronic device 1000 using various interfaces and lines to perform various functions of the electronic device 1000 and to process data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 1005 and invoking data stored in the memory 1005. Alternatively, the processor 1001 may be implemented in at least one hardware form of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 1001 may integrate one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing the content required to be displayed by the display screen; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 1001, but may be implemented by a single chip.
The Memory 1005 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). Optionally, the memory 1005 includes a non-transitory computer-readable medium. The memory 1005 may be used to store an instruction, a program, code, a set of codes, or a set of instructions. The memory 1005 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the various method embodiments described above, and the like; the storage data area may store data and the like referred to in the above respective method embodiments. The memory 1005 may optionally be at least one memory device located remotely from the processor 1001. As shown in fig. 5, the memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a Spark-based data quality check application.
In the terminal 1000 shown in fig. 5, the user interface 1003 is mainly used as an interface for providing input for a user, and acquiring data input by the user; and the processor 1001 may be configured to call the Spark-based data quality check application stored in the memory 1005, and specifically perform the following operations:
acquiring data partition parameters or screening parameters, and creating a data extraction component according to the spark SQL component and the partition parameters or the screening parameters;
acquiring and preprocessing a data set to be checked from a data center according to a data extraction component;
loading a data checking rule table, and determining a data checking rule corresponding to each field in the preprocessed data set to be checked from the data checking rule table;
performing quality check on the corresponding fields according to the data check rule corresponding to each field to generate a check result of each field;
and inputting the checking result of each field into a preset report template, and generating a data quality checking report of the data set to be checked.
In one embodiment, after the processor 1001 enters the checking result of each field into the preset report template and generates the data quality checking report of the data set to be checked, the following operations are further performed:
sending the data quality check report to a client of a relevant department; wherein the content of the first and second substances,
the client of the relevant department at least comprises a nail robot, a TXT file and a mailbox.
In an embodiment, when the processor 1001 obtains the nth image from the model training sample, inputs the nth image into the Spark-based data quality check model, and outputs the position information and the category information of each candidate region in the image, specifically performs the following operations:
acquiring a data value of each field in the preprocessed data set to be checked;
identifying a data type corresponding to the data value of each field;
and acquiring the data checking rule corresponding to each field from a preset data checking rule table based on the data type.
In one embodiment, when the processor 1001 identifies the data type corresponding to the data value of each field, the following operations are specifically performed:
adopting a sliding window algorithm to create a sliding window;
acquiring a plurality of currently existing data types;
binding a plurality of data types with a sliding window to generate a sliding window for judging the data types;
inputting the data value of each field into a sliding window for data type judgment one by one;
and outputting the data type corresponding to the data value of each field.
In an embodiment, when the processor 1001 determines, from the data checking rule table, a data checking rule corresponding to each field in the preprocessed data set to be checked, specifically performs the following operations:
acquiring a data value of each field in the preprocessed data set to be checked;
determining the quality level corresponding to the data value of each field;
and acquiring the data checking rule corresponding to each field from a preset data checking rule table according to the quality level.
In an embodiment, when determining the quality level corresponding to the data value of each field, the processor 1001 specifically performs the following operations:
initializing a pre-trained data quality level determination model;
inputting the data value of each field into an initialized pre-trained data quality grade determination model;
and outputting the quality level corresponding to the data value of each field.
In one embodiment, the processor 1001, when generating the pre-trained data quality level determination model, specifically performs the following operations:
acquiring a plurality of field data;
receiving a data quality level labeled for each field data in a plurality of field data, and generating labeled field data;
inputting the marked field data into a convolutional neural network, and outputting a text feature vector with fixed dimensionality;
calculating a first loss value according to the text feature vector with fixed dimensionality;
establishing a data quality level determination model by using a YOLOV3 neural network;
inputting the marked field data into a data quality level determination model for training, and outputting a second loss value;
summing the first loss value and the second loss value and then averaging to generate a target loss value;
when the target loss value reaches the minimum, a pre-trained data quality level determination model is generated.
In the embodiment of the application, a Spark-based data quality checking device firstly acquires data partition parameters or screening parameters, creates a data extraction component according to Spark SQL components and the partition parameters or the screening parameters, then acquires and preprocesses a to-be-checked data set from a data center according to the data extraction component, loads a data checking rule table, determines a data checking rule corresponding to each field in the preprocessed to-be-checked data set from the data checking rule table, then performs quality checking on the corresponding field according to the data checking rule corresponding to each field, generates a checking result of each field, and finally inputs the checking result of each field into a preset report template to generate a data quality checking report of the to-be-checked data set. According to the method and the device, the big data component Spark SQL component is adopted to extract data, and quality check is carried out based on the data check rule table, so that automatic quality check can be realized for required data, the data check efficiency is improved, and the accuracy and reliability of the data are effectively guaranteed.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program to instruct associated hardware, and the spare-based data quality check program can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a read-only memory or a random access memory.
The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto, and all equivalent variations and modifications can be made to the present application.

Claims (10)

1. A Spark-based data quality checking method is characterized by comprising the following steps:
acquiring data partition parameters or screening parameters, and creating a data extraction component according to the spark SQL component and the partition parameters or the screening parameters;
acquiring and preprocessing a data set to be checked from a data center according to the data extraction component;
loading a data checking rule table, and determining a data checking rule corresponding to each field in the preprocessed data set to be checked from the data checking rule table;
performing quality check on the corresponding fields according to the data check rule corresponding to each field to generate a check result of each field;
and inputting the checking result of each field into a preset report template, and generating a data quality checking report of the data set to be checked.
2. The method according to claim 1, wherein the checking result of each field is input into a preset report template, and after a data quality checking report of the data set to be checked is generated, the method further comprises:
sending the data quality check report to a client of a relevant department; wherein the content of the first and second substances,
the client side of the related department at least comprises a nail robot, a TXT file and a mailbox.
3. The method according to claim 1, wherein the determining, from the data checking rule table, the data checking rule corresponding to each field in the preprocessed data set to be checked includes:
acquiring a data value of each field in the preprocessed data set to be checked;
identifying a data type corresponding to the data value of each field;
and acquiring the data checking rule corresponding to each field from a preset data checking rule table based on the data type.
4. The method of claim 3, wherein the identifying the data type corresponding to the data value of each field comprises:
adopting a sliding window algorithm to create a sliding window;
acquiring a plurality of currently existing data types;
binding the plurality of data types with the sliding window to generate a sliding window with data type judgment;
inputting the data value of each field into the sliding window for judging the data type one by one;
and outputting the data type corresponding to the data value of each field.
5. The method according to claim 1, wherein the determining, from the data checking rule table, the data checking rule corresponding to each field in the preprocessed data set to be checked includes:
acquiring a data value of each field in the preprocessed data set to be checked;
determining a quality level corresponding to the data value of each field;
and acquiring the data checking rule corresponding to each field from a preset data checking rule table according to the quality level.
6. The method of claim 5, wherein the determining the quality level corresponding to the data value of each field comprises:
initializing a pre-trained data quality level determination model;
inputting the data value of each field into the initialized pre-trained data quality level determination model;
and outputting the quality level corresponding to the data value of each field.
7. The method of claim 6, wherein generating a pre-trained data quality level determination model comprises:
acquiring a plurality of field data;
receiving the data quality level labeled for each field data in the plurality of field data, and generating labeled field data;
inputting the marked field data into a convolutional neural network, and outputting a text feature vector with fixed dimensionality;
calculating a first loss value according to the text feature vector of the fixed dimension;
establishing a data quality level determination model by using a YOLOV3 neural network;
inputting the marked field data into the data quality level determination model for training, and outputting a second loss value;
summing the first loss value and the second loss value and then averaging to generate a target loss value;
and when the target loss value reaches the minimum value, generating a pre-trained data quality level determination model.
8. A Spark-based data quality verification apparatus, comprising:
the data extraction component creation module is used for acquiring data partition parameters or screening parameters and creating data extraction components according to Spark SQL components and the partition parameters or screening parameters;
the data set preprocessing module is used for acquiring and preprocessing a data set to be checked from a data center according to the data extraction component;
the data checking rule determining module is used for loading a data checking rule table and determining a data checking rule corresponding to each field in the preprocessed data set to be checked from the data checking rule table;
the verification result generation module is used for performing quality verification on the corresponding fields according to the data verification rule corresponding to each field to generate the verification result of each field;
and the quality check report generating module is used for inputting the check result of each field into a preset report template and generating a data quality check report of the data set to be checked.
9. A computer storage medium, characterized in that it stores a plurality of instructions adapted to be loaded by a processor and to carry out the method steps according to any one of claims 1 to 7.
10. A terminal, comprising: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the method steps of any of claims 1 to 7.
CN202110926788.9A 2021-08-12 2021-08-12 Spark-based data quality checking method and device, storage medium and terminal Pending CN113792033A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110926788.9A CN113792033A (en) 2021-08-12 2021-08-12 Spark-based data quality checking method and device, storage medium and terminal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110926788.9A CN113792033A (en) 2021-08-12 2021-08-12 Spark-based data quality checking method and device, storage medium and terminal

Publications (1)

Publication Number Publication Date
CN113792033A true CN113792033A (en) 2021-12-14

Family

ID=78875994

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110926788.9A Pending CN113792033A (en) 2021-08-12 2021-08-12 Spark-based data quality checking method and device, storage medium and terminal

Country Status (1)

Country Link
CN (1) CN113792033A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114328700A (en) * 2022-03-16 2022-04-12 上海柯林布瑞信息技术有限公司 Data checking method and device in medical data ETL task

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108595563A (en) * 2018-04-13 2018-09-28 林秀丽 A kind of data quality management method and device
CN108875821A (en) * 2018-06-08 2018-11-23 Oppo广东移动通信有限公司 The training method and device of disaggregated model, mobile terminal, readable storage medium storing program for executing
CN111221956A (en) * 2019-12-26 2020-06-02 国网宁夏电力有限公司中卫供电公司 PMS distribution network equipment data quality checking method for power management system
CN111858646A (en) * 2020-07-21 2020-10-30 国网浙江省电力有限公司营销服务中心 Method and system for checking quality data format of electric energy meter
CN112650762A (en) * 2021-03-15 2021-04-13 腾讯科技(深圳)有限公司 Data quality monitoring method and device, electronic equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108595563A (en) * 2018-04-13 2018-09-28 林秀丽 A kind of data quality management method and device
CN108875821A (en) * 2018-06-08 2018-11-23 Oppo广东移动通信有限公司 The training method and device of disaggregated model, mobile terminal, readable storage medium storing program for executing
CN111221956A (en) * 2019-12-26 2020-06-02 国网宁夏电力有限公司中卫供电公司 PMS distribution network equipment data quality checking method for power management system
CN111858646A (en) * 2020-07-21 2020-10-30 国网浙江省电力有限公司营销服务中心 Method and system for checking quality data format of electric energy meter
CN112650762A (en) * 2021-03-15 2021-04-13 腾讯科技(深圳)有限公司 Data quality monitoring method and device, electronic equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114328700A (en) * 2022-03-16 2022-04-12 上海柯林布瑞信息技术有限公司 Data checking method and device in medical data ETL task

Similar Documents

Publication Publication Date Title
WO2021012570A1 (en) Data entry method and device, apparatus, and storage medium
CN108256591B (en) Method and apparatus for outputting information
CN113392646A (en) Data relay system, construction method and device
CN112434188B (en) Data integration method, device and storage medium of heterogeneous database
CN112560453B (en) Voice information verification method and device, electronic equipment and medium
CN111694926A (en) Interactive processing method and device based on scene dynamic configuration and computer equipment
CN107644106B (en) Method, terminal device and storage medium for automatically mining service middleman
CN111931809A (en) Data processing method and device, storage medium and electronic equipment
CN112418320A (en) Enterprise association relation identification method and device and storage medium
CN113887551B (en) Target person analysis method based on ticket data, terminal device and storage medium
CN113792033A (en) Spark-based data quality checking method and device, storage medium and terminal
CN112069269B (en) Big data and multidimensional feature-based data tracing method and big data cloud server
CN110532448B (en) Document classification method, device, equipment and storage medium based on neural network
CN115146653B (en) Dialogue scenario construction method, device, equipment and storage medium
CN113360672B (en) Method, apparatus, device, medium and product for generating knowledge graph
CN115375965A (en) Preprocessing method for target scene recognition and target scene recognition method
CN114898390A (en) Table generation method and device, electronic equipment and storage medium
CN114492306A (en) Corpus labeling method and device, electronic equipment and storage medium
CN113688232A (en) Method and device for classifying bidding texts, storage medium and terminal
CN113515591A (en) Text bad information identification method and device, electronic equipment and storage medium
CN113469237A (en) User intention identification method and device, electronic equipment and storage medium
CN111930961A (en) Competitive relationship analysis method and device, electronic equipment and storage medium
CN112381458A (en) Project evaluation method, project evaluation device, equipment and storage medium
CN111813529A (en) Data processing method and device, electronic equipment and storage medium
CN111309884A (en) Robot dialogue method, device, medium, and electronic apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination