CN118036983A - Scheduling management method and system based on data quality management - Google Patents
Scheduling management method and system based on data quality management Download PDFInfo
- Publication number
- CN118036983A CN118036983A CN202410230291.7A CN202410230291A CN118036983A CN 118036983 A CN118036983 A CN 118036983A CN 202410230291 A CN202410230291 A CN 202410230291A CN 118036983 A CN118036983 A CN 118036983A
- Authority
- CN
- China
- Prior art keywords
- scheduling
- data quality
- management
- task
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000007726 management method Methods 0.000 title claims abstract description 130
- 238000000034 method Methods 0.000 claims abstract description 27
- 238000012545 processing Methods 0.000 claims abstract description 26
- 230000008569 process Effects 0.000 claims abstract description 22
- 230000000903 blocking effect Effects 0.000 claims description 20
- 238000012544 monitoring process Methods 0.000 claims description 14
- 230000003068 static effect Effects 0.000 claims description 5
- 238000013523 data management Methods 0.000 abstract description 2
- 238000011144 upstream manufacturing Methods 0.000 description 13
- 238000013515 script Methods 0.000 description 8
- 238000004458 analytical method Methods 0.000 description 7
- 230000009286 beneficial effect Effects 0.000 description 7
- 230000006870 function Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 4
- 230000000737 periodic effect Effects 0.000 description 4
- 230000001419 dependent effect Effects 0.000 description 3
- 238000001514 detection method Methods 0.000 description 3
- 238000002360 preparation method Methods 0.000 description 3
- 238000012552 review Methods 0.000 description 3
- 238000012795 verification Methods 0.000 description 3
- 239000008280 blood Substances 0.000 description 2
- 210000004369 blood Anatomy 0.000 description 2
- 238000007405 data analysis Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000012423 maintenance Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000013468 resource allocation Methods 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 238000007619 statistical method Methods 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000012550 audit Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 230000002354 daily effect Effects 0.000 description 1
- 238000013479 data entry Methods 0.000 description 1
- 238000013075 data extraction Methods 0.000 description 1
- 238000013499 data model Methods 0.000 description 1
- 230000001934 delay Effects 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 238000009792 diffusion process Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 230000014759 maintenance of location Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000001902 propagating effect Effects 0.000 description 1
- 238000001303 quality assessment method Methods 0.000 description 1
- 238000005067 remediation Methods 0.000 description 1
- 230000008439 repair process Effects 0.000 description 1
- 238000012958 reprocessing Methods 0.000 description 1
- 238000012502 risk assessment Methods 0.000 description 1
- 239000000243 solution Substances 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000012384 transportation and delivery Methods 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
- 239000012224 working solution Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0631—Resource planning, allocation, distributing or scheduling for enterprises or organisations
- G06Q10/06311—Scheduling, planning or task assignment for a person or group
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0631—Resource planning, allocation, distributing or scheduling for enterprises or organisations
- G06Q10/06315—Needs-based resource requirements planning or analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0639—Performance analysis of employees; Performance analysis of enterprise or organisation operations
- G06Q10/06395—Quality analysis or management
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/10—Office automation; Time management
- G06Q10/103—Workflow collaboration or project management
Landscapes
- Business, Economics & Management (AREA)
- Human Resources & Organizations (AREA)
- Engineering & Computer Science (AREA)
- Strategic Management (AREA)
- Entrepreneurship & Innovation (AREA)
- Economics (AREA)
- General Business, Economics & Management (AREA)
- Marketing (AREA)
- Operations Research (AREA)
- Quality & Reliability (AREA)
- Tourism & Hospitality (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Development Economics (AREA)
- Educational Administration (AREA)
- Game Theory and Decision Science (AREA)
- Data Mining & Analysis (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention relates to the technical field of data management, in particular to a scheduling management method and system based on data quality management. The scheduling management system based on the data quality management comprises a scheduling service module, a task management module, a scheduling instance management module, a task instance management module and a data quality management module. The scheduling service module is responsible for executing task scheduling according to set rules and frequencies, and the task management module processes the creation, the dependency relationship and the attribute setting of the tasks. The scheduling instance management module manages each scheduling content generated according to the scheduling frequency, and the task instance management module focuses on a specific task instance in each scheduling content. The data quality management module is used for maintaining data quality. The invention can improve the dispatching efficiency and the accuracy of data processing.
Description
Technical Field
The invention relates to the technical field of data management, in particular to a scheduling management method and system based on data quality management.
Background
Current designs of dispatch platforms and data governance platforms within the industry typically employ independent architectures, each independently executed and deployed, that meet a series of challenges in handling large data and complex business processes.
At present, although various open source scheduling tools in the market provide a certain flexibility, the tools meet the increasing business demands of enterprises, because the tools are often required to be integrated with various tool bags, the coupling between the tool bags is higher, and an integral management system is lacking, so that the execution conditions of the future scheduling task amount cannot be supported. In addition, more and more decisions depend on data, the quality of the data directly determines the effect of service operation and the correctness of the decisions, but the existing data quality management has the problems of inconsistent, non-unique, non-standard and incomplete core data, and inaccurate data statistical analysis.
Disclosure of Invention
The invention aims to provide a scheduling management system based on data quality management, which improves scheduling efficiency and accuracy of data processing.
The basic scheme provided by the invention is as follows: a scheduling management system based on data quality management comprises a scheduling service module, a task management module, a scheduling instance management module, a task instance management module and a data quality management module; the scheduling service module is used for carrying out task scheduling according to the set scheduling frequency; the task management module is used for managing creation, dependency relationship and attribute setting of various tasks; the scheduling instance management module manages each scheduling content generated according to the scheduling frequency; the task instance management module manages specific task instances in each scheduling content; the data quality management module is used for managing data quality; the data quality management module comprises a data quality scheduling unit which realizes data scheduling according to a blocking execution strategy.
The invention has the realization principle and beneficial effects that: the scheduling service module performs task scheduling by using the set frequency, and the task management module is responsible for creation of various tasks and dependency processing. The scheduling instance management module and the task instance management module are respectively responsible for managing scheduling contents generated by scheduling frequency and specific task instances in each scheduling. The data quality management module ensures maintenance of the data quality. By combining task scheduling with data quality management, the task execution efficiency is improved, and the accuracy and reliability in the data processing process are ensured.
Further, the scheduling frequency includes a CRONTAB expression, a fixed interval, a specified date and time, and a manual trigger.
The beneficial effect of this scheme: the system can adapt to various different service demands and scenes through diversified scheduling modes, whether the tasks are executed regularly or the tasks are triggered at specific time, the time of executing the tasks can be controlled better by a user, so that the resource allocation and the workflow are optimized effectively, and the overall system efficiency and the response capability are improved.
Further, the dispatch service module comprises a backtracking unit and a parameter configuration unit; and the backtracking unit backtracks tasks according to a specified time period, and the parameter configuration unit is used for configuring static and dynamic variables.
The beneficial effect of this scheme: the backtracking unit can backtrack the task according to the designated time period, effectively process historical data or missed tasks, and ensure the integrity and continuity of the data.
Further, the task management module includes specific configuration content for different task types, including but not limited to, an Sqoop task, a Hive task, and a quality check task.
The beneficial effect of this scheme: the task management module supports specific configuration content of different task types, so that the system can adapt to various data processing and analysis tasks and meet different types of service requirements.
Further, the scheduling instance management module includes a monitoring unit that uses a Gantt chart to show instance execution status.
The beneficial effect of this scheme: through the Gantt chart, the user can intuitively see the start time, end time, and duration of each task instance, as well as the dependency relationship between them. The graphical representation makes complex task scheduling and execution states clear at a glance, and facilitates understanding and analysis.
Further, the data quality management module comprises a data quality rule management unit for managing data quality verification tasks and rules.
The beneficial effect of this scheme: by defining and executing the data quality check rules, the system is able to ensure that the processed and analyzed data meets predetermined quality criteria, reducing data errors.
Further, the data quality management module also comprises a data quality alarm management unit and a data quality scheduling unit; the data quality alarm management unit is used for processing alarms generated by data quality verification, and comprises alarm notification configuration, statistics of alarm information and alarm processing; the data quality scheduling unit performs data scheduling according to the blocking execution strategy.
The blocking execution strategy comprises a first blocking execution strategy, a second blocking execution strategy and a third blocking execution strategy; the first blocking execution strategy is determined according to the association degree of the data; and the second blocking execution strategy is determined according to the similarity of the services.
The beneficial effect of this scheme: the data quality alarm management unit can immediately give an alarm when a data quality problem occurs, so that related personnel can quickly take measures, and error data is reduced.
Drawings
FIG. 1 is a schematic diagram of a data flow process of a scheduling management system based on data quality management;
FIG. 2 is a schematic diagram of a mission-critical baseline alert notification delivery for a scheduling management system based on data quality management;
FIG. 3 is a Gantt chart of a scheduling management system based on data quality management;
FIG. 4 is a schematic diagram of actual task execution of a scheduling management system based on data quality management;
FIG. 5 is a data quality check standard diagram of a scheduling management system based on data quality management;
FIG. 6 is a flow chart of a data quality alarm management unit of a scheduling management system based on data quality management;
Fig. 7 is a flow chart of data quality event management for a scheduling management system based on data quality management.
Detailed Description
The following is a further detailed description of the embodiments:
Example 1
A scheduling management system based on data quality management can improve scheduling efficiency and accuracy of data processing. The scheduling service module is used for carrying out task scheduling according to the set scheduling rules and frequencies, the task management module is used for managing creation, dependency relationship and attribute setting of various tasks, the scheduling instance management module is used for managing each scheduling content generated according to the scheduling frequencies, the task instance management module is used for managing specific task instances in each scheduling content, and the data quality management module is used for managing data quality.
The scheduling frequency in this embodiment includes a CRONTAB expression, a fixed interval, a specified date and time, and a manual trigger. Wherein the CRONTAB expression is similar to the Linux server CRONTAB function, the user can define the time point of task execution, such as a specific time of day, week or month; the fixed interval scheduling is suitable for tasks to be executed according to a fixed time interval, such as automatically starting the next task after the fixed time interval after the last scheduled task is completed; the AT command similar to a Linux server is scheduled once AT a specified date and time, so that a task running once AT a specific time point can be realized; the manual trigger provides a function of starting the task immediately for the user, and is suitable for online testing or use in emergency. The system can process periodic and predictive tasks through diversified scheduling options, can flexibly cope with sudden and special situations, and improves the efficiency and response capability of the whole system.
The scheduling service module also comprises an instance concurrency configurator for setting the concurrency of each scheduling instance. In the embodiment, the concurrency of the instances is dynamically adjusted based on task dependency and resource limitation, and a user can set the concurrency of each scheduling instance, namely the number of scheduling instances which can be executed in parallel at the same time. The scheduling instance refers to each specific scheduling operation generated according to the scheduling rule. For example, a schedule set using the CRONTAB expression "0 0" will generate a new schedule instance every day. The aim of example concurrency configuration is to control the number of tasks running simultaneously, ensure that system resources are effectively utilized, and avoid performance problems caused by resource overload or task dependent incompletion. By reasonably configuring the concurrency of the instances, the situation that the next scheduling instance starts to execute when the last scheduling instance is not executed yet can be prevented.
The dispatch service module also includes a policy manager for managing priority policies of the dispatch hierarchy and processing policies when configuration tasks fail. The priority policy may automatically adjust the execution priority of the tasks based on the downstream dependency number of each task, that is, the more the number of tasks in the downstream aggregate, the higher the priority. This "downstream priority" strategy ensures that critical tasks are completed in time, thereby avoiding affecting the efficiency of the overall workflow due to critical task delays.
The configuration task failure policy provides a processing policy when the task fails. For example: when the task fails, the method is configured to ignore errors and continue to execute subsequent downstream tasks; or may choose to terminate all relevant upstream and downstream tasks, preventing further error propagation. In some cases, it is also possible to suspend the entire scheduling process or retry the failed task. In addition, the system can also be configured to notify relevant responsible persons when the task fails, so as to ensure timely intervention and problem solving.
In addition, the scheduling service module further comprises a backtracking unit and a parameter configuration unit, the backtracking unit backtracks tasks according to a specified time period, the parameter configuration unit is used for configuring static variables and dynamic variables, and the configuration of the dynamic variables dynamically calculates variable values based on the current scheduling state.
The backtracking unit enables the system to review and re-execute tasks at a certain time point in the past, repair past errors in time and deal with data changes. For example, if a problem is found with data processing for the past week, the user may simply set the week to be a backtracking period, and the system will automatically re-process all tasks within the period. For parallel backtracking operation, the user can also specify the concurrency, i.e. the number of tasks that can be backtracked simultaneously at the same time. Through backtracking processing, consistency and accuracy of data can be ensured.
In terms of variable and parameter management, the parameter configuration unit may be used for static variable configuration and dynamic variable configuration. Static variables are suitable for settings that remain unchanged during the scheduling process, common including, but not limited to, values such as the date of T-1 (i.e., yesterday's date), the last month of the month, and the particular time of the weekend Zhou Chudeng. The values of these variables are typically calculated from the time of the scheduling frequency so that tasks can be adjusted or executed for these specific points in time. The dynamic variable can be dynamically calculated and adjusted according to the current scheduling state, and the user can define a complex expression to calculate the variable value similar to the spel expression in the Spring Boot. The parameter configuration unit may set various variables and parameters and apply these settings to all tasks under the schedule. In practical application scenarios, these parameters and variables are typically applied to the case of variable values in some subordinate tasks, such as linking to databases, adjusting the number of specific products, etc.
The task management module in this embodiment is used to create and manage a series of tasks, support multiple task types, and declare dependency relationships between tasks. The task management module comprises a task type manager, a task dependency configurator, a task attribute configurator and an alarm manager. And the task type manager is used for supporting and expanding a plurality of task types. And the task dependency configurator is used for declaring the dependency relationship between the tasks. And the task attribute configurator is used for configuring the fixed attribute of the task. The alarm manager is used for configuring and triggering alarms under different scenes and sending alarm notification by using mails, short messages and nails.
The task type manager supports a variety of task types including, but not limited to shell, sqoop, hive2, spark, datax, stored procedures, branching tasks, quality checking tasks, file listening tasks, database listening tasks, and the like. In addition, an interface is reserved to support new task types which may occur in the future, and the expansibility and the adaptability of the system are ensured. The Sqoop task type refers to a script or an executed command of which the specific content is executed by the Sqoop; the Hive task type is script of Hive execution/execution content of Hive; quality checking task: rule type (hierarchical detection, single sql detection, double sql detection), data source, custom sql, check criteria (consistency, accuracy, etc.).
The dependency relationship between tasks is declared in the form of configuration or dragging, so that the flexibility and the user friendliness of task management are improved. The task relies on the blood relationship of the data, i.e. ensures that the order of task execution conforms to the logic of the data stream, thereby ensuring that the downstream task can correctly acquire the required data. In this embodiment, whether to configure the scheduling dependency of the task based on the blood-edge relationship of the data table may be selected according to the service requirement.
Each task also contains a series of fixed attributes such as task name, description, failed retry function, responsible person, developer, belonging resource pool, submitting server and path of executing content (e.g. some script paths to be executed, etc., file protocol is not limited to local file protocol, hdfs file protocol, shared file protocol, etc.), priority setting, notification management (notification management contains failure, success, delay, etc.), task resource monitoring, task search, task relationship presentation, etc.
The rule for setting the task priority in this embodiment is that the priority of the task level is higher than the priority of the scheduling level, so as to ensure that the critical task is executed with priority.
Mission-critical (highest priority, e.g., supervision related) baseline alert notification transitivity as shown in fig. 2, mainly includes three parts of content:
1. Creating a base line: designating tasks to add to a baseline and setting baseline priority and alarm policy parameters
2. Determining a monitoring range according to the baseline task K: upstream nodes of the baseline task, i.e. nodes affecting the output of task K, are all included in the monitoring range, such as A, B, E, F, I; downstream nodes of the baseline task are not within the monitoring range, such as M, C, D, G, H, J, L; the critical path is defined as the longest time-consuming path of all paths affecting task K, such as ABFIK paths in the illustration.
3. And starting a baseline alarm or an event alarm according to the actual running condition of the monitoring range class task.
Alarms in the task management module include various conditions such as success alarms, failure alarms, delay alarms, timeout alarms, and expected unexecuted alarms. The alarm modes are various, including nailing, micro-message, telephone, short message and mail, even the alarm content can be pushed to the operation and maintenance monitoring large screen, and an expandable notification type interface is provided.
The system in this embodiment further includes a task resource monitoring module, configured to monitor usage of various resources. Such as CPU, memory, disk read-write, and IO occupation, helps the user optimize the resource allocation and block the execution policy.
The scheduling instance management module in the embodiment is used for managing each instance of scheduling content generated according to the scheduling frequency, and comprises a monitoring unit, wherein the monitoring unit displays the blood-edge dependence and the critical path of the task by using the Gantt chart. As shown in fig. 3. Through the Gantt chart, the user can clearly see the starting time and the ending time of each task and the dependency relationship among the tasks. In addition, the Gantt chart also makes the abnormal conditions in the monitoring process become clear at a glance, such as task delay or overlong execution time. And the instance operator is used for terminating, suspending, recovering and re-running the scheduling instance.
The task instance management module in this embodiment is configured to manage a specific task instance in each scheduling content. The task instance management module includes a log viewer, a link analyzer, and a time-consuming analyzer. And the log viewer is used for viewing log information of the task instance. And the link analyzer is used for analyzing the upstream link, the downstream link and the slowest link of the task instance. And the time consumption analyzer is used for analyzing the execution time consumption of the historical task.
The log viewer can display a basic running log, providing error logs, warning information, and other critical running data. Through the detailed log information, the user can effectively diagnose problems, understand the execution condition of tasks, and evaluate the performance of tasks. The link analyzer is used for analyzing the upstream and downstream links of the task instance and provides a visual representation of the interdependence relationship between tasks. The user can clearly see through this tool which tasks are pre-or post-conditions of other tasks, thereby helping the user understand the data flow of the overall task stream. The time-consuming analyzer may analyze the time consumed in performing the historical tasks.
In this embodiment, the user may also force the status of the task instance to be successful in some cases; failure or problem tasks may be re-executed, including re-running a single task or its associated upstream and downstream tasks; a backtracking operation can be performed on a single task instance for reprocessing or analyzing past data; and terminating the executing task instance if necessary to prevent error diffusion or resource waste.
In addition, the actual execution of tasks is affected by a plurality of factors in addition to the defined timing schedule time. For example, the timing time of the upstream task, the actual execution completion time of the upstream task, and the resources left by the task execution resource group, as shown in fig. 4.
The system also comprises a data quality management module, wherein the data quality management module comprises a data quality rule management unit, and the data quality rule management unit is used for managing data quality check tasks and rules.
In the data quality rule management unit, the user may define the data quality rules through detailed templates including various parameters such as rule name, type, databases and tables involved, specific fields, check logic, etc. In addition, the unit supports the query, deployment, modification, execution and deletion of rules, and provides a function of batch operation so that a user can efficiently manage a large number of rules. The downloading and batch importing functions of the data quality rule templates further improve the convenience and efficiency of operation. In this embodiment, the data quality check criteria are shown in fig. 5, including timeliness, accuracy, integrity, consistency, and validity.
The data quality management module further comprises a data quality alarm management unit. The data quality alarm management unit is used for processing alarms generated by data quality verification, and comprises alarm notification configuration, statistics of alarm information and alarm processing.
Alarm notification configuration, i.e. sending an alarm to the responsible person within a specified working time, such as through a mail, a short message or a spike. The alarm display page provides statistics of alarm information, such as rule names, responsible persons, check standards, execution states, associated schedules, upstream dependent task names, execution states, check results, online dates, execution times and the like. The alarm processing allows the user to select different processing modes, such as modifying and checking SQL, repairing data, reporting data quality event, checking offline rule of quality, scheduling system influence, temporarily ignoring, etc. In addition, the data quality alarm management unit also evaluates the influence range (just online, no influence temporarily, controllable influence range, reported data quality event, etc.) and the history processing record (history alarm ignored, history alarm processed, etc.).
The flow of the data quality alarm management unit is as shown in fig. 6:
When an anomaly occurs in the periodic strong check quality rule, the system may be configured to automatically block execution of downstream tasks. This prevents data that does not meet quality criteria from propagating to downstream tasks, thereby improving the accuracy and reliability of the overall data processing flow, especially decisions on financial reporting or critical business. While for some less critical data, the quality rules may be set to a weaker periodic check. Violation of these rules may not immediately result in blocking of downstream tasks, and may be handled through periodic review and analysis.
In addition, when a data quality problem frequently occurs or reaches a certain threshold, the system may be configured to trigger a fusing alert, i.e., temporarily stop or adjust the relevant data processing flow, until the problem is resolved.
In this embodiment, three blocking execution strategies are included.
The first blocking execution strategy is determined according to the association degree of the data. Specifically, the degree of dependence of a downstream business process (e.g., C business) on upstream data (e.g., data produced by a business) has a direct impact on the output quality of the downstream business. In order to accurately evaluate the impact of such dependencies on business process accuracy, various methods may be employed to quantify the degree of data correlation.
For example, the relevance of data is evaluated by analyzing its participation in the business process. The participation degree can be measured by the weight of the data in the business decision, the calling frequency in business logic, the influence range and degree of the data change on the business result and other factors. Taking customer credit data as an example, if customer credit data provided by service a is used to assess credit risk in service C, the quality of that data directly affects the accuracy of the credit decision for service C, as it is directly linked to the credit assessment and loan approval decisions. However, in transacting a new card business process, credit data may be one of many considerations, which have a relatively small impact.
The second blocking execution strategy is determined according to the similarity of the services. Even if two business processes (e.g., a and B) have no direct dependencies in the data stream, the inherent similarities between them, such as business logic, process flow, application technology, operating environment, or customer base, may mean that they face similar risks or error patterns.
When business process a presents a problem, if business process B has a high degree of similarity to it, we can infer that B may also present a potential problem. Such inference is based on shared structure and environmental factors among business processes, e.g., if both business processes depend on the same third party service provider, then a failure of that service provider may affect both business processes at the same time.
The third blocking enforcement policy is inferred from the upstream data. When quality problems occur in the data of the A service, the diagnosis flow is not limited to checking the inside of the A service, but also extends to the upstream service flow, so that the source of the problems is ensured to be comprehensively identified.
For example, the quality of the output data of the a service suddenly drops, resulting in inaccurate credit risk estimation. Possible upstream factors are tracked without affecting the a-service internal logic. Such as discovering a business-dependent market data entry from an upstream data-providing service B. Further analysis shows that the last update of service B introduces a data processing error, resulting in incomplete market data to model a. By modifying the data processing flow in service B, the data quality of service a can be restored and possible credit risk assessment errors prevented.
The data quality management module further comprises a data quality event management unit, and the data quality event management flow is as shown in fig. 7: in this flow, data quality problems are discovered through multiple approaches such as data quality assessment, supervision reporting and statistical analysis, audit assessment, daily business operations, and the like.
The data quality is the perfection degree of satisfying business operation, management and decision, and the measurement standards comprise: authenticity, accuracy, integrity, validity, timeliness, consistency, uniqueness, and security. Wherein, the authenticity represents that the data truly reflects the actual condition of the service; the accuracy is used for ensuring that the accuracy degree of the data meets the service requirement; the key data items required for the integrity representation are defined in the system and all collect data; validity means that the data definition accords with business rules, data standards, data models, metadata or authority reference data, and the data value is within a specified range; timeliness means that the latest data can be obtained within the time limit of the data demand, or the data value is refreshed according to the required update frequency; consistency reflects whether the data of the same business entity and the attribute thereof have consistent definition and meaning, and remains the same when recorded for multiple times in different systems or the same system; only one item (or a group) of key data uniquely describing the same service; security indicates the level or degree to which data can be accessed.
After the data quality problem is found, the problem will be analyzed in detail to determine the impact of the problem. After the analysis is completed, a working solution to these problems is proposed and submitted for review to ensure its feasibility and effectiveness. The decision point is whether the judged data quality problem is significant. If the problem is determined to be significant, then an approval of the solution to the heavy data quality problem is required; if the problem is not significant, the business responsibility department will specify a specific data quality problem remediation embodiment. Through the series of processes, the accuracy and the reliability of the data can be ensured, and the operation requirement can be met.
The function of the scheduling tool in this embodiment in data streaming (dataOps) is shown in fig. 1: in the preparation phase of data analysis, the data source, the script required for processing, the storage path of the data and the script, and the specific type of analysis task are determined. After the preparation is completed, hive Query Language (HQL) and Shell (sh) scripts conforming to the development specifications are uploaded into a specific directory of the data cluster.
In a big data scheduling platform, a Directed Acyclic Graph (DAG) of data processing is configured, including setting various tasks under the DAG, such as data extraction, processing scripts, quality check rules, data retention policies and alarm policies. The scheduling platform will parse key annotation information contained in the script, such as responsible person, guarantee time, target table, work source table and intermediate table. This key information is then presented on the page for validation by the data developer.
After the configuration and synchronization of the big data scheduling platform is completed, a scheduling tool is used to create a corresponding data set workflow and to designate a corresponding version. The workflow is saved and brought online and then enters an operational state. During the operation of the workflow, operation monitoring is performed, and the re-run and backtracking operations are performed as necessary. At the same time, the DAG graph and the gatte graph may be viewed to better understand the execution and timing of the workflow.
In the information collection and feedback stage, information about alarm level, table level blood edges (full links), history version, completion time, time consumption, guarantee rate, and the dependency of DAG is also collected. In addition, whether the dependency relationship is wrong or not is checked, the execution priority is determined, and a completion mark is marked after the operation is successfully completed.
The whole process starts from the preparation work of data analysis, the operation and monitoring of the job are carried out by using a dispatching tool through configuration and management of the big data dispatching platform, and the detailed information of job execution is finally collected for analysis and optimization, so that the high efficiency and accuracy of data processing are ensured.
The foregoing is merely exemplary of the present application, and specific structures and features well known in the art will not be described in detail herein, so that those skilled in the art will be aware of all the prior art to which the present application pertains, and will be able to ascertain the general knowledge of the technical field in the application or prior art, and will not be able to ascertain the general knowledge of the technical field in the prior art, without using the prior art, to practice the present application, with the aid of the present application, to ascertain the general knowledge of the same general knowledge of the technical field in general purpose. It should be noted that modifications and improvements can be made by those skilled in the art without departing from the structure of the present application, and these should also be considered as the scope of the present application, which does not affect the effect of the implementation of the present application and the utility of the patent. The protection scope of the present application is subject to the content of the claims, and the description of the specific embodiments and the like in the specification can be used for explaining the content of the claims.
Claims (9)
1. The scheduling management system based on the data quality management is characterized by comprising a scheduling service module, a task management module, a scheduling instance management module, a task instance management module and a data quality management module; the scheduling service module is used for carrying out task scheduling according to the set scheduling frequency; the task management module is used for managing creation, dependency relationship and attribute setting of various tasks; the scheduling instance management module manages each scheduling content generated according to the scheduling frequency; the task instance management module manages specific task instances in each scheduling content; the data quality management module is used for managing data quality; the data quality management module comprises a data quality scheduling unit which realizes data scheduling according to a blocking execution strategy.
2. The data quality management based dispatch management system of claim 1, wherein the dispatch frequency comprises a cronbab expression, a fixed interval, a specified date time, and a manual trigger.
3. The scheduling management system based on data quality management according to claim 2, wherein the scheduling service module comprises a backtracking unit and a parameter configuration unit; and the backtracking unit backtracks tasks according to a specified time period, and the parameter configuration unit is used for configuring static variables and dynamic variables.
4. The data quality management based scheduling management system of claim 1, wherein the task management module includes specific configuration content for different task types including, but not limited to, sqoop task, hive task, quality check task.
5. The data quality management-based scheduling management system of claim 1, wherein the scheduling instance management module includes a monitoring unit that uses a gand graph to demonstrate instance execution status.
6. The data quality management-based scheduling management system of claim 1, wherein the data quality management module includes a data quality rule management unit for managing data quality check tasks and rules.
7. The scheduling management system based on data quality management according to claim 1, wherein the data quality management module further comprises a data quality alarm management unit, and the data quality alarm management unit is configured to process alarms generated by data quality check, including alarm notification configuration, statistics of alarm information, and alarm processing.
8. The data quality management based scheduling management system of claim 7, wherein the blocking execution policy comprises a first blocking execution policy, a second blocking execution policy, and a third blocking execution policy; the first blocking execution strategy is determined according to the association degree of the data; and the second blocking execution strategy is determined according to the similarity of the services.
9. A scheduling management method based on data quality management, characterized in that the method uses the scheduling management system based on data quality management according to any one of claims 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410230291.7A CN118036983A (en) | 2024-02-29 | 2024-02-29 | Scheduling management method and system based on data quality management |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410230291.7A CN118036983A (en) | 2024-02-29 | 2024-02-29 | Scheduling management method and system based on data quality management |
Publications (1)
Publication Number | Publication Date |
---|---|
CN118036983A true CN118036983A (en) | 2024-05-14 |
Family
ID=90993096
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410230291.7A Pending CN118036983A (en) | 2024-02-29 | 2024-02-29 | Scheduling management method and system based on data quality management |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118036983A (en) |
-
2024
- 2024-02-29 CN CN202410230291.7A patent/CN118036983A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8682705B2 (en) | Information technology management based on computer dynamically adjusted discrete phases of event correlation | |
US9558459B2 (en) | Dynamic selection of actions in an information technology environment | |
US8365185B2 (en) | Preventing execution of processes responsive to changes in the environment | |
US8326910B2 (en) | Programmatic validation in an information technology environment | |
US8677174B2 (en) | Management of runtime events in a computer environment using a containment region | |
US8447859B2 (en) | Adaptive business resiliency computer system for information technology environments | |
US8868441B2 (en) | Non-disruptively changing a computing environment | |
US8782662B2 (en) | Adaptive computer sequencing of actions | |
CN111538634B (en) | Computing system, method, and storage medium | |
US8375244B2 (en) | Managing processing of a computing environment during failures of the environment | |
US8990810B2 (en) | Projecting an effect, using a pairing construct, of execution of a proposed action on a computing environment | |
US8826077B2 (en) | Defining a computer recovery process that matches the scope of outage including determining a root cause and performing escalated recovery operations | |
Xu et al. | POD-Diagnosis: Error diagnosis of sporadic operations on cloud applications | |
US20090172670A1 (en) | Dynamic generation of processes in computing environments | |
US20090171730A1 (en) | Non-disruptively changing scope of computer business applications based on detected changes in topology | |
US20140123110A1 (en) | Monitoring and improving software development quality | |
US20090172674A1 (en) | Managing the computer collection of information in an information technology environment | |
US8661125B2 (en) | System comprising probe runner, monitor, and responder with associated databases for multi-level monitoring of a cloud service | |
US20180143897A1 (en) | Determining idle testing periods | |
US20080071807A1 (en) | Methods and systems for enterprise performance management | |
van der Aalst et al. | Conformance checking | |
US9959329B2 (en) | Unified master report generator | |
CN114124743A (en) | Method and system for executing data application full link check rule | |
CN118036983A (en) | Scheduling management method and system based on data quality management | |
CN116149824A (en) | Task re-running processing method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |