CN110928884A - Data re-brushing method, device and system - Google Patents

Data re-brushing method, device and system Download PDF

Info

Publication number
CN110928884A
CN110928884A CN201811023618.4A CN201811023618A CN110928884A CN 110928884 A CN110928884 A CN 110928884A CN 201811023618 A CN201811023618 A CN 201811023618A CN 110928884 A CN110928884 A CN 110928884A
Authority
CN
China
Prior art keywords
data
brushing
job
batch
task
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811023618.4A
Other languages
Chinese (zh)
Other versions
CN110928884B (en
Inventor
庞廓
李升�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201811023618.4A priority Critical patent/CN110928884B/en
Priority claimed from CN201811023618.4A external-priority patent/CN110928884B/en
Publication of CN110928884A publication Critical patent/CN110928884A/en
Application granted granted Critical
Publication of CN110928884B publication Critical patent/CN110928884B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure provides a data refreshing method, device and system, and relates to the field of data processing. The method comprises the following steps: generating a re-brushing batch and data operation according to the re-brushing configuration file, wherein a data re-brushing flow is used as a re-brushing batch, a data calculation process is used as a data task, and a data task is executed as a data operation; executing data jobs based on upstream and downstream dependencies between data tasks; original data is replaced based on a job result of the data job. The method and the device reduce the manual participation degree in the offline historical data refreshing process, and can ensure the integrity and accuracy of data updating at all levels.

Description

Data re-brushing method, device and system
Technical Field
The present disclosure relates to the field of data processing, and in particular, to a data refreshing method, apparatus, and system.
Background
For complex data flow of service data generated by performing multi-stage offline processing on original data, after an upstream data source changes in a certain period of history, the whole or part of the data flow needs to be refreshed. In the process of refreshing the data, each data responsible person carries out manual processing according to the upstream notification and then notifies downstream updating step by step. In the step-by-step processing process, all the responsible persons at all levels need to invest time to respectively ensure the correctness of data updating.
For example, a data provider needs to modify historical data for a certain period of time due to business needs or bugs, and notifies a data principal who is directly responsible for the data. The data responsible person confirms the updating range of the influence data with the upstream and arranges the data refreshing time. And the data responsible person informs the downstream and other data users to stop using the data needing to be updated and starts to refresh the data. And after the re-brushing is finished, the data responsible person informs the downstream of updating the modified data and informs other data users that the new data is available. The downstream data responsible persons also execute the repeated process until the most downstream data responsible persons finish updating and inform the user that the latest data is available.
Disclosure of Invention
The inventor finds that for large-range data updating, the labor cost is huge because all levels of data updating are manually processed. For example, in the case of frequent update of upstream data, the work energy of data managers at all levels is exhausted, so that the data team is limited to the trivial matter of historical data refresh and cannot perform more creative work. Moreover, because all levels of data update are processed manually, the understanding consistency of upstream and downstream data managers on the data update range cannot be completely guaranteed, and whether the data update is omitted or not cannot be guaranteed. Moreover, since each level of data update is responsible for the data responsible person, the difference of the computing resources occupied by each level of data update task is very large, and the reasonable use of the computing resources cannot be ensured. Moreover, since data update is time-consuming, data cannot be used for a long time in the data update process. In addition, due to the fact that the data updating completion time of each stage is inconsistent, data searched by different channels is inconsistent due to different old and new versions.
The technical problem to be solved by the present disclosure is to provide a data re-flashing method, device and system, which can ensure the integrity and accuracy of data update at each level.
According to an aspect of the present disclosure, a data recalling method is provided, including: generating a re-brushing batch and data operation according to the re-brushing configuration file, wherein a data re-brushing flow is used as a re-brushing batch, a data calculation process is used as a data task, and a data task is executed as a data operation; executing data jobs based on upstream and downstream dependencies between data tasks; original data is replaced based on a job result of the data job.
According to another aspect of the present disclosure, there is also provided a data recalling apparatus, including: the data processing system comprises an analysis re-brushing configuration file module, a data processing module and a data processing module, wherein the analysis re-brushing configuration file module is configured to generate a re-brushing batch and data operation according to a re-brushing configuration file, one data re-brushing flow is used as a re-brushing batch, one data calculation process is used as a data task, and one data task is executed as one data operation; an execute task job module configured to execute a data job based on upstream and downstream dependencies between data tasks; a data replacement module configured to replace the original data based on a job result of the data job.
According to another aspect of the present disclosure, there is also provided a data recalling apparatus, including: a memory; and a processor coupled to the memory, the processor configured to perform the data re-brushing method as described above based on instructions stored in the memory.
According to another aspect of the present disclosure, a computer-readable storage medium is also proposed, on which computer program instructions are stored, which instructions, when executed by a processor, implement the steps of the data-scrubbing method described above.
According to another aspect of the present disclosure, a data recalling system is further provided, which includes a user service device, a data storage device, and the data recalling device.
Compared with the prior art, in the embodiment of the disclosure, the re-brushing batch and all data jobs contained in the re-brushing batch are generated according to the re-brushing configuration file, then the data jobs are executed based on the upstream and downstream dependency relationship between the data tasks, and the historical data is re-brushed according to the data job result, so that the manual participation degree in the off-line historical data re-brushing process is reduced, and the integrity and the accuracy of data updating at each level can be ensured.
Other features of the present disclosure and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the disclosure.
The present disclosure may be more clearly understood from the following detailed description, taken with reference to the accompanying drawings, in which:
fig. 1 is a schematic flow chart diagram of an embodiment of a data refreshing method according to the present disclosure.
FIG. 2 is a schematic diagram of a directed acyclic graph according to the present disclosure.
Fig. 3 is a schematic flow chart of another embodiment of the data refreshing method according to the present disclosure.
Fig. 4A is a schematic view of a re-brushing batch state according to the present disclosure.
Fig. 4B is a schematic diagram illustrating the status of data operations according to the present disclosure.
Fig. 5 is a schematic diagram of data storage according to the present disclosure.
FIG. 6 is a flowchart illustrating an embodiment of generating a refresh batch and a data job according to a refresh configuration file in the data refresh method according to the present disclosure.
Fig. 7 is a flowchart illustrating an embodiment of selecting a job set for which data re-brushing needs to be performed in the data re-brushing method according to the present disclosure.
FIG. 8 is a flowchart illustrating an embodiment of executing a data job in a job set according to the data refreshing method of the present disclosure.
Fig. 9 is a flowchart illustrating an embodiment of a process of replacing original data based on a job result in the data refreshing method according to the present disclosure.
Fig. 10 is a flowchart illustrating an embodiment of replacing data in a table on a cache line with data in a cache table according to the data flushing method of the present disclosure.
Fig. 11 is a schematic diagram illustrating data in a table of a data replacement line in a cache table according to the data refreshing method of the present disclosure.
Fig. 12 is a schematic flow chart of an embodiment of a summary error alarm in the data refresh method according to the present disclosure.
Fig. 13 is a schematic structural diagram of an embodiment of a data recalling apparatus according to the present disclosure.
Fig. 14 is a schematic structural diagram of another embodiment of the data recalling apparatus according to the present disclosure.
Fig. 15 is a schematic structural diagram of a data recalling apparatus according to still another embodiment of the present disclosure.
Fig. 16 is a schematic structural diagram of a data recalling apparatus according to another embodiment of the present disclosure.
FIG. 17 is a block diagram illustrating an embodiment of a data refresh system according to the present disclosure.
Detailed Description
Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.
Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.
Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.
In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.
For the purpose of promoting a better understanding of the objects, aspects and advantages of the present disclosure, reference is made to the following detailed description taken in conjunction with the accompanying drawings.
Fig. 1 is a schematic flow chart diagram of an embodiment of a data refreshing method according to the present disclosure.
In step 110, a re-brushing batch and a data job are generated according to the re-brushing configuration file, wherein one data re-brushing process is used as a re-brushing batch, one data calculation process is used as a data task, and one data task is executed as a data job. The re-brushing configuration file comprises a source task name, a re-brushing task time range, a re-brushing task parameter and the like.
At step 120, data jobs are executed based on upstream and downstream dependencies between data tasks. The data tasks have upstream and downstream dependency relations, a directed acyclic graph can be constructed according to the upstream and downstream dependency relations, and data operation is executed based on the directed acyclic graph. As shown in fig. 2, if the data source of the upstream history of task B is updated for a certain period of time, tasks B, D, E, F, and G need to recalculate the history data for the period of time.
At step 130, the original data is replaced based on the job results of the data job. And replacing the original historical data by the operation result, namely the data obtained by multi-level data calculation, and updating the historical data into the data obtained by multi-level data calculation.
In the embodiment, the re-brushing batch and all data jobs contained in the re-brushing batch are generated according to the re-brushing configuration file, then the data jobs are executed based on the upstream and downstream dependency relationship among the data tasks, and the original data is replaced according to the data job result, so that the manual participation degree in the off-line historical data re-brushing process is reduced, and the integrity and the accuracy of data updating at each level can be ensured.
Fig. 3 is a schematic flow chart of another embodiment of the data refreshing method according to the present disclosure.
At step 310, the redraw configuration file is parsed into multiple states of the redraw batch and data job. The data calculation generally uses MapReduce or Spark to perform distributed calculation, which consumes relatively long time, so that the calculation process needs to be abstracted and then scheduled. Each data calculation process is abstracted into a data task, a historical data flow needing to be refreshed at one time is abstracted into a refreshing batch, a data task is executed into a data operation at one time, and the refreshing batch and the data operation have states.
In one embodiment, as shown in FIG. 4A, the states of the re-brushing batch include a generate state, a wait state, a freeze state, an execute successful state, and a replace complete state. The generating state is an initial state, a plurality of jobs can be sequentially generated by analyzing according to the configuration of source tasks in the re-brushing batch and the upstream and downstream relationship of the tasks, wherein the job corresponding to the most upstream task is in a ready state, the rest jobs are in a waiting state, if the generating job fails, the batch directly fails and alarm information is added, if the generating job succeeds, a test mark is added, if the generating job is a test batch, the batch enters a freezing state, and if the non-test batch enters the waiting state. The fact that the rewashing batch is in the waiting state means that the entire rewashing batch is waiting to be executed, and if the batch does not have any conflict with the executing batch, i.e. does not contain the same data job, the execution state is entered. The fact that the re-brushing batch is in the frozen state means that the job status in the entire batch is no longer progressing for testing and user control. The ready-state jobs in the execution-state batches can enter the process pool to start execution, and the execution-state batches can also be controlled to be manually paused to enter the frozen state. The fact that the re-brushing batch is in the execution success state indicates that all tasks in the batch are executed successfully. The fact that the batch of data is in the replacement completion state means that the results of the batch of data re-brushing are all replaced on the line, and a data user can query the updated data.
In one embodiment, as shown in FIG. 4B, the data job includes a wait state, a ready state, an executing failed state, a retry failed state, an executing successful state, a replacing in state, a replacing completed state, a replacing failed state. The data job is in a waiting state, which means that the job execution of the upstream dependency is waited to be completed, all the upstream jobs are completed, and the job enters a ready state. The data job being in the ready state indicates that the current job can be executed. With the entire batch in a run state, and system configuration and resources allowed, the job enters an executing state. The data job being in an executing state indicates that a process of data calculation is in progress. If the execution is successful, the execution is in a successful state, and if the execution is failed, the execution is in a failed state. The data operation is in the execution failure, which means the current operation is in the execution failure, the failure state has the opportunity of re-execution, if the data operation is successful, the data operation enters the execution success state, and after the failure reaches the specified number of times, the data operation enters the retry failure state and is not retried. The data operation fails to retry, which means that the current operation still fails after several times of failed retries, and at this time, the failure reason may no longer be the cluster environment or other sudden reasons, and the data responsible person corresponding to the current operation needs to perform troubleshooting. The data job is in the execution success state, which indicates that the current task is successfully executed, at the moment, the data is in the buffer table, and if the current time can be replaced any more, the data enters the replacement state. The data job in the replacement state indicates to replace data on the line, generally, a batch of jobs are replaced uniformly to ensure data consistency, but data which is urgently needed to be used can be replaced in advance. The data job is in a replacement completion state, which indicates that the data corresponding to the current job has been replaced on the line, and user query can be provided. The data operation is in a replacement failure state, which indicates that the data files in the buffer table have problems and cannot be replaced on line.
At step 320, a directed acyclic graph is constructed based on upstream and downstream dependencies between data tasks.
In step 330, a job set for which data re-brushing needs to be performed is selected from the data jobs corresponding to all the data tasks.
At step 340, the data jobs in the job set are executed based on the directed acyclic graph. And each stage of data calculation acquires data from a plurality of partitions of a plurality of tables, performs operations such as data filtering, grouping, summarizing and the like, and outputs the result to one or more table partitions.
At step 350, job results of the data jobs that have performed completed the data task are stored in a cache table. As shown in fig. 5, the underlying data is mainly stored in the HDFS (Hadoop distributed file system) in the form of hive (Hadoop-based data warehouse tool) table, and is independent of metadata such as partition information in the table. Because the data calculation time is slow, in order to avoid influencing on-line query, a buffer table is established for all off-line hive on-line tables, the buffer table is an appearance, and data and metadata in the buffer table are independent.
In step 360, the data in the cache table is replaced with the data in the table above the line. All data calculation is carried out in a buffer table, and then the data calculation is replaced into an on-line hive table according to requirements. For example, the data in the table on the line is deleted at the same time, and the data in the cache table is moved to the data address corresponding to the table on the line.
In the embodiment, the results of data calculation performed by all jobs are not directly output to the online table, but a buffer table is created on the hive online table of the bottom data storage, and the buffer table data is replaced on the online at a set time, so that the usability of historical data during refreshing can be ensured, that is, in the data updating process, the use of old version data by a user is not affected, and the uniformity of the data is ensured.
FIG. 6 is a flowchart illustrating an embodiment of generating a refresh batch and a data job according to a refresh configuration file in the data refresh method according to the present disclosure.
At step 610, system configuration information such as the number of retries of job failure, timeout time, job running process pool size, data replacement process pool size, etc. is read from the database.
In step 620, the upstream and downstream dependency relationships of the tasks are read from the database, and a directed acyclic graph is constructed. The original upstream and downstream dependency relationship in the database is formed by two tuples of upstream task names and downstream task names, a directed acyclic graph structure is constructed, and a task name sequence after topological sorting is given.
At step 630, it is checked whether the designated storage location has a user-submitted reflash configuration file, if so, a subsequent parsing is performed on each configuration file, and if not, the parsing phase is directly ended.
At step 640, a redraw configuration file is retrieved from the specified storage location.
In step 650, the reshuffling configuration file is parsed into a reshuffling batch, wherein the information of the reshuffling batch is substantially consistent with the information in the configuration file, and some default configurations can be complemented in the reshuffling batch information. The batch information map includes, as shown in table 1, run _ tasks (source task name: refresh history date: parameter), run _ status (batch status), priority (batch priority), cause (refresh reason), run _ mode (refresh mode, refresh immediately or at a specified time), switch _ mode (data replacement mode, replacement on line immediately after completion of job task execution or replacement at a specified time), switch _ ahead _ tasks (list of tasks that need to be replaced on line in advance), and except _ tasks (list of tasks that need not be executed this time, but are on task dependency graph).
Figure BDA0001787696110000081
TABLE 1
At step 660, data jobs are generated in turn.
In step 670, it is determined whether all jobs are generated according to the directed acyclic graph, if all the jobs are generated successfully, step 680 is executed, otherwise, step 6120 is executed.
At step 680, the configuration file is moved to the parse-success directory.
In step 690, it is determined whether the task is marked as a test task, if not, step 6100 is performed, and if yes, step 6110 is performed.
In step 6100, the re-brushing batch status is modified to a wait status.
In step 6110, the re-brushing batch status is modified to a frozen status.
In step 6120, the database is cleared of the generated re-brushing batch and data job information, and the configuration file is moved to the failed parsing directory, and the log records the error information.
In the embodiment, the data re-brushing requirement submitted by the user is analyzed into batches and jobs in various states based on the re-brushing configuration file so as to carry out subsequent operations.
Fig. 7 is a flowchart illustrating an embodiment of selecting a job set for which data re-brushing needs to be performed in the data re-brushing method according to the present disclosure.
At step 710, a set of jobs in an active state is determined. Wherein the active state job set includes data jobs in a refresh batch in which the data job is being executed and data jobs in a refresh batch in which all data jobs have been completed.
At step 720, a waiting re-brushing batch is taken. I.e., a re-brushing batch waiting for the execution of the data job.
At step 730, a data job in the current batch is fetched.
In step 740, it is determined whether the current data job conflicts with the jobs in the active job set, i.e. if the task name, the refresh date and the parameters of the job are all completely consistent, it indicates that the job conflicts. If there is no conflict, i.e., the current data job does not intersect the jobs in the active job set, step 750 is performed, otherwise step 780 is performed.
In step 750, it is determined whether the job traversal in the current batch is complete, if so, step 760 is performed, otherwise, step 730 is performed.
In step 760, the status of the current lot is changed to an execution status, i.e., a data job status is being executed.
At step 770, all jobs for the current lot are added to the active job set.
In step 780, it is determined whether all the waiting re-brushing batches have been traversed, if so, step 790 is performed, otherwise, step 720 is performed.
At step 790, all execution state lots are checked in order from high priority, and a set of lots with the highest priority is selected and all executable jobs are fetched. The executable job, i.e., the job status, is ready status or execution failure status. Namely, the data job in the rewashing batch with the highest priority is used as the job set needing to execute data rewashing.
In this embodiment, a group of job sets that can be executed at the current time and should be executed preferentially are selected from all jobs that need to be re-brushed, and the data jobs in the job sets are executed.
FIG. 8 is a flowchart illustrating an embodiment of executing a data job in a job set according to the data refreshing method of the present disclosure.
In step 810, a process pool of suitable size is established, task timeout time is set, number of failed retries, etc., based on the system configuration in the database.
At step 820, the database is queried for script addresses corresponding to all job tasks in a ready state.
At step 830, the task script package is downloaded from the network disk and decompressed to the specified directory.
In step 840, a data job for executing the data task, i.e. a job in a ready state, is taken, and it is detected whether the corresponding task is within the executable time range. If so, go to step 850, otherwise go to step 870.
In step 850, it is detected whether the task corresponding to the current data job does not exceed the parallelism allowed by the task, if yes, step 860 is executed, otherwise, step 870 is executed.
At step 860, the current data job state is changed to executing and committed to the process pool to begin execution.
At step 870, it is determined whether all ready job checks are complete, if so, then step 880 is performed, otherwise, step 840 is performed.
At step 880, the results of all data job executions are gathered and job status is updated. And if the execution of the job is successful, modifying the job state into the execution success, otherwise, modifying the job state into the execution failure. If the state of the job before entering the execution success state is the execution failure, whether the number of failed retries is reached is detected, if so, the state is changed to the retries failure, otherwise, the execution failure is changed, and the number of failed retries of the job is added with 1.
In step 890, all the lots corresponding to the jobs entering execution are checked, and if all the jobs in the lot are in the execution success or replacement completion status, the lot status is changed to the execution success.
In step 8100, each job in the wait state is checked, and if all its upstream jobs are in the execution success state, replacement in-progress state, or replacement complete state, the job state is changed to ready.
In this embodiment, the parallelism, the allowable running time, and the like of each data refreshing task are configured, and then the data job process is executed, so that the computing resources can be reasonably utilized.
Fig. 9 is a schematic structural diagram of an embodiment of a process of replacing original data based on a job result in the data refreshing method according to the present disclosure.
At step 910, a process pool of suitable size is established based on the system configuration.
In step 920, it is determined whether there is a non-detected re-brushing batch in the execution state or the execution success state, if so, one re-brushing batch is taken to execute step 930, otherwise, the process is ended. The data job is executed in the executing state, and the data job is executed in the successful executing state.
In step 930, it is determined whether the current batch status is a successful execution status, if yes, step 940 is performed, otherwise step 960 is performed.
In step 940, it is determined whether the current lot is within the replaceable time range, if so, step 950 is performed, otherwise, step 970 is performed.
In step 950, all jobs in the current lot that are in status of successfully executed are set to be replaceable jobs, and step 990 is performed.
In step 960, it is determined whether the current batch status is an execution status, if yes, step 970 is executed, otherwise, step 920 is executed.
In step 970, it is determined whether there is a task to be replaced in advance in the current batch, if yes, step 980 is executed, otherwise, step 920 is executed.
At the headquarters 980, the corresponding task in the current batch is set as a task to be replaced in advance and the job whose status is successful is set as a replaceable job.
At step 990, the alternative job is submitted to the process pool for replacement.
At step 9100, wait for the jobs in the process pool to be replaced and complete, and update the status of all replaced jobs, e.g., a successful replacement is changed to a replacement complete status, and a failed replacement is changed to a replacement failed status.
At step 9110, the current batch state is changed, and if all job states are replacement complete, the state is changed to replacement complete.
The specific implementation process of step 990 is shown in fig. 10.
In step 1010, it is determined whether the job task result data that needs to be replaced on-line is a hive table, if yes, step 1020 is executed, otherwise, a return is successful. The bottom layer data is mainly stored in the HDFS in the form of hive tables, and then is provided with business data query by other real-time query engines or databases. The offline hive comprises time partitions and other service partitions, data in the hive table cannot be subjected to simple update operation, and recalculation updating can be performed only by taking the partitions as granularity.
At step 1020, the corresponding table partition is marked as not queriable.
In step 1030, the address corresponding to the table partition affected by the current job and the address of the HDFS file are parsed.
At step 1040, the partition corresponding to the online table is deleted.
In step 1050, delete the data file corresponding to the online table partition, and move the data file in the buffer table to the location corresponding to the online table.
In step 1060, the partition corresponding to the buffer table is deleted, and the data is simultaneously mounted on the partition corresponding to the on-line table and the buffer table.
In step 1070, it is checked whether the sizes of the files pointed by the partitions corresponding to the online table and the buffer table are consistent, if so, step 1080 is executed, otherwise, the process is failed and ended.
In step 1080, it is determined whether the online data is to be pushed to real-time databases such as mysql, memsql, clickhouse, etc., if so, step 1090 is executed, otherwise, the process is successfully ended.
In step 1090, the old data corresponding to the real-time database is deleted, and the partition data corresponding to the online table is fetched and downloaded to the real-time database.
In step 1010, whether the updated data of the real-time database is consistent with the data of the hive table is checked, if yes, the process is successfully finished, and if not, the process is failed to finish.
In one embodiment, since the data calculation time is slow, in order to avoid affecting the on-line query, a buffer table is established for all off-line hive on-line tables, the buffer table is an appearance, and the data and the metadata are independent. All data calculation is carried out in a buffer table, and then the live table and load data on the line are replaced into other real-time tables according to requirements.
As shown in fig. 11, during data replacement, during normal non-refresh, the data is in the online table and the buffer table has no data, but the partition of the buffer table also points to the data address on the online table. And when the data is refreshed, outputting the operation result to the HDFS address corresponding to the buffer table. At this point, the data in the buffer table is new and the online table data is old. The downstream operation can directly extract the data of the buffer table for downstream data re-brushing. And when all the new data of the operation are generated to the buffer table, performing unified replacement. At this time, the data corresponding to the on-line table is deleted. And moving the new data in the buffer table to the data address corresponding to the online table. And finally, respectively mounting the data into the partitions corresponding to the online table and the buffer table, returning to the initial state, and updating the data.
Wherein the process of replacing onto-line determines when the refreshed new data is presented for use by the user. Although the replacement time is short, the replaced data is marked as being undetectable before and after the replacement in order to ensure the correctness of the user query. When the front end just needs to inquire the data, the user can be reminded to use the data later in the data updating.
In the embodiment, in the data updating process, the use of the old version data by the user is not influenced, and in addition, the data provided by each channel is replaced from the old version to the new version at the same time, so that the inconsistency of the data acquired by the user is avoided.
In another embodiment of the present disclosure, if the current time is in the error summary report time period, the recorded error information is processed in a unified manner, where the recorded error information includes that the re-brushing batch and the data job cannot be generated according to the re-brushing configuration file, the data in the cache table cannot be replaced by the data in the line table, the data job fails to be executed, and the like. The specific implementation process is shown in fig. 12.
In step 1210, it is determined whether the current time is in the error summary reporting time period, if so, step 1220 is performed, otherwise, the subsequent steps are not performed.
At step 1220, all unprocessed historical error information is aggregated.
In step 1230, the warning information of the current period is recorded and printed to the log.
In step 1240, it is determined whether the current period error information list is empty, if yes, step 1250 is executed, otherwise, the process is ended.
In step 1250, the error information generated in the current period is recorded and printed to the log.
In step 1260, the short message and the mail send an error alarm to the data re-brushing task responsible person, and the period returns to step 1210.
In this embodiment, since there is much data to be re-brushed at the same time, normal logs and error alarms of each job are mixed together, which may make it difficult to troubleshoot an error. The unified log processing can store each job log into a separate file for front-end display. And recording the error alarm in the operation execution process, and finally uniformly integrating the alarm in a scheduling period.
Fig. 13 is a schematic structural diagram of an embodiment of a data recalling apparatus according to the present disclosure. The data re-brushing apparatus is a scheduling system, and includes a parsing re-brushing configuration file module 1310, a task execution job module 1320, and a data replacement module 1330.
The parsing re-brushing configuration file module 1310 is configured to generate a re-brushing batch and a data job according to a re-brushing configuration file, where one data re-brushing process is used as one re-brushing batch, each data calculation is used as one data task, and a calculation process corresponding to one data task in the re-brushing batch is used as one data job. The re-brushing configuration file comprises a source task name, a re-brushing task time range, a re-brushing task parameter and the like.
Wherein the parse-redraw configuration file module 1310 is configured to build a directed acyclic graph based on upstream and downstream dependencies between data tasks.
The execute task job module 1320 is configured to execute data jobs based on upstream and downstream dependencies between data tasks. For example, data jobs are performed based on directed acyclic graphs.
The data replacement module 1330 is configured to replace the original data based on the job results of the data job. Namely, original historical data is replaced, and the historical data is updated to be data obtained through multi-stage data calculation.
In the embodiment, the re-brushing batch and all data jobs contained in the re-brushing batch are generated according to the re-brushing configuration file, then the data jobs are executed based on the upstream and downstream dependency relationship among the data tasks, and the historical data is replaced according to the data job result, so that the manual participation degree in the off-line historical data re-brushing process is reduced, and the integrity and the accuracy of data updating at each level can be ensured.
Fig. 14 is a schematic structural diagram of another embodiment of the data recalling apparatus according to the present disclosure. The data recalling apparatus includes a parsing and recalling configuration file module 1410, a parsing and executing plan module 1420, a task execution operation module 1430, a data cache module 1440 and a data replacement module 1450.
The parse-redraw configuration file module 1410 is configured to parse the redraw configuration file into multiple states of redraw batches and data jobs, and build a directed acyclic graph based on upstream and downstream dependencies between data tasks.
The parsing execution plan module 1420 is configured to select a job set for which data re-brushing needs to be performed among data jobs corresponding to all data tasks. Specifically, the analysis execution plan module 1420 determines the job set in the active state, if all the data jobs in the re-brushing batch waiting for execution of the data job do not intersect with the job set in the active state, modifies the state of the re-brushing batch waiting for execution of the data job to the executing data job, selects the re-brushing batch with the highest priority from all the re-brushing batches of the executing data job, and takes the data job in the re-brushing batch with the highest priority as the job set requiring execution of the data re-brushing. Wherein the active state job set includes data jobs in a refresh batch in which the data job is being executed and data jobs in a refresh batch in which all data jobs have been completed.
The execute task job module 1430 is configured to determine job results for data jobs that have performed a complete data task based on data jobs in the directed acyclic graph execution job set. For example, if the data task corresponding to the data job capable of executing the data task in the job set is in the executable time range and does not exceed the task parallelism, the data job is executed.
The data cache module 1440 is configured to store job results for data jobs that have performed completed data tasks in a cache table.
The data replacement module 1450 is configured to replace data in the cache table with data in the table above the line. For example, if the re-brushing batch for which the data task is completed is within the replaceable time range, the operation result of the re-brushing batch stored in the cache table is replaced by the data in the upper table of the line, or if the re-brushing batch for which the data operation is being performed exists, the data in the upper table of the line is replaced by the operation result of the re-brushing batch stored in the cache table.
In an embodiment, after the data replacement module 1450 replaces the data in the table on the line with the data in the cache table, it is determined whether the size of the data file in the cache table is consistent with the size of the data file in the table on the line, and if so, it is determined that the data replacement is successful, otherwise, it is determined that the data replacement is failed.
In one embodiment, the data in the table on the line is deleted and the data in the cache table is moved to the corresponding data address of the table on the line at the same time.
In the above embodiment, the results of data calculation performed by all jobs are not directly output to the online table, but a buffer table is created on the live online table of the underlying data storage, and the buffer table data is replaced on the line at a set time, so that the availability of the historical data during the refreshing process can be ensured, that is, the old version data used by the user is not affected during the data updating process, and the data consistency is ensured.
In another embodiment of the present disclosure, the data recalling apparatus further includes a summary error reporting module 1460 configured to perform unified processing on the recorded error information if the current time is in the error summary reporting time period. The recorded error information comprises that the re-brushing batch and the data operation cannot be generated according to the re-brushing configuration file, the data in the cache table cannot replace the data in the line table, and the data operation execution fails.
In this embodiment, since there is much data to be re-brushed at the same time, normal logs and error alarms of each job are mixed together, which may make it difficult to troubleshoot an error. The unified log processing can store each job log into a separate file for front-end display. And recording the error alarm in the operation execution process, and finally uniformly integrating the alarm in a scheduling period.
Fig. 15 is a schematic structural diagram of a data recalling apparatus according to still another embodiment of the present disclosure. The data refreshing apparatus includes a memory 1510 and a processor 1520, wherein the memory 1510 may be a magnetic disk, a flash memory, or any other non-volatile storage medium. The memory is used to store instructions in the embodiments corresponding to fig. 1-12. Processor 1520, coupled to memory 1510, may be implemented as one or more integrated circuits, such as a microprocessor or microcontroller. The processor 1520 is configured to execute instructions stored in the memory.
In one embodiment, as also shown in fig. 16, the data recalling apparatus 1600 includes a memory 1610 and a processor 1620. Processor 1620 is coupled to memory 1610 via BUS 1630. The data swipe device 1600 can be further connected to an external storage device 1650 through a storage interface 1640 for invoking external data, and can be further connected to a network or another computer system (not shown) through a network interface 1660, which will not be described in detail herein.
In the embodiment, the data instruction is stored in the memory, and then the instruction is processed by the processor, so that the manual participation degree in the offline historical data refreshing process is reduced, the integrity and the accuracy of each level of data updating can be ensured, the computing resources are reasonably utilized, in addition, in the data updating process, the use of old version data by a user is not influenced, the data provided by each channel is replaced from the old version to the new version at the same time, and the inconsistency of the data acquired by the user is avoided.
FIG. 17 is a block diagram illustrating an embodiment of a data refresh system according to the present disclosure. The system includes a user service device 1710, a data redrawing device 1720, and a data storage device 1730, where the data redrawing device 1720 is a scheduling system and has been described in detail in the above embodiments.
The user service device 1710 is mainly a web service, and provides a simple operation and display interface for a user, and the user performs operations on the layer, and how to schedule and execute specific tasks is transparent. The user service device 1710 mainly includes three sub-items, namely a task submitting module, a task monitoring module and a system configuration module.
And the task submitting module is responsible for collecting historical data re-brushing requests of users, generating re-brushing configuration files and submitting the re-brushing configuration files to the scheduling system. The re-refreshing task data information comprises a source task name, a re-running task time range and re-running task parameters. The task monitoring module displays the execution state of the data refreshing batch/operation to the user, and has authority control, and a data responsible person can modify the execution state of the data task of the person. The system configuration module can configure the upstream and downstream dependency relationship, the corresponding script address, the parameter information and the like corresponding to each task in the re-refreshing system. The system configuration is stored in a real-time database, such as mysql. The module also configures storage of an execution script of each task, and the execution script is generally stored in a cheaper storage system such as an HDFS or a network disk.
The data storage 1730 is used for storing data in the HIVE line table, the HIVE cache table and other real-time database tables.
In the embodiment, the refreshing of the most upstream data can drive the refreshing of all related data, so that the manpower participation is reduced, and the reliability of the refreshing is ensured from the system; in addition, through a configurable execution framework, reasonable distribution of computing resources can be carried out on the whole; moreover, the buffer table is adopted to store data calculation results, and the data calculation results are uniformly replaced on a line at set time, so that the problem that data cannot be used when historical data are refreshed is solved, and the data consistency is ensured.
In another embodiment, a computer-readable storage medium has stored thereon computer program instructions which, when executed by a processor, implement the steps of the method in the corresponding embodiments of fig. 1-12. As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, apparatus, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Thus far, the present disclosure has been described in detail. Some details that are well known in the art have not been described in order to avoid obscuring the concepts of the present disclosure. It will be fully apparent to those skilled in the art from the foregoing description how to practice the presently disclosed embodiments.
Although some specific embodiments of the present disclosure have been described in detail by way of example, it should be understood by those skilled in the art that the foregoing examples are for purposes of illustration only and are not intended to limit the scope of the present disclosure. It will be appreciated by those skilled in the art that modifications may be made to the above embodiments without departing from the scope and spirit of the present disclosure. The scope of the present disclosure is defined by the appended claims.

Claims (32)

1. A data recalling method comprising:
generating a re-brushing batch and data operation according to the re-brushing configuration file, wherein a data re-brushing flow is used as a re-brushing batch, a data calculation process is used as a data task, and a data task is executed as a data operation;
executing the data jobs based on upstream and downstream dependencies between data tasks;
original data is replaced based on a job result of the data job.
2. The data recalling method of claim 1, further comprising:
constructing a directed acyclic graph based on the upstream and downstream dependency relationships among the data tasks;
executing the data job based on the directed acyclic graph.
3. The data recalling method of claim 2, further comprising:
selecting a job set which needs to execute data refreshing from data jobs corresponding to all data tasks;
and executing the data jobs in the job set.
4. The data re-brushing method according to claim 3, wherein selecting the job set on which data re-brushing needs to be performed from the data jobs corresponding to all the data tasks comprises:
determining a job set in an active state;
if all data jobs in the re-brushing batch waiting for executing the data job are not intersected with the job set in the active state, modifying the state of the re-brushing batch waiting for executing the data job into the executing data job;
selecting a re-brushing batch with the highest priority from all re-brushing batches for which the data job is being executed;
and taking the data job in the re-brushing batch with the highest priority as a job set needing to execute data re-brushing.
5. The data re-brushing method according to claim 4,
the active set of jobs includes data jobs in a refresh batch that are executing data jobs and data jobs in a refresh batch that have completed all data jobs.
6. The data reflashing method according to claim 3, wherein executing the data jobs in the job set comprises:
and if the data tasks corresponding to the data jobs of all executable data tasks in the job set are in the executable time range and do not exceed the task parallelism, executing the data jobs.
7. The data recalling method according to any one of claims 1-6, wherein replacing original data based on job results of the data job comprises:
storing the operation result of the data operation which completes the data task in a cache table;
and replacing the data in the cache table with the data in the table on the line.
8. The data re-brushing method according to claim 7,
judging whether the re-brushing batch which has executed the completed data task is within the replaceable time range;
if yes, replacing the data in the line table with the operation result of the re-brushing batch stored in the cache table.
9. The data re-brushing method according to claim 7,
judging whether a re-brushing batch of the data operation is executed or not, wherein the re-brushing batch needs to execute a data replacement task in advance;
and if so, replacing the data in the line table with the operation result of the re-brushing batch stored in the cache table.
10. The data scrubbing method of claim 7, wherein replacing the data in the cache table with the data in the table above the line further comprises:
and judging whether the size of the data file in the cache table is consistent with that of the data file in the online table or not, if so, successfully replacing the data, and otherwise, failing to replace the data.
11. The data re-brushing method according to claim 10,
and deleting the data in the online table at the same time, and moving the data in the cache table to the data address corresponding to the online table.
12. The data re-brushing method according to any one of claims 1 to 6, wherein the re-brushing configuration file comprises re-brushing task data information including a source task name, a re-brushing task time range, and a re-brushing task parameter.
13. The data recalling method according to any one of claims 1-6, further comprising:
and if the current time is in the error summary report time period, uniformly processing the recorded error information.
14. The data re-brushing method according to claim 13,
the recorded error information comprises at least one of the failure to generate a re-brushing batch and a data job according to a re-brushing configuration file, the failure of replacing the data in the line table by the data in the cache table and the failure of executing the data job.
15. A data recalling apparatus comprising:
the data processing system comprises an analysis re-brushing configuration file module, a data processing module and a data processing module, wherein the analysis re-brushing configuration file module is configured to generate a re-brushing batch and data operation according to a re-brushing configuration file, one data re-brushing flow is used as a re-brushing batch, one data calculation process is used as a data task, and one data task is executed as one data operation;
an execute task job module configured to execute the data job based on upstream and downstream dependencies between data tasks;
a data replacement module configured to replace original data based on a job result of the data job.
16. The data re-flashing device of claim 15, wherein,
the parsing redrawing configuration file module is configured to build a directed acyclic graph based on upstream and downstream dependencies between the data tasks;
the execute task job module is further configured to execute the data job based on the directed acyclic graph.
17. The data recalling apparatus according to claim 16, further comprising:
the analysis execution plan module is configured to select a job set which needs to execute data refreshing from the data jobs corresponding to all the data tasks;
the execute task job module is further configured to execute a data job in the set of jobs.
18. The data redrawing apparatus of claim 17,
the analysis execution planning module is configured to determine a job set in an active state, modify the state of a re-brushing batch of a waiting execution data job to be an execution data job if all data jobs in the re-brushing batch of the waiting execution data job do not intersect with the job set in the active state, select a re-brushing batch with the highest priority from all the re-brushing batches of the execution data job, and take the data job in the re-brushing batch with the highest priority as a job set needing to execute data re-brushing.
19. The data re-flashing device of claim 18,
the active set of jobs includes data jobs in a refresh batch that are executing data jobs and data jobs in a refresh batch that have completed all data jobs.
20. The data redrawing apparatus of claim 17,
the task execution job module is further configured to execute the data job if the data jobs corresponding to the data jobs of all executable data tasks in the job set are within an executable time range and do not exceed the task parallelism.
21. The data recalling apparatus according to any one of claims 15-20, further comprising:
the data caching module is used for storing the operation result of the data operation which completes the data task in the caching table;
the data replacement module is further configured to replace data in the cache table with data in a table above a line.
22. The data restush apparatus of claim 21, wherein,
the data replacement module is further configured to determine whether a re-brushing batch executed to complete the data task is within a replaceable time range, and if so, replace the data in the line table with the operation result of the re-brushing batch stored in the cache table.
23. The data restush apparatus of claim 21, wherein,
the data replacement module is further configured to determine whether a refresh batch for executing the data job exists and needs to execute a data replacement task in advance, and if so, replace the data in the table on the line with the job result of the refresh batch stored in the cache table.
24. The data restush apparatus of claim 21, wherein,
the data replacement module is further configured to, after replacing the data in the online table with the data in the cache table, determine whether the size of the data file in the cache table is consistent with the size of the data file in the online table, if so, confirm that the data replacement is successful, otherwise, confirm that the data replacement is failed.
25. The data re-flashing device of claim 22,
the data replacement module is also configured to delete the data in the online table at the same time and move the data in the cache table to the data address corresponding to the online table.
26. The data redrawing apparatus of any one of claims 15-20, wherein the redrawing profile comprises redrawing task data information comprising a source task name, a redrawing task time range, and a redrawing task parameter.
27. The data recalling apparatus according to any one of claims 15-20, further comprising:
and the summary error reporting module is configured to perform unified processing on the recorded error information if the current time is in the error summary reporting time period.
28. The data restush apparatus of claim 27, wherein,
the recorded error information comprises at least one of the failure to generate a re-brushing batch and a data job according to a re-brushing configuration file, the failure of replacing the data in the line table by the data in the cache table and the failure of executing the data job.
29. A data recalling apparatus comprising:
a memory; and
a processor coupled to the memory, the processor configured to perform the data recalling method of any of claims 1 to 14 based on instructions stored in the memory.
30. A computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the data-scrubbing method of any one of claims 1 to 14.
31. A data grooming system comprising a user service device, a data storage device and a data grooming device as claimed in any one of claims 15 to 29.
32. The data reflashing system of claim 31, wherein the user service device comprises:
the task submitting module is configured to generate a re-brushing configuration file according to a data re-brushing request of a user;
the task monitoring module is configured to show the execution states of the re-brushing batch and the data job to a user;
a system configuration module configured to configure system resources.
CN201811023618.4A 2018-09-04 Data re-brushing method, device and system Active CN110928884B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811023618.4A CN110928884B (en) 2018-09-04 Data re-brushing method, device and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811023618.4A CN110928884B (en) 2018-09-04 Data re-brushing method, device and system

Publications (2)

Publication Number Publication Date
CN110928884A true CN110928884A (en) 2020-03-27
CN110928884B CN110928884B (en) 2024-05-17

Family

ID=

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100162248A1 (en) * 2008-12-22 2010-06-24 Microsoft Corporation Complex dependency graph with bottom-up constraint matching for batch processing
US20110087637A1 (en) * 2009-10-09 2011-04-14 International Business Machines Corporation Method and System for Database Recovery
US20150379061A1 (en) * 2013-02-20 2015-12-31 Quick Eye Technologies Inc. Managing changes to information
CN106528853A (en) * 2016-11-28 2017-03-22 中国工商银行股份有限公司 Data interaction management device and cross-database data interaction processing device and method
CN106897311A (en) * 2015-12-21 2017-06-27 财团法人工业技术研究院 Database batch rekeying method, data convert daily record production method and storage device
CN107315761A (en) * 2017-04-17 2017-11-03 阿里巴巴集团控股有限公司 A kind of data-updating method, data query method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100162248A1 (en) * 2008-12-22 2010-06-24 Microsoft Corporation Complex dependency graph with bottom-up constraint matching for batch processing
US20110087637A1 (en) * 2009-10-09 2011-04-14 International Business Machines Corporation Method and System for Database Recovery
US20150379061A1 (en) * 2013-02-20 2015-12-31 Quick Eye Technologies Inc. Managing changes to information
CN106897311A (en) * 2015-12-21 2017-06-27 财团法人工业技术研究院 Database batch rekeying method, data convert daily record production method and storage device
CN106528853A (en) * 2016-11-28 2017-03-22 中国工商银行股份有限公司 Data interaction management device and cross-database data interaction processing device and method
CN107315761A (en) * 2017-04-17 2017-11-03 阿里巴巴集团控股有限公司 A kind of data-updating method, data query method and device

Similar Documents

Publication Publication Date Title
US10554771B2 (en) Parallelized replay of captured database workload
US10698892B2 (en) Order-independent multi-record hash generation and data filtering
US10216584B2 (en) Recovery log analytics with a big data management platform
CN107239335B (en) Job scheduling system and method for distributed system
US10956422B2 (en) Integrating event processing with map-reduce
CN110245023B (en) Distributed scheduling method and device, electronic equipment and computer storage medium
CN106708740B (en) Script testing method and device
CN107016480B (en) Task scheduling method, device and system
CN104423960A (en) Continuous project integration method and continuous project integration system
US11226985B2 (en) Replication of structured data records among partitioned data storage spaces
US10949218B2 (en) Generating an execution script for configuration of a system
US11256608B2 (en) Generating test plans for testing computer products based on product usage data
CN112835924A (en) Real-time computing task processing method, device, equipment and storage medium
CN116009428A (en) Industrial data monitoring system and method based on stream computing engine and medium
CN110908793A (en) Long-time task execution method, device, equipment and readable storage medium
CN111782201A (en) Method and device for realizing linkage of service codes and layout topological graph
CN110928884B (en) Data re-brushing method, device and system
CN110928884A (en) Data re-brushing method, device and system
CN113220530B (en) Data quality monitoring method and platform
CN116010452A (en) Industrial data processing system and method based on stream type calculation engine and medium
WO2021037684A1 (en) System for persisting application program data objects
Klein et al. Quality attribute-guided evaluation of NoSQL databases: an experience report
Fördős et al. CRDTs for the configuration of distributed Erlang systems
US20220382236A1 (en) Shared automated execution platform in cloud
CN112667597B (en) Algorithm model full life cycle management tool system and implementation method thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant