Detailed Description
To make the purpose, technical solutions and advantages of the present application clearer, the technical solutions in the present application will be clearly and completely described below with reference to the accompanying drawings in some embodiments of the present application, and it is obvious that the described embodiments are some, but not all embodiments of the present application. The components of the present application, as generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present application, as presented in the figures, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments obtained by a person of ordinary skill in the art based on a part of the embodiments in the present application without any creative effort belong to the protection scope of the present application.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
In a scenario of using the ETL system to integrate and load data of a source device into a destination device, for example, the ETL system may integrate multiple data in the same batch in a batch order.
Because the ETL system may have a process crash, a server running the ETL system may also have a downtime, and the like, the ETL system may be restarted, and data needs to be obtained from the source end device again and is loaded to the destination end device in an integrated manner.
In order to avoid that the ETL re-integrates and loads all data from the source device to the destination device, the ETL system may record last run metadata (last run metadata) during the process of extracting data from the source device, and when the ETL system is restarted, the ETL system may re-integrate and load the data of the batch to the destination device according to the recorded last run metadata.
However, since the ETL system may be restarted in the process of integrating data, if the batch is reloaded based on the last running metadata recorded after the ETL system is restarted, there may be a case of reloading part of the data of the batch while processing; if the next batch of data is restarted to be loaded based on the last run metadata, part of the data of the batch may not be processed, so that the destination device loses part of the data, and the data processing accuracy is low.
To this end, based on at least some of the drawbacks of some of the above embodiments, some possible embodiments provided by the present application are: the ETL system deletes the data operation information corresponding to the target data aiming at the target data which is successfully loaded to the destination equipment after the conversion processing is carried out on each target data by establishing and storing the data operation information corresponding to each target data aiming at a plurality of target data extracted from the source equipment and each converted target data is loaded to the destination equipment, so that when the ETL system is restarted, the ETL system can reintegrate and load the target data corresponding to the stored data operation information according to the remaining stored data operation information without reintegrating and loading all data of the batch, and the data which is successfully loaded are prevented from being repeatedly reloaded; partial data of the batch cannot be discarded, so that data loss of the destination equipment is avoided, and the data processing accuracy is improved.
Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.
Referring to fig. 1, fig. 1 shows a schematic architecture diagram of an ETL system provided in the present application, which may include a data extraction module (Extractor Manager), a data transformation module (Transformer Manager), and a data loading module (Loader Manager); the data processing method provided by the application can be realized by executing the corresponding functional modules of the data extraction module, the data conversion module and the data loading module.
Referring to fig. 2, fig. 2 shows a schematic block diagram of a server 100 provided in the present application, where the server 100 may include a memory 101, a processor 102 and a communication interface 103, and the memory 101, the processor 102 and the communication interface 103 are electrically connected to each other directly or indirectly to implement data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines.
The memory 101 may be used for storing software programs and modules, such as program instructions/modules corresponding to the ETL system provided in the present application, and the processor 102 executes the software programs and modules stored in the memory 101 to execute various functional applications and data processing, so as to operate the ETL system provided in the present application, and further execute the steps of the data processing method provided in the present application. The communication interface 103 may be used for communicating signaling or data with other node devices.
The Memory 101 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Programmable Read-Only Memory (EEPROM), and the like.
The processor 102 may be an integrated circuit chip having signal processing capabilities. The Processor 102 may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.
It will be appreciated that the configuration shown in fig. 2 is merely illustrative and that the server 100 may include more or fewer components than shown in fig. 2 or have a different configuration than shown in fig. 2. The components shown in fig. 2 may be implemented in hardware, software, or a combination thereof.
The data processing method provided by the present application is schematically described below with a server operating the ETL system shown in fig. 1 as an exemplary execution subject.
Referring to fig. 3, fig. 3 shows a schematic flow chart of a data processing method provided in the present application, which may include the following steps in some possible embodiments:
step 201, for a plurality of target data extracted from a source device, creating and storing data operation information corresponding to each target data.
Step 203, converting each target data, and loading each converted target data to the destination device.
Step 205, deleting the data operation information corresponding to the target data, with respect to the converted target data successfully loaded to the destination device.
In some embodiments, referring to fig. 1, in the process of loading data from a source device to a destination device by an ELT system in an integrated manner, an ETL system may obtain data from the source device in batches for an integrated process; the ETL system may create and store data operation information corresponding to each target data, where the data operation information may be used to indicate object data being processed by the ETL system.
For example, in some possible embodiments of the present application, the ETL system may create and store a data snapshot corresponding to each target data as data operation information corresponding to each target data; or, the ETL system may also create and store a data verification code corresponding to each target data as the data operation information corresponding to each target data according to the data batch number of each target data and the serial number in the data batch; the implementation manner of the data operation information is not limited in the present application.
Next, the ETL system may perform conversion processing on each target data, and load each converted target data to the destination device.
It should be noted that, in some possible scenarios, the ETL system may enable multiple threads to process each target data in parallel during the process of performing the integrated processing of the data. Therefore, in the process of executing step 203, the conversion processes of each target data may be independent from each other, that is, the conversion processing processes of any two target data do not have a necessary sequence, and for one target data, after the ETL system creates and stores the data operation information corresponding to the target data, the ETL system may execute the operation of converting the target data, and load the converted target data to the destination device.
Then, for each converted target data that is successfully loaded to the destination device, the ETL system may delete the data operation information corresponding to the target data to indicate that the corresponding target data has been successfully converted and loaded to the destination device.
Thus, in some embodiments of the present application, when the ETL system is restarted, the ETL system can reintegrate and load the target data corresponding to the stored data operation information according to the remaining stored data operation information, without reintegrating and loading all the data of the batch, thereby avoiding the data that has been successfully loaded from being repeatedly reloaded; partial data of the batch cannot be discarded, so that data loss of the destination equipment is avoided, and the data processing accuracy is improved.
In some possible embodiments, as shown in fig. 1, the architecture of the ETL system may include a data extraction module, a data conversion module, and a data loading module; in addition, a first blocking Queue (Block Queue) may be created for the data extraction module, and a second blocking Queue may be created for the data conversion module, where the first blocking Queue may be used to store the data extracted by the data extraction module, and the second blocking Queue may be used to store the data converted by the data conversion module.
Based on this, in the course of executing step 201, the ETL system may extract a plurality of target data from the source device by the data extraction module, and add each target data to the first blocking queue after the data extraction module creates and stores the data operation information corresponding to each target data.
Also, in the ETL system executing step 203, the data conversion module may perform an operation of performing conversion processing on each target data, that is: and the data conversion module acquires target data from the first blocking queue, converts each acquired target data, and adds the converted target data to the second blocking queue.
In addition, in the process of executing step 203, the ETL system may further execute, by the data loading module, an operation of loading each converted target data to the destination device, that is: and the data loading module acquires the converted target data from the second blocking queue and loads each converted target data to the destination terminal equipment.
In addition, in the process of executing 205 in the ETL system, the data loading module may delete the data operation information corresponding to the target data, with respect to the converted target data that is successfully loaded to the destination device.
In some embodiments, for the data operation information maintained by the ETL system, the ETL system may create a Staging Area (Staging Area), so as to store the created data operation information into the Staging Area, and delete the created data operation information from the Staging Area during the step 205.
In addition, as can be seen from the schematic diagram of the ETL system architecture shown in fig. 1, in the process of integrally loading data from the source device to the destination device, the ETL system may include two internal processing flows, that is, extracting data from the source device to the first blocking queue, extracting data from the first blocking queue for conversion, and adding the converted data to the second blocking queue, and further includes an external processing flow, that is, loading data in the second blocking queue to the destination device.
Therefore, when the ETL system is restarted, it may be necessary to continue to perform the operation of obtaining data from the first blocking queue for conversion, and it may also be possible to continue to perform the operation of obtaining data from the second blocking queue for loading to the destination device.
Based on this, in order to improve the fineness of data processing, in some embodiments of the present application, the data operation information may include a first data snapshot and a second data snapshot, where the first data snapshot may be used to indicate that corresponding target data is in a state of being extracted from a source device to a first blocking queue, and the second data snapshot may be used to indicate that corresponding target data is in a state of being subjected to data conversion and being saved to a second blocking queue.
Thus, the ETL system during the execution of step 201 can create and include a first data snapshot corresponding to each target data to indicate that the corresponding target data is already in a state of being extracted from the source device to the first blocking queue.
Next, in the process of executing step 203, the ETL system may perform conversion processing on each target data, create and store a second data snapshot corresponding to each target data after the conversion is successful, delete a first data snapshot corresponding to the target data, and load each converted target data to the destination device.
That is, when the ETL system obtains target data from the first blocking queue and performs conversion processing on each target data successfully, the ETL system may create and store a second data snapshot corresponding to the target data that is successfully converted, so as to indicate that the target data that is successfully converted is currently in a state of being subjected to data conversion and being stored in the second blocking queue; and when the ETL system successfully converts the target data and adds the converted target data to the second blocking queue, the ETL system may delete the first data snapshot corresponding to the target data.
In addition, in the process of executing step 205, the ETL system may delete, for the converted target data that is successfully loaded to the destination device, the second data snapshot corresponding to the target data that is successfully converted, so as to indicate that the data conversion loading is completed.
It should be noted that, in the scenario shown in fig. 1, all the first data snapshots and all the second data snapshots that are saved in the above-mentioned manner may be saved in the same operation space of the temporary storage area, or may be saved in different operation spaces.
And based on the configured first data snapshot and second data snapshot, when the ETL system is restarted, the ETL system may read all the first data snapshots and all the second data snapshots saved in the temporary storage area, and resume processing of the target data.
For the stored first data snapshots, the ETL system may perform conversion processing on target data corresponding to each stored first data snapshot; that is to say, for the saved first data snapshot, when the ETL system is restarted, the ETL system may read the saved first data snapshot from the temporary storage area, where the saved first data snapshot may be used to indicate that the corresponding data is in a state of being extracted from the source device to the first blocking queue, and then the ETL system may obtain, from the first blocking queue, the target data corresponding to each saved first data snapshot, perform conversion processing on the obtained target data, and continue to perform the subsequent steps of creating and saving the corresponding second data snapshot.
In addition, for the saved second data snapshots, the ETL system may load the target data corresponding to each saved second data snapshot to the destination device; that is to say, for the saved second data snapshot, when the ETL system is restarted, the ETL system may read the saved second data snapshot from the temporary storage area, where the saved second data snapshot may be used to indicate that the corresponding target data is in a state of being subjected to data conversion and saved to the second blocking queue, and then the ETL system may obtain the target data corresponding to each saved second data snapshot from the second blocking queue, and load the obtained target data to the destination device, so as to complete the integrated loading of the data.
In addition, based on the same inventive concept as the above data processing method provided by the present application, the present application also provides an ETL system as shown in fig. 1, where the ETL system is used to load data from a source device to a destination device; wherein:
the ETL system is specifically configured to create and store, for a plurality of target data extracted from the source device, data operation information corresponding to each of the target data;
the ETL system is also used for converting each target data and loading each converted target data to the destination terminal equipment;
the ETL system is further configured to, for the converted target data that is successfully loaded to the destination device, delete the data operation information corresponding to the target data.
Optionally, as a possible implementation, the data operation information includes a first data snapshot and a second data snapshot;
when the ETL system creates and stores the data operation information corresponding to each target data, the ETL system is specifically configured to:
creating and storing a first data snapshot corresponding to each target data;
when the ETL system performs conversion processing on each target data and loads each converted target data to the destination device, the ETL system is specifically configured to:
performing conversion processing on each target data, creating and storing a second data snapshot corresponding to each target data after the conversion is successful, and deleting a first data snapshot corresponding to the target data;
loading each converted target data to the destination device;
when the ETL system deletes, for the converted target data that is successfully loaded to the destination device, the data operation information corresponding to the target data, the ETL system is specifically configured to:
and deleting the second data snapshot corresponding to the converted target data aiming at the converted target data which is successfully loaded to the destination terminal equipment.
Optionally, as a possible implementation manner, when the ETL system is restarted, the ETL system is further configured to:
aiming at the stored first data snapshots, converting target data corresponding to each stored first data snapshot;
and for the stored second data snapshots, loading the target data corresponding to each stored second data snapshot to the destination device.
Optionally, as a possible implementation manner, the ETL system includes a data extraction module, a data conversion module, and a data loading module;
the data extraction module is used for extracting a plurality of target data from the source end equipment, creating and storing data operation information corresponding to each target data, and then adding each target data to a first blocking queue;
the data conversion module is used for acquiring target data from the first blocking queue, converting each acquired target data and adding the converted target data to a second blocking queue;
the data loading module is used for acquiring the converted target data from the second blocking queue and loading each converted target data to the destination device; and the number of the first and second groups,
and deleting the data operation information corresponding to the target data aiming at the converted target data which is successfully loaded to the destination terminal equipment.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to some embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s).
It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional modules in some embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to perform all or part of the steps of the method according to some embodiments of the present application. And the aforementioned storage medium includes: u disk, removable hard disk, read only memory, random access memory, magnetic or optical disk, etc. for storing program codes.
The above description is only a few examples of the present application and is not intended to limit the present application, and those skilled in the art will appreciate that various modifications and variations can be made in the present application. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.
It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.