CN114297273A - Data extraction method and device - Google Patents

Data extraction method and device Download PDF

Info

Publication number
CN114297273A
CN114297273A CN202111466573.XA CN202111466573A CN114297273A CN 114297273 A CN114297273 A CN 114297273A CN 202111466573 A CN202111466573 A CN 202111466573A CN 114297273 A CN114297273 A CN 114297273A
Authority
CN
China
Prior art keywords
data
extraction
extracted
data block
message
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111466573.XA
Other languages
Chinese (zh)
Inventor
林鹏程
叶姣荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dt Dream Technology Co Ltd
Original Assignee
Hangzhou Dt Dream Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dt Dream Technology Co Ltd filed Critical Hangzhou Dt Dream Technology Co Ltd
Priority to CN202111466573.XA priority Critical patent/CN114297273A/en
Publication of CN114297273A publication Critical patent/CN114297273A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a data extraction method and a data extraction device, which are applied to an ETL tool; the method comprises the following steps: responding to a starting instruction of a data exchange task aiming at data to be extracted, and sending a data extraction request aiming at the data to be extracted to a data source end; the data source end is based on a distributed file system and realizes data storage by taking a data block as a unit; receiving and recording data block information of a plurality of data blocks corresponding to the data to be extracted, which is returned by the data source end; respectively receiving an extraction starting message and an extraction ending message corresponding to each data block in the plurality of data blocks; and if an extraction starting message and an extraction ending message corresponding to each data block in the plurality of data blocks are received, determining that the data to be extracted is extracted, and determining that the data exchange task is successfully executed.

Description

Data extraction method and device
Technical Field
The present application relates to the field of data exchange technologies, and in particular, to a data extraction method and apparatus, an electronic device, and a machine-readable storage medium.
Background
ETL (Extract-Transform-Load) can be used to describe the process of extracting, transforming, and loading data from a data source to a destination. The ETL is an important link of a Business Intelligence (BI) project, and can integrate scattered, disordered and non-uniform data in an enterprise to provide an analysis basis for the decision of the enterprise.
In practical application, an ETL tool can be used for creating and executing a data exchange task, so that data exchange between a data source end and a destination end is realized; and acquiring the task starting time, the task ending time and the task execution result of the data exchange task through a log recorded by an ETL tool.
In the related art, if the data exchange task is interrupted due to network abnormality or the like, after the data exchange task is restarted, part of data which has been exchanged needs to be deleted and the data exchange task needs to be executed again, which results in time waste.
Disclosure of Invention
The application provides a data extraction method, which is applied to an ETL tool; the ETL tool is used for executing a data exchange task of extracting, converting and loading data stored by a data source end to a destination end; the data source end is based on a distributed file system and realizes data storage by taking a data block as a unit; the method comprises the following steps:
responding to a starting instruction of a data exchange task aiming at data to be extracted, and sending a data extraction request aiming at the data to be extracted to the data source end; the data source end obtains data block information of a plurality of data blocks corresponding to the data to be extracted, and returns the data block information to the ETL tool;
receiving and recording data block information of a plurality of data blocks corresponding to the data to be extracted, which is returned by the data source end;
respectively receiving an extraction starting message and an extraction ending message corresponding to each data block in the plurality of data blocks; wherein the extraction start message is used for indicating to start extracting the data stored in the corresponding data block; the extraction end message is used for indicating that the extraction of the data stored in the corresponding data block is finished;
and if an extraction starting message and an extraction ending message corresponding to each data block in the plurality of data blocks are received, determining that the data to be extracted is extracted, and determining that the data exchange task is successfully executed.
Optionally, if an extraction start message and an extraction end message corresponding to each of the data blocks are received, determining that the data to be extracted has been extracted includes:
if an extraction starting message and an extraction ending message corresponding to any data block in the data blocks are received, determining that the data block is extracted;
and if the data blocks are extracted completely, determining that the data to be extracted is extracted completely.
Optionally, the method further includes:
and if the extraction starting message corresponding to any one of the data blocks is received and the extraction ending message corresponding to the data block is not received, determining that the data block is not extracted completely.
Optionally, the method further includes:
in response to receiving the extraction start message corresponding to the data block, determining whether an extraction end message corresponding to the data block is received within a preset time length;
if the extraction ending message corresponding to the data block is received within the preset time length, determining that the data block is extracted;
and if the extraction ending message corresponding to the data block is not received within the preset time length, determining that the data block is abnormal in extraction.
Optionally, the method further includes:
and if the extraction of part of the data blocks in the plurality of data blocks is abnormal, determining that the extraction of the data to be extracted is abnormal, thereby determining that the execution of the data exchange task is abnormal.
Optionally, the method further includes:
if the data exchange task is determined to be abnormally executed, periodically sending a recovery extraction request aiming at the data to be extracted to the data source end according to a preset period; wherein, the recovery extraction request carries the data block information of the data block which is not extracted in the plurality of data blocks; and the data source end continues to extract the data to be extracted based on the data block information of the data block which is not extracted.
Optionally, the method further includes:
if the data exchange task is determined to be abnormally executed, sending a recovery extraction request aiming at the data to be extracted to the data source end in response to a restart instruction aiming at the data exchange task; wherein, the recovery extraction request carries the data block information of the data block which is not extracted in the plurality of data blocks; and the data source end continues to extract the data to be extracted based on the data block information of the data block which is not extracted.
Optionally, the data source end includes a big data platform.
The application also provides a data extraction device which is applied to the ETL tool; the ETL tool is used for executing a data exchange task of loading data stored by the data source end to the destination end after extraction, cleaning and conversion; the data source end is based on a distributed file system and realizes data storage by taking a data block as a unit; the device comprises:
the sending unit is used for responding to a starting instruction of a data exchange task aiming at data to be extracted, and sending a data extraction request aiming at the data to be extracted to the data source end; the data source end obtains data block information of a plurality of data blocks corresponding to the data to be extracted, and returns the data block information to the ETL tool;
the receiving unit is used for receiving and recording data block information of a plurality of data blocks corresponding to the data to be extracted, which is returned by the data source end;
the receiving unit is further configured to receive an extraction start message and an extraction end message corresponding to each of the plurality of data blocks respectively; wherein the extraction start message is used for indicating to start extracting the data stored in the corresponding data block; the extraction end message is used for indicating that the extraction of the data stored in the corresponding data block is finished;
and the determining unit is used for determining that the data to be extracted is extracted completely if an extraction starting message and an extraction ending message corresponding to each data block in the plurality of data blocks are received, so that the data exchange task is successfully executed.
The application also provides an electronic device, which comprises a communication interface, a processor, a memory and a bus, wherein the communication interface, the processor and the memory are mutually connected through the bus;
the memory stores machine-readable instructions, and the processor executes the method by calling the machine-readable instructions.
The present application also provides a machine-readable storage medium having stored thereon machine-readable instructions which, when invoked and executed by a processor, implement the above-described method.
Through the above embodiment, when the ETL tool executes a data exchange task for data to be extracted, by acquiring data block information of a plurality of data blocks corresponding to the data to be extracted, and receiving an extraction start message and an extraction end message corresponding to each of the plurality of data blocks, respectively, an extraction result of each data block can be determined, and further an extraction result of the data to be extracted and an execution result of the data exchange task can be determined; therefore, the execution progress of the data exchange task is obtained through the ETL tool, and the breakpoint continuous transmission capability is enabled.
Subsequently, if the data exchange task is executed abnormally, the ETL tool may determine, based on the received data block information, extraction start message, and extraction end message, an unfinished extracted data block in the data blocks corresponding to the data to be extracted, and send a recovery extraction request for the data to be extracted to the data source, so that the data source end may continue to extract the data to be extracted based on the data block information of the unfinished extracted data block carried in the recovery extraction request; therefore, the breakpoint continuous transmission capability is achieved, the extracted part of data blocks do not need to be deleted, the data to be extracted does not need to be extracted again, and the execution efficiency of the data exchange task after restarting is improved.
Drawings
FIG. 1 is a schematic block diagram of a data switching system according to the related art;
FIG. 2 is a flow diagram of a data extraction method, shown in an exemplary embodiment;
fig. 3 is a hardware structure diagram of an electronic device in which a data extraction apparatus is located according to an exemplary embodiment;
fig. 4 is a block diagram of a data extraction apparatus according to an exemplary embodiment.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.
It should be noted that: in other embodiments, the steps of the corresponding methods are not necessarily performed in the order shown and described herein. In some other embodiments, the method may include more or fewer steps than those described herein. Moreover, a single step described in this specification may be broken down into multiple steps for description in other embodiments; multiple steps described in this specification may be combined into a single step in other embodiments.
In order to make those skilled in the art better understand the technical solution in the embodiment of the present disclosure, the following briefly describes the related art of data exchange related to the embodiment of the present disclosure.
ETL (Extract-Transform-Load) can be used to describe the process of extracting, transforming, and loading data from a data source to a destination. The ETL is an important link of a Business Intelligence (BI) project, and can integrate scattered, disordered and non-uniform data in an enterprise to provide an analysis basis for the decision of the enterprise.
In practical application, an ETL tool can be used to create and execute a data exchange task, so as to implement data exchange between a data source end and a data destination end. Referring to fig. 1, fig. 1 is a schematic diagram of a data exchange system in the related art. As shown in fig. 1, the data exchange system may include at least a data source 102, an ETL tool 104, and a destination 106.
The data source 102 and destination 106 may comprise different types of databases. For example, the data source 102 may specifically include different business systems corresponding to an enterprise; the destination 106 may specifically include, but is not limited to, a data warehouse or the like.
The ETL tool 104 may be deployed in a cluster manner, so as to improve the data exchange capability of the ETL tool and support access to more data sources. As shown in fig. 1, the ETL tool 104 may specifically include a control node and several working nodes; the control node may uniformly manage the plurality of working nodes, and distribute a data exchange task to one or more working nodes, so that the corresponding working nodes perform an ETL process of data exchange.
In the system architecture shown in fig. 1, the process of implementing data exchange between a data source end and a destination end may specifically include: a user can create a data exchange task for target data through the ETL tool 104, and can also configure data extraction rules, data conversion rules, and the like for the data exchange task according to requirements; in response to the created data exchange task being initiated, the ETL tool 104 may establish a connection between the data source 102 and the destination 106, and may extract the target data from the data source 102 according to a preconfigured data extraction rule; further, the ETL tool 104 may convert the extracted data according to a pre-configured data conversion rule; further, the ETL tool 104 may load the converted data to the destination 106, that is, may send the converted data to the destination 106, so that the destination 106 performs a disk-dropping operation on the received data.
In the above data exchange process, the ETL tool 104 may also record the task start time, the task end time, and the task execution result of the data exchange task through a log.
It can be seen that in the above illustrated embodiment, it can only be known whether the data exchange task is being executed and whether the data exchange task is finally executed successfully according to the log information recorded by the ETL tool; if the data exchange task is interrupted due to the abnormality of the network environment, data extraction, data conversion, data loading, data transmission and the like, the data exchange task can only be restarted, that is, part of the data which is extracted, exchanged and loaded is required to be deleted, and the previous data exchange process is executed again. Particularly, in a scenario where the amount of data to be exchanged is large, once the data exchange task is interrupted, the task needs to be re-executed, which may result in more time being wasted.
In view of this, for data extraction in the ETL process, the present specification aims to provide a technical solution that an ETL tool can determine whether each data block in the data to be extracted completes extraction.
When the method is implemented, the ETL tool can respond to a starting instruction of a data exchange task aiming at data to be extracted, and sends a data extraction request aiming at the data to be extracted to a data source end; the data source end is based on a distributed file system and realizes data storage by taking a data block as a unit; further, data block information of a plurality of data blocks corresponding to the data to be extracted, which is returned by the data source end, may be received and recorded;
further, an extraction start message and an extraction end message corresponding to each of the data blocks may be received respectively; wherein the extraction start message is used for indicating to start extracting the data stored in the corresponding data block; the extraction end message is used for indicating that the extraction of the data stored in the corresponding data block is finished; and if an extraction starting message and an extraction ending message corresponding to each data block in the plurality of data blocks are received, determining that the data to be extracted is extracted, and determining that the data exchange task is successfully executed.
Therefore, in the technical solution in this specification, when executing a data exchange task for data to be extracted, an ETL tool may determine an extraction result of each data block by acquiring data block information of a plurality of data blocks corresponding to the data to be extracted and receiving an extraction start message and an extraction end message corresponding to each data block in the plurality of data blocks, and further may determine an extraction result of the data to be extracted and an execution result of the data exchange task; therefore, the execution progress of the data exchange task is obtained through the ETL tool, and the breakpoint continuous transmission capability is enabled.
Subsequently, if the data exchange task is executed abnormally, the ETL tool may determine, based on the received data block information, extraction start message, and extraction end message, an unfinished extracted data block in the data blocks corresponding to the data to be extracted, and send a recovery extraction request for the data to be extracted to the data source, so that the data source end may continue to extract the data to be extracted based on the data block information of the unfinished extracted data block carried in the recovery extraction request; therefore, the breakpoint continuous transmission capability is achieved, partial data blocks which are extracted are not required to be deleted, and all data blocks corresponding to the data to be extracted are not required to be extracted again, so that the execution efficiency of the data exchange task after restarting is improved.
In addition, those skilled in the art can understand that, on the basis of the technical scheme disclosed in this specification for extracting data in units of data blocks, data conversion and data loading in the ETL process can be further performed in units of data blocks, and a conversion result and a loading result of each data block can be recorded, so that the execution progress of each data block corresponding to each link in a data exchange task can be known through an ETL tool, and the breakpoint resume capability can be enabled.
The present application is described below with reference to specific embodiments and specific application scenarios.
Referring to fig. 2, fig. 2 is a flowchart illustrating a data extraction method applied to an ETL tool according to an exemplary embodiment, where the method performs the following steps:
step 202: responding to a starting instruction of a data exchange task aiming at data to be extracted, and sending a data extraction request aiming at the data to be extracted to a data source end; the data source end obtains data block information of a plurality of data blocks corresponding to the data to be extracted and returns the data block information to the ETL tool;
step 204: receiving and recording data block information of a plurality of data blocks corresponding to the data to be extracted, which is returned by the data source end;
step 206: respectively receiving an extraction starting message and an extraction ending message corresponding to each data block in the plurality of data blocks; wherein the extraction start message is used for indicating to start extracting the data stored in the corresponding data block; the extraction end message is used for indicating that the extraction of the data stored in the corresponding data block is finished;
step 208: and if an extraction starting message and an extraction ending message corresponding to each data block in the plurality of data blocks are received, determining that the data to be extracted is extracted, and determining that the data exchange task is successfully executed.
In this specification, the data source may implement data storage in units of data blocks (blocks) based on a Distributed File System (Distributed File System).
For example, the data source end may specifically include a database or a big data platform implemented based on Hadoop, Spark, or the like; and, the data source side may store and manage data in units of data blocks.
In order to facilitate better understanding of the technical solutions in the embodiments of the present specification, the following describes the technical solutions of the present specification by taking the data source end as a big data platform implemented based on a Hadoop Distributed File System (HDFS).
The system architecture of the HDFS will be briefly described. The system architecture of the HDFS may generally include an HDFS Client (HDFS Client), a name node (NameNode), and a data node (DataNode). The HDFS client can be used for accessing and managing the HDFS, the HDFS client can also interact with the NameNode to acquire data block information of data to be extracted, and the HDFS client can also interact with the DataNode to read or write the data; the NameNode can be used for managing the name space of the HDFS, managing data block mapping information and configuring a copy strategy, and can also process a read/write request initiated by the HDFS client; the DataNode may be used to store the actual data block. In particular, the HDFS is a master/Slave (Mater/Slave) architecture, and may generally include one NameNode and a plurality of datanodes.
It should be noted that, in one or more embodiments shown in this specification, the data source end is a large data platform implemented based on an HDFS, which is merely an exemplary description and is not limited to this specification; in practical application, the technical solution in this specification can also be applied to a scenario where a data source end is implemented by taking a data block as a unit based on other distributed file systems.
In this specification, the destination may be configured to store data obtained by performing data extraction and data conversion on data stored in the data source. For example, the destination may specifically include, but is not limited to, a relational database, a non-relational database, a data warehouse, and the like, and this specification is not limited thereto.
In this specification, the ETL tool may be configured to create and execute one or more data exchange tasks, that is, the ETL tool may be configured to extract (extract), convert (transform), and load (load) data stored by the data source to the destination.
For example, common ETL tools are button, Datastage, Informatica, and the like. In practical applications, those skilled in the art may adopt or develop different ETL tools according to requirements, and the description is not limited herein.
In this specification, the ETL tool may send a data extraction request for data to be extracted to the data source end in response to a start instruction of a data exchange task for the data to be extracted.
The data to be extracted may include structured data and/or unstructured data stored in the data source end, and may also include file data such as a txt file, an excel file, and an xml file, which is not limited in this specification.
The start instruction of the data exchange task for the data to be extracted may be understood as that a user may create the data exchange task through the ETL tool, and may also configure a data source end, a destination end, the data to be extracted, a data extraction rule, a data cleansing rule, a data loading rule, and the like for the data exchange task, and further, when the data exchange task is created or started, the ETL tool may detect the start instruction for the data exchange task and start to execute the data exchange task.
For example, the ETL tool needs to execute a data exchange task of extracting, converting, and loading data to be extracted, which is stored in the HDFS file format in the data source end, to the destination end; the ETL tool may detect a start instruction for the data exchange task when the data exchange task is initiated; in response to the start instruction, the ETL tool may establish a connection with the data source end, and send a data extraction request for the data to be extracted to the data source end; the data extraction request may carry data table information of the data to be extracted, a data extraction rule, and the like.
The data extraction request may include a full data extraction request or an incremental data extraction request. When the method is implemented, before the ETL tool sends a data extraction request for the data to be extracted to the data source, a data range corresponding to the data to be extracted may be determined.
For example, in a scenario of full data extraction, in response to a start instruction of the data exchange task, the ETL tool may determine a data range corresponding to data to be extracted according to relevant configuration information of the data task, such as data table information, and may send a full data extraction request for the data to be extracted to the data source end.
For another example, in a scenario of extracting incremental data, the ETL tool may first determine, by means of a timestamp and the like, what incremental data that needs to perform data exchange, that is, determine a data range corresponding to data to be extracted, and may send an incremental data extraction request for the incremental data to the data source end.
It should be noted that, regarding the specific implementation manner of determining the incremental data, a person skilled in the art may flexibly select different specific implementation manners according to different system architectures, functions, and the like of the data exchange system, and the present specification is not limited thereto. For example, the data source may store the generated incremental data in a new data block, and may also provide data information corresponding to the incremental data to the ETL tool. For another example, a timestamp field may be added to the data stored by the data source for indicating the creation time or the update time of the corresponding data, and the ETL tool may determine the incremental data by comparing the current system time with the timestamp field.
In this specification, in response to receiving a data extraction request for the data to be extracted, which is sent by the ETL tool, the data source end may obtain data block information of a plurality of data blocks corresponding to the data to be extracted, and return the data block information to the ETL tool.
In practical applications, the data block information may include address information (location) of the data block.
For example, in response to receiving the data extraction request sent by the ETL tool, the data source end may create an HDFS client, call an open method of a distributedfilesystems object, and open an HDFS file corresponding to the data to be extracted; the distributedFileSystemcan call the NameNode through RPC (remote procedure call) to obtain address information (location) of a plurality of data blocks (blocks) corresponding to the data to be extracted, which are returned by the NameNode; further, the data source end may return the acquired data block information to the ETL tool.
In this specification, the ETL tool may receive and record data block information of a plurality of data blocks corresponding to the data to be extracted, where the data block information is returned by the data source.
Specifically, the ETL tool may generate a data block information list for recording data block information of a plurality of received data blocks corresponding to the data to be extracted. The data block information list is shown in table 1 as an example.
Data block information
location 1
location 2
……
location n
TABLE 1
For example, as shown in table 1, the data block information of n data blocks corresponding to the data to be extracted is location 1, location 2, … …, and location n, respectively, in sequence, where n is an integer greater than 1. The ETL tool may receive data block information of n data blocks corresponding to the data to be extracted, which is returned by the data source, and record the obtained data block information in a data block information list shown in table 1.
In practical application, the arrangement order of the plurality of data blocks in the data block information recorded by the ETL tool is the arrangement order of the plurality of data blocks in the data block information acquired by the data source. Those skilled in the art can understand that the arrangement order of the data block information of the data blocks is ordered according to a hadoop topology structure, that is, the arrangement close to the HDFS client is arranged in the front, and is not described herein again.
It should be noted that, in the related art, after the data source end acquires the data block information of the data blocks corresponding to the data to be extracted from the NameNode, the data source end generally only interacts with the DataNode according to the data block information to read the corresponding data from the DataNode, that is, the data source end does not return the acquired data block information to the ETL tool; therefore, in the related art, the ETL tool cannot know the data block information corresponding to the data to be extracted. In the technical solution of the present specification, the ETL tool may request the data source end to acquire data block information corresponding to the data to be extracted, and use the data block information as a basis for determining the execution progress of the data exchange task in units of data blocks.
In this specification, after acquiring data block information of a plurality of data blocks corresponding to the data to be extracted, the data source end may sequentially establish connection with a data node corresponding to each data block information, and read data from the data node until all data corresponding to the plurality of data blocks are read, or the data exchange task is executed abnormally.
Specifically, after acquiring the data block information of the data blocks corresponding to the data to be extracted, the data source end may establish connection with the data node corresponding to the data block information of each data block, and send an extraction start message corresponding to the data block to the ETL tool; further, the data source end may read data stored in the corresponding data block, and send the read data to the ETL tool; further, when the data stored in the data block is completely read, the data source end may close the connection with the data block, and send an extraction end message corresponding to the data block to the ETL tool.
The extraction starting message may be used to instruct to start extracting data stored in the corresponding data block; the extraction end message may be used to indicate that the extraction of the data stored in the corresponding data block is completed. In practical applications, the extraction start message and the extraction end message may specifically include timestamps corresponding to the extraction start time and the extraction end time, and may also include other identifiers for distinguishing the start state from the end state, such as: "yes" and "no", "1" and "0", and so on; the specific implementation forms of the extraction start message and the extraction end message may be set by those skilled in the art according to requirements, and this specification is not limited.
For example, after acquiring the data block information of n data blocks corresponding to the data to be extracted, the HDFS client may call a read method, establish a connection with a DataNode corresponding to the data block information location 1, and send an extraction start message "t" corresponding to the location 1 to the ETL tool11"; further, the HDFS client may read data stored in the DataNode corresponding to location 1, and send the read data to the ETL tool; further, when the data stored in the DataNode corresponding to location 1 is completely read, the HDFS client may close the connection with the DataNode and send an extraction end message "t" corresponding to location 1 to the ETL tool12"; then, the HDFS client may sequentially establish connections with the data nodes corresponding to the other data block information, and perform the above data reading process, which is not described in detail herein.
In this specification, the ETL tool may receive an extraction start message and an extraction end message corresponding to each of the several data blocks, respectively.
Specifically, the ETL tool may receive an extraction start message and an extraction end message corresponding to each data block, and may also receive read data corresponding to each data block; further, the ETL tool may record the received extraction start message and the extraction end message in the data block information list, please refer to a data block information list illustrated in table 2.
Data block information Extract start message Extract end message Data block extraction result
location 1 t11 t12 Has been completed
…… …… …… Has been completed
location k tk1 / Unfinished
location k+1 / / Unfinished
…… / / Unfinished
location n / / Unfinished
TABLE 2
Continuing with the above example, the ETL tool can receive the extraction start message "t" corresponding to location 1 sent by the data source terminal11", and an extraction end message" t12", and records it in the data block information list as shown in table 2; similarly, the ETL tool can receive and record an extraction start message "t" corresponding to location kk1". It can be understood that, in table 2, the ETL tool has received an extraction start message and an extraction end message corresponding to one or more pieces of data information arranged before location k, has not received an extraction end message corresponding to location k, and has not received an extraction start message and an extraction end message corresponding to one or more pieces of data information arranged after location k.
It should be noted that, in the related art, in a process of reading data from a data node corresponding to data block information by the data source, only the read data is usually returned, that is, the ETL tool does not receive an extraction start message and an extraction end message corresponding to each data block; therefore, the ETL tool cannot know the extraction progress of each data block corresponding to the data to be extracted, and can only know the execution result of the entire data exchange task. In the technical solution of the present specification, the ETL tool may request to acquire, from the data source end, an extraction start message and an extraction end message corresponding to each data block, so as to determine an extraction progress of each data block according to whether the extraction start message and the extraction end message are received.
In practical applications, after the ETL tool extracts the data corresponding to each data block from the data source, the ETL tool may further perform data conversion on the extracted data in units of data blocks. The specific implementation manner of the data conversion is not limited in the specification; for example, the ETL tool may perform one or more data conversion operations such as data filtering, data cleaning, data calculation, data verification, data merging, and field mapping on the extracted data in units of data blocks, and may further record a conversion start message, a conversion end message, and a conversion result respectively corresponding to each data block.
In addition, in practical application, the ETL tool may further send the converted data to the destination, and load the data in units of data blocks. The specific implementation manner of data loading is not particularly limited in this specification; for example, the ETL tool may write data corresponding to one or more data blocks into a destination after extracting and caching the data corresponding to the one or more data blocks; for another example, the ETL tool may first send the extracted data to the destination, and after receiving an extraction end message corresponding to any data block, may send a commit message corresponding to the data block to the destination, so that the destination drops the received data corresponding to the data block; for another example, the ETL tool may send the extracted data to the destination, so that the destination performs a disk dropping, and returns a loading start message, a loading end message, a loading result, and the like, which correspond to each data block, to the ETL tool.
In this specification, if the ETL tool receives an extraction start message and an extraction end message corresponding to each of the several data blocks, it may be determined that the data to be extracted has completed extraction, and thus it may be determined that the data exchange task has been successfully performed.
In practical applications, the data exchange task is successfully executed, and it can be broadly understood that a plurality of data blocks corresponding to the data to be extracted are all extracted from the data source end, and all data conversion and data loading are completed. Based on this, if only the data extraction stage in the data exchange task is concerned, in this specification, it can be narrowly understood that: if the data to be extracted is extracted and the subsequent data conversion and data loading are normally completed, the data exchange task is successfully executed; and if the data to be extracted is not extracted completely, the data exchange task is executed abnormally.
For example, if the ETL tool receives an extraction start message and an extraction end message corresponding to the data block information location 1, … …, and location n of the n data blocks corresponding to the data to be extracted, it may be determined that all the n data blocks corresponding to the data to be extracted have been extracted, that is, the data to be extracted has been extracted, and in a case that subsequent data conversion and data loading are normally implemented, it may be determined that the data exchange task has been successfully executed.
In one embodiment shown, the determining that the data to be extracted has completed extraction if an extraction start message and an extraction end message corresponding to each of the data blocks are received includes: if an extraction starting message and an extraction ending message corresponding to any data block in the data blocks are received, determining that the data block is extracted; and if the data blocks are extracted completely, determining that the data to be extracted is extracted completely.
For example, as shown in Table 2, if the ETL tool receives an extraction start message "t" corresponding to location 111"and extraction end message" t12", it can be determined that the data block extraction result corresponding to location 1 is" completed "; if the data block extraction results corresponding to the n data blocks corresponding to the data to be extracted are all 'completed', it can be determined that the data to be extracted have been extracted.
In another illustrated embodiment, the method further comprises: and if the extraction starting message corresponding to any one of the data blocks is received and the extraction ending message corresponding to the data block is not received, determining that the data block is not extracted completely.
For example, as shown in Table 2, if the ETL tool receives an extraction start message "t" corresponding to location kk1", and the extraction end message corresponding to the location k is not received, it may be determined that the extraction result of the data block corresponding to the location k is" incomplete ".
In addition, if the extraction start message and the extraction end message corresponding to a certain data block are not received, it may be determined that the extraction of the data block is not completed, or the data block may be referred to as that the extraction of the data block is not started, and the description is not particularly limited.
For example, as shown in table 2, if the ETL tool does not receive the extraction start message and the extraction end message corresponding to location n, it may be determined that the extraction result of the data chunk corresponding to location n is "incomplete".
In practical applications, the data block extraction result in the data block information list shown in table 2 may be initialized to "incomplete", and the ETL tool may update a data block to "complete" in response to receiving an extraction end message corresponding to the data block. Alternatively, the ETL tool may determine a data block extraction result corresponding to each data block according to information respectively corresponding to each data block recorded in a data block information list as shown in table 2 in response to determining that the data exchange task fails to be executed.
In the above illustrated embodiment, for any data block, the ETL tool may determine whether the data block is abnormal according to the time interval between the reception of the extraction start message and the extraction end message corresponding to the data block.
When implemented, the method may further comprise: the ETL tool may determine, in response to receiving the extraction start message corresponding to the data block, whether an extraction end message corresponding to the data block is received within a preset time period; if the extraction ending message corresponding to the data block is received within the preset time length, determining that the data block is extracted; if the extraction ending message corresponding to the data block is not received within the preset time length, determining that the data block is abnormal in extraction; if the extraction of part of the data blocks in the data blocks is abnormal, the extraction of the data to be extracted can be determined to be abnormal, so that the execution of the data exchange task can be determined to be abnormal.
For example, the ETL tool receives an extraction start message "t" corresponding to location kk1"thereafter, if the extraction end message" t "corresponding to the location k is received within a preset time periodk2", it can be determined that the data block extraction result corresponding to location k is" completed "; otherwise, it may be determined that the data block extraction result corresponding to the location k is "incomplete", and it may be determined that the data block corresponding to the location k has an extraction exception.
In this specification, if the ETL tool does not receive an extraction start message and an extraction end message corresponding to each of the data blocks, it may be determined that the data to be extracted is not extracted completely, and thus it may be determined that the data exchange task is abnormally performed; if the data exchange task is determined to be abnormally executed, the ETL tool can send a recovery extraction request aiming at the data to be extracted to the data source end; wherein, the recovery extraction request may carry data block information of a data block that is not extracted among the plurality of data blocks; so that the data source end can continue to extract the data to be extracted based on the data block information of the data blocks which are not extracted completely.
For example, as shown in table 2, the ETL tool does not receive an extraction start message and an extraction end message corresponding to each of n data blocks corresponding to the data to be extracted, and may determine that the data blocks corresponding to location k, location k +1, … …, and location n have not been extracted, that is, may determine that the data to be extracted has not been extracted, so that it may determine that the data exchange task for the data to be extracted performs an exception. Further, if the data exchange task needs to be resumed, the ETL tool may send a recovery extraction request for the data to be extracted to the data source end, where the recovery extraction request may carry data block information of the data block that has not been extracted; the data source end can continue to read data from the DataNode corresponding to the location k in response to the recovery extraction request.
The data block information carried in the recovery extraction request may include data block information of all unfinished extraction data blocks in the plurality of data blocks corresponding to the data to be extracted, and may also include data block information of a certain data block which is extracted abnormally from the plurality of data blocks corresponding to the data to be extracted.
For example, as shown in table 2, the resume extraction request may carry data block information location k, location k +1, … …, and location n. For another example, the resume extraction request may only carry the data block information location k where the data extraction is interrupted.
Correspondingly, the data source end can continue to extract the data blocks which are not extracted based on the data block information carried in the recovery extraction request; the data source end may also first reacquire data block information of a plurality of data blocks corresponding to the data to be extracted, and then determine which data block information of the reacquired data block information to start to continue to extract based on the data block information carried in the extraction resuming request.
For example, as shown in table 2, the resume extraction request carries data block information location k, location k +1, … …, and location n; then, in response to the extraction resuming request, the data source end does not need to re-acquire the data block information of the data blocks corresponding to the data to be extracted, and may continue to extract data according to the data block information location k, location k +1, … …, and location n carried in the extraction resuming request.
For another example, as shown in table 2, the resume extraction request only carries the data block information location k where the data extraction is interrupted; then, in response to the extraction resuming request, the data source end may first reacquire the data block information of the n data blocks corresponding to the data to be extracted, and then continue to extract from the location k in the reacquired data block information.
It should be noted that, in the above illustrated embodiment, if the data exchange task is executed abnormally, the ETL tool may determine, based on the received data block information, extraction start message, and extraction end message, an unfinished extracted data block in several data blocks corresponding to the data to be extracted, and send a recovery extraction request for the data to be extracted to the data source, so that the data source end may continue to extract the data to be extracted based on the data block information of the unfinished extracted data block carried in the recovery extraction request; therefore, the breakpoint continuous transmission capability is achieved, the extracted part of data blocks do not need to be deleted, the data to be extracted does not need to be extracted again, and the execution efficiency of the data exchange task after restarting is improved.
In one embodiment shown, the breakpoint resume capability may be automatically enabled if the data exchange task performs abnormally. When implemented, the method further comprises: and if the data exchange task is determined to be abnormally executed, periodically sending a recovery extraction request aiming at the data to be extracted to the data source end according to a preset period.
The preset period, that is, the automatic restart period of the task in which the ETL tool attempts to re-execute the abnormal data exchange task, may be configured by a person skilled in the art according to a requirement, and the description is not limited herein.
For example, the automatic restart period of a task pre-configured for the data exchange task of the data to be extracted by the user is 2 minutes; if the data exchange task is determined to be abnormally executed, the ETL tool may send a recovery extraction request for the data to be extracted to the data source end every 2 minutes until the data exchange task is restarted successfully or the number of times of automatic restart is exceeded.
In another embodiment shown, the breakpoint resume capability may be manually enabled after troubleshooting if the data exchange task performs abnormally. When implemented, the method further comprises: and if the data exchange task is determined to be abnormally executed, responding to a restarting instruction aiming at the data exchange task, and sending a recovery extraction request aiming at the data to be extracted to the data source end.
For example, if the data exchange task for the data to be extracted is executed abnormally, the user may manually restart the data exchange task after network recovery or other troubleshooting; the ETL tool may send a recovery extraction request for the data to be extracted to the data source end in response to detecting a restart instruction for the data exchange task.
It should be noted that, in the above-described embodiments, especially in a scenario where the amount of data to be exchanged is large, for example: in the scenes that the data source end comprises a big data platform or the data extraction request comprises a full data extraction request and the like, the time consumed by the data exchange task is long; therefore, by adopting the data extraction method with the breakpoint continuous transmission capability, if the execution of the data exchange task is abnormal, the execution time of the data exchange task can be greatly saved.
It should be noted that, in one or more of the above-mentioned embodiments, before determining that the data exchange task is abnormal in execution, if a part of data in a data block currently being extracted has been extracted and written into the destination, after the data exchange task is restarted, the ETL tool may perform operations such as data extraction, data conversion, data loading, etc. on the data block by taking the data block as a unit, which may result in that the previous part of data is repeatedly written into the destination. For example, as shown in Table 2, the ETL tool receives an extraction start message "t" corresponding to location kk1", and the partial data corresponding to location k has been extracted and written to the destination, but the ETL tool does not receive the extraction end message" t "corresponding to location kk2Determining that the data extraction result of the data block corresponding to the location k is 'incomplete', and the data exchange task is abnormal in execution; further, after restarting the data exchange task, the ETL tool may restart the data exchange taskSo as to extract all the data corresponding to location k again and rewrite the data into the destination, and thus, part of the data corresponding to location k is rewritten into the destination, resulting in data duplication in the destination.
Based on this, in practical applications, in order to avoid a situation that data is repeated in the destination terminal after restarting the data exchange task, the ETL tool may also make corresponding improvement measures for a data loading phase in the data exchange task, and several optional improvement measures are exemplarily described below.
(1) The ETL tool may write the data corresponding to the one or more data blocks into the destination after extracting the data corresponding to the one or more data blocks and completing the caching; that is, only the whole extracted data block continues to be loaded with subsequent data, and the data block causing the data exchange task to be abnormal does not complete data extraction and cache, and the ETL tool does not send the data block to the destination, so that the destination does not have a data duplication condition after restarting the data exchange task.
(2) The ETL tool may send the extracted data to the destination first, and may send a commit message corresponding to any data block to the destination after receiving an extraction end message corresponding to the data block, so that the destination drops the received data corresponding to the data block; therefore, the destination only performs the disk-dropping storage on the whole extracted data block, but the data block causing the data exchange task abnormality does not complete the data extraction, the ETL tool does not receive the extraction end message corresponding to the data block, and thus the commit message corresponding to the data block is not sent to the destination, and the destination does not perform the disk-dropping storage on the received data corresponding to the data block.
(3) The ETL tool may send the extracted data to the destination, so that the destination performs a disk drop, and may return the last row of data of the data block to the ETL tool after the data loading of the entire data block is completed; further, when restarting the data exchange task, the ETL tool may simultaneously send an instruction to the destination, so that the destination deletes the data that was landed after the last row of data of the last data block that was landed successfully in response to the instruction.
According to the technical scheme, when the ETL tool executes a data exchange task aiming at data to be extracted, the extraction result of each data block can be determined by acquiring the data block information of a plurality of data blocks corresponding to the data to be extracted and respectively receiving the extraction starting message and the extraction ending message corresponding to each data block in the plurality of data blocks, and further the extraction result of the data to be extracted and the execution result of the data exchange task can be determined; therefore, the execution progress of the data exchange task is obtained through the ETL tool, and the breakpoint continuous transmission capability is enabled.
Subsequently, if the data exchange task is executed abnormally, the ETL tool may determine, based on the received data block information, extraction start message, and extraction end message, an unfinished extracted data block in the data blocks corresponding to the data to be extracted, and send a recovery extraction request for the data to be extracted to the data source, so that the data source end may continue to extract the data to be extracted based on the data block information of the unfinished extracted data block carried in the recovery extraction request; therefore, the breakpoint continuous transmission capability is achieved, partial data blocks which are extracted are not required to be deleted, and all data blocks corresponding to the data to be extracted are not required to be extracted again, so that the execution efficiency of the data exchange task after restarting is improved.
Corresponding to the embodiment of the data extraction method, the specification also provides an embodiment of a data extraction device.
Referring to fig. 3, fig. 3 is a hardware structure diagram of an electronic device where a web page classification recognition apparatus is located according to an exemplary embodiment. At the hardware level, the device includes a processor 302, an internal bus 304, a network interface 306, a memory 308, and a non-volatile memory 310, although it may include hardware required for other services. One or more embodiments of the present description may be implemented in software, such as by processor 302 reading a corresponding computer program from non-volatile storage 310 into memory 308 and then executing. Of course, besides software implementation, the one or more embodiments in this specification do not exclude other implementations, such as logic devices or combinations of software and hardware, and so on, that is, the execution subject of the following processing flow is not limited to each logic unit, and may also be hardware or logic devices.
Referring to fig. 4, fig. 4 is a block diagram of a data extraction apparatus according to an exemplary embodiment. The data extraction device can be applied to the electronic device shown in fig. 3 to implement the technical solution of the present specification. Wherein, the data extraction device may include:
a sending unit 402, configured to send a data extraction request for data to be extracted to the data source end in response to a start instruction of a data exchange task for the data to be extracted; the data source end obtains data block information of a plurality of data blocks corresponding to the data to be extracted, and returns the data block information to the ETL tool;
a receiving unit 404, configured to receive and record data block information of a plurality of data blocks corresponding to the data to be extracted, where the data block information is returned by the data source end;
the receiving unit 404 is further configured to receive an extraction start message and an extraction end message corresponding to each of the plurality of data blocks; wherein the extraction start message is used for indicating to start extracting the data stored in the corresponding data block; the extraction end message is used for indicating that the extraction of the data stored in the corresponding data block is finished;
a determining unit 406, configured to determine that the data to be extracted has been extracted if an extraction start message and an extraction end message corresponding to each of the data blocks are received, so as to determine that the data exchange task has been successfully executed.
In this embodiment, the determining unit 406 is specifically configured to:
if an extraction starting message and an extraction ending message corresponding to any data block in the data blocks are received, determining that the data block is extracted;
and if the data blocks are extracted completely, determining that the data to be extracted is extracted completely.
In this embodiment, the determining unit 406 is further configured to:
and if the extraction starting message corresponding to any one of the data blocks is received and the extraction ending message corresponding to the data block is not received, determining that the data block is not extracted completely.
In this embodiment, the determining unit 406 is further specifically configured to:
in response to receiving the extraction start message corresponding to the data block, determining whether an extraction end message corresponding to the data block is received within a preset time length;
if the extraction ending message corresponding to the data block is received within the preset time length, determining that the data block is extracted;
and if the extraction ending message corresponding to the data block is not received within the preset time length, determining that the data block is abnormal in extraction.
In this embodiment, the determining unit 406 is further specifically configured to:
and if the extraction of part of the data blocks in the plurality of data blocks is abnormal, determining that the extraction of the data to be extracted is abnormal, thereby determining that the execution of the data exchange task is abnormal.
In this embodiment, the sending unit 402 is further configured to:
if the data exchange task is determined to be abnormally executed, periodically sending a recovery extraction request aiming at the data to be extracted to the data source end according to a preset period; wherein, the recovery extraction request carries the data block information of the data block which is not extracted in the plurality of data blocks; and the data source end continues to extract the data to be extracted based on the data block information of the data block which is not extracted.
In this embodiment, the sending unit 402 is further configured to:
if the data exchange task is determined to be abnormally executed, sending a recovery extraction request aiming at the data to be extracted to the data source end in response to a restart instruction aiming at the data exchange task; wherein, the recovery extraction request carries the data block information of the data block which is not extracted in the plurality of data blocks; and the data source end continues to extract the data to be extracted based on the data block information of the data block which is not extracted.
In this embodiment, the data source includes a big data platform.
The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.
For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are only illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution in the specification. One of ordinary skill in the art can understand and implement it without inventive effort.
The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. A typical implementation device is a computer, which may take the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email messaging device, game console, tablet computer, wearable device, or a combination of any of these devices.
In a typical configuration, a computer includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic disk storage, quantum memory, graphene-based storage media or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The terminology used in the description of the one or more embodiments is for the purpose of describing the particular embodiments only and is not intended to be limiting of the description of the one or more embodiments. As used in one or more embodiments of the present specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used in one or more embodiments of the present description to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of one or more embodiments herein. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
The above description is only for the purpose of illustrating the preferred embodiments of the one or more embodiments of the present disclosure, and is not intended to limit the scope of the one or more embodiments of the present disclosure, and any modifications, equivalent substitutions, improvements, etc. made within the spirit and principle of the one or more embodiments of the present disclosure should be included in the scope of the one or more embodiments of the present disclosure.

Claims (11)

1. A data extraction method is characterized by being applied to an ETL tool; the ETL tool is used for executing a data exchange task of extracting, converting and loading data stored by a data source end to a destination end; the data source end is based on a distributed file system and realizes data storage by taking a data block as a unit; the method comprises the following steps:
responding to a starting instruction of a data exchange task aiming at data to be extracted, and sending a data extraction request aiming at the data to be extracted to the data source end; the data source end obtains data block information of a plurality of data blocks corresponding to the data to be extracted, and returns the data block information to the ETL tool;
receiving and recording data block information of a plurality of data blocks corresponding to the data to be extracted, which is returned by the data source end;
respectively receiving an extraction starting message and an extraction ending message corresponding to each data block in the plurality of data blocks; wherein the extraction start message is used for indicating to start extracting the data stored in the corresponding data block; the extraction end message is used for indicating that the extraction of the data stored in the corresponding data block is finished;
and if an extraction starting message and an extraction ending message corresponding to each data block in the plurality of data blocks are received, determining that the data to be extracted is extracted, and determining that the data exchange task is successfully executed.
2. The method according to claim 1, wherein the determining that the data to be extracted has completed extraction if an extraction start message and an extraction end message corresponding to each of the data blocks are received comprises:
if an extraction starting message and an extraction ending message corresponding to any data block in the data blocks are received, determining that the data block is extracted;
and if the data blocks are extracted completely, determining that the data to be extracted is extracted completely.
3. The method of claim 2, further comprising:
and if the extraction starting message corresponding to any one of the data blocks is received and the extraction ending message corresponding to the data block is not received, determining that the data block is not extracted completely.
4. The method of claim 3, further comprising:
in response to receiving the extraction start message corresponding to the data block, determining whether an extraction end message corresponding to the data block is received within a preset time length;
if the extraction ending message corresponding to the data block is received within the preset time length, determining that the data block is extracted;
and if the extraction ending message corresponding to the data block is not received within the preset time length, determining that the data block is abnormal in extraction.
5. The method of claim 4, further comprising:
and if the extraction of part of the data blocks in the plurality of data blocks is abnormal, determining that the extraction of the data to be extracted is abnormal, thereby determining that the execution of the data exchange task is abnormal.
6. The method of claim 1, further comprising:
if the data exchange task is determined to be abnormally executed, periodically sending a recovery extraction request aiming at the data to be extracted to the data source end according to a preset period; wherein, the recovery extraction request carries the data block information of the data block which is not extracted in the plurality of data blocks; and the data source end continues to extract the data to be extracted based on the data block information of the data block which is not extracted.
7. The method of claim 1, further comprising:
if the data exchange task is determined to be abnormally executed, sending a recovery extraction request aiming at the data to be extracted to the data source end in response to a restart instruction aiming at the data exchange task; wherein, the recovery extraction request carries the data block information of the data block which is not extracted in the plurality of data blocks; and the data source end continues to extract the data to be extracted based on the data block information of the data block which is not extracted.
8. The method of claim 1, wherein the data source comprises a big data platform.
9. A data extraction device is characterized by being applied to an ETL tool; the ETL tool is used for executing a data exchange task of loading data stored by the data source end to the destination end after extraction, cleaning and conversion; the data source end is based on a distributed file system and realizes data storage by taking a data block as a unit; the device comprises:
the sending unit is used for responding to a starting instruction of a data exchange task aiming at data to be extracted, and sending a data extraction request aiming at the data to be extracted to the data source end; the data source end obtains data block information of a plurality of data blocks corresponding to the data to be extracted, and returns the data block information to the ETL tool;
the receiving unit is used for receiving and recording data block information of a plurality of data blocks corresponding to the data to be extracted, which is returned by the data source end;
the receiving unit is further configured to receive an extraction start message and an extraction end message corresponding to each of the plurality of data blocks respectively; wherein the extraction start message is used for indicating to start extracting the data stored in the corresponding data block; the extraction end message is used for indicating that the extraction of the data stored in the corresponding data block is finished;
and the determining unit is used for determining that the data to be extracted is extracted completely if an extraction starting message and an extraction ending message corresponding to each data block in the plurality of data blocks are received, so that the data exchange task is successfully executed.
10. An electronic device is characterized by comprising a communication interface, a processor, a memory and a bus, wherein the communication interface, the processor and the memory are connected with each other through the bus;
the memory stores machine-readable instructions, and the processor executes the method of any one of claims 1 to 8 by calling the machine-readable instructions.
11. A machine-readable storage medium having stored thereon machine-readable instructions which, when invoked and executed by a processor, carry out the method of any of claims 1 to 8.
CN202111466573.XA 2021-12-03 2021-12-03 Data extraction method and device Pending CN114297273A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111466573.XA CN114297273A (en) 2021-12-03 2021-12-03 Data extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111466573.XA CN114297273A (en) 2021-12-03 2021-12-03 Data extraction method and device

Publications (1)

Publication Number Publication Date
CN114297273A true CN114297273A (en) 2022-04-08

Family

ID=80964713

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111466573.XA Pending CN114297273A (en) 2021-12-03 2021-12-03 Data extraction method and device

Country Status (1)

Country Link
CN (1) CN114297273A (en)

Similar Documents

Publication Publication Date Title
KR102392944B1 (en) Data backup methods, storage media and computing devices
CN113297166B (en) Data processing system, method and device
CN106484906B (en) Distributed object storage system flash-back method and device
CN110888889B (en) Data information updating method, device and equipment
CN102158540A (en) System and method for realizing distributed database
CN107040576B (en) Information pushing method and device and communication system
US10642530B2 (en) Global occupancy aggregator for global garbage collection scheduling
CN105900093A (en) Keyvalue database data table updating method and data table updating device
CN103414762A (en) Cloud backup method and cloud backup device
CN103559247A (en) Data service processing method and device
CN112860412B (en) Service data processing method and device, electronic equipment and storage medium
EP3832477A1 (en) Efficient storage method for time series data
CN111078719A (en) Data recovery method and device, storage medium and processor
CN110442645B (en) Data indexing method and device
CN114297273A (en) Data extraction method and device
CN115421856A (en) Data recovery method and device
CN115858471A (en) Service data change recording method, device, computer equipment and medium
CN112464049B (en) Method, device and equipment for downloading number detail list
CN113656496A (en) Data processing method and system
CN112699129A (en) Data processing system, method and device
CN108121719B (en) Method and device for realizing data extraction conversion loading ETL
CN108023914B (en) Memory data sharing system, and memory data writing and reading method
CN111274316A (en) Execution method and device of multi-level data flow task, electronic equipment and storage medium
CN113190563B (en) Index generation method, device and storage medium
CN117390040B (en) Service request processing method, device and storage medium based on real-time wide table

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination