WO2017028690A1 - 一种基于etl的文件处理方法及系统 - Google Patents

一种基于etl的文件处理方法及系统 Download PDF

Info

Publication number
WO2017028690A1
WO2017028690A1 PCT/CN2016/093495 CN2016093495W WO2017028690A1 WO 2017028690 A1 WO2017028690 A1 WO 2017028690A1 CN 2016093495 W CN2016093495 W CN 2016093495W WO 2017028690 A1 WO2017028690 A1 WO 2017028690A1
Authority
WO
WIPO (PCT)
Prior art keywords
file
row
segmentation
row data
text data
Prior art date
Application number
PCT/CN2016/093495
Other languages
English (en)
French (fr)
Inventor
罗海伟
陈守元
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2017028690A1 publication Critical patent/WO2017028690A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses

Definitions

  • the present application relates to the field of data processing technologies, and in particular, to an ETL-based file processing method, and an ETL-based file processing system.
  • EAI Enterprise Application Integration
  • ETL the abbreviation of Extraction-Transformation-Loading, which is the process of extracting, transforming and loading, is an important part of building data warehouse.
  • ETL is a process of loading and transforming data of a business system into a data warehouse after being extracted and cleaned. The purpose is to integrate scattered, disorderly, and non-uniform data in the enterprise to provide an analysis basis for enterprise decision-making.
  • the synchronization tools for the file system are file-based, mainly to complete the file-to-file synchronization between the terminals, and do not meet the needs of the data warehouse ETL.
  • the technical problem to be solved by the embodiments of the present application is to provide an ETL-based file processing method for improving the speed of file synchronization in the ETL process and maximizing data synchronization efficiency.
  • the embodiment of the present application further provides an ETL-based file processing system to ensure implementation and application of the foregoing method.
  • an ETL-based file processing method the method The law includes:
  • the data in the file is segmented to obtain a plurality of text data blocks
  • the step of acquiring a plurality of file objects from the source includes:
  • Reading structured information from a source the structured information including a plurality of file objects
  • the structured information is segmented in units of a single file object to obtain a plurality of file objects.
  • the step of performing data segmentation in the file for each file object to obtain a plurality of text data blocks comprises:
  • the file object is segmented according to the plurality of segmentation positions to obtain a plurality of text data blocks.
  • the file object comprises a plurality of row data records
  • the step of determining a plurality of segmentation locations for each file object comprises:
  • the cut position is adjusted to the position where the line separator is located
  • the adjusted initial segmentation position is subjected to de-weighting processing to obtain a plurality of segmentation positions.
  • the step of adjusting the segmentation position to the position where the row separator is located includes:
  • initial segmentation position is not the position where the row separator is located, then forward or backward to the position of the row separator closest to the initial segmentation position;
  • the segmentation position is determined as the position of the closest row separator.
  • the step of determining an average size of the row data record comprises:
  • the size of the first row data record is taken as the average size of the row data record
  • the size of the last row data record is taken as the average size of the row data record
  • the average size of the row data records of the first N rows is calculated as the average size of the row data records.
  • the step of concurrently writing all the text data block text data blocks corresponding to the plurality of file objects to the destination end after the sharding of the plurality of file objects is completed includes:
  • the intermediate text data block is converted to a format required by the destination.
  • the text data block includes one or more row data records, the row data record includes a column separator, and when the plurality of file objects are segmented, respectively, the format of the text data block is respectively
  • the steps to convert to an intermediate format include:
  • each row data record of each text data block is cut according to the column separator to obtain one or more column records;
  • the preset data type includes at least one or more of the following types: a string STRING, a long integer LONG, a Boolean BOOLEAN, a double precision floating point DOUBLE, and a date DATE.
  • the method further includes:
  • the file object is a file object capable of random reading and writing, and the file object may include at least one or more of the following objects: a local file, an open storage service OSS file, a secure file transfer protocol SFTP file, and a distribution.
  • File system HDFS file is a file object capable of random reading and writing, and the file object may include at least one or more of the following objects: a local file, an open storage service OSS file, a secure file transfer protocol SFTP file, and a distribution.
  • File system HDFS file is a file object capable of random reading and writing
  • the file object may include at least one or more of the following objects: a local file, an open storage service OSS file, a secure file transfer protocol SFTP file, and a distribution.
  • File system HDFS file is a file object capable of random reading and writing
  • the file object may include at least one or more of the following objects: a local file, an open storage service OSS file, a secure file transfer protocol SFTP file, and a distribution.
  • the embodiment of the present application further provides an ETL-based file processing system, where the system includes:
  • a file object obtaining module configured to acquire a plurality of file objects from a source
  • a file segmentation module configured to perform data segmentation in a file for each file object to obtain a plurality of text data blocks
  • a writing module configured to concurrently write all the text data blocks corresponding to the plurality of file objects to the destination end after the dividing of the plurality of file objects is completed.
  • the file object obtaining module includes:
  • a structured information reading submodule configured to read structured information from a source, where the structured information includes a plurality of file objects;
  • the structured information cutting molecular module is configured to divide the structured information into units of a single file object to obtain a plurality of file objects.
  • the file segmentation module comprises:
  • segmentation location determining sub-module for determining a plurality of segmentation locations for each file object
  • a molecular module configured to segment the file object according to the plurality of segmentation locations to obtain a plurality of text data blocks.
  • the file object includes a plurality of row data records
  • the segmentation location determining submodule comprises:
  • a file size obtaining unit configured to acquire a size of the file object for each file object
  • a row size determining unit configured to determine an average size of the row data record
  • a first calculating unit configured to calculate a quotient of the file object size and an average size of the row data record, to obtain a quantity of the row data record
  • a second calculating unit configured to calculate a quotient of the number of preset text data blocks and the number of the row data records, to obtain the number of row data records owned by each text data block;
  • An initial segmentation location determining unit configured to determine a plurality of initial segmentation locations of the file object according to the number of row data records that the text data block has;
  • an adjusting unit configured to adjust the cutting position to a position where the line separator is located when the initial splitting position is not the position where the line separator is located;
  • the de-weighting unit is configured to perform de-duplication processing on the adjusted initial segmentation position to obtain a plurality of segmentation positions.
  • the adjusting unit is further configured to:
  • initial segmentation position is not the position where the row separator is located, then forward or backward to the position of the row separator closest to the initial segmentation position;
  • the segmentation position is determined as the position of the closest row separator.
  • the line size determining unit is further configured to:
  • the size of the first row data record is taken as the average size of the row data record
  • the size of the last row data record is taken as the average size of the row data record
  • the average size of the row data records of the first N rows is calculated as the average size of the row data records.
  • the writing module comprises:
  • a first format conversion submodule configured to convert the format of the text data block into an intermediate state format after the plurality of file objects are segmented
  • a data writing sub-module configured to write a text block of a corresponding intermediate state format to the destination end by using a multi-thread or a multi-process or a distributed multi-machine;
  • a second format conversion submodule configured to convert the intermediate state text data block into a format required by the destination end in the destination end.
  • the text data block includes one or more row data records
  • the row data record includes a column separator
  • the first format conversion submodule comprises:
  • a cutting unit configured to cut each row data record of each text data block according to the column separator after the segmentation of the plurality of file objects, to obtain one or more column records
  • a data type adding unit configured to add a corresponding preset data type to the column record, respectively, to obtain the intermediate state format.
  • the preset data type includes at least one or more of the following types: a string STRING, a long integer LONG, a Boolean BOOLEAN, a double precision floating point DOUBLE, and a date DATE.
  • the system further comprises:
  • a dirty data generating module configured to: when the format of the text data block is unsuccessful in converting the intermediate state format, or when the format of the text data block of the intermediate state is unsuccessful in converting the destination end, generating a dirty data;
  • An error report generation module is configured to generate an error report when the dirty data exceeds a preset threshold.
  • the file object is a file object capable of random read and write, and the file object may include at least one or more of the following objects: a local file, an open storage service OSS file, and a secure file transfer protocol SFTP.
  • File distributed file system HDFS file.
  • the embodiments of the present application include the following advantages:
  • the data inside the file is segmented in units of file objects to obtain a text data block, and after all the file objects are internally segmented, all will be The text data block is concurrently written into the destination end, and the granularity is the fine-grained segmentation within the file.
  • concurrent writing is performed, the file synchronization speed in the ETL process can be improved, and the data synchronization efficiency is maximized.
  • the embodiment of the present application is segmented by the internal structure of the file, it is not limited to a certain file system, and data synchronization of multiple file systems can be completed, and the versatility is strong.
  • Embodiment 1 is a flow chart showing the steps of Embodiment 1 of an ETL-based file processing method according to the present application;
  • FIG. 2 is a flow chart of steps of a second embodiment of an ETL-based file processing method according to the present application
  • FIG. 3 is a schematic diagram of a splitting position of a second embodiment of an ETL-based file processing method according to the present application.
  • FIG. 3b is a schematic diagram 1 of the cut position adjustment of the second embodiment of the ETL-based file processing method of the present application.
  • 4A is a schematic diagram 2 of a cut position adjustment of a second embodiment of an ETL-based file processing method according to the present application;
  • 4b is a schematic diagram of deduplication results of Embodiment 2 of an ETL-based file processing method according to the present application;
  • FIG. 5 is a schematic diagram of a text data block of Embodiment 2 of an ETL-based file processing method according to the present application;
  • FIG. 6 is a structural block diagram of an embodiment of an ETL-based file processing system of the present application.
  • FIG. 1 a flow chart of a first embodiment of an ETL-based file processing method according to the present application is shown, which may include the following steps:
  • Step 101 Acquire multiple file objects from a source end
  • Step 102 Perform data segmentation in the file for each file object to obtain a plurality of text data blocks.
  • Step 103 After the segmentation of the plurality of file objects is completed, all the text data blocks corresponding to the plurality of file objects are concurrently written into the destination end.
  • the file object when multiple file objects are acquired from the source end, the file object is used as a unit.
  • the data inside the file is segmented to obtain a text data block, and after all the file objects are internally divided, all the text data blocks are concurrently written into the destination end, and the granularity is the fine-grained segmentation within the file, and then When concurrently writing, it can improve the speed of file synchronization in the ETL process and maximize data synchronization efficiency.
  • the embodiment of the present application is segmented by the internal structure of the file, it is not limited to a certain file system, and data synchronization of multiple file systems can be completed, and the versatility is strong.
  • FIG. 2 a flow chart of the steps of the second embodiment of the ETL-based text data processing method of the present application is shown, which may include the following steps:
  • Step 201 Read structured information from a source, where the structured information includes multiple file objects.
  • the embodiment of the present application can be applied to an ETL scenario, and the source end may have multiple, for example, but not limited to: SFTP (Secure File Transfer Protocol), local file system Local File, OSS (open storage services) Open storage services can be understood as storage disks, HDFS (Hadoop Distributed File System), and so on.
  • SFTP Secure File Transfer Protocol
  • OSS open storage services
  • Open storage services can be understood as storage disks
  • HDFS Hadoop Distributed File System
  • the source end system of the present application generally does not have a database transaction related guarantee, and the user needs to ensure the data consistency (data addition, deletion, modification, etc.) when the data is read, that is, the user tries to ensure the data in the data. Do not modify the content of structured information during synchronization.
  • the unstructured information read from the source may include structured data and/or semi-structured data, wherein the structured data may be a database whose data rows are strictly predefined, with explicit Schema constraints; semi-structured Data refers to character data with explicit line separators and column separators. It can be abstracted into a two-dimensional table structure. Each record line includes one or more columns of data, and the number of columns of each row of data is the same.
  • the structured information may include a plurality of file objects, wherein the file object may be an object capable of constructing a file model.
  • the file object in the embodiment of the present application may be a file object capable of random reading and writing, that is, the file object of the present application may start reading from any byte.
  • the write operation to the file object can be gradually added at the end of the file object.
  • the file object may include at least one or more of the following objects: a local file, an OSS file, an SFTP file, and an HDFS file.
  • file object of the embodiment of the present application can support different coding types and different compression types.
  • Step 202 The structured information is segmented in units of a single file object to obtain multiple file pairs.
  • the embodiment of the present application may perform the first-level coarse-grained file segmentation, where the segmentation is a file-level segmentation, and the segmentation manner may be performed by dividing a single file object into units. , get multiple file objects.
  • the segmentation refers to dividing a task into multiple subtasks, and multiple subtasks can be executed concurrently. After all the subtasks are completed, the overall task is also completed. By performing the subtasks concurrently, the running time of the job can be shortened.
  • the structured information read from the source includes 10 file objects, which can be divided into 10 file objects in units of a single file object.
  • the structured information of one day is obtained from the source end, and the structured information is stored in units of hours, and the file can be divided into 24 file objects in one day.
  • the split logic of the present application may not be executed, such as reading a 100 KB file object, and it is not necessary to split it.
  • the splitting logic of the present application is directed to the source end data of the size of the structured information timeout setting threshold. Therefore, before performing step 202, it may first determine whether the size of the structured information is greater than a set threshold, if greater than or equal to If the threshold is determined, step 202 can be performed. Otherwise, step 202 is not performed.
  • Step 203 Perform data segmentation in the file for each file object to obtain a plurality of text data blocks.
  • each file object After the data obtained from the source is subjected to the first-level coarse-grained segmentation to obtain a plurality of file objects, the embodiment of the present application further performs a second-level fine-grained file segmentation for each file object, and the segmentation is a file. Internal segmentation, after the segmentation is completed, each file object can get multiple text data blocks.
  • step 203 may include the following sub-steps:
  • Sub-step S11 determining a plurality of segmentation positions for each file object
  • the sub-step S11 further includes the following sub-steps:
  • Sub-step S111 obtaining, for each file object, the file object size
  • attribute information of the file object can be obtained, wherein the attribute information can include a file object size totalSize.
  • Sub-step S112 determining an average size of the row data record
  • the data record of the row has information such as a row separator, a column separator, and the like.
  • each row data record in the file object may have a different size.
  • the embodiment of the present application can obtain the average size lineSize of the row data record, and the average size.
  • lineSize is one of the basis for segmentation. In a specific implementation, lineSize can be determined according to the characteristics of the size distribution of the row data records in the file object.
  • the average size of the row data record can be determined by referring to the following ways:
  • the size of the first row data record is taken as the average size of the row data record
  • the size of the last row data record is taken as the average size of the row data record
  • the average size of the row data records of the first N rows is calculated as the average size of the row data records.
  • the size of each row data record in the file object is the same, or substantially the same, the size of the first row data or the last row of data or a random row of data may be directly selected as the average size of the row data record.
  • the average size of the data of each 50 rows is taken as the average size of the row data record, of course, except that the size average value can be taken as
  • the maximum value of the size of the row data record in the change period, or the minimum value, or the median, etc. may be taken as the average size of the row data record.
  • the first N lines in the embodiment of the present application may be the number of lines that are pre-configured by the user, or the number of lines that are randomly read, in addition to the above-mentioned change period, which is not limited in this embodiment of the present application.
  • Sub-step S113 calculating a quotient of the file object size and an average size of the row data record, to obtain the number of the row data records;
  • Sub-step S114 calculating a quotient of the number of preset text data blocks and the number of the row data records, to obtain the number of row data records owned by each text data block;
  • the number m of text data blocks that the user wants to segment can be pre-configured.
  • the configuration may be: the user directly inputs the number of text data blocks that need to be configured; Or, if the flow rate of the text data block is 1 Mbps and the user needs 10 Mbps, it is divided into 10 copies.
  • the number of row data records owned by each text data block can be obtained as number/m.
  • Sub-step S115 determining a plurality of initial segmentation locations of the file object according to the number of row data records owned by the text data block;
  • a plurality of initial segmentation locations point of the file object may be determined, and the initial segmentation location point may be: 0, 1*number/m, 2* Number/m, 3*number/m, and so on.
  • Sub-step S116 if the initial segmentation position is not the position where the row separator is located, the segmentation position is adjusted to the position where the row separator is located;
  • the embodiment of the present application can fine-tune the segmentation position to ensure that the segmentation position is at the end of the line or at the beginning of the line.
  • the initial segmentation position may be adjusted in such a manner that if the initial segmentation position is not the position where the row separator is located, the segmentation position is adjusted to the position where the row separator is located.
  • the sub-step S116 may further be: if the initial splitting position is not the position where the line separator is located, then forward or backward back to the initial cut The position of the closest line separator of the positional position; the cut position is determined as the position of the closest line separator.
  • the initial segmentation position obtained according to the sub-step S115 may be located in the middle of the row data record. If the segmentation is performed according to the initial segmentation position, the row data record is blocked, resulting in a row. The data record is incomplete. Therefore, in an embodiment, the embodiment of the present application may detect the line separator forward by means of forward detection until the end of the line or the end of the file is reached. If the line separator is detected, the initial split position is determined. Update to the position where the line separator is located, as shown in Figure 1b.
  • the embodiment of the present application may also detect the line separator by using a backward rollback until the line head is returned or the file starts, if the line separator is detected. , then update the initial split location to the location where the row separator is located. It should be noted that the way to back backward is higher than the way forward detection is required, and it is only available to file systems that can support backward rollback.
  • Sub-step S117 performing de-duplication processing on the adjusted initial segmentation position to obtain a plurality of segmentation positions.
  • an excessively long row data record may span multiple text data blocks, which may be obtained according to the processing of sub-step S116.
  • a plurality of repeated segmentation positions may be reserved, and other repeated segmentation positions may be deleted, and the result of the deduplication is as shown in the deduplication result diagram of FIG. 4b.
  • Sub-step S12 the file object is segmented according to the plurality of segmentation positions to obtain a plurality of text data blocks.
  • the file objects may be segmented according to the plurality of segmentation positions to obtain corresponding plurality of text data blocks. For example, after performing internal segmentation of the file according to step 203, a plurality of text data blocks as shown in the text data block diagram of FIG. 5 can be obtained.
  • Step 204 After the segmentation of the plurality of file objects is completed, all the text data blocks corresponding to the plurality of file objects are concurrently written into the destination end.
  • all the text data blocks can be concurrently written to the destination.
  • multiple text data blocks can be read in parallel by using a distributed multi-machine, and/or multi-process, and/or multi-threaded manner, thereby improving data synchronization speed and efficiency.
  • step 204 may include the following sub-steps:
  • Sub-step S21 respectively converting the format of the text data block into an intermediate state format
  • a transit mechanism is defined, and data between the source end data and the destination end data is synchronized by the transit mechanism.
  • the transfer mechanism first converts the format of the text data block obtained from the source data segmentation into an intermediate state format, wherein the format of the intermediate state is neither the format of the source data nor the format of the destination data, which is A transitional state from the source data format to the destination data format.
  • the sub-step S21 may include the following sub-steps:
  • Sub-step S211 for each row data record of each text data block, cutting according to the column separator to obtain one or more column records;
  • the embodiment of the present application may divide the row data record according to the column separator to obtain one or more column records.
  • the row data record is: 1, 2, 3, abc, 2015-07-16 00:00:00, China; after cutting according to the column separator ",", the resulting column record cut results are: 1 2 3 abc 2015-07-16 00:00:00 China, a total of 6 columns.
  • Sub-step S212 adding a corresponding preset data type to the column record, respectively, to obtain the intermediate state format.
  • each column data has lost a specific type (integer, floating point number, date, null), etc., therefore, when a plurality of column records are obtained through substep S211, the user can be Pre-configuration information to add a preset data type for the plurality of column records.
  • the preset data type may include at least one or more of the following types: a string STRING, a long integer LONG, a Boolean BOOLEAN, a double precision floating point DOUBLE, a date DATE,
  • the data type can also include the byte type Bytes, but it is not involved in the text reading process.
  • the preset data type can be analogized to a relational database table, and each column of the table has a type, as shown in Table 1 below.
  • index represents the index of the data listed in one row
  • subscript starts from 0
  • type represents the type
  • format of the date can be configured to represent the date format.
  • the reading of character data is involved, instead of the reading of binary data (for example, mp3, picture).
  • binary data for example, mp3, picture.
  • Base64 encoding may be used to convert to a string. Type STRING for processing.
  • Sub-step S22 writing the text data block of the intermediate state format to the destination end
  • Sub-step S23 in the destination end, converting the intermediate text data block into a format required by the destination end.
  • the text data block of the intermediate state format can be written to the destination end, and at the destination end, the intermediate text data block can be converted into the data format type required by the destination end. That is, the above-mentioned preset data type serves as a middle-level data type, and plays a role of relaying, and the relationship between the source end, the destination end, and the preset data type is: multiple source-end data types -> 5 kinds of preset data Type -> Multiple destination types.
  • the synchronization framework may adopt a plug-in mode, and whether the data type of the source end or the data type of the destination end is added, the transfer mechanism of the embodiment of the present application may be used to quickly complete the conversion of the data type. It enriches the form of data exchange between different storage systems and is highly scalable.
  • the embodiment of the present application further It can include the following steps:
  • the corresponding column record can be processed as dirty data, such as an abc conversion to a number, an abc conversion to a date, etc., if the number of dirty data After the dirty data limit is exceeded, an error can be reported and the synchronization task for the text data block ends.
  • the dirty data limitation may include: 1. The number of dirty data, that is, the task error when the specified number of dirty data is exceeded; 2. The percentage of dirty data, that is, the dirty data exceeding the specified percentage, the task reports an error. .
  • the destination end can establish a file corresponding to the text data block, for storing and processing the text data block, that is, one text data block corresponds to one file in the destination end, and each text data block is written. Enter the different files in the destination.
  • the file in the destination may include a file identifier.
  • the file identifier may be named in the following manner: the user specifies a file prefix (ie, a specified prefix), and the embodiment of the present application generates a random 32-bit UUID (Universally). Unique Identifier, which is a spliced together with the UUID to form a file identifier.
  • the first type of writing is truncate: indicating that historical data having the same specified prefix is cleaned before writing the text data block;
  • the write mode is noConflict: indicates that when the text data block is written, if the historical data with the same specified prefix is found, an error report is generated and the data synchronization is exited;
  • the third write mode is append: indicates additional data, regardless of whether Historical data with the same specified prefix is normally written to the text data block.
  • the structured information is divided into a plurality of text data blocks by a coarse-grained file level segmentation and a fine-grained file internal segmentation, and the segmentation is a logical two-dimensional table structure inside the file.
  • the embodiments of the present application can support multiple source data types, and have strong versatility and scalability.
  • FIG. 6 a structural block diagram of an ETL-based file processing system of the present application is shown, which may specifically include the following modules:
  • a file object obtaining module 601, configured to acquire a plurality of file objects from a source end;
  • the file segmentation module 602 is configured to perform data segmentation in the file for each file object to obtain a plurality of text data blocks.
  • the writing module 603 is configured to concurrently write all the text data blocks corresponding to the plurality of file objects to the destination end after the plurality of file objects are segmented.
  • the file object obtaining module 601 may include the following sub-modules:
  • a structured information reading submodule configured to read structured information from a source, where the structured information includes a plurality of file objects;
  • the structured information cutting molecular module is configured to divide the structured information into units of a single file object to obtain a plurality of file objects.
  • the file slicing module 602 may include the following sub-modules:
  • segmentation location determining sub-module for determining a plurality of segmentation locations for each file object
  • a molecular module configured to segment the file object according to the plurality of segmentation locations to obtain a plurality of text data blocks.
  • the file object includes a plurality of row data records
  • the segmentation location determining submodule includes:
  • a file size obtaining unit configured to acquire a size of the file object for each file object
  • a row size determining unit configured to determine an average size of the row data record
  • a first calculating unit configured to calculate a quotient of the file object size and an average size of the row data record, to obtain a quantity of the row data record
  • a second calculating unit configured to calculate a quantity of the preset text data block and the number of the row data record The quotient, the number of row data records owned by each text block;
  • An initial segmentation location determining unit configured to determine a plurality of initial segmentation locations of the file object according to the number of row data records that the text data block has;
  • an adjusting unit configured to adjust the cutting position to a position where the line separator is located when the initial splitting position is not the position where the line separator is located;
  • the de-weighting unit is configured to perform de-duplication processing on the adjusted initial segmentation position to obtain a plurality of segmentation positions.
  • the adjusting unit is further configured to:
  • initial segmentation position is not the position where the row separator is located, then forward or backward to the position of the row separator closest to the initial segmentation position;
  • the segmentation position is determined as the position of the closest row separator.
  • the line size determining unit is further configured to:
  • the size of the first row data record is taken as the average size of the row data record
  • the size of the last row data record is taken as the average size of the row data record
  • the average size of the row data records of the first N rows is calculated as the average size of the row data records.
  • the writing module 603 may include the following sub-modules:
  • a first format conversion submodule configured to convert the format of the text data block into an intermediate state format after the plurality of file objects are segmented
  • a data writing sub-module configured to write a text block of a corresponding intermediate state format to the destination end by using a multi-thread or a multi-process or a distributed multi-machine;
  • a second format conversion submodule configured to convert the intermediate state text data block into a format required by the destination end in the destination end.
  • the text data block includes one or more row data records
  • the row data record includes a column separator
  • the first format conversion submodule includes:
  • a cutting unit configured to cut each row data record of each text data block according to the column separator after the segmentation of the plurality of file objects, to obtain one or more column records
  • a data type adding unit configured to add a corresponding preset data type to the column record, respectively, to obtain the intermediate state format.
  • the preset data type includes at least one or more of the following types: a string STRING, a long LONG, a Boolean BOOLEAN, and a double precision floating point DOUBLE. Date DATE.
  • system further includes:
  • a dirty data generating module configured to: when the format of the text data block is unsuccessful in converting the intermediate state format, or when the format of the text data block of the intermediate state is unsuccessful in converting the destination end, generating a dirty data;
  • An error report generation module is configured to generate an error report when the dirty data exceeds a preset threshold.
  • the file object is a file object capable of random reading and writing, and the file object may include at least one or more of the following objects: a local file, an OSS file, and an SFTP. Files, HDFS files.
  • embodiments of the embodiments of the present application can be provided as a method, apparatus, or computer program product. Therefore, the embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware. Moreover, embodiments of the present application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • Embodiments of the present application are described with reference to flowcharts and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the present application. It will be understood that each flow and/or block of the flowchart illustrations and/or FIG.
  • These computer program operating instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing terminal device to produce a machine for execution by a processor of a computer or other programmable data processing terminal device
  • the operational instructions generate means for implementing the functions specified in one or more of the flow or in a block or blocks of the flowchart.
  • These computer program operating instructions can also be stored in a bootable computer or other programmable data processing terminal
  • the device is readable in a computer readable memory in a particular manner, such that the operational instructions stored in the computer readable memory produce an article of manufacture comprising an operational instruction device implemented in a flow or processes of the flowchart and/or Block diagram A function specified in a box or multiple boxes.
  • These computer program operating instructions can also be loaded onto a computer or other programmable data processing terminal device such that a series of operational steps are performed on the computer or other programmable terminal device to produce computer-implemented processing, such that the computer or other programmable terminal
  • the operational instructions executed on the device provide steps for implementing the functions specified in one or more blocks of the flowchart or in a flow or block of the flowchart.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种基于ETL的文件处理方法及系统,其中所述方法包括:从源端获取多个文件对象(101);针对每个文件对象,进行文件内的数据切分,得到多个文本数据块(102);当所述多个文件对象切分完成后,将所述多个文件对象对应的所有文本数据块并发写入目的端(103)。所述方法和系统可以提高ETL过程中文件同步的速度,最大化数据同步效率。

Description

一种基于ETL的文件处理方法及系统
本申请要求2015年08月14日递交的申请号为201510502163.4、发明名称为“一种基于ETL的文件处理方法及系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及数据处理技术领域,特别是涉及一种基于ETL的文件处理方法,以及一种基于ETL的文件处理系统。
背景技术
随着企业信息化建设的发展,越来越多的企业建立了众多的信息系统,以帮助企业进行内外部业务的处理和管理工作。但是随着信息系统的增加,各自孤立工作的信息系统造成了大量的冗余数据和业务人员的重复劳动。企业应用集成(EAI,Enterprise Application Integration)应运而生,而ETL是实现数据集成的主要技术。
ETL,Extraction-Transformation-Loading的缩写,即数据抽取(Extract)、转换(Transform)、加载(Load)的过程,它是构建数据仓库的重要环节。ETL是将业务系统的数据经过抽取、清洗转换之后加载到数据仓库的过程,目的是将企业中的分散、零乱、标准不统一的数据整合到一起,为企业的决策提供分析依据。
目前,针对文件系统的同步工具如同步盘类应用,是以文件为单位,主要完成各个终端间的文件到文件的同步,并不适应数据仓库ETL的需要。
因此,目前需要本领域技术人员迫切解决的一个技术问题就是:如何提出一种基于ETL的文件处理机制,用以提高ETL过程中文件同步的速度,最大化数据同步效率。
发明内容
本申请实施例所要解决的技术问题是提供一种基于ETL的文件处理方法,用以提高ETL过程中文件同步的速度,最大化数据同步效率。
相应的,本申请实施例还提供了一种基于ETL的文件处理系统,用以保证上述方法的实现及应用。
为了解决上述问题,本申请实施例公开了一种基于ETL的文件处理方法,所述的方 法包括:
从源端获取多个文件对象;
针对每个文件对象,进行文件内的数据切分,得到多个文本数据块;
当所述多个文件对象切分完成后,将所述多个文件对象对应的所有文本数据块并发写入目的端。
优选地,所述从源端获取多个文件对象的步骤包括:
从源端读取结构化信息,所述结构化信息包括多个文件对象;
将所述结构化信息以单个文件对象为单位进行切分,得到多个文件对象。
优选地,所述针对每个文件对象,进行文件内的数据切分,得到多个文本数据块的步骤包括:
针对每个文件对象,确定多个切分位置;
按照所述多个切分位置对所述文件对象进行切分,得到多个文本数据块。
优选地,所述文件对象包括多个行数据记录,所述针对每个文件对象,确定多个切分位置的步骤包括:
针对每个文件对象,获取所述文件对象的大小;
确定所述行数据记录的平均大小;
计算所述文件对象大小与所述行数据记录的平均大小的商值,得到所述行数据记录的数量;
计算预设的文本数据块的数量与所述行数据记录的数量的商值,得到每个文本数据块所拥有的行数据记录的数量;
依据所述文本数据块所具有的行数据记录的数量,确定所述文件对象的多个初始切分位置;
若所述初始切分位置不是行分隔符所在的位置,则将所述切分位置调整为行分隔符所在的位置;
对所述调整后的初始切分位置进行去重处理,得到多个切分位置。
优选地,所述若所述初始切分位置不是行分隔符所在的位置,则将所述切分位置调整为行分隔符所在的位置的步骤包括:
若所述初始切分位置不是行分隔符所在的位置,则向前探测或向后回退到与所述初始切分位置最接近的行分隔符的位置;
将所述切分位置确定为所述最接近的行分隔符的位置。
优选地,所述确定所述行数据记录的平均大小的步骤包括:
将第一个行数据记录的大小作为所述行数据记录的平均大小;
或者,
将最后一个行数据记录的大小作为所述行数据记录的平均大小;
或者,
随机选取一个行数据记录的大小作为所述行数据记录的平均大小;
或者,
计算前N行的行数据记录的大小平均值作为所述行数据记录的平均大小。
优选地,所述当所述多个文件对象切分完成后,将所述多个文件对象对应的所有文本数据块文本数据块并发写入目的端的步骤包括:
当所述多个文件对象切分完成后,分别将所述文本数据块的格式转换为中间态格式;
采用多线程或者多进程或者分布式多机分别将对应的中间态格式的文本数据块写入目的端;
在所述目的端中,将所述中间态的文本数据块转换为所述目的端所需的格式。
优选地,所述文本数据块包括一条或多条行数据记录,所述行数据记录包括列分隔符,所述当所述多个文件对象切分完成后,分别将所述文本数据块的格式转换为中间态格式的步骤包括:
当所述多个文件对象切分完成后,针对每个文本数据块的每条行数据记录,按照所述列分隔符进行切割,得到一个或多个列记录;
分别为所述列记录添加对应的预设数据类型,得到所述中间态格式。
优选地,所述预设数据类型至少包括如下类型的一种或多种:字符串STRING,长整型LONG,布尔型BOOLEAN,双精度浮点型DOUBLE,日期DATE。
优选地,所述方法还包括:
若所述文本数据块的格式转换所述中间态格式不成功,或者,所述中间态的文本数据块转换所述目的端所需的格式不成功,则产生脏数据;
若所述脏数据超出预设阈值,则生成错误报告。
优选地,所述文件对象为能够进行随机读写的文件对象,所述文件对象至少可以包括如下对象的一种或多种:本地文件、开放存储服务OSS文件、安全文件传送协议SFTP文件、分布式文件系统HDFS文件。
本申请实施例还提供了一种基于ETL的文件处理系统,所述的系统包括:
文件对象获取模块,用于从源端获取多个文件对象;
文件切分模块,用于针对每个文件对象,进行文件内的数据切分,得到多个文本数据块;
写入模块,用于在所述多个文件对象切分完成后,将所述多个文件对象对应的所有文本数据块并发写入目的端。
优选地,所述文件对象获取模块包括:
结构化信息读取子模块,用于从源端读取结构化信息,所述结构化信息包括多个文件对象;
结构化信息切分子模块,用于将所述结构化信息以单个文件对象为单位进行切分,得到多个文件对象。
优选地,所述文件切分模块包括:
切分位置确定子模块,用于针对每个文件对象,确定多个切分位置;
切分子模块,用于按照所述多个切分位置对所述文件对象进行切分,得到多个文本数据块。
优选地,所述文件对象包括多个行数据记录,所述切分位置确定子模块包括:
文件大小获取单元,用于针对每个文件对象,获取所述文件对象的大小;
行大小确定单元,用于确定所述行数据记录的平均大小;
第一计算单元,用于计算所述文件对象大小与所述行数据记录的平均大小的商值,得到所述行数据记录的数量;
第二计算单元,用于计算预设的文本数据块的数量与所述行数据记录的数量的商值,得到每个文本数据块所拥有的行数据记录的数量;
初始切分位置确定单元,用于依据所述文本数据块所具有的行数据记录的数量,确定所述文件对象的多个初始切分位置;
调整单元,用于在所述初始切分位置不是行分隔符所在的位置时,则将所述切分位置调整为行分隔符所在的位置;
去重单元,用于对所述调整后的初始切分位置进行去重处理,得到多个切分位置。
优选地,所述调整单元还用于:
若所述初始切分位置不是行分隔符所在的位置,则向前探测或向后回退到与所述初始切分位置最接近的行分隔符的位置;
将所述切分位置确定为所述最接近的行分隔符的位置。
优选地,所述行大小确定单元还用于:
将第一个行数据记录的大小作为所述行数据记录的平均大小;
或者,
将最后一个行数据记录的大小作为所述行数据记录的平均大小;
或者,
随机选取一个行数据记录的大小作为所述行数据记录的平均大小;
或者,
计算前N行的行数据记录的大小平均值作为所述行数据记录的平均大小。
优选地,所述写入模块包括:
第一格式转换子模块,用于在所述多个文件对象切分完成后,分别将所述文本数据块的格式转换为中间态格式;
数据写入子模块,用于采用多线程或者多进程或者分布式多机分别将对应的中间态格式的文本数据块写入目的端;
第二格式转换子模块,用于在所述目的端中,将所述中间态的文本数据块转换为所述目的端所需的格式。
优选地,所述文本数据块包括一条或多条行数据记录,所述行数据记录包括列分隔符,所述第一格式转换子模块包括:
切割单元,用于在所述多个文件对象切分完成后,针对每个文本数据块的每条行数据记录,按照所述列分隔符进行切割,得到一个或多个列记录;
数据类型添加单元,用于分别为所述列记录添加对应的预设数据类型,得到所述中间态格式。
优选地,所述预设数据类型至少包括如下类型的一种或多种:字符串STRING,长整型LONG,布尔型BOOLEAN,双精度浮点型DOUBLE,日期DATE。
优选地,所述系统还包括:
脏数据产生模块,用于在所述文本数据块的格式转换所述中间态格式不成功时,或者,所述中间态的文本数据块转换所述目的端所需的格式不成功时,产生脏数据;
错误报告生成模块,用于在所述脏数据超出预设阈值时,生成错误报告。
优选地,所述文件对象为能够进行随机读写的文件对象,所述文件对象至少可以包括如下对象的一种或多种:本地文件、开放存储服务OSS文件、安全文件传送协议SFTP 文件、分布式文件系统HDFS文件。
与背景技术相比,本申请实施例包括以下优点:
在本申请实施例中,当从源端获取多个文件对象时,以文件对象为单位,进行文件内部的数据切分,得到文本数据块,并在所有文件对象内部切分完成后,将所有的文本数据块并发写入目的端中,切分粒度为文件内部的细粒度切分,则进行并发写入时,能够提高ETL过程中文件同步的速度,最大化数据同步效率。
另外,由于本申请实施例是以文件内部的结构进行切分,并不限于某种文件系统,能够完成多种文件系统的数据同步,通用性强。
附图说明
图1是本申请的一种基于ETL的文件处理方法实施例一的步骤流程图;
图2是本申请的一种基于ETL的文件处理方法实施例二的步骤流程图;
图3a是本申请的一种基于ETL的文件处理方法实施例二的切分位置示意图;
图3b是本申请的一种基于ETL的文件处理方法实施例二的切分位置调整示意图一;
图4a是本申请的一种基于ETL的文件处理方法实施例二的切分位置调整示意图二;
图4b是本申请的一种基于ETL的文件处理方法实施例二的去重结果示意图;
图5是本申请的一种基于ETL的文件处理方法实施例二的文本数据块示意图;
图6是本申请的一种基于ETL的文件处理系统实施例的结构框图。
具体实施方式
为使本申请的上述目的、特征和优点能够更加明显易懂,下面结合附图和具体实施方式对本申请作进一步详细的说明。
参照图1,示出了本申请的一种基于ETL的文件处理方法实施例一的步骤流程图,可以包括如下步骤:
步骤101,从源端获取多个文件对象;
步骤102,针对每个文件对象,进行文件内的数据切分,得到多个文本数据块;
步骤103,当所述多个文件对象切分完成后,将所述多个文件对象对应的所有文本数据块并发写入目的端。
在本申请实施例中,当从源端获取多个文件对象时,以文件对象为单位,进行 文件内部的数据切分,得到文本数据块,并在所有文件对象内部切分完成后,将所有的文本数据块并发写入目的端中,切分粒度为文件内部的细粒度切分,则进行并发写入时,能够提高ETL过程中文件同步的速度,最大化数据同步效率。
另外,由于本申请实施例是以文件内部的结构进行切分,并不限于某种文件系统,能够完成多种文件系统的数据同步,通用性强。
参照图2,示出了本申请的一种基于ETL的文本数据处理方法实施例二的步骤流程图,可以包括如下步骤:
步骤201,从源端读取结构化信息,所述结构化信息包括多个文件对象;
本申请实施例可以应用于ETL的场景,源端可以有多个,例如,可以包括但不限于:SFTP(Secure File Transfer Protocol,安全文件传送协议)、本地文件系统Local File、OSS(open storage services,开放存储服务,可理解为存储盘)、HDFS(Hadoop Distributed File System,分布式文件系统)等。
需要说明的是,本申请的源端系统一般不具有数据库事务相关的保证,需要用户自己保证数据读取时的数据一致性(数据的增、删、改等)问题,即用户尽量保证在数据同步过程中,不要修改结构化信息的内容。
从源端读取的非结构化信息可以包括结构化数据和/或半结构化数据,其中,结构化数据可以为数据库,其数据行列有严格的预定义,有明确的Schema约束;半结构化数据是指有明确行分隔符、列分隔符的字符数据,可抽象为一个二维表结构,每个记录行包括1到多列数据,每行数据的列数目是相同的。
在本申请实施例中,结构化信息可以包括多个文件对象,其中,文件对象可以为能够构建出文件模型的对象。另外,为了使得后续的数据切分和并发写入能够顺利执行,本申请实施例的文件对象可以为能够进行随机读写的文件对象,也即,本申请的文件对象可以从任意字节开始读取,对于文件对象的写入操作,可以在文件对象的最后逐渐追加。
作为本申请实施例的一种示例,文件对象至少可以包括如下对象的一种或多种:本地文件、OSS文件、SFTP文件、HDFS文件。
需要说明的是,本申请实施例的文件对象可以支持不同的编码类型,以及不同的压缩类型。
步骤202,将所述结构化信息以单个文件对象为单位进行切分,得到多个文件对 象;
从源端获取结构化信息以后,本申请实施例可以进行第一层次粗粒度的文件切分,该切分是文件级别的切分,切分的方式可以为以单个文件对象为单位进行切分,得到多个文件对象。其中,切分是指将一个任务划分成多个子任务,可以并发执行多个子任务,所有子任务完成后整体任务也完成了,通过切分并发执行子任务可以缩短作业的运行时间。
例如,从源端读取的结构化信息包括10个文件对象,以单个文件对象为单位,则可以切分成10份文件对象。
再如,从源端获取一天的结构化信息,该结构化信息以小时为单位进行数据存储,则一天可以切分成24个文件对象。
需要说明的是,如果从源端读取的结构化信息的大小过小,则可以不执行本申请的切分逻辑,比如读取一个100KB的文件对象,没有必要对其进行切分。本申请的切分逻辑针对的是结构化信息的大小超时设定阈值的源端数据,因此,在执行步骤202之前,可以首先判断结构化信息的大小是否大于设定阈值,若大于或等于设定阈值,则可以执行步骤202,否则,不执行步骤202。
步骤203,针对每个文件对象,进行文件内的数据切分,得到多个文本数据块;
从源端获得的数据经过第一层次粗粒度的切分,得到多个文件对象以后,本申请实施例针对每个文件对象,进一步执行第二层次细粒度的文件切分,该切分是文件内部的切分,该切分完成后,每个文件对象可以得到多个文本数据块。
在本申请实施例的一种优选实施例中,步骤203可以包括如下子步骤:
子步骤S11,针对每个文件对象,确定多个切分位置;
在对文件对象进行切分前,首先需要确定切分位置,在本申请实施例的一种优选实施例中,子步骤S11进一步包括如下子步骤:
子步骤S111,针对每个文件对象,获取所述文件对象大小;
对于每个单独的文件对象,可以获得该文件对象的属性信息,其中,属性信息可以包括文件对象大小totalSize。
子步骤S112,确定所述行数据记录的平均大小;
对于结构化数据和/或半结构化数据而言,其都可以包括行数据记录,其中,行数据记录每一行的数据记录,则对结构化信息进行切分后的文件对象也是由行数据记录组成的,该行数据记录具有行分隔符、列分隔符等信息。
然而,文件对象中的每条行数据记录可能存在大小不一的情况,为了尽量保障后续切分的行数据记录的完整,本申请实施例可以获取行数据记录的平均大小lineSize,以该平均大小lineSize作为切分的依据之一。在具体实现中,lineSize可以根据文件对象中的行数据记录的大小分布等特点来确定。
在一种实施方式中,可以参考如下几种方式来确定行数据记录的平均大小:
将第一个行数据记录的大小作为所述行数据记录的平均大小;
或者,
将最后一个行数据记录的大小作为所述行数据记录的平均大小;
或者,
随机选取一个行数据记录的大小作为所述行数据记录的平均大小;
或者,
计算前N行的行数据记录的大小平均值作为所述行数据记录的平均大小。
具体来说,如果文件对象中每条行数据记录的大小都是相同的,或者基本相同,则可以直接选取第一行数据或最后一行数据或者随机一行数据的大小作为行数据记录的平均大小。
如果文件对象中的行数据记录的大小分布是周期变化的,比如变化周期是50行,则取每50行数据的大小平均值作为行数据记录的平均大小,当然,除了可以取大小平均值作为行数据记录的平均大小以外,还可以取该变化周期内的行数据记录的大小的最大值,或者,最小值,或者,中位数等作为行数据记录的平均大小。另外,本申请实施例述及的前N行除了上述变化周期外,还可以是用户预先配置的行数,或随机读取的行数,本申请实施例对此无需加以限制。
需要说明的是,上述确定行数据记录的平均大小的几种方式仅仅是本申请实施例的示例,本领域技术人员根据需要采用其他方式来确定行数据记录的平均大小来达到本申请的目的均是可以的。
子步骤S113,计算所述文件对象大小与所述行数据记录的平均大小的商值,得到所述行数据记录的数量;
可以根据totalSize及lineSize估算出文件对象中所拥有的行数据记录的数量number=totalSize/lineSize。
子步骤S114,计算预设的文本数据块的数量与所述行数据记录的数量的商值,得到每个文本数据块所拥有的行数据记录的数量;
在本申请实施例中,可以预先配置用户想要切分获得的文本数据块的数量m,在一种优选实施方式中,配置的方式可以为:用户直接输入需要配置的文本数据块的数量;或者,若文本数据块的流速是1Mbps,用户需要10Mbps,那么切分成10份。
根据文本数据块的数量m与行数据记录的数量number,可以得到每个文本数据块所拥有的行数据记录的数量为number/m。
子步骤S115,依据所述文本数据块所拥有的行数据记录的数量,确定所述文件对象的多个初始切分位置;
当确定每个文本数据块所拥有的行数据记录的数量以后,则可以确定文件对象的多个初始切分位置point,该初始切分位置point可以为:0,1*number/m,2*number/m,3*number/m,依次类推。
子步骤S116,若所述初始切分位置不是行分隔符所在的位置,则将所述切分位置调整为行分隔符所在的位置;
为了避免按照初始切分位置point进行切分时将一个完整的行数据记录隔断,本申请实施例可以对切分位置进行微调,以保证切分位置是在行尾或行首。在具体实现中,对初始切分位置调整的方式可以为:若初始切分位置不是行分隔符所在的位置,则将所述切分位置调整为行分隔符所在的位置。
在本申请实施例的一种优选实施例中,子步骤S116进一步可以为:若所述初始切分位置不是行分隔符所在的位置,则向前探测或向后回退到与所述初始切分位置最接近的行分隔符的位置;将所述切分位置确定为所述最接近的行分隔符的位置。
如图3a的切分位置示意图所示,根据子步骤S115得到的初始切分位置可能会位于行数据记录的中部,如果按照此初始切分位置进行切分,会把行数据记录隔断,导致行数据记录不完整。因此,在一种实施方式中,本申请实施例可以采用向前探测的方式,向前探测行分隔符,直到到达行尾或者文件结束,若探测到行分隔符,则将该初始切分位置更新为行分隔符所在的位置,如图3b所示的切分位置调整示意图一所示。
在另一种实施方式中,与前述向前探测类似,本申请实施例还可以采用向后回退的方式来探测行分隔符,直到回退到行首或者文件开始,若探测到行分隔符,则将该初始切分位置更新为行分隔符所在的位置。需要说明的是,向后回退的方式是较于向前探测的方式要求会更高一点,并且,其只提供给能够支持向后回退的文件系统使用。
子步骤S117,对所述调整后的初始切分位置进行去重处理,得到多个切分位置。
如图4a所示的切分位置调整示意图二所示,在具体实现中,可能会出现超长的行数据记录跨越多个文本数据块的情况,这种情况按照子步骤S116的处理后可以得到多个重复的切分位置,此时,可以只保留一个该重复的切分位置,删除其他重复的切分位置,去重结果如图4b的去重结果示意图所示。
子步骤S12,按照所述多个切分位置对所述文件对象进行切分,得到多个文本数据块。
获得多个切分位置以后,可以按照该多个切分位置进行文件对象的切分,得到对应的多个文本数据块。例如,按照步骤203进行文件内部切分后,可以得到如图5的文本数据块示意图所示的多个文本数据块。
步骤204,当所述多个文件对象切分完成后,将所述多个文件对象对应的所有文本数据块并发写入目的端。
第二层次细粒度的切分获得全部文本数据块以后,可以将所有文本数据块并发写入目的端。在具体实现中,可以采取分布式多机,和/或,多进程,和/或,多线程的方式并行读取多个文本数据块,从而提高了数据同步速度和效率。
在具体实现中,由于分布式系统需要处理的数据量是庞大的,但是系统能够容纳的分布式机器或进程或线程数量是有限的,因此,可以使用一个进程或线程或分布式机器来处理多个文本数据块,并限定该进程或线程或分布式机器处理一个文本数据块的数据流量,以达到流量控制的目的。
在本申请实施例的一种优选实施例中,步骤204可以包括如下子步骤:
子步骤S21,分别将所述文本数据块的格式转换为中间态格式;
应用于本申请实施例,定义了一种中转机制,源端数据与目的端数据通过该中转机制进行数据同步。该中转机制首先是将从源端数据切分得到的文本数据块的格式转换成中间态的格式,其中,中间态的格式既不是源端数据的格式,也不是目的端数据的格式,其是源端数据格式到目的端数据格式的一种过渡状态。
在本申请实施例的一种优选实施例中,子步骤S21可以包括如下子步骤:
子步骤S211,针对每个文本数据块的每条行数据记录,按照所述列分隔符进行切割,得到一个或多个列记录;
本申请实施例可以根据结构化信息的特点,将行数据记录按照列分隔符进行分割,得到一个或多个列记录。例如,行数据记录为:1,2,3,abc,2015-07-16 00:00:00, 中国;其按照列分隔符“,”进行切割后,得到的列记录切割结果为:1 2 3 abc 2015-07-16 00:00:00中国,共6列。
子步骤S212,分别为所述列记录添加对应的预设数据类型,得到所述中间态格式。
由于行数据记录内存储的都是字符数据,每一列数据已经丢失了具体的类型(整数、浮点数、日期、null)等,因此,当通过子步骤S211获得多个列记录后,可以根据用户预先的配置信息,为该多个列记录添加预设数据类型。
作为本申请实施例的一种优选示例,预设数据类型至少可以包括如下类型的一种或多种:字符串STRING,长整型LONG,布尔型BOOLEAN,双精度浮点型DOUBLE,日期DATE,当然,该数据类型还可以包括字节类型Bytes,只是在文本读取过程中,不涉及该类型
该预设数据类型可类比于关系数据库表,表的每一列有一个类型,如下表1所示。
Figure PCTCN2016093495-appb-000001
表1
作为一种示例,一个数据类型的配置实例如下代码所示:
Figure PCTCN2016093495-appb-000002
Figure PCTCN2016093495-appb-000003
其中,index表示列在一行数据的索引,下标从0开始,type表示类型,对于date类型可配置format表示日期格式。
需要说明的是,本申请实施例中涉及的都是字符数据的读取,而不是二进制数据(例如,mp3、图片)的读取,针对二进制数据的情况,可以使用Base64编码,转换为字符串类型STRING进行处理。
子步骤S22,将所述中间态格式的文本数据块写入目的端;
子步骤S23,在所述目的端中,将所述中间态的文本数据块转换为所述目的端所需的格式。
获得文本数据块的中间态格式后,可以将该中间态格式的文本数据块写入目的端,则在目的端,可以将该中间态的文本数据块转换为目的端所需的数据格式类型,即,上述的预设数据类型作为一种中间态的数据类型,起到了中转作用,则源端、目的端和预设数据类型的关系是:多种源端数据类型->5种预设数据类型->多种目的端类型。
根据本申请实施例的中转机制,同步框架可以采用插件式的模式,无论是增加了源端的数据类型还是目的端的数据类型,都可以采用本申请实施例的中转机制快捷的完成数据类型的转换,丰富了不同存储系统间数据交换的形式,并且具有很强的拓展性。
在一种实施方式中,在上述采用中转机制的格式转换过程中,本申请实施例还 可以包括如下步骤:
若所述文本数据块的格式转换所述中间态格式不成功,或者,所述中间态的文本数据块转换所述目的端所需的格式不成功,则产生脏数据;若所述脏数据超出预设阈值,则生成错误报告。
具体来说,若上述子步骤S21和/或子步骤S23的格式转换失败,则对应的列记录可以作为脏数据处理,比如abc转换为数字、abc转换为日期等错误情况,若脏数据的数量超出脏数据限制后,可以报告错误,该文本数据块的同步任务结束。
其中,作为一种示例,脏数据限制可以包括:1、脏数据条数,即超出指定条数的脏数据时任务报错;2、脏数据百分比,即超出指定百分比的脏数据时,任务报告错误。
进一步的,当多个文本数据块写入目的端后,在目的端,若只建立一个文件用于存储多个文本数据块,则不能够做到多个线程(进程)并发写入同一个文件,导致数据互斥问题的产生。为了避免数据互斥,目的端可以建立与文本数据块一一对应的文件,用于进行文本数据块的存储和处理,即一个文本数据块对应目的端中的一个文件,每个文本数据块写入目的端不同的文件中。
目的端中的文件可以包括文件标识,在一种实施方式中,该文件标识的命名方式可以为:用户指定一个文件前缀(即指定前缀),本申请实施例生产一个随机的32位UUID(Universally Unique Identifier,通用唯一识别码),该指定前缀与UUID拼接在一起形成文件标识。
在具体实现中,向目的端写入文本数据块的写入方式可以有3种,第一种写入方式是truncate:表示写入文本数据块前,清理有相同指定前缀的历史数据;第二种写入方式是noConflict:表示写入文本数据块时,如果发现有相同指定前缀的历史数据,则生成错误报告,并退出数据同步;第三种写入方式是append:表示追加数据,不管是否有相同指定前缀的历史数据,都正常写入文本数据块。
在本申请实施例中,通过粗粒度的文件级别切分以及细粒度的文件内部切分,将结构化信息切分成多个文本数据块,切分是关注文件内部的逻辑二维表结构,是一种通用的半结构化文件数据的通用同步解决方案
另外,本申请实施例可以支持多种源端数据类型,通用性及拓展性强。
需要说明的是,对于方法实施例,为了简单描述,故将其都表述为一系列的动 作组合,但是本领域技术人员应该知悉,本申请实施例并不受所描述的动作顺序的限制,因为依据本申请实施例,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作并不一定是本申请实施例所必须的。
参照图6,示出了本申请一种基于ETL的文件处理系统的结构框图,具体可以包括如下模块:
文件对象获取模块601,用于从源端获取多个文件对象;
文件切分模块602,用于针对每个文件对象,进行文件内的数据切分,得到多个文本数据块;
写入模块603,用于在所述多个文件对象切分完成后,将所述多个文件对象对应的所有文本数据块并发写入目的端。
在本申请实施例的一种优选实施例中,所述文件对象获取模块601可以包括如下子模块:
结构化信息读取子模块,用于从源端读取结构化信息,所述结构化信息包括多个文件对象;
结构化信息切分子模块,用于将所述结构化信息以单个文件对象为单位进行切分,得到多个文件对象。
在本申请实施例的一种优选实施例中,所述文件切分模块602可以包括如下子模块:
切分位置确定子模块,用于针对每个文件对象,确定多个切分位置;
切分子模块,用于按照所述多个切分位置对所述文件对象进行切分,得到多个文本数据块。
在本申请实施例的一种优选实施例中,所述文件对象包括多个行数据记录,所述切分位置确定子模块包括:
文件大小获取单元,用于针对每个文件对象,获取所述文件对象的大小;
行大小确定单元,用于确定所述行数据记录的平均大小;
第一计算单元,用于计算所述文件对象大小与所述行数据记录的平均大小的商值,得到所述行数据记录的数量;
第二计算单元,用于计算预设的文本数据块的数量与所述行数据记录的数量的 商值,得到每个文本数据块所拥有的行数据记录的数量;
初始切分位置确定单元,用于依据所述文本数据块所具有的行数据记录的数量,确定所述文件对象的多个初始切分位置;
调整单元,用于在所述初始切分位置不是行分隔符所在的位置时,则将所述切分位置调整为行分隔符所在的位置;
去重单元,用于对所述调整后的初始切分位置进行去重处理,得到多个切分位置。
在本申请实施例的一种优选实施例中,所述调整单元还用于:
若所述初始切分位置不是行分隔符所在的位置,则向前探测或向后回退到与所述初始切分位置最接近的行分隔符的位置;
将所述切分位置确定为所述最接近的行分隔符的位置。
在本申请实施例的一种优选实施例中,所述行大小确定单元还用于:
将第一个行数据记录的大小作为所述行数据记录的平均大小;
或者,
将最后一个行数据记录的大小作为所述行数据记录的平均大小;
或者,
随机选取一个行数据记录的大小作为所述行数据记录的平均大小;
或者,
计算前N行的行数据记录的大小平均值作为所述行数据记录的平均大小。
在本申请实施例的一种优选实施例中,所述写入模块603可以包括如下子模块:
第一格式转换子模块,用于在所述多个文件对象切分完成后,分别将所述文本数据块的格式转换为中间态格式;
数据写入子模块,用于采用多线程或者多进程或者分布式多机分别将对应的中间态格式的文本数据块写入目的端;
第二格式转换子模块,用于在所述目的端中,将所述中间态的文本数据块转换为所述目的端所需的格式。
在本申请实施例的一种优选实施例中,所述文本数据块包括一条或多条行数据记录,所述行数据记录包括列分隔符,所述第一格式转换子模块包括:
切割单元,用于在所述多个文件对象切分完成后,针对每个文本数据块的每条行数据记录,按照所述列分隔符进行切割,得到一个或多个列记录;
数据类型添加单元,用于分别为所述列记录添加对应的预设数据类型,得到所述中间态格式。
在本申请实施例的一种优选实施例中,所述预设数据类型至少包括如下类型的一种或多种:字符串STRING,长整型LONG,布尔型BOOLEAN,双精度浮点型DOUBLE,日期DATE。
在本申请实施例的一种优选实施例中,所述系统还包括:
脏数据产生模块,用于在所述文本数据块的格式转换所述中间态格式不成功时,或者,所述中间态的文本数据块转换所述目的端所需的格式不成功时,产生脏数据;
错误报告生成模块,用于在所述脏数据超出预设阈值时,生成错误报告。
在本申请实施例的一种优选实施例中,所述文件对象为能够进行随机读写的文件对象,所述文件对象至少可以包括如下对象的一种或多种:本地文件、OSS文件、SFTP文件、HDFS文件。
对于系统实施例而言,由于其与上述方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
本说明书中的各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似的部分互相参见即可。
本领域内的技术人员应明白,本申请实施例的实施例可提供为方法、装置、或计算机程序产品。因此,本申请实施例可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请实施例可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本申请实施例是参照根据本申请实施例的方法、终端设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序操作指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序操作指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理终端设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理终端设备的处理器执行的操作指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序操作指令也可存储在能引导计算机或其他可编程数据处理终端 设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的操作指令产生包括操作指令装置的制造品,该操作指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序操作指令也可装载到计算机或其他可编程数据处理终端设备上,使得在计算机或其他可编程终端设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程终端设备上执行的操作指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
尽管已描述了本申请实施例的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例做出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本申请实施例范围的所有变更和修改。
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者终端设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者终端设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者终端设备中还存在另外的相同要素。
以上对本申请所提供的一种基于ETL的文件处理方法及系统进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (22)

  1. 一种基于ETL的文件处理方法,其特征在于,所述的方法包括:
    从源端获取多个文件对象;
    针对每个文件对象,进行文件内的数据切分,得到多个文本数据块;
    当所述多个文件对象切分完成后,将所述多个文件对象对应的所有文本数据块并发写入目的端。
  2. 根据权利要求1所述的方法,其特征在于,所述从源端获取多个文件对象的步骤包括:
    从源端读取结构化信息,所述结构化信息包括多个文件对象;
    将所述结构化信息以单个文件对象为单位进行切分,得到多个文件对象。
  3. 根据权利要求1或2所述的方法,其特征在于,所述针对每个文件对象,进行文件内的数据切分,得到多个文本数据块的步骤包括:
    针对每个文件对象,确定多个切分位置;
    按照所述多个切分位置对所述文件对象进行切分,得到多个文本数据块。
  4. 根据权利要求3所述的方法,其特征在于,所述文件对象包括多个行数据记录,所述针对每个文件对象,确定多个切分位置的步骤包括:
    针对每个文件对象,获取所述文件对象的大小;
    确定所述行数据记录的平均大小;
    计算所述文件对象大小与所述行数据记录的平均大小的商值,得到所述行数据记录的数量;
    计算预设的文本数据块的数量与所述行数据记录的数量的商值,得到每个文本数据块所拥有的行数据记录的数量;
    依据所述文本数据块所具有的行数据记录的数量,确定所述文件对象的多个初始切分位置;
    若所述初始切分位置不是行分隔符所在的位置,则将所述切分位置调整为行分隔符所在的位置;
    对所述调整后的初始切分位置进行去重处理,得到多个切分位置。
  5. 根据权利要求4所述的方法,其特征在于,所述若所述初始切分位置不是行分隔符所在的位置,则将所述切分位置调整为行分隔符所在的位置的步骤包括:
    若所述初始切分位置不是行分隔符所在的位置,则向前探测或向后回退到与所述初 始切分位置最接近的行分隔符的位置;
    将所述切分位置确定为所述最接近的行分隔符的位置。
  6. 根据权利要求4所述的方法,其特征在于,所述确定所述行数据记录的平均大小的步骤包括:
    将第一个行数据记录的大小作为所述行数据记录的平均大小;
    或者,
    将最后一个行数据记录的大小作为所述行数据记录的平均大小;
    或者,
    随机选取一个行数据记录的大小作为所述行数据记录的平均大小;
    或者,
    计算前N行的行数据记录的大小平均值作为所述行数据记录的平均大小。
  7. 根据权利要求1所述的方法,其特征在于,所述当所述多个文件对象切分完成后,将所述多个文件对象对应的所有文本数据块文本数据块并发写入目的端的步骤包括:
    当所述多个文件对象切分完成后,分别将所述文本数据块的格式转换为中间态格式;
    采用多线程或者多进程或者分布式多机分别将对应的中间态格式的文本数据块写入目的端;
    在所述目的端中,将所述中间态的文本数据块转换为所述目的端所需的格式。
  8. 根据权利要求7所述的方法,其特征在于,所述文本数据块包括一条或多条行数据记录,所述行数据记录包括列分隔符,所述当所述多个文件对象切分完成后,分别将所述文本数据块的格式转换为中间态格式的步骤包括:
    当所述多个文件对象切分完成后,针对每个文本数据块的每条行数据记录,按照所述列分隔符进行切割,得到一个或多个列记录;
    分别为所述列记录添加对应的预设数据类型,得到所述中间态格式。
  9. 根据权利要求8所述的方法,其特征在于,所述预设数据类型至少包括如下类型的一种或多种:字符串STRING,长整型LONG,布尔型BOOLEAN,双精度浮点型DOUBLE,日期DATE。
  10. 根据权利要求7所述的方法,其特征在于,还包括:
    若所述文本数据块的格式转换所述中间态格式不成功,或者,所述中间态的文本数 据块转换所述目的端所需的格式不成功,则产生脏数据;
    若所述脏数据超出预设阈值,则生成错误报告。
  11. 根据权利要求1或2或4或5或6或7或8或9所述的方法,其特征在于,所述文件对象为能够进行随机读写的文件对象,所述文件对象至少可以包括如下对象的一种或多种:本地文件、开放存储服务OSS文件、安全文件传送协议SFTP文件、分布式文件系统HDFS文件。
  12. 一种基于ETL的文件处理系统,其特征在于,所述的系统包括:
    文件对象获取模块,用于从源端获取多个文件对象;
    文件切分模块,用于针对每个文件对象,进行文件内的数据切分,得到多个文本数据块;
    写入模块,用于在所述多个文件对象切分完成后,将所述多个文件对象对应的所有文本数据块并发写入目的端。
  13. 根据权利要求12所述的系统,其特征在于,所述文件对象获取模块包括:
    结构化信息读取子模块,用于从源端读取结构化信息,所述结构化信息包括多个文件对象;
    结构化信息切分子模块,用于将所述结构化信息以单个文件对象为单位进行切分,得到多个文件对象。
  14. 根据权利要求12或13所述的系统,其特征在于,所述文件切分模块包括:
    切分位置确定子模块,用于针对每个文件对象,确定多个切分位置;
    切分子模块,用于按照所述多个切分位置对所述文件对象进行切分,得到多个文本数据块。
  15. 根据权利要求14所述的系统,其特征在于,所述文件对象包括多个行数据记录,所述切分位置确定子模块包括:
    文件大小获取单元,用于针对每个文件对象,获取所述文件对象的大小;
    行大小确定单元,用于确定所述行数据记录的平均大小;
    第一计算单元,用于计算所述文件对象大小与所述行数据记录的平均大小的商值,得到所述行数据记录的数量;
    第二计算单元,用于计算预设的文本数据块的数量与所述行数据记录的数量的商值,得到每个文本数据块所拥有的行数据记录的数量;
    初始切分位置确定单元,用于依据所述文本数据块所具有的行数据记录的数量,确定所述文件对象的多个初始切分位置;
    调整单元,用于在所述初始切分位置不是行分隔符所在的位置时,则将所述切分位置调整为行分隔符所在的位置;
    去重单元,用于对所述调整后的初始切分位置进行去重处理,得到多个切分位置。
  16. 根据权利要求15所述的系统,其特征在于,所述调整单元还用于:
    若所述初始切分位置不是行分隔符所在的位置,则向前探测或向后回退到与所述初始切分位置最接近的行分隔符的位置;
    将所述切分位置确定为所述最接近的行分隔符的位置。
  17. 根据权利要求15所述的系统,其特征在于,所述行大小确定单元还用于:
    将第一个行数据记录的大小作为所述行数据记录的平均大小;
    或者,
    将最后一个行数据记录的大小作为所述行数据记录的平均大小;
    或者,
    随机选取一个行数据记录的大小作为所述行数据记录的平均大小;
    或者,
    计算前N行的行数据记录的大小平均值作为所述行数据记录的平均大小。
  18. 根据权利要求12所述的系统,其特征在于,所述写入模块包括:
    第一格式转换子模块,用于在所述多个文件对象切分完成后,分别将所述文本数据块的格式转换为中间态格式;
    数据写入子模块,用于采用多线程或者多进程或者分布式多机分别将对应的中间态格式的文本数据块写入目的端;
    第二格式转换子模块,用于在所述目的端中,将所述中间态的文本数据块转换为所述目的端所需的格式。
  19. 根据权利要求18所述的系统,其特征在于,所述文本数据块包括一条或多条行数据记录,所述行数据记录包括列分隔符,所述第一格式转换子模块包括:
    切割单元,用于在所述多个文件对象切分完成后,针对每个文本数据块的每条行数据记录,按照所述列分隔符进行切割,得到一个或多个列记录;
    数据类型添加单元,用于分别为所述列记录添加对应的预设数据类型,得到所述中间态格式。
  20. 根据权利要求19所述的系统,其特征在于,所述预设数据类型至少包括如下类型的一种或多种:字符串STRING,长整型LONG,布尔型BOOLEAN,双精度浮点型DOUBLE,日期DATE。
  21. 根据权利要求18所述的系统,其特征在于,还包括:
    脏数据产生模块,用于在所述文本数据块的格式转换所述中间态格式不成功时,或者,所述中间态的文本数据块转换所述目的端所需的格式不成功时,产生脏数据;
    错误报告生成模块,用于在所述脏数据超出预设阈值时,生成错误报告。
  22. 根据权利要求12或13或15或16或17或18或19或20所述的系统,其特征在于,所述文件对象为能够进行随机读写的文件对象,所述文件对象至少可以包括如下对象的一种或多种:本地文件、开放存储服务OSS文件、安全文件传送协议SFTP文件、分布式文件系统HDFS文件。
PCT/CN2016/093495 2015-08-14 2016-08-05 一种基于etl的文件处理方法及系统 WO2017028690A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510502163.4 2015-08-14
CN201510502163.4A CN106469152A (zh) 2015-08-14 2015-08-14 一种基于etl的文件处理方法及系统

Publications (1)

Publication Number Publication Date
WO2017028690A1 true WO2017028690A1 (zh) 2017-02-23

Family

ID=58051898

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/093495 WO2017028690A1 (zh) 2015-08-14 2016-08-05 一种基于etl的文件处理方法及系统

Country Status (2)

Country Link
CN (1) CN106469152A (zh)
WO (1) WO2017028690A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113177040A (zh) * 2021-04-29 2021-07-27 东北大学 铝/铜板带材生产全流程大数据清洗与分析方法

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109408468A (zh) * 2018-08-24 2019-03-01 阿里巴巴集团控股有限公司 文件处理方法和装置、计算设备及存储介质
CN111061927B (zh) * 2018-10-16 2023-06-20 阿里巴巴集团控股有限公司 数据处理方法、装置及电子设备
CN109299352B (zh) * 2018-11-14 2022-02-01 百度在线网络技术(北京)有限公司 搜索引擎中网站数据的更新方法、装置和搜索引擎
CN111435346B (zh) * 2019-01-14 2023-12-19 阿里巴巴集团控股有限公司 离线数据的处理方法、装置及设备
CN110162401A (zh) * 2019-05-24 2019-08-23 广州中望龙腾软件股份有限公司 Dwg文件并行读取方法、电子设备和存储介质
CN114356212B (zh) * 2021-11-23 2024-06-14 阿里巴巴(中国)有限公司 数据处理方法、系统及计算机可读存储介质
CN114584556A (zh) * 2022-03-14 2022-06-03 中国工商银行股份有限公司 文件传输方法和装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101582064A (zh) * 2008-05-15 2009-11-18 阿里巴巴集团控股有限公司 一种大数据量数据处理方法及系统
CN103970874A (zh) * 2014-05-14 2014-08-06 浪潮(北京)电子信息产业有限公司 一种实现Hadoop文件处理的方法及装置
CN104376082A (zh) * 2014-11-18 2015-02-25 中国建设银行股份有限公司 一种把数据源文件中的数据导入到数据库中的方法
CN104615736A (zh) * 2015-02-10 2015-05-13 上海创景计算机系统有限公司 基于数据库的大数据快速解析存储方法

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101155296B (zh) * 2006-09-29 2011-05-25 中国科学技术大学 数据传输的方法
CN102999537B (zh) * 2011-09-19 2017-01-18 阿里巴巴集团控股有限公司 一种数据迁移系统和方法
CN104699723B (zh) * 2013-12-10 2018-10-19 北京神州泰岳软件股份有限公司 数据交换适配器、异构系统之间数据同步系统和方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101582064A (zh) * 2008-05-15 2009-11-18 阿里巴巴集团控股有限公司 一种大数据量数据处理方法及系统
CN103970874A (zh) * 2014-05-14 2014-08-06 浪潮(北京)电子信息产业有限公司 一种实现Hadoop文件处理的方法及装置
CN104376082A (zh) * 2014-11-18 2015-02-25 中国建设银行股份有限公司 一种把数据源文件中的数据导入到数据库中的方法
CN104615736A (zh) * 2015-02-10 2015-05-13 上海创景计算机系统有限公司 基于数据库的大数据快速解析存储方法

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113177040A (zh) * 2021-04-29 2021-07-27 东北大学 铝/铜板带材生产全流程大数据清洗与分析方法

Also Published As

Publication number Publication date
CN106469152A (zh) 2017-03-01

Similar Documents

Publication Publication Date Title
WO2017028690A1 (zh) 一种基于etl的文件处理方法及系统
US11422982B2 (en) Scaling stateful clusters while maintaining access
US9792306B1 (en) Data transfer between dissimilar deduplication systems
US20120011101A1 (en) Integrating client and server deduplication systems
US9143562B2 (en) Managing transfer of data from a source to a destination machine cluster
US10509696B1 (en) Error detection and mitigation during data migrations
US20220188196A1 (en) Continuous data protection
TWI626547B (zh) 於分散式資料庫中將系統狀態一致地還原至欲還原時間點之方法及系統
KR102187127B1 (ko) 데이터 연관정보를 이용한 중복제거 방법 및 시스템
CN106649676B (zh) 一种基于hdfs存储文件的去重方法及装置
CN110795499B (zh) 基于大数据的集群数据同步方法、装置、设备及存储介质
US8438130B2 (en) Method and system for replicating data
US20190317872A1 (en) Database cluster architecture based on dual port solid state disk
WO2013078583A1 (zh) 优化数据访问的方法及装置、优化数据存储的方法及装置
US10423495B1 (en) Deduplication grouping
JP2023525882A (ja) データの分割、処理、および保護
CN104965835A (zh) 一种分布式文件系统的文件读写方法及装置
CN110413588B (zh) 分布式对象存储方法、装置、计算机设备和存储介质
JP6262505B2 (ja) 分散型データ仮想化システム、クエリ処理方法及びクエリ処理プログラム
CN107566341B (zh) 一种基于联邦分布式文件存储系统的数据持久化存储方法及系统
JP6577604B2 (ja) 関係チェーン処理方法およびシステム、ならびに記憶媒体
US20210397599A1 (en) Techniques for generating a consistent view of an eventually consistent database
CN112965939A (zh) 一种文件合并方法、装置和设备
US10083121B2 (en) Storage system and storage method
CN108121807B (zh) Hadoop环境下多维索引结构OBF-Index的实现方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16836554

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16836554

Country of ref document: EP

Kind code of ref document: A1