CN106469152A

CN106469152A - A kind of document handling method based on ETL and system

Info

Publication number: CN106469152A
Application number: CN201510502163.4A
Authority: CN
Inventors: 罗海伟; 陈守元
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2015-08-14
Filing date: 2015-08-14
Publication date: 2017-03-01
Also published as: WO2017028690A1

Abstract

The embodiment of the present application provides a kind of document handling method based on ETL and system, and wherein said method includes：Obtain multiple file objects from source；For each file object, carry out the data cutting in file, obtain multiple text data blocks；After the completion of the plurality of file object cutting, corresponding for the plurality of file object all text data blocks are concurrently write destination.The present invention can improve the speed of file synchronization during ETL, maximizes data synchronization efficiency.

Description

A kind of document handling method based on ETL and system

Technical field

The application is related to technical field of data processing, more particularly to a kind of based on the file of ETL at Reason method, and a kind of document handling system based on ETL.

Background technology

With the development of IT application in enterprise, the numerous information system of increasing enterprise System, to help enterprise to carry out process and the management work of inside and outside business.But it is as information system Increase, the information system of respective works in isolation causes substantial amounts of redundant data and business personnel The duplication of labour.Enterprise information integration (EAI, Enterprise Application Integration) meets the tendency And give birth to, and ETL is the major technique realizing data integration.

The abbreviation of ETL, Extraction-Transformation-Loading, i.e. data pick-up (Extract), conversion (Transform), the process of loading (Load), it is to build data bins The important step in storehouse.ETL is to load the data of operation system after extracting, clean conversion To data warehouse process it is therefore an objective to by the dispersion in enterprise, messy, the skimble-scamble data of standard It is integrated together, the decision-making for enterprise provides analysis foundation.

At present, for the synchronization means such as synchronous disk class application of file system, it is in units of file, Mainly complete the synchronization of the file of each terminal room to file, and be not suitable with the need of data warehouse ETL Will.

Therefore, the urgent technical problem solving of those skilled in the art is needed to be exactly at present：As What proposes a kind of file process mechanism based on ETL, in order to improve file synchronization during ETL Speed, maximizes data synchronization efficiency.

Content of the invention

The embodiment of the present application technical problem to be solved is to provide a kind of file process based on ETL Method, in order to improve the speed of file synchronization during ETL, maximizes data synchronization efficiency.

Accordingly, the embodiment of the present application additionally provides a kind of document handling system based on ETL, uses To ensure realization and the application of said method.

In order to solve the above problems, the embodiment of the present application discloses a kind of file process based on ETL Method, described method includes：

Obtain multiple file objects from source；

For each file object, carry out the data cutting in file, obtain multiple text data blocks；

After the completion of the plurality of file object cutting, will be corresponding for the plurality of file object all Text data block concurrently writes destination.

Preferably, the described step obtaining multiple file objects from source includes：

Read structured message from source, described structured message includes multiple file objects；

Described structured message is carried out cutting in units of single file object, obtains multiple files Object.

Preferably, described for each file object, carry out the data cutting in file, obtain many The step of individual text data block includes：

For each file object, determine multiple dicing position；

According to the plurality of dicing position, cutting is carried out to described file object, obtain multiple textual data According to block.

Preferably, described file object includes multiple row data records, described for each file pair As determining that the step of multiple dicing position includes：

For each file object, obtain the size of described file object；

Determine the mean size of described row data record；

Calculate the quotient of described file object size and the mean size of described row data record, obtain The quantity of described row data record；

Calculate the quotient of the quantity of default text data block and the quantity of described row data record, obtain The quantity of the row data record being had to each text data block；

The quantity of the row data record having according to described text data block, determines described file pair The multiple initial dicing position of elephant；

If described initial dicing position is not the position that line Separator is located, by described dicing position It is adjusted to the position at line Separator place；

Duplicate removal process is carried out to the initial dicing position after described adjustment, obtains multiple dicing position.

Preferably, if described initial dicing position is not the position that line Separator is located, will The step that described dicing position is adjusted to the position at line Separator place includes：

If described initial dicing position be not line Separator be located position, forward detect or backward It return back to the position with the immediate line Separator of described initial dicing position；

Described dicing position is defined as the position of described immediate line Separator.

Preferably, the step of the described mean size determining described row data record includes：

Using the size of first row data record as described row data record mean size；

Or,

Using the size of last row data record as described row data record mean size；

Or,

Randomly select the mean size as described row data record for the size of a row data record；

Or,

Before calculating, the size meansigma methodss of the row data record of N row are average as described row data record Size.

Preferably, described after the completion of the plurality of file object cutting, by the plurality of file pair As the step that corresponding all text data block text data blocks concurrently write destination includes：

After the completion of the plurality of file object cutting, respectively the form of described text data block is turned It is changed to intermediate state form；

Using multithreading or multi-process or distributed multimachine respectively by corresponding intermediate state form Text data block writes destination；

In described destination, the text data block of described intermediate state is converted to described destination institute The form needing.

Preferably, described text data block includes one or more row data record, described row data Record include Column Cata Format, described after the completion of the plurality of file object cutting, respectively will be described The step that the form of text data block is converted to intermediate state form includes：

After the completion of the plurality of file object cutting, for every line number of each text data block According to record, cut according to described Column Cata Format, obtained one or more row records；

It is respectively described row record and adds corresponding preset data type, obtain described intermediate state form.

Preferably, described preset data type at least includes following types of one or more：Character String STRING, long LONG, Boolean type BOOLEAN, double-precision floating point type DOUBLE, Date DATE.

Preferably, methods described also includes：

If it is unsuccessful that the form of described text data block changes described intermediate state form, or, described The form that the text data block of intermediate state is changed needed for described destination is unsuccessful, then produce dirty data；

If described dirty data exceeds predetermined threshold value, generation error is reported.

Preferably, described file object is the file object that can carry out random read-write, described file Object at least can include one or more of following object：Local file, open storage service OSS File, secure file transportation protocol SFTP file, distributed file system HDFS file.

The embodiment of the present application additionally provides a kind of document handling system based on ETL, described system Including：

File object acquisition module, for obtaining multiple file objects from source；

File cutting module, for for each file object, carrying out the data cutting in file, Obtain multiple text data blocks；

Writing module, for after the completion of the plurality of file object cutting, by the plurality of file The corresponding all text data blocks of object concurrently write destination.

Preferably, described file object acquisition module includes：

Structured message reading submodule, for reading structured message, described structuring from source Information includes multiple file objects；

Structured message cutting submodule, for by described structured message with single file object being Unit carries out cutting, obtains multiple file objects.

Preferably, described file cutting module includes：

Dicing position determination sub-module, for for each file object, determining multiple dicing position；

Cutting submodule, for cutting being carried out to described file object according to the plurality of dicing position, Obtain multiple text data blocks.

Preferably, described file object includes multiple row data records, and described dicing position determines son Module includes：

File size acquiring unit, for for each file object, obtaining described file object Size；

Row size determining unit, for determining the mean size of described row data record；

First computing unit, for calculating the flat of described file object size and described row data record The quotient of equal size, obtains the quantity of described row data record；

Second computing unit, for calculating the quantity of default text data block and described row data note The quotient of the quantity of record, obtains the quantity of the row data record that each text data block is had；

Initial dicing position determining unit, for the row data having according to described text data block The quantity of record, determines the multiple initial dicing position of described file object；

Adjustment unit, for described initial dicing position be not line Separator be located position when, Then described dicing position is adjusted to the position at line Separator place；

Duplicate removal unit, for carrying out duplicate removal process to the initial dicing position after described adjustment, obtains Multiple dicing position.

Preferably, described adjustment unit is additionally operable to：

Preferably, described row size determining unit is additionally operable to：

Or,

Preferably, said write module includes：

First form transform subblock, for, after the completion of the plurality of file object cutting, distinguishing The form of described text data block is converted to intermediate state form；

Data writes submodule, for using multithreading or multi-process or distributed multimachine difference The text data block of corresponding intermediate state form is write destination；

Second form transform subblock, in described destination, by the text of described intermediate state Data block is converted to the form needed for described destination.

Preferably, described text data block includes one or more row data record, described row data Record includes Column Cata Format, and described first form transform subblock includes：

Cutter unit, for after the completion of the plurality of file object cutting, for each textual data According to every row data record of block, cut according to described Column Cata Format, obtained one or more Row record；

Data type adding device, adds corresponding preset data class for being respectively described row record Type, obtains described intermediate state form.

Preferably, described system also includes：

Dirty data generation module, changes described intermediate state lattice for the form in described text data block When formula is unsuccessful, or, the text data block of described intermediate state changes the lattice needed for described destination When formula is unsuccessful, produce dirty data；

Error reporting generation module, for described dirty data exceed predetermined threshold value when, generation error Report.

Compared with background technology, the embodiment of the present application includes advantages below：

In the embodiment of the present application, when obtaining multiple file object from source, with file object it is Unit, carries out the data cutting of file internal, obtains text data block, and in All Files object After the completion of internal cutting, all of text data block is concurrently write in destination, cutting granularity is The fine granularity cutting of file internal, then when concurrently being write, it is possible to increase file during ETL Synchronous speed, maximizes data synchronization efficiency.

Further, since the embodiment of the present application be cutting is carried out with the structure of file internal however it is not limited to Certain file system, can complete the data syn-chronization of multiple file system, highly versatile.

Brief description

Fig. 1 is a kind of steps flow chart of document handling method embodiment one based on ETL of the application Figure；

Fig. 2 is a kind of steps flow chart of document handling method embodiment two based on ETL of the application Figure；

Fig. 3 a is a kind of dicing position of document handling method embodiment two based on ETL of the application Schematic diagram；

Fig. 3 b is a kind of dicing position of document handling method embodiment two based on ETL of the application Adjustment schematic diagram one；

Fig. 4 a is a kind of dicing position of document handling method embodiment two based on ETL of the application Adjustment schematic diagram two；

Fig. 4 b is a kind of duplicate removal result of document handling method embodiment two based on ETL of the application Schematic diagram；

Fig. 5 is a kind of text data of document handling method embodiment two based on ETL of the application Block schematic diagram；

Fig. 6 is a kind of structured flowchart of document handling system embodiment based on ETL of the application.

Specific embodiment

Understandable for enabling the above-mentioned purpose of the application, feature and advantage to become apparent from, with reference to The drawings and specific embodiments are described in further detail to the application.

With reference to Fig. 1, show a kind of document handling method embodiment one based on ETL of the application Flow chart of steps, may include steps of：

Step 101, obtains multiple file objects from source；

Step 102, for each file object, carries out the data cutting in file, obtains multiple literary compositions Notebook data block；

Step 103, after the completion of the plurality of file object cutting, by the plurality of file object pair The all text data blocks answered concurrently write destination.

With reference to Fig. 2, show that a kind of text data processing method based on ETL of the application is implemented The flow chart of steps of example two, may include steps of：

Step 201, reads structured message from source, and described structured message includes multiple files pair As；

The embodiment of the present application can apply to the scene of ETL, source can have multiple, for example, can With including but not limited to：SFTP (Secure File Transfer Protocol, secure file transmission association View), local file system Local File, OSS (open storage services, open storage take Business it will be appreciated that be storage dish), HDFS (Hadoop Distributed File System, distributed File system) etc..

It should be noted that the source end system of the application does not typically have the related guarantor of db transaction Card, needs user oneself to ensure the data consistency (increasing of data, delete, change) during digital independent Problem, that is, user ensure to change the content of structured message in data synchronization process as far as possible.

The unstructured information reading from source can include structural data and/or semi-structured number According to wherein, structural data can be data base, and its data ranks has strict predefining, and has Clearly Schema constraint；Semi-structured data refers to the word of clear and definite line Separator, Column Cata Format Symbol data, can abstract be a bivariate table structure, and each record row includes 1 and arrives multi-column data, often The column number of row data is identical.

In the embodiment of the present application, structured message can include multiple file objects, wherein, literary composition Part object can be the object that can construct file model.In addition, so that follow-up data Cutting and concurrently write can smoothly execute, and the file object of the embodiment of the present application can be for entering The file object of row random read-write, that is, the file object of the application can be from the beginning of any byte Read, for the write operation of file object, gradually can add in the last of file object.

As a kind of example of the embodiment of the present application, file object at least can include following object One or more：Local file, OSS file, SFTP file, HDFS file.

It should be noted that the file object of the embodiment of the present application can support different type of codings, And different compression types.

Step 202, described structured message is carried out cutting in units of single file object, obtains Multiple file objects；

After source obtains structured message, the embodiment of the present application can carry out the first level coarse grain The file cutting of degree, this cutting is the cutting of file-level, and the mode of cutting can be with single literary composition Part object carries out cutting for unit, obtains multiple file objects.Wherein, cutting refers to appoint one Business is divided into multiple subtasks, can concurrently execute multiple subtasks, whole after the completion of all subtasks Body task also completes, and can shorten the run time of operation by the concurrent subtasking of cutting.

For example, the structured message reading from source includes 10 file objects, with single file pair As for unit, then 10 parts of file objects can be cut into.

For another example, obtain the structured message of a day from source, this structured message is in units of hour Carry out data storage, then can be cut within one day 24 file objects.

If it should be noted that the size of the structured message reading from source is too small, permissible Do not execute the cutting logic of the application, such as read the file object of a 100KB it is not necessary that Cutting is carried out to it.The cutting logic of the application is directed to the size timeouts of structured message The source data of threshold value, therefore, before execution step 202, can first determine whether that structuring is believed Whether the size of breath is more than given threshold, if being more than or equal to given threshold, can be with execution step 202, otherwise, not execution step 202.

Step 203, for each file object, carries out the data cutting in file, obtains multiple literary compositions Notebook data block；

The data obtaining from source, through the cutting of the first level coarseness, obtains multiple file objects After, the embodiment of the present application is directed to each file object, executes the second level further fine-grained File cutting, this cutting is the cutting of file internal, and after the completion of this cutting, each file object can To obtain multiple text data blocks.

In a kind of preferred embodiment of the embodiment of the present application, step 203 can include following sub-step Suddenly：

Sub-step S11, for each file object, determines multiple dicing position；

It is necessary first to determine dicing position before cutting is carried out to file object, implement in the application In a kind of preferred embodiment of example, sub-step S11 further includes following sub-step：

Sub-step S111, for each file object, obtains described file object size；

For each single file object, it is possible to obtain the attribute information of this document object, wherein, Attribute information can include file object size totalSize.

Sub-step S112, determines the mean size of described row data record；

For structural data and/or semi-structured data, it can include row data record, Wherein, the data record of the every a line of row data record, then carry out the literary composition after cutting to structured message Part object is also to be made up of row data record, and the row data record has line Separator, column split The information such as symbol.

However, every row data record in file object there may be situation not of uniform size, it is Ensure as far as possible follow-up cutting row data record complete, the embodiment of the present application can obtain line number According to the mean size lineSize of record, using this mean size lineSize as cutting according to one of. In implementing, lineSize can be according to size distribution of row data record in file object etc. Feature is determining.

In one embodiment, may be referred to following several ways to determine the flat of row data record All sizes：

Or,

Specifically, if the size of every row data record is all identical in file object, or Person is essentially identical, then can directly choose the first row data or last column data or random a line The size of data is as the mean size of row data record.

If the size distribution of the row data record in file object is mechanical periodicity, such as change Cycle is 50 row, then take the size meansigma methodss of every 50 row data averagely big as row data record Little, certainly, in addition to can taking the mean size as row data record for the size meansigma methodss, also The maximum of the size of row data record in this period of change can be taken, or, minima, or Person, median etc. is as the mean size of row data record.In addition, what the embodiment of the present application was addressed Front N row, in addition to above-mentioned period of change, can also be the line number that user is pre-configured with, or random write The line number taking, the embodiment of the present application need not be any limitation as to this.

It should be noted that the several ways of the mean size of above-mentioned determination row data record are only The example of the embodiment of the present application, those skilled in the art to determine row using other modes as needed The mean size of data record is all possible come the purpose to reach the application.

Sub-step S113, calculates the mean size of described file object size and described row data record Quotient, obtain the quantity of described row data record；

The row data record being had in file object can be estimated according to totalSize and lineSize Quantity number=totalSize/lineSize.

Sub-step S114, calculates the quantity of default text data block and the number of described row data record The quotient of amount, obtains the quantity of the row data record that each text data block is had；

In the embodiment of the present application, the text data block that user wants cutting to obtain can be pre-configured with Quantity m, in a preferred embodiment, the mode of configuration can be：User directly inputs Need the quantity of the text data block of configuration；Or, if the flow velocity of text data block is 1Mbps, User needs 10Mbps, then be cut into 10 parts.

Quantity m according to text data block and quantity number of row data record, can obtain every The quantity of the row data record that individual text data block is had is number/m.

Sub-step S115, the quantity of the row data record being had according to described text data block, really The multiple initial dicing position of fixed described file object；

After the quantity of the row data record that each text data block of determination is had, then can be true Determine the multiple initial dicing position point of file object, this initial dicing position point can be：0, 1*number/m, 2*number/m, 3*number/m, the like.

Sub-step S116, if described initial dicing position is not the position that line Separator is located, will Described dicing position is adjusted to the position at line Separator place；

In order to avoid carrying out a complete row data note during cutting according to initial dicing position point Record cut-off, the embodiment of the present application can be finely adjusted to dicing position, with ensure dicing position be End of line or row are first.In implementing, to the mode of initial dicing position adjustment can be：If just Beginning dicing position is not the position that line Separator is located, then described dicing position is adjusted to row separation The position that symbol is located.

In a kind of preferred embodiment of the embodiment of the present application, sub-step S116 can be further： If described initial dicing position is not the position that line Separator is located, detection or backward rollback forward To the position with the immediate line Separator of described initial dicing position；Described dicing position is determined Position for described immediate line Separator.

As shown in the dicing position schematic diagram of Fig. 3 a, the initial cutting that obtained according to sub-step S115 Position may be located at the middle part of row data record, if carrying out cutting according to this initial dicing position, Row data record can be led to imperfect row data record cut-off.Therefore, in a kind of embodiment In, the embodiment of the present application can detect forward line Separator in the way of using detecting forward, until Reaching end of line or the end of file, if detecting line Separator, this initial dicing position being updated The position being located for line Separator, shown in dicing position adjustment schematic diagram one as shown in Figure 3 b.

In another embodiment, similar with aforementioned detection forward, the embodiment of the present application is acceptable To detect line Separator by the way of rollback backward, row is first or file starts until returning back to, If detecting line Separator, this initial dicing position is updated to the position at line Separator place. It should be noted that the mode of rollback is can be higher compared with the mode requirement detecting forward backward, And, it is provided only to support that the file system of rollback backward uses.

Sub-step S117, carries out duplicate removal process to the initial dicing position after described adjustment, obtains many Individual dicing position.

Shown in dicing position adjustment schematic diagram two as shown in fig. 4 a, in implementing, may Row data record overlength crosses over the situation of multiple text data blocks, and this situation is according to sub-step The dicing position of multiple repetitions can be obtained, at this point it is possible to only retain one after the process of rapid S116 The dicing position of this repetition, deletes the dicing position of other repetitions, the duplicate removal of duplicate removal result such as Fig. 4 b Shown in result schematic diagram.

Sub-step S12, carries out cutting according to the plurality of dicing position to described file object, obtains Multiple text data blocks.

After obtaining multiple dicing position, file object can be carried out according to the plurality of dicing position Cutting, obtains corresponding multiple text data block.For example, carry out file internal according to step 203 After cutting, the multiple text data blocks as shown in the text data block schematic diagram of Fig. 5 can be obtained.

Step 204, after the completion of the plurality of file object cutting, by the plurality of file object pair The all text data blocks answered concurrently write destination.

After the fine-grained cutting of second level obtains entire text data block, can be by all texts Data block concurrently writes destination.In implementing, distributed multimachine can be taken, and/or, Multi-process, and/or, the mode of multithreading reads multiple text data blocks, parallel thus improve number According to synchronizing speed and efficiency.

In implementing, it is huge because distributed system needs data volume to be processed, but Distributed machines that system can accommodate or process or number of threads are limited, therefore, it can make Process multiple text data blocks with a process or thread or distributed machines, and limit this process Or thread or distributed machines process the data traffic of a text data block, to reach flow-control Purpose.

In a kind of preferred embodiment of the embodiment of the present application, step 204 can include following sub-step Suddenly：

The form of described text data block is converted to intermediate state form by sub-step S21 respectively；

It is applied to the embodiment of the present application, define a kind of transfer mechanism, source data and purpose terminal number Carry out data syn-chronization according to by this transfer mechanism.This transfer mechanism is by from source data cutting first The form of the text data block obtaining is converted into the form of intermediate state, and wherein, the form of intermediate state was both It is not the form of source data, is not the form of purpose end data, it is that source data form arrives A kind of transitive state of destination data form.

In a kind of preferred embodiment of the embodiment of the present application, sub-step S21 can include following son Step：

Sub-step S211, for every row data record of each text data block, according to described row Separator is cut, and obtains one or more row records；

The embodiment of the present application can be according to the feature of structured message, by row data record according to row point Split every symbol, obtained one or more row records.For example, row data is recorded as：1,2,3, Abc, 2015-07-1600:00:00, China；After it is cut according to Column Cata Format ", ", obtain To row record cutting result be：1 2 3 abc 2015-07-16 00:00:In 00 State, totally 6 row.

Sub-step S212, respectively described row record adds corresponding preset data type, obtains institute State intermediate state form.

Due to row data record memory storage is all character data, and each column data lost tool Type (integer, floating number, date, null) of body etc., therefore, when by sub-step S211 After obtaining multiple row records, can be that the plurality of row record adds according to user's configuration information in advance Plus preset data type.

As a kind of preferred exemplary of the embodiment of the present application, preset data type at least can include as One or more of Types Below：Character string STRING, long LONG, Boolean type BOOLEAN, Double-precision floating point type DOUBLE, date DATE, certainly, this data type can also include word Section type B ytes, simply during Reading text, is not related to the type

This preset data type can be analogous to relation database table, and each of table shows a type, As shown in table 1 below.

First row

Secondary series

3rd row

4th row

5th row

STRING	LONG	BOOLEAN	DOUBLE	DATE
					Abc	123	true	1.001	1989-04-27
abc	456	false	2.002	2007-09-01
					Abc123	789	true	3.003	2014-07-07

Table 1

As a kind of example, shown in the following code of profile instance of a data type：

Wherein, index represents the index being listed in data line, and from the beginning of 0, type represents class to subscript Type, can configure format for date type and represents date format.

It should be noted that be related in the embodiment of the present application is all the reading of character data, and not It is the reading of binary data (for example, mp3, picture), for the situation of binary data, permissible Encoded using Base64, be converted to character string type STRING process.

Sub-step S22, the text data block of described intermediate state form is write destination；

Sub-step S23, in described destination, the text data block of described intermediate state is converted to institute State the form needed for destination.

After obtaining the intermediate state form of text data block, can be by the text data of this intermediate state form Block writes destination, then in destination, for the purpose of can changing the text data block of this intermediate state The required type of data format in end, i.e. above-mentioned preset data type is as a kind of number of intermediate state According to type, serve transfer effect, then the relation of source, destination and preset data type is： Multiple source data types->5 kinds of preset data types->Multiple destination types.

According to the transfer mechanism of the embodiment of the present application, synchronization framework can adopt the pattern of plug-in type, No matter it is the increase in the data type of source or the data type of destination, this Shen can be adopted The transfer mechanism of embodiment please efficiently complete the conversion of data type, enrich different storage systems Between data exchange form, and there is very strong expansion.

In one embodiment, in the form transformation process of above-mentioned employing transfer mechanism, this Shen Please embodiment can also comprise the steps：

If it is unsuccessful that the form of described text data block changes described intermediate state form, or, described The form that the text data block of intermediate state is changed needed for described destination is unsuccessful, then produce dirty data； If described dirty data exceeds predetermined threshold value, generation error is reported.

Specifically, if the form convert failed of above-mentioned sub-step S21 and/or sub-step S23, right The row record answered can be used as dirty data processing, and such as abc is converted to numeral, abc is converted to the date Deng error situation, if after the quantity of dirty data exceeds dirty data restriction, mistake, this article can be reported The synchronous task of notebook data block terminates.

Wherein, as a kind of example, dirty data limits and can include：1st, dirty data bar number, that is, surpass Go out specified bar number dirty data when task report an error；2nd, dirty data percentage ratio, that is, exceed prescribed percentage Dirty data when, debriefing mistake.

Further, after multiple text data blocks write destination, in destination, if only setting up One file is used for storing multiple text data blocks, then can not accomplish multiple threads (process) and Send out write same file, lead to the generation of data problem of mutual exclusion.In order to avoid data mutual exclusion, mesh End can set up the one-to-one file with text data block, for carrying out depositing of text data block Store up and process, that is, a text data block corresponds to one of destination file, each text data Block writes in the different file of destination.

File in destination can include file identification, in one embodiment, this document mark Know naming method can be：User specifies a file prefix (i.e. specified prefix), and the application is real Apply example produce random 32 UUID (Universally Unique Identifier, general only One identification code), this specified prefix and UUID are stitched together formation file identification.

In implementing, the writing mode writing text data block to destination can have 3 kinds, The first writing mode is truncate：Before representing write text data block, before cleaning has identical specifying The historical data sewed；Second writing mode is noConflict：When representing write text data block, If it find that there being the historical data of identical specified prefix, then generation error report, and it is same to exit data Step；The third writing mode is append：Represent supplemental data, in spite of have identical specify before The historical data sewed, is all normally written text data block.

In the embodiment of the present application, by file-level cutting and the fine-grained file of coarseness Internal cutting, structured message is cut into multiple text data blocks, and cutting is concern file internal Logic bivariate table structure, be a kind of universal synchronous solution party of general semi-structured file data Case

In addition, the embodiment of the present application can support multiple source data types, versatility and expansion By force.

It should be noted that for embodiment of the method, in order to be briefly described, therefore it is all expressed as A series of combination of actions, but those skilled in the art should know, and the embodiment of the present application is not Limited by described sequence of movement, because according to the embodiment of the present application, some steps can be adopted Carry out with other orders or simultaneously.Secondly, those skilled in the art also should know, description Described in embodiment belong to preferred embodiment, involved action not necessarily the application Necessary to embodiment.

With reference to Fig. 6, show a kind of structured flowchart of the document handling system based on ETL of the application, Specifically can include as lower module：

File object acquisition module 601, for obtaining multiple file objects from source；

File cutting module 602, for for each file object, carrying out the data cutting in file, Obtain multiple text data blocks；

Writing module 603, for after the completion of the plurality of file object cutting, by the plurality of literary composition The corresponding all text data blocks of part object concurrently write destination.

In a kind of preferred embodiment of the embodiment of the present application, described file object acquisition module 601 Following submodule can be included：

In a kind of preferred embodiment of the embodiment of the present application, described file cutting module 602 is permissible Including following submodule：

In a kind of preferred embodiment of the embodiment of the present application, described file object includes multiple line numbers According to record, described dicing position determination sub-module includes：

In a kind of preferred embodiment of the embodiment of the present application, described adjustment unit is additionally operable to：

In a kind of preferred embodiment of the embodiment of the present application, described row size determining unit is additionally operable to：

Or,

In a kind of preferred embodiment of the embodiment of the present application, said write module 603 can include Following submodule：

In a kind of preferred embodiment of the embodiment of the present application, described text data block include one or A plurality of row data record, described row data record includes Column Cata Format, described first form conversion Module includes：

In a kind of preferred embodiment of the embodiment of the present application, described preset data type at least includes Following types of one or more：Character string STRING, long LONG, Boolean type BOOLEAN, double-precision floating point type DOUBLE, date DATE.

In a kind of preferred embodiment of the embodiment of the present application, described system also includes：

In a kind of preferred embodiment of the embodiment of the present application, described file object is can to carry out at random The file object of read-write, described file object at least can include one or more of following object：Locally File, OSS file, SFTP file, HDFS file.

For system embodiment, due to itself and said method embodiment basic simlarity, so retouching That states is fairly simple, and in place of correlation, the part referring to embodiment of the method illustrates.

Each embodiment in this specification is all described by the way of going forward one by one, each embodiment emphasis Illustrate is all the difference with other embodiment, identical similar part between each embodiment Mutually referring to.

Those skilled in the art it should be appreciated that the embodiment of the embodiment of the present application can be provided as method, Device or computer program.Therefore, the embodiment of the present application can using complete hardware embodiment, Complete software embodiment or the form of the embodiment with reference to software and hardware aspect.And, this Shen Please embodiment can adopt in one or more computers wherein including computer usable program code Usable storage medium (including but not limited to disk memory, CD-ROM, optical memory etc.) The form of the computer program of upper enforcement.

The embodiment of the present application be with reference to according to the method for the embodiment of the present application, terminal unit (system), To describe with the flow chart of computer program and/or block diagram.It should be understood that can be by computer Procedure operation instruct each flow process in flowchart and/or block diagram and/or square frame and Flow process in flow chart and/or block diagram and/or the combination of square frame.These computer journeys can be provided Sequence operational order is to general purpose computer, special-purpose computer, Embedded Processor or other programmable numbers According to processing terminal equipment processor with produce a machine so that by computer or other can compile The operational order of the computing device of journey data processing terminal equipment produces for realizing in flow chart one The dress of the function of specifying in individual flow process or multiple flow process and/or one square frame of block diagram or multiple square frame Put.

These computer program operational orders may be alternatively stored in and can guide computer or other programmable numbers According in the computer-readable memory that processing terminal equipment works in a specific way so that being stored in this Operational order in computer-readable memory produces the manufacture including operational order device, this behaviour Make command device realize in one flow process of flow chart or multiple flow process and/or one square frame of block diagram or The function of specifying in multiple square frames.

These computer program operational orders also can be loaded into computer or other programmable datas are processed So that sequence of operations step is executed on computer or other programmable terminal equipments on terminal unit Suddenly to produce computer implemented process, thus holding on computer or other programmable terminal equipments The operational order of row is provided for realizing in one flow process of flow chart or multiple flow process and/or block diagram The step of the function of specifying in one square frame or multiple square frame.

Although having been described for the preferred embodiment of the embodiment of the present application, those skilled in the art Once knowing basic creative concept, then these embodiments can be made with other change and modification. So, claims are intended to be construed to including preferred embodiment and fall into the embodiment of the present application Being had altered and being changed of scope.

Finally in addition it is also necessary to illustrate, herein, such as first and second or the like relation Term is used merely to make a distinction an entity or operation with another entity or operation, and not Necessarily require or imply between these entities or operation, there is any this actual relation or suitable Sequence.And, term " inclusion ", "comprising" or its any other variant are intended to non-exclusive Property comprise, so that including a series of process of key elements, method, article or terminal unit Not only include those key elements, but also include other key elements being not expressly set out, or also wrap Include as this process, method, article or the intrinsic key element of terminal unit.There is no more limits In the case of system, the key element being limited by sentence "including a ..." is it is not excluded that including described wanting Also there is other identical element in the process of element, method, article or terminal unit.

Above a kind of document handling method based on ETL provided herein and system are carried out It is discussed in detail, specific case used herein is explained to the principle of the application and embodiment State, the explanation of above example is only intended to help and understands the present processes and its core concept； Simultaneously for one of ordinary skill in the art, according to the thought of the application, in specific embodiment party All will change in formula and range of application, in sum, this specification content should not be construed as Restriction to the application.

Claims

1. a kind of document handling method based on ETL is it is characterised in that described method includes：

Obtain multiple file objects from source；

2. method according to claim 1 it is characterised in that described from source obtain multiple The step of file object includes：

3. method according to claim 1 and 2 it is characterised in that described for each literary composition Part object, carries out the data cutting in file, and the step obtaining multiple text data blocks includes：

For each file object, determine multiple dicing position；

4. method according to claim 3 it is characterised in that described file object include many Individual row data record, described for each file object, determine that the step of multiple dicing position includes：

For each file object, obtain the size of described file object；

Determine the mean size of described row data record；

If 5. method according to claim 4 is it is characterised in that described initial cutting Position is not the position that line Separator is located, then described dicing position is adjusted to line Separator and is located The step of position include：

6. method according to claim 4 is it is characterised in that described determination described row data The step of the mean size of record includes：

Or,

7. method according to claim 1 it is characterised in that described when the plurality of file After the completion of object cutting, by corresponding for the plurality of file object all text data block text data The step that block concurrently writes destination includes：

8. method according to claim 7 is it is characterised in that described text data block includes One or more row data record, described row data record includes Column Cata Format, described when described many After the completion of individual file object cutting, respectively the form of described text data block is converted to intermediate state lattice The step of formula includes：

9. method according to claim 8 it is characterised in that described preset data type extremely Include following types of one or more less：Character string STRING, long LONG, Boolean type BOOLEAN, double-precision floating point type DOUBLE, date DATE.

10. method according to claim 7 is it is characterised in that also include：

11. methods according to claim 1 or 2 or 4 or 5 or 6 or 7 or 8 or 9, It is characterized in that, described file object is the file object that can carry out random read-write, described file Object at least can include one or more of following object：Local file, open storage service OSS File, secure file transportation protocol SFTP file, distributed file system HDFS file.

A kind of 12. document handling systems based on ETL are it is characterised in that described system includes：

13. systems according to claim 12 are it is characterised in that described file object obtains Module includes：

14. systems according to claim 12 or 13 are it is characterised in that described file is cut Sub-module includes：

15. systems according to claim 14 are it is characterised in that described file object includes Multiple row data records, described dicing position determination sub-module includes：

16. systems according to claim 15 are it is characterised in that described adjustment unit is also used In：

17. systems according to claim 15 are it is characterised in that described row size determines list Unit is additionally operable to：

Or,

18. systems according to claim 12 are it is characterised in that said write module includes：

19. systems according to claim 18 are it is characterised in that described text data block bag Include one or more row data record, described row data record includes Column Cata Format, described first lattice Formula transform subblock includes：

20. systems according to claim 19 are it is characterised in that described preset data type At least include following types of one or more：Character string STRING, long LONG, boolean Type BOOLEAN, double-precision floating point type DOUBLE, date DATE.

21. systems according to claim 18 are it is characterised in that also include：

22. are according to claim 12 or 13 or 15 or 16 or 17 or 18 or 19 or 20 Unite it is characterised in that described file object is the file object that can carry out random read-write, described file Object at least can include one or more of following object：Local file, open storage service OSS File, secure file transportation protocol SFTP file, distributed file system HDFS file.