CN106469152A - A kind of document handling method based on ETL and system - Google Patents
A kind of document handling method based on ETL and system Download PDFInfo
- Publication number
- CN106469152A CN106469152A CN201510502163.4A CN201510502163A CN106469152A CN 106469152 A CN106469152 A CN 106469152A CN 201510502163 A CN201510502163 A CN 201510502163A CN 106469152 A CN106469152 A CN 106469152A
- Authority
- CN
- China
- Prior art keywords
- file
- row
- file object
- data record
- cutting
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/254—Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the present application provides a kind of document handling method based on ETL and system, and wherein said method includes:Obtain multiple file objects from source;For each file object, carry out the data cutting in file, obtain multiple text data blocks;After the completion of the plurality of file object cutting, corresponding for the plurality of file object all text data blocks are concurrently write destination.The present invention can improve the speed of file synchronization during ETL, maximizes data synchronization efficiency.
Description
Technical field
The application is related to technical field of data processing, more particularly to a kind of based on the file of ETL at
Reason method, and a kind of document handling system based on ETL.
Background technology
With the development of IT application in enterprise, the numerous information system of increasing enterprise
System, to help enterprise to carry out process and the management work of inside and outside business.But it is as information system
Increase, the information system of respective works in isolation causes substantial amounts of redundant data and business personnel
The duplication of labour.Enterprise information integration (EAI, Enterprise Application Integration) meets the tendency
And give birth to, and ETL is the major technique realizing data integration.
The abbreviation of ETL, Extraction-Transformation-Loading, i.e. data pick-up
(Extract), conversion (Transform), the process of loading (Load), it is to build data bins
The important step in storehouse.ETL is to load the data of operation system after extracting, clean conversion
To data warehouse process it is therefore an objective to by the dispersion in enterprise, messy, the skimble-scamble data of standard
It is integrated together, the decision-making for enterprise provides analysis foundation.
At present, for the synchronization means such as synchronous disk class application of file system, it is in units of file,
Mainly complete the synchronization of the file of each terminal room to file, and be not suitable with the need of data warehouse ETL
Will.
Therefore, the urgent technical problem solving of those skilled in the art is needed to be exactly at present:As
What proposes a kind of file process mechanism based on ETL, in order to improve file synchronization during ETL
Speed, maximizes data synchronization efficiency.
Content of the invention
The embodiment of the present application technical problem to be solved is to provide a kind of file process based on ETL
Method, in order to improve the speed of file synchronization during ETL, maximizes data synchronization efficiency.
Accordingly, the embodiment of the present application additionally provides a kind of document handling system based on ETL, uses
To ensure realization and the application of said method.
In order to solve the above problems, the embodiment of the present application discloses a kind of file process based on ETL
Method, described method includes:
Obtain multiple file objects from source;
For each file object, carry out the data cutting in file, obtain multiple text data blocks;
After the completion of the plurality of file object cutting, will be corresponding for the plurality of file object all
Text data block concurrently writes destination.
Preferably, the described step obtaining multiple file objects from source includes:
Read structured message from source, described structured message includes multiple file objects;
Described structured message is carried out cutting in units of single file object, obtains multiple files
Object.
Preferably, described for each file object, carry out the data cutting in file, obtain many
The step of individual text data block includes:
For each file object, determine multiple dicing position;
According to the plurality of dicing position, cutting is carried out to described file object, obtain multiple textual data
According to block.
Preferably, described file object includes multiple row data records, described for each file pair
As determining that the step of multiple dicing position includes:
For each file object, obtain the size of described file object;
Determine the mean size of described row data record;
Calculate the quotient of described file object size and the mean size of described row data record, obtain
The quantity of described row data record;
Calculate the quotient of the quantity of default text data block and the quantity of described row data record, obtain
The quantity of the row data record being had to each text data block;
The quantity of the row data record having according to described text data block, determines described file pair
The multiple initial dicing position of elephant;
If described initial dicing position is not the position that line Separator is located, by described dicing position
It is adjusted to the position at line Separator place;
Duplicate removal process is carried out to the initial dicing position after described adjustment, obtains multiple dicing position.
Preferably, if described initial dicing position is not the position that line Separator is located, will
The step that described dicing position is adjusted to the position at line Separator place includes:
If described initial dicing position be not line Separator be located position, forward detect or backward
It return back to the position with the immediate line Separator of described initial dicing position;
Described dicing position is defined as the position of described immediate line Separator.
Preferably, the step of the described mean size determining described row data record includes:
Using the size of first row data record as described row data record mean size;
Or,
Using the size of last row data record as described row data record mean size;
Or,
Randomly select the mean size as described row data record for the size of a row data record;
Or,
Before calculating, the size meansigma methodss of the row data record of N row are average as described row data record
Size.
Preferably, described after the completion of the plurality of file object cutting, by the plurality of file pair
As the step that corresponding all text data block text data blocks concurrently write destination includes:
After the completion of the plurality of file object cutting, respectively the form of described text data block is turned
It is changed to intermediate state form;
Using multithreading or multi-process or distributed multimachine respectively by corresponding intermediate state form
Text data block writes destination;
In described destination, the text data block of described intermediate state is converted to described destination institute
The form needing.
Preferably, described text data block includes one or more row data record, described row data
Record include Column Cata Format, described after the completion of the plurality of file object cutting, respectively will be described
The step that the form of text data block is converted to intermediate state form includes:
After the completion of the plurality of file object cutting, for every line number of each text data block
According to record, cut according to described Column Cata Format, obtained one or more row records;
It is respectively described row record and adds corresponding preset data type, obtain described intermediate state form.
Preferably, described preset data type at least includes following types of one or more:Character
String STRING, long LONG, Boolean type BOOLEAN, double-precision floating point type DOUBLE,
Date DATE.
Preferably, methods described also includes:
If it is unsuccessful that the form of described text data block changes described intermediate state form, or, described
The form that the text data block of intermediate state is changed needed for described destination is unsuccessful, then produce dirty data;
If described dirty data exceeds predetermined threshold value, generation error is reported.
Preferably, described file object is the file object that can carry out random read-write, described file
Object at least can include one or more of following object:Local file, open storage service OSS
File, secure file transportation protocol SFTP file, distributed file system HDFS file.
The embodiment of the present application additionally provides a kind of document handling system based on ETL, described system
Including:
File object acquisition module, for obtaining multiple file objects from source;
File cutting module, for for each file object, carrying out the data cutting in file,
Obtain multiple text data blocks;
Writing module, for after the completion of the plurality of file object cutting, by the plurality of file
The corresponding all text data blocks of object concurrently write destination.
Preferably, described file object acquisition module includes:
Structured message reading submodule, for reading structured message, described structuring from source
Information includes multiple file objects;
Structured message cutting submodule, for by described structured message with single file object being
Unit carries out cutting, obtains multiple file objects.
Preferably, described file cutting module includes:
Dicing position determination sub-module, for for each file object, determining multiple dicing position;
Cutting submodule, for cutting being carried out to described file object according to the plurality of dicing position,
Obtain multiple text data blocks.
Preferably, described file object includes multiple row data records, and described dicing position determines son
Module includes:
File size acquiring unit, for for each file object, obtaining described file object
Size;
Row size determining unit, for determining the mean size of described row data record;
First computing unit, for calculating the flat of described file object size and described row data record
The quotient of equal size, obtains the quantity of described row data record;
Second computing unit, for calculating the quantity of default text data block and described row data note
The quotient of the quantity of record, obtains the quantity of the row data record that each text data block is had;
Initial dicing position determining unit, for the row data having according to described text data block
The quantity of record, determines the multiple initial dicing position of described file object;
Adjustment unit, for described initial dicing position be not line Separator be located position when,
Then described dicing position is adjusted to the position at line Separator place;
Duplicate removal unit, for carrying out duplicate removal process to the initial dicing position after described adjustment, obtains
Multiple dicing position.
Preferably, described adjustment unit is additionally operable to:
If described initial dicing position be not line Separator be located position, forward detect or backward
It return back to the position with the immediate line Separator of described initial dicing position;
Described dicing position is defined as the position of described immediate line Separator.
Preferably, described row size determining unit is additionally operable to:
Using the size of first row data record as described row data record mean size;
Or,
Using the size of last row data record as described row data record mean size;
Or,
Randomly select the mean size as described row data record for the size of a row data record;
Or,
Before calculating, the size meansigma methodss of the row data record of N row are average as described row data record
Size.
Preferably, said write module includes:
First form transform subblock, for, after the completion of the plurality of file object cutting, distinguishing
The form of described text data block is converted to intermediate state form;
Data writes submodule, for using multithreading or multi-process or distributed multimachine difference
The text data block of corresponding intermediate state form is write destination;
Second form transform subblock, in described destination, by the text of described intermediate state
Data block is converted to the form needed for described destination.
Preferably, described text data block includes one or more row data record, described row data
Record includes Column Cata Format, and described first form transform subblock includes:
Cutter unit, for after the completion of the plurality of file object cutting, for each textual data
According to every row data record of block, cut according to described Column Cata Format, obtained one or more
Row record;
Data type adding device, adds corresponding preset data class for being respectively described row record
Type, obtains described intermediate state form.
Preferably, described preset data type at least includes following types of one or more:Character
String STRING, long LONG, Boolean type BOOLEAN, double-precision floating point type DOUBLE,
Date DATE.
Preferably, described system also includes:
Dirty data generation module, changes described intermediate state lattice for the form in described text data block
When formula is unsuccessful, or, the text data block of described intermediate state changes the lattice needed for described destination
When formula is unsuccessful, produce dirty data;
Error reporting generation module, for described dirty data exceed predetermined threshold value when, generation error
Report.
Preferably, described file object is the file object that can carry out random read-write, described file
Object at least can include one or more of following object:Local file, open storage service OSS
File, secure file transportation protocol SFTP file, distributed file system HDFS file.
Compared with background technology, the embodiment of the present application includes advantages below:
In the embodiment of the present application, when obtaining multiple file object from source, with file object it is
Unit, carries out the data cutting of file internal, obtains text data block, and in All Files object
After the completion of internal cutting, all of text data block is concurrently write in destination, cutting granularity is
The fine granularity cutting of file internal, then when concurrently being write, it is possible to increase file during ETL
Synchronous speed, maximizes data synchronization efficiency.
Further, since the embodiment of the present application be cutting is carried out with the structure of file internal however it is not limited to
Certain file system, can complete the data syn-chronization of multiple file system, highly versatile.
Brief description
Fig. 1 is a kind of steps flow chart of document handling method embodiment one based on ETL of the application
Figure;
Fig. 2 is a kind of steps flow chart of document handling method embodiment two based on ETL of the application
Figure;
Fig. 3 a is a kind of dicing position of document handling method embodiment two based on ETL of the application
Schematic diagram;
Fig. 3 b is a kind of dicing position of document handling method embodiment two based on ETL of the application
Adjustment schematic diagram one;
Fig. 4 a is a kind of dicing position of document handling method embodiment two based on ETL of the application
Adjustment schematic diagram two;
Fig. 4 b is a kind of duplicate removal result of document handling method embodiment two based on ETL of the application
Schematic diagram;
Fig. 5 is a kind of text data of document handling method embodiment two based on ETL of the application
Block schematic diagram;
Fig. 6 is a kind of structured flowchart of document handling system embodiment based on ETL of the application.
Specific embodiment
Understandable for enabling the above-mentioned purpose of the application, feature and advantage to become apparent from, with reference to
The drawings and specific embodiments are described in further detail to the application.
With reference to Fig. 1, show a kind of document handling method embodiment one based on ETL of the application
Flow chart of steps, may include steps of:
Step 101, obtains multiple file objects from source;
Step 102, for each file object, carries out the data cutting in file, obtains multiple literary compositions
Notebook data block;
Step 103, after the completion of the plurality of file object cutting, by the plurality of file object pair
The all text data blocks answered concurrently write destination.
In the embodiment of the present application, when obtaining multiple file object from source, with file object it is
Unit, carries out the data cutting of file internal, obtains text data block, and in All Files object
After the completion of internal cutting, all of text data block is concurrently write in destination, cutting granularity is
The fine granularity cutting of file internal, then when concurrently being write, it is possible to increase file during ETL
Synchronous speed, maximizes data synchronization efficiency.
Further, since the embodiment of the present application be cutting is carried out with the structure of file internal however it is not limited to
Certain file system, can complete the data syn-chronization of multiple file system, highly versatile.
With reference to Fig. 2, show that a kind of text data processing method based on ETL of the application is implemented
The flow chart of steps of example two, may include steps of:
Step 201, reads structured message from source, and described structured message includes multiple files pair
As;
The embodiment of the present application can apply to the scene of ETL, source can have multiple, for example, can
With including but not limited to:SFTP (Secure File Transfer Protocol, secure file transmission association
View), local file system Local File, OSS (open storage services, open storage take
Business it will be appreciated that be storage dish), HDFS (Hadoop Distributed File System, distributed
File system) etc..
It should be noted that the source end system of the application does not typically have the related guarantor of db transaction
Card, needs user oneself to ensure the data consistency (increasing of data, delete, change) during digital independent
Problem, that is, user ensure to change the content of structured message in data synchronization process as far as possible.
The unstructured information reading from source can include structural data and/or semi-structured number
According to wherein, structural data can be data base, and its data ranks has strict predefining, and has
Clearly Schema constraint;Semi-structured data refers to the word of clear and definite line Separator, Column Cata Format
Symbol data, can abstract be a bivariate table structure, and each record row includes 1 and arrives multi-column data, often
The column number of row data is identical.
In the embodiment of the present application, structured message can include multiple file objects, wherein, literary composition
Part object can be the object that can construct file model.In addition, so that follow-up data
Cutting and concurrently write can smoothly execute, and the file object of the embodiment of the present application can be for entering
The file object of row random read-write, that is, the file object of the application can be from the beginning of any byte
Read, for the write operation of file object, gradually can add in the last of file object.
As a kind of example of the embodiment of the present application, file object at least can include following object
One or more:Local file, OSS file, SFTP file, HDFS file.
It should be noted that the file object of the embodiment of the present application can support different type of codings,
And different compression types.
Step 202, described structured message is carried out cutting in units of single file object, obtains
Multiple file objects;
After source obtains structured message, the embodiment of the present application can carry out the first level coarse grain
The file cutting of degree, this cutting is the cutting of file-level, and the mode of cutting can be with single literary composition
Part object carries out cutting for unit, obtains multiple file objects.Wherein, cutting refers to appoint one
Business is divided into multiple subtasks, can concurrently execute multiple subtasks, whole after the completion of all subtasks
Body task also completes, and can shorten the run time of operation by the concurrent subtasking of cutting.
For example, the structured message reading from source includes 10 file objects, with single file pair
As for unit, then 10 parts of file objects can be cut into.
For another example, obtain the structured message of a day from source, this structured message is in units of hour
Carry out data storage, then can be cut within one day 24 file objects.
If it should be noted that the size of the structured message reading from source is too small, permissible
Do not execute the cutting logic of the application, such as read the file object of a 100KB it is not necessary that
Cutting is carried out to it.The cutting logic of the application is directed to the size timeouts of structured message
The source data of threshold value, therefore, before execution step 202, can first determine whether that structuring is believed
Whether the size of breath is more than given threshold, if being more than or equal to given threshold, can be with execution step
202, otherwise, not execution step 202.
Step 203, for each file object, carries out the data cutting in file, obtains multiple literary compositions
Notebook data block;
The data obtaining from source, through the cutting of the first level coarseness, obtains multiple file objects
After, the embodiment of the present application is directed to each file object, executes the second level further fine-grained
File cutting, this cutting is the cutting of file internal, and after the completion of this cutting, each file object can
To obtain multiple text data blocks.
In a kind of preferred embodiment of the embodiment of the present application, step 203 can include following sub-step
Suddenly:
Sub-step S11, for each file object, determines multiple dicing position;
It is necessary first to determine dicing position before cutting is carried out to file object, implement in the application
In a kind of preferred embodiment of example, sub-step S11 further includes following sub-step:
Sub-step S111, for each file object, obtains described file object size;
For each single file object, it is possible to obtain the attribute information of this document object, wherein,
Attribute information can include file object size totalSize.
Sub-step S112, determines the mean size of described row data record;
For structural data and/or semi-structured data, it can include row data record,
Wherein, the data record of the every a line of row data record, then carry out the literary composition after cutting to structured message
Part object is also to be made up of row data record, and the row data record has line Separator, column split
The information such as symbol.
However, every row data record in file object there may be situation not of uniform size, it is
Ensure as far as possible follow-up cutting row data record complete, the embodiment of the present application can obtain line number
According to the mean size lineSize of record, using this mean size lineSize as cutting according to one of.
In implementing, lineSize can be according to size distribution of row data record in file object etc.
Feature is determining.
In one embodiment, may be referred to following several ways to determine the flat of row data record
All sizes:
Using the size of first row data record as described row data record mean size;
Or,
Using the size of last row data record as described row data record mean size;
Or,
Randomly select the mean size as described row data record for the size of a row data record;
Or,
Before calculating, the size meansigma methodss of the row data record of N row are average as described row data record
Size.
Specifically, if the size of every row data record is all identical in file object, or
Person is essentially identical, then can directly choose the first row data or last column data or random a line
The size of data is as the mean size of row data record.
If the size distribution of the row data record in file object is mechanical periodicity, such as change
Cycle is 50 row, then take the size meansigma methodss of every 50 row data averagely big as row data record
Little, certainly, in addition to can taking the mean size as row data record for the size meansigma methodss, also
The maximum of the size of row data record in this period of change can be taken, or, minima, or
Person, median etc. is as the mean size of row data record.In addition, what the embodiment of the present application was addressed
Front N row, in addition to above-mentioned period of change, can also be the line number that user is pre-configured with, or random write
The line number taking, the embodiment of the present application need not be any limitation as to this.
It should be noted that the several ways of the mean size of above-mentioned determination row data record are only
The example of the embodiment of the present application, those skilled in the art to determine row using other modes as needed
The mean size of data record is all possible come the purpose to reach the application.
Sub-step S113, calculates the mean size of described file object size and described row data record
Quotient, obtain the quantity of described row data record;
The row data record being had in file object can be estimated according to totalSize and lineSize
Quantity number=totalSize/lineSize.
Sub-step S114, calculates the quantity of default text data block and the number of described row data record
The quotient of amount, obtains the quantity of the row data record that each text data block is had;
In the embodiment of the present application, the text data block that user wants cutting to obtain can be pre-configured with
Quantity m, in a preferred embodiment, the mode of configuration can be:User directly inputs
Need the quantity of the text data block of configuration;Or, if the flow velocity of text data block is 1Mbps,
User needs 10Mbps, then be cut into 10 parts.
Quantity m according to text data block and quantity number of row data record, can obtain every
The quantity of the row data record that individual text data block is had is number/m.
Sub-step S115, the quantity of the row data record being had according to described text data block, really
The multiple initial dicing position of fixed described file object;
After the quantity of the row data record that each text data block of determination is had, then can be true
Determine the multiple initial dicing position point of file object, this initial dicing position point can be:0,
1*number/m, 2*number/m, 3*number/m, the like.
Sub-step S116, if described initial dicing position is not the position that line Separator is located, will
Described dicing position is adjusted to the position at line Separator place;
In order to avoid carrying out a complete row data note during cutting according to initial dicing position point
Record cut-off, the embodiment of the present application can be finely adjusted to dicing position, with ensure dicing position be
End of line or row are first.In implementing, to the mode of initial dicing position adjustment can be:If just
Beginning dicing position is not the position that line Separator is located, then described dicing position is adjusted to row separation
The position that symbol is located.
In a kind of preferred embodiment of the embodiment of the present application, sub-step S116 can be further:
If described initial dicing position is not the position that line Separator is located, detection or backward rollback forward
To the position with the immediate line Separator of described initial dicing position;Described dicing position is determined
Position for described immediate line Separator.
As shown in the dicing position schematic diagram of Fig. 3 a, the initial cutting that obtained according to sub-step S115
Position may be located at the middle part of row data record, if carrying out cutting according to this initial dicing position,
Row data record can be led to imperfect row data record cut-off.Therefore, in a kind of embodiment
In, the embodiment of the present application can detect forward line Separator in the way of using detecting forward, until
Reaching end of line or the end of file, if detecting line Separator, this initial dicing position being updated
The position being located for line Separator, shown in dicing position adjustment schematic diagram one as shown in Figure 3 b.
In another embodiment, similar with aforementioned detection forward, the embodiment of the present application is acceptable
To detect line Separator by the way of rollback backward, row is first or file starts until returning back to,
If detecting line Separator, this initial dicing position is updated to the position at line Separator place.
It should be noted that the mode of rollback is can be higher compared with the mode requirement detecting forward backward,
And, it is provided only to support that the file system of rollback backward uses.
Sub-step S117, carries out duplicate removal process to the initial dicing position after described adjustment, obtains many
Individual dicing position.
Shown in dicing position adjustment schematic diagram two as shown in fig. 4 a, in implementing, may
Row data record overlength crosses over the situation of multiple text data blocks, and this situation is according to sub-step
The dicing position of multiple repetitions can be obtained, at this point it is possible to only retain one after the process of rapid S116
The dicing position of this repetition, deletes the dicing position of other repetitions, the duplicate removal of duplicate removal result such as Fig. 4 b
Shown in result schematic diagram.
Sub-step S12, carries out cutting according to the plurality of dicing position to described file object, obtains
Multiple text data blocks.
After obtaining multiple dicing position, file object can be carried out according to the plurality of dicing position
Cutting, obtains corresponding multiple text data block.For example, carry out file internal according to step 203
After cutting, the multiple text data blocks as shown in the text data block schematic diagram of Fig. 5 can be obtained.
Step 204, after the completion of the plurality of file object cutting, by the plurality of file object pair
The all text data blocks answered concurrently write destination.
After the fine-grained cutting of second level obtains entire text data block, can be by all texts
Data block concurrently writes destination.In implementing, distributed multimachine can be taken, and/or,
Multi-process, and/or, the mode of multithreading reads multiple text data blocks, parallel thus improve number
According to synchronizing speed and efficiency.
In implementing, it is huge because distributed system needs data volume to be processed, but
Distributed machines that system can accommodate or process or number of threads are limited, therefore, it can make
Process multiple text data blocks with a process or thread or distributed machines, and limit this process
Or thread or distributed machines process the data traffic of a text data block, to reach flow-control
Purpose.
In a kind of preferred embodiment of the embodiment of the present application, step 204 can include following sub-step
Suddenly:
The form of described text data block is converted to intermediate state form by sub-step S21 respectively;
It is applied to the embodiment of the present application, define a kind of transfer mechanism, source data and purpose terminal number
Carry out data syn-chronization according to by this transfer mechanism.This transfer mechanism is by from source data cutting first
The form of the text data block obtaining is converted into the form of intermediate state, and wherein, the form of intermediate state was both
It is not the form of source data, is not the form of purpose end data, it is that source data form arrives
A kind of transitive state of destination data form.
In a kind of preferred embodiment of the embodiment of the present application, sub-step S21 can include following son
Step:
Sub-step S211, for every row data record of each text data block, according to described row
Separator is cut, and obtains one or more row records;
The embodiment of the present application can be according to the feature of structured message, by row data record according to row point
Split every symbol, obtained one or more row records.For example, row data is recorded as:1,2,3,
Abc, 2015-07-1600:00:00, China;After it is cut according to Column Cata Format ", ", obtain
To row record cutting result be:1 2 3 abc 2015-07-16 00:00:In 00
State, totally 6 row.
Sub-step S212, respectively described row record adds corresponding preset data type, obtains institute
State intermediate state form.
Due to row data record memory storage is all character data, and each column data lost tool
Type (integer, floating number, date, null) of body etc., therefore, when by sub-step S211
After obtaining multiple row records, can be that the plurality of row record adds according to user's configuration information in advance
Plus preset data type.
As a kind of preferred exemplary of the embodiment of the present application, preset data type at least can include as
One or more of Types Below:Character string STRING, long LONG, Boolean type BOOLEAN,
Double-precision floating point type DOUBLE, date DATE, certainly, this data type can also include word
Section type B ytes, simply during Reading text, is not related to the type
This preset data type can be analogous to relation database table, and each of table shows a type,
As shown in table 1 below.
First row | Secondary series | 3rd row | 4th row | 5th row |
STRING | LONG | BOOLEAN | DOUBLE | DATE |
Abc | 123 | true | 1.001 | 1989-04-27 |
abc | 456 | false | 2.002 | 2007-09-01 |
Abc123 | 789 | true | 3.003 | 2014-07-07 |
Table 1
As a kind of example, shown in the following code of profile instance of a data type:
Wherein, index represents the index being listed in data line, and from the beginning of 0, type represents class to subscript
Type, can configure format for date type and represents date format.
It should be noted that be related in the embodiment of the present application is all the reading of character data, and not
It is the reading of binary data (for example, mp3, picture), for the situation of binary data, permissible
Encoded using Base64, be converted to character string type STRING process.
Sub-step S22, the text data block of described intermediate state form is write destination;
Sub-step S23, in described destination, the text data block of described intermediate state is converted to institute
State the form needed for destination.
After obtaining the intermediate state form of text data block, can be by the text data of this intermediate state form
Block writes destination, then in destination, for the purpose of can changing the text data block of this intermediate state
The required type of data format in end, i.e. above-mentioned preset data type is as a kind of number of intermediate state
According to type, serve transfer effect, then the relation of source, destination and preset data type is:
Multiple source data types->5 kinds of preset data types->Multiple destination types.
According to the transfer mechanism of the embodiment of the present application, synchronization framework can adopt the pattern of plug-in type,
No matter it is the increase in the data type of source or the data type of destination, this Shen can be adopted
The transfer mechanism of embodiment please efficiently complete the conversion of data type, enrich different storage systems
Between data exchange form, and there is very strong expansion.
In one embodiment, in the form transformation process of above-mentioned employing transfer mechanism, this Shen
Please embodiment can also comprise the steps:
If it is unsuccessful that the form of described text data block changes described intermediate state form, or, described
The form that the text data block of intermediate state is changed needed for described destination is unsuccessful, then produce dirty data;
If described dirty data exceeds predetermined threshold value, generation error is reported.
Specifically, if the form convert failed of above-mentioned sub-step S21 and/or sub-step S23, right
The row record answered can be used as dirty data processing, and such as abc is converted to numeral, abc is converted to the date
Deng error situation, if after the quantity of dirty data exceeds dirty data restriction, mistake, this article can be reported
The synchronous task of notebook data block terminates.
Wherein, as a kind of example, dirty data limits and can include:1st, dirty data bar number, that is, surpass
Go out specified bar number dirty data when task report an error;2nd, dirty data percentage ratio, that is, exceed prescribed percentage
Dirty data when, debriefing mistake.
Further, after multiple text data blocks write destination, in destination, if only setting up
One file is used for storing multiple text data blocks, then can not accomplish multiple threads (process) and
Send out write same file, lead to the generation of data problem of mutual exclusion.In order to avoid data mutual exclusion, mesh
End can set up the one-to-one file with text data block, for carrying out depositing of text data block
Store up and process, that is, a text data block corresponds to one of destination file, each text data
Block writes in the different file of destination.
File in destination can include file identification, in one embodiment, this document mark
Know naming method can be:User specifies a file prefix (i.e. specified prefix), and the application is real
Apply example produce random 32 UUID (Universally Unique Identifier, general only
One identification code), this specified prefix and UUID are stitched together formation file identification.
In implementing, the writing mode writing text data block to destination can have 3 kinds,
The first writing mode is truncate:Before representing write text data block, before cleaning has identical specifying
The historical data sewed;Second writing mode is noConflict:When representing write text data block,
If it find that there being the historical data of identical specified prefix, then generation error report, and it is same to exit data
Step;The third writing mode is append:Represent supplemental data, in spite of have identical specify before
The historical data sewed, is all normally written text data block.
In the embodiment of the present application, by file-level cutting and the fine-grained file of coarseness
Internal cutting, structured message is cut into multiple text data blocks, and cutting is concern file internal
Logic bivariate table structure, be a kind of universal synchronous solution party of general semi-structured file data
Case
In addition, the embodiment of the present application can support multiple source data types, versatility and expansion
By force.
It should be noted that for embodiment of the method, in order to be briefly described, therefore it is all expressed as
A series of combination of actions, but those skilled in the art should know, and the embodiment of the present application is not
Limited by described sequence of movement, because according to the embodiment of the present application, some steps can be adopted
Carry out with other orders or simultaneously.Secondly, those skilled in the art also should know, description
Described in embodiment belong to preferred embodiment, involved action not necessarily the application
Necessary to embodiment.
With reference to Fig. 6, show a kind of structured flowchart of the document handling system based on ETL of the application,
Specifically can include as lower module:
File object acquisition module 601, for obtaining multiple file objects from source;
File cutting module 602, for for each file object, carrying out the data cutting in file,
Obtain multiple text data blocks;
Writing module 603, for after the completion of the plurality of file object cutting, by the plurality of literary composition
The corresponding all text data blocks of part object concurrently write destination.
In a kind of preferred embodiment of the embodiment of the present application, described file object acquisition module 601
Following submodule can be included:
Structured message reading submodule, for reading structured message, described structuring from source
Information includes multiple file objects;
Structured message cutting submodule, for by described structured message with single file object being
Unit carries out cutting, obtains multiple file objects.
In a kind of preferred embodiment of the embodiment of the present application, described file cutting module 602 is permissible
Including following submodule:
Dicing position determination sub-module, for for each file object, determining multiple dicing position;
Cutting submodule, for cutting being carried out to described file object according to the plurality of dicing position,
Obtain multiple text data blocks.
In a kind of preferred embodiment of the embodiment of the present application, described file object includes multiple line numbers
According to record, described dicing position determination sub-module includes:
File size acquiring unit, for for each file object, obtaining described file object
Size;
Row size determining unit, for determining the mean size of described row data record;
First computing unit, for calculating the flat of described file object size and described row data record
The quotient of equal size, obtains the quantity of described row data record;
Second computing unit, for calculating the quantity of default text data block and described row data note
The quotient of the quantity of record, obtains the quantity of the row data record that each text data block is had;
Initial dicing position determining unit, for the row data having according to described text data block
The quantity of record, determines the multiple initial dicing position of described file object;
Adjustment unit, for described initial dicing position be not line Separator be located position when,
Then described dicing position is adjusted to the position at line Separator place;
Duplicate removal unit, for carrying out duplicate removal process to the initial dicing position after described adjustment, obtains
Multiple dicing position.
In a kind of preferred embodiment of the embodiment of the present application, described adjustment unit is additionally operable to:
If described initial dicing position be not line Separator be located position, forward detect or backward
It return back to the position with the immediate line Separator of described initial dicing position;
Described dicing position is defined as the position of described immediate line Separator.
In a kind of preferred embodiment of the embodiment of the present application, described row size determining unit is additionally operable to:
Using the size of first row data record as described row data record mean size;
Or,
Using the size of last row data record as described row data record mean size;
Or,
Randomly select the mean size as described row data record for the size of a row data record;
Or,
Before calculating, the size meansigma methodss of the row data record of N row are average as described row data record
Size.
In a kind of preferred embodiment of the embodiment of the present application, said write module 603 can include
Following submodule:
First form transform subblock, for, after the completion of the plurality of file object cutting, distinguishing
The form of described text data block is converted to intermediate state form;
Data writes submodule, for using multithreading or multi-process or distributed multimachine difference
The text data block of corresponding intermediate state form is write destination;
Second form transform subblock, in described destination, by the text of described intermediate state
Data block is converted to the form needed for described destination.
In a kind of preferred embodiment of the embodiment of the present application, described text data block include one or
A plurality of row data record, described row data record includes Column Cata Format, described first form conversion
Module includes:
Cutter unit, for after the completion of the plurality of file object cutting, for each textual data
According to every row data record of block, cut according to described Column Cata Format, obtained one or more
Row record;
Data type adding device, adds corresponding preset data class for being respectively described row record
Type, obtains described intermediate state form.
In a kind of preferred embodiment of the embodiment of the present application, described preset data type at least includes
Following types of one or more:Character string STRING, long LONG, Boolean type
BOOLEAN, double-precision floating point type DOUBLE, date DATE.
In a kind of preferred embodiment of the embodiment of the present application, described system also includes:
Dirty data generation module, changes described intermediate state lattice for the form in described text data block
When formula is unsuccessful, or, the text data block of described intermediate state changes the lattice needed for described destination
When formula is unsuccessful, produce dirty data;
Error reporting generation module, for described dirty data exceed predetermined threshold value when, generation error
Report.
In a kind of preferred embodiment of the embodiment of the present application, described file object is can to carry out at random
The file object of read-write, described file object at least can include one or more of following object:Locally
File, OSS file, SFTP file, HDFS file.
For system embodiment, due to itself and said method embodiment basic simlarity, so retouching
That states is fairly simple, and in place of correlation, the part referring to embodiment of the method illustrates.
Each embodiment in this specification is all described by the way of going forward one by one, each embodiment emphasis
Illustrate is all the difference with other embodiment, identical similar part between each embodiment
Mutually referring to.
Those skilled in the art it should be appreciated that the embodiment of the embodiment of the present application can be provided as method,
Device or computer program.Therefore, the embodiment of the present application can using complete hardware embodiment,
Complete software embodiment or the form of the embodiment with reference to software and hardware aspect.And, this Shen
Please embodiment can adopt in one or more computers wherein including computer usable program code
Usable storage medium (including but not limited to disk memory, CD-ROM, optical memory etc.)
The form of the computer program of upper enforcement.
The embodiment of the present application be with reference to according to the method for the embodiment of the present application, terminal unit (system),
To describe with the flow chart of computer program and/or block diagram.It should be understood that can be by computer
Procedure operation instruct each flow process in flowchart and/or block diagram and/or square frame and
Flow process in flow chart and/or block diagram and/or the combination of square frame.These computer journeys can be provided
Sequence operational order is to general purpose computer, special-purpose computer, Embedded Processor or other programmable numbers
According to processing terminal equipment processor with produce a machine so that by computer or other can compile
The operational order of the computing device of journey data processing terminal equipment produces for realizing in flow chart one
The dress of the function of specifying in individual flow process or multiple flow process and/or one square frame of block diagram or multiple square frame
Put.
These computer program operational orders may be alternatively stored in and can guide computer or other programmable numbers
According in the computer-readable memory that processing terminal equipment works in a specific way so that being stored in this
Operational order in computer-readable memory produces the manufacture including operational order device, this behaviour
Make command device realize in one flow process of flow chart or multiple flow process and/or one square frame of block diagram or
The function of specifying in multiple square frames.
These computer program operational orders also can be loaded into computer or other programmable datas are processed
So that sequence of operations step is executed on computer or other programmable terminal equipments on terminal unit
Suddenly to produce computer implemented process, thus holding on computer or other programmable terminal equipments
The operational order of row is provided for realizing in one flow process of flow chart or multiple flow process and/or block diagram
The step of the function of specifying in one square frame or multiple square frame.
Although having been described for the preferred embodiment of the embodiment of the present application, those skilled in the art
Once knowing basic creative concept, then these embodiments can be made with other change and modification.
So, claims are intended to be construed to including preferred embodiment and fall into the embodiment of the present application
Being had altered and being changed of scope.
Finally in addition it is also necessary to illustrate, herein, such as first and second or the like relation
Term is used merely to make a distinction an entity or operation with another entity or operation, and not
Necessarily require or imply between these entities or operation, there is any this actual relation or suitable
Sequence.And, term " inclusion ", "comprising" or its any other variant are intended to non-exclusive
Property comprise, so that including a series of process of key elements, method, article or terminal unit
Not only include those key elements, but also include other key elements being not expressly set out, or also wrap
Include as this process, method, article or the intrinsic key element of terminal unit.There is no more limits
In the case of system, the key element being limited by sentence "including a ..." is it is not excluded that including described wanting
Also there is other identical element in the process of element, method, article or terminal unit.
Above a kind of document handling method based on ETL provided herein and system are carried out
It is discussed in detail, specific case used herein is explained to the principle of the application and embodiment
State, the explanation of above example is only intended to help and understands the present processes and its core concept;
Simultaneously for one of ordinary skill in the art, according to the thought of the application, in specific embodiment party
All will change in formula and range of application, in sum, this specification content should not be construed as
Restriction to the application.
Claims (22)
1. a kind of document handling method based on ETL is it is characterised in that described method includes:
Obtain multiple file objects from source;
For each file object, carry out the data cutting in file, obtain multiple text data blocks;
After the completion of the plurality of file object cutting, will be corresponding for the plurality of file object all
Text data block concurrently writes destination.
2. method according to claim 1 it is characterised in that described from source obtain multiple
The step of file object includes:
Read structured message from source, described structured message includes multiple file objects;
Described structured message is carried out cutting in units of single file object, obtains multiple files
Object.
3. method according to claim 1 and 2 it is characterised in that described for each literary composition
Part object, carries out the data cutting in file, and the step obtaining multiple text data blocks includes:
For each file object, determine multiple dicing position;
According to the plurality of dicing position, cutting is carried out to described file object, obtain multiple textual data
According to block.
4. method according to claim 3 it is characterised in that described file object include many
Individual row data record, described for each file object, determine that the step of multiple dicing position includes:
For each file object, obtain the size of described file object;
Determine the mean size of described row data record;
Calculate the quotient of described file object size and the mean size of described row data record, obtain
The quantity of described row data record;
Calculate the quotient of the quantity of default text data block and the quantity of described row data record, obtain
The quantity of the row data record being had to each text data block;
The quantity of the row data record having according to described text data block, determines described file pair
The multiple initial dicing position of elephant;
If described initial dicing position is not the position that line Separator is located, by described dicing position
It is adjusted to the position at line Separator place;
Duplicate removal process is carried out to the initial dicing position after described adjustment, obtains multiple dicing position.
If 5. method according to claim 4 is it is characterised in that described initial cutting
Position is not the position that line Separator is located, then described dicing position is adjusted to line Separator and is located
The step of position include:
If described initial dicing position be not line Separator be located position, forward detect or backward
It return back to the position with the immediate line Separator of described initial dicing position;
Described dicing position is defined as the position of described immediate line Separator.
6. method according to claim 4 is it is characterised in that described determination described row data
The step of the mean size of record includes:
Using the size of first row data record as described row data record mean size;
Or,
Using the size of last row data record as described row data record mean size;
Or,
Randomly select the mean size as described row data record for the size of a row data record;
Or,
Before calculating, the size meansigma methodss of the row data record of N row are average as described row data record
Size.
7. method according to claim 1 it is characterised in that described when the plurality of file
After the completion of object cutting, by corresponding for the plurality of file object all text data block text data
The step that block concurrently writes destination includes:
After the completion of the plurality of file object cutting, respectively the form of described text data block is turned
It is changed to intermediate state form;
Using multithreading or multi-process or distributed multimachine respectively by corresponding intermediate state form
Text data block writes destination;
In described destination, the text data block of described intermediate state is converted to described destination institute
The form needing.
8. method according to claim 7 is it is characterised in that described text data block includes
One or more row data record, described row data record includes Column Cata Format, described when described many
After the completion of individual file object cutting, respectively the form of described text data block is converted to intermediate state lattice
The step of formula includes:
After the completion of the plurality of file object cutting, for every line number of each text data block
According to record, cut according to described Column Cata Format, obtained one or more row records;
It is respectively described row record and adds corresponding preset data type, obtain described intermediate state form.
9. method according to claim 8 it is characterised in that described preset data type extremely
Include following types of one or more less:Character string STRING, long LONG, Boolean type
BOOLEAN, double-precision floating point type DOUBLE, date DATE.
10. method according to claim 7 is it is characterised in that also include:
If it is unsuccessful that the form of described text data block changes described intermediate state form, or, described
The form that the text data block of intermediate state is changed needed for described destination is unsuccessful, then produce dirty data;
If described dirty data exceeds predetermined threshold value, generation error is reported.
11. methods according to claim 1 or 2 or 4 or 5 or 6 or 7 or 8 or 9,
It is characterized in that, described file object is the file object that can carry out random read-write, described file
Object at least can include one or more of following object:Local file, open storage service OSS
File, secure file transportation protocol SFTP file, distributed file system HDFS file.
A kind of 12. document handling systems based on ETL are it is characterised in that described system includes:
File object acquisition module, for obtaining multiple file objects from source;
File cutting module, for for each file object, carrying out the data cutting in file,
Obtain multiple text data blocks;
Writing module, for after the completion of the plurality of file object cutting, by the plurality of file
The corresponding all text data blocks of object concurrently write destination.
13. systems according to claim 12 are it is characterised in that described file object obtains
Module includes:
Structured message reading submodule, for reading structured message, described structuring from source
Information includes multiple file objects;
Structured message cutting submodule, for by described structured message with single file object being
Unit carries out cutting, obtains multiple file objects.
14. systems according to claim 12 or 13 are it is characterised in that described file is cut
Sub-module includes:
Dicing position determination sub-module, for for each file object, determining multiple dicing position;
Cutting submodule, for cutting being carried out to described file object according to the plurality of dicing position,
Obtain multiple text data blocks.
15. systems according to claim 14 are it is characterised in that described file object includes
Multiple row data records, described dicing position determination sub-module includes:
File size acquiring unit, for for each file object, obtaining described file object
Size;
Row size determining unit, for determining the mean size of described row data record;
First computing unit, for calculating the flat of described file object size and described row data record
The quotient of equal size, obtains the quantity of described row data record;
Second computing unit, for calculating the quantity of default text data block and described row data note
The quotient of the quantity of record, obtains the quantity of the row data record that each text data block is had;
Initial dicing position determining unit, for the row data having according to described text data block
The quantity of record, determines the multiple initial dicing position of described file object;
Adjustment unit, for described initial dicing position be not line Separator be located position when,
Then described dicing position is adjusted to the position at line Separator place;
Duplicate removal unit, for carrying out duplicate removal process to the initial dicing position after described adjustment, obtains
Multiple dicing position.
16. systems according to claim 15 are it is characterised in that described adjustment unit is also used
In:
If described initial dicing position be not line Separator be located position, forward detect or backward
It return back to the position with the immediate line Separator of described initial dicing position;
Described dicing position is defined as the position of described immediate line Separator.
17. systems according to claim 15 are it is characterised in that described row size determines list
Unit is additionally operable to:
Using the size of first row data record as described row data record mean size;
Or,
Using the size of last row data record as described row data record mean size;
Or,
Randomly select the mean size as described row data record for the size of a row data record;
Or,
Before calculating, the size meansigma methodss of the row data record of N row are average as described row data record
Size.
18. systems according to claim 12 are it is characterised in that said write module includes:
First form transform subblock, for, after the completion of the plurality of file object cutting, distinguishing
The form of described text data block is converted to intermediate state form;
Data writes submodule, for using multithreading or multi-process or distributed multimachine difference
The text data block of corresponding intermediate state form is write destination;
Second form transform subblock, in described destination, by the text of described intermediate state
Data block is converted to the form needed for described destination.
19. systems according to claim 18 are it is characterised in that described text data block bag
Include one or more row data record, described row data record includes Column Cata Format, described first lattice
Formula transform subblock includes:
Cutter unit, for after the completion of the plurality of file object cutting, for each textual data
According to every row data record of block, cut according to described Column Cata Format, obtained one or more
Row record;
Data type adding device, adds corresponding preset data class for being respectively described row record
Type, obtains described intermediate state form.
20. systems according to claim 19 are it is characterised in that described preset data type
At least include following types of one or more:Character string STRING, long LONG, boolean
Type BOOLEAN, double-precision floating point type DOUBLE, date DATE.
21. systems according to claim 18 are it is characterised in that also include:
Dirty data generation module, changes described intermediate state lattice for the form in described text data block
When formula is unsuccessful, or, the text data block of described intermediate state changes the lattice needed for described destination
When formula is unsuccessful, produce dirty data;
Error reporting generation module, for described dirty data exceed predetermined threshold value when, generation error
Report.
22. are according to claim 12 or 13 or 15 or 16 or 17 or 18 or 19 or 20
Unite it is characterised in that described file object is the file object that can carry out random read-write, described file
Object at least can include one or more of following object:Local file, open storage service OSS
File, secure file transportation protocol SFTP file, distributed file system HDFS file.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510502163.4A CN106469152A (en) | 2015-08-14 | 2015-08-14 | A kind of document handling method based on ETL and system |
PCT/CN2016/093495 WO2017028690A1 (en) | 2015-08-14 | 2016-08-05 | File processing method and system based on etl |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510502163.4A CN106469152A (en) | 2015-08-14 | 2015-08-14 | A kind of document handling method based on ETL and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106469152A true CN106469152A (en) | 2017-03-01 |
Family
ID=58051898
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510502163.4A Pending CN106469152A (en) | 2015-08-14 | 2015-08-14 | A kind of document handling method based on ETL and system |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN106469152A (en) |
WO (1) | WO2017028690A1 (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109299352A (en) * | 2018-11-14 | 2019-02-01 | 百度在线网络技术(北京)有限公司 | The update method of website data, device and search engine in search engine |
CN109408468A (en) * | 2018-08-24 | 2019-03-01 | 阿里巴巴集团控股有限公司 | Document handling method and device calculate equipment and storage medium |
CN110162401A (en) * | 2019-05-24 | 2019-08-23 | 广州中望龙腾软件股份有限公司 | The parallel read method of DWG file, electronic equipment and storage medium |
CN111061927A (en) * | 2018-10-16 | 2020-04-24 | 阿里巴巴集团控股有限公司 | Data processing method and device and electronic equipment |
CN111435346A (en) * | 2019-01-14 | 2020-07-21 | 阿里巴巴集团控股有限公司 | Offline data processing method, device and equipment |
CN114356212A (en) * | 2021-11-23 | 2022-04-15 | 阿里巴巴(中国)有限公司 | Data processing method, system and computer readable storage medium |
CN114584556A (en) * | 2022-03-14 | 2022-06-03 | 中国工商银行股份有限公司 | File transmission method and device |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113177040A (en) * | 2021-04-29 | 2021-07-27 | 东北大学 | Full-process big data cleaning and analyzing method for aluminum/copper plate strip production |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101155296A (en) * | 2006-09-29 | 2008-04-02 | 中国科学技术大学 | Method for transmitting data |
CN102999537A (en) * | 2011-09-19 | 2013-03-27 | 阿里巴巴集团控股有限公司 | System and method for data migration |
CN104699723A (en) * | 2013-12-10 | 2015-06-10 | 北京神州泰岳软件股份有限公司 | Data exchange adapter and system and method for synchronizing data among heterogeneous systems |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101582064B (en) * | 2008-05-15 | 2011-12-21 | 阿里巴巴集团控股有限公司 | Method and system for processing enormous data |
CN103970874A (en) * | 2014-05-14 | 2014-08-06 | 浪潮(北京)电子信息产业有限公司 | Method and device for processing Hadoop files |
CN104376082B (en) * | 2014-11-18 | 2019-06-18 | 中国建设银行股份有限公司 | A method of the data in data source file are imported into database |
CN104615736B (en) * | 2015-02-10 | 2017-10-27 | 上海创景计算机系统有限公司 | Big data fast resolving storage method based on database |
-
2015
- 2015-08-14 CN CN201510502163.4A patent/CN106469152A/en active Pending
-
2016
- 2016-08-05 WO PCT/CN2016/093495 patent/WO2017028690A1/en active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101155296A (en) * | 2006-09-29 | 2008-04-02 | 中国科学技术大学 | Method for transmitting data |
CN102999537A (en) * | 2011-09-19 | 2013-03-27 | 阿里巴巴集团控股有限公司 | System and method for data migration |
CN104699723A (en) * | 2013-12-10 | 2015-06-10 | 北京神州泰岳软件股份有限公司 | Data exchange adapter and system and method for synchronizing data among heterogeneous systems |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109408468A (en) * | 2018-08-24 | 2019-03-01 | 阿里巴巴集团控股有限公司 | Document handling method and device calculate equipment and storage medium |
CN111061927A (en) * | 2018-10-16 | 2020-04-24 | 阿里巴巴集团控股有限公司 | Data processing method and device and electronic equipment |
CN111061927B (en) * | 2018-10-16 | 2023-06-20 | 阿里巴巴集团控股有限公司 | Data processing method and device and electronic equipment |
CN109299352A (en) * | 2018-11-14 | 2019-02-01 | 百度在线网络技术(北京)有限公司 | The update method of website data, device and search engine in search engine |
CN109299352B (en) * | 2018-11-14 | 2022-02-01 | 百度在线网络技术(北京)有限公司 | Method and device for updating website data in search engine and search engine |
CN111435346A (en) * | 2019-01-14 | 2020-07-21 | 阿里巴巴集团控股有限公司 | Offline data processing method, device and equipment |
CN111435346B (en) * | 2019-01-14 | 2023-12-19 | 阿里巴巴集团控股有限公司 | Offline data processing method, device and equipment |
CN110162401A (en) * | 2019-05-24 | 2019-08-23 | 广州中望龙腾软件股份有限公司 | The parallel read method of DWG file, electronic equipment and storage medium |
CN114356212A (en) * | 2021-11-23 | 2022-04-15 | 阿里巴巴(中国)有限公司 | Data processing method, system and computer readable storage medium |
CN114584556A (en) * | 2022-03-14 | 2022-06-03 | 中国工商银行股份有限公司 | File transmission method and device |
Also Published As
Publication number | Publication date |
---|---|
WO2017028690A1 (en) | 2017-02-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106469152A (en) | A kind of document handling method based on ETL and system | |
CN105701098B (en) | The method and apparatus for generating index for the table in database | |
CN103167172B (en) | Integration method and system for variety of chat records | |
US9558211B1 (en) | Incremental schema consistency validation on geographic features | |
CN102270225B (en) | Data change daily record method for supervising and data change daily record supervising device | |
TW201530328A (en) | Method and device for constructing NoSQL database index for semi-structured data | |
EP2721477A1 (en) | Processing repetitive data | |
CN103020255A (en) | Hierarchical storage method and hierarchical storage device | |
CN102169491B (en) | Dynamic detection method for multi-data concentrated and repeated records | |
JP2015531126A (en) | Method and apparatus for realizing product characteristic navigation | |
US20140095549A1 (en) | Method and Apparatus for Generating Schema of Non-Relational Database | |
CN106095871A (en) | A kind of method and device setting up data base directory structure | |
CN113051347B (en) | Method, system, equipment and storage medium for synchronizing data between heterogeneous databases | |
CN101504662A (en) | Data conversion method and apparatus | |
CN103647850A (en) | Data processing method, device and system of distributed version control system | |
CN110737729A (en) | Engineering map data information management method based on knowledge map concept and technology | |
CN106610931A (en) | Extraction method and device for topic names | |
CN112699142A (en) | Cold and hot data processing method and device, electronic equipment and storage medium | |
CN103593447A (en) | Data processing method and device applied to database table | |
CN110019169B (en) | Data processing method and device | |
CN104484402A (en) | Method and device for deleting repeating data | |
CN109542860B (en) | Service data management method based on HDFS and terminal equipment | |
US20160366225A1 (en) | Shuffle embedded distributed storage system supporting virtual merge and method thereof | |
CN101650732B (en) | Method and device for grouping objects in object management system | |
KR101508068B1 (en) | Apparatus and method for data de-duplication |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170301 |
|
RJ01 | Rejection of invention patent application after publication |