CN106250380B

CN106250380B - The customized method of partition of Hadoop file system data

Info

Publication number: CN106250380B
Application number: CN201510320303.6A
Authority: CN
Inventors: 亢永敢; 赵改善; 杨祥森; 孙成龙; 许自龙; 段文超; 杨文广
Original assignee: China Petroleum and Chemical Corp; Sinopec Geophysical Research Institute
Current assignee: China Petroleum and Chemical Corp; Sinopec Geophysical Research Institute
Priority date: 2015-06-12
Filing date: 2015-06-12
Publication date: 2019-08-23
Anticipated expiration: 2035-06-12
Also published as: CN106250380A

Abstract

Propose a kind of customized method of partition of Hadoop file system data, comprising: be ranked up to input data；According to pre-set deblocking parameter, piecemeal is carried out to the input data after sequence, to obtain data block, wherein carrying out piecemeal to the input data after sequence includes: that initial position in input data by each data block after sequence and final position are recorded in blocking information corresponding with each data block；And it is based on the blocking information, corresponding data block is read, from the input data after sequence to carry out parallel processing.

Description

The customized method of partition of Hadoop file system data

Technical field

The invention belongs to parallel file system data management domains in computer field, and in particular to a kind of Hadoop file The customized method of partition of system data.

Background technique

Hadoop distributed file system HDFS (Hadoop Distributed File System) is Google file The open source version of system GFS (Google File System), is the distributed file system of an Error Tolerance, is suitble to deployment In cheap large-scale machines.HDFS is capable of providing the data access of high-throughput, and big file is supported to store, and is very suitable to big Application on scale data collection.HDFS is the sub-project of Hadoop, provides the expansible of high-throughput for Hadoop upper layer application Big file storage service, be Hadoop cloud calculate basis.

Fig. 1 is the structural schematic diagram of HDFS in the prior art, and the basic structure of HDFS uses master slave mode, a HDFS collection Group includes a namenode, it is the primary server of a management file name space and adjusting client access file, when So there are also some back end, one machine of a usually node, it manages the storage of corresponding node.HDFS opening File name space simultaneously allows user data to store with document form.

The internal mechanism of HDFS is by a file division into one or more blocks, these blocks are stored in one group of data section Point in.Namenode is used to the file or directory operation of operation file NameSpace, such as opens, and closes, renaming etc..It is same When determine the mapping of block and back end.Back end is responsible for the read-write requests from file system client.Back end is same When also want the creation of perfoming block, delete, and the block duplicate instructions from namenode.

The design of HDFS is for supporting big file.The program operated on HDFS is also for handling large data sets 's.These programs only write a data, one or many read data requests, and these read operations and are required to meet stream transmission speed Degree.HDFS supports the write multiple times operation of file.Typical block size is 64MB in HDFS, and HDFS file can be by It is cut into the block of multiple 64MB sizes, this fixed block mode limits the application field of Hadoop, such as in prestack seismic data In preceding migration processing, a data input needs repeatedly different partitioned mode processing, and the fixed data partitioned mode of HDFS can not It meets the requirements.

Summary of the invention

The present invention proposes a kind of descriptive piecemeal side of self-defining data on the basis of HDFS fixed data piecemeal Method realizes customized, the descriptive piecemeal of data in HDFS file system, solves HDFS in deblocking access mode Solid data fixed block mode is taken, the problem of changeable data access requires is not adapted to, improves HDFS data file The versatility and flexibility of access.

One aspect of the present invention proposes a kind of customized method of partition of Hadoop file system data, comprising: to input number According to being ranked up；According to pre-set deblocking parameter, piecemeal is carried out to the input data after sequence, to obtain data Block, wherein including: the starting in the input data by each data block after sequence to the input data progress piecemeal after sequence Position and final position are recorded in blocking information corresponding with each data block；And it is based on the blocking information, from row Corresponding data block is read in input data after sequence, to carry out parallel processing.

According to another embodiment of the present invention, a kind of customized blocking devices of Hadoop file system data are proposed, are wrapped It includes: the component for being ranked up to input data；For according to pre-set deblocking parameter, to the input after sequence Data carry out piecemeal, to obtain the component of data block, wherein carrying out piecemeal to the input data after sequence includes: by each data Initial position and final position of the block in the input data after sequence are recorded in blocking information corresponding with each data block In；And for being based on the blocking information, corresponding data block is read, from the input data after sequence to be located parallel The component of reason.

Each aspect of the present invention improves HDFS file access method, improve HDFS data file access versatility and Flexibility provides more efficient file storage service for the popularization and application of Hadoop technology.

Detailed description of the invention

Disclosure illustrative embodiments are described in more detail in conjunction with the accompanying drawings, the disclosure above-mentioned and its Its purpose, feature and advantage will be apparent, wherein in disclosure illustrative embodiments, identical reference label Typically represent same parts.

Fig. 1 shows the structural schematic diagram of HDFS in the prior art.

Fig. 2 shows a kind of customized method of partition of Hadoop file system data according to an embodiment of the invention Flow chart.

Specific embodiment

The preferred embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although showing the disclosure in attached drawing Preferred embodiment, however, it is to be appreciated that may be realized in various forms the disclosure without the embodiment party that should be illustrated here Formula is limited.On the contrary, these embodiments are provided so that this disclosure will be more thorough and complete, and can be by the disclosure Range is completely communicated to those skilled in the art.

Fig. 2 shows the streams of the customized method of partition of Hadoop file system data according to an embodiment of the invention Cheng Tu, in this embodiment, this method comprises:

Step 201, input data is ranked up；

Step 202, according to pre-set deblocking parameter, piecemeal is carried out to the input data after sequence, to obtain Data block, wherein including: in the input data by each data block after sequence to the input data progress piecemeal after sequence Initial position and final position are recorded in blocking information corresponding with each data block；

Step 203, it is based on the blocking information, corresponding data block is read from the input data after sequence, to carry out Parallel processing.

The present embodiment, which is used, carries out piecemeal simultaneously to the input data after sequence according to pre-set deblocking parameter Blocking information is recorded, further according to the mode of blocking information read block, this is a kind of different from traditional entity fixed block The descriptive customized partitioned mode of mode, which solve HDFS to take fixed point of solid data in deblocking access mode Block mode does not adapt to the problem of changeable data access requires, improves the versatility and flexibly of HDFS data file access Property, extend the application range of Hadoop cloud calculating.

Data sorting

It is to provide the input data of regularization for subsequent descriptive piecemeal to the purpose that input data is ranked up, guarantees The continuity of deblocking, and the blocking information in following blocks processing is made to can simplify initial position and end for data Location information keeps parallel processing more efficient.

It will be understood by those skilled in the art that can arbitrarily formulate as needed the principle that input data is ranked up.One In a example, can be according to the parallel processing to be executed the characteristics of, is (for example, the processing sequence of input data is wanted in parallel processing Ask), input data is ranked up.

In one example, classification processing first can be carried out to input data before sorting, so that with same alike result Data concentrate in together, then are ranked up.By this processing, sorting out by attribute and orderly for data can be further realized Change.

In one example, the input data in the present embodiment, which can be, is consolidated solid data based on Hadoop file system Determine the data stored after piecemeal.That is, this implementation can be built upon secondary point on original fixed block basis Block.However it will be understood by those skilled in the art that the present embodiment can also be used for replacing the fixed block in Hadoop file system.

Deblocking

The present embodiment carries out piecemeal to the input data after sequence according to pre-set deblocking parameter, realizes certainly Define piecemeal.In addition, the present embodiment is by by initial position of each data block in input data after sequence and stop bit It sets and is recorded in blocking information corresponding with each data block, to realize descriptive piecemeal.By such customized, description Property piecemeal processing after, the corresponding blocking information of each data block, and solid data remains unchanged, thus keeping solid data In the case that storage mode is constant, any customized piecemeal can be carried out to data at any time according to the demand of parallel processing and handled.

In one example, piecemeal parameter is parameter required for carrying out piecemeal, can arbitrarily be set as needed by user It is fixed, to meet a variety of partitioned modes of user's needs.It is that the present embodiment can be realized and " make by oneself that piecemeal parameter, which can arbitrarily be set, The embodiment of adopted piecemeal ".

Parallel processing

In the present embodiment, it is based on the blocking information, corresponding data block is read from the input data after sequence, with Parallel processing is carried out, this mode is not limited by solid data piecemeal storage, can be carried out in real time in operation treatment process Data block access and processing.

In one example, it before parallel processing, can also further comprise: be opened according to the number of obtained data block Dynamic parallel processing element a, wherein parallel processing element can be started for each data block.After starting parallel processing element, Using parallel processing element be based on the blocking information, corresponding data block is read from the input data after sequence, with into Row parallel processing.

Data regularization

In one example, the step of method of the present embodiment can also further comprise data regularization, to parallel processing As a result reduction process is carried out.Reduction process can be carried out respectively for the parallel processing of each completion, it can also be in all parallel processings After the completion, then for the processing result of each parallel processing reduction process is carried out.

Reduction principle can be determined according to piecemeal principle, for example, if the input data that piecemeal principle is same attribute is divided into One data block, then reduction principle can will be combined into a data result through reduction group for the output result of each data block. After completing reduction process, processing result is exported.

It will be understood by those skilled in the art that it is not necessary that under carrying out the application scenarios of reduction operation, the present embodiment can not also be wrapped Containing the reduction operation in the example.

Using example

Hereinafter, one for providing the embodiment of the present invention applies example for using seismic channel set data as input data.This It should be understood to the one skilled in the art that this is intended merely to better understand the present invention using example, any details is not intended to this in field Invention is limited.

Data sorting

Common seismic channel set data may include common-shot-gather data and common point (CMP) trace gather data, common-source point Trace gather refers to the data acquired per same big gun is belonged to together in trace gather, and concentrically trace gather data refer to counting in trace gather per one According to inspection point center it is identical.

Before carrying out parallel processing to seismic channel set data, sequence processing according to an embodiment of the present invention can be carried out, this It outside, can also further progress classification processing before sequence processing.In this application example, classification can be by taking out trace gather come real It is existing, i.e., trace gather data (such as common-shot-gather data or common point (CMP) trace gather data) are generated by taking out trace gather operation Required trace gather form.For example, under the application scenarios for the piecemeal processing for mainly needing to solve the problems, such as common offset data, institute The trace gather form needed can be common offset trace gather.Common offset trace gather data refer to the offset distance (shot point of each track data in trace gather To the distance of geophone station) it is identical, therefore the identical trace gather data of offset distance can be referred to together, form common offset trace gather Data.

Trace gather data after classification can be ranked up, for obtaining common offset trace gather after sorting out, sequencer procedure It can be and common offset trace gather data are ranked up according to the size of offset distance value.

Deblocking

Piecemeal can be carried out to the seismic channel set data after sequence according to pre-set deblocking parameter.In the application In example, still by taking the piecemeal of common offset data as an example, deblocking parameter can include but is not limited to one in following parameter It is a or multiple: the max number of channels etc. in minimum offset values, maximum offset value, offset distance class interval and each data block. These parameters can be provided by user, to determine partitioned mode.By start bit of each data block in the input data after sequence Set and be recorded in blocking information corresponding with each data block with final position, thus realize to seismic channel set data from Definition, descriptive data piecemeal.

It should be noted that purpose for ease of description, this application example is described by taking " common offset " principle as an example, However it will be understood by those skilled in the art that deblocking and the principle of processing are not limited to " common offset ", can be according to number According to processing any principle actually required, such as in the processing of big gun domain, need to carry out big gun collection piecemeal to data.At seismic data Reason field, main piecemeal principle may include offset distance piecemeal, CMP trace gather piecemeal, big gun collection piecemeal etc..

Parallel processing

Can be based on the blocking information, read corresponding data block from the input seismic channel set data after sequence, with into Row parallel processing.The former processing mode of Hadoop file system is fixed block mode, i.e., input data is stored in file system When, multi-block data storage is had been separated into, this piecemeal storage is just immobilized after input data importing.And in earthquake number According in treatment process, especially in migration before stack treatment process, the piecemeal of data is required to be to change, each run program, User may need different deblocking modes to handle.It is applied in example at this, utilizes making by oneself for the embodiment of the present invention The fixed block storage mode that adopted, descriptive block parallel processing mode solves Hadoop cannot adapt to seismic data process Problem realizes in real time according to user's definition come any carry out deblocking during seismic data process, meets ground Shake the specific demand of data processing.

Data regularization.

This can also include reduction process using example, and reduction process can be different and different according to piecemeal principle, still with altogether For offset distance processing, reduction mode can be determined according to the packet mode of offset distance, for example, same offset distance group (i.e. same number According to block) data can be overlapped reduction process, i.e., will for multiple parallel processings of same offset distance group result it is corresponding Value is added, a raw result；The data of different offset distance groups can be combined reduction process, i.e., will be directed to different offset distance groups The result of each parallel processing is combined.And for big gun domain processing routine, last reduction process can be overlapped reduction. After completing reduction process, processing result is exported.

In one example, which may also include that for before being ranked up to input data, to input data into Row classification processing, so that the component that the data with same alike result concentrate in together.

In one example, can be will be after solid data fixed block based on Hadoop file system for the input data The data of storage.

In one example, which may also include that for starting parallel processing according to the number of obtained data block The component of unit a, wherein parallel processing element can be started for each data block；And for utilizing the parallel place started It manages unit and is based on the blocking information, read corresponding data block, from the input data after sequence to carry out parallel processing Component.

In one example, which may also include that the portion that reduction process is carried out for the processing result to parallel processing Part.

In one example, input data can be seismic channel set data.

In one example, be ranked up to input data may include: to return the identical seismic channel set data of offset distance Class is to together, to form common offset trace gather data；And common offset trace gather data are carried out according to the size of offset distance value Sequence.

In one example, according to pre-set deblocking parameter, carrying out piecemeal to the input data after sequence can To include: to carry out piecemeal to the common offset trace gather data after sequence according to one or more of following data piecemeal parameter: Max number of channels in minimum offset values, maximum offset value, offset distance class interval and each data block.

In one example, which may also include the result pair of multiple parallel processings for that will be directed to same data block Addition should be worth, to realize the component of reduction process.

In one example, which may also include the combination of the result for that will be directed to each parallel processing of different data block Together, to realize the component of reduction process.

The embodiment of the present invention proposes a kind of customized, descriptive data piecemeal mechanism, realizes in HDFS file system Customized, the descriptive piecemeals of data.It is right using data specifying-information on the basis of HDFS fixed data piecemeal mechanism Data carry out customized piecemeal, in the case where not changing solid data storage, realize any piecemeal of data.Realize number According to flexible piecemeal, extend the data management function and application field of HDFS file system.

In seismic data process application scenarios, by applying customized, the descriptive piecemeal mechanism of the embodiment of the present invention, The Hadoop parallel processing that earthquake migration before stack processing technique may be implemented, improves mass data processing ability.

The disclosure can be system, method and/or computer program product.Computer program product may include computer Readable storage medium storing program for executing, containing for making processor realize the computer-readable program instructions of various aspects of the disclosure.

Computer readable storage medium, which can be, can keep and store the tangible of the instruction used by instruction execution equipment Equipment.Computer readable storage medium for example can be-- but it is not limited to-- storage device electric, magnetic storage apparatus, optical storage Equipment, electric magnetic storage apparatus, semiconductor memory apparatus or above-mentioned any appropriate combination.Computer readable storage medium More specific example (non exhaustive list) includes: portable computer diskette, hard disk, random access memory (RAM), read-only deposits It is reservoir (ROM), erasable programmable read only memory (EPROM or flash memory), static random access memory (SRAM), portable Compact disk read-only memory (CD-ROM), digital versatile disc (DVD), memory stick, floppy disk, mechanical coding equipment, for example thereon It is stored with punch card or groove internal projection structure and the above-mentioned any appropriate combination of instruction.Calculating used herein above Machine readable storage medium storing program for executing is not interpreted that instantaneous signal itself, the electromagnetic wave of such as radio wave or other Free propagations lead to It crosses the electromagnetic wave (for example, the light pulse for passing through fiber optic cables) of waveguide or the propagation of other transmission mediums or is transmitted by electric wire Electric signal.

Computer-readable program instructions as described herein can be downloaded to from computer readable storage medium it is each calculate/ Processing equipment, or outer computer or outer is downloaded to by network, such as internet, local area network, wide area network and/or wireless network Portion stores equipment.Network may include copper transmission cable, optical fiber transmission, wireless transmission, router, firewall, interchanger, gateway Computer and/or Edge Server.Adapter or network interface in each calculating/processing equipment are received from network to be counted Calculation machine readable program instructions, and the computer-readable program instructions are forwarded, for the meter being stored in each calculating/processing equipment In calculation machine readable storage medium storing program for executing.

Computer program instructions for executing disclosure operation can be assembly instruction, instruction set architecture (ISA) instructs, Machine instruction, machine-dependent instructions, microcode, firmware instructions, condition setup data or with one or more programming languages The source code or object code that any combination is write, the programming language include the programming language-of object-oriented such as Smalltalk, C++ etc., and conventional procedural programming languages-such as " C " language or similar programming language.Computer Readable program instructions can be executed fully on the user computer, partly execute on the user computer, be only as one Vertical software package executes, part executes on the remote computer or completely in remote computer on the user computer for part Or it is executed on server.In situations involving remote computers, remote computer can pass through network-packet of any kind It includes local area network (LAN) or wide area network (WAN)-is connected to subscriber computer, or, it may be connected to outer computer (such as benefit It is connected with ISP by internet).In some embodiments, by utilizing computer-readable program instructions Status information carry out personalized customization electronic circuit, such as programmable logic circuit, field programmable gate array (FPGA) or can Programmed logic array (PLA) (PLA), the electronic circuit can execute computer-readable program instructions, to realize each side of the disclosure Face.

Referring herein to according to the flow chart of the method, apparatus (system) of the embodiment of the present disclosure and computer program product and/ Or block diagram describes various aspects of the disclosure.It should be appreciated that flowchart and or block diagram each box and flow chart and/ Or in block diagram each box combination, can be realized by computer-readable program instructions.

These computer-readable program instructions can be supplied to general purpose computer, special purpose computer or other programmable datas The processor of processing unit, so that a kind of machine is produced, so that these instructions are passing through computer or other programmable datas When the processor of processing unit executes, function specified in one or more boxes in implementation flow chart and/or block diagram is produced The device of energy/movement.These computer-readable program instructions can also be stored in a computer-readable storage medium, these refer to It enables so that computer, programmable data processing unit and/or other equipment work in a specific way, thus, it is stored with instruction Computer-readable medium then includes a manufacture comprising in one or more boxes in implementation flow chart and/or block diagram The instruction of the various aspects of defined function action.

Computer-readable program instructions can also be loaded into computer, other programmable data processing units or other In equipment, so that series of operation steps are executed in computer, other programmable data processing units or other equipment, to produce Raw computer implemented process, so that executed in computer, other programmable data processing units or other equipment Instruct function action specified in one or more boxes in implementation flow chart and/or block diagram.

The flow chart and block diagram in the drawings show system, method and the computer journeys according to multiple embodiments of the disclosure The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation One module of table, program segment or a part of instruction, the module, program segment or a part of instruction include one or more use The executable instruction of the logic function as defined in realizing.In some implementations as replacements, function marked in the box It can occur in a different order than that indicated in the drawings.For example, two continuous boxes can actually be held substantially in parallel Row, they can also be executed in the opposite order sometimes, and this depends on the function involved.It is also noted that block diagram and/or The combination of each box in flow chart and the box in block diagram and or flow chart, can the function as defined in executing or dynamic The dedicated hardware based system made is realized, or can be realized using a combination of dedicated hardware and computer instructions.

The presently disclosed embodiments is described above, above description is exemplary, and non-exclusive, and It is not limited to disclosed each embodiment.Without departing from the scope and spirit of illustrated each embodiment, for this skill Many modifications and changes are obvious for the those of ordinary skill in art field.The selection of term used herein, purport In the principle, practical application or technological improvement to the technology in market for best explaining each embodiment, or lead this technology Other those of ordinary skill in domain can understand each embodiment disclosed herein.

Claims

1. a kind of customized method of partition of Hadoop file system data, comprising:

Input data is ranked up；

According to pre-set deblocking parameter, piecemeal is carried out to the input data after sequence, to obtain data block, wherein right Input data progress piecemeal after sequence includes: initial position and termination in the input data by each data block after sequence Position is recorded in blocking information corresponding with each data block；And

Based on the blocking information, corresponding data block is read from the input data after sequence, to carry out parallel processing；

Wherein, the input data is the data that will be stored after solid data fixed block based on Hadoop file system；

It is described according to pre-set deblocking parameter, carrying out piecemeal to the input data after sequence includes:

According to one or more of following data piecemeal parameter, piecemeal is carried out to the common offset trace gather data after sequence: most Max number of channels in small deviant, maximum offset value, offset distance class interval and each data block.

2. the customized method of partition of Hadoop file system data according to claim 1, further includes:

Before being ranked up to input data, classification processing is carried out to input data, so that the data set with same alike result In together.

3. the customized method of partition of Hadoop file system data according to claim 1, further includes:

Start parallel processing element according to the number of obtained data block, wherein one can be started for each data block parallel Processing unit；And

It utilizes started parallel processing element to be based on the blocking information, corresponding number is read from the input data after sequence According to block, to carry out parallel processing.

4. the customized method of partition of Hadoop file system data according to claim 1, further includes:

Reduction process is carried out to the processing result of parallel processing.

5. the customized method of partition of Hadoop file system data according to claim 1, wherein

The input data is seismic channel set data.

6. the customized method of partition of Hadoop file system data according to claim 5, wherein being carried out to input data Sequence includes:

The identical seismic channel set data of offset distance are referred to together, to form common offset trace gather data；And

Common offset trace gather data are ranked up according to the size of offset distance value.

7. the customized method of partition of Hadoop file system data according to claim 5, wherein same data will be directed to The result respective value of multiple parallel processings of block is added, to carry out reduction process.

8. the customized method of partition of Hadoop file system data according to claim 5, wherein different data will be directed to The result of each parallel processing of block is combined, to carry out reduction process.