CN103699627A

CN103699627A - Dummy file parallel data block positioning method based on Hadoop cluster

Info

Publication number: CN103699627A
Application number: CN201310712421.2A
Authority: CN
Inventors: 孙彦猛; 苏丽; 刘文俊; 张博为
Original assignee: Aerospace Long March Launch Vehicle Technology Co Ltd; Beijing Institute of Telemetry Technology
Current assignee: Aerospace Long March Launch Vehicle Technology Co Ltd; Beijing Institute of Telemetry Technology
Priority date: 2013-12-20
Filing date: 2013-12-20
Publication date: 2014-04-02
Anticipated expiration: 2033-12-20
Also published as: CN103699627B

Abstract

The invention discloses a dummy file parallel data block positioning method based on a Hadoop cluster. A plurality of data blocks in a dummy file can be positioned in the Hadoop cluster by the aid of Map/Reduce software. The method mainly includes the steps: building cluster environments; constructing a sequential file; modifying user program source codes; calling user programs in a streaming mode. Mapping relations among Map progresses and the data blocks are determined by controlling Map number and Map sequence, the data blocks in the file can be positioned, the specified data blocks are processed by the specified Map progresses, and the data blocks of the dummy file can be easily and parallelly processed.

Description

A kind of super large file in parallel data block localization method based on Hadoop cluster

Technical field

The present invention relates to a kind of software frame based on Hadoop(distributed treatment mass data) the super large file of cluster (surpassing hundred GB) parallel data block localization method, belong to large data processing field.

Background technology

In high performance parallel computation field, MapReduce mass data processing frame application is extensive, by cheap common computer cluster, can obtain the large-scale data computing power of only having expensive large server just to have over, and at aspects such as stability and extendabilities, all be better than traditional high-performance calculation scheme.MapReduce model is applied to the aspects such as astronomical information computing, mass memory analysis, virus base storage, network retrieval service now, solves the contradiction between the growth of data explosion formula and Computer Storage ability and computing power deficiency.In actual development process, development language is varied, and the program that Streaming technology allows developer to use any programming language to realize is used in Hadoop MapReduce, convenient existing program, to Hadoop platform transplantation, has greatly reduced program portable cost.

The HDFS(Hadoop distributed file system of Hadoop) there is the feature of high fault tolerance, it disperses data to be stored on many machines with the form of one or more copies, can store mass data, and reliability is high, provide to data fast, extendible access, the access module that be applicable to write-once, repeatedly reads.File on HDFS is divided into a plurality of piecemeals of block size, and as storage unit independently, system default block size is 64MB, and user also can physical block size.

In high-performance calculation, having a class problem is repeatedly to process same super large file, and each data of processing are the one section of continuous datas starting with different side-play amounts in large file, and separate between each computing, not Existence dependency relationship.This kind of computation model is in being transplanted to Hadoop platform process, memory model generally adopts HDFS file system, computation model generally adopts Hadoop Streaming, do not change or revise minute quantity program source code situation just can Rapid transplant to Hadoop platform.One section of continuous data that each map process in Hadoop is processed respectively in large file, different side-play amounts start, this model just needs data that developer processes map number of processes and each the map process side-play amount in large file to control, and realizes a plurality of data blocks in the super large file of a plurality of map task parallelisms location.

Generally the quantity of map process is determined by size and the HDFS block size of input file, by input file shared number in HDFS, is determined, under default situations, not directly controls and interferes.In Hadoop API, provide the corresponding interface: org.apache.hadoop.mapred.JobConf.setNumMapTasks (int n) can control map number of processes, but official's document is " Note:This is only a hint to the framework " to the lexical or textual analysis of this function, be that setNumMapTasks () method is only a prompting concerning the framework of Hadoop, can not play conclusive effect.In other words, even if be provided with this value, also differ and produce a desired effect surely.

Although single map process can conduct interviews in any skew from a file, even if this file is stored in HDFS file system, blocks of files is distributed on different nodes, but for a plurality of map processes, because providing corresponding interface, system do not distinguish each map process on the one hand, in a lot of situations, the offset address of the data of each required processing in large file is irregular on the other hand, therefore under present case, specify respectively each map process to need the skew of data to be processed in large file infeasible.

In sum, control map number of processes, directly the input using large file as program and make each map process accurately locate respectively this map process to need the skew of process data block in large file be infeasible.

Summary of the invention

The technical problem to be solved in the present invention is: overcome the deficiencies in the prior art, a kind of super large file in parallel data block localization method based on Hadoop cluster is provided, way and the map process operation way of by tectonic sequence, controlling map number of processes, reach and control the effect that different map processes are processed respectively one section of continuous data that side-play amount is different.

Technical solution of the present invention is:

A kind of super large file in parallel data block localization method based on Hadoop cluster comprises that step is as follows:

Step 1: set up Hadoop cluster by building Hadoop environment, configuration HDFS determinant attribute;

Step 2: construct specific sequential file; The content of described particular sequence file is integer number, each integer number is monopolized 1 row, in sequential file, the value of every row is the side-play amount of the data block in each map process super large file to be processed, and the line number of sequential file equals the number of map process and needs the number of data block to be processed;

The method of tectonic sequence file is any one of following method:

(a) use Microsoft Office Excel to generate specific sequence by row mode, then in row mode, copy in text;

(b), under vim editing machine command mode, key in order and generate specific sequence;

(c) manually generate the irregular sequence needing, by autoexec or script file, generate particular sequence;

Step 3: revise user program and make user program can receive the data that standard input stream transmits, this data-switching is become to integer data, and by the start address of the data block in the super large file that will read in this integer data setting program;

Step 4: the location that completes parallel data block by the sequential file in amended user program and step (2) in the streaming mode invocation step (3) of Hadoop cluster, the method that completes parallel data location is: the location that completes parallel data by the side-play amount of the super large data in step (2) and the start address of the super large file in step (3).

Each computer node in described cluster all has independently CPU, internal memory, local hard drive and operating system.

Each computer node in described cluster is supported " the Convertion cable technology " of how concurrent link by Ethernet or Infiniband() interconnected.

The present invention's beneficial effect is compared with prior art:

(1) the present invention adopts streaming technology to realize as long as programme according to standard input output format, just can meet hadoop requirement, so stand-alone program just revises a little and can increase work efficiency in the enterprising enforcement use of cluster, is convenient to test.

(2) the streaming technology that the present invention adopts can realize the support to non-Java language, in Practical Project, can choose most suitable development language according to engineering demand situation, by streaming technology, make it to operate on hadoop platform again, make by this method executing efficiency higher.

(3) tectonic sequence file of the present invention accurately control map number of processes can more effective raising executing efficiency, control cluster load balancing.

(4) the present invention uses sequential file to realize the location to large file, makes each map process process respectively the data of different reference positions in large file, controls simply easy operating.

Accompanying drawing explanation

Fig. 1 is the method flow diagram that the present invention is based on the super large file in parallel data block location of Hadoop cluster.Embodiment

Below in conjunction with accompanying drawing, the specific embodiment of the present invention is further described in detail.

The present invention, by controlling quantity and the sequence of Map, determines the mapping relations between map process and many data blocks, and can locator data piece position hereof, allows and specifies Map process processing specified data block.As shown in Figure 1, concrete steps of the present invention are as follows:

Hadoop Ministry of environment is deployed on 4 computer nodes, wherein 1 computer node is title node (namenode), other 3 is back end (datanode), shared Computer Storage task, HDFS stored copies number is 3, and each computer node in described cluster all has independently CPU, internal memory, local hard drive and operating system; Each computer node in described cluster is supported " the Convertion cable technology " of how concurrent link by Ethernet or Infiniband() interconnected.

Described HDFS determinant attribute comprises:

(a) allocating default file system, the port numbers of definition host computer system title and the work of title node;

(b) configuration name node is stored the directory listing of permanent metadata;

(c) directory listing of configuration data node store data piece;

(d) copy amount of configuration data node data storage;

The method of tectonic sequence file is any one of following method:

Step 4: the streaming(by Hadoop cluster is used non-JAVA language call map and reduce function technology) location that in mode invocation step (3), the sequential file in amended user program and step (2) completes parallel data block, the method that completes parallel data location is: the location that completes parallel data by the side-play amount of the super large data in step (2) and the start address of the super large file in step (3), the input option of Hadoop cluster is set to the sequential file generating in step (3), inputformat option is set to map number of processes and is determined by the sequential file line number of inputting.

The content not being described in detail in instructions of the present invention belongs to those skilled in the art's known technology.

Claims

1. the super large file in parallel data block localization method based on Hadoop cluster, is characterized in that comprising that step is as follows:

The method of tectonic sequence file is any one of following method:

(b), under vim editing machine command mode, key in order and generate specific sequential file;

(c) manually generate the irregular sequential file needing

(d) by autoexec or script file, generate particular sequence file;

Step 3: revise user program and make user program can receive the data of being transmitted by standard input stream, this data-switching is become to integer data, and by the start address of the data block in the super large file that will read in this integer data setting program;

Step 4: the location that completes parallel data block by the sequential file in amended user program and step (2) in the streaming mode invocation step (3) of Hadoop cluster, the method that completes parallel data location is: by using the streaming mode of Hadoop cluster to realize finder parallelization, by the side-play amount of super large data of sequential file in step (2) and the location that the start address of the super large file in step (3) completes data.

2. a kind of super large file in parallel data block localization method based on Hadoop cluster as claimed in claim 1, is characterized in that: each computer node in described cluster all has independently CPU, internal memory, local hard drive and operating system.

3. a kind of super large file in parallel data block localization method based on Hadoop cluster as claimed in claim 1, is characterized in that: each computer node in described cluster is interconnected by Ethernet or Infiniband.