CN103699627B

CN103699627B - A kind of super large file in parallel data block localization method based on Hadoop clusters

Info

Publication number: CN103699627B
Application number: CN201310712421.2A
Authority: CN
Inventors: 孙彦猛; 苏丽; 刘文俊; 张博为
Original assignee: Aerospace Long March Launch Vehicle Technology Co Ltd; Beijing Institute of Telemetry Technology
Current assignee: Aerospace Long March Launch Vehicle Technology Co Ltd; Beijing Institute of Telemetry Technology
Priority date: 2013-12-20
Filing date: 2013-12-20
Publication date: 2017-03-15
Anticipated expiration: 2033-12-20
Also published as: CN103699627A

Abstract

The invention discloses a kind of super large file in parallel data block localization method based on Hadoop clusters, the present invention is the method that many data block positioning in super large file realized by application Map/Reduce softwares in Hadoop clusters.The invention mainly includes steps：Set up cluster environment, tectonic sequence file, modification user program source code and user program is called by streaming modes, quantity and sequence of the present invention by control Map, determine the mapping relations between Map processes and many data blocks, and position that can be with location data block hereof, specified Map processes are allowed to process specified data block, while the present invention can easily realize many data block parallel processings of super large file.

Description

A kind of super large file in parallel data block localization method based on Hadoop clusters

Technical field

The present invention relates to a kind of be based on Hadoop（The software frame of distributed treatment mass data）The super large file of cluster （More than hundred GB）Parallel data block localization method, belongs to big data process field.

Background technology

In high performance parallel computation field MapReduce mass data processing frame application extensively, by cheap common Computer cluster can obtain over the large-scale data computing capability that only expensive large server just has, and in stability Traditional high-performance calculation scheme is better than with aspects such as autgmentabilities.MapReduce model is applied to ephemeris information meter now The aspects such as calculation process, mass memory analysis, viral library storage, network retrieval service, solve data explosion formula and increase and computer Contradiction between storage capacity and computing capability deficiency.During actual development, development language is varied, and Streaming The program that technology permission developer is realized using any programming language facilitates existing program used in Hadoop MapReduce Transplant to Hadoop platform, greatly reduce program portable cost.

The HDFS of Hadoop（Hadoop distributed file systems）Have high fault tolerance the characteristics of, it by data with one or The form dispersion of multiple copies is stored on multiple stage machine, can store mass data, and reliability is high, there is provided be fast to data Speed, extendible access, it is adaptable to write-once, the access module for repeatedly reading.It is big that file on HDFS is divided into block Little multiple piecemeals, used as independent memory cell, system default block size is 64MB, and user can also specify block size.

In high-performance calculation, it is repeatedly to process same super large file to have a class problem, and data per treatment are big One section continuous data of beginning is measured in file with a different shift, and separate between computing every time, there is no dependence and close System.During Hadoop platform is transplanted to, storage model typically adopts HDFS file system, computation model to this kind of computation model Hadoop Streaming are typically adopted, is arrived in the just energy Rapid transplant of the situation that does not change or change very small amount program source code Hadoop platform.Each map process in Hadoop is processed in big file respectively, one section of consecutive numbers that different side-play amounts start According to this model is accomplished by skew of the developer to the data of map number of processes and the process of each map process in big file Amount is controlled, and realizes the multiple data blocks in multiple map task parallelisms positioning super large files.

Under normal circumstances the quantity of map processes by input file size and HDFS block sizes determine, i.e., by input file In HDFS, shared block number determines that not directly control is interfered under default situations.The corresponding interface is provided in Hadoop API： Org.apache.hadoop.mapred.JobConf.setNumMapTasks (int n) can control map number of processes, but Official document is " Note to the lexical or textual analysis of this function:This is only a hint to the framework ", i.e., SetNumMapTasks () method is only a prompting for the framework of Hadoop, it is impossible to play conclusive effect.In other words, Even if being provided with this value, also differ.

Although single map processes can conduct interviews from any skew in a file, even if this file is stored in In HDFS file system, blocks of files is distributed on different nodes, but for multiple map processes, on the one hand as system does not have There is provided corresponding interface to distinguish each map process, on the other hand the data of required process every time in many cases are in big file Interior offset address is irregular, therefore under present case, respectively specifies that each map processes need data to be processed in big file Skew infeasible.

In sum, map number of processes is controlled, big file as the input of program and is caused each map process directly It is infeasible to be accurately positioned this map processes respectively and need skew of the processing data block in big file.

Content of the invention

The technical problem to be solved in the present invention is：Overcome the deficiencies in the prior art, there is provided a kind of based on Hadoop clusters Super large file in parallel data block localization method, by tectonic sequence control map number of processes method and map processes operation Method, reaches the effect that control difference map processes process one section of different continuous data of side-play amount respectively.

The present invention technical solution be：

A kind of super large file in parallel data block localization method based on Hadoop clusters includes that step is as follows：

Step 1：Hadoop clusters are set up by building Hadoop environment, HDFS determinant attributes are configured；

Step 2：Construct specific sequential file；The content of described particular sequence file be integer, each integer Exclusive 1 row, the value that often goes in sequential file is the side-play amount of the data block in each map process super large file to be processed, sequence The line number of row file is equal to the number of map processes and needs the number of data block to be processed；

The method of tectonic sequence file is any one of following method：

（a）Specific sequence is generated using Microsoft Office Excel by row mode, then copied in row mode To in text；

（b）Under vim editing machine command modes, key in order and generate specific sequence；

（c）The irregular sequence of needs is manually generated, particular sequence is generated by autoexec or script file；

Step 3：Modification user program enables user program to receive the data of standard inlet flow transmission, and the data are turned Change integer data, and the initial address by the data block in super large file to be read in the integer data setting program into；

Step 4：By the streaming mode invocation steps of Hadoop clusters（3）In amended user program and step Suddenly（2）In sequential file complete the positioning of parallel data block, the method for completing parallel data positioning is：By step（2）In Super large data side-play amount and step（3）In the initial address of super large file complete the positioning of parallel data.

Each computer node in the cluster has independent CPU, internal memory, local hard drive and operating system.

Each computer node in the cluster is by Ethernet or Infiniband（Support that how concurrently linked " turns Change cable technique "）Interconnection.

Compared with the prior art, the invention has the advantages that：

（1）As long as the present invention realizes being programmed according to standard input/output format using streaming technology, it is possible to Meet hadoop requirements, therefore stand-alone program is slightly changed and just can be used in enterprising enforcement of cluster, is improved operating efficiency, is easy to survey Examination.

（2）The support that the streaming technology that the present invention is adopted can be realized to non-Java language, can be with Practical Project Most suitable development language is chosen according to engineering demand situation, then is allowed to operate in hadoop platforms by streaming technology On, make executing efficiency higher by this method.

（3）Tectonic sequence file precise control map number of processes of the present invention more effectively can improve executing efficiency, Control cluster load balance.

（4）The present invention realizes the positioning to big file using sequential file, each map process is processed in big file respectively The data of different original positions, control are simple, it is easy to operate.

Description of the drawings

Fig. 1 is the method flow diagram that super large file in parallel data block of the present invention based on Hadoop clusters is positioned.

Specific embodiment

Below in conjunction with the accompanying drawings the specific embodiment of the present invention is further described in detail.

Quantity and sequence of the present invention by control Map, determine the mapping relations between map processes and many data blocks, and And can be with location data block hereof position, allow specified Map processes to process specified data block.As shown in figure 1, the present invention Comprise the following steps that：

Hadoop environment is deployed on 4 computer nodes, wherein 1 computer node is name node （namenode）, in addition 3 is back end（datanode）, shared Computer Storage task, HDFS stored copies numbers For 3, each computer node in the cluster has independent CPU, internal memory, local hard drive and operating system；The cluster In each computer node by Ethernet or Infiniband（Support many " Convertion cable technology " for concurrently linking）Interconnection.

Described HDFS determinant attributes include：

（a）Allocating default file system, defines the port numbers of host computer system title and name node work；

（b）Configuration name node stores the directory listing of permanent metadata；

（c）Configuration data node deposits the directory listing of data block；

（d）The copy amount of configuration data node data storage；

The method of tectonic sequence file is any one of following method：

Step 4：Streaming by Hadoop clusters（Map and reduce function skills are called using non-JAVA language Art）Mode invocation step（3）In amended user program and step（2）In sequential file complete determining for parallel data block , the method for completing parallel data positioning is：By step（2）In super large data side-play amount and step（3）In super large The initial address of file completes the positioning of parallel data, and the input options of Hadoop clusters are set to step（3）Middle generation Inputformat options are set to map number of processes and are determined by the sequential file line number being input into by sequential file.

The content not being described in detail in description of the invention belongs to the known technology of those skilled in the art.

Claims

1. a kind of super large file in parallel data block localization method based on Hadoop clusters, it is characterised in that as follows including step：

Step 2：Construct specific sequential file；The content of the specific sequential file is integer, each integer exclusive 1 OK, in sequential file often go value be the data block in each map process super large file to be processed side-play amount, sequential file Line number be equal to the number of map processes and need the number of data block to be processed；

The method of tectonic sequence file is any one of following method：

A () generates specific sequence using Microsoft Office Excel by row mode, then copy text in row mode In presents；

B () under vim editing machine command modes keys in order and generates specific sequential file；

C () manually generates the irregular sequential file of needs；

D () generates specific sequential file by autoexec or script file；

Step 3：Modification user program enables user program to receive the data that is transmitted by standard inlet flow, converts the data into Integer data, and the initial address by the data block in super large file to be read in the integer data setting program；

Step 4：In by amended user program and step 2 in the streaming modes invocation step 3 of Hadoop clusters Sequential file completes the positioning of parallel data block, and the method for completing parallel data block positioning is：By using Hadoop clusters Streaming modes realize finder parallelization, by the side-play amount and step 3 of the super large file of sequential file in step 2 In the initial address of super large file complete the positioning of data.

2. a kind of super large file in parallel data block localization method based on Hadoop clusters as claimed in claim 1, its feature It is：Each computer node in the cluster has independent CPU, internal memory, local hard drive and operating system.

3. a kind of super large file in parallel data block localization method based on Hadoop clusters as claimed in claim 1, its feature It is：Each computer node in the cluster is interconnected by Ethernet or Infiniband.