CN103699627A - Dummy file parallel data block positioning method based on Hadoop cluster - Google Patents

Dummy file parallel data block positioning method based on Hadoop cluster Download PDF

Info

Publication number
CN103699627A
CN103699627A CN201310712421.2A CN201310712421A CN103699627A CN 103699627 A CN103699627 A CN 103699627A CN 201310712421 A CN201310712421 A CN 201310712421A CN 103699627 A CN103699627 A CN 103699627A
Authority
CN
China
Prior art keywords
file
data block
data
super large
hadoop
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310712421.2A
Other languages
Chinese (zh)
Other versions
CN103699627B (en
Inventor
孙彦猛
苏丽
刘文俊
张博为
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Aerospace Long March Launch Vehicle Technology Co Ltd
Beijing Institute of Telemetry Technology
Original Assignee
Aerospace Long March Launch Vehicle Technology Co Ltd
Beijing Institute of Telemetry Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Aerospace Long March Launch Vehicle Technology Co Ltd, Beijing Institute of Telemetry Technology filed Critical Aerospace Long March Launch Vehicle Technology Co Ltd
Priority to CN201310712421.2A priority Critical patent/CN103699627B/en
Publication of CN103699627A publication Critical patent/CN103699627A/en
Application granted granted Critical
Publication of CN103699627B publication Critical patent/CN103699627B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems

Abstract

The invention discloses a dummy file parallel data block positioning method based on a Hadoop cluster. A plurality of data blocks in a dummy file can be positioned in the Hadoop cluster by the aid of Map/Reduce software. The method mainly includes the steps: building cluster environments; constructing a sequential file; modifying user program source codes; calling user programs in a streaming mode. Mapping relations among Map progresses and the data blocks are determined by controlling Map number and Map sequence, the data blocks in the file can be positioned, the specified data blocks are processed by the specified Map progresses, and the data blocks of the dummy file can be easily and parallelly processed.

Description

A kind of super large file in parallel data block localization method based on Hadoop cluster
Technical field
The present invention relates to a kind of software frame based on Hadoop(distributed treatment mass data) the super large file of cluster (surpassing hundred GB) parallel data block localization method, belong to large data processing field.
Background technology
In high performance parallel computation field, MapReduce mass data processing frame application is extensive, by cheap common computer cluster, can obtain the large-scale data computing power of only having expensive large server just to have over, and at aspects such as stability and extendabilities, all be better than traditional high-performance calculation scheme.MapReduce model is applied to the aspects such as astronomical information computing, mass memory analysis, virus base storage, network retrieval service now, solves the contradiction between the growth of data explosion formula and Computer Storage ability and computing power deficiency.In actual development process, development language is varied, and the program that Streaming technology allows developer to use any programming language to realize is used in Hadoop MapReduce, convenient existing program, to Hadoop platform transplantation, has greatly reduced program portable cost.
The HDFS(Hadoop distributed file system of Hadoop) there is the feature of high fault tolerance, it disperses data to be stored on many machines with the form of one or more copies, can store mass data, and reliability is high, provide to data fast, extendible access, the access module that be applicable to write-once, repeatedly reads.File on HDFS is divided into a plurality of piecemeals of block size, and as storage unit independently, system default block size is 64MB, and user also can physical block size.
In high-performance calculation, having a class problem is repeatedly to process same super large file, and each data of processing are the one section of continuous datas starting with different side-play amounts in large file, and separate between each computing, not Existence dependency relationship.This kind of computation model is in being transplanted to Hadoop platform process, memory model generally adopts HDFS file system, computation model generally adopts Hadoop Streaming, do not change or revise minute quantity program source code situation just can Rapid transplant to Hadoop platform.One section of continuous data that each map process in Hadoop is processed respectively in large file, different side-play amounts start, this model just needs data that developer processes map number of processes and each the map process side-play amount in large file to control, and realizes a plurality of data blocks in the super large file of a plurality of map task parallelisms location.
Generally the quantity of map process is determined by size and the HDFS block size of input file, by input file shared number in HDFS, is determined, under default situations, not directly controls and interferes.In Hadoop API, provide the corresponding interface: org.apache.hadoop.mapred.JobConf.setNumMapTasks (int n) can control map number of processes, but official's document is " Note:This is only a hint to the framework " to the lexical or textual analysis of this function, be that setNumMapTasks () method is only a prompting concerning the framework of Hadoop, can not play conclusive effect.In other words, even if be provided with this value, also differ and produce a desired effect surely.
Although single map process can conduct interviews in any skew from a file, even if this file is stored in HDFS file system, blocks of files is distributed on different nodes, but for a plurality of map processes, because providing corresponding interface, system do not distinguish each map process on the one hand, in a lot of situations, the offset address of the data of each required processing in large file is irregular on the other hand, therefore under present case, specify respectively each map process to need the skew of data to be processed in large file infeasible.
In sum, control map number of processes, directly the input using large file as program and make each map process accurately locate respectively this map process to need the skew of process data block in large file be infeasible.
Summary of the invention
The technical problem to be solved in the present invention is: overcome the deficiencies in the prior art, a kind of super large file in parallel data block localization method based on Hadoop cluster is provided, way and the map process operation way of by tectonic sequence, controlling map number of processes, reach and control the effect that different map processes are processed respectively one section of continuous data that side-play amount is different.
Technical solution of the present invention is:
A kind of super large file in parallel data block localization method based on Hadoop cluster comprises that step is as follows:
Step 1: set up Hadoop cluster by building Hadoop environment, configuration HDFS determinant attribute;
Step 2: construct specific sequential file; The content of described particular sequence file is integer number, each integer number is monopolized 1 row, in sequential file, the value of every row is the side-play amount of the data block in each map process super large file to be processed, and the line number of sequential file equals the number of map process and needs the number of data block to be processed;
The method of tectonic sequence file is any one of following method:
(a) use Microsoft Office Excel to generate specific sequence by row mode, then in row mode, copy in text;
(b), under vim editing machine command mode, key in order and generate specific sequence;
(c) manually generate the irregular sequence needing, by autoexec or script file, generate particular sequence;
Step 3: revise user program and make user program can receive the data that standard input stream transmits, this data-switching is become to integer data, and by the start address of the data block in the super large file that will read in this integer data setting program;
Step 4: the location that completes parallel data block by the sequential file in amended user program and step (2) in the streaming mode invocation step (3) of Hadoop cluster, the method that completes parallel data location is: the location that completes parallel data by the side-play amount of the super large data in step (2) and the start address of the super large file in step (3).
Each computer node in described cluster all has independently CPU, internal memory, local hard drive and operating system.
Each computer node in described cluster is supported " the Convertion cable technology " of how concurrent link by Ethernet or Infiniband() interconnected.
The present invention's beneficial effect is compared with prior art:
(1) the present invention adopts streaming technology to realize as long as programme according to standard input output format, just can meet hadoop requirement, so stand-alone program just revises a little and can increase work efficiency in the enterprising enforcement use of cluster, is convenient to test.
(2) the streaming technology that the present invention adopts can realize the support to non-Java language, in Practical Project, can choose most suitable development language according to engineering demand situation, by streaming technology, make it to operate on hadoop platform again, make by this method executing efficiency higher.
(3) tectonic sequence file of the present invention accurately control map number of processes can more effective raising executing efficiency, control cluster load balancing.
(4) the present invention uses sequential file to realize the location to large file, makes each map process process respectively the data of different reference positions in large file, controls simply easy operating.
Accompanying drawing explanation
Fig. 1 is the method flow diagram that the present invention is based on the super large file in parallel data block location of Hadoop cluster.Embodiment
Below in conjunction with accompanying drawing, the specific embodiment of the present invention is further described in detail.
The present invention, by controlling quantity and the sequence of Map, determines the mapping relations between map process and many data blocks, and can locator data piece position hereof, allows and specifies Map process processing specified data block.As shown in Figure 1, concrete steps of the present invention are as follows:
Step 1: set up Hadoop cluster by building Hadoop environment, configuration HDFS determinant attribute;
Hadoop Ministry of environment is deployed on 4 computer nodes, wherein 1 computer node is title node (namenode), other 3 is back end (datanode), shared Computer Storage task, HDFS stored copies number is 3, and each computer node in described cluster all has independently CPU, internal memory, local hard drive and operating system; Each computer node in described cluster is supported " the Convertion cable technology " of how concurrent link by Ethernet or Infiniband() interconnected.
Described HDFS determinant attribute comprises:
(a) allocating default file system, the port numbers of definition host computer system title and the work of title node;
(b) configuration name node is stored the directory listing of permanent metadata;
(c) directory listing of configuration data node store data piece;
(d) copy amount of configuration data node data storage;
Step 2: construct specific sequential file; The content of described particular sequence file is integer number, each integer number is monopolized 1 row, in sequential file, the value of every row is the side-play amount of the data block in each map process super large file to be processed, and the line number of sequential file equals the number of map process and needs the number of data block to be processed;
The method of tectonic sequence file is any one of following method:
(a) use Microsoft Office Excel to generate specific sequence by row mode, then in row mode, copy in text;
(b), under vim editing machine command mode, key in order and generate specific sequence;
(c) manually generate the irregular sequence needing, by autoexec or script file, generate particular sequence;
Step 3: revise user program and make user program can receive the data that standard input stream transmits, this data-switching is become to integer data, and by the start address of the data block in the super large file that will read in this integer data setting program;
Step 4: the streaming(by Hadoop cluster is used non-JAVA language call map and reduce function technology) location that in mode invocation step (3), the sequential file in amended user program and step (2) completes parallel data block, the method that completes parallel data location is: the location that completes parallel data by the side-play amount of the super large data in step (2) and the start address of the super large file in step (3), the input option of Hadoop cluster is set to the sequential file generating in step (3), inputformat option is set to map number of processes and is determined by the sequential file line number of inputting.
The content not being described in detail in instructions of the present invention belongs to those skilled in the art's known technology.

Claims (3)

1. the super large file in parallel data block localization method based on Hadoop cluster, is characterized in that comprising that step is as follows:
Step 1: set up Hadoop cluster by building Hadoop environment, configuration HDFS determinant attribute;
Step 2: construct specific sequential file; The content of described particular sequence file is integer number, each integer number is monopolized 1 row, in sequential file, the value of every row is the side-play amount of the data block in each map process super large file to be processed, and the line number of sequential file equals the number of map process and needs the number of data block to be processed;
The method of tectonic sequence file is any one of following method:
(a) use Microsoft Office Excel to generate specific sequence by row mode, then in row mode, copy in text;
(b), under vim editing machine command mode, key in order and generate specific sequential file;
(c) manually generate the irregular sequential file needing
(d) by autoexec or script file, generate particular sequence file;
Step 3: revise user program and make user program can receive the data of being transmitted by standard input stream, this data-switching is become to integer data, and by the start address of the data block in the super large file that will read in this integer data setting program;
Step 4: the location that completes parallel data block by the sequential file in amended user program and step (2) in the streaming mode invocation step (3) of Hadoop cluster, the method that completes parallel data location is: by using the streaming mode of Hadoop cluster to realize finder parallelization, by the side-play amount of super large data of sequential file in step (2) and the location that the start address of the super large file in step (3) completes data.
2. a kind of super large file in parallel data block localization method based on Hadoop cluster as claimed in claim 1, is characterized in that: each computer node in described cluster all has independently CPU, internal memory, local hard drive and operating system.
3. a kind of super large file in parallel data block localization method based on Hadoop cluster as claimed in claim 1, is characterized in that: each computer node in described cluster is interconnected by Ethernet or Infiniband.
CN201310712421.2A 2013-12-20 2013-12-20 A kind of super large file in parallel data block localization method based on Hadoop clusters Active CN103699627B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310712421.2A CN103699627B (en) 2013-12-20 2013-12-20 A kind of super large file in parallel data block localization method based on Hadoop clusters

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310712421.2A CN103699627B (en) 2013-12-20 2013-12-20 A kind of super large file in parallel data block localization method based on Hadoop clusters

Publications (2)

Publication Number Publication Date
CN103699627A true CN103699627A (en) 2014-04-02
CN103699627B CN103699627B (en) 2017-03-15

Family

ID=50361155

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310712421.2A Active CN103699627B (en) 2013-12-20 2013-12-20 A kind of super large file in parallel data block localization method based on Hadoop clusters

Country Status (1)

Country Link
CN (1) CN103699627B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105808354A (en) * 2016-03-10 2016-07-27 西北大学 Method for establishing temporary Hadoop environment by utilizing WLAN (Wireless Local Area Network)
CN106339473A (en) * 2016-08-29 2017-01-18 北京百度网讯科技有限公司 Method and device for copying file
CN108874897A (en) * 2018-05-23 2018-11-23 新华三大数据技术有限公司 Data query method and device
CN110851399A (en) * 2019-09-22 2020-02-28 苏州浪潮智能科技有限公司 Method and system for optimizing file data block transmission efficiency of distributed file system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102332027A (en) * 2011-10-15 2012-01-25 西安交通大学 Mass non-independent small file associated storage method based on Hadoop
US20130311480A1 (en) * 2012-04-27 2013-11-21 International Business Machines Corporation Sensor data locating

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102332027A (en) * 2011-10-15 2012-01-25 西安交通大学 Mass non-independent small file associated storage method based on Hadoop
US20130311480A1 (en) * 2012-04-27 2013-11-21 International Business Machines Corporation Sensor data locating

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
GUIBIN: "深度分析如何在Hadoop中控制Map的数量", 《HTTP://BLOG.CSDN.NET/STRONGERBIT/ARTICLE/DETAILS/7440111》 *
SUPERCHARLES: "Hadoop HDFS中的数据块和Map任务的分片", 《HTTP://WWW.LINUXIDC.COM/LINUX/201205/》 *
王永洲等: "HDFS中的一种数据放置策略", 《计算机技术与发展》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105808354A (en) * 2016-03-10 2016-07-27 西北大学 Method for establishing temporary Hadoop environment by utilizing WLAN (Wireless Local Area Network)
CN106339473A (en) * 2016-08-29 2017-01-18 北京百度网讯科技有限公司 Method and device for copying file
CN108874897A (en) * 2018-05-23 2018-11-23 新华三大数据技术有限公司 Data query method and device
CN108874897B (en) * 2018-05-23 2019-09-13 新华三大数据技术有限公司 Data query method and device
CN110851399A (en) * 2019-09-22 2020-02-28 苏州浪潮智能科技有限公司 Method and system for optimizing file data block transmission efficiency of distributed file system
CN110851399B (en) * 2019-09-22 2022-11-25 苏州浪潮智能科技有限公司 Method and system for optimizing file data block transmission efficiency of distributed file system

Also Published As

Publication number Publication date
CN103699627B (en) 2017-03-15

Similar Documents

Publication Publication Date Title
Ikeda et al. Provenance for Generalized Map and Reduce Workflows.
CN111324610A (en) Data synchronization method and device
CN102708203A (en) Database dynamic management method based on XML metadata
CN106471501A (en) The method of data query, the storage method data system of data object
CN103810272A (en) Data processing method and system
CN110795499A (en) Cluster data synchronization method, device and equipment based on big data and storage medium
CN105447051A (en) Database operation method and device
CN102799679A (en) Hadoop-based massive spatial data indexing updating system and method
CN104536987A (en) Data query method and device
CN103699627A (en) Dummy file parallel data block positioning method based on Hadoop cluster
CN106055678A (en) Hadoop-based panoramic big data distributed storage method
KR101790766B1 (en) Method, device and terminal for data search
Singh et al. Spatial data analysis with ArcGIS and MapReduce
CN103501341A (en) Method and device for establishing Web service
Barkhordari et al. Atrak: a MapReduce-based data warehouse for big data
CN103809915B (en) The reading/writing method of a kind of disk file and device
US11698911B2 (en) System and methods for performing updated query requests in a system of multiple database engine
Li et al. Research of distributed database system based on Hadoop
JP2014041501A (en) Fast reading method for batch processing target data and batch management system
CN106484379B (en) A kind of processing method and processing device of application
Xu et al. A unified computation engine for big data analytics
Pan et al. An open sharing pattern design of massive power big data
Gao et al. On the power of combiner optimizations in mapreduce over MPI workflows
Xu et al. A PaaS based metadata-driven ETL framework
Han et al. Design and Implementation of Big Data Management Platform for Android Applications

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant