CN103699627B - A kind of super large file in parallel data block localization method based on Hadoop clusters - Google Patents

A kind of super large file in parallel data block localization method based on Hadoop clusters Download PDF

Info

Publication number
CN103699627B
CN103699627B CN201310712421.2A CN201310712421A CN103699627B CN 103699627 B CN103699627 B CN 103699627B CN 201310712421 A CN201310712421 A CN 201310712421A CN 103699627 B CN103699627 B CN 103699627B
Authority
CN
China
Prior art keywords
file
data block
super large
large file
hadoop clusters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310712421.2A
Other languages
Chinese (zh)
Other versions
CN103699627A (en
Inventor
孙彦猛
苏丽
刘文俊
张博为
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Aerospace Long March Launch Vehicle Technology Co Ltd
Beijing Institute of Telemetry Technology
Original Assignee
Aerospace Long March Launch Vehicle Technology Co Ltd
Beijing Institute of Telemetry Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Aerospace Long March Launch Vehicle Technology Co Ltd, Beijing Institute of Telemetry Technology filed Critical Aerospace Long March Launch Vehicle Technology Co Ltd
Priority to CN201310712421.2A priority Critical patent/CN103699627B/en
Publication of CN103699627A publication Critical patent/CN103699627A/en
Application granted granted Critical
Publication of CN103699627B publication Critical patent/CN103699627B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of super large file in parallel data block localization method based on Hadoop clusters, the present invention is the method that many data block positioning in super large file realized by application Map/Reduce softwares in Hadoop clusters.The invention mainly includes steps:Set up cluster environment, tectonic sequence file, modification user program source code and user program is called by streaming modes, quantity and sequence of the present invention by control Map, determine the mapping relations between Map processes and many data blocks, and position that can be with location data block hereof, specified Map processes are allowed to process specified data block, while the present invention can easily realize many data block parallel processings of super large file.

Description

A kind of super large file in parallel data block localization method based on Hadoop clusters
Technical field
The present invention relates to a kind of be based on Hadoop(The software frame of distributed treatment mass data)The super large file of cluster (More than hundred GB)Parallel data block localization method, belongs to big data process field.
Background technology
In high performance parallel computation field MapReduce mass data processing frame application extensively, by cheap common Computer cluster can obtain over the large-scale data computing capability that only expensive large server just has, and in stability Traditional high-performance calculation scheme is better than with aspects such as autgmentabilities.MapReduce model is applied to ephemeris information meter now The aspects such as calculation process, mass memory analysis, viral library storage, network retrieval service, solve data explosion formula and increase and computer Contradiction between storage capacity and computing capability deficiency.During actual development, development language is varied, and Streaming The program that technology permission developer is realized using any programming language facilitates existing program used in Hadoop MapReduce Transplant to Hadoop platform, greatly reduce program portable cost.
The HDFS of Hadoop(Hadoop distributed file systems)Have high fault tolerance the characteristics of, it by data with one or The form dispersion of multiple copies is stored on multiple stage machine, can store mass data, and reliability is high, there is provided be fast to data Speed, extendible access, it is adaptable to write-once, the access module for repeatedly reading.It is big that file on HDFS is divided into block Little multiple piecemeals, used as independent memory cell, system default block size is 64MB, and user can also specify block size.
In high-performance calculation, it is repeatedly to process same super large file to have a class problem, and data per treatment are big One section continuous data of beginning is measured in file with a different shift, and separate between computing every time, there is no dependence and close System.During Hadoop platform is transplanted to, storage model typically adopts HDFS file system, computation model to this kind of computation model Hadoop Streaming are typically adopted, is arrived in the just energy Rapid transplant of the situation that does not change or change very small amount program source code Hadoop platform.Each map process in Hadoop is processed in big file respectively, one section of consecutive numbers that different side-play amounts start According to this model is accomplished by skew of the developer to the data of map number of processes and the process of each map process in big file Amount is controlled, and realizes the multiple data blocks in multiple map task parallelisms positioning super large files.
Under normal circumstances the quantity of map processes by input file size and HDFS block sizes determine, i.e., by input file In HDFS, shared block number determines that not directly control is interfered under default situations.The corresponding interface is provided in Hadoop API: Org.apache.hadoop.mapred.JobConf.setNumMapTasks (int n) can control map number of processes, but Official document is " Note to the lexical or textual analysis of this function:This is only a hint to the framework ", i.e., SetNumMapTasks () method is only a prompting for the framework of Hadoop, it is impossible to play conclusive effect.In other words, Even if being provided with this value, also differ.
Although single map processes can conduct interviews from any skew in a file, even if this file is stored in In HDFS file system, blocks of files is distributed on different nodes, but for multiple map processes, on the one hand as system does not have There is provided corresponding interface to distinguish each map process, on the other hand the data of required process every time in many cases are in big file Interior offset address is irregular, therefore under present case, respectively specifies that each map processes need data to be processed in big file Skew infeasible.
In sum, map number of processes is controlled, big file as the input of program and is caused each map process directly It is infeasible to be accurately positioned this map processes respectively and need skew of the processing data block in big file.
Content of the invention
The technical problem to be solved in the present invention is:Overcome the deficiencies in the prior art, there is provided a kind of based on Hadoop clusters Super large file in parallel data block localization method, by tectonic sequence control map number of processes method and map processes operation Method, reaches the effect that control difference map processes process one section of different continuous data of side-play amount respectively.
The present invention technical solution be:
A kind of super large file in parallel data block localization method based on Hadoop clusters includes that step is as follows:
Step 1:Hadoop clusters are set up by building Hadoop environment, HDFS determinant attributes are configured;
Step 2:Construct specific sequential file;The content of described particular sequence file be integer, each integer Exclusive 1 row, the value that often goes in sequential file is the side-play amount of the data block in each map process super large file to be processed, sequence The line number of row file is equal to the number of map processes and needs the number of data block to be processed;
The method of tectonic sequence file is any one of following method:
(a)Specific sequence is generated using Microsoft Office Excel by row mode, then copied in row mode To in text;
(b)Under vim editing machine command modes, key in order and generate specific sequence;
(c)The irregular sequence of needs is manually generated, particular sequence is generated by autoexec or script file;
Step 3:Modification user program enables user program to receive the data of standard inlet flow transmission, and the data are turned Change integer data, and the initial address by the data block in super large file to be read in the integer data setting program into;
Step 4:By the streaming mode invocation steps of Hadoop clusters(3)In amended user program and step Suddenly(2)In sequential file complete the positioning of parallel data block, the method for completing parallel data positioning is:By step(2)In Super large data side-play amount and step(3)In the initial address of super large file complete the positioning of parallel data.
Each computer node in the cluster has independent CPU, internal memory, local hard drive and operating system.
Each computer node in the cluster is by Ethernet or Infiniband(Support that how concurrently linked " turns Change cable technique ")Interconnection.
Compared with the prior art, the invention has the advantages that:
(1)As long as the present invention realizes being programmed according to standard input/output format using streaming technology, it is possible to Meet hadoop requirements, therefore stand-alone program is slightly changed and just can be used in enterprising enforcement of cluster, is improved operating efficiency, is easy to survey Examination.
(2)The support that the streaming technology that the present invention is adopted can be realized to non-Java language, can be with Practical Project Most suitable development language is chosen according to engineering demand situation, then is allowed to operate in hadoop platforms by streaming technology On, make executing efficiency higher by this method.
(3)Tectonic sequence file precise control map number of processes of the present invention more effectively can improve executing efficiency, Control cluster load balance.
(4)The present invention realizes the positioning to big file using sequential file, each map process is processed in big file respectively The data of different original positions, control are simple, it is easy to operate.
Description of the drawings
Fig. 1 is the method flow diagram that super large file in parallel data block of the present invention based on Hadoop clusters is positioned.
Specific embodiment
Below in conjunction with the accompanying drawings the specific embodiment of the present invention is further described in detail.
Quantity and sequence of the present invention by control Map, determine the mapping relations between map processes and many data blocks, and And can be with location data block hereof position, allow specified Map processes to process specified data block.As shown in figure 1, the present invention Comprise the following steps that:
Step 1:Hadoop clusters are set up by building Hadoop environment, HDFS determinant attributes are configured;
Hadoop environment is deployed on 4 computer nodes, wherein 1 computer node is name node (namenode), in addition 3 is back end(datanode), shared Computer Storage task, HDFS stored copies numbers For 3, each computer node in the cluster has independent CPU, internal memory, local hard drive and operating system;The cluster In each computer node by Ethernet or Infiniband(Support many " Convertion cable technology " for concurrently linking)Interconnection.
Described HDFS determinant attributes include:
(a)Allocating default file system, defines the port numbers of host computer system title and name node work;
(b)Configuration name node stores the directory listing of permanent metadata;
(c)Configuration data node deposits the directory listing of data block;
(d)The copy amount of configuration data node data storage;
Step 2:Construct specific sequential file;The content of described particular sequence file be integer, each integer Exclusive 1 row, the value that often goes in sequential file is the side-play amount of the data block in each map process super large file to be processed, sequence The line number of row file is equal to the number of map processes and needs the number of data block to be processed;
The method of tectonic sequence file is any one of following method:
(a)Specific sequence is generated using Microsoft Office Excel by row mode, then copied in row mode To in text;
(b)Under vim editing machine command modes, key in order and generate specific sequence;
(c)The irregular sequence of needs is manually generated, particular sequence is generated by autoexec or script file;
Step 3:Modification user program enables user program to receive the data of standard inlet flow transmission, and the data are turned Change integer data, and the initial address by the data block in super large file to be read in the integer data setting program into;
Step 4:Streaming by Hadoop clusters(Map and reduce function skills are called using non-JAVA language Art)Mode invocation step(3)In amended user program and step(2)In sequential file complete determining for parallel data block , the method for completing parallel data positioning is:By step(2)In super large data side-play amount and step(3)In super large The initial address of file completes the positioning of parallel data, and the input options of Hadoop clusters are set to step(3)Middle generation Inputformat options are set to map number of processes and are determined by the sequential file line number being input into by sequential file.
The content not being described in detail in description of the invention belongs to the known technology of those skilled in the art.

Claims (3)

1. a kind of super large file in parallel data block localization method based on Hadoop clusters, it is characterised in that as follows including step:
Step 1:Hadoop clusters are set up by building Hadoop environment, HDFS determinant attributes are configured;
Step 2:Construct specific sequential file;The content of the specific sequential file is integer, each integer exclusive 1 OK, in sequential file often go value be the data block in each map process super large file to be processed side-play amount, sequential file Line number be equal to the number of map processes and need the number of data block to be processed;
The method of tectonic sequence file is any one of following method:
A () generates specific sequence using Microsoft Office Excel by row mode, then copy text in row mode In presents;
B () under vim editing machine command modes keys in order and generates specific sequential file;
C () manually generates the irregular sequential file of needs;
D () generates specific sequential file by autoexec or script file;
Step 3:Modification user program enables user program to receive the data that is transmitted by standard inlet flow, converts the data into Integer data, and the initial address by the data block in super large file to be read in the integer data setting program;
Step 4:In by amended user program and step 2 in the streaming modes invocation step 3 of Hadoop clusters Sequential file completes the positioning of parallel data block, and the method for completing parallel data block positioning is:By using Hadoop clusters Streaming modes realize finder parallelization, by the side-play amount and step 3 of the super large file of sequential file in step 2 In the initial address of super large file complete the positioning of data.
2. a kind of super large file in parallel data block localization method based on Hadoop clusters as claimed in claim 1, its feature It is:Each computer node in the cluster has independent CPU, internal memory, local hard drive and operating system.
3. a kind of super large file in parallel data block localization method based on Hadoop clusters as claimed in claim 1, its feature It is:Each computer node in the cluster is interconnected by Ethernet or Infiniband.
CN201310712421.2A 2013-12-20 2013-12-20 A kind of super large file in parallel data block localization method based on Hadoop clusters Active CN103699627B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310712421.2A CN103699627B (en) 2013-12-20 2013-12-20 A kind of super large file in parallel data block localization method based on Hadoop clusters

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310712421.2A CN103699627B (en) 2013-12-20 2013-12-20 A kind of super large file in parallel data block localization method based on Hadoop clusters

Publications (2)

Publication Number Publication Date
CN103699627A CN103699627A (en) 2014-04-02
CN103699627B true CN103699627B (en) 2017-03-15

Family

ID=50361155

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310712421.2A Active CN103699627B (en) 2013-12-20 2013-12-20 A kind of super large file in parallel data block localization method based on Hadoop clusters

Country Status (1)

Country Link
CN (1) CN103699627B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105808354B (en) * 2016-03-10 2019-02-15 西北大学 The method for setting up interim Hadoop environment using wlan network
CN106339473A (en) * 2016-08-29 2017-01-18 北京百度网讯科技有限公司 Method and device for copying file
CN108874897B (en) * 2018-05-23 2019-09-13 新华三大数据技术有限公司 Data query method and device
CN110851399B (en) * 2019-09-22 2022-11-25 苏州浪潮智能科技有限公司 Method and system for optimizing file data block transmission efficiency of distributed file system

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102332027A (en) * 2011-10-15 2012-01-25 西安交通大学 Mass non-independent small file associated storage method based on Hadoop

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103377269B (en) * 2012-04-27 2016-12-28 国际商业机器公司 Sensing data localization method and device

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102332027A (en) * 2011-10-15 2012-01-25 西安交通大学 Mass non-independent small file associated storage method based on Hadoop

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Hadoop HDFS中的数据块和Map任务的分片;supercharles;《http://www.linuxidc.com/Linux/201205/》;20120526;第1-2页 *
HDFS中的一种数据放置策略;王永洲等;《计算机技术与发展》;20130531;第23卷(第5期);第90-92,96页 *
深度分析如何在Hadoop中控制Map的数量;guibin;《http://blog.csdn.net/strongerbit/article/details/7440111》;20120409;第1-3页 *

Also Published As

Publication number Publication date
CN103699627A (en) 2014-04-02

Similar Documents

Publication Publication Date Title
Dayal et al. Flexpath: Type-based publish/subscribe system for large-scale science analytics
Zheng et al. Scaling embedded in-situ indexing with deltaFS
Abbasi et al. Extending i/o through high performance data services
CN103699627B (en) A kind of super large file in parallel data block localization method based on Hadoop clusters
CN106570113B (en) Mass vector slice data cloud storage method and system
CN105706092A (en) Methods and systems of four-valued simulation
Ferraro Petrillo et al. Analyzing big datasets of genomic sequences: fast and scalable collection of k-mer statistics
CN104536987A (en) Data query method and device
US20240004853A1 (en) Virtual data source manager of data virtualization-based architecture
Singh et al. Spatial data analysis with ArcGIS and MapReduce
US8949255B1 (en) Methods and apparatus for capture and storage of semantic information with sub-files in a parallel computing system
Zhang et al. Towards optimized scheduling for data‐intensive scientific workflow in multiple datacenter environment
Thao Nguyen et al. Efficient MPI‐AllReduce for large‐scale deep learning on GPU‐clusters
CN104299170B (en) Intermittent energy source mass data processing method
WO2022061878A1 (en) Blockchain transaction processing systems and methods
US11960616B2 (en) Virtual data sources of data virtualization-based architecture
US11263026B2 (en) Software plugins of data virtualization-based architecture
Serbanescu et al. Architecture of distributed data aggregation service
Yuan et al. Dynamic data replication based on local optimization principle in data grid
Wang Distributed Machine Learning with Python: Accelerating model training and serving with distributed systems
Poyraz et al. Application-specific I/O optimizations on petascale supercomputers
Gao et al. On the power of combiner optimizations in mapreduce over MPI workflows
Lin et al. Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data-and Compute-intensive
CN106484379B (en) A kind of processing method and processing device of application
Yan et al. CADRE: A Cloud-Based Data Service for Big Bibliographic Data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant