CN103699627A - Dummy file parallel data block positioning method based on Hadoop cluster - Google Patents
Dummy file parallel data block positioning method based on Hadoop cluster Download PDFInfo
- Publication number
- CN103699627A CN103699627A CN201310712421.2A CN201310712421A CN103699627A CN 103699627 A CN103699627 A CN 103699627A CN 201310712421 A CN201310712421 A CN 201310712421A CN 103699627 A CN103699627 A CN 103699627A
- Authority
- CN
- China
- Prior art keywords
- file
- data block
- data
- super large
- hadoop
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
Abstract
The invention discloses a dummy file parallel data block positioning method based on a Hadoop cluster. A plurality of data blocks in a dummy file can be positioned in the Hadoop cluster by the aid of Map/Reduce software. The method mainly includes the steps: building cluster environments; constructing a sequential file; modifying user program source codes; calling user programs in a streaming mode. Mapping relations among Map progresses and the data blocks are determined by controlling Map number and Map sequence, the data blocks in the file can be positioned, the specified data blocks are processed by the specified Map progresses, and the data blocks of the dummy file can be easily and parallelly processed.
Description
Technical field
The present invention relates to a kind of software frame based on Hadoop(distributed treatment mass data) the super large file of cluster (surpassing hundred GB) parallel data block localization method, belong to large data processing field.
Background technology
In high performance parallel computation field, MapReduce mass data processing frame application is extensive, by cheap common computer cluster, can obtain the large-scale data computing power of only having expensive large server just to have over, and at aspects such as stability and extendabilities, all be better than traditional high-performance calculation scheme.MapReduce model is applied to the aspects such as astronomical information computing, mass memory analysis, virus base storage, network retrieval service now, solves the contradiction between the growth of data explosion formula and Computer Storage ability and computing power deficiency.In actual development process, development language is varied, and the program that Streaming technology allows developer to use any programming language to realize is used in Hadoop MapReduce, convenient existing program, to Hadoop platform transplantation, has greatly reduced program portable cost.
The HDFS(Hadoop distributed file system of Hadoop) there is the feature of high fault tolerance, it disperses data to be stored on many machines with the form of one or more copies, can store mass data, and reliability is high, provide to data fast, extendible access, the access module that be applicable to write-once, repeatedly reads.File on HDFS is divided into a plurality of piecemeals of block size, and as storage unit independently, system default block size is 64MB, and user also can physical block size.
In high-performance calculation, having a class problem is repeatedly to process same super large file, and each data of processing are the one section of continuous datas starting with different side-play amounts in large file, and separate between each computing, not Existence dependency relationship.This kind of computation model is in being transplanted to Hadoop platform process, memory model generally adopts HDFS file system, computation model generally adopts Hadoop Streaming, do not change or revise minute quantity program source code situation just can Rapid transplant to Hadoop platform.One section of continuous data that each map process in Hadoop is processed respectively in large file, different side-play amounts start, this model just needs data that developer processes map number of processes and each the map process side-play amount in large file to control, and realizes a plurality of data blocks in the super large file of a plurality of map task parallelisms location.
Generally the quantity of map process is determined by size and the HDFS block size of input file, by input file shared number in HDFS, is determined, under default situations, not directly controls and interferes.In Hadoop API, provide the corresponding interface: org.apache.hadoop.mapred.JobConf.setNumMapTasks (int n) can control map number of processes, but official's document is " Note:This is only a hint to the framework " to the lexical or textual analysis of this function, be that setNumMapTasks () method is only a prompting concerning the framework of Hadoop, can not play conclusive effect.In other words, even if be provided with this value, also differ and produce a desired effect surely.
Although single map process can conduct interviews in any skew from a file, even if this file is stored in HDFS file system, blocks of files is distributed on different nodes, but for a plurality of map processes, because providing corresponding interface, system do not distinguish each map process on the one hand, in a lot of situations, the offset address of the data of each required processing in large file is irregular on the other hand, therefore under present case, specify respectively each map process to need the skew of data to be processed in large file infeasible.
In sum, control map number of processes, directly the input using large file as program and make each map process accurately locate respectively this map process to need the skew of process data block in large file be infeasible.
Summary of the invention
The technical problem to be solved in the present invention is: overcome the deficiencies in the prior art, a kind of super large file in parallel data block localization method based on Hadoop cluster is provided, way and the map process operation way of by tectonic sequence, controlling map number of processes, reach and control the effect that different map processes are processed respectively one section of continuous data that side-play amount is different.
Technical solution of the present invention is:
A kind of super large file in parallel data block localization method based on Hadoop cluster comprises that step is as follows:
Step 1: set up Hadoop cluster by building Hadoop environment, configuration HDFS determinant attribute;
Step 2: construct specific sequential file; The content of described particular sequence file is integer number, each integer number is monopolized 1 row, in sequential file, the value of every row is the side-play amount of the data block in each map process super large file to be processed, and the line number of sequential file equals the number of map process and needs the number of data block to be processed;
The method of tectonic sequence file is any one of following method:
(a) use Microsoft Office Excel to generate specific sequence by row mode, then in row mode, copy in text;
(b), under vim editing machine command mode, key in order and generate specific sequence;
(c) manually generate the irregular sequence needing, by autoexec or script file, generate particular sequence;
Step 3: revise user program and make user program can receive the data that standard input stream transmits, this data-switching is become to integer data, and by the start address of the data block in the super large file that will read in this integer data setting program;
Step 4: the location that completes parallel data block by the sequential file in amended user program and step (2) in the streaming mode invocation step (3) of Hadoop cluster, the method that completes parallel data location is: the location that completes parallel data by the side-play amount of the super large data in step (2) and the start address of the super large file in step (3).
Each computer node in described cluster all has independently CPU, internal memory, local hard drive and operating system.
Each computer node in described cluster is supported " the Convertion cable technology " of how concurrent link by Ethernet or Infiniband() interconnected.
The present invention's beneficial effect is compared with prior art:
(1) the present invention adopts streaming technology to realize as long as programme according to standard input output format, just can meet hadoop requirement, so stand-alone program just revises a little and can increase work efficiency in the enterprising enforcement use of cluster, is convenient to test.
(2) the streaming technology that the present invention adopts can realize the support to non-Java language, in Practical Project, can choose most suitable development language according to engineering demand situation, by streaming technology, make it to operate on hadoop platform again, make by this method executing efficiency higher.
(3) tectonic sequence file of the present invention accurately control map number of processes can more effective raising executing efficiency, control cluster load balancing.
(4) the present invention uses sequential file to realize the location to large file, makes each map process process respectively the data of different reference positions in large file, controls simply easy operating.
Accompanying drawing explanation
Fig. 1 is the method flow diagram that the present invention is based on the super large file in parallel data block location of Hadoop cluster.Embodiment
Below in conjunction with accompanying drawing, the specific embodiment of the present invention is further described in detail.
The present invention, by controlling quantity and the sequence of Map, determines the mapping relations between map process and many data blocks, and can locator data piece position hereof, allows and specifies Map process processing specified data block.As shown in Figure 1, concrete steps of the present invention are as follows:
Step 1: set up Hadoop cluster by building Hadoop environment, configuration HDFS determinant attribute;
Hadoop Ministry of environment is deployed on 4 computer nodes, wherein 1 computer node is title node (namenode), other 3 is back end (datanode), shared Computer Storage task, HDFS stored copies number is 3, and each computer node in described cluster all has independently CPU, internal memory, local hard drive and operating system; Each computer node in described cluster is supported " the Convertion cable technology " of how concurrent link by Ethernet or Infiniband() interconnected.
Described HDFS determinant attribute comprises:
(a) allocating default file system, the port numbers of definition host computer system title and the work of title node;
(b) configuration name node is stored the directory listing of permanent metadata;
(c) directory listing of configuration data node store data piece;
(d) copy amount of configuration data node data storage;
Step 2: construct specific sequential file; The content of described particular sequence file is integer number, each integer number is monopolized 1 row, in sequential file, the value of every row is the side-play amount of the data block in each map process super large file to be processed, and the line number of sequential file equals the number of map process and needs the number of data block to be processed;
The method of tectonic sequence file is any one of following method:
(a) use Microsoft Office Excel to generate specific sequence by row mode, then in row mode, copy in text;
(b), under vim editing machine command mode, key in order and generate specific sequence;
(c) manually generate the irregular sequence needing, by autoexec or script file, generate particular sequence;
Step 3: revise user program and make user program can receive the data that standard input stream transmits, this data-switching is become to integer data, and by the start address of the data block in the super large file that will read in this integer data setting program;
Step 4: the streaming(by Hadoop cluster is used non-JAVA language call map and reduce function technology) location that in mode invocation step (3), the sequential file in amended user program and step (2) completes parallel data block, the method that completes parallel data location is: the location that completes parallel data by the side-play amount of the super large data in step (2) and the start address of the super large file in step (3), the input option of Hadoop cluster is set to the sequential file generating in step (3), inputformat option is set to map number of processes and is determined by the sequential file line number of inputting.
The content not being described in detail in instructions of the present invention belongs to those skilled in the art's known technology.
Claims (3)
1. the super large file in parallel data block localization method based on Hadoop cluster, is characterized in that comprising that step is as follows:
Step 1: set up Hadoop cluster by building Hadoop environment, configuration HDFS determinant attribute;
Step 2: construct specific sequential file; The content of described particular sequence file is integer number, each integer number is monopolized 1 row, in sequential file, the value of every row is the side-play amount of the data block in each map process super large file to be processed, and the line number of sequential file equals the number of map process and needs the number of data block to be processed;
The method of tectonic sequence file is any one of following method:
(a) use Microsoft Office Excel to generate specific sequence by row mode, then in row mode, copy in text;
(b), under vim editing machine command mode, key in order and generate specific sequential file;
(c) manually generate the irregular sequential file needing
(d) by autoexec or script file, generate particular sequence file;
Step 3: revise user program and make user program can receive the data of being transmitted by standard input stream, this data-switching is become to integer data, and by the start address of the data block in the super large file that will read in this integer data setting program;
Step 4: the location that completes parallel data block by the sequential file in amended user program and step (2) in the streaming mode invocation step (3) of Hadoop cluster, the method that completes parallel data location is: by using the streaming mode of Hadoop cluster to realize finder parallelization, by the side-play amount of super large data of sequential file in step (2) and the location that the start address of the super large file in step (3) completes data.
2. a kind of super large file in parallel data block localization method based on Hadoop cluster as claimed in claim 1, is characterized in that: each computer node in described cluster all has independently CPU, internal memory, local hard drive and operating system.
3. a kind of super large file in parallel data block localization method based on Hadoop cluster as claimed in claim 1, is characterized in that: each computer node in described cluster is interconnected by Ethernet or Infiniband.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310712421.2A CN103699627B (en) | 2013-12-20 | 2013-12-20 | A kind of super large file in parallel data block localization method based on Hadoop clusters |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310712421.2A CN103699627B (en) | 2013-12-20 | 2013-12-20 | A kind of super large file in parallel data block localization method based on Hadoop clusters |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103699627A true CN103699627A (en) | 2014-04-02 |
CN103699627B CN103699627B (en) | 2017-03-15 |
Family
ID=50361155
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310712421.2A Active CN103699627B (en) | 2013-12-20 | 2013-12-20 | A kind of super large file in parallel data block localization method based on Hadoop clusters |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103699627B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105808354A (en) * | 2016-03-10 | 2016-07-27 | 西北大学 | Method for establishing temporary Hadoop environment by utilizing WLAN (Wireless Local Area Network) |
CN106339473A (en) * | 2016-08-29 | 2017-01-18 | 北京百度网讯科技有限公司 | Method and device for copying file |
CN108874897A (en) * | 2018-05-23 | 2018-11-23 | 新华三大数据技术有限公司 | Data query method and device |
CN110851399A (en) * | 2019-09-22 | 2020-02-28 | 苏州浪潮智能科技有限公司 | Method and system for optimizing file data block transmission efficiency of distributed file system |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102332027A (en) * | 2011-10-15 | 2012-01-25 | 西安交通大学 | Mass non-independent small file associated storage method based on Hadoop |
US20130311480A1 (en) * | 2012-04-27 | 2013-11-21 | International Business Machines Corporation | Sensor data locating |
-
2013
- 2013-12-20 CN CN201310712421.2A patent/CN103699627B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102332027A (en) * | 2011-10-15 | 2012-01-25 | 西安交通大学 | Mass non-independent small file associated storage method based on Hadoop |
US20130311480A1 (en) * | 2012-04-27 | 2013-11-21 | International Business Machines Corporation | Sensor data locating |
Non-Patent Citations (3)
Title |
---|
GUIBIN: "深度分析如何在Hadoop中控制Map的数量", 《HTTP://BLOG.CSDN.NET/STRONGERBIT/ARTICLE/DETAILS/7440111》 * |
SUPERCHARLES: "Hadoop HDFS中的数据块和Map任务的分片", 《HTTP://WWW.LINUXIDC.COM/LINUX/201205/》 * |
王永洲等: "HDFS中的一种数据放置策略", 《计算机技术与发展》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105808354A (en) * | 2016-03-10 | 2016-07-27 | 西北大学 | Method for establishing temporary Hadoop environment by utilizing WLAN (Wireless Local Area Network) |
CN106339473A (en) * | 2016-08-29 | 2017-01-18 | 北京百度网讯科技有限公司 | Method and device for copying file |
CN108874897A (en) * | 2018-05-23 | 2018-11-23 | 新华三大数据技术有限公司 | Data query method and device |
CN108874897B (en) * | 2018-05-23 | 2019-09-13 | 新华三大数据技术有限公司 | Data query method and device |
CN110851399A (en) * | 2019-09-22 | 2020-02-28 | 苏州浪潮智能科技有限公司 | Method and system for optimizing file data block transmission efficiency of distributed file system |
CN110851399B (en) * | 2019-09-22 | 2022-11-25 | 苏州浪潮智能科技有限公司 | Method and system for optimizing file data block transmission efficiency of distributed file system |
Also Published As
Publication number | Publication date |
---|---|
CN103699627B (en) | 2017-03-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Ikeda et al. | Provenance for Generalized Map and Reduce Workflows. | |
CN111324610A (en) | Data synchronization method and device | |
CN102708203A (en) | Database dynamic management method based on XML metadata | |
CN106471501A (en) | The method of data query, the storage method data system of data object | |
CN103810272A (en) | Data processing method and system | |
CN110795499A (en) | Cluster data synchronization method, device and equipment based on big data and storage medium | |
CN105447051A (en) | Database operation method and device | |
CN102799679A (en) | Hadoop-based massive spatial data indexing updating system and method | |
CN104536987A (en) | Data query method and device | |
CN103699627A (en) | Dummy file parallel data block positioning method based on Hadoop cluster | |
CN106055678A (en) | Hadoop-based panoramic big data distributed storage method | |
KR101790766B1 (en) | Method, device and terminal for data search | |
Singh et al. | Spatial data analysis with ArcGIS and MapReduce | |
CN103501341A (en) | Method and device for establishing Web service | |
Barkhordari et al. | Atrak: a MapReduce-based data warehouse for big data | |
CN103809915B (en) | The reading/writing method of a kind of disk file and device | |
US11698911B2 (en) | System and methods for performing updated query requests in a system of multiple database engine | |
Li et al. | Research of distributed database system based on Hadoop | |
JP2014041501A (en) | Fast reading method for batch processing target data and batch management system | |
CN106484379B (en) | A kind of processing method and processing device of application | |
Xu et al. | A unified computation engine for big data analytics | |
Pan et al. | An open sharing pattern design of massive power big data | |
Gao et al. | On the power of combiner optimizations in mapreduce over MPI workflows | |
Xu et al. | A PaaS based metadata-driven ETL framework | |
Han et al. | Design and Implementation of Big Data Management Platform for Android Applications |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant |