CN104408047A - Method for uploading text file to HDFS (hadoop distributed file system) in multi-machine parallel mode based on NFS (network file system) file server - Google Patents

Method for uploading text file to HDFS (hadoop distributed file system) in multi-machine parallel mode based on NFS (network file system) file server Download PDF

Info

Publication number
CN104408047A
CN104408047A CN201410584207.8A CN201410584207A CN104408047A CN 104408047 A CN104408047 A CN 104408047A CN 201410584207 A CN201410584207 A CN 201410584207A CN 104408047 A CN104408047 A CN 104408047A
Authority
CN
China
Prior art keywords
node
file
uploading
hdfs
uploaded
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410584207.8A
Other languages
Chinese (zh)
Inventor
房体盈
辛国茂
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Electronic Information Industry Co Ltd
Original Assignee
Langchao Electronic Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Langchao Electronic Information Industry Co Ltd filed Critical Langchao Electronic Information Industry Co Ltd
Priority to CN201410584207.8A priority Critical patent/CN104408047A/en
Publication of CN104408047A publication Critical patent/CN104408047A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • G06F16/1824Distributed file systems implemented using Network-attached Storage [NAS] architecture
    • G06F16/183Provision of network file services by network file servers, e.g. by using NFS, CIFS
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Human Computer Interaction (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention provides a method for uploading a text file to an HDFS (hadoop distributed file system) in a multi-machine parallel mode based on an NFS (network file system) file server. The method comprises the following steps: selecting N host computers from an HDFS cluster, and then selecting any node as a main node and other N-1 nodes as slave nodes; obtaining a file in a to-be-uploaded directory of a to-be-uploaded NFS file server on the main node; as to each file, adopting a parallel uploading method, namely that all machines in the cluster participate into uploading, wherein each host computer in the cluster takes charge of uploading of continuous data blocks in 1/N size of each file, so as to achieve the parallel uploading target. Thus the uploading speed is increased.

Description

A kind of text multi-host parallel based on NFS file server uploads to HDFS method
Technical field
The present invention relates to large technical field of data storage, specifically a kind of text multi-host parallel based on NFS file server uploads to HDFS method.
Background technology
Along with the development of computer network, the epoch of mass data arrive.The data use amount in Internet data center's prediction whole world will increase by 44 times to the year two thousand twenty, reaches 35.2ZB.
For the storage of large data sets like this, analysis, management and excavation, conventional art (comprising traditional relational) is incompetent, how analysis the most best and to understand these data are pendulum task of top priority in face of everybody.And in the technology had now and instrument, the most ripe also the most successful a set of large data solution is that Hadoop file stores Computational frame and framework associated component thereon.For a large amount of texts that every day generates, if upload to HDFS fast for follow-up process, be the current problem faced.For solving the problem uploaded fast of text, proposing herein and a kind ofly uploading to HDFS method based on the text multi-host parallel based on NFS file server.
HDFS gives tacit consent to employing three copy mechanism, for the client of HDFS, when some users write data by a client in HDFS, if this client there is DataNode node, NameNode override considers the data write copy to be kept on the DataNode node of this client, two other copy is saved on other DataNode nodes of cluster, like this in whole cluster, if only there is a client write operation, the work of 3 DataNode nodes is only had in cluster, other DataNode nodes are idle, the performance of whole cluster can not be played.
Summary of the invention
The object of this invention is to provide a kind of text multi-host parallel based on NFS file server and upload to HDFS method.
The object of the invention is to realize in the following manner, choose N number of main frame in HDFS cluster, then select any node as host node, other N-1 node is as from node, on the primary node, the NFS file server that acquisition will be uploaded will upload file under catalogue, for each file, adopt parallel method for uploading, namely in cluster, all machines all participate in uploading, and in cluster, the continuous print data block uploading each file 1/N size is responsible for by each main frame, reach the parallel object uploaded, thus raising uploading speed, concrete steps flow process is:
1) each node of the N number of node of MainPut program computation data block start-stop to be uploaded byte stream on host node, and start BlockPut program parallelization on N number of node and upload; If first time is run, on each node an executable program BlockPut being installed, being responsible for for uploading this node institute the data block uploaded, then initiate order startup BlockPut program to each from node;
2) on each node, BlockPut program is responsible for data block to be uploaded to upload to HDFS, BlockPut opens a file input stream InputStream to be uploaded, InputStream navigates to banner word throttling, after on HDFS create a unique file, start-stop byte stream is written in HDFS unique file.
Catalogue to be uploaded is mounted to the acquiescence unified directory of N number of node.
N is not more than the number clients that NFS file server walks abreast when can to reach maximum bandwidth when reading,
Object beneficial effect of the present invention is: this to have chosen in cluster N number of node clearly as client, a file is divided into N number of data block upload simultaneously, each client is responsible for one piece, each piecemeal saves as an independently file on HDFS, can utilize the performance of whole cluster to greatest extent.A text block parallel is uploaded, plays the performance of cluster to greatest extent, transfer efficiency in raising.
Accompanying drawing explanation
Fig. 1 is based on multi-host parallel upload process frame diagram.
Embodiment
With reference to Figure of description, method of the present invention is described in detail below.
Choose N number of main frame in HDFS cluster, then select any node as host node, other N-1 node is as from node, on the primary node, the NFS file server that acquisition will be uploaded will upload file under catalogue, for each file, adopt parallel method for uploading, namely in cluster, all machines all participate in uploading, in cluster, the continuous print data block uploading each file 1/N size is responsible for by each main frame, reach the parallel object uploaded, thus raising uploading speed, a kind of text multi-host parallel based on NFS file server of the present invention uploads to HDFS method, whole flow process is:
1) each node of the N number of node of MainPut program computation data block start-stop to be uploaded byte stream on host node, and start BlockPut program parallelization on N number of node and upload; If first time is run, on each node an executable program BlockPut being installed, being responsible for for uploading this node institute the data block uploaded, then initiate order startup BlockPut program to each from node;
2) on each node, BlockPut program is responsible for data block to be uploaded to upload to HDFS, BlockPut opens a file input stream InputStream to be uploaded, InputStream navigates to banner word throttling, after on HDFS create a unique file, start-stop byte stream is written in HDFS unique file.
Except the technical characteristic described in instructions, be the known technology of those skilled in the art.

Claims (3)

1. one kind uploads to HDFS method based on the text multi-host parallel of NFS file server, it is characterized in that, choose N number of main frame in HDFS cluster, then select any node as host node, other N-1 node is as from node, on the primary node, the NFS file server that acquisition will be uploaded will upload file under catalogue, for each file, adopt parallel method for uploading, namely in cluster, all machines all participate in uploading, in cluster, the continuous print data block uploading each file 1/N size is responsible for by each main frame, reach the parallel object uploaded, thus raising uploading speed, concrete steps flow process is:
The each node of the N number of node of MainPut program computation data block start-stop to be uploaded byte stream on host node, and start BlockPut program parallelization on N number of node and upload; If first time is run, on each node an executable program BlockPut being installed, being responsible for for uploading this node institute the data block uploaded, then initiate order startup BlockPut program to each from node;
2) on each node, BlockPut program is responsible for data block to be uploaded to upload to HDFS, BlockPut opens a file input stream InputStream to be uploaded, InputStream navigates to banner word throttling, after on HDFS create a unique file, start-stop byte stream is written in HDFS unique file.
2. method according to claim 1, is characterized in that acquiescence unified directory catalogue to be uploaded being mounted to N number of node.
3. method according to claim 1, is characterized in that, N is not more than the number clients that NFS file server walks abreast when can to reach maximum bandwidth when reading.
CN201410584207.8A 2014-10-28 2014-10-28 Method for uploading text file to HDFS (hadoop distributed file system) in multi-machine parallel mode based on NFS (network file system) file server Pending CN104408047A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410584207.8A CN104408047A (en) 2014-10-28 2014-10-28 Method for uploading text file to HDFS (hadoop distributed file system) in multi-machine parallel mode based on NFS (network file system) file server

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410584207.8A CN104408047A (en) 2014-10-28 2014-10-28 Method for uploading text file to HDFS (hadoop distributed file system) in multi-machine parallel mode based on NFS (network file system) file server

Publications (1)

Publication Number Publication Date
CN104408047A true CN104408047A (en) 2015-03-11

Family

ID=52645679

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410584207.8A Pending CN104408047A (en) 2014-10-28 2014-10-28 Method for uploading text file to HDFS (hadoop distributed file system) in multi-machine parallel mode based on NFS (network file system) file server

Country Status (1)

Country Link
CN (1) CN104408047A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105357317A (en) * 2015-12-07 2016-02-24 金蝶软件(中国)有限公司 Data uploading method and system based on multi-client polling queuing
CN105357280A (en) * 2015-10-19 2016-02-24 福建新大陆软件工程有限公司 Hadoop distributed file system (HDFS) based file tracing file transfer protocol (FTP) system
CN105610899A (en) * 2015-12-10 2016-05-25 浪潮(北京)电子信息产业有限公司 Text file parallel uploading method and device
CN106339473A (en) * 2016-08-29 2017-01-18 北京百度网讯科技有限公司 Method and device for copying file
CN107800691A (en) * 2017-10-12 2018-03-13 云巅(上海)网络科技有限公司 The system and method for building application program on demand and accessing data trnascription is realized based on distributed storage mechanism
CN108280214A (en) * 2017-02-02 2018-07-13 马志强 Quick I/O systems applied to distributed genetic group analysis
CN109325002A (en) * 2018-09-03 2019-02-12 北京京东金融科技控股有限公司 Text file processing method, device, system, electronic equipment, storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030110237A1 (en) * 2001-12-06 2003-06-12 Hitachi, Ltd. Methods of migrating data between storage apparatuses
CN101227460A (en) * 2007-01-19 2008-07-23 秦晨 Method for uploading and downloading distributed document and apparatus and system thereof
CN103530388A (en) * 2013-10-22 2014-01-22 浪潮电子信息产业股份有限公司 Performance improving data processing method in cloud storage system
CN103544285A (en) * 2013-10-28 2014-01-29 华为技术有限公司 Data loading method and device
CN103970881A (en) * 2014-05-16 2014-08-06 浪潮(北京)电子信息产业有限公司 Method and system for achieving file uploading
CN103971066A (en) * 2014-05-20 2014-08-06 浪潮电子信息产业股份有限公司 Verification method for integrity of big data migration in HDFS

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030110237A1 (en) * 2001-12-06 2003-06-12 Hitachi, Ltd. Methods of migrating data between storage apparatuses
CN101227460A (en) * 2007-01-19 2008-07-23 秦晨 Method for uploading and downloading distributed document and apparatus and system thereof
CN103530388A (en) * 2013-10-22 2014-01-22 浪潮电子信息产业股份有限公司 Performance improving data processing method in cloud storage system
CN103544285A (en) * 2013-10-28 2014-01-29 华为技术有限公司 Data loading method and device
CN103970881A (en) * 2014-05-16 2014-08-06 浪潮(北京)电子信息产业有限公司 Method and system for achieving file uploading
CN103971066A (en) * 2014-05-20 2014-08-06 浪潮电子信息产业股份有限公司 Verification method for integrity of big data migration in HDFS

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
杨锋 等: "基于Hadoop 的海量农业数据资源管理平台", 《计算机工程》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105357280A (en) * 2015-10-19 2016-02-24 福建新大陆软件工程有限公司 Hadoop distributed file system (HDFS) based file tracing file transfer protocol (FTP) system
CN105357280B (en) * 2015-10-19 2019-02-19 福建新大陆软件工程有限公司 A kind of file based on HDFS is traced to the source FTP system
CN105357317A (en) * 2015-12-07 2016-02-24 金蝶软件(中国)有限公司 Data uploading method and system based on multi-client polling queuing
CN105357317B (en) * 2015-12-07 2019-06-07 金蝶软件(中国)有限公司 A kind of data uploading method and system based on multi-client repeating query queuing
CN105610899A (en) * 2015-12-10 2016-05-25 浪潮(北京)电子信息产业有限公司 Text file parallel uploading method and device
CN105610899B (en) * 2015-12-10 2019-09-24 浪潮(北京)电子信息产业有限公司 A kind of parallel method for uploading of text file and device
CN106339473A (en) * 2016-08-29 2017-01-18 北京百度网讯科技有限公司 Method and device for copying file
CN108280214A (en) * 2017-02-02 2018-07-13 马志强 Quick I/O systems applied to distributed genetic group analysis
CN107800691A (en) * 2017-10-12 2018-03-13 云巅(上海)网络科技有限公司 The system and method for building application program on demand and accessing data trnascription is realized based on distributed storage mechanism
CN109325002A (en) * 2018-09-03 2019-02-12 北京京东金融科技控股有限公司 Text file processing method, device, system, electronic equipment, storage medium

Similar Documents

Publication Publication Date Title
CN104408047A (en) Method for uploading text file to HDFS (hadoop distributed file system) in multi-machine parallel mode based on NFS (network file system) file server
US11003992B2 (en) Distributed training and prediction using elastic resources
CN111966684B (en) Apparatus, method and computer program product for distributed data set indexing
US9268716B2 (en) Writing data from hadoop to off grid storage
US9628582B2 (en) Social-driven precaching of accessible objects
US11716271B2 (en) Automated data flows using flow-based data processor blocks
US9767040B2 (en) System and method for generating and storing real-time analytics metric data using an in memory buffer service consumer framework
US9715532B1 (en) Systems and methods for content object optimization
US11182216B2 (en) Auto-scaling cloud-based computing clusters dynamically using multiple scaling decision makers
JP7038740B2 (en) Data aggregation methods for cache optimization and efficient processing
US20140047059A1 (en) Method for improving mobile network performance via ad-hoc peer-to-peer request partitioning
Tudoran et al. Jetstream: Enabling high performance event streaming across cloud data-centers
EP3161610A1 (en) Optimized browser rendering process
US20140149465A1 (en) Feature rich view of an entity subgraph
Pal et al. Big data real time ingestion and machine learning
US20160259494A1 (en) System and method for controlling video thumbnail images
Dev et al. A survey of different technologies and recent challenges of big data
KR102031589B1 (en) Methods and systems for processing relationship chains, and storage media
CN106101710A (en) A kind of distributed video transcoding method and device
Li et al. Design of the mass multimedia files storage architecture based on Hadoop
US11481168B2 (en) Data streams of production intents
Li et al. Enabling performance as a service for a cloud storage system
Basha et al. Storage and processing speed for knowledge from enhanced cloud computing with Hadoop frame work: A survey
Chen et al. Data-driven parallel video transcoding for content delivery network in the cloud
Manekar et al. Studying cloud as IAAS for big data analytics: opportunity, challenges

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20150311

WD01 Invention patent application deemed withdrawn after publication