CN102075584B - Distributed file system and access method thereof - Google Patents

Distributed file system and access method thereof Download PDF

Info

Publication number
CN102075584B
CN102075584B CN201110033439.0A CN201110033439A CN102075584B CN 102075584 B CN102075584 B CN 102075584B CN 201110033439 A CN201110033439 A CN 201110033439A CN 102075584 B CN102075584 B CN 102075584B
Authority
CN
China
Prior art keywords
data
random access
packet
access request
tcp
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201110033439.0A
Other languages
Chinese (zh)
Other versions
CN102075584A (en
Inventor
廖浩均
韩冀中
戴娇
周薇
路远征
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN201110033439.0A priority Critical patent/CN102075584B/en
Publication of CN102075584A publication Critical patent/CN102075584A/en
Application granted granted Critical
Publication of CN102075584B publication Critical patent/CN102075584B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an HDFS (Hadoop distributed file system) and an access method and a system thereof. The method comprises the following steps: using the distributed file system to receive an access request sent from a client, and setting the size of a packet according to actual situation; acquiring initial deviation value of data of the access request and the length of the data of the request, and calculating the number of required chunks; and packaging the chunks into a data packet and sending to the client. With the adoption of the system and the method, the random access performance can be optimized under the situation of ensuring the sequential access performance of the HDFS.

Description

A kind of distributed file system and access method thereof
Technical field
The present invention relates to network communication field, particularly relate to a kind of distributed file system (HadoopDistributed File System, HDFS) and access method thereof.
Background technology
Along with the fast development of performance application and computing demand, separate unit high-performance computer can not solve some ultra-large application problems, as space connects, and the arest neighbors inquiry (ANN) of multiple data sets.This just need to join together many computer resources, forms computer cluster, jointly solves large-scale application problem.Multiple programming technology can be developed the especially computing capability of cluster computer of parallel computer effectively, is the bridge between hardware and software, is the interface of bottom layer realization and the higher level of abstraction of parallel computation.
First issued Hadoop based on the Google Clustering system of increasing income, Hadoop is made up of distributed file system (HDFS) and MapReduce programming model.
The data access method based on HDFS providing in prior art, for sequential access mode, user asks a part of data, and HDFS sends to user by this partial data with together with other data following closely.Because HDFS mainly supports sequential access, therefore, the unnecessary data of part that before transmit will be accessed at once.Now, just do not need to have reached data in HDFS, can directly in the local cache of client, obtain.
For random access mode, if the deviant of random access in Block with on once the deviant of random access differ in a threshold value (HDFS acquiescence be 128K), just adopt the scheme of said sequence access, start sequential access from last random access, data pass to client, and client is carried out the required data of cutting according to deviant; Connect if deviant not within threshold value, is just closed this TCP, restart a TCP and connect.
Therefore, there is following defect in the data access method based on HDFS in prior art:
Random access is with respect to sequential access, and the next part data of the data of this access probably do not visit again, and therefore, is not suitable for random access as the measure of that optimization of sequential access, and can bring unnecessary network bandwidth loss to random access; And the frequent foundation that TCP connects can expend time in.
Summary of the invention
The object of the present invention is to provide a kind of HDFS and access method thereof, it can, in the situation that ensureing HDFS sequential access performance, optimize random access performance.
For realizing the access method of a kind of distributed file system that object of the present invention provides, comprise the following steps:
Step 100. distributed file system receives the access request that client is sent, and the size of packet is set according to actual conditions;
Step 200. is obtained the start offset value of the data of described access request, and the length of request msg, calculates the quantity that needs chunk;
Step 300. is packaged into packet by described chunk and passes to client.
Described step 100, comprises the following steps:
Step 110. judges that whether described access request is random access request, if so, performs step 120; Otherwise, execution step 130;
Step 120. is made as the size of described packet the packet that is less than original distribution formula file system;
Step 130. is made as the size of described packet the size of packet in original distribution formula file system.
Described method, also comprises the following steps:
Step 400. is each time after random access, keep the TCP of client to connect, in the time of upper once random access, if the data offset of current random access request is in the back end at the data place of random access request last time, the TCP continuing before continuing to use connects, and connects without re-establishing TCP.
Described step 400, comprises the following steps:
Step 410. is obtained the start offset address of the data of current random access request;
Step 420. judges that these data whether in the back end at the data place of random access request last time, if so, perform step 430; Otherwise, execution step 440;
Step 430. keeps last time TCP to connect, and proceeds transfer of data;
Step 440. disconnects last time TCP and connects, and re-establishes TCP request and carries out transfer of data.
Also provide a kind of distributed file system for realizing object of the present invention, described system, comprising:
For receiving according to distributed file system the access request that client is sent, and actual conditions arrange the big or small device of packet;
Be used for the start offset value of the data of obtaining described access request, and the length of request msg, calculating needs the device of the quantity of chunk;
For being packaged into packet, described chunk passes to the device of client.
Described for receiving the access request sent of client according to distributed file system, and actual conditions arrange the big or small device of packet, comprising:
For judging whether described access request is the device of random access request;
For for random access request, the size of described packet is made as to the device of the packet that is less than original distribution formula file system;
For for sequential access request, the size of described packet is made as to the big or small device of packet in original distribution formula file system.
Described system, also comprises:
For after random access each time, keep the TCP of client to connect, in the time of upper once random access, if the data offset of current random access request is in the back end at the data place of random access request last time, the TCP continuing before continuing to use connects, without the device that re-establishes TCP connection.
Described for after random access each time, keep the TCP of client to connect, in the time of upper once random access, if the data offset of current random access request is in the back end at the data place of random access request last time, the TCP continuing before continuing to use connects, without the device that re-establishes TCP connection, comprising:
Be used for the device of the start offset address of the data of obtaining current random access request;
For judging that these data, whether at the device of the back end at the data place of random access request last time, if so, perform step 430; Otherwise, execution step 440;
For when these data are during at back end last time, keep last time TCP connection, proceed the device of transfer of data;
For when these data are not during at back end last time, disconnect last time TCP and connect, re-establish TCP and ask to carry out the device of transfer of data.
The invention has the beneficial effects as follows:
1. the present invention supports random access by the size that changes minimum transmission units packet, to reduce unnecessary Internet Transmission;
2., when the data offset of the present invention by judge current request is in this DataNode, the TCP continuing before continuing to use connects, and connects without re-establishing TCP, can save like this time overhead of setting up TCP connection.
Brief description of the drawings
Fig. 1 is the flow chart of steps of the access method of a kind of distributed file system of the present invention;
Fig. 2 is the flow chart of steps that data package size is set according to actual conditions in the present invention;
Fig. 3 continues to continue to use the flow chart of steps that previous TCP connects after random access in the present invention;
Fig. 4 is the structural representation of a kind of distributed file system of the present invention.
Embodiment
In order to make object of the present invention, technical scheme and advantage clearer, below in conjunction with drawings and Examples, a kind of distributed file system of the present invention and access method thereof are further elaborated.Should be appreciated that specific embodiment described herein, only in order to explain the present invention, is not intended to limit the present invention.
A kind of HDFS of the present invention and access method thereof, support random access by the size that changes minimum transmission units packet, to reduce unnecessary Internet Transmission; In the time that data offset of judging this time request is in this DataNode, the TCP continuing before continuing to use connects, and connects without re-establishing TCP, can save like this and set up the time overhead that TCP connects.
Introduce in detail the access method of a kind of distributed file system of the present invention below in conjunction with above-mentioned target, Fig. 1 is the flow chart of steps of the access method of a kind of distributed file system of the present invention, and as shown in Figure 1, described method, comprises the following steps:
Step 100. distributed file system receives the access request that client is sent, and the size of packet (packet) is set according to actual conditions;
Packet is minimum transmission unit, formed by several data blocks (chunk), in the present invention, the size of packet can dynamically change according to user's needs, (in other words, by change the quantity of the chunk comprising in a packet change in the size of packet) if user needs sequential access, in the time setting up TCP connection, packet value can be established larger, if user needs random access, in the time setting up TCP connection, packet value can be arranged smaller.
Fig. 2 is the flow chart of steps that data package size is set according to actual conditions in the present invention, and as shown in Figure 2, described step 100, comprises the following steps:
Step 110. judges that whether described access request is random access request, if so, performs step 120; Otherwise, execution step 130;
Step 120. is made as the size of described packet the packet that is less than original distribution formula file system;
In the present invention by reducing the size of packet in current HDFS, such as, original packet size is 64K, packet can be set as to 4k, like this, the in the situation that of random read-write, waste situation is just less, meanwhile, has also saved the network bandwidth.
Step 130. is made as the size of described packet the size of packet in original distribution formula file system.
If user wishes to carry out sequential access, just carry out TCP while being connected at user and DataNode, the size of packet is arranged larger, as 64K.
Step 200. is obtained the start offset value of the data of described access request, and the length of request msg, calculates the quantity that needs chunk;
Step 300. is packaged into packet by described chunk and passes to client.
Step 400. is each time after random access, keep the TCP of client to connect, in the time of upper once random access, if the data offset of current random access request is in the back end at the data place of random access request last time, the TCP continuing before continuing to use connects, and connects without re-establishing TCP.
While the present invention is directed to random access, access the situation of the data on same DataNode, only need a TCP to connect, do not need to disconnect TCP and connect, re-establish.
Fig. 3 continues after random access in the present invention to continue to use the flow chart of steps that previous TCP connects, and as shown in Figure 3, described step 400, comprises the following steps:
Step 410. is obtained the start offset address of the data of current random access request;
In the time of a file of client-requested access, client is by RPC remote access NameNode, and NameNode has preserved the positional information that forms all block of this file, and for each block, NameNode returns to the DataNode address at this block place; Send TCP connection request with the DataNode address to obtaining.
Step 420. judges that these data whether in the back end at the data place of random access request last time, if so, perform step 430; Otherwise, execution step 440;
Step 430. keeps last time TCP to connect, and proceeds transfer of data;
Step 440. disconnects last time TCP and connects, and re-establishes TCP request and carries out transfer of data.
As a kind of embodiment, for HDFS sequential access, the present invention includes following steps:
Step 1, distributed file system receive the sequential access request that client is sent, and packet size are set for 64K;
Step 2, obtain the start offset value of request msg and the length of request msg;
Step 3, obtain need how many chunk;
Step 4, chunk is packaged into packet;
Step 5, packet is issued to client.
As a kind of embodiment, for HDFS random access, the present invention includes following steps:
Step 1, distributed file system receive the sequential access request that client is sent, and packet size are set for 4K;
Step 2, obtain the start offset value of request msg and the length of request msg;
Step 3, obtain need how many chunk;
Step 4, chunk is packaged into packet;
Step 5, packet is issued to client.
Corresponding to the access method of a kind of HDFS of the present invention, a kind of distributed file system is also provided, Fig. 4 is the structural representation of a kind of distributed file system of the present invention, as shown in Figure 4, described system comprises:
Module 1 is set, and for receiving according to distributed file system the access request that client is sent, and actual conditions arrange the size of packet;
Acquisition of information module 2, for obtaining the start offset value of data of described access request, and the length of request msg, calculate the quantity that needs chunk;
Sending module 3, passes to client for described chunk is packaged into packet.
Link block 4, for after random access each time, keep the TCP of client to connect, in the time of upper once random access, if the data offset of current random access request is in the back end at the data place of random access request last time, the TCP continuing before continuing to use connects, and connects without re-establishing TCP.
The wherein said module 1 that arranges, comprising:
Judge module 11, for judging whether described access request is random access request;
Random access module 12, for being made as the size of described packet the packet that is less than original distribution formula file system;
Sequential access module 13, for being made as the size of described packet the size of original distribution formula file system packet.
Described link block 4, comprising:
Address acquisition module 41, for obtaining the start offset address of data of current random access request;
Judge submodule 42, for judging that these data are whether at the back end at the data place of random access request last time, if so, trigger to connect to keep module 43; Otherwise, trigger and connect disconnection module 44;
Connect and keep module 43, for when these data are during at back end last time, keep last time TCP connection, proceed transfer of data;
Connect and disconnect module 44, for when these data are not during at back end last time, disconnect last time TCP and connect, re-establish TCP and ask to carry out transfer of data.
Beneficial effect of the present invention is:
1. the present invention supports random access by the size that changes minimum transmission units packet, to reduce unnecessary Internet Transmission;
2., when the data offset of the present invention by judge current request is in this DataNode, the TCP continuing before continuing to use connects, and connects without re-establishing TCP, can save like this time overhead of setting up TCP connection.
Description to the specific embodiment of the invention in conjunction with the drawings, other side of the present invention and feature are apparent to those skilled in the art.
Above specific embodiments of the invention are described and are illustrated, it is exemplary that these embodiment should be considered to it, and is not used in and limits the invention, and the present invention should make an explanation according to appended claim.

Claims (4)

1. an access method for distributed file system, is characterized in that, described method, comprises the following steps:
Step 100. distributed file system receives the access request that client is sent, and the size of packet is set according to actual conditions;
Wherein, described step 100, comprises the following steps:
Step 110. judges that whether described access request is random access request, if so, performs step 120; Otherwise, execution step 130;
Step 120. is made as the size of described packet the size that is less than packet in original distribution formula file system;
Step 130. is made as the size of described packet the size of packet in original distribution formula file system;
Wherein, described packet is minimum unit of transfer, is made up of data block, changes the size of described packet by changing the quantity of data block in packet;
Step 200. is obtained the start offset value of the data of described access request, and the length of request msg, calculates the quantity that needs data block;
Step 300. is packaged into packet by described data block and passes to client;
Step 400. is each time after random access, keep the TCP of client to connect, in the time of upper once random access, if the data offset of current random access request is in the back end at the data place of random access request last time, the TCP continuing before continuing to use connects, and connects without re-establishing TCP.
2. the access method of distributed file system according to claim 1, is characterized in that, described step 400, comprises the following steps:
Step 410. is obtained the start offset address of the data of current random access request;
Step 420. judges that these data whether in the back end at the data place of random access request last time, if so, perform step 430; Otherwise, execution step 440;
Step 430. keeps last time TCP to connect, and proceeds transfer of data;
Step 440. disconnects last time TCP and connects, and re-establishes TCP request and carries out transfer of data.
3. a distributed file system, is characterized in that, described system, comprising:
For receiving according to distributed file system the access request that client is sent, and actual conditions arrange the big or small device of packet;
Wherein, described for receiving the access request sent of client according to distributed file system, and actual conditions arrange the big or small device of packet, comprising:
For judging whether described access request is the device of random access request;
For for random access request, the size of described packet is made as to the big or small device that is less than packet in original distribution formula file system;
For for sequential access request, the size of described packet is made as to the big or small device of packet in original distribution formula file system;
Wherein, described packet is minimum unit of transfer, is made up of data block, changes the size of described packet by changing the quantity of data block in packet;
Be used for the start offset value of the data of obtaining described access request, and the length of request msg, calculating needs the device of the quantity of data block;
For being packaged into packet, described data block passes to the device of client;
For after random access each time, keep the TCP of client to connect, in the time of upper once random access, if the data offset of current random access request is in the back end at the data place of random access request last time, the TCP continuing before continuing to use connects, without the device that re-establishes TCP connection.
4. distributed file system according to claim 3, it is characterized in that, described for after random access each time, keep the TCP of client to connect, in the time of upper once random access, if the TCP that the data offset of current random access request in the back end at the data place of random access request last time, continues before continuing to use connects, without the device that re-establishes TCP connection, comprising:
Be used for the device of the start offset address of the data of obtaining current random access request;
For judging that these data are whether at the device of the back end at the data place of random access request last time;
For when these data are during at back end last time, keep last time TCP connection, proceed the device of transfer of data;
For when these data are not during at back end last time, disconnect last time TCP and connect, re-establish TCP and ask to carry out the device of transfer of data.
CN201110033439.0A 2011-01-30 2011-01-30 Distributed file system and access method thereof Expired - Fee Related CN102075584B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110033439.0A CN102075584B (en) 2011-01-30 2011-01-30 Distributed file system and access method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110033439.0A CN102075584B (en) 2011-01-30 2011-01-30 Distributed file system and access method thereof

Publications (2)

Publication Number Publication Date
CN102075584A CN102075584A (en) 2011-05-25
CN102075584B true CN102075584B (en) 2014-08-06

Family

ID=44033925

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110033439.0A Expired - Fee Related CN102075584B (en) 2011-01-30 2011-01-30 Distributed file system and access method thereof

Country Status (1)

Country Link
CN (1) CN102075584B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103092927B (en) * 2012-12-29 2016-01-20 华中科技大学 File rapid read-write method under a kind of distributed environment
CN107168891B (en) * 2014-07-23 2020-08-14 华为技术有限公司 I/O feature identification method and device
CN105468643B (en) * 2014-09-09 2019-05-03 博雅网络游戏开发(深圳)有限公司 The access method and system of distributed file system
CN104902009B (en) * 2015-04-27 2018-02-02 浙江大学 A kind of distributed memory system based on erasable coding and chain type backup
CN107995147B (en) * 2016-10-27 2021-05-14 中国电信股份有限公司 Metadata encryption and decryption method and system based on distributed file system

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101689182A (en) * 2007-06-29 2010-03-31 微软公司 Efficient updates for distributed file systems

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011002451A1 (en) * 2009-06-30 2011-01-06 Hewlett-Packard Development Company, L.P. Optimizing file block communications in a virtual distributed file system
CN101917490B (en) * 2010-09-16 2014-01-01 北京开心人信息技术有限公司 Method and system for reading cache data

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101689182A (en) * 2007-06-29 2010-03-31 微软公司 Efficient updates for distributed file systems

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
HDFS下载效率的优化;曹宁等;《计算机应用》;20100831;第30卷(第8期);第2060-2065页 *
曹宁等.HDFS下载效率的优化.《计算机应用》.2010,第30卷(第8期),第2060-2065页.

Also Published As

Publication number Publication date
CN102075584A (en) 2011-05-25

Similar Documents

Publication Publication Date Title
US10708350B2 (en) Method and system for content delivery of mobile terminal applications
CN102984286B (en) Method and device and system of domain name server (DNS) for buffering updating
US9392081B2 (en) Method and device for sending requests
CN102075584B (en) Distributed file system and access method thereof
CN103338252B (en) Realizing method of distributed database concurrence storage virtual request mechanism
CN103516744A (en) A data processing method, an application server and an application server cluster
CN105898352A (en) m3u8-based streaming media file direct broadcast method and system
CN110430274A (en) A kind of document down loading method and system based on cloud storage
CN104284201A (en) Video content processing method and device
CN102394880B (en) Method and device for processing jump response in content delivery network
CN101237429A (en) Stream media living broadcasting system, method and device based on content distribution network
WO2011088725A1 (en) Method and apparatus for synchronization based on hypertext transfer protocol (http)
CN103227826A (en) Method and device for transferring file
CN104935668A (en) Distributed file system and data synchronization method therefor
AU2011370439A1 (en) Method and apparatus for rapid data distribution
CN108541025B (en) Wireless heterogeneous network-oriented base station and D2D common caching method
CN101635741A (en) Method and system thereof for inquiring recourses in distributed network
CN107633102A (en) A kind of method, apparatus, system and equipment for reading metadata
CN103731484B (en) A kind of power save transmission method towards mobile cloud computing and middleware system
CN101895550B (en) Cache accelerating method for compatibility of dynamic and static contents of internet website
Peng et al. Value‐aware cache replacement in edge networks for Internet of Things
CN104615597A (en) Method, device and system for clearing cache file in browser
CN107733949B (en) Wireless access network caching method and system
CN103024018A (en) Method and device for operating multiple content distribution network (CDN) service processes in single device
CN103220260A (en) Method of updating data, server, client-side and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20140806

Termination date: 20190130

CF01 Termination of patent right due to non-payment of annual fee