CN111708738A - Method and system for realizing data inter-access between hdfs of hadoop file system and s3 of object storage - Google Patents
Method and system for realizing data inter-access between hdfs of hadoop file system and s3 of object storage Download PDFInfo
- Publication number
- CN111708738A CN111708738A CN202010482343.1A CN202010482343A CN111708738A CN 111708738 A CN111708738 A CN 111708738A CN 202010482343 A CN202010482343 A CN 202010482343A CN 111708738 A CN111708738 A CN 111708738A
- Authority
- CN
- China
- Prior art keywords
- hdfs
- file system
- hadoop
- ceph
- object storage
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000003860 storage Methods 0.000 title claims abstract description 50
- 238000000034 method Methods 0.000 title claims abstract description 21
- 238000003032 molecular docking Methods 0.000 claims abstract description 6
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000011161 development Methods 0.000 abstract description 5
- 230000004888 barrier function Effects 0.000 abstract description 4
- 230000000295 complement effect Effects 0.000 abstract description 2
- 238000002955 isolation Methods 0.000 abstract description 2
- 238000013497 data interchange Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 1
- 239000008358 core component Substances 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000014759 maintenance of location Effects 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000002040 relaxant effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/16—File or folder operations, e.g. details of user interfaces specifically adapted to file systems
- G06F16/164—File meta data generation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a method and a system for realizing data inter-access between hdfs of a hadoop file system and s3 of an object storage, wherein the method comprises the following steps: configuring a hadoop big data environment containing a file system hdfs and a distributed storage software ceph environment containing an object storage s 3; the two sides of the hadoop big data environment and the distributed storage software ceph environment are communicated with the ceph-mon node through a namenode node; docking a file system hdfs through a namenode node, and docking an object storage s3 through a ceph-mon node; acquiring an external data access instruction; and performing data access between the corresponding file system hdfs and the object storage s3 according to the data access instruction. The invention breaks the isolation barrier of mutual reading and access of files among different file systems, realizes mutual access, reading and coexistence of hdfs of the hadoop file system and s3 data of the object storage, enables the hdfs and s3 data of the object storage to complement each other in advantages, promotes development of big data and expands the application field range of the big data.
Description
Technical Field
The invention relates to the field of computer data interaction, in particular to a method and a system for realizing data exchange between hdfs of a hadoop file system and s3 of an object storage.
Background
Hadoop is an open-source big data framework developed by the apache foundation, and is a software platform for developing and operating large-scale data. The three core components are respectively a distributed file system hdfs, a job scheduling and cluster resource management framework yarn and a distributed operation programming framework MapReduce. The three core problems of the hadoop big data frame are solved, and how to solve the mass data storage, how to solve the scheduling of the computing resource task and how to solve the computing task of the mass data. The hdfs is a distributed file system which is specially designed based on streaming data access and processing of massive large files and does not strictly comply with posix standard protocol. Due to the characteristics of high fault tolerance, high bandwidth provision and the like, the method is very suitable for being deployed on a large amount of cheap hardware equipment and is also very suitable for large-scale hadoop big data operation application.
The hdfs has the characteristics of relaxing the complete compatibility of the hdfs to posix, achieving the purpose of reading massive large files in a streaming mode, supporting the characteristics of large data scale, large file concurrency, large node number and the like through mobile calculation, and ensuring the benchmarking position in the field of large data calculation. But the disadvantage is also obvious that it is not suitable for low latency (like ms level) data access in the first place, i.e. the HDFS file system is suitable for large concurrent IO access and the application support with high requirements for IOPS is not good enough. In addition, the read-write access support for a large number of small files is poor, so that the calculation scene of a large number of small files is not suitable like picture calculation.
The current mainstream distributed file system can make up for the disadvantages of hdfs and has an object storage system s3, and the object storage uses a unique file storage mode, which is different from a file storage system and a block storage system. First, the underlying file interface is a rest style and underlying file distribution arrangement key-value pair based flattening structure. Particularly, the flat data organization mode can solve massive and large concurrent small file access, eliminates the dependence on metadata, can provide support of large iops, and is very suitable for the current mainstream big data era characteristic. With the rapid development of the internet, the data size is exponentially increased, and the development of big data can be accelerated no matter whether the data mode, the data size, the structured data or not, and the object storage s3 are used together with the hdfs file system.
However, hdfs and s3 are two completely different styles and use completely different file access reading interfaces, and there is a barrier to file read access between different file systems, so that the prior art has problems and needs to be further improved.
Disclosure of Invention
In order to make up for the defects of the prior art, the invention provides a method and a system for realizing data exchange between hdfs of a hadoop file system and s3 of an object storage.
In order to achieve the purpose, the specific technical scheme of the invention is as follows:
a method for realizing data exchange access between hdfs of a hadoop file system and s3 of an object storage comprises the following steps:
configuring a hadoop big data environment containing a file system hdfs and a distributed storage software ceph environment containing an object storage s 3; the two sides of the hadoop big data environment and the distributed storage software ceph environment are communicated with the ceph-mon node through a namenode node;
docking a file system hdfs through a namenode node, and docking an object storage s3 through a ceph-mon node;
acquiring an external data access instruction;
and performing data access between the corresponding file system hdfs and the object storage s3 according to the data access instruction.
Preferably, the obtaining the external data access instruction includes: when a client-side hadoop-client in the hadoop big data environment writes a big file, the big file is calculated through a namenode node and then written into a datanode, and then the datanode is stored into an hdfs file system.
Preferably, the obtaining the external data access instruction includes: when a client-side hadoop-client of the hadoop big data environment writes a small file, ceph-mon information is called after the computation of a namenode, and then the file is written into an object storage through an s3 interface;
preferably, the obtaining the external data access instruction includes: when MapReduce is executed, the file is copied between the hdfs file system and the s3 object store by calculating metadata information through the namenode.
Further, when MapReduce is executed, the calculation result can be stored in hdfs or object storage s3 according to the custom selection.
The invention also provides a system for realizing data exchange access between hdfs of the hadoop file system and s3 of the object storage, which comprises the following steps: a hadoop big data environment containing a file system hdfs, and a distributed storage software ceph environment containing an object storage s 3.
The method comprises the steps that a hadoop big data environment and a distributed storage software ceph environment are communicated with a ceph-mon node through a namenode node; the file system hdfs is interfaced through the namenode node and the object store s3 is interfaced through the ceph-mon node.
The invention breaks the isolation barrier of mutual reading and access of files among different file systems, realizes mutual access, reading and coexistence of hdfs of the hadoop file system and s3 data of the object storage, enables the hdfs and s3 data of the object storage to complement each other in advantages, promotes development of big data and expands the application field range of the big data.
Drawings
FIG. 1 is a flowchart of a method for implementing data interchange between hdfs and object store s3 in a hadoop file system according to an embodiment of the present invention;
FIG. 2 is a schematic structural diagram of a system for implementing data interchange between a hadoop file system hdfs and an object store s3 according to an embodiment of the present invention.
Detailed Description
In order that those skilled in the art can understand and implement the present invention, the following embodiments of the present invention will be further described with reference to the accompanying drawings.
Referring to fig. 1, the present invention provides a method for implementing data exchange between hdfs of hadoop file system and s3 of object storage, comprising the steps of:
s11, configuring a hadoop big data environment containing a file system hdfs and a distributed storage software ceph environment containing an object storage S3; the two sides of the hadoop big data environment and the distributed storage software ceph environment are communicated with the ceph-mon node through a namenode node;
s12, connecting the file system hdfs through the namenode node, and connecting the object storage S3 through the ceph-mon node;
s13, obtaining an external data access instruction;
s14, according to the data access instruction, performing data access between the corresponding file system hdfs and the object storage S3.
Step S13, acquiring the external data access instruction includes the following steps:
(1) when a client-side hadoop-client in a hadoop big data environment writes a big file, calculating through a namenode node, writing into a datade node, and then storing into an hdfs file system;
(2) when a client-side hadoop-client of the hadoop big data environment writes a small file, ceph-mon information is called after the computation of a namenode, and then the file is written into an object storage through an s3 interface;
(3) when MapReduce is executed, the file is copied between the hdfs file system and the s3 object store by calculating metadata information through the namenode. The computed results may also be stored in hdfs or object store s3 services according to custom choices.
Referring to fig. 2, the present invention further provides a system for implementing data exchange between hadoop file system hdfs and object storage s3, including: a hadoop big data environment containing a file system hdfs, and a distributed storage software ceph environment containing an object storage s 3.
The method comprises the steps that a hadoop big data environment and a distributed storage software ceph environment are communicated with a ceph-mon node through a namenode node; the file system hdfs is interfaced through the namenode node and the object store s3 is interfaced through the ceph-mon node.
For the access system, when a client side hadoop-client in a hadoop big data environment writes a large number of small files under the original condition, a hadoop cluster records a block with the size of 150 bits no matter how large the data is, and then records the block into a memory of a metadata node. However, when hundreds of millions of small files need to be written simultaneously, each file needs to occupy one block, and the metadata node needs about 20G of space. This severely limits hadoop cluster performance.
However, when a large number of small files are written by the method, the client directly writes the files into the object storage S3. The method has the advantages that the dependency of metadata is eliminated by using a key value pair-based storage mode of object storage flattening, the barrier between the hdfs file system and the object storage system is opened, a large number of small files stored before can be directly copied into the object storage, the space of the original hadoop cluster is released again, the files are not lost, and the win-win effect is achieved.
In the invention, a set of hadoop big data environment and a distributed storage software ceph environment are actually deployed and an s3 interface is provided to support object storage service. Since hadoop-AWS modules provide support for AWS integration by default and ceph provides an object store s3 interface that is AWS compatible, communication between the hadoop big data environment and the distributed storage software ceph environment is possible.
Through the technical scheme of the invention, the writing of the hadoop big data application file into hdfs and the object storage can be realized, the data of the hdfs file transferred into the object storage and the data of the object storage data transferred into the hdfs are mutually accessed, different file systems coexist and the files are mutually accessed, so that the requirements of different file sizes for different storages and the application requirements of coexistence of high bandwidth and high io purposes under hadoop big data application are ensured, and the development of big data is accelerated in the big data application industry.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.
Claims (6)
1. A method for realizing data exchange access between hdfs of a hadoop file system and s3 of an object storage is characterized by comprising the following steps:
configuring a hadoop big data environment containing a file system hdfs and a distributed storage software ceph environment containing an object storage s 3; the two sides of the hadoop big data environment and the distributed storage software ceph environment are communicated with the ceph-mon node through a namenode node;
docking a file system hdfs through a namenode node, and docking an object storage s3 through a ceph-mon node;
acquiring an external data access instruction;
and performing data access between the corresponding file system hdfs and the object storage s3 according to the data access instruction.
2. The method of claim 1, wherein the obtaining external data access instructions comprises:
when a client-side hadoop-client in the hadoop big data environment writes a big file, the big file is calculated through a namenode node and then written into a datanode, and then the datanode is stored into an hdfs file system.
3. The method of claim 1, wherein the obtaining external data access instructions comprises:
when a client-side hadoop-client of the hadoop big data environment writes a small file, ceph-mon information is called after the computation of a namenode, and then the file is written into an object storage through an s3 interface.
4. The method of claim 1, wherein the obtaining external data access instructions comprises:
when MapReduce is executed, the file is copied between the hdfs file system and the s3 object store by calculating metadata information through the namenode.
5. The method of claim 4, wherein when MapReduce is executed, the calculation result can be optionally stored in hdfs or the object store s3 according to a user-defined selection.
6. A method for realizing data exchange between hadoop file system hdfs and object storage s3 is characterized by comprising the following steps: a hadoop big data environment containing a file system hdfs, and a distributed storage software ceph environment containing an object storage s 3.
The method comprises the steps that a hadoop big data environment and a distributed storage software ceph environment are communicated with a ceph-mon node through a namenode node; the file system hdfs is interfaced through the namenode node and the object store s3 is interfaced through the ceph-mon node.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010482343.1A CN111708738B (en) | 2020-05-29 | 2020-05-29 | Method and system for realizing interaction of hadoop file system hdfs and object storage s3 data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010482343.1A CN111708738B (en) | 2020-05-29 | 2020-05-29 | Method and system for realizing interaction of hadoop file system hdfs and object storage s3 data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111708738A true CN111708738A (en) | 2020-09-25 |
CN111708738B CN111708738B (en) | 2023-11-03 |
Family
ID=72538444
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010482343.1A Active CN111708738B (en) | 2020-05-29 | 2020-05-29 | Method and system for realizing interaction of hadoop file system hdfs and object storage s3 data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111708738B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112965950A (en) * | 2021-03-09 | 2021-06-15 | 浪潮云信息技术股份公司 | Method for realizing storage of stream data write-in object |
CN113127420A (en) * | 2021-03-30 | 2021-07-16 | 山东英信计算机技术有限公司 | Metadata request processing method, device, equipment and medium |
CN114185490A (en) * | 2021-12-06 | 2022-03-15 | 深圳市瑞驰信息技术有限公司 | Method for realizing data exchange between glusterfs file system and object storage s3 |
US20220100878A1 (en) * | 2020-09-25 | 2022-03-31 | EMC IP Holding Company LLC | Facilitating an object protocol based access of data within a multiprotocol environment |
CN114500485A (en) * | 2022-01-28 | 2022-05-13 | 北京沃东天骏信息技术有限公司 | Data processing method and device |
WO2022116766A1 (en) * | 2020-12-04 | 2022-06-09 | 中兴通讯股份有限公司 | Data storage system and construction method therefor |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130151884A1 (en) * | 2011-12-09 | 2013-06-13 | Promise Technology, Inc. | Cloud data storage system |
CN103581332A (en) * | 2013-11-15 | 2014-02-12 | 武汉理工大学 | HDFS framework and pressure decomposition method for NameNodes in HDFS framework |
US20170154039A1 (en) * | 2015-11-30 | 2017-06-01 | International Business Machines Corporation | Enabling a hadoop file system with posix compliance |
US9928203B1 (en) * | 2015-07-15 | 2018-03-27 | Western Digital | Object storage monitoring |
CN109033250A (en) * | 2018-07-06 | 2018-12-18 | 内蒙古大学 | A kind of high availability object storage method for supporting large data files access service |
CN109783438A (en) * | 2018-12-05 | 2019-05-21 | 南京华讯方舟通信设备有限公司 | Distributed NFS system and its construction method based on librados |
CN110287150A (en) * | 2019-05-16 | 2019-09-27 | 中国科学院信息工程研究所 | A kind of large-scale storage systems meta-data distribution formula management method and system |
CN110688674A (en) * | 2019-09-23 | 2020-01-14 | 中国银联股份有限公司 | Access butt-joint device, system and method and device applying access butt-joint device |
CN110750458A (en) * | 2019-10-22 | 2020-02-04 | 恩亿科(北京)数据科技有限公司 | Big data platform testing method and device, readable storage medium and electronic equipment |
-
2020
- 2020-05-29 CN CN202010482343.1A patent/CN111708738B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130151884A1 (en) * | 2011-12-09 | 2013-06-13 | Promise Technology, Inc. | Cloud data storage system |
CN103581332A (en) * | 2013-11-15 | 2014-02-12 | 武汉理工大学 | HDFS framework and pressure decomposition method for NameNodes in HDFS framework |
US9928203B1 (en) * | 2015-07-15 | 2018-03-27 | Western Digital | Object storage monitoring |
US20170154039A1 (en) * | 2015-11-30 | 2017-06-01 | International Business Machines Corporation | Enabling a hadoop file system with posix compliance |
CN109033250A (en) * | 2018-07-06 | 2018-12-18 | 内蒙古大学 | A kind of high availability object storage method for supporting large data files access service |
CN109783438A (en) * | 2018-12-05 | 2019-05-21 | 南京华讯方舟通信设备有限公司 | Distributed NFS system and its construction method based on librados |
CN110287150A (en) * | 2019-05-16 | 2019-09-27 | 中国科学院信息工程研究所 | A kind of large-scale storage systems meta-data distribution formula management method and system |
CN110688674A (en) * | 2019-09-23 | 2020-01-14 | 中国银联股份有限公司 | Access butt-joint device, system and method and device applying access butt-joint device |
CN110750458A (en) * | 2019-10-22 | 2020-02-04 | 恩亿科(北京)数据科技有限公司 | Big data platform testing method and device, readable storage medium and electronic equipment |
Non-Patent Citations (1)
Title |
---|
陈涛: "UMStor Hadapter:大数据与对象存储的柳暗花明", 《HTTPS://MP.WEIXIN.QQ.COM/S/-NSD9WOG5BNOV0RTWGEABW,UMCLOUD优铭云》, pages 1 - 11 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20220100878A1 (en) * | 2020-09-25 | 2022-03-31 | EMC IP Holding Company LLC | Facilitating an object protocol based access of data within a multiprotocol environment |
US11928228B2 (en) * | 2020-09-25 | 2024-03-12 | EMC IP Holding Company LLC | Facilitating an object protocol based access of data within a multiprotocol environment |
WO2022116766A1 (en) * | 2020-12-04 | 2022-06-09 | 中兴通讯股份有限公司 | Data storage system and construction method therefor |
CN112965950A (en) * | 2021-03-09 | 2021-06-15 | 浪潮云信息技术股份公司 | Method for realizing storage of stream data write-in object |
CN113127420A (en) * | 2021-03-30 | 2021-07-16 | 山东英信计算机技术有限公司 | Metadata request processing method, device, equipment and medium |
CN114185490A (en) * | 2021-12-06 | 2022-03-15 | 深圳市瑞驰信息技术有限公司 | Method for realizing data exchange between glusterfs file system and object storage s3 |
CN114500485A (en) * | 2022-01-28 | 2022-05-13 | 北京沃东天骏信息技术有限公司 | Data processing method and device |
Also Published As
Publication number | Publication date |
---|---|
CN111708738B (en) | 2023-11-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111708738B (en) | Method and system for realizing interaction of hadoop file system hdfs and object storage s3 data | |
US11157457B2 (en) | File management in thin provisioning storage environments | |
US20160292249A1 (en) | Dynamic replica failure detection and healing | |
US10331669B2 (en) | Fast query processing in columnar databases with GPUs | |
CN110413685B (en) | Database service switching method, device, readable storage medium and computer equipment | |
US10356150B1 (en) | Automated repartitioning of streaming data | |
US9218136B2 (en) | Cloud scale directory services | |
US11157456B2 (en) | Replication of data in a distributed file system using an arbiter | |
US20240220334A1 (en) | Data processing method in distributed system, and related system | |
CN113204520B (en) | Remote sensing data rapid concurrent read-write method based on distributed file system | |
CN105808451B (en) | Data caching method and related device | |
US10565202B2 (en) | Data write/import performance in a database through distributed memory | |
US20150186269A1 (en) | Managing memory | |
CN107102898B (en) | Memory management and data structure construction method and device based on NUMA (non Uniform memory Access) architecture | |
AU2021268828B2 (en) | Secure data replication in distributed data storage environments | |
CN115525618A (en) | Storage cluster, data storage method, system and storage medium | |
CN111782647A (en) | Block data storage method, system, medium and equipment of EOS network | |
CN115604290B (en) | Kafka message execution method, device, equipment and storage medium | |
US9251100B2 (en) | Bitmap locking using a nodal lock | |
CN107844258A (en) | Data processing method, client, node server and distributed file system | |
US11593498B2 (en) | Distribution of user specific data elements in a replication environment | |
US11526490B1 (en) | Database log performance | |
US8688857B1 (en) | Filtering messages based on pruning profile generated from pruning profile schema | |
CN116521066A (en) | Fragment transfer method and device | |
CN116701349A (en) | Data migration method, system, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |