CN103500089A - Small file storage system suitable for Mapreduce calculation model - Google Patents

Small file storage system suitable for Mapreduce calculation model Download PDF

Info

Publication number
CN103500089A
CN103500089A CN201310430402.0A CN201310430402A CN103500089A CN 103500089 A CN103500089 A CN 103500089A CN 201310430402 A CN201310430402 A CN 201310430402A CN 103500089 A CN103500089 A CN 103500089A
Authority
CN
China
Prior art keywords
small documents
file
mapreduce
hadoop
small
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310430402.0A
Other languages
Chinese (zh)
Inventor
王雷
王鲁俊
龙翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN201310430402.0A priority Critical patent/CN103500089A/en
Publication of CN103500089A publication Critical patent/CN103500089A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses a method (as shown in the figure 1) for combining small files into a large file on line, wherein the method is adopted in an HDFS of Hadoop. The number of started maps in Mapreduce is reduced. The invention mainly provides a novel interface for uploading the small files, and meanwhile provides a corresponding input format class. By means of the uploading interface and the input class, on-line small file storage and process can be finished.

Description

A kind of small documents storage system that is adapted to the Mapreduce computation model
Technical field
The present invention relates to MapReduce and small documents field of storage, be specifically related to a kind of small documents storage system of the MapReduce of being adapted to computation model.
Background technology
Hadoop is a distributed architecture, by the development group exploitation of the Yahoo at Doug Cutting and place thereof.Under the thinking of the paper about GFS and MapReduce that this development group is delivered at Google, with Java language, realized a realization that is similar to the MapReduce of Google, i.e. Hadoop, an and distributed file system HDFS.
The small documents problem has caused some concerns in academia and industry member gradually.Famous social network sites Facebook has stored 2,600 hundred million pictures, and capacity surpasses 20PB, and these file overwhelming majority all are less than 64MB.The data of accessing on internet mostly are the small documents of high access frequency.
GFS technical control people Sean Quinlan mentions one of them application scenarios of BigTable towards small documents in the GFS interview.Hadoop existing problems aspect the processing mass small documents are also pointed out in the report about Small File Problem of the famous Hadoop application Cloudera of company issue.
Process such small documents and brought serious problem to performance and the extendability of HDFS.First, mass small documents has brought a large amount of metadata, because the metadata information of each catalogue in HDFS and file leaves in the internal memory of title node, if there is a large amount of small documents in system, can reduce undoubtedly storage efficiency and the storage capacity of whole storage system.For example, if 1,000 ten thousand small documents are arranged in system, each small documents need take a block, and Namenode approximately needs the 3G space, so the memory size of Namenode has seriously restricted the expansion of cluster.The second, the speed of access large amount of small documents is far smaller than the speed of the several large files of access, because if access a large amount of small documents, needs constantly from a DataNode, to jump to another DataNote, and this is a kind of data access patterns of poor efficiency.The 3rd, accessing large file differs greatly with the map number of tasks that the access small documents is used, for example, the file of a 1G is divided into the piece of 16 64MB, with 10000 100KB(1GB altogether) file, these 10000 files each need a map, the final Mapreduce activity duration may be than hundred times of the activity duration long numbers of a 1G.Although Hadoop is used JVM to reuse etc., but still can not finely address these problems.
Hadoop itself provides Hadoop archive(HAR) be used for small documents is merged into to large file.The HAR file is to go up by HDFS the file system that builds a stratification to carry out work, and HAR file is that the archive order by Hadoop is created, this order actual motion a Mapreudce task small documents is packaged into to the HAR file.
Summary of the invention
The present invention designed and a kind ofly online small documents merged to the method for storage, and provides and be applicable to MapReduce computation process.
At first, in Hadoop, deposit under the catalogue of small documents while uploading first file to Hadoop, system can create the large file (being referred to as piece) that a size is 64MB, from the document misregistration amount, be wherein 0 to start to write the content of this small documents, and count 1 at the current small documents number of depositing of the end of piece write-in block, and write the filename of this small documents, the side-play amount of this small documents in piece and the size of this small documents.Subsequently under this catalogue during upload file, beginning by current blank in the content write-in block of this small documents, and, by the filename of this small documents, the size of the side-play amount of this small documents in piece and this small documents writes blank ending, and the small documents number counting at renewal piece end.In other words, the content of small documents starts to deposit successively from the beginning of piece, and the retrieving information of small documents in piece deposited successively from the ending of piece, upgrades small documents number counting.
Location mode is as Fig. 1.
When MapReduce reads this small documents, at first the information of these small documents of Study document head, then be organized into key-value couple, in map, processes.So need to realize reading the input class for the small documents in this Merge Scenarios.
MapReduce framing dependence InputFormat in Hadoop provides data, relies on OutputFormat output data; Each MapReduce program needs to carry out input and output by these classes.Hadoop provides a series of InputFormat and the convenient exploitation of OutputFormat.As TextInputFormat, for reading text-only file, file is divided into a series of row that finish with LF or CR, and key is the position (side-play amount, LongWritable type) of every a line, and value is the content of every a line, the Text type.KeyValueTextInputFormat, equally for file reading, is divided into two parts if row is separated symbol (the default tab of being), and first is key, and remaining part is value; If there is no separator, full line is as key, and value is empty.SequenceFileInputFormat is for reading sequence file.Sequence file is that Hadoop is for storing the binary file of data user-defined format.It has two subclass: SequenceFileAsBinaryInputFormat, and key and value are read with the type of BytesWritable; SequenceFileAsTextInputFormat, read key and value with the type of Text.
In the present invention, need self-defined input class SmallBulkInputFormat to read small documents for the file from bulk and carry out map operation (this be applied in the fields such as a large amount of picture processings very common) using each small documents as a key-value.
The accompanying drawing explanation
Fig. 1 is small documents location mode schematic diagram in piece.
Embodiment
Step 1: the flow process of improving HDFS read-write small documents.
When Hadoop writes small documents, at first in advance generate the large block file of several 64M, then after NameServer receives the request of client written document, according to load balancing, select a DataServer, receive this write request, and the information of this DataServer is issued to client, client call is improved writes function interface (realize identically with original function interface that writes, just function name is inconsistent); After DataServer receives this write request, at first select the file of a preallocated 64M, the content of small documents in write request is write to this large file, and record hereof the retrieving information of this small documents.
Step 2: new input class is provided.
At first defining SmallBulkInputFormat inherits from FileInputFormat, under core code:
Figure 2013104304020100002DEST_PATH_IMAGE001
Step 3: the developer uses new input class.
The developer carries out writing in files with the new interface function that writes, and the input format that Job is set is the SmallBulkInputFormat class.

Claims (2)

1. online HDFS small documents storage, is characterized in that the on-line storage small documents, rather than the compressed file mode of the off-line of HAR mode.The invention provides the new interface function of uploading small documents, use for carrying out online small documents storage.
2. new input format SmallBulkInputFormat is provided, it is characterized in that: by using this input format class, just can be to by using, new upload these small documents that the small documents interface creates and carry out the map operation as key-value one by one.
CN201310430402.0A 2013-09-18 2013-09-18 Small file storage system suitable for Mapreduce calculation model Pending CN103500089A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310430402.0A CN103500089A (en) 2013-09-18 2013-09-18 Small file storage system suitable for Mapreduce calculation model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310430402.0A CN103500089A (en) 2013-09-18 2013-09-18 Small file storage system suitable for Mapreduce calculation model

Publications (1)

Publication Number Publication Date
CN103500089A true CN103500089A (en) 2014-01-08

Family

ID=49865304

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310430402.0A Pending CN103500089A (en) 2013-09-18 2013-09-18 Small file storage system suitable for Mapreduce calculation model

Country Status (1)

Country Link
CN (1) CN103500089A (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103970874A (en) * 2014-05-14 2014-08-06 浪潮(北京)电子信息产业有限公司 Method and device for processing Hadoop files
CN104331428A (en) * 2014-10-20 2015-02-04 暨南大学 Storage and access method of small files and large files
CN105139281A (en) * 2015-08-20 2015-12-09 北京中电普华信息技术有限公司 Method and system for processing big data of electric power marketing
CN106708606A (en) * 2015-11-17 2017-05-24 阿里巴巴集团控股有限公司 MapReduce based data processing method and MapReduce based data processing device
CN106855861A (en) * 2015-12-09 2017-06-16 北京金山安全软件有限公司 File merging method and device and electronic equipment
CN106855872A (en) * 2015-12-08 2017-06-16 山东商务职业学院 The method for quickly retrieving of the mass picture based on Hadoop platform
WO2017133216A1 (en) * 2016-02-06 2017-08-10 华为技术有限公司 Distributed storage method and device
CN107948334A (en) * 2018-01-09 2018-04-20 无锡华云数据技术服务有限公司 Data processing method based on distributed memory system
CN110018997A (en) * 2019-03-08 2019-07-16 中国农业科学院农业信息研究所 A kind of mass small documents storage optimization method based on HDFS
CN110321329A (en) * 2019-06-18 2019-10-11 中盈优创资讯科技有限公司 Data processing method and device based on big data
CN110457265A (en) * 2019-08-20 2019-11-15 上海商汤智能科技有限公司 Data processing method, device and storage medium
CN111221472A (en) * 2019-12-26 2020-06-02 天津中科曙光存储科技有限公司 Multi-block allocation strategy optimization method and system for disk space allocation
CN113568877A (en) * 2020-04-28 2021-10-29 杭州海康威视数字技术股份有限公司 File merging method and device, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102222092A (en) * 2011-06-03 2011-10-19 复旦大学 Massive high-dimension data clustering method for MapReduce platform
CN102902716A (en) * 2012-08-27 2013-01-30 苏州两江科技有限公司 Storage system based on Hadoop distributed computing platform

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102222092A (en) * 2011-06-03 2011-10-19 复旦大学 Massive high-dimension data clustering method for MapReduce platform
CN102902716A (en) * 2012-08-27 2013-01-30 苏州两江科技有限公司 Storage system based on Hadoop distributed computing platform

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
张春明 等: "一种Hadoop小文件存储和读取的方法", 《计算机应用与软件》, vol. 29, no. 11, 15 November 2012 (2012-11-15) *
江柳: "HDFS下小文件存储优化相关技术研究", 《中国优秀硕士学位论文全文数据库信息科技辑》, no. 9, 15 September 2011 (2011-09-15) *
洪旭升 等: "基于MapFile的HDFS小文件存储效率问题", 《计算机系统应用》, vol. 21, no. 11, 15 November 2012 (2012-11-15) *

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103970874A (en) * 2014-05-14 2014-08-06 浪潮(北京)电子信息产业有限公司 Method and device for processing Hadoop files
CN104331428A (en) * 2014-10-20 2015-02-04 暨南大学 Storage and access method of small files and large files
CN104331428B (en) * 2014-10-20 2017-07-04 暨南大学 The storage of a kind of small documents and big file and access method
CN105139281A (en) * 2015-08-20 2015-12-09 北京中电普华信息技术有限公司 Method and system for processing big data of electric power marketing
CN106708606B (en) * 2015-11-17 2020-07-07 阿里巴巴集团控股有限公司 Data processing method and device based on MapReduce
CN106708606A (en) * 2015-11-17 2017-05-24 阿里巴巴集团控股有限公司 MapReduce based data processing method and MapReduce based data processing device
WO2017084509A1 (en) * 2015-11-17 2017-05-26 阿里巴巴集团控股有限公司 Mapreduce-based data processing method and device
CN106855872A (en) * 2015-12-08 2017-06-16 山东商务职业学院 The method for quickly retrieving of the mass picture based on Hadoop platform
CN106855861A (en) * 2015-12-09 2017-06-16 北京金山安全软件有限公司 File merging method and device and electronic equipment
CN107045422A (en) * 2016-02-06 2017-08-15 华为技术有限公司 Distributed storage method and equipment
WO2017133216A1 (en) * 2016-02-06 2017-08-10 华为技术有限公司 Distributed storage method and device
US11301154B2 (en) 2016-02-06 2022-04-12 Huawei Technologies Co., Ltd. Distributed storage method and device
US11809726B2 (en) 2016-02-06 2023-11-07 Huawei Technologies Co., Ltd. Distributed storage method and device
CN107948334A (en) * 2018-01-09 2018-04-20 无锡华云数据技术服务有限公司 Data processing method based on distributed memory system
CN110018997A (en) * 2019-03-08 2019-07-16 中国农业科学院农业信息研究所 A kind of mass small documents storage optimization method based on HDFS
CN110018997B (en) * 2019-03-08 2021-07-23 中国农业科学院农业信息研究所 Mass small file storage optimization method based on HDFS
CN110321329A (en) * 2019-06-18 2019-10-11 中盈优创资讯科技有限公司 Data processing method and device based on big data
CN110457265A (en) * 2019-08-20 2019-11-15 上海商汤智能科技有限公司 Data processing method, device and storage medium
CN111221472A (en) * 2019-12-26 2020-06-02 天津中科曙光存储科技有限公司 Multi-block allocation strategy optimization method and system for disk space allocation
CN111221472B (en) * 2019-12-26 2023-08-25 天津中科曙光存储科技有限公司 Multi-block allocation strategy optimization method and system for disk space allocation
CN113568877A (en) * 2020-04-28 2021-10-29 杭州海康威视数字技术股份有限公司 File merging method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN103500089A (en) Small file storage system suitable for Mapreduce calculation model
CN103020315B (en) A kind of mass small documents storage means based on master-salve distributed file system
US9996557B2 (en) Database storage system based on optical disk and method using the system
CN108319654A (en) Computing system, cold and hot data separation method and device, computer readable storage medium
CN103595797B (en) Caching method for distributed storage system
CN109598156B (en) Method for redirecting engine snapshot stream during writing
CN105095247B (en) symbol data analysis method and system
WO2016010570A1 (en) Partial snapshot creation
CN102541691B (en) Log check point recovery method applied to memory data base OLTP (online transaction processing)
CN107391544B (en) Processing method, device and equipment of column type storage data and computer storage medium
CN109240607B (en) File reading method and device
CN102169460A (en) Method and device for managing variable length data
CN104657366A (en) Method and device for writing mass logs in database and log disaster-tolerant system
US9798761B2 (en) Apparatus and method for fsync system call processing using ordered mode journaling with file unit
CN102306168A (en) Log operation method and device and file system
CN103501319A (en) Low-delay distributed storage system for small files
CN103473258A (en) Cloud storage file system
CN103365926A (en) Method and device for storing snapshot in file system
CN109213898A (en) The video retrieval method and device of video monitoring system
CN111125058A (en) Data migration method, device and system
JPWO2020012380A5 (en)
CN102929935A (en) Transaction-based large-volume data read and write methods
CN101783814A (en) Metadata storing method for mass storage system
CN102955808A (en) Data acquisition method and distributed file system
CN105631010A (en) Optimization method based on HDFS small file storage

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20140108

WD01 Invention patent application deemed withdrawn after publication