CN103500089A - Small file storage system suitable for Mapreduce calculation model - Google Patents
Small file storage system suitable for Mapreduce calculation model Download PDFInfo
- Publication number
- CN103500089A CN103500089A CN201310430402.0A CN201310430402A CN103500089A CN 103500089 A CN103500089 A CN 103500089A CN 201310430402 A CN201310430402 A CN 201310430402A CN 103500089 A CN103500089 A CN 103500089A
- Authority
- CN
- China
- Prior art keywords
- small documents
- file
- mapreduce
- hadoop
- small
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a method (as shown in the figure 1) for combining small files into a large file on line, wherein the method is adopted in an HDFS of Hadoop. The number of started maps in Mapreduce is reduced. The invention mainly provides a novel interface for uploading the small files, and meanwhile provides a corresponding input format class. By means of the uploading interface and the input class, on-line small file storage and process can be finished.
Description
Technical field
The present invention relates to MapReduce and small documents field of storage, be specifically related to a kind of small documents storage system of the MapReduce of being adapted to computation model.
Background technology
Hadoop is a distributed architecture, by the development group exploitation of the Yahoo at Doug Cutting and place thereof.Under the thinking of the paper about GFS and MapReduce that this development group is delivered at Google, with Java language, realized a realization that is similar to the MapReduce of Google, i.e. Hadoop, an and distributed file system HDFS.
The small documents problem has caused some concerns in academia and industry member gradually.Famous social network sites Facebook has stored 2,600 hundred million pictures, and capacity surpasses 20PB, and these file overwhelming majority all are less than 64MB.The data of accessing on internet mostly are the small documents of high access frequency.
GFS technical control people Sean Quinlan mentions one of them application scenarios of BigTable towards small documents in the GFS interview.Hadoop existing problems aspect the processing mass small documents are also pointed out in the report about Small File Problem of the famous Hadoop application Cloudera of company issue.
Process such small documents and brought serious problem to performance and the extendability of HDFS.First, mass small documents has brought a large amount of metadata, because the metadata information of each catalogue in HDFS and file leaves in the internal memory of title node, if there is a large amount of small documents in system, can reduce undoubtedly storage efficiency and the storage capacity of whole storage system.For example, if 1,000 ten thousand small documents are arranged in system, each small documents need take a block, and Namenode approximately needs the 3G space, so the memory size of Namenode has seriously restricted the expansion of cluster.The second, the speed of access large amount of small documents is far smaller than the speed of the several large files of access, because if access a large amount of small documents, needs constantly from a DataNode, to jump to another DataNote, and this is a kind of data access patterns of poor efficiency.The 3rd, accessing large file differs greatly with the map number of tasks that the access small documents is used, for example, the file of a 1G is divided into the piece of 16 64MB, with 10000 100KB(1GB altogether) file, these 10000 files each need a map, the final Mapreduce activity duration may be than hundred times of the activity duration long numbers of a 1G.Although Hadoop is used JVM to reuse etc., but still can not finely address these problems.
Hadoop itself provides Hadoop archive(HAR) be used for small documents is merged into to large file.The HAR file is to go up by HDFS the file system that builds a stratification to carry out work, and HAR file is that the archive order by Hadoop is created, this order actual motion a Mapreudce task small documents is packaged into to the HAR file.
Summary of the invention
The present invention designed and a kind ofly online small documents merged to the method for storage, and provides and be applicable to MapReduce computation process.
At first, in Hadoop, deposit under the catalogue of small documents while uploading first file to Hadoop, system can create the large file (being referred to as piece) that a size is 64MB, from the document misregistration amount, be wherein 0 to start to write the content of this small documents, and count 1 at the current small documents number of depositing of the end of piece write-in block, and write the filename of this small documents, the side-play amount of this small documents in piece and the size of this small documents.Subsequently under this catalogue during upload file, beginning by current blank in the content write-in block of this small documents, and, by the filename of this small documents, the size of the side-play amount of this small documents in piece and this small documents writes blank ending, and the small documents number counting at renewal piece end.In other words, the content of small documents starts to deposit successively from the beginning of piece, and the retrieving information of small documents in piece deposited successively from the ending of piece, upgrades small documents number counting.
Location mode is as Fig. 1.
When MapReduce reads this small documents, at first the information of these small documents of Study document head, then be organized into key-value couple, in map, processes.So need to realize reading the input class for the small documents in this Merge Scenarios.
MapReduce framing dependence InputFormat in Hadoop provides data, relies on OutputFormat output data; Each MapReduce program needs to carry out input and output by these classes.Hadoop provides a series of InputFormat and the convenient exploitation of OutputFormat.As TextInputFormat, for reading text-only file, file is divided into a series of row that finish with LF or CR, and key is the position (side-play amount, LongWritable type) of every a line, and value is the content of every a line, the Text type.KeyValueTextInputFormat, equally for file reading, is divided into two parts if row is separated symbol (the default tab of being), and first is key, and remaining part is value; If there is no separator, full line is as key, and value is empty.SequenceFileInputFormat is for reading sequence file.Sequence file is that Hadoop is for storing the binary file of data user-defined format.It has two subclass: SequenceFileAsBinaryInputFormat, and key and value are read with the type of BytesWritable; SequenceFileAsTextInputFormat, read key and value with the type of Text.
In the present invention, need self-defined input class SmallBulkInputFormat to read small documents for the file from bulk and carry out map operation (this be applied in the fields such as a large amount of picture processings very common) using each small documents as a key-value.
The accompanying drawing explanation
Fig. 1 is small documents location mode schematic diagram in piece.
Embodiment
Step 1: the flow process of improving HDFS read-write small documents.
When Hadoop writes small documents, at first in advance generate the large block file of several 64M, then after NameServer receives the request of client written document, according to load balancing, select a DataServer, receive this write request, and the information of this DataServer is issued to client, client call is improved writes function interface (realize identically with original function interface that writes, just function name is inconsistent); After DataServer receives this write request, at first select the file of a preallocated 64M, the content of small documents in write request is write to this large file, and record hereof the retrieving information of this small documents.
Step 2: new input class is provided.
At first defining SmallBulkInputFormat inherits from FileInputFormat, under core code:
Step 3: the developer uses new input class.
The developer carries out writing in files with the new interface function that writes, and the input format that Job is set is the SmallBulkInputFormat class.
Claims (2)
1. online HDFS small documents storage, is characterized in that the on-line storage small documents, rather than the compressed file mode of the off-line of HAR mode.The invention provides the new interface function of uploading small documents, use for carrying out online small documents storage.
2. new input format SmallBulkInputFormat is provided, it is characterized in that: by using this input format class, just can be to by using, new upload these small documents that the small documents interface creates and carry out the map operation as key-value one by one.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310430402.0A CN103500089A (en) | 2013-09-18 | 2013-09-18 | Small file storage system suitable for Mapreduce calculation model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310430402.0A CN103500089A (en) | 2013-09-18 | 2013-09-18 | Small file storage system suitable for Mapreduce calculation model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN103500089A true CN103500089A (en) | 2014-01-08 |
Family
ID=49865304
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310430402.0A Pending CN103500089A (en) | 2013-09-18 | 2013-09-18 | Small file storage system suitable for Mapreduce calculation model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103500089A (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103970874A (en) * | 2014-05-14 | 2014-08-06 | 浪潮(北京)电子信息产业有限公司 | Method and device for processing Hadoop files |
CN104331428A (en) * | 2014-10-20 | 2015-02-04 | 暨南大学 | Storage and access method of small files and large files |
CN105139281A (en) * | 2015-08-20 | 2015-12-09 | 北京中电普华信息技术有限公司 | Method and system for processing big data of electric power marketing |
CN106708606A (en) * | 2015-11-17 | 2017-05-24 | 阿里巴巴集团控股有限公司 | MapReduce based data processing method and MapReduce based data processing device |
CN106855861A (en) * | 2015-12-09 | 2017-06-16 | 北京金山安全软件有限公司 | File merging method and device and electronic equipment |
CN106855872A (en) * | 2015-12-08 | 2017-06-16 | 山东商务职业学院 | The method for quickly retrieving of the mass picture based on Hadoop platform |
WO2017133216A1 (en) * | 2016-02-06 | 2017-08-10 | 华为技术有限公司 | Distributed storage method and device |
CN107948334A (en) * | 2018-01-09 | 2018-04-20 | 无锡华云数据技术服务有限公司 | Data processing method based on distributed memory system |
CN110018997A (en) * | 2019-03-08 | 2019-07-16 | 中国农业科学院农业信息研究所 | A kind of mass small documents storage optimization method based on HDFS |
CN110321329A (en) * | 2019-06-18 | 2019-10-11 | 中盈优创资讯科技有限公司 | Data processing method and device based on big data |
CN110457265A (en) * | 2019-08-20 | 2019-11-15 | 上海商汤智能科技有限公司 | Data processing method, device and storage medium |
CN111221472A (en) * | 2019-12-26 | 2020-06-02 | 天津中科曙光存储科技有限公司 | Multi-block allocation strategy optimization method and system for disk space allocation |
CN113568877A (en) * | 2020-04-28 | 2021-10-29 | 杭州海康威视数字技术股份有限公司 | File merging method and device, electronic equipment and storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102222092A (en) * | 2011-06-03 | 2011-10-19 | 复旦大学 | Massive high-dimension data clustering method for MapReduce platform |
CN102902716A (en) * | 2012-08-27 | 2013-01-30 | 苏州两江科技有限公司 | Storage system based on Hadoop distributed computing platform |
-
2013
- 2013-09-18 CN CN201310430402.0A patent/CN103500089A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102222092A (en) * | 2011-06-03 | 2011-10-19 | 复旦大学 | Massive high-dimension data clustering method for MapReduce platform |
CN102902716A (en) * | 2012-08-27 | 2013-01-30 | 苏州两江科技有限公司 | Storage system based on Hadoop distributed computing platform |
Non-Patent Citations (3)
Title |
---|
张春明 等: "一种Hadoop小文件存储和读取的方法", 《计算机应用与软件》, vol. 29, no. 11, 15 November 2012 (2012-11-15) * |
江柳: "HDFS下小文件存储优化相关技术研究", 《中国优秀硕士学位论文全文数据库信息科技辑》, no. 9, 15 September 2011 (2011-09-15) * |
洪旭升 等: "基于MapFile的HDFS小文件存储效率问题", 《计算机系统应用》, vol. 21, no. 11, 15 November 2012 (2012-11-15) * |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103970874A (en) * | 2014-05-14 | 2014-08-06 | 浪潮(北京)电子信息产业有限公司 | Method and device for processing Hadoop files |
CN104331428A (en) * | 2014-10-20 | 2015-02-04 | 暨南大学 | Storage and access method of small files and large files |
CN104331428B (en) * | 2014-10-20 | 2017-07-04 | 暨南大学 | The storage of a kind of small documents and big file and access method |
CN105139281A (en) * | 2015-08-20 | 2015-12-09 | 北京中电普华信息技术有限公司 | Method and system for processing big data of electric power marketing |
CN106708606B (en) * | 2015-11-17 | 2020-07-07 | 阿里巴巴集团控股有限公司 | Data processing method and device based on MapReduce |
CN106708606A (en) * | 2015-11-17 | 2017-05-24 | 阿里巴巴集团控股有限公司 | MapReduce based data processing method and MapReduce based data processing device |
WO2017084509A1 (en) * | 2015-11-17 | 2017-05-26 | 阿里巴巴集团控股有限公司 | Mapreduce-based data processing method and device |
CN106855872A (en) * | 2015-12-08 | 2017-06-16 | 山东商务职业学院 | The method for quickly retrieving of the mass picture based on Hadoop platform |
CN106855861A (en) * | 2015-12-09 | 2017-06-16 | 北京金山安全软件有限公司 | File merging method and device and electronic equipment |
CN107045422A (en) * | 2016-02-06 | 2017-08-15 | 华为技术有限公司 | Distributed storage method and equipment |
WO2017133216A1 (en) * | 2016-02-06 | 2017-08-10 | 华为技术有限公司 | Distributed storage method and device |
US11301154B2 (en) | 2016-02-06 | 2022-04-12 | Huawei Technologies Co., Ltd. | Distributed storage method and device |
US11809726B2 (en) | 2016-02-06 | 2023-11-07 | Huawei Technologies Co., Ltd. | Distributed storage method and device |
CN107948334A (en) * | 2018-01-09 | 2018-04-20 | 无锡华云数据技术服务有限公司 | Data processing method based on distributed memory system |
CN110018997A (en) * | 2019-03-08 | 2019-07-16 | 中国农业科学院农业信息研究所 | A kind of mass small documents storage optimization method based on HDFS |
CN110018997B (en) * | 2019-03-08 | 2021-07-23 | 中国农业科学院农业信息研究所 | Mass small file storage optimization method based on HDFS |
CN110321329A (en) * | 2019-06-18 | 2019-10-11 | 中盈优创资讯科技有限公司 | Data processing method and device based on big data |
CN110457265A (en) * | 2019-08-20 | 2019-11-15 | 上海商汤智能科技有限公司 | Data processing method, device and storage medium |
CN111221472A (en) * | 2019-12-26 | 2020-06-02 | 天津中科曙光存储科技有限公司 | Multi-block allocation strategy optimization method and system for disk space allocation |
CN111221472B (en) * | 2019-12-26 | 2023-08-25 | 天津中科曙光存储科技有限公司 | Multi-block allocation strategy optimization method and system for disk space allocation |
CN113568877A (en) * | 2020-04-28 | 2021-10-29 | 杭州海康威视数字技术股份有限公司 | File merging method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103500089A (en) | Small file storage system suitable for Mapreduce calculation model | |
CN103020315B (en) | A kind of mass small documents storage means based on master-salve distributed file system | |
US9996557B2 (en) | Database storage system based on optical disk and method using the system | |
CN108319654A (en) | Computing system, cold and hot data separation method and device, computer readable storage medium | |
CN103595797B (en) | Caching method for distributed storage system | |
CN105095247B (en) | symbol data analysis method and system | |
CN102541691B (en) | Log check point recovery method applied to memory data base OLTP (online transaction processing) | |
CN107391544B (en) | Processing method, device and equipment of column type storage data and computer storage medium | |
CN109240607B (en) | File reading method and device | |
CN109598156A (en) | Engine snapshot stream method is redirected when one kind is write | |
CN102169460A (en) | Method and device for managing variable length data | |
CN104657366A (en) | Method and device for writing mass logs in database and log disaster-tolerant system | |
US9798761B2 (en) | Apparatus and method for fsync system call processing using ordered mode journaling with file unit | |
CN102306168A (en) | Log operation method and device and file system | |
CN103501319A (en) | Low-delay distributed storage system for small files | |
CN103473258A (en) | Cloud storage file system | |
CN103365926A (en) | Method and device for storing snapshot in file system | |
CN109213898A (en) | The video retrieval method and device of video monitoring system | |
CN111125058A (en) | Data migration method, device and system | |
JPWO2020012380A5 (en) | ||
Bachman | The evolution of storage structures | |
CN115470235A (en) | Data processing method, device and equipment | |
CN112965939A (en) | File merging method, device and equipment | |
CN102955808A (en) | Data acquisition method and distributed file system | |
CN105631010A (en) | Optimization method based on HDFS small file storage |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20140108 |
|
WD01 | Invention patent application deemed withdrawn after publication |