WO2015178943A1

WO2015178943A1 - Eliminating file duplication in a file system

Info

Publication number: WO2015178943A1
Application number: PCT/US2014/047044
Authority: WO
Inventors: Ramesh Kannan KARUPPUSAMY; Rajkumar Kannan
Original assignee: Hewlett-Packard Development Company, L.P.
Priority date: 2014-05-22
Filing date: 2014-07-17
Publication date: 2015-11-26

Abstract

Some examples describe a method for eliminating file duplication in a file system. A checksum of a file is generated during transition of the file to a retained state. The generated checksum is stored in a database. The database is queried to identify duplicate checksums of the file. Files corresponding to the duplicate checksums are deleted according to a predefined single instancing limit.

Description

ELIMINATING FILE DUPLICATION IN A FILE SYSTEM

Background 1] File systems have evolved over time to meet increased data storage requirement of organizations. From earlier local file systems to more recent scale-out file systems such as a Storage Area Network (SAN) file system, file systems have tried to keep pace not only with the continuous need of the organizations for additional storage but also with more complex requirements such as remote file access, data sharing, data security, etc. According to an estimate, over a thousand types of file systems are currently available.

Brief Description of the Drawings

[002] For a better understanding of the solution, embodiments will now be described, purely by way of example, with reference to the accompanying drawings, in which:

[003] FIG. 1 is a block diagram of an example computing device that facilitates eliminating file duplication in a file system;

[004] FIG. 2 is a block diagram of an example computing environment that facilitates eliminating file duplication in a file system;

[005] FIG. 3 is a flowchart of an example method for eliminating file duplication in a file system;

[006] FIG. 4 is a flowchart of an example method for eliminating file duplication in a file system; and

[007] FIG. 5 is a block diagram of an example system that facilitates eliminating file duplication in a file system.

Detailed Description of the Invention

[008] The demand for large-scale storage systems for unstructured data has increased enormously over the years. Enterprises are demanding storage systems that can store billions of files (with file sizes ranging from a few kilobytes to petabytes) along with a file-system that can handle such storage systems with minimum of costs and complexity. Earlier file systems, which are still used generally by small organizations, were local file systems that were co-hosted with an application(s) on a server. In case additional storage capacity was needed, a local file system was typically scaled-up by adding one or more disks to the existing system. It was soon realized that although this solved the requirement of additional storage initially, it eventually resulted in a lowering the performance of a storage system since it placed an additional load on the existing processing and bandwidth resources that were required to access data stored on the additional storage space. Thus, a scale-out file system was designed to obviate the limitations of a scale-up file system.

[009] Scale-out file systems or shared file systems can overcome constraints of a scale-up file system by providing, for instance, shared storage devices to multiple clients. In a scale-out file system, storage capacity of a system may be expanded by including additional storage resources (for example, additional disks) according to a user's requirement. Each storage resource in a scale-out storage system may include its own processing and bandwidth resource.

[0010] In any file system, it is inevitable that files may get duplicated and the storage is wasted to store multiple copies of the same file. The issue becomes even more challenging in a retention enabled file system that allows users to apply retention settings on a file such that the file may be retained in the system for a period set by an administrator for the file. The retention feature may allow users to retain files up to a hundred years or more. When a file is retained it can neither be modified nor be deleted. Even after retention period expires the file can't be modified but may become eligible for deletion. This state of the file is called WORM (Write Once Read Many). Thus, a user may not be able to delete duplicate files once they become "retained".

[0011] The present disclosure describes a mechanism for eliminating duplicate files in a file system. In an example, the present disclosure may describe generating a checksum of a file during transition of the file from a normal state to a retained state in a file system. The generated checksum of the file may be stored in a database, and the database may be periodically queried to identify any duplicate checksums of the file. Upon identification of a duplicate checksum of the file, the file corresponding to the duplicate checksum may be deleted according to a predefined single instancing limit. Since a file corresponding to a duplicate checksum represents a duplicate of the file, deleting the same eliminates a duplicate file from the file system.

[0012] FIG. 1 is a block diagram of an example computing device 100 that facilitates eliminating file duplication in a file system. Computing device 100 may be a server, a desktop computer, a notebook computer, a tablet computer, and the like. In an example, computing device 100 may be a file server 100. File server 100 may include a processor 102 and a machine- readable storage medium 104.

[0013] Processor 102 may be any type of Central Processing Unit (CPU), microprocessor, or processing logic that interprets and executes machine- readable instructions stored in machine-readable storage medium 104.

[0014] Machine-readable storage medium 104 may be a random access memory (RAM) or another type of dynamic storage device that may store information and machine-readable instructions that may be executed by processor 102. For example, machine-readable storage medium 104 may be Synchronous DRAM (SDRAM), Double Data Rate (DDR), Rambus DRAM (RDRAM), Rambus RAM, etc. or storage memory media such as a floppy disk, a hard disk, a CD-ROM, a DVD, a pen drive, and the like. In an example, machine-readable storage medium 104 may be a non-transitory machine-readable medium.

[0015] In an example, machine-readable storage medium 104 may store a file system 106, a database 108 and an elimination module 1 10. The term "module" may refer to a software component (machine readable instructions), a hardware component or a combination thereof. A module may include, by way of example, components, such as software components, processes, tasks, co-routines, functions, attributes, procedures, drivers, firmware, data, databases, data structures, Application Specific Integrated Circuits (ASIC) and other computing devices. A module may reside on a volatile or non-volatile storage medium (e.g. 104) and configured to interact with a processor (e.g. 102) of a computing device (e.g. 100).

[0016] In general, file system 106 may be used for storage and retrieval of data from a storage device. Typically, each piece of data is called a "file". File system 106 may be a local file system or a scale-out file system such as a shared file system or a network file system. Examples of a shared file system may include a Storage Area Network (SAN) file system or a cluster file system. Examples of a network file system may include a distributed file system or a distributed parallel file system. File system 106 may allow a user to apply retention settings on a file such that the file is retained in the system for a period set by the user. When a file transitions from a normal state to a retained state (i.e. upon application of retention settings), a checksum (hash) of the file may be generated using a hash algorithm and stored in database (example, 108). A duplicate checksum of a file may get generated when retention settings are applied to a duplicate of the file. Thus, in case there are multiple duplicates of a file, and retention settings are applied to each of them, a checksum may be generated for each duplicate of the file. Such duplicate checksums, wherein each duplicate checksum represents a duplicate of a file, may be stored in database 108. Some non-limiting examples of hash algorithms that may be used for generating a checksum of a file (or duplicate of a file) may include SHA, SHA-1 , MD2, MD4, and MD5. In an instance, file system 106 may generate a notification event during transition of a file from a normal state to a retained state. A retained file may neither be modified nor deleted for a specified period depending upon the applied retention settings.

[0017] Database 108 may be a repository that stores an organized collection of data. In an example, database 108 may store a checksum of a file. The checksum of a file may be generated during transition of the file to a retained state i.e. once retention settings are applied to the file. Database 108 may also store duplicate checksums of a file, wherein each duplicate checksum represents a duplicate of a file. Apart from the generated checksum, the database 108 may also store other attributes of a file such as, but not limited to, a unique ID of the file, file path, etc.

[0018] In an example, database 108 may be a distributed database that provides high query rates and high-throughput updates using a batching process. Database 108 may use a pipelined architecture that provides access to update batches at various points through processing. In an instance, database 108 may be based on a batched update model, which decouples update processing from read-only queries (i.e. query processing task). In this model, the updates may be batched and processed in the background, and do not interfere with the foreground query workload. Database 108 may allow different stages of the updates in the pipeline to be queried independently. Queries that could use slightly out-of-date data may use only the final output of the pipeline, which may correspond to the completely ingested and indexed data. Queries that require even fresher results may access data at any stage in the pipeline. Database 108 may be a metadata database that stores metadata related to unstructured data. Examples of unstructured data may include documents, audio, video, images, files, body of an e-mail message, Web page, or word-processor document. In an example, database 108 may be integrated into file system 106.

[0019] Elimination module 1 10 may include instructions to query database 108 to identify duplicate checksums of a file. In the event duplicate checksums of a file are identified, elimination module 1 10 may include instructions to delete files corresponding to duplicate checksums of the file according to a predefined single instancing limit. Single instancing limit acts as a limit for removing duplicates of a file. In other words, it may specify the number of master copies of a file that may be retained in a file system. Single instancing limit may be a system defined limit or a user defined limit. In an example, single instancing limit may be based on the number of duplicate checksums of a file identified upon querying a database (for example, 108). In such case, single instancing limit may specify the number of master copies of a file to be retained in a file system (for example, 106). For instance, single instancing limit may be a ratio between master copies of a file to be retained in a file system and duplicates of the file present in the file system. For example, if a ratio of 1 :3 is specified, it means one master copy of a file may be retained for every three duplicates of the file indentified in the system. In another example, if a given ratio is 2:5, it means two master copies of a file may be retained for every five duplicates of the file indentified in the system. Elimination module 1 10 may continue deleting duplicates of a file until the single instancing limit is satisfied. In this regard, a special Application Programming Interface (API) may used by the elimination module 1 10 to delete duplicate files from a file system (for example, 106).

[0020] In an example, deletion of files corresponding to duplicate checksums (i.e. deleting duplicates of a file) by elimination module 1 10 may include determining whether a first duplicate checksum of the file matches with a second duplicate checksum of the file. In response to the determination, if a first duplicate checksum of the file matches with a second duplicate checksum of the file, elimination module 1 10 may release data blocks pointed to by file path of the second duplicate checksum of the file. Elimination module 1 10 then may enable file path of the second duplicate checksum of the file to point to data blocks of the file path of the first duplicate checksum of the file.

[0021] In an example, prior to deleting files corresponding to duplicate checksums (i.e. duplicates of a file), elimination module 1 10 may determine that content of a file corresponding to the first duplicate checksum of a file matches with content of a file corresponding to a second duplicate checksum of the file. Said determination may be performed to avoid the probability of checksum collision since due to collision in checksum generation algorithm, it may be possible that two different contents produce the same checksum.

[0022] FIG. 2 is a block diagram of an example computing environment 200 that facilitates eliminating file duplication in a file system. Computing environment 200 may include client systems 202, 204, and 206, a file server 208, and a storage device 210. The number of client systems 202, 204, and 206, file server 208, and storage device 210 shown in FIG. 1 is for the purpose of illustration only and their number may vary in other implementations. In an example, computing environment 200 may represent a scale-out file system.

[0023] Client systems 202, 204, and 206 may each be a computing device such as a desktop computer, a notebook computer, a tablet computer, a mobile phone, personal digital assistant (PDA), a server, and the like. In an example, client systems 202, 204, and 206 may host one or more applications that may use a file system on file server for data storage and retrieval. Client systems 202, 204, and 206, may communicate with file server 208 via a computer network 212. Computer network 212 may be a wireless or wired network. Computer network 212 may include, for example, a Local Area Network (LAN), a Wireless Local Area Network (WAN), a Metropolitan Area Network (MAN), a Storage Area Network (SAN), a Campus Area Network (CAN), or the like. Further, computer network 212 may be a public network (for example, the Internet) or a private network (for example, an intranet).

[0024] File server 208 may include a non-transitory machine- readable storage medium 214 that may store machine executable instructions. In an example, file server 208 may be similar to file server 100 described earlier. Accordingly, components of file server 208 that are similarly named and illustrated in file server 100 may be considered similar. For the sake of brevity, components or reference numerals of FIG. 2 having a same or similarly described function in FIG. 1 are not being described in connection with FIG. 2. Said components or reference numerals may be considered alike.

[0025] In an example, machine-readable storage medium 214 may store a file system 106, a database 108, an elimination module 1 10, a hash generator module 216, a journal writer 218, and a journal scanner 220.

[0026] A hash generator module 216 may include instructions to generate a checksum of a file when the file transitions from a normal state to a retained state. In an instance, when a file transitions to a retained state, a notification event may be generated by file system 106. This notification event acts as a cue for hash generator module 216 to generate a checksum of a file that transitions to a retained state. The generated checksum may be sent to a journal writer 218 (present in the file system kernel module) which may include instructions to generate a journal for the checksum generation.

[0027] Journal scanner 220 may include instructions to process a journal generated by journal writer 218. Upon processing of a journal for checksum generation, journal scanner 220 may insert the generated checksum into database 108. Journal scanner 220 may also insert various file attributes such as, but not limited to, a unique ID of the file, file path, etc. in database 108.

[0028] Storage device 210 may be used to store and retrieve data stored by file system 106. Some non-limiting examples of storage device 210 may include a Direct Attached Storage (DAS) device, a Network Attached Storage (NAS) device, a tape drive, a magnetic tape drive, or a combination of these devices. Storage device 210 may be directly coupled to file server 106 or may communicate with file server 106 via a computer network 222. Such a computer network 222 may be similar to the computer network 212 described above. In an example, computer network 222 may be a Storage Area Network (SAN).

[0029] FIG. 3 is a flowchart of an example method 300 for eliminating file duplication in a file system. The method 300, which is described below, may at least partially be executed on a computing device 100 of FIG. 1 or file server 208 of FIG. 2. However, other computing devices may be used as well. At block 302, a checksum of a file may be generated during transition of the file from a normal state to a retained state. At block 304, the generated checksum may be stored in a database (example, 108). At block 306, database (example, 108) may be queried to identify duplicate checksums of the file. At block 308, upon identification of duplicate checksums of the file, files corresponding or representing duplicate checksums may be deleted according to a predefined single instancing limit.

[0030] FIG. 4 is a flowchart of an example method 400 for eliminating file duplication in a file system. The method 400, which is described below, may at least partially be executed on a computing device 100 of FIG. 1 or file server 208 of FIG. 2. However, other computing devices may be used as well. At block 402, an event may be generated by a file system during transition of file from a normal state to a retained state. At block 404, upon generation of the event, a hash generator module (example, 216) may generate a checksum of the file transitioned to the retained state. At block 406, a journal writer (example, 218) may generate a journal for the checksum generation. At block 408, a journal scanner (example, 220) may process the journal and store the generated checksum in a database (example, 108). Journal scanner (example, 220) may store various attributes of the file in the database (example, 108) as well. At block 410, database (example, 108) may be queried for paths and checksums stored in the database. At block 412, path and checksum present in the first row (or "initial row") of the query results may be stored, and a file reduction counter may be set to zero. At block 414, a determination may be made whether another row is present in the query results. If no further rows are present, the method may stop at block 416. At block 418, if another row is present in the query results, checksum of this second row (or "current row") may be compared with stored checksum of the initial row to determine whether they match. If checksum of the current row matches with the checksum of the initial row, a determination may be made (at block 420) whether the file reduction counter is less than single instancing limit. If checksum of the current row matches with the checksum of the initial row, and the file reduction counter is less than single instancing limit, the data blocks pointed by the path of the current row may be released, and the path of the second row may be enabled to point to data blocks of the path of the initial row at block 422. The file reduction counter may also be incremented by one unit at this point, and the method 400 may move to block 414. If at block 418, checksum of the current row does not match with the checksum of the initial row or the file reduction counter is less than single instancing limit at block 420, the path and checksum of the current row may be preserved, and the method 400 may move to block 412. 31] FIG. 5 is a block diagram of an example system 500 that facilitates eliminating file duplication in a file system. System 500 includes a processor 502 and a machine-readable storage medium 504 communicatively coupled through a system bus. In an example, system 500 may be analogous to computing device 100 of FIG. 1 or file server 208 of FIG. 2. Processor 502 may be any type of Central Processing Unit (CPU), microprocessor, or processing logic that interprets and executes machine-readable instructions stored in machine-readable storage medium 504. Machine-readable storage medium 504 may be a random access memory (RAM) or another type of dynamic storage device that may store information and machine-readable instructions that may be executed by processor 502. For example, machine- readable storage medium 504 may be Synchronous DRAM (SDRAM), Double Data Rate (DDR), Rambus DRAM (RDRAM), Rambus RAM, etc. or a storage memory media such as a floppy disk, a hard disk, a CD-ROM, a DVD, a pen drive, and the like. In an example, machine-readable storage medium 504 may be a non-transitory machine-readable medium. Machine- readable storage medium 504 may store instructions 506, 508, 510, and 512. In an example, instructions 506 may be executed by processor 502 to generate a checksum of a file during transition of the file from a normal state to a retained state. Instructions 508 may be executed by processor 502 to store the generated checksum of the file in a database embedded within the file system. Instructions 510 may be executed by processor 502 to query the database to identify duplicate checksums of the file. Instructions 512 may be executed by processor 502 to delete files corresponding to the duplicate checksums according to a predefined single instancing limit. 32] For the purpose of simplicity of explanation, the example methods of FIGS. 3 and 4 are shown as executing serially, however it is to be understood and appreciated that the present and other examples are not limited by the illustrated order. The example systems of FIGS. 1 , 2 and 5, and methods of FIGS. 3 and 4 may be implemented in the form of a computer program product including computer-executable instructions, such as program code, which may be run on any suitable computing device in conjunction with a suitable operating system (for example, Microsoft Windows, Linux, UNIX, and the like). Embodiments within the scope of the present solution may also include program products comprising non- transitory computer-readable media for carrying or having computer- executable instructions or data structures stored thereon. Such computer- readable media can be any available media that can be accessed by a general purpose or special purpose computer. By way of example, such computer-readable media can comprise RAM, ROM, EPROM, EEPROM, CD-ROM, magnetic disk storage or other storage devices, or any other medium which can be used to carry or store desired program code in the form of computer-executable instructions and which can be accessed by a general purpose or special purpose computer. The computer readable instructions can also be accessed from memory and executed by a processor. 33] It may be noted that the above-described examples of the present solution is for the purpose of illustration only. Although the solution has been described in conjunction with a specific embodiment thereof, numerous modifications may be possible without materially departing from the teachings and advantages of the subject matter described herein. Other substitutions, modifications and changes may be made without departing from the spirit of the present solution. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and/or all of the steps of any method or process so disclosed, may be combined in any combination, except combinations where at least some of such features and/or steps are mutually exclusive.

Claims

Claims:

1 . A method for eliminating file duplication in a file system, comprising: generating a checksum of a file during transition of the file to a retained state;

storing the generated checksum of the file in a database;

querying the database to identify duplicate checksums of the file; and deleting files corresponding to the duplicate checksums according to a predefined single instancing limit.

2. The method of claim 1 , wherein deleting the files corresponding to the duplicate checksums according to the predefined single instancing limit comprises:

determining that a first duplicate checksum of the file matches with a second duplicate checksum of the file;

in response to the determination, releasing data blocks pointed to by path of the second duplicate checksum of the file; and

enabling the path of the second duplicate checksum of the file to point to data blocks of path of the first duplicate checksum of the file.

3. The method of claim 2, further comprising determining that content of a file corresponding to the first duplicate checksum of the file matches with content of a file corresponding to the second duplicate checksum of the file.

4. The method of claim 1 , wherein the predefined single instancing limit is a system defined limit or a user defined limit.

5. The method of claim 1 , wherein the predefined single instancing limit is based on number of identified duplicate checksums of the file.

6. A system, comprising: a database to store a checksum of a file, wherein the checksum of the file is generated during transition of the file to a retained state in a scale-out file system; and

an elimination module to query the database to identify a duplicate checksum of the file; and

in response to the identification, delete a file corresponding to the duplicate checksum according to a predefined single instancing limit.

7. The system of claim 6, further comprising a hash generator module to generate the checksum of the file during transition of the file to the retained state in the scale-out file system.

8. The system of claim 6, further comprising a journal scanner to store the checksum of the file in the database.

9. The system of claim 6, wherein the database is integrated into the scale- out file system.

10. A non-transitory machine-readable storage medium comprising instructions executable by a processor to:

generate a checksum of a file during transition of the file from a normal state to a retained state in a file system;

store the generated checksum of the file in a database embedded within the file system;

query the database to identify duplicate checksums of the file; and delete files corresponding to the duplicate checksums according to a predefined single instancing limit.

1 1 . The storage medium of claim 10, wherein the instructions to delete comprise instructions to: determine, from the identified duplicate checksums of the file, that a first duplicate checksum of the file matches with a second duplicate checksum of the file;

in response to the determination, release data blocks pointed by a path of the second duplicate checksum of the file; and

enable the path of the second duplicate checksum of the file to point to data blocks of the path of the first duplicate checksum of the file.

12. The storage medium of claim 10, wherein the database is a metadata database that stores metadata related to unstructured data.

13. The storage medium of claim 10, wherein the database is based on decoupling of an update process in the database from a query processing task of the database.

14. The storage medium of claim 10, wherein the database is to allow pipelining of updates and independent querying of the pipelined updates.

15. The storage medium of claim 10, wherein the file system is a Network Attached Storage (NAS) file system.