WO2022054984A1 - Method for processing files through network attached disks - Google Patents

Method for processing files through network attached disks Download PDF

Info

Publication number
WO2022054984A1
WO2022054984A1 PCT/KR2020/012185 KR2020012185W WO2022054984A1 WO 2022054984 A1 WO2022054984 A1 WO 2022054984A1 KR 2020012185 W KR2020012185 W KR 2020012185W WO 2022054984 A1 WO2022054984 A1 WO 2022054984A1
Authority
WO
WIPO (PCT)
Prior art keywords
file
disk
data
files
computers
Prior art date
Application number
PCT/KR2020/012185
Other languages
French (fr)
Inventor
Han Gyoo Kim
Original Assignee
Han Gyoo Kim
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Han Gyoo Kim filed Critical Han Gyoo Kim
Priority to PCT/KR2020/012185 priority Critical patent/WO2022054984A1/en
Publication of WO2022054984A1 publication Critical patent/WO2022054984A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/176Support for shared access to files; File sharing support
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/0643Management of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0662Virtualisation aspects
    • G06F3/0665Virtualisation aspects at area level, e.g. provisioning of virtual or logical volumes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • G06F16/1824Distributed file systems implemented using Network-attached Storage [NAS] architecture

Definitions

  • This invention in general relates to file sharing methods in distributed computer systems. More specifically, this invention relates to a method that shares files in distributed computer systems where each computer uses only its local file system.
  • a network attached disk which provides block storage device to a host computer through a network
  • a network attached disk is a type of storage device that is attached via network to a computer.
  • Technology that uses Fibre Channel communication protocol network called SAN (Storage Area Network) to provide disks was used initially, but technologies that use the more common Ethernet to provide storage space to computers was later developed and is being widely used.
  • SAN Storage Area Network
  • Network attached disks that are connected via Ethernet include iSCSI disk, NetDisk disk, AoE disk, and FCoE disk.
  • iSCSI stands for "Internet Small Computer Systems Interface.” IBM and Cisco started developing it in 1998, and it was selected as a standard in 2000. iSCSI provides disks to computers through a network and is based on Internet communications protocol. Ximeta, Inc., a company based in the U.S., developed and launched NetDisk using proprietary protocol in 2002.
  • FCoE stands for "Fibre Channel over Ethernet.” Its development started in 2003, and its protocol was standardized in 2007.
  • FCoE is a protocol technology that connects storage apparatus to a computer through Ethernet instead of Fibre Channel. AoE stands for "ATA over Ethernet.” It was announced in 2004.
  • AoE is a protocol that connects devices that use AT/ATAPI (AT Attachment Packet Interface) interface standard, such as regular hard disk drives (HDDs) and SSDs (solid-state drives), to Ethernet to provide HDD or SSD to a computer.
  • AT/ATAPI AT Attachment Packet Interface
  • HDDs regular hard disk drives
  • SSDs solid-state drives
  • AoE technology connects the relatively cheaper AT/ATAPI disks to Ethernet and provides them to a computer.
  • Network attached disk technologies including SAN, iSCSI, NetDisk, FCoE, AoE, etc., provide storage devices through network to computers connected to the network, but no technology exists that allows multiple computers to share files stored on the network attached disks such that each computer in the network uses only its own local file system without the need for coordination between their local file systems of multiple computers in the network.
  • network attached disks are not currently used in such distributed systems because they cannot be shared as a storage device under each computer's local file system. More specifically, mere connecting a network attached storage device to the network such that the computers in the network access the network attached storage device as their local storage device does not provide the sharing of files stored on the network attached device because the files on the storage device are to be created and stored through "file system" software layer in the operating system.
  • Metadata contain the information on which blocks of the storage device the file is stored. Metadata generated by a local file system are not shared with other independent local file systems, so there is no way the local file systems access the file created and stored by another local file system.
  • the present invention purposes to provide technology for network attached disks to be used to share files while each computer only uses its own local file system without the need for cooperation among the local file systems of the computers in the network.
  • each computer reads a file stored on its internal disk into its buffer cache in main memory and transfers the file over the network to other computers, and the computer that has received the file stores it on its own buffer cache in its main memory and then saves it to its internal disk; in the process of file transmission from a computer to the other, the process occupies a large portion of main memory of the sender and receiver that could have been allocated for processing their jobs instead of just transferring the file.
  • Another purpose of the present invention is to improve job processing throughput of the whole system by eliminating the occupation of main memory involved in the file transfer for sharing data between the computers, and by allocating the very amount of the saved main memory for other job processing of the computers, thus increasing the effective amount of main memory for the computers.
  • the present invention relates to the method for multiple computers to share data files using network attached storage devices, where each computer uses only its own local file system to create and share without damaging the integrity of files.
  • the disk mentioned in the present invention refers not only to hard disk drives (HDDs) and SSDs (solid state drives) but also to any nonvolatile block storage devices, such as USB drives and SD (secure digital) cards.
  • HDDs hard disk drives
  • SSDs solid state drives
  • nonvolatile block storage devices such as USB drives and SD (secure digital) cards.
  • FIG. 1 is a diagram of the constituents of a Hadoop system.
  • FIG. 2 is a diagram of how the intermediate result files are transmitted and received between mappers and reducers in a Hadoop system.
  • FIG. 3 is a diagram of how multiple computers share files by sharing network attached disks.
  • FIG. 4 is a diagram of what software components of multiple computers are required to share network attached disks.
  • FIG. 5 is a diagram that describes a method where each computer uses only its own local file system to share files among computers without the coordination at the level of local file systems of multiple computers.
  • FIG. 6 is an illustration of disk partitions according to LBA (logical block addressing).
  • FIG. 7 is an illustration of how files are shared among computers by using network attached disks as simple block storage devices without mounting file system on the device.
  • FIG. 8 is an illustration of how a conventional internal storage device is used as a virtual network attached storage device for file sharing.
  • FIG. 1 shows a Hadoop system consisting of data nodes (1-1, 1-2, 1-3, 1-4), which can be consisted of up to several thousands of computers that have been connected through network (2) to process big data.
  • data nodes (1-1, 1-2, 1-3, 1-4) which are computers that process data, have one or more mappers (20-1, 20-2, 20-3, 20-4, 20-5) and reducers (21-1, 21-2, 21-3, 21-4, 21-5, 21-6), which are functions for processing data by dividing up the load among the multiple data nodes.
  • a mapper is a function for deriving intermediate data. Data to be processed are stored in the data blocks (16-1, 16-2, 16-3) distributed among multiple data nodes via Hadoop file system. Each mapper processes data in the data blocks of its own data nodes, and thus the whole data are processed by multiple mappers (20-1, 20-2, 20-3, 20-4, 20-5) in parallel.
  • Each mapper that processes data produces intermediate data into files (11-1, 11-2, 11-3, 11-4, 11-5, 11-6) and saves them as intermediate files (12-1, 12-2, 12-3, 12-4, 12-5, 12-6) on its own data nodes through the local file system (10-1, 10-2, 10-3, 10-4) of each data node.
  • the intermediate files (11-1, 11-2, 11-3, 11-4, 11-5, 11-6) that are produced by mappers are initially produced on buffer caches in the operating system of each data node before they are saved on local disks as files (12-1, 12-2, 12-3, 12-4, 12-5, 12-6); while the temporary files produced on buffer caches that are located on the main memory of each data node (3-1, 3-2, 3-3, 3-4) and the files saved on the disks are the same files whose data are identical, in all drawings of the present invention the intermediate files on buffer caches will be numbered 11-1, 11-2, and so on, and the intermediate files that are saved on disks will be numbered 12-1, 12-2, and so on, in order to distinguish the files that are on buffer caches on the main memory (3-1, 3-2, 3-3, 3-4) from the files that are saved on the disks of the data nodes.
  • Buffer cache sometimes called page cache, is a portion of main memory controlled by an operating system that keeps the copies of disk blocks so that the buffer cache can be used as a transparent cache for blocks of the hard disk drive.
  • a reducer is a function that gathers and processes intermediate data files. Like mappers, reducers are distributed among multiple data nodes, and it is possible for there to be many reducers on each data node to process a single task. Mappers transmit intermediate files (12-1, 12-2, 12-3, 12-4, 12-5, 12-6) to reducers (21-1, 21-2, 21-3, 21-4, 21-5, 21-6), and each reducer puts together intermediate files that it has received from multiple mappers and processes the intermediate files.
  • final result files (14-1, 14-2, 14-3) that each reducer has processed are saved as files (15-1, 15-2, 15-3) on local disks through the local file system of each of the data nodes, and these final data files (15-1, 15-2, 15-3) constitutes the final results of the job.
  • the reason for distributing numerous mappers and reducers among the data nodes is to increase data processing speed by distributing vast amounts of total data among multiple data nodes and processing them in parallel.
  • FIG. 2 shows the typical process by which an intermediate file produced by a mapper is transmitted to a reducer through the HTTP (hyper text transport protocol) communication protocol.
  • HTTP hyper text transport protocol
  • intermediate file p (11-7) residing in buffer cache on the main memory (3-1) by mapper j (20-1) of data node M (1-1) is saved as file p (12-7) on the disk (6-1); is transmitted via the HTTP interface program (22-1) through the network stack (4-1) of the operating system of data node M (1-1) and NIC (Network Interface Card, or Network Interface Chip) (5-1), which is the network interface hardware; and is sent to data node N (1-2), on which reducer h (21-1) is waiting to receive the intermediate file p (11-9, 12-8).
  • NIC Network Interface Card, or Network Interface Chip
  • the intermediate file p (12-7) that is on the disk (6-1) cannot be transmitted directly to the network (2) via the HTTP interface program (22-1), but the intermediate file p (12-7) on the disk (6-1) must be loaded on the main memory (3-1) as intermediate file p (11-8) before it can be transmitted.
  • the intermediate file p (11-8) that has been transmitted in this manner is saved on the local disk (6-2) of the reducer as file p (12-8) on its local disk (6-2) only after it is loaded on the main memory (3-2) of data node N (1-2) as intermediate file p (11-9) through the network stack (4-2) of the operating system, the HTTP interface program (22-2), and the local file system (10-2) after passing through NIC (Network Interface Card, or Network Interface Chip) (5-2) of data node N (1-2; afterwards, it is loaded again as intermediate file p (11-10) on the main memory (3-2) before the reducer h (21-1) then processes the data of file p (11-10.
  • NIC Network Interface Card, or Network Interface Chip
  • file p (11-7) produced by mapper j (20-1), each file p (12-7, 12-8) saved on the disks (6-1, 6-2) of the two data nodes (1-1, 1-2), and each file p (11-8, 11-9, 11-10) that is loaded on the main memory are the same files with identical data, but in order to distinguish the location where each file instance resides, files on the main memory are numbered 11-7, 11-8, 11-9, 11-10, and files saved on the disks are numbered 12-7, 12-8, in a manner similar to how they were numbered in FIG. 1.
  • Each data node transmits intermediate files produced by mappers to reducers in other data nodes.
  • each data node occupies its main memory (3-1, 3-2) as it transmits and receives intermediate files due to the network communication involved in the process.
  • throughput of data processing job becomes reduced because the amount of main memory occupied in this process of transferring data file over the network would otherwise have been allocated to data processing jobs instead of being occupied in the process of transferring the file.
  • the present invention relates to the method for improving the throughput and performance of big data processing systems such as the Hadoop system. More specifically, the present invention provides for data nodes, or computers, to share intermediate files by storing the intermediate files on storage devices that are attached to a network and are shared among data nodes.
  • the method of sharing data file presented in this invention eliminates the occupation of main memory allocated to network stack and buffer cache required for transferring the data file between data nodes, thus improving the overall data processing job throughput and performance.
  • FIG. 3 is a diagram that shows the system of the present invention of multiple data nodes, or computers, that share network attached disks.
  • the present invention uses a network attached disk (7-1) to share the files.
  • network attached disks include iSCSI (Internet Small Computer Systems Interface) disks, NetDisks, and AoE (ATA over Ethernet) disks.
  • the present invention uses the network attached disk (7-1) as a local disk to each data node (1-1, 1-2) as shown in FIG. 3. That is, the network attached disk (7-1) is provided as a local disk to both of data node N (1-1) and data node M (1-2).
  • mapper j (20-7) saves the intermediate data file h (11-1) through its local file system (10-1) as file h (12-1) on the network attached disk (7-1) through network stack (4-1) and NIC (5-1).
  • File h (11-1) produced by mapper j (20-7) refers to the file that is on the buffer cache of the operating system of data node M (1-1) where mapper j (20-7) is running, and file h (12-1) which is identical in data to file h (11-1) refers to the file that is saved on the network attached disk (7-1).
  • Reducer k (21-7) which is being executed on another data node N (1-2), processes the data of intermediate data file h (12-1), which is on the network attached disk (7-1), by loading it as intermediate data file h (11-2) on the buffer cache on the main memory of its own local file system (10-2).
  • mapper's data file stored in its local disk has to be loaded on buffer cache thus entailing the occupation of memory to the size of the file, and then another memory occupation of same amount of memory due to the copy of data file from buffer cache to network stack occurs in order to be transferred to reducer, and then the exactly same amount of memory occupations occurs in reducer's computer because the same file transfer procedures are required only in reverse order for the reducer to store the data file on its local disk.
  • the file sharing method of the present invention in FIG. 3 has data nodes directly access files stored on network attached disks to share the files, instead of one data node transmitting files to another data node through a network, thus no memory occupations are entailed for the transfer of data files between the data nodes.
  • the amount of occupied memory in conventional method due to the file transfer over the network is twice the size of data files. The method of present invention, therefore, allocates this much of memory to the jobs for processing data thus increases the effective amount of memory of the system compared to the conventional method.
  • FIG. 4 is a block diagram of how multiple data nodes (1-1, 1-2, 1-3), or computers, share multiple network attached disks (7-1, 7-2, 7-3, 7-4, 7-5).
  • Each of the data nodes that are connected to network (2) recognizes network attached disks (7-1, 7-2, 7-3, 7-4, 7-5) as its own local disks through device driver software modules (13-1, 13-2, 13-3) that control the network attached disks mounted on each data node.
  • device driver software modules 13-1, 13-2, 13-3
  • Linux operating systems provide initiator device driver software for iSCSI devices, and Windows, VMware, and various other operating systems also provide initiator software so that network attached disks can be recognized and used as local disks.
  • the device driver software for network attached disks accesses and controls the network attached disks through the network interface (NIC), though files on network attached disks are read and written through the local file system similarly to those on the computer's internal disks.
  • NIC network interface
  • FIG. 5 is an illustration of the file sharing method of the present invention, where each computer (1-11, 1-12, 1-13) share files by using only its own local file system without cooperating with other computers' local file systems (10-11, 10-12, 10-13).
  • a disk partition, or a disk volume is a collection of contiguous blocks of the sectors of a storage device, such as a hard disk or an SSD (solid-state drive).
  • a single physical hard disk or SSD can be divided into several partitions. Each partition is recognized as an independent storage device volume by the operating system, and a single file system is loaded on individual disk partition. Therefore, the explanations in the the present invention do not need to distinguish between disks and disk partitions.
  • each computer (1-11, 1-12, 1-13) mounts the network attached disks or disk partitions (7-1, 7-2, 7-3) on its local file system as its local disks with both reading and writing privileges.
  • Mounting a disk or disk partition is the process of logically attaching a disk to a directory within the file system structure, and after the mounting process, files can be read and written on the mounted disk through the local file system.
  • Mounting commands and related software tools are provided in all operating systems. In Linux operating systems, for example, mounting a network attached disk to directory /networkdisk1 of the local file system (10-11) of computer 1 (1-11) is executed by using the mount command as follows.
  • the program snippet below shows a case that a network attached disk in FIG. 5 named as /dev/sda is mounted to the /networkdisk1 directory using the mount command.
  • dotted lines (100 - 108) are used to emphasize the fact that each network attached disk (7-1, 7-2, 7-3) is mounted not on a single computer but rather on multiple computers at the same time with both read and write privileges.
  • files (12-10, 12-11, 12-12, 12-13, 12-14, 12-15, 12-16, 12-17) are to be created a priori on the network attached disks (7-1, 7-2, 7-3) before each computer (1-11, 1-12) accesses the network attached disks (7-1, 7-2, 7-3) and starts reading and writing files on them.
  • Creating files of desired sizes before storing the actual data on those files can be done in Linux operating systems, for example, with the fallocate command.
  • fallocate is a command provided by Linux operating systems, and it stands for "file allocate.” fallocate creates a file s taking up certain size of disk space before data is saved on it. Below is an example snippet of using the fallocate command to create a file named file1.txt that is 10 MB in size.
  • the fallocate command secures disk space that can contain data of a specified size before the actual data is saved.
  • the file1.txt file is created by securing 10 MB of disk blocks before valid data is actually stored on the file1.txt file.
  • pre-allocated files can be created as many as desired (12-10, 12-11, 12-12, 12-13, 12-14, 12-15, 12-16, 12-17) on the network attached disks (7-1, 7-2, 7-3) in advance.
  • pre-allocated files (12-10, 12-11, 12-12, 12-13, 12-14, 12-15, 12-16, 12-17) are assigned in exclusive way to individual computers (1-11, 1-12, 1-13). That is, a pre-allocated file is assigned to a computer and never to other computers. For example, two files 12-10, 12-11 can be assigned to computer 1-11 only, three files 12-12, 12-13, 12-14 can be assigned to computer 1-12 only, and another three files 12-15, 12-16, 12-17 can be assigned to computer 1-13 only. Each computer can write data on the pre-allocated files exclusively assigned to it and never writes data on other pre-allocated files that are not assigned to it, whereas all computers can read data from any of the pre-allocated files.
  • a computer can share data files with other computers because only one computer can write data on a pre-allocated file assigned exclusively to it and other computer cannot update the data on the file but can read the file.
  • the reason we provide pre-allocated files before a computer starts to write data on them is to maintain the integrity file system when individual computers write and read data on and from the files assigned exclusively to each computer.
  • the integrity of a file system is maintained when the meta data such as inode (index node) remain consistent on the files stored in the file system.
  • the inode is a data structure in a Unix style file system that has information on which disk block stores which portion of data stored in the file.
  • Pre-allocation of files and assignment of pre-allocated files to different computers exclusively guarantees the meta data such as inode of the files will remain the same because a set of fixed disk blocks allocated to the pre-allocated files a priori will never be replaced with other disk blocks as long as the size of data stored on the file does not exceed the size of disk space pre-allocated to it, and only one computer can update data on a pre-allocated file in particular. Therefore, sharing a file in the present invention is achieved such that a computer can update data on a pre-allocated file assigned exclusively to it whereas other computers can read data from the file without compromising the file system integrity.
  • Line (3) of the program above invokes the fsync system call, which synchronizes the data written on the file file1.txt to force the data that have been written on the file to be stored physically on the disk instead of being remained on the buffer cache; doing so allows other computers that share the file, i.e., file1.txt in this example to read the correct data on the file when reading it.
  • file1.txt in this example to read the correct data on the file when reading it.
  • the relevant metadata regarding the file must be reflected and stored on the disk; for this, the disk device shared by the mapper and the reducers, /dev/sda, in the example, on which the file is located has to be synchronized as shown in line (5) of the program above.
  • the open system call is invoked on the disk device to get the file descriptor of the disk device file as shown in line (4) of the program.
  • the close system call is invoked as in lines (6) and (7) to terminate the use of the file and the disk device.
  • the file is created in advance by means of the fallocate command before the process running on a computer actually writes the data on the file, it is normal for the size of the actual data when it is being written to be different from the data size designated at the time the file created by the fallocate command. Therefore, the process that has written data on the pre-allocated file must notify the exact data size written on the file to the other processes running on other computers that want to read the data from the file so that the processes in the other computers can read the actual data written on the file. Note that when a process writes data on a file the data are written from the very beginning of the file consecutively towards the end of the file.
  • the other processes in the other computers only need to know the data size of the file so that the processes can read data from the beginning of the file consecutively up to the size of the actual data written on the file. For example, in the program above, 4000 bytes of data were written on the file1.txt file, but because file1.txt was created by securing 10 MB of disk space in advance, the size of file1.txt is 10 MB.
  • data_buf in the below program is an array whose size is 4000 bytes and the third argument, 4000, of the read(fd, data_buf,4000)system call in line (2) specifies to read up to 4000 bytes of the file /shared/file1.txt starting from the beginning of the file.
  • the writing process may use two or more pre-allocated files to save all of the data and concatenate the two or more files into a single file. For example, if two files the size of 1 MB each were allocated but 1.5 MB of data need to be saved on an allocated file, 1 MB of data is to be written on one of the allocated files and 0.5 MB of remaining data is to be written on the another allocated file, and then the two files are concatenated. In the case of Linux operating systems, files can be concatenated with the cat command.
  • file2 is appended to the end of file1
  • the concatenated file is named file1.
  • file1 if a large amount of data is first written to fill up file1 and the rest of the data is written on file2, the two files can be concatenated in this manner.
  • files can be pre-allocated one at a time on demand as they are needed, or multiple files can be pre-allocated at once. No matter whether the pre-allocated files are created dynamically or created in advance, the process that writes the data on the pre-allocated file should use the aforementioned fsync system call on the file so that the file data and the metadata are synchronized and saved on the network attached disk before the other processes in the other computers read the data from the file.
  • any process that wants to read the file but has no exclusive right to write on the file must flush or clear the possibly incorrect data and relevant metadata of the file that might have been on the buffer cache of its local file system before it actually reads data from the file because the buffer cache in its local file system might not maintain correctly updated data and metadata of the file that has just been written by other process in other computer.
  • Operating systems including Linux provide options to flush its buffer cache.
  • the snippet below shows a way for a process to clear its buffer cache, dentries, and inode.
  • An inode is a data structure that represents a file.
  • a dentries is a data structure that represents a directory.
  • the buffer cache, or page cache could contain any memory mappings to blocks on disk which means anything that the OS could hold in memory from a file.
  • writing 3 to /proc/sys/vm/drop_caches will clear buffer cache, dentries, and inode in Linux systems without killing any application/service. It causes the kernel to look for files on the disk rather than in the cache.
  • data integrity of files is maintained even when each computer uses only its own local filesystem to write, read, and share the files by allocating files a priori on network attached disks before actual data are written on the files and then assigning pre-allocated files to each computers in mutually exclusive way such that each pre-allocated file is assigned to one and only one computer that can write data on the file exclusively and then synchronize the file data and metadata in its buffer cache to the network attached disk and then having other processes on other computers to clear their local buffer caches to read the correctly updated file data from the network attached disk.
  • the present invention shows a method of saving files on network attached disks and sharing the files with other computers through the local filesystem of each of the computers.
  • FIG. 6 shows disk sectors that comprise a disk partition in a linear model of LBA (logical block addressing).
  • the physical structure of a hard disk consists of a cylinder, head, and sectors; a disk is oftentimes represented by LBA model which numbers the sectors starting from sector 0 (zero).
  • LBA model which numbers the sectors starting from sector 0 (zero).
  • a disk or disk partition is shown as a series of contiguous disk sectors of a uniform size, typically 512 bytes each. All disk devices are equipped with logic that converts disk sector number in the LBA into the corresponding cylinder number, head number, and sector number of the physical disk and therefore access the desired sector by indicating a disk sector with the LBA sector number.
  • FIG. 7 is a diagram that shows how to share data on a network attached disk among multiple computers without a local file system mounted on the disk. Instead, the network attached disk is treated simply as a block storage device without loading any file system on it. It differs from the methods explained in FIG. 5, where data is saved and shared by using the files created through local file system, only in that FIG. 7 uses a network attached disk (7-1) as a device file.
  • bundle here is a term defined in the present invention to represent a series of contiguous disk sectors.
  • Each bundle (30-1, 30-2, 30-3, 30-4) of FIG. 7 consists of a series of contiguous disk sectors.
  • the entirety of a disk or disk partition can be regarded as a group of bundles.
  • disk partition (7-1) of FIG. 7 consists of a series of bundles starting from the first bundle 0 (30-1) to the last bundle n (30-4).
  • SEEK_SET is the whence value that specifies the offset points not to a relative value but rather to an absolute byte address location that starts from the very beginning of disk partition (7-1). Therefore, line (3) is commanding 4,000 bytes of data on the array buf to be saved starting from the 11th sector of the disk partition (7-1). Program line (5) is to read the 4,000 byes of data that was written in line (3).
  • Lines (4) and (5) of the example program above can serve as a program that is executed when a process in a computer, e.g., computer k (1-25) wants to read the data saved by another computer, e.g., computer 1 (1-21).
  • a process in a computer e.g., computer k (1-25) wants to read the data saved by another computer, e.g., computer 1 (1-21).
  • following example program reads data from the disk partition (7-1) of 4,000 bytes starting from the 11th sector of the disk that are within bundle 0 (30-1).
  • the data sharing method of FIG. 7 prevents computers from arbitrarily writing on disk sectors and damaging the data by allocating bundles in exclusive manner, that is, allocating a bundle to one and only one computer with exclusive right to write on it and having each computer store data on disk sectors in the bundle over which it has the exclusive writing privilege and allowing other computers to read data in the bundle.
  • network attached disks including iSCSI disk and others are used such that the computers connected to the network can share data among them.
  • virtual network attached disks or volumes installed on conventional disks attached directly to the internal system bus of the computer instead of using physical entities of network attached disks.
  • a virtual disk or virtual volume, or logical volume is a virtual device that provides an area of usable storage capacity on one or more physical disk drives in a computer system. The disk is described as logical or virtual because it does not actually exist as a single physical entity. All operating systems provide commands or tools to a make virtual disk or virtual volume on physical disk devices in a computer system.
  • device driver software for network attached disk devices make any logical or virtual disk/volume to be a network attached disk, that is, "virtual" network attached disk.
  • virtual iSCSI disk Taking an example on implementing virtual iSCSI disk as shown in FIG. 8, we first create a virtual disk (6-1) on one or more conventional internal disks and then make the virtual disk (6-1) an iSCSI disk target (6-1) using iSCSI target driver (31-1) software provided in the operating systems.
  • the iSCSI software initiator (31-2) enables the computer (1-2) to access iSCSI devices on Ethernet.
  • the iSCSI software target (31-1) enables computers (1-1) to export local storage to be accessed by other iSCSI initiators (31-2) using the iSCSI protocol that is defined in the RFC 3720. Therefore, it is obvious that the network attached disk in the present invention can be "virtual" network attached disk virtualized on conventional internal disks. Notice that it is a norm to use "virtual" network attached disk, such as virtual iSCSI target, in today's computer system practices.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method for sharing files among multiple computers are disclosed. A method for computers to save and share files without damaging the files, where each computer uses only its own local file system without the need for coordination between the file systems of multiple computers, is disclosed. More specifically, a method for sharing files between computers through network attached disks in distributed computing systems that process big data, such as the Hadoop system, is disclosed.

Description

METHOD FOR PROCESSING FILES THROUGH NETWORK ATTACHED DISKS
This invention in general relates to file sharing methods in distributed computer systems. More specifically, this invention relates to a method that shares files in distributed computer systems where each computer uses only its local file system.
Unlike a conventional disk that is attached to an internal bus of a computer, a network attached disk, which provides block storage device to a host computer through a network, is a type of storage device that is attached via network to a computer. Technology that uses Fibre Channel communication protocol network called SAN (Storage Area Network) to provide disks was used initially, but technologies that use the more common Ethernet to provide storage space to computers was later developed and is being widely used.
Network attached disks that are connected via Ethernet include iSCSI disk, NetDisk disk, AoE disk, and FCoE disk. iSCSI stands for "Internet Small Computer Systems Interface." IBM and Cisco started developing it in 1998, and it was selected as a standard in 2000. iSCSI provides disks to computers through a network and is based on Internet communications protocol. Ximeta, Inc., a company based in the U.S., developed and launched NetDisk using proprietary protocol in 2002. FCoE stands for "Fibre Channel over Ethernet." Its development started in 2003, and its protocol was standardized in 2007. FCoE is a protocol technology that connects storage apparatus to a computer through Ethernet instead of Fibre Channel. AoE stands for "ATA over Ethernet." It was announced in 2004. AoE is a protocol that connects devices that use AT/ATAPI (AT Attachment Packet Interface) interface standard, such as regular hard disk drives (HDDs) and SSDs (solid-state drives), to Ethernet to provide HDD or SSD to a computer. Unlike the aforementioned iSCSI technology, which connects SCSI (Small Computer System Interface) disks to Ethernet, AoE technology connects the relatively cheaper AT/ATAPI disks to Ethernet and provides them to a computer.
Network attached disk technologies including SAN, iSCSI, NetDisk, FCoE, AoE, etc., provide storage devices through network to computers connected to the network, but no technology exists that allows multiple computers to share files stored on the network attached disks such that each computer in the network uses only its own local file system without the need for coordination between their local file systems of multiple computers in the network.
Although there is high demand for sharing files among multiple computers of distributed data processing systems such as Hadoop, network attached disks are not currently used in such distributed systems because they cannot be shared as a storage device under each computer's local file system. More specifically, mere connecting a network attached storage device to the network such that the computers in the network access the network attached storage device as their local storage device does not provide the sharing of files stored on the network attached device because the files on the storage device are to be created and stored through "file system" software layer in the operating system. A file created and stored on a storage device by a computer's local file system cannot be properly accessed by other computers' local file systems because the other computers' local file systems have no way to get the metadata of the file whose data are stored and maintained by the particular local file system that created and stored the file on the device. Metadata contain the information on which blocks of the storage device the file is stored. Metadata generated by a local file system are not shared with other independent local file systems, so there is no way the local file systems access the file created and stored by another local file system. The present invention purposes to provide technology for network attached disks to be used to share files while each computer only uses its own local file system without the need for cooperation among the local file systems of the computers in the network.
Under the conventional method of sharing files among computers of distributed systems such as Hadoop, each computer reads a file stored on its internal disk into its buffer cache in main memory and transfers the file over the network to other computers, and the computer that has received the file stores it on its own buffer cache in its main memory and then saves it to its internal disk; in the process of file transmission from a computer to the other, the process occupies a large portion of main memory of the sender and receiver that could have been allocated for processing their jobs instead of just transferring the file.
Another purpose of the present invention is to improve job processing throughput of the whole system by eliminating the occupation of main memory involved in the file transfer for sharing data between the computers, and by allocating the very amount of the saved main memory for other job processing of the computers, thus increasing the effective amount of main memory for the computers.
The present invention relates to the method for multiple computers to share data files using network attached storage devices, where each computer uses only its own local file system to create and share without damaging the integrity of files.
The disk mentioned in the present invention refers not only to hard disk drives (HDDs) and SSDs (solid state drives) but also to any nonvolatile block storage devices, such as USB drives and SD (secure digital) cards.
Although the embodiments of the present invention are explained with reference to Hadoop systems, it is obvious that the file sharing method of the present invention can be used for sharing file data among multiple computers in other distributed systems besides the Hadoop system.
If the file sharing method via network attached disks as disclosed herein according to the present invention, a computer's main memory is no longer occupied for transmitting file data among computers as in conventional distributed systems such as Hadoop; therefore the data processing throughput of distributed systems as a whole is greatly improved.
FIG. 1 is a diagram of the constituents of a Hadoop system.
FIG. 2 is a diagram of how the intermediate result files are transmitted and received between mappers and reducers in a Hadoop system.
FIG. 3 is a diagram of how multiple computers share files by sharing network attached disks.
FIG. 4 is a diagram of what software components of multiple computers are required to share network attached disks.
FIG. 5 is a diagram that describes a method where each computer uses only its own local file system to share files among computers without the coordination at the level of local file systems of multiple computers.
FIG. 6 is an illustration of disk partitions according to LBA (logical block addressing).
FIG. 7 is an illustration of how files are shared among computers by using network attached disks as simple block storage devices without mounting file system on the device.
FIG. 8 is an illustration of how a conventional internal storage device is used as a virtual network attached storage device for file sharing.
FIG. 1 shows a Hadoop system consisting of data nodes (1-1, 1-2, 1-3, 1-4), which can be consisted of up to several thousands of computers that have been connected through network (2) to process big data. In Hadoop system, data nodes (1-1, 1-2, 1-3, 1-4), which are computers that process data, have one or more mappers (20-1, 20-2, 20-3, 20-4, 20-5) and reducers (21-1, 21-2, 21-3, 21-4, 21-5, 21-6), which are functions for processing data by dividing up the load among the multiple data nodes. A mapper is a function for deriving intermediate data. Data to be processed are stored in the data blocks (16-1, 16-2, 16-3) distributed among multiple data nodes via Hadoop file system. Each mapper processes data in the data blocks of its own data nodes, and thus the whole data are processed by multiple mappers (20-1, 20-2, 20-3, 20-4, 20-5) in parallel.
Each mapper that processes data produces intermediate data into files (11-1, 11-2, 11-3, 11-4, 11-5, 11-6) and saves them as intermediate files (12-1, 12-2, 12-3, 12-4, 12-5, 12-6) on its own data nodes through the local file system (10-1, 10-2, 10-3, 10-4) of each data node. The intermediate files (11-1, 11-2, 11-3, 11-4, 11-5, 11-6) that are produced by mappers are initially produced on buffer caches in the operating system of each data node before they are saved on local disks as files (12-1, 12-2, 12-3, 12-4, 12-5, 12-6); while the temporary files produced on buffer caches that are located on the main memory of each data node (3-1, 3-2, 3-3, 3-4) and the files saved on the disks are the same files whose data are identical, in all drawings of the present invention the intermediate files on buffer caches will be numbered 11-1, 11-2, and so on, and the intermediate files that are saved on disks will be numbered 12-1, 12-2, and so on, in order to distinguish the files that are on buffer caches on the main memory (3-1, 3-2, 3-3, 3-4) from the files that are saved on the disks of the data nodes.
Buffer cache, sometimes called page cache, is a portion of main memory controlled by an operating system that keeps the copies of disk blocks so that the buffer cache can be used as a transparent cache for blocks of the hard disk drive. Using buffer cache on main memory to temporarily store copies of disk blocks, which is called disk buffering, results in quicker file accesses because it is faster to access the main memory than secondary storage devices such as hard disk drive.
Intermediate data files produced by the mappers are transmitted to reducers (21-1, 21-2, 21-3, 21-4, 21-5, 21-6). A reducer is a function that gathers and processes intermediate data files. Like mappers, reducers are distributed among multiple data nodes, and it is possible for there to be many reducers on each data node to process a single task. Mappers transmit intermediate files (12-1, 12-2, 12-3, 12-4, 12-5, 12-6) to reducers (21-1, 21-2, 21-3, 21-4, 21-5, 21-6), and each reducer puts together intermediate files that it has received from multiple mappers and processes the intermediate files. In this manner, final result files (14-1, 14-2, 14-3) that each reducer has processed are saved as files (15-1, 15-2, 15-3) on local disks through the local file system of each of the data nodes, and these final data files (15-1, 15-2, 15-3) constitutes the final results of the job.
The reason for distributing numerous mappers and reducers among the data nodes is to increase data processing speed by distributing vast amounts of total data among multiple data nodes and processing them in parallel.
FIG. 2 shows the typical process by which an intermediate file produced by a mapper is transmitted to a reducer through the HTTP (hyper text transport protocol) communication protocol. For example, intermediate file p (11-7) residing in buffer cache on the main memory (3-1) by mapper j (20-1) of data node M (1-1) is saved as file p (12-7) on the disk (6-1); is transmitted via the HTTP interface program (22-1) through the network stack (4-1) of the operating system of data node M (1-1) and NIC (Network Interface Card, or Network Interface Chip) (5-1), which is the network interface hardware; and is sent to data node N (1-2), on which reducer h (21-1) is waiting to receive the intermediate file p (11-9, 12-8).
In the course of transferring the intermediate file, the intermediate file p (12-7) that is on the disk (6-1) cannot be transmitted directly to the network (2) via the HTTP interface program (22-1), but the intermediate file p (12-7) on the disk (6-1) must be loaded on the main memory (3-1) as intermediate file p (11-8) before it can be transmitted. The intermediate file p (11-8) that has been transmitted in this manner is saved on the local disk (6-2) of the reducer as file p (12-8) on its local disk (6-2) only after it is loaded on the main memory (3-2) of data node N (1-2) as intermediate file p (11-9) through the network stack (4-2) of the operating system, the HTTP interface program (22-2), and the local file system (10-2) after passing through NIC (Network Interface Card, or Network Interface Chip) (5-2) of data node N (1-2; afterwards, it is loaded again as intermediate file p (11-10) on the main memory (3-2) before the reducer h (21-1) then processes the data of file p (11-10. In FIG. 2, file p (11-7) produced by mapper j (20-1), each file p (12-7, 12-8) saved on the disks (6-1, 6-2) of the two data nodes (1-1, 1-2), and each file p (11-8, 11-9, 11-10) that is loaded on the main memory are the same files with identical data, but in order to distinguish the location where each file instance resides, files on the main memory are numbered 11-7, 11-8, 11-9, 11-10, and files saved on the disks are numbered 12-7, 12-8, in a manner similar to how they were numbered in FIG. 1.
Each data node transmits intermediate files produced by mappers to reducers in other data nodes. During this process, each data node occupies its main memory (3-1, 3-2) as it transmits and receives intermediate files due to the network communication involved in the process. In turn, due to this overhead, throughput of data processing job becomes reduced because the amount of main memory occupied in this process of transferring data file over the network would otherwise have been allocated to data processing jobs instead of being occupied in the process of transferring the file.
The present invention relates to the method for improving the throughput and performance of big data processing systems such as the Hadoop system. More specifically, the present invention provides for data nodes, or computers, to share intermediate files by storing the intermediate files on storage devices that are attached to a network and are shared among data nodes. The method of sharing data file presented in this invention eliminates the occupation of main memory allocated to network stack and buffer cache required for transferring the data file between data nodes, thus improving the overall data processing job throughput and performance.
FIG. 3 is a diagram that shows the system of the present invention of multiple data nodes, or computers, that share network attached disks. In order to avoid the occupation of the main memory that occurs when intermediate files are transferred over network, the present invention uses a network attached disk (7-1) to share the files. As mentioned above, network attached disks include iSCSI (Internet Small Computer Systems Interface) disks, NetDisks, and AoE (ATA over Ethernet) disks. By attaching network attached disk to a network (2) to which many data nodes (1-1, 1-2) are connected, the present invention uses the network attached disk (7-1) as a local disk to each data node (1-1, 1-2) as shown in FIG. 3. That is, the network attached disk (7-1) is provided as a local disk to both of data node N (1-1) and data node M (1-2).
In FIG. 3, mapper j (20-7) saves the intermediate data file h (11-1) through its local file system (10-1) as file h (12-1) on the network attached disk (7-1) through network stack (4-1) and NIC (5-1). File h (11-1) produced by mapper j (20-7) refers to the file that is on the buffer cache of the operating system of data node M (1-1) where mapper j (20-7) is running, and file h (12-1) which is identical in data to file h (11-1) refers to the file that is saved on the network attached disk (7-1). Reducer k (21-7), which is being executed on another data node N (1-2), processes the data of intermediate data file h (12-1), which is on the network attached disk (7-1), by loading it as intermediate data file h (11-2) on the buffer cache on the main memory of its own local file system (10-2).
In the conventional method illustrated by FIG. 2, mapper's data file stored in its local disk has to be loaded on buffer cache thus entailing the occupation of memory to the size of the file, and then another memory occupation of same amount of memory due to the copy of data file from buffer cache to network stack occurs in order to be transferred to reducer, and then the exactly same amount of memory occupations occurs in reducer's computer because the same file transfer procedures are required only in reverse order for the reducer to store the data file on its local disk.
On the contrary, the file sharing method of the present invention in FIG. 3 has data nodes directly access files stored on network attached disks to share the files, instead of one data node transmitting files to another data node through a network, thus no memory occupations are entailed for the transfer of data files between the data nodes. The amount of occupied memory in conventional method due to the file transfer over the network is twice the size of data files. The method of present invention, therefore, allocates this much of memory to the jobs for processing data thus increases the effective amount of memory of the system compared to the conventional method.
FIG. 4 is a block diagram of how multiple data nodes (1-1, 1-2, 1-3), or computers, share multiple network attached disks (7-1, 7-2, 7-3, 7-4, 7-5). Each of the data nodes that are connected to network (2) recognizes network attached disks (7-1, 7-2, 7-3, 7-4, 7-5) as its own local disks through device driver software modules (13-1, 13-2, 13-3) that control the network attached disks mounted on each data node. As for device drivers (13-1, 13-2, 13-3), Linux operating systems provide initiator device driver software for iSCSI devices, and Windows, VMware, and various other operating systems also provide initiator software so that network attached disks can be recognized and used as local disks. In a computer system that uses network attached disks, the device driver software for network attached disks accesses and controls the network attached disks through the network interface (NIC), though files on network attached disks are read and written through the local file system similarly to those on the computer's internal disks.
However, multiple data nodes cannot independently read or write files on network attached disks without breaching the integrity of the file data. If each data node reads and writes files though its own local file system without cooperating with other data nodes, each data node that shares the network attached disks does not know which sectors of the network attached disks the other data nodes are writing on. Therefore, when multiple data nodes write different data over the same disk sector of a network attached disk that is being shared, the integrity of file data becomes compromised. In other words, because the individual local file systems of the data nodes are not working in collaboration, the metadata, that contains the information regarding which files are stored on which blocks of the disk, maintained by each data nodes end up being different from one data node to the other, and thus consistent file system of the shared network attached disks cannot be kept. For this reason, network attached disks such as iSCSI, AoE, FCoE, and NetDisk by themselves are not enough for computers of a network to be able to share files on the network attached disks.
FIG. 5 is an illustration of the file sharing method of the present invention, where each computer (1-11, 1-12, 1-13) share files by using only its own local file system without cooperating with other computers' local file systems (10-11, 10-12, 10-13). A disk partition, or a disk volume, is a collection of contiguous blocks of the sectors of a storage device, such as a hard disk or an SSD (solid-state drive). A single physical hard disk or SSD can be divided into several partitions. Each partition is recognized as an independent storage device volume by the operating system, and a single file system is loaded on individual disk partition. Therefore, the explanations in the the present invention do not need to distinguish between disks and disk partitions.
] In FIG. 5, each computer (1-11, 1-12, 1-13) mounts the network attached disks or disk partitions (7-1, 7-2, 7-3) on its local file system as its local disks with both reading and writing privileges. Mounting a disk or disk partition is the process of logically attaching a disk to a directory within the file system structure, and after the mounting process, files can be read and written on the mounted disk through the local file system. Mounting commands and related software tools are provided in all operating systems. In Linux operating systems, for example, mounting a network attached disk to directory /networkdisk1 of the local file system (10-11) of computer 1 (1-11) is executed by using the mount command as follows. The program snippet below shows a case that a network attached disk in FIG. 5 named as /dev/sda is mounted to the /networkdisk1 directory using the mount command.
sudo mount /dev/sda /networkdisk1
In FIG. 5, dotted lines (100 - 108) are used to emphasize the fact that each network attached disk (7-1, 7-2, 7-3) is mounted not on a single computer but rather on multiple computers at the same time with both read and write privileges. In the file sharing method of the present invention, files (12-10, 12-11, 12-12, 12-13, 12-14, 12-15, 12-16, 12-17) are to be created a priori on the network attached disks (7-1, 7-2, 7-3) before each computer (1-11, 1-12) accesses the network attached disks (7-1, 7-2, 7-3) and starts reading and writing files on them. Creating files of desired sizes before storing the actual data on those files can be done in Linux operating systems, for example, with the fallocate command. fallocate is a command provided by Linux operating systems, and it stands for "file allocate." fallocate creates a file s taking up certain size of disk space before data is saved on it. Below is an example snippet of using the fallocate command to create a file named file1.txt that is 10 MB in size.
fallocate -l 10M file1.txt
The fallocate command secures disk space that can contain data of a specified size before the actual data is saved. For instance, in the above example, the file1.txt file is created by securing 10 MB of disk blocks before valid data is actually stored on the file1.txt file. In this manner, pre-allocated files can be created as many as desired (12-10, 12-11, 12-12, 12-13, 12-14, 12-15, 12-16, 12-17) on the network attached disks (7-1, 7-2, 7-3) in advance.
Afterwards, pre-allocated files (12-10, 12-11, 12-12, 12-13, 12-14, 12-15, 12-16, 12-17) are assigned in exclusive way to individual computers (1-11, 1-12, 1-13). That is, a pre-allocated file is assigned to a computer and never to other computers. For example, two files 12-10, 12-11 can be assigned to computer 1-11 only, three files 12-12, 12-13, 12-14 can be assigned to computer 1-12 only, and another three files 12-15, 12-16, 12-17 can be assigned to computer 1-13 only. Each computer can write data on the pre-allocated files exclusively assigned to it and never writes data on other pre-allocated files that are not assigned to it, whereas all computers can read data from any of the pre-allocated files. In this way of assigning and writing on the pre-allocated files in exclusive way, a computer can share data files with other computers because only one computer can write data on a pre-allocated file assigned exclusively to it and other computer cannot update the data on the file but can read the file.
The reason we provide pre-allocated files before a computer starts to write data on them is to maintain the integrity file system when individual computers write and read data on and from the files assigned exclusively to each computer. The integrity of a file system is maintained when the meta data such as inode (index node) remain consistent on the files stored in the file system. The inode is a data structure in a Unix style file system that has information on which disk block stores which portion of data stored in the file. Pre-allocation of files and assignment of pre-allocated files to different computers exclusively guarantees the meta data such as inode of the files will remain the same because a set of fixed disk blocks allocated to the pre-allocated files a priori will never be replaced with other disk blocks as long as the size of data stored on the file does not exceed the size of disk space pre-allocated to it, and only one computer can update data on a pre-allocated file in particular. Therefore, sharing a file in the present invention is achieved such that a computer can update data on a pre-allocated file assigned exclusively to it whereas other computers can read data from the file without compromising the file system integrity.
Following snippet shows a mapper process in a Linux system opens a pre-allocated file assigned to it and writes data on the file that is shared with reducer processes. The program below assumes that there are 4000 bytes of data on an array buf and that the file file1.txt is the name of the pre-allocated file located on directory /shared.
char buf[4000];
(1)int fd = open("/shared/file1.txt", O_RDWR);
(2)write(fd, buf, 4000);
(3)fsync(fd);
(4)int fd_device = open("/dev/sda", O_RDWR);
(5)fsync(fd_device);
(6)close(fd);
(7)close(fd_device);
After the writing of data on the file is completed, the file data and the relevant metadata must be synchronized on the disk so that other computers that want to read the file can access and read the correctly synchronized data of the file. fd of line (1) of the above program is the file descriptor returned by the operating system as the result of opening the file file1.txt with read/write privilege. In order to open a file with both read and write privileges, O_RDWR flag needs to be specified as the second argument of the open system call. As shown in line (2) of the program, the write system call is invoked to execute writing of data on the opened file, file1.txt. Line (3) of the program above invokes the fsync system call, which synchronizes the data written on the file file1.txt to force the data that have been written on the file to be stored physically on the disk instead of being remained on the buffer cache; doing so allows other computers that share the file, i.e., file1.txt in this example to read the correct data on the file when reading it. In addition to the data of the file, the relevant metadata regarding the file must be reflected and stored on the disk; for this, the disk device shared by the mapper and the reducers, /dev/sda, in the example, on which the file is located has to be synchronized as shown in line (5) of the program above. Because the operating system treats the disk device as a file s, the open system call is invoked on the disk device to get the file descriptor of the disk device file as shown in line (4) of the program. After the file data and the disk device are synchronized, the close system call is invoked as in lines (6) and (7) to terminate the use of the file and the disk device.
] In the present invention, because the file is created in advance by means of the fallocate command before the process running on a computer actually writes the data on the file, it is normal for the size of the actual data when it is being written to be different from the data size designated at the time the file created by the fallocate command. Therefore, the process that has written data on the pre-allocated file must notify the exact data size written on the file to the other processes running on other computers that want to read the data from the file so that the processes in the other computers can read the actual data written on the file. Note that when a process writes data on a file the data are written from the very beginning of the file consecutively towards the end of the file. Therefore, the other processes in the other computers only need to know the data size of the file so that the processes can read data from the beginning of the file consecutively up to the size of the actual data written on the file. For example, in the program above, 4000 bytes of data were written on the file1.txt file, but because file1.txt was created by securing 10 MB of disk space in advance, the size of file1.txt is 10 MB. However, although the nominal size of the file1.txt file is 10 MB, valid data of 4000 bytes is saved on the file1.txt file; for this reason, the size 4000 of the valid data should be made known to the processes in the other computers that want to read the valid data on file1.txt so that they would read up to 4000 bytes instead of 10 MB. Below is a snippet of a program that reads 4000 bytes of data from the file1.txt file created in the example program above. data_buf in the below program is an array whose size is 4000 bytes and the third argument, 4000, of the read(fd, data_buf,4000)system call in line (2) specifies to read up to 4000 bytes of the file /shared/file1.txt starting from the beginning of the file.
char data_buf[4000];
(1)int fd = open("/shared/file1.txt", O_RDONLY);
(2)read(fd, data_buf, 4000);
(3)close(fd);
If there is a need to save data that is greater in size than the originally designated size of a pre-allocated file, the writing process may use two or more pre-allocated files to save all of the data and concatenate the two or more files into a single file. For example, if two files the size of 1 MB each were allocated but 1.5 MB of data need to be saved on an allocated file, 1 MB of data is to be written on one of the allocated files and 0.5 MB of remaining data is to be written on the another allocated file, and then the two files are concatenated. In the case of Linux operating systems, files can be concatenated with the cat command. Below is an example of forming a single file by appending a file to the end of another file by using the cat command of Linux operating systems; file2 is appended to the end of file1, and the concatenated file is named file1. In other words, if a large amount of data is first written to fill up file1 and the rest of the data is written on file2, the two files can be concatenated in this manner.
cat file2 >> file1
In the present invention, files can be pre-allocated one at a time on demand as they are needed, or multiple files can be pre-allocated at once. No matter whether the pre-allocated files are created dynamically or created in advance, the process that writes the data on the pre-allocated file should use the aforementioned fsync system call on the file so that the file data and the metadata are synchronized and saved on the network attached disk before the other processes in the other computers read the data from the file.
Once a process writes data on a pre-allocated file and synchronize the data and relevant metadata on the network attached disk, all other processes in other computers can read the data from the file. However, all other processes of the other computers cannot know if the file on the network attached disk has been written in fact by the process that has the exclusive right to write data on it because, although network attached disks are shared by all the computers, each computer uses its own local file system to read and write files on network attached disks without cooperating among the local file systems of individual computers. Therefore, any process that wants to read the file but has no exclusive right to write on the file must flush or clear the possibly incorrect data and relevant metadata of the file that might have been on the buffer cache of its local file system before it actually reads data from the file because the buffer cache in its local file system might not maintain correctly updated data and metadata of the file that has just been written by other process in other computer.
Operating systems including Linux provide options to flush its buffer cache. For example, the snippet below shows a way for a process to clear its buffer cache, dentries, and inode. An inode is a data structure that represents a file. A dentries is a data structure that represents a directory. The buffer cache, or page cache, could contain any memory mappings to blocks on disk which means anything that the OS could hold in memory from a file. In the lines (1) and (2) in the snippet below, writing 3 to /proc/sys/vm/drop_caches will clear buffer cache, dentries, and inode in Linux systems without killing any application/service. It causes the kernel to look for files on the disk rather than in the cache. Depending on the version or implementation of operating system, it is necessary to invoke BLKFLSBUF ioctl() system call to completely clear the buffer cache as shown in lines (3) and (4).
(1)fd_drop_cache = open("/proc/sys/vm/drop_caches", O_RDWR);
(2)write(fd_drop_cache, "3", 1);
(3)fd_dev = open("/dev/sda", O_RDONLY);
(4)ioctl(fd_dev, BLKFLSBUF);
To summarize the file sharing method of FIG. 5 of the present invention, data integrity of files is maintained even when each computer uses only its own local filesystem to write, read, and share the files by allocating files a priori on network attached disks before actual data are written on the files and then assigning pre-allocated files to each computers in mutually exclusive way such that each pre-allocated file is assigned to one and only one computer that can write data on the file exclusively and then synchronize the file data and metadata in its buffer cache to the network attached disk and then having other processes on other computers to clear their local buffer caches to read the correctly updated file data from the network attached disk.
The present invention shows a method of saving files on network attached disks and sharing the files with other computers through the local filesystem of each of the computers. However, it is possible to share files while maintaining the integrity of the files among multiple computers by using the network attached disks simply as block storage devices without mounting local filesystem.
FIG. 6 shows disk sectors that comprise a disk partition in a linear model of LBA (logical block addressing). The physical structure of a hard disk consists of a cylinder, head, and sectors; a disk is oftentimes represented by LBA model which numbers the sectors starting from sector 0 (zero). In other words, a disk or disk partition is shown as a series of contiguous disk sectors of a uniform size, typically 512 bytes each. All disk devices are equipped with logic that converts disk sector number in the LBA into the corresponding cylinder number, head number, and sector number of the physical disk and therefore access the desired sector by indicating a disk sector with the LBA sector number.
As shown in FIG. 6, disk location can be indicated with a byte address called offset that begins from the start of a disk partition. For example, if the value of the offset is 2,000, it indicates the byte that is located at the 2001st byte from the start of the disk partition. Because byte addresses starts from 0, not 1, offset 2,000 indicates the 2001st byte. Each disk sector is typically 512 bytes; therefore, if the value of the offset is 2,000, it indicates the 465th byte on the 4th sector from the start of the disk partition. Note that 2001 = 512 * 3 + 465.
FIG. 7 is a diagram that shows how to share data on a network attached disk among multiple computers without a local file system mounted on the disk. Instead, the network attached disk is treated simply as a block storage device without loading any file system on it. It differs from the methods explained in FIG. 5, where data is saved and shared by using the files created through local file system, only in that FIG. 7 uses a network attached disk (7-1) as a device file.
The term "bundle" here is a term defined in the present invention to represent a series of contiguous disk sectors. Each bundle (30-1, 30-2, 30-3, 30-4) of FIG. 7 consists of a series of contiguous disk sectors. The entirety of a disk or disk partition can be regarded as a group of bundles. For example, disk partition (7-1) of FIG. 7 consists of a series of bundles starting from the first bundle 0 (30-1) to the last bundle n (30-4).
Only a certain computer is given exclusive writing privilege over particular bundles, and other computers are restricted to read-only privilege over those bundles; only the computer with the exclusive writing privilege writes data on the bundles assigned exclusively to it. In this manner, if each of different computers are given exclusive writing privilege over different bundles, each computer can save data on the bundles over which it has the exclusive writing privilege and other computers can read the data on those bundles; this allows file data to be shared by two or more computers without corrupting the data by overwriting on any of the sectors of the disk (7-1).
For example, if the bundle 0 (30-1) is assigned to computer 1 (1-21) with exclusive writing privilege, only computer 1 (1-21) exclusively save data on the bundle 0 (30-1) and all other computers 2 to k (1-22, 1-25) can read the saved data, enabling all the computers 1 to k (1-21, 1-22, 1-25) to share the data saved by computer 1 (1-21).
Below is an example program snippet that illustrates a case of saving data on an allocated bundle. In the program below, it is assumed that the name of partition (7-1) is /dev/sda. In addition, it is also assumed that bundle 0 (30-1) consists of 50 consecutive sectors that start from byte number 0 of the disk (7-1).
#define SECTOR_SIZE 512
int size = 4000;
char *buf[size];
int offset = SECTOR_SIZE * 10;
memset(buf, '0', size);
(1)int fd = open("/dev/sda", O_RDWR);
(2)lseek(fd, offset, SEEK_SET);
(3)write(fd, buf, size);
(4)lseek(fd, offset, SEEK_SET);
(5)read(fd, buf, size);
In line (1) of the above program, the name of the whole disk partition, /dev/sda, is passed to open system call as the first argument. Operating system treats not only the file created through the file system but also the storage device itself as a file, i.e., a device file. Operating system allows to use conventional system calls, including write and read system calls on the device file by treating it the same as a file created through the local file system. lseek() of program line (2) is a system call that repositions file offset to read or write the opened file, and the second argument of lseek()is set as the starting point of the input and output operation of the file. In the above example program, the value of the offset is 512 * 10 bytes, or 5120 bytes. SEEK_SET is the whence value that specifies the offset points not to a relative value but rather to an absolute byte address location that starts from the very beginning of disk partition (7-1). Therefore, line (3) is commanding 4,000 bytes of data on the array buf to be saved starting from the 11th sector of the disk partition (7-1). Program line (5) is to read the 4,000 byes of data that was written in line (3). In order to read the data saved starting at byte address 5210 that was written in line (3), lseek system call is invoked in line (4) to turn the offset back to the location where the 4,000 bytes of data saved by program line (3) starts; then read system call in program line (5) is invoked to read 4,000 bytes of data that had been written in line (3) beginning on byte 5210 of the disk partition (7-1).
Lines (4) and (5) of the example program above can serve as a program that is executed when a process in a computer, e.g., computer k (1-25) wants to read the data saved by another computer, e.g., computer 1 (1-21). Following example program reads data from the disk partition (7-1) of 4,000 bytes starting from the 11th sector of the disk that are within bundle 0 (30-1).
#define SECTOR_SIZE 512
int size = 4000;
char *buf[size];
int offset = SECTOR_SIZE * 10;
memset(buf, '0', size);
(1)int fd = open("/dev/sda1", O_RDONLY);
*(2)lseek(fd, offset, SEEK_SET);
(3)read(fd, buf, size);
The data sharing method of FIG. 7 prevents computers from arbitrarily writing on disk sectors and damaging the data by allocating bundles in exclusive manner, that is, allocating a bundle to one and only one computer with exclusive right to write on it and having each computer store data on disk sectors in the bundle over which it has the exclusive writing privilege and allowing other computers to read data in the bundle.
In the present invention, network attached disks including iSCSI disk and others are used such that the computers connected to the network can share data among them. However, we can use "virtual" network attached disks or volumes installed on conventional disks attached directly to the internal system bus of the computer instead of using physical entities of network attached disks. A virtual disk or virtual volume, or logical volume is a virtual device that provides an area of usable storage capacity on one or more physical disk drives in a computer system. The disk is described as logical or virtual because it does not actually exist as a single physical entity. All operating systems provide commands or tools to a make virtual disk or virtual volume on physical disk devices in a computer system. In addition, device driver software for network attached disk devices make any logical or virtual disk/volume to be a network attached disk, that is, "virtual" network attached disk. Taking an example on implementing virtual iSCSI disk as shown in FIG. 8, we first create a virtual disk (6-1) on one or more conventional internal disks and then make the virtual disk (6-1) an iSCSI disk target (6-1) using iSCSI target driver (31-1) software provided in the operating systems. The iSCSI software initiator (31-2) enables the computer (1-2) to access iSCSI devices on Ethernet. The iSCSI software target (31-1) enables computers (1-1) to export local storage to be accessed by other iSCSI initiators (31-2) using the iSCSI protocol that is defined in the RFC 3720. Therefore, it is obvious that the network attached disk in the present invention can be "virtual" network attached disk virtualized on conventional internal disks. Notice that it is a norm to use "virtual" network attached disk, such as virtual iSCSI target, in today's computer system practices.

Claims (7)

  1. A method for sharing file data among multiple computers by sharing the network attached disk volume on which the file data are stored.
  2. The file sharing method of claim 1, wherein file are shared among the aforementioned multiple computers by pre-allocating files on the aforementioned network attached disk volume, assigning the pre-allocated files to the computers in order that each of the files is assigned to one and only one computer that has the exclusive right to write data on the file assigned to it, and allowing other computers to read the file.
  3. The file sharing method of claim 1, wherein files are shared among the aforementioned multiple computers by pre-allocating files on the aforementioned network attached disk volume, where the aforementioned network attached disk volume is a virtual network attached disk volume virtualized on conventional disk volumes directly attached to internal bus of the computer, assigning the pre-allocated files to the computers in order that each of the files is assigned to one and only one computer that has the exclusive right to write data on the file assigned to it, and allowing other computers to read the file.
  4. The file sharing method of claim 1, wherein data are shared among the aforementioned multiple computers by pre-allocating bundles of disk blocks on the aforementioned network attached disk volume, assigning the pre-allocated bundles of disk blocks to the computers in order that each bundle of disk blocks is assigned to one and only one computer that has the exclusive right to write data on the bundle of disk blocks assigned to it, and allowing other computers to read the bundle of disk blocks.
  5. The file sharing method of claim 1, wherein data are shared among the aforementioned multiple computers by pre-allocating bundles of disk blocks on the aforementioned network attached disk volume, where the aforementioned network attached disk volume is a virtual network attached disk volume virtualized on conventional disk volumes directly attached to internal bus of the computer, assigning the pre-allocated bundles of disk blocks to the computers in order that each bundle of disk blocks is assigned to one and only one computer that has the exclusive right to write data on the bundle of disk blocks assigned to it, and allowing other computers to read the bundle of disk blocks.
  6. The file sharing method of claim 1, wherein files are shared among the aforementioned multiple computers by sharing a network attached disk volume whose storage device is a hard disk, an SSD (solid-state drive), flash memory, RAID (Redundant Array of Inexpensive/Independent Disks), or JBOD (just a bunch of disks).
  7. The file sharing method of claim 1, wherein files are shared among the aforementioned multiple computers by sharing a virtual network attached disk volume virtualized on the disks directly attached to internal system bus whose storage device is a hard disk, an SSD (solid-state drive), flash memory, RAID (Redundant Array of Inexpensive/Independent Disks), or JBOD (just a bunch of disks).
PCT/KR2020/012185 2020-09-09 2020-09-09 Method for processing files through network attached disks WO2022054984A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/KR2020/012185 WO2022054984A1 (en) 2020-09-09 2020-09-09 Method for processing files through network attached disks

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/KR2020/012185 WO2022054984A1 (en) 2020-09-09 2020-09-09 Method for processing files through network attached disks

Publications (1)

Publication Number Publication Date
WO2022054984A1 true WO2022054984A1 (en) 2022-03-17

Family

ID=80630341

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2020/012185 WO2022054984A1 (en) 2020-09-09 2020-09-09 Method for processing files through network attached disks

Country Status (1)

Country Link
WO (1) WO2022054984A1 (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR19990013792A (en) * 1997-07-11 1999-02-25 제프리 엘. 포맨 Parallel file system and method with byte range ATI locking
KR20030044464A (en) * 2001-11-30 2003-06-09 한국전자통신연구원 A distributed controlling apparatus of a RAID system
US20050114592A1 (en) * 2003-11-26 2005-05-26 Yonggen Jin Storage system and data caching method in the system
US20130332489A1 (en) * 2004-02-06 2013-12-12 Vmware, Inc. Providing multiple concurrent access to a file system
US8700585B2 (en) * 2004-02-06 2014-04-15 Vmware, Inc. Optimistic locking method and system for committing transactions on a file system
KR20200109547A (en) * 2019-03-13 2020-09-23 김한규 Method and network attached storage apparatus for sharing files between computers

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR19990013792A (en) * 1997-07-11 1999-02-25 제프리 엘. 포맨 Parallel file system and method with byte range ATI locking
KR20030044464A (en) * 2001-11-30 2003-06-09 한국전자통신연구원 A distributed controlling apparatus of a RAID system
US20050114592A1 (en) * 2003-11-26 2005-05-26 Yonggen Jin Storage system and data caching method in the system
US20130332489A1 (en) * 2004-02-06 2013-12-12 Vmware, Inc. Providing multiple concurrent access to a file system
US8700585B2 (en) * 2004-02-06 2014-04-15 Vmware, Inc. Optimistic locking method and system for committing transactions on a file system
KR20200109547A (en) * 2019-03-13 2020-09-23 김한규 Method and network attached storage apparatus for sharing files between computers

Similar Documents

Publication Publication Date Title
US9063561B2 (en) Direct memory access for loopback transfers in a media controller architecture
RU2505851C2 (en) Providing indirect data addressing in input/output processing system where indirect data address list is fragmented
US5603003A (en) High speed file access control method and computer system including a plurality of storage subsystems connected on a bus
US7660867B2 (en) Virtual computer system and virtual computer migration control method
JP4632574B2 (en) Storage device, file data backup method, and file data copy method
US8843703B2 (en) Storage system having a channel control function using a plurality of processors
US9122415B2 (en) Storage system using real data storage area dynamic allocation method
US6944732B2 (en) Method and apparatus for supporting snapshots with direct I/O in a storage area network
US8639898B2 (en) Storage apparatus and data copy method
US9423984B2 (en) Storage apparatus and control method thereof
WO2023096118A1 (en) Data input and output method using storage node-based key-value store
KR20160075564A (en) Network interface
US8527732B2 (en) Storage system and method of controlling storage system
WO2017195928A1 (en) Flash-based storage device and computing device comprising same
WO2022054984A1 (en) Method for processing files through network attached disks
KR20200109547A (en) Method and network attached storage apparatus for sharing files between computers
US7640415B2 (en) Storage system having a first computer, a second computer connected to the first computer via a network, and a storage device system that is accessed by the second computer
JP2007004710A (en) Storage access system, data transfer device, storage accessing method and program
JP3387017B2 (en) File access control method and file relocation method
WO2009033971A1 (en) System and method for splitting data and data control information
JP3199816B2 (en) High-speed file access control method and computer system
JP3294564B2 (en) High-speed file access control method and computer system
EP1662400B1 (en) Apparatus with dual writing function, and storage control apparatus
US20050044300A1 (en) Apparatus with dual writing function, and storage control apparatus

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20953386

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20953386

Country of ref document: EP

Kind code of ref document: A1