WO2022054984A1

WO2022054984A1 - Method for processing files through network attached disks

Info

Publication number: WO2022054984A1
Application number: PCT/KR2020/012185
Authority: WO
Inventors: Han Gyoo Kim
Original assignee: Han Gyoo Kim
Priority date: 2020-09-09
Filing date: 2020-09-09
Publication date: 2022-03-17

Abstract

A method for sharing files among multiple computers are disclosed. A method for computers to save and share files without damaging the files, where each computer uses only its own local file system without the need for coordination between the file systems of multiple computers, is disclosed. More specifically, a method for sharing files between computers through network attached disks in distributed computing systems that process big data, such as the Hadoop system, is disclosed.

Description

METHOD FOR PROCESSING FILES THROUGH NETWORK ATTACHED DISKS

This invention in general relates to file sharing methods in distributed computer systems. More specifically, this invention relates to a method that shares files in distributed computer systems where each computer uses only its local file system.

Unlike a conventional disk that is attached to an internal bus of a computer, a network attached disk, which provides block storage device to a host computer through a network, is a type of storage device that is attached via network to a computer. Technology that uses Fibre Channel communication protocol network called SAN (Storage Area Network) to provide disks was used initially, but technologies that use the more common Ethernet to provide storage space to computers was later developed and is being widely used.

Network attached disks that are connected via Ethernet include iSCSI disk, NetDisk disk, AoE disk, and FCoE disk. iSCSI stands for "Internet Small Computer Systems Interface." IBM and Cisco started developing it in 1998, and it was selected as a standard in 2000. iSCSI provides disks to computers through a network and is based on Internet communications protocol. Ximeta, Inc., a company based in the U.S., developed and launched NetDisk using proprietary protocol in 2002. FCoE stands for "Fibre Channel over Ethernet." Its development started in 2003, and its protocol was standardized in 2007. FCoE is a protocol technology that connects storage apparatus to a computer through Ethernet instead of Fibre Channel. AoE stands for "ATA over Ethernet." It was announced in 2004. AoE is a protocol that connects devices that use AT/ATAPI (AT Attachment Packet Interface) interface standard, such as regular hard disk drives (HDDs) and SSDs (solid-state drives), to Ethernet to provide HDD or SSD to a computer. Unlike the aforementioned iSCSI technology, which connects SCSI (Small Computer System Interface) disks to Ethernet, AoE technology connects the relatively cheaper AT/ATAPI disks to Ethernet and provides them to a computer.

Network attached disk technologies including SAN, iSCSI, NetDisk, FCoE, AoE, etc., provide storage devices through network to computers connected to the network, but no technology exists that allows multiple computers to share files stored on the network attached disks such that each computer in the network uses only its own local file system without the need for coordination between their local file systems of multiple computers in the network.

Although there is high demand for sharing files among multiple computers of distributed data processing systems such as Hadoop, network attached disks are not currently used in such distributed systems because they cannot be shared as a storage device under each computer's local file system. More specifically, mere connecting a network attached storage device to the network such that the computers in the network access the network attached storage device as their local storage device does not provide the sharing of files stored on the network attached device because the files on the storage device are to be created and stored through "file system" software layer in the operating system. A file created and stored on a storage device by a computer's local file system cannot be properly accessed by other computers' local file systems because the other computers' local file systems have no way to get the metadata of the file whose data are stored and maintained by the particular local file system that created and stored the file on the device. Metadata contain the information on which blocks of the storage device the file is stored. Metadata generated by a local file system are not shared with other independent local file systems, so there is no way the local file systems access the file created and stored by another local file system. The present invention purposes to provide technology for network attached disks to be used to share files while each computer only uses its own local file system without the need for cooperation among the local file systems of the computers in the network.

Under the conventional method of sharing files among computers of distributed systems such as Hadoop, each computer reads a file stored on its internal disk into its buffer cache in main memory and transfers the file over the network to other computers, and the computer that has received the file stores it on its own buffer cache in its main memory and then saves it to its internal disk; in the process of file transmission from a computer to the other, the process occupies a large portion of main memory of the sender and receiver that could have been allocated for processing their jobs instead of just transferring the file.

Another purpose of the present invention is to improve job processing throughput of the whole system by eliminating the occupation of main memory involved in the file transfer for sharing data between the computers, and by allocating the very amount of the saved main memory for other job processing of the computers, thus increasing the effective amount of main memory for the computers.

The present invention relates to the method for multiple computers to share data files using network attached storage devices, where each computer uses only its own local file system to create and share without damaging the integrity of files.

The disk mentioned in the present invention refers not only to hard disk drives (HDDs) and SSDs (solid state drives) but also to any nonvolatile block storage devices, such as USB drives and SD (secure digital) cards.

Although the embodiments of the present invention are explained with reference to Hadoop systems, it is obvious that the file sharing method of the present invention can be used for sharing file data among multiple computers in other distributed systems besides the Hadoop system.

If the file sharing method via network attached disks as disclosed herein according to the present invention, a computer's main memory is no longer occupied for transmitting file data among computers as in conventional distributed systems such as Hadoop; therefore the data processing throughput of distributed systems as a whole is greatly improved.

FIG. 1 is a diagram of the constituents of a Hadoop system.

FIG. 2 is a diagram of how the intermediate result files are transmitted and received between mappers and reducers in a Hadoop system.

FIG. 3 is a diagram of how multiple computers share files by sharing network attached disks.

FIG. 4 is a diagram of what software components of multiple computers are required to share network attached disks.

FIG. 5 is a diagram that describes a method where each computer uses only its own local file system to share files among computers without the coordination at the level of local file systems of multiple computers.

FIG. 6 is an illustration of disk partitions according to LBA (logical block addressing).

FIG. 7 is an illustration of how files are shared among computers by using network attached disks as simple block storage devices without mounting file system on the device.

FIG. 8 is an illustration of how a conventional internal storage device is used as a virtual network attached storage device for file sharing.

FIG. 1 shows a Hadoop system consisting of data nodes (1-1, 1-2, 1-3, 1-4), which can be consisted of up to several thousands of computers that have been connected through network (2) to process big data. In Hadoop system, data nodes (1-1, 1-2, 1-3, 1-4), which are computers that process data, have one or more mappers (20-1, 20-2, 20-3, 20-4, 20-5) and reducers (21-1, 21-2, 21-3, 21-4, 21-5, 21-6), which are functions for processing data by dividing up the load among the multiple data nodes. A mapper is a function for deriving intermediate data. Data to be processed are stored in the data blocks (16-1, 16-2, 16-3) distributed among multiple data nodes via Hadoop file system. Each mapper processes data in the data blocks of its own data nodes, and thus the whole data are processed by multiple mappers (20-1, 20-2, 20-3, 20-4, 20-5) in parallel.

Each mapper that processes data produces intermediate data into files (11-1, 11-2, 11-3, 11-4, 11-5, 11-6) and saves them as intermediate files (12-1, 12-2, 12-3, 12-4, 12-5, 12-6) on its own data nodes through the local file system (10-1, 10-2, 10-3, 10-4) of each data node. The intermediate files (11-1, 11-2, 11-3, 11-4, 11-5, 11-6) that are produced by mappers are initially produced on buffer caches in the operating system of each data node before they are saved on local disks as files (12-1, 12-2, 12-3, 12-4, 12-5, 12-6); while the temporary files produced on buffer caches that are located on the main memory of each data node (3-1, 3-2, 3-3, 3-4) and the files saved on the disks are the same files whose data are identical, in all drawings of the present invention the intermediate files on buffer caches will be numbered 11-1, 11-2, and so on, and the intermediate files that are saved on disks will be numbered 12-1, 12-2, and so on, in order to distinguish the files that are on buffer caches on the main memory (3-1, 3-2, 3-3, 3-4) from the files that are saved on the disks of the data nodes.

Buffer cache, sometimes called page cache, is a portion of main memory controlled by an operating system that keeps the copies of disk blocks so that the buffer cache can be used as a transparent cache for blocks of the hard disk drive. Using buffer cache on main memory to temporarily store copies of disk blocks, which is called disk buffering, results in quicker file accesses because it is faster to access the main memory than secondary storage devices such as hard disk drive.

Intermediate data files produced by the mappers are transmitted to reducers (21-1, 21-2, 21-3, 21-4, 21-5, 21-6). A reducer is a function that gathers and processes intermediate data files. Like mappers, reducers are distributed among multiple data nodes, and it is possible for there to be many reducers on each data node to process a single task. Mappers transmit intermediate files (12-1, 12-2, 12-3, 12-4, 12-5, 12-6) to reducers (21-1, 21-2, 21-3, 21-4, 21-5, 21-6), and each reducer puts together intermediate files that it has received from multiple mappers and processes the intermediate files. In this manner, final result files (14-1, 14-2, 14-3) that each reducer has processed are saved as files (15-1, 15-2, 15-3) on local disks through the local file system of each of the data nodes, and these final data files (15-1, 15-2, 15-3) constitutes the final results of the job.

The reason for distributing numerous mappers and reducers among the data nodes is to increase data processing speed by distributing vast amounts of total data among multiple data nodes and processing them in parallel.

FIG. 2 shows the typical process by which an intermediate file produced by a mapper is transmitted to a reducer through the HTTP (hyper text transport protocol) communication protocol. For example, intermediate file p (11-7) residing in buffer cache on the main memory (3-1) by mapper j (20-1) of data node M (1-1) is saved as file p (12-7) on the disk (6-1); is transmitted via the HTTP interface program (22-1) through the network stack (4-1) of the operating system of data node M (1-1) and NIC (Network Interface Card, or Network Interface Chip) (5-1), which is the network interface hardware; and is sent to data node N (1-2), on which reducer h (21-1) is waiting to receive the intermediate file p (11-9, 12-8).

In the course of transferring the intermediate file, the intermediate file p (12-7) that is on the disk (6-1) cannot be transmitted directly to the network (2) via the HTTP interface program (22-1), but the intermediate file p (12-7) on the disk (6-1) must be loaded on the main memory (3-1) as intermediate file p (11-8) before it can be transmitted. The intermediate file p (11-8) that has been transmitted in this manner is saved on the local disk (6-2) of the reducer as file p (12-8) on its local disk (6-2) only after it is loaded on the main memory (3-2) of data node N (1-2) as intermediate file p (11-9) through the network stack (4-2) of the operating system, the HTTP interface program (22-2), and the local file system (10-2) after passing through NIC (Network Interface Card, or Network Interface Chip) (5-2) of data node N (1-2; afterwards, it is loaded again as intermediate file p (11-10) on the main memory (3-2) before the reducer h (21-1) then processes the data of file p (11-10. In FIG. 2, file p (11-7) produced by mapper j (20-1), each file p (12-7, 12-8) saved on the disks (6-1, 6-2) of the two data nodes (1-1, 1-2), and each file p (11-8, 11-9, 11-10) that is loaded on the main memory are the same files with identical data, but in order to distinguish the location where each file instance resides, files on the main memory are numbered 11-7, 11-8, 11-9, 11-10, and files saved on the disks are numbered 12-7, 12-8, in a manner similar to how they were numbered in FIG. 1.

Each data node transmits intermediate files produced by mappers to reducers in other data nodes. During this process, each data node occupies its main memory (3-1, 3-2) as it transmits and receives intermediate files due to the network communication involved in the process. In turn, due to this overhead, throughput of data processing job becomes reduced because the amount of main memory occupied in this process of transferring data file over the network would otherwise have been allocated to data processing jobs instead of being occupied in the process of transferring the file.

The present invention relates to the method for improving the throughput and performance of big data processing systems such as the Hadoop system. More specifically, the present invention provides for data nodes, or computers, to share intermediate files by storing the intermediate files on storage devices that are attached to a network and are shared among data nodes. The method of sharing data file presented in this invention eliminates the occupation of main memory allocated to network stack and buffer cache required for transferring the data file between data nodes, thus improving the overall data processing job throughput and performance.

FIG. 3 is a diagram that shows the system of the present invention of multiple data nodes, or computers, that share network attached disks. In order to avoid the occupation of the main memory that occurs when intermediate files are transferred over network, the present invention uses a network attached disk (7-1) to share the files. As mentioned above, network attached disks include iSCSI (Internet Small Computer Systems Interface) disks, NetDisks, and AoE (ATA over Ethernet) disks. By attaching network attached disk to a network (2) to which many data nodes (1-1, 1-2) are connected, the present invention uses the network attached disk (7-1) as a local disk to each data node (1-1, 1-2) as shown in FIG. 3. That is, the network attached disk (7-1) is provided as a local disk to both of data node N (1-1) and data node M (1-2).

In FIG. 3, mapper j (20-7) saves the intermediate data file h (11-1) through its local file system (10-1) as file h (12-1) on the network attached disk (7-1) through network stack (4-1) and NIC (5-1). File h (11-1) produced by mapper j (20-7) refers to the file that is on the buffer cache of the operating system of data node M (1-1) where mapper j (20-7) is running, and file h (12-1) which is identical in data to file h (11-1) refers to the file that is saved on the network attached disk (7-1). Reducer k (21-7), which is being executed on another data node N (1-2), processes the data of intermediate data file h (12-1), which is on the network attached disk (7-1), by loading it as intermediate data file h (11-2) on the buffer cache on the main memory of its own local file system (10-2).

In the conventional method illustrated by FIG. 2, mapper's data file stored in its local disk has to be loaded on buffer cache thus entailing the occupation of memory to the size of the file, and then another memory occupation of same amount of memory due to the copy of data file from buffer cache to network stack occurs in order to be transferred to reducer, and then the exactly same amount of memory occupations occurs in reducer's computer because the same file transfer procedures are required only in reverse order for the reducer to store the data file on its local disk.

On the contrary, the file sharing method of the present invention in FIG. 3 has data nodes directly access files stored on network attached disks to share the files, instead of one data node transmitting files to another data node through a network, thus no memory occupations are entailed for the transfer of data files between the data nodes. The amount of occupied memory in conventional method due to the file transfer over the network is twice the size of data files. The method of present invention, therefore, allocates this much of memory to the jobs for processing data thus increases the effective amount of memory of the system compared to the conventional method.

FIG. 4 is a block diagram of how multiple data nodes (1-1, 1-2, 1-3), or computers, share multiple network attached disks (7-1, 7-2, 7-3, 7-4, 7-5). Each of the data nodes that are connected to network (2) recognizes network attached disks (7-1, 7-2, 7-3, 7-4, 7-5) as its own local disks through device driver software modules (13-1, 13-2, 13-3) that control the network attached disks mounted on each data node. As for device drivers (13-1, 13-2, 13-3), Linux operating systems provide initiator device driver software for iSCSI devices, and Windows, VMware, and various other operating systems also provide initiator software so that network attached disks can be recognized and used as local disks. In a computer system that uses network attached disks, the device driver software for network attached disks accesses and controls the network attached disks through the network interface (NIC), though files on network attached disks are read and written through the local file system similarly to those on the computer's internal disks.

However, multiple data nodes cannot independently read or write files on network attached disks without breaching the integrity of the file data. If each data node reads and writes files though its own local file system without cooperating with other data nodes, each data node that shares the network attached disks does not know which sectors of the network attached disks the other data nodes are writing on. Therefore, when multiple data nodes write different data over the same disk sector of a network attached disk that is being shared, the integrity of file data becomes compromised. In other words, because the individual local file systems of the data nodes are not working in collaboration, the metadata, that contains the information regarding which files are stored on which blocks of the disk, maintained by each data nodes end up being different from one data node to the other, and thus consistent file system of the shared network attached disks cannot be kept. For this reason, network attached disks such as iSCSI, AoE, FCoE, and NetDisk by themselves are not enough for computers of a network to be able to share files on the network attached disks.

FIG. 5 is an illustration of the file sharing method of the present invention, where each computer (1-11, 1-12, 1-13) share files by using only its own local file system without cooperating with other computers' local file systems (10-11, 10-12, 10-13). A disk partition, or a disk volume, is a collection of contiguous blocks of the sectors of a storage device, such as a hard disk or an SSD (solid-state drive). A single physical hard disk or SSD can be divided into several partitions. Each partition is recognized as an independent storage device volume by the operating system, and a single file system is loaded on individual disk partition. Therefore, the explanations in the the present invention do not need to distinguish between disks and disk partitions.

] In FIG. 5, each computer (1-11, 1-12, 1-13) mounts the network attached disks or disk partitions (7-1, 7-2, 7-3) on its local file system as its local disks with both reading and writing privileges. Mounting a disk or disk partition is the process of logically attaching a disk to a directory within the file system structure, and after the mounting process, files can be read and written on the mounted disk through the local file system. Mounting commands and related software tools are provided in all operating systems. In Linux operating systems, for example, mounting a network attached disk to directory /networkdisk1 of the local file system (10-11) of computer 1 (1-11) is executed by using the mount command as follows. The program snippet below shows a case that a network attached disk in FIG. 5 named as /dev/sda is mounted to the /networkdisk1 directory using the mount command.

sudo mount /dev/sda /networkdisk1

In FIG. 5, dotted lines (100 - 108) are used to emphasize the fact that each network attached disk (7-1, 7-2, 7-3) is mounted not on a single computer but rather on multiple computers at the same time with both read and write privileges. In the file sharing method of the present invention, files (12-10, 12-11, 12-12, 12-13, 12-14, 12-15, 12-16, 12-17) are to be created a priori on the network attached disks (7-1, 7-2, 7-3) before each computer (1-11, 1-12) accesses the network attached disks (7-1, 7-2, 7-3) and starts reading and writing files on them. Creating files of desired sizes before storing the actual data on those files can be done in Linux operating systems, for example, with the fallocate command. fallocate is a command provided by Linux operating systems, and it stands for "file allocate." fallocate creates a file s taking up certain size of disk space before data is saved on it. Below is an example snippet of using the fallocate command to create a file named file1.txt that is 10 MB in size.

fallocate -l 10M file1.txt

The fallocate command secures disk space that can contain data of a specified size before the actual data is saved. For instance, in the above example, the file1.txt file is created by securing 10 MB of disk blocks before valid data is actually stored on the file1.txt file. In this manner, pre-allocated files can be created as many as desired (12-10, 12-11, 12-12, 12-13, 12-14, 12-15, 12-16, 12-17) on the network attached disks (7-1, 7-2, 7-3) in advance.

Afterwards, pre-allocated files (12-10, 12-11, 12-12, 12-13, 12-14, 12-15, 12-16, 12-17) are assigned in exclusive way to individual computers (1-11, 1-12, 1-13). That is, a pre-allocated file is assigned to a computer and never to other computers. For example, two files 12-10, 12-11 can be assigned to computer 1-11 only, three files 12-12, 12-13, 12-14 can be assigned to computer 1-12 only, and another three files 12-15, 12-16, 12-17 can be assigned to computer 1-13 only. Each computer can write data on the pre-allocated files exclusively assigned to it and never writes data on other pre-allocated files that are not assigned to it, whereas all computers can read data from any of the pre-allocated files. In this way of assigning and writing on the pre-allocated files in exclusive way, a computer can share data files with other computers because only one computer can write data on a pre-allocated file assigned exclusively to it and other computer cannot update the data on the file but can read the file.

The reason we provide pre-allocated files before a computer starts to write data on them is to maintain the integrity file system when individual computers write and read data on and from the files assigned exclusively to each computer. The integrity of a file system is maintained when the meta data such as inode (index node) remain consistent on the files stored in the file system. The inode is a data structure in a Unix style file system that has information on which disk block stores which portion of data stored in the file. Pre-allocation of files and assignment of pre-allocated files to different computers exclusively guarantees the meta data such as inode of the files will remain the same because a set of fixed disk blocks allocated to the pre-allocated files a priori will never be replaced with other disk blocks as long as the size of data stored on the file does not exceed the size of disk space pre-allocated to it, and only one computer can update data on a pre-allocated file in particular. Therefore, sharing a file in the present invention is achieved such that a computer can update data on a pre-allocated file assigned exclusively to it whereas other computers can read data from the file without compromising the file system integrity.

Following snippet shows a mapper process in a Linux system opens a pre-allocated file assigned to it and writes data on the file that is shared with reducer processes. The program below assumes that there are 4000 bytes of data on an array buf and that the file file1.txt is the name of the pre-allocated file located on directory /shared.

char buf[4000];

(1)int fd = open(＂/shared/file1.txt＂, O_RDWR);

(2)write(fd, buf, 4000);

(3)fsync(fd);

(4)int fd_device = open("/dev/sda", O_RDWR);

(5)fsync(fd_device);

(6)close(fd);

(7)close(fd_device);

After the writing of data on the file is completed, the file data and the relevant metadata must be synchronized on the disk so that other computers that want to read the file can access and read the correctly synchronized data of the file. fd of line (1) of the above program is the file descriptor returned by the operating system as the result of opening the file file1.txt with read/write privilege. In order to open a file with both read and write privileges, O_RDWR flag needs to be specified as the second argument of the open system call. As shown in line (2) of the program, the write system call is invoked to execute writing of data on the opened file, file1.txt. Line (3) of the program above invokes the fsync system call, which synchronizes the data written on the file file1.txt to force the data that have been written on the file to be stored physically on the disk instead of being remained on the buffer cache; doing so allows other computers that share the file, i.e., file1.txt in this example to read the correct data on the file when reading it. In addition to the data of the file, the relevant metadata regarding the file must be reflected and stored on the disk; for this, the disk device shared by the mapper and the reducers, /dev/sda, in the example, on which the file is located has to be synchronized as shown in line (5) of the program above. Because the operating system treats the disk device as a file s, the open system call is invoked on the disk device to get the file descriptor of the disk device file as shown in line (4) of the program. After the file data and the disk device are synchronized, the close system call is invoked as in lines (6) and (7) to terminate the use of the file and the disk device.

] In the present invention, because the file is created in advance by means of the fallocate command before the process running on a computer actually writes the data on the file, it is normal for the size of the actual data when it is being written to be different from the data size designated at the time the file created by the fallocate command. Therefore, the process that has written data on the pre-allocated file must notify the exact data size written on the file to the other processes running on other computers that want to read the data from the file so that the processes in the other computers can read the actual data written on the file. Note that when a process writes data on a file the data are written from the very beginning of the file consecutively towards the end of the file. Therefore, the other processes in the other computers only need to know the data size of the file so that the processes can read data from the beginning of the file consecutively up to the size of the actual data written on the file. For example, in the program above, 4000 bytes of data were written on the file1.txt file, but because file1.txt was created by securing 10 MB of disk space in advance, the size of file1.txt is 10 MB. However, although the nominal size of the file1.txt file is 10 MB, valid data of 4000 bytes is saved on the file1.txt file; for this reason, the size 4000 of the valid data should be made known to the processes in the other computers that want to read the valid data on file1.txt so that they would read up to 4000 bytes instead of 10 MB. Below is a snippet of a program that reads 4000 bytes of data from the file1.txt file created in the example program above. data_buf in the below program is an array whose size is 4000 bytes and the third argument, 4000, of the read(fd, data_buf,4000)system call in line (2) specifies to read up to 4000 bytes of the file /shared/file1.txt starting from the beginning of the file.

char data_buf[4000];

(1)int fd = open("/shared/file1.txt", O_RDONLY);

(2)read(fd, data_buf, 4000);

(3)close(fd);

If there is a need to save data that is greater in size than the originally designated size of a pre-allocated file, the writing process may use two or more pre-allocated files to save all of the data and concatenate the two or more files into a single file. For example, if two files the size of 1 MB each were allocated but 1.5 MB of data need to be saved on an allocated file, 1 MB of data is to be written on one of the allocated files and 0.5 MB of remaining data is to be written on the another allocated file, and then the two files are concatenated. In the case of Linux operating systems, files can be concatenated with the cat command. Below is an example of forming a single file by appending a file to the end of another file by using the cat command of Linux operating systems; file2 is appended to the end of file1, and the concatenated file is named file1. In other words, if a large amount of data is first written to fill up file1 and the rest of the data is written on file2, the two files can be concatenated in this manner.

cat file2 >> file1

In the present invention, files can be pre-allocated one at a time on demand as they are needed, or multiple files can be pre-allocated at once. No matter whether the pre-allocated files are created dynamically or created in advance, the process that writes the data on the pre-allocated file should use the aforementioned fsync system call on the file so that the file data and the metadata are synchronized and saved on the network attached disk before the other processes in the other computers read the data from the file.

Once a process writes data on a pre-allocated file and synchronize the data and relevant metadata on the network attached disk, all other processes in other computers can read the data from the file. However, all other processes of the other computers cannot know if the file on the network attached disk has been written in fact by the process that has the exclusive right to write data on it because, although network attached disks are shared by all the computers, each computer uses its own local file system to read and write files on network attached disks without cooperating among the local file systems of individual computers. Therefore, any process that wants to read the file but has no exclusive right to write on the file must flush or clear the possibly incorrect data and relevant metadata of the file that might have been on the buffer cache of its local file system before it actually reads data from the file because the buffer cache in its local file system might not maintain correctly updated data and metadata of the file that has just been written by other process in other computer.

Operating systems including Linux provide options to flush its buffer cache. For example, the snippet below shows a way for a process to clear its buffer cache, dentries, and inode. An inode is a data structure that represents a file. A dentries is a data structure that represents a directory. The buffer cache, or page cache, could contain any memory mappings to blocks on disk which means anything that the OS could hold in memory from a file. In the lines (1) and (2) in the snippet below, writing 3 to /proc/sys/vm/drop_caches will clear buffer cache, dentries, and inode in Linux systems without killing any application/service. It causes the kernel to look for files on the disk rather than in the cache. Depending on the version or implementation of operating system, it is necessary to invoke BLKFLSBUF ioctl() system call to completely clear the buffer cache as shown in lines (3) and (4).

(1)fd_drop_cache = open("/proc/sys/vm/drop_caches", O_RDWR);

(2)write(fd_drop_cache, "3", 1);

(3)fd_dev = open("/dev/sda", O_RDONLY);

(4)ioctl(fd_dev, BLKFLSBUF);

To summarize the file sharing method of FIG. 5 of the present invention, data integrity of files is maintained even when each computer uses only its own local filesystem to write, read, and share the files by allocating files a priori on network attached disks before actual data are written on the files and then assigning pre-allocated files to each computers in mutually exclusive way such that each pre-allocated file is assigned to one and only one computer that can write data on the file exclusively and then synchronize the file data and metadata in its buffer cache to the network attached disk and then having other processes on other computers to clear their local buffer caches to read the correctly updated file data from the network attached disk.

The present invention shows a method of saving files on network attached disks and sharing the files with other computers through the local filesystem of each of the computers. However, it is possible to share files while maintaining the integrity of the files among multiple computers by using the network attached disks simply as block storage devices without mounting local filesystem.

FIG. 6 shows disk sectors that comprise a disk partition in a linear model of LBA (logical block addressing). The physical structure of a hard disk consists of a cylinder, head, and sectors; a disk is oftentimes represented by LBA model which numbers the sectors starting from sector 0 (zero). In other words, a disk or disk partition is shown as a series of contiguous disk sectors of a uniform size, typically 512 bytes each. All disk devices are equipped with logic that converts disk sector number in the LBA into the corresponding cylinder number, head number, and sector number of the physical disk and therefore access the desired sector by indicating a disk sector with the LBA sector number.

As shown in FIG. 6, disk location can be indicated with a byte address called offset that begins from the start of a disk partition. For example, if the value of the offset is 2,000, it indicates the byte that is located at the 2001st byte from the start of the disk partition. Because byte addresses starts from 0, not 1, offset 2,000 indicates the 2001st byte. Each disk sector is typically 512 bytes; therefore, if the value of the offset is 2,000, it indicates the 465th byte on the 4th sector from the start of the disk partition. Note that 2001 = 512 * 3 + 465.

FIG. 7 is a diagram that shows how to share data on a network attached disk among multiple computers without a local file system mounted on the disk. Instead, the network attached disk is treated simply as a block storage device without loading any file system on it. It differs from the methods explained in FIG. 5, where data is saved and shared by using the files created through local file system, only in that FIG. 7 uses a network attached disk (7-1) as a device file.

The term "bundle" here is a term defined in the present invention to represent a series of contiguous disk sectors. Each bundle (30-1, 30-2, 30-3, 30-4) of FIG. 7 consists of a series of contiguous disk sectors. The entirety of a disk or disk partition can be regarded as a group of bundles. For example, disk partition (7-1) of FIG. 7 consists of a series of bundles starting from the first bundle 0 (30-1) to the last bundle n (30-4).

Only a certain computer is given exclusive writing privilege over particular bundles, and other computers are restricted to read-only privilege over those bundles; only the computer with the exclusive writing privilege writes data on the bundles assigned exclusively to it. In this manner, if each of different computers are given exclusive writing privilege over different bundles, each computer can save data on the bundles over which it has the exclusive writing privilege and other computers can read the data on those bundles; this allows file data to be shared by two or more computers without corrupting the data by overwriting on any of the sectors of the disk (7-1).

For example, if the bundle 0 (30-1) is assigned to computer 1 (1-21) with exclusive writing privilege, only computer 1 (1-21) exclusively save data on the bundle 0 (30-1) and all other computers 2 to k (1-22, 1-25) can read the saved data, enabling all the computers 1 to k (1-21, 1-22, 1-25) to share the data saved by computer 1 (1-21).

Below is an example program snippet that illustrates a case of saving data on an allocated bundle. In the program below, it is assumed that the name of partition (7-1) is /dev/sda. In addition, it is also assumed that bundle 0 (30-1) consists of 50 consecutive sectors that start from byte number 0 of the disk (7-1).

#define SECTOR_SIZE 512

int size = 4000;

char *buf[size];

int offset = SECTOR_SIZE * 10;

memset(buf, '0', size);

(1)int fd = open("/dev/sda", O_RDWR);

(2)lseek(fd, offset, SEEK_SET);

(3)write(fd, buf, size);

(4)lseek(fd, offset, SEEK_SET);

(5)read(fd, buf, size);

In line (1) of the above program, the name of the whole disk partition, /dev/sda, is passed to open system call as the first argument. Operating system treats not only the file created through the file system but also the storage device itself as a file, i.e., a device file. Operating system allows to use conventional system calls, including write and read system calls on the device file by treating it the same as a file created through the local file system. lseek() of program line (2) is a system call that repositions file offset to read or write the opened file, and the second argument of lseek()is set as the starting point of the input and output operation of the file. In the above example program, the value of the offset is 512 * 10 bytes, or 5120 bytes. SEEK_SET is the whence value that specifies the offset points not to a relative value but rather to an absolute byte address location that starts from the very beginning of disk partition (7-1). Therefore, line (3) is commanding 4,000 bytes of data on the array buf to be saved starting from the 11th sector of the disk partition (7-1). Program line (5) is to read the 4,000 byes of data that was written in line (3). In order to read the data saved starting at byte address 5210 that was written in line (3), lseek system call is invoked in line (4) to turn the offset back to the location where the 4,000 bytes of data saved by program line (3) starts; then read system call in program line (5) is invoked to read 4,000 bytes of data that had been written in line (3) beginning on byte 5210 of the disk partition (7-1).

Lines (4) and (5) of the example program above can serve as a program that is executed when a process in a computer, e.g., computer k (1-25) wants to read the data saved by another computer, e.g., computer 1 (1-21). Following example program reads data from the disk partition (7-1) of 4,000 bytes starting from the 11th sector of the disk that are within bundle 0 (30-1).

#define SECTOR_SIZE 512

int size = 4000;

char *buf[size];

int offset = SECTOR_SIZE * 10;

memset(buf, '0', size);

(1)int fd = open("/dev/sda1", O_RDONLY);

*(2)lseek(fd, offset, SEEK_SET);

(3)read(fd, buf, size);

The data sharing method of FIG. 7 prevents computers from arbitrarily writing on disk sectors and damaging the data by allocating bundles in exclusive manner, that is, allocating a bundle to one and only one computer with exclusive right to write on it and having each computer store data on disk sectors in the bundle over which it has the exclusive writing privilege and allowing other computers to read data in the bundle.

In the present invention, network attached disks including iSCSI disk and others are used such that the computers connected to the network can share data among them. However, we can use "virtual" network attached disks or volumes installed on conventional disks attached directly to the internal system bus of the computer instead of using physical entities of network attached disks. A virtual disk or virtual volume, or logical volume is a virtual device that provides an area of usable storage capacity on one or more physical disk drives in a computer system. The disk is described as logical or virtual because it does not actually exist as a single physical entity. All operating systems provide commands or tools to a make virtual disk or virtual volume on physical disk devices in a computer system. In addition, device driver software for network attached disk devices make any logical or virtual disk/volume to be a network attached disk, that is, "virtual" network attached disk. Taking an example on implementing virtual iSCSI disk as shown in FIG. 8, we first create a virtual disk (6-1) on one or more conventional internal disks and then make the virtual disk (6-1) an iSCSI disk target (6-1) using iSCSI target driver (31-1) software provided in the operating systems. The iSCSI software initiator (31-2) enables the computer (1-2) to access iSCSI devices on Ethernet. The iSCSI software target (31-1) enables computers (1-1) to export local storage to be accessed by other iSCSI initiators (31-2) using the iSCSI protocol that is defined in the RFC 3720. Therefore, it is obvious that the network attached disk in the present invention can be "virtual" network attached disk virtualized on conventional internal disks. Notice that it is a norm to use "virtual" network attached disk, such as virtual iSCSI target, in today's computer system practices.

Claims

A method for sharing file data among multiple computers by sharing the network attached disk volume on which the file data are stored.
The file sharing method of claim 1, wherein file are shared among the aforementioned multiple computers by pre-allocating files on the aforementioned network attached disk volume, assigning the pre-allocated files to the computers in order that each of the files is assigned to one and only one computer that has the exclusive right to write data on the file assigned to it, and allowing other computers to read the file.
The file sharing method of claim 1, wherein files are shared among the aforementioned multiple computers by pre-allocating files on the aforementioned network attached disk volume, where the aforementioned network attached disk volume is a virtual network attached disk volume virtualized on conventional disk volumes directly attached to internal bus of the computer, assigning the pre-allocated files to the computers in order that each of the files is assigned to one and only one computer that has the exclusive right to write data on the file assigned to it, and allowing other computers to read the file.
The file sharing method of claim 1, wherein data are shared among the aforementioned multiple computers by pre-allocating bundles of disk blocks on the aforementioned network attached disk volume, assigning the pre-allocated bundles of disk blocks to the computers in order that each bundle of disk blocks is assigned to one and only one computer that has the exclusive right to write data on the bundle of disk blocks assigned to it, and allowing other computers to read the bundle of disk blocks.
The file sharing method of claim 1, wherein data are shared among the aforementioned multiple computers by pre-allocating bundles of disk blocks on the aforementioned network attached disk volume, where the aforementioned network attached disk volume is a virtual network attached disk volume virtualized on conventional disk volumes directly attached to internal bus of the computer, assigning the pre-allocated bundles of disk blocks to the computers in order that each bundle of disk blocks is assigned to one and only one computer that has the exclusive right to write data on the bundle of disk blocks assigned to it, and allowing other computers to read the bundle of disk blocks.
The file sharing method of claim 1, wherein files are shared among the aforementioned multiple computers by sharing a network attached disk volume whose storage device is a hard disk, an SSD (solid-state drive), flash memory, RAID (Redundant Array of Inexpensive/Independent Disks), or JBOD (just a bunch of disks).
The file sharing method of claim 1, wherein files are shared among the aforementioned multiple computers by sharing a virtual network attached disk volume virtualized on the disks directly attached to internal system bus whose storage device is a hard disk, an SSD (solid-state drive), flash memory, RAID (Redundant Array of Inexpensive/Independent Disks), or JBOD (just a bunch of disks).