AU2005200656B2

AU2005200656B2 - User-level data-storage system

Info

Publication number: AU2005200656B2
Application number: AU2005200656A
Authority: AU
Inventors: Joachim Worringen
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2004-04-23
Filing date: 2005-02-14
Publication date: 2010-04-08
Anticipated expiration: 2025-02-14
Also published as: GB0504446D0; FR2870951B1; GB2413410A; AU2005200656A1; FR2870951A1; DE102004019854A1; DE102004019854B4; GB2413410B

Description

S&F Ref: 707705 AUSTRALIA PATENTS ACT 1990 COMPLETE SPECIFICATION FOR A STANDARD PATENT Name and Address NEC Europe Ltd., of Rathausallee 10, 53757, Sankt of Applicant: Augustin, Germany Actual Inventor(s): Joachim Worringen Address for Service: Spruson & Ferguson St Martins Tower Level 35 31 Market Street Sydney NSW 2000 (CCN 3710000177) Invention Title: User-level data-storage system The following statement is a full description of this invention, including the best method of performing it known to me/us:- I User-level Data-storage System The present invention relates to data storage and accessing system for use with one or more parallel processing system(s). More particular, the invention relates to a 5 User-level Data-storage System (UDS) to increase the performance of an application which performs parallel file input/output. While the computational power and communication performance of many high performance computer systems (HPC) is on a very high level which allows efficient execution of massively parallel applications, it can be observed that the effective input 1o and output (1/0) bandwidth for such applications falls behind this level of performance. With current solutions, the actual performance potential of installed 1/0 hardware is not well exploited. To achieve a sufficient accumulated bandwidth for parallel applications even with the available 1/0 infrastructure, users and system operators often employ "work 15 arounds". A typical way is to split or partition the input for an application in one or more sub-files for each process of a parallel application. Each process then reads or writes only this sub-file using a sequential I/O interface like Fortran 1/0, while no other process accesses this sub-file. The problem with this approach is the static mapping of the sub-files to the processes and the high pre- and post-processing overhead. Also, to 20 achieve the maximum performance, it is often necessary to spread the sub-files across multiple file systems which adds even more complexity. If these file systems can not be L accessed from all compute nodes, for instance in the case when using the fast local file systems on the compute nodes, the application will even fail to run if the distribution of the files does not match the placement of the processes. Because the placement of the 25 processes is performed by the scheduling system and thus arbitrary, it is very hard to achieve optimal performance this way. RTr Distributed file systems such as Network File System (NFS), Andrew File System (AFS) or Global File System (GFS) are only designed to provide distributed access to files from multiple client machines. However, distributed file systems are not designed 30 for high-bandwidth concurrent writes that parallel applications typically require.

2 Commercial available parallel file systems, like PFS, PIOFS and GPFS are often available only for specific platforms. As described for instance in US-A-6 032 216, this system comprises a shared disk file system running on multiple computers each having their own instance of an operating system and being coupled for parallel data sharing 5 access to files residing on network attached shared disks. A metadata node manages file metadata for parallel read and write actions. Metadata tokens are used for controlled access to the metadata and initial selection and changing of the metadata node. A parallel file system for Linux clusters, called the Parallel Virtual File System (PVFS) has the primary goal to provide high-speed access to file data for parallel 10 applications based on Linux. PVFS provides a Linux-cluster wide consistent name space, enables user-controlled striping of data across disks on different 1/0 nodes, and allows existing binaries to operate on PVFS files without the need for recompiling. PVFS is designed as a client-server system with multiple servers, called 1/0 daemons. I/O daemons are statically assigned to run on separate predetermined nodes in the 15 cluster, called 1/0 nodes, which have disks attached to them. Each PVFS file is striped across the disks on the 1/0 nodes. At the time the file system is installed, the user specifies which nodes in the cluster will serve as an 1/0 node. Application processes interact with PVFS via a client library. Application processes communicate directly with the PVFS manager via relatively slow TCP when performing operations such as 20 opening, creating, closing, and removing files. When an application opens a file, the manager returns to the application the locations of the predetermined static 1/0 nodes on which file data is stored. The present invention provides a solution for the above discussed problems. The system according to the present invention is not a kemel-operated file system, but 25 instead operates in user space, using the available file systems in the-best possible way to achieve a high accumulated parallel 1/0 bandwidth. The system according to the present invention, in the following also named as Userlevel Data Storage (UDS), will feature high 1/0 bandwidth for parallel applications using preferably a MPI-10 interface which allows scaling with the number of processes 30 involved. MPI-I0 is the part of the Message Passing Interface (MPI) standard that 3 specifies how to access files in parallel within an MPI application. For file access, which does not need high-performance, like for managing tasks, other ways for file access may be used. The system is able to benefit from any given I/O infrastructure, without any dependency on the actual file system used to store the data on. Its flexibility 5 will allow the user to choose from any storage location visible in the system, shared or non-shared between the nodes, which best fit the needs of his specific application. The object of the invention is achieved with the features of the claims. The system according to the present invention is a system that allows the user, in particular of a high-performance computing system, to increase the performance of his 10 application which performs parallel file 1/0. A parallel application is split into a plurality of individual applications (APs), which are intended to run as separate processes possibly on different nodes. The basic concept to increase performance is that a file is striped into file fragments across fast, e.g. local storage devices of a plurality of compute nodes in the system. The plurality or part of the plurality of the compute nodes 15 are intended in running the individual applications of the parallel application. The system of the present invention is not a conventional file system, but a way to store and access data in a user-controlled manner. Therefore, the file fragments are in principle still accessible through the respective native file systems. Because it is necessary to also access parts of a file, i.e. the file fragments, which 20 are not located on any of the nodes currently running processes, a "third-party" transfer model is used. Preferably one separate I/O process (IOP) is launched on each node, which holds a part of the file, i.e., a file fragment, to be accessed. For some specific situations it my also be desirable launching more than one separate I/O process on a single node. The running processes, i.e., the individual applications (APs) and the IOPs 25 communicate preferably via MPI messages, wherein MPI messages provide substantially the fastest way to communicate between processes in a HPC system, and UDS is able to exploit this performance. The effective bandwidth for accessing such a file can thus reach the accumulated bandwidth of the independent file systems as the bandwidth of MPI messages is typically significantly higher than the bandwidth of I/O 30 operations.

4 As a user-level system, the way UDS stores data can be fully customized by the user. This is possible because UDS is not a fixed file system, but operates on top of any available file system. The user can tell UDS which file systems and which nodes should be used to store a specific file whereas the next file in the namespace can again be 5 stored on a totally different set of file systems. These file systems can be distributed across any subset of the compute nodes in the system that can execute MPI applications. The file is split (striped) across all specified file systems block-by-block, with a user definable block size (stripe size) which best fits the characteristics of the application, file system and underlying storage devices. It is not required that the file systems used 10 to stripe a file are shared between any two compute nodes. This allows exploiting the performance of node-local file systems. However, each file, hereinafter also referred to as UDS file, is still always globally visible and accessible to UDS-aware applications, no matter on which compute node it is running. This combines high-bandwidth local file access with the usability of a shared file system. To these two characteristics, UDS 15 adds the high-performance inter-process communication of MPI, tight integration of file access into the MPI library and the flexibility of a user-level system. Because UDS splits up files and stores the file fragments in distinct file systems (or distinct locations in the same file system), it operates in a separate namespace, which is managed by a distribution information means, hereinafter also referred as UDS 20 nameserver. For this purpose, the nameserver maintains a database, which keeps a record and some global state for each UDS file. Therefore, files that are created via UDS are located inside this namespace, i.e. all information of the file distribution is stored in the database of the nameserver, whereas the physical file fragments are stored on the distributed storage disks. In consequence, it is not possible to directly open a file 25 outside this namespace (e.g. a file located in a user's home directory) via UDS. To do this, the file first needs to be imported into the UDS namespace. The UDS environment offers different ways to perform such an import. Likewise, a file located inside the UDS namespace cannot be directly accessed using programs, which are not aware of UDS, e.g., it is not possible to perform a copy operation on a UDS file using the standard unix 30 cp command. The file first needs to be exported from the UDS namespace, or be 5 accessed within the UDS namespace using a UDS command line tool like udscp, which is the equivalent to cp for UDS. The system according to the present invention provides the highest I/O performance and good usability by the following means of: (i) Locality, which allows 5 making use of the fastest storage devices in the system. This may be a local hard disk in the nodes or any other storage device which is accessible for a process of an MPI application; (ii) High effective file access bandwidth: Since a file is striped across a number of nodes or file systems, and each stripe unit is operated by a separate I/O process. Because the bandwidth of the concurrent accesses of the 1/0 processes to the 10 independent file systems sums up to the applications I/O bandwidth, a high effective bandwidth does result; (iii) High inter-process bandwidth: Because all inter-process communication is preferably done via MPI messages, very low latency and high bandwidth are available, and a large fraction of the I/O bandwidth achieved by the I/O processes on the, e.g. local, file systems can be delivered to the application. There is no 15 obligatory need for TCP/IP protocol in the data communication path; (iv) Flexibility: Since any valid path on the compute nodes can be chosen to hold the file fragments, the user has full flexibility how to store the data, like using an extremely fast memory storage device for scratch files or a persistent storage device for other file types. It also allows a tradeoff between maximum performance and improved external accessibility 20 of the files. The system according to the present invention supports at least three ways to specify where to store the file fragments which make up a file in UDS. They are determined by the name of the file when it is created in the UDS storage system. This creation can be done from with an MPI application (calling MPIFileopeno) or via an 25 external tool. A first possible way specifies a file via a home path, e.g. a path starting with '-' or '-username' specifies the current or given users home directory on each compute node used to store a file fragment. Usually, if the home directories are mounted from a remote server on each compute node, this will result in low bandwidth and should only be used for small files. However, some systems are configured to give 30 the user a home directory on a local file-system on each node. In such cases, high bandwidth can be achieved. A second way relates to an explicit path, e.g. a path starting 6 with '/' specifies an actual path which will be used on all compute nodes. This means that the specified path has to be valid on all these nodes, but not necessarily pointing to a local file system. But only if the path leads to a local file system, high bandwidth will be achieved. A third way relates to an implicit path, e.g. paths which start with a meta 5 sign and a system-depended string, like '#local/' or '#fcraid/'. These virtual paths will then be resolved by UDS using a more complex (custom specified or even dynamic) mapping of nodes to file systems. There are substantially two available interfaces for accessing files, namely using the MPI-IO interface and the "command line" tools. The complete MPI-IO interface as 10 defined in the MPI-2 standard is supported. By this, the user can create and open files, read and write data from an open file with various read and write functions, gather information about the file and manipulate its size, close a file and delete a file. Operations like listing the contents of a directory, renaming, moving or copying a file may benot directly supported by this interface. To perform file operations via MPI-IO, 15 the user needs to run an MPI application on the system. When a file is opened via MPI File-openo command as provided by MPI, the following actions are performed by UDS: 1) Preferably a single individual application AP of the group of processes which opens the file (group leader) contacts the UDS name server, supplying the list of nodes on which processes of the group are running. 20 The communication between the group leader and the nameserver is preferably performed via TCP/IP using an encrypted socket or other means of encryption, which ensures authenticy on a file system security level. 2) The nameserver maps the given filename to its local meta representation of the complete UDS name space, which is stored in a hierarchy of directories and files on a local file system. Each UDS file is 25 represented by a meta file in this hierarchy which contains all information to access the real, distributed UDS file and its attributes. 3) If the namserver finds a matching meta file, it means that the file already exists, it passes the necessary information to the group leader. This information consists for the most part of a list of <node,path> tuples which describes the location of the file fragments. 4) If the nameserver does not find a 30 matching meta file, it means that the file does not exist. In case the application wants to create the file, the nameserver creates the matching meta file and passes back a list of 7 <node, path> tuples based on the specified path and the list of nodes that were contained in the initial open request of the group leader. 5) The group leader passes the response to all processes in the groups (the APs), which then collectively spawn the required I/O processes. Preferably exactly one separate I/O process (IOPs) per node is provided in 5 the list. However, for some reasons it may be advantageous to run more than one separate 1/0 process on a node. All communication between the APs and the IOPs is preferably performed via MPI which ensures high performance. 6) The IOPs open (or create) the local file fragments and report back to the APs on the success of this operation. If all file fragments can be accessed, the IOPs wait for incoming read/write 10 (or other) requests of the application processes. 7) Each file open operation is approved or cancelled by the group leader which collects the results of the operation of all APs and IOPs. If an error occurs, if for instance, at least one IOP can not access a file fragment, this problem is reported to the UDS nameserver. If the MPI_File open operation related to a file which already existed, the nameserver will mark the file as 15 corrupt. The administrator can then try to recover the related file fragments. If the MPIFile open operation was meant to create a new file, the path specification was probably not valid on all nodes, and the meta file that had been created is removed again. When data is accessed like by read or write access operations, the following steps 20 are performed in UDS: 1) Based on the position of the access in the file and the striping pattem of the file fragments which was determined when the file was created, each individual application AP can calculate which set of IOPs is needed to service the request. 2) If the AP is performing an individual file access, it sends requests to each of the according IOPs and checks for their response in a blocking or non-blocking way. 3) 25 For collective file accesses, e.g. all APs perform a joint, synchronized file access, one AP per compute node communicates with a number (not necessarily all) of the IOPs on behalf of the other application processes on the node. The data from or towards the IOPs is then redistributed inside the APs, using intra- and inter-node MPI communication. Such optimizations are only possible with collective file access; this is 30 the reason why collective file access should be used whenever possible. 4) The transport protocols processed by the IOPs preferably include MPI communication and file access.

8 These operations can be performed in a pipelined manner to increase throughput. Additionally, the IOPs can use asynchronous I/O operations (if available on the given file system) to overlap MPI communication and file accesses. The second available interface for accessing files is provided by "command line" 5 tools. The UDS file system is on top of a native file system and not visible as such using for conventional unix commands which perform file operations like directory listing (ls), removing (rm), copying (cp) or make directories (mkdir). These tools can only operate on the distributed file fragments that make up an UDS file, but have no knowledge on this distribution. By accident, a user could remove some of the file 10 fragments without knowing that the file fragment belongs to a distributed file. For this reason, UDS will name the file fragments in a way which makes it clear to the user that it is part of a UDS file. Furthermore, UDS will include a range of command line utilities that offer these standard functions to manage the files in the file system. Next to these standard functions, additional tools will be available for importing or exporting a file 15 from the UDS file system, e.g. copying a file from a standard file system visible on a frontend to a specified location inside the UDS file system, and vice versa, retrieving UDS-specific information on a file, or even change the distribution or striping of a UDS file. The present invention will be further described with reference to the 20 accompanying drawings wherein like reference numerals refer to like parts in the several views, and wherein: Fig. 1 shows a configuration of a first embodiment of a data storage and accessing system according to the present invention; Fig. 2 shows a configuration of another embodiment according to the present 25 invention; Fig. 3 shows a configuration of a further embodiment according to the present invention; Fig. 4 shows examples of persistent components on computer nodes according to the present invention; 9 Fig. 5 illustrates an application accessing a newly created file; Fig. 6 illustrates a possible situation when an application opens an existing file; Fig. 7 shows the situation when a file is imported into the UDS namespace by using a command line tool; 5 Fig. 8 shows the situation when a command line tool queries a nameserver for a file status; Fig. 9 shows the situation when a command line tool probes a file; and Fig. 10 indicates a striped file across multiple nodes. The architecture of a data storing and accessing system according to the present 10 invention is illustrated in Fig. 1, showing the communication and data flow between the UDS components, for a typical setup of a high-performance computing center. Fig. 1 shows two typical scenarios for a high performance computer system with six compute nodes 1-1 to 1-6, each with a local storage device 5-1 to 5-6. The parallel application APl consists of the two individual applications 3-1 (AP1#0) and 3-2 (AP1#1) running 15 on nodes 1-1 and 1-2. These two individual applications or processes have just created a new file, which is striped across the two nodes 1-1 and 1-2 running the application. The file consists of two file fragments each stored on the local storage device 5-1 and 5-2, respectively. To access the file, i.e. the two file fragments, two separate I/O processes IOPs 4-1 (IOP1#0) and 4-2 (IOP1#1) are spawned. This setup usually delivers the best 20 possible bandwidth. A second scenario is shown with application AP2, consisting of AP2 #0, #1 and #2. They have opened a file which already existed. It has been created on nodes 1-3 and 1-4. To access this file, i.e. two file fragments stored on the storage devices 5-3 and 5-4, two separate I/O processes, IOP2#0 and IOP2#1, were spawned on these nodes 1-3 and 1-4. The location of the parts of the file, the file fragments, can be 25 chosen freely by the user and administrator rights are not obligatory necessary. The management of UDS, namely the persistent storage of the file system name space is done by the nameserver 6 running preferably on an external system. This nameserver provides to an application, i.e., to at least one of the individual applications 4-x the information how to access a file that was stored before by another application run.

10 Another hardware setup according to the present invention is shown in Fig. 2. A number of n compute nodes 1-1 to 1-n serve to run a parallel application and are connected by a high-bandwidth, low-latency IXS switch. Preferably each node 1-1 to 1 n is equipped a local storage system 5-1 to 5-y which is only visible on this node, 5 wherein x is equal n if on each node exact one storage system is provided. Access to the local storage device, which is for instance performed via SFS (Super-UX File System), provides high bandwidth even for small blocks, including caching. Additionally, all nodes 1-1 to 1-n can access a number of large remote shared storage arrays 5-(y+l) to 5-1 via fiber channel (FC) links, using for example a Global network File System (GFS) 10 offered by the storage nodes. Access to the remote GFS file systems usually is significantly slower than SFS, especially for small data blocks. However, these storage arrays can also be directly accessed from the storage nodes, which may also serve as frontend nodes for interactive user access. The direct connection between frontend systems 6-1 to 6-p and the nodes 1-1 to 1-n is preferably a low-bandwidth TCP/IP link 15 like a Gb Ethernet Switch. At least on one of these frontend nodes 6-x a USD nameserver is running. The first way to exploit such a system via UDS is to store files, and/or file fragments on the local SFS file systems, e.g. memory or disk based. To specify this, an explicit UDS path, like '/uds' or '/xmu' which would exist on all nodes, may be used. If 20 UDS is configured accordingly, an implicit path can also be used with the same performance. A home path is not recommended for large-scale I/O as the home directories are usually mounted via NFS. Going this way with implicit or explicit paths provides the highest performance for the access via MPI-IO. However, importing or exporting a file is slower as the files need to be staged on the shared GFS file systems. 25 In a setup which does not feature shared storage between the frontend 6-1 to 6-p and the compute nodes 1-1 to 1-n, the import/export would need to be performed via the TCP/IP link between these two entities, resulting in low bandwidth values for import/export operations. The remote storage 5-(y+l) to 5-1 via GFS can also be used for file storage with 30 UDS. However, this usually means that the mapping of a compute node onto a file system is not necessarily 1:1 any more. A fixed round-robin mapping of compute nodes 11 to file systems will provide sub-optimal performance for all cases in which an application runs on a number of compute nodes which are mapped to the same file system. Instead, UDS has to try to distribute the file evenly across the available file systems, using a dynamic mapping of compute nodes to file systems. This can mean one 5 file system per compute node, but other distributions, e.g. more than one file system per node and IOP, or a setup in which only a subset of the nodes runs an IOP, are also possible. If it is effective to use more file systems than (active) compute nodes depend on the capability of the remote file system to perform asynchronous 1/0, or requires additional functionality (threads) in the 1/0 processes. 10 UDS will also support legacy I/O applications as good as possible. However, changing the standard 1/0 calls openn, read, write(, close) into the according MPI IO equivalents (MPIFileopeno, MPIFilereadO, MPIFile writeO, MPIFile close) is mandatory. Usually, this is a very simple task. A typical 1/0 technique in a legacy application is to have each process access an independent file. For 15 read access, these files or file fragments need to be placed in a location which is reachable for a process. Without UDS, this means that the files or file fragments are either placed on a single remote file system, which then is typically not able to provide sufficient bandwidth for all processes accessing their file, or need to be copied on different file systems which can deliver a higher accumulated bandwidth for the 20 concurrent file accesses. For write access, the user needs to take care after the program run to collect all individual files or file fragments from the different file systems and possible joining them into a single file. This manual distribution can be become very complex, especially if not all processes can access all file systems as it is the case when using local file systems of the compute nodes. This causes many users to use the slower 25 approach. For such applications, UDS can not increase the bandwidth compared with the optimal manual distribution of the files. However, it can vastly simplify the procedure to reach a similar performance. This is achieved by importing the directory with the input files in a way that distributes the complete files (no striping applied) across the available file systems. The application can then access all files using a single 30 implicit (virtual) UDS path specification. Although the distribution that is performed during the import operation can not provide optimal locality (placing the files on the 12 node on which a process runs) for most cases, the third-party-transfer concept ensures that each process can access the file with either MPI- or local I/0-bandwidth (whatever is smaller). The architecture of Fig. 3 illustrates a typical setup of a high-performance 5 computing center. The embodiment of Fig. 3 comprises four compute nodes 1-1 to 1-4 which run the parallel applications. They offer a number of CPUs, memory and local storage devices 5-1 to 5-4. The access bandwidth to these storage devices 5-1 to 5-4 is very high, even for smaller blocks as the file system is able to perform an efficient caching. However, the access can only be performed by processes running on this node. 10 Typically, only a small fraction of the local storage capacity is used for the operating system; using UDS, the full capacity can be used for globally visible and accessible files. The compute nodes 1-1 to 1-4 are connected via a high-bandwidth, low-latency message-passing interconnect 8, usable through MPI messages. A frontend 10 used for login, data transfer, editing and compiling and maybe visualization. This is a system 15 with limited user-accessible storage, e.g. like user's home directories. The frontend 10 is connected to the compute nodes 1-1 to 1-4 via (Gigabit) Ethernet. An external storage system 11 holds the user's larger data files, and usually different scratch spaces. This storage system 11 consists of a large number of independent disk drive arrays 5-5. It is attached to the compute nodes 1-1 to 1-4 and the frontend node 10 via multiple fibre 20 channel (FC) connections 9. However, the maximum bandwidth that a single process on a compute node can achieve via this connection 9 is lower than for the local storage devices 5-1 to 5-4 of the compute node 1-1 to 1-4, especially for small access sizes. Next to these components that every computer center uses, a separate system is used to run the UDS nameserver 6. The nameserver 6 manages all the distributed UDS files and 25 provides its clients with the information necessary to retrieve the fragments that make up a file. Although the nameserver 6 could be run on the frontend 10 or even on one of the compute nodes 1-1 to 1-4, it is advisable to run it on a separate system. Without a running UDS nameserver 6, no UDS file can be opened, and no information on the UDS namespace is accessible. Therefore, it is important to increase the availability of the 30 system running the UDS nameservers. To achieve this, it is preferred to run multiple, e.g. redundant, nameservers 6-1 and 6-2 on separate hosts, as shown in the embodiment 13 in Fig. 3. These hosts 6-1 to 6-2 need to access a common database, which is preferably realized via a separate storage system 12 with high-availability characteristics. The nameserver 6 is connected to the compute nodes 1-1 to 1-4 and the frontend 10 via (Gigabit) Ethernet. Another benefit of not running the nameserver 6 on one of the 5 compute nodes 1-1 to 1-6 is the reduction of compute load not related to the applications running on the node. It is also possible that external hosts 13, like user's workstations have access to the rest of the system only via the (Gigabit) Ethernet network. Fig. 4 is the same architecture as Fig. 3, but shows some additional software 10 components of UDS. The persistent components are running preferably constantly under a special user account or root and are depicted in Fig. 4 as 7-1 to 7-4. Other components of UDS are running preferably on-demand and can be seen in the scenario examples of Fig. 5 to Fig. 9. The UDS nameserver 6 manages the UDS namespace. For this purpose, it maintains a database which keeps a record for each UDS file and the file 15 fragments. Additionally, some global state is kept persistent in this database. More than one nameserver 6-1 and 6-2, as shown in Fig. 4 to Fig. 10, can be used to keep UDS available even if one nameserver should fail for some reason. The nameserver 6 is queried each time a IDS file is created, re-opened, closed, moved, deleted or if information on the current status of a UDS file is required. The nameserver 6 is not 20 required for the effective access to the file data. The seperate 1/0 processes 4-x (IOPs) are in fact part of the user's MPI application (APs), but are not visible to the user wherein x is a natural number. The separate 1/0 processes are spawned by an UDS component inside the MPI library just in the moment when the application opens an UDS file. At least one separate I/O process 4-x is launched on each node on which 25 fragments of the UDS file are located. In some specific situations it may be also prefered that more than one separate 1/0 process is launched on a node, or that a separate I/O process is launched on a node on which the desired file fragment is not located itself but on a shared disk which is- connected to this node. For a new file, these are (by default) the nodes on which the application, i.e. the individual applications 3-x is 30 running. The separate I/O processes communicate with the application processes via MPI messages. The UDS daemons 7-x (udsd) are running on the compute nodes 1-x as 14 a service process. They are required to provide information on UDS file fragments to the name server 6 and the UDS command line tools 8. Additionally, they are used to perform a limited set of actions on the fragments, like deleting, copying etc. The daemons preferably run as resident processes under root, but can also be started on 5 demand, e.g. under the respective user's account, if resident processes are not wanted. The UDS command line tools 8 allow the user to manage the UDS namespace also from external, e.g. non-compute hosts 13. The command line tools 8 perform directory listings and provide file status information, and are also used to exchange files between the UDS namespace and any external file system. The command line tools 8 can be run 10 on any external node 10, 13 that can connect to the nameserver 6 and the compute nodes 1-x, e.g. via TCP/IP. The example scenarios given in Fig. 5 through 9 show the active UDS components, their relations and file access activities involved in completing a certain task. These activities are explained below to illustrate the interaction of the UDS components for the typical tasks performed by a user. 15 MPI application accessing a newly created UDS file are illustrated in Fig. 5 An MPI application of user, e.g. larry, consists in this example of six processes or individual applications 3-1 to 3-6, running on compute nodes 1-2, 1-3 and 1-4. The application opens a file named uds:/uds/larry/output data.dat. Because the nameserver 6 doesn't know this file, a new file is created. The new file will preferably be striped 20 across all nodes 1-2 to 1-4 on which the application is currently running. However, it is also possible if desired to stripe the file across any desired node as shown for instance in Fig. 1. The filename implies that the file is stored in the path /uds/larry, mapped to a local file system on each compute node 1-2, 1-3 and 1-4. Therefore, this directory needs to exist and be accessible for the user on all nodes involved. All 11O activities are 25 performed by the separate I/O processes 4-x (IOPs), which have been spawned by UDS when the file was opened. Once the separate 1/0 processes 4-x have created the file fragments on the nodes 1-2 to 1-4, they perform all file accesses on behalf of the application processes 3-x. The data exchange between the individual applications 3-x and the separate 1/0 processes 4-x is done via MPI messages, both inside a node and 30 between processes on different nodes.

15 The situation when a MPI application is opening an existing UDS file is illustrated in Fig. 6. In this example, a MPI application, consisting of two processes 3-1 and 3-2 is running on compute node 1-1. It is currently opening an existing UDS file, which was previously created by another run of an application. Process 3-1 of the 5 application connects to the nameserver 6-2 to get the information on how to access the file. The nameserver 6-2 tells the individual application 3-1 that the file consists of a single fragment, which is located on compute node 1-2. To access this file, UDS spawns the separate 1/0 process 4-1 on this compute node 1-2, which accesses the file and exchanges the data with the application 3-1, 3-2 using MPI messages. 10 The use of a command line tool 8 like importing a file into the UDS namespace is illustrated in Fig. 7. In this example, user harry wants to transfer an input file, which is currently located in its native format on the external storage system (filename /ext/home/harry/input.dat), into the UDS namespace. As the user has to repeatedly read from this large file, the user wants it to be located as close to the application processes 15 as possible. Therefore, it is preferably located on the local discs 5-x. Because the user typically runs his application 3-x on the two compute nodes 1-1 and 1-4, he decides to stripe the file across two nodes, too. To achieve this, he calls the command 8-1: udscp p -f 2 -n 1-1, 1-4 /ext/home/harry/input.dat uds:/uds/harry from the frontend node 10. The option -f 2 sets the number of file fragments (striping factor) to two. The copy 20 command (udscp) of the command line tools contacts the nameserver 6 (6-1), which tells it the nodes to.be used. Then, the copy command (udscp) connects to the daemon 7-1 and 7-4 (udsd) on each of these nodes 1-1 and 1-4, which spawn a child process 70 1 and 70-4 as a service process to serve this user's requests, wherein the service process 70-1 and 70-4 is running in the requesting user's context. To perform the requested 25 action, the service processes 70-1 and 70-4 transfer the respective parts of the input file to the specified location on the local disks 5-1 and 5-4 (/uds/harry). Because the user provided the -p option, the service processes 70-1 and 70-4 both directly access the input file and read it in parallel via the storage network. The service processes do not have to exchange any data with each other during this import. If the input file is not 30 directly accessible to the service processes, like if it was stored on a local disk of the 16 frontend node, the copy command 8-1 (udsep) would send the file to the service process via the same connection used for sending the request to the daemon 7-x. The situation when a command line tool 8 is querying the nameserver 6 is illustrated in Fig. 8. Another user wants to have status information on a specific UDS 5 file. The user uses a different UDS command line tool 8-1 (udsls) for this purpose. The command 8-1 (udsls) contacts the nameserver 6 and delivers the information to the user. If the user had added the option -n to his command, udsls would verify the information received from the nameserver with the actual state of the file fragments on the nodes. This verification would be done by contacting the UDS daemons. 10 Another example for using a command line tool 8 for probing a UDS file is shown in Fig. 9. Like in the previous example, the user indicated that he does not require that the returned information is the most current information available, it was sufficient to return the information that the nameserver 6 has available in its database. However, if an application currently would have the file open to append data, some information like 15 the file size returned by the nameserver 6 would not match the actual accumulated size of the fragments on the nodes. To get this up-to-date information, the command udsls would contact the daemons 7-x on the respective nodes 1 -x. This case is shown for the UDS command line tool 8-1 which is executed for instance on the external system 13. Next to the connection to the nameserver 6-1, this tool also has a connection to a 20 daemon-generated service process 70-1 on compute node 1-1. This service process gathers up-to-date information on the file fragment stored on the local disk 5-1 and delivers it back to udsls. It has to be noted that the usage of UDS is not at all limited to the example configuration above, and the activities presented. Due to it's flexibility, the user could 25 also create UDS files on the external storage system of this configuration, or exploit any other configuration not covered by this examples. In one preferred embodiment of the present invention is an object to provide a scalable 1/0 bandwidth to access a single file with an MPI application. The bandwidth should therefore scale with the size of the application. In this context, the size of an 30 application is considered the number of processes also called individual applications 17 used to run the application. Typically, this is directly related to the size of the problem processed by this application, and the amount of 11O that this application has to perform. Using an increasing number of processes for an application means using a larger number of compute nodes as there is often a fixed system-dependent ratio between 5 number of processes and number of nodes used to run the processes. To scale the I/O bandwidth with the number of nodes 1-x, each UDS file is distributed across an arbitrary number of nodes in a regular way: the file is preferably divided linearly into blocks of fixed size, and the first block of data is stored on the first node, the second block on the second node and so on. For n nodes holding blocks of the file, the (n+l)th 10 block is again stored on the first node, employing a round-robin distribution pattern. This type of distribution is called striping. The size of the blocks is called stripe unit, and the number of nodes used to distribute the file is called the stripe factor. Fig. 10 illustrates the striped distribution of a single file 20 to a number of three nodes 1-1 to 1 3 starting from node 1-1. The resulting partial files 20-1, 20-2, and 20-3 on each node 15 are the fragments or file fragments. Because each node 1 -x accesses its blocks of data independently from the other nodes, the bandwidth of all these concurrent accesses adds up to the total bandwidth available to the application. However, for full exploitation of this effect, it is necessary that each node uses a storage system that is independent from all other nodes. This can 20 be achieved by using the local disks of the nodes, which are not visible outside the node. If the nodes 1 -x use storage devices 11 that are shared between nodes, the effective bandwidth as experienced by the application may be less than the bandwidth that a single node experiences times the number of nodes. The degree of this degradation depends on the characteristics of the storage device used, and the way that 25 the application accesses the file. In the worst case, the multiple nodes may access the file with an accumulated bandwidth that is not higher than the bandwidth experienced by a single node. In this preferred embodiment, UDS creates a new file on all nodes on which the MPI application is running. The best location for the file are the local disks 5 1 to 5-4 of a node. However, UDS is not bound to create a new file at any predefmed 30 location on the nodes. Instead, because the fragments are just regular files, they can be placed on any file system and mapped to a storage device underneath, which is visible 18 on the nodes. The user has total freedom to place a UDS file wherever he wants, according to his requirements of I/O bandwidth, capacity, external accessibility and other factors, e.g. like limited persistence of the file system contents due to system policies. To assist the user in the specification of the storage locations for a file, UDS 5 supports at least three different types of path specifications as described in detail above. From a user view, using UDS from an MPI application is very simple and does only require the use of a UDS file name. All MPI objects are still supported when using UDS. The MPI applications which will use JDS are still normal MPI applications and are started using the mpirun or mpiexec startup commands as usual. To access UDS 10 files from MPI, it is only required to provide a file name to the related MPI function, like "MPIFile_openo", which implies a UDS path specification as discussed in detail above. A new file will then be striped across all nodes 1-x running processes which are part of the communicator being passed to "MPIFile openo. If an implicit path was used, the file will be striped across all nodes 1-1 to 1-n, which are associated with this 15 implicit path. It is also possible to explicitly specify the nodes to be used, or the striping factor. If an existing file is opened, it will preferably remain striped across those nodes on which it was striped when it was created. UDS ensures that the data between the nodes on which the individual applications 3-x of the application 3 are running, and the nodes which store the file, is efficiently exchanged. Additional file information can be 20 provided by the application 3 to the MPI library using "MPIInfo" objects when opening a file. Vice versa, the MPI library provides information about an open UDS file which can be retrieved using MPIInfo objects as well. MPI-IO defines a common data format for the single, MPI-typed elements written to a file. This data format can be chosen by using the external32 data representation. However, for a single UDS name 25 space which is accessed from more than one system running MPI applications based on a MPI library, UDS is able to manage for instance the endian-related data format of the files automatically to provide each application a correct view of the data in the file even if using the default data representation. UDS is able to manage different data formats, wherein a particular management 30 will be described in further detail. According to one embodiment, UDS sets the byte- 19 order settings (big or little endian) of UDS files at the moment of creation through the MPI-IO interface. When a file is first created, the endian type (big or little) used for data written to this file from this application is the native endian type of the machine used. This means the contents of a newly created file becomes a exact binary copy of the data 5 in memory. The endian type of a UDS file can be determined by using the command line tool udsstat. Preferably, it is not advised to change the endian type of a UDS file as this will lead to data corruption if used inconsistently. However, the command line tool udsadmin allows to set the endian type of a UDS file. Some platforms do support an extended width mode which increases the size of the integral and floating point data 10 types to at least 8 bytes. If a new file is created, and the data representation is set to external32 before any data has been written to the file, UDS will store the related data format characteristics as well. In both cases, UDS manages this extended data format in the same way as it handles different endian formats. Likewise, these file attributes can be retrieved and set externally via the command line tools udsstat and udsadmin. For 15 UDS files which are not created through the MPI-IO interface, but via the external interface, these file attributes may need to be specified explicitly. When an existing UDS file is opened via MPI-IO for reading or writing, the endian of the data is automatically transformed to match both, the required representation of the data in the memory of the machine on which the application is running and the stored endian 20 setting of the UDS file. This applies for all predefmed multi-byte MPI datatypes, and also for the according elements of derived MPI datatypes. This means that if the native endian the machine is different than the endian setting of the file, the byte order of all data elements will be changed on all read and write operations. If both, the endian of the machine and the endian setting of the file, are identical, the data is directly transfered 25 between memory and file. Naturally, the file access bandwidth will be higher for file access which does not involve byte reordering. It is possible to override the implicit data transformation by enforcing a certain endian type of a file. This can be done by passing an appropiate "MPIInfo" object to "MPIFileopeno" or "MPIFilesetinfo(". This object has to contain the key 30 endianfile which supports the values big, little and native. The values big and little cause all data written to be transformed into the respective endian format before they 20 are store in the file. Likewise, data read from the file will be assumed to have the indicated endian format and will be transformed into the native endian of the machine before it is stored in memory. For the value native, the equivalent operations are performed based on the native endian format of the machine. As using this method does 5 explicitely override the endian format of the file as managed by UDS, UDS will not change the persistent setting of the endian format of the file with the name server. This implies the risk of corrupting data of a file if the explicit data transformation is used incorrectly. The implicit endian transformation does in this embodiment only apply for files which are accessed through the MPI-IO interface, using the native data 10 representation. If a file is accessed via the external interface to UDS, the data is read or written to or from memory without any transformation. If the file is accessed through the MPI-IO interface via a different data representation then native, the data is transformed only according to the data representation used, without any implicit endian transformation. If data is written to the file on this occasion, UDS will mark the endian 15 setting of this file as unknown. The same is true for UDS files which have not been created through the MPI-IO interface, but via the external interface. Accessing UDS files with unknown endian setting through the MPI-IO interface will make the read/write operation will generate the error MPIERRFORMAT if the native data representation is in place and the read/write operation uses an MPI datatype which does 20 require an endian transformation. The data will be transfered between memory and file without any data transformation being performed. This means, if the application knows how to handle the data correctly, it may continue despite this error condition. If the endian type of the nodes on which processes of a single MPI application is not the same for all nodes, the endian setting of this file will be unknown from the 25 beginning, if using the native data representation. In this case, it is recommended to use the external32 data representation which will set the endian type to big. The detailed description above is intended only to illustrate certain preferred embodiments of the present invention. It is in no way intended to limit the scope of the invention as set out in the claims. 30

Claims

1. A method for storing and accessing a data file (2) in one or more parallel processing system(s), the system(s) comprising a plurality of nodes (1-1, 5 1-2, ... , 1-n), wherein the method comprises the steps: a) splitting a data file (2) accessible to an application (3) into a plurality of file fragments (2-1, 2-2, ... , 2-m), 10 b) storing said file fragments (2-1, 2-2, ... , 2-m) in a plurality of storage means (5-1, 5-2, ..., 5-1) accessible within the parallel processing system, wherein the splitting and the distributed storing of the file fragments on the storage means is individually defined by a user; 15 C) splitting the applications (3) to be used for parallel file input/output into a plurality of individual applications (3-1, 3-2,..., 3-k), wherein each of said individual applications (3-1, 3-2,..., 3-k) may run on an individual node (1-1, 1-2, .. ,1-n), 20 d) providing information of the distribution of the distributed stored file fragments (2-1, 2-2, ..., 2-m) by at least one distribution information means (6) to at least one of the individual applications (3-1, 3-2,..., 3-k), e) launching at least one separate 1/O process (4-1, 4-2, ... , 4-o) by a 25 plurality of processes of said individual application(s) (3-1, 3-2,..., 3-k) dependent on the distribution of the stored file fragments (2-1, 2-2,

2-m), wherein said 1/O processes perform access to the file fragments (2-1, 2-2, 30 ..., 2-m) on the storage means (5-1, 5-2, ..., 5-1) and each of said at least one I/O process(es) (4-1, 4-2, ... , 4-o) communicates for the transfer of data and control information with the corresponding individual application (3-1, 3-2,..., 3-k) via Message Passing Interface (MPI) messages, 22 wherein the data storage and accessing method is workable on top of a native available file system of the nodes (1-1, 1-2, ... , 1-n). 2. The method according to claim 1, wherein at least one separate 1/O process 5 (4-1, 4-2, ... , 4-o) is launched by an individual application (3-1, 3-2,...,3-k) on a node (1-1, 1-2, ... , 1-n), the 1/O process comprises access to the storage means (5-1, 5-2, ... , 5-1) storing the file fragment (2-1, 2-2, ... , 2-m) which is accessed by the application (3). 10

3. The method according to claim 1 or 2, wherein the data are transformed when they are transferred from the memory (RAM) of the application process to the file fragment located on the storage means.

4. The method according to claims 1, 2 or 3, wherein the data are transformed 15 when they are transferred from the file fragment located on storage means to the memory (RAM) of the application process.

5. The method according to claim 3 or 4, wherein the data is transformed for files which are accessed through the MPI-lO interface. 20

6. The method according to any of daims 3 to 5, wherein the method provides an implicit data transformation step between different data type representations. 25

7. The method according to any one of claims 3 to 6, wherein a user can explicitly specify a data transformation via a/the MPI-lO interface.

8. The method according to any one of claims 1 to 7, wherein the native endian data type of the machine on which the application is running is used 30 when creating a new file (2) with a/the MPI-1O interface.

9. The method according to any one of claims 3 to 8, wherein when an existing file (2) is opened via MPI-1O for reading or writing, and in case the endian type of the data in the file does not match the endian type of the 23 data in the memory, the endian of the data is automatically transformed on read and write access.

10. The method according to any one of claims 1 to 9, wherein the byte order of all data elements will be changed on all read and write operations when the native endian of the machine is different that the endian setting of the file.

11. The method according to any one of claims 1 to 10, wherein the data is to directly transferred between memory of the application process and file when the native endianess of the machine is identical to the endian setting of the file.

12. The method according to any one of claims 1 to 11, wherein file attributes 15 of different data type representations are retrieved and set externally via command line tools.

13. The method according to any one of claims 1 to 12, wherein when accessing a file via an external interface, the data is read or written to or from the memory without any transformation. 20

14. The method according to any one of caims 1 to 13, wherein the data type representation may be one of the group of: endian type big, endian type little, extemal 3 2 . or any other data type representation defined by a specific combination of the byte order (endian) and the binary representation of 25 numerical values.

15. The method according to any of claims 1 to 14, wherein each individual application (3-1, 3-2,..., 3-k) may be allowed to access the whole data file (2). 30

16. The method according to any of claims i to 15. wherein only one separate /0 process (4-1, 4-2,.,4-o) is launched by the application on a node (1-1, 1-2, ..., 1-n), the 1/0 process provides access to storage means (5-1. 5-2, ..., 5-1) storing the file fragment (2-1, 2-2, ... , 2-) which is read or 35 written by the application (3). 24

17. The method according to any of claims 1 to 16, wherein the set of nodes (1-1, 1-2, ..., 1-n) and file system location on each node is not predetermined by a data storage and accessing system, but can instead be 5 freely defined by the user via the file name which comprises a path specification which may be resolved either by the client, the distribution information means (6) or by the I/O processes.

18. The method according to any of claims 1 to 17, wherein at least one of the 10 storage means (5-1, 5-2, ..., 5-1) is a local storage device of a node (1-1, 1-2, ... , 1-n).

19. The method according to any of claims 1 to 18, wherein at least one of the storage means (5-1, 5-2, ..., 5-1) is a shared storage device of a plurality of 15 some nodes (1-1; 1-2, ... , 1-n).

20. The method according to any of claims 1 to 19, wherein the storage location of the file fragments (2-1, 2-2, ..., 2-m) is determined by selecting a named distribution patten which is pre-defined by the user or system 20 administrator.

21. The method according to any of claims 1 to 20, wherein the file access of the application (3) is controlled by the application (3) itself and the individual applications (3-1, 3-2,..., 3-k) themselves. 25

22. The method according to any of claims 1 to 21, wherein the method makes the data files (2), which are composed of a plurality of file fragments (2-1, 2-2, ..., 2-m), visible to all processes on computing nodes which are able to communicate with the distribution information means (6) and all nodes 30 comprising the storage means storing any of the file fragments (2-1, 2-2, 2-m).

23. The method according to any of claims 1 to 22, wherein the information of the distribution of the file fragments is provided by a plurality of distribution 35 information means (6-1, 6-2,..., 6-p). 25

24. The method according to any of claims 1 to 23, wherein the method comprises providing service programs (8) to access, list, delete and modify the data file (2) by global commands. 5

25. The method according to any of claims 1 to 24, wherein the application (3) accessing a file may run on any subset of said plurality of nodes which form the parallel processing system, wherein a node can be a parallel computer, or on an individual node. 10

26. The method according to any of claims 1 to 25, wherein on each node (1-1, 1-2, ... , 1-n) is able to execute said individual applications (3-1, 3-2,..., 3-k) and to provide I/O access to a desired file fragment (2-1, 2-2, ..., 2-m). 15

27. The method according to any of claims 1 to 26, wherein the method runs the distribution information means (6) on one or more of said nodes (1-1, 1-2, ... , 1-n) or on at least one additional node.

28. The method according to any of claims 1 to 27, in which a database is 20 maintained on a distribution information means (6) which keeps a record of each file (2) and the corresponding file fragments (2-1, 2-2, ... , 2-m).

29. The method according to any of claims 1 to 28, wherein a system name space is provided by a storage information means (6) for managing the 25 storage and access of the files (2).

30. The method according to any of claims 1 to 29, wherein on each node (1-1, 1-2, ..., 1-n) a daemon (7-1, 7-2, ..., 7-n) is provided for providing information of the file fragments to the distribution information means (6). 30

31. The method according to claim 30, wherein each daemon runs as a resident process or can be started on demand under the respective user account. 26

32. The method according to any of claims 1 to 31, wherein the communication between the application (3) of the individual applications (3-1, 3-2,..., 3-k) and the distribution information means (6) is performed via TCP/IP. 5

33. The system according to claim 32, wherein encrypted sockets for the TCP/IP communication are used.

34. The method according to any of claims 1 to 33, wherein a filename of a file (2) in a local meta representation is mapped on the distribution information 10 means (6) which contains all information to access the distributed file fragments (2-1, 2-2, ..., 2-m).

35. A data storage and accessing method substantially as herein described with reference to the accompanying drawings. 15

36. A data storage and accessing system programmed to carry out the method according to any of claims 1 to 35.