IES70671B2

IES70671B2 - Data processing system

Info

Publication number: IES70671B2
Authority: IE
Inventors: Anthony Stafford
Original assignee: Anthony Stafford
Priority date: 1996-07-08
Filing date: 1996-07-08
Publication date: 1996-12-11
Also published as: IES960500A2; ES2147689B1; ES2147689A1

Abstract

A client/server data processing system includes at least one application running on a client PC 12 and a database 16 on the server 10. The database 16 has a hierarchical data structure wherein each data item in the database includes a poiunter to any child or sibling data item of that data item. When a particular data item is requested by the application both the requested data item and at least one subordinate data item of the requested item are read out, the requested data item being returned to the application and the subroutine data item being stored in a cache for subsequent use by the application if requested.

Description

DATA PROCESSING SYSTEM This invention relates to a data processing system.

According to the present invention there is provided a data processing system including at least one application and a database having a hierarchical data structure containing data for use by the application, wherein each data item in the database includes a pointer to any child or sibling data item of that data item, and wherein the system includes means operable when a particular data item is requested by the application for reading out both the requested data item and at least one subordinate data item of the requested item, the requested data item being returned to the application and the subordinate data item being stored in a cache for subsequent use by the application if requested..

An embodiment of the invention will now be described, by way of example, with reference to the accompanying drawings, in which: Fig. 1 is a block diagram of a client/server system in which the invention may be used, Fig. 2 illustrates the main components of a database in the system of Fig. 1, Fig. 3 illustrates the operation of the system according to the embodiment, and Figs 4a and 4b shows an example of the data organisation in a database of Fig. 1.

In the embodiment the invention is described in relation to a system running the commercially available on-line analytical processing (OLAP) product known as Essbase, developed by Arbor Software Corporation. However, the invention is not limited thereto.

Essbase operates in a client-server computing environment on a local area -2network, Fig. 1. The Essbase Data Analysis Server runs on an IBM OS/2, Microsoft Windows NT or UNIX based server 10 - all multi-tasking operating systems which can efficiently manage multiple simultaneous requests for data. This environment enables multiple users of an executive information system (EIS) to retrieve and analyse centralised data on their personal computers 12.

For the purposes of this specification the Essbase product is regarded as one which services read-only requests for data as this is the mode most often used to serve EIS requirements. Typically the Essbase server 10 will be a dedicated high-end Intel Pentium based machine (often with multiple microprocessors) equipped with large capacities of hard disk space and physical memory. ' . *· ,* Λ S · V The PCs 12 .running the EIS applications will use either Microsoft Windows 3.1 or the Microsoft Windows 95 as its operating system. As a user interacts with the EIS application to retrieve data; requests are sent over the network 14 to the Essbase server 10 and data is retrieved and returned to the client PCs to facilitate their data access and data analysis requirements.

The Essbase Data Analysis Server running on the server 10 contains all multidimensional database information. The server also handles all data control, data storage, security and database management functions. When Essbase is running, a server task 24 (Fig. 3) acts as a coordinator for all user data requests. This task directs requests for Essbase database records. Databases in Essbase (shown schematically at 16 in Fig. 1) are logically grouped into so-called applications. An application is simply a collection of Essbase databases logically grouped together. For example, a sales analysis system may consist of a ’Sales' application which comprises two databases, Orders and Payments. Data resides in a combination of memory and disk on the server 10.

The Essbase server software installation creates the following directory structure on die server: -3ESSBASE \APP (for applications) \BIN (for Essbase software files) The physical contents of an Essbase database are stored in a sub-directory of tbe YAPP directory. This will contain a .PAG file 18 (page file, where data is stored) and .IND file 20 (the index to the PAG file) which is always loaded into physical memory 22 on the server 10 at startup, as seen in Fig. 2. These two files together form the Essbase database. Large Essbase applications can have .PAG files exceeding 10Gb in size.

The basic unit of storage in an Essbase database is the block. This is of variable size and will contain one or more numeric values, ultimately of interest to the end user of the EIS. The developer of an Essbase database will have labelled the various characteristics or dimensions of the database as either dense or sparse depending on the nature of the data. Dense dimensions form the blocks and determine the size of the block while sparse dimensions determine the number of blocks which will exist. When Essbase writes a block to the PAG file it compresses it using a simple, algorithm which looks for four or more of the same data values in the same sequence within the block. This is termed run-length encoding (RLE) and is used to compress any repetitive values - any value that repeats four or more times consecutively.

The .PAG file is seen by the operating system as a normal data file. To the Essbase system, however, it is seen as a number of discrete 'blocks' of data with each block and data item having its own unique offset in the .IND file (per database). The .IND file is also a physical disk file. Due to its access hit-rate and also its compact size it is loaded into physical memory on the server at application startup and accessed from there.

Referring to Fig. 3, Essbase is started on the server 10 when the ESSBASE.EXE program (Essbase task 24) runs. This is normally scheduled to occur automatically when the server starts. The Essbase task handles the starting of applications. Its first task is to check which -4applications/datahases are to be started. For each of these an application server process 26 is created. The index for each database is loaded into physical memory from the .IND file where it will remain for the duration of the Esshase session. Thereafter, as client requests for data arrive over the network they are dispatched by the Esshase task 24 to the appropriate application server process 26 (the one which contains the database for which data has been requested). This process will analyse the request and using its index in memory will construct a list of physical blocks to be retrieved from the .PAG file located on disk.

In a conventional Esshase system a routine (disk access module 28) is then called to read the .PAG file and buffer the data from disk into an area of physical memory. Control then returns from the disk access module 28 to the application process26. This passes a pointer to the memory location of the retrieved data back to. the Essbase task 24 which transfers this data back to the requesting user’s PC 12 over the local area network 14.

When the disk access module 28 is called, it is passed a string of physical file segments to be retrieved from the .PAG file. The list of segments to be. retrieved,has been built by an internal Essbase data retrieval module in conjunction with the .IND file which contains the index or roadmap used to map an end-user query onto the physical file. The disk access module 28 is called with a list of disk retrieve requests (as data to satisfy a user's query may not be located contiguously on disk). This list could for example take the form 23678:900,253:2,13445678;6700.

This request will involve the disk access module 28 retrieving 900 bytes from offset 23,678 in the file, 2 bytes from offset 253 and finally 6700 bytes from offset 13,445,678 in the file. This data is concatenated together t into one physical data item and returned by the disk access module.

For example, the original request by the user may have been the sales figure for stereos in Europe for 1994. The internal Esshase data retrieval module maps this into a list of physical segments to he retrieved, as described above. -5The disk access module 28 used by Essbase fulfills its requirements in that when given a list of physical blocks to be retrieved from the .PAG file (this information coming from the .IND file) it accesses disk and buffers the required data blocks into physical memory for subsequent transfer back to a client PC 12 over the local area network 14 by the Essbase task 24.

The are however a number of limitations with the way the conventional disk access module 28 achieves its goal: 1. Its use of essentially MS-DOS handle-based file I/O functions, albeit in Windows mode. When called it opens (returning a file handle) the specified .PAG file in read/write mode and retrieves the blocks of data required by changing the file pointer (equivalent to MS-DOS interrupt 0x21 function 0x42). Its use of read/write mode means access to any single database is single threaded. That is, if three users request data from a single database concurrently the read/write nature of the disk access module will mean they will operate on a queued basis with inherent delays in receiving data. In other words, the data access tasks 30 (Fig. 3) by which the data access module 28 reads the data from the .PAG files 18 are queued. This, is true regardless of whether all three only requested read access to the database. 2. It retrieves the minimum amount of data required, which may not always be the most efficient. For example, suppose an EIS user requests the figure for sales in the European market in one of their data requests. The disk access module will be called to access the data in the PAG file. Moments later the user may request a drill-down (a term describing, for example, the expansion of a sales figure for Europe into its constituent parts, e.g. Ireland, UK, France etc.). The disk access module 28 will again be called on to service this disk access request. Although the data for this query may have been accessed only moments before, the .PAG file will again have to be opened and following the example described above the data within Sales for Europe will be accessed.

To overcome these limitations, the present embodiment of the invention incorporates a new disk access module 32. To this end a new .DLL called -6DSESSB32.DLL has been compiled. This .DLL contains a routine NewEssbDisk which is executed when the .DLL is loaded into memory.

This routine overwrites the EssDHand routine in memory (the original disk access module code) with the new code contained in DSESSB32.DLL.

Thus when an application process 26 calls EssDHand it no longer calls the old routine 28 but now uses the new routine 32. The disk access module 32 differs from the conventional module 28 in a number of significant ways: 1. The new module 32 takes advantage of Windows NT's multi-threaded architecture which allows asynchronous file I/O. That is, the system can beinstructed to read from disk while the rest of the code continues to execute in parallel. Unlike the conventional disk access module which first reads all of the blocks required into an area of memory, then decompresses them into an area of memory, the new module uses specific Windows 32-bit function calls which permit asynchronous I/O as described above.

Thus, when the module is called it will multi-task the tasks of reading the required blocks into memory from disk and decompressing the data returned for transfer back to the client. By multi-tasking these two distinct operations a significant performance benefit is achieved as the true multi-tasking capabilities of Windows NT are being utilised. Currently, while the system is reading the blocks from disk there is additional processing capability which is not being utilised to achieve timely data transfer back to the end user. The Windows NT ReadFile function call is used in OVERLAPPED mode to achieve the functionality described here. 2. As mentioned previously the disk access module 28 operates in read/write mode - as well as accepting data read requests it can also handle data update requests. In our experience of implementing EIS applications, « only a very few specialised EIS developments will ever require write access to the Essbase database directly from the client's EIS application. By 4 taking the view that all access to the Essbase database will be of a read only nature the disk access module 32 can be implemented in such a way that it will handle data requests from multiple users of die same database concurrently thus removing a major potential bottleneck. Unlike the conventional disk access module which queues client requests for access to -Ί 1 the same database the new module initiates a separate NT worker process for each data request received. This process uses the Win32 file functions ’ which make it possible to open the same file two or more times. Every time a file is opened a new file handle is returned to the calling process.

Essentially, the process can read the .PAG file as if it were the only task accessing the file even though there may he two or more other processes also retrieving data from the file. Thus multiple requests for data from the same database can be handled concurrently which will make EIS data available to users in a more time efficient manner. 3. A further function of the new disk access module 32 is to implement data caching. A global area of memory is maintained as a data cache 34, Fig. 3, which will contain the most recently accessed database blocks. The size of this cache will depend on the total amount of physical memory installed on the server. When the new disk access module 32 is called by an application server process, rather than go directly to disk and read in the requested blocks, a cache index (which will be maintained by the disk access module 32) is used to check if the requested block is in the cache.

If it is, it is the data is retrieved from there, otherwise the .PAG file on disk will be accessed. In this way a substantial time saving can be effected for retrieval of frequently accessed data blocks as fast physical memory rather than slower magnetic disk will be accessed. 4. As discussed, the conventional disk access module 28 retrieves only the minimum amount of data per request, i.e. the data explicitly requested even though subordinate data may be requested within a very short time requiring a further disk access. Therefore, the new disk access module 32 5 implements predictive data access in conjunction with data caching described above. To explain this new function, the data organisation of the w database will first be described in further detail.

Each database has a hierarchical data structure, of which a simple example is shown diagrammatically in Fig. 4a. At the highest level the database has data items for sales figures for Asia, Europe and America. One level below, the data item Asia has a child data item Japan, while Europe -8has a child data item Italy, with sibling data items France and Ireland. Finally, data item America has a child data item California” which itself has a child item Orange County with a sibling data item Oakland. Europe and America are themselves sibling data items of Asia.

This structure is organised in the database as shown in Fig. 4b. The page (.PAG) file 18 is shown on the right and the related index (.IND) file 20 on the left. The page file 18 contains respective data items 40 containing the sales figures for the various territories identified. The index file 20 contains entries IND1, IND2 .. INDIO pointing to the respective data items 40 as indicated by the arrows directed horizontally from left to right.

The last eight bytes of each data item 40 contain pointers to child and sibling data items, if any. In particular, .bytes 1-4 contain the index address of any child data item, while bytes 5-8 contain the index address of any sibling data item. In any .case where there is no child or sibling data item the relevant four bytes will contain blank characters. In Fig. 4b, these last 8 bytes are indicated by the two narrow strips 42,44 at the bottom of each labelled entry 40, the upper strip 42 representing bytes 1-4 and the lower strip 44 representing bytes 5-8. Where a particular strip is shaded, it means there is no child or sibling information contained in those four bytes.

Thus, the first four bytes 42 of the data item Asia point to IND4 which in turn points to its child data item Japan, and the last four bytes 44 of the data item Asia point to IND2 which in turn points to its sibling data item Europe. Similarly, the first four bytes 42 of the data item Europe point to IND5 which in turn points to its child data item Italy, and the last four bytes 44 of the data item Europe” point to IND3 which in turn points to its sibling data item America”. In a similar manner, by following the non-horizontal arrows pointing generally right to left it will be seen that the entire data structure of Fig. 4a is represented.

As mentioned above, the conventional data access module 28 merely retrieves the data item explicitly requested by the application process 26, even though in many if not most cases the user will immediately request to -9drill down through the data to retrieve subordinate data items. As used herein, a subordinate data item is a child of the requested data item or any * other data item below the selected data item in the hierarchy which can be reached via the child. Thus, in the case of the simple data structure show in Fig. 4a, California, Orange County and Oakland are subordinate data items to America, and Italy, France and Ireland are subordinate data items to Europe. Similarly, Orange County and Oakland are subordinate to California.

By contrast, the new data access module 32 is designed so that when a particular data item is requested, not only is the requested data item returned to the application, but all its subordinate data items are also read out and stored in the cache 34 (or as many as can be accommodated in the available storage space in the cache). Here they are available to the application, should the latter request them, without requiring another disk access. This predictive data caching will speed up data access in situations where the user wishes to 'drill-down* on a data item having previously accessed the data item alone. This function is readily implemented by any competent programmer. All that is necessary is for the data access module to read thelast 8 bytes of each dataitem retrieved and, via the index file, access the associated child and/or sibling data item until all subordinate data items have been read to the cache (which will occur when a data item is reached without any child or sibling data), or until the cache is full.

If desired, instead of reading all the subordinate data items to the cache, it is possible to selectively read subordinate data items to the cache, for example if experience teaches that a particular subordinate data item is rarely requested following an initial request. This can be achieved by marking only those data items which are to be read to the cache.

Claims

1. A data processing system including at least one application and a database having a hierarchical data structure containing data for use by the 5 application, wherein each data item in the database includes a pointer to any child or sibling data item of that data item, and wherein the system includes means operable when a particular data item is requested by the application for reading out both the requested data item and at least one subordinate data item of the requested item, the requested data item being 10 returned to the application and the subordinate data item being stored in a cache for subsequent use by the application if requested.

2. A data processing system as claimed in claim 1, wherein data items are stored in a first file and are accessed through an index file which IS contains respective pointers to the data items in the first file, and wherein each data item in the first file contains the address(es) in the index file of any child or siblingdata items of that data item.

3. A data processing system as claimed in claim 1 or 2, wherein the 20 system is a .client/server system with ;the application running on a client and the database on the server.

4. A data processing system substantially as described with reference to the accompanying drawings.