GB2470943A

GB2470943A - Converting a large data file to a spreadsheet format to allow processing using remote procedure calls

Info

Publication number: GB2470943A
Application number: GB0910038A
Authority: GB
Inventors: Joseph Kilbride
Original assignee: T B I REFUNDS IPR Ltd
Current assignee: T B I REFUNDS IPR Ltd
Priority date: 2009-06-11
Filing date: 2009-06-11
Publication date: 2010-12-15
Also published as: GB0910038D0

Abstract

Processing a large data file containing a number of data sets, each of which has a plurality of data items, between a server and a remote client. The large data file is manipulated at the client by formatting the data file 5 into a spreadsheet format and arranging the data sets 7 into a spreadsheet. The spreadsheet is loaded into a calculation component 9 which scans the data set 11 and transmits a plurality of data items to the server e.g. performing an embedded SQL remote procedure call. At the server the data sets are processed in a data set equation 19 using at least some of the data items and coefficients 15 retrieved from a data table 13 in a database. The data set equation is processed 19 and the data set result is transmitted back to the calculation component on the client which dynamically updates the spreadsheet. The client then combines the data set results 23 into a data file and exported.

Description

"A method of processing a large data file"

Introduction

This invention relates to a method of processing a large data file. More specifically, this invention relates to a method of processing a large data file between a server and a client, the large data file containing a large number of data sets each having a plurality of data items and requiring a number of distinct processing steps to be carned out on the data sets.

There is a constant desire to improve performance and reduce the processing times taken to process large data files. Numerous techniques have been proposed to reduce the processing times. One commonly used technique is to provide a faster processor that has the ability to perform a larger number of operations per time period. Another commonly used technique involves providing additional processors and spreading the processing load across a number of processors. Although effective in improving performance and reducing processing times, both of the above methods require additional capital expenditure which is undesirable.

It is an object of the present invention to provide a method of processing a large data file that overcomes at least some of the problems with the known methods.

Statements of Invention

According to the invention there is provided a computer implemented method of processing a large data file between a server and a remote client, the large data file containing a plurality of data sets, each data set comprising a plurality of data items, and in which each data set requires a calculation to be performed using the plurality of data items in the data set to produce a data set result, and in which a further calculation is performed using a plurality of the data set results to produce a data file result, the method comprising the steps of: (a) the client structuring the plurality of data sets in the large data file in a spreadsheet format, the spreadsheet format comprising a plurality of rows and a plurality of columns, each data set populating a row and a plurality of columns of the spreadsheet format; (b) the client loading the data file in spreadsheet format into a calculation component; (c) for each data set, the calculation component scanning the data set and transmitting a plurality of data items to the server; (d) for each data set, the server querying a data table stored in a database for appropriate coefficients to use in a data set equation based on one or more of the received data items from the data set; (e) for each data set, the server retrieving those coefficients from the data table, inserting those coefficients and one or more data items of the data set into the data set equation; (f) the server calculating a data set result at a database level by processing the data set equation; (g) the server transmitting the data set result to the calculation component; (h) the calculation component updating the large data file dynamically in the spreadsheet format by inserting the data set result in a column of the spreadsheet format; and (i) the client combining the data set results into a data file result and exporting the data file result.

By having such a method, it will be possible to significantly reduce the processing time required to process the large data file without employing additional or alternative expensive resources. On the client side, the data is structured into a spreadsheet format in which it may be more easily referenced and manipulated. Data received in a number of different formats may be quickly transformed into a single uniform format prior to processing of the large data file. This is due in part to the simplicity of manipulating the data once it is in the spreadsheet format which speeds up and simplifies the handling of the data. Secondly, the calculations on the data items are performed at a database level by the server processing the data set equation which improves the processing speed of the method. Finally, the large data file is updated dynamically in the spreadsheet format on the client side which allows the processing of the data set results on the client side which again improves the method of processing the large data file.

In one embodiment of the invention there is provided a computer implemented method in which the step of transmitting a plurality of data items to the server comprises performing an embedded SQL call. By implementing the method in this manner, the code being executed is not stored on the server database side and the data never has to reside on the server yet all calculations may be carried out on the server side.

In one embodiment of the invention there is provided a computer implemented method in which the step of transmitting a plurality of data items to the server comprises performing a remote procedure call. This is seen as useful as the function may run on the server side as opposed to the client side and the minimum amount of data is transferred between the client and the server side.

In another embodiment of the invention there is provided a computer implemented method in which the step of structuring the data sets in the large data file in a spreadsheet format further comprises rearranging the columns of the spreadsheet into a pre-selected spreadsheet format.

In a further embodiment of the invention there is provided a computer implemented method in which the step of structuring the data sets in the large data file in a spreadsheet format comprises structuring the data sets in an Excel � spreadsheet format.

In another embodiment of the invention there is provided a computer implemented method in which the method comprises the initial step of converting the large data file to a spreadsheet format by changing the file extension of the large data file from a.txt extension to a.xls extension.

In one embodiment of the invention there is provided a computer implemented method in which the method comprises the initial step of converting the large data file to a spreadsheet format by changing the file extension of the large data file from a.csv extension to a.xls extension.

In a further embodiment of the invention there is provided a computer implemented method in which the data file result is saved in a spreadsheet file format.

In another embodiment of the invention there is provided a computer implemented method in which the data file result is saved in an Excel � tile format.

Detailed Description of the Invention

The invention will now be more clearly understood from the following description of some embodiments thereof given by way of example only with reference to the accompanying drawing, in which:-Fig. 1 is a flow diagram of the method according to the present invention.

Referring to the drawing, the method, indicated generally by the reference numeral 1, comprises the initial step 3 of providing a large data file for processing. The large data file comprises a plurality of data sets, each of which in turn comprises a plurality of data items. Typically, the large data file will comprise of the order of between 100 and 100,000 data sets and each data set will comprise of the order of between 4 and 100 data items. Once provided, the large data file is formatted in step 5. The formatting step comprises converting the large data file from a.txt format or other format into a spreadsheet format, in this case a Microsoft � Excel � .xls format.

Once formatted, the data sets of the large data file are structured in step 7 which comprises placing the data sets into a spreadsheet format having a plurality of rows and columns. Each data set occupies a row and a plurality of columns of the spreadsheet format and the data items populate a plurality of the columns. The spreadsheet format will depend on the number of data sets and the number of data items contained in the data set having the most data items. For example, if there are 1,000 data sets and the data set with the largest number of data items has 15 data items, the spreadsheet will have at least 1,000 rows and 15 columns. Other additional rows and columns may be provided for headings, numerical identifiers and the like. What is important is that all of the data sets are placed in an ordered manner in the spreadsheet and this will facilitate the manipulation of the data sets. The large data file in the spreadsheet format is then loaded into a calculation component in step 9.

In step 11, the data sets are each scanned in order to ascertain the parameters, in this case coefficients, of the calculation that is to be carried out on at least some of the data items from that data set and a plurality of the data items are transmitted to the server.

The information regarding the correct coefficients to use is ascertained from one or more of the data items. In step 13, a data table on the server containing a plurality of coefficients is queried and in step 15 the appropriate coefficients are retrieved from the data table for use in a data set equation. In step 17, the data set equation is populated with the coefficients and one or more of the data items in the data set and in step 19, the data set equation is executed thereby producing a data set result. The result of the data set equation is populated dynamically into the large data file in the calculation component by adding the result to the other data in spreadsheet format in the calculation component. Effectively, another column containing the result is added to the large data file in the calculation component. The data set equation is executed at a database level and this contributes to significantly speed up the processing of the large data file.

In step 21, a check is made to see if all of the data sets have been processed. If all of the data sets have been processed, the method proceeds to step 23. If all the data sets have not been processed, the steps 11 to 21 are repeated for the remaining data sets in the large data file until all data sets in the large data file have been processed. In step 23, the data set results of all the data set equations are combined into a data file result and the data file result is thereafter exported in step 25.

Typically, the large data file will be transmitted over a communications network, preferably through the internet, for processing. The large data file will be transmitted by a first party, the client, for partial processing by the second party, the server. The large data file content may be transferred through a web site or other dedicated portal. Once the second party has calculated the data file result, the second party will typically export the data file result to the first party over the communications network or alternatively they will make the result available to the first party by providing a link to the location of the data file result.

In one embodiment of the present invention, the large data file may already be provided in a spreadsheet format in which case, the structuring step may comprise either alone or in combination the steps of changing the provided spreadsheet format to an Excel � spreadsheet format and the step of reconfiguring the large data file into a uniform spreadsheet format suitable for processing.

It will be readily understood that the present invention could be applied in a wide range of activities where it is necessary to process large data files containing predominantly numeric data in an efficient manner. For example, one could envisage that the present invention could be used in an aerodynamics environment where a plurality of sensors is arranged about a body being tested. Each of the sensors periodically gathers measurements such as wind speed, direction, pressure and the like and the measurements from all of the sensors over a period of time are stored as data items in a large data file. In order to evaluate the aerodynamic performance of the body, the data from each of the sensors has to be individually analysed before the overall aerodynamic performance of the body can be determined. Depending on the location of the sensor about the body (a data item could identify the location), specific coefficients may be used in a data set equation to determine the impact of the measurements at that location on the aerodynamic profile of the body. The correct coefficients for the equation could then be obtained from the data table and those coefficients used along with the measurements from the sensor in the data set equation to determine drag or other property at that point on the body. All of the results could then be combined together to provide an overall result specifying the aerodynamic profile of the body.

Alternatively, it could be seen how the invention could be used in other areas where numerous calculations on numeric data must be carried out such as in an instrument to determine refunds to individuals where the individuals have been abroad and are entitled to a refund of certain taxes that have been paid while abroad. Depending on where the taxes were paid (a data item could identify the location) and when the taxes were paid (another data item could identify the date of payment), the individual may be entitled to a refund of different levels of tax and appropriate coefficients could be retrieved from a data table for insertion into a data set equation. Furthermore, different coefficients may apply to different types of goods or services as different tax rates would apply (again, a data item could be indicative of the good or service). The appropriate refund of tax for each purchase could be ascertained by inserting the data items and appropriate coefficients in a data set equation to obtain a data set result before an overall refund is calculated by combining the data set results.

Again, the above implementations are only representative of two areas that would particularly benefit from a method according to the invention and many other fields of endeavour would also benefit. The invention has been implemented using a PostgreSQL � v8.3.0 for the database on the server side, Visual Basic � 6 programming environment on the client side incorporating DynamiCube � version 3.0 and Farpoint Spread � Version 3.5 components. The step of transmitting a plurality of data items to the server comprises performing an embedded SQL call. In this way, no software code has to reside on the database server side. In addition to the embedded SQL call, a remote procedure call may be additionally used to transfer one or more data items to the server to reference data stored in the PostgreSQL database. It can be further appreciated that according to the present invention, although the data never resides on the server, a significant portion of the calculations may be carried out on the server side.

Using this implementation, speed-up figures for processing a large data file of the order of a factor of two have been achieved, thereby decreasing the processing time by half.

What is important is that the invention is particularly useful where there are a large number of calculations to be carried out on a large amount of data and the calculations may be carried out in an efficient manner with the minimum of resources available. By arranging the data sets in a spreadsheet format and then carrying out the necessary calculations at a database level dynamically in the spreadsheet format, this is achieved.

In this specification the terms "comprise, comprises, comprised and comprising" and the terms "include, includes, included and including" are all deemed totally interchangeable and should be afforded the widest possible interpretation.

The invention is in no way limited to the embodiment hereinbefore described but may be varied in both construction and detail within the scope of the specification.