WO2007116210A1

WO2007116210A1 - Data compression

Info

Publication number: WO2007116210A1
Application number: PCT/GB2007/001293
Authority: WO
Inventors: Anthony Charles Lovick
Original assignee: Norwich Union Insurance Limited
Priority date: 2006-04-07
Filing date: 2007-04-03
Publication date: 2007-10-18
Also published as: GB2436880A; GB2436880B; GB0607050D0

Abstract

Methods for compression, decompression and augmentation of a sequence of data records are provided. In particular, a method for electronically compressing and storing a sequence of data records is described, in which each record comprises data fields comprising a given time in the sequence and the position and/or motion of an object at a given time. The method comprises the steps of assigning the records to separate data blocks, each block comprising a single primary record or a group comprising a primary record followed by at least one subsidiary record, with at least one of the blocks being of the type comprising a primary record followed by at least one subsidiary record; and calculating the differences in value between fields of each subsidiary record and values associated with the corresponding fields of the preceding record, and storing relative values related to the calculated differences in a respective compressed subsidiary record; and calculating the differences in value between fields of each primary record after the first and values associated with the corresponding fields of the preceding primary record, and storing relative values related to the calculated differences in a respective compressed primary record, wherein the compressed primary and subsidiary records are stored in electronic memory as a compressed version of the original sequence of data records.

Description

Title: Data Compression

Field of the Invention

The present invention relates to data compression, and in particular, compression of data recording the motion of an object over time.

Background to the Invention

Increasingly, valuable applications are being realised for data recording the motion of physical objects over time. For instance, the data may be used in the calculation of insurance premiums for individual vehicles based on the actual journeys undertaken by the vehicle concerned.

Position data may be generated currently by a device linked to the GPS system, for example, or in the future, the forthcoming European Galileo positioning system. Large volumes of data may be generated in this way. For example, a GPS device may generate position information once per second. If data is collated from several devices over a long period of time, the resulting volume of data generated can be vast. The costs associated with storage and/or transmission of such large volumes of data will be high.

To address this, some positioning devices may simply discard a large proportion of the data, and only store readings periodically. Other devices carry out preliminary processing of the readings to produce aggregated readings, each of which replaces a block of original readings. However, some of the detail present in the original readings will inevitably be lost with either of these approaches.

There are various existing techniques for data compression. These fall into two broad groups, namely "lossless" and "lossy" compression, depending on whether the data retains the full original detail during compression and then decompression, or becomes approximate in some way.

Typically, lossy methods are applied to audio and visual media, such as the jpeg, mpeg and MP3 formats.

Lossless method are more appropriate for numeric data files and tables. One such technique is known as "Lempel-Zif" compression which generates zip format files. This approach involves scanning the file to be compressed for repeating patterns and then compiling a library of patterns and tokens. A significant amount of processing is required to implement this technique. This is a significant drawback where the aim is to achieve compression in a relatively simple and inexpensive mobile device. Similarly, decompression of the compressed data may also be costly in terms of processing demands.

Summary of the Invention

The present invention provides a method for electronically compressing and storing a sequence of data records, each record comprising data fields representing a given time in the sequence and the position, motion, or position and motion of an object at said given time, the method comprising the steps of:

(a) assigning the records to separate data blocks, each block comprising a single primary record or a group comprising a primary record followed by at least one subsidiary record, with at least one of the blocks being of the type comprising a primary record followed by at least one subsidiary record; and

(b) calculating the differences in value between fields of each subsidiary record and values associated with the corresponding fields of the preceding record, and storing relative values related to the calculated differences in a respective compressed subsidiary record; and

(c) calculating the differences in value between fields of each primary record after the first and values associated with the corresponding fields of the preceding primary record, and storing relative values related to the calculated differences in a respective compressed primary record, wherein the compressed primary and subsidiary records are stored in electronic memory as a compressed version of the original sequence of data records.

Data representing the motion of an object such as a vehicle is likely to have data fields which are unchanged over several records, or move gradually and continuously from one record to the next. This introduces the possibility of reducing the field size by recording the difference between the field of a record and a value associated with the corresponding field in the preceding record, in a process sometimes referred to as "deltaing". When handling motion data, the vast majority of these differences are likely to be small compared to the absolute value of the parameter concerned and so storing the difference (or relative value in other words) to a given degree of accuracy can be achieved using a smaller field than storage of the absolute value to the same degree of accuracy. In this way the original data can be compressed with an insignificant loss of detail in the data.

This difference may be calculated by subtracting from an absolute value associated with the current field an absolute value associated with the preceding field. In a preferred embodiment of the present invention, the difference is calculated by subtracting from an absolute value associated with the current field an absolute value associated with the preceding field which is derived using stored relative values of preceding fields. This is preferable as, when the compressed records are decompressed, it is the stored relative values that will be used in the decompression calculations to regain absolute values. The accuracy of the values generated by the decompression is therefore increased if the differencing calculations of the compression process mirror the decompression calculations and utilise stored relative values.

Furthermore, according to the method of the invention, records are periodically designated as primary records, and the primary records are compressed by the deltaing process independently from the intervening subsidiary records. If it is desired to decompress the data for example to correlate it with mapping information (for example the identity of the road the vehicle position corresponds to), the present method enables the primary records to be decompressed separately without having to decompress all the data. Correlation of the primary records with map data can provide a satisfactory degree of resolution for most purposes without the processing burden of decompressing all the data, and correlating every record with map data.

In addition, the present technique facilitates further compression where each original record includes both position and motion data fields. Each primary compressed record may store both position and motion fields, whilst the compressed subsidiary rows only store information relating to position or motion. If necessary, the position data can then be derived from the motion data, or vice versa, for each of the subsidiary records in a group, with a quantifiable loss of accuracy which can be kept within acceptable limits if the groups are not unreasonably large, and the data is stored to a reasonable degree of accuracy.

Storing both position and motion fields in the primary compressed records avoids unacceptable levels of drift in absolute values derived when decompressing the compressed records. If only position (or motion) was stored in all the compressed records, the values of motion (or position) derived during decompression would include cumulative errors which increase through the entire sequence of records. With the compressed data structures created using the present method, any cumulative errors are confined to each group and can therefore only develop to a limited extent, as each primary record effectively resets this error to zero.

To reduce processing requirements, the present compression method is susceptible to implementation using only integer arithmetic, rather than floating point calculations.

The first primary record of the sequence may be stored in the compressed version of the original sequence of data records with its data fields representing absolute values of time and position and/or motion. The first primary record should preferably store absolute values for each field, with a high degree of resolution, as this defines the starting point from which the differences recorded in subsequent records are calculated. In the fields of the second primary record, the differences between the original absolute data of that record (subject to any rounding thereof, as discussed below) and the original absolute data for the corresponding fields of the first primary record are stored, and so on for subsequent primary records (except, as noted above, for these subsequent primary records it is preferable to subtract an absolute value associated with the preceding field which is derived using stored relative values of preceding fields, rather than the original value itself).

In a preferred embodiment, if a difference between a field of a current subsidiary record in a current group and the corresponding field of the preceding record calculated in step (b) above is outside a predetermined range for that field, the current subsidiary record is designated as a primary record and assigned to a new separate group together with the following subsidiary records in the current group.

Preferably, each compressed group of records is allocated to a whole number of bytes of memory. Thus, the total size field of a compressed primary record belonging to a group can represent a number of bytes, rather than bits, and so occupy a smaller volume. Furthermore, the hardware used to carry out the compression may require that data is written to memory in chunks equal to a whole number of bytes. The hierarchical structure of the present technique groups together primary and associated subsidiary records, which can conveniently be written to memory together and allocated as a group to a whole number of bytes. If each record were written individually to a respective allocation of whole bytes, a much greater volume of memory would be taken up by the bits unoccupied by the records, in comparison to the number of unoccupied bits associated with each group of records using the present technique.

The present invention further provides a method of electronically decompressing and storing a sequence of data records compressed by a method described herein, comprising the steps of:

(a) calculating values for fields of the second primary record of the sequence by adding the stored relative values to the values of the corresponding fields of the first primary O

record, and storing absolute values related to the calculated values in a respective decompressed version of the second primary record; and

(b) for each primary record after the second primary record in sequence, calculating values for fields of the current primary record by adding the stored relative values to the stored absolute values of the corresponding fields of the preceding primary record, and storing absolute values related to the calculated values in a respective decompressed version of the current primary record, wherein the decompressed primary records are stored in electronic memory as a decompressed sequence of data records.

Thus, using a compressed file generated by the present method, the primary records can be decompressed without reference to the intervening subsidiary records. Processing may then be carried out using the decompressed primary records only. As a result of this processing, one or more fields containing additional data may be added to each decompressed primary record. The additional data may comprise geographical information for example, such as the number of the road a vehicle was on at the relevant point in time.

To this end, the present invention also provides a method of electronically decompressing, augmenting and storing a sequence of data records compressed by a method described herein, comprising the steps of:

(a) calculating values for fields of the second primary record of the sequence by adding the stored relative values to the values of the corresponding fields of the first primary record, and storing absolute values related to the calculated values in a respective decompressed version of the second primary record;

(b) for each primary record after the second primary record in sequence, calculating values for fields of the current primary record by adding the stored relative values to the stored absolute values of the corresponding fields of the preceding primary record, and storing absolute values related to the calculated values in a respective decompressed version of the current primary record;

(c) adding a further data field to each compressed primary record representing further information related to that primary record to create a corresponding augmented compressed primary record; and (d) storing in electronic memory the augmented compressed primary records.

The primary records may therefore be recompressed and stored as an augmented compressed file, with or without the intervening compressed subsidiary records. If the augmented compressed primary records are to be stored together with the intervening compressed subsidiary records, this can be readily achieved with little additional processing burden, as the subsidiary records can merely be copied across to the new file unaltered.

Brief Description of the Drawings

Embodiments of the invention will now be described by way of example and with reference to the accompanying drawings wherein:

Figure 1 is a diagrammatic representation of the structure of data compressed according to a method embodying the invention;

Figure 2 is a graph plotting frequency against angular change for heading data from a data sample; and

Figures 3 and 4 show tables defining sets of primary and subsidiary record types, respectively, for use in a method embodying the invention.

Detailed Description

The techniques described herein impose a hierarchy on a sequence of data records. The records are divided into separate groups, with the first record of each group being designated as a primary record and the remaining records as subsidiary records. Whilst in the embodiments described herein, the hierarchy has two levels, namely primary and subsidiary, it will be appreciated that for some applications, it may be appropriate to further sub-divide the groups, to provide three or more levels in the hierarchy.

Figure 1 illustrates the structure of a compressed sequence of data records, compressed using a method embodying the present invention. The records are divided into groups, each group consisting of a primary record 10, followed by an associated series of subsidiary records 20a to 2Od. A device linked to the GPS system may typically generate one record per second. When handling data from such a device mounted in a vehicle, it has been found to be preferable to have 20 records in each group, that is one primary record followed by 19 subsidiary records. The primary records therefore offer resolution of one data point every 20 seconds, which is generally adequate for determining the overall route of a vehicle for example.

The fields present in each record are shown in Figure 1. Each primary record comprises the following fields: type: this indicates the type of primary record employed for a given record, as discussed further below; timestamp: relates to the time at which the record was created; latitude and longitude: indicate position; size: indicates the total size of the subsidiary records associated with the primary record; heading and speed: indicate motion; and augmentation: this field is illustrated using dashed lines and represents additional data which may be added to each primary record by processing of the data, as discussed further below.

It will be appreciated that records may include further fields in addition to those listed above, or omit some of these fields, depending on the source data to be compressed and the intended use of the compressed data.

Each compressed subsidiary record usually includes just three of these fields, namely type, heading and speed. Occasionally, a "timestamp" field may be included as is the case for record 20c in Figure 1. Most subsidiary records are likely to be one sampling interval (for example, one second) ahead of the preceding record, and in those cases a separate time field is not necessary, as this interval can be taken to be the default value. However, where a record is more than one sampling interval ahead of the preceding one, this difference can be recorded in a "timestamp" field. In the compressed data structure, the values stored in the fields timestamp, latitude, longitude, size, heading and speed in each primary record (except the first primary record) are calculated using the deltaing process referred to above. In particular, the values in these fields of the primary records represent the difference between the absolute values of the original corresponding primary record and absolute values associated with the same fields of the preceding primary record.

Thus, it can be seen that the values stored in the primary records are independent of those of the subsidiary records. Accordingly, the primary records can be decompressed independently of the subsidiary records to regain absolute values for each primary record.

The heading and speed fields of the subsidiary records represent the differences between the absolute values of the original corresponding record and the absolute values of the same fields in the preceding original record.

The efficiency of the compression process can be increased by defining a number of different record types or formats for the primary and subsidiary records, respectively. For example, a record which is unchanged relative to the preceding record with regard to the position and motion fields, for example where an associated vehicle is stationary, or only differs by small amounts, can be adequately represented using a much smaller record than a record in which the fields differ more significantly from those of the preceding record (in the case of a primary record, the preceding primary record).

Predefined data formats may be created to optimise compression by analysing historic data samples generated by a device employed for a similar purpose. Alternatively, the data formats may be created on the basis of analysis carried out using the particular sequence of data that is to be compressed, or analysis relying partially on historic data and partially on the data to be compressed. In these cases, details of the data formats tailored to the data to be compressed in this way will need to be stored and associated with the compressed data so that they can be retrieved at the time of decompression. It will be appreciated that use of a large number of data formats will have a penalty in that the type field of each record will need to be larger to enable each format to have an individual identifier or type number, and so in defining the data formats it is appropriate to balance the desire to compress the data itself as efficiently as possible with the need to minimise the space occupied by the data format type fields.

-By way of illustration, the following tables show how selection of different ways of numbering the data formats and selection of the data formats to be used can affect the total file size.

Table 1:

In this example the number system option 2 should be selected since the space saving from using a one bit type name for the first row outweighs the losses of increasing the other two to three bits.

Table 2(a):

In this example, splitting the data records between two data formats, including a new more limited but smaller one is worth the penalty of the extra record type size because the additional format accommodates more than 200 rows, hence the total storage size is reduced. A combination of the two methods above, applied to the specific data in question, can be used to optimise the design and number of data formats used.

Analysis of historic data is illustrated by Figure 2, which is a plot based on heading data alone from a real data sample generated by a GPS device mounted on a vehicle. For each heading along the X axis, the frequency of occurrence of that value is plotted along the Y axis, which employs a logarithmic scale.

It can be seen that in a large majority of records, the heading is unchanged, or deviates only slightly, as might be expected when tracking a vehicle. Thus, if a data format is defined as designating no change in heading, no storage space is required for the heading field in the compressed record. Similarly, a data format may be defined with a relatively small heading field to record small changes in heading, which will encompass a large proportion of the records.

Imposing the hierarchy of records described herein means that shorter type numbers can be assigned to the subsidiary records which occur more frequently than the primary records, which therefore improves the amount of compression achievable. Without the hierarchy, all the records must be identified using a single set of mutually distinguishable type numbers, which is likely to require the use of larger type numbers, occupying more storage space.

Compression of the longitude and latitude fields may be enhanced by storing second order differences. Thus, rather than store the difference between the value of the field in the current record and its counterpart in the preceding record, the difference between the preceding two records is subtracted from this calculated difference. Thus, the stored values reflect the rate of change of the difference between consecutive fields, which will be smaller in magnitude than the differences themselves, enabling the size of the compressed fields to be reduced further.

Once the values to be stored in each field of a record have been calculated, the most appropriate data format for storage of that record can then be selected from the predefined set of formats. The most appropriate format is the smallest one that can accommodate the values concerned.

Using the statistical analysis discussed above, a limited number of subsidiary record types can be defined which can be used for the vast majority of records. However, from time to time, a record may be encountered which does not conform to one of the predefined types. Under these circumstances, the record concerned can instead be designated as a primary record. As there will tend to be greater variation between primary records, and the primary record formats must be able to accommodate all possible compressed record ranges, an appropriate record type will be available from the primary record types. In practice, such a situation may arise for example when a vehicle carrying a GPS device enters a tunnel, and no data is received for a minute or more, so that the next record's values differ substantially from those of the preceding one.

It will be appreciated that the size field can only be determined retrospectively after the associated subsidiary records have been compressed and the number of bytes they occupy known. To allow for this, the compressed primary record can be written to a temporary memory or storage buffer until the subsidiary records have been compressed and the size number calculated before the compressed records are written to the final compressed file.

In some applications, it may be desirable to compress the motion data using the mobile positioning device itself. This reduces the amount of data to be transmitted and accordingly the transmission time required to send the data to a central processing station. It also distributes the processing required to compress the data amongst the mobile devices, rather than requiring a central processor to compress a large amount of collated uncompressed data.

Further processing can then be carried out by the central processor on the collated, compressed data. In some cases, it may be desirable to augment or enhance this data, for instance before it is supplied to a third party. For example, when using motion data generated by a vehicle, there may be a requirement to identify the roads along which the vehicle travelled. As the size of a compressed data file would be significantly increased if such augmentation data was added to every record, the hierarchical structure generated by the present technique facilitates augmentation of just a proportion of the records, namely by augmenting the primary records and not the subsidiary records.

The hardware configuration of some mobile positioning devices may require the compressed records to be written to memory from a temporary storage buffer in chunks, the size of which corresponds to a multiple of a given basic unit such as a byte for example. The present hierarchical structure divides records into groups. The compression process may be arranged to "top up" each group so that its size corresponds to a multiple number of bytes, for example by adding the required number of "0" bits at the end of the group.

When decompressing a file compressed in this way, interpretation of these additional bits as a further subsidiary record can be avoided. This is because the processor will recognise that the last byte of a group of records is being processed, as the size field of the relevant parent record indicates the size of the group. The first "0" of the added bits will be recognised as merely topping-up the current byte if the only subsidiary record type number beginning with a "0" denotes a record more than 8 bits long, which could not therefore fit within the last byte (as is the case in the subsidiary records examples described below with reference to Figure 4).

This topping-up process has a further benefit in that the size field of the primary records can be expressed as a number of whole bytes, rather than as a number of bits, making its size smaller. In addition, stepping from one primary record to the next is a simpler process as each step is a whole number of bytes.

The hierarchy imposed by the present invention allows this topping-up process to be made more efficient as, rather than top-up each record as would otherwise be required, each group instead can be topped-up before it is written to memory, significantly reducing the amount of space occupied by topping-up bits overall.

Prior to compression of data, it may be desirable to carry out some rounding operations on the raw data received from a positioning device, to simplify the compression calculations whilst still retaining the desired level of resolution. To illustrate data format types which may be employed to store compressed motion data to which this pre-processing has been applied, examples of primary and subsidiary record types indicating the ranges of values accepted by each field are shown in Figures 3 and 4, respectively. The "Total Size" field indicates the size of the record concerned in bits.

The record formats illustrated in Figures 3 and 4 were selected having regard to a real set of motion data generated by a vehicle-mounted positioning device. This input data was initially pre-processed in the manner described above.

The compression techniques described herein have been employed using real motion data and successfully compressed the data by a factor of over 50.

Although the embodiments of the invention described with reference to the drawings comprise processes performed in computer apparatus, the invention also extends to computer programs, particularly computer programs on or in a carrier, adapted for putting the invention into practice. The program may be in the form of source code, object code, a code intermediate source and object code such as in partially compiled form, or in any other form suitable for use in the implementation of the processes according to the invention. The carrier be any entity or device capable of carrying the program.

For example, the carrier may comprise a storage medium, such as a ROM, for example a CD ROM or a semiconductor ROM, or a magnetic recording medium, for example a floppy disc or hard disk. Further, the carrier may be a transmissible carrier such as an electrical or optical signal which may be conveyed via electrical or optical cable or by radio or other means.

When the program is embodied in a signal which may be conveyed directly by a cable or other device or means, the carrier may be constituted by such cable or other device or means. Alternatively, the carrier may be an integrated circuit in which the program is embedded, the integrated circuit being adapted for performing, or for use in the performance of, the relevant processes.

The data compression processes described herein may be implemented using a conventional personal computer, for example. Alternatively, they may be carried out by a vehicle's onboard computer. In other implementations, the compression may be achieved by a central server, or by a processor associated with database hardware.

Claims

1. A method for electronically compressing and storing a sequence of data records, each record comprising data fields representing a given time in the sequence and the position, motion, or position and motion of an object at said given time, the method comprising the steps of:

(a) assigning the records to separate data blocks, each block comprising a single primary record or a group comprising a primary record followed by at least one subsidiary record, with at least one of the blocks being of the type comprising a primary record followed by at least one subsidiary record;

2. A method of claim 1 wherein the compressed primary record in each group includes a data field dependent on the total size of the compressed subsidiary records in its group.

3. A method of claim 1 or claim 2 wherein the first primary record of the sequence is stored in the compressed version of the original sequence of data records with its data fields representing absolute values of time and position, time and motion, or time, position and motion.

4. A method of any preceding claim wherein, if a difference between a field of a current subsidiary record in a current group and the corresponding field of the preceding record calculated in step (b) above is outside a predetermined range for that field, the current subsidiary record is designated as a primary record and assigned to a new separate data block together with the following subsidiary records in the current group.

5. A method of any preceding claim, wherein each compressed group of records is allocated to a whole number of bytes of memory.

6. A method of electronically decompressing and storing a sequence of data records compressed by a method of any preceding claim, comprising the steps of:

(a) calculating values for fields of the second primary record of the sequence by adding the stored relative values to the values of the corresponding fields of the first primary record, and storing absolute values related to the calculated values in a respective decompressed version of the second primary record; and

7. A method of claim 6, including a further step of:

(c) calculating values for fields of each subsidiary record in sequence, by adding the stored relative values to the stored absolute values of the corresponding fields of the preceding record, and storing absolute values related to the calculated values in a respective decompressed version of each subsidiary record.

8. A method of electronically decompressing, augmenting and storing a sequence of data records compressed by a method of any of claims 1 to 5, comprising the steps of: (a) calculating values for fields of the second primary record of the sequence by adding the stored relative values to the values of the corresponding fields of the first primary record, and storing absolute values related to the calculated values in a respective decompressed version of the second primary record;

(Jo) for each primary record after the second primary record in sequence, calculating values for fields of the current primary record by adding the stored relative values to the stored absolute values of the corresponding fields of the preceding primary record, and storing absolute values related to the calculated values in a respective decompressed version of the current primary record;

(c) adding a further data field to each compressed primary record representing further information related to that primary record to create a corresponding augmented compressed primary record; and

(d) storing in electronic memory the augmented compressed primary records.

9. A computer program comprising program instructions for causing a computer to perform a method of any preceding claim.

10. A computer program comprising program instructions for causing a computer to perform a method of any of claims 1 to 8 embodied on a record medium, stored in a computer memory, embodied in a read-only memory, or carried on an electrical carrier signal, or other carrier.

11. A computer programmed to perform a method of any of claims 1 to 8.