GB2586226A

GB2586226A - Processing and storage of location data

Info

Publication number: GB2586226A
Application number: GB1911299.4A
Authority: GB
Inventors: Farthing Duncan; Wigley Andrew; Haynes Ian
Original assignee: Theoblong Global Ltd
Current assignee: Theoblong Global Ltd
Priority date: 2019-08-07
Filing date: 2019-08-07
Publication date: 2021-02-17
Also published as: AU2020202075A1; GB201911299D0

Abstract

A method of processing location data and then receiving and responding to a query. A dataset (fig. 1, 10) is received, comprising entries each with data (fig. 1, 14) and a location (fig. 1, 16). A bounding shape (fig. 2, 20), is defined that contains all the locations of the dataset, then subdivided into level 1 shapes (fig. 2, 22), the number of entries, and a data summary of the data of each entry, in each level 1 shape is calculated. This is then repeated a predetermined number of times, dividing at least one level n shape into level n+1. Indexes are stored for each level, each index 24 storing the number of entries and data summary for shapes at the respective level. When a query defining a user area is received, for each level, the shapes that are contained wholly within the user defined area are determined and added to a query response. For the level defining the smallest shape, determine the shapes of that level that are partially contained within the user defined, and the entries contained within the user defined area are determined and added to the query response, which is then outputted (fig. 5, 30).

Description

DESCRIPTION

PROCESSING AND STORAGE

OF LOCATION DATA

This invention relates to a method of and system for processing and storing location data and for receiving a query in respect of the location data and responding to the query.

Location data is widely used in many different applications. For example, when making a search on the Internet for a hotel in a foreign city, it is common for an option to be available that allows the hotels being returned by the search request to be displayed on a map that shows the location of each hotel. This allows the user to visualise the search result and is useful for many different reasons. Location data is a broad term that can mean very specifically the location of something, as defined by GPS co-ordinates or a postcode, for example, or can mean data (such as a hotel name) plus the specific location. Any data that includes within it, or consists of, some sort of location information or definition can be considered as location data.

In more complex situations where large amounts of location data is present, the ability to query the location data and return a result that is useful to the end user is a non-trivial task. For example, in the hotel case referred to above, if the hotels are stored with their location as a GPS co-ordinate and the query comprises "show hotels in the county of Yorkshire", the actual process of determining which of the hotels are within the boundary of Yorkshire is a non-trivial task, since there has to be some intermediate processing to map GPS co-ordinates to the boundary of the county of Yorkshire. Normally this means that the location data has to be stored in a database that has specific location processing capabilities or the program through which the user is making the search query is able to perform a significant amount of specialist tasks in relation to the search query (such as reformatting the query so that it is understood by the database).

This means that if an enterprise has a large amount of location data, they need access either to a specialist location database or a specialist interface to the location data. Both of these things can be expensive in monetary and processing/storage/bandwidth requirements. This creates a significant barrier to a useful access to location data. For example, if a company has over a period of time collected the data of customers with their location, for example in a spreadsheet, there is no way to query this spreadsheet without some specialist software that can translate a query into an operation on the spreadsheet and can present the result to the end user in a meaningful fashion. The only other way of making the data available would be to purchase a specialist database with a location function and turn the data within the spreadsheet into database data.

It is therefore an object of the invention to improve upon the known art.

According to a first aspect of the present invention, there is provided a method of processing and storing location data and of receiving a query in respect of the location data and responding to the query, the method comprising receiving a dataset comprising a plurality of entries, each entry comprising data and a location, defining a bounding shape that contains all the locations of the received dataset, subdividing the bounding shape into a set of level 1 shapes, calculating the number of entries in each level 1 shape, calculating a data summary from the data of each entry in each level 1 shape, repeating the subdividing and calculating for a predetermined number of times, wherein the subdividing comprises dividing at least one level n shape into a set of level n+1 shapes, storing one or more indexes for each level, each index storing the number of entries and the data summary for one or more shapes at the respective level, receiving a query for the received dataset, the query comprising a user defined area, for each level, determining the shapes of that level that are wholly contained within the user defined area and adding each such determined shape to a query response, for the level defining the smallest shape size, determining the shapes of that level that are partially contained within the user defined area, determining the entries contained within the user defined area and adding each such determined entry to the query response, and outputting the query response.

According to a second aspect of the present invention, there is provided a system for processing and storing location data and of receiving a query in respect of the location data and responding to the query, the system comprising a storage device and a processor connected to the storage device, the processor arranged to receive a dataset comprising a plurality of entries, each entry comprising data and a location, define a bounding shape that contains all the locations of the received dataset, subdivide the bounding to shape into a set of level 1 shapes, calculate the number of entries in each level 1 shape, calculate a data summary from the data of each entry in each level 1 shape, repeat the subdividing and calculating for a predetermined number of times, wherein the subdividing comprises dividing at least one level n shape into a set of level n+1 shapes, store one or more indexes for each level in the storage device, each index storing the number of entries and the data summary for one or more shapes at the respective level, receive a query for the received dataset, the query comprising a user defined area, for each level, determine the shapes of that level that are wholly contained within the user defined area and add each such determined shape to a query response, for the level defining the smallest shape size, determine the shapes of that level that are partially contained within the user defined area, determine the entries contained within the user defined area and add each such determined entry to the query response, and output the query response.

Owing to the invention, it is possible to provide a method and system that can be used to pre-process location data in a fast and efficient manner that will create a set of indexes that can be used in answering subsequent queries made to the location data, without the requirement of either using a specialist location suitable database or a complex front-end software product. The underlying location data does not need to be changed in any way, since the indexes that are generated as a result of the pre-processing form the basis of the processing of any subsequent queries of the location data. This is a significant improvement over the prior art systems, since no change to the raw data is required and no specialist software is required to make a query over the location data.

Preferably, the bounding shape comprises an x by y rectangle and each level n shape comprises a scaled x by y rectangle. The simplest way in which the indexes can be constructed is to use rectangles as the shapes that are used to define the area covering all of the locations and the sub-divided areas. Each smaller rectangle is an exact scaled down version of the rectangle used for the bounding shape. For example, the bounding shape may be a 3x2 rectangle and when moving down the levels, each shape is subdivided into io four scaled down 3x2 rectangles. This provides the simplest and most efficient way of dividing up the physical area represented by the spread of locations contained within the location data. Other shapes such as triangles or hexagons can also be used to define the bounding shape and subshapes.

Ideally, the step of calculating a data summary from the data of each entry in each level 1 shape comprises calculating the average of values contained in the data in the entries and/or calculating the range of values contained in the data in the entries. The data summary for each shape can be constructed by calculating summaries of the individual values present in the data that forms part of each entry. This has the advantage that this supports the presentation and/or processing of relevant information at the level of each individual shape. The bounding shape is broken up into smaller shapes for a number of subdivisions, where for each shape at each level a data summary is available.

Advantageously, the step of storing one or more indexes for each level, each index storing the number of entries and the data summary for one or more shapes at the respective level comprises storing a separate index for each shape at each level. In the preferred embodiment of the system and method, the pre-processing part of the methodology is configured so that there is an individual index for each shape at each level. This provides the most logically consistent structure of the indexes, with an individual index for every shape at every level. This means that when a query is received and worked through, for every shape that is wholly contained within the location area in the query, there is an individual index present for each shape. This simplifies the processing and the generation of the query response.

Preferably, the method further comprises, with each index, storing each entry contained within the respective index. In this embodiment, the relevant entries of each index are also stored with the index, either directly in the index file or more likely in a separate file that shares a common naming or location with the corresponding index file. Although this will increase the amount of storage required to store the multiple copies of the entries from the original dataset, in general the benefits of storing the entries in this way vastly io outweighs the costs of doing so. There are many cloud based third party storage solutions that have a very low cost of data storage and the benefit of being able to immediately access and use the entries for each index means that the original dataset does not have to be mined to find the required entries every time a query is received and a response generated.

Alternatively, only a single copy of the entries from the dataset is stored with the indexes. One way in which this can be implemented is to break up the entries into a series of files that match the different shapes at the lowest level. For example, if the bounding shape is divided into four level 1 rectangles and the four level 1 rectangles are each divided into four level 2 rectangles, then there are sixteen level 2 rectangles: 1-1, 1-2, 1-3, 1-4, 2-1, 2-2, 2-3, 2-4, 3-1, 3-2, 3-3, 3-4, 4-1, 4-2, 4-3 and 4-4. The original dataset is then split into sixteen individual files, with the entries in each file matching the entries that are in the sixteen level 2 rectangles. This reduces the amount of data that has to be stored to a single copy of the dataset but also makes it very easy to grab the necessary contents of the dataset when the user makes a query. If the user draws a shape that includes rectangles 1, 2-1 and 2-3 then the six files (1-1, 1-2, 1-3, 1-4, 2-1 and 2-3) can be used to generate the response to the query without any need to perform any complex tasks to determine which entries fall within the scope of the received query.

Advantageously, the repeating of the subdividing and calculating for a predetermined number of times, wherein the subdividing comprises dividing at least one level n shape into a set of level n+1 shapes, is continually performed until the number of entries in each shape of the current level is below a pre-set number. As described above, the method of pre-processing the dataset splits the locations into a series of level 1 shapes and then each level 1 shape is split into a series of level 2 shapes and so on. So there may be four level 1 shapes and then sixteen level 2 shapes etc. However this process has to be terminated at some point. One way in which this can be done is to set a simple numerical limit, such as proceed to level 5 and then stop. However one advantageous way in which the sub-division can be terminated is to keep a count of the number of entries in each shape at the current level. Once this io count falls below a set level (such as 2000) for all shapes at the current level then the sub-division can be terminated. The advantage of this methodology is that the sub-division will terminate once each shape at the lowest level of resolution (the highest numbered level) contains a manageable number of entries, with regard to the presentation (for example display) of the entries.

Equally the subdivision will terminate before the summary data becomes meaningless through the number of entries in the shape becoming too small for useful calculations such as the average of values contained within the data of the different entries.

The process of subdividing each shape at each level can also be terminated at different levels. For example the bounding shape may be split into four level 1 rectangles and the four level 1 rectangles are each divided into four level 2 rectangles, then there are sixteen level 2 rectangles. However, at the next level of subdivision it is not essential that each of the sixteen level 2 rectangles is then subdivided. For example if a specific level 2 rectangle is actually empty, then clearly there is no point in further subdividing that shape.

The process may apply a threshold to each individual shape at the current lowest level, before deciding whether further subdivision of individual shapes is required. If all shapes are subdivided by 4 each time, then the number of shapes at each level is 1-4-16-64 and so on. However if only some of the shapes at each level are subdivided, then the number of shapes at a specific level will be less than the upper limits of 1-4-16-64. There will be fewer shapes and fewer indexes, which is a more efficient subdivision of the location data.

Embodiments of the present invention will now be described, by way of example only, with reference to the accompanying drawings, in which:-Figure 1 is a schematic diagram of a dataset of entries, Figure 2 is a schematic diagram of a map showing the entries, Figure 3 is a schematic diagram of creation of an index, Figure 4 is a schematic diagram of an index hierarchy, Figure 5 is a schematic diagram showing index access, Figure 6 is a schematic diagram of a map showing location data, and Figure 7 is a schematic diagram of system for processing a dataset.

Figure 1 shows a dataset 10, which comprises a plurality of entries 12, each entry comprising data 14 and a location 16. The data 14 could comprise a large number of different fields, with numbers and/or text being contained within the different fields, but here is shown as a simple single number, for ease of understanding. The location 16 can be expressed in any number of different ways, for example using a postcode or GPS co-ordinates or the like, which are sufficient to provide a location 16 for a respective entry 12. Here the dataset 10 is presented as a table of entries 12, with each line in the table representing a different entry 12.

The table 10 could comprise a list of farms in the United Kingdom, for example. Each line 12 in the table 10 would comprise an entry for a specific farm, with the data 14 indicating the size of the farm in hectares. As will appreciated, a large number of other fields could be present in the data 14, such as number of employees, amount of livestock, type of farming, etc. The location 16 is here expressed as a GPS co-ordinate. The dataset 10 has a plurality of entries 12, each entry 12 being for a farm and each entry 12 providing the size of the farm and the physical location of the farm (which may be expressed as the physical centre of the land making up the farm).

The dataset 10 is pre-processed and stored with a plurality of indexes that are generated from the dataset 10. Here a very simple example is being walked through, to illustrate the concept of the pre-processing and storing of the dataset. In reality, a complete table (for UK farms for example) will have many thousands of different entries 12 and the data 14 that forms part of each entry 12 will have a large number of different fields, depending upon the context and use of the table 10. The pre-processing and storage of the table 10 (and the indexes generated from the table 10) is here explained using a small and simple table 10.

Figure 2 shows a schematic view of the entries 12 from the table 10 of Figure 1 displayed on a map 18. The entries 12 are displayed on the map 18 according to their respective locations 16 contained within each entry. The visualisation of the entries 12 on the map 18 is purely to illustrate the nature of the pre-processing that is performed on the table 10, in fact there is no requirement for any display of the entries from the table 10, all of the preprocessing can be carried out directly on the content of the entries without the necessity of producing a map 18, as shown in Figure 2.

In the upper part of Figure 2, included on the map 18 is a bounding shape 20. Here the bounding shape 20 is a rectangle that is sized so that the shape 20 just contains all of the locations 16 of the entries 12, without containing any excess areas. Effectively, the extent of the locations 16 of the entries 12 is calculated by determining the co-ordinates of a rectangular bounding box sized to just enclose the data. The bounding shape 20 is then divided into a grid of equal sized rectangles 22 (lower part of Figure 2) such that the full extent of the bounding shape 20 is covered, with each rectangle 22 being defined by the co-ordinates of at least two opposing corners.

For each rectangle 22 relevant attributes for the data 14 contained within the entries 12 falling within the respective rectangle 22 are calculated.

This can be the number of entries 12 within the rectangle and a data summary such as sum/average/range of the data 14, and this information forms a level 1 grid (the highest level of data). Each rectangle 22 is then subdivided and the process is repeated for a number of times, with subdivision taking place thereby dividing each level n shape into a set of level n+1 shapes. This subdivision process can be repeated a set number of times or can be terminated when some parameter has been met.

Figure 3 shows schematically the creation of a spatial index 24. This example assumes that all grids comprise four rectangles. Each index 24 stores the number of entries and the data summary for at least one shape at the respective level. Input data is analysed to detect its spatial extent (geographic s parameter), size (number of rows) and spatial distribution (geographic spread).

This analysis provides the required parameters to break up the input data and create the multi-level individual indexes 24. Each index 24 contains information in relation to the shapes at the level immediately below, so the first index 24 for the bounding shape 20 contains details of the level 1 shapes.

Figure 4 illustrates some of the indexes 24 that will be created in a process that moves to level 3 as the lowest extent of the spatial subdivision. The indexes 24 form an index hierarchy, but at each level only one branch from the hierarchy is illustrated in order to simplify the indexes 24 being shown in the Figure. At the highest level is a summary index 24a, which includes the is co-ordinates of the four first level 1 shapes (which are taken from the highest level shape, which is the bounding shape 20) and also a data summary for each of the four level 1 shapes. No data in the form of the actual entries 12 is stored with this index 24a.

At the next level, there are four level 1 indexes 24b, which are numbered according to their level in a unique fashion, so that each index 24 and the corresponding shape can always be identified. Each level 1-1, 1-2 up to 1-n index 24b contains the co-ordinates of the four first level 2 shapes (for each level 1 shape) and also a data summary for each of these four level 2 shapes. The actual entries 12 from the dataset 10 are stored with each of the indexes 24b.

At the next level, there are sixteen level 2 indexes 24c, although only four are shown in the Figure, being the subdivisions of the index 1-2. Each of these indexes 24c contains the co-ordinates of the four first level 3 shapes (for each level 2 shape) and also a data summary for each of these four level 3 shapes. Again the respective actual entries 12 from the dataset 10 are stored with the index files 24c, which means that at each index level in the hierarchy (apart from the top summary level) there is an entire copy of the dataset 10 stored repeatedly. Although this multiplies the amount of data to be stored by the number of levels, this ensures that for an index for a shape at any level, the entries for that shape are immediately available and do not have to be searched for within the dataset 10. At the lowest level (here level 3) the raw entries 12 from the dataset 10 are stored (in 64 files).

An end user is able to perform a spatial index access. The purpose of pre-calculating the indexes 24 is to reduce the cost of computation. When a user requests information about an area within the dataset 10, the following steps are performed. Firstly, for step 1, using the coordinates of the level 1 grid, determine which rectangles are fully enclosed or fully excluded from the area selected by the user. Secondly, for step 2, for each level 1 rectangle that is partially contained within the area selected by the user, repeat step 1 using the level 2 grid. Thirdly for step 3, repeat step 2 using successive lower grid levels until all grid levels exhausted.

The next stage (step 4) is, at the lowest grid level, for any grids still not fully enclosed or excluded from the user defined area, calculate which points are included or excluded within the user specified area using standard calculation methods. For those points included in the user area, calculate any relevant attributes for the data contained within it for example, the number of points, the co-ordinates of at least the two opposing corners of the rectangle that encloses the data within the rectangle and the sum/average or other value represented by the data points contained within the rectangle.

The final step is step 5 which is to use the results from step 4 and add the results for all fully enclosed shapes from previous steps. This provides the number of points in each of the rectangles within the fully and partially enclosed shapes, the calculated values for that data and the locations within each rectangle that represents the location of those data points. In this way, a received query which comprises a user defined area is used to generate a query response, using the indexes to determine which data entries are to be contained within the query response.

Figure 5 shows schematically a client device 26, such as a desktop computer 26, which is running a browser that allows a user to make a query 28 over the indexed dataset 10. A response 30 is returned. The client 26 requests data by accessing the indexes 24. Computation is handled by the client browser, removing the need for server based computation. Only the requested data (the relevant lines 12 of the table 10) are returned in a highly optimised and compressed manner to reduce transit costs. The relevant indexes 24 are used, according to shapes covered by the user defined area contained within the query 28. According to the process described above, at each level, wholly contained shapes are added to the response 30 as the process works through the levels. Any shape at the lowest level that is only partially contained in the fo user's defined area have to be calculated in the normal manner.

Figure 6 shows a more complex example of the process described above. Here a map of part of the United Kingdom has farms located thereon as black dots. The outer border of the map defines the bounding shape 20 and there are four level 1 boxes, each of which is split into four level 2 boxes. A user defined area 32 has been received as part of a query in relation to the map shown in the Figure. A user is interested in information relating to the farms that are contained within the area 32. The user has the map displayed to them on their local device and they can then perform a user interface action to define the area 32.

As per the steps defined above, there is then performed a process at each level to determine those shapes that are wholly contained within the area 32. At level 1, the only shape fully contained within the area 32 is cell 2. Then process then moves to level 2 and the only shape contained within the area 32 at this level is the cell 1-4. At the lowest level (here level 2) it is then determined which cells are partially contained within the area 32 and these are the cells 1-2, 1-3, 3-2, 4-1 and 4-2. In relation to these cells that are partially contained within the user area 32, then standard extraction and/or analysis techniques are used to extract the required information. The completely contained cell data and the extracted cell data for the partially contained cells is then bundled into the query response.

Figure 7 illustrates the hardware used, at the highest level. The input data 10 (the dataset 10 of entries 12) is processed by a processor stage 34 to generate the indexes 24 which are stored with the raw data from the dataset 10 in a storage medium 36. The client device 26 can then access the indexes 24 and the raw data by way of queries that are handled on the client device 26, which runs a browser or suitable API that can read the indexes 24 and s obtain the correct raw data as desired.

Claims

CLAIMS1. A method of processing and storing location data and of receiving a query in respect of the location data and responding to the query, the method comprising: * receiving a dataset comprising a plurality of entries, each entry comprising data and a location, * defining a bounding shape that contains all the locations of the received dataset, * subdividing the bounding shape into a set of level 1 shapes, * calculating the number of entries in each level 1 shape, * calculating a data summary from the data of each entry in each level 1 shape, * repeating the subdividing and calculating for a predetermined number of times, wherein the subdividing comprises dividing at least one level n shape into a set of level n+1 shapes, * storing one or more indexes for each level, each index storing the number of entries and the data summary for one or more shapes at the respective level, * receiving a query for the received dataset, the query comprising a user defined area, * for each level, determining the shapes of that level that are wholly contained within the user defined area and adding each such determined shape to a query response, * for the level defining the smallest shape size, determining the shapes of that level that are partially contained within the user defined area, determining the entries contained within the user defined area and adding each such determined entry to the query response, and * outputting the query response.
2. A method according to claim 1, wherein the bounding shape comprises an x by y rectangle and each level n shape comprises a scaled x by y rectangle.
3. A method according to claim 1 or 2, wherein the step of defining a bounding shape that contains all the locations of the received dataset comprises defining a bounding shape which is sized to just include all locations.
4. A method according to claim 1, 2 or 3, wherein the step of subdividing the bounding shape into a set of level 1 shapes comprises subdividing the bounding shape so that the entire bounding shape comprises the set of level 1 shapes.
5. A method according to any preceding claim, wherein the step of calculating a data summary from the data of each entry in each level 1 shape comprises calculating the average of values contained in the data in the entries and/or calculating the range of values contained in the data in the entries.
6. A method according to any preceding claim, wherein the step of storing one or more indexes for each level, each index storing the number of entries and the data summary for one or more shapes at the respective level comprises storing a separate index for each shape at each level.
7. A method according to any preceding claim, and further comprising, with each index, storing each entry contained within the respective index.
8. A method according to any preceding claim, wherein the repeating of the subdividing and calculating for a predetermined number of times, wherein the subdividing comprises dividing at least one level n shape into a set of level n+1 shapes, is continually performed until the number of entries in each shape of the current level is below a pre-set number.
9. A system for processing and storing location data and of receiving a query in respect of the location data and responding to the query, the system comprising a storage device and a processor connected to the storage device, the processor arranged to: * receive a dataset comprising a plurality of entries, each entry comprising data and a location, io * define a bounding shape that contains all the locations of the received dataset, * subdivide the bounding shape into a set of level 1 shapes, * calculate the number of entries in each level 1 shape, * calculate a data summary from the data of each entry in each level 1 shape, * repeat the subdividing and calculating for a predetermined number of times, wherein the subdividing comprises dividing at least one level n shape into a set of level n+1 shapes, * store one or more indexes for each level in the storage device, each index storing the number of entries and the data summary for one or more shapes at the respective level, * receive a query for the received dataset, the query comprising a user defined area, * for each level, determine the shapes of that level that are wholly contained within the user defined area and add each such determined shape to a query response, * for the level defining the smallest shape size, determine the shapes of that level that are partially contained within the user defined area, determine the entries contained within the user defined area and add each such determined entry to the query response, and * output the query response.