WO2019234039A1

WO2019234039A1 - Data processing

Info

Publication number: WO2019234039A1
Application number: PCT/EP2019/064515
Authority: WO
Inventors: Luc VLAMING; David Geier; Thomas Richter; Adrien HAMELIN
Original assignee: Swarm64 As
Priority date: 2018-06-05
Filing date: 2019-06-04
Publication date: 2019-12-12
Also published as: GB201809174D0

Abstract

A data processing apparatus, data processing method and computer program product are disclosed. The data processing method comprises: identifying a group of existing data clusters having a clustering characteristic, each existing data cluster comprising data entries of a data information system stored together in storage, each data entry having at least one field storing data values; selecting at least one existing data cluster from the group of existing data clusters as an optimisable group of data clusters; and forming a group of optimised data clusters by allocating data entries of the optimisable group of data clusters to each optimised data cluster to improve the clustering characteristic for the group of optimised data clusters compared to the group of existing data clusters. In this way, the distribution of data entries in data clusters may be changed when creating optimized data clusters from those data entries, in order to improve the clusters. This, in turn, improves the performance of the data information system.

Description

DATA PROCESSING

FIELD OF THE INVENTION

The present invention relates to a data processing apparatus, data processing method and computer program product.

BACKGROUND

Data processing apparatus are known. Data processing apparatus store data values (which may be instructions or data) in storage for processing. The data processing apparatus retains or is provided with location information which identifies the location of data stored in storage for subsequent retrieval and processing. A data processing apparatus may operate as a data information system which is provided with data which is required to be stored in storage for subsequent interrogation, such as searching in order to answer a query. Although various techniques for storing data exist, they each have their own shortcomings. Accordingly, it is desired to provide an improved data processing technique.

SUMMARY

According to a first aspect, there is provided a data processing method, comprising: identifying a group of existing data clusters having a clustering characteristic, each existing data cluster comprising data entries of a data information system stored together in storage, each data entry having at least one field storing data values;

selecting at least one existing data cluster from the group of existing data clusters as an optimisable group of data clusters; and forming a group of optimised data clusters by allocating data entries of the optimisable group of data clusters to each optimised data cluster to improve the clustering characteristic for the group of optimised data clusters compared to the group of existing data clusters.

The first aspect recognises that databases typically have two mechanisms to retrieve data that meets the condition of a database query. They can either access all of the data to answer a query - which is commonly referred to as“table scan” and is typically slow, especially with large datasets; or it has to inquire an index, an indirect structure that holds information where the data meeting the query conditions is stored, which allows to only selectively retrieve the data. A query may ask for all data matching a specific value, and given this value is a key in the index, the index can directly provide all rows matching this query condition. These selective retrievals may create data accesses that are scattered and, due to the nature of accesses in one or more connected computer systems, may fetch a large proportion of data that is not relevant. Also, when ingesting or updating data, modifying the index may also lead to scattered and inefficient data accesses. Some databases reduce the use of indexes by storing each column of a database table individually (column-store), thereby each column can be evaluated without interrogating an index or loading any additional data beyond the columns required. However, these column-store databases typically encounter the scattered and inefficient data access patterns when assembling all the columns in a row to answer the query. Accordingly, the first aspect recognises that when storing data entries in a data information system, it is often convenient to cluster a number of those data entries together for subsequent storage. For example, clustering them together can help reduce any increase in overhead that may occur if storing the data entries individually. Also, clustering can improve the efficiency of accesses to storage, particularly when the cluster size is related to optimum sizes for accesses to storage. However, the first aspect also recognises that storing data in this way can lead to clusters having undesirable characteristics. For example, identical or similar data values which may need to be searched may be distributed over a wide range of clusters, each of which may need to be interrogated in response to a search request or enquiry. This, in turn, leads to frequent data accesses needing to be made to retrieve data clusters in order to answer a query, which slows processing speed substantially and can cause bottlenecks in the infrastructure used to perform the data accesses, requires higher performance resources to handle those data access requests and requires a higher than desired amount of resources to perform the data processing in order to interrogate all of those data clusters in response to a search enquiry. The first aspect also recognises that if the characteristics of the clusters are controlled to suit the particular implementation of a data information system, then it is possible to optimize the processing speed or performance of the data information system.

Accordingly, a method is provided. The method may be for or be performed by a data processing apparatus. The method may comprise identifying or determining a group or set of data clusters. Each data cluster may have data entries. The data entries may be data entries of a data information system. The data entries may be stored as a block in a storage device. Each of the data entries may have one or more fields. Each of those fields may store one or more data values. The method may comprise selecting or choosing one or more of the data clusters from the group of data clusters. That one or more selected data clusters may comprise or be designated as an optimizable group of data clusters. The method may comprise forming or creating a group or set of optimized data clusters. The optimized data clusters may be formed by allocating data entries from the optimizable group of data clusters. Each data entry of the optimizable group of data clusters may be allocated or assigned to one of the optimized data clusters. In this way, the distribution of data entries in data clusters may be changed when creating optimized data clusters from those data entries, in order to improve the clusters. This, in turn, improves the processing speed and performance of the data information system.

In one embodiment, the group of data clusters may have a characteristic, feature or parameter that can be related to a metric. The data entries may be allocated to the optimized data clusters to improve the characteristic of the group of optimized data clusters when compared to the characteristic of the group of data clusters.

In one embodiment, each data cluster occupies a data range in a search space defined by values of each data entry of each field. Accordingly, each field of a data entry can define a space which may need to be searched by the data information system. The values of those fields of each data entry within the data cluster may define a data range within that search space. For example, consider a simple arrangement where a field stores a numerical value such as temperature. The field may then define a search space or search dimension of temperature. When looking at the values in the temperature field for each data entry within the data cluster it may be determined that a minimum temperature is 10 and a maximum temperature is 25. Accordingly, the data range in the search space defined by the temperature field for that data cluster would be between 10 and 25. It will be appreciated that any values of any type of field (such as text, hierarchy information, image data, etc) can be mapped into Euclidian space and a range within that space can be established.

In one embodiment, the search space has‘n’ dimensions, each dimension being defined by a corresponding‘n’ field. Accordingly, the search space for the data cluster may be multi-dimensional, depending on the number of fields to be searched or indexed.

In one embodiment, each data cluster has a size which matches a bandwidth-optimised data block transfer size of the storage. Accordingly, the data clusters may be sized to match the data block transfer size of the storage device.

In one embodiment, each data cluster has a size no larger than a bandwidth-optimised data block transfer size of the storage. Accordingly, the size of each data cluster may be set to be the block transfer size or smaller. This helps to ensure that each data cluster can be transferred between the storage and data processing apparatus as efficiently as possible.

In one embodiment, each data cluster has a size larger than a bandwidth-optimised data block transfer size of said storage.

In one embodiment, each data cluster has a size a multiple of a bandwidth-optimised data block transfer size of said storage.

In one embodiment, one or more data clusters are compressed or stored in compressed form.

In one embodiment, each data cluster has associated metadata which provides at least an indication of the data range in the search space defined by values of each data entry of at least one field. Accordingly, each cluster may have metadata associated therewith. The metadata may provide or indicate the search range or search ranges in the search space which are defined by the values of the data entries of one or more fields within that data cluster.

In one embodiment, the metadata stores at least one additional parameter relating to that data cluster. Accordingly, the metadata may provide additional information relating to the data cluster which may be unrelated to its search ranges.

In one embodiment, the additional parameter comprises a number of entries in that data cluster. Accordingly, the number of data entries within a data cluster may be indicated within the metadata.

In one embodiment, the optimizable group of data clusters may be determined from the metadata.

In one embodiment, each data entry is an entry in a database. Accordingly, the data entries may relate to entries in a database. The data entries may also relate to entries in data information systems, a relational database, a NOSQL database or the like.

In one embodiment, the clustering characteristic comprises a search selectivity between the existing data clusters. In one embodiment, the clustering characteristic comprises a number of existing data clusters accessed in response to search enquiries. In one embodiment, the clustering characteristic comprises a separation between existing data clusters in the search space. In one embodiment, the clustering characteristic comprises an overlap of data ranges of data values between existing data clusters in the search space. In one embodiment, the clustering characteristic comprises data ranges of data values within each existing data cluster in the search space. Accordingly, depending on implementation, various different characteristics may need to be optimized. In one embodiment, the selectivity of data clusters in response to the search may be a characteristic to be improved. In one embodiment, the number of data clusters which are accessed following a search enquiry may be a characteristic to be optimized. In one embodiment, a separation or distance between the data clusters in search space may be another characteristic to be optimized. In one embodiment, an overlap or commonality in data ranges between data clusters within the search space may be a characteristic to be optimized. In one embodiment, a data range of the data values within the data clusters may be a characteristic to optimize. By optimizing these characteristics, the performance of the data information system when performing searches can be improved.

In one embodiment, the selecting comprises selecting the at least one existing data cluster from the group of existing data clusters as the optimisable group of data clusters based on a size of its occupied search space. Accordingly, one or more of the data clusters may be selected from the group or set of data clusters and included in the group or set of optimizable data clusters based on how much of the search space that data cluster occupies, or based on their shape or position or fill level. When data entries can be deleted from data clusters or updated, the fill level can be a characteristic that is useful to incorporate into an error metric, because it is advantageous that data clusters with very low fill levels are merged together. Selecting data clusters on that basis biases the selection towards larger data clusters which are more likely to cause a poor clustering characteristic.

In one embodiment, the selecting comprises selecting the at least one existing data cluster from the group of existing data clusters as the optimisable group of data clusters based on a number of intersections in the search space with other existing data clusters. Accordingly, selecting those data clusters which intersect or overlap with more data clusters than others biases the optimization towards those data clusters which are likely to cause a poor clustering characteristic. In one embodiment, the selecting comprises selecting the at least one existing data cluster from the group of existing data clusters as the optimisable group of data clusters based only associated metadata. Accordingly, the optimizable group of data clusters may be determined using the stored metadata for those data clusters. This avoids the need to access the data clusters themselves, or perform any searching within the data clusters to make that selection. This significantly improves the performance of the selection.

In one embodiment, the method comprises generating a group of ideal data clusters from the group of existing data clusters, the group of ideal data clusters being generated within the search space occupied by the group of existing data clusters and wherein the selecting comprises selecting at least one existing data cluster from the group of existing data clusters which intersects a selected ideal data cluster in the search space as the optimisable group of data clusters. Accordingly, an idealised group of data clusters, which, if existed, would provide improvement to the clustering characteristic, may be created. At least one of the data clusters which, when the ideal data clusters are overlaid in the search space, intersects, covers, falls within or crosses the boundary of a particular or selected ideal data cluster is selected for the optimizable group of data clusters. Selecting a data cluster which deviates from the ideal ensures that a sub-optimal data cluster is selected for optimization.

In one embodiment, the generating comprises generating the group of ideal data clusters from the group of existing data clusters, the group of ideal data clusters being generated within the search space occupied by the group of existing data clusters based on an ideal clustering criteria which would improve the clustering characteristic.

In one embodiment, the ideal clustering criteria comprises an increase in a search selectivity between the existing data clusters. In one embodiment, the ideal clustering criteria comprises a decrease in a number of existing data clusters accessed in response to search enquiries. In one embodiment, the ideal clustering criteria comprises an increase in a separation between existing data clusters in the search space. In one embodiment, the ideal clustering criteria comprises a decrease in an overlap of data ranges of data values between existing data clusters in the search space. In one embodiment, the ideal clustering criteria comprises a decrease in a data range of data values within each or at least one existing data cluster in the search space. In one embodiment, the generating comprises generating the group of ideal data clusters based on an assumed distribution of data entries within the search space within each existing data cluster. Accordingly, the ideal group of clusters may be generated using a simplified assumption that the data entries are distributed within those ideal data clusters in accordance with a particular distribution. This again helps to simplify the generation of the ideal data clusters, which avoids the need to perform data accesses to retrieve the actual data clusters and minimises the processing required.

In one embodiment, the generating comprises generating the group of ideal data clusters using a partitioning algorithm, scheme or process which partitions the search space to have similar numbers of data entries in each ideal data cluster. Accordingly, a partitioning algorithm is employed to partition the search space into regions, each of which has as close as possible to identical numbers of data entries or which occupy a similar amount of space.

In one embodiment, the partitioning algorithm seeks to partition the data entries at least once in each dimension of the search space.

In one embodiment, the partitioning algorithm seeks to provide an equal number of split planes in each dimension of the search space.

In one embodiment, the partitioning algorithm splits more often in certain dimensions based on user-defined settings or automatically learned usage patterns.

In one embodiment, the partitioning algorithm comprises a KD-tree. It will be appreciated that other algorithms such as quadtrees, octrees, BSPtrees and the like may be utilised.

In one embodiment, the method comprises, for each ideal data cluster in the group of ideal data clusters, determining a deviation in occupied search space between that ideal data cluster and existing data clusters intersecting that ideal data cluster and wherein the selecting comprises selecting the selected ideal data cluster based on the deviation.

In one embodiment, the selecting comprises selecting the selected ideal data cluster having maximum deviation. Accordingly, that data cluster which deviates the most from the ideal may be selected. In one embodiment, the method comprises, for each existing data cluster intersecting the selected ideal data cluster, determining a deviation in occupied search space between that existing data cluster and the selected ideal data cluster and wherein the selecting comprises selecting at least one of the existing data clusters intersecting the selected ideal data cluster for inclusion in the optimisable group of data clusters based on the deviation. Accordingly, for every other data cluster which crosses, overlaps or intersects the selected ideal data cluster, a deviation may also be determined.

In one embodiment, the selecting comprises selecting that existing data cluster intersecting the selected ideal data cluster having a maximum deviation for inclusion in the optimisable group of data clusters.

In one embodiment, the selecting comprises selecting that existing data cluster intersecting the selected ideal data cluster which most occupies the search space.

In one embodiment, the selecting comprises selecting neighbouring existing data clusters to that existing data cluster having the maximum deviation for inclusion in the optimisable group of data clusters. Accordingly, those clusters which neighbour or are proximate to the selected data cluster may be included in the optimizable group. This helps to ensure that clusters near each other which could potentially collide during searches are optimized.

In one embodiment, the neighbouring existing data clusters include existing data clusters which most occupy the search space. Accordingly, those clusters which extend furthest within the search space or occupy the greatest area or volume within search space may be included in the optimizable group. Again, this helps to include clusters which are more likely to fall within a search.

In one embodiment, the neighbouring existing data clusters include existing data clusters which are closest in the search space to that existing data cluster having the maximum deviation. Accordingly, those data clusters which are most proximate to the selected cluster may be included in the optimizable group.

In one embodiment, the neighbouring existing data clusters overlap in the search space with that existing data cluster having the maximum deviation. Accordingly, those clusters which intersect with or share the same space as the selected cluster may be included in the optimizable group. In one embodiment, wherein the selecting comprises selecting that existing data cluster intersecting the selected ideal data cluster which most occupies the search space.

In one embodiment, the forming comprises forming the group of optimised data clusters by allocating data entries of the optimisable group of data clusters to each optimised data cluster using a partitioning algorithm which would improve the clustering characteristic. Accordingly, the data entries within the optimizable group may be allocated to each optimized data cluster by partitioning the data clusters in search space.

In one embodiment, the partitioning algorithm partitions the search space occupied by the group of optimised data clusters to have similar numbers of data entries in each optimised data cluster. Accordingly, the partitioning may seek to balance the number of data entries in each optimized cluster so that each optimized cluster has near identical numbers of data entries.

In one embodiment, the partitioning algorithm seeks to maintain more than a 75% fill average in each optimised data cluster. Accordingly, a minimum fill average may be set for each criteria in order to balance the number of data entries in each data cluster.

That fill average may be a high fill average. ln one embodiment, the partitioning algorithm seeks to partition the data entries at least once in each dimension of the search space. ln one embodiment, the partitioning algorithm splits more often in certain dimensions based on user-defined settings or automatically learned usage patterns. ln one embodiment, the partitioning algorithm seeks to provide an equal number of split planes in each dimension of the search space.

In one embodiment, the optimising clustering criteria seeks to form optimised data clusters which minimise a deviation with respect to the group of ideal data clusters. In one embodiment, the forming comprises allocating the data entries of the optimisable group of data clusters to each optimised data cluster subject to a maximum number data entries being provided in each optimised data cluster.

In one embodiment, the selecting comprises selecting overlapping existing data clusters which store data entries having overlapping data ranges within a plurality of fields defining a plurality of the search space dimensions as the optimisable group of data clusters and the forming comprises allocating the data entries from the optimisable group of data clusters to form the optimised data clusters having minimised

overlapping data ranges in each search space dimension. In one embodiment, the forming comprises allocating the data entries from the optimisable group of data clusters to form the optimised data clusters having optimised shape.

In one embodiment, the forming comprises allocating the data entries from the optimisable group of data clusters to form the optimised data clusters having optimised shape.

In one embodiment, the forming comprises allocating the data entries from the optimisable group of data clusters to form the optimised data clusters having minimised data range overlap in each search space dimension.

In one embodiment, the forming comprises allocating the data entries from the optimisable group of data clusters to form the optimised data clusters having eliminated overlapping data ranges in each search space dimension.

In one embodiment, the forming comprises allocating the data entries from the optimisable group of data clusters to form optimised data clusters having non overlapping optimised data ranges in each search space dimension.

In one embodiment, the forming comprises allocating the data entries from the optimisable group of data clusters to form optimised data clusters whose distance between the non-overlapping optimised data ranges is maximised in each search space dimension.

In one embodiment, the forming comprises partitioning the data entries from the optimisable group of data clusters using a partitioning algorithm. In one embodiment, the partitioning algorithm seeks to maintain more than a 75% fill average in each optimised data cluster.

In one embodiment, the partitioning algorithm seeks to partition the data entries from the optimisable group of data clusters at least once in each search space dimension.

In one embodiment, the partitioning algorithm seeks to provide an equal number of split planes in each search space dimension.

In one embodiment, the partitioning algorithm seeks to partition regions of less dense data value distribution into optimised data clusters having more dense data value distribution.

In one embodiment, the partitioning algorithm comprises a KD-tree algorithm. It will be appreciated that other algorithms such as quadtrees, octrees, BSPtrees and the like may be utilised.

In one embodiment, the method comprises storing each optimised data cluster in the storage.

In one embodiment, the method comprises identifying a range of data values for each search space dimension within each data cluster and storing an indicator of each range of data values as the metadata for each corresponding data cluster. Hence, metadata may be stored for each data range to provide an index for each searchable field.

In one embodiment, the method comprises ordering the range in accordance with an ordering indicator for each search dimension.

In one embodiment, the range identifies at least a maximum and minimum data value that search space dimension within that data cluster.

In one embodiment, the method comprises storing an indicator of the data values for each search space dimension. Such an indicator may be configured to exclude certain patterns such as when applying a bloomfilter. In one embodiment, the method comprises incorporating each metadata into a search tree for all data clusters. Accordingly, the metadata may be incorporated into a search tree to facilitate efficient searching of the metadata of each data cluster.

In one embodiment, the search tree comprises a KD-tree. It will be appreciated that other algorithms such as quadtrees, octrees, BSPtrees and the like may be utilised.

In one embodiment, the method comprises storing all or parts of the metadata in a compressed form.

In one embodiment, the method comprises storing, with the metadata, a pointer to a location of each corresponding data cluster or clusters in the storage. Accordingly, the metadata may include a pointer. It can make sense to partition data values of data entries across multiple data clusters. In that case it can make sense to store more than one pointer in the metadata. The metadata may include an indication of the location of each data cluster in the storage. The metadata may also include a size indicator in order to identify where each cluster begins.

In one embodiment, the method comprises storing with the metadata an entries counter providing an indication of how many data entries are within each data cluster

In one embodiment, the method comprises storing with the metadata statistical information about the data entries stored within each data cluster.

In one embodiment, the method comprises selecting a field as a search space dimension based on historic search requests. Accordingly, the fields which are selected to be included in the metadata may be selected actively, based on searches that are being made.

In one embodiment, the method comprises nulling the group of existing data clusters. Accordingly, when the optimized data clusters have been stored, the existing data clusters which they replace are nulled.

In one embodiment, the method comprises iteratively repeating the identifying, selecting and forming. Accordingly, the optimization can be iteratively repeated in order to optimize the data clusters. In one embodiment, the method comprises receiving data entries to be stored in a new data cluster and buffering the data entries until a minimal data cluster size has been reached. Accordingly, individual data entries may be received and buffered until a minimal size of data cluster formed from those received data entries is achieved.

In one embodiment, the minimal data cluster size comprises the bandwidth-optimised data block transfer size of the storage device.

In one embodiment, the method comprises deferring the iteratively repeating until the new data cluster has been stored. Accordingly, the optimizing of data clusters may be defered or its priority reduced while data entries are pending being stored.

In one embodiment, the method comprises receiving a search request for data and interrogating the metadata to identify candidate data clusters whose range of data values encompasses the search request. Accordingly, when a search request is received then the metadata may be searched to identify potential data clusters which may store data values satisfying that search.

In one embodiment, the interrogating the metadata comprises interrogating the search tree.

In one embodiment, the method comprises returning a result of the search request based only on the metadata. Accordingly, it may be possible in some circumstances to return the result of the search based purely on the metadata. For example, the metadata may indicate that no data cluster can contain a data value matching the search criteria, in which case no access to the data clusters is required. Likewise, some searches may relate to data stored within the metadata itself, such as returning a number of entries falling within a search range or matching search criteria. In that case, the answer to the query can be returned again without needing to access the data clusters themselves. It will be appreciated that various different values can be stored in the metadata to enhance such search queries. Should the metadata indicate that matching data values may be present in one or more data clusters, then those data clusters may be interrogated.

In one embodiment, the method comprises returning an approximate result of the search request based only on the statistical information stored in the metadata. In one embodiment, the method comprises interrogating the candidate data clusters to return a result of the search request.

In one embodiment, the interrogating the candidate data clusters comprises

interrogating only the candidate data clusters to return the result of the search request.

In one embodiment, the method comprises performing a join operation between said group of optimised data clusters and another group of optimised data clusters.

In one embodiment, the method comprises performing a join operation between an optimised data cluster within said group of optimised data clusters and an optimised data cluster within said another group of optimised data clusters.

According to a second aspect, there is provided a data processing apparatus, comprising: identification logic operable to identify a group of existing data clusters having a clustering characteristic, each existing data cluster comprising data entries of a data information system stored together in storage, each data entry having at least one field storing data values; selection logic operable to select at least one existing data cluster from the group of existing data clusters as an optimisable group of data clusters; and formation logic operable to form a group of optimised data clusters by allocating data entries of the optimisable group of data clusters to each optimised data cluster to improve the clustering characteristic for the group of optimised data clusters compared to the group of existing data clusters.

In one embodiment, the group of data clusters may have a characteristic, feature or parameter that can be related to a metric.

In one embodiment, each data cluster occupies a data range in a search space defined by values of each data entry of each field.

In one embodiment, the search space has‘n’ dimensions, each dimension being defined by a corresponding‘n’ field.

In one embodiment, each data cluster has a size which matches a bandwidth-optimised data block transfer size of the storage. In one embodiment, each data cluster has a size no larger than a bandwidth-optimised data block transfer size of the storage.

In one embodiment, each data cluster has associated metadata which provides at least an indication of the data range in the search space defined by values of each data entry of at least one field.

In one embodiment, the metadata stores at least one additional parameter relating to that data cluster.

In one embodiment, the additional parameter comprises a number of entries in that data cluster.

In one embodiment, the optimizable group of data clusters is determined from the metadata.

In one embodiment, each data entry is an entry in a database.

In one embodiment, the clustering characteristic comprises a search selectivity between the existing data clusters. In one embodiment, the clustering characteristic comprises a number of existing data clusters accessed in response to search enquiries. In one embodiment, the clustering characteristic comprises a separation between existing data clusters in the search space. In one embodiment, the clustering characteristic comprises an overlap of data ranges of data values between existing data clusters in the search space. In one embodiment, the clustering characteristic comprises data ranges of data values within each existing data cluster in the search space. In one embodiment, the selection logic is operable to select the at least one existing data cluster from the group of existing data clusters as the optimisable group of data clusters based on a size of its occupied search space.

In one embodiment, the selection logic is operable to select the at least one existing data cluster from the group of existing data clusters as the optimisable group of data clusters based on a number of intersections in the search space with other existing data clusters.

In one embodiment, the selection logic is operable to select the at least one existing data cluster from the group of existing data clusters as the optimisable group of data clusters based only associated metadata.

In one embodiment, the identification logic is operable to generate a group of ideal data clusters from the group of existing data clusters, the group of ideal data clusters being generated within the search space occupied by the group of existing data clusters and wherein the selection logic is operable to select at least one existing data cluster from the group of existing data clusters which intersects a selected ideal data cluster in the search space as the optimisable group of data clusters.

In one embodiment, the identification logic is operable to generate the group of ideal data clusters from the group of existing data clusters, the group of ideal data clusters being generated within the search space occupied by the group of existing data clusters based on an ideal clustering criteria which would improve the clustering characteristic.

In one embodiment, the ideal clustering criteria comprises an increase in a search selectivity between the existing data clusters. In one embodiment, the ideal clustering criteria comprises a decrease in a number of existing data clusters accessed in response to search enquiries. In one embodiment, the ideal clustering criteria comprises an increase in a separation between existing data clusters in the search space. In one embodiment, the ideal clustering criteria comprises a decrease in an overlap of data ranges of data values between existing data clusters in the search space. In one embodiment, the ideal clustering criteria comprises a decrease in a data range of data values within each or at least one existing data cluster in the search space. In one embodiment, the identification logic is operable to generate the group of ideal data clusters based on an assumed distribution of data entries within the search space within each existing data cluster.

In one embodiment, the identification logic is operable to generate the group of ideal data clusters using a partitioning algorithm, scheme or process which partitions the search space to have similar numbers of data entries in each ideal data cluster.

In one embodiment, the identification logic is operable, for each ideal data cluster in the group of ideal data clusters, to determine a deviation in occupied search space between that ideal data cluster and existing data clusters intersecting that ideal data cluster and the selection logic is operable to select the selected ideal data cluster based on the deviation.

In one embodiment, the selection logic is operable to select the selected ideal data cluster having maximum deviation.

In one embodiment, the identification logic is operable, for each existing data cluster intersecting the selected ideal data cluster, to determine a deviation in occupied search space between that existing data cluster and the selected ideal data cluster and the selection logic is operable to select at least one of the existing data clusters intersecting the selected ideal data cluster for inclusion in the optimisable group of data clusters based on the deviation. In one embodiment, the selection logic is operable to select that existing data cluster intersecting the selected ideal data cluster having a maximum deviation for inclusion in the optimisable group of data clusters.

In one embodiment, the selection logic is operable to select that existing data cluster intersecting the selected ideal data cluster which most occupies the search space.

In one embodiment, the selection logic is operable to select neighbouring existing data clusters to that existing data cluster having the maximum deviation for inclusion in the optimisable group of data clusters.

In one embodiment, the neighbouring existing data clusters include existing data clusters which most occupy the search space.

In one embodiment, the neighbouring existing data clusters include existing data clusters which are closest in the search space to that existing data cluster having the maximum deviation.

In one embodiment, the neighbouring existing data clusters overlap in the search space with that existing data cluster having the maximum deviation.

In one embodiment, the formation logic is operable to form the group of optimised data clusters by allocating data entries of the optimisable group of data clusters to each optimised data cluster using a partitioning algorithm which would improve the clustering characteristic.

In one embodiment, the partitioning algorithm partitions the search space occupied by the group of optimised data clusters to have similar numbers of data entries in each optimised data cluster.

In one embodiment, the partitioning algorithm seeks to maintain more than a 75% fill average in each optimised data cluster. In one embodiment, the partitioning algorithm seeks to partition the data entries at least once in each dimension of the search space.

In one embodiment, the formation logic is operable to to form optimised data clusters which minimise a deviation with respect to the group of ideal data clusters.

In one embodiment, the formation logic is operable to allocate the data entries of the optimisable group of data clusters to each optimised data cluster subject to a maximum number data entries being provided in each optimised data cluster.

In one embodiment, the selection logic is operable to select overlapping existing data clusters which store data entries having overlapping data ranges within a plurality of fields defining a plurality of the search space dimensions as the optimisable group of data clusters and the formation logic is operable to allocate the data entries from the optimisable group of data clusters to form the optimised data clusters having minimised overlapping data ranges in each search space dimension.

In one embodiment, the formation logic is operable to allocate the data entries from the optimisable group of data clusters to form the optimised data clusters having optimised shape.

In one embodiment, the formation logic is operable to allocate the data entries from the optimisable group of data clusters to form the optimised data clusters having minimised data range overlap in each search space dimension. In one embodiment, the formation logic is operable to allocate the data entries from the optimisable group of data clusters to form the optimised data clusters having eliminated overlapping data ranges in each search space dimension.

In one embodiment, the formation logic is operable to allocate the data entries from the optimisable group of data clusters to form optimised data clusters having non overlapping optimised data ranges in each search space dimension.

In one embodiment, the formation logic is operable to allocate the data entries from the optimisable group of data clusters to form optimised data clusters whose distance between the non-overlapping optimised data ranges is maximised in each search space dimension.

In one embodiment, the formation logic is operable to partition the data entries from the optimisable group of data clusters using a partitioning algorithm.

In one embodiment, the partitioning algorithm seeks to maintain more than a 75% fill average in each optimised data cluster.

In one embodiment, the apparatus comprises storing logic operable to store each optimised data cluster in the storage. In one embodiment, the apparatus comprises metadata logic operable to identify a range of data values for each search space dimension within each data cluster and to store an indicator of each range of data values as the metadata for each corresponding data cluster.

In one embodiment, the metadata logic is operable to order the range in accordance with an ordering indicator for each search dimension.

In one embodiment, the storing logic is operable to store an indicator of the data values for each search space dimension.

In one embodiment, the metadata logic is operable to incorporate each index into a search tree for all data clusters.

In one embodiment, the storing logic is operable to store all or parts of the metadata in a compressed form.

In one embodiment, the metadata logic is operable to store with the metadata a pointer to a location of each corresponding data cluster in the storage.

In one embodiment, the metadata logic is operable to store with the metadata an entries counter providing an indication of how many data entries are within each data cluster.

In one embodiment, the storing logic is operable to store with the metadata statistical information about the data entries stored within each data cluster.

In one embodiment, the metadata logic is operable to select a field as a search space dimension based on historic search requests. In one embodiment, the storing logic is operable to null the group of existing data clusters.

In one embodiment, the identification logic is operable repeatedly identify a group of existing data clusters, the selection logic is operable to select at least one existing data cluster and the formation logic is operable to operable to form a group of optimised data clusters iteratively.

In one embodiment, the apparatus comprises buffering logic operable to receive data entries to be stored in a new data cluster and to buffer the data entries until a minimal data cluster size has been reached.

In one embodiment, the buffering logic is operable to defer the iteratively repeating until the new data cluster has been stored.

In one embodiment, the apparatus comprises search logic operable to receive a search request for data and to interrogate the metadata to identify candidate data clusters whose range of data values encompasses the search request.

In one embodiment, the search logic is operable to interrogate the search tree.

In one embodiment, the search logic is operable to return a result of the search request based only on the metadata.

In one embodiment, the search logic is operable to return an approximate result of the search request based only on the statistical information stored in the metadata.

In one embodiment, the search logic is operable to interrogate the candidate data clusters to return a result of the search request.

In one embodiment, the search logic is operable to interrogate only the candidate data clusters to return the result of the search request. In one embodiment, the apparatus comprises joining logic operable to perform a join operation between said group of optimised data clusters and another group of optimised data clusters.

In one embodiment, the joining logic is operable to perform a join operation between an optimised data cluster within said group of optimised data clusters and an optimised data cluster within said another group of optimised data clusters.

According to a third aspect, there is provided a computer program product operable, when executed on a computer, to perform the method of the first aspect.

Further particular and preferred aspects are set out in the accompanying independent and dependent claims. Features of the dependent claims may be combined with features of the independent claims as appropriate, and in combinations other than those explicitly set out in the claims.

Where an apparatus feature is described as being operable to provide a function, it will be appreciated that this includes an apparatus feature which provides that function or which is adapted or configured to provide that function.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the present invention will now be described further, with reference to the accompanying drawings, in which :

Figure 1 illustrates a data processing apparatus according to one embodiment;

Figure 2 illustrates the main processing steps performed by the data processing apparatus when receiving data entries according to one embodiment;

Figure 3 illustrates the main processing steps performed by the data processing apparatus when optimising data clusters according to one embodiment;

Figures 4A to 4N illustrates optimising data clusters according to one embodiment; Figure 5 illustrates the main processing steps performed by the data processing apparatus in response to a query according to one embodiment; and

Figures 6A and 6B illustrate data join operations according to one embodiment.

DESCRIPTION OF THE EMBODIMENTS OVERVIEW

Before describing embodiments in any more detail, first an overview will be provided. Embodiments recognise that the way that data is stored in a data information system may be sub-optimal for efficiently performing operations during data processing since the data is often distributed in storage in a manner that makes an operation inefficient, which reduces the processing speed of the data processing apparatus. Accordingly, embodiments store data in data clusters stored in storage and optimise those data clusters. Each data cluster in storage typically has a maximum size which is determined by an optimal data transfer size between the storage and processing logic performing the processing. While this optimises accesses between the data processor and the storage, the content of the data clusters themselves may be unrelated, and even random. For example, consider the situation where the data information system is an inventory database. For each item there may be a number of different types of data to be stored which need to be associated with that item, such as item identifier, purchase price, purchaser identifier, firmware date, item location, etc. Hence, each data entry may be a row in the database having the fields“item identifier”,“purchase price”, “purchaser identifier”,“firmware date”,“item location”, etc.

As users add items to the inventory, a transaction may be provided to the data processing apparatus which buffers the transactions as a data cluster of entries until that data cluster matches the optimal size for transfer to the storage. As is apparent from this example, the entries are likely to be widely distributed. That is to say, for any data cluster the range of firmware dates is likely to be widely distributed, as are the identifiers. When storing each data cluster, metadata is provided which provides a search index which indicates a range of values stored in each field which may need to be searched. For example, the metadata may provide an indication of the range of firmware dates of the entries within that data cluster and/ or the range of values of each identifiers of the entries within that data cluster, etc.

When performing an operation on the data clusters stored in storage, it is possible to then interrogate the metadata to see which data clusters cannot include entries which meet the search criteria because their ranges fail to encompass the search criteria. For example, a search may be for items with a purchase price of more than $ 300 which have a firmware date of more than four years ago. Any data clusters whose metadata indicates that its entries have a purchase price of less than $ 300 , or which have a firmware date which is less than four years ago, can be ignored. However, any data cluster which cannot be ignored must be retrieved to perform the operation using the data entries. As mentioned above, a characteristic of the data entries in these data clusters is that they are likely to be widely distributed and lack correlation as each transaction is likely to be reasonably random. Accordingly, many data clusters stored in this way may need to be accessed and their entries interrogated in order to perform the operation. It will be appreciated that, even then, a null answer may be returned if there are no entries matching the operation criteria.

Accordingly, embodiments perform optimization on the stored or existing data clusters in order to perform the operation more efficiently by avoiding or reducing the number of accesses to storage required in response to queries. This optimization can be adapted to suit the particular physical and functional constraints or characteristics of the data processing apparatus and its storage. In any event, in general terms, the optimization procedure involves identifying existing data clusters stored in the storage which exhibit characteristics which are likely to lead to poor or reduced search efficiency. For example, if more than a particular number of data clusters have overlapping data ranges, then it is likely that a search will encompass that overlapping range and each of those data clusters may need to be retrieved in order to return a result to that query. Also, it is likely that data clusters which occupy a wider range of data values are more likely to need to be accessed than those occupying a smaller range of data values. Also, when the distance between the space occupied by data clusters increases, the likelihood that additional data clusters will need to be retrieved in response to a search query decreases compared to data clusters where the space between them is smaller. It will be appreciated that, depending on the type of operations being performed, other characteristics of the data clusters may need to be adjusted to suit those types of operations.

Accordingly, for the data clusters mentioned above, search performance can be improved by decreasing the likelihood that data clusters not containing the data satisfying required for the operation is returned and that a minimal number of data clusters that may contain the data being searched for are returned. By examining existing data clusters within the data information system and optimizing those data clusters by taking data values within those data clusters and forming new data clusters which exhibit better search characteristics, the processing performance of the data processing apparatus can be improved.

Data Processing Apparatus Figure 1 illustrates a data processing apparatus, generally 100 , according to one embodiment. The data processing apparatus 100 has one or more processor cores 120 arranged to execute a sequence of instructions that are applied to data supplied to the processor core 120 over a bus 115. Hereinafter, the term data value will be used to refer to either instructions or data. A memory 150 is may be provided for storing the data values required by the processor core 120. A cache 160 may also be provided for storing data values required by the processor core 120, thus increasing the speed of processing since the number of accesses required to the memory 150 is reduced. Data values may also be received from and provided to external devices such as a storage device 110 using input/ output logic 140 via the bus 115.

Cluster Storage

Figure 2 illustrates the main processing steps performed by the data processing apparatus 100 when receiving data entries, according to one embodiment. At step S10 , a data entry is received. Typically, each data entry is arranged to store data values in one or more fields, which typically store different types of data values, as is common in data information systems.

At step S20 , a determination is made of whether sufficient data entries have been received to perform an efficient data transfer with the storage device 110. If insufficient data entries have been received, then processing returns to step S10 where the received data entry is buffered and added to by subsequently received data entries. When it is determined at step S20 that sufficient data entries have been buffered to perform an efficient data transfer to the storage device 110 , then processing proceeds to step S30.

At step S30 , metadata is generated which provides a search index for one or more fields in each data entry. Metadata may also be stored for other information such as the number of entries in a data cluster, an average value of a field in the data cluster, etc.

It will be appreciated that a search index need not be provided for every field as this increases the size of the metadata and the amount of processing required to generate that metadata. Hence, metadata may instead be generated for the fields which are most commonly searched. The metadata may indicate particular values stored in fields in the data cluster. More typically, the metadata indicates, for each field which requires a search index, the range of values stored by data entries within that field within that data cluster. One such range may indicate a maximum value and a minimum value of data stored in a particular field in that data cluster, a mid-value and distance from that mid-value, or in any other way. For example, if the“purchase price” and“firmware date” fields are to be indexed, then a particular data cluster may have metadata indicating that the firmware date of entries in that data cluster ranges from 1. Mar.15 to 28. Oct.16 and the purchase price of entries in that data cluster ranges from $ 15 to $295.

The data cluster is then stored in the storage device 110. Typically, a pointer is added to the metadata for the data cluster which has been stored in storage to indicate its location in that storage. The metadata is also typically stored at a location in the storage device 110 , but a copy may be retained in memory in order to facilitate fast interrogation of the metadata. Preferably, once the cluster and its metadata have been stored then the data cluster can be made available to the data information system for interrogation. However, it will be appreciated that the metadata can be made available for interrogation earlier than this. Processing then returns to step S10 to await further data entries.

Accordingly, it can be seen that as data entries are received they are buffered until they achieve a size that is efficient for storing in the storage device 110 , in order to prevent increased, inefficient storage accesses. When the data cluster is ready for storage then search metadata is generated which defines characteristics of the data entries within that data cluster in order to make subsequent searching more efficient. However, it will be appreciated that the data entries in each data cluster may be random, with very little correlation between those data entries, and any such correlation may only be due to particular fortunate circumstances. Accordingly, whilst this technique provides for efficient use of storage system resources, and the metadata helps to exclude data clusters which cannot satisfy a query, the number of data clusters that may need to be retrieved and interrogated to answer a query may still be higher than is necessary.

Cluster Optimisation

Figure 3 and Figures 4A to 4N illustrate the main processing steps performed by the data processing apparatus 100 when optimizing data clusters. The data clusters to be optimized may include all of the data clusters stored by the storage device 110 or a subset of those data clusters. The selection may be random or based on some metric such as clusters which are often retrieved but do not answer a query. At step S40 , the metadata for data clusters to be optimized is retrieved. Such retrieval may occur from the storage device 110 , memory 120 or cache 160 , depending on implementation.

As illustrated in Figure 4A, a group of existing data clusters 10 are selected. In this simple example, every existing data cluster is selected. Also in this simple example, the metadata for this group of existing data clusters 10 stores ranges for fields A and B.

This is illustrated schematically in Figure 4A where the ranges for field A are mapped onto the A axis, and the ranges for field B are mapped onto the B axis. It will be appreciated that this can be repeated for multiple fields which would then map into multiple dimensions. For example, the metadata for data cluster 10- 1 indicates that the values of data entries within that cluster fall within the range A1A - A1B and within the range B1A - B1B. The metadata for the other data clusters are mapped in a similar way. It will be appreciated that these ranges may be numerical ranges or any other range which is forms a metric space which is, for example, definable in Euclidian space whose size can be determined (for example a Hamming distance).

As can be seen in Figure 4B, the complete group of existing data clusters 10 occupies a search space 20 bounded by AL - AU on the A axis and BL - BU on the B axis.

Returning now to Figure 3 , at step S50 , the metadata for this group of existing data clusters 10 is analysed to identify which of these data clusters to optimize. In one embodiment, this is achieved by assuming that each data value within the group of existing data clusters 10 is evenly distributed within the search space 20 in order to select a group of optimizable data clusters. Identifying the group of optimizable data clusters in this way reduces the processing burden and avoids the need to retrieve any of the existing data clusters themselves from the storage device 110 to make that determination.

As shown in Figure 4C, the search space 20 is partitioned using a partitioning algorithm. The partitioning algorithm used will be selected based on the

characteristic(s) of the data clusters which are desired to improve. In this embodiment, the partitioning algorithm initially seeks to place a partition line 25A1 along the A axis, so that assumed number of data entries in the area 20 A occupied by data clusters to one size of the line 25A1 matches the assumed number of data entries in the area 20B on the other side of the line 25A1. As shown in Figure 4D, the area 20 A is split along the B axis in a similar manner by the line 25B1 and the area 20B is split in a similar way by the line 25B2.

This process continues until, as illustrated in Figure 4E, the search space 20 has been partitioned into a number of separate regions which equals or exceeds the number of existing data clusters within the search space 20. Typically, the search space is partitioned in 2ⁿ regions. In this example, there were 7 data clusters, and so 8 regions have been formed. These regions represent an ideal partitioning of the search space 20 to meet the required clustering criteria.

It will be appreciated that this technique is often referred to as a KD-tree. It will be appreciated that other partitioning techniques may be used such as, for example, a quad tree, octree, BSPtree and the like. The partitioning into optimized data clusters may be subject to a maximum or minimum filling constraint. The particular partitioning performed is intended to partition the space into an arrangement which would represent an ideal set of clusters that would meet the particular clustering criteria which best suits the search requirements of the data information system. In this example, it is desired to provide no overlap between data clusters and an equal number of splits in each dimension, thereby creating maximum selectivity in each dimension independently.

As indicated above, the partitioning assumes that the data values within the existing data clusters are distributed in a uniform way. However, as will become apparent, this would often not be the case, but this technique still enables optimizations of the existing data clusters to be performed to provide optimized data clusters in an efficient way which does not require excessive resources. For certain data sets this assumption holds so badly that it can make sense to keep a small set of samples per cluster. In particular, for very skewed data sets, the uniformity assumption is not enough to make the optimization converge. One option in these circumstances is to keep a low number of data entries per data cluster to better approximate the distribution.

In order to select data clusters to be optimised, two different approaches are envisaged. The first approach selects a data cluster for optimisation which is judged to be least aligned with the ideal set of clusters. A second approach selects an ideal data cluster for optimisation based on an error contribution of data clusters falling within that ideal data cluster. Turning now to the first approach, as can be seen in Figure 4F1, an existing data cluster 10-2 is selected. This selection is made by comparing each data cluster within the partitions and selecting the data cluster which least aligns with those partitions (or which deviates the most from those partitions). The existing data cluster which deviates the most is assumed to be the best candidate for optimization.

As shown in Figure 4G1, every data cluster which intersects in search space with the candidate data cluster 10-2 is also selected to create an optimizable group of data clusters 30 , with all non-intersecting data clusters being ignored, as illustrated in Figure 4H.

Turning now to the second approach, as can be seen in Figure 4F2, an ideal data cluster 20’ is selected. This ideal data cluster 20’ is selected based on an error measure. For every partition (ideal data cluster) an error measure is computed. For each partition, data clusters falling within that partition are identified and a data cluster error based on the shape, overlap and positional misalignment of each of those data clusters is calculated. Those data cluster errors are then combined for that partition. For example, the ideal data cluster 20’ will have data cluster errors calculated for the two data clusters intersecting that ideal data cluster 20’ and these data cluster errors will be combined to give an error measure for that ideal data cluster 20’. The partition that has the highest error measure is selected, in this example, the ideal data cluster 20’. It will be appreciated that in another embodiment neighbouring partitions may also selected for various reasons such as if a wider optimisation is required and/ or for faster convergence per iteration.

As can be seen in Figure 4G2, every data cluster which intersects in search space with the ideal data cluster 20’ is selected to create an optimizable group of data clusters 30’, with all non-intersecting data clusters being ignored.

Irrespective of which approach is taken (the following description is based on the first approach, but applies equally to the second approach) the optimizable group of data clusters are then optimized. Returning to Figure 3, at step S60 , those existing data clusters within the optimizable group of data clusters 30 are retrieved from the storage device 110 and their data values 200 stored in the entries of the optimizable group of data clusters 30 are mapped onto the search space 20 , as illustrated in Figure 41. It will be appreciated that although in this example the search space 20’ of the optimizable group of data clusters 30 matches the search space 20 of the existing data clusters, as illustrated in Figure 4J , this need not be the case and may instead be a subset of that search space 20.

Returning to Figure 3, at step S70 , the search space 20’ of the optimizable group of data clusters 30 is then partitioned in a similar manner to that described above, as illustrated in Figures 4K to 4L. Partitioning ceased after 4 partitions were generated, since the number of data clusters in the optimizable group 30 is also 4.

As illustrated in Figure 4M, optimized data clusters 10’- 1 to 10’-4 are formed from the data values falling within each partition area. Metadata describing the range in the search dimensions A and B of each of those optimized data clusters 10’- 1 to 10’-4 is generated and the optimized data clusters 10’- 1 to 10’-4, together with their metadata, are stored. Once that storage has happened, then the existing data clusters within the group of optimizable data clusters 30 , together with its metadata, can be nulled and the optimized data clusters 10’- 1 to 10’-4 and its metadata can be made available to the data information system at step S80.

As can be seen in Figure 4N, the characteristics of the resultant data clusters have been improved, since there are now fewer data clusters, they are spaced further apart and the amount of overlap has been reduced. However, it can be seen that full optimization has not yet occurred and so processing may return to step S40 to continue to optimize the data clusters in an iterative manner.

When building the ideal data cluster model worst-case data cluster configurations can be encountered for which the runtime complexity becomes quadratic. This happens for example if all data clusters overlap with each other, because for every cluster the error computation must consider every other data cluster in the set. In order to create a strict 0(n log n) bound on the runtime complexity, the data clusters that have a very negative impact on the overall runtime are filtered out. One possible heuristic can be based on the size of the data clusters, because it is assumed that very large clusters are likely to overlap with very many clusters. In order to filter out these "bad" overlapping clusters the number of successive kD-tree levels in which the clusters intersect the same split planes is computed. The clusters that intersect split planes of successive levels for a certain or specified number of times are filtered out and handled separately.

Although this loses a bit of precision, this loss seems acceptable when dealing with large datasets. Updates

Data values stored by data clusters may be changed or updated. For example, using the example mentioned above, the“firmware date” for an entry in a data cluster could be changed from one date to another. Updates can also include deletion of an entry from a data cluster. For example, using the example mentioned above an item in the inventory database may be deleted. When such updates occur, new metadata is generated for the data cluster reflecting that changed data values within that data cluster. Those changes may then cause that updated data cluster to be selected for optimisation as mentioned above.

Searchin

Figure 5 illustrates searching the data clusters according to one embodiment.

At step S90 , a search enquiry is received. Typically, the search enquiry will relate, among other fields, to search fields whose data ranges are indicated in the metadata for the data clusters. Should the metadata not contain that information then, depending on implementation, that metadata can be added when optimizing the data clusters.

At step S100 , the metadata is interrogated to see if it answers the query.

At step S110 , an assessment is made of whether the query is answered. For example, a query may be made for an indication of the total number of data entries in the data clusters. As mentioned above, the metadata for each data cluster may include that as a data item, and so the answer can be returned without needing to interrogate the data clusters themselves. It will be appreciated that other data items relating to the data clusters may also be stored in the metadata. Similarly, an interrogation of the metadata may reveal that no data clusters contain data values which can possibly fall within the search criteria, and so, at step S120 , an answer to the query is provided from the metadata alone.

If, instead, it is determined that it is not possible to answer the query from the metadata alone, then those data clusters which intersect with the search criteria are retrieved, the data entries in those data clusters interrogated and the answer to the query provided at step S140.

As an example, consider a search which is bounded in search space by the area A’ in Figure 4A, which is also illustrated in Figure 4N. Prior to optimisation of the data clusters as shown in Figure 4B, the metadata would have indicated that the result to that search could be contained in two data clusters, each of which would need to have been accessed from the storage device 110 in two data accesses (assuming that the size of the data clusters was matched to the data transfer size between the storage and processing logic), then interrogated before returning a null result. After optimisation of the data clusters as shown in Figure 4N, the metadata would have indicated that none of the data clusters can possibly store the result, thereby saving two data accesses and subsequent processing to interrogate those data clusters.

Resource Allocation

It will be appreciated that the data processing resources dedicated to the receiving and storing of data clusters as illustrated in Figure 3 , the optimization of data clusters as illustrated in Figures 4A to 4N, and the searching of data clusters as illustrated in Figure 5, may be dynamically altered or statically prioritized in order to, for example, prioritize one process over the other and/ or to make some processes foreground and others background. Typically, the searching and storing of data clusters are prioritized as foreground processes, with the optimization occurring in the background, as resources become available.

JOIN Operations

Figure 6A illustrates an example J OIN operation on two tables. Table a and Table b are unoptimised and store data values. Table a stores data values for the fields item_id, order_id and part_id. Table b stores data values for the fields item_id, sales_date and sales_id. It is possible to perform a J OIN operation in response to a query. For example, Table a may be J OINed with Table b along a shared field (dimension) which, in this example, is item_id. The result, Table c, contains data values which map order_id and part_id to sales_date and sales_id via item_id. The J OIN operation can be resource-intensive (requiring large amounts of memory) and can slow the processing speed dramatically, particularly as the size of the tables increase.

Figure 6B illustrates an example J OIN operation on two tables according to one embodiment. Table a’ and Table b’ are optimised using the techniques described above. Consequently, table a’ has optimised data clusters a’- l to a’-5 and table b’ has optimised data clusters b’- l to b’-5. Now individual J OIN operations can be performed using the optimised data clusters. For example, data cluster a’- l can be J OINed with data cluster b’- l, a’-2 with b’-2, and so on to generate resultant J OINed data clusters.

In this example, five resultant J OINed data clusters will be generated. This approach enables a subset of the data from one table to be J OINed with a subset of the data from another table, which reduces the resources required (reduces the amount of memory utilised) and increases the processing speed dramatically, particularly as the size of the tables increase.

Accordingly, embodiments provide a mechanism to introduce data locality to a dataset incrementally. Embodiments alleviate limitations of existing techniques. In particular, in embodiments: 1) Scattered and inefficient input/ output (I/ O) data accesses (typically to a storage device) are avoided by clustering data. Access is typically at a granularity level optimized for the I/ O systems of the one or more connected computer systems and clustering data ensures that a large proportion of the fetched data is relevant to the query (as opposed to a large proportion being irrelevant in the earlier cases). This can be executed on multiple dimensions at the same time (clustering data along each of these dimensions, co locating similar data). It is irrelevant for the operability of the embodiments if the dimensions are correlated or not. 2) Any index typically requires the keys and their location to be stored in this index, which, in the typical

implementation, increases the data the more keys are defined. This can be a significant amount of resources and thereby creates inefficiencies, such as exceeding the ability to be kept in one of the system’s caches (RAM etc.). Alternatively, an additional index with that key can be defined independently, which requires additional storage and prevents the ability to search on multiple keys together at the same time.

With the combination of embodiments such a structure can be very small since it only includes part of the key data, such as key ranges. 3) In a typical index with multiple keys (e.g. n keys) at the same time, the order of the keys is typically predefined and a user may only query between 1 and n keys together in the order they were defined.

With the multi-dimensional clustering of embodiments, data can be queried along 1 to n keys independently and in any order. 4) Any lookup structure requires

administration to keep it up-to-date. By not forcing clustering of data during ingestion but instead incrementally building it in a stand-alone process embodiments do not need to do any administration during the ingestion process and can therefore guarantee a stable ingestion performance while guaranteeing availability of all existing data, which is a limitation of existing B-tree-based indexes or other data structures such as Cache Oblivious Look-ahead Array (COLA) or Log Structured Merge Trees (LSM-Tree).

Traditional databases use indices to allow l-to- l lookup of rows. When analysing big ranges of data this has the overhead that typically a 4- 16KB chunk of data has to be read for every row, incurring a high read overhead, since typically only a fraction of this data is relevant for the query. The reverse case is true when the index is updated - it may be necessary to read and additionally to overwrite one entire, typically, 4- 16KB block of data just to modify one key and associated pointer to the row, sometimes even multiple blocks. Some databases may use a COLA. This allows for fast lookup, but must be kept up-to-date during ingestion, thereby incurring an ingestion overhead relative to the total dataset size. A COLA works by ingesting data into a first level of a multi-level structure. The first level typically covers the entire value range of the data that is ingested. When this level has reached a certain fill state, the data in this level is inserted into the next level down, possibly triggering the next level to reach its fill state as well. This level then also cascades downwards etc. Each level separates the value ranges. An example of this is a perfect order. Thereby clustering of data becomes more granular the lower the layer. These structures are typically of an amortized complexity 0(log n) for every row inserted, the heavy penalty of triggering large cascades (heavy 1/ O) being somewhat offset against the clustering precision of data. A LSM-Tree, works in a very similar fashion . “Database Cracking” adaptively builds knowledge about the data contained in the database during the queries. It is used in some column- store databases. It moves the cost from index maintenance from the database changes (ingestion) to the queries (selection). The query processor provides information to the data handling mechanisms to re-arrange the data and execute optimizations such as a partial sorting or partial indexing. This technique is said to improve 1/ O, query processing speed and to exhibit self-optimizing behaviour.

A database is a tool to persistently store data inserted into it, it typically also has a very predictable behaviour on how these insertions are handled when multiple sources compete for storing data or how long these insertions typically take at an upper bound. The complexity of operating a database system typically limits the total system data throughput to a fraction of the achievable system throughput compared to storing a stream of data with the standard system 1/ O without using a database.

In embodiments, given one or more connected computer systems operating a database containing one or multiple database tables, a process can organize the data in each table into many independent clusters of data. a) Such a clustering is made by spatially organising data along multiple, possibly independent, dimensions. At the first insertion, information on the data’s properties, such as ranges, are already obtained and kept. The process of laying out the data along multiple possibly independent dimensions is both an independent process and an incremental process. Multiple possibly independent dimensions: the process is capable of organising along multiple dimensions at the same time and in an independent way:

Able to order everything by projecting all types of values to numbers

Able to cope with vastly different scales along each dimension- Able to reach partial optima, able to stop oscillating between reordering steps A multi-dimensional range distribution that is guaranteed to create a minimum of selectivity along every dimension.

In one embodiment, a KD-tree is used with at least one split in every dimension. In one embodiment, KD-tree is used with typically equal selectivity on each dimension irrespective of the dimension’s cardinality (cardinality = how many different values in a data column). In one embodiment, the KD-tree is used with a user-defined selectivity scaling for each dimension.

It will be appreciated that using an independent process means that the process of optimising the data is typically undertaken after data was inserted into the database, thereby never blocking the insertion and affording a higher insertion rate. The total amount of data that can be persistently inserted into the database this way is therefore not limited by the time it may take to maintain another optimization structure, such as a database index. When otherwise the maintenance of optimization structures imposes practical limits on the amount of data, embodiments do not impose this limit.

Furthermore, the effort, e.g. in terms of I/ O and/ or processing resources, can be limited to achieve the desirable balance between insertion and query performance. In the most extreme case, the user or system can omit this process to save resources for an undetermined amount of time solely relying on the data’s properties obtained during insertion for executing c) below. Therefore, embodiments are capable to guarantee a predictable and a high data insertion performance. In embodiments using an incremental process means that the resources dedicated to this process are adjustable as well as constrainable to what is, by the user or automatically by the system, determined to be the optimum trade-off between all system resources. The outcome of each increment is already usable and, typically, already shows an improvement to the prior state with respect to data clustering. This is true, even if the increments are still far away from a mathematically-optimal distribution. When above constraint is set to a value N bytes, embodiments work with up to N bytes of data at the same time and is still able to improve the dataset for any N greater than or equal to [2 to the power of dimensions] times the data cluster size. A selection algorithm ensures that there is an improvement at every step or that the system is informed that at the current state with the current N no further improvements can be made and thus no further resources are required.

It is observed that incrementally running above mechanism takes an effort of 0(n * log n) for n rows to theoretically reach mathematical optimality, but in the embodiments it was observed that a fraction of this effort is required. Typically effort is an

approximately constant factor in relation to the data inserted. This contrasts to the other implementations (such as an Index), where the effort of maintaining it is relative to the total dataset size. b) Once the data has reached a close-to-the optimum distribution, it is desirable to only re-organize the newly-ingested data into clusters. Embodiments choose its increments such that the data furthest away from the desired distribution is incrementally re organized first. Thereby, for certain distributions, the effort is more correlated to the number of data entries ingested in a time period as opposed to the total dataset size (this is different to the database index, which is determined by data set size). c) In embodiments, a metadata structure can be kept to limit access only to the relevant clusters that contain data the user asks for. This metadata structure is able to determine this relevance by storing the range along each dimension that is covered in each of the clusters. It will be appreciated that storing a range for each dimension takes only two values per dimension, and it is therefore very small compared to the underlying entries described. In this context it is important to note that range is just one of the possible embodiments. Each range can be evaluated independently. The cluster size is configurable to reach the optimum trade-off between : The I/ O size to be fetched at good 1/ O performance; and the corresponding selectivity based on the extent of the clusters along each dimension (hypothetical extent is dimension-root of the total cluster count); and the relative size of the metadata structure and resulting access speed (smaller metadata structures can be kept in the individual cache hierarchies: fast caching storage, RAM, CPU-Caches etc.)

Ingestion :

In embodiments, data entries such as one or more rows are received by the database. A row typically has a defined set of columns, the value of a column in a row is called a field. These rows are evaluated against a number of dimensions, each dimension being determined by one or more fields in the row (single field, concatenating multiple ones, calculation from multiple fields etc.) or generated, e.g. the total row count. In one embodiment dimensions are set a-priori. In another, dimensions are learned from usage pattern. In one embodiment every field in a row is chosen as its own dimension. The row is stored. In order to store the rows, a plurality of rows is buffered until an 1/ O optimal size is reached or the database operation requires writing out the data/ this buffer. The set of rows having an approximate 1/ O optimal size is referred to as a row cluster (data cluster). When a row cluster is created, the information of which data is contained along each dimension in the cluster is extracted and stored in a metadata structure. In another embodiment, more rows than required for a single row cluster are buffered. Then, the distribution of the rows into row clusters already follows the optimization step disclosed below. In one embodiment, the information which data is contained is the range between the values in each of the dimensions. In another embodiment, the information is a probabilistic data structure, such as a bloom filter.

In such a probabilistic filter, which could be a bit mask, one or more bits at different positions indicate if a value may be contained. If one of the bits is not set it can be concluded that the value is not contained. The distance between two probabilistic filters can be determined by finding out which bits are equal, and which not (001 to 001= distance 0 , 010 to 101 = distance 3, 010 to 011 = distance 1). In another embodiment, the values in each dimension are given an order, e.g.“aab” after“aaa”, distance is determined by finding the distance in steps along the order. This order may be implicit from the values (e.g. the given text ordering example) or kept as a dictionary that each part of the database can look up in. In one embodiment, where the cardinality for one or more dimensions is very low, bits indicate the definite presence or absence of a value as opposed to indicating a probable presence. It will be appreciated that storing a range, a bit mask for values contained or a range along an order or similar takes up much less space than the row cluster. Thereby it is typically many factors smaller than an index, which typically at least stores every key (= the field value) and a pointer to that key’s row. In a typical embodiment the metadata also includes information such as how many rows are inside a row cluster. In one embodiment row clusters are compressed and the metadata contains additional information, such as storage sizes, to optimize the I/ O access when retrieving the row cluster. In one embodiment row clusters and/ or the metadata are compressed using additional hardware that can be configured to execute compression or other processing of the row cluster and/ or the metadata in line with the query conditions.

Optimization: The purpose of optimization is to increase selectivity when querying for the data while being able to conduct optimization iteratively (no“all or nothing” case). Row clusters are evaluated for their relative selectivity, i.e. how likely they are chosen for retrieval by a query and thus create a cost - versus the probability of including data required by that query. Selectivity can, for example, be approximated by the extent of their range (in one embodiment), by the number of splits per dimension (in one embodiment) by the numbers of bits set (in another) or the range along the order (in another) or similar. Wider ranges or more bits set - or any other way in which data selectivity is lower - shall be referred to as larger clusters hereafter, narrower ranges or fewer bits set - or any other way in which data selectivity is higher - as smaller clusters. This evaluation can be undertaken for each dimension individually or for multiple dimensions at the same time. In one embodiment, larger clusters are prioritized over smaller clusters. Clusters are also selected to be at a shorter distance to each other. In one embodiment, the rows in multiple larger clusters within a certain, typically close, distance of each other are processed. Close distance can include overlap, which is spatial overlap, range overlap or equal bits set, depending on the embodiment. If overlap is present, rows of smaller clusters that overlap with the aforementioned larger clusters are also included in the processing. In another embodiment, the target distribution with a high selectivity has been identified, for example, by sampling the data. Row clusters are then chosen based on their divergence from the target distribution, the more diverging ones in favour of the less diverging ones. In one embodiment, multiple of the aforementioned selection mechanisms are combined to choose the row clusters.

The rows are processed by re-distributing rows to row clusters such that the larger clusters turn into smaller clusters, the overlap is reduced and the distance between row clusters is increased, while maintaining a certain threshold of rows per row cluster. Thereby the resulting re-distributed row clusters are typically more selective than the original row clusters. In one embodiment this re-distribution is operated with a KD- tree. Each row’s dimensions are inserted as an n-dimensional point in the KD-tree.

The point clusters generated by the split planes created by the KD-tree are then used to obtain the new row clusters. In one embodiment, with a fixed row size, the row clusters are thereby on average at least 75% filled. In another embodiment, with a variable row size, KD-tree splits are adjusted for row size, i.e. instead of splitting by the median value, it is split by the median aggregated row size. In one embodiment, the KD-tree splits at least once per every dimension. In another embodiment, the KD-tree ensures to have an approximately equal number of split planes in each dimension. It will be appreciated that this will typically result in equal selectivity on each dimension irrespective of the dimension’s cardinality. In another embodiment, the rows are re distributed into row clusters such that as few bits as possible are set in the probabilistic data structure of each row cluster, i.e. the“distance” (in the sense it was defined before) between rows minimized and larger clusters with overlap are converted into smaller clusters with no or limited overhead. For example, the row with the bitmask 001 shall be clustered with the row with the bitmask 101 in favour of clustering with the row with bitmask 010 , or even 110 (which has greatest distance). In another embodiment, the position along the order is interpreted as a range for the KD-tree splitting of the values. In another embodiment, the distribution is learned by applying an online k-means algorithm to the row (interpreting the row’s dimensions as an n-dimensional point).

In a typical embodiment, the I/ O and/ or processing power used for optimization is limited to ensure enough resources are available for other processing, such as ingesting and selecting, and thereby be able to give performance guarantees. In a typical embodiment, the number of row clusters chosen for optimization is balanced with the upper limit for I/ O and/ or processing power to yield the right balance between resource usage and creating smaller clusters. Such a balance can be obtained by automatically or manually inspecting the how much smaller the clusters became. In one embodiment it may be chosen to delay the optimization temporarily to free resources for other database operations. The database is nevertheless fully capable to ingest, select or execute other tasks during this period (as in, optimization is not required to always run to allow database operation).

Selecting:

The metadata structure is interrogated if a row cluster contains values that match the conditions of the database query. In one embodiment, this is carried out with a range check along each of the dimensions. In another embodiment, this is done by checking if all the relevant bits are set in the probabilistic data structure. In another

embodiment, the order is retrieved by finding the dimensions’ ordering or looking up the dimensions’ ordering in the common dictionary, then a range check can inform about the possible presence of this value. Only the row clusters, where the metadata structure has indicated a possible presence of rows having met all of the required conditions, are retrieved. The metadata structure may be kept in a search tree of some form to accelerate interrogation. In one embodiment the metadata structure is organized as a KD-tree of n-dimensional cubes expressing the range in dimensions. In some embodiments, queries or parts of queries are solely answered by inquiring the metadata structure. For example, when requesting a count for a range and the metadata can identify that none of the data lies within that range, the count is already known to be zero without retrieving any row cluster using I/ O. In the reverse example, if the metadata structure can identify that all rows fall within the range, the metadata structure can provide the count directly by summing up the row count stored for each row cluster. In one embodiment, the metadata structure can estimate the result of queries based on the information it holds. For example, when a query requests the row count for half the range of a row cluster, the metadata structure can estimate that half the rows will match the query conditions and return the row count stored in the metadata structure divided by two. In another embodiment, the metadata structure keeps additional statistics, such as samples or information about the data distribution within the row cluster, to reach more precise estimates or even be able to answer specific requests accurately based on these additional statistics kept.

General:

In one embodiment, a plurality of dimension mechanisms are combined, for example a range on one dimension with a probabilistic filter on another dimension and an order range filter on two other dimensions. In this case, optimization is undertaken by applying the aforementioned methods for each dimension mechanism individually, possibly interleaving them (e.g. distributing bit-fields in between finding KD-tree splitting planes). In the aforementioned case, selection is undertaken by evaluating each of the dimensions’ metadata to conclude if for the combination of all dimensions together the possibility exists, that all query conditions are met by a row cluster. This is done with the aforementioned methods. If this possibility exists, the row cluster is retrieved, otherwise not. In another embodiment, with a plurality of dimensions, the dimension mechanisms are combined by finding common measures for the

optimization. One typical measure to use is the distance, which has been defined for multiple different dimension mechanisms in this disclosure. The combination of all distances can then subsequently be used for operating on multiple dimensions in accordance to the mechanism for a single dimension disclosed above.

Although illustrative embodiments of the invention have been disclosed in detail herein, with reference to the accompanying drawings, it is understood that the invention is not limited to the precise embodiment and that various changes and modifications can be effected therein by one skilled in the art without departing from the scope of the invention as defined by the appended claims and their equivalents.

Claims

1. A data processing method, comprising:

identifying a group of existing data clusters having a clustering characteristic, each existing data cluster comprising data entries of a data information system stored together in storage, each data entry having at least one field storing data values;

selecting at least one existing data cluster from said group of existing data clusters as an optimisable group of data clusters; and

forming a group of optimised data clusters by allocating data entries of said optimisable group of data clusters to each optimised data cluster to improve said clustering characteristic for said group of optimised data clusters compared to said group of existing data clusters.

2. The method of claim 1, wherein each data cluster occupies a data range in a search space defined by values of each data entry of each field.

3. The method of any preceding claim, wherein each data cluster has a size which matches a bandwidth-optimised data block transfer size of said storage.

4. The method of claims 2 or 3 , wherein each data cluster has associated metadata which provides at least an indication of said data range in said search space defined by values of each data entry of at least one field.

5. The method of any preceding claim, wherein said clustering characteristic comprises at least one of:

a search selectivity between said existing data clusters;

a number of existing data clusters accessed in response to search enquiries; a separation between existing data clusters in said search space;

an overlap of data ranges of data values between existing data clusters in said search space; and

data ranges of data values within each existing data cluster in said search space.

6. The method of any preceding claim, wherein said selecting comprises selecting said at least one existing data cluster from said group of existing data clusters as said optimisable group of data clusters based on a size of its occupied search space.

7. The method of any preceding claim, wherein said selecting comprises selecting said at least one existing data cluster from said group of existing data clusters as said optimisable group of data clusters based on a number of intersections in said search space with other existing data clusters.

8. The method of any one of claims 4 to 7, wherein said selecting comprises selecting said at least one existing data cluster from said group of existing data clusters as said optimisable group of data clusters based only on associated metadata.

9. The method of any preceding claim, comprising:

generating a group of ideal data clusters from said group of existing data clusters, said group of ideal data clusters being generated within said search space occupied by said group of existing data clusters and wherein said selecting comprises selecting at least one existing data cluster from said group of existing data clusters which intersects a selected ideal data cluster in said search space as said optimisable group of data clusters.

10. The method of claim 9, wherein said generating comprises generating said group of ideal data clusters from said group of existing data clusters, said group of ideal data clusters being generated within said search space occupied by said group of existing data clusters based on an ideal clustering criteria which would improve said clustering characteristic.

11. The method of claim 10 , wherein said ideal clustering criteria comprises at least one of:

an increase in a search selectivity between said existing data clusters;

a decrease in a number of existing data clusters accessed in response to search enquiries;

an increase in a separation between existing data clusters in said search space; a decrease in an overlap of data ranges of data values between existing data clusters in said search space; and

a decrease in a data range of data values within each existing data cluster in said search space.

12. The method of any one of claims 9 to 11, wherein said generating comprises generating said group of ideal data clusters based on an assumed distribution of data entries within said search space within each existing data cluster.

13. The method of any one of claims 9 to 12, wherein said generating comprises generating said group of ideal data clusters using a partitioning algorithm which partitions said search space to have similar numbers of data entries in each ideal data cluster.

14. The method of any one of claims 9 to 13 , comprising:

for each ideal data cluster in said group of ideal data clusters, determining a deviation in occupied search space between that ideal data cluster and existing data clusters intersecting that ideal data cluster and wherein said selecting comprises

selecting said selected ideal data cluster based on said deviation.

15. The method of any preceding claim, wherein said forming comprises forming said group of optimised data clusters by allocating data entries of said optimisable group of data clusters to each optimised data cluster using a partitioning algorithm which would improve said clustering characteristic.

16. The method of any preceding claim, wherein said selecting comprises selecting overlapping existing data clusters which store data entries having overlapping data ranges within a plurality of fields defining a plurality of said search space dimensions as said optimisable group of data clusters and said forming comprises allocating said data entries from said optimisable group of data clusters to form said optimised data clusters having minimised overlapping data ranges in each search space dimension.

17. The method of any preceding claim, comprising storing each optimised data cluster in said storage.

18. The method of any preceding claim, comprising:

identifying a range of data values for each search space dimension within each data cluster; and

storing an indicator of each range of data values as said metadata for each corresponding data cluster.

19. The method of any preceding claim, comprising nulling said group of existing data clusters.

20. The method of any preceding claim, comprising iteratively repeating said identifying, selecting and forming.

21. The method of any preceding claim, comprising:

receiving a search request for data;

interrogating said metadata to identify candidate data clusters whose range of data values encompasses said search request.

22. The method of claim 21, comprising returning a result of said search request based only on said metadata.

23. The method of any preceding claim, comprising performing a join operation between said group of optimised data clusters and another group of optimised data clusters.

24. A data processing apparatus, comprising:

identification logic operable to identify a group of existing data clusters having a clustering characteristic, each existing data cluster comprising data entries of a data information system stored together in storage, each data entry having at least one field storing data values;

selection logic operable to select at least one existing data cluster from said group of existing data clusters as an optimisable group of data clusters; and

formation logic operable to form a group of optimised data clusters by allocating data entries of said optimisable group of data clusters to each optimised data cluster to improve said clustering characteristic for said group of optimised data clusters compared to said group of existing data clusters.

25. A computer program product operable, when executed on a computer, to perform the method of any one of claims 1 to 23.