WO2019234039A1 - Data processing - Google Patents

Data processing Download PDF

Info

Publication number
WO2019234039A1
WO2019234039A1 PCT/EP2019/064515 EP2019064515W WO2019234039A1 WO 2019234039 A1 WO2019234039 A1 WO 2019234039A1 EP 2019064515 W EP2019064515 W EP 2019064515W WO 2019234039 A1 WO2019234039 A1 WO 2019234039A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
clusters
group
cluster
data clusters
Prior art date
Application number
PCT/EP2019/064515
Other languages
French (fr)
Inventor
Luc VLAMING
David Geier
Thomas Richter
Adrien HAMELIN
Original Assignee
Swarm64 As
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Swarm64 As filed Critical Swarm64 As
Publication of WO2019234039A1 publication Critical patent/WO2019234039A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/217Database tuning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Definitions

  • the present invention relates to a data processing apparatus, data processing method and computer program product.
  • Data processing apparatus store data values (which may be instructions or data) in storage for processing.
  • the data processing apparatus retains or is provided with location information which identifies the location of data stored in storage for subsequent retrieval and processing.
  • a data processing apparatus may operate as a data information system which is provided with data which is required to be stored in storage for subsequent interrogation, such as searching in order to answer a query.
  • a data processing method comprising: identifying a group of existing data clusters having a clustering characteristic, each existing data cluster comprising data entries of a data information system stored together in storage, each data entry having at least one field storing data values;
  • the first aspect recognises that databases typically have two mechanisms to retrieve data that meets the condition of a database query. They can either access all of the data to answer a query - which is commonly referred to as“table scan” and is typically slow, especially with large datasets; or it has to inquire an index, an indirect structure that holds information where the data meeting the query conditions is stored, which allows to only selectively retrieve the data.
  • a query may ask for all data matching a specific value, and given this value is a key in the index, the index can directly provide all rows matching this query condition.
  • These selective retrievals may create data accesses that are scattered and, due to the nature of accesses in one or more connected computer systems, may fetch a large proportion of data that is not relevant.
  • the first aspect recognises that when storing data entries in a data information system, it is often convenient to cluster a number of those data entries together for subsequent storage. For example, clustering them together can help reduce any increase in overhead that may occur if storing the data entries individually.
  • clustering can improve the efficiency of accesses to storage, particularly when the cluster size is related to optimum sizes for accesses to storage.
  • the first aspect also recognises that storing data in this way can lead to clusters having undesirable characteristics. For example, identical or similar data values which may need to be searched may be distributed over a wide range of clusters, each of which may need to be interrogated in response to a search request or enquiry.
  • the first aspect also recognises that if the characteristics of the clusters are controlled to suit the particular implementation of a data information system, then it is possible to optimize the processing speed or performance of the data information system.
  • the method may be for or be performed by a data processing apparatus.
  • the method may comprise identifying or determining a group or set of data clusters.
  • Each data cluster may have data entries.
  • the data entries may be data entries of a data information system.
  • the data entries may be stored as a block in a storage device.
  • Each of the data entries may have one or more fields. Each of those fields may store one or more data values.
  • the method may comprise selecting or choosing one or more of the data clusters from the group of data clusters. That one or more selected data clusters may comprise or be designated as an optimizable group of data clusters.
  • the method may comprise forming or creating a group or set of optimized data clusters.
  • the optimized data clusters may be formed by allocating data entries from the optimizable group of data clusters. Each data entry of the optimizable group of data clusters may be allocated or assigned to one of the optimized data clusters. In this way, the distribution of data entries in data clusters may be changed when creating optimized data clusters from those data entries, in order to improve the clusters. This, in turn, improves the processing speed and performance of the data information system.
  • the group of data clusters may have a characteristic, feature or parameter that can be related to a metric.
  • the data entries may be allocated to the optimized data clusters to improve the characteristic of the group of optimized data clusters when compared to the characteristic of the group of data clusters.
  • each data cluster occupies a data range in a search space defined by values of each data entry of each field.
  • each field of a data entry can define a space which may need to be searched by the data information system.
  • the values of those fields of each data entry within the data cluster may define a data range within that search space. For example, consider a simple arrangement where a field stores a numerical value such as temperature. The field may then define a search space or search dimension of temperature. When looking at the values in the temperature field for each data entry within the data cluster it may be determined that a minimum temperature is 10 and a maximum temperature is 25. Accordingly, the data range in the search space defined by the temperature field for that data cluster would be between 10 and 25. It will be appreciated that any values of any type of field (such as text, hierarchy information, image data, etc) can be mapped into Euclidian space and a range within that space can be established.
  • the search space has‘n’ dimensions, each dimension being defined by a corresponding‘n’ field. Accordingly, the search space for the data cluster may be multi-dimensional, depending on the number of fields to be searched or indexed.
  • each data cluster has a size which matches a bandwidth-optimised data block transfer size of the storage. Accordingly, the data clusters may be sized to match the data block transfer size of the storage device.
  • each data cluster has a size no larger than a bandwidth-optimised data block transfer size of the storage. Accordingly, the size of each data cluster may be set to be the block transfer size or smaller. This helps to ensure that each data cluster can be transferred between the storage and data processing apparatus as efficiently as possible.
  • each data cluster has a size larger than a bandwidth-optimised data block transfer size of said storage.
  • each data cluster has a size a multiple of a bandwidth-optimised data block transfer size of said storage.
  • one or more data clusters are compressed or stored in compressed form.
  • each data cluster has associated metadata which provides at least an indication of the data range in the search space defined by values of each data entry of at least one field. Accordingly, each cluster may have metadata associated therewith. The metadata may provide or indicate the search range or search ranges in the search space which are defined by the values of the data entries of one or more fields within that data cluster.
  • the metadata stores at least one additional parameter relating to that data cluster. Accordingly, the metadata may provide additional information relating to the data cluster which may be unrelated to its search ranges.
  • the additional parameter comprises a number of entries in that data cluster. Accordingly, the number of data entries within a data cluster may be indicated within the metadata.
  • the optimizable group of data clusters may be determined from the metadata.
  • each data entry is an entry in a database.
  • the data entries may relate to entries in a database.
  • the data entries may also relate to entries in data information systems, a relational database, a NOSQL database or the like.
  • the clustering characteristic comprises a search selectivity between the existing data clusters. In one embodiment, the clustering characteristic comprises a number of existing data clusters accessed in response to search enquiries. In one embodiment, the clustering characteristic comprises a separation between existing data clusters in the search space. In one embodiment, the clustering characteristic comprises an overlap of data ranges of data values between existing data clusters in the search space. In one embodiment, the clustering characteristic comprises data ranges of data values within each existing data cluster in the search space. Accordingly, depending on implementation, various different characteristics may need to be optimized. In one embodiment, the selectivity of data clusters in response to the search may be a characteristic to be improved. In one embodiment, the number of data clusters which are accessed following a search enquiry may be a characteristic to be optimized.
  • a separation or distance between the data clusters in search space may be another characteristic to be optimized.
  • an overlap or commonality in data ranges between data clusters within the search space may be a characteristic to be optimized.
  • a data range of the data values within the data clusters may be a characteristic to optimize.
  • the selecting comprises selecting the at least one existing data cluster from the group of existing data clusters as the optimisable group of data clusters based on a size of its occupied search space. Accordingly, one or more of the data clusters may be selected from the group or set of data clusters and included in the group or set of optimizable data clusters based on how much of the search space that data cluster occupies, or based on their shape or position or fill level.
  • the fill level can be a characteristic that is useful to incorporate into an error metric, because it is advantageous that data clusters with very low fill levels are merged together. Selecting data clusters on that basis biases the selection towards larger data clusters which are more likely to cause a poor clustering characteristic.
  • the selecting comprises selecting the at least one existing data cluster from the group of existing data clusters as the optimisable group of data clusters based on a number of intersections in the search space with other existing data clusters. Accordingly, selecting those data clusters which intersect or overlap with more data clusters than others biases the optimization towards those data clusters which are likely to cause a poor clustering characteristic.
  • the selecting comprises selecting the at least one existing data cluster from the group of existing data clusters as the optimisable group of data clusters based only associated metadata. Accordingly, the optimizable group of data clusters may be determined using the stored metadata for those data clusters. This avoids the need to access the data clusters themselves, or perform any searching within the data clusters to make that selection. This significantly improves the performance of the selection.
  • the method comprises generating a group of ideal data clusters from the group of existing data clusters, the group of ideal data clusters being generated within the search space occupied by the group of existing data clusters and wherein the selecting comprises selecting at least one existing data cluster from the group of existing data clusters which intersects a selected ideal data cluster in the search space as the optimisable group of data clusters. Accordingly, an idealised group of data clusters, which, if existed, would provide improvement to the clustering characteristic, may be created. At least one of the data clusters which, when the ideal data clusters are overlaid in the search space, intersects, covers, falls within or crosses the boundary of a particular or selected ideal data cluster is selected for the optimizable group of data clusters. Selecting a data cluster which deviates from the ideal ensures that a sub-optimal data cluster is selected for optimization.
  • the generating comprises generating the group of ideal data clusters from the group of existing data clusters, the group of ideal data clusters being generated within the search space occupied by the group of existing data clusters based on an ideal clustering criteria which would improve the clustering characteristic.
  • the ideal clustering criteria comprises an increase in a search selectivity between the existing data clusters. In one embodiment, the ideal clustering criteria comprises a decrease in a number of existing data clusters accessed in response to search enquiries. In one embodiment, the ideal clustering criteria comprises an increase in a separation between existing data clusters in the search space. In one embodiment, the ideal clustering criteria comprises a decrease in an overlap of data ranges of data values between existing data clusters in the search space. In one embodiment, the ideal clustering criteria comprises a decrease in a data range of data values within each or at least one existing data cluster in the search space. In one embodiment, the generating comprises generating the group of ideal data clusters based on an assumed distribution of data entries within the search space within each existing data cluster.
  • the ideal group of clusters may be generated using a simplified assumption that the data entries are distributed within those ideal data clusters in accordance with a particular distribution. This again helps to simplify the generation of the ideal data clusters, which avoids the need to perform data accesses to retrieve the actual data clusters and minimises the processing required.
  • the generating comprises generating the group of ideal data clusters using a partitioning algorithm, scheme or process which partitions the search space to have similar numbers of data entries in each ideal data cluster. Accordingly, a partitioning algorithm is employed to partition the search space into regions, each of which has as close as possible to identical numbers of data entries or which occupy a similar amount of space.
  • the partitioning algorithm seeks to partition the data entries at least once in each dimension of the search space.
  • the partitioning algorithm seeks to provide an equal number of split planes in each dimension of the search space.
  • the partitioning algorithm splits more often in certain dimensions based on user-defined settings or automatically learned usage patterns.
  • the partitioning algorithm comprises a KD-tree. It will be appreciated that other algorithms such as quadtrees, octrees, BSPtrees and the like may be utilised.
  • the method comprises, for each ideal data cluster in the group of ideal data clusters, determining a deviation in occupied search space between that ideal data cluster and existing data clusters intersecting that ideal data cluster and wherein the selecting comprises selecting the selected ideal data cluster based on the deviation.
  • the selecting comprises selecting the selected ideal data cluster having maximum deviation. Accordingly, that data cluster which deviates the most from the ideal may be selected.
  • the method comprises, for each existing data cluster intersecting the selected ideal data cluster, determining a deviation in occupied search space between that existing data cluster and the selected ideal data cluster and wherein the selecting comprises selecting at least one of the existing data clusters intersecting the selected ideal data cluster for inclusion in the optimisable group of data clusters based on the deviation. Accordingly, for every other data cluster which crosses, overlaps or intersects the selected ideal data cluster, a deviation may also be determined.
  • the selecting comprises selecting that existing data cluster intersecting the selected ideal data cluster having a maximum deviation for inclusion in the optimisable group of data clusters.
  • the selecting comprises selecting that existing data cluster intersecting the selected ideal data cluster which most occupies the search space.
  • the selecting comprises selecting neighbouring existing data clusters to that existing data cluster having the maximum deviation for inclusion in the optimisable group of data clusters. Accordingly, those clusters which neighbour or are proximate to the selected data cluster may be included in the optimizable group. This helps to ensure that clusters near each other which could potentially collide during searches are optimized.
  • the neighbouring existing data clusters include existing data clusters which most occupy the search space. Accordingly, those clusters which extend furthest within the search space or occupy the greatest area or volume within search space may be included in the optimizable group. Again, this helps to include clusters which are more likely to fall within a search.
  • the neighbouring existing data clusters include existing data clusters which are closest in the search space to that existing data cluster having the maximum deviation. Accordingly, those data clusters which are most proximate to the selected cluster may be included in the optimizable group.
  • the neighbouring existing data clusters overlap in the search space with that existing data cluster having the maximum deviation. Accordingly, those clusters which intersect with or share the same space as the selected cluster may be included in the optimizable group. In one embodiment, wherein the selecting comprises selecting that existing data cluster intersecting the selected ideal data cluster which most occupies the search space.
  • the forming comprises forming the group of optimised data clusters by allocating data entries of the optimisable group of data clusters to each optimised data cluster using a partitioning algorithm which would improve the clustering characteristic. Accordingly, the data entries within the optimizable group may be allocated to each optimized data cluster by partitioning the data clusters in search space.
  • the partitioning algorithm partitions the search space occupied by the group of optimised data clusters to have similar numbers of data entries in each optimised data cluster. Accordingly, the partitioning may seek to balance the number of data entries in each optimized cluster so that each optimized cluster has near identical numbers of data entries.
  • the partitioning algorithm seeks to maintain more than a 75% fill average in each optimised data cluster. Accordingly, a minimum fill average may be set for each criteria in order to balance the number of data entries in each data cluster.
  • That fill average may be a high fill average.
  • the partitioning algorithm seeks to partition the data entries at least once in each dimension of the search space. ln one embodiment, the partitioning algorithm splits more often in certain dimensions based on user-defined settings or automatically learned usage patterns. ln one embodiment, the partitioning algorithm seeks to provide an equal number of split planes in each dimension of the search space.
  • the partitioning algorithm comprises a KD-tree. It will be appreciated that other algorithms such as quadtrees, octrees, BSPtrees and the like may be utilised.
  • the optimising clustering criteria seeks to form optimised data clusters which minimise a deviation with respect to the group of ideal data clusters.
  • the forming comprises allocating the data entries of the optimisable group of data clusters to each optimised data cluster subject to a maximum number data entries being provided in each optimised data cluster.
  • the selecting comprises selecting overlapping existing data clusters which store data entries having overlapping data ranges within a plurality of fields defining a plurality of the search space dimensions as the optimisable group of data clusters and the forming comprises allocating the data entries from the optimisable group of data clusters to form the optimised data clusters having minimised
  • the forming comprises allocating the data entries from the optimisable group of data clusters to form the optimised data clusters having optimised shape.
  • the forming comprises allocating the data entries from the optimisable group of data clusters to form the optimised data clusters having optimised shape.
  • the forming comprises allocating the data entries from the optimisable group of data clusters to form the optimised data clusters having minimised data range overlap in each search space dimension.
  • the forming comprises allocating the data entries from the optimisable group of data clusters to form the optimised data clusters having eliminated overlapping data ranges in each search space dimension.
  • the forming comprises allocating the data entries from the optimisable group of data clusters to form optimised data clusters having non overlapping optimised data ranges in each search space dimension.
  • the forming comprises allocating the data entries from the optimisable group of data clusters to form optimised data clusters whose distance between the non-overlapping optimised data ranges is maximised in each search space dimension.
  • the forming comprises partitioning the data entries from the optimisable group of data clusters using a partitioning algorithm.
  • the partitioning algorithm seeks to maintain more than a 75% fill average in each optimised data cluster.
  • the partitioning algorithm seeks to partition the data entries from the optimisable group of data clusters at least once in each search space dimension.
  • the partitioning algorithm seeks to provide an equal number of split planes in each search space dimension.
  • the partitioning algorithm seeks to partition regions of less dense data value distribution into optimised data clusters having more dense data value distribution.
  • the partitioning algorithm comprises a KD-tree algorithm. It will be appreciated that other algorithms such as quadtrees, octrees, BSPtrees and the like may be utilised.
  • the method comprises storing each optimised data cluster in the storage.
  • the method comprises identifying a range of data values for each search space dimension within each data cluster and storing an indicator of each range of data values as the metadata for each corresponding data cluster.
  • metadata may be stored for each data range to provide an index for each searchable field.
  • the method comprises ordering the range in accordance with an ordering indicator for each search dimension.
  • the range identifies at least a maximum and minimum data value that search space dimension within that data cluster.
  • the method comprises storing an indicator of the data values for each search space dimension. Such an indicator may be configured to exclude certain patterns such as when applying a bloomfilter.
  • the method comprises incorporating each metadata into a search tree for all data clusters. Accordingly, the metadata may be incorporated into a search tree to facilitate efficient searching of the metadata of each data cluster.
  • the search tree comprises a KD-tree. It will be appreciated that other algorithms such as quadtrees, octrees, BSPtrees and the like may be utilised.
  • the method comprises storing all or parts of the metadata in a compressed form.
  • the method comprises storing, with the metadata, a pointer to a location of each corresponding data cluster or clusters in the storage.
  • the metadata may include a pointer. It can make sense to partition data values of data entries across multiple data clusters. In that case it can make sense to store more than one pointer in the metadata.
  • the metadata may include an indication of the location of each data cluster in the storage.
  • the metadata may also include a size indicator in order to identify where each cluster begins.
  • the method comprises storing with the metadata an entries counter providing an indication of how many data entries are within each data cluster
  • the method comprises storing with the metadata statistical information about the data entries stored within each data cluster.
  • the method comprises selecting a field as a search space dimension based on historic search requests. Accordingly, the fields which are selected to be included in the metadata may be selected actively, based on searches that are being made.
  • the method comprises nulling the group of existing data clusters. Accordingly, when the optimized data clusters have been stored, the existing data clusters which they replace are nulled.
  • the method comprises iteratively repeating the identifying, selecting and forming. Accordingly, the optimization can be iteratively repeated in order to optimize the data clusters.
  • the method comprises receiving data entries to be stored in a new data cluster and buffering the data entries until a minimal data cluster size has been reached. Accordingly, individual data entries may be received and buffered until a minimal size of data cluster formed from those received data entries is achieved.
  • the minimal data cluster size comprises the bandwidth-optimised data block transfer size of the storage device.
  • the method comprises deferring the iteratively repeating until the new data cluster has been stored. Accordingly, the optimizing of data clusters may be defered or its priority reduced while data entries are pending being stored.
  • the method comprises receiving a search request for data and interrogating the metadata to identify candidate data clusters whose range of data values encompasses the search request. Accordingly, when a search request is received then the metadata may be searched to identify potential data clusters which may store data values satisfying that search.
  • the interrogating the metadata comprises interrogating the search tree.
  • the method comprises returning a result of the search request based only on the metadata.
  • the metadata may indicate that no data cluster can contain a data value matching the search criteria, in which case no access to the data clusters is required.
  • some searches may relate to data stored within the metadata itself, such as returning a number of entries falling within a search range or matching search criteria. In that case, the answer to the query can be returned again without needing to access the data clusters themselves. It will be appreciated that various different values can be stored in the metadata to enhance such search queries. Should the metadata indicate that matching data values may be present in one or more data clusters, then those data clusters may be interrogated.
  • the method comprises returning an approximate result of the search request based only on the statistical information stored in the metadata. In one embodiment, the method comprises interrogating the candidate data clusters to return a result of the search request.
  • the interrogating the candidate data clusters comprises
  • the method comprises performing a join operation between said group of optimised data clusters and another group of optimised data clusters.
  • the method comprises performing a join operation between an optimised data cluster within said group of optimised data clusters and an optimised data cluster within said another group of optimised data clusters.
  • a data processing apparatus comprising: identification logic operable to identify a group of existing data clusters having a clustering characteristic, each existing data cluster comprising data entries of a data information system stored together in storage, each data entry having at least one field storing data values; selection logic operable to select at least one existing data cluster from the group of existing data clusters as an optimisable group of data clusters; and formation logic operable to form a group of optimised data clusters by allocating data entries of the optimisable group of data clusters to each optimised data cluster to improve the clustering characteristic for the group of optimised data clusters compared to the group of existing data clusters.
  • the group of data clusters may have a characteristic, feature or parameter that can be related to a metric.
  • each data cluster occupies a data range in a search space defined by values of each data entry of each field.
  • the search space has‘n’ dimensions, each dimension being defined by a corresponding‘n’ field.
  • each data cluster has a size which matches a bandwidth-optimised data block transfer size of the storage. In one embodiment, each data cluster has a size no larger than a bandwidth-optimised data block transfer size of the storage.
  • each data cluster has a size larger than a bandwidth-optimised data block transfer size of said storage.
  • each data cluster has a size a multiple of a bandwidth-optimised data block transfer size of said storage.
  • one or more data clusters are compressed or stored in compressed form.
  • each data cluster has associated metadata which provides at least an indication of the data range in the search space defined by values of each data entry of at least one field.
  • the metadata stores at least one additional parameter relating to that data cluster.
  • the additional parameter comprises a number of entries in that data cluster.
  • the optimizable group of data clusters is determined from the metadata.
  • each data entry is an entry in a database.
  • the clustering characteristic comprises a search selectivity between the existing data clusters. In one embodiment, the clustering characteristic comprises a number of existing data clusters accessed in response to search enquiries. In one embodiment, the clustering characteristic comprises a separation between existing data clusters in the search space. In one embodiment, the clustering characteristic comprises an overlap of data ranges of data values between existing data clusters in the search space. In one embodiment, the clustering characteristic comprises data ranges of data values within each existing data cluster in the search space. In one embodiment, the selection logic is operable to select the at least one existing data cluster from the group of existing data clusters as the optimisable group of data clusters based on a size of its occupied search space.
  • the selection logic is operable to select the at least one existing data cluster from the group of existing data clusters as the optimisable group of data clusters based on a number of intersections in the search space with other existing data clusters.
  • the selection logic is operable to select the at least one existing data cluster from the group of existing data clusters as the optimisable group of data clusters based only associated metadata.
  • the identification logic is operable to generate a group of ideal data clusters from the group of existing data clusters, the group of ideal data clusters being generated within the search space occupied by the group of existing data clusters and wherein the selection logic is operable to select at least one existing data cluster from the group of existing data clusters which intersects a selected ideal data cluster in the search space as the optimisable group of data clusters.
  • the identification logic is operable to generate the group of ideal data clusters from the group of existing data clusters, the group of ideal data clusters being generated within the search space occupied by the group of existing data clusters based on an ideal clustering criteria which would improve the clustering characteristic.
  • the ideal clustering criteria comprises an increase in a search selectivity between the existing data clusters. In one embodiment, the ideal clustering criteria comprises a decrease in a number of existing data clusters accessed in response to search enquiries. In one embodiment, the ideal clustering criteria comprises an increase in a separation between existing data clusters in the search space. In one embodiment, the ideal clustering criteria comprises a decrease in an overlap of data ranges of data values between existing data clusters in the search space. In one embodiment, the ideal clustering criteria comprises a decrease in a data range of data values within each or at least one existing data cluster in the search space. In one embodiment, the identification logic is operable to generate the group of ideal data clusters based on an assumed distribution of data entries within the search space within each existing data cluster.
  • the identification logic is operable to generate the group of ideal data clusters using a partitioning algorithm, scheme or process which partitions the search space to have similar numbers of data entries in each ideal data cluster.
  • the partitioning algorithm seeks to partition the data entries at least once in each dimension of the search space.
  • the partitioning algorithm seeks to provide an equal number of split planes in each dimension of the search space.
  • the partitioning algorithm splits more often in certain dimensions based on user-defined settings or automatically learned usage patterns.
  • the partitioning algorithm comprises a KD-tree. It will be appreciated that other algorithms such as quadtrees, octrees, BSPtrees and the like may be utilised.
  • the identification logic is operable, for each ideal data cluster in the group of ideal data clusters, to determine a deviation in occupied search space between that ideal data cluster and existing data clusters intersecting that ideal data cluster and the selection logic is operable to select the selected ideal data cluster based on the deviation.
  • the selection logic is operable to select the selected ideal data cluster having maximum deviation.
  • the identification logic is operable, for each existing data cluster intersecting the selected ideal data cluster, to determine a deviation in occupied search space between that existing data cluster and the selected ideal data cluster and the selection logic is operable to select at least one of the existing data clusters intersecting the selected ideal data cluster for inclusion in the optimisable group of data clusters based on the deviation.
  • the selection logic is operable to select that existing data cluster intersecting the selected ideal data cluster having a maximum deviation for inclusion in the optimisable group of data clusters.
  • the selection logic is operable to select that existing data cluster intersecting the selected ideal data cluster which most occupies the search space.
  • the selection logic is operable to select neighbouring existing data clusters to that existing data cluster having the maximum deviation for inclusion in the optimisable group of data clusters.
  • the neighbouring existing data clusters include existing data clusters which most occupy the search space.
  • the neighbouring existing data clusters include existing data clusters which are closest in the search space to that existing data cluster having the maximum deviation.
  • the neighbouring existing data clusters overlap in the search space with that existing data cluster having the maximum deviation.
  • the selection logic is operable to select that existing data cluster intersecting the selected ideal data cluster which most occupies the search space.
  • the formation logic is operable to form the group of optimised data clusters by allocating data entries of the optimisable group of data clusters to each optimised data cluster using a partitioning algorithm which would improve the clustering characteristic.
  • the partitioning algorithm partitions the search space occupied by the group of optimised data clusters to have similar numbers of data entries in each optimised data cluster.
  • the partitioning algorithm seeks to maintain more than a 75% fill average in each optimised data cluster. In one embodiment, the partitioning algorithm seeks to partition the data entries at least once in each dimension of the search space.
  • the partitioning algorithm splits more often in certain dimensions based on user-defined settings or automatically learned usage patterns.
  • the partitioning algorithm seeks to provide an equal number of split planes in each dimension of the search space.
  • the partitioning algorithm comprises a KD-tree. It will be appreciated that other algorithms such as quadtrees, octrees, BSPtrees and the like may be utilised.
  • the formation logic is operable to to form optimised data clusters which minimise a deviation with respect to the group of ideal data clusters.
  • the formation logic is operable to allocate the data entries of the optimisable group of data clusters to each optimised data cluster subject to a maximum number data entries being provided in each optimised data cluster.
  • the selection logic is operable to select overlapping existing data clusters which store data entries having overlapping data ranges within a plurality of fields defining a plurality of the search space dimensions as the optimisable group of data clusters and the formation logic is operable to allocate the data entries from the optimisable group of data clusters to form the optimised data clusters having minimised overlapping data ranges in each search space dimension.
  • the formation logic is operable to allocate the data entries from the optimisable group of data clusters to form the optimised data clusters having optimised shape.
  • the formation logic is operable to allocate the data entries from the optimisable group of data clusters to form the optimised data clusters having minimised data range overlap in each search space dimension. In one embodiment, the formation logic is operable to allocate the data entries from the optimisable group of data clusters to form the optimised data clusters having eliminated overlapping data ranges in each search space dimension.
  • the formation logic is operable to allocate the data entries from the optimisable group of data clusters to form optimised data clusters having non overlapping optimised data ranges in each search space dimension.
  • the formation logic is operable to allocate the data entries from the optimisable group of data clusters to form optimised data clusters whose distance between the non-overlapping optimised data ranges is maximised in each search space dimension.
  • the formation logic is operable to partition the data entries from the optimisable group of data clusters using a partitioning algorithm.
  • the partitioning algorithm seeks to maintain more than a 75% fill average in each optimised data cluster.
  • the partitioning algorithm seeks to partition the data entries from the optimisable group of data clusters at least once in each search space dimension.
  • the partitioning algorithm seeks to provide an equal number of split planes in each search space dimension.
  • the partitioning algorithm seeks to partition regions of less dense data value distribution into optimised data clusters having more dense data value distribution.
  • the partitioning algorithm comprises a KD-tree algorithm. It will be appreciated that other algorithms such as quadtrees, octrees, BSPtrees and the like may be utilised.
  • the apparatus comprises storing logic operable to store each optimised data cluster in the storage.
  • the apparatus comprises metadata logic operable to identify a range of data values for each search space dimension within each data cluster and to store an indicator of each range of data values as the metadata for each corresponding data cluster.
  • the metadata logic is operable to order the range in accordance with an ordering indicator for each search dimension.
  • the range identifies at least a maximum and minimum data value that search space dimension within that data cluster.
  • the storing logic is operable to store an indicator of the data values for each search space dimension.
  • the metadata logic is operable to incorporate each index into a search tree for all data clusters.
  • the search tree comprises a KD-tree. It will be appreciated that other algorithms such as quadtrees, octrees, BSPtrees and the like may be utilised.
  • the storing logic is operable to store all or parts of the metadata in a compressed form.
  • the metadata logic is operable to store with the metadata a pointer to a location of each corresponding data cluster in the storage.
  • the metadata logic is operable to store with the metadata an entries counter providing an indication of how many data entries are within each data cluster.
  • the storing logic is operable to store with the metadata statistical information about the data entries stored within each data cluster.
  • the metadata logic is operable to select a field as a search space dimension based on historic search requests. In one embodiment, the storing logic is operable to null the group of existing data clusters.
  • the identification logic is operable repeatedly identify a group of existing data clusters, the selection logic is operable to select at least one existing data cluster and the formation logic is operable to operable to form a group of optimised data clusters iteratively.
  • the apparatus comprises buffering logic operable to receive data entries to be stored in a new data cluster and to buffer the data entries until a minimal data cluster size has been reached.
  • the minimal data cluster size comprises the bandwidth-optimised data block transfer size of the storage device.
  • the buffering logic is operable to defer the iteratively repeating until the new data cluster has been stored.
  • the apparatus comprises search logic operable to receive a search request for data and to interrogate the metadata to identify candidate data clusters whose range of data values encompasses the search request.
  • the search logic is operable to interrogate the search tree.
  • the search logic is operable to return a result of the search request based only on the metadata.
  • the search logic is operable to return an approximate result of the search request based only on the statistical information stored in the metadata.
  • the search logic is operable to interrogate the candidate data clusters to return a result of the search request.
  • the search logic is operable to interrogate only the candidate data clusters to return the result of the search request.
  • the apparatus comprises joining logic operable to perform a join operation between said group of optimised data clusters and another group of optimised data clusters.
  • the joining logic is operable to perform a join operation between an optimised data cluster within said group of optimised data clusters and an optimised data cluster within said another group of optimised data clusters.
  • a computer program product operable, when executed on a computer, to perform the method of the first aspect.
  • Figure 1 illustrates a data processing apparatus according to one embodiment
  • Figure 2 illustrates the main processing steps performed by the data processing apparatus when receiving data entries according to one embodiment
  • Figure 3 illustrates the main processing steps performed by the data processing apparatus when optimising data clusters according to one embodiment
  • Figures 4A to 4N illustrates optimising data clusters according to one embodiment
  • Figure 5 illustrates the main processing steps performed by the data processing apparatus in response to a query according to one embodiment
  • FIGS 6A and 6B illustrate data join operations according to one embodiment.
  • Embodiments recognise that the way that data is stored in a data information system may be sub-optimal for efficiently performing operations during data processing since the data is often distributed in storage in a manner that makes an operation inefficient, which reduces the processing speed of the data processing apparatus. Accordingly, embodiments store data in data clusters stored in storage and optimise those data clusters. Each data cluster in storage typically has a maximum size which is determined by an optimal data transfer size between the storage and processing logic performing the processing. While this optimises accesses between the data processor and the storage, the content of the data clusters themselves may be unrelated, and even random. For example, consider the situation where the data information system is an inventory database.
  • each data entry may be a row in the database having the fields“item identifier”,“purchase price”, “purchaser identifier”,“firmware date”,“item location”, etc.
  • a transaction may be provided to the data processing apparatus which buffers the transactions as a data cluster of entries until that data cluster matches the optimal size for transfer to the storage.
  • the entries are likely to be widely distributed. That is to say, for any data cluster the range of firmware dates is likely to be widely distributed, as are the identifiers.
  • metadata is provided which provides a search index which indicates a range of values stored in each field which may need to be searched.
  • the metadata may provide an indication of the range of firmware dates of the entries within that data cluster and/ or the range of values of each identifiers of the entries within that data cluster, etc.
  • a search may be for items with a purchase price of more than $ 300 which have a firmware date of more than four years ago. Any data clusters whose metadata indicates that its entries have a purchase price of less than $ 300 , or which have a firmware date which is less than four years ago, can be ignored. However, any data cluster which cannot be ignored must be retrieved to perform the operation using the data entries.
  • a characteristic of the data entries in these data clusters is that they are likely to be widely distributed and lack correlation as each transaction is likely to be reasonably random. Accordingly, many data clusters stored in this way may need to be accessed and their entries interrogated in order to perform the operation. It will be appreciated that, even then, a null answer may be returned if there are no entries matching the operation criteria.
  • embodiments perform optimization on the stored or existing data clusters in order to perform the operation more efficiently by avoiding or reducing the number of accesses to storage required in response to queries.
  • This optimization can be adapted to suit the particular physical and functional constraints or characteristics of the data processing apparatus and its storage.
  • the optimization procedure involves identifying existing data clusters stored in the storage which exhibit characteristics which are likely to lead to poor or reduced search efficiency. For example, if more than a particular number of data clusters have overlapping data ranges, then it is likely that a search will encompass that overlapping range and each of those data clusters may need to be retrieved in order to return a result to that query.
  • search performance can be improved by decreasing the likelihood that data clusters not containing the data satisfying required for the operation is returned and that a minimal number of data clusters that may contain the data being searched for are returned.
  • Data Processing Apparatus Figure 1 illustrates a data processing apparatus, generally 100 , according to one embodiment.
  • the data processing apparatus 100 has one or more processor cores 120 arranged to execute a sequence of instructions that are applied to data supplied to the processor core 120 over a bus 115.
  • the term data value will be used to refer to either instructions or data.
  • a memory 150 is may be provided for storing the data values required by the processor core 120.
  • a cache 160 may also be provided for storing data values required by the processor core 120, thus increasing the speed of processing since the number of accesses required to the memory 150 is reduced.
  • Data values may also be received from and provided to external devices such as a storage device 110 using input/ output logic 140 via the bus 115.
  • FIG. 2 illustrates the main processing steps performed by the data processing apparatus 100 when receiving data entries, according to one embodiment.
  • a data entry is received.
  • each data entry is arranged to store data values in one or more fields, which typically store different types of data values, as is common in data information systems.
  • step S20 a determination is made of whether sufficient data entries have been received to perform an efficient data transfer with the storage device 110. If insufficient data entries have been received, then processing returns to step S10 where the received data entry is buffered and added to by subsequently received data entries. When it is determined at step S20 that sufficient data entries have been buffered to perform an efficient data transfer to the storage device 110 , then processing proceeds to step S30.
  • Metadata is generated which provides a search index for one or more fields in each data entry. Metadata may also be stored for other information such as the number of entries in a data cluster, an average value of a field in the data cluster, etc.
  • Metadata may instead be generated for the fields which are most commonly searched.
  • the metadata may indicate particular values stored in fields in the data cluster. More typically, the metadata indicates, for each field which requires a search index, the range of values stored by data entries within that field within that data cluster. One such range may indicate a maximum value and a minimum value of data stored in a particular field in that data cluster, a mid-value and distance from that mid-value, or in any other way.
  • a particular data cluster may have metadata indicating that the firmware date of entries in that data cluster ranges from 1. Mar.15 to 28. Oct.16 and the purchase price of entries in that data cluster ranges from $ 15 to $295.
  • the data cluster is then stored in the storage device 110.
  • a pointer is added to the metadata for the data cluster which has been stored in storage to indicate its location in that storage.
  • the metadata is also typically stored at a location in the storage device 110 , but a copy may be retained in memory in order to facilitate fast interrogation of the metadata.
  • the data cluster can be made available to the data information system for interrogation. However, it will be appreciated that the metadata can be made available for interrogation earlier than this. Processing then returns to step S10 to await further data entries.
  • Figure 3 and Figures 4A to 4N illustrate the main processing steps performed by the data processing apparatus 100 when optimizing data clusters.
  • the data clusters to be optimized may include all of the data clusters stored by the storage device 110 or a subset of those data clusters. The selection may be random or based on some metric such as clusters which are often retrieved but do not answer a query.
  • the metadata for data clusters to be optimized is retrieved. Such retrieval may occur from the storage device 110 , memory 120 or cache 160 , depending on implementation.
  • a group of existing data clusters 10 are selected.
  • every existing data cluster is selected.
  • the metadata for this group of existing data clusters 10 stores ranges for fields A and B.
  • the metadata for data cluster 10- 1 indicates that the values of data entries within that cluster fall within the range A1A - A1B and within the range B1A - B1B.
  • the metadata for the other data clusters are mapped in a similar way. It will be appreciated that these ranges may be numerical ranges or any other range which is forms a metric space which is, for example, definable in Euclidian space whose size can be determined (for example a Hamming distance).
  • the complete group of existing data clusters 10 occupies a search space 20 bounded by AL - AU on the A axis and BL - BU on the B axis.
  • the metadata for this group of existing data clusters 10 is analysed to identify which of these data clusters to optimize. In one embodiment, this is achieved by assuming that each data value within the group of existing data clusters 10 is evenly distributed within the search space 20 in order to select a group of optimizable data clusters. Identifying the group of optimizable data clusters in this way reduces the processing burden and avoids the need to retrieve any of the existing data clusters themselves from the storage device 110 to make that determination.
  • the search space 20 is partitioned using a partitioning algorithm.
  • the partitioning algorithm used will be selected based on the
  • the partitioning algorithm initially seeks to place a partition line 25A1 along the A axis, so that assumed number of data entries in the area 20 A occupied by data clusters to one size of the line 25A1 matches the assumed number of data entries in the area 20B on the other side of the line 25A1.
  • the area 20 A is split along the B axis in a similar manner by the line 25B1 and the area 20B is split in a similar way by the line 25B2.
  • the search space 20 has been partitioned into a number of separate regions which equals or exceeds the number of existing data clusters within the search space 20.
  • the search space is partitioned in 2 n regions. In this example, there were 7 data clusters, and so 8 regions have been formed. These regions represent an ideal partitioning of the search space 20 to meet the required clustering criteria.
  • this technique is often referred to as a KD-tree.
  • partitioning techniques may be used such as, for example, a quad tree, octree, BSPtree and the like.
  • the partitioning into optimized data clusters may be subject to a maximum or minimum filling constraint.
  • the particular partitioning performed is intended to partition the space into an arrangement which would represent an ideal set of clusters that would meet the particular clustering criteria which best suits the search requirements of the data information system. In this example, it is desired to provide no overlap between data clusters and an equal number of splits in each dimension, thereby creating maximum selectivity in each dimension independently.
  • the partitioning assumes that the data values within the existing data clusters are distributed in a uniform way.
  • this technique still enables optimizations of the existing data clusters to be performed to provide optimized data clusters in an efficient way which does not require excessive resources.
  • this assumption holds so badly that it can make sense to keep a small set of samples per cluster.
  • the uniformity assumption is not enough to make the optimization converge.
  • One option in these circumstances is to keep a low number of data entries per data cluster to better approximate the distribution.
  • the first approach selects a data cluster for optimisation which is judged to be least aligned with the ideal set of clusters.
  • a second approach selects an ideal data cluster for optimisation based on an error contribution of data clusters falling within that ideal data cluster.
  • an existing data cluster 10-2 is selected. This selection is made by comparing each data cluster within the partitions and selecting the data cluster which least aligns with those partitions (or which deviates the most from those partitions). The existing data cluster which deviates the most is assumed to be the best candidate for optimization.
  • every data cluster which intersects in search space with the candidate data cluster 10-2 is also selected to create an optimizable group of data clusters 30 , with all non-intersecting data clusters being ignored, as illustrated in Figure 4H.
  • an ideal data cluster 20’ is selected. This ideal data cluster 20’ is selected based on an error measure. For every partition (ideal data cluster) an error measure is computed. For each partition, data clusters falling within that partition are identified and a data cluster error based on the shape, overlap and positional misalignment of each of those data clusters is calculated. Those data cluster errors are then combined for that partition. For example, the ideal data cluster 20’ will have data cluster errors calculated for the two data clusters intersecting that ideal data cluster 20’ and these data cluster errors will be combined to give an error measure for that ideal data cluster 20’. The partition that has the highest error measure is selected, in this example, the ideal data cluster 20’. It will be appreciated that in another embodiment neighbouring partitions may also selected for various reasons such as if a wider optimisation is required and/ or for faster convergence per iteration.
  • every data cluster which intersects in search space with the ideal data cluster 20’ is selected to create an optimizable group of data clusters 30’, with all non-intersecting data clusters being ignored.
  • the optimizable group of data clusters are then optimized.
  • those existing data clusters within the optimizable group of data clusters 30 are retrieved from the storage device 110 and their data values 200 stored in the entries of the optimizable group of data clusters 30 are mapped onto the search space 20 , as illustrated in Figure 41.
  • the search space 20’ of the optimizable group of data clusters 30 matches the search space 20 of the existing data clusters, as illustrated in Figure 4J , this need not be the case and may instead be a subset of that search space 20.
  • step S70 the search space 20’ of the optimizable group of data clusters 30 is then partitioned in a similar manner to that described above, as illustrated in Figures 4K to 4L. Partitioning ceased after 4 partitions were generated, since the number of data clusters in the optimizable group 30 is also 4.
  • optimized data clusters 10’- 1 to 10’-4 are formed from the data values falling within each partition area. Metadata describing the range in the search dimensions A and B of each of those optimized data clusters 10’- 1 to 10’-4 is generated and the optimized data clusters 10’- 1 to 10’-4, together with their metadata, are stored. Once that storage has happened, then the existing data clusters within the group of optimizable data clusters 30 , together with its metadata, can be nulled and the optimized data clusters 10’- 1 to 10’-4 and its metadata can be made available to the data information system at step S80.
  • worst-case data cluster configurations can be encountered for which the runtime complexity becomes quadratic. This happens for example if all data clusters overlap with each other, because for every cluster the error computation must consider every other data cluster in the set.
  • the data clusters that have a very negative impact on the overall runtime are filtered out.
  • One possible heuristic can be based on the size of the data clusters, because it is assumed that very large clusters are likely to overlap with very many clusters.
  • the number of successive kD-tree levels in which the clusters intersect the same split planes is computed. The clusters that intersect split planes of successive levels for a certain or specified number of times are filtered out and handled separately.
  • Data values stored by data clusters may be changed or updated.
  • the“firmware date” for an entry in a data cluster could be changed from one date to another.
  • Updates can also include deletion of an entry from a data cluster.
  • an item in the inventory database may be deleted.
  • new metadata is generated for the data cluster reflecting that changed data values within that data cluster. Those changes may then cause that updated data cluster to be selected for optimisation as mentioned above.
  • Figure 5 illustrates searching the data clusters according to one embodiment.
  • a search enquiry is received.
  • the search enquiry will relate, among other fields, to search fields whose data ranges are indicated in the metadata for the data clusters. Should the metadata not contain that information then, depending on implementation, that metadata can be added when optimizing the data clusters.
  • step S100 the metadata is interrogated to see if it answers the query.
  • an assessment is made of whether the query is answered.
  • a query may be made for an indication of the total number of data entries in the data clusters.
  • the metadata for each data cluster may include that as a data item, and so the answer can be returned without needing to interrogate the data clusters themselves. It will be appreciated that other data items relating to the data clusters may also be stored in the metadata. Similarly, an interrogation of the metadata may reveal that no data clusters contain data values which can possibly fall within the search criteria, and so, at step S120 , an answer to the query is provided from the metadata alone.
  • the data processing resources dedicated to the receiving and storing of data clusters as illustrated in Figure 3 , the optimization of data clusters as illustrated in Figures 4A to 4N, and the searching of data clusters as illustrated in Figure 5, may be dynamically altered or statically prioritized in order to, for example, prioritize one process over the other and/ or to make some processes foreground and others background.
  • the searching and storing of data clusters are prioritized as foreground processes, with the optimization occurring in the background, as resources become available.
  • Figure 6A illustrates an example J OIN operation on two tables.
  • Table a and Table b are unoptimised and store data values.
  • Table a stores data values for the fields item_id, order_id and part_id.
  • Table b stores data values for the fields item_id, sales_date and sales_id. It is possible to perform a J OIN operation in response to a query.
  • Table a may be J OINed with Table b along a shared field (dimension) which, in this example, is item_id.
  • Table c contains data values which map order_id and part_id to sales_date and sales_id via item_id.
  • the J OIN operation can be resource-intensive (requiring large amounts of memory) and can slow the processing speed dramatically, particularly as the size of the tables increase.
  • Figure 6B illustrates an example J OIN operation on two tables according to one embodiment.
  • Table a’ and Table b’ are optimised using the techniques described above. Consequently, table a’ has optimised data clusters a’- l to a’-5 and table b’ has optimised data clusters b’- l to b’-5.
  • individual J OIN operations can be performed using the optimised data clusters.
  • data cluster a’- l can be J OINed with data cluster b’- l, a’-2 with b’-2, and so on to generate resultant J OINed data clusters.
  • embodiments provide a mechanism to introduce data locality to a dataset incrementally.
  • Embodiments alleviate limitations of existing techniques.
  • Scattered and inefficient input/ output (I/ O) data accesses are avoided by clustering data.
  • Access is typically at a granularity level optimized for the I/ O systems of the one or more connected computer systems and clustering data ensures that a large proportion of the fetched data is relevant to the query (as opposed to a large proportion being irrelevant in the earlier cases).
  • This can be executed on multiple dimensions at the same time (clustering data along each of these dimensions, co locating similar data). It is irrelevant for the operability of the embodiments if the dimensions are correlated or not.
  • Any index typically requires the keys and their location to be stored in this index, which, in the typical
  • such a structure can be very small since it only includes part of the key data, such as key ranges. 3)
  • the order of the keys is typically predefined and a user may only query between 1 and n keys together in the order they were defined.
  • the first level typically covers the entire value range of the data that is ingested. When this level has reached a certain fill state, the data in this level is inserted into the next level down, possibly triggering the next level to reach its fill state as well. This level then also cascades downwards etc. Each level separates the value ranges. An example of this is a perfect order. Thereby clustering of data becomes more granular the lower the layer. These structures are typically of an amortized complexity 0(log n) for every row inserted, the heavy penalty of triggering large cascades (heavy 1/ O) being somewhat offset against the clustering precision of data. A LSM-Tree, works in a very similar fashion .
  • Database Cracking adaptively builds knowledge about the data contained in the database during the queries. It is used in some column- store databases. It moves the cost from index maintenance from the database changes (ingestion) to the queries (selection).
  • the query processor provides information to the data handling mechanisms to re-arrange the data and execute optimizations such as a partial sorting or partial indexing. This technique is said to improve 1/ O, query processing speed and to exhibit self-optimizing behaviour.
  • a database is a tool to persistently store data inserted into it, it typically also has a very predictable behaviour on how these insertions are handled when multiple sources compete for storing data or how long these insertions typically take at an upper bound.
  • the complexity of operating a database system typically limits the total system data throughput to a fraction of the achievable system throughput compared to storing a stream of data with the standard system 1/ O without using a database.
  • a process can organize the data in each table into many independent clusters of data.
  • a clustering is made by spatially organising data along multiple, possibly independent, dimensions.
  • information on the data is already obtained and kept.
  • the process of laying out the data along multiple possibly independent dimensions is both an independent process and an incremental process.
  • Multiple possibly independent dimensions the process is capable of organising along multiple dimensions at the same time and in an independent way:
  • a KD-tree is used with at least one split in every dimension.
  • the KD-tree is used with a user-defined selectivity scaling for each dimension.
  • the effort e.g. in terms of I/ O and/ or processing resources
  • the user or system can omit this process to save resources for an undetermined amount of time solely relying on the data’s properties obtained during insertion for executing c) below. Therefore, embodiments are capable to guarantee a predictable and a high data insertion performance.
  • using an incremental process means that the resources dedicated to this process are adjustable as well as constrainable to what is, by the user or automatically by the system, determined to be the optimum trade-off between all system resources. The outcome of each increment is already usable and, typically, already shows an improvement to the prior state with respect to data clustering.
  • a metadata structure can be kept to limit access only to the relevant clusters that contain data the user asks for.
  • This metadata structure is able to determine this relevance by storing the range along each dimension that is covered in each of the clusters. It will be appreciated that storing a range for each dimension takes only two values per dimension, and it is therefore very small compared to the underlying entries described. In this context it is important to note that range is just one of the possible embodiments. Each range can be evaluated independently.
  • the cluster size is configurable to reach the optimum trade-off between : The I/ O size to be fetched at good 1/ O performance; and the corresponding selectivity based on the extent of the clusters along each dimension (hypothetical extent is dimension-root of the total cluster count); and the relative size of the metadata structure and resulting access speed (smaller metadata structures can be kept in the individual cache hierarchies: fast caching storage, RAM, CPU-Caches etc.)
  • data entries such as one or more rows are received by the database.
  • a row typically has a defined set of columns, the value of a column in a row is called a field.
  • These rows are evaluated against a number of dimensions, each dimension being determined by one or more fields in the row (single field, concatenating multiple ones, calculation from multiple fields etc.) or generated, e.g. the total row count.
  • dimensions are set a-priori.
  • dimensions are learned from usage pattern.
  • every field in a row is chosen as its own dimension. The row is stored. In order to store the rows, a plurality of rows is buffered until an 1/ O optimal size is reached or the database operation requires writing out the data/ this buffer.
  • the set of rows having an approximate 1/ O optimal size is referred to as a row cluster (data cluster).
  • a row cluster data cluster
  • the information of which data is contained along each dimension in the cluster is extracted and stored in a metadata structure.
  • more rows than required for a single row cluster are buffered.
  • the distribution of the rows into row clusters already follows the optimization step disclosed below.
  • the information which data is contained is the range between the values in each of the dimensions.
  • the information is a probabilistic data structure, such as a bloom filter.
  • a probabilistic filter which could be a bit mask
  • one or more bits at different positions indicate if a value may be contained. If one of the bits is not set it can be concluded that the value is not contained.
  • the values in each dimension are given an order, e.g.“aab” after“aaa”, distance is determined by finding the distance in steps along the order. This order may be implicit from the values (e.g. the given text ordering example) or kept as a dictionary that each part of the database can look up in.
  • the metadata also includes information such as how many rows are inside a row cluster.
  • row clusters are compressed and the metadata contains additional information, such as storage sizes, to optimize the I/ O access when retrieving the row cluster.
  • row clusters and/ or the metadata are compressed using additional hardware that can be configured to execute compression or other processing of the row cluster and/ or the metadata in line with the query conditions.
  • optimization The purpose of optimization is to increase selectivity when querying for the data while being able to conduct optimization iteratively (no“all or nothing” case). Row clusters are evaluated for their relative selectivity, i.e. how likely they are chosen for retrieval by a query and thus create a cost - versus the probability of including data required by that query. Selectivity can, for example, be approximated by the extent of their range (in one embodiment), by the number of splits per dimension (in one embodiment) by the numbers of bits set (in another) or the range along the order (in another) or similar.
  • larger clusters Wider ranges or more bits set - or any other way in which data selectivity is lower - shall be referred to as larger clusters hereafter, narrower ranges or fewer bits set - or any other way in which data selectivity is higher - as smaller clusters.
  • This evaluation can be undertaken for each dimension individually or for multiple dimensions at the same time.
  • larger clusters are prioritized over smaller clusters.
  • Clusters are also selected to be at a shorter distance to each other.
  • the rows in multiple larger clusters within a certain, typically close, distance of each other are processed. Close distance can include overlap, which is spatial overlap, range overlap or equal bits set, depending on the embodiment. If overlap is present, rows of smaller clusters that overlap with the aforementioned larger clusters are also included in the processing.
  • the target distribution with a high selectivity has been identified, for example, by sampling the data. Row clusters are then chosen based on their divergence from the target distribution, the more diverging ones in favour of the less diverging ones. In one embodiment, multiple of the aforementioned selection mechanisms are combined to choose the row clusters.
  • the rows are processed by re-distributing rows to row clusters such that the larger clusters turn into smaller clusters, the overlap is reduced and the distance between row clusters is increased, while maintaining a certain threshold of rows per row cluster.
  • the resulting re-distributed row clusters are typically more selective than the original row clusters.
  • this re-distribution is operated with a KD- tree.
  • Each row’s dimensions are inserted as an n-dimensional point in the KD-tree.
  • the point clusters generated by the split planes created by the KD-tree are then used to obtain the new row clusters.
  • the row clusters are thereby on average at least 75% filled.
  • KD-tree splits are adjusted for row size, i.e. instead of splitting by the median value, it is split by the median aggregated row size.
  • the KD-tree splits at least once per every dimension.
  • the KD-tree ensures to have an approximately equal number of split planes in each dimension. It will be appreciated that this will typically result in equal selectivity on each dimension irrespective of the dimension’s cardinality.
  • the rows are re distributed into row clusters such that as few bits as possible are set in the probabilistic data structure of each row cluster, i.e. the“distance” (in the sense it was defined before) between rows minimized and larger clusters with overlap are converted into smaller clusters with no or limited overhead.
  • the row with the bitmask 001 shall be clustered with the row with the bitmask 101 in favour of clustering with the row with bitmask 010 , or even 110 (which has greatest distance).
  • the position along the order is interpreted as a range for the KD-tree splitting of the values.
  • the distribution is learned by applying an online k-means algorithm to the row (interpreting the row’s dimensions as an n-dimensional point).
  • the I/ O and/ or processing power used for optimization is limited to ensure enough resources are available for other processing, such as ingesting and selecting, and thereby be able to give performance guarantees.
  • the number of row clusters chosen for optimization is balanced with the upper limit for I/ O and/ or processing power to yield the right balance between resource usage and creating smaller clusters. Such a balance can be obtained by automatically or manually inspecting the how much smaller the clusters became. In one embodiment it may be chosen to delay the optimization temporarily to free resources for other database operations. The database is nevertheless fully capable to ingest, select or execute other tasks during this period (as in, optimization is not required to always run to allow database operation).
  • the metadata structure is interrogated if a row cluster contains values that match the conditions of the database query. In one embodiment, this is carried out with a range check along each of the dimensions. In another embodiment, this is done by checking if all the relevant bits are set in the probabilistic data structure. In another
  • the order is retrieved by finding the dimensions’ ordering or looking up the dimensions’ ordering in the common dictionary, then a range check can inform about the possible presence of this value. Only the row clusters, where the metadata structure has indicated a possible presence of rows having met all of the required conditions, are retrieved.
  • the metadata structure may be kept in a search tree of some form to accelerate interrogation.
  • the metadata structure is organized as a KD-tree of n-dimensional cubes expressing the range in dimensions.
  • queries or parts of queries are solely answered by inquiring the metadata structure. For example, when requesting a count for a range and the metadata can identify that none of the data lies within that range, the count is already known to be zero without retrieving any row cluster using I/ O.
  • the metadata structure can identify that all rows fall within the range, the metadata structure can provide the count directly by summing up the row count stored for each row cluster.
  • the metadata structure can estimate the result of queries based on the information it holds. For example, when a query requests the row count for half the range of a row cluster, the metadata structure can estimate that half the rows will match the query conditions and return the row count stored in the metadata structure divided by two.
  • the metadata structure keeps additional statistics, such as samples or information about the data distribution within the row cluster, to reach more precise estimates or even be able to answer specific requests accurately based on these additional statistics kept.
  • a plurality of dimension mechanisms are combined, for example a range on one dimension with a probabilistic filter on another dimension and an order range filter on two other dimensions.
  • optimization is undertaken by applying the aforementioned methods for each dimension mechanism individually, possibly interleaving them (e.g. distributing bit-fields in between finding KD-tree splitting planes).
  • selection is undertaken by evaluating each of the dimensions’ metadata to conclude if for the combination of all dimensions together the possibility exists, that all query conditions are met by a row cluster. This is done with the aforementioned methods. If this possibility exists, the row cluster is retrieved, otherwise not.
  • the dimension mechanisms are combined by finding common measures for the
  • One typical measure to use is the distance, which has been defined for multiple different dimension mechanisms in this disclosure. The combination of all distances can then subsequently be used for operating on multiple dimensions in accordance to the mechanism for a single dimension disclosed above.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A data processing apparatus, data processing method and computer program product are disclosed. The data processing method comprises: identifying a group of existing data clusters having a clustering characteristic, each existing data cluster comprising data entries of a data information system stored together in storage, each data entry having at least one field storing data values; selecting at least one existing data cluster from the group of existing data clusters as an optimisable group of data clusters; and forming a group of optimised data clusters by allocating data entries of the optimisable group of data clusters to each optimised data cluster to improve the clustering characteristic for the group of optimised data clusters compared to the group of existing data clusters. In this way, the distribution of data entries in data clusters may be changed when creating optimized data clusters from those data entries, in order to improve the clusters. This, in turn, improves the performance of the data information system.

Description

DATA PROCESSING
FIELD OF THE INVENTION
The present invention relates to a data processing apparatus, data processing method and computer program product.
BACKGROUND
Data processing apparatus are known. Data processing apparatus store data values (which may be instructions or data) in storage for processing. The data processing apparatus retains or is provided with location information which identifies the location of data stored in storage for subsequent retrieval and processing. A data processing apparatus may operate as a data information system which is provided with data which is required to be stored in storage for subsequent interrogation, such as searching in order to answer a query. Although various techniques for storing data exist, they each have their own shortcomings. Accordingly, it is desired to provide an improved data processing technique.
SUMMARY
According to a first aspect, there is provided a data processing method, comprising: identifying a group of existing data clusters having a clustering characteristic, each existing data cluster comprising data entries of a data information system stored together in storage, each data entry having at least one field storing data values;
selecting at least one existing data cluster from the group of existing data clusters as an optimisable group of data clusters; and forming a group of optimised data clusters by allocating data entries of the optimisable group of data clusters to each optimised data cluster to improve the clustering characteristic for the group of optimised data clusters compared to the group of existing data clusters.
The first aspect recognises that databases typically have two mechanisms to retrieve data that meets the condition of a database query. They can either access all of the data to answer a query - which is commonly referred to as“table scan” and is typically slow, especially with large datasets; or it has to inquire an index, an indirect structure that holds information where the data meeting the query conditions is stored, which allows to only selectively retrieve the data. A query may ask for all data matching a specific value, and given this value is a key in the index, the index can directly provide all rows matching this query condition. These selective retrievals may create data accesses that are scattered and, due to the nature of accesses in one or more connected computer systems, may fetch a large proportion of data that is not relevant. Also, when ingesting or updating data, modifying the index may also lead to scattered and inefficient data accesses. Some databases reduce the use of indexes by storing each column of a database table individually (column-store), thereby each column can be evaluated without interrogating an index or loading any additional data beyond the columns required. However, these column-store databases typically encounter the scattered and inefficient data access patterns when assembling all the columns in a row to answer the query. Accordingly, the first aspect recognises that when storing data entries in a data information system, it is often convenient to cluster a number of those data entries together for subsequent storage. For example, clustering them together can help reduce any increase in overhead that may occur if storing the data entries individually. Also, clustering can improve the efficiency of accesses to storage, particularly when the cluster size is related to optimum sizes for accesses to storage. However, the first aspect also recognises that storing data in this way can lead to clusters having undesirable characteristics. For example, identical or similar data values which may need to be searched may be distributed over a wide range of clusters, each of which may need to be interrogated in response to a search request or enquiry. This, in turn, leads to frequent data accesses needing to be made to retrieve data clusters in order to answer a query, which slows processing speed substantially and can cause bottlenecks in the infrastructure used to perform the data accesses, requires higher performance resources to handle those data access requests and requires a higher than desired amount of resources to perform the data processing in order to interrogate all of those data clusters in response to a search enquiry. The first aspect also recognises that if the characteristics of the clusters are controlled to suit the particular implementation of a data information system, then it is possible to optimize the processing speed or performance of the data information system.
Accordingly, a method is provided. The method may be for or be performed by a data processing apparatus. The method may comprise identifying or determining a group or set of data clusters. Each data cluster may have data entries. The data entries may be data entries of a data information system. The data entries may be stored as a block in a storage device. Each of the data entries may have one or more fields. Each of those fields may store one or more data values. The method may comprise selecting or choosing one or more of the data clusters from the group of data clusters. That one or more selected data clusters may comprise or be designated as an optimizable group of data clusters. The method may comprise forming or creating a group or set of optimized data clusters. The optimized data clusters may be formed by allocating data entries from the optimizable group of data clusters. Each data entry of the optimizable group of data clusters may be allocated or assigned to one of the optimized data clusters. In this way, the distribution of data entries in data clusters may be changed when creating optimized data clusters from those data entries, in order to improve the clusters. This, in turn, improves the processing speed and performance of the data information system.
In one embodiment, the group of data clusters may have a characteristic, feature or parameter that can be related to a metric. The data entries may be allocated to the optimized data clusters to improve the characteristic of the group of optimized data clusters when compared to the characteristic of the group of data clusters.
In one embodiment, each data cluster occupies a data range in a search space defined by values of each data entry of each field. Accordingly, each field of a data entry can define a space which may need to be searched by the data information system. The values of those fields of each data entry within the data cluster may define a data range within that search space. For example, consider a simple arrangement where a field stores a numerical value such as temperature. The field may then define a search space or search dimension of temperature. When looking at the values in the temperature field for each data entry within the data cluster it may be determined that a minimum temperature is 10 and a maximum temperature is 25. Accordingly, the data range in the search space defined by the temperature field for that data cluster would be between 10 and 25. It will be appreciated that any values of any type of field (such as text, hierarchy information, image data, etc) can be mapped into Euclidian space and a range within that space can be established.
In one embodiment, the search space has‘n’ dimensions, each dimension being defined by a corresponding‘n’ field. Accordingly, the search space for the data cluster may be multi-dimensional, depending on the number of fields to be searched or indexed.
In one embodiment, each data cluster has a size which matches a bandwidth-optimised data block transfer size of the storage. Accordingly, the data clusters may be sized to match the data block transfer size of the storage device.
In one embodiment, each data cluster has a size no larger than a bandwidth-optimised data block transfer size of the storage. Accordingly, the size of each data cluster may be set to be the block transfer size or smaller. This helps to ensure that each data cluster can be transferred between the storage and data processing apparatus as efficiently as possible.
In one embodiment, each data cluster has a size larger than a bandwidth-optimised data block transfer size of said storage.
In one embodiment, each data cluster has a size a multiple of a bandwidth-optimised data block transfer size of said storage.
In one embodiment, one or more data clusters are compressed or stored in compressed form.
In one embodiment, each data cluster has associated metadata which provides at least an indication of the data range in the search space defined by values of each data entry of at least one field. Accordingly, each cluster may have metadata associated therewith. The metadata may provide or indicate the search range or search ranges in the search space which are defined by the values of the data entries of one or more fields within that data cluster.
In one embodiment, the metadata stores at least one additional parameter relating to that data cluster. Accordingly, the metadata may provide additional information relating to the data cluster which may be unrelated to its search ranges.
In one embodiment, the additional parameter comprises a number of entries in that data cluster. Accordingly, the number of data entries within a data cluster may be indicated within the metadata.
In one embodiment, the optimizable group of data clusters may be determined from the metadata.
In one embodiment, each data entry is an entry in a database. Accordingly, the data entries may relate to entries in a database. The data entries may also relate to entries in data information systems, a relational database, a NOSQL database or the like.
In one embodiment, the clustering characteristic comprises a search selectivity between the existing data clusters. In one embodiment, the clustering characteristic comprises a number of existing data clusters accessed in response to search enquiries. In one embodiment, the clustering characteristic comprises a separation between existing data clusters in the search space. In one embodiment, the clustering characteristic comprises an overlap of data ranges of data values between existing data clusters in the search space. In one embodiment, the clustering characteristic comprises data ranges of data values within each existing data cluster in the search space. Accordingly, depending on implementation, various different characteristics may need to be optimized. In one embodiment, the selectivity of data clusters in response to the search may be a characteristic to be improved. In one embodiment, the number of data clusters which are accessed following a search enquiry may be a characteristic to be optimized. In one embodiment, a separation or distance between the data clusters in search space may be another characteristic to be optimized. In one embodiment, an overlap or commonality in data ranges between data clusters within the search space may be a characteristic to be optimized. In one embodiment, a data range of the data values within the data clusters may be a characteristic to optimize. By optimizing these characteristics, the performance of the data information system when performing searches can be improved.
In one embodiment, the selecting comprises selecting the at least one existing data cluster from the group of existing data clusters as the optimisable group of data clusters based on a size of its occupied search space. Accordingly, one or more of the data clusters may be selected from the group or set of data clusters and included in the group or set of optimizable data clusters based on how much of the search space that data cluster occupies, or based on their shape or position or fill level. When data entries can be deleted from data clusters or updated, the fill level can be a characteristic that is useful to incorporate into an error metric, because it is advantageous that data clusters with very low fill levels are merged together. Selecting data clusters on that basis biases the selection towards larger data clusters which are more likely to cause a poor clustering characteristic.
In one embodiment, the selecting comprises selecting the at least one existing data cluster from the group of existing data clusters as the optimisable group of data clusters based on a number of intersections in the search space with other existing data clusters. Accordingly, selecting those data clusters which intersect or overlap with more data clusters than others biases the optimization towards those data clusters which are likely to cause a poor clustering characteristic. In one embodiment, the selecting comprises selecting the at least one existing data cluster from the group of existing data clusters as the optimisable group of data clusters based only associated metadata. Accordingly, the optimizable group of data clusters may be determined using the stored metadata for those data clusters. This avoids the need to access the data clusters themselves, or perform any searching within the data clusters to make that selection. This significantly improves the performance of the selection.
In one embodiment, the method comprises generating a group of ideal data clusters from the group of existing data clusters, the group of ideal data clusters being generated within the search space occupied by the group of existing data clusters and wherein the selecting comprises selecting at least one existing data cluster from the group of existing data clusters which intersects a selected ideal data cluster in the search space as the optimisable group of data clusters. Accordingly, an idealised group of data clusters, which, if existed, would provide improvement to the clustering characteristic, may be created. At least one of the data clusters which, when the ideal data clusters are overlaid in the search space, intersects, covers, falls within or crosses the boundary of a particular or selected ideal data cluster is selected for the optimizable group of data clusters. Selecting a data cluster which deviates from the ideal ensures that a sub-optimal data cluster is selected for optimization.
In one embodiment, the generating comprises generating the group of ideal data clusters from the group of existing data clusters, the group of ideal data clusters being generated within the search space occupied by the group of existing data clusters based on an ideal clustering criteria which would improve the clustering characteristic.
In one embodiment, the ideal clustering criteria comprises an increase in a search selectivity between the existing data clusters. In one embodiment, the ideal clustering criteria comprises a decrease in a number of existing data clusters accessed in response to search enquiries. In one embodiment, the ideal clustering criteria comprises an increase in a separation between existing data clusters in the search space. In one embodiment, the ideal clustering criteria comprises a decrease in an overlap of data ranges of data values between existing data clusters in the search space. In one embodiment, the ideal clustering criteria comprises a decrease in a data range of data values within each or at least one existing data cluster in the search space. In one embodiment, the generating comprises generating the group of ideal data clusters based on an assumed distribution of data entries within the search space within each existing data cluster. Accordingly, the ideal group of clusters may be generated using a simplified assumption that the data entries are distributed within those ideal data clusters in accordance with a particular distribution. This again helps to simplify the generation of the ideal data clusters, which avoids the need to perform data accesses to retrieve the actual data clusters and minimises the processing required.
In one embodiment, the generating comprises generating the group of ideal data clusters using a partitioning algorithm, scheme or process which partitions the search space to have similar numbers of data entries in each ideal data cluster. Accordingly, a partitioning algorithm is employed to partition the search space into regions, each of which has as close as possible to identical numbers of data entries or which occupy a similar amount of space.
In one embodiment, the partitioning algorithm seeks to partition the data entries at least once in each dimension of the search space.
In one embodiment, the partitioning algorithm seeks to provide an equal number of split planes in each dimension of the search space.
In one embodiment, the partitioning algorithm splits more often in certain dimensions based on user-defined settings or automatically learned usage patterns.
In one embodiment, the partitioning algorithm comprises a KD-tree. It will be appreciated that other algorithms such as quadtrees, octrees, BSPtrees and the like may be utilised.
In one embodiment, the method comprises, for each ideal data cluster in the group of ideal data clusters, determining a deviation in occupied search space between that ideal data cluster and existing data clusters intersecting that ideal data cluster and wherein the selecting comprises selecting the selected ideal data cluster based on the deviation.
In one embodiment, the selecting comprises selecting the selected ideal data cluster having maximum deviation. Accordingly, that data cluster which deviates the most from the ideal may be selected. In one embodiment, the method comprises, for each existing data cluster intersecting the selected ideal data cluster, determining a deviation in occupied search space between that existing data cluster and the selected ideal data cluster and wherein the selecting comprises selecting at least one of the existing data clusters intersecting the selected ideal data cluster for inclusion in the optimisable group of data clusters based on the deviation. Accordingly, for every other data cluster which crosses, overlaps or intersects the selected ideal data cluster, a deviation may also be determined.
In one embodiment, the selecting comprises selecting that existing data cluster intersecting the selected ideal data cluster having a maximum deviation for inclusion in the optimisable group of data clusters.
In one embodiment, the selecting comprises selecting that existing data cluster intersecting the selected ideal data cluster which most occupies the search space.
In one embodiment, the selecting comprises selecting neighbouring existing data clusters to that existing data cluster having the maximum deviation for inclusion in the optimisable group of data clusters. Accordingly, those clusters which neighbour or are proximate to the selected data cluster may be included in the optimizable group. This helps to ensure that clusters near each other which could potentially collide during searches are optimized.
In one embodiment, the neighbouring existing data clusters include existing data clusters which most occupy the search space. Accordingly, those clusters which extend furthest within the search space or occupy the greatest area or volume within search space may be included in the optimizable group. Again, this helps to include clusters which are more likely to fall within a search.
In one embodiment, the neighbouring existing data clusters include existing data clusters which are closest in the search space to that existing data cluster having the maximum deviation. Accordingly, those data clusters which are most proximate to the selected cluster may be included in the optimizable group.
In one embodiment, the neighbouring existing data clusters overlap in the search space with that existing data cluster having the maximum deviation. Accordingly, those clusters which intersect with or share the same space as the selected cluster may be included in the optimizable group. In one embodiment, wherein the selecting comprises selecting that existing data cluster intersecting the selected ideal data cluster which most occupies the search space.
In one embodiment, the forming comprises forming the group of optimised data clusters by allocating data entries of the optimisable group of data clusters to each optimised data cluster using a partitioning algorithm which would improve the clustering characteristic. Accordingly, the data entries within the optimizable group may be allocated to each optimized data cluster by partitioning the data clusters in search space.
In one embodiment, the partitioning algorithm partitions the search space occupied by the group of optimised data clusters to have similar numbers of data entries in each optimised data cluster. Accordingly, the partitioning may seek to balance the number of data entries in each optimized cluster so that each optimized cluster has near identical numbers of data entries.
In one embodiment, the partitioning algorithm seeks to maintain more than a 75% fill average in each optimised data cluster. Accordingly, a minimum fill average may be set for each criteria in order to balance the number of data entries in each data cluster.
That fill average may be a high fill average. ln one embodiment, the partitioning algorithm seeks to partition the data entries at least once in each dimension of the search space. ln one embodiment, the partitioning algorithm splits more often in certain dimensions based on user-defined settings or automatically learned usage patterns. ln one embodiment, the partitioning algorithm seeks to provide an equal number of split planes in each dimension of the search space.
In one embodiment, the partitioning algorithm comprises a KD-tree. It will be appreciated that other algorithms such as quadtrees, octrees, BSPtrees and the like may be utilised.
In one embodiment, the optimising clustering criteria seeks to form optimised data clusters which minimise a deviation with respect to the group of ideal data clusters. In one embodiment, the forming comprises allocating the data entries of the optimisable group of data clusters to each optimised data cluster subject to a maximum number data entries being provided in each optimised data cluster.
In one embodiment, the selecting comprises selecting overlapping existing data clusters which store data entries having overlapping data ranges within a plurality of fields defining a plurality of the search space dimensions as the optimisable group of data clusters and the forming comprises allocating the data entries from the optimisable group of data clusters to form the optimised data clusters having minimised
overlapping data ranges in each search space dimension. In one embodiment, the forming comprises allocating the data entries from the optimisable group of data clusters to form the optimised data clusters having optimised shape.
In one embodiment, the forming comprises allocating the data entries from the optimisable group of data clusters to form the optimised data clusters having optimised shape.
In one embodiment, the forming comprises allocating the data entries from the optimisable group of data clusters to form the optimised data clusters having minimised data range overlap in each search space dimension.
In one embodiment, the forming comprises allocating the data entries from the optimisable group of data clusters to form the optimised data clusters having eliminated overlapping data ranges in each search space dimension.
In one embodiment, the forming comprises allocating the data entries from the optimisable group of data clusters to form optimised data clusters having non overlapping optimised data ranges in each search space dimension.
In one embodiment, the forming comprises allocating the data entries from the optimisable group of data clusters to form optimised data clusters whose distance between the non-overlapping optimised data ranges is maximised in each search space dimension.
In one embodiment, the forming comprises partitioning the data entries from the optimisable group of data clusters using a partitioning algorithm. In one embodiment, the partitioning algorithm seeks to maintain more than a 75% fill average in each optimised data cluster.
In one embodiment, the partitioning algorithm seeks to partition the data entries from the optimisable group of data clusters at least once in each search space dimension.
In one embodiment, the partitioning algorithm seeks to provide an equal number of split planes in each search space dimension.
In one embodiment, the partitioning algorithm seeks to partition regions of less dense data value distribution into optimised data clusters having more dense data value distribution.
In one embodiment, the partitioning algorithm comprises a KD-tree algorithm. It will be appreciated that other algorithms such as quadtrees, octrees, BSPtrees and the like may be utilised.
In one embodiment, the method comprises storing each optimised data cluster in the storage.
In one embodiment, the method comprises identifying a range of data values for each search space dimension within each data cluster and storing an indicator of each range of data values as the metadata for each corresponding data cluster. Hence, metadata may be stored for each data range to provide an index for each searchable field.
In one embodiment, the method comprises ordering the range in accordance with an ordering indicator for each search dimension.
In one embodiment, the range identifies at least a maximum and minimum data value that search space dimension within that data cluster.
In one embodiment, the method comprises storing an indicator of the data values for each search space dimension. Such an indicator may be configured to exclude certain patterns such as when applying a bloomfilter. In one embodiment, the method comprises incorporating each metadata into a search tree for all data clusters. Accordingly, the metadata may be incorporated into a search tree to facilitate efficient searching of the metadata of each data cluster.
In one embodiment, the search tree comprises a KD-tree. It will be appreciated that other algorithms such as quadtrees, octrees, BSPtrees and the like may be utilised.
In one embodiment, the method comprises storing all or parts of the metadata in a compressed form.
In one embodiment, the method comprises storing, with the metadata, a pointer to a location of each corresponding data cluster or clusters in the storage. Accordingly, the metadata may include a pointer. It can make sense to partition data values of data entries across multiple data clusters. In that case it can make sense to store more than one pointer in the metadata. The metadata may include an indication of the location of each data cluster in the storage. The metadata may also include a size indicator in order to identify where each cluster begins.
In one embodiment, the method comprises storing with the metadata an entries counter providing an indication of how many data entries are within each data cluster
In one embodiment, the method comprises storing with the metadata statistical information about the data entries stored within each data cluster.
In one embodiment, the method comprises selecting a field as a search space dimension based on historic search requests. Accordingly, the fields which are selected to be included in the metadata may be selected actively, based on searches that are being made.
In one embodiment, the method comprises nulling the group of existing data clusters. Accordingly, when the optimized data clusters have been stored, the existing data clusters which they replace are nulled.
In one embodiment, the method comprises iteratively repeating the identifying, selecting and forming. Accordingly, the optimization can be iteratively repeated in order to optimize the data clusters. In one embodiment, the method comprises receiving data entries to be stored in a new data cluster and buffering the data entries until a minimal data cluster size has been reached. Accordingly, individual data entries may be received and buffered until a minimal size of data cluster formed from those received data entries is achieved.
In one embodiment, the minimal data cluster size comprises the bandwidth-optimised data block transfer size of the storage device.
In one embodiment, the method comprises deferring the iteratively repeating until the new data cluster has been stored. Accordingly, the optimizing of data clusters may be defered or its priority reduced while data entries are pending being stored.
In one embodiment, the method comprises receiving a search request for data and interrogating the metadata to identify candidate data clusters whose range of data values encompasses the search request. Accordingly, when a search request is received then the metadata may be searched to identify potential data clusters which may store data values satisfying that search.
In one embodiment, the interrogating the metadata comprises interrogating the search tree.
In one embodiment, the method comprises returning a result of the search request based only on the metadata. Accordingly, it may be possible in some circumstances to return the result of the search based purely on the metadata. For example, the metadata may indicate that no data cluster can contain a data value matching the search criteria, in which case no access to the data clusters is required. Likewise, some searches may relate to data stored within the metadata itself, such as returning a number of entries falling within a search range or matching search criteria. In that case, the answer to the query can be returned again without needing to access the data clusters themselves. It will be appreciated that various different values can be stored in the metadata to enhance such search queries. Should the metadata indicate that matching data values may be present in one or more data clusters, then those data clusters may be interrogated.
In one embodiment, the method comprises returning an approximate result of the search request based only on the statistical information stored in the metadata. In one embodiment, the method comprises interrogating the candidate data clusters to return a result of the search request.
In one embodiment, the interrogating the candidate data clusters comprises
interrogating only the candidate data clusters to return the result of the search request.
In one embodiment, the method comprises performing a join operation between said group of optimised data clusters and another group of optimised data clusters.
In one embodiment, the method comprises performing a join operation between an optimised data cluster within said group of optimised data clusters and an optimised data cluster within said another group of optimised data clusters.
According to a second aspect, there is provided a data processing apparatus, comprising: identification logic operable to identify a group of existing data clusters having a clustering characteristic, each existing data cluster comprising data entries of a data information system stored together in storage, each data entry having at least one field storing data values; selection logic operable to select at least one existing data cluster from the group of existing data clusters as an optimisable group of data clusters; and formation logic operable to form a group of optimised data clusters by allocating data entries of the optimisable group of data clusters to each optimised data cluster to improve the clustering characteristic for the group of optimised data clusters compared to the group of existing data clusters.
In one embodiment, the group of data clusters may have a characteristic, feature or parameter that can be related to a metric.
In one embodiment, each data cluster occupies a data range in a search space defined by values of each data entry of each field.
In one embodiment, the search space has‘n’ dimensions, each dimension being defined by a corresponding‘n’ field.
In one embodiment, each data cluster has a size which matches a bandwidth-optimised data block transfer size of the storage. In one embodiment, each data cluster has a size no larger than a bandwidth-optimised data block transfer size of the storage.
In one embodiment, each data cluster has a size larger than a bandwidth-optimised data block transfer size of said storage.
In one embodiment, each data cluster has a size a multiple of a bandwidth-optimised data block transfer size of said storage.
In one embodiment, one or more data clusters are compressed or stored in compressed form.
In one embodiment, each data cluster has associated metadata which provides at least an indication of the data range in the search space defined by values of each data entry of at least one field.
In one embodiment, the metadata stores at least one additional parameter relating to that data cluster.
In one embodiment, the additional parameter comprises a number of entries in that data cluster.
In one embodiment, the optimizable group of data clusters is determined from the metadata.
In one embodiment, each data entry is an entry in a database.
In one embodiment, the clustering characteristic comprises a search selectivity between the existing data clusters. In one embodiment, the clustering characteristic comprises a number of existing data clusters accessed in response to search enquiries. In one embodiment, the clustering characteristic comprises a separation between existing data clusters in the search space. In one embodiment, the clustering characteristic comprises an overlap of data ranges of data values between existing data clusters in the search space. In one embodiment, the clustering characteristic comprises data ranges of data values within each existing data cluster in the search space. In one embodiment, the selection logic is operable to select the at least one existing data cluster from the group of existing data clusters as the optimisable group of data clusters based on a size of its occupied search space.
In one embodiment, the selection logic is operable to select the at least one existing data cluster from the group of existing data clusters as the optimisable group of data clusters based on a number of intersections in the search space with other existing data clusters.
In one embodiment, the selection logic is operable to select the at least one existing data cluster from the group of existing data clusters as the optimisable group of data clusters based only associated metadata.
In one embodiment, the identification logic is operable to generate a group of ideal data clusters from the group of existing data clusters, the group of ideal data clusters being generated within the search space occupied by the group of existing data clusters and wherein the selection logic is operable to select at least one existing data cluster from the group of existing data clusters which intersects a selected ideal data cluster in the search space as the optimisable group of data clusters.
In one embodiment, the identification logic is operable to generate the group of ideal data clusters from the group of existing data clusters, the group of ideal data clusters being generated within the search space occupied by the group of existing data clusters based on an ideal clustering criteria which would improve the clustering characteristic.
In one embodiment, the ideal clustering criteria comprises an increase in a search selectivity between the existing data clusters. In one embodiment, the ideal clustering criteria comprises a decrease in a number of existing data clusters accessed in response to search enquiries. In one embodiment, the ideal clustering criteria comprises an increase in a separation between existing data clusters in the search space. In one embodiment, the ideal clustering criteria comprises a decrease in an overlap of data ranges of data values between existing data clusters in the search space. In one embodiment, the ideal clustering criteria comprises a decrease in a data range of data values within each or at least one existing data cluster in the search space. In one embodiment, the identification logic is operable to generate the group of ideal data clusters based on an assumed distribution of data entries within the search space within each existing data cluster.
In one embodiment, the identification logic is operable to generate the group of ideal data clusters using a partitioning algorithm, scheme or process which partitions the search space to have similar numbers of data entries in each ideal data cluster.
In one embodiment, the partitioning algorithm seeks to partition the data entries at least once in each dimension of the search space.
In one embodiment, the partitioning algorithm seeks to provide an equal number of split planes in each dimension of the search space.
In one embodiment, the partitioning algorithm splits more often in certain dimensions based on user-defined settings or automatically learned usage patterns.
In one embodiment, the partitioning algorithm comprises a KD-tree. It will be appreciated that other algorithms such as quadtrees, octrees, BSPtrees and the like may be utilised.
In one embodiment, the identification logic is operable, for each ideal data cluster in the group of ideal data clusters, to determine a deviation in occupied search space between that ideal data cluster and existing data clusters intersecting that ideal data cluster and the selection logic is operable to select the selected ideal data cluster based on the deviation.
In one embodiment, the selection logic is operable to select the selected ideal data cluster having maximum deviation.
In one embodiment, the identification logic is operable, for each existing data cluster intersecting the selected ideal data cluster, to determine a deviation in occupied search space between that existing data cluster and the selected ideal data cluster and the selection logic is operable to select at least one of the existing data clusters intersecting the selected ideal data cluster for inclusion in the optimisable group of data clusters based on the deviation. In one embodiment, the selection logic is operable to select that existing data cluster intersecting the selected ideal data cluster having a maximum deviation for inclusion in the optimisable group of data clusters.
In one embodiment, the selection logic is operable to select that existing data cluster intersecting the selected ideal data cluster which most occupies the search space.
In one embodiment, the selection logic is operable to select neighbouring existing data clusters to that existing data cluster having the maximum deviation for inclusion in the optimisable group of data clusters.
In one embodiment, the neighbouring existing data clusters include existing data clusters which most occupy the search space.
In one embodiment, the neighbouring existing data clusters include existing data clusters which are closest in the search space to that existing data cluster having the maximum deviation.
In one embodiment, the neighbouring existing data clusters overlap in the search space with that existing data cluster having the maximum deviation.
In one embodiment, the selection logic is operable to select that existing data cluster intersecting the selected ideal data cluster which most occupies the search space.
In one embodiment, the formation logic is operable to form the group of optimised data clusters by allocating data entries of the optimisable group of data clusters to each optimised data cluster using a partitioning algorithm which would improve the clustering characteristic.
In one embodiment, the partitioning algorithm partitions the search space occupied by the group of optimised data clusters to have similar numbers of data entries in each optimised data cluster.
In one embodiment, the partitioning algorithm seeks to maintain more than a 75% fill average in each optimised data cluster. In one embodiment, the partitioning algorithm seeks to partition the data entries at least once in each dimension of the search space.
In one embodiment, the partitioning algorithm splits more often in certain dimensions based on user-defined settings or automatically learned usage patterns.
In one embodiment, the partitioning algorithm seeks to provide an equal number of split planes in each dimension of the search space.
In one embodiment, the partitioning algorithm comprises a KD-tree. It will be appreciated that other algorithms such as quadtrees, octrees, BSPtrees and the like may be utilised.
In one embodiment, the formation logic is operable to to form optimised data clusters which minimise a deviation with respect to the group of ideal data clusters.
In one embodiment, the formation logic is operable to allocate the data entries of the optimisable group of data clusters to each optimised data cluster subject to a maximum number data entries being provided in each optimised data cluster.
In one embodiment, the selection logic is operable to select overlapping existing data clusters which store data entries having overlapping data ranges within a plurality of fields defining a plurality of the search space dimensions as the optimisable group of data clusters and the formation logic is operable to allocate the data entries from the optimisable group of data clusters to form the optimised data clusters having minimised overlapping data ranges in each search space dimension.
In one embodiment, the formation logic is operable to allocate the data entries from the optimisable group of data clusters to form the optimised data clusters having optimised shape.
In one embodiment, the formation logic is operable to allocate the data entries from the optimisable group of data clusters to form the optimised data clusters having minimised data range overlap in each search space dimension. In one embodiment, the formation logic is operable to allocate the data entries from the optimisable group of data clusters to form the optimised data clusters having eliminated overlapping data ranges in each search space dimension.
In one embodiment, the formation logic is operable to allocate the data entries from the optimisable group of data clusters to form optimised data clusters having non overlapping optimised data ranges in each search space dimension.
In one embodiment, the formation logic is operable to allocate the data entries from the optimisable group of data clusters to form optimised data clusters whose distance between the non-overlapping optimised data ranges is maximised in each search space dimension.
In one embodiment, the formation logic is operable to partition the data entries from the optimisable group of data clusters using a partitioning algorithm.
In one embodiment, the partitioning algorithm seeks to maintain more than a 75% fill average in each optimised data cluster.
In one embodiment, the partitioning algorithm seeks to partition the data entries from the optimisable group of data clusters at least once in each search space dimension.
In one embodiment, the partitioning algorithm seeks to provide an equal number of split planes in each search space dimension.
In one embodiment, the partitioning algorithm seeks to partition regions of less dense data value distribution into optimised data clusters having more dense data value distribution.
In one embodiment, the partitioning algorithm comprises a KD-tree algorithm. It will be appreciated that other algorithms such as quadtrees, octrees, BSPtrees and the like may be utilised.
In one embodiment, the apparatus comprises storing logic operable to store each optimised data cluster in the storage. In one embodiment, the apparatus comprises metadata logic operable to identify a range of data values for each search space dimension within each data cluster and to store an indicator of each range of data values as the metadata for each corresponding data cluster.
In one embodiment, the metadata logic is operable to order the range in accordance with an ordering indicator for each search dimension.
In one embodiment, the range identifies at least a maximum and minimum data value that search space dimension within that data cluster.
In one embodiment, the storing logic is operable to store an indicator of the data values for each search space dimension.
In one embodiment, the metadata logic is operable to incorporate each index into a search tree for all data clusters.
In one embodiment, the search tree comprises a KD-tree. It will be appreciated that other algorithms such as quadtrees, octrees, BSPtrees and the like may be utilised.
In one embodiment, the storing logic is operable to store all or parts of the metadata in a compressed form.
In one embodiment, the metadata logic is operable to store with the metadata a pointer to a location of each corresponding data cluster in the storage.
In one embodiment, the metadata logic is operable to store with the metadata an entries counter providing an indication of how many data entries are within each data cluster.
In one embodiment, the storing logic is operable to store with the metadata statistical information about the data entries stored within each data cluster.
In one embodiment, the metadata logic is operable to select a field as a search space dimension based on historic search requests. In one embodiment, the storing logic is operable to null the group of existing data clusters.
In one embodiment, the identification logic is operable repeatedly identify a group of existing data clusters, the selection logic is operable to select at least one existing data cluster and the formation logic is operable to operable to form a group of optimised data clusters iteratively.
In one embodiment, the apparatus comprises buffering logic operable to receive data entries to be stored in a new data cluster and to buffer the data entries until a minimal data cluster size has been reached.
In one embodiment, the minimal data cluster size comprises the bandwidth-optimised data block transfer size of the storage device.
In one embodiment, the buffering logic is operable to defer the iteratively repeating until the new data cluster has been stored.
In one embodiment, the apparatus comprises search logic operable to receive a search request for data and to interrogate the metadata to identify candidate data clusters whose range of data values encompasses the search request.
In one embodiment, the search logic is operable to interrogate the search tree.
In one embodiment, the search logic is operable to return a result of the search request based only on the metadata.
In one embodiment, the search logic is operable to return an approximate result of the search request based only on the statistical information stored in the metadata.
In one embodiment, the search logic is operable to interrogate the candidate data clusters to return a result of the search request.
In one embodiment, the search logic is operable to interrogate only the candidate data clusters to return the result of the search request. In one embodiment, the apparatus comprises joining logic operable to perform a join operation between said group of optimised data clusters and another group of optimised data clusters.
In one embodiment, the joining logic is operable to perform a join operation between an optimised data cluster within said group of optimised data clusters and an optimised data cluster within said another group of optimised data clusters.
According to a third aspect, there is provided a computer program product operable, when executed on a computer, to perform the method of the first aspect.
Further particular and preferred aspects are set out in the accompanying independent and dependent claims. Features of the dependent claims may be combined with features of the independent claims as appropriate, and in combinations other than those explicitly set out in the claims.
Where an apparatus feature is described as being operable to provide a function, it will be appreciated that this includes an apparatus feature which provides that function or which is adapted or configured to provide that function.
BRIEF DESCRIPTION OF THE DRAWINGS
Embodiments of the present invention will now be described further, with reference to the accompanying drawings, in which :
Figure 1 illustrates a data processing apparatus according to one embodiment;
Figure 2 illustrates the main processing steps performed by the data processing apparatus when receiving data entries according to one embodiment;
Figure 3 illustrates the main processing steps performed by the data processing apparatus when optimising data clusters according to one embodiment;
Figures 4A to 4N illustrates optimising data clusters according to one embodiment; Figure 5 illustrates the main processing steps performed by the data processing apparatus in response to a query according to one embodiment; and
Figures 6A and 6B illustrate data join operations according to one embodiment.
DESCRIPTION OF THE EMBODIMENTS OVERVIEW
Before describing embodiments in any more detail, first an overview will be provided. Embodiments recognise that the way that data is stored in a data information system may be sub-optimal for efficiently performing operations during data processing since the data is often distributed in storage in a manner that makes an operation inefficient, which reduces the processing speed of the data processing apparatus. Accordingly, embodiments store data in data clusters stored in storage and optimise those data clusters. Each data cluster in storage typically has a maximum size which is determined by an optimal data transfer size between the storage and processing logic performing the processing. While this optimises accesses between the data processor and the storage, the content of the data clusters themselves may be unrelated, and even random. For example, consider the situation where the data information system is an inventory database. For each item there may be a number of different types of data to be stored which need to be associated with that item, such as item identifier, purchase price, purchaser identifier, firmware date, item location, etc. Hence, each data entry may be a row in the database having the fields“item identifier”,“purchase price”, “purchaser identifier”,“firmware date”,“item location”, etc.
As users add items to the inventory, a transaction may be provided to the data processing apparatus which buffers the transactions as a data cluster of entries until that data cluster matches the optimal size for transfer to the storage. As is apparent from this example, the entries are likely to be widely distributed. That is to say, for any data cluster the range of firmware dates is likely to be widely distributed, as are the identifiers. When storing each data cluster, metadata is provided which provides a search index which indicates a range of values stored in each field which may need to be searched. For example, the metadata may provide an indication of the range of firmware dates of the entries within that data cluster and/ or the range of values of each identifiers of the entries within that data cluster, etc.
When performing an operation on the data clusters stored in storage, it is possible to then interrogate the metadata to see which data clusters cannot include entries which meet the search criteria because their ranges fail to encompass the search criteria. For example, a search may be for items with a purchase price of more than $ 300 which have a firmware date of more than four years ago. Any data clusters whose metadata indicates that its entries have a purchase price of less than $ 300 , or which have a firmware date which is less than four years ago, can be ignored. However, any data cluster which cannot be ignored must be retrieved to perform the operation using the data entries. As mentioned above, a characteristic of the data entries in these data clusters is that they are likely to be widely distributed and lack correlation as each transaction is likely to be reasonably random. Accordingly, many data clusters stored in this way may need to be accessed and their entries interrogated in order to perform the operation. It will be appreciated that, even then, a null answer may be returned if there are no entries matching the operation criteria.
Accordingly, embodiments perform optimization on the stored or existing data clusters in order to perform the operation more efficiently by avoiding or reducing the number of accesses to storage required in response to queries. This optimization can be adapted to suit the particular physical and functional constraints or characteristics of the data processing apparatus and its storage. In any event, in general terms, the optimization procedure involves identifying existing data clusters stored in the storage which exhibit characteristics which are likely to lead to poor or reduced search efficiency. For example, if more than a particular number of data clusters have overlapping data ranges, then it is likely that a search will encompass that overlapping range and each of those data clusters may need to be retrieved in order to return a result to that query. Also, it is likely that data clusters which occupy a wider range of data values are more likely to need to be accessed than those occupying a smaller range of data values. Also, when the distance between the space occupied by data clusters increases, the likelihood that additional data clusters will need to be retrieved in response to a search query decreases compared to data clusters where the space between them is smaller. It will be appreciated that, depending on the type of operations being performed, other characteristics of the data clusters may need to be adjusted to suit those types of operations.
Accordingly, for the data clusters mentioned above, search performance can be improved by decreasing the likelihood that data clusters not containing the data satisfying required for the operation is returned and that a minimal number of data clusters that may contain the data being searched for are returned. By examining existing data clusters within the data information system and optimizing those data clusters by taking data values within those data clusters and forming new data clusters which exhibit better search characteristics, the processing performance of the data processing apparatus can be improved.
Data Processing Apparatus Figure 1 illustrates a data processing apparatus, generally 100 , according to one embodiment. The data processing apparatus 100 has one or more processor cores 120 arranged to execute a sequence of instructions that are applied to data supplied to the processor core 120 over a bus 115. Hereinafter, the term data value will be used to refer to either instructions or data. A memory 150 is may be provided for storing the data values required by the processor core 120. A cache 160 may also be provided for storing data values required by the processor core 120, thus increasing the speed of processing since the number of accesses required to the memory 150 is reduced. Data values may also be received from and provided to external devices such as a storage device 110 using input/ output logic 140 via the bus 115.
Cluster Storage
Figure 2 illustrates the main processing steps performed by the data processing apparatus 100 when receiving data entries, according to one embodiment. At step S10 , a data entry is received. Typically, each data entry is arranged to store data values in one or more fields, which typically store different types of data values, as is common in data information systems.
At step S20 , a determination is made of whether sufficient data entries have been received to perform an efficient data transfer with the storage device 110. If insufficient data entries have been received, then processing returns to step S10 where the received data entry is buffered and added to by subsequently received data entries. When it is determined at step S20 that sufficient data entries have been buffered to perform an efficient data transfer to the storage device 110 , then processing proceeds to step S30.
At step S30 , metadata is generated which provides a search index for one or more fields in each data entry. Metadata may also be stored for other information such as the number of entries in a data cluster, an average value of a field in the data cluster, etc.
It will be appreciated that a search index need not be provided for every field as this increases the size of the metadata and the amount of processing required to generate that metadata. Hence, metadata may instead be generated for the fields which are most commonly searched. The metadata may indicate particular values stored in fields in the data cluster. More typically, the metadata indicates, for each field which requires a search index, the range of values stored by data entries within that field within that data cluster. One such range may indicate a maximum value and a minimum value of data stored in a particular field in that data cluster, a mid-value and distance from that mid-value, or in any other way. For example, if the“purchase price” and“firmware date” fields are to be indexed, then a particular data cluster may have metadata indicating that the firmware date of entries in that data cluster ranges from 1. Mar.15 to 28. Oct.16 and the purchase price of entries in that data cluster ranges from $ 15 to $295.
The data cluster is then stored in the storage device 110. Typically, a pointer is added to the metadata for the data cluster which has been stored in storage to indicate its location in that storage. The metadata is also typically stored at a location in the storage device 110 , but a copy may be retained in memory in order to facilitate fast interrogation of the metadata. Preferably, once the cluster and its metadata have been stored then the data cluster can be made available to the data information system for interrogation. However, it will be appreciated that the metadata can be made available for interrogation earlier than this. Processing then returns to step S10 to await further data entries.
Accordingly, it can be seen that as data entries are received they are buffered until they achieve a size that is efficient for storing in the storage device 110 , in order to prevent increased, inefficient storage accesses. When the data cluster is ready for storage then search metadata is generated which defines characteristics of the data entries within that data cluster in order to make subsequent searching more efficient. However, it will be appreciated that the data entries in each data cluster may be random, with very little correlation between those data entries, and any such correlation may only be due to particular fortunate circumstances. Accordingly, whilst this technique provides for efficient use of storage system resources, and the metadata helps to exclude data clusters which cannot satisfy a query, the number of data clusters that may need to be retrieved and interrogated to answer a query may still be higher than is necessary.
Cluster Optimisation
Figure 3 and Figures 4A to 4N illustrate the main processing steps performed by the data processing apparatus 100 when optimizing data clusters. The data clusters to be optimized may include all of the data clusters stored by the storage device 110 or a subset of those data clusters. The selection may be random or based on some metric such as clusters which are often retrieved but do not answer a query. At step S40 , the metadata for data clusters to be optimized is retrieved. Such retrieval may occur from the storage device 110 , memory 120 or cache 160 , depending on implementation.
As illustrated in Figure 4A, a group of existing data clusters 10 are selected. In this simple example, every existing data cluster is selected. Also in this simple example, the metadata for this group of existing data clusters 10 stores ranges for fields A and B.
This is illustrated schematically in Figure 4A where the ranges for field A are mapped onto the A axis, and the ranges for field B are mapped onto the B axis. It will be appreciated that this can be repeated for multiple fields which would then map into multiple dimensions. For example, the metadata for data cluster 10- 1 indicates that the values of data entries within that cluster fall within the range A1A - A1B and within the range B1A - B1B. The metadata for the other data clusters are mapped in a similar way. It will be appreciated that these ranges may be numerical ranges or any other range which is forms a metric space which is, for example, definable in Euclidian space whose size can be determined (for example a Hamming distance).
As can be seen in Figure 4B, the complete group of existing data clusters 10 occupies a search space 20 bounded by AL - AU on the A axis and BL - BU on the B axis.
Returning now to Figure 3 , at step S50 , the metadata for this group of existing data clusters 10 is analysed to identify which of these data clusters to optimize. In one embodiment, this is achieved by assuming that each data value within the group of existing data clusters 10 is evenly distributed within the search space 20 in order to select a group of optimizable data clusters. Identifying the group of optimizable data clusters in this way reduces the processing burden and avoids the need to retrieve any of the existing data clusters themselves from the storage device 110 to make that determination.
As shown in Figure 4C, the search space 20 is partitioned using a partitioning algorithm. The partitioning algorithm used will be selected based on the
characteristic(s) of the data clusters which are desired to improve. In this embodiment, the partitioning algorithm initially seeks to place a partition line 25A1 along the A axis, so that assumed number of data entries in the area 20 A occupied by data clusters to one size of the line 25A1 matches the assumed number of data entries in the area 20B on the other side of the line 25A1. As shown in Figure 4D, the area 20 A is split along the B axis in a similar manner by the line 25B1 and the area 20B is split in a similar way by the line 25B2.
This process continues until, as illustrated in Figure 4E, the search space 20 has been partitioned into a number of separate regions which equals or exceeds the number of existing data clusters within the search space 20. Typically, the search space is partitioned in 2n regions. In this example, there were 7 data clusters, and so 8 regions have been formed. These regions represent an ideal partitioning of the search space 20 to meet the required clustering criteria.
It will be appreciated that this technique is often referred to as a KD-tree. It will be appreciated that other partitioning techniques may be used such as, for example, a quad tree, octree, BSPtree and the like. The partitioning into optimized data clusters may be subject to a maximum or minimum filling constraint. The particular partitioning performed is intended to partition the space into an arrangement which would represent an ideal set of clusters that would meet the particular clustering criteria which best suits the search requirements of the data information system. In this example, it is desired to provide no overlap between data clusters and an equal number of splits in each dimension, thereby creating maximum selectivity in each dimension independently.
As indicated above, the partitioning assumes that the data values within the existing data clusters are distributed in a uniform way. However, as will become apparent, this would often not be the case, but this technique still enables optimizations of the existing data clusters to be performed to provide optimized data clusters in an efficient way which does not require excessive resources. For certain data sets this assumption holds so badly that it can make sense to keep a small set of samples per cluster. In particular, for very skewed data sets, the uniformity assumption is not enough to make the optimization converge. One option in these circumstances is to keep a low number of data entries per data cluster to better approximate the distribution.
In order to select data clusters to be optimised, two different approaches are envisaged. The first approach selects a data cluster for optimisation which is judged to be least aligned with the ideal set of clusters. A second approach selects an ideal data cluster for optimisation based on an error contribution of data clusters falling within that ideal data cluster. Turning now to the first approach, as can be seen in Figure 4F1, an existing data cluster 10-2 is selected. This selection is made by comparing each data cluster within the partitions and selecting the data cluster which least aligns with those partitions (or which deviates the most from those partitions). The existing data cluster which deviates the most is assumed to be the best candidate for optimization.
As shown in Figure 4G1, every data cluster which intersects in search space with the candidate data cluster 10-2 is also selected to create an optimizable group of data clusters 30 , with all non-intersecting data clusters being ignored, as illustrated in Figure 4H.
Turning now to the second approach, as can be seen in Figure 4F2, an ideal data cluster 20’ is selected. This ideal data cluster 20’ is selected based on an error measure. For every partition (ideal data cluster) an error measure is computed. For each partition, data clusters falling within that partition are identified and a data cluster error based on the shape, overlap and positional misalignment of each of those data clusters is calculated. Those data cluster errors are then combined for that partition. For example, the ideal data cluster 20’ will have data cluster errors calculated for the two data clusters intersecting that ideal data cluster 20’ and these data cluster errors will be combined to give an error measure for that ideal data cluster 20’. The partition that has the highest error measure is selected, in this example, the ideal data cluster 20’. It will be appreciated that in another embodiment neighbouring partitions may also selected for various reasons such as if a wider optimisation is required and/ or for faster convergence per iteration.
As can be seen in Figure 4G2, every data cluster which intersects in search space with the ideal data cluster 20’ is selected to create an optimizable group of data clusters 30’, with all non-intersecting data clusters being ignored.
Irrespective of which approach is taken (the following description is based on the first approach, but applies equally to the second approach) the optimizable group of data clusters are then optimized. Returning to Figure 3, at step S60 , those existing data clusters within the optimizable group of data clusters 30 are retrieved from the storage device 110 and their data values 200 stored in the entries of the optimizable group of data clusters 30 are mapped onto the search space 20 , as illustrated in Figure 41. It will be appreciated that although in this example the search space 20’ of the optimizable group of data clusters 30 matches the search space 20 of the existing data clusters, as illustrated in Figure 4J , this need not be the case and may instead be a subset of that search space 20.
Returning to Figure 3, at step S70 , the search space 20’ of the optimizable group of data clusters 30 is then partitioned in a similar manner to that described above, as illustrated in Figures 4K to 4L. Partitioning ceased after 4 partitions were generated, since the number of data clusters in the optimizable group 30 is also 4.
As illustrated in Figure 4M, optimized data clusters 10’- 1 to 10’-4 are formed from the data values falling within each partition area. Metadata describing the range in the search dimensions A and B of each of those optimized data clusters 10’- 1 to 10’-4 is generated and the optimized data clusters 10’- 1 to 10’-4, together with their metadata, are stored. Once that storage has happened, then the existing data clusters within the group of optimizable data clusters 30 , together with its metadata, can be nulled and the optimized data clusters 10’- 1 to 10’-4 and its metadata can be made available to the data information system at step S80.
As can be seen in Figure 4N, the characteristics of the resultant data clusters have been improved, since there are now fewer data clusters, they are spaced further apart and the amount of overlap has been reduced. However, it can be seen that full optimization has not yet occurred and so processing may return to step S40 to continue to optimize the data clusters in an iterative manner.
When building the ideal data cluster model worst-case data cluster configurations can be encountered for which the runtime complexity becomes quadratic. This happens for example if all data clusters overlap with each other, because for every cluster the error computation must consider every other data cluster in the set. In order to create a strict 0(n log n) bound on the runtime complexity, the data clusters that have a very negative impact on the overall runtime are filtered out. One possible heuristic can be based on the size of the data clusters, because it is assumed that very large clusters are likely to overlap with very many clusters. In order to filter out these "bad" overlapping clusters the number of successive kD-tree levels in which the clusters intersect the same split planes is computed. The clusters that intersect split planes of successive levels for a certain or specified number of times are filtered out and handled separately.
Although this loses a bit of precision, this loss seems acceptable when dealing with large datasets. Updates
Data values stored by data clusters may be changed or updated. For example, using the example mentioned above, the“firmware date” for an entry in a data cluster could be changed from one date to another. Updates can also include deletion of an entry from a data cluster. For example, using the example mentioned above an item in the inventory database may be deleted. When such updates occur, new metadata is generated for the data cluster reflecting that changed data values within that data cluster. Those changes may then cause that updated data cluster to be selected for optimisation as mentioned above.
Searchin
Figure 5 illustrates searching the data clusters according to one embodiment.
At step S90 , a search enquiry is received. Typically, the search enquiry will relate, among other fields, to search fields whose data ranges are indicated in the metadata for the data clusters. Should the metadata not contain that information then, depending on implementation, that metadata can be added when optimizing the data clusters.
At step S100 , the metadata is interrogated to see if it answers the query.
At step S110 , an assessment is made of whether the query is answered. For example, a query may be made for an indication of the total number of data entries in the data clusters. As mentioned above, the metadata for each data cluster may include that as a data item, and so the answer can be returned without needing to interrogate the data clusters themselves. It will be appreciated that other data items relating to the data clusters may also be stored in the metadata. Similarly, an interrogation of the metadata may reveal that no data clusters contain data values which can possibly fall within the search criteria, and so, at step S120 , an answer to the query is provided from the metadata alone.
If, instead, it is determined that it is not possible to answer the query from the metadata alone, then those data clusters which intersect with the search criteria are retrieved, the data entries in those data clusters interrogated and the answer to the query provided at step S140.
As an example, consider a search which is bounded in search space by the area A’ in Figure 4A, which is also illustrated in Figure 4N. Prior to optimisation of the data clusters as shown in Figure 4B, the metadata would have indicated that the result to that search could be contained in two data clusters, each of which would need to have been accessed from the storage device 110 in two data accesses (assuming that the size of the data clusters was matched to the data transfer size between the storage and processing logic), then interrogated before returning a null result. After optimisation of the data clusters as shown in Figure 4N, the metadata would have indicated that none of the data clusters can possibly store the result, thereby saving two data accesses and subsequent processing to interrogate those data clusters.
Resource Allocation
It will be appreciated that the data processing resources dedicated to the receiving and storing of data clusters as illustrated in Figure 3 , the optimization of data clusters as illustrated in Figures 4A to 4N, and the searching of data clusters as illustrated in Figure 5, may be dynamically altered or statically prioritized in order to, for example, prioritize one process over the other and/ or to make some processes foreground and others background. Typically, the searching and storing of data clusters are prioritized as foreground processes, with the optimization occurring in the background, as resources become available.
JOIN Operations
Figure 6A illustrates an example J OIN operation on two tables. Table a and Table b are unoptimised and store data values. Table a stores data values for the fields item_id, order_id and part_id. Table b stores data values for the fields item_id, sales_date and sales_id. It is possible to perform a J OIN operation in response to a query. For example, Table a may be J OINed with Table b along a shared field (dimension) which, in this example, is item_id. The result, Table c, contains data values which map order_id and part_id to sales_date and sales_id via item_id. The J OIN operation can be resource-intensive (requiring large amounts of memory) and can slow the processing speed dramatically, particularly as the size of the tables increase.
Figure 6B illustrates an example J OIN operation on two tables according to one embodiment. Table a’ and Table b’ are optimised using the techniques described above. Consequently, table a’ has optimised data clusters a’- l to a’-5 and table b’ has optimised data clusters b’- l to b’-5. Now individual J OIN operations can be performed using the optimised data clusters. For example, data cluster a’- l can be J OINed with data cluster b’- l, a’-2 with b’-2, and so on to generate resultant J OINed data clusters.
In this example, five resultant J OINed data clusters will be generated. This approach enables a subset of the data from one table to be J OINed with a subset of the data from another table, which reduces the resources required (reduces the amount of memory utilised) and increases the processing speed dramatically, particularly as the size of the tables increase.
Accordingly, embodiments provide a mechanism to introduce data locality to a dataset incrementally. Embodiments alleviate limitations of existing techniques. In particular, in embodiments: 1) Scattered and inefficient input/ output (I/ O) data accesses (typically to a storage device) are avoided by clustering data. Access is typically at a granularity level optimized for the I/ O systems of the one or more connected computer systems and clustering data ensures that a large proportion of the fetched data is relevant to the query (as opposed to a large proportion being irrelevant in the earlier cases). This can be executed on multiple dimensions at the same time (clustering data along each of these dimensions, co locating similar data). It is irrelevant for the operability of the embodiments if the dimensions are correlated or not. 2) Any index typically requires the keys and their location to be stored in this index, which, in the typical
implementation, increases the data the more keys are defined. This can be a significant amount of resources and thereby creates inefficiencies, such as exceeding the ability to be kept in one of the system’s caches (RAM etc.). Alternatively, an additional index with that key can be defined independently, which requires additional storage and prevents the ability to search on multiple keys together at the same time.
With the combination of embodiments such a structure can be very small since it only includes part of the key data, such as key ranges. 3) In a typical index with multiple keys (e.g. n keys) at the same time, the order of the keys is typically predefined and a user may only query between 1 and n keys together in the order they were defined.
With the multi-dimensional clustering of embodiments, data can be queried along 1 to n keys independently and in any order. 4) Any lookup structure requires
administration to keep it up-to-date. By not forcing clustering of data during ingestion but instead incrementally building it in a stand-alone process embodiments do not need to do any administration during the ingestion process and can therefore guarantee a stable ingestion performance while guaranteeing availability of all existing data, which is a limitation of existing B-tree-based indexes or other data structures such as Cache Oblivious Look-ahead Array (COLA) or Log Structured Merge Trees (LSM-Tree).
Traditional databases use indices to allow l-to- l lookup of rows. When analysing big ranges of data this has the overhead that typically a 4- 16KB chunk of data has to be read for every row, incurring a high read overhead, since typically only a fraction of this data is relevant for the query. The reverse case is true when the index is updated - it may be necessary to read and additionally to overwrite one entire, typically, 4- 16KB block of data just to modify one key and associated pointer to the row, sometimes even multiple blocks. Some databases may use a COLA. This allows for fast lookup, but must be kept up-to-date during ingestion, thereby incurring an ingestion overhead relative to the total dataset size. A COLA works by ingesting data into a first level of a multi-level structure. The first level typically covers the entire value range of the data that is ingested. When this level has reached a certain fill state, the data in this level is inserted into the next level down, possibly triggering the next level to reach its fill state as well. This level then also cascades downwards etc. Each level separates the value ranges. An example of this is a perfect order. Thereby clustering of data becomes more granular the lower the layer. These structures are typically of an amortized complexity 0(log n) for every row inserted, the heavy penalty of triggering large cascades (heavy 1/ O) being somewhat offset against the clustering precision of data. A LSM-Tree, works in a very similar fashion . “Database Cracking” adaptively builds knowledge about the data contained in the database during the queries. It is used in some column- store databases. It moves the cost from index maintenance from the database changes (ingestion) to the queries (selection). The query processor provides information to the data handling mechanisms to re-arrange the data and execute optimizations such as a partial sorting or partial indexing. This technique is said to improve 1/ O, query processing speed and to exhibit self-optimizing behaviour.
A database is a tool to persistently store data inserted into it, it typically also has a very predictable behaviour on how these insertions are handled when multiple sources compete for storing data or how long these insertions typically take at an upper bound. The complexity of operating a database system typically limits the total system data throughput to a fraction of the achievable system throughput compared to storing a stream of data with the standard system 1/ O without using a database.
In embodiments, given one or more connected computer systems operating a database containing one or multiple database tables, a process can organize the data in each table into many independent clusters of data. a) Such a clustering is made by spatially organising data along multiple, possibly independent, dimensions. At the first insertion, information on the data’s properties, such as ranges, are already obtained and kept. The process of laying out the data along multiple possibly independent dimensions is both an independent process and an incremental process. Multiple possibly independent dimensions: the process is capable of organising along multiple dimensions at the same time and in an independent way:
Able to order everything by projecting all types of values to numbers
Able to cope with vastly different scales along each dimension- Able to reach partial optima, able to stop oscillating between reordering steps A multi-dimensional range distribution that is guaranteed to create a minimum of selectivity along every dimension.
In one embodiment, a KD-tree is used with at least one split in every dimension. In one embodiment, KD-tree is used with typically equal selectivity on each dimension irrespective of the dimension’s cardinality (cardinality = how many different values in a data column). In one embodiment, the KD-tree is used with a user-defined selectivity scaling for each dimension.
It will be appreciated that using an independent process means that the process of optimising the data is typically undertaken after data was inserted into the database, thereby never blocking the insertion and affording a higher insertion rate. The total amount of data that can be persistently inserted into the database this way is therefore not limited by the time it may take to maintain another optimization structure, such as a database index. When otherwise the maintenance of optimization structures imposes practical limits on the amount of data, embodiments do not impose this limit.
Furthermore, the effort, e.g. in terms of I/ O and/ or processing resources, can be limited to achieve the desirable balance between insertion and query performance. In the most extreme case, the user or system can omit this process to save resources for an undetermined amount of time solely relying on the data’s properties obtained during insertion for executing c) below. Therefore, embodiments are capable to guarantee a predictable and a high data insertion performance. In embodiments using an incremental process means that the resources dedicated to this process are adjustable as well as constrainable to what is, by the user or automatically by the system, determined to be the optimum trade-off between all system resources. The outcome of each increment is already usable and, typically, already shows an improvement to the prior state with respect to data clustering. This is true, even if the increments are still far away from a mathematically-optimal distribution. When above constraint is set to a value N bytes, embodiments work with up to N bytes of data at the same time and is still able to improve the dataset for any N greater than or equal to [2 to the power of dimensions] times the data cluster size. A selection algorithm ensures that there is an improvement at every step or that the system is informed that at the current state with the current N no further improvements can be made and thus no further resources are required.
It is observed that incrementally running above mechanism takes an effort of 0(n * log n) for n rows to theoretically reach mathematical optimality, but in the embodiments it was observed that a fraction of this effort is required. Typically effort is an
approximately constant factor in relation to the data inserted. This contrasts to the other implementations (such as an Index), where the effort of maintaining it is relative to the total dataset size. b) Once the data has reached a close-to-the optimum distribution, it is desirable to only re-organize the newly-ingested data into clusters. Embodiments choose its increments such that the data furthest away from the desired distribution is incrementally re organized first. Thereby, for certain distributions, the effort is more correlated to the number of data entries ingested in a time period as opposed to the total dataset size (this is different to the database index, which is determined by data set size). c) In embodiments, a metadata structure can be kept to limit access only to the relevant clusters that contain data the user asks for. This metadata structure is able to determine this relevance by storing the range along each dimension that is covered in each of the clusters. It will be appreciated that storing a range for each dimension takes only two values per dimension, and it is therefore very small compared to the underlying entries described. In this context it is important to note that range is just one of the possible embodiments. Each range can be evaluated independently. The cluster size is configurable to reach the optimum trade-off between : The I/ O size to be fetched at good 1/ O performance; and the corresponding selectivity based on the extent of the clusters along each dimension (hypothetical extent is dimension-root of the total cluster count); and the relative size of the metadata structure and resulting access speed (smaller metadata structures can be kept in the individual cache hierarchies: fast caching storage, RAM, CPU-Caches etc.)
Ingestion :
In embodiments, data entries such as one or more rows are received by the database. A row typically has a defined set of columns, the value of a column in a row is called a field. These rows are evaluated against a number of dimensions, each dimension being determined by one or more fields in the row (single field, concatenating multiple ones, calculation from multiple fields etc.) or generated, e.g. the total row count. In one embodiment dimensions are set a-priori. In another, dimensions are learned from usage pattern. In one embodiment every field in a row is chosen as its own dimension. The row is stored. In order to store the rows, a plurality of rows is buffered until an 1/ O optimal size is reached or the database operation requires writing out the data/ this buffer. The set of rows having an approximate 1/ O optimal size is referred to as a row cluster (data cluster). When a row cluster is created, the information of which data is contained along each dimension in the cluster is extracted and stored in a metadata structure. In another embodiment, more rows than required for a single row cluster are buffered. Then, the distribution of the rows into row clusters already follows the optimization step disclosed below. In one embodiment, the information which data is contained is the range between the values in each of the dimensions. In another embodiment, the information is a probabilistic data structure, such as a bloom filter.
In such a probabilistic filter, which could be a bit mask, one or more bits at different positions indicate if a value may be contained. If one of the bits is not set it can be concluded that the value is not contained. The distance between two probabilistic filters can be determined by finding out which bits are equal, and which not (001 to 001= distance 0 , 010 to 101 = distance 3, 010 to 011 = distance 1). In another embodiment, the values in each dimension are given an order, e.g.“aab” after“aaa”, distance is determined by finding the distance in steps along the order. This order may be implicit from the values (e.g. the given text ordering example) or kept as a dictionary that each part of the database can look up in. In one embodiment, where the cardinality for one or more dimensions is very low, bits indicate the definite presence or absence of a value as opposed to indicating a probable presence. It will be appreciated that storing a range, a bit mask for values contained or a range along an order or similar takes up much less space than the row cluster. Thereby it is typically many factors smaller than an index, which typically at least stores every key (= the field value) and a pointer to that key’s row. In a typical embodiment the metadata also includes information such as how many rows are inside a row cluster. In one embodiment row clusters are compressed and the metadata contains additional information, such as storage sizes, to optimize the I/ O access when retrieving the row cluster. In one embodiment row clusters and/ or the metadata are compressed using additional hardware that can be configured to execute compression or other processing of the row cluster and/ or the metadata in line with the query conditions.
Optimization: The purpose of optimization is to increase selectivity when querying for the data while being able to conduct optimization iteratively (no“all or nothing” case). Row clusters are evaluated for their relative selectivity, i.e. how likely they are chosen for retrieval by a query and thus create a cost - versus the probability of including data required by that query. Selectivity can, for example, be approximated by the extent of their range (in one embodiment), by the number of splits per dimension (in one embodiment) by the numbers of bits set (in another) or the range along the order (in another) or similar. Wider ranges or more bits set - or any other way in which data selectivity is lower - shall be referred to as larger clusters hereafter, narrower ranges or fewer bits set - or any other way in which data selectivity is higher - as smaller clusters. This evaluation can be undertaken for each dimension individually or for multiple dimensions at the same time. In one embodiment, larger clusters are prioritized over smaller clusters. Clusters are also selected to be at a shorter distance to each other. In one embodiment, the rows in multiple larger clusters within a certain, typically close, distance of each other are processed. Close distance can include overlap, which is spatial overlap, range overlap or equal bits set, depending on the embodiment. If overlap is present, rows of smaller clusters that overlap with the aforementioned larger clusters are also included in the processing. In another embodiment, the target distribution with a high selectivity has been identified, for example, by sampling the data. Row clusters are then chosen based on their divergence from the target distribution, the more diverging ones in favour of the less diverging ones. In one embodiment, multiple of the aforementioned selection mechanisms are combined to choose the row clusters.
The rows are processed by re-distributing rows to row clusters such that the larger clusters turn into smaller clusters, the overlap is reduced and the distance between row clusters is increased, while maintaining a certain threshold of rows per row cluster. Thereby the resulting re-distributed row clusters are typically more selective than the original row clusters. In one embodiment this re-distribution is operated with a KD- tree. Each row’s dimensions are inserted as an n-dimensional point in the KD-tree.
The point clusters generated by the split planes created by the KD-tree are then used to obtain the new row clusters. In one embodiment, with a fixed row size, the row clusters are thereby on average at least 75% filled. In another embodiment, with a variable row size, KD-tree splits are adjusted for row size, i.e. instead of splitting by the median value, it is split by the median aggregated row size. In one embodiment, the KD-tree splits at least once per every dimension. In another embodiment, the KD-tree ensures to have an approximately equal number of split planes in each dimension. It will be appreciated that this will typically result in equal selectivity on each dimension irrespective of the dimension’s cardinality. In another embodiment, the rows are re distributed into row clusters such that as few bits as possible are set in the probabilistic data structure of each row cluster, i.e. the“distance” (in the sense it was defined before) between rows minimized and larger clusters with overlap are converted into smaller clusters with no or limited overhead. For example, the row with the bitmask 001 shall be clustered with the row with the bitmask 101 in favour of clustering with the row with bitmask 010 , or even 110 (which has greatest distance). In another embodiment, the position along the order is interpreted as a range for the KD-tree splitting of the values. In another embodiment, the distribution is learned by applying an online k-means algorithm to the row (interpreting the row’s dimensions as an n-dimensional point).
In a typical embodiment, the I/ O and/ or processing power used for optimization is limited to ensure enough resources are available for other processing, such as ingesting and selecting, and thereby be able to give performance guarantees. In a typical embodiment, the number of row clusters chosen for optimization is balanced with the upper limit for I/ O and/ or processing power to yield the right balance between resource usage and creating smaller clusters. Such a balance can be obtained by automatically or manually inspecting the how much smaller the clusters became. In one embodiment it may be chosen to delay the optimization temporarily to free resources for other database operations. The database is nevertheless fully capable to ingest, select or execute other tasks during this period (as in, optimization is not required to always run to allow database operation).
Selecting:
The metadata structure is interrogated if a row cluster contains values that match the conditions of the database query. In one embodiment, this is carried out with a range check along each of the dimensions. In another embodiment, this is done by checking if all the relevant bits are set in the probabilistic data structure. In another
embodiment, the order is retrieved by finding the dimensions’ ordering or looking up the dimensions’ ordering in the common dictionary, then a range check can inform about the possible presence of this value. Only the row clusters, where the metadata structure has indicated a possible presence of rows having met all of the required conditions, are retrieved. The metadata structure may be kept in a search tree of some form to accelerate interrogation. In one embodiment the metadata structure is organized as a KD-tree of n-dimensional cubes expressing the range in dimensions. In some embodiments, queries or parts of queries are solely answered by inquiring the metadata structure. For example, when requesting a count for a range and the metadata can identify that none of the data lies within that range, the count is already known to be zero without retrieving any row cluster using I/ O. In the reverse example, if the metadata structure can identify that all rows fall within the range, the metadata structure can provide the count directly by summing up the row count stored for each row cluster. In one embodiment, the metadata structure can estimate the result of queries based on the information it holds. For example, when a query requests the row count for half the range of a row cluster, the metadata structure can estimate that half the rows will match the query conditions and return the row count stored in the metadata structure divided by two. In another embodiment, the metadata structure keeps additional statistics, such as samples or information about the data distribution within the row cluster, to reach more precise estimates or even be able to answer specific requests accurately based on these additional statistics kept.
General:
In one embodiment, a plurality of dimension mechanisms are combined, for example a range on one dimension with a probabilistic filter on another dimension and an order range filter on two other dimensions. In this case, optimization is undertaken by applying the aforementioned methods for each dimension mechanism individually, possibly interleaving them (e.g. distributing bit-fields in between finding KD-tree splitting planes). In the aforementioned case, selection is undertaken by evaluating each of the dimensions’ metadata to conclude if for the combination of all dimensions together the possibility exists, that all query conditions are met by a row cluster. This is done with the aforementioned methods. If this possibility exists, the row cluster is retrieved, otherwise not. In another embodiment, with a plurality of dimensions, the dimension mechanisms are combined by finding common measures for the
optimization. One typical measure to use is the distance, which has been defined for multiple different dimension mechanisms in this disclosure. The combination of all distances can then subsequently be used for operating on multiple dimensions in accordance to the mechanism for a single dimension disclosed above.
Although illustrative embodiments of the invention have been disclosed in detail herein, with reference to the accompanying drawings, it is understood that the invention is not limited to the precise embodiment and that various changes and modifications can be effected therein by one skilled in the art without departing from the scope of the invention as defined by the appended claims and their equivalents.

Claims

1. A data processing method, comprising:
identifying a group of existing data clusters having a clustering characteristic, each existing data cluster comprising data entries of a data information system stored together in storage, each data entry having at least one field storing data values;
selecting at least one existing data cluster from said group of existing data clusters as an optimisable group of data clusters; and
forming a group of optimised data clusters by allocating data entries of said optimisable group of data clusters to each optimised data cluster to improve said clustering characteristic for said group of optimised data clusters compared to said group of existing data clusters.
2. The method of claim 1, wherein each data cluster occupies a data range in a search space defined by values of each data entry of each field.
3. The method of any preceding claim, wherein each data cluster has a size which matches a bandwidth-optimised data block transfer size of said storage.
4. The method of claims 2 or 3 , wherein each data cluster has associated metadata which provides at least an indication of said data range in said search space defined by values of each data entry of at least one field.
5. The method of any preceding claim, wherein said clustering characteristic comprises at least one of:
a search selectivity between said existing data clusters;
a number of existing data clusters accessed in response to search enquiries; a separation between existing data clusters in said search space;
an overlap of data ranges of data values between existing data clusters in said search space; and
data ranges of data values within each existing data cluster in said search space.
6. The method of any preceding claim, wherein said selecting comprises selecting said at least one existing data cluster from said group of existing data clusters as said optimisable group of data clusters based on a size of its occupied search space.
7. The method of any preceding claim, wherein said selecting comprises selecting said at least one existing data cluster from said group of existing data clusters as said optimisable group of data clusters based on a number of intersections in said search space with other existing data clusters.
8. The method of any one of claims 4 to 7, wherein said selecting comprises selecting said at least one existing data cluster from said group of existing data clusters as said optimisable group of data clusters based only on associated metadata.
9. The method of any preceding claim, comprising:
generating a group of ideal data clusters from said group of existing data clusters, said group of ideal data clusters being generated within said search space occupied by said group of existing data clusters and wherein said selecting comprises selecting at least one existing data cluster from said group of existing data clusters which intersects a selected ideal data cluster in said search space as said optimisable group of data clusters.
10. The method of claim 9, wherein said generating comprises generating said group of ideal data clusters from said group of existing data clusters, said group of ideal data clusters being generated within said search space occupied by said group of existing data clusters based on an ideal clustering criteria which would improve said clustering characteristic.
11. The method of claim 10 , wherein said ideal clustering criteria comprises at least one of:
an increase in a search selectivity between said existing data clusters;
a decrease in a number of existing data clusters accessed in response to search enquiries;
an increase in a separation between existing data clusters in said search space; a decrease in an overlap of data ranges of data values between existing data clusters in said search space; and
a decrease in a data range of data values within each existing data cluster in said search space.
12. The method of any one of claims 9 to 11, wherein said generating comprises generating said group of ideal data clusters based on an assumed distribution of data entries within said search space within each existing data cluster.
13. The method of any one of claims 9 to 12, wherein said generating comprises generating said group of ideal data clusters using a partitioning algorithm which partitions said search space to have similar numbers of data entries in each ideal data cluster.
14. The method of any one of claims 9 to 13 , comprising:
for each ideal data cluster in said group of ideal data clusters, determining a deviation in occupied search space between that ideal data cluster and existing data clusters intersecting that ideal data cluster and wherein said selecting comprises
selecting said selected ideal data cluster based on said deviation.
15. The method of any preceding claim, wherein said forming comprises forming said group of optimised data clusters by allocating data entries of said optimisable group of data clusters to each optimised data cluster using a partitioning algorithm which would improve said clustering characteristic.
16. The method of any preceding claim, wherein said selecting comprises selecting overlapping existing data clusters which store data entries having overlapping data ranges within a plurality of fields defining a plurality of said search space dimensions as said optimisable group of data clusters and said forming comprises allocating said data entries from said optimisable group of data clusters to form said optimised data clusters having minimised overlapping data ranges in each search space dimension.
17. The method of any preceding claim, comprising storing each optimised data cluster in said storage.
18. The method of any preceding claim, comprising:
identifying a range of data values for each search space dimension within each data cluster; and
storing an indicator of each range of data values as said metadata for each corresponding data cluster.
19. The method of any preceding claim, comprising nulling said group of existing data clusters.
20. The method of any preceding claim, comprising iteratively repeating said identifying, selecting and forming.
21. The method of any preceding claim, comprising:
receiving a search request for data;
interrogating said metadata to identify candidate data clusters whose range of data values encompasses said search request.
22. The method of claim 21, comprising returning a result of said search request based only on said metadata.
23. The method of any preceding claim, comprising performing a join operation between said group of optimised data clusters and another group of optimised data clusters.
24. A data processing apparatus, comprising:
identification logic operable to identify a group of existing data clusters having a clustering characteristic, each existing data cluster comprising data entries of a data information system stored together in storage, each data entry having at least one field storing data values;
selection logic operable to select at least one existing data cluster from said group of existing data clusters as an optimisable group of data clusters; and
formation logic operable to form a group of optimised data clusters by allocating data entries of said optimisable group of data clusters to each optimised data cluster to improve said clustering characteristic for said group of optimised data clusters compared to said group of existing data clusters.
25. A computer program product operable, when executed on a computer, to perform the method of any one of claims 1 to 23.
PCT/EP2019/064515 2018-06-05 2019-06-04 Data processing WO2019234039A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GBGB1809174.4A GB201809174D0 (en) 2018-06-05 2018-06-05 Data processing
GB1809174.4 2018-06-05

Publications (1)

Publication Number Publication Date
WO2019234039A1 true WO2019234039A1 (en) 2019-12-12

Family

ID=62975529

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2019/064515 WO2019234039A1 (en) 2018-06-05 2019-06-04 Data processing

Country Status (2)

Country Link
GB (1) GB201809174D0 (en)
WO (1) WO2019234039A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111339212A (en) * 2020-02-13 2020-06-26 深圳前海微众银行股份有限公司 Sample clustering method, device, equipment and readable storage medium
CN117708613A (en) * 2023-12-25 2024-03-15 北京中微盛鼎科技有限公司 Industrial chain collaborative operation-oriented digital resource matching method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090106210A1 (en) * 2006-09-18 2009-04-23 Infobright, Inc. Methods and systems for database organization
WO2013081650A1 (en) * 2011-11-28 2013-06-06 Hewlett-Packard Development Company, L. P. Clustering event data by multiple time dimensions
US20180068008A1 (en) * 2016-09-02 2018-03-08 Snowflake Computing, Inc. Incremental Clustering Maintenance Of A Table

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090106210A1 (en) * 2006-09-18 2009-04-23 Infobright, Inc. Methods and systems for database organization
WO2013081650A1 (en) * 2011-11-28 2013-06-06 Hewlett-Packard Development Company, L. P. Clustering event data by multiple time dimensions
US20180068008A1 (en) * 2016-09-02 2018-03-08 Snowflake Computing, Inc. Incremental Clustering Maintenance Of A Table

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111339212A (en) * 2020-02-13 2020-06-26 深圳前海微众银行股份有限公司 Sample clustering method, device, equipment and readable storage medium
CN117708613A (en) * 2023-12-25 2024-03-15 北京中微盛鼎科技有限公司 Industrial chain collaborative operation-oriented digital resource matching method
CN117708613B (en) * 2023-12-25 2024-05-14 北京中微盛鼎科技有限公司 Industrial chain collaborative operation-oriented digital resource matching method

Also Published As

Publication number Publication date
GB201809174D0 (en) 2018-07-25

Similar Documents

Publication Publication Date Title
US11238039B2 (en) Materializing internal computations in-memory to improve query performance
US8386463B2 (en) Method and apparatus for dynamically associating different query execution strategies with selective portions of a database table
US5511190A (en) Hash-based database grouping system and method
EP3329393B1 (en) Materializing expressions within in-memory virtual column units to accelerate analytic queries
US7158996B2 (en) Method, system, and program for managing database operations with respect to a database table
US7761407B1 (en) Use of primary and secondary indexes to facilitate aggregation of records of an OLAP data cube
EP2885728B1 (en) Hardware implementation of the aggregation/group by operation: hash-table method
US8660985B2 (en) Multi-dimensional OLAP query processing method oriented to column store data warehouse
US7562090B2 (en) System and method for automating data partitioning in a parallel database
US5797000A (en) Method of performing a parallel relational database query in a multiprocessor environment
US9135298B2 (en) Autonomically generating a query implementation that meets a defined performance specification
US6772163B1 (en) Reduced memory row hash match scan join for a partitioned database system
WO2018187229A1 (en) Database management system using hybrid indexing list and hierarchical query processing architecture
EP2009559A1 (en) Database
WO2018157680A1 (en) Method and device for generating execution plan, and database server
JPH05197763A (en) Method and system for executing join in computor- processing database system
WO2018129500A1 (en) Optimized navigable key-value store
WO2019234039A1 (en) Data processing
US7188334B1 (en) Value-ordered primary index and row hash match scan
CN114020779A (en) Self-adaptive optimization retrieval performance database and data query method
US8024288B2 (en) Block compression using a value-bit format for storing block-cell values
US7373340B2 (en) Computer implemented method and according computer program product for storing data sets in and retrieving data sets from a data storage system
US8700822B2 (en) Parallel aggregation system
Ross et al. Serving datacube tuples from main memory
US11275737B2 (en) Assignment of objects to processing engines for efficient database operations

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19728948

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19728948

Country of ref document: EP

Kind code of ref document: A1