WO2016122653A1 - Distances de recherche communes dans des partitions de données - Google Patents

Distances de recherche communes dans des partitions de données Download PDF

Info

Publication number
WO2016122653A1
WO2016122653A1 PCT/US2015/013952 US2015013952W WO2016122653A1 WO 2016122653 A1 WO2016122653 A1 WO 2016122653A1 US 2015013952 W US2015013952 W US 2015013952W WO 2016122653 A1 WO2016122653 A1 WO 2016122653A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
processing node
search
local
query
Prior art date
Application number
PCT/US2015/013952
Other languages
English (en)
Inventor
Maria T. GONZALEZ DIAZ
Jun Li
Krishnamurthy Viswanathan
Mijung Kim
Original Assignee
Hewlett Packard Enterprise Development Lp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hewlett Packard Enterprise Development Lp filed Critical Hewlett Packard Enterprise Development Lp
Priority to PCT/US2015/013952 priority Critical patent/WO2016122653A1/fr
Publication of WO2016122653A1 publication Critical patent/WO2016122653A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9014Indexing; Data structures therefor; Storage structures hash tables

Definitions

  • the data may be time series data that may be, for example, acquired by a sensor.
  • Issues with the sensor may be identified by searching for certain patterns in the time series data.
  • the time series data may be searched for patterns for various other purposes, such as classification, pattern detection, modeling and anomaly detection, as examples.
  • FIG.1 is a schematic diagram of a muiticore machine according to an example implementation.
  • FIG. 2 is a flowchart illustrating a method for assigning data points of a data set to multiple processing nodes of a muiticore system, according to an example.
  • FIG. 3 is a block diagram illustrating an example operation of a shuffle- based configuration module that may be used to partition a dataset onto multiple processing nodes in a muiticore machine, according to an example.
  • FIG. 4 is a flow chart illustrating a method for generating common search distances based on data points stored in data partitions, according to an example.
  • FIG. 5 is a sequence diagram illustrating a method for handling query requests, according to an example.
  • FIG. 6 is a diagram illustrating a worker thread of a processing node capable of processing concurrent local query requests, according to an example.
  • a “multicore” machine refers to a physical machine that contains a computing component with at least two independent central processing units (CPUs). Each CPU can contain multiple processing cores. A given processing core is a unit that is constructed to read and execute machine executable instructions. In accordance with example implementations, the multicore machine may contain one or multiple CPU semiconductor packages, where each package contains multiple processing cores. Further, multiple processing cores may access or otherwise share the same local memory. As used herein, the term "processing node” may refer to a unit that includes multiple processing cores that share the same local memory to access instructions and/or data.
  • a multicore machine to process a search query, as disclosed herein, allows relatively time efficient searching of a relative large dataset for purposes of identifying and retrieving matches to the search query.
  • Techniques and systems that are disclosed herein may be applied to relatively large volume data, such as high dimensional data (multimedia image or video data, as examples), time series data, or any other suitable data. Identifying and retrieving high volume data that are similar to a data item specified by a search query may be useful for such purposes as classification, pattern detection, modeling, fault diagnosis, and anomaly detection, as well as for other purposes.
  • Performing a relatively time efficient search may allow the construction of better models, better pattern detection, faster fault analysis, more rapid classifications, and more timely detection of anomalies.
  • Other and different advantages may be achieved in accordance with further
  • Approaches for using a multicore machine to identify and retrieve matches in a data set that are similar to a search query may involve partitioning the data set across multiple processing nodes in the multicore machine.
  • One approach for partitioning the data set may be to read a block of data from the original data set and store that block of data in partition A, read another block of data from the original data set and store that block of data in partition B, and so on.
  • data points stored in the same block of data may exhibit spatial locality and temporal locality. Accordingly, if the data set is partitioned based on how the data points are stored in the original data set, such spatial and temporal locality will be preserved in individual data partitions. Thus, it's possible that different partitions will be statistically dissimilar. Because the data partitions can be statistically dissimilar, it may be difficult to create a common search model for the different data partitions. Having different search models may make it difficult to determine whether the K closest points to a given query have been retrieved at any given point.
  • LSH locality sensitive hashing
  • LSH involves hashing data points to entries of a hash table such that data points that are similar to each are likely to map to the same entry.
  • similarity may be measured as a function of a hash parameter referred to as a "search distance" (also referred to as "R").
  • search distance also referred to as "R”
  • an input data point is hashed to generate a key that maps to an entry of a hash table that specifies other data points that are likely to be proximate to the input data point.
  • a key may be a vector of hash values generated by multiple hash functions.
  • the data structure that includes a hash table may be referred to as a hash index.
  • the hash index can include multiple hash tables that each return data points that are likely to be near the input data point as a function of the search distance.
  • LSH-based search systems may use multiple hash indexes where each hash index is built using a different search distance (e.g., R0, R1 ,..RX).
  • a search system may then start with the hash index associated with the shortest search distance. If sufficient results are returned (e.g., a result set that exceeds the result size threshold), the search system returns those results and the search is done. Otherwise, the search system uses the hash index associated with the next largest search distance. This process of determining whether a hash index produces a sufficient result size and advancing if not continues until a sufficient search result is obtained.
  • one aspect discussed herein may involve assigning a first data point from a data set to a first data partition stored in a first processing node of a mu!ticore machine.
  • the assignment of the first data point to the first data partition may be based on a first hash value derived from a first identifier (e.g., a file name, a sequence identifier, or any other suitable data usable to uniquely reference between different data points) assigned to the first data point.
  • a second data point from the data set may be assigned to a second data partition stored in a second processing node of the multicore machine.
  • the assignment of the second data point to the second data partition may be based on a second hash value derived from a second identifier assigned to the second data point.
  • a first set of hash indexes may be stored in the first processing node.
  • the first set of hash indexes may individually correspond to common search distances.
  • the second set of hash indexes may be stored in the second processing node. The second set of hash indexes may individually correspond to the common search distances.
  • a machine-readable storage device may include instructions that, when executed, cause a processor to distribute a data set across a first processing node and a second processing node according to a distribution pattern that maintains a similar distribution pattern between data points from the data set stored in a first local memory unit of the first processing node and data points from the data set stored in a second local memory of the second processing node.
  • the instructions when executed, may further cause the processor to calculate common search distances based on an analysis of the distributed data set.
  • the shuffle-based configuration module may also cause the first processing node and the second processing node to create hash indexes with the common search distances.
  • an apparatus may include a first processing node, a second processing node, and a search coordinator module.
  • the first processing node may include a first set of processing cores with access to a first local memory unit.
  • the first local memory unit stores a first data partition and a first hash index constructed with common search distances.
  • the second processing node that includes a second set of processing cores with access to a second local memory unit.
  • the second local memory unit stores a second data partition and a second hash index constructed with the common search distances.
  • the search coordinator module may initiate local searches in the first processing node and the second processing node. The local searches perform searches on the first data partition and the second data partitions based on progressively increasing the common search distances.
  • the search coordinator module is to notify a client device when cumulative search results from local search results exceed a search result size threshold.
  • a client device may be a computer device or computer system that sends requests to the search coordinator module.
  • the requests may be query requests that requests a search of a data set for data points that may be similar to a data item specified by the request.
  • the "search result size threshold" may be a parameter of the search that specifies a determinable number of data points that are to be returned from the request, in some cases, the search result size threshold may represent a number such that the results return a number of data points that are the closest data points in the data set.
  • FIG. 1 is a schematic diagram of a muiticore machine 100 according to an example implementation.
  • the muiticore machine 100 is a physical machine that is constructed from actual machine executable instructions, or "software," and actual hardware.
  • the hardware of the muiticore machine 100 includes processing nodes 110 (processing nodes 1 10-1 .110-S, being depicted as examples in FIG. 1 ), which may be part of, for example, a particular muiticore central processing unit (CPU) package or multiple muiticore CPU packages.
  • each processing node 110 contains CPU processing cores 1 12 (processing cores 112-1 , 112-2. . .112-Q, being depicted in Fig. 1 for each node 1 10) and a local memory 114.
  • each processing node 110 contains a memory controller (not shown).
  • the muiticore machine 100 may have a non-uniform memory access architecture (NUMA); and the processing nodes 110 may be NUMA nodes.
  • the persistent memory 130 may store a data set.
  • the data set may store a large volume of data.
  • the data stored in the data set may be represented as multidimensional data points (or simply "data points").
  • the data set 160 is organized as files in a distributed file system.
  • the multicore machine 100 may also include a search system 140 comprising of, among other things, a shuffle-based configuration module 142 and a search coordinator module 144.
  • the shuffle-based configuration module 142 may distribute the data set 160 across the processing nodes according to a distribution pattern that maintains a similar distribution pattern between data points from the data set stored in the different locai memory units.
  • the processing nodes may generate, and store in iocal memory hash indexes derived from common search distances. As described below, the common search distances may be supplied by the shuffle-based configuration module 142.
  • the shuffle-based configuration module 142 may be configured to issue commands to the processing nodes 110 that initiate index building on the processing nodes 110.
  • these commands may include common search distances derived from an analysis of the data distribution in the data partitions.
  • the search coordinator module 144 may receive query request, which can specify a given muitidimensional data point to be searched as part of a search query.
  • the search coordinator module 144 can then return query output data (also referred to as a query response) that represents a top K number of similarity matches (also herein referred to as the "search size threshold" or the "top K results”) to the search query.
  • query output data also referred to as a query response
  • top K number of similarity matches also herein referred to as the "search size threshold" or the "top K results
  • the search coordinator module 144 may issue commands that initiate local searches in the processing nodes.
  • the local searches may execute by progressively performing searches on the data partitions based on progressively increasing the common search distances.
  • the search coordinator module 144 may notify the client device when cumulative search results from local search results exceed a search result size threshold.
  • the shuff!e-based configuration module 142 and the search coordinator module 144 may be
  • FIG. 2 is a flowchart illustrating a method 200 for assigning data points of a data set to multiple processing nodes of a multicore system, according to an exampie.
  • the method 200 may be performed by the modules, components, systems shown in FIG. 1 , and, accordingly, is described herein merely by way of reference thereto, it will be appreciated that the method 200 may, however, be performed on any suitable hardware.
  • the method 200 may begin when, at operation 202, a shuffle-based configuration module assigns a first data point from a data set to a data partition stored in a first processing node of a multicore machine. This assignment may be based on a first hash value the shuffle-based configuration module derives from a first identifier assigned to the first data point, in some cases, the data set may be obtained from a persistent storage device of the multicore machine or received from an external storage device or a communication input/output device.
  • the shuffle-based configuration module may assign a second point from the data set to a second data partition stored in a second processing node of the multicore machine based on a second hash value derived from a second identifier assigned to the second data point.
  • the first processing node may store a first set of hash indexes in the first processing node.
  • the first set of hash indexes may individually correspond to common search distances.
  • each index may correspond to a common search distance (e.g., R1 , R2, ... RN).
  • the second processing node may store a second set of hash indexes in the second processing node. Similar to the first set of hash indexes, the second set of hash indexes may individually correspond to the common search distances. Accordingly, the first set of hash indexes and the second set of hash indexes may correspond to the same search distances. Example implementations for generating common search distances is described in the foregoing with reference to FIG. 4.
  • FIG. 3 is a block diagram illustrating an example operation of a shuffle- based configuration module 300 that may be used to partition a dataset onto multiple processing nodes in a multicore machine, according to an example.
  • the shuffle- based configuration module 300 shown in FIG. 3 includes readers 302a-m, a hashing module 306, and writers 304a-n. in some cases the shuffle-based configuration module 300 may be an implementation of the shuffle-based
  • the readers 302a-m may be threads of execution executing on a processing node of a multicore machine or a machine communicatively coupled to the multicore machine.
  • FIG. 3 shows that there are m readers.
  • Each reader may be configured to read a portion of the data set 160 to which the reader is assigned.
  • a portion can be a block of data if the data set is stored in distributed file system, such as a Hadoop File System.
  • the number of readers e.g., m
  • the number of readers can determined by the total number of the blocks occupied by the data set in the Hadoop File System.
  • the writers 304a-n may be threads of execution executing on a processing node of a mu!ticore machine or a machine communicatively coupled to the multicore machine.
  • FIG. 3 shows that there are n writers, which can be different from m.
  • the number n can be determined by the total data size of the data set, over the designed partition size.
  • a partition size can be 4 million data points.
  • a partition size can be 1 GB, 4 GB, 12 GB, or any other suitable data size, which can be significantly different from the distributed file system's block size.
  • the writers 304a-n write to the files stored in a distributed file system.
  • a partition can be a file.
  • the hashing module 306 may also be threads of execution executing on a processing node of a multicore machine or a machine communicatively coupled to the multicore machine.
  • the hashing module 306 may be configured to map data points from the readers 302a-m to the writers 304a-n. For example, a reader may read a data point stored in the data set. The data point then may be communicated (via pull or push methods) to the hashing module 306.
  • Hashing module 306 can then determine which destination writer is to process the data point, in some implementations, the data point can have an identifier, such as a file name, a unique sequence identifier in the whole data set, or any other data that uniquely references the data point. Accordingly, the hashing module 306 can map the data point to a writer according to:
  • the Hash_Function may be a hash function. It is to be appreciated that the Hash_Function is different from the LSH hashing scheme discussed above in that Hash_Function (and the hashing module 306 in general) is used to map data points to data partitions rather than generating hash-indexes.
  • ID may be an identifier for the data point.
  • N may be the number of writers.
  • a "data partition,” such as the data partition 310, may be a unit of data storage used to store a subset of the data points that are distributed from the data source 160.
  • local memory 114 of a processing node 110 may store the data partition 310, as, for example, a file.
  • the operations of the shuffle-based configuration module 300 can be implemented by a MapReduce framework.
  • mapping data points to data partitions using techniques similar to techniques shown in FIG. 3 may have many practical applications. For example, in large data sets, data points that are adjacent to each other are likely obtained from the same data source (spatial locality) or get collected at time instants that are close to each other (temporal locality). If example implementations simply partition the data set based on how the data set is stored on the file system, such spatial and temporal locality will be preserved on individual partitions and, as a result, different partitions will be statistically dissimilar. However, mapping the data points to partitions using a hashing function, for example, may produce partitions that are statistically similar to the original data set.
  • Having statistically similar data partitions can, in some cases, allow for common search distances to be used when generating the indexes for each partition.
  • Using common search distances on statistically similar partitions can lead to advantages when performing a similarity search on the data partitions. For example, from the perspective of the design of the coordinator, it is much easier to determine when the search results obtained at any point are sufficient to determine the K closest points to a given query if all the data partitions use identical search parameters, particularly the search distances.
  • FIG. 4 is a flow chart illustrating a method 400 for generating common search distances based on data points stored in data partitions, according to an example.
  • FIG. 4 shows that the method 400 may begin at operation 402 when the shuffle-based configuration module selects a determinable number of data points from each data partition. The selection performed at operation may be done at random, in some cases. These data points may be referred to as a representative query set.
  • the shuffle-based configuration module may: (a) retrieve the K closest points in each of the data partitions; and (b) compute the distances of the closest points in each of the data partitions.
  • the shuffle based configuration module may estimate the cumulative distribution function for the distance of the K-th closest point based on the K closest distances returned by the partitions for the representative queries.
  • the common search distances can be computed.
  • the R values can be computed in different ways to optimize various objective functions.
  • F(x) be the estimated cumulative distribution function of the distance of the K-th closest point to a random query.
  • the shuffle-based configuration module may send a command to the processing nodes to generate indexes based on the common search distances.
  • FIG. 5 is a sequence diagram illustrating a method 500 for handling query requests, according to an example.
  • the method 500 may begin at operation 510 when the search coordinator module receives a query search request 510 from a client device 502. Responsive to receiving the query request, the search coordinator module 504 may distribute local query requests 512 to the processing nodes 506a-n, each of which may correspond to a different data partition. After the search coordinator module 504 distributes the local query requests 512, the search coordinator module 504 may block 514 to receive local query results from the processing nodes 506a-n.
  • FIG. 5 shows that the processing nodes 506a-n may perform local searches 516a-n, in parallel, on the data partitions stored in local memory of the processing node, in some
  • the processing nodes may perform the local searches using a LSH scheme.
  • the local searches 516a-n may use a first search distance value (e.g., R0).
  • the processing node When a processing node completes a local search, the processing node sends a local query response back to the search coordinator module.
  • the processing nodes 506a-n send local query responses 515 independent of each other, it is possible that there are no top-K search results to be found at the given search distance used to perform the search (e.g., R0) for a processing node.
  • some implementations of the processing node can send a NULL search result to the search coordinator module 504.
  • the processing node may then proceed by performing searches 522a using the next search distance, say, in this case, R1.
  • the search coordinator module blocks processing of the query request (but may, in some cases, continue processing other query requests) until receiving local query responses from each of the processing nodes. Once the local query responses are received, the search coordinator module evaluates the search results found in the local query responses to determine whether the results (e.g., the number of data points within the search results) exceeds the result size threshold.
  • the local query responses 515 may relate to a search distance RO, and, accordingly, the search coordinator module may determine whether there are at least K points within distance RO across the different local query responses. If not, the search coordinator module 504 blocks 520 for the next set of local query responses 523.
  • the search coordinator module04 will send a query result response 526 to the client device 502.
  • the search coordinator module 504 may send a local query abandonment command 525 to the processing nodes 506a-n.
  • a processing node receives a local query abandonment command, and if the local search is still in progress, then the processing node can terminate the search. In some cases, the processing node terminates the local search after the current iteration of the local search completes processing, in order to maintain internal states that are tracked.
  • a processing node may immediately advance to perform a local search using the next search distance (e.g., say R1 ). in some cases, before starting the local search using the next search distance, the processing node can check whether an early abandonment command for the local query request has been received. If so, the processing node may abandon the local search; otherwise, the processing node may advance to the next search distance. Therefore, the search coordinator module does not restart the processing node for each search distance in a synchronized manner.
  • a given search distance e.g., say R0
  • the processing node may immediately advance to perform a local search using the next search distance (e.g., say R1 ).
  • the processing node can check whether an early abandonment command for the local query request has been received. If so, the processing node may abandon the local search; otherwise, the processing node may advance to the next search distance. Therefore, the search coordinator module does not restart the processing node for each search distance in a synchronized manner.
  • the early abandonment techniques described herein may be used to more efficiently manage computational resources of a multicore machine. For example, causing a processing node to abandon execution of local searches allows the processing node to use computational resources that would otherwise be occupied for processing a local query request, Q1 , to be given away to a different local query request, Q2. This computational resource transfer can happen earlier than in the case where the processing node terminates itself based on its local search criteria, namely, K search results are found for the local processing node. Therefore, early abandonment is able to speed up concurrent query processing in these cases.
  • FIG. 5 also shows that when a processing node returns a local query response for a search distance (e.g., R0) to the search coordinator module, the processing node may immediately advance to the next search distance (e.g., R1 ), if the processing node discovers less than K points within distance R0 from the query in its partition.
  • the search coordinator module may be able to discover at least K such points by combining the results from other processing nodes, and thus the query request can be completed early. But this global decision wiil not be known to the processing nodes until ail of the local query results related to R0, from all the processing nodes, have been gathered by the search coordinator module and the resultant decision notified to the processing nodes.
  • some implementations of the processing nodes can process the next global query request, Q2. This might mean processing the appropriate search distance value for Q2 completely independent of the progress Q1 is making.
  • FIG. 6 is a diagram illustrating a worker thread 600 of a processing node capable of processing concurrent local query requests, according to an example.
  • the worker thread 600 may be configured to execute on a processing node of a muiticore machine.
  • a processing nodes may execute multiple worker threads that are each assigned to different incoming local query requests received from a search coordinator module.
  • the worker thread 600 may include a local query queue 602 to store local query requests that are in progress, an early abandonment queue 604 to store eariy abandonment commands associated with iocal queries being processed by the worker thread 600, and a iocal execution engine 606 to perform a distance based similarity search on the data partition that is Iocal to the processing node executing the worker thread 600.
  • data or logic e.g., a listener thread executing on a processing node may route a local query request to the worker thread 600.
  • the worker thread may store the iocal query request in the locai query queue 602 as a local query record, such as one of the Iocal query records 610a-n.
  • a iocal query record may include, among other things, a query identifier ⁇ e.g., Q) to uniquely identify a query from other Iocal queries and a current search distance (e.g., R) to specify the search distance to which the iocal query should be executed.
  • the locai query record may include other data, such as the iocal top-K search results that have been identified so far, for example.
  • data or logic e.g., another listener thread or the same
  • the worker thread may store the early abandonment commands in the abandonment queue 604 as an abandonment record, such as abandonment record 612.
  • An abandonment record may include, among other things, a query identifier to uniquely identify a locai query from other locai queries.
  • the Iocal execution engine 606 operates by retrieving a Iocal query record from the Iocal query queue and then checks whether an abandonment record matches the iocal query record (e.g., the locai query record and the abandonment record include the same query identifier). If there is a match, the iocal execution engine will drop the local query record and remove the corresponding abandonment record. If not, the Iocal execution executes a proximity search using the parameters recorded in the locai query record. Once the proximity search completes, the locai query response is returned to the search coordinator module and the Iocal query record is pushed to the end of the Iocal query queue 602 after incrementing or otherwise advancing the current search distance of that Iocal query record.
  • an abandonment record matches the iocal query record (e.g., the locai query record and the abandonment record include the same query identifier). If there is a match, the iocal execution engine will drop the local query record and remove the corresponding abandonment record. If not,
  • the Iocal execution engine 606 may evaluate the iocal query response includes a number of resuits that exceed the result size threshoid. If so, the local execution engine 606 may drop the Iocal query record from the Iocal query queue 602 instead of pushing the record to the back of the iocal query queue 602.
  • the Iocal execution engine 606 may re-prioritize the local query queue 602 based on the current search distances of the local query records 610a-n. For example, in some cases, the local execution engine 606 may re-order the local query queue 602 in non-descending order based on the values of the current search distances of the records.
  • processing nodes can be implemented to handle multiple queries concurrently.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention concerne des procédés, des systèmes et des techniques d'utilisation de distances de recherche communes dans des partitions de données d'un ensemble de données mémorisé dans une machine multicœur. Par exemple, un premier point de données et un second point de données d'un ensemble de données sont attribués respectivement à une première partition de données mémorisée dans un premier nœud de traitement et à une seconde partition de données mémorisée dans un second nœud de traitement. Les attributions des points de données aux partitions de données peuvent être basées sur des valeurs de hachage dérivées d'identificateurs attribués aux points de données respectifs. Des indices de hachage d'un premier ensemble d'indices de hachage sont mémorisés dans le premier nœud de traitement. Les indices de hachage du premier ensemble d'indices de hachage peuvent correspondre individuellement à des distances de recherche communes. Des indices de hachage d'un second ensemble d'indices de hachage sont mémorisés dans le second nœud de traitement. Les indices de hachage du second ensemble d'indices de hachage peuvent correspondre individuellement aux distances de recherche communes.
PCT/US2015/013952 2015-01-30 2015-01-30 Distances de recherche communes dans des partitions de données WO2016122653A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/US2015/013952 WO2016122653A1 (fr) 2015-01-30 2015-01-30 Distances de recherche communes dans des partitions de données

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2015/013952 WO2016122653A1 (fr) 2015-01-30 2015-01-30 Distances de recherche communes dans des partitions de données

Publications (1)

Publication Number Publication Date
WO2016122653A1 true WO2016122653A1 (fr) 2016-08-04

Family

ID=56544069

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2015/013952 WO2016122653A1 (fr) 2015-01-30 2015-01-30 Distances de recherche communes dans des partitions de données

Country Status (1)

Country Link
WO (1) WO2016122653A1 (fr)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060116989A1 (en) * 2004-11-30 2006-06-01 Srikanth Bellamkonda Efficient data aggregation operations using hash tables
US7168025B1 (en) * 2001-10-11 2007-01-23 Fuzzyfind Corporation Method of and system for searching a data dictionary with fault tolerant indexing
US20100174714A1 (en) * 2006-06-06 2010-07-08 Haskolinn I Reykjavik Data mining using an index tree created by recursive projection of data points on random lines
US20120166401A1 (en) * 2010-12-28 2012-06-28 Microsoft Corporation Using Index Partitioning and Reconciliation for Data Deduplication
US20140201772A1 (en) * 2009-05-29 2014-07-17 Zeev Neumeier Systems and methods for addressing a media database using distance associative hashing

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7168025B1 (en) * 2001-10-11 2007-01-23 Fuzzyfind Corporation Method of and system for searching a data dictionary with fault tolerant indexing
US20060116989A1 (en) * 2004-11-30 2006-06-01 Srikanth Bellamkonda Efficient data aggregation operations using hash tables
US20100174714A1 (en) * 2006-06-06 2010-07-08 Haskolinn I Reykjavik Data mining using an index tree created by recursive projection of data points on random lines
US20140201772A1 (en) * 2009-05-29 2014-07-17 Zeev Neumeier Systems and methods for addressing a media database using distance associative hashing
US20120166401A1 (en) * 2010-12-28 2012-06-28 Microsoft Corporation Using Index Partitioning and Reconciliation for Data Deduplication

Similar Documents

Publication Publication Date Title
US8510316B2 (en) Database processing system and method
US10423616B2 (en) Using local memory nodes of a multicore machine to process a search query
US20180004751A1 (en) Methods and apparatus for subgraph matching in big data analysis
Yagoubi et al. Dpisax: Massively distributed partitioned isax
US10885031B2 (en) Parallelizing SQL user defined transformation functions
CN103581331B (zh) 虚拟机在线迁移方法与系统
US10191998B1 (en) Methods of data reduction for parallel breadth-first search over graphs of connected data elements
US9405782B2 (en) Parallel operation in B+ trees
US10515078B2 (en) Database management apparatus, database management method, and storage medium
US20180300330A1 (en) Proactive spilling of probe records in hybrid hash join
CN106095920A (zh) 面向大规模高维空间数据的分布式索引方法
Jin et al. Querying web-scale information networks through bounding matching scores
Deng et al. Pyramid: A general framework for distributed similarity search on large-scale datasets
WO2016187975A1 (fr) Procédé et appareil de défragmentation de mémoire interne
CN113377689B (zh) 一种路由表项查找、存储方法及网络芯片
CN113590332B (zh) 内存管理方法、装置及内存分配器
CN110008030A (zh) 一种元数据访问的方法、系统及设备
CN106484818B (zh) 一种基于Hadoop和HBase的层次聚类方法
Ma et al. In-memory distributed indexing for large-scale media data retrieval
Cheng et al. A Multi-dimensional Index Structure Based on Improved VA-file and CAN in the Cloud
CN109325022A (zh) 一种数据处理方法和装置
CN109213972A (zh) 确定文档相似度的方法、装置、设备和计算机存储介质
Siddique et al. k-dominant skyline query computation in MapReduce environment
WO2016122653A1 (fr) Distances de recherche communes dans des partitions de données
Yagoubi et al. Radiussketch: massively distributed indexing of time series

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15880543

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15880543

Country of ref document: EP

Kind code of ref document: A1