US20210365300A9 - Systems and methods for dynamic partitioning in distributed environments - Google Patents
Systems and methods for dynamic partitioning in distributed environments Download PDFInfo
- Publication number
- US20210365300A9 US20210365300A9 US16/198,133 US201816198133A US2021365300A9 US 20210365300 A9 US20210365300 A9 US 20210365300A9 US 201816198133 A US201816198133 A US 201816198133A US 2021365300 A9 US2021365300 A9 US 2021365300A9
- Authority
- US
- United States
- Prior art keywords
- key
- frequency count
- value pairs
- processor
- data set
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5061—Partitioning or combining of resources
- G06F9/5066—Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5061—Partitioning or combining of resources
- G06F9/5077—Logical partitioning of resources; Management or configuration of virtualized resources
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/27—Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
- G06F16/278—Data partitioning, e.g. horizontal or vertical partitioning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
-
- G06F17/30598—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5083—Techniques for rebalancing the load in a distributed system
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2209/00—Indexing scheme relating to G06F9/00
- G06F2209/50—Indexing scheme relating to G06F9/50
- G06F2209/5017—Task decomposition
Definitions
- the present disclosure relates to dynamic partitioning in distributed computing environments. More particularly, the present disclosure relates to dynamic partitioning of keys based on frequency counters maintained locally and/or global in the distributed computing environment.
- the integration of data from a plurality of data sources may produce large data sets that need to be managed efficiently and effectively.
- conventional methods of integrating large data sets have performance barriers because of the size of the data sets, which leads to relatively long processing times and relatively large computer resource use.
- MapReduce framework data sets are partitioned into several blocks of data using keys assigned by map task operations and allocated in parallel to reduce task operations.
- a common problem with the MapReduce framework is data skew, which occurs when the workload is non-uniformly distributed.
- data skew occurs when the workload is non-uniformly distributed.
- computer resources that process a reduce task receive a relatively large amount of workload and require a relatively longer amount of processing time to complete the tasks compared to other computer resources that process other reduce tasks, which diminishes the benefits of parallelization.
- embodiments of the present disclosure relate to dynamic partitioning of tasks in a distributed computing environment to improve data processing speed.
- Embodiments of the present disclosure include systems, methods, and computer-readable media for dynamic partitioning in distributed computing environments.
- One method includes: receiving, at a processor, a first data set and a second data set; mapping, by the processor, the first data set into a first set of key-value pairs; mapping, by the processor, the second data set into a second set of key-value pairs; estimating, by the processor using a sketch, a frequency count for each key based on the first set of key-value pairs and the second set of key-value pairs; determining, by the processor, whether the estimated frequency count for each key is greater than or equal to a predetermined threshold; and partitioning, by the processor, the key when the estimated frequency count for the key is greater than or equal to the predetermined threshold.
- One system includes a data storage device that stores instructions system for dynamic partitioning in distributed computing environments; and a processor configured to execute the instructions to perform a method including: receiving a first data set and a second data set; mapping the first data set into a first set of key-value pairs; mapping the second data set into a second set of key-value pairs; estimating, using a sketch, a frequency count for each key based on the first set of key-value pairs and the second set of key-value pairs; determining whether the estimated frequency count for each key is greater than or equal to a predetermined threshold; and partitioning the key when the estimated frequency count for the key is greater than or equal to the predetermined threshold.
- non-transitory computer-readable media storing instructions that, when executed by a computer, cause the computer to perform a method for dynamic partitioning in distributed computing environments.
- One method of the non-transitory computer-readable medium including: receiving a first data set and a second data set; mapping the first data set into a first set of key-value pairs; mapping the second data set into a second set of key-value pairs; estimating, using a sketch, a frequency count for each key based on the first set of key-value pairs and the second set of key-value pairs; determining whether the estimated frequency count for each key is greater than or equal to a predetermined threshold; and partitioning the key when the estimated frequency count for the key is greater than or equal to the predetermined threshold.
- FIG. 1 depicts a system implementing a MapReduce framework for dynamic partitioning of in a distributed environment, according to embodiments of the present disclosure
- FIG. 2 depicts an exemplary blocking-based records/events linking using the MapReduce framework, according to embodiments of the present disclosure
- FIG. 3 depicts an exemplary blocking-based records/events linking using the MapReduce framework that includes a predetermined threshold when mapping data sets, according to embodiments of the present disclosure
- FIG. 4 depicts a system implementing a MapReduce framework for dynamic partitioning of in a distributed environment using a global frequency counter, according to embodiments of the present disclosure
- FIG. 5 depicts a table of performance results for a MapReduce framework using a global frequency counter, according to embodiments of the present disclosure.
- FIG. 6 depicts a method for dynamic partitioning of in a distributed environment, according to embodiments of the present disclosure
- FIG. 7 depicts another method for dynamic partitioning of in a distributed environment, according to embodiments of the present disclosure.
- FIG. 8 is a simplified functional block diagram of a computer configured as a device for executing the methods of FIGS. 6 and 7 , according to exemplary embodiments of the present disclosure.
- a data analysis platform may process relatively large amounts of data to learn insights from the data.
- an advertiser may have a relatively large of amount of data relating to advertisements and campaigns.
- the data may be stored in a software framework for distributed storage and distributed processing, such as with Hadoop.
- Hadoop may be utilized for distributed processing of the data
- the Hadoop distributed file system (“HDFS”) may be used for organizing communications and storage of the data.
- Clusters and/or nodes may be generated that also utilize HDFS.
- a cluster computing framework, such as Spark may be arranged to further utilize the HDFS of the Hadoop clusters.
- a Hadoop cluster may allow for the distributed processing of large data sets across clusters of computers using programming models.
- a Hadoop cluster may scale up from single servers to thousands of machines, each offering local computation and storage.
- a MapReduce framework may be provided for accessing and processing data from the distributed computing system.
- a MapReduce framework may be used to process records/events related to a particular unique identifier (e.g., an advertiser id and/or a campaign id) in parallel.
- a particular unique identifier e.g., an advertiser id and/or a campaign id
- the workload of processing for a large number of records/events may be divided among a plurality of MapReduce nodes and divided among a plurality of computers within the MapReduce framework.
- FIG. 1 depicts a system implementing a MapReduce framework, according to embodiments of the present disclosure.
- the system includes a cluster 100 of nodes working in parallel. Each node may be a computer, a processor, or a processing.
- the cluster 100 includes a master node 102 and a plurality of slave nodes 104 , which performs MapReduce tasks and/or other tasks.
- MapReduce tasks include map tasks and reduce tasks.
- a data set received by the cluster 100 may be split into independent chunks of data that are processed by a map tasks in parallel.
- the map tasks may produce a set of key-value pairs.
- the MapReduce framework may group the outputs of the map tasks by their respective keys, which may be input into the reduce tasks.
- the grouping of keys may be a time consuming process when the number of map task results is relatively large.
- Reduce tasks may consolidate the outputs from the map tasks into final results.
- the slave nodes 104 may include a plurality of map task nodes 106 , a plurality of reduce tasks nodes 108 , and/or a plurality of other tasks N nodes 110 .
- the master node 102 may divide a data set into smaller data chunks and distributes the smaller data chunks to the map task nodes 106 .
- Each reduce task node 108 may combine the output received from the map tasks nodes 106 into a single result.
- Each node in the cluster 100 may be coupled to a database 112 .
- the results of each stage of the MapReduce tasks may be stored in the database 112 , and the nodes in the cluster 100 may obtain the results from the database 112 in order to perform subsequent processing.
- a data set that is received may include a set of records/events that relate to a particular unique identifier (e.g., an advertiser id and/or a campaign id).
- a unique key may be assigned to the data in the data set in order to uniquely identify the data.
- Another data set may also be received from the same data provider and/or different data provider and include another set of records/events that related to another particular unique identifier (e.g., another advertiser id and/or another campaign id).
- a unique key may be assigned to the second data of the second data set in order to uniquely identify the second data.
- the set of records/events of the data sets may then be linked by matching records/events of the data sets. For example, a record/event of the data set may be assigned with a first key, and other records/events of the data set with the same first key may be grouped into a block. The records/events of the block may be compared with each other to determine whether the information within the records/events match or do not match.
- the MapReduce framework may be used to efficiently process the linking of records/events of data sets.
- the MapReduce framework includes two major tasks, i.e., map and reduce.
- the map task inputs the data of the data set, and assigns a key to a record/event.
- the reduce task receives all values which have the same key, and processes these groups.
- the map and reduce tasks may simplified by the following algorithmic formulas:
- the map task may output one or more key-value pairs.
- the reduce task may receive a list of values for a particular key, and, after computation, output a new list of values.
- the records/events included in the data sets may be separated into smaller units and distributed to different computing resources that may be run in parallel.
- input data may be processed by map tasks in parallel, the intermediate outputs of the map tasks may be collected locally and grouped based on their respective key values. Based on a partition function (such as a default hashing function and/or a user-defined function), the groups may be allocated to a reduce task depending on their keys. Upon completion of the map tasks and the intermediate results being transferred to the respective reduce task, reduce task operations may begin. The reduce task operations may also be processed in parallel for each key group.
- a partition function such as a default hashing function and/or a user-defined function
- FIG. 2 depicts an exemplary blocking-based records/events linking using the MapReduce framework, according to embodiments of the present disclosure.
- Field A of a data sets 202 A and 202 B may be used as the key, and the records/events B of the respective data sets 202 A and 202 B may be mapped, and then be processed by the same reduce task computing resources.
- MapReduce framework data skew occurs when the workload is non-uniformly distributed.
- computer resources that process reduce tasks may receive a relatively large amount of key-value pairs, and may require a relatively longer amount of processing time to complete the reduce tasks compared to other computer resources that process other reduce tasks.
- Such an uneven distribution of key-value pairs may reduce the benefits of parallelization.
- the computing resources needed for reduce task operations 204 A, 204 B may compare six record/event pairs, but the computing resources need for reduce task operations 204 C may compare ten record/event pairs.
- the MapReduce framework may assign some computing resources for reduce task operations with a larger workload, such as 204 C. Data skew occurs because of the imbalanced distribution of block sizes.
- each map task operation may maintain a frequency counter per key. The frequency counter per key may be used in conjunction with a predetermined threshold to one or more of split a key, create sub-keys, and/or to allocate record/event pairs to particular computing resources to ensure that a load of the computer resources is balanced.
- each reduce task operation and/or each stage of a MapReduce operation may maintain a frequency counter per key.
- the frequency counter per key may be used in conjunction with an overall predetermined threshold, a reduce task predetermined threshold, and/or a stage predetermined threshold to one or more of split a key, create sub-keys, and/or to allocate record/event pairs to particular computing resources to ensure that a load of the computer resources is balanced.
- the data sets may be examined to produce a workload estimation based on a sketch of the data sets 202 A and 202 B.
- the frequency counter may use various algorithms, such as an algorithm that uses a lossy count and/or an algorithm that uses sketches to count the number of values.
- a sketch may be a data structure that provides space-efficient summaries for large and frequently updated data sets.
- a sketch data structure may estimate a number of values that have been assigned to a certain key for the data set.
- the sketch data structure may be one or more of a count-min sketch, a hyperloglog, a bloomfilter, a minhash, and/or a cuckoo filter.
- hash functions may be used to map records/events to frequencies.
- a slave node 104 that process map tasks 106 may use a frequency counter 114 to estimate a number of values that are repeated in over a predetermined fraction of the rows, for each column of data being processed.
- the frequency counter 114 may use a sketch when inputting a stream of records/events, one at a time, of a data set, such as data set 202 A and 202 B, and the frequency counter 114 may count a frequency of the different types of records/events in the stream.
- the sketch may be used as an estimated frequency of each record/event type.
- the count-min sketch data structure may be a two-dimensional array of cells with w columns and d rows. The values for the parameters w and d may be fixed when the sketch is created, and may be used to determine time and space needs and the probability of error when the sketch is queried for a frequency. Associated with each of the d rows is a separate and independent hash function.
- Each hash function h i maps a blocking key k into a hashing space of size w.
- Each cell of the two-dimensional array of a sketch may include a counter, and initially, all of each counter in the array may be set to zero.
- the counters may be incremented. If a counter of a cell of the two-dimensional array of the sketch is greater than or equal to a predetermined count threshold for the particular key k, then the individual map task may partition (split) the key into two or more sub-keys with the map task operation.
- the predetermined count threshold may be a predetermined value and/or a range of values that may be determined empirically and/or dynamically. For example, a dynamically predetermined count threshold may use machine learning to determine a value or a range of values for the predetermined count threshold.
- FIG. 3 depicts an exemplary blocking-based records/events linking using the MapReduce framework that includes a predetermined threshold when mapping data sets, according to embodiments of the present disclosure.
- Field A of a data sets 302 A and 302 B may be used as the blocking key, and the records/events B of the respective data sets 302 A and 302 B may be mapped, and then the key pairs may be processed by the same reduce task computing resources.
- the predetermined threshold for determining whether a mapper should partition (split) a key may be 4. When the frequency of the key 1 is determined to be 4, the mapper may split the key 1 into keys 1 A and 1 B.
- the computing resources needed for reduce task operations 304 A may compare nine record/event pairs
- the computing resources needed for reduce task operations 304 B may compare six record/event pairs
- the computing resources needed for reduce task operations 304 C may compare one record/event pair
- the computing resources needed for reduce task operations 304 D may compare two record/event pairs.
- the computer resources for reduce task operations 304 A and 304 B would be combined and may compare fifteen record/event pairs, which is a relatively larger amount of processing needed to the other computing resources needed for reduce task operations 304 C and 304 D.
- each slave node that processes map tasks may include a frequency counter for each key using a sketch, and partitions a key when the frequency counter associated with the key exceeds a predetermined threshold.
- the above described frequency counter may allow for data skew to be mitigated locally at the slave node. In order to further mitigate data skew, the frequency counter for each key may be maintained globally.
- the master node 102 may also include a global frequency counter 116 that maintains a global frequency count for each key.
- the global frequency counter 116 may maintain a sketch, such as a count-min sketch, and the frequency counters 114 of the slave nodes 104 including map tasks 106 , may retrieve the global frequency count for each key from the global frequency counter 116 .
- the local frequency counters 114 of the slave nodes 104 including map tasks 106 may retrieve the global frequency count for each key from the global frequency counter 116 . Then the slave nodes 104 may determine an updated frequency count for each key based on the estimated frequency counts for each key and the retrieved global frequency count for each key. The map tasks may then partition (split) their local keys based on the locally updated frequency counts for each key and the predetermined threshold. Upon completion of the map tasks, the local frequency counters 114 may transmit their local updated frequency counts for each key to the global frequency counter 116 of the master node 102 .
- FIG. 5 depicts a table of performance results for a MapReduce framework using a global frequency counter, according to embodiments of the present disclosure.
- the environment includes data from 20,599 files having a total size of 2.9 terabytes of data.
- the running environment was performed with 40,855 total map tasks, with 559 concurrent map tasks, and 316 total reduce tasks, with 279 concurrent reduce tasks, running Hadoop 2.7.1.
- the sketch used for frequency counting was a count-min sketch.
- the various parameters of each performance result are depicted in the table of FIG. 5 .
- the MapReduce framework may be substituted with a Spark framework, and an execution time may be reduced from about 2-3 hours to about 40 minutes.
- a Spark framework implementation may be similar to a MapReduce framework implementation.
- the Spark framework implementation may differ from the MapReduce framework implementation in that (i) data may be processed in a memory to reduce slow down due to disk input/output, (ii) map and reduce stages may not occur separately in order to avoid a total replicated disk write and network transfer, and (iii) a partition/re-partition of sub-keyed data may be done in memory with minimum shuffling.
- FIG. 6 depicts a method for dynamic partitioning of in a distributed environment, according to embodiments of the present disclosure.
- the method 600 may begin at step 602 in which a node, such as the master node 102 and/or slave node 104 , may receive a first data set and a second data set.
- a master node receives the first data set and the second data set
- the master node may distribute a portion and/or all of the first data set and the second data set to one or more of the slave nodes for distributed processing.
- the slave node may process the portion and/or all of the first data set and the second data set according to one or more tasks handled by the slave node.
- the slave node may perform a map task on the first data set.
- the map task may map the first data set, and may output a first set of key-value pairs based on the first data set.
- a plurality of slave nodes may perform map tasks on a plurality of first data sets in parallel, and the intermediate outputs of the map tasks may be collected locally at each slave node.
- the slave node may perform a map task on the second data set.
- the map task may map the second data set, and may output a second set of key-value pairs based on the second data set.
- a plurality of slave nodes may perform map tasks on a plurality of second data sets in parallel, and the intermediate outputs of the map tasks may be collected locally at each slave node.
- each slave node may estimate, using a sketch, a frequency count for each key based on the first set of key-value pairs and the second set of key-value pairs. For example, the first and second data sets may be examined to produce a workload estimation based on the sketch of the first and second data sets.
- a frequency counter such as frequency counter 114 of a slave node 104 , may use various algorithms, such as an algorithm that uses a lossy count and/or an algorithm that uses sketches to count the number of distinct values in the first and second set of key-value pairs.
- a sketch may be a data structure that provides space-efficient summaries for large and frequently updated data sets.
- a sketch data structure may estimate a number of distinct values that have been assigned to a particular key in a first and second set of key-value pairs.
- the frequency counter may estimate a number of distinct values for each key in the first and second set of key-value pairs.
- the sketch data structure may be a count-min sketch.
- the slave node may determine whether the estimated frequency count for each key is greater than or equal to a predetermined threshold.
- the slave node may partition a key when the frequency count associated with the key is greater than or equal to the predetermined threshold.
- each slave node that processes map tasks may include a frequency counter for each key, and the slave node may partition a key when the frequency count associated with the key exceeds a predetermined threshold.
- the process may continue. For example, the slave node may group the values associated with the keys based on the key. Then, other slave nodes that process reduce tasks may receive a list of values for a particular key, and, after computation, output a new list of values.
- FIG. 7 depicts another method for dynamic partitioning of in a distributed environment, according to embodiments of the present disclosure.
- the method 700 may begin at step 702 in which a node, such as the master node 102 and/or slave node 104 , may receive a first data set and a second data set.
- a master node may distribute a portion and/or all of the first data set and the second data set to one or more of the slave nodes for distributed processing.
- the slave node may process the portion and/or all of the first data set and the second data set according to one or more tasks handled by the slave node.
- the slave node may perform a map task on the first data set.
- the map task may map the first data set, and may output a first set of key-value pairs based on the first data set.
- a plurality of slave nodes may perform map tasks on a plurality of first data sets in parallel, and the intermediate outputs of the map tasks may be collected locally at each slave node.
- the slave node may perform a map task on the second data set.
- the map task may map the second data set, and may output a second set of key-value pairs based on the second data set.
- a plurality of slave nodes may perform map tasks on a plurality of second data sets in parallel, and the intermediate outputs of the map tasks may be collected locally at each slave node.
- the slave node may retrieve, from a master node, a global frequency count for each key mapped in the first and second set of key value pairs.
- the master node such as master node 102
- the global frequency counter 116 may also maintain a sketch, such as a count-min sketch.
- the frequency counters, such as frequency counters 114 , of each slave node, may retrieve the global frequency count for each key from the global frequency counter.
- each slave node may estimate, using a sketch, a frequency count for each key based on the first set of key-value pairs and the second set of key-value pairs. For example, the first and second data sets may be examined to produce a workload estimation based on the sketch of the first and second data sets.
- a frequency counter such as frequency counter 114 of a slave node 104 , may use various algorithms, such as an algorithm that uses a lossy count and/or an algorithm that uses sketches to count the number of distinct values in the first and second set of key-value pairs.
- a sketch may be a data structure that provides space-efficient summaries for large and frequently updated data sets.
- a sketch data structure may estimate a number of distinct values that have been assigned to a particular key in a first and second set of key-value pairs.
- the frequency counter may estimate a number of distinct values for each key in the first and second set of key-value pairs.
- the sketch data structure may be a count-min sketch.
- each slave node may determine an updated frequency count for each key based on the retrieved global frequency count for each key and the estimated frequency count for each key. For example, the slave node, for each key, may average the global frequency count for a key and the estimated frequency count for the key, and generated an updated frequency count for the key based on the average.
- the slave node may determine whether the updated frequency count for each key is greater than or equal to a predetermined threshold.
- the slave node may partition a key when the updated frequency count associated with the key is greater than or equal to the predetermined threshold.
- each slave node that processes map tasks may include a frequency counter for each key, and the slave node may partition a key when the updated frequency count associated with the key exceeds a predetermined threshold.
- the slave node may transmit, to the master node, the updated frequency count for each key. Accordingly, the master node may update the global frequency count with the updated frequency count from each slave node.
- the process may continue. For example, the slave node may group the values associated with the keys based on the key. Then, other slave nodes that process reduce tasks may receive a list of values for a particular key, and, after computation, output a new list of values.
- FIG. 8 is a simplified functional block diagram of a computer that may be configured as the nodes, computing device, servers, providers, and/or network elements for executing the methods, according to exemplary an embodiment of the present disclosure.
- any of the nodes, computing device, servers, providers, and/or network may be an assembly of hardware 800 including, for example, a data communication interface 860 for packet data communication.
- the platform may also include a central processing unit (“CPU”) 820 , in the form of one or more processors, for executing program instructions.
- CPU central processing unit
- the platform typically includes an internal communication bus 810 , program storage, and data storage for various data files to be processed and/or communicated by the platform such as ROM 830 and RAM 840 , although the system 800 often receives programming and data via network communications.
- the system 800 also may include input and output ports 850 to connect with input and output devices such as keyboards, mice, touchscreens, monitors, displays, etc.
- input and output ports 850 to connect with input and output devices such as keyboards, mice, touchscreens, monitors, displays, etc.
- the various system functions may be implemented in a distributed fashion on a number of similar platforms, to distribute the processing load.
- the systems may be implemented by appropriate programming of one computer hardware platform.
- Storage type media include any or all of the tangible memory of the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which may provide non-transitory storage at any time for the software programming. All or portions of the software may at times be communicated through the Internet or various other telecommunication networks.
- Such communications may enable loading of the software from one computer or processor into another, for example, from a management server or host computer of the mobile communication network into the computer platform of a server and/or from a server to the mobile device.
- another type of media that may bear the software elements includes optical, electrical and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and over various air-links.
- the physical elements that carry such waves, such as wired or wireless links, optical links, or the like, also may be considered as media bearing the software.
- terms such as computer or machine “readable medium” refer to any medium that participates in providing instructions to a processor for execution.
- the presently disclosed methods, devices, and systems are described with exemplary reference to transmitting data, it should be appreciated that the presently disclosed embodiments may be applicable to any environment, such as a desktop or laptop computer, an automobile entertainment system, a home entertainment system, etc. Also, the presently disclosed embodiments may be applicable to any type of Internet protocol.
- the present disclosure is not limited to these particular embodiments.
- the present disclosure may also be used in other distributed computing environments.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Computing Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
- The present disclosure relates to dynamic partitioning in distributed computing environments. More particularly, the present disclosure relates to dynamic partitioning of keys based on frequency counters maintained locally and/or global in the distributed computing environment.
- The integration of data from a plurality of data sources may produce large data sets that need to be managed efficiently and effectively. However, conventional methods of integrating large data sets have performance barriers because of the size of the data sets, which leads to relatively long processing times and relatively large computer resource use.
- Several newer techniques of integrating data sets have been proposed to parallelize the integration process and reduce long processing times based on the MapReduce framework. In the MapReduce framework, data sets are partitioned into several blocks of data using keys assigned by map task operations and allocated in parallel to reduce task operations.
- A common problem with the MapReduce framework is data skew, which occurs when the workload is non-uniformly distributed. When typical data skew occurs, computer resources that process a reduce task receive a relatively large amount of workload and require a relatively longer amount of processing time to complete the tasks compared to other computer resources that process other reduce tasks, which diminishes the benefits of parallelization.
- Thus, embodiments of the present disclosure relate to dynamic partitioning of tasks in a distributed computing environment to improve data processing speed.
- Embodiments of the present disclosure include systems, methods, and computer-readable media for dynamic partitioning in distributed computing environments.
- According to embodiments of the present disclosure, computer-implemented methods are disclosed for dynamic partitioning in distributed computing environments. One method includes: receiving, at a processor, a first data set and a second data set; mapping, by the processor, the first data set into a first set of key-value pairs; mapping, by the processor, the second data set into a second set of key-value pairs; estimating, by the processor using a sketch, a frequency count for each key based on the first set of key-value pairs and the second set of key-value pairs; determining, by the processor, whether the estimated frequency count for each key is greater than or equal to a predetermined threshold; and partitioning, by the processor, the key when the estimated frequency count for the key is greater than or equal to the predetermined threshold.
- According to embodiments of the present disclosure, systems are disclosed for dynamic partitioning in distributed computing environments. One system includes a data storage device that stores instructions system for dynamic partitioning in distributed computing environments; and a processor configured to execute the instructions to perform a method including: receiving a first data set and a second data set; mapping the first data set into a first set of key-value pairs; mapping the second data set into a second set of key-value pairs; estimating, using a sketch, a frequency count for each key based on the first set of key-value pairs and the second set of key-value pairs; determining whether the estimated frequency count for each key is greater than or equal to a predetermined threshold; and partitioning the key when the estimated frequency count for the key is greater than or equal to the predetermined threshold.
- According to embodiments of the present disclosure, non-transitory computer-readable media storing instructions that, when executed by a computer, cause the computer to perform a method for dynamic partitioning in distributed computing environments are also disclosed. One method of the non-transitory computer-readable medium including: receiving a first data set and a second data set; mapping the first data set into a first set of key-value pairs; mapping the second data set into a second set of key-value pairs; estimating, using a sketch, a frequency count for each key based on the first set of key-value pairs and the second set of key-value pairs; determining whether the estimated frequency count for each key is greater than or equal to a predetermined threshold; and partitioning the key when the estimated frequency count for the key is greater than or equal to the predetermined threshold.
- Additional objects and advantages of the disclosed embodiments will be set forth in part in the description that follows, and in part will be apparent from the description, or may be learned by practice of the disclosed embodiments. The objects and advantages of the disclosed embodiments will be realized and attained by means of the elements and combinations particularly pointed out in the appended claims.
- It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the scope of disclosed embodiments, as set forth by the claims.
- The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate various exemplary embodiments and together with the description, serve to explain the principles of the disclosed embodiments.
-
FIG. 1 depicts a system implementing a MapReduce framework for dynamic partitioning of in a distributed environment, according to embodiments of the present disclosure; -
FIG. 2 depicts an exemplary blocking-based records/events linking using the MapReduce framework, according to embodiments of the present disclosure; -
FIG. 3 depicts an exemplary blocking-based records/events linking using the MapReduce framework that includes a predetermined threshold when mapping data sets, according to embodiments of the present disclosure; -
FIG. 4 depicts a system implementing a MapReduce framework for dynamic partitioning of in a distributed environment using a global frequency counter, according to embodiments of the present disclosure; -
FIG. 5 depicts a table of performance results for a MapReduce framework using a global frequency counter, according to embodiments of the present disclosure. -
FIG. 6 depicts a method for dynamic partitioning of in a distributed environment, according to embodiments of the present disclosure; -
FIG. 7 depicts another method for dynamic partitioning of in a distributed environment, according to embodiments of the present disclosure; and -
FIG. 8 is a simplified functional block diagram of a computer configured as a device for executing the methods ofFIGS. 6 and 7 , according to exemplary embodiments of the present disclosure. - It is to be understood that the figures are not necessarily drawn to scale, nor are the objects in the figures necessarily drawn to scale in relationship to one another. The figures are depictions that are intended to bring clarity and understanding to various embodiments of apparatuses, systems, and methods disclosed herein. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts. Moreover, it should be appreciated that the drawings are not intended to limit the scope of the present teachings in any way.
- The following detailed description refers to the accompanying drawings. The same reference numbers in different drawings may identify the same or similar elements.
- A data analysis platform may process relatively large amounts of data to learn insights from the data. For example, an advertiser may have a relatively large of amount of data relating to advertisements and campaigns. To determine the effectiveness and/or improve the effectiveness of an advertisement and/or campaign, the data may be stored in a software framework for distributed storage and distributed processing, such as with Hadoop. In particular, Hadoop may be utilized for distributed processing of the data, and the Hadoop distributed file system (“HDFS”) may be used for organizing communications and storage of the data. Clusters and/or nodes may be generated that also utilize HDFS. For example, a cluster computing framework, such as Spark, may be arranged to further utilize the HDFS of the Hadoop clusters. A Hadoop cluster may allow for the distributed processing of large data sets across clusters of computers using programming models. A Hadoop cluster may scale up from single servers to thousands of machines, each offering local computation and storage.
- Accessing and organizing data in a large distributed system may be difficult and require specialized functionality for efficient operations. In one example, a MapReduce framework may be provided for accessing and processing data from the distributed computing system. According to embodiments of the present disclosure, a MapReduce framework may be used to process records/events related to a particular unique identifier (e.g., an advertiser id and/or a campaign id) in parallel. Thus, the workload of processing for a large number of records/events may be divided among a plurality of MapReduce nodes and divided among a plurality of computers within the MapReduce framework.
-
FIG. 1 depicts a system implementing a MapReduce framework, according to embodiments of the present disclosure. The system includes acluster 100 of nodes working in parallel. Each node may be a computer, a processor, or a processing. Thecluster 100 includes amaster node 102 and a plurality ofslave nodes 104, which performs MapReduce tasks and/or other tasks. As discussed in more detail below, MapReduce tasks include map tasks and reduce tasks. A data set received by thecluster 100 may be split into independent chunks of data that are processed by a map tasks in parallel. The map tasks may produce a set of key-value pairs. The MapReduce framework may group the outputs of the map tasks by their respective keys, which may be input into the reduce tasks. The grouping of keys (also referred to as shuffling) may be a time consuming process when the number of map task results is relatively large. Reduce tasks may consolidate the outputs from the map tasks into final results. Theslave nodes 104 may include a plurality ofmap task nodes 106, a plurality of reducetasks nodes 108, and/or a plurality of othertasks N nodes 110. Themaster node 102 may divide a data set into smaller data chunks and distributes the smaller data chunks to themap task nodes 106. Each reducetask node 108 may combine the output received from themap tasks nodes 106 into a single result. Each node in thecluster 100 may be coupled to adatabase 112. The results of each stage of the MapReduce tasks may be stored in thedatabase 112, and the nodes in thecluster 100 may obtain the results from thedatabase 112 in order to perform subsequent processing. - As discussed above, a data set that is received may include a set of records/events that relate to a particular unique identifier (e.g., an advertiser id and/or a campaign id). When the data set is received, a unique key may be assigned to the data in the data set in order to uniquely identify the data. Another data set may also be received from the same data provider and/or different data provider and include another set of records/events that related to another particular unique identifier (e.g., another advertiser id and/or another campaign id). A unique key may be assigned to the second data of the second data set in order to uniquely identify the second data.
- The set of records/events of the data sets may then be linked by matching records/events of the data sets. For example, a record/event of the data set may be assigned with a first key, and other records/events of the data set with the same first key may be grouped into a block. The records/events of the block may be compared with each other to determine whether the information within the records/events match or do not match.
- The MapReduce framework may be used to efficiently process the linking of records/events of data sets. As mentioned above, the MapReduce framework includes two major tasks, i.e., map and reduce. The map task inputs the data of the data set, and assigns a key to a record/event. The reduce task receives all values which have the same key, and processes these groups. The map and reduce tasks may simplified by the following algorithmic formulas:
-
map::(K 1 ,V 1)-->list(K 2 ,V 2) -
reduce::(K 2,list(V 2))-->list(V 3) - For example, the map task may output one or more key-value pairs. The reduce task may receive a list of values for a particular key, and, after computation, output a new list of values. Through mapping and reducing, the records/events included in the data sets may be separated into smaller units and distributed to different computing resources that may be run in parallel.
- In the map phase, input data may be processed by map tasks in parallel, the intermediate outputs of the map tasks may be collected locally and grouped based on their respective key values. Based on a partition function (such as a default hashing function and/or a user-defined function), the groups may be allocated to a reduce task depending on their keys. Upon completion of the map tasks and the intermediate results being transferred to the respective reduce task, reduce task operations may begin. The reduce task operations may also be processed in parallel for each key group.
- As mentioned above, the data sets may be partitioned into several blocks of data using keys by map tasks, and assigned in parallel to reduce tasks.
FIG. 2 depicts an exemplary blocking-based records/events linking using the MapReduce framework, according to embodiments of the present disclosure. Field A of a data sets 202A and 202 B may be used as the key, and the records/events B of therespective data sets - With the MapReduce framework data skew occurs when the workload is non-uniformly distributed. When typical data skew occurs, computer resources that process reduce tasks may receive a relatively large amount of key-value pairs, and may require a relatively longer amount of processing time to complete the reduce tasks compared to other computer resources that process other reduce tasks. Such an uneven distribution of key-value pairs may reduce the benefits of parallelization. For example, as shown in
FIG. 1 , the computing resources needed forreduce task operations reduce task operations 204C may compare ten record/event pairs. - When the block size distribution is skewed, the MapReduce framework may assign some computing resources for reduce task operations with a larger workload, such as 204C. Data skew occurs because of the imbalanced distribution of block sizes. To alleviate the imbalanced distribution of block sizes, each map task operation may maintain a frequency counter per key. The frequency counter per key may be used in conjunction with a predetermined threshold to one or more of split a key, create sub-keys, and/or to allocate record/event pairs to particular computing resources to ensure that a load of the computer resources is balanced.
- Additionally, to alleviate the imbalanced distribution of block sizes, each reduce task operation and/or each stage of a MapReduce operation may maintain a frequency counter per key. The frequency counter per key may be used in conjunction with an overall predetermined threshold, a reduce task predetermined threshold, and/or a stage predetermined threshold to one or more of split a key, create sub-keys, and/or to allocate record/event pairs to particular computing resources to ensure that a load of the computer resources is balanced.
- In order to estimate a frequency count per key, the data sets may be examined to produce a workload estimation based on a sketch of the
data sets slave node 104 that processmap tasks 106 may use afrequency counter 114 to estimate a number of values that are repeated in over a predetermined fraction of the rows, for each column of data being processed. - For example, the
frequency counter 114 may use a sketch when inputting a stream of records/events, one at a time, of a data set, such asdata set frequency counter 114 may count a frequency of the different types of records/events in the stream. The sketch may be used as an estimated frequency of each record/event type. The count-min sketch data structure may be a two-dimensional array of cells with w columns and d rows. The values for the parameters w and d may be fixed when the sketch is created, and may be used to determine time and space needs and the probability of error when the sketch is queried for a frequency. Associated with each of the d rows is a separate and independent hash function. Each hash function hi maps a blocking key k into a hashing space of size w. The parameters w and d may be set with w=┌e/ε┐ and d=┌In 1/δ┐, where the error in answering a query is within a factor of ε with probability δ. - Each cell of the two-dimensional array of a sketch may include a counter, and initially, all of each counter in the array may be set to zero. When a new record/event of type is detected (i.e., a new key k is detect), the counters may be incremented. If a counter of a cell of the two-dimensional array of the sketch is greater than or equal to a predetermined count threshold for the particular key k, then the individual map task may partition (split) the key into two or more sub-keys with the map task operation. The predetermined count threshold may be a predetermined value and/or a range of values that may be determined empirically and/or dynamically. For example, a dynamically predetermined count threshold may use machine learning to determine a value or a range of values for the predetermined count threshold.
-
FIG. 3 depicts an exemplary blocking-based records/events linking using the MapReduce framework that includes a predetermined threshold when mapping data sets, according to embodiments of the present disclosure. Field A of a data sets 302A and 302 B may be used as the blocking key, and the records/events B of therespective data sets FIG. 2 , the predetermined threshold for determining whether a mapper should partition (split) a key may be 4. When the frequency of thekey 1 is determined to be 4, the mapper may split thekey 1 intokeys reduce task operations 304A may compare nine record/event pairs, the computing resources needed forreduce task operations 304B may compare six record/event pairs, the computing resources needed forreduce task operations 304C may compare one record/event pair, and the computing resources needed forreduce task operations 304D may compare two record/event pairs. Without the partitioning of the keys, the computer resources forreduce task operations reduce task operations - As discussed in detail above, each slave node that processes map tasks may include a frequency counter for each key using a sketch, and partitions a key when the frequency counter associated with the key exceeds a predetermined threshold. The above described frequency counter may allow for data skew to be mitigated locally at the slave node. In order to further mitigate data skew, the frequency counter for each key may be maintained globally.
- As shown
FIG. 4 , themaster node 102 may also include aglobal frequency counter 116 that maintains a global frequency count for each key. Theglobal frequency counter 116 may maintain a sketch, such as a count-min sketch, and the frequency counters 114 of theslave nodes 104 includingmap tasks 106, may retrieve the global frequency count for each key from theglobal frequency counter 116. - For example, the local frequency counters 114 of the
slave nodes 104 includingmap tasks 106 may retrieve the global frequency count for each key from theglobal frequency counter 116. Then theslave nodes 104 may determine an updated frequency count for each key based on the estimated frequency counts for each key and the retrieved global frequency count for each key. The map tasks may then partition (split) their local keys based on the locally updated frequency counts for each key and the predetermined threshold. Upon completion of the map tasks, the local frequency counters 114 may transmit their local updated frequency counts for each key to theglobal frequency counter 116 of themaster node 102. -
FIG. 5 depicts a table of performance results for a MapReduce framework using a global frequency counter, according to embodiments of the present disclosure. The environment includes data from 20,599 files having a total size of 2.9 terabytes of data. The running environment was performed with 40,855 total map tasks, with 559 concurrent map tasks, and 316 total reduce tasks, with 279 concurrent reduce tasks, running Hadoop 2.7.1. The sketch used for frequency counting was a count-min sketch. The various parameters of each performance result are depicted in the table ofFIG. 5 . In another embodiment, the MapReduce framework may be substituted with a Spark framework, and an execution time may be reduced from about 2-3 hours to about 40 minutes. A Spark framework implementation may be similar to a MapReduce framework implementation. The Spark framework implementation may differ from the MapReduce framework implementation in that (i) data may be processed in a memory to reduce slow down due to disk input/output, (ii) map and reduce stages may not occur separately in order to avoid a total replicated disk write and network transfer, and (iii) a partition/re-partition of sub-keyed data may be done in memory with minimum shuffling. -
FIG. 6 depicts a method for dynamic partitioning of in a distributed environment, according to embodiments of the present disclosure. Themethod 600 may begin atstep 602 in which a node, such as themaster node 102 and/orslave node 104, may receive a first data set and a second data set. When a master node receives the first data set and the second data set, the master node may distribute a portion and/or all of the first data set and the second data set to one or more of the slave nodes for distributed processing. When a slave node receives the portion and/or all of the first data set and the second data set, the slave node may process the portion and/or all of the first data set and the second data set according to one or more tasks handled by the slave node. - At
step 604, the slave node may perform a map task on the first data set. The map task may map the first data set, and may output a first set of key-value pairs based on the first data set. Additionally, a plurality of slave nodes may perform map tasks on a plurality of first data sets in parallel, and the intermediate outputs of the map tasks may be collected locally at each slave node. - At
step 606, the slave node may perform a map task on the second data set. The map task may map the second data set, and may output a second set of key-value pairs based on the second data set. Additionally, a plurality of slave nodes may perform map tasks on a plurality of second data sets in parallel, and the intermediate outputs of the map tasks may be collected locally at each slave node. - At
step 608, each slave node may estimate, using a sketch, a frequency count for each key based on the first set of key-value pairs and the second set of key-value pairs. For example, the first and second data sets may be examined to produce a workload estimation based on the sketch of the first and second data sets. A frequency counter, such asfrequency counter 114 of aslave node 104, may use various algorithms, such as an algorithm that uses a lossy count and/or an algorithm that uses sketches to count the number of distinct values in the first and second set of key-value pairs. A sketch may be a data structure that provides space-efficient summaries for large and frequently updated data sets. A sketch data structure may estimate a number of distinct values that have been assigned to a particular key in a first and second set of key-value pairs. The frequency counter may estimate a number of distinct values for each key in the first and second set of key-value pairs. In one embodiment, the sketch data structure may be a count-min sketch. - Then at
step 610, the slave node may determine whether the estimated frequency count for each key is greater than or equal to a predetermined threshold. Atstep 612, the slave node may partition a key when the frequency count associated with the key is greater than or equal to the predetermined threshold. For example, each slave node that processes map tasks may include a frequency counter for each key, and the slave node may partition a key when the frequency count associated with the key exceeds a predetermined threshold. - After
step 612, the process may continue. For example, the slave node may group the values associated with the keys based on the key. Then, other slave nodes that process reduce tasks may receive a list of values for a particular key, and, after computation, output a new list of values. -
FIG. 7 depicts another method for dynamic partitioning of in a distributed environment, according to embodiments of the present disclosure. Themethod 700 may begin atstep 702 in which a node, such as themaster node 102 and/orslave node 104, may receive a first data set and a second data set. When a master node receives the first data set and the second data set, the master node may distribute a portion and/or all of the first data set and the second data set to one or more of the slave nodes for distributed processing. When a slave node receives the portion and/or all of the first data set and the second data set, the slave node may process the portion and/or all of the first data set and the second data set according to one or more tasks handled by the slave node. - At
step 704, the slave node may perform a map task on the first data set. The map task may map the first data set, and may output a first set of key-value pairs based on the first data set. Additionally, a plurality of slave nodes may perform map tasks on a plurality of first data sets in parallel, and the intermediate outputs of the map tasks may be collected locally at each slave node. - At
step 706, the slave node may perform a map task on the second data set. The map task may map the second data set, and may output a second set of key-value pairs based on the second data set. Additionally, a plurality of slave nodes may perform map tasks on a plurality of second data sets in parallel, and the intermediate outputs of the map tasks may be collected locally at each slave node. - At
step 708, the slave node may retrieve, from a master node, a global frequency count for each key mapped in the first and second set of key value pairs. The master node, such asmaster node 102, may also include a global frequency counter, such asglobal frequency counter 116, that maintains a global frequency count for each key. Theglobal frequency counter 116 may also maintain a sketch, such as a count-min sketch. The frequency counters, such as frequency counters 114, of each slave node, may retrieve the global frequency count for each key from the global frequency counter. - At
step 710, each slave node may estimate, using a sketch, a frequency count for each key based on the first set of key-value pairs and the second set of key-value pairs. For example, the first and second data sets may be examined to produce a workload estimation based on the sketch of the first and second data sets. A frequency counter, such asfrequency counter 114 of aslave node 104, may use various algorithms, such as an algorithm that uses a lossy count and/or an algorithm that uses sketches to count the number of distinct values in the first and second set of key-value pairs. A sketch may be a data structure that provides space-efficient summaries for large and frequently updated data sets. A sketch data structure may estimate a number of distinct values that have been assigned to a particular key in a first and second set of key-value pairs. The frequency counter may estimate a number of distinct values for each key in the first and second set of key-value pairs. In one embodiment, the sketch data structure may be a count-min sketch. - At
step 712, each slave node may determine an updated frequency count for each key based on the retrieved global frequency count for each key and the estimated frequency count for each key. For example, the slave node, for each key, may average the global frequency count for a key and the estimated frequency count for the key, and generated an updated frequency count for the key based on the average. - Then at
step 714, the slave node may determine whether the updated frequency count for each key is greater than or equal to a predetermined threshold. Atstep 716, the slave node may partition a key when the updated frequency count associated with the key is greater than or equal to the predetermined threshold. For example, each slave node that processes map tasks may include a frequency counter for each key, and the slave node may partition a key when the updated frequency count associated with the key exceeds a predetermined threshold. - At
step 718, the slave node may transmit, to the master node, the updated frequency count for each key. Accordingly, the master node may update the global frequency count with the updated frequency count from each slave node. Afterstep 718, the process may continue. For example, the slave node may group the values associated with the keys based on the key. Then, other slave nodes that process reduce tasks may receive a list of values for a particular key, and, after computation, output a new list of values. -
FIG. 8 is a simplified functional block diagram of a computer that may be configured as the nodes, computing device, servers, providers, and/or network elements for executing the methods, according to exemplary an embodiment of the present disclosure. Specifically, in one embodiment, any of the nodes, computing device, servers, providers, and/or network may be an assembly ofhardware 800 including, for example, adata communication interface 860 for packet data communication. The platform may also include a central processing unit (“CPU”) 820, in the form of one or more processors, for executing program instructions. The platform typically includes aninternal communication bus 810, program storage, and data storage for various data files to be processed and/or communicated by the platform such asROM 830 andRAM 840, although thesystem 800 often receives programming and data via network communications. Thesystem 800 also may include input andoutput ports 850 to connect with input and output devices such as keyboards, mice, touchscreens, monitors, displays, etc. Of course, the various system functions may be implemented in a distributed fashion on a number of similar platforms, to distribute the processing load. Alternatively, the systems may be implemented by appropriate programming of one computer hardware platform. - Program aspects of the technology may be thought of as “products” or “articles of manufacture” typically in the form of executable code and/or associated data that is carried on or embodied in a type of machine-readable medium. “Storage” type media include any or all of the tangible memory of the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which may provide non-transitory storage at any time for the software programming. All or portions of the software may at times be communicated through the Internet or various other telecommunication networks. Such communications, for example, may enable loading of the software from one computer or processor into another, for example, from a management server or host computer of the mobile communication network into the computer platform of a server and/or from a server to the mobile device. Thus, another type of media that may bear the software elements includes optical, electrical and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and over various air-links. The physical elements that carry such waves, such as wired or wireless links, optical links, or the like, also may be considered as media bearing the software. As used herein, unless restricted to non-transitory, tangible “storage” media, terms such as computer or machine “readable medium” refer to any medium that participates in providing instructions to a processor for execution.
- While the presently disclosed methods, devices, and systems are described with exemplary reference to transmitting data, it should be appreciated that the presently disclosed embodiments may be applicable to any environment, such as a desktop or laptop computer, an automobile entertainment system, a home entertainment system, etc. Also, the presently disclosed embodiments may be applicable to any type of Internet protocol.
- As will be recognized, the present disclosure is not limited to these particular embodiments. For instance, although described in the context of MapReduce, the present disclosure may also be used in other distributed computing environments.
- Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
Claims (21)
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/198,133 US11221890B2 (en) | 2016-06-22 | 2018-11-21 | Systems and methods for dynamic partitioning in distributed environments |
US17/122,849 US11442792B2 (en) | 2016-06-22 | 2020-12-15 | Systems and methods for dynamic partitioning in distributed environments |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US15/189,158 US10162830B2 (en) | 2016-06-22 | 2016-06-22 | Systems and methods for dynamic partitioning in distributed environments |
US16/198,133 US11221890B2 (en) | 2016-06-22 | 2018-11-21 | Systems and methods for dynamic partitioning in distributed environments |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US15/189,158 Continuation US10162830B2 (en) | 2016-06-22 | 2016-06-22 | Systems and methods for dynamic partitioning in distributed environments |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/122,849 Continuation US11442792B2 (en) | 2016-06-22 | 2020-12-15 | Systems and methods for dynamic partitioning in distributed environments |
Publications (3)
Publication Number | Publication Date |
---|---|
US20200159594A1 US20200159594A1 (en) | 2020-05-21 |
US20210365300A9 true US20210365300A9 (en) | 2021-11-25 |
US11221890B2 US11221890B2 (en) | 2022-01-11 |
Family
ID=79167799
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/198,133 Active US11221890B2 (en) | 2016-06-22 | 2018-11-21 | Systems and methods for dynamic partitioning in distributed environments |
Country Status (1)
Country | Link |
---|---|
US (1) | US11221890B2 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11941002B1 (en) | 2022-03-31 | 2024-03-26 | Amazon Technologies, Inc. | Dynamically sort data |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11290476B1 (en) | 2019-11-29 | 2022-03-29 | Amazon Technologies, Inc. | Time bounded lossy counters for network data |
CN111506399B (en) * | 2020-03-05 | 2024-03-22 | 百度在线网络技术(北京)有限公司 | Task migration method and device, electronic equipment and storage medium |
US11423049B2 (en) * | 2020-05-11 | 2022-08-23 | Google Llc | Execution-time dynamic range partitioning transformations |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7599910B1 (en) * | 1993-11-16 | 2009-10-06 | Hitachi, Ltd. | Method and system of database divisional management for parallel database system |
US7590620B1 (en) * | 2004-06-18 | 2009-09-15 | Google Inc. | System and method for analyzing data records |
US8090754B2 (en) * | 2007-12-07 | 2012-01-03 | Sap Ag | Managing relationships of heterogeneous objects |
US8954967B2 (en) * | 2011-05-31 | 2015-02-10 | International Business Machines Corporation | Adaptive parallel data processing |
US10635644B2 (en) * | 2013-11-11 | 2020-04-28 | Amazon Technologies, Inc. | Partition-based data stream processing framework |
-
2018
- 2018-11-21 US US16/198,133 patent/US11221890B2/en active Active
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11941002B1 (en) | 2022-03-31 | 2024-03-26 | Amazon Technologies, Inc. | Dynamically sort data |
Also Published As
Publication number | Publication date |
---|---|
US11221890B2 (en) | 2022-01-11 |
US20200159594A1 (en) | 2020-05-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11221890B2 (en) | Systems and methods for dynamic partitioning in distributed environments | |
US10162830B2 (en) | Systems and methods for dynamic partitioning in distributed environments | |
Mohebi et al. | Iterative big data clustering algorithms: a review | |
US10157429B2 (en) | Fast and scalable connected component computation | |
Lin | Mr-apriori: Association rules algorithm based on mapreduce | |
Xu et al. | LogGP: A log-based dynamic graph partitioning method | |
US9953071B2 (en) | Distributed storage of data | |
Mestre et al. | Improving load balancing for mapreduce-based entity matching | |
WO2017118335A1 (en) | Mapping method and device | |
US20150149437A1 (en) | Method and System for Optimizing Reduce-Side Join Operation in a Map-Reduce Framework | |
Yan et al. | Scalable load balancing for mapreduce-based record linkage | |
Liroz-Gistau et al. | Dynamic workload-based partitioning for large-scale databases | |
US20150172369A1 (en) | Method and system for iterative pipeline | |
US11442792B2 (en) | Systems and methods for dynamic partitioning in distributed environments | |
Slagter et al. | SmartJoin: a network-aware multiway join for MapReduce | |
Mestre et al. | Efficient entity matching over multiple data sources with mapreduce | |
Wang et al. | A query-oriented adaptive indexing technique for smart grid big data analytics | |
US11048756B2 (en) | Inserting datasets into database systems utilizing hierarchical value lists | |
CN111767287A (en) | Data import method, device, equipment and computer storage medium | |
CN111143456B (en) | Spark-based Cassandra data import method, device, equipment and medium | |
CN114297260A (en) | Distributed RDF data query method and device and computer equipment | |
US11036678B2 (en) | Optimizing files stored in a distributed file system | |
Khan et al. | Computational performance analysis of cluster-based technologies for big data analytics | |
Espinosa et al. | Analysis and improvement of map-reduce data distribution in read mapping applications | |
Pal et al. | Distributed synthesized association mining for big transactional data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: OATH (AMERICAS) INC., NEW YORK Free format text: CHANGE OF NAME;ASSIGNOR:AOL ADVERTISING INC.;REEL/FRAME:047877/0896 Effective date: 20170612 Owner name: AOL ADVERTISING INC., NEW YORK Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:KYAW, THU R.;JI, JONATHAN;MUFTI, SAAD;AND OTHERS;SIGNING DATES FROM 20160615 TO 20160616;REEL/FRAME:047565/0246 |
|
FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
AS | Assignment |
Owner name: VERIZON MEDIA INC., VIRGINIA Free format text: CHANGE OF NAME;ASSIGNOR:OATH (AMERICAS) INC.;REEL/FRAME:051999/0720 Effective date: 20200122 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT RECEIVED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
FEPP | Fee payment procedure |
Free format text: PETITION RELATED TO MAINTENANCE FEES GRANTED (ORIGINAL EVENT CODE: PTGR); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED |
|
AS | Assignment |
Owner name: YAHOO ASSETS LLC, VIRGINIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:YAHOO AD TECH LLC (FORMERLY VERIZON MEDIA INC.);REEL/FRAME:058982/0282 Effective date: 20211117 |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
AS | Assignment |
Owner name: ROYAL BANK OF CANADA, AS COLLATERAL AGENT, CANADA Free format text: PATENT SECURITY AGREEMENT (FIRST LIEN);ASSIGNOR:YAHOO ASSETS LLC;REEL/FRAME:061571/0773 Effective date: 20220928 |