WO2017113865A1 - 一种大数据增量计算方法和装置 - Google Patents

一种大数据增量计算方法和装置 Download PDF

Info

Publication number
WO2017113865A1
WO2017113865A1 PCT/CN2016/097946 CN2016097946W WO2017113865A1 WO 2017113865 A1 WO2017113865 A1 WO 2017113865A1 CN 2016097946 W CN2016097946 W CN 2016097946W WO 2017113865 A1 WO2017113865 A1 WO 2017113865A1
Authority
WO
WIPO (PCT)
Prior art keywords
calculation
incremental
data
steps
big data
Prior art date
Application number
PCT/CN2016/097946
Other languages
English (en)
French (fr)
Inventor
陈世敏
杨慧
张军
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2017113865A1 publication Critical patent/WO2017113865A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases

Definitions

  • the embodiments of the present invention relate to the field of computers, and in particular, to a method and an apparatus for calculating large data increments.
  • Big data processing plays an increasingly important role in actual production.
  • Many big data calculations performed in actual production are defined in advance, such as statistical reports, recommendation models in recommendation systems, search engine indexes, and web page ranking models.
  • the results reflect the updated data and need to be double-counted.
  • the big data complete calculation needs to be completely calculated on D+ ⁇ D to obtain the latest results. Because the amount of data
  • Incremental calculations are a better optimization method for problems caused by the need to recalculate the entire data set due to the addition of new data. Because the updated data size
  • the big data incremental computing solution adopts a coarse-grained incremental calculation for a specific big data computing platform in a task-based unit, and the efficiency of the incremental calculation cannot be fully utilized.
  • an embodiment of the present invention provides a method and apparatus for calculating a large data increment by dividing the big data calculation into at least two calculation steps, and defining a data-based incremental delivery rule for each calculation step.
  • the big data incremental calculation is performed in the form of incremental transfer calculation.
  • the present application provides a big data calculation method, where the big data calculation includes at least two calculation steps, the method includes: according to the incremental data, the incremental delivery rule of each calculation step, and each calculation step needs to be saved.
  • the necessary data to calculate the incremental output of the big data calculation wherein the necessary data includes at least one of a complete input and a complete output, and the incremental delivery rule is used to describe each calculation step according to the data granularity according to each Incremental input of the calculation step and
  • Each calculation step needs to save the necessary data to calculate the calculation rule of the incremental output of each calculation step, and each calculation step needs to save the necessary data according to the incremental delivery rule of each calculation step when performing the complete calculation or the incremental calculation. Save; determine the final calculation based on the raw output of the incremental output and the big data calculation.
  • the method further includes: when each of the calculation steps needs to save the necessary data, the necessary data is saved according to each calculation step. Quantity, update the necessary data that needs to be saved for each calculation step.
  • the complete input of each calculation step refers to the complete input of each calculation step in the calculation before the current incremental calculation
  • the complete output of each calculation step refers to the calculation before the current incremental calculation
  • the complete input and the complete output can be updated as the incremental calculation proceeds. More specifically, the complete input refers to the complete input updated after the last complete calculation or incremental calculation of the calculation step, and the complete output refers to the corresponding calculation step. The complete output updated after a full calculation or incremental calculation.
  • the big data calculation includes at least two calculation phases, each of the calculation phases includes at least one calculation step; Calculate the incremental output of the big data calculation according to the incremental data, the incremental delivery rules of each calculation step, and the necessary data to be saved for each calculation step, including: according to the calculation phase sequence, according to each calculation phase The incremental transfer rule of the included calculation step and the calculation step included in each calculation phase need to save the necessary data for incremental calculation, wherein the incremental input of the first calculation phase is incremental data, the i+1th calculation The incremental input of the phase is the incremental output of the i-th calculation phase, the incremental output of the last calculation phase is the incremental output of the big data calculation, and i is a positive integer greater than zero.
  • the calculation step is divided into multiple calculation stages, and each calculation stage may include one or more calculation steps, and the calculation steps in each calculation stage may be parallel, thereby enhancing the flexibility of big data calculation.
  • the third possible implementation manner of the first aspect before the first calculating step in each calculating step is performed, determining the first The incremental calculation cost of a calculation step, if the incremental calculation cost of the first calculation step is greater than the full calculation cost, the incremental calculation in the first calculation step and the calculation step after the first calculation step is switched to the complete calculation.
  • first herein is not an order of representation, and “first calculation step” may be any one of the calculation steps of big data calculation. You can express the entire calculation as a graph and find the critical path of the graph. The time of each operation on the critical path really has a greater impact on the total computation time. It is better to analyze the operation on each critical path from the back to the front. If the complete calculation is better, then switch to complete the calculation, thus ensuring the efficiency of big data calculation.
  • the incremental data, the necessary data required for each calculation step, and the increment of each calculation step are organized into an overall parallel data set.
  • the overall parallel data set consists of key/value pairs.
  • the overall parallel data set supports read and write operations, and the operation of the overall parallel data set supports multiple compute nodes in parallel operation.
  • Organizing data into an overall parallel data set can achieve fine-grained big data calculation with data granularity, and improve the computational efficiency of big data calculation.
  • the present application provides a computer readable medium, comprising computer executed instructions, when the processor of the computer executes the computer to execute an instruction, the computer executes the first aspect or any of the possible implementations of the first aspect Methods.
  • the present application provides a computing device, including: a processor, a memory, a bus, and a communication interface; the memory is configured to store an execution instruction, the processor is connected to the memory through the bus, when the computing device is running The processor executes the execution instructions stored by the memory to cause the computing device to perform the method of any of the first aspect or the first aspect.
  • the present application provides a big data computing device, characterized in that the big data calculation comprises at least two calculation steps, the device comprising: a calculation unit for incrementally transmitting according to the incremental data and each calculation step
  • the rules and the necessary data for each calculation step need to be saved, and the incremental output of the big data calculation is calculated, wherein the necessary data includes at least one of a complete input and a complete output, and the incremental delivery rule is used to granularize the data.
  • each calculation step to calculate the incremental output of each calculation step based on the incremental input of each calculation step and the necessary data required for each calculation step.
  • Each calculation step needs to save the necessary data for complete calculation.
  • incremental calculation according to the incremental delivery rules of each calculation step determining unit for determining the final output based on the original output of the incremental output and the big data calculation Calculate the result.
  • the device further includes an updating unit, configured to perform the saving according to each calculation step, when the necessary data to be saved in each calculation step has an increment The increment of the data, updating the necessary data that needs to be saved for each calculation step.
  • the complete input of each calculation step refers to the complete input of each calculation step in the calculation before the current incremental calculation
  • the complete output of each calculation step refers to the calculation before the current incremental calculation
  • the complete input and the complete output can be updated as the incremental calculation proceeds. More specifically, the complete input refers to the complete input updated after the last complete calculation or incremental calculation of the calculation step, and the complete output refers to the corresponding calculation step. The complete output updated after a full calculation or incremental calculation.
  • the big data calculation includes at least two calculation phases, each of the calculation phases includes at least one calculation step;
  • the calculation unit is configured to calculate an incremental output result of the big data calculation according to the incremental data, the incremental delivery rule of each calculation step, and the necessary data required for each calculation step, including: the calculation unit is configured to: calculate according to The phase sequence is incrementally calculated according to the incremental delivery rules of the calculation steps included in each calculation phase and the necessary data required for the calculation steps included in each calculation phase, wherein the incremental input of the first calculation phase is incremented.
  • the quantity data, the incremental input of the i+1th calculation stage is the incremental output of the i-th calculation stage, the incremental output of the last calculation stage is the incremental output result of the big data calculation, and i is a positive integer greater than 0. .
  • the calculation step is divided into multiple calculation stages, and each calculation stage may include one or more calculation steps, and the calculation steps in each calculation stage may be parallel, thereby enhancing the flexibility of big data calculation.
  • the determining unit is further configured to: before the first calculating step in each calculating step is performed, Determining the incremental calculation cost of the first calculation step, if the incremental calculation of the first calculation step If the cost is greater than the full calculation cost, the incremental calculation in the first calculation step and the calculation step after the first calculation step is switched to the complete calculation.
  • first herein is not an order of representation, and “first calculation step” may be any one of the calculation steps of big data calculation. You can express the entire calculation as a graph and find the critical path of the graph. The time of each operation on the critical path really has a greater impact on the total computation time. It is better to analyze the operation on each critical path from the back to the front. If the complete calculation is better, then switch to complete the calculation, thus ensuring the efficiency of big data calculation.
  • the incremental data, the necessary data required for each calculation step, and the increment of each calculation step are organized into an overall parallel data set.
  • the overall parallel data set consists of key/value pairs.
  • the overall parallel data set supports read and write operations, and the operation of the overall parallel data set supports multiple compute nodes in parallel operation.
  • Organizing data into an overall parallel data set can achieve fine-grained big data calculation with data granularity, and improve the computational efficiency of big data calculation.
  • the big data calculation is divided into at least two calculation steps, and the incremental input, the necessary data, and the incremental output of each calculation step are organized into a data set form, and each calculation is performed.
  • the step of the step is to use the data as the granularity of the incremental delivery rule to perform the incremental calculation of the big data in a fine-grained manner, thereby improving the efficiency of the big data incremental calculation.
  • the embodiment of the present invention describes the basic theory of big data incremental calculation, and provides a general fine-grained big data incremental calculation method, which is applicable to various big data computing systems.
  • FIG. 1 is an exemplary block diagram of a big data computing system in accordance with an embodiment of the present invention
  • FIG. 2 is a schematic diagram showing the hardware structure of a big data incremental computing device according to an embodiment of the invention
  • 3(a) and 3(b) are schematic diagrams showing a big data incremental calculation architecture according to an embodiment of the invention.
  • FIG. 4 is an exemplary flowchart of a method for calculating a big data increment according to an embodiment of the invention
  • FIG. 5 is an exemplary structural diagram of a method for calculating a big data increment according to an embodiment of the invention.
  • FIG. 6 is a schematic structural diagram of a big data incremental computing device according to an embodiment of the invention.
  • FIG. 1 shows an exemplary block diagram of a big data computing system 100, as shown in FIG. 1, a system 100 including a client 102, a manager 104, and a plurality of nodes 106,
  • the client 102 runs an application, and the type of the application may be MapReduce, Giraph, Storm, Dryad, Pregel, Spark, Tez/Impala, or Message Passing Interface (MPI). Any type of application can be run on the client 102, and the user can implement a customized application type, thereby implementing a brand new application framework, which is not limited by the embodiment of the present invention.
  • MPI Message Passing Interface
  • the manager 104 is responsible for resource management and allocation of the entire system. When receiving an application job from the client 102, resources can be allocated for the application job according to the load condition of the entire system.
  • the node 106 is used for big data calculation, and includes resources such as memory, CPU, disk, and network required for performing big data calculation.
  • each node 106 can include a node manager and at least one resource container.
  • the node manager is used to manage the resources of the node 106 and the tasks running on the node 106, and periodically report the resource usage on the node and the running status of each resource container to the manager 104.
  • the resource container is a resource abstraction in the node 106, which can encapsulate multiple types of resources on a node, such as memory, a central processing unit (CPU), a disk, a network, and the like.
  • the resource container may also encapsulate only a part of resources on a certain node, for example, only the memory and the CPU are encapsulated, which is not limited in this embodiment of the present invention.
  • the resource container can run any type of task. For example, a MapReduce application can request a resource container to initiate a map or reduce task, while a Giraph application can request a resource container to run a Giraph task. Users can also implement a custom application type by running a specific task through a resource container to implement a completely new application framework.
  • resource containers may have multiple specifications, and the present invention is implemented.
  • the example does not limit this.
  • the client 102, the manager 104, and the plurality of nodes 106 can communicate through a network, where the network can be the Internet, an intranet, or a local area network (LAN). Wireless Local Area Networks (WLANs), Storage Area Networks (SANs), etc., or a combination of the above.
  • LAN local area network
  • WLANs Wireless Local Area Networks
  • SANs Storage Area Networks
  • FIG. 1 is merely exemplary participants of the system 100 and their interrelationships. Therefore, the depicted system 100 is greatly simplified, and the embodiments of the present invention are merely described in general terms, and the implementation thereof is not limited in any way.
  • the client 102, the manager 104, and the node 106 in FIG. 1 may be of any architecture, which is not limited by the embodiment of the present invention.
  • the manager 104 and node 106 shown in FIG. 1 can be implemented by the computing device 200 shown in FIG. 2.
  • 2 is a simplified logical block diagram of computing device 200.
  • computing device 200 includes a processor 202, a memory unit 204, an input/output interface 206, a communication interface 208, a bus 210, and a storage device 212.
  • the processor 202, the memory unit 204, the input/output interface 206, the communication interface 208, and the storage device 212 implement communication connections with each other through the bus 210.
  • the processor 202 is a control center of the computing device 200 for executing related programs to implement the technical solutions provided by the embodiments of the present invention.
  • the processor 202 includes one or more CPUs, such as the central processor unit 1 and the central processor unit 2 shown in FIG.
  • the computing device 200 can also include multiple processors 202, each of which can be a single core processor (including one CPU) or a multi-core processor (including multiple CPUs).
  • a component for performing a specific function for example, the processor 202 or the memory unit 204, may be implemented by configuring a general-purpose component to perform a corresponding function, or may be specifically performed by a specific one.
  • the specific components of the function are implemented, and this application does not limit this.
  • the processor 202 can use a general-purpose central processing unit, a microprocessor, an application specific integrated circuit (ASIC), or one or more integrated circuits for executing related programs to implement the technology provided by the present application. Program.
  • ASIC application specific integrated circuit
  • Processor 202 can be coupled to one or more storage schemes via bus 210.
  • the storage scheme can include a memory unit 204 and a storage device 212.
  • the storage device 212 can be a read only memory (ROM), a static storage device, a dynamic storage device, or a random access memory (RAM).
  • Memory unit 204 can be a random access memory.
  • the memory unit 204 can be integrated with or integrated with the processor 202, or it can be one or more memory units independent of the processor 202.
  • Program code for execution by the processor 202 or a CPU internal to the processor 202 is stored in a memory of the computing device 200, and may be stored in the storage device 212 or the memory unit 204.
  • program code eg, an operating system, an application, a big data computing module, or a communication module, etc.
  • stored internal to storage device 212 is copied to memory unit 204 for execution by processor 202.
  • the storage device 212 can be a physical hard disk or a partition thereof (including a small computer system interface storage or a global network block device volume), a network storage protocol (including a network file system NFS or the like network or a cluster file system), a file-based virtual storage device (virtual Disk mirroring), logical volume-based storage devices.
  • High speed random access memory may be included, as well as non-volatile memory, such as one or more disk memories, flash memory, or other non-volatile memory.
  • storage device 212 may further include a remote memory separate from said one or more processors 202, such as a network disk accessed through a communication interface 208 with a communication network, which may be the Internet, Networking, local area networks (LANs), wide area networks (WLANs), storage area networks (SANs), etc., or a combination of the above.
  • a communication network which may be the Internet, Networking, local area networks (LANs), wide area networks (WLANs), storage area networks (SANs), etc., or a combination of the above.
  • Operating systems include controls and management of general system tasks (such as memory management, storage device control, power management, etc.) And various software components and/or drivers that facilitate communication between various hardware and software components.
  • the input/output interface 206 is for receiving input data and information, and outputting data such as operation results.
  • Communication interface 208 enables communication between computing device 200 and other devices or communication networks using transceivers such as, but not limited to, transceivers.
  • Bus 210 may include a path for communicating information between various components of computing device 200, such as processor 202, memory unit 204, input/output interface 206, communication interface 208, and storage device 212.
  • the bus 210 can use a wired connection or a wireless communication mode, which is not limited in this application.
  • computing device 200 shown in FIG. 2 only shows the processor 202, the memory unit 204, the input/output interface 206, the communication interface 208, the bus 210, and the storage device 212, in a specific implementation process, the field Those skilled in the art will appreciate that computing device 200 also includes other devices necessary to achieve proper operation.
  • the computing device 200 can be a general purpose computer or a special purpose computing device, including but not limited to a portable computer, a personal desktop computer, a network server, a tablet computer, a mobile phone, a personal digital assistant (PDA), or the like. Or a combination of two or more of the above, the present application does not limit the specific implementation of the computing device 200.
  • a portable computer including but not limited to a portable computer, a personal desktop computer, a network server, a tablet computer, a mobile phone, a personal digital assistant (PDA), or the like.
  • PDA personal digital assistant
  • computing device 200 of FIG. 2 is merely an example of one computing device 200, which may include more or fewer components than those shown in FIG. 2, or have different component configurations. Those skilled in the art will appreciate that computing device 200 may also include other implementations, depending on the particular needs. Add functional hardware devices. Those skilled in the art will appreciate that computing device 200 may also only include the components necessary to implement embodiments of the present invention, and does not necessarily include all of the devices shown in FIG. At the same time, the various components shown in Figure 2 can be implemented in hardware, software, or a combination of hardware and software.
  • the hardware structure shown in FIG. 2 and the above description are applicable to various computing devices provided by the embodiments of the present invention, and are applicable to performing various big data calculation methods provided by the embodiments of the present invention.
  • the memory unit 204 of the computing device 200 includes a big data computing module, and the processor 202 executes the big data computing module program code to implement big data computing.
  • the big data computing module can be comprised of one or more operational instructions to cause computing device 200 to perform one or more method steps in accordance with the above description. The specific method steps are described in detail in the following sections of this application.
  • FIG. 3 is a schematic diagram 300 of a big data incremental computing architecture.
  • the big data computing model 304 supports two computing modes, a full calculation and an incremental calculation:
  • Incremental calculation that is, only the incremental data 308 is calculated, the incremental output result is obtained, and the incremental output result and the original output result are correlated, and the final output result 306 is obtained, wherein the incremental data 308 refers to the new one.
  • the incremental incremental input data i.e., the amount of change in new input data 306 relative to the original input data.
  • FIG. 4 is an exemplary flowchart of a big data incremental calculation method 400 for incremental data calculation, and for large data incremental calculation, the big data calculation includes at least two calculation steps, as shown in FIG. As shown in Figure 4, method 400 includes:
  • S402 Calculate the incremental output result of the big data calculation according to the incremental data, the incremental delivery rule of each calculation step, and the necessary data that needs to be saved for each calculation step.
  • the necessary data includes at least one of a complete input and a complete output
  • the incremental transfer rule is used to describe, in the granularity of the data, each calculation step according to the incremental input of each calculation step and the need for each calculation step to be saved.
  • the data calculates the calculation rules for the incremental output of each calculation step.
  • the necessary data to be saved for each calculation step is saved according to the incremental delivery rule of each calculation step when performing the complete calculation or the incremental calculation.
  • the necessary data to be saved in each calculation step includes at least one of a complete input of the calculation step and a complete output of the calculation step, and the complete input of each calculation step refers to the calculation before the current calculation is performed.
  • the complete input of each calculation step the complete output of each calculation step refers to the complete output of each calculation step in the calculations prior to this incremental calculation. It should be understood that the complete input and the complete output can be updated as the incremental calculation proceeds, that is, in increments of the necessary data that need to be saved in each calculation step, according to the increment of the necessary data that needs to be saved for each calculation step. , update the necessary data that needs to be saved for each calculation step.
  • the full input is updated based on the increment of the full input; if the saved full output has an increment, the update is complete based on the incremental output.
  • the complete input in step 402 refers to the complete input updated after the last complete calculation or the incremental calculation of the calculation step, and the complete output refers to the complete output updated after the last complete calculation or incremental calculation of the calculation step.
  • the big data calculation includes at least two calculation phases, each of which includes at least one calculation step; according to the incremental data, the incremental delivery rule of each calculation step, and the necessary data to be saved for each calculation step Calculating the incremental output result of the big data calculation, including: according to the calculation phase sequence, sequentially adding the necessary data to be saved according to the incremental delivery rule of the calculation step included in each calculation phase and the calculation step included in each calculation phase
  • the quantity calculation wherein the incremental input of the first calculation stage is incremental data, the incremental input of the i+1th calculation stage is the incremental output of the i-th calculation stage, and the incremental output of the last calculation stage is The incremental output of the big data calculation, i is a positive integer greater than zero.
  • the calculation steps according to the big data calculation are divided into multiple calculation stages according to the execution order. Considering that there may be calculation steps that are parallel operations, each calculation stage may include one or more calculation steps.
  • the incremental data 308 is used as the incremental input of the first calculation phase, and the calculation steps in the first calculation phase are based on the incremental data 308 and the intermediate data (or no intermediate data is required, for generality, FIG. 5 is The intermediate data is shown in each calculation phase, but it should be understood that FIG.
  • the intermediate data contains the necessary data to be saved for each calculation step, and the incremental output of the first calculation phase is passed to the second calculation phase, and the incremental input of the second calculation phase is the incremental output of the first calculation phase, the second The calculation step of the calculation phase calculates the incremental output of the second calculation phase according to the incremental input and intermediate data of the second calculation phase (or does not require intermediate data), and The incremental output of the second calculation phase is passed to the third calculation phase, and so on, until the calculation step of the Nth calculation phase calculates the incremental output result 502 based on its incremental input and intermediate data (or no intermediate data is required), The incremental output result 502 is the incremental output of the entire big data calculation.
  • first herein is not an order of representation, and “first calculation step” may be any one of the calculation steps of big data calculation.
  • the time of the complete calculation is recorded in the first calculation.
  • the time of the incremental transfer calculation is recorded during the incremental calculation. Assuming that the incremental calculations are done at the same time interval and the incremental data updates are uniform, then the next incremental calculation is roughly similar to the last time. Thus, it is possible to estimate the cost of the incremental calculation and the complete calculation based on the time already recorded, and choose whether to perform the complete calculation at a certain calculation step.
  • the entire calculation can be expressed as a graph to find the critical path of the graph.
  • the time of each operation on the critical path really has a greater impact on the total computation time. It is better to analyze the operation on each critical path from the back to the front. Once a calculation has a complete calculation, all calculations downstream of it are also fully calculated.
  • the incremental data, the necessary data required for each calculation step, and the incremental input and the incremental output of each calculation step are organized into a Bulk Parallel Data Set (BPD).
  • BPD Bulk Parallel Data Set
  • the overall parallel data set consists of key-value data elements ⁇ k, v>.
  • the overall parallel data set supports read and write operations, and the operation of the overall parallel data set supports multi-computing node parallel operations.
  • parallel means that one BPD can be divided into multiple subsets BPD 0 , BPD 1 , . . . , BPD n-1 , distributed on multiple machine nodes, and the union of all subsets is a complete set.
  • the overall parallelism means that the data in the BPD can be read and written as a whole, and the operations on the data in the BPD can be performed in parallel, and the global synchronization operation is performed when the calculation is completed.
  • the incremental data, the necessary data required for each of the computing steps, and the incremental input and incremental output of each of the computing steps are organized into an overall parallel data set.
  • the data may be divided into multiple subsets to implement the parallel computing.
  • the embodiment of the present invention only describes the calculation flow of one of the subsets, but the embodiment of the present invention This is not limited.
  • BPD can define and support multiple calculations.
  • BPD can define and support multiple calculations.
  • the result includes all of the A and B elements.
  • GroupBy is only partially performed for each subset.
  • the iv can be changed during the calculation.
  • the system runs as follows:
  • UDFListCombine meets the exchange rate and the combination rate.
  • the system runs as follows:
  • ⁇ +,k,v,> represents an inserted data element
  • ⁇ -,k,v,> represents a deleted data element.
  • Changes to the original input data can be broken down into inserts and deletes.
  • BPD 0 , BPD 1 , ..., BPD n-1 we can also obtain ⁇ BPD 0 , ⁇ BPD 1 , ..., ⁇ BPD n-1 .
  • the goal of this scenario is to use fine-grained incremental calculations, which are performed in units of ⁇ k, v> to minimize unnecessary double counting.
  • a big data calculation consists of multiple basic BPD calculation steps. Each calculation step has one or more input BPDs that are calculated to produce an output BPD.
  • the original data element ⁇ k, v> can be regarded as data ⁇ +, k, v>.
  • the full output needs to be accessed to save the complete output of this calculation step.
  • the method of merging or insert sorting can be used to obtain a complete sorted result, and the complete output of this calculation step is saved.
  • the Join operation on the input BPD is first calculated, and each result of the Join is used as an input parameter to call the user-defined UDFComputeMulti function.
  • the calculation process is similar to UDFCompute, and is easy to implement. Quantity calculation Save the complete input for this calculation step.
  • UDFListCombine conforms to the exchange rate and the combination rate, it is easy to add the newly inserted data elements to the result and access the complete output. However, for data elements that need to be deleted, you need to recalculate and you need to save the full input and complete output of this calculation step.
  • the embodiment of the present invention only exemplifies the calculation step of the big data incremental calculation and the incremental delivery rule of the calculation step, and may have other calculation steps and an incremental delivery rule of the calculation step, which is an embodiment of the present invention. This is not limited.
  • Evaluation is a good calculation step, indicating that its incremental delivery rules do not need to save the complete input and complete output; the evaluation is a general calculation step, indicating that its incremental delivery rules need to save the necessary data; the evaluation is a poor calculation step, indicating that Incremental delivery rules require the preservation of full input and full output.
  • a series of optimizations can be performed for evaluating general operations and evaluating poor operations, such as:
  • the Join operation can be optimized (for Join, CartesianProduct, UDFComputeMulti).
  • Join(A, B) the following two options can be used to optimize:
  • UDFComputeMulti For Join, UDFComputeMulti, GroupBy, LocalGroupBy, Sort, UDFListCombine and other operations, you can provide support for incremental queries. That is, the existing Lookup operation is used, which allows selective access to the stored necessary data BPD.
  • the specific implementation can create an index on the BPD, or use the distributed Key-Value Store key value storage system to implement the index function.
  • the big data calculation is divided into at least two calculation steps, and the incremental input, the necessary data, and the incremental output of each calculation step are organized into a data set form, and each calculation is performed.
  • the step of the step is to use the data as the granularity of the incremental delivery rule to perform the incremental calculation of the big data in a fine-grained manner, thereby improving the efficiency of the big data incremental calculation.
  • the embodiment of the present invention describes the basic theory of big data incremental calculation, and provides a general fine-grained big data incremental calculation method, which is applicable to various big data computing systems.
  • FIG. 6 is a schematic structural diagram of a big data computing device 600 according to an embodiment of the present invention. As shown in FIG. 6, the device 600 includes a computing unit 602, a determining unit 604, and an updating unit 606.
  • the calculating unit 602 is configured to calculate an incremental output result of the big data calculation according to the incremental data, the incremental delivery rule of each calculation step, and the necessary data that needs to be saved in each calculation step.
  • the necessary data includes at least one of a complete input and a complete output
  • the incremental transfer rule is used to describe, in the granularity of the data, each calculation step according to the incremental input of each calculation step and the need for each calculation step to be saved.
  • the data calculates the calculation rules for the incremental output of each calculation step.
  • the necessary data to be saved for each calculation step is saved according to the incremental delivery rule of each calculation step when performing the complete calculation or the incremental calculation.
  • the computing unit 602 can be implemented by the processor 202 and the memory unit shown in FIG. 2. More specifically, the big data calculation module in the memory unit 204 can be executed by the processor 202, and the big data calculation is calculated according to the incremental data, the incremental delivery rule of each calculation step, and the necessary data that needs to be saved for each calculation step. Incremental output results.
  • the determining unit 604 is configured to determine a final calculation result according to the incremental output result and the original output result of the big data calculation.
  • the determining unit 604 can be implemented by the processor 202 and the memory unit shown in FIG. 2. More specifically, the big data calculation module in the memory unit 204 can be executed by the processor 202 to determine the final calculation result based on the incremental output result and the original output result of the big data calculation.
  • the updating unit 606 is configured to update the necessary data that needs to be saved in each calculation step according to the increment of the necessary data that needs to be saved in each calculation step when there is an increment of the necessary data to be saved in each calculation step.
  • the update unit 606 can be implemented by the processor 202 and the memory unit shown in FIG. 2. More specifically, the big data calculation module in the memory unit 204 can be executed by the processor 202. When there is an increment of the necessary data to be saved in each calculation step, the increment of the necessary data to be saved according to each calculation step is updated. Every calculation step requires the necessary data to be saved.
  • the big data calculation includes at least two calculation phases, each of which includes at least one calculation step; the calculation unit 602 is configured to use the incremental data, the incremental delivery rules for each calculation step, and the necessary data to be saved for each calculation step Calculating the incremental output result of the big data calculation, comprising: calculating unit 602 according to the calculation phase sequence, sequentially according to the incremental delivery rule of the calculation step included in each calculation phase, and the necessary calculation steps included in each calculation phase
  • the data is incrementally calculated, wherein the incremental input of the first calculation phase is incremental data, the incremental input of the i+1th calculation phase is the incremental output of the i-th calculation phase, and the last calculation phase is incremented.
  • the quantity output is the incremental output of the big data calculation, and i is a positive integer greater than zero.
  • the determining unit 604 is further configured to determine an incremental calculation cost of the first calculating step before the first calculating step in each calculating step, if the incremental computing cost of the first calculating step is greater than the complete The cost is calculated, and the incremental calculation in the first calculation step and the calculation step after the first calculation step is switched to the complete calculation.
  • the incremental data, the necessary data to be saved in each calculation step, and the incremental input and incremental output of each calculation step are organized into an overall parallel data set, and the overall parallel data set is composed of key/value pairs, and the overall parallel
  • the data set supports read and write operations and supports multiple compute node parallel operations on the operation of the overall parallel data set.
  • the embodiment of the present invention is an embodiment of the device corresponding to the embodiment of FIG. 4, and the feature description of the embodiment of FIG. 4 is applicable to the embodiment of the present invention, and details are not described herein again.
  • big data calculation may be performed by multiple devices 600.
  • the computing tasks may be allocated to multiple devices 600 for processing, for example, for each computing step or A plurality of computing steps allocate a device 600, and the big data calculation is completed by the cooperation of the plurality of devices 600.
  • the embodiment of the present invention only describes the function of the device 600, and does not count the number of devices 600 that perform big data calculation, and The organizational form between devices 600 that perform big data calculations is defined.
  • the big data calculation is divided into at least two calculation steps, and the incremental input, the necessary data, and the incremental output of each calculation step are organized into a data set form, and each calculation is performed.
  • the step of the step is to use the data as the granularity of the incremental delivery rule to perform the incremental calculation of the big data in a fine-grained manner, thereby improving the efficiency of the big data incremental calculation.
  • the embodiment of the present invention describes the basic theory of big data incremental calculation, and provides a general fine-grained big data incremental calculation method, which is applicable to various big data computing systems.
  • the disclosed systems, devices, and methods may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the modules is only a logical function division, and may be implemented in another manner, for example, multiple modules or components may be combined or may be Integrate into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or module, and may be electrical, mechanical or otherwise.
  • the modules described as separate components may or may not be physically separated.
  • the components displayed as modules may or may not be physical modules, that is, may be located in one place, or may be distributed to multiple network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional module in each embodiment of the present invention may be integrated into one processing module, or each module may exist physically separately, or two or more modules may be integrated into one module.
  • the above integrated modules can be implemented in the form of hardware or in the form of hardware plus software function modules.
  • the above-described integrated modules implemented in the form of software function modules can be stored in a computer readable storage medium.
  • the software functional modules described above are stored in a storage medium and include instructions for causing a computer device (which may be a personal computer, server, or network device, etc.) to perform some of the steps of the methods described in various embodiments of the present invention.
  • the foregoing storage medium includes: a mobile hard disk, a read-only memory (English: Read-Only Memory, ROM for short), a random access memory (English: Random Access Memory, RAM for short), a magnetic disk or an optical disk, and the like. The medium of the code.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Complex Calculations (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种大数据计算方法和装置,实现了大数据计算。该方法包括:根据增量数据、每一个计算步骤的增量传递规则以及每一个计算步骤需要保存的必要数据,计算出大数据计算的增量输出结果(S402),其中,该必要数据包括完整输入、完整输出中的至少一项,每一个计算步骤需要保存的必要数据在进行完整计算或增量计算时根据每一个计算步骤的增量传递规则进行保存;根据增量输出结果与大数据计算的原始输出结果,确定最终计算结果(S404)。将大数据计算分为至少两个计算步骤,通过每一个计算步骤的增量传递规则,以细粒度的方式进行大数据的增量计算,从而提升了大数据增量计算的效率。

Description

一种大数据增量计算方法和装置 技术领域
本发明实施例涉及计算机领域,尤其涉及一种大数据增量计算方法和装置。
背景技术
大数据处理在实际生产中发挥着越来越重要的作用。很多实际生产中进行的大数据计算都是事先定义好的,例如统计报表、推荐系统中的推荐模型、搜索引擎的索引和网页排序模型等。随着更新数据不断地到来,为了及时得到最新的结果,使结果反映更新的数据,需要重复计算。假设上一次计算的输入数据是D,更新的数据是ΔD,那么大数据完整计算需要在D+ΔD上进行完整地计算,获得最新的结果。因为数据量|D|非常大,完全重复计算将产生很大的时间、能耗、云平台租赁成本的开销。
针对由于新数据添加而需要对整个数据集重新计算导致的问题,增量计算就成为一种较好的优化方法。因为一般情况下更新的数据大小|ΔD|<<|D|,为了得到最新的结果,希望计算主要处理ΔD,而不用或少用原始的完整数据集D,从而达到降低计算时间,降低能耗和节省成本的目的。
现有技术中,大数据增量计算解决方案多采用针对特定的大数据计算平台,以任务为单位的粗粒度的增量计算,不能充分发挥增量计算的效率。
发明内容
有鉴于此,本发明实施例提供了一种大数据增量计算方法和装置,通过将大数据计算分为至少两个计算步骤,并定义每一个计算步骤的以数据为粒度的增量传递规则,以增量传递计算的方式进行大数据增量计算。
第一方面,本申请提供了一种大数据计算方法,大数据计算包括至少两个计算步骤,该方法包括:根据增量数据、每一个计算步骤的增量传递规则以及每一个计算步骤需要保存的必要数据,计算出大数据计算的增量输出结果,其中,该必要数据包括完整输入、完整输出中的至少一项,增量传递规则用于以数据为粒度描述每一个计算步骤根据每一个计算步骤的增量输入和 每一个计算步骤需要保存的必要数据计算每一个计算步骤的增量输出的计算规则,每一个计算步骤需要保存的必要数据在进行完整计算或增量计算时根据每一个计算步骤的增量传递规则进行保存;根据增量输出结果与大数据计算的原始输出结果,确定最终计算结果。
将大数据计算分为至少两个计算步骤,并将每一个计算步骤的增量输入、必要数据和增量输出组织成数据集的形式,通过每一个计算步骤的以数据为粒度的增量传递规则,以细粒度的方式进行大数据的增量计算,从而提升了大数据增量计算的效率。
结合第一方面,在第一方面第一种可能的实现方式中,该方法还包括:在每一个计算步骤需要保存的必要数据有增量时,根据每一个计算步骤需要保存的必要数据的增量,更新每一个计算步骤需要保存的必要数据。
其中,每一个计算步骤的完整输入是指在进行本次增量计算之前的计算中,每一个计算步骤的完整输入,每一个计算步骤的完整输出是指在进行本次增量计算之前的计算中,每一个计算步骤的完整输出。完整输入和完整输出可以随着增量计算的进行,进行更新,更具体的,完整输入是指对应计算步骤上一次完整计算或增量计算后更新的完整输入,完整输出是指对应计算步骤上一次完整计算或增量计算后更新的完整输出。
结合第一方面或第一方面以上任一种可能的实现方式,在第一方面第二种可能的实现方式中,大数据计算包括至少两个计算阶段,每一个计算阶段包含至少一个计算步骤;根据增量数据、每一个计算步骤的增量传递规则以及每一个计算步骤的需要保存的必要数据,计算出大数据计算的增量输出结果,包括:按照计算阶段顺序,依次根据每一个计算阶段包含的计算步骤的增量传递规则以及每一个计算阶段包含的计算步骤需要保存的必要数据进行增量计算,其中,第1个计算阶段的增量输入为增量数据,第i+1个计算阶段的增量输入为第i个计算阶段的增量输出,最后一个计算阶段的增量输出为大数据计算的增量输出结果,i为大于0的正整数。
将计算步骤分为多个计算阶段,每一个计算阶段可以包含1个或者1个以上的计算步骤,每一个计算阶段中计算步骤可以并行的,从而增强了大数据计算的灵活性。
结合第一方面或第一方面以上任一种可能的实现方式,在第一方面第三种可能的实现方式中,在每一个计算步骤中的第一计算步骤执行前,确定第 一计算步骤的增量计算代价,如果第一计算步骤的增量计算代价大于完整计算代价,则将第一计算步骤以及第一计算步骤之后的计算步骤中的增量计算切换为完整计算。
应理解,此处的“第一”并不是表示次序,“第一计算步骤”可以是大数据计算的任意一个计算步骤。可以把整个计算表达成一个图,找到图的关键路径。关键路径上每个运算的时间真对总计算时间影响较大。从后向前分析在每个关键路径上的运算是否采用完整计算更优。如果完整计算更优,则切换为完成计算,从而保证了大数据计算的效率。
结合第一方面或第一方面以上任一种可能的实现方式,在第一方面第四种可能的实现方式中,增量数据、每一个计算步骤需要保存的必要数据以及每一个计算步骤的增量输入和增量输出均被组织成整体并行数据集,整体并行数据集由键/值对组成,整体并行数据集支持读写操作,且对整体并行数据集的操作支持多计算节点并行操作。
将数据组织成整体并行数据集,可以实现以数据为粒度的细粒度大数据计算,提高大数据计算的计算效率。
第二方面,本申请提供了一种计算机可读介质,包括计算机执行指令,当计算机的处理器执行该计算机执行指令时,该计算机执行第一方面或第一方面任一种可能的实现方式中的方法。
第三方面,本申请提供了一种计算设备,包括:处理器、存储器、总线和通信接口;该存储器用于存储执行指令,该处理器与该存储器通过该总线连接,当该计算设备运行时,该处理器执行该存储器存储的该执行指令,以使该计算设备执行第一方面或第一方面任一种可能的实现方式中的方法。
第四方面,本申请提供了一种大数据计算装置,其特征在于,大数据计算包括至少两个计算步骤,装置包括:计算单元,用于根据增量数据、每一个计算步骤的增量传递规则以及每一个计算步骤需要保存的必要数据,计算出大数据计算的增量输出结果,其中,该必要数据包括完整输入、完整输出中的至少一项,增量传递规则用于以数据为粒度描述每一个计算步骤根据每一个计算步骤的增量输入和每一个计算步骤需要保存的必要数据计算每一个计算步骤的增量输出的计算规则,每一个计算步骤需要保存的必要数据在进行完整计算或增量计算时根据每一个计算步骤的增量传递规则进行保存;确定单元,用于根据增量输出结果与大数据计算的原始输出结果,确定最终计 算结果。
将大数据计算分为至少两个计算步骤,并将每一个计算步骤的增量输入、必要数据和增量输出组织成数据集的形式,通过每一个计算步骤的以数据为粒度的增量传递规则,以细粒度的方式进行大数据的增量计算,从而提升了大数据增量计算的效率。
结合第四方面,在第四方面第一种可能的实现方式中,装置还包括更新单元,用于在每一个计算步骤需要保存的必要数据有增量时,根据每一个计算步骤需要保存的必要数据的增量,更新每一个计算步骤需要保存的必要数据。
其中,每一个计算步骤的完整输入是指在进行本次增量计算之前的计算中,每一个计算步骤的完整输入,每一个计算步骤的完整输出是指在进行本次增量计算之前的计算中,每一个计算步骤的完整输出。完整输入和完整输出可以随着增量计算的进行,进行更新,更具体的,完整输入是指对应计算步骤上一次完整计算或增量计算后更新的完整输入,完整输出是指对应计算步骤上一次完整计算或增量计算后更新的完整输出。
结合第四方面或第四方面以上任一种可能的实现方式,在第四方面第二种可能的实现方式中,大数据计算包括至少两个计算阶段,每一个计算阶段包含至少一个计算步骤;计算单元用于根据增量数据、每一个计算步骤的增量传递规则以及每一个计算步骤的需要保存的必要数据,计算出大数据计算的增量输出结果,包括:计算单元用于:按照计算阶段顺序,依次根据每一个计算阶段包含的计算步骤的增量传递规则以及每一个计算阶段包含的计算步骤需要保存的必要数据进行增量计算,其中,第1个计算阶段的增量输入为增量数据,第i+1个计算阶段的增量输入为第i个计算阶段的增量输出,最后一个计算阶段的增量输出为大数据计算的增量输出结果,i为大于0的正整数。
将计算步骤分为多个计算阶段,每一个计算阶段可以包含1个或者1个以上的计算步骤,每一个计算阶段中计算步骤可以并行的,从而增强了大数据计算的灵活性。
结合第四方面或第四方面以上任一种可能的实现方式,在第四方面第三种可能的实现方式中,确定单元还用于:在每一个计算步骤中的第一计算步骤执行前,确定第一计算步骤的增量计算代价,如果第一计算步骤的增量计算 代价大于完整计算代价,则将第一计算步骤以及第一计算步骤之后的计算步骤中的增量计算切换为完整计算。
应理解,此处的“第一”并不是表示次序,“第一计算步骤”可以是大数据计算的任意一个计算步骤。可以把整个计算表达成一个图,找到图的关键路径。关键路径上每个运算的时间真对总计算时间影响较大。从后向前分析在每个关键路径上的运算是否采用完整计算更优。如果完整计算更优,则切换为完成计算,从而保证了大数据计算的效率。
结合第四方面或第四方面以上任一种可能的实现方式,在第四方面第四种可能的实现方式中,增量数据、每一个计算步骤需要保存的必要数据以及每一个计算步骤的增量输入和增量输出均被组织成整体并行数据集,整体并行数据集由键/值对组成,整体并行数据集支持读写操作,且对整体并行数据集的操作支持多计算节点并行操作。
将数据组织成整体并行数据集,可以实现以数据为粒度的细粒度大数据计算,提高大数据计算的计算效率。
根据本发明实施例提供的技术方案,将大数据计算分为至少两个计算步骤,并将每一个计算步骤的增量输入、必要数据和增量输出组织成数据集的形式,通过每一个计算步骤的以数据为粒度的增量传递规则,以细粒度的方式进行大数据的增量计算,从而提升了大数据增量计算的效率。进一步的,本发明实施例阐述了大数据增量计算的基础理论,提供了一种通用的细粒度的大数据增量计算方法,适用于多种大数据计算系统。
附图说明
为了更清楚地说明本发明实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为依据本发明一实施例的大数据计算系统的示例性框图;
图2为依据本发明一实施例的大数据增量计算装置硬件结构示意图;
图3(a)和图3(b)为依据本发明一实施例的大数据增量计算架构示意图;
图4为依据本发明一实施例的大数据增量计算方法的示范性流程图;
图5为依据本发明一实施例的大数据增量计算方法的示范性架构图;
图6为依据本发明一实施例的大数据增量计算装置结构示意图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行描述。
图1示出了一种大数据计算系统100的示例性框图,如图1所示,系统100包含客户端102,管理器104,以及多个节点106,
其中,客户端102上运行有应用,该应用的类型可以是MapReduce、Giraph、Storm、Dryad、Pregel、Spark、Tez/Impala或消息传递接口(Message Passing Interface,MPI)等。客户端102上可以运行任意类型的应用,用户还可以实现一个自定义的应用类型,从而实现一种全新的应用程序框架,本发明实施例对此并不进行限定。
管理器104负责整个系统的资源管理和分配,当接收到来自客户端102的应用作业时,可以根据整个系统的负载情况,为该应用作业分配资源。
节点106用于进行大数据计算,其中包含进行大数据计算所需的内存、CPU、磁盘以及网络等资源。
更具体的,每一个节点106可以包含一个节点管理器和至少一个资源容器。节点管理器用于管理节点106的资源和节点106上运行的任务,并定时地向管理器104汇报本节点上的资源使用情况和各个资源容器的运行状态。
资源容器是节点106中的资源抽象,它可以封装某个节点上的多类资源,如内存、中央处理器单元(Central Processing Unit,CPU)、磁盘、网络等。可选的,资源容器也可以只封装某个节点上的部分资源,例如只封装内存和CPU,本发明实施例对此并不进行限定。其中,资源容器可以运行任何类型的任务。例如,MapReduce应用可以请求一个资源容器来启动map或reduce任务,而Giraph应用可以请求一个资源容器来运行Giraph任务。用户还可以实现一个自定义的应用类型,通过资源容器来运行特定的任务,从而实现一种全新的应用程序框架。
应理解,因为不同的应用对资源容器的需求可能不同,即不同的应用需要的资源容器的资源种类,以及每一种类的资源的数量需求不同,所以资源容器可以有多个规格,本发明实施例并不对此进行限定。
其中,客户端102,管理器104,以及多个节点106可以通过网络进行通信,其中,网络可以是因特网,内联网,局域网(Local Area Networks,简称LANs), 广域网络(Wireless Local Area Networks,简称WLANs),存储区域网络(Storage Area Networks,简称SANs)等,或者以上网络的组合。
应理解,图1的目的仅仅是示例性的引入系统100的参与者以及它们的相互关系。因此,所描绘的系统100被大大地简化,本发明实施例仅仅对其进行概括性的说明,并不对其实现方式进行任何的限定。且图1中的客户端102,管理器104和节点106可以是任意体系结构的,本发明实施例并不对此进行限定。
图1所示的管理器104和节点106可以由图2所示的计算设备200来实现。图2为计算设备200的简化的逻辑结构示意图,如图2所示,计算设备200包括处理器202、内存单元204、输入/输出接口206、通信接口208、总线210和存储设备212。其中,处理器202、内存单元204、输入/输出接口206、通信接口208和存储设备212,通过总线210实现彼此之间的通信连接。
处理器202是计算设备200的控制中心,用于执行相关程序,以实现本发明实施例所提供的技术方案。可选的,处理器202包含一个或多个CPU,例如,图2所示的中央处理器单元1和中央处理器单元2。可选的,计算设备200还可以包含多个处理器202,每一个处理器202可以是一个单核处理器(包含一个CPU)或多核处理器(包含多个CPU)。除非另有说明,在本发明中,一个用于执行特定功能的组件,例如,处理器202或内存单元204,可以通过配置一个通用的组件来执行相应功能来实现,也可以通过一个专门执行特定功能的专用组件来实现,本申请并不对此进行限定。处理器202可以采用通用的中央处理器,微处理器,应用专用集成电路(Application Specific Integrated Circuit,ASIC),或者一个或多个集成电路,用于执行相关程序,以实现本申请所提供的技术方案。
处理器202可以通过总线210与一个或多个存储方案相连接。存储方案可以包含内存单元204和存储设备212。其中,存储设备212可以为只读存储器(Read Only Memory,ROM),静态存储设备,动态存储设备或者随机存取存储器(Random Access Memory,RAM)。内存单元204可以为随机存取存储器。内存单元204可以与处理器202集成在一起或集成在处理器202的内部,也可以是独立于处理器202的一个或多个存储单元。
供处理器202或处理器202内部的CPU执行的程序代码存储在计算设备的200的存储器中,具体可以存储在存储设备212或内存单元204中。可选的,存储在存储设备212内部的程序代码(例如,操作系统、应用程序、大数据计算模块或通信模块等)被拷贝到内存单元204中,以供处理器202执行。
存储设备212可以为物理硬盘或其分区(包括小型计算机系统接口存储或全局网络块设备卷)、网络存储协议(包括网络文件系统NFS等网络或机群文件系统)、基于文件的虚拟存储设备(虚拟磁盘镜像)、基于逻辑卷的存储设备。可以包含高速随机存储器,也可以包含非易失性存储器,例如一个或者多个磁盘存储器,闪速存储器,或者其他非易失性存储器。在一些实施例中,存储设备212还可能进一步包含与所述一个和多个处理器202分离的远程存储器,例如通过通信接口208与通信网络进行访问的网盘,该通信网络可以为因特网,内联网,局域网(LANs),广域网络(WLANs),存储区域网络(SANs)等,或者以上网络的组合。
操作系统(例如Darwin、RTXC、LINUX、UNIX、OS X、WINDOWS或是诸如Vxworks之类的嵌入式操作系统)包括用于控制和管理常规系统任务(例如内存管理、存储设备控制、电源管理等等)以及有助于各种软硬件组件之间通信的各种软件组件和/或驱动器。
输入/输出接口206用于接收输入的数据和信息,输出操作结果等数据。
通信接口208使用例如但不限于收发器一类的收发装置,来实现计算设备200与其他设备或通信网络之间的通信。
总线210可包括一通路,在计算设备200各个部件(例如处理器202、内存单元204、输入/输出接口206、通信接口208和存储设备212)之间传送信息。可选的,总线210可以使用有线的连接方式或采用无线的通讯方式,本申请并不对此进行限定。
应注意,尽管图2所示的计算设备200仅仅示出了处理器202、内存单元204、输入/输出接口206、通信接口208、总线210以及存储设备212,但是在具体实现过程中,本领域的技术人员应当明白,计算设备200还包含实现正常运行所必须的其他器件。
计算设备200可以为一般的通用计算机或专门用途的计算设备,包括但不限于便携计算机,个人台式计算机,网络服务器,平板电脑,手机,个人数字助理(Personal Digital Assistant,PDA)等任何电子设备,或者以上两种或者多种的组合设备,本申请并不对计算设备200的具体实现形式进行任何限定。
此外,图2的计算设备200仅仅是一个计算设备200的例子,计算设备200可能包含相比于图2展示的更多或者更少的组件,或者有不同的组件配置方式。根据具体需要,本领域的技术人员应当明白,计算设备200还可包含实现其他附 加功能的硬件器件。本领域的技术人员应当明白,计算设备200也可仅仅包含实现本发明实施例所必须的器件,而不必包含图2中所示的全部器件。同时,图2中展示的各种组件可以用硬件、软件或者硬件与软件的结合方式实施。
图2所示的硬件结构以及上述描述适用于本发明实施例所提供的各种计算设备,适用于执行本发明实施例所提供的各种大数据计算方法。
如图2所示,计算设备200的内存单元204中包含大数据计算模块,处理器202执行该大数据计算模块程序代码,实现大数据计算。
大数据计算模块可以由一个或者多个操作指令构成,以使计算设备200根据以上描述执行一个或多个方法步骤。具体的方法步骤在本申请的以下部分进行详细描述。
图3为一种大数据增量计算架构示意图300,大数据计算模型304支持两种计算模式,完整计算和增量计算:
1)完整计算,即用新的输入数据302代替原始输入数据,然后根据新的输入数据数据302重新计算出新的输出结果306,其中,新的输入数据302是指包括原始输入数据和本次新增加的增量输入数据的全集;
2)增量计算,即仅对增量数据308进行计算,得到增量输出结果,并关联增量输出结果与原始输出结果,得出最终输出结果306,其中增量数据308是指本次新增加的增量输入数据,即新的输入数据306相对于原始输入数据的的变化量。
图4为依据本发明一实施例的大数据增量计算方法400的示范性流程图,当有增量数据时,用于大数据增量计算,大数据计算包括至少两个计算步骤,如图4所示,方法400包括:
S402:根据增量数据、每一个计算步骤的增量传递规则以及每一个计算步骤需要保存的必要数据,计算出大数据计算的增量输出结果。
其中,该必要数据包括完整输入、完整输出中的至少一项,增量传递规则用于以数据为粒度描述每一个计算步骤根据每一个计算步骤的增量输入和每一个计算步骤需要保存的必要数据计算每一个计算步骤的增量输出的计算规则,每一个计算步骤需要保存的必要数据在进行完整计算或增量计算时根据每一个计算步骤的增量传递规则进行保存。
其中,每一个计算步骤需要保存的必要数据包括该计算步骤的完整输入、该计算步骤的完整输出中的至少一项,每一个计算步骤的完整输入是指在进行本次增量计算之前的计算中,每一个计算步骤的完整输入,每一个计算步骤的完整输出 是指在进行本次增量计算之前的计算中,每一个计算步骤的完整输出。应了解,完整输入和完整输出可以随着增量计算的进行,进行更新,即在每一个计算步骤需要保存的必要数据有增量时,根据该每一个计算步骤需要保存的必要数据的增量,更新该每一个计算步骤需要保存的必要数据。例如,在某一轮增量计算后,如果保存的完整输入有增量,则根据完整输入的增量,更新完整输入;如果保存的完整输出有增量,则根据完整输出的增量更新完整输出。更具体的,步骤402中的完整输入是指对应计算步骤上一次完整计算或增量计算后更新的完整输入,完整输出是指对应计算步骤上一次完整计算或增量计算后更新的完整输出。
S406:根据所述增量输出结果与所述大数据计算的原始输出结果,确定最终计算结果。
具体实现过程中,大数据计算包括至少两个计算阶段,每一个计算阶段包含至少一个计算步骤;根据增量数据、每一个计算步骤的增量传递规则以及每一个计算步骤的需要保存的必要数据,计算出大数据计算的增量输出结果,包括:按照计算阶段顺序,依次根据每一个计算阶段包含的计算步骤的增量传递规则以及每一个计算阶段包含的计算步骤需要保存的必要数据进行增量计算,其中,第1个计算阶段的增量输入为增量数据,第i+1个计算阶段的增量输入为第i个计算阶段的增量输出,最后一个计算阶段的增量输出为大数据计算的增量输出结果,i为大于0的正整数。
如图5所示,按照大数据计算的计算步骤按照执行顺序分为多个计算阶段,考虑到可能有计算步骤是并行运算,所以每一个计算阶段可以包含1个或者1个以上的计算步骤,如图5所示,增量数据308作为第1计算阶段的增量输入,第1计算阶段内的计算步骤根据增量数据308和中间数据(或者不需要中间数据,为了一般性,图5在每个计算阶段均示出中间数据,但应了解,图5仅仅是举例说明,并不是每一个计算阶段或每一个计算阶段的每一个计算步骤都需要中间数据)计算出增量输出,其中,中间数据包含每一个计算步骤需要保存的必要数据,并把第1计算阶段的增量输出传递给第2计算阶段,第2计算阶段的增量输入为第1计算阶段的增量输出,第2计算阶段的计算步骤根据第2计算阶段的增量输入和中间数据(或者不需要中间数据)计算出第2计算阶段的增量输出,并把第2计算阶段的增量输出传递给第3计算阶段,以此类推,直至第N计算阶段的计算步骤根据其增量输入和中间数据(或者不需要中间数据)计算出增量输出结果502,增量输出结果502即整个大数据计算的增量输出结果。
可选的,在每一个计算步骤中的第一计算步骤执行前,确定第一计算步骤的增量计算代价,如果第一计算步骤的增量计算代价大于完整计算代价,则将第一计算步骤以及第一计算步骤之后的计算步骤中的增量计算切换为完整计算。应理解,此处的“第一”并不是表示次序,“第一计算步骤”可以是大数据计算的任意一个计算步骤。
在具体实现过程中,对每个计算步骤,在首次计算时,记录完整计算的时间。在增量计算时,记录增量传递计算的时间。假定增量计算以相同的时间间隔进行,增量数据更新是均匀的,那么下次增量计算的时间与上次大致相似。于是,可以根据已经记录的时间估计增量计算和完整计算的代价,选择是否在某个计算步骤开始执行完整计算。
更具体的,可以把整个计算表达成一个图,找到图的关键路径。关键路径上每个运算的时间真对总计算时间影响较大。从后向前分析在每个关键路径上的运算是否采用完整计算更优。一旦一个计算采用完整计算,那么它下游的所有计算也都采用完整计算。
在本发明实施例中,增量数据、每一个计算步骤需要保存的必要数据以及每一个计算步骤的增量输入和增量输出均被组织成整体并行数据集(Bulk Parallel Data Set,BPD),整体并行数据集由键值对(key-value)数据元素<k,v>组成,整体并行数据集支持读写操作,且对整体并行数据集的操作支持多计算节点并行操作。
其中,“并行”是指一个BPD可以划分为多个子集BPD0,BPD1,…,BPDn-1,分布在多个机器节点上,并且所有子集的并集为全集。整体并行是指对BPD内的数据可以整体读写,且对BPD内的数据的操作可以并行执行,在计算完成时进行全局同步操作。所述增量数据、所述每一个计算步骤需要保存的必要数据以及所述每一个计算步骤的增量输入和增量输出均被组织成整体并行数据集。
应了解,在大数据计算中,数据可以被分为多个子集,以实现并行计算,为了描述方便,本发明实施例仅仅对其中的一个子集的计算流程进行说明,但本发明实施例并不对此进行限定。
BPD可以定义并支持多种计算。现对BPD支持的运算进行举例说明:
(1)Union(A,B)→A∪B
结果包括所有的A元素和B元素。
(2)Join({<k,av>},{<k,bv>})→{<k,av│bv>}
用于匹配k相同的数据元素。
(3)CartesianProduct({<ak,av>},{<bk,bv>})→{<ak|bk,av|bv>}
用于计算笛卡尔积。
(4)GroupBy({<k,v>})→{<k,list(v)>}
用于对于每个k,收集所有的<k,v>,形成一个v的列表list(v)。
(5)LocalGroupBy({<k,v>})→{<k,list(v)>}
只对于每个子集局部进行GroupBy。
(6)Sort({<k,v>})→{<k,v>}
排序每个子集内部有序,子集之间也有序
(7)Lookup(k,{<k,v>})→list(v)
用于寻找k相同的列表list(v)。
(8)PartitionBy(p,
Figure PCTCN2016097946-appb-000001
其中,BPDi={<k,v>|p(v)=i}。
用户可扩展的多种运算举例如下:
(1)A=UDFCompute(B)
用户实现一个函数:UDFCompute({<ik,iv>})→{<ok,ov>}。
对于每个<ik,iv>,计算产生零个、一个或多个<ok,ov>。
计算过程中可以改变iv。
系统运行流程如下:
Parallel for all<ik,iv>∈B do
{<ok,ov>}=UDFCompute(<ik,iv>);
A=A∪{<ok,ov>}。
(2)A=UDFComputeMulti(B,C,…)
对于Join匹配的B,C…的数据元素,调用UDFComputeMulti,其它与UDFCompute类似。
当所有Join的匹配是唯一的时,可以改变输入数据集。
(3)A=UDFListCombine(B)
用户实现一个函数:
UDFListCombine({<ik,list(iv)>})→{<ik,ov>}。
UDFListCombine符合交换率和结合率。
系统运行流程如下:
Parallel for all<ik,list(iv)>∈B do
{<ik,ov>}=UDFListCombine(<ik,list(iv)>);
A=A∪{<ik,ov>}。
应了解,以上运算规则仅仅是举例说明,本发明实施例允许有其他的运算规则,并不对此进行限定。
对于一个BPD={<k,v>},增量数据更新可以表达为ΔBPD={<+/-,k,v,>}。其中,<+,k,v,>表示一个插入的数据元素,而<-,k,v,>表示一个删除的数据元素。对原始输入数据的更改可以分解为插入和删除。相应于子集BPD0,BPD1,…,BPDn-1,我们也可以得到ΔBPD0,ΔBPD1,…,ΔBPDn-1
本方案的目标是使用细粒度的增量计算,增量计算以<k,v>为单位进行,从而尽可能地减少不必要的重复计算。在BPD模型中,一个大数据计算是由多个基本的BPD计算步骤组成的。每个计算步骤有一个或多个输入BPD,经过计算,产生一个输出BPD。
对大数据分析计算的每个BPD计算步骤,定义相应的增量传递计算步骤。这样,就可以根据整个计算的增量输入,按照原来的计算步骤,采用相应的增量传递计算步骤,计算出增量的结果。
以下对计算步骤的增量传递规则进行举例说明:
(1)Union(A,B)→A∪B
对于R=Union(A,B),增量传递计算为ΔR=Union(ΔA,ΔB)。
可以使用现有的Union实现,不需要访问完整输入和完整输出,不保存此计算步骤的完整输入和完整输出。
(2)Join({<k,av>},{<k,bv>})→{<k,av│bv>}
对于R=Join(A,B),增量传递计算为ΔR=Join(ΔA,B)∪Join(A,ΔB)∪Join(ΔA,ΔB)。
其中,可以把原始的数据元素<k,v>看作是数据<+,k,v>。
<+,k,av>joins<+,k,bv>→<+,k,av|bv>;
<+,k,av>joins<-,k,bv>→<-,k,av|bv>;
<-,k,av>joins<+,k,bv>→<-,k,av|bv>;
<-,k,av>joins<-,k,bv>→<+,k,av|bv>。
需要访问完整输入A和B,保存此计算步骤的完整输入。
(3)CartesianProduct({<ak,av>},{<bk,bv>})→{<ak|bk,av|bv>}
对于R=CartesianProduct(A,B),增量传递计算为:
ΔR=CartesianProduct(ΔA,B)∪CartesianProduct(A,ΔB)∪CartesianProduct(ΔA,ΔB)
需要访问完整输入A和B,保存此计算步骤的完整输入。
(4)GroupBy({<k,v>})→{<k,list(v)>}
对于R=GroupBy(A),增量传递计算为ΔR=UDFCompute(GroupBy(ΔA),R)。
首先,计算GroupBy(ΔA),得到数据更新的列表,每个k的所有更新都放入同一个列表中。然后,需要把更新列表与原来的GroupBy的结果列表归并得到ΔR。系统实现一个UDFCompute完成这一归并操作。
需要访问完整输出,保存此计算步骤的完整输出。
(5)LocalGroupBy({<k,v>})→{<k,list(v)>}
与GroupBy相似,需要访问完整输出,保存此计算步骤的完整输出。
(6)Sort({<k,v>})→{<k,v>}
对于R=Sort(A),排序的输出是整个数据集的一种顺序,这里ΔR没有意义。
根据ΔA,可以采用归并或插入排序的方法,得到完整的排好序的结果,保存此计算步骤的完整输出。
(7)Lookup(k,{<k,v>})→list(v)
对于R=Lookup(k,A),当A改变了,ΔR=Lookup(k,ΔA)。
可以很好实现,不需访问完整输入和完整输出,不保存此计算步骤的完整输入和完整输出。
(8)PartitionBy(p,
Figure PCTCN2016097946-appb-000002
可以直接对{<+/-,k,v,>}进行划分,不需要访问完整输入和完整输出,不保存此计算步骤的完整输入和完整输出。
(9)A=UDFCompute(B)
直接计算ΔA=UDFCompute(ΔB)。
可以很容易实现,不需要访问完整输入和完整输出,不保存此计算步骤的完整输入和完整输出。
(10)A=UDFComputeMulti(B,C,…)
对于Join匹配确定输入这个过程,与前面Join操作类似,先计算了输入BPD上的Join操作,把Join的每一个结果作为输入参数调用用户定义的UDFComputeMulti函数,计算过程则与UDFCompute相似,容易实现增量计算,保 存此计算步骤的完整输入。
(11)A=UDFListCombine(B)
由于UDFListCombine符合交换率和结合率,所以很容易实现把新插入的数据元素加入结果,需要访问完整输出。但是,对于需要删除的数据元素,则需要重新计算,需要保存此计算步骤的完整输入和完整输出。
应了解,本发明实施例仅对大数据增量计算的计算步骤,以及计算步骤的增量传递规则进行举例说明,可以有其他的计算步骤,以及计算步骤的增量传递规则,本发明实施例并不对此进行限定。
根据每一个计算步骤的增量传递规则可以得出表1所示的不同计算步骤需要的必要数据:
表1不同运算需要的必要数据
BPD计算 完整输入 完整输出 评价
Union    
Join 需要   一般
CartesianProduct 需要   一般
GroupBy   需要 一般
LocalGroupBy   需要 一般
Sort   需要 一般
Lookup    
PartitionBy    
UDFCompute    
UDFComputeMulti 需要   一般
UDFListCombine 需要 需要
评价为好的计算步骤,表明其增量传递规则不需要保存完整输入和完整输出;评价为一般的计算步骤,表明其增量传递规则需要保存完必要数据;评价为差的计算步骤,表明其增量传递规则需要保存完整输入和完整输出。
可以对评价一般的运算和评价差的运算进行一系列优化,例如:
可以对Join运算进行优化(适用于Join,CartesianProduct,UDFComputeMulti),对于Join(A,B),可以采用以下两种方案进行优化:
Lookup:对每个<+/-,k,v,>∈ΔA,Lookup(k,B)来找到匹配的<k,v>;
把ΔA发送到每个B的划分所在的机器节点,然后进行本地的Join。
也可以对Combine运算优化(适用于UDFListCombine),用户可以定义一个Combine的反操作Remove,即Remove(<ik,ok>,iv)→<ik,ok'>,其中ok’是从ok中去除了iv部分的结果,那么,就不需要访问完整输入,只使用完整输出,就可以计算增量结果。对<+,k,v,>,调用Combine;对<-,k,v,>,调用Remove。这样,可以大大降低Combine增量计算的代价。
应理解,以上描述仅仅是对本发明实施例的举例说明,但本发明实施例并不对此进行限定,允许有其他的实现方式。
具体实现过程中,根据增量传递运算的需要,例如,对于Join,CartesianProduct,UDFComputeMulti,UDFListCombine,存储完整输入的BPD;对于GroupBy,LocalGroupBy,Sort,UDFListCombine,存储完整输出的BPD。在第一次完整计算时,保存这些BPD。在每次增量计算时,更新这些BPD。
对于Join,UDFComputeMulti,GroupBy,LocalGroupBy,Sort,UDFListCombine等运算,可以提供增量查询的支持。即采用已有的Lookup操作,允许有选择地访问存储的必要数据BPD。具体实现可以在BPD上创建索引,或者采用分布式Key-Value Store键值存储系统实现索引的功能。
对于每个作业,记录作业的所有计算步骤。控制作业的生命周期,包括首次完全运行、增量输入数据和增量计算三个阶段,后两个阶段可以多次进行。
根据本发明实施例提供的技术方案,将大数据计算分为至少两个计算步骤,并将每一个计算步骤的增量输入、必要数据和增量输出组织成数据集的形式,通过每一个计算步骤的以数据为粒度的增量传递规则,以细粒度的方式进行大数据的增量计算,从而提升了大数据增量计算的效率。进一步的,本发明实施例阐述了大数据增量计算的基础理论,提供了一种通用的细粒度的大数据增量计算方法,适用于多种大数据计算系统。
图6为依据本发明一实施例的大数据计算装置600结构示意图,如图6所示,装置600包含:计算单元602、确定单元604和更新单元606。
计算单元602,用于根据增量数据、每一个计算步骤的增量传递规则以及每一个计算步骤需要保存的必要数据,计算出大数据计算的增量输出结果。
其中,该必要数据包括完整输入、完整输出中的至少一项,增量传递规则用于以数据为粒度描述每一个计算步骤根据每一个计算步骤的增量输入和每一个计算步骤需要保存的必要数据计算每一个计算步骤的增量输出的计算规则,每一个计算步骤需要保存的必要数据在进行完整计算或增量计算时根据每一个计算步骤的增量传递规则进行保存。
在具体实现过程中,计算单元602可以由图2所示的处理器202和内存单元来实现。更具体的,可以由处理器202执行内存单元204中的大数据计算模块,根据增量数据、每一个计算步骤的增量传递规则以及每一个计算步骤需要保存的必要数据,计算出大数据计算的增量输出结果。
确定单元604,用于根据增量输出结果与大数据计算的原始输出结果,确定最终计算结果。
在具体实现过程中,确定单元604可以由图2所示的处理器202和内存单元来实现。更具体的,可以由处理器202执行内存单元204中的大数据计算模块,根据增量输出结果与大数据计算的原始输出结果,确定最终计算结果。
更新单元606,用于在每一个计算步骤需要保存的必要数据有增量时,根据每一个计算步骤需要保存的必要数据的增量,更新每一个计算步骤需要保存的必要数据。
在具体实现过程中,更新单元606可以由图2所示的处理器202和内存单元来实现。更具体的,可以由处理器202执行内存单元204中的大数据计算模块,在每一个计算步骤需要保存的必要数据有增量时,根据每一个计算步骤需要保存的必要数据的增量,更新每一个计算步骤需要保存的必要数据。
大数据计算包括至少两个计算阶段,每一个计算阶段包含至少一个计算步骤;计算单元602用于根据增量数据、每一个计算步骤的增量传递规则以及每一个计算步骤的需要保存的必要数据,计算出大数据计算的增量输出结果,包括:计算单元602按照计算阶段顺序,依次根据每一个计算阶段包含的计算步骤的增量传递规则以及每一个计算阶段包含的计算步骤需要保存的必要数据进行增量计算,其中,第1个计算阶段的增量输入为增量数据,第i+1个计算阶段的增量输入为第i个计算阶段的增量输出,最后一个计算阶段的增量输出为大数据计算的增量输出结果,i为大于0的正整数。
可选的,确定单元604还用于:在每一个计算步骤中的第一计算步骤执行前,确定第一计算步骤的增量计算代价,如果第一计算步骤的增量计算代价大于完整 计算代价,则将第一计算步骤以及第一计算步骤之后的计算步骤中的增量计算切换为完整计算。
其中,增量数据、每一个计算步骤需要保存的必要数据以及每一个计算步骤的增量输入和增量输出均被组织成整体并行数据集,整体并行数据集由键/值对组成,整体并行数据集支持读写操作,且对整体并行数据集的操作支持多计算节点并行操作。
本发明实施例是图4实施例对应的装置实施例,图4实施例部分的特征描述,适用于本发明实施例,在此不再赘述。
应理解,在本发明实施例中,可以由多个装置600进行大数据计算,在分布式计算系统中,可以将计算任务分配给多个装置600进行处理,例如,可以为每一个计算步骤或某几个计算步骤分配一个装置600,通过多个装置600的配合来完成大数据计算,本发明实施例仅对装置600的功能进行了说明,并不对执行大数据计算的装置600的数目,以及执行大数据计算的装置600之间的组织形式进行限定。
根据本发明实施例提供的技术方案,将大数据计算分为至少两个计算步骤,并将每一个计算步骤的增量输入、必要数据和增量输出组织成数据集的形式,通过每一个计算步骤的以数据为粒度的增量传递规则,以细粒度的方式进行大数据的增量计算,从而提升了大数据增量计算的效率。进一步的,本发明实施例阐述了大数据增量计算的基础理论,提供了一种通用的细粒度的大数据增量计算方法,适用于多种大数据计算系统。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,设备和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块的划分,仅仅为一种逻辑功能划分,实现时可以有另外的划分方式,例如多个模块或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或模块的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理模块,即可以位于一个地方,或者也可以分布到多个网络模块上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。
另外,在本发明各个实施例中的各功能模块可以集成在一个处理模块中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用硬件加软件功能模块的形式实现。
上述以软件功能模块的形式实现的集成的模块,可以存储在一个计算机可读取存储介质中。上述软件功能模块存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例所述方法的部分步骤。而前述的存储介质包括:移动硬盘、只读存储器(英文:Read-Only Memory,简称ROM)、随机存取存储器(英文:Random Access Memory,简称RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
最后应说明的是:以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的保护范围。

Claims (12)

  1. 一种大数据计算方法,其特征在于,所述大数据计算包括至少两个计算步骤,所述方法包括:
    根据增量数据、每一个计算步骤的增量传递规则以及所述每一个计算步骤需要保存的必要数据,计算出所述大数据计算的增量输出结果,其中,所述必要数据包括完整输入、完整输出中的至少一项,所述增量传递规则用于以数据为粒度描述所述每一个计算步骤根据所述每一个计算步骤的增量输入和所述每一个计算步骤需要保存的必要数据计算所述每一个计算步骤的增量输出的计算规则,所述每一个计算步骤需要保存的必要数据在进行完整计算或增量计算时根据所述每一个计算步骤的增量传递规则进行保存;
    根据所述增量输出结果与所述大数据计算的原始输出结果,确定最终计算结果。
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    在所述每一个计算步骤需要保存的必要数据有增量时,根据所述每一个计算步骤需要保存的必要数据的增量,更新所述每一个计算步骤需要保存的必要数据。
  3. 根据权利要求1或2所述的方法,其特征在于,所述大数据计算包括至少两个计算阶段,每一个计算阶段包含至少一个所述计算步骤;所述根据增量数据、每一个计算步骤的增量传递规则以及所述每一个计算步骤的需要保存的必要数据,计算出所述大数据计算的增量输出结果,包括:
    按照计算阶段顺序,依次根据所述每一个计算阶段包含的计算步骤的增量传递规则以及所述每一个计算阶段包含的计算步骤需要保存的必要数据进行增量计算,其中,第1个计算阶段的增量输入为所述增量数据,第i+1个计算阶段的增量输入为第i个计算阶段的增量输出,最后一个计算阶段的增量输出为所述大数据计算的增量输出结果,i为大于0的正整数。
  4. 根据权利要求1-3任一项所述的方法,其特征在于,在所述每一个计算步骤中的第一计算步骤执行前,确定所述第一计算步骤的增量计算代价,如果所述第一计算步骤的增量计算代价大于完整计算代价,则将所述第一计算步骤以及所述第一计算步骤之后的计算步骤中的增量计算切换为完整计算。
  5. 根据权利要求1-4任一项所述的方法,其特征在于,所述增量数据、所述每一个计算步骤需要保存的必要数据以及所述每一个计算步骤的增量输入和增量输出均被组织成整体并行数据集,所述整体并行数据集由键值对组成,所述整体并行数据集支持读写操作,且对所述整体并行数据集的操作支持多计算节点并行操作。
  6. 一种大数据计算装置,其特征在于,所述大数据计算包括至少两个计算步骤,所述装置包括:
    计算单元,用于根据增量数据、每一个计算步骤的增量传递规则以及所述每一个计算步骤需要保存的必要数据,计算出所述大数据计算的增量输出结果,其中,所述必要数据包括完整输入、完整输出中的至少一项,所述增量传递规则用于以数据为粒度描述所述每一个计算步骤根据所述每一个计算步骤的增量输入和所述每一个计算步骤需要保存的必要数据计算所述每一个计算步骤的增量输出的计算规则,所述每一个计算步骤需要保存的必要数据在进行完整计算或增量计算时根据所述每一个计算步骤的增量传递规则进行保存;
    确定单元,用于根据所述增量输出结果与所述大数据计算的原始输出结果,确定最终计算结果。
  7. 根据权利要求6所述的装置,其特征在于,所述装置还包括更新单元,用于在所述每一个计算步骤需要保存的必要数据有增量时,根据所述每一个计算步骤需要保存的必要数据的增量,更新所述每一个计算步骤需要保存的必要数据。
  8. 根据权利要求6或7所述的装置,其特征在于,所述大数据计算包括至少两个计算阶段,每一个计算阶段包含至少一个所述计算步骤;所述计算单元用于根据增量数据、每一个计算步骤的增量传递规则以及所述每一个计算步骤的需要保存的必要数据,计算出所述大数据计算的增量输出结果,包括:
    所述计算单元用于:按照计算阶段顺序,依次根据所述每一个计算阶段包含的计算步骤的增量传递规则以及所述每一个计算阶段包含的计算步骤需要保存的必要数据进行增量计算,其中,第1个计算阶段的增量输入为所述增量数据,第i+1个计算阶段的增量输入为第i个计算阶段的增量输出,最后一个计算阶段的增量输出为所述大数据计算的增量输出结果,i为大于0的 正整数。
  9. 根据权利要求6-8任一项所述的装置,其特征在于,所述确定单元还用于:在所述每一个计算步骤中的第一计算步骤执行前,确定所述第一计算步骤的增量计算代价,如果所述第一计算步骤的增量计算代价大于完整计算代价,则将所述第一计算步骤以及所述第一计算步骤之后的计算步骤中的增量计算切换为完整计算。
  10. 根据权利要求6-9任一项所述的装置,其特征在于,所述增量数据、所述每一个计算步骤需要保存的必要数据以及所述每一个计算步骤的增量输入和增量输出均被组织成整体并行数据集,所述整体并行数据集由键/值对组成,所述整体并行数据集支持读写操作,且对所述整体并行数据集的操作支持多计算节点并行操作。
  11. 一种计算机可读介质,其特征在于,包括计算机执行指令,当计算机的处理器执行所述计算机执行指令时,所述计算机执行权利要求1-5任一项所述的方法。
  12. 一种计算设备,其特征在于,包括:处理器、存储器和总线;
    所述存储器用于存储执行指令,所述处理器与所述存储器通过所述总线连接,当所述计算设备运行时,所述处理器执行所述存储器存储的所述执行指令,以使所述计算设备执行权利要求1-5任一项所述的方法。
PCT/CN2016/097946 2015-12-31 2016-09-02 一种大数据增量计算方法和装置 WO2017113865A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201511028360.3A CN106933882B (zh) 2015-12-31 2015-12-31 一种大数据增量计算方法和装置
CN201511028360.3 2015-12-31

Publications (1)

Publication Number Publication Date
WO2017113865A1 true WO2017113865A1 (zh) 2017-07-06

Family

ID=59224446

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/097946 WO2017113865A1 (zh) 2015-12-31 2016-09-02 一种大数据增量计算方法和装置

Country Status (2)

Country Link
CN (1) CN106933882B (zh)
WO (1) WO2017113865A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108647228A (zh) * 2018-03-28 2018-10-12 中国电力科学研究院有限公司 可见光通信大数据实时处理方法和系统

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108334599A (zh) * 2018-01-31 2018-07-27 佛山市聚成知识产权服务有限公司 一种基于大数据的分析系统
CN112669984B (zh) * 2020-12-30 2023-09-12 华南师范大学 基于大数据人工智能的传染病协同递进监测预警应对方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102637206A (zh) * 2012-03-21 2012-08-15 浪潮集团山东通用软件有限公司 一种大数据量的数据查询方法
CN103049556A (zh) * 2012-12-28 2013-04-17 中国科学院深圳先进技术研究院 一种海量医疗数据的快速统计查询方法
CN104199942A (zh) * 2014-09-09 2014-12-10 中国科学技术大学 一种Hadoop平台时序数据增量计算方法及系统
CN104317738A (zh) * 2014-10-24 2015-01-28 中国科学技术大学 一种基于MapReduce的增量计算方法

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7620634B2 (en) * 2006-07-31 2009-11-17 Microsoft Corporation Ranking functions using an incrementally-updatable, modified naïve bayesian query classifier
CN105138600B (zh) * 2015-08-06 2019-03-26 四川长虹电器股份有限公司 基于图结构匹配的社交网络分析方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102637206A (zh) * 2012-03-21 2012-08-15 浪潮集团山东通用软件有限公司 一种大数据量的数据查询方法
CN103049556A (zh) * 2012-12-28 2013-04-17 中国科学院深圳先进技术研究院 一种海量医疗数据的快速统计查询方法
CN104199942A (zh) * 2014-09-09 2014-12-10 中国科学技术大学 一种Hadoop平台时序数据增量计算方法及系统
CN104317738A (zh) * 2014-10-24 2015-01-28 中国科学技术大学 一种基于MapReduce的增量计算方法

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108647228A (zh) * 2018-03-28 2018-10-12 中国电力科学研究院有限公司 可见光通信大数据实时处理方法和系统

Also Published As

Publication number Publication date
CN106933882A (zh) 2017-07-07
CN106933882B (zh) 2020-09-29

Similar Documents

Publication Publication Date Title
US10171284B2 (en) Reachability-based coordination for cyclic dataflow
Nasir et al. The power of both choices: Practical load balancing for distributed stream processing engines
CN109033234B (zh) 一种基于状态更新传播的流式图计算方法及系统
US20160140253A1 (en) Platform for Continuous Graph Update and Computation
TWI681337B (zh) 一種多群集管理方法與設備
CN108280522A (zh) 一种插件式分布式机器学习计算框架及其数据处理方法
CN104915407A (zh) 一种基于Hadoop多作业环境下的资源调度方法
US20170286485A1 (en) High Performance Query Processing and Data Analytics
CN102281290A (zh) 一种PaaS云平台的仿真系统及方法
CN104063501B (zh) 基于hdfs的副本平衡方法
Heintz et al. MESH: A flexible distributed hypergraph processing system
WO2017113865A1 (zh) 一种大数据增量计算方法和装置
US11176088B2 (en) Dynamic server pool data segmentation using dynamic ordinal partition key without locks
Xia et al. Efficient data placement and replication for QoS-aware approximate query evaluation of big data analytics
Su et al. Passive and partially active fault tolerance for massively parallel stream processing engines
Theeten et al. Chive: Bandwidth optimized continuous querying in distributed clouds
US20210390405A1 (en) Microservice-based training systems in heterogeneous graphic processor unit (gpu) cluster and operating method thereof
CN112000703B (zh) 数据入库处理方法、装置、计算机设备和存储介质
Selvi et al. Popularity (hit rate) based replica creation for enhancing the availability in cloud storage
CN116302574B (zh) 一种基于MapReduce的并发处理方法
CN115344358A (zh) 资源调度方法、装置和管理节点
CN114443236A (zh) 一种任务处理方法、装置、系统、设备及介质
Wang et al. Sublinear algorithms for big data applications
Wang et al. A BSP-based parallel iterative processing system with multiple partition strategies for big graphs
US11714796B1 (en) Data recalculation and liveliness in applications

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16880658

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16880658

Country of ref document: EP

Kind code of ref document: A1