WO2016008317A1 - Data processing method and central node - Google Patents

Data processing method and central node Download PDF

Info

Publication number
WO2016008317A1
WO2016008317A1 PCT/CN2015/075703 CN2015075703W WO2016008317A1 WO 2016008317 A1 WO2016008317 A1 WO 2016008317A1 CN 2015075703 W CN2015075703 W CN 2015075703W WO 2016008317 A1 WO2016008317 A1 WO 2016008317A1
Authority
WO
WIPO (PCT)
Prior art keywords
gpu
function
loop
calculation function
central node
Prior art date
Application number
PCT/CN2015/075703
Other languages
French (fr)
Chinese (zh)
Inventor
刘颖
崔慧敏
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2016008317A1 publication Critical patent/WO2016008317A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs

Definitions

  • Embodiments of the present invention relate to computer technologies, and in particular, to a data processing method and a central node.
  • MapReduce is currently the most popular programming model in systems that use large-scale clustering for big data processing.
  • MapReduce In a homogeneous cluster system (for example, a cluster system consisting of multiple central processing units (CPUs) connected via a network), MapReduce currently uses the Hadoop programming framework. Under the Hadoop programming framework, programmers You only need to write the Map function and the Reduce function to submit to the Hadoop program running on the central node of the cluster system.
  • the Hadoop program decomposes the computing task into multiple sub-data blocks (split), and the Map function and The Reduce function and the sub-block are sent to the calculation node that needs to be calculated.
  • the calculation node receives the task instruction, it calls the Map function to process the received sub-block, and then the Reduce function sorts and mixes the processing result of the Map function. After processing, the final result is output.
  • the Hadoop programming framework in the prior art is only applicable to a homogeneous cluster system, and cannot be applied to a hybrid cluster system (for example, a cluster system of a CPU and a Graphic Processing Unit (GPU)) for data processing. .
  • a hybrid cluster system for example, a cluster system of a CPU and a Graphic Processing Unit (GPU) for data processing.
  • Embodiments of the present invention provide a data processing method and a central node, so that the Hadoop programming framework is applicable to a hybrid cluster system for data processing.
  • a first aspect of the present invention provides a data processing method, where the method is applied to a Hadoop cluster system, where the Hadoop cluster system includes a computing node and a central node, and the central node runs a Hadoop program, and the central node pairs the The computing node performs MapReduce operation management, and the computing node includes a CPU and a GPU having N cores, and the method includes:
  • the central node receives a first loop function written by a user according to a MapReduce computing framework provided by the Hadoop program, where the first loop function includes a Map calculation function provided by a user.
  • the first loop function is used to cyclically call the Map calculation function provided by the user;
  • the central node replaces the Map calculation function in the first loop function with a first copy function to generate a second loop function by using the Hadoop program running, the first copy function being used in the compute node
  • a plurality of data records required to be processed by the GPU are copied from a memory of the computing node to a video memory of the GPU, and the second loop function is configured to perform cyclic execution on the first copy function;
  • the central node generates a startup calculation function according to the first loop function, and the Map calculation function in the startup calculation function is used to instruct the GPU to process the data record that the GPU is responsible for processing;
  • the central node generates a second copy function, and the second copy function is configured to copy the calculation result of the plurality of data records by the GPU from the video memory of the GPU to the memory of the computing node.
  • the Map calculation function in the startup calculation function includes: an input part, a calculation part, and an output part, wherein the input part For reading data records that the GPU needs to process from the video memory of the GPU, the calculating portion is configured to process the data records to be processed that are read by the input portion, and the output portion is used for The calculation result of the calculation part of the processed data record is stored in the video memory of the GPU.
  • the Map calculation function in the startup calculation function is used to The plurality of data records processed by the GPU are processed in parallel, wherein the plurality of cores of the GPU respectively process at least one of the plurality of data records processed by the GPU.
  • the input address of the input portion includes an input address of each core of the GPU, so that each core of the GPU reads from the GPU's video memory according to its own input address.
  • the data record needs to be processed, and the output address of the output portion includes an output address of each core of the GPU, so that each core of the GPU stores the result of the processed data record to its own according to its own output address. In the output address.
  • the central node generates a startup calculation function, including:
  • the central node modifies an input address in the Map calculation function provided by the user to an input address of each core of the GPU to generate an input address of the input portion;
  • the central node modifies an output address of each core of the GPU to an output address in the Map calculation function provided by the user to generate an output address of the output portion;
  • the central node replaces the first loop function of the outer layer of the Map calculation function provided by the user with a third loop function, and the number of loops of the third loop function is the number of data records that the GPU is responsible for processing. ;
  • the central node splits the loop in the third loop function into an outer loop and an inner loop to divide the M data records that the GPU is responsible for processing into Data record blocks executed in parallel, wherein the number of times the outer loop is The number of times the inner layer is looped is B, and each core of the GPU executes a data recording block;
  • the central node declares a local variable of the Map calculation function provided by the user as a thread local variable of the GPU, where each core of the GPU corresponds to a thread local variable, and each core of the GPU passes its own Corresponding thread local variables read data records that need to be processed from the graphics card of the GPU.
  • the method further includes: the calculating The node converts the language of the startup calculation function to a language that the GPU can recognize.
  • the method further includes:
  • the central node sends the first loop function, the second loop function, the second copy function, and the startup calculation function to the computing node to cause the CPU to run the first loop function
  • the second loop function and the second copy function and cause the GPU to run the startup calculation function.
  • a second aspect of the present invention provides a central node, including:
  • a receiving module configured to receive a first loop function written by a user according to a MapReduce computing framework provided by a Hadoop program, where the first loop function includes a user-provided Map calculation function, and the first loop function is used to cyclically invoke the User-provided Map calculation function;
  • a first generation module configured to replace, by using the Hadoop program that is running, a Map calculation function in the first loop function with a first copy function to generate a second loop function, where the first copy function is used to A plurality of data records in the computing node that are required to be processed by the GPU are copied from a memory of the computing node to a video memory of the GPU, and the second loop function is configured to perform cyclic execution on the first copy function;
  • a second generation module configured to generate a startup calculation function according to the first loop function, where a Map calculation function in the startup calculation function is used to instruct the GPU to process a data record that the GPU is responsible for processing;
  • a third generation module configured to generate a second copy function, where the second copy function is configured to copy, by the GPU, a calculation result of the multiple data records from a memory of the GPU to a memory of the computing node in.
  • the Map calculation function in the startup calculation function includes: an input part, a calculation part, and an output part, wherein the input part For reading data records that the GPU needs to process from the video memory of the GPU, the calculating portion is configured to process the data records to be processed that are read by the input portion, and the output portion is used for The calculation result of the calculation part of the processed data record is stored in the video memory of the GPU.
  • the Map calculation function in the startup calculation function is used to The plurality of data records processed by the GPU are processed in parallel, wherein the plurality of cores of the GPU respectively process at least one of the plurality of data records processed by the GPU.
  • the input address of the input portion includes an input address of each core of the GPU, so that each core of the GPU reads from the GPU's video memory according to its own input address.
  • the data record needs to be processed, and the output address of the output portion includes an output address of each core of the GPU, so that each core of the GPU stores the result of the processed data record to its own according to its own output address. In the output address.
  • the second generating module is specifically configured to:
  • the central node further includes:
  • a conversion module configured to convert the language of the startup calculation function into a language that the GPU can recognize.
  • the central node further includes:
  • a sending module configured to send the first loop function, the second loop function, the second copy function, and the start calculation function to the computing node, so that the CPU runs the first loop a function, the second loop function, and the second copy function, and causing the GPU to run the startup calculation function.
  • the central node generates a second loop function, a start calculation function, and a second copy function according to a first loop function written by the user using a MapReduce computing framework, wherein the second node
  • the loop function is used to cyclically call the first copy function to copy a plurality of data records in the compute node that need to be processed by the GPU from the memory of the compute node to the memory of the GPU, and the Map calculation function in the start calculation function is used to instruct the GPU to handle the GPU.
  • the data record is processed, and the second copy function is used to copy the calculation result of the GPU on the data record from the GPU's memory to the memory of the compute node, thereby realizing the automatic generation of the code suitable for running in the CPU.
  • the code running on the GPU makes the Hadoop programming framework suitable for data processing in a hybrid cluster system.
  • FIG. 1 is a flowchart of a data processing method according to Embodiment 1 of the present invention.
  • FIG. 2 is a flowchart of a data processing method according to Embodiment 2 of the present invention.
  • FIG. 3 is a schematic structural diagram of a central node according to Embodiment 3 of the present invention.
  • FIG. 4 is a schematic structural diagram of a central node according to Embodiment 4 of the present invention.
  • FIG. 5 is a schematic structural diagram of a central node according to Embodiment 5 of the present invention.
  • the embodiment of the invention provides a data processing method, which is applied to a Hadoop cluster system.
  • the Hadoop cluster system includes a computing node and a central node.
  • the central node runs a Hadoop program, and the central node performs MapReduce operation management on the computing node, and the calculation is performed.
  • the node includes a CPU and a GPU with N cores. That is, the Hadoop cluster system in the embodiment of the present invention is a hybrid cluster system, and both the CPU and the GPU of the computing node can run the MapReduce program to process the data.
  • FIG. 1 is a flowchart of a data processing method according to Embodiment 1 of the present invention. As shown in FIG. 1 , the method in this embodiment may include the following steps:
  • Step 101 The central node receives a first loop function written by a user according to a MapReduce computing framework provided by a Hadoop program, where the first loop function includes a user-provided Map calculation function, and the first loop function is used to cyclically invoke a Map calculation function provided by the user. .
  • the first loop function provided by the user is written in the existing Hadoop writing mode, and the first loop function can be directly run on the CPU of the computing node.
  • the computing task to be calculated is divided into multiple data blocks (Spilt), and the internal data in the Split is divided into multiple data records (records), and the first loop function cyclically calls the user-provided Map calculation function, and the user
  • the provided Map calculation function sequentially executes each data record, and the CPU completes the calculation task by cyclically calling the Map calculation function provided by the user.
  • Step 102 The central node calculates the Map in the first loop function by using a running Hadoop program.
  • the function is replaced with a first copy function to generate a second loop function, and the first copy function is used to copy a plurality of data records in the compute node that need to be processed by the GPU from the memory of the compute node to the video memory of the GPU, and the second loop function is used to The first copy function is executed cyclically.
  • the GPU and the CPU need to cooperate to process the computing task, but the first loop function is written for the running environment of the CPU, and the first loop function can only run on the CPU, but cannot run on the GPU. Therefore, the method of this embodiment is to generate code that can be run on the GPU, hereinafter referred to as GPU code, and the GPU code can call the Map calculation function to process the data record.
  • the variable value of the Map calculation function is stored and stored in the memory through the java language declaration and definition on the CPU side.
  • the variables of the Map function mainly include the key value (key) and the variable value (value).
  • the CPU side reads the data from the memory for processing by the declaration of the variable. If the Map calculation function provided by the user is directly copied to the GPU without any modification, when the Map calculation function uses variables when executing, the management program on the GPU searches the variable list on the GPU to find the variable. Since the variable is only declared on the CPU, only the java program executed on the CPU side can access the variable. Therefore, the Map calculation function on the GPU cannot find the variable, and the Map calculation function cannot be executed.
  • the GPU cannot directly access the memory of the computing node.
  • To run the Map computing function on the GPU first copy the data in the memory to the GPU's video memory, and the GPU can directly access the data in the memory. Therefore, the central node replaces the Map calculation function in the first loop function with the first copy function to generate a second loop function, and the first copy function is used to record a plurality of data records in the compute node that need to be processed by the GPU from the memory of the compute node.
  • Copying to the GPU's video memory the second loop function is used to cyclically execute the first copy function, the first loop function copies one data record at a time, and the second loop function needs to process the GPU by calling the first copy function multiple times. The data records are copied to the GPU's video memory.
  • Step 103 The central node generates a startup calculation function according to the first cyclic function, and the Map calculation function in the startup calculation function is used to instruct the GPU to process the data record processed by the GPU.
  • the central node generates a startup calculation function for the GPU according to the first loop function submitted by the user.
  • the startup calculation function includes a Map calculation function
  • the GPU processes the data record by calling a Map calculation function in the startup calculation function.
  • the Map calculation function in the startup calculation function may include: a Map calculation function in the startup calculation function, including: an input portion, a calculation portion, and an output portion, wherein the input portion is configured to read from the GPU's display memory and needs to be processed.
  • Data record, the calculation portion is configured to process a data record to be processed that is read by the input portion, and the output portion is used to calculate the data
  • the calculation result of the partially processed data record is stored in the video memory of the GPU.
  • the computing node Before the computing node processes the data record, it first executes a second loop function to copy the data records that the GPU needs to process from the memory of the compute node to the GPU's video memory.
  • the computing node executes the Map calculation function that starts the calculation function, first, the input part accesses the GPU's memory to read the data record that needs to be processed, and then the calculation part processes the data record read by the input part by calling the Map calculation function. After the calculation part finishes processing the data record, the output part stores the processing result of the data record in the video memory of the GPU.
  • the computing part can process multiple data records in parallel. If the N cores of the GPU are idle, the N cores of the GPU can process multiple data records in parallel, for example, 2N in total. Data records, then each core can process two data records, N cores can be processed in parallel at the same time, parallel processing can handle efficiency. If you need to process less data records, the GPU can also call the Map function multiple times to process.
  • Step 104 The central node generates a second copy function, and the second copy function is used to copy the calculation result of the GPU to the plurality of data records from the memory of the GPU to the memory of the computing node.
  • the central node After the GPU processes the data record, the calculation result of the data record needs to be copied from the GPU's memory to the memory of the compute node. Therefore, the central node also generates a second copy function, which is used to The GPU's calculation of multiple data records is copied from the GPU's memory to the compute node's memory.
  • the Reduce function sorts and mixes the calculation results of the Map calculation function. Therefore, the central node also needs to send the Reduce function to the computing node.
  • the central node After the central node generates the second loop function, the start calculation function, and the second copy function, the central node sends the first loop function, the second loop function, the second copy function, and the start calculation function to the computing node, specifically, the central node Transmitting the first loop function, the second loop function, and the second copy function to the CPU, so that the CPU runs the first loop function, the second loop function, and the second copy function, and the central node sends the start calculation function to the GPU, so that The GPU runs the startup calculation function.
  • the central node When the central node receives the computing task input by the user, the computing task is divided into multiple sub-blocks, and then each sub-block is allocated a corresponding computing node according to the preset scheduling policy, and each sub-block is sent to the corresponding
  • the compute node stores the sub-blocks into the memory of the compute node after receiving the sub-blocks.
  • the GPU When the GPU is included in the compute node, the GPU and CPU of the compute node can cooperate to process the received sub-block.
  • the CPU of the compute node processes the received sub-block.
  • the computing node when the CPU and the GPU use different programming languages, the computing node is further configured to convert the language that starts the calculation function into a language that the GPU can recognize. For example, if C++ is running on the CPU and Java is running on the GPU, then the compute node needs to convert the C++ language that starts the calculation function to the Java language.
  • the central node generates a second loop function, a start calculation function, and a second copy function according to a first loop function provided by the user using a MapReduce computing framework, wherein the second loop function is used to cyclically invoke the first copy.
  • the function copies the data records of the computing node that need to be processed by the GPU from the memory of the computing node to the memory of the GPU, and the Map calculation function in the startup calculation function is used to instruct the GPU to process the data records processed by the GPU, the second copy.
  • the function is used to copy the calculation result of the GPU on multiple data records from the GPU's memory to the memory of the compute node, thereby realizing the code that is suitable for running in the CPU to automatically generate code suitable for running on the GPU, enabling Hadoop.
  • the programming framework is suitable for data processing in a hybrid cluster system. Since the central node can automatically generate code suitable for running on the GPU according to the first loop function provided by the user, there is no need to change the existing Hadoop writing mode, that is, it is not necessary to rewrite the Map and Reduce functions, which is beneficial to the maintenance of the legacy code and transplant.
  • the computing task is decomposed into multiple sub-data blocks (split), and the Map function is performed in parallel between the splits.
  • the split is generally 64M size data, and the parallel granularity is coarse, which is not suitable for the structural characteristics of the GPU.
  • the split can be divided into finer granularity to make full use of the structural characteristics of the GPU.
  • the multiple data records included in the split allocated to the GPU are allocated to multiple cores of the GPU for parallel processing at the same time, which can further improve the processing speed of the computing node.
  • Embodiment 2 is a flowchart of a data processing method according to Embodiment 2 of the present invention.
  • this embodiment describes in detail how a computing node generates a startup when a GPU processes a plurality of data records in parallel for processing. Calculation function.
  • the Map calculation function in the startup calculation function is used for parallel processing of a plurality of data records processed by the GPU, wherein the L cores of the GPU respectively process at least one of the plurality of data records processed by the GPU.
  • L is an integer greater than or equal to 2 and less than or equal to N
  • N is the total number of cores included in the GPU.
  • the method in this embodiment may include the following steps:
  • Step 201 The central node modifies an input address in the Map calculation function provided by the user to an input address of each core of the GPU.
  • the input address of the input portion of the Map calculation function in the startup calculation function includes the input address of each core of the GPU, so that each core of the GPU reads the data record to be processed from the GPU's video memory according to its own input address.
  • the input address in the Map calculation function provided by the user needs to be modified into the input address of each core of the GPU.
  • the startup calculation function needs to be run on each core of the GPU, and the i-th GPU core executes the corresponding startup calculation function to work on the data in the work-buff[index1[i]] address.
  • the records are read and processed, and each core of the GPU corresponds to a process.
  • Step 202 The central node modifies an output address of each core of the GPU by an output address in a Map calculation function provided by the user to generate an output address of the output part.
  • the output address of the output portion includes the output address of each core of the GPU, so that each core of the GPU is based on its own output address. Store the results of the processed data record in its own output address.
  • Step 203 The central node replaces the first loop function of the outer layer of the Map calculation function provided by the user with the third loop function, and the number of loops of the third loop function is the number M of data records that the GPU is responsible for processing.
  • Step 204 The central node splits the loop in the third loop function into an outer loop and an inner loop to divide the M data records that the GPU is responsible for processing into Data record blocks executed in parallel, wherein the number of outer loops is The number of inner loops is B, and each core of the GPU executes a data record block.
  • Step 205 The central node declares a local variable of the Map calculation function provided by the user as a thread local variable of the GPU, wherein each core of the GPU corresponds to a thread local variable, and each core of the GPU passes the corresponding thread local variable from the GPU. Read the data records that need to be processed in the graphics card.
  • Steps 203-205 are specific processes for the central node to generate a calculation portion of the start calculation function when the GPU processes the plurality of data records in charge for parallel processing.
  • the first loop function After the first loop function calls the user-provided Map calculation function to complete a data record, the first loop determines whether there is still a data record to be processed. If there is still data to be processed, the first loop function continues to call the user-provided Map calculation function. Until all data records have been processed, that is, the first loop function is a serial Map calculation function. In this embodiment, data records need to be allocated to multiple cores of the GPU. Line processing, therefore, you can not directly use the first loop function, you need to convert the serial Map calculation function into a parallel OpenCL kernel, which is a code segment executed in parallel on the GPU in the OpenCL program, wrapped in a function form.
  • a parallel OpenCL kernel which is a code segment executed in parallel on the GPU in the OpenCL program, wrapped in a function form.
  • the central node replaces the first loop function of the outer layer of the Map calculation function provided by the user with the third loop function, and the number of loops of the third loop function is the number M of data records processed by the GPU, the first loop function and the first The loop conditions of the three loop functions are different.
  • the center node After replacing the first loop function outside the Map calculation function with the third loop function, the center node splits the loop in the third loop function into an outer loop and an inner loop to record the M data records that the GPU is responsible for processing. be divided into Data record blocks executed in parallel, the number of loops of the outer loop is The number of cycles of the memory loop is B. Looping the inner layer as an OpenCL kernel, then a total of generated OpenCL kernel, each core of the GPU runs an OpenCL kernel, OpenCL kernel is executed in parallel.
  • Each core of the GPU executes a block of data records, a total of The cores are executed in parallel, and the number of inner loops is B, that is, each core processes B data records, and each core processes B data records by calling B map calculation functions.
  • M/B is an integer
  • M data records are just divided into M/B data record blocks, and the number of data records in each data record block is equal.
  • the data is The number of recording blocks is rounded up to the value of M/B, and the number of data records of the last data recording block is different from the number of other data recording blocks, and the number of data records of the last data recording block is The remainder of M/B, for example, when M is equal to 11, B is equal to 5, and 11/5 is equal to 2 and 1, then the data record is divided into 5 parallel execution data record blocks, and 5 of the GPUs are executed separately. Two data records, the last core performs 1 data record.
  • the variable of the Map calculation function provided by the user is a local variable.
  • the variable can be shared by all the data records, and in this embodiment, the variable of each core can only be processed by the core. Record sharing, but not shared by other cores, therefore, the central node needs to declare the local variables of the Map calculation function provided by the user as thread-local variables of the GPU.
  • the parallelism of the Map phase exists only between the splits, and the parallel granularity is coarse.
  • the serial execution mode of the Map function in the existing Hadoop mechanism is changed to the parallel execution mode.
  • the parallelism between the original splits is preserved, and the parallelism between the data records within the split is increased, that is, a split running on the GPU is further divided into multiple parallel execution data record blocks, so that the parallel nodes of the computing nodes sexual enhancement and calculation rate are improved.
  • the central node of this embodiment includes: a receiving module 11, a first generating module 12, a second generating module 13, and a third.
  • the module 14 is generated.
  • the receiving module 11 is configured to receive a first loop function written by a user according to a MapReduce computing framework provided by a Hadoop program, where the first loop function includes a user-provided Map calculation function, and the first loop function is used to loop Calling the Map calculation function provided by the user;
  • a first generating module 12 configured to replace, by using the Hadoop program that is running, the Map calculation function in the first loop function with a first copy function to generate a second loop function, where the first copy function is used to
  • the plurality of data records in the computing node that need to be processed by the GPU are copied from the memory of the computing node to the video memory of the GPU, and the second loop function is used to perform cyclic execution on the first copy function;
  • the second generation module 13 is configured to generate a startup calculation function according to the first loop function, where the Map calculation function in the startup calculation function is used to instruct the GPU to process the data record that the GPU is responsible for processing;
  • a third generation module 14 is configured to generate a second copy function, where the second copy function is used to copy the calculation result of the plurality of data records from the GPU to the computing node of the GPU In memory.
  • the Map calculation function in the startup calculation function may include: an input portion, a calculation portion, and an output portion, wherein the input portion is configured to read, from the video memory of the GPU, a data record that the GPU needs to process, The calculation portion is configured to process the data record to be processed read by the input portion, and the output portion is configured to store the calculation result of the data portion processed by the calculation portion into the video memory of the GPU.
  • the central node of this embodiment may be used to implement the technical solution of the method embodiment shown in FIG. 1.
  • the specific implementation manner and technical effects are similar, and details are not described herein again.
  • the central node of the present embodiment further includes: a conversion module 15 and a sending module 16 on the basis of the central node shown in FIG.
  • the module 15 is configured to convert the language of the startup calculation function into a language that the GPU can recognize.
  • a sending module configured to send the first loop function, the second loop function, the second copy function, and the start calculation function to the computing node, so that the CPU runs the first loop a function, the second loop function, and the second copy function, and causing the GPU to run the startup calculation function.
  • the Map calculation function in the startup calculation function is used for parallel processing of a plurality of data records that are processed by the GPU, wherein the plurality of cores of the GPU respectively process the plurality of data records that the GPU is responsible for processing. At least one data record in .
  • the input address of the input portion includes an input address of each core of the GPU, so that the Each core of the GPU reads from the GPU's video memory according to its own input address to process a data record, and the output portion of the output portion includes an output address of each core of the GPU to enable each of the GPUs
  • the cores store the results of the processed data records into their own output addresses according to their own output addresses.
  • the second generation module is specifically configured to perform the following operations:
  • the loop in the third loop function is split into an outer loop and an inner loop to divide the M data records that the GPU is responsible for processing into Data record blocks executed in parallel, wherein the number of times the outer loop is The number of times the inner layer is looped is B, and each core of the GPU executes one data recording block;
  • the central node of this embodiment may be used to implement the technical solution of the method embodiment shown in FIG. 1 and FIG. 2 , and the specific implementation manners and technical effects are similar, and details are not described herein again.
  • the central node 200 of this embodiment includes: a processor 21, a memory 22, a communication interface 23, and a system bus 24, and a memory 22 and communication.
  • the interface 23 is connected to and communicates with the processor 21 through the system bus 24.
  • the communication interface 23 is used for communication with other devices.
  • the memory 22 stores computer execution instructions 221.
  • the processor 21 is configured to execute the computer execution instructions 221. , perform the method described below:
  • the second loop function is configured to perform cyclic execution on the first copy function
  • the map calculation function in the startup calculation function may specifically include: an input portion, a calculation portion, and an output portion, wherein the input portion is configured to read a data record that the GPU needs to process from a memory of the GPU.
  • the calculating portion is configured to process the data record to be processed that is read by the input portion, and the output portion is configured to store the calculation result of the data portion processed by the computing portion into the video memory of the GPU.
  • the Map calculation function in the startup calculation function may be used to perform parallel processing on a plurality of data records that are processed by the GPU, wherein multiple cores of the GPU respectively process multiple processed by the GPU At least one data record in the data record.
  • the input address of the input portion includes an input address of each core of the GPU, so that the Each core of the GPU reads from the GPU's video memory according to its own input address to process a data record, and the output portion of the output portion includes an output address of each core of the GPU to enable each of the GPUs
  • the cores store the results of the processed data records into their own output addresses according to their own output addresses.
  • the processor 21 When the Map calculation function in the startup calculation function is used for parallel processing of a plurality of data records that are responsible for the GPU, the processor 21 generates a startup calculation function, which specifically includes the following steps:
  • the central node modifies an input address in the Map calculation function provided by the user to an input address of each core of the GPU to generate an input address of the input portion;
  • the central node modifies an output address of each core of the GPU to an output address in the Map calculation function provided by the user to generate an output address of the output portion;
  • the central node replaces the first loop function of the outer layer of the Map calculation function provided by the user with a third loop function, and the number of loops of the third loop function is the number of data records that the GPU is responsible for processing. ;
  • the central node splits the loop in the third loop function into an outer loop and an inner loop to divide the M data records that the GPU is responsible for processing into Data record blocks executed in parallel, wherein the number of times the outer loop is The number of times the inner layer is looped is B, and each core of the GPU executes one data recording block;
  • the central node declares a local variable of the Map calculation function provided by the user as a thread local variable of the GPU, where each core of the GPU corresponds to a thread local variable, and each core of the GPU passes its own Corresponding thread local variables read data records that need to be processed from the graphics card of the GPU.
  • the processor 21 is further configured to convert the language of the startup calculation function into a language that the GPU can recognize.
  • the communication interface 23 is specifically configured to send the first loop function, the second loop function, the second copy function, and the start calculation function to the computing node, so that the The CPU runs the first loop function, the second loop function, and the second copy function, and causes the GPU to run the startup calculation function.
  • the central node of this embodiment may be used to implement the technical solution of the method embodiment shown in FIG. 1 and FIG. 2 , and the specific implementation manners and technical effects are similar, and details are not described herein again.
  • the aforementioned program can be stored in a computer readable storage medium.
  • the program when executed, performs the steps including the foregoing method embodiments; and the foregoing storage medium includes various media that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Image Processing (AREA)
  • Stored Programmes (AREA)

Abstract

Provided are a data processing method and a central node. The method comprises: according to a first looping function which is provided by a user and is written using a MapReduce computing framework, generating, by a central node, a second looping function, a computation enabling function and a second copying function, wherein the second looping function is used for iteratively invoking a first copying function to copy a plurality of data records in a computing node which need to be processed by a GPU from a memory of the computing node to a video memory of the GPU; a Map computing function in the computation enabling function is used for instructing the GPU to process the data records which the GPU is responsible for processing; and the second copying function is used for copying results of computing the plurality of data records by the GPU from the video memory of the GPU to the memory of the computing node, and therefore, a code which is suitable to operate in a CPU is automatically generated into a code which is suitable to operate in the GPU is achieved, so that a Hadoop programming framework is suitable for conducting data processing in a mixed cluster system.

Description

数据处理方法和中心节点Data processing method and central node 技术领域Technical field
本发明实施例涉及计算机技术,尤其涉及一种数据处理方法和中心节点。Embodiments of the present invention relate to computer technologies, and in particular, to a data processing method and a central node.
背景技术Background technique
采用大规模集群进行大数据处理的系统中,MapReduce是目前最为流行的编程模型。MapReduce is currently the most popular programming model in systems that use large-scale clustering for big data processing.
在同构的集群系统(例如:由多个中央处理器(Central Processing Unit,简称CPU)经过网络连接构成的集群系统)中,MapReduce目前使用的是Hadoop编程框架,在Hadoop编程框架下,程序员仅需要编写Map函数和Reduce函数,提交给集群系统的中心节点上运行的Hadoop程序,当有计算任务需要处理时,Hadoop程序将计算任务分解为多个子数据块(split),并将Map函数和Reduce函数以及子数据块发送给需要进行计算的计算节点,计算节点接到执行任务指令时,调用Map函数对接收到的子数据块进行处理,然后Reduce函数对Map函数的处理结果进行排序、混合等处理后输出最终结果。In a homogeneous cluster system (for example, a cluster system consisting of multiple central processing units (CPUs) connected via a network), MapReduce currently uses the Hadoop programming framework. Under the Hadoop programming framework, programmers You only need to write the Map function and the Reduce function to submit to the Hadoop program running on the central node of the cluster system. When there is a computing task to be processed, the Hadoop program decomposes the computing task into multiple sub-data blocks (split), and the Map function and The Reduce function and the sub-block are sent to the calculation node that needs to be calculated. When the calculation node receives the task instruction, it calls the Map function to process the received sub-block, and then the Reduce function sorts and mixes the processing result of the Map function. After processing, the final result is output.
然而,现有技术中的Hadoop编程框架仅适用于同构的集群系统,而无法适用于混合集群系统(例如:CPU和图像处理器(Graphic Processing Unit,简称GPU)混合的集群系统)进行数据处理。However, the Hadoop programming framework in the prior art is only applicable to a homogeneous cluster system, and cannot be applied to a hybrid cluster system (for example, a cluster system of a CPU and a Graphic Processing Unit (GPU)) for data processing. .
发明内容Summary of the invention
本发明实施例提供一种数据处理方法和中心节点,以使Hadoop编程框架适用于混合集群系统进行数据处理。Embodiments of the present invention provide a data processing method and a central node, so that the Hadoop programming framework is applicable to a hybrid cluster system for data processing.
本发明第一方面提供一种数据处理方法,所述方法应用于Hadoop集群系统,所述Hadoop集群系统中包括计算节点和中心节点,所述中心节点上运行Hadoop程序,所述中心节点对所述计算节点进行MapReduce运算管理,所述计算节点上包含有CPU和具有N个核的GPU,所述方法包括:A first aspect of the present invention provides a data processing method, where the method is applied to a Hadoop cluster system, where the Hadoop cluster system includes a computing node and a central node, and the central node runs a Hadoop program, and the central node pairs the The computing node performs MapReduce operation management, and the computing node includes a CPU and a GPU having N cores, and the method includes:
所述中心节点接收用户根据所述Hadoop程序所提供的MapReduce计算框架编写的第一循环函数,所述第一循环函数中包括用户提供的Map计算函数,所 述第一循环函数用于循环调用所述用户提供的Map计算函数;The central node receives a first loop function written by a user according to a MapReduce computing framework provided by the Hadoop program, where the first loop function includes a Map calculation function provided by a user. The first loop function is used to cyclically call the Map calculation function provided by the user;
所述中心节点利用运行的所述Hadoop程序将所述第一循环函数中的Map计算函数替换为第一拷贝函数以生成第二循环函数,所述第一拷贝函数用于将所述计算节点中需要所述GPU处理的多个数据记录从所述计算节点的内存拷贝到所述GPU的显存中,所述第二循环函数用于对所述第一拷贝函数进行循环执行;The central node replaces the Map calculation function in the first loop function with a first copy function to generate a second loop function by using the Hadoop program running, the first copy function being used in the compute node A plurality of data records required to be processed by the GPU are copied from a memory of the computing node to a video memory of the GPU, and the second loop function is configured to perform cyclic execution on the first copy function;
所述中心节点根据所述第一循环函数生成启动计算函数,所述启动计算函数中的Map计算函数用于指示所述GPU对所述GPU负责处理的数据记录进行处理;The central node generates a startup calculation function according to the first loop function, and the Map calculation function in the startup calculation function is used to instruct the GPU to process the data record that the GPU is responsible for processing;
所述中心节点生成第二拷贝函数,所述第二拷贝函数用于将所述GPU对所述多个数据记录的计算结果从所述GPU的显存中拷贝至所述计算节点的内存中。The central node generates a second copy function, and the second copy function is configured to copy the calculation result of the plurality of data records by the GPU from the video memory of the GPU to the memory of the computing node.
结合本发明第一方面,在本发明第一方面的第一种可能的实现方式中,所述启动计算函数中的Map计算函数包括:输入部分、计算部分、输出部分,其中,所述输入部分用于从所述GPU的显存中读取所述GPU需要处理的数据记录,所述计算部分用于对所述输入部分读取的需要处理的数据记录进行处理,所述输出部分用于将所述计算部分处理后数据记录的计算结果存储到所述GPU的显存中。With reference to the first aspect of the present invention, in a first possible implementation manner of the first aspect of the present invention, the Map calculation function in the startup calculation function includes: an input part, a calculation part, and an output part, wherein the input part For reading data records that the GPU needs to process from the video memory of the GPU, the calculating portion is configured to process the data records to be processed that are read by the input portion, and the output portion is used for The calculation result of the calculation part of the processed data record is stored in the video memory of the GPU.
结合本发明第一方面以及本发明第一方面的第一种可能的实现方式,在本发明第一方面的第二种可能的实现方式中,所述启动计算函数中的Map计算函数用于对所述GPU负责处理的多个数据记录并行处理,其中,所述GPU的多个核分别处理所述GPU负责处理的多个数据记录中的至少一个数据记录。With reference to the first aspect of the present invention and the first possible implementation manner of the first aspect of the present invention, in a second possible implementation manner of the first aspect of the present invention, the Map calculation function in the startup calculation function is used to The plurality of data records processed by the GPU are processed in parallel, wherein the plurality of cores of the GPU respectively process at least one of the plurality of data records processed by the GPU.
结合本发明第一方面的第二种可能的实现方式,在本发明第一方面的第三种可能的实现方式中,当所述启动计算函数中的Map计算函数用于对所述GPU负责的多个数据记录并行处理时,所述输入部分的输入地址包括所述GPU的每个核的输入地址,以使所述GPU的每个核根据自己的输入地址从所述GPU的显存中读取需要处理数据记录,所述输出部分的输出地址包括所述GPU的每个核的输出地址,以使所述GPU的每个核根据自己的输出地址将处理后的数据记录的结果存储到自己的输出地址中。With reference to the second possible implementation manner of the first aspect of the present invention, in a third possible implementation manner of the first aspect of the present invention, when the Map calculation function in the startup calculation function is used to be responsible for the GPU When the plurality of data records are processed in parallel, the input address of the input portion includes an input address of each core of the GPU, so that each core of the GPU reads from the GPU's video memory according to its own input address. The data record needs to be processed, and the output address of the output portion includes an output address of each core of the GPU, so that each core of the GPU stores the result of the processed data record to its own according to its own output address. In the output address.
结合本发明第一方面的第三种可能的实现方式,在本发明第一方面的第四种可能的实现方式中,所述中心节点生成启动计算函数,包括:With reference to the third possible implementation manner of the first aspect of the present invention, in a fourth possible implementation manner of the first aspect of the present invention, the central node generates a startup calculation function, including:
所述中心节点将所述用户提供的Map计算函数中的输入地址修改为所述GPU的每个核的输入地址以生成所述输入部分的输入地址; The central node modifies an input address in the Map calculation function provided by the user to an input address of each core of the GPU to generate an input address of the input portion;
所述中心节点将所述用户提供的Map计算函数中的输出地址修改所述GPU的每个核的输出地址以生成所述输出部分的输出地址;The central node modifies an output address of each core of the GPU to an output address in the Map calculation function provided by the user to generate an output address of the output portion;
所述中心节点将所述用户提供的Map计算函数外层的所述第一循环函数替换为第三循环函数,所述第三循环函数的循环次数为所述GPU负责处理的数据记录的数目M;The central node replaces the first loop function of the outer layer of the Map calculation function provided by the user with a third loop function, and the number of loops of the third loop function is the number of data records that the GPU is responsible for processing. ;
所述中心节点将所述第三循环函数中的循环拆分为外层循环和内层循环,以将所述GPU负责处理的M个数据记录划分为
Figure PCTCN2015075703-appb-000001
个并行执行的数据记录块,其中,所述外层循环的次数为
Figure PCTCN2015075703-appb-000002
,所述内层循环的次数为B,所述GPU的每个核执行一个数据记录块;
The central node splits the loop in the third loop function into an outer loop and an inner loop to divide the M data records that the GPU is responsible for processing into
Figure PCTCN2015075703-appb-000001
Data record blocks executed in parallel, wherein the number of times the outer loop is
Figure PCTCN2015075703-appb-000002
The number of times the inner layer is looped is B, and each core of the GPU executes a data recording block;
所述中心节点将所述用户提供的Map计算函数的局部变量声明为所述GPU的线程局部变量,其中,所述GPU的每个核对应一个线程局部变量,所述GPU的每个核通过自己对应的线程局部变量从所述GPU的显卡中读取需要处理的数据记录。The central node declares a local variable of the Map calculation function provided by the user as a thread local variable of the GPU, where each core of the GPU corresponds to a thread local variable, and each core of the GPU passes its own Corresponding thread local variables read data records that need to be processed from the graphics card of the GPU.
结合本发明第一方面以及本发明第一方面的第一种至第四种可能的实现方式,在本发明第一方面的第五种可能的实现方式中,所述方法还包括:所述计算节点将所述启动计算函数的语言转换为所述GPU所能识别的语言。With reference to the first aspect of the present invention and the first to fourth possible implementation manners of the first aspect of the present invention, in a fifth possible implementation manner of the first aspect of the present invention, the method further includes: the calculating The node converts the language of the startup calculation function to a language that the GPU can recognize.
结合本发明第一方面以及本发明第一方面的第一种至第五种可能的实现方式,在本发明第一方面的第六种可能的实现方式中,所述方法还包括:With reference to the first aspect of the present invention and the first to fifth possible implementation manners of the first aspect of the present invention, in a sixth possible implementation manner of the first aspect of the present invention, the method further includes:
所述中心节点将所述第一循环函数、所述第二循环函数、所述第二拷贝函数、所述启动计算函数发送给所述计算节点,以使所述CPU运行所述第一循环函数、所述第二循环函数和所述第二拷贝函数,并使所述GPU运行所述启动计算函数。The central node sends the first loop function, the second loop function, the second copy function, and the startup calculation function to the computing node to cause the CPU to run the first loop function The second loop function and the second copy function, and cause the GPU to run the startup calculation function.
本发明第二方面提供一种中心节点,包括:A second aspect of the present invention provides a central node, including:
接收模块,用于接收用户根据Hadoop程序所提供的MapReduce计算框架编写的第一循环函数,所述第一循环函数中包括用户提供的Map计算函数,所述第一循环函数用于循环调用所述用户提供的Map计算函数;a receiving module, configured to receive a first loop function written by a user according to a MapReduce computing framework provided by a Hadoop program, where the first loop function includes a user-provided Map calculation function, and the first loop function is used to cyclically invoke the User-provided Map calculation function;
第一生成模块,用于利用运行的所述Hadoop程序将所述第一循环函数中的Map计算函数替换为第一拷贝函数以生成第二循环函数,所述第一拷贝函数用于将所述计算节点中需要所述GPU处理的多个数据记录从所述计算节点的内存拷贝到所述GPU的显存中,所述第二循环函数用于对所述第一拷贝函数进行循环执行; a first generation module, configured to replace, by using the Hadoop program that is running, a Map calculation function in the first loop function with a first copy function to generate a second loop function, where the first copy function is used to A plurality of data records in the computing node that are required to be processed by the GPU are copied from a memory of the computing node to a video memory of the GPU, and the second loop function is configured to perform cyclic execution on the first copy function;
第二生成模块,用于根据所述第一循环函数生成启动计算函数,所述启动计算函数中的Map计算函数用于指示所述GPU对所述GPU负责处理的数据记录进行处理;a second generation module, configured to generate a startup calculation function according to the first loop function, where a Map calculation function in the startup calculation function is used to instruct the GPU to process a data record that the GPU is responsible for processing;
第三生成模块,用于生成第二拷贝函数,所述第二拷贝函数用于将所述GPU对所述多个数据记录的计算结果从所述GPU的显存中拷贝至所述计算节点的内存中。a third generation module, configured to generate a second copy function, where the second copy function is configured to copy, by the GPU, a calculation result of the multiple data records from a memory of the GPU to a memory of the computing node in.
结合本发明第二方面,在本发明第二方面的第一种可能的实现方式中,所述启动计算函数中的Map计算函数包括:输入部分、计算部分、输出部分,其中,所述输入部分用于从所述GPU的显存中读取所述GPU需要处理的数据记录,所述计算部分用于对所述输入部分读取的需要处理的数据记录进行处理,所述输出部分用于将所述计算部分处理后数据记录的计算结果存储到所述GPU的显存中。With reference to the second aspect of the present invention, in a first possible implementation manner of the second aspect of the present invention, the Map calculation function in the startup calculation function includes: an input part, a calculation part, and an output part, wherein the input part For reading data records that the GPU needs to process from the video memory of the GPU, the calculating portion is configured to process the data records to be processed that are read by the input portion, and the output portion is used for The calculation result of the calculation part of the processed data record is stored in the video memory of the GPU.
结合本发明第二方面以及本发明第二方面的第一种可能的实现方式,在本发明第二方面的第二种可能的实现方式中,所述启动计算函数中的Map计算函数用于对所述GPU负责处理的多个数据记录并行处理,其中,所述GPU的多个核分别处理所述GPU负责处理的多个数据记录中的至少一个数据记录。With reference to the second aspect of the present invention and the first possible implementation manner of the second aspect of the present invention, in a second possible implementation manner of the second aspect of the present invention, the Map calculation function in the startup calculation function is used to The plurality of data records processed by the GPU are processed in parallel, wherein the plurality of cores of the GPU respectively process at least one of the plurality of data records processed by the GPU.
结合本发明第二方面的第二种可能的实现方式,在本发明第二方面的第三种可能的实现方式中,当所述启动计算函数中的Map计算函数用于对所述GPU负责的多个数据记录并行处理时,所述输入部分的输入地址包括所述GPU的每个核的输入地址,以使所述GPU的每个核根据自己的输入地址从所述GPU的显存中读取需要处理数据记录,所述输出部分的输出地址包括所述GPU的每个核的输出地址,以使所述GPU的每个核根据自己的输出地址将处理后的数据记录的结果存储到自己的输出地址中。With reference to the second possible implementation manner of the second aspect of the present invention, in a third possible implementation manner of the second aspect of the present invention, when the Map calculation function in the startup calculation function is used to be responsible for the GPU When the plurality of data records are processed in parallel, the input address of the input portion includes an input address of each core of the GPU, so that each core of the GPU reads from the GPU's video memory according to its own input address. The data record needs to be processed, and the output address of the output portion includes an output address of each core of the GPU, so that each core of the GPU stores the result of the processed data record to its own according to its own output address. In the output address.
结合本发明第二方面的第三种可能的实现方式,在本发明第二方面的第四种可能的实现方式中,所述第二生成模块具体用于:With reference to the third possible implementation manner of the second aspect of the present invention, in a fourth possible implementation manner of the second aspect of the present disclosure, the second generating module is specifically configured to:
将所述用户提供的Map计算函数中的输入地址修改为所述GPU的每个核的输入地址以生成所述输入部分的输入地址;Modifying an input address in the Map calculation function provided by the user to an input address of each core of the GPU to generate an input address of the input portion;
将所述用户提供的Map计算函数中的输出地址修改所述GPU的每个核的输出地址以生成所述输出部分的输出地址;Outputting an output address in the Map calculation function provided by the user to an output address of each core of the GPU to generate an output address of the output portion;
将所述用户提供的Map计算函数外层的所述第一循环函数替换为第三循环函数,所述第三循环函数的循环次数为所述GPU负责处理的数据记录的数目M; Replacing the first loop function of the outer layer of the Map calculation function provided by the user with a third loop function, the number of loops of the third loop function being the number M of data records that the GPU is responsible for processing;
将所述第三循环函数中的循环拆分为外层循环和内层循环,以将所述GPU负责处理的M个数据记录划分为
Figure PCTCN2015075703-appb-000003
个并行执行的数据记录块,其中,所述外层循环的次数为
Figure PCTCN2015075703-appb-000004
所述内层循环的次数为B,所述GPU的每个核执行一个数据记录块;
Splitting the loop in the third loop function into an outer loop and an inner loop to divide the M data records that the GPU is responsible for processing into
Figure PCTCN2015075703-appb-000003
Data record blocks executed in parallel, wherein the number of times the outer loop is
Figure PCTCN2015075703-appb-000004
The number of times the inner layer is looped is B, and each core of the GPU executes one data recording block;
将所述用户提供的Map计算函数的局部变量声明为所述GPU的线程局部变量,其中,所述GPU的每个核对应一个线程局部变量,所述GPU的每个核通过自己对应的线程局部变量从所述GPU的显卡中读取需要处理的数据记录。Declaring a local variable of the Map calculation function provided by the user as a thread local variable of the GPU, wherein each core of the GPU corresponds to a thread local variable, and each core of the GPU passes its own corresponding thread local The variable reads the data record that needs to be processed from the graphics card of the GPU.
结合本发明第二方面以及本发明第二方面的第一种至第四种可能的实现方式,在本发明第二方面的第五种可能的实现方式中,所述中心节点还包括:With reference to the second aspect of the present invention and the first to fourth possible implementation manners of the second aspect of the present invention, in a fifth possible implementation manner of the second aspect of the present invention, the central node further includes:
转换模块,用于将所述启动计算函数的语言转换为所述GPU所能识别的语言。And a conversion module, configured to convert the language of the startup calculation function into a language that the GPU can recognize.
结合本发明第二方面以及本发明第二方面的第一种至第五种可能的实现方式,在本发明第二方面的第六种可能的实现方式中,所述中心节点还包括:With reference to the second aspect of the present invention and the first to fifth possible implementation manners of the second aspect of the present invention, in a sixth possible implementation manner of the second aspect of the present invention, the central node further includes:
发送模块,用于将所述第一循环函数、所述第二循环函数、所述第二拷贝函数、所述启动计算函数发送给所述计算节点,以使所述CPU运行所述第一循环函数、所述第二循环函数和所述第二拷贝函数,并使所述GPU运行所述启动计算函数。a sending module, configured to send the first loop function, the second loop function, the second copy function, and the start calculation function to the computing node, so that the CPU runs the first loop a function, the second loop function, and the second copy function, and causing the GPU to run the startup calculation function.
本发明实施例的一种数据处理方法和中心节点,中心节点根据用户提供的采用MapReduce计算框架编写的第一循环函数,生成第二循环函数、启动计算函数和第二拷贝函数,其中,第二循环函数用于循环调用第一拷贝函数将计算节点中需要GPU处理的多个数据记录从计算节点的内存拷贝到GPU的显存中,启动计算函数中的Map计算函数用于指示GPU对GPU负责处理的数据记录进行处理,第二拷贝函数用于将GPU对多个数据记录的计算结果从GPU的显存中拷贝至计算节点的内存中,从而实现将适用于在CPU中运行的代码自动生成适用于在GPU中运行的代码,使Hadoop编程框架适用于在混合集群系统进行数据处理。A data processing method and a central node according to an embodiment of the present invention, the central node generates a second loop function, a start calculation function, and a second copy function according to a first loop function written by the user using a MapReduce computing framework, wherein the second node The loop function is used to cyclically call the first copy function to copy a plurality of data records in the compute node that need to be processed by the GPU from the memory of the compute node to the memory of the GPU, and the Map calculation function in the start calculation function is used to instruct the GPU to handle the GPU. The data record is processed, and the second copy function is used to copy the calculation result of the GPU on the data record from the GPU's memory to the memory of the compute node, thereby realizing the automatic generation of the code suitable for running in the CPU. The code running on the GPU makes the Hadoop programming framework suitable for data processing in a hybrid cluster system.
附图说明DRAWINGS
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面 描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, obviously, the following The drawings in the description are only some of the embodiments of the present invention, and those skilled in the art can obtain other drawings based on these drawings without any inventive labor.
图1为本发明实施例一提供的数据处理方法的流程图;1 is a flowchart of a data processing method according to Embodiment 1 of the present invention;
图2为本发明实施例二提供的数据处理方法的流程图;2 is a flowchart of a data processing method according to Embodiment 2 of the present invention;
图3为本发明实施例三提供的中心节点的结构示意图;3 is a schematic structural diagram of a central node according to Embodiment 3 of the present invention;
图4为本发明实施例四提供的中心节点的结构示意图;4 is a schematic structural diagram of a central node according to Embodiment 4 of the present invention;
图5为本发明实施例五提供的中心节点的结构示意图。FIG. 5 is a schematic structural diagram of a central node according to Embodiment 5 of the present invention.
具体实施方式detailed description
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is obvious that the described embodiments are only a part of the embodiments of the present invention, but not all embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.
本发明实施例提供一种数据处理方法,该方法应用于Hadoop集群系统,该Hadoop集群系统中包括计算节点和中心节点,该中心节点上运行Hadoop程序,中心节点对计算节点进行MapReduce运算管理,计算节点上包含有CPU和具有N个核的GPU,即本发明实施例中的Hadoop集群系统为混合集群系统,计算节点的CPU和GPU都能够运行MapReduce程序对数据进行处理。图1为本发明实施例一提供的数据处理方法的流程图,如图1所示,本实施例的方法可以包括以下步骤:The embodiment of the invention provides a data processing method, which is applied to a Hadoop cluster system. The Hadoop cluster system includes a computing node and a central node. The central node runs a Hadoop program, and the central node performs MapReduce operation management on the computing node, and the calculation is performed. The node includes a CPU and a GPU with N cores. That is, the Hadoop cluster system in the embodiment of the present invention is a hybrid cluster system, and both the CPU and the GPU of the computing node can run the MapReduce program to process the data. FIG. 1 is a flowchart of a data processing method according to Embodiment 1 of the present invention. As shown in FIG. 1 , the method in this embodiment may include the following steps:
步骤101、中心节点接收用户根据Hadoop程序所提供的MapReduce计算框架编写的第一循环函数,第一循环函数中包括用户提供的Map计算函数,第一循环函数用于循环调用用户提供的Map计算函数。Step 101: The central node receives a first loop function written by a user according to a MapReduce computing framework provided by a Hadoop program, where the first loop function includes a user-provided Map calculation function, and the first loop function is used to cyclically invoke a Map calculation function provided by the user. .
用户提供第一循环函数是采用现有的Hadoop编写方式编写的,该第一循环函数可以直接在计算节点的CPU上运行。在Hadoop机制中将要计算的计算任务划分为多个数据块(Spilt),在Spilt内部数据又被划分为多个数据记录(record),该第一循环函数循环调用用户提供的Map计算函数,用户提供的Map计算函数顺序执行每个数据记录,CPU通过循环调用用户提供的Map计算函数完成计算任务。The first loop function provided by the user is written in the existing Hadoop writing mode, and the first loop function can be directly run on the CPU of the computing node. In the Hadoop mechanism, the computing task to be calculated is divided into multiple data blocks (Spilt), and the internal data in the Split is divided into multiple data records (records), and the first loop function cyclically calls the user-provided Map calculation function, and the user The provided Map calculation function sequentially executes each data record, and the CPU completes the calculation task by cyclically calling the Map calculation function provided by the user.
步骤102、中心节点利用运行的Hadoop程序将第一循环函数中的Map计算 函数替换为第一拷贝函数以生成第二循环函数,第一拷贝函数用于将计算节点中需要GPU处理的多个数据记录从计算节点的内存拷贝到GPU的显存中,第二循环函数用于对第一拷贝函数进行循环执行。Step 102: The central node calculates the Map in the first loop function by using a running Hadoop program. The function is replaced with a first copy function to generate a second loop function, and the first copy function is used to copy a plurality of data records in the compute node that need to be processed by the GPU from the memory of the compute node to the video memory of the GPU, and the second loop function is used to The first copy function is executed cyclically.
本发明实施例的场景中需要GPU和CPU协同对计算任务进行处理,但是第一循环函数是针对CPU的运行环境编写的,第一循环函数只能运行在CPU上,而无法在GPU上运行,因此,本实施例的方法就是要生成能够在GPU上运行的代码,以下简称GPU代码,GPU代码能够调用Map计算函数对数据记录进行处理。In the scenario of the embodiment of the present invention, the GPU and the CPU need to cooperate to process the computing task, but the first loop function is written for the running environment of the CPU, and the first loop function can only run on the CPU, but cannot run on the GPU. Therefore, the method of this embodiment is to generate code that can be run on the GPU, hereinafter referred to as GPU code, and the GPU code can call the Map calculation function to process the data record.
CPU在执行Map计算函数时,要获取Map计算函数的变量值,Map计算函数的变量值在CPU端通过java语言声明和定义,存储在内存中。Map函数的变量主要包括键值(key)和变量值(value)。CPU端通过变量的声明,从内存中读取数据进行处理。若将用户提供的Map计算函数不做任何修改直接拷贝到GPU上运行,那么当Map计算函数在执行时要用到变量时,GPU上的管理程序会去GPU上的变量列表中查找该变量,由于该变量只在CPU上进行了声明,只有在CPU端执行的java程序才能访问该变量,因此,GPU上的Map计算函数找不到该变量,Map计算函数无法执行。When the CPU executes the Map calculation function, it needs to obtain the variable value of the Map calculation function. The variable value of the Map calculation function is stored and stored in the memory through the java language declaration and definition on the CPU side. The variables of the Map function mainly include the key value (key) and the variable value (value). The CPU side reads the data from the memory for processing by the declaration of the variable. If the Map calculation function provided by the user is directly copied to the GPU without any modification, when the Map calculation function uses variables when executing, the management program on the GPU searches the variable list on the GPU to find the variable. Since the variable is only declared on the CPU, only the java program executed on the CPU side can access the variable. Therefore, the Map calculation function on the GPU cannot find the variable, and the Map calculation function cannot be executed.
通过上述的问题可知,GPU不能直接访问计算节点的内存,要在GPU上运行Map计算函数,首先要将内存中的数据拷贝到GPU的显存中,GPU可以直接访问显存中的数据。因此,中心节点将第一循环函数中的Map计算函数替换为第一拷贝函数以生成第二循环函数,第一拷贝函数用于将计算节点中需要GPU处理的多个数据记录从计算节点的内存拷贝到GPU的显存中,第二循环函数用于对第一拷贝函数进行循环执行,该第一循环函数每次拷贝一条数据记录,第二循环函数通过多次调用第一拷贝函数将GPU需要处理的数据记录都拷贝到GPU的显存中。According to the above problem, the GPU cannot directly access the memory of the computing node. To run the Map computing function on the GPU, first copy the data in the memory to the GPU's video memory, and the GPU can directly access the data in the memory. Therefore, the central node replaces the Map calculation function in the first loop function with the first copy function to generate a second loop function, and the first copy function is used to record a plurality of data records in the compute node that need to be processed by the GPU from the memory of the compute node. Copying to the GPU's video memory, the second loop function is used to cyclically execute the first copy function, the first loop function copies one data record at a time, and the second loop function needs to process the GPU by calling the first copy function multiple times. The data records are copied to the GPU's video memory.
步骤103、中心节点根据第一循环函数生成启动计算函数,启动计算函数中的Map计算函数用于指示GPU对GPU负责处理的数据记录进行处理。Step 103: The central node generates a startup calculation function according to the first cyclic function, and the Map calculation function in the startup calculation function is used to instruct the GPU to process the data record processed by the GPU.
中心节点根据用户提交的第一循环函数为GPU生成启动计算函数,该启动计算函数包括Map计算函数,GPU通过调用启动计算函数中的Map计算函数对数据记录进行处理。该启动计算函数中的Map计算函数可以包括:启动计算函数中的Map计算函数包括:输入部分、计算部分、输出部分,其中,所述输入部分用于从所述GPU的显存中读取需要处理的数据记录,所述计算部分用于对所述输入部分读取的需要处理的数据记录进行处理,所述输出部分用于将所述计算 部分处理后数据记录的计算结果存储到所述GPU的显存中。The central node generates a startup calculation function for the GPU according to the first loop function submitted by the user. The startup calculation function includes a Map calculation function, and the GPU processes the data record by calling a Map calculation function in the startup calculation function. The Map calculation function in the startup calculation function may include: a Map calculation function in the startup calculation function, including: an input portion, a calculation portion, and an output portion, wherein the input portion is configured to read from the GPU's display memory and needs to be processed. Data record, the calculation portion is configured to process a data record to be processed that is read by the input portion, and the output portion is used to calculate the data The calculation result of the partially processed data record is stored in the video memory of the GPU.
计算节点在对数据记录进行处理前,首先要执行第二循环函数,将GPU需要处理的数据记录从计算节点的内存中都拷贝到GPU的显存中。当计算节点执行启动计算函数的Map计算函数时,首先,输入部分访问GPU的显存读取需要处理的数据记录,然后,计算部分通过调用Map计算函数对输入部分读取到的数据记录进行处理,计算部分对数据记录处理完后,输出部分将数据记录的处理结果存储到GPU的显存中。Before the computing node processes the data record, it first executes a second loop function to copy the data records that the GPU needs to process from the memory of the compute node to the GPU's video memory. When the computing node executes the Map calculation function that starts the calculation function, first, the input part accesses the GPU's memory to read the data record that needs to be processed, and then the calculation part processes the data record read by the input part by calling the Map calculation function. After the calculation part finishes processing the data record, the output part stores the processing result of the data record in the video memory of the GPU.
当GPU需要处理多条数据记录时,计算部分可以对多条数据记录进行并行处理,假设GPU的N个核都空闲,那么GPU的N个核可以并行对多条数据记录进行处理,例如共有2N个数据记录,那么每个核可以处理两条数据记录,N个核可以同时并行处理,并行处理能够处理效率。若需要处理数据记录较少,GPU也可以多次循环调用Map函数进行处理。When the GPU needs to process multiple data records, the computing part can process multiple data records in parallel. If the N cores of the GPU are idle, the N cores of the GPU can process multiple data records in parallel, for example, 2N in total. Data records, then each core can process two data records, N cores can be processed in parallel at the same time, parallel processing can handle efficiency. If you need to process less data records, the GPU can also call the Map function multiple times to process.
步骤104、中心节点生成第二拷贝函数,第二拷贝函数用于将GPU对多个数据记录的计算结果从GPU的显存中拷贝至计算节点的内存中。Step 104: The central node generates a second copy function, and the second copy function is used to copy the calculation result of the GPU to the plurality of data records from the memory of the GPU to the memory of the computing node.
在GPU将数据记录处理完后,还需要将数据记录的计算结果从GPU的显存中拷贝至计算节点的内存中,因此,中心节点还要生成第二拷贝函数,该第二拷贝函数用于将GPU对多个数据记录的计算结果从GPU的显存中拷贝至计算节点的内存中。在计算节点将所有数据记录都处理完后,Reduce函数对Map计算函数的计算结果进行排序、混合等处理,因此,中心节点还需要向计算节点发送Reduce函数。After the GPU processes the data record, the calculation result of the data record needs to be copied from the GPU's memory to the memory of the compute node. Therefore, the central node also generates a second copy function, which is used to The GPU's calculation of multiple data records is copied from the GPU's memory to the compute node's memory. After the computing node processes all the data records, the Reduce function sorts and mixes the calculation results of the Map calculation function. Therefore, the central node also needs to send the Reduce function to the computing node.
中心节点在生成第二循环函数、启动计算函数以及第二拷贝函数之后,中心节点将第一循环函数、第二循环函数、第二拷贝函数、启动计算函数发送给计算节点,具体的,中心节点将第一循环函数、第二循环函数和第二拷贝函数发送给CPU,以使CPU运行第一循环函数、第二循环函数和第二拷贝函数,中心节点将启动计算函数发送给GPU,以使GPU运行启动计算函数。After the central node generates the second loop function, the start calculation function, and the second copy function, the central node sends the first loop function, the second loop function, the second copy function, and the start calculation function to the computing node, specifically, the central node Transmitting the first loop function, the second loop function, and the second copy function to the CPU, so that the CPU runs the first loop function, the second loop function, and the second copy function, and the central node sends the start calculation function to the GPU, so that The GPU runs the startup calculation function.
当中心节点接收到用户输入的计算任务时,将计算任务分割为多个子数据块,然后,根据预设调度策略为每个子数据块分配对应的计算节点,并将每个子数据块发送给对应的计算节点,计算节点接收到子数据块后将子数据块存储到计算节点的内存中。当计算节点中包含GPU时,计算节点的GPU和CPU可以协同对接收到的子数据块进行处理。当计算节点中不包含GPU时,计算节点的CPU对接收到的子数据块进行处理。 When the central node receives the computing task input by the user, the computing task is divided into multiple sub-blocks, and then each sub-block is allocated a corresponding computing node according to the preset scheduling policy, and each sub-block is sent to the corresponding The compute node stores the sub-blocks into the memory of the compute node after receiving the sub-blocks. When the GPU is included in the compute node, the GPU and CPU of the compute node can cooperate to process the received sub-block. When the GPU is not included in the compute node, the CPU of the compute node processes the received sub-block.
本实施例的方法中,当CPU和GPU使用不同的编程语言时,计算节点还用于将启动计算函数的语言转换为所述GPU所能识别的语言。例如,CPU上运行C++,GPU上运行java,那么计算节点需要将启动计算函数的C++语言转换为java语言。In the method of this embodiment, when the CPU and the GPU use different programming languages, the computing node is further configured to convert the language that starts the calculation function into a language that the GPU can recognize. For example, if C++ is running on the CPU and Java is running on the GPU, then the compute node needs to convert the C++ language that starts the calculation function to the Java language.
本实施例中,中心节点根据用户提供的采用MapReduce计算框架编写的第一循环函数,生成第二循环函数、启动计算函数和第二拷贝函数,其中,第二循环函数用于循环调用第一拷贝函数将计算节点中需要GPU处理的多个数据记录从计算节点的内存拷贝到GPU的显存中,启动计算函数中的Map计算函数用于指示GPU对GPU负责处理的数据记录进行处理,第二拷贝函数用于将GPU对多个数据记录的计算结果从GPU的显存中拷贝至计算节点的内存中,从而实现将适用于在CPU中运行的代码自动生成适用于在GPU中运行的代码,使Hadoop编程框架适用于在混合集群系统进行数据处理。由于中心节点能够根据用户提供的第一循环函数自动生成适用于在GPU中运行的代码,不需要改变现有的Hadoop编写方式,即不需要重新改写Map和Reduce函数,有利于遗产代码的维护和移植。In this embodiment, the central node generates a second loop function, a start calculation function, and a second copy function according to a first loop function provided by the user using a MapReduce computing framework, wherein the second loop function is used to cyclically invoke the first copy. The function copies the data records of the computing node that need to be processed by the GPU from the memory of the computing node to the memory of the GPU, and the Map calculation function in the startup calculation function is used to instruct the GPU to process the data records processed by the GPU, the second copy. The function is used to copy the calculation result of the GPU on multiple data records from the GPU's memory to the memory of the compute node, thereby realizing the code that is suitable for running in the CPU to automatically generate code suitable for running on the GPU, enabling Hadoop. The programming framework is suitable for data processing in a hybrid cluster system. Since the central node can automatically generate code suitable for running on the GPU according to the first loop function provided by the user, there is no need to change the existing Hadoop writing mode, that is, it is not necessary to rewrite the Map and Reduce functions, which is beneficial to the maintenance of the legacy code and transplant.
现有的Hadoop机制中,将计算任务分解为多个子数据块(split),split之间并行进行Map函数,split一般是64M大小的数据,并行的粒度较粗,不适合GPU的结构特点,GPU通常具有很多个核,各个核之间可以并行运行,因此,可以将split划分为更细粒度,以充分利用GPU的结构特点。具体的,将分配给GPU的split中包括的多个数据记录分配给GPU的多个核同时并行处理,可以进一步提高计算节点的处理速度。In the existing Hadoop mechanism, the computing task is decomposed into multiple sub-data blocks (split), and the Map function is performed in parallel between the splits. The split is generally 64M size data, and the parallel granularity is coarse, which is not suitable for the structural characteristics of the GPU. Usually there are many cores, and each core can run in parallel. Therefore, the split can be divided into finer granularity to make full use of the structural characteristics of the GPU. Specifically, the multiple data records included in the split allocated to the GPU are allocated to multiple cores of the GPU for parallel processing at the same time, which can further improve the processing speed of the computing node.
图2为本发明实施例二提供的数据处理方法的流程图,本实施例在实施例一的基础上,详细的说明当GPU对负责处理的多个数据记录并行处理时,计算节点如何生成启动计算函数。本实施例中,启动计算函数中的Map计算函数用于对GPU负责处理的多个数据记录并行处理,其中,GPU的L个核分别处理GPU负责处理的多个数据记录中的至少一个数据记录,其中,L为大于等于2小于等于N的整数,N为GPU包含的核的总数。如图2所示,本实施例的方法可以包括以下步骤:2 is a flowchart of a data processing method according to Embodiment 2 of the present invention. On the basis of Embodiment 1, this embodiment describes in detail how a computing node generates a startup when a GPU processes a plurality of data records in parallel for processing. Calculation function. In this embodiment, the Map calculation function in the startup calculation function is used for parallel processing of a plurality of data records processed by the GPU, wherein the L cores of the GPU respectively process at least one of the plurality of data records processed by the GPU. Where L is an integer greater than or equal to 2 and less than or equal to N, and N is the total number of cores included in the GPU. As shown in FIG. 2, the method in this embodiment may include the following steps:
步骤201、中心节点将用户提供的Map计算函数中的输入地址修改为GPU的每个核的输入地址。Step 201: The central node modifies an input address in the Map calculation function provided by the user to an input address of each core of the GPU.
当启动计算函数中的Map计算函数用于对GPU负责的多个数据记录并行处理 时,启动计算函数中的Map计算函数的输入部分的输入地址包括GPU的每个核的输入地址,以使GPU的每个核根据自己的输入地址从GPU的显存中读取需要处理数据记录。When the Map calculation function in the calculation function is started, it is used for parallel processing of multiple data records that are responsible for the GPU. The input address of the input portion of the Map calculation function in the startup calculation function includes the input address of each core of the GPU, so that each core of the GPU reads the data record to be processed from the GPU's video memory according to its own input address.
用户提供的Map计算函数中输入和输出都只有一个,因此,需要将用户提供的Map计算函数中的输入地址修改为GPU的每个核的输入地址,每个核的输入地址可以表示为:work-buff[index1[i]],i=0,1,……L-1,work-buff表示GPU需要处理的数据在显存中的地址,index1[i]用于指示该数据由第i个核处理。当GPU负责的多个数据记录并行处理时,GPU的每个核上都需要运行启动计算函数,第i个GPU核执行对应的启动计算函数将work-buff[index1[i]]地址中的数据记录读取出来并处理,GPU的每个核对应一个进程。There is only one input and output in the Map calculation function provided by the user. Therefore, the input address in the Map calculation function provided by the user needs to be modified into the input address of each core of the GPU. The input address of each core can be expressed as: work -buff[index1[i]], i=0,1,...L-1, work-buff represents the address of the data that the GPU needs to process in the video memory, and index1[i] is used to indicate that the data is from the ith core. deal with. When multiple data records that the GPU is responsible for are processed in parallel, the startup calculation function needs to be run on each core of the GPU, and the i-th GPU core executes the corresponding startup calculation function to work on the data in the work-buff[index1[i]] address. The records are read and processed, and each core of the GPU corresponds to a process.
步骤202、中心节点将用户提供的Map计算函数中的输出地址修改GPU的每个核的输出地址以生成输出部分的输出地址。Step 202: The central node modifies an output address of each core of the GPU by an output address in a Map calculation function provided by the user to generate an output address of the output part.
当启动计算函数中的Map计算函数用于对GPU负责的多个数据记录并行处理时,输出部分的输出地址包括GPU的每个核的输出地址,以使GPU的每个核根据自己的输出地址将处理后的数据记录的结果存储到自己的输出地址中。每个核的输出地址可以表示为:Result-buff[index2[i]],i=0,1,……L-1。When the Map calculation function in the startup calculation function is used for parallel processing of multiple data records that are responsible for the GPU, the output address of the output portion includes the output address of each core of the GPU, so that each core of the GPU is based on its own output address. Store the results of the processed data record in its own output address. The output address of each core can be expressed as: Result-buff[index2[i]], i=0, 1, ... L-1.
步骤203、中心节点将用户提供的Map计算函数外层的第一循环函数替换为第三循环函数,第三循环函数的循环次数为GPU负责处理的数据记录的数目M。Step 203: The central node replaces the first loop function of the outer layer of the Map calculation function provided by the user with the third loop function, and the number of loops of the third loop function is the number M of data records that the GPU is responsible for processing.
步骤204、中心节点将第三循环函数中的循环拆分为外层循环和内层循环,以将GPU负责处理的M个数据记录划分为
Figure PCTCN2015075703-appb-000005
个并行执行的数据记录块,其中,外层循环的次数为
Figure PCTCN2015075703-appb-000006
内层循环的次数为B,GPU的每个核执行一个数据记录块。
Step 204: The central node splits the loop in the third loop function into an outer loop and an inner loop to divide the M data records that the GPU is responsible for processing into
Figure PCTCN2015075703-appb-000005
Data record blocks executed in parallel, wherein the number of outer loops is
Figure PCTCN2015075703-appb-000006
The number of inner loops is B, and each core of the GPU executes a data record block.
步骤205、中心节点将用户提供的Map计算函数的局部变量声明为GPU的线程局部变量,其中,GPU的每个核对应一个线程局部变量,GPU的每个核通过自己对应的线程局部变量从GPU的显卡中读取需要处理的数据记录。Step 205: The central node declares a local variable of the Map calculation function provided by the user as a thread local variable of the GPU, wherein each core of the GPU corresponds to a thread local variable, and each core of the GPU passes the corresponding thread local variable from the GPU. Read the data records that need to be processed in the graphics card.
步骤203-205为GPU对负责的多个数据记录并行处理时,中心节点生成启动计算函数的计算部分的具体过程。Steps 203-205 are specific processes for the central node to generate a calculation portion of the start calculation function when the GPU processes the plurality of data records in charge for parallel processing.
第一循环函数在调用用户提供的Map计算函数出来完一条数据记录后,第一循环判断是否还有数据记录要处理,若还有数据要处理,第一循环函数继续调用用户提供的Map计算函数,直至所有的数据记录都处理完,即第一循环函数为一个串行Map计算函数。本实施例中需要将数据记录分配给GPU的多个核进 行处理,因此,不能直接使用第一循环函数,需要将串行的Map计算函数转换为并行的OpenCL kernel,OpenCL kernel是OpenCL程序中在GPU上并行执行的代码段,以函数形式包装。具体地,中心节点将用户提供的Map计算函数外层的第一循环函数替换为第三循环函数,第三循环函数的循环次数为GPU负责处理的数据记录的数目M,第一循环函数和第三循环函数的循环条件不一样。After the first loop function calls the user-provided Map calculation function to complete a data record, the first loop determines whether there is still a data record to be processed. If there is still data to be processed, the first loop function continues to call the user-provided Map calculation function. Until all data records have been processed, that is, the first loop function is a serial Map calculation function. In this embodiment, data records need to be allocated to multiple cores of the GPU. Line processing, therefore, you can not directly use the first loop function, you need to convert the serial Map calculation function into a parallel OpenCL kernel, which is a code segment executed in parallel on the GPU in the OpenCL program, wrapped in a function form. Specifically, the central node replaces the first loop function of the outer layer of the Map calculation function provided by the user with the third loop function, and the number of loops of the third loop function is the number M of data records processed by the GPU, the first loop function and the first The loop conditions of the three loop functions are different.
在将Map计算函数外的第一循环函数替换为第三循环函数后,中心节点将第三循环函数中的循环拆分为外层循环和内层循环,以将GPU负责处理的M个数据记录划分为
Figure PCTCN2015075703-appb-000007
个并行执行的数据记录块,外层循环的循环次数为
Figure PCTCN2015075703-appb-000008
内存循环的循环次数为B。将内层循环作为一个OpenCL kernel,那么总共生成了
Figure PCTCN2015075703-appb-000009
个OpenCL kernel,GPU的每个核运行一个OpenCL kernel,
Figure PCTCN2015075703-appb-000010
个OpenCL kernel并行执行。
After replacing the first loop function outside the Map calculation function with the third loop function, the center node splits the loop in the third loop function into an outer loop and an inner loop to record the M data records that the GPU is responsible for processing. be divided into
Figure PCTCN2015075703-appb-000007
Data record blocks executed in parallel, the number of loops of the outer loop is
Figure PCTCN2015075703-appb-000008
The number of cycles of the memory loop is B. Looping the inner layer as an OpenCL kernel, then a total of generated
Figure PCTCN2015075703-appb-000009
OpenCL kernel, each core of the GPU runs an OpenCL kernel,
Figure PCTCN2015075703-appb-000010
OpenCL kernel is executed in parallel.
GPU的每个核执行一个数据记录块,共有
Figure PCTCN2015075703-appb-000011
个核并行执行,内层循环的次数为B,即每个核处理B个数据记录,每个核通过调用B次Map计算函数对B个数据记录进行处理。当M/B为整数时,M个数据记录刚好被划分成了M/B个数据记录块,每个数据记录块中的数据记录的个数都相等,当M/B不为整数时,数据记录块的个数为对M/B的值向上取整,最后一个数据记录块的数据记录的个数与其他数据记录块的个数不相同,最后一个数据记录块的数据记录的个数为M/B的余数,例如,当M等于11,B等于5时,11/5等于2余1,那么数据记录被划分成了5个并行执行的数据记录块,GPU的其中5个核分别执行两个数据记录,最后一个核执行1个数据记录。
Each core of the GPU executes a block of data records, a total of
Figure PCTCN2015075703-appb-000011
The cores are executed in parallel, and the number of inner loops is B, that is, each core processes B data records, and each core processes B data records by calling B map calculation functions. When M/B is an integer, M data records are just divided into M/B data record blocks, and the number of data records in each data record block is equal. When M/B is not an integer, the data is The number of recording blocks is rounded up to the value of M/B, and the number of data records of the last data recording block is different from the number of other data recording blocks, and the number of data records of the last data recording block is The remainder of M/B, for example, when M is equal to 11, B is equal to 5, and 11/5 is equal to 2 and 1, then the data record is divided into 5 parallel execution data record blocks, and 5 of the GPUs are executed separately. Two data records, the last core performs 1 data record.
用户提供的Map计算函数的变量为局部变量,CPU在执行用户提供的Map计算函数时,该变量可以被所有数据记录共用,而本实施例中每个核的变量只能被该核处理的数据记录共用,而不能被其他核共用,因此,中心节点需要将用户提供的Map计算函数的局部变量声明为GPU的线程局部变量。The variable of the Map calculation function provided by the user is a local variable. When the CPU executes the Map calculation function provided by the user, the variable can be shared by all the data records, and in this embodiment, the variable of each core can only be processed by the core. Record sharing, but not shared by other cores, therefore, the central node needs to declare the local variables of the Map calculation function provided by the user as thread-local variables of the GPU.
现有技术中,Map阶段的并行性仅存在于split之间,并行粒度较粗,而本实施例的方法中,通过将现有的Hadoop机制中Map函数串行执行模式改变为并行执行模式。保留了原有的split之间的并行性,同时增加了split之内数据记录之间的并行性,即将GPU上运行的一个split进一步划分为多个并行执行的数据记录块,使计算节点的并行性增强、计算速率得到提高。In the prior art, the parallelism of the Map phase exists only between the splits, and the parallel granularity is coarse. In the method of the embodiment, the serial execution mode of the Map function in the existing Hadoop mechanism is changed to the parallel execution mode. The parallelism between the original splits is preserved, and the parallelism between the data records within the split is increased, that is, a split running on the GPU is further divided into multiple parallel execution data record blocks, so that the parallel nodes of the computing nodes Sexual enhancement and calculation rate are improved.
图3为本发明实施例三提供的中心节点的结构示意图,如图3所示,本实施例的中心节点包括:接收模块11、第一生成模块12、第二生成模块13和第三 生成模块14。3 is a schematic structural diagram of a central node according to Embodiment 3 of the present invention. As shown in FIG. 3, the central node of this embodiment includes: a receiving module 11, a first generating module 12, a second generating module 13, and a third. The module 14 is generated.
其中,接收模块11,用于接收用户根据Hadoop程序所提供的MapReduce计算框架编写的第一循环函数,所述第一循环函数中包括用户提供的Map计算函数,所述第一循环函数用于循环调用所述用户提供的Map计算函数;The receiving module 11 is configured to receive a first loop function written by a user according to a MapReduce computing framework provided by a Hadoop program, where the first loop function includes a user-provided Map calculation function, and the first loop function is used to loop Calling the Map calculation function provided by the user;
第一生成模块12,用于利用运行的所述Hadoop程序将所述第一循环函数中的Map计算函数替换为第一拷贝函数以生成第二循环函数,所述第一拷贝函数用于将所述计算节点中需要所述GPU处理的多个数据记录从所述计算节点的内存拷贝到所述GPU的显存中,所述第二循环函数用于对所述第一拷贝函数进行循环执行;a first generating module 12, configured to replace, by using the Hadoop program that is running, the Map calculation function in the first loop function with a first copy function to generate a second loop function, where the first copy function is used to The plurality of data records in the computing node that need to be processed by the GPU are copied from the memory of the computing node to the video memory of the GPU, and the second loop function is used to perform cyclic execution on the first copy function;
第二生成模块13,用于根据所述第一循环函数生成启动计算函数,所述启动计算函数中的Map计算函数用于指示所述GPU对所述GPU负责处理的数据记录进行处理;The second generation module 13 is configured to generate a startup calculation function according to the first loop function, where the Map calculation function in the startup calculation function is used to instruct the GPU to process the data record that the GPU is responsible for processing;
第三生成模块14,用于生成第二拷贝函数,所述第二拷贝函数用于将所述GPU对所述多个数据记录的计算结果从所述GPU的显存中拷贝至所述计算节点的内存中。a third generation module 14 is configured to generate a second copy function, where the second copy function is used to copy the calculation result of the plurality of data records from the GPU to the computing node of the GPU In memory.
其中,启动计算函数中的Map计算函数可以包括:输入部分、计算部分、输出部分,其中,所述输入部分用于从所述GPU的显存中读取所述GPU需要处理的数据记录,所述计算部分用于对所述输入部分读取的需要处理的数据记录进行处理,所述输出部分用于将所述计算部分处理后数据记录的计算结果存储到所述GPU的显存中。The Map calculation function in the startup calculation function may include: an input portion, a calculation portion, and an output portion, wherein the input portion is configured to read, from the video memory of the GPU, a data record that the GPU needs to process, The calculation portion is configured to process the data record to be processed read by the input portion, and the output portion is configured to store the calculation result of the data portion processed by the calculation portion into the video memory of the GPU.
本实施例的中心节点可用于执行图1所示方法实施例的技术方案,具体实现方式和技术效果类似,这里不再赘述。The central node of this embodiment may be used to implement the technical solution of the method embodiment shown in FIG. 1. The specific implementation manner and technical effects are similar, and details are not described herein again.
图4为本发明实施例四提供的中心节点的结构示意图,如图4所示,本实施例的中心节点在图3所示中心节点的基础上还包括:转换模块15和发送模块16,转换模块15,用于将所述启动计算函数的语言转换为所述GPU所能识别的语言。发送模块,用于将所述第一循环函数、所述第二循环函数、所述第二拷贝函数、所述启动计算函数发送给所述计算节点,以使所述CPU运行所述第一循环函数、所述第二循环函数和所述第二拷贝函数,并使所述GPU运行所述启动计算函数。4 is a schematic structural diagram of a central node according to Embodiment 4 of the present invention. As shown in FIG. 4, the central node of the present embodiment further includes: a conversion module 15 and a sending module 16 on the basis of the central node shown in FIG. The module 15 is configured to convert the language of the startup calculation function into a language that the GPU can recognize. a sending module, configured to send the first loop function, the second loop function, the second copy function, and the start calculation function to the computing node, so that the CPU runs the first loop a function, the second loop function, and the second copy function, and causing the GPU to run the startup calculation function.
本实施例中,启动计算函数中的Map计算函数用于对所述GPU负责处理的多个数据记录并行处理,其中,所述GPU的多个核分别处理所述GPU负责处理的多个数据记录中的至少一个数据记录。 In this embodiment, the Map calculation function in the startup calculation function is used for parallel processing of a plurality of data records that are processed by the GPU, wherein the plurality of cores of the GPU respectively process the plurality of data records that the GPU is responsible for processing. At least one data record in .
当所述启动计算函数中的Map计算函数用于对所述GPU负责的多个数据记录并行处理时,所述输入部分的输入地址包括所述GPU的每个核的输入地址,以使所述GPU的每个核根据自己的输入地址从所述GPU的显存中读取需要处理数据记录,所述输出部分的输出地址包括所述GPU的每个核的输出地址,以使所述GPU的每个核根据自己的输出地址将处理后的数据记录的结果存储到自己的输出地址中。When the Map calculation function in the startup calculation function is used to process a plurality of data records that are responsible for the GPU, the input address of the input portion includes an input address of each core of the GPU, so that the Each core of the GPU reads from the GPU's video memory according to its own input address to process a data record, and the output portion of the output portion includes an output address of each core of the GPU to enable each of the GPUs The cores store the results of the processed data records into their own output addresses according to their own output addresses.
当所述启动计算函数中的Map计算函数用于对所述GPU负责的多个数据记录并行处理时,所述第二生成模块具体用于执行以下操作:When the Map calculation function in the startup calculation function is used to process the plurality of data records that are responsible for the GPU in parallel, the second generation module is specifically configured to perform the following operations:
将所述用户提供的Map计算函数中的输入地址修改为所述GPU的每个核的输入地址以生成所述输入部分的输入地址;将所述用户提供的Map计算函数中的输出地址修改所述GPU的每个核的输出地址以生成所述输出部分的输出地址;Modifying an input address in the Map calculation function provided by the user to an input address of each core of the GPU to generate an input address of the input portion; and modifying an output address in a Map calculation function provided by the user An output address of each core of the GPU to generate an output address of the output portion;
将所述用户提供的Map计算函数外层的所述第一循环函数替换为第三循环函数,所述第三循环函数的循环次数为所述GPU负责处理的数据记录的数目M;将所述第三循环函数中的循环拆分为外层循环和内层循环,以将所述GPU负责处理的M个数据记录划分为
Figure PCTCN2015075703-appb-000012
个并行执行的数据记录块,其中,所述外层循环的次数为
Figure PCTCN2015075703-appb-000013
所述内层循环的次数为B,所述GPU的每个核执行一个数据记录块;
Substituting the first loop function of the outer layer of the Map calculation function provided by the user with a third loop function, the number of loops of the third loop function being the number M of data records that the GPU is responsible for processing; The loop in the third loop function is split into an outer loop and an inner loop to divide the M data records that the GPU is responsible for processing into
Figure PCTCN2015075703-appb-000012
Data record blocks executed in parallel, wherein the number of times the outer loop is
Figure PCTCN2015075703-appb-000013
The number of times the inner layer is looped is B, and each core of the GPU executes one data recording block;
将所述用户提供的Map计算函数的局部变量声明为所述GPU的线程局部变量,其中,所述GPU的每个核对应一个线程局部变量,所述GPU的每个核通过自己对应的线程局部变量从所述GPU的显卡中读取需要处理的数据记录。Declaring a local variable of the Map calculation function provided by the user as a thread local variable of the GPU, wherein each core of the GPU corresponds to a thread local variable, and each core of the GPU passes its own corresponding thread local The variable reads the data record that needs to be processed from the graphics card of the GPU.
本实施例的中心节点,可用于执行图1和图2所示方法实施例的技术方案,具体实现方式和技术效果类似,这里不再赘述。The central node of this embodiment may be used to implement the technical solution of the method embodiment shown in FIG. 1 and FIG. 2 , and the specific implementation manners and technical effects are similar, and details are not described herein again.
图5为本发明实施例五提供的中心节点的结构示意图,如图5所示,本实施例的中心节点200包括:处理器21、存储器22、通信接口23以及系统总线24,存储器22和通信接口23通过系统总线24与处理器21连接并通信,通信接口23用于和其他设备进行通信,存储器22中存储有计算机执行指令221;所述处理器21,用于运行所述计算机执行指令221,执行如下所述的方法:5 is a schematic structural diagram of a central node according to Embodiment 5 of the present invention. As shown in FIG. 5, the central node 200 of this embodiment includes: a processor 21, a memory 22, a communication interface 23, and a system bus 24, and a memory 22 and communication. The interface 23 is connected to and communicates with the processor 21 through the system bus 24. The communication interface 23 is used for communication with other devices. The memory 22 stores computer execution instructions 221. The processor 21 is configured to execute the computer execution instructions 221. , perform the method described below:
接收用户根据所述Hadoop程序所提供的MapReduce计算框架编写的第一循环函数,所述第一循环函数中包括用户提供的Map计算函数,所述第一循环函数用于循环调用所述用户提供的Map计算函数;Receiving a first loop function written by the user according to the MapReduce computing framework provided by the Hadoop program, where the first loop function includes a user-provided Map calculation function, and the first loop function is used to cyclically invoke the user-provided Map calculation function;
利用运行的所述Hadoop程序将所述第一循环函数中的Map计算函数替换为 第一拷贝函数以生成第二循环函数,所述第一拷贝函数用于将所述计算节点中需要所述GPU处理的多个数据记录从所述计算节点的内存拷贝到所述GPU的显存中,所述第二循环函数用于对所述第一拷贝函数进行循环执行;Replace the Map calculation function in the first loop function with the Hadoop program that is running a first copy function to generate a second round function for copying a plurality of data records in the computing node that require the GPU processing from a memory of the computing node to a memory of the GPU The second loop function is configured to perform cyclic execution on the first copy function;
根据所述第一循环函数生成启动计算函数,所述启动计算函数中的Map计算函数用于指示所述GPU对所述GPU负责处理的数据记录进行处理;Generating a startup calculation function according to the first loop function, where the Map calculation function in the startup calculation function is used to instruct the GPU to process a data record that the GPU is responsible for processing;
生成第二拷贝函数,所述第二拷贝函数用于将所述GPU对所述多个数据记录的计算结果从所述GPU的显存中拷贝至所述计算节点的内存中。Generating a second copy function, the second copy function for copying, by the GPU, a calculation result of the plurality of data records from a memory of the GPU into a memory of the computing node.
其中,所述启动计算函数中的Map计算函数具体可以包括:输入部分、计算部分、输出部分,其中,所述输入部分用于从所述GPU的显存中读取所述GPU需要处理的数据记录,所述计算部分用于对所述输入部分读取的需要处理的数据记录进行处理,所述输出部分用于将所述计算部分处理后数据记录的计算结果存储到所述GPU的显存中。The map calculation function in the startup calculation function may specifically include: an input portion, a calculation portion, and an output portion, wherein the input portion is configured to read a data record that the GPU needs to process from a memory of the GPU. The calculating portion is configured to process the data record to be processed that is read by the input portion, and the output portion is configured to store the calculation result of the data portion processed by the computing portion into the video memory of the GPU.
可选地,所述启动计算函数中的Map计算函数可以用于对所述GPU负责处理的多个数据记录并行处理,其中,所述GPU的多个核分别处理所述GPU负责处理的多个数据记录中的至少一个数据记录。当所述启动计算函数中的Map计算函数用于对所述GPU负责的多个数据记录并行处理时,所述输入部分的输入地址包括所述GPU的每个核的输入地址,以使所述GPU的每个核根据自己的输入地址从所述GPU的显存中读取需要处理数据记录,所述输出部分的输出地址包括所述GPU的每个核的输出地址,以使所述GPU的每个核根据自己的输出地址将处理后的数据记录的结果存储到自己的输出地址中。Optionally, the Map calculation function in the startup calculation function may be used to perform parallel processing on a plurality of data records that are processed by the GPU, wherein multiple cores of the GPU respectively process multiple processed by the GPU At least one data record in the data record. When the Map calculation function in the startup calculation function is used to process a plurality of data records that are responsible for the GPU, the input address of the input portion includes an input address of each core of the GPU, so that the Each core of the GPU reads from the GPU's video memory according to its own input address to process a data record, and the output portion of the output portion includes an output address of each core of the GPU to enable each of the GPUs The cores store the results of the processed data records into their own output addresses according to their own output addresses.
当所述启动计算函数中的Map计算函数用于对所述GPU负责的多个数据记录并行处理时,处理器21生成启动计算函数,具体包括以下步骤:When the Map calculation function in the startup calculation function is used for parallel processing of a plurality of data records that are responsible for the GPU, the processor 21 generates a startup calculation function, which specifically includes the following steps:
所述中心节点将所述用户提供的Map计算函数中的输入地址修改为所述GPU的每个核的输入地址以生成所述输入部分的输入地址;The central node modifies an input address in the Map calculation function provided by the user to an input address of each core of the GPU to generate an input address of the input portion;
所述中心节点将所述用户提供的Map计算函数中的输出地址修改所述GPU的每个核的输出地址以生成所述输出部分的输出地址;The central node modifies an output address of each core of the GPU to an output address in the Map calculation function provided by the user to generate an output address of the output portion;
所述中心节点将所述用户提供的Map计算函数外层的所述第一循环函数替换为第三循环函数,所述第三循环函数的循环次数为所述GPU负责处理的数据记录的数目M;The central node replaces the first loop function of the outer layer of the Map calculation function provided by the user with a third loop function, and the number of loops of the third loop function is the number of data records that the GPU is responsible for processing. ;
所述中心节点将所述第三循环函数中的循环拆分为外层循环和内层循环,以将所述GPU负责处理的M个数据记录划分为
Figure PCTCN2015075703-appb-000014
个并行执行的数据记录块,其 中,所述外层循环的次数为
Figure PCTCN2015075703-appb-000015
所述内层循环的次数为B,所述GPU的每个核执行一个数据记录块;
The central node splits the loop in the third loop function into an outer loop and an inner loop to divide the M data records that the GPU is responsible for processing into
Figure PCTCN2015075703-appb-000014
Data record blocks executed in parallel, wherein the number of times the outer loop is
Figure PCTCN2015075703-appb-000015
The number of times the inner layer is looped is B, and each core of the GPU executes one data recording block;
所述中心节点将所述用户提供的Map计算函数的局部变量声明为所述GPU的线程局部变量,其中,所述GPU的每个核对应一个线程局部变量,所述GPU的每个核通过自己对应的线程局部变量从所述GPU的显卡中读取需要处理的数据记录。The central node declares a local variable of the Map calculation function provided by the user as a thread local variable of the GPU, where each core of the GPU corresponds to a thread local variable, and each core of the GPU passes its own Corresponding thread local variables read data records that need to be processed from the graphics card of the GPU.
可选地,处理器21还用于将所述启动计算函数的语言转换为所述GPU所能识别的语言。Optionally, the processor 21 is further configured to convert the language of the startup calculation function into a language that the GPU can recognize.
本实施例中,通信接口23具体可以用于将所述第一循环函数、所述第二循环函数、所述第二拷贝函数、所述启动计算函数发送给所述计算节点,以使所述CPU运行所述第一循环函数、所述第二循环函数和所述第二拷贝函数,并使所述GPU运行所述启动计算函数。In this embodiment, the communication interface 23 is specifically configured to send the first loop function, the second loop function, the second copy function, and the start calculation function to the computing node, so that the The CPU runs the first loop function, the second loop function, and the second copy function, and causes the GPU to run the startup calculation function.
本实施例的中心节点,可用于执行图1和图2所示方法实施例的技术方案,具体实现方式和技术效果类似,这里不再赘述。The central node of this embodiment may be used to implement the technical solution of the method embodiment shown in FIG. 1 and FIG. 2 , and the specific implementation manners and technical effects are similar, and details are not described herein again.
本领域普通技术人员可以理解:实现上述各方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成。前述的程序可以存储于一计算机可读取存储介质中。该程序在执行时,执行包括上述各方法实施例的步骤;而前述的存储介质包括:ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。One of ordinary skill in the art will appreciate that all or part of the steps to implement the various method embodiments described above may be accomplished by hardware associated with the program instructions. The aforementioned program can be stored in a computer readable storage medium. The program, when executed, performs the steps including the foregoing method embodiments; and the foregoing storage medium includes various media that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.
最后应说明的是:以上各实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述各实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。 Finally, it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, and are not intended to be limiting; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art will understand that The technical solutions described in the foregoing embodiments may be modified, or some or all of the technical features may be equivalently replaced; and the modifications or substitutions do not deviate from the technical solutions of the embodiments of the present invention. range.

Claims (14)

  1. 一种数据处理方法,所述方法应用于Hadoop集群系统,所述Hadoop集群系统中包括计算节点和中心节点,所述中心节点上运行Hadoop程序,所述中心节点对所述计算节点进行MapReduce运算管理,所述计算节点上包含有CPU和具有N个核的GPU,其特征在于,所述方法包ac括:A data processing method, the method is applied to a Hadoop cluster system, where the Hadoop cluster system includes a computing node and a central node, the central node runs a Hadoop program, and the central node performs MapReduce operation management on the computing node. The computing node includes a CPU and a GPU having N cores, wherein the method package includes:
    所述中心节点接收用户根据所述Hadoop程序所提供的MapReduce计算框架编写的第一循环函数,所述第一循环函数中包括用户提供的Map计算函数,所述第一循环函数用于循环调用所述用户提供的Map计算函数;The central node receives a first loop function written by a user according to a MapReduce computing framework provided by the Hadoop program, where the first loop function includes a user-provided Map calculation function, and the first loop function is used for a loop call The Map calculation function provided by the user;
    所述中心节点利用运行的所述Hadoop程序将所述第一循环函数中的Map计算函数替换为第一拷贝函数以生成第二循环函数,所述第一拷贝函数用于将所述计算节点中需要所述GPU处理的多个数据记录从所述计算节点的内存拷贝到所述GPU的显存中,所述第二循环函数用于对所述第一拷贝函数进行循环执行;The central node replaces the Map calculation function in the first loop function with a first copy function to generate a second loop function by using the Hadoop program running, the first copy function being used in the compute node A plurality of data records required to be processed by the GPU are copied from a memory of the computing node to a video memory of the GPU, and the second loop function is configured to perform cyclic execution on the first copy function;
    所述中心节点根据所述第一循环函数生成启动计算函数,所述启动计算函数中的Map计算函数用于指示所述GPU对所述GPU负责处理的数据记录进行处理;The central node generates a startup calculation function according to the first loop function, and the Map calculation function in the startup calculation function is used to instruct the GPU to process the data record that the GPU is responsible for processing;
    所述中心节点生成第二拷贝函数,所述第二拷贝函数用于将所述GPU对所述多个数据记录的计算结果从所述GPU的显存中拷贝至所述计算节点的内存中。The central node generates a second copy function, and the second copy function is configured to copy the calculation result of the plurality of data records by the GPU from the video memory of the GPU to the memory of the computing node.
  2. 根据权利要求1所述的方法,其特征在于,所述启动计算函数中的Map计算函数包括:输入部分、计算部分、输出部分,其中,所述输入部分用于从所述GPU的显存中读取所述GPU需要处理的数据记录,所述计算部分用于对所述输入部分读取的需要处理的数据记录进行处理,所述输出部分用于将所述计算部分处理后数据记录的计算结果存储到所述GPU的显存中。The method according to claim 1, wherein the Map calculation function in the startup calculation function comprises: an input portion, a calculation portion, and an output portion, wherein the input portion is for reading from a display memory of the GPU Taking the data record that the GPU needs to process, the calculating part is configured to process the data record to be processed that is read by the input part, and the output part is used to calculate the result of the data record processed by the calculating part Stored in the video memory of the GPU.
  3. 根据权利要求1或2所述的方法,其特征在于,所述启动计算函数中的Map计算函数用于对所述GPU负责处理的多个数据记录并行处理,其中,所述GPU的多个核分别处理所述GPU负责处理的多个数据记录中的至少一个数据记录。The method according to claim 1 or 2, wherein the Map calculation function in the startup calculation function is used for parallel processing of a plurality of data records that are processed by the GPU, wherein the plurality of cores of the GPU Processing at least one of the plurality of data records that the GPU is responsible for processing, respectively.
  4. 根据权利要求3所述的方法,其特征在于,当所述启动计算函数中的Map计算函数用于对所述GPU负责的多个数据记录并行处理时,所述输入部分 的输入地址包括所述GPU的每个核的输入地址,以使所述GPU的每个核根据自己的输入地址从所述GPU的显存中读取需要处理数据记录,所述输出部分的输出地址包括所述GPU的每个核的输出地址,以使所述GPU的每个核根据自己的输出地址将处理后的数据记录的结果存储到自己的输出地址中。The method according to claim 3, wherein said input portion is used when said Map calculation function in said startup calculation function is used for parallel processing of a plurality of data records for which said GPU is responsible Input address includes an input address of each core of the GPU such that each core of the GPU reads from the GPU's video memory according to its own input address to process a data record, the output portion of the output portion An output address of each core of the GPU is included to cause each core of the GPU to store the result of the processed data record into its own output address according to its own output address.
  5. 根据权利要求4所述的方法,其特征在于,所述中心节点生成启动计算函数,包括:The method of claim 4 wherein said central node generates a startup calculation function comprising:
    所述中心节点将所述用户提供的Map计算函数中的输入地址修改为所述GPU的每个核的输入地址以生成所述输入部分的输入地址;The central node modifies an input address in the Map calculation function provided by the user to an input address of each core of the GPU to generate an input address of the input portion;
    所述中心节点将所述用户提供的Map计算函数中的输出地址修改所述GPU的每个核的输出地址以生成所述输出部分的输出地址;The central node modifies an output address of each core of the GPU to an output address in the Map calculation function provided by the user to generate an output address of the output portion;
    所述中心节点将所述用户提供的Map计算函数外层的所述第一循环函数替换为第三循环函数,所述第三循环函数的循环次数为所述GPU负责处理的数据记录的数目M;The central node replaces the first loop function of the outer layer of the Map calculation function provided by the user with a third loop function, and the number of loops of the third loop function is the number of data records that the GPU is responsible for processing. ;
    所述中心节点将所述第三循环函数中的循环拆分为外层循环和内层循环,以将所述GPU负责处理的M个数据记录划分为
    Figure PCTCN2015075703-appb-100001
    个并行执行的数据记录块,其中,所述外层循环的次数为
    Figure PCTCN2015075703-appb-100002
    所述内层循环的次数为B,所述GPU的每个核执行一个数据记录块;
    The central node splits the loop in the third loop function into an outer loop and an inner loop to divide the M data records that the GPU is responsible for processing into
    Figure PCTCN2015075703-appb-100001
    Data record blocks executed in parallel, wherein the number of times the outer loop is
    Figure PCTCN2015075703-appb-100002
    The number of times the inner layer is looped is B, and each core of the GPU executes one data recording block;
    所述中心节点将所述用户提供的Map计算函数的局部变量声明为所述GPU的线程局部变量,其中,所述GPU的每个核对应一个线程局部变量,所述GPU的每个核通过自己对应的线程局部变量从所述GPU的显卡中读取需要处理的数据记录。The central node declares a local variable of the Map calculation function provided by the user as a thread local variable of the GPU, where each core of the GPU corresponds to a thread local variable, and each core of the GPU passes its own Corresponding thread local variables read data records that need to be processed from the graphics card of the GPU.
  6. 根据权利要求1-5中任一项所述的方法,其特征在于,所述方法还包括:所述计算节点将所述启动计算函数的语言转换为所述GPU所能识别的语言。The method of any of claims 1-5, wherein the method further comprises the computing node converting the language of the startup calculation function to a language identifiable by the GPU.
  7. 根据权利要求1-6中任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1 to 6, wherein the method further comprises:
    所述中心节点将所述第一循环函数、所述第二循环函数、所述第二拷贝函数、所述启动计算函数发送给所述计算节点,以使所述CPU运行所述第一循环函数、所述第二循环函数和所述第二拷贝函数,并使所述GPU运行所述启动计算函数。The central node sends the first loop function, the second loop function, the second copy function, and the startup calculation function to the computing node to cause the CPU to run the first loop function The second loop function and the second copy function, and cause the GPU to run the startup calculation function.
  8. 一种中心节点,其特征在于,包括:A central node, comprising:
    接收模块,用于接收用户根据Hadoop程序所提供的MapReduce计算框架编写的第一循环函数,所述第一循环函数中包括用户提供的Map计算函数,所述 第一循环函数用于循环调用所述用户提供的Map计算函数;a receiving module, configured to receive a first loop function written by a user according to a MapReduce computing framework provided by a Hadoop program, where the first loop function includes a user-provided Map calculation function, The first loop function is used to cyclically call the Map calculation function provided by the user;
    第一生成模块,用于利用运行的所述Hadoop程序将所述第一循环函数中的Map计算函数替换为第一拷贝函数以生成第二循环函数,所述第一拷贝函数用于将所述计算节点中需要所述GPU处理的多个数据记录从所述计算节点的内存拷贝到所述GPU的显存中,所述第二循环函数用于对所述第一拷贝函数进行循环执行;a first generation module, configured to replace, by using the Hadoop program that is running, a Map calculation function in the first loop function with a first copy function to generate a second loop function, where the first copy function is used to A plurality of data records in the computing node that are required to be processed by the GPU are copied from a memory of the computing node to a video memory of the GPU, and the second loop function is configured to perform cyclic execution on the first copy function;
    第二生成模块,用于根据所述第一循环函数生成启动计算函数,所述启动计算函数中的Map计算函数用于指示所述GPU对所述GPU负责处理的数据记录进行处理;a second generation module, configured to generate a startup calculation function according to the first loop function, where a Map calculation function in the startup calculation function is used to instruct the GPU to process a data record that the GPU is responsible for processing;
    第三生成模块,用于生成第二拷贝函数,所述第二拷贝函数用于将所述GPU对所述多个数据记录的计算结果从所述GPU的显存中拷贝至所述计算节点的内存中。a third generation module, configured to generate a second copy function, where the second copy function is configured to copy, by the GPU, a calculation result of the multiple data records from a memory of the GPU to a memory of the computing node in.
  9. 根据权利要求8所述的中心节点,其特征在于,所述启动计算函数中的Map计算函数包括:输入部分、计算部分、输出部分,其中,所述输入部分用于从所述GPU的显存中读取所述GPU需要处理的数据记录,所述计算部分用于对所述输入部分读取的需要处理的数据记录进行处理,所述输出部分用于将所述计算部分处理后数据记录的计算结果存储到所述GPU的显存中。The central node according to claim 8, wherein the Map calculation function in the startup calculation function comprises: an input portion, a calculation portion, and an output portion, wherein the input portion is used for display memory from the GPU Reading a data record that the GPU needs to process, the calculating portion is configured to process a data record to be processed that is read by the input portion, and the output portion is configured to calculate a data record after processing the computing portion The result is stored in the video memory of the GPU.
  10. 根据权利要求8或9所述的中心节点,其特征在于,所述启动计算函数中的Map计算函数用于对所述GPU负责处理的多个数据记录并行处理,其中,所述GPU的多个核分别处理所述GPU负责处理的多个数据记录中的至少一个数据记录。The central node according to claim 8 or 9, wherein the Map calculation function in the startup calculation function is used for parallel processing of a plurality of data records that are processed by the GPU, wherein the plurality of GPUs are processed The core processes at least one of the plurality of data records that the GPU is responsible for processing, respectively.
  11. 根据权利要求10所述的中心节点,其特征在于,当所述启动计算函数中的Map计算函数用于对所述GPU负责的多个数据记录并行处理时,所述输入部分的输入地址包括所述GPU的每个核的输入地址,以使所述GPU的每个核根据自己的输入地址从所述GPU的显存中读取需要处理数据记录,所述输出部分的输出地址包括所述GPU的每个核的输出地址,以使所述GPU的每个核根据自己的输出地址将处理后的数据记录的结果存储到自己的输出地址中。The central node according to claim 10, wherein when the Map calculation function in the startup calculation function is used for parallel processing of a plurality of data records that are responsible for the GPU, the input address of the input portion includes An input address of each core of the GPU, such that each core of the GPU reads a data record to be processed from the GPU's video memory according to its own input address, and the output address of the output portion includes the GPU The output address of each core is such that each core of the GPU stores the result of the processed data record into its own output address according to its own output address.
  12. 根据权利要求11所述的中心节点,其特征在于,所述第二生成模块具体用于:The central node according to claim 11, wherein the second generating module is specifically configured to:
    将所述用户提供的Map计算函数中的输入地址修改为所述GPU的每个核的输入地址以生成所述输入部分的输入地址; Modifying an input address in the Map calculation function provided by the user to an input address of each core of the GPU to generate an input address of the input portion;
    将所述用户提供的Map计算函数中的输出地址修改所述GPU的每个核的输出地址以生成所述输出部分的输出地址;Outputting an output address in the Map calculation function provided by the user to an output address of each core of the GPU to generate an output address of the output portion;
    将所述用户提供的Map计算函数外层的所述第一循环函数替换为第三循环函数,所述第三循环函数的循环次数为所述GPU负责处理的数据记录的数目M;Replacing the first loop function of the outer layer of the Map calculation function provided by the user with a third loop function, the number of loops of the third loop function being the number M of data records that the GPU is responsible for processing;
    将所述第三循环函数中的循环拆分为外层循环和内层循环,以将所述GPU负责处理的M个数据记录划分为
    Figure PCTCN2015075703-appb-100003
    个并行执行的数据记录块,其中,所述外层循环的次数为
    Figure PCTCN2015075703-appb-100004
    所述内层循环的次数为B,所述GPU的每个核执行一个数据记录块;
    Splitting the loop in the third loop function into an outer loop and an inner loop to divide the M data records that the GPU is responsible for processing into
    Figure PCTCN2015075703-appb-100003
    Data record blocks executed in parallel, wherein the number of times the outer loop is
    Figure PCTCN2015075703-appb-100004
    The number of times the inner layer is looped is B, and each core of the GPU executes one data recording block;
    将所述用户提供的Map计算函数的局部变量声明为所述GPU的线程局部变量,其中,所述GPU的每个核对应一个线程局部变量,所述GPU的每个核通过自己对应的线程局部变量从所述GPU的显卡中读取需要处理的数据记录。Declaring a local variable of the Map calculation function provided by the user as a thread local variable of the GPU, wherein each core of the GPU corresponds to a thread local variable, and each core of the GPU passes its own corresponding thread local The variable reads the data record that needs to be processed from the graphics card of the GPU.
  13. 根据权利要求8-12中任一项所述的中心节点,其特征在于,所述中心节点还包括:The central node according to any one of claims 8 to 12, wherein the central node further comprises:
    转换模块,用于将所述启动计算函数的语言转换为所述GPU所能识别的语言。And a conversion module, configured to convert the language of the startup calculation function into a language that the GPU can recognize.
  14. 根据权利要求8-13中任一项所述的中心节点,其特征在于,所述中心节点还包括:The central node according to any one of claims 8 to 13, wherein the central node further comprises:
    发送模块,用于将所述第一循环函数、所述第二循环函数、所述第二拷贝函数、所述启动计算函数发送给所述计算节点,以使所述CPU运行所述第一循环函数、所述第二循环函数和所述第二拷贝函数,并使所述GPU运行所述启动计算函数。 a sending module, configured to send the first loop function, the second loop function, the second copy function, and the start calculation function to the computing node, so that the CPU runs the first loop a function, the second loop function, and the second copy function, and causing the GPU to run the startup calculation function.
PCT/CN2015/075703 2014-07-14 2015-04-01 Data processing method and central node WO2016008317A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201410331030.0A CN105335135B (en) 2014-07-14 2014-07-14 Data processing method and central node
CN201410331030.0 2014-07-14

Publications (1)

Publication Number Publication Date
WO2016008317A1 true WO2016008317A1 (en) 2016-01-21

Family

ID=55077886

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/075703 WO2016008317A1 (en) 2014-07-14 2015-04-01 Data processing method and central node

Country Status (2)

Country Link
CN (1) CN105335135B (en)
WO (1) WO2016008317A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110187970A (en) * 2019-05-30 2019-08-30 北京理工大学 A kind of distributed big data parallel calculating method based on Hadoop MapReduce

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106611037A (en) * 2016-09-12 2017-05-03 星环信息科技(上海)有限公司 Method and device for distributed diagram calculation
CN106506266B (en) * 2016-11-01 2019-05-14 中国人民解放军91655部队 Network flow analysis method based on GPU, Hadoop/Spark mixing Computational frame
CN108304177A (en) * 2017-01-13 2018-07-20 辉达公司 Calculate the execution of figure

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102169505A (en) * 2011-05-16 2011-08-31 苏州两江科技有限公司 Recommendation system building method based on cloud computing
US20120182981A1 (en) * 2011-01-13 2012-07-19 Pantech Co., Ltd. Terminal and method for synchronization
CN103279328A (en) * 2013-04-08 2013-09-04 河海大学 BlogRank algorithm parallelization processing construction method based on Haloop

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120182981A1 (en) * 2011-01-13 2012-07-19 Pantech Co., Ltd. Terminal and method for synchronization
CN102169505A (en) * 2011-05-16 2011-08-31 苏州两江科技有限公司 Recommendation system building method based on cloud computing
CN103279328A (en) * 2013-04-08 2013-09-04 河海大学 BlogRank algorithm parallelization processing construction method based on Haloop

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110187970A (en) * 2019-05-30 2019-08-30 北京理工大学 A kind of distributed big data parallel calculating method based on Hadoop MapReduce

Also Published As

Publication number Publication date
CN105335135B (en) 2019-01-08
CN105335135A (en) 2016-02-17

Similar Documents

Publication Publication Date Title
EP3667496B1 (en) Distributed computing system, data transmission method and device in distributed computing system
CN110262901B (en) Data processing method and data processing system
CN106991011B (en) CPU multithreading and GPU (graphics processing unit) multi-granularity parallel and cooperative optimization based method
WO2020108303A1 (en) Heterogeneous computing-based task processing method and software-hardware framework system
US9996394B2 (en) Scheduling accelerator tasks on accelerators using graphs
JP6006230B2 (en) Device discovery and topology reporting in combined CPU / GPU architecture systems
CN111309649B (en) Data transmission and task processing method, device and equipment
WO2018045753A1 (en) Method and device for distributed graph computation
US10402223B1 (en) Scheduling hardware resources for offloading functions in a heterogeneous computing system
CN110308984B (en) Cross-cluster computing system for processing geographically distributed data
WO2019047441A1 (en) Communication optimization method and system
CN104536937A (en) Big data appliance realizing method based on CPU-GPU heterogeneous cluster
JP2014206979A (en) Apparatus and method of parallel processing execution
WO2016008317A1 (en) Data processing method and central node
CN112463290A (en) Method, system, apparatus and storage medium for dynamically adjusting the number of computing containers
US11784946B2 (en) Method for improving data flow and access for a neural network processor
WO2023124543A1 (en) Data processing method and data processing apparatus for big data
CN110245024B (en) Dynamic allocation system and method for static storage blocks
US20110209007A1 (en) Composition model for cloud-hosted serving applications
US11467836B2 (en) Executing cross-core copy instructions in an accelerator to temporarily store an operand that cannot be accommodated by on-chip memory of a primary core into a secondary core
US20220036206A1 (en) Containerized distributed rules engine
US20230410428A1 (en) Hybrid gpu-cpu approach for mesh generation and adaptive mesh refinement
KR102026333B1 (en) Method for processing task in respect to distributed file system
CN116402091A (en) Hybrid engine intelligent computing method and device for artificial intelligent chip
JP2020166427A (en) Application execution device and application execution method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15822486

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15822486

Country of ref document: EP

Kind code of ref document: A1