CN110415162B - Adaptive graph partitioning method facing heterogeneous fusion processor in big data - Google Patents

Adaptive graph partitioning method facing heterogeneous fusion processor in big data Download PDF

Info

Publication number
CN110415162B
CN110415162B CN201910661044.1A CN201910661044A CN110415162B CN 110415162 B CN110415162 B CN 110415162B CN 201910661044 A CN201910661044 A CN 201910661044A CN 110415162 B CN110415162 B CN 110415162B
Authority
CN
China
Prior art keywords
data
graph
cpu
gpu
processed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910661044.1A
Other languages
Chinese (zh)
Other versions
CN110415162A (en
Inventor
张峰
杜小勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Renmin University of China
Original Assignee
Renmin University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Renmin University of China filed Critical Renmin University of China
Priority to CN201910661044.1A priority Critical patent/CN110415162B/en
Publication of CN110415162A publication Critical patent/CN110415162A/en
Application granted granted Critical
Publication of CN110415162B publication Critical patent/CN110415162B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/505Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Complex Calculations (AREA)

Abstract

The invention relates to a self-adaptive graph partitioning method facing a heterogeneous fusion processor in big data, which is characterized by comprising the following steps: carrying out fine-grained division on the load of the general graph, and distributing loads with different irregularity degrees for different equipment; analyzing the dynamic graph load, designing self-adaptive graph division, and automatically identifying whether multiple devices need to be operated in a mixed mode; for multi-equipment processing of large-scale graph loads, a pipeline mode is used for running and processing data. The invention constructs an automatic programming frame system which can meet high performance requirements in heterogeneous and dynamic environments based on a CPU and GPU integrated architecture and oriented to a graph calculation program, and researches efficient fine-grained graph division aiming at new characteristics of large graph and real-time dynamic graph calculation.

Description

Adaptive graph partitioning method facing heterogeneous fusion processor in big data
Technical Field
The invention relates to a self-adaptive graph partitioning method facing a heterogeneous fusion processor in big data, and relates to the field of heterogeneous computation.
Background
In the big data era, the processing requirement of large-scale load is difficult to deal with by using the traditional CPU processor, and the appearance of the GPU brings a new idea for big data processing. More and more big data applications are processed using GPUs. Graph computation is a representative of big data application, and is always a hot problem in big data research due to the complex and irregular relationship of points in the graph. An increasing number of researchers are now focusing on how to accelerate large-data-graph computing applications using GPUs. The traditional discrete GPU and the CPU are not on one chip, communication needs to pass through PCIe, and data transmission efficiency is low, so that an integrated processor of the CPU and the GPU is provided. The integration of a CPU and a GPU into one chip for hybrid computing is one of the important directions for the research and development of high-performance computer architectures, but the integration of different kinds of devices also brings great challenges in programming and system optimization, and especially, the challenges are greater when graph computation programs are run on various heterogeneous devices. In the prior art, graph calculation is performed by simply dividing a graph into a CPU and a GPU, and the graph calculation is poor in key indexes such as processor resource utilization rate, memory access and the like. Aiming at graph computing application, how to enable a CPU and a GPU to efficiently and mixedly operate and process graph data is a scientific problem to be solved urgently.
At present, many scholars are studying how to utilize irregular programs of integrated architecture accelerated graph calculation, and the heterogeneous characteristics and load irregularity on the integrated architecture pose great challenges to the efficiency of data partitioning, and the prior art does not solve the problems well, mainly because: first, many of the current tasks are only coarse-grained data partitioning, and do not consider fine-grained interactions between the CPU and the GPU. Second, although some studies have implemented fine-grained data partitioning, they are generally only applicable to specific applications, such as Hash Join and MapReduce in the field of databases, and do not provide general automatic transformation for irregular programs based on graphs or sparse matrices.
In addition, in order to fully utilize resources on an integrated architecture, a novel partition of program tasks is generally needed, and a common partition method at present belongs to static partition and is only suitable for a first iteration, and the partition is not necessarily efficient after a graph is changed. However, there is overhead in repartitioning the graph after each iteration. In addition, for an integrated architecture, in a big data environment, when input graph data exceeds a certain scale, a program cannot be executed, the problem is not solved by the prior art, and an additional segmentation design is required for big data load.
Disclosure of Invention
In view of the above problems, an object of the present invention is to provide an adaptive graph partitioning method for a heterogeneous fusion processor in big data, which can analyze large-scale data and establish a performance-aware model, and perform finer-grained partitioning by combining the characteristics of a CPU and a GPU and the shared memory characteristics of an integrated architecture.
In order to achieve the purpose, the invention adopts the following technical scheme: a method for partitioning an adaptive graph facing a heterogeneous fusion processor in big data comprises the following steps:
carrying out fine-grained division on the load of the general graph, and distributing the loads with different irregularity degrees for different devices, wherein the specific process is as follows: s11: constructing a data set for model training, and generating a sparse matrix; s12: off-line training and selecting a characteristic value; s13: constructing a performance model for a specific graph application; s14: performing fine-grained division on input graph data provided by a user; s15: calculating the performances of the CPU and the GPU respectively used by different data groups according to the performance model obtained by the training in the step S13, and recording the performances multiplied by the number of nonzero elements in each data group as the execution time of each data group respectively processed by the CPU and the GPU; s16: according to the execution time of each data group processed by the CPU and the GPU respectively, calculating which equipment each data group should adopt to operate, wherein the total execution time is shortest; s17: the distribution scheme calculated from the total execution shortest is a fine-grained division scheme, and in the specific execution process, each data group is calculated by using one kernel respectively, so that the irregularity degree is reduced;
analyzing the dynamic graph load, designing self-adaptive graph division, and automatically identifying whether multiple devices need to be operated in a mixed mode;
for multi-equipment processing of large-scale graph loads, a pipeline mode is used for running and processing data.
Further, constructing a performance model for the specific graph application and constructing the performance model for the specific graph application comprise the following specific processes:
s131: for a given graph calculation program, respectively operating the sparse matrixes generated in the S11 by using the CPU and the GPU, and recording characteristic value parameter conditions and performance indexes of the programs in the S1 when the selected graph application program processes different matrixes by using the CPU and the GPU, wherein the performance indexes are the total number of nonzero elements in the sparse matrixes divided by the operating time;
s132: and (3) taking the characteristic value parameters in the S131 as input data and the performance indexes of the CPU and the GPU as results by using a multilayer neural network model, training the data, and training to obtain the performances of the CPU and the GPU under different index parameters.
Further, the dynamic graph load is analyzed, a self-adaptive graph is designed to be divided, whether multi-equipment hybrid operation is needed or not is automatically identified, and the specific process is as follows:
s21: judging whether the program relates to a dynamic graph load;
s22: aiming at a dynamic load program, recording the number of active nodes in each iteration in the running process of the dynamic load and the number of active nodes to be processed in the next iteration, and simultaneously recording the sum of the number of points processed in a graph after each iteration is finished;
s23: when a certain round of iteration starts, judging the number of current active nodes;
s24: if the number of processed points is larger than the threshold value of the mixed operation of the CPU and the GPU, performing mixed operation to process data;
s25: and when the number of the points in the processed graph reaches 90% of the total number of the points, and the minimum threshold value of the CPU and GPU mixed operation in the step S24 is not reached in the iteration of continuously setting the number, the CPU and the GPU are not operated in a mixed mode, and only the CPU is used for calculation.
Further, for multi-device processing of large-scale graph loads, a pipeline mode is used for running and processing data, and the specific process is as follows:
s31: firstly, calculating the memory capacity of the currently used computer equipment, recording the memory capacity as S, judging whether the processed data volume exceeds S, if so, determining that the processed data volume cannot be loaded into the memory for processing at one time, and executing a step S32, otherwise, loading the data into the memory for directly processing;
s32: the data is segmented into number blocks which do not exceed s/2, the sum of the number of non-zero elements in the next continuous row is recorded from the first row because the data is described by a sparse matrix, and the segmentation of the next data block is started by taking the rows as one data block when the number of the non-zero elements approaches s/2;
s33: and after all data processing is finished, all intermediate results are combined to obtain a final result.
Further, the data is segmented into number blocks not exceeding s/2 size, since the data is described by sparse matrix, the sum of the number of non-zero elements in the next continuous row is recorded from the first row, when reaching to s/2 number, the recording is stopped, the rows are taken as one data block, and the division of the next data block is started at the same time, specifically:
(1) two array spaces are opened in a computer memory, the size of each memory data space is s/2, and the two array spaces are alternately used, namely, one array space is ensured to be used for calculation, and the other array space is used for data transmission;
(2) when the data transmission time is not equal to the calculation time, the short step can be continued after the long waiting time step is completed, and if the data transmission time is longer than the calculation time, the calculation of the corresponding data can be started after the transmission is completed, and vice versa;
(3) intermediate calculation results need to be retained after each iteration to merge the data.
Due to the adoption of the technical scheme, the invention has the following characteristics:
1. the invention provides a novel self-adaptive load division scheme, which can analyze large-scale data, establish a performance perception model and divide finer granularity by combining the characteristics of a CPU (Central processing Unit) and a GPU (graphics processing Unit) and the characteristic of an integrated architecture shared memory;
2. the invention can make full use of various device resources of the integrated architecture for online tuning, and meanwhile, can further optimize the graph calculation program with dynamic change characteristics, so that the system can obtain remarkable performance improvement compared with the traditional graph division scheme;
3. the invention considers the interactive relation between the graph load characteristic and different devices, the existing method simply divides the graph load into two parts to make the CPU and the GPU process respectively, and does not consider the graph load characteristic and the architecture difference of the GPU and the CPU, the method provided by the invention fully considers the graph characteristic and the architecture characteristic, and distributes the load to different devices for processing according to the load characteristic suitable for processing by the devices;
4. the invention considers the graph load with dynamic characteristics, the dynamic graph load is difficult to predict because of the change in the program running process, and no better solution is available at present;
5. the method is characterized in that the original load is divided into a plurality of parts which can be processed separately, each part can be put into a memory system to be processed, the parts can be sequentially loaded into a computer to be processed through graph division, and finally, intermediate results are gathered, so that the characteristics of different devices can be considered in the graph dividing process, and the load suitable for the characteristics of each device can be distributed to each device;
in summary, the invention constructs an automatic programming framework system capable of meeting high performance requirements in a heterogeneous and dynamic environment based on a CPU and GPU integrated architecture and a graph-oriented computing program, and researches efficient fine-grained graph division aiming at new computing characteristics of large graphs and real-time dynamic graphs.
Drawings
FIG. 1 shows the overall design of the system of the present invention;
FIG. 2 is a flow chart of the present invention for fine granularity partitioning of generic graph loads;
FIG. 3 is a flow chart of the present invention for dynamic load partitioning;
FIG. 4 illustrates pipeline processing for large data loads according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
The invention utilizes the advantage that the CPU and the GPU of the integrated architecture share the memory on one chip, takes the graph calculation application program as the center, faces the challenges brought by the loads of large graphs and dynamic graphs, and focuses on researching the fine-grained mixed operation of multiple devices on the integrated architecture.
As shown in fig. 1, the adaptive graph partitioning method for a heterogeneous fusion processor in big data provided by the present invention specifically includes:
and S1, carrying out fine-grained division on loads of a general graph, and distributing loads with different irregularity degrees for different devices, wherein the general graph refers to the relation between objects and is a basic research object of graph theory, and one graph consists of vertexes and edges connecting the vertexes. Generic graph loads refer to input data that can be abstracted for graph problem applications, such as a user relationship network in social network analysis, and can be abstracted as generic graph loads. Fine-grained partitioning refers to, when points in a graph are allocated to CPU and GPU devices, considering in turn which device (CPU or GPU) each point should be allocated to according to the attributes of each vertex, rather than simply mapping the points in the graph into two parts onto CPU and GPU. Specifically, the specific process of step S1 is:
s11: and constructing a data set for model training to generate a sparse matrix.
In order to generate different training data for different access patterns quickly, the embodiment uses the Graph generator in Graph500 to generate all the training data, and the specific steps are as follows:
(1) a sparse matrix is generated.
The generator has 5 parameters: s, A, B, C, D, where the sum of the parameters A, B, C, D is 1, S controls the size of the generated graph, which would generate 2SDot and 2S+4An edge. The other 4 parameters control the non-zero distribution of the generated image. This embodiment respectively sets S to 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, and for each scale of S, this embodiment randomly generates 20 sets of data, each set of data having 4 positive numbers summed to 1. With the largest positive number assigned to a and the other three positive integers assigned to B, C, D. In the matrix, if the number of elements having a value of 0 is much greater than the number of elements other than 0, the matrix is called a sparse matrix. The generated graph is represented by a sparse matrix, the rows in the sparse matrix list the middle points in the graph, if one edge in the graph is pointed to by the point i, the element at the position (i, j) in the sparse matrix is not 0; otherwise, it is 0. The matrix has 2SLine 2SAnd (4) columns. The generator slices in the middle rows and columns, dividing the sparse matrix into equal four quadrants, and A, B, C, D the four parameters respectively represent the probability of randomly generating a zero-division element falling in these four quadrants. This process in turn yields 2S+4And generating a required sparse matrix by using the nonzero elements. Assuming that the sparse matrix set generated in this step is set 1, the sparse matrix is sparseThe number of arrays is m.
(2) After the sparse matrix is generated, in order to increase the number of types of sparse matrices, in this embodiment, for m sparse matrices generated in step (1), for each sparse matrix, the non-zero elements in each row are sequentially gathered at positions around the diagonal line from the (0,0) position, the newly generated sparse matrix set is recorded as a set 2, and the number of sparse matrices in the set 2 is also m. In order to further increase the number of sparse matrices, for the sparse matrices in the set 1, sequentially moving the non-zero elements in each row of each sparse matrix to the position of the forefront (the numbering position is close to 0) of each row of the sparse matrix (for example, if the elements of the 4 th column, the 7 th column and the 11 th column in the ith row are all 1, and the elements of other positions are 0, then the elements of the 0 th column, the 1 st column and the 2 nd column in the ith row are all 1, and the elements of other positions are 0), recording the newly generated sparse matrix set as a set 3, and the number of sparse matrices in the set 3 is also m. And (3) merging the set 1, the set 2 and the set 3 into a training set testing sparse matrix, wherein the number of the sparse matrices generated in the step (1) is m, and the number of the testing sparse matrices is changed into 3 m through the operation of the step (2).
(3) And (5) classifying matrix types.
If the number of rows and the number of columns of the sparse matrix are m and n, 80% or more of the non-zero elements in the sparse matrix are concentrated at the positions where the diagonal of the matrix deviates within 0.2n, the sparse matrix is marked as type 1. If 80% and more of the non-zero elements in the sparse matrix are concentrated in the positions of the first 20% column (0 th to 0.2n th columns) or in the positions of the last 20% column (0.8 n th to n th columns), the sparse matrix is marked as type 2, and other sparse matrices are classified as type 3.
S12: and (4) off-line training and selecting characteristic values to reflect the processing performance of the CPU and the GPU on different sparse matrixes. Since the graph load is stored in the format of a sparse matrix, the characteristic values can reflect the processing capacity of the CPU and the GPU for different graph loads.
In this embodiment, feature values related to the OpenCL programming model are selected, and these feature values may reflect the processing performance of the CPU and the GPU on different kinds of sparse matrices (graph loads). The graph load is represented by a sparse matrix with an index value of 0 much larger in number of elements than in a matrix with a number of non-0 elements, the sparse matrix typically storing only non-zero elements in the matrix. The thread is the smallest unit capable of scheduling and calculating in the system, and the processing of the sparse matrix needs to be performed by the thread.
In this embodiment, the following 10 eigenvalue parameters are selected: 1) averaging the load amount of each thread, namely dividing the number of non-zero elements in the sparse matrix by the number of threads; 2) the distribution variance of the non-zero elements among different rows, namely the variance of the number of the non-zero elements of each row in the selected row group; 3) calculating the number of threads required; 4) the total task load is the number of non-zero elements in the sparse matrix; 5) the size of the read data volume is determined by a user program; 6) except the sparse matrix, the size of other written data volume is determined by a program written by a user; 7) the size of the computation volume, i.e. the number of arithmetic operations in the code; 8) the device being operated, CPU or GPU, is denoted 0 and 1 respectively; 9) calculating the access memory ratio, and dividing the calculated amount by the total task load amount; 10) matrix types, where 1), 2), 4), 5), 6), 7), and 9) are continuous type features, and 3), 8), and 10) are discrete type features.
S13: the method comprises the following steps of constructing a performance model applied to a specific graph:
(1) for a given graph computation program, the sparse matrix generated by process S11 is run separately using the CPU and GPU. In processing each sparse matrix, the 10 eigenvalues mentioned in S12 (e.g., calculated by dividing the number of non-0 elements in the sparse matrix by the number of threads turned on for feature 1) and the performance index (e.g., the number of non-zero elements in the sparse matrix that can be processed by the CPU per second and the number of non-zero elements in the sparse matrix that can be processed by the GPU per second) can be calculated, and the performance index selected by the present invention is the total number of non-zero elements in the sparse matrix divided by the running time.
(2) And (3) training data by using a multilayer neural network (MLP) model and taking the 10 characteristic value parameters in the step (1) as input data and performance indexes of the CPU and the GPU as results. And training to obtain the performance conditions of the CPU and the GPU under different index parameters. That is, given a set of 10 feature parameters in step S1, the trained model can predict the performance of the CPU and GPU.
S14: and performing fine-grained division on the input graph provided by the user.
After training is finished, when a user newly inputs a graph load represented by a sparse matrix, distributing each row to different data groups according to the number of nonzero elements of each row in the matrix, wherein the nonzero element stored in each data group is 2i-1To 2iThe row(s). For example, the first data set includes non-zero element data between 20To 21The row(s). If the number of the data groups is 30 at most, i is 30 at most, and when the number of the non-zero elements in the row exceeds 30, the data groups with i being 30 are classified. In practical application, the maximum value of i can be set according to specific situations. At the same time, the number of non-zero elements of each data set needs to be recorded.
S15: and calculating the performances of the CPU and the GPU respectively used by different data sets according to the performance model obtained by the training in the step S13. Because the performance is determined by the number of non-zero elements processed per second, the performance multiplied by the number of non-zero elements in each data set is recorded as the execution time of each data set processed by the CPU and the GPU, respectively.
S16: in step S15, the execution time for each data group to be processed by the CPU and the GPU is calculated, and then the device (CPU or GPU) with which each data group should be operated is determined so that the total execution time is the shortest. The different data sets may select either the CPU or the GPU. Due to the limited number of data groups, the invention can exhaust the total running time of the CPU and the GPU integrated processor under all conditions. The total running time is equal to the maximum of the corresponding load of the CPU and the corresponding load of the GPU.
S17: the distribution scheme calculated in step S16 is the fine-grained partition scheme. In the specific execution process, a kernel is respectively started for each data group for calculation. One kernel includes a group of threads, and the number of non-zero elements included in the row in the sparse matrix corresponding to the data group to be processed is not greatly different (due to the grouping process of S14), because each thread processes different rows in sequence, the ending time when each thread processes different rows simultaneously is similar, which is important for the parallel performance of the GPU. The fine-grained division scheme reduces the irregularity degree of correspondingly processed data because the number of non-0 elements in the sparse matrix row processed by the kernel internal thread is not greatly different.
Experiments prove that modeling research is carried out by analyzing a large amount of graph calculation loads, and a multi-device fine-grained division method for graph loads is provided. The performance of the graph computation program for a specific type can be improved by 10 to 20 percent compared with the existing multi-device partitioning method.
2. And analyzing the dynamic graph load, designing self-adaptive graph division, and automatically identifying whether multiple devices are required to operate in a mixed mode.
The dynamic load problem in graph calculation is a hotspot and difficulty in the heterogeneous field, because the change of dynamic traversal processes such as graph calculation and the like in program operation generally depends on input data, the regularity is poor, and prediction is difficult. And the input data to the program is often visible at runtime, further increasing the processing difficulty of dynamic loads. The dynamic load optimization of the present invention includes dynamic change trend analysis and adaptive method selection. The scheme of the invention can sense the load change of the dynamic graph during operation, and adjust the load division of the CPU and the GPU in real time, and the specific process is as follows:
s21: it is determined whether the program is involved in dynamic graph loading.
The user can directly set whether the program is a dynamic load program or not by providing an interface. If the user can not provide the program, the program provided by the user needs to be analyzed, whether an array related to sparse matrix storage accesses different positions in different iteration steps is judged, and if the situation is met, the program is judged to be a dynamic load program.
S22: the points in the graph processed in each iteration in the dynamic load program are different, and the points needing to be processed in each iteration are called active nodes in the invention. And recording the number of active nodes in each iteration in the running process of the dynamic load and the number of active nodes to be processed in the next round aiming at the dynamic load program. In general, after the processing is completed on the currently active point in each iteration, the graph node connected with the currently active point is saved as the processing object of the next iteration. Therefore, the analysis of the active points in each iteration can provide the basis for the load division of the next round. Meanwhile, after each iteration is finished, the sum of the number of points processed in the graph needs to be recorded.
S23: and when a certain round of iteration starts, judging the number of the current active nodes.
Because the number of active points cannot reach a certain number in some iterations, the cost of the device for simultaneously processing by using the CPU and the GPU is not enough to make up the performance benefit of the multi-device hybrid operation, and therefore, it is necessary to determine whether the CPU and the GPU need to be operated in a hybrid manner. Meanwhile, because the number of nodes is too low, the parallelism of the GPU cannot be fully utilized, and the CPU is used for processing in the situation. The value of the default active point is 1000, when the number of nodes needing to be processed in a certain iteration is less than 1000, only the CPU is used for processing, and the threshold value of the default active point can be adjusted according to the actual situation.
S24: if the number of processed points is larger than the threshold value of the mixed operation of the CPU and the GPU, the mixed operation can be carried out for data processing. There may be 2 execution modes depending on the number of active points in the iteration.
(1) If the number of the active points to be processed in the iteration can reach 80% of the total points in the graph, it is described that the edge distribution situation of the active points in the iteration is similar to the edge distribution situation of the whole graph, and at this time, the fine-grained division scheme in S1 is called to solve.
(2) If the condition in (1) is not satisfied, dividing the point in the graph into two parts by using a sampling method according to the number of edges of different points in the graph, namely the number of non-zero elements of a row in the sparse matrix: the number of edges of the points in the first part is small, so that the irregularity degree is small, and the points are processed by the GPU; the number of edges of the second part of points is large, and the variation is also large, so that the irregularity degree is large and the CPU processes the edges.
S25: and when the number of points in the processed graph reaches the set proportion of the total number of points and does not reach the minimum threshold value of the CPU and GPU mixed operation in the step S24 in the continuous set number of iterations, not performing the CPU and GPU mixed operation, and only using the CPU for calculation.
3. For multi-device processing of large-scale graph loads, pipeline processing is used, data processing is divided into a plurality of steps, different steps are not interfered with each other, and different data can be processed in parallel, the process is divided into two steps of 1) data transmission and 2) data processing, input data with large scale are divided into a plurality of parts capable of being loaded into a memory, then 1) data transmission and 2) data processing are sequentially carried out on each part, in the process of carrying out data processing on one part of data, the next part of data starts the data transmission process, intermediate results corresponding to each part are stored in the processing process, and after each part is calculated, the intermediate results are summarized, and the specific process is as follows:
s31: firstly, the memory capacity of the currently used computer equipment is calculated and recorded as s. Judging whether the processed data volume exceeds S, if so, determining that the large graph program cannot be loaded into the memory for processing at one time, and executing step S32; otherwise, the data can be loaded into the memory of the computer equipment, and the data can be loaded into the memory for direct processing without executing subsequent steps.
S32: the data is sliced into a number of blocks not exceeding the size of s/2. Since the data is described by a sparse matrix, the sum of the number of non-zero elements in the next consecutive rows is recorded starting with the first row, stopping when the number approaches s/2, and treating these rows as one data block. And starting the division of the next data block at the same time, wherein the following two requirements are specifically required to be met:
(1) two array spaces are opened in a computer memory, and the size of each memory data space is s/2 and the two array spaces are alternately used. The method is used for ensuring that in the process of completing data transmission to a data space and loading the data space into a memory for calculation, another array transmits data, namely, simultaneously ensuring that one array space is used for calculation and the other array space is used for data transmission.
(2) When the data transmission and calculation time are not equal, the short step needs to wait for the long step to complete before proceeding. Assuming that the data transmission time is longer than the calculation time, the calculation of the corresponding data needs to be started after the transmission is completed, and vice versa.
(3) Intermediate calculation results need to be retained after each iteration to merge the data. The intermediate data of the program is usually small in data amount and can be stored in the memory.
S33: and after all data processing is finished, all intermediate results are combined to obtain a final result.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the scope of protection thereof, and although the present application is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: numerous variations, modifications, and equivalents will occur to those skilled in the art upon reading the present application and are within the scope of the claims appended hereto.

Claims (4)

1. A self-adaptive graph partitioning method facing heterogeneous fusion processors in big data is characterized by comprising the following steps:
carrying out fine-grained division on the load of the general graph, and distributing the loads with different irregularity degrees for different devices, wherein the specific process is as follows: s11: constructing a data set for model training, and generating a sparse matrix; s12: off-line training and selecting a characteristic value; s13: constructing a performance model for a specific graph application; s14: performing fine-grained division on input graph data provided by a user; s15: calculating the performances of the CPU and the GPU respectively used by different data groups according to the performance model obtained by the training in the step S13, and recording the performances multiplied by the number of nonzero elements in each data group as the execution time of each data group respectively processed by the CPU and the GPU; s16: according to the execution time of each data group processed by the CPU and the GPU respectively, calculating which equipment each data group should adopt to operate, wherein the total execution time is shortest; s17: the distribution scheme calculated from the total execution shortest is a fine-grained division scheme, and in the specific execution process, each data group is calculated by using one kernel respectively, so that the irregularity degree is reduced;
analyzing the dynamic graph load, designing self-adaptive graph division, and automatically identifying whether multiple devices need to be operated in a mixed mode, wherein the specific process comprises the following steps:
s21: judging whether the program relates to a dynamic graph load;
s22: aiming at a dynamic load program, recording the number of active nodes in each iteration in the running process of the dynamic load and the number of active nodes to be processed in the next iteration, and simultaneously recording the sum of the number of points processed in a graph after each iteration is finished;
s23: when a certain round of iteration starts, judging the number of current active nodes;
s24: if the number of processed points is larger than the threshold value of the mixed operation of the CPU and the GPU, performing mixed operation to process data;
s25: when the number of points in the processed graph reaches the set proportion of the total number of points and does not reach the lowest threshold value of the CPU and GPU mixed operation in the step S24 in the continuous set number of iterations, the CPU and the GPU are not operated in a mixed mode, and only the CPU is used for calculation;
for multi-equipment processing of large-scale graph loads, a pipeline mode is used for running and processing data.
2. The adaptive graph partitioning method according to claim 1, wherein a performance model for a specific graph application is constructed by the specific process of:
s131: for a given graph calculation program, respectively operating the sparse matrixes generated in the S11 by using the CPU and the GPU, and recording characteristic value parameter conditions and performance indexes of the programs in the S1 when the selected graph application program processes different matrixes by using the CPU and the GPU, wherein the performance indexes are the total number of nonzero elements in the sparse matrixes divided by the operating time;
s132: and (3) taking the characteristic value parameters in the S131 as input data and the performance indexes of the CPU and the GPU as results by using a multilayer neural network model, training the data, and training to obtain the performances of the CPU and the GPU under different index parameters.
3. The adaptive graph partitioning method according to claim 1 or 2, wherein for multi-device processing of large-scale graph loads, a pipeline mode is used to run processing data, and the specific process is as follows:
s31: firstly, calculating the memory capacity of the currently used computer equipment, recording the memory capacity as S, judging whether the processed data volume exceeds S, if so, determining that the processed data volume cannot be loaded into the memory for processing at one time, and executing a step S32, otherwise, loading the data into the memory for directly processing;
s32: the data is segmented into number blocks which do not exceed s/2, the sum of the number of non-zero elements in the next continuous row is recorded from the first row because the data is described by a sparse matrix, and the segmentation of the next data block is started by taking the rows as one data block when the number of the non-zero elements approaches s/2;
s33: and after all data processing is finished, all intermediate results are combined to obtain a final result.
4. The adaptive graph partitioning method according to claim 3, wherein the data is partitioned into blocks of a number not exceeding s/2, since the data is described by a sparse matrix, the sum of the numbers of non-zero elements in the next consecutive rows is recorded starting from the first row, and stopping when the number approaches s/2 and regarding these rows as one data block, and starting the partitioning of the next data block at the same time, specifically:
(1) two array spaces are opened in a computer memory, the size of each memory data space is s/2, and the two array spaces are alternately used, namely, one array space is ensured to be used for calculation, and the other array space is used for data transmission;
(2) when the data transmission time is not equal to the calculation time, the short step can be continued after the long waiting time step is completed, and if the data transmission time is longer than the calculation time, the calculation of the corresponding data can be started after the transmission is completed, and vice versa;
(3) intermediate calculation results need to be retained after each iteration to merge the data.
CN201910661044.1A 2019-07-22 2019-07-22 Adaptive graph partitioning method facing heterogeneous fusion processor in big data Active CN110415162B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910661044.1A CN110415162B (en) 2019-07-22 2019-07-22 Adaptive graph partitioning method facing heterogeneous fusion processor in big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910661044.1A CN110415162B (en) 2019-07-22 2019-07-22 Adaptive graph partitioning method facing heterogeneous fusion processor in big data

Publications (2)

Publication Number Publication Date
CN110415162A CN110415162A (en) 2019-11-05
CN110415162B true CN110415162B (en) 2020-03-31

Family

ID=68362293

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910661044.1A Active CN110415162B (en) 2019-07-22 2019-07-22 Adaptive graph partitioning method facing heterogeneous fusion processor in big data

Country Status (1)

Country Link
CN (1) CN110415162B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111858059A (en) * 2020-07-24 2020-10-30 苏州浪潮智能科技有限公司 Graph calculation method, device, equipment and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109741421A (en) * 2019-01-23 2019-05-10 东北大学 A kind of Dynamic Graph color method based on GPU

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103617626A (en) * 2013-12-16 2014-03-05 武汉狮图空间信息技术有限公司 Central processing unit (CPU) and ground power unit (GPU)-based remote-sensing image multi-scale heterogeneous parallel segmentation method
CN107357661B (en) * 2017-07-12 2020-07-10 北京航空航天大学 Fine-grained GPU resource management method for mixed load
US11373088B2 (en) * 2017-12-30 2022-06-28 Intel Corporation Machine learning accelerator mechanism
CN109871512B (en) * 2019-01-27 2020-05-22 中国人民解放军国防科技大学 Matrix multiplication acceleration method for heterogeneous fusion system structure

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109741421A (en) * 2019-01-23 2019-05-10 东北大学 A kind of Dynamic Graph color method based on GPU

Also Published As

Publication number Publication date
CN110415162A (en) 2019-11-05

Similar Documents

Publication Publication Date Title
KR102011671B1 (en) Method and apparatus for processing query based on heterogeneous computing device
CN110168516B (en) Dynamic computing node grouping method and system for large-scale parallel processing
CN109993299B (en) Data training method and device, storage medium and electronic device
Khorasani et al. Scalable simd-efficient graph processing on gpus
Zhang et al. BoostGCN: A framework for optimizing GCN inference on FPGA
US10783436B2 (en) Deep learning application distribution
CN110633153A (en) Method for realizing neural network model splitting by using multi-core processor and related product
CN111488205B (en) Scheduling method and scheduling system for heterogeneous hardware architecture
Lu et al. Optimizing depthwise separable convolution operations on gpus
WO2022068663A1 (en) Memory allocation method, related device, and computer readable storage medium
CN107908536B (en) Performance evaluation method and system for GPU application in CPU-GPU heterogeneous environment
CN110689121A (en) Method for realizing neural network model splitting by using multi-core processor and related product
CN113821311A (en) Task execution method and storage device
CN109978171B (en) Grover quantum simulation algorithm optimization method based on cloud computing
CN112817730B (en) Deep neural network service batch processing scheduling method and system and GPU
Deng et al. A data and task co-scheduling algorithm for scientific cloud workflows
Wang et al. Exploiting parallelism for CNN applications on 3D stacked processing-in-memory architecture
Senthilkumar et al. A survey on job scheduling in big data
WO2022110860A1 (en) Hardware environment-based data operation method, apparatus and device, and storage medium
Wu et al. Using hybrid MPI and OpenMP programming to optimize communications in parallel loop self-scheduling schemes for multicore PC clusters
Cojean et al. Resource aggregation for task-based cholesky factorization on top of modern architectures
US20210390405A1 (en) Microservice-based training systems in heterogeneous graphic processor unit (gpu) cluster and operating method thereof
Pimpley et al. Towards Optimal Resource Allocation for Big Data Analytics.
CN110415162B (en) Adaptive graph partitioning method facing heterogeneous fusion processor in big data
CN114217930A (en) Accelerator system resource optimization management method based on mixed task scheduling

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant