CN108509270B

CN108509270B - High-performance parallel implementation method of K-means algorithm on domestic Shenwei 26010 many-core processor

Info

Publication number: CN108509270B
Application number: CN201810188779.2A
Authority: CN
Inventors: 杨超; 李敏; 闫碧莹
Original assignee: Institute of Software of CAS
Current assignee: Institute of Software of CAS
Priority date: 2018-03-08
Filing date: 2018-03-08
Publication date: 2020-09-29
Anticipated expiration: 2038-03-08
Also published as: CN108509270A

Abstract

The invention provides a high-performance parallel implementation method of a K-means algorithm on a domestic Shenwei 26010 many-core processor, which is based on a Shenwei 26010 platform of the domestic processor and aims at a clustering stage to design a computation framework integrating block distance matrix computation and protocol operation, wherein the framework uses a three-layer blocking strategy to divide tasks, and simultaneously designs a cooperative internuclear data sharing scheme and a register communication mechanism-based cluster label protocol method, and uses double buffering technology, instruction rearrangement and other optimization technologies. Aiming at the stage of updating the central point, the invention designs a task division mode of dynamic scheduling. Through testing on a real data set, the invention can reach the maximum floating point calculation performance of 348.1GFlops, can obtain 47% -84% of floating point calculation efficiency compared with the theoretical maximum performance, and can obtain the highest acceleration ratio of 1.7x and the average of 1.3x compared with a calculation mode without fusion.

Description

High-performance parallel implementation method of K-means algorithm on domestic Shenwei 26010 many-core processor

Technical Field

The invention belongs to the field of parallel acceleration research of clustering algorithms in machine learning, and particularly relates to a high-performance parallel implementation method of a K-means algorithm on a domestic Shenwei 26010 many-core processor.

Background

K-means is a classic clustering algorithm based on distance calculation in unsupervised learning, and divides sample data into different clusters according to the similarity measurement among samples, so that the similarity among the samples in the same cluster is maximized. Due to the characteristics of simplicity, easy implementation, no need of marking samples and the like, K-means has wide application in the fields of image processing, data mining, text clustering, biology and the like, and is increasingly used as a preprocessing means of a plurality of more complex algorithms. With the advent of the big data era, the characteristic dimension of sample data is increased from tens of original dimensions to thousands of original dimensions, and correspondingly, higher requirements are put forward on the computing speed, so that the research on the parallel acceleration of the K-means is very important and meaningful for practical application.

At present, a plurality of parallel K-means algorithms based on GPU platforms are researched. Among them, the best implementation on a single GPU is the algorithm of You Li (Li Y, ZHao K, Chu X, et al. speeding up K-Means algorithmby byGPUs [ J ]. Journal of Computer & System Sciences,2013,79(2): 216-. The You Li algorithm firstly proposes that for high-dimensional samples, the clustering process of K-means has great similarity with the calculation mode of a matrix-by-matrix (GEMM) function in a basic linear algebra library (BLAS), the algorithm calculates the distance between the samples in the K-means clustering stage by adopting the GEMM parallelization scheme, stores the calculation result into a distance matrix, and then obtains the cluster label of each sample by reading the distance matrix and carrying out reduction operation. The scheme makes full use of the powerful computing power of the GPU. However, the real result required to be obtained by the K-means algorithm is the cluster label of each sample, the distance matrix is only one piece of intermediate data for solving the cluster label array, and in the You Li algorithm, after the distance matrix is calculated, the distance matrix is written into the memory and then read from the memory, so that additional access operation is added, the communication pressure of the system is improved, and particularly the access wall problem commonly existing in most modern high-performance hardware platforms (CPU, GPU, SW26010 and the like) is highlighted. Furthermore, due to the lack of fine tuning, its parallel implementation can only reach 15.9% of the peak floating point performance of the machine.

The domestic Shenwei 26010 many-core processor is a master-slave core heterogeneous high-performance platform independently developed in China, and consists of 4 core groups, wherein the core groups are connected through Noc (network on chip) and share a main memory space of 32 GB. Each core group consists of 1 master core and an 8 x 8 array of slave cores. Both the master and slave cores support 256-bit vector floating point instructions. Slave core arrays may provide 742.4GFlops floating point computing capability. Each slave core has an independent (private) user-controllable 64KB LDM (Local device memory) space. The slave core can directly access the main memory, and can also adopt a DMA interface to carry out data transmission between the main memory and the local LDM, and the latter has smaller delay and provides the actual memory access bandwidth of 22 GB/s. The slave cores are connected through a row/column data bus, so that different slave cores in the same row and the same column can achieve efficient data sharing through register communication. The slave core has 2 pipelines, P0 and P1. Wherein scalar integer instructions may be executed on P0 or P1; floating point instructions can only be executed on P0; memory accesses, register communications, and control transfer instructions can only execute on P1. Therefore, for the algorithm with the dominance of floating point calculation, the P0 pipeline is fully utilized to carry out floating point operation, and higher floating point calculation performance can be obtained. The software and hardware parameters of the Shenwei 26010 many-core processor are shown in Table 1:

table 1: software and hardware parameters of Shenwei 26010 many-core processor

Meanwhile, no high-performance K-means algorithm is realized on the current domestic Shenwei platform, and machine learning is not utilized to apply and popularize the K-means algorithm on the platform.

Based on the requirement of the Shenwei 26010 platform on a high-performance clustering algorithm and the defects of the existing parallelization method, it is necessary to develop a set of efficient K-means parallelization realization method on the Shenwei 26010 platform.

Disclosure of Invention

The invention solves the problems that: the problem that a high-performance clustering algorithm does not exist on the existing Shenwei 26010 platform is solved, a K-means parallel implementation framework fusing block distance matrix calculation and a cluster label protocol is provided, and computing resources of a hardware platform are fully utilized and the computing performance of the K-means is improved through optimization methods such as three-layer blocking, cooperative inter-core data sharing, double buffering, instruction rearrangement and dynamic scheduling.

The K-means algorithm comprises (a) a central point initializing step, (b) a clustering step, (c) an iterative convergence value calculating step and (d) a central point updating step; firstly, finishing the steps (a) and (b), then solving an iterative convergence value according to the step (c) and judging whether convergence occurs, and if the convergence does not occur, entering the main loop formed by the steps b-d again; if the clustering is converged, returning the current clustering result, exiting, and outputting a cluster label corresponding to each sample; the invention mainly optimizes the steps (b) and (d) which are time-consuming in parallel. The invention is based on a core group of a domestic Shenwei 26010 processor, and each core group consists of a main core and 64 slave cores. In the clustering step, the traditional method mainly adopts a method of firstly calculating the whole distance matrix and solving the cluster label by carrying out reduction on the distance matrix. The framework reduces the storage of the whole distance matrix, reduces the read-write operation of the distance matrix for 2 times, and provides a better access path. The framework adopts a three-layer partitioning strategy, simultaneously designs a cooperative inter-core data sharing scheme and a cluster label protocol method based on a register communication mechanism, and further improves the performance by using double buffering and instruction rearrangement technology. In the step of updating the central point, the invention designs a task division mode of dynamic scheduling.

The efficient parallel K-means algorithm on the SW26010 many-core processor is specifically realized as follows:

(1) and (5) a clustering stage. In this stage, the nearest center point is found for each sample, and the index of the center point is stored as the cluster label of the sample. The method provides a parallel implementation framework for fusing block distance matrix calculation and cluster label reduction, and the framework is implemented in a parallel mode through clustering by a three-layer code design structure of an interface layer, a drive layer and a core layer. The interface layer is operated by a main core end, and the driving layer and the core layer are operated by a slave core end; the framework adopts a three-layer blocking algorithm of logic blocking, physical blocking and register blocking. The driving layer performs logic block and physical block operations, and the core layer performs register block operations. Firstly, an interface layer reads input data, stores the input data into a sample matrix with dimension of n multiplied by d, wherein n is the number of samples, d is the dimension of the samples, and stores a central point into a central point matrix with dimension of d multiplied by k, wherein k is the number of clusters (namely the number of the central points). The driver layer then performs logical and physical partitioning. Firstly, the sample matrix and the central point matrix are divided into n, d and k dimensions in a logic partitioning mode respectively to obtain a sample matrix block and a central point matrix block. Traversing each block in the for loop organization sequence of n, k and d; and performing k-loop inside the n-loop, performing d-loop inside the k-loop, and calling a two-step cluster label reduction function based on register communication. And firstly initiating DMA transmission in the d cycle, transmitting the sample matrix block and the central point matrix block from a main memory to an LDM on a slave core, and then calling a core layer function to calculate the distance between the sample matrix block and the central point matrix block to obtain a distance matrix block. Then, the drive layer further performs physical block division on the sample matrix block, the central point matrix block and the distance matrix block obtained by logical block division, and divides the blocks into 64 smaller blocks called tiles respectively, wherein each sample matrix tile and the central point matrix tile are transmitted to an efficient buffer memory LDM on a corresponding slave core in 64 slave cores, and for example, the ith tile of the sample block and the ith tile of the central point block are transmitted to the ith slave core. Meanwhile, each slave core opens up a memory space for the distance matrix tile in the LDM of the slave core, and is responsible for calculating the value of the distance matrix tile, and the driving layer calls a core layer function to calculate the responsible distance matrix tile. In the core layer, when each slave core calculates the responsible distance matrix tile, a sample matrix tile and a central point matrix tile required for calculating the distance matrix tile need to be obtained from other slave cores. The method designs a cooperative inter-core data sharing scheme. The scheme comprises 8 steps of data exchange, each slave core obtains data required by calculation when each step of data exchange is carried out, and then a calculation function based on a register blocking method is called once. In the calculation function, each slave core performs calculation of the responsible distance matrix tile by means of assembly code. Next, after each slave core completes the calculation of the distance matrix tile thereon, a cluster label reduction scheme based on register communication is adopted to solve the minimum value of each row of the distance matrix tile, the global column label corresponding to the minimum value is the cluster label of the sample corresponding to the corresponding global row label, and finally the cluster label of the sample is stored in a main memory to complete the clustering operation;

(2) and updating the central point stage. This stage updates the center point of each cluster to the weighted average of all samples belonging to this cluster. The clustering step finds the cluster to which each sample belongs, and then updates the central point to the weighted average of all the samples belonging to the cluster according to the number of the central points of the set cluster. The method adopts a parallel mode of dividing tasks of the central point, designs a task scheduling strategy based on work sharing, and each secondary core is responsible for updating and calculating a part of the distributed central point; when each slave core processes the update of the cluster which is responsible for the slave core, firstly, an array new _ centroid with the length of d is opened up and set to be zero, the new _ centroid is used for storing the central point of the new cluster, then the cluster label of each sample is read, whether the cluster belongs to the cluster which is processed currently is judged, if the cluster belongs to the cluster, the corresponding sample is read into the LDM space of the slave core, and the sample is accumulated into the new _ centroid. After traversing the cluster label array, storing the accumulated values of the corresponding dimensionalities of all samples belonging to the cluster in the new _ centroid, and dividing the value of each dimensionality of the new _ centroid by the number of the samples in the cluster to obtain a new value of the central point of the cluster.

In the step (1), the driving layer firstly initiates DMA transmission in the d-cycle, transmits the sample matrix block and the central point matrix block from the main memory to the LDM on the secondary core, and then calls the core layer function to calculate the distance between the sample matrix block and the central point matrix block to obtain a distance matrix block. In the step, the asynchronous DMA transmission is benefited, the invention designs a double buffer mechanism based on access memory-calculation overlapping, and initiates DMA reading operation of a sample matrix block and a central point matrix block in the next cycle while calculating the distance matrix block in the current cycle, so as to cover the access memory and the calculation time and improve the calculation performance.

In step (1), when each slave core calculates the responsible distance matrix tile, the design collaborative inter-core data sharing scheme is specifically as follows: the method performs data sharing among 64 slave cores (the slave cores are arranged in an 8 × 8 topology) in 8 steps: when the ith step (i is more than or equal to 0 and less than or equal to 7), broadcasting the local sample matrix tile to all the slave cores in the same row from the core row of the ith row and the ith column, broadcasting the local central point matrix tile to all the slave cores in the same column from the column, and receiving corresponding data from the broadcasted slave cores into the data buffer of the slave cores; meanwhile, broadcasting local sample matrixes tile from all slave cores (except the ith row and the ith column) in the ith column, broadcasting local central point matrixes tile from all slave core columns (except the ith row and the ith column) in the ith row, and receiving corresponding data from the broadcasted slave cores into own data buffer; after each step of data transmission is completed, each slave core uses the data in the data buffer to call a calculation function based on a register blocking method to calculate the distance matrix.

In step (1), the calculation function based on the register blocking method in the core layer is specifically as follows: the sample matrix tile, the center point matrix tile, and the distance matrix tile are further divided into a 16 × 1 sample matrix pannel, a 1 × 2 center point matrix pannel, and a corresponding 16 × 2 distance matrix pannel. The core layer is compiled in an assembly code mode; the register is partitioned, 3 for loops are generated at the code level, the method adopts the loop sequence of n, k and d to organize the code structure, and calls a sample matrix pannel and a central point matrix pannel to perform distance calculation inside the innermost loop to obtain the innermost assembly function of the distance matrix pannel.

In the step (1), after each slave core adopts a register blocking method, calling a sample matrix pannel and a central point matrix pannel inside the innermost loop to perform distance calculation to obtain an innermost assembly function of the distance matrix pannel. The innermost assembly function is carried out in the following mode; elements of the sample matrix pannel are stored using 4 floating-point vector registers, one vector register for each 4 elements; the elements of the central point matrix pannel are stored by using 2 floating point vector registers, and the elements of each central point matrix pannel are expanded into 4 parts and placed in one vector register; when the Euler distance between two elements is calculated, the two elements are subtracted first, and then the subtraction result is subjected to square operation of multiplication. Therefore, 8 registers are used to store the calculation result of the matrix element difference between the element in the sample matrix pannel and the corresponding center point matrix pannel, and 8 registers are used to store the final euler distance, i.e., the distance matrix element. 22 floating-point vector registers are used in the calculation; the Shenwei 26010 processor has 32 registers, and the rest 10 registers except the 22 registers are used for storing intermediate variables in a program or used as loop variables, so that the register resource is reasonably and efficiently used. Meanwhile, the assembly code is subjected to instruction rearrangement, and the performance of the program is further improved.

In step (1), after the register blocking method is adopted, the instruction rearrangement is performed on the innermost assembly function, and the following is realized: on one hand, based on the double-pipeline and double-emission characteristics of the Shenwei 26010, the calculation instruction and the access instruction are adjacent to each other as much as possible, so that the calculation instruction and the access instruction can be emitted simultaneously; on the other hand, the instructions with the dependence are arranged as far as possible, and instruction beating is reduced.

In step (1), the two-step cluster label reduction function based on register communication is implemented as follows: firstly, each slave core stipulates a distance matrix tile on the slave core of each slave core, and the minimum value of each row of the distance matrix tile and a column mark corresponding to the minimum value are solved. And secondly, carrying out pairwise reduction on 8 slave cores on the same row in three times, and finally solving the minimum value and subscript of each row of all distance matrixes tile on the slave cores of the same row. For the first time, the 1 st column, the 3 rd column, the 5 th column and the 7 th column encapsulate the result obtained in the first step into vectors, the vectors are respectively sent to the slave cores of the 0 th column, the 2 nd column, the 4 th column and the 6 th column through a register communication function put _ r, after the slave cores of the 0 th column, the 2 nd column, the 4 th column and the 6 th column receive data through using get _ r, the data are compared with the local minimum value, and the smaller value and the subscript thereof are stored; secondly, the slave cores in the 2 nd column and the 6 th column respectively send the results obtained last time to the slave cores in the 0 th column and the 4 th column, after receiving the data, the slave cores in the 0 th column and the 4 th column compare the data with the local minimum value, and the minimum value and the subscript thereof are stored; and thirdly, the slave core in the 4 th column sends the result obtained last time to the slave core in the 0 th column, after the slave core in the 0 th column receives the data, the received data is compared with the local minimum value, and the subscript corresponding to the obtained minimum value is the cluster label of the sample in the current row.

In the step (2), in the step of updating the central point, the method adopts a parallel mode of dividing tasks for the central point, and the dynamic scheduling strategy is as follows: first, a global shared task pool is established, which each slave core can access. Each task in the task pool represents an update operation for a central point, with k tasks being numbered 0,1, … (k-1) in sequence. Then, each slave core takes the task label from the shared task pool by using the mutual exclusion of the atomic instruction fetch-and-add, and the atomic instruction ensures that only one slave core reads and modifies the task label in the global task pool at the same time, thereby ensuring the consistency. After each slave core completes the current task, one task is taken down from the shared task pool until the tasks of the task pool are taken out.

Compared with the prior art, the invention has the beneficial effects that:

(1) a computation framework is provided that fuses block distance matrix computation with reduction operations. The framework reduces the storage of the whole distance matrix, calculates the distance matrix by taking a block as a unit, and specifies the cluster to which the sample corresponding to the corresponding block belongs while calculating the distance matrix, thereby reducing the storage of the distance matrix and the memory access (once reading and once writing) operation of the distance matrix twice and providing a better access path.

(2) In addition, by using a three-layer partitioning strategy, simultaneously designing a cooperative inter-core data sharing scheme, and optimizing methods such as a cluster label protocol method based on a register communication mechanism, double buffering, instruction rearrangement and dynamic scheduling, the performance of the K-means is further greatly improved.

(3) By adopting 5 data sets in UCI Machine Learning repeatability to test, the invention obtains the highest floating point operation performance of 46.9%; compared with the theoretically available maximum performance of K-means, the floating point calculation efficiency of 47-84% is obtained; compared with the traditional realization of a non-fusion method, the acceleration ratio of 1.7x at the maximum and 1.3x on average is obtained.

Drawings

FIG. 1 is a basic flow chart of the calculation process of the parallel K-means method of the present invention;

FIG. 2 is a schematic diagram of block mapping an input matrix;

FIG. 3 is a diagram of data sharing from among cores;

FIG. 4 is a diagram of an algorithm implementation of a cluster label convention;

FIG. 5 is a graph comparing the performance of an implementation of the method of the present invention with an implementation of the unfused method on a Schweizhong kernel platform;

FIG. 6 is a graph of performance effects of different optimization methods employed by the present invention on a Schweizhongke platform;

FIG. 7 is a graph comparing floating point performance with theoretical values obtained by the present invention on a Schwann Kernel platform.

Detailed Description

The invention is explained in detail below with reference to the drawings and the exemplary figures.

As shown in FIG. 1, the invention relates to a high-performance parallel implementation method of a K-means algorithm based on a domestic Shenwei 26010 many-core processor. The K-means calculation process is an iterative solution process, and the whole process is divided into four main steps: initializing a central point, clustering, calculating an iterative convergence value, and updating the central point. Firstly, a certain method for initializing a central point is adopted, and an initial value is assigned to a central point matrix; and then carrying out initial clustering to obtain an iterative convergence value, judging whether convergence occurs or not, and entering a main loop if convergence does not occur. And if the current clustering result is converged, returning the current clustering result. In the main loop, two operations are mainly performed: clustering and updating the center points. And exiting the main loop when the clustering effect in the main loop reaches the convergence standard. The final output of the algorithm is the class label of the sample.

The method adopts a randomized initialization central point mode, and does not relate to parallelization research of other more optimized initialization algorithms such as K-means + +, and the like. And adopting the mean square error of all samples to the corresponding central points thereof as an iteration convergence condition. And the calculation of the mean square error is integrated into the clustering operation. The correctness of the parallel algorithm is ensured by adopting a mode of comparing with the serial algorithm.

Since the four steps involved, the clustering and updating center point operations are located in the main loop and are repeatedly executed, they belong to the hot spot function of the K-means algorithm. Therefore, the research of the invention is mainly a parallelization acceleration scheme of clustering operation and updating central point operation, and the clustering and updating central point operation is performed with multithread parallelization by using an Athread thread library provided by Shenwei.

As shown in FIG. 2, the sample matrix T (n × d) and the central point matrix C (d × k) are partitioned into blocks with the size b_n×b_dSample matrix blocks T (i, j) and b of_d×b_kWhere i and j represent block numbers in row/column directions, respectively. At the same time, b is opened up for the distance matrix block D (i, j) in LDM_n×b_kThe temporary storage space. Each distance matrix block D (i, j) is calculated by the following formula:

symbol

The meaning of (A) is as follows, if X, Y and Z are matrices of n × d, d × k and n × k, respectively

In (1)

Represents the distance calculation between two matrices, i.e. each element Z (i, j) of the Z matrix, 0 ≦ i < n, 0 ≦ j < d, obtained by the following formula:

where X (i, k) and Y (k, j) are the elements in the ith row and the kth column in matrix X and the kth row and the jth column in matrix Y, respectively.

Then, each sample matrix block and the center point matrix block are further partitioned, each timeThe blocks are divided into smaller tiles of 8 × 8, resulting in a size t_n×t_dSample matrix tile ∈ T (p, q), T_d×t_kTile ∈ C (p, q) and t_n×t_kE.g., CPE (p, q) (0 ≦ p < 8, 0 ≦ q < 8) is stored ∈ T (p, q), ∈ C (p, q) and ∈ D (p, q.) the update of ∈ D (p, q) on which each slave core is responsible, i.e.:

as can be seen from the above formula, each slave core does not need to have all the data in its computation of local ∈ D (p, q). The invention adopts a cooperative data sharing scheme to realize data exchange between the secondary cores. The slave core in Shenwei adopts an 8 × topological result, and for simplicity of description, a 3 × 3 slave core array is taken as an example to describe the data sharing process. As shown in FIG. 3, in the initial state, each slave core CPE (p, q) (0 ≦ p < 3, 0 ≦ q < 3) stores a local sample matrix tile ∈ T (p, q) and a center point matrix tile ∈ C (p, q), wherein the sample matrix tile, the center point matrix tile and the distance matrix tile are represented by a dark gray small rectangle, a light gray small rectangle and a black small rectangle, respectively, wherein the solid line rectangle represents the local matrix, and the dotted line rectangle represents the received matrix. Each CPE is responsible for calculating a local distance matrix tile e D (p, q), i.e., the cumulative calculation result between 3 tile pair, as follows:

this operation is done in 3 steps. The first step obtains the data of the wanted first tile pair, and then accumulates the calculation result of the tile pair, namely realizing

And (4) calculating. Specifically, in step 0, each slave core calculates Pair0, that is, each slave core performs the following calculation:

at this time, CPE (0,0) row broadcasts local sample matrix tile ∈ T, column broadcasts local center point matrix tile ∈ C.CPE (0, -) (all slave cores except CPE (0,0) in row 0) column broadcasts local center point matrix tile ∈ C, CPE (-, 0) (all slave cores except CPE (0,0) in column 0) row broadcasts local sample matrix tile ∈ T. after broadcasting is completed, all slave cores obtain needed tile for calculation, then calculation is carried out, during step 1, each slave core calculates Pair 1, namely each slave core calculates as follows:

at this time, CPE (1,1) broadcasts local sample matrix tile ∈ T in row, column broadcasts local center matrix tile ∈ C. CPE (1, -) (all slave cores except CPE (1,1) in row 1) column broadcasts local center matrix tile ∈ C, CPE (-, 1) (all slave cores except CPE (1,1) in column 1) row broadcasts local sample matrix tile ∈ T.

When each slave core calculates ∈ D (p, q) locally, ∈ T, ∈ C and ∈ D are divided into a sample matrix pannel with the size of 16 × 1, a central point matrix pannel of 1 × 2 and a corresponding distance matrix pannel of 16 × 2 by using register partitioning, and the sample matrices pannel, the central point matrix pannel and the distance matrix pannel are respectively marked as gamma T, gamma C and gamma D

The distance of (2) is calculated. Firstly, the elements of γ T are stored using 4 floating point vector registers, each 4 elements being placed in one vector register; the elements of the gamma C are stored by using 2 floating point vector registers, and each element of the gamma C is expanded into 4 parts and put into one vector register; when the Euler distance between two elements is calculated, the two elements are subtracted first, and then the subtraction result is subjected to square operation of multiplication. Therefore, it isIt is necessary to store the calculation result of the matrix element difference between the element in γ T and γ C using 8 registers, and store the final euler distance, i.e., the element in γ D, using 8 registers. Meanwhile, based on the characteristics of double pipelines and double transmission of the Shenwei 26010, the assembly code is subjected to instruction rearrangement, on one hand, a calculation instruction and a memory access instruction can be transmitted simultaneously as much as possible, and on the other hand, the instruction sequence is adjusted, dependent instructions are arranged far as possible, and instruction stop is reduced.

After each slave core completes the local ∈ D (p, q) computation, that is, all 64 slave cores cooperate to complete the computation of one of the block pair of D (i, j) blocks:

then proceed to the next block

Until all block pair are accumulated, D (i, j) does not complete the calculation (as shown in equation (1)). In order to cover up the access and storage expenses, the invention uses double buffering technology to lead the access and the calculation to be overlapped: in the calculation of

Meanwhile, data of T (i, l +1) and C (l +1, j) are prefetched by using an asynchronous DMA mode.

After a distance matrix block D (i, j) is calculated, the index of the nearest center point of each sample contained in D (i, j) needs to be obtained as its cluster label. As known from the blocking strategy, the distance matrix block D (i, j) is uniformly distributed on the 8 × 8 slave kernel array, e.g., LDM on CPE (p, q) has the distance matrix block e D (p, q) in which the distances of the samples to a part of the center point are stored. As shown in fig. 4, the present invention designs a two-step cluster label reduction function based on register communication, which is implemented as follows: first, local specification. And (3) each slave core stipulates a distance matrix tile on the slave core, and the minimum value (mean square error) of each row of the distance matrix tile and a column mark (cluster label) corresponding to the minimum value are obtained. And secondly, performing internuclear specification. And (3) carrying out pairwise reduction on 8 slave cores on the same row in three times, and finally solving the minimum value and subscript of each row of all distance matrixes tile on the slave cores of the same row. The figure illustrates a slave core in line p as an example. For the first time, the slave cores CPE (p, 1), CPE (p, 3), CPE (p, 5), CPE (p, 7) encapsulate the result obtained in the first step into vectors, and send the vectors to the slave cores CPE (p, 0), CPE (p, 2), CPE (p, 4), CPE (p, 6) respectively through the register communication function put _ r, after the receiver receives data from the cores by using get _ r, the data is compared with the local minimum value, and the smaller value and the subscript thereof are saved; secondly, the CPE (p, 2), the CPE (p, 6) respectively sends the results obtained from the last time to the PE (p, 0), the CPE (p, 4) compares the data with the local minimum value after receiving the data from the kernel and the receiver, and saves the value with smaller value and the subscript thereof; and thirdly, the CPE (p, 4) sends the last obtained result to the CPE (p, 0) slave core from the core, and after the CPE (p, 0) slave core performs minimum value comparison, the subscript corresponding to the value with the smaller value is the cluster label of the sample of the current row. And finally, the CPE (p, 0) transmits all the obtained cluster label results and the obtained mean square error results (the global cluster label and the global mean square error) back to the main memory, and writes back the results to finish the clustering step.

And updating a central point, wherein due to the fact that the number of samples belonging to each cluster is possibly unevenly distributed, a problem of serious load imbalance exists among secondary cores due to static task allocation. First, a global shared task pool is established, which each slave core can access. Each task in the task pool represents an update operation for a central point, with k tasks being numbered 0,1, … (k-1) in sequence. Then, each slave core takes the task label from the shared task pool by using the mutual exclusion of the atomic instruction fetch-and-add, and the atomic instruction ensures that only one slave core reads and modifies the task label in the global task pool at the same time, thereby ensuring the consistency. After each slave core completes the current task, one task is taken down from the shared task pool until the tasks of the task pool are taken out.

The test platform of the invention is a core group of the Shenwei 26010 many-core platform, and comprises 1 main core and 64 slave cores. As shown in fig. 5, when the UCI Machine Learning repeatability test data set is used for testing, the performance of the algorithm provided by the present invention is improved compared with the performance of the conventional non-fusion method, wherein the two algorithms adopt the same optimization method, such as double buffering, instruction rearrangement, etc. From the statistical performance data of fig. 5, it can be seen that the present invention achieves an acceleration ratio of up to 1.7x, on average 1.3x, compared to the unfused implementation.

As shown in fig. 6, the performance of the test using 4 sets of randomly generated data was improved by the various optimization methods used in the present invention. From the statistical data in fig. 6, it can be seen that the algorithm using only the blocking technique can only obtain the performance of 103.4 to 112.8 GFlops. By using the instruction reordering method, the algorithm achieves an accelerated boost of up to 2.3 x. By using the double buffering technique, the invention achieves a performance improvement of 30% again. The overall acceleration effect achieved by the K-means algorithm is not significant by using the dynamic scheduling technique, but it brings about a 40% performance gain for the update center point step.

FIG. 7 shows the comparison between the floating point performance obtained by the present invention and the maximum performance theoretically obtainable by K-means on the platform under the UCI test data set. As can be seen from the performance data of statistics in FIG. 7, the invention can achieve the calculation performance of 348.1GFlops, and compared with the theoretical floating point maximum performance, the floating point calculation efficiency of 47% -84% is obtained.

In a word, the invention provides a high-performance parallel implementation method of a K-means algorithm on a domestic Shenwei 26010 many-core processor, and provides a calculation framework integrating block distance matrix calculation and specification operation. The framework reduces the storage of the whole distance matrix, calculates the distance matrix by taking a block as a unit, and specifies the cluster to which the sample corresponding to the corresponding block belongs while calculating the distance matrix, thereby reducing the storage of the distance matrix and the memory access (once reading and once writing) operation of the distance matrix for two times. Aiming at a domestic Shenwei 26010 platform, the framework uses a three-layer partitioning strategy to perform task division, a cooperative inter-core data sharing scheme is designed to improve the reuse rate of data, a register communication mechanism is used to realize a cluster label protocol, and optimization methods such as double buffering, instruction rearrangement and dynamic scheduling are used to greatly improve the function performance.

Meanwhile, based on a domestic processor Shenwei 26010 platform, aiming at a clustering stage, the invention designs a computing framework integrating block distance matrix computation and protocol operation, the framework uses a three-layer partitioning strategy to divide tasks, and simultaneously designs a cooperative internuclear data sharing scheme and a cluster label protocol method based on a register communication mechanism, and uses optimization technologies such as double buffering technology, instruction rearrangement and the like. Aiming at the stage of updating the central point, the invention designs a task division mode of dynamic scheduling. Through testing on a real data set, the invention can reach the maximum floating point calculation performance of 348.1GFlops, can obtain 47% -84% of floating point calculation efficiency compared with the theoretical maximum performance, and can obtain the highest acceleration ratio of 1.7x and the average of 1.3x compared with a calculation mode without fusion.

The above examples are provided only for the purpose of describing the present invention, and are not intended to limit the scope of the present invention. The scope of the invention is defined by the appended claims. Various equivalent substitutions and modifications can be made without departing from the spirit and principles of the invention, and are intended to be within the scope of the invention.

Claims

1. A high-performance parallel implementation method of a K-means algorithm on a domestic Shenwei 26010 many-core processor is based on a core group of the domestic Shenwei 26010 processor, each core group consists of a main core and 64 auxiliary cores, and the K-means algorithm comprises the following steps: (a) initializing a central point, (b) clustering, (c) calculating an iterative convergence value, and (d) updating the central point; firstly, finishing the steps (a) and (b), then solving an iterative convergence value according to the step (c) and judging whether convergence occurs, and if the convergence does not occur, entering a main loop formed by the steps (b) to (d) again; if the clustering is converged, returning the current clustering result, exiting, and outputting a cluster label corresponding to each sample; the method is characterized in that: the method mainly carries out parallel optimization on the steps (b) and (d) which are time-consuming;

the (b) clustering step is implemented as: calculating the distance between each sample and all the central points, solving the central point with the closest distance for each sample, storing the subscript of the central point as the cluster label of the sample, and further dividing all the samples into the clusters with the closest distance;

the step (d) of updating the center point comprises the following steps: updating the center point corresponding to each of the clusters to a weighted average of all the samples belonging to the cluster;

the (b) clustering step is realized as follows: establishing a parallel implementation framework for fusing block distance matrix calculation and a cluster label protocol, wherein the framework is implemented in a clustering parallel mode by using a three-layer code design structure of an interface layer, a drive layer and a core layer, the interface layer is a master core end operation, and the drive layer and the core layer are slave core end operations; the framework adopts a three-layer blocking algorithm of logic blocking, physical blocking and register blocking, wherein a driving layer performs logic blocking and physical blocking operations, and a core layer performs register blocking operations, and the method is specifically realized as follows:

(1) firstly, an interface layer reads input data, stores the input data into a sample matrix with dimension of n multiplied by d, wherein n is the number of samples, d is the dimension of the samples, and stores a central point into a central point matrix with dimension of d multiplied by k, wherein k is the number of clusters, namely the number of the central points;

(2) then the driving layer carries out logic blocking and physical blocking, namely, the sample matrix and the central point matrix are respectively divided into n, d and k dimensions by logic blocking to obtain a sample matrix block and a central point matrix block, and each block is traversed according to the for-loop organization sequence of n, k and d; performing k circulation in the n circulation, performing d circulation in the k circulation, and calling two-step cluster label reduction functions based on register communication; initiating DMA transmission in a d cycle, transmitting a sample matrix block and a central point matrix block from a main memory to an LDM on a slave core, and calling a core layer function to calculate the distance between the sample matrix block and the central point matrix block to obtain a distance matrix block;

(3) then, the drive layer further performs physical block division on the sample matrix block, the central point matrix block and the distance matrix block obtained by logic blocking, and the sample matrix block, the central point matrix block and the distance matrix block are respectively divided into 64 smaller distance matrix blocks which are called tiles and are respectively a sample matrix tile, a central point matrix tile and a distance matrix tile; each sample matrix tile and the central point matrix tile are DMA-transmitted to an efficient buffer memory LDM on a corresponding slave core in 64 slave cores, if the ith tile of the sample matrix block and the ith tile of the central point matrix block are transmitted to the ith slave core; meanwhile, each slave core opens up a memory space for a distance matrix tile in the LDM of the slave core, and is responsible for calculating the value of the distance matrix tile, and the driving layer calls a core layer function to calculate the responsible distance matrix tile;

(4) in a core layer, when each slave core calculates a responsible distance matrix tile, a sample matrix tile and a central point matrix tile required for calculating the distance matrix tile need to be obtained from other slave cores; calling a calculation function based on a register block method; in the calculation function, each slave core performs calculation of the responsible distance matrix tile in a mode of assembly codes;

(5) calculating a distance matrix tile on each slave core, and solving a minimum value of each row of the distance matrix tile by adopting a register communication-based cluster label reduction method, wherein a global column label corresponding to the minimum value is a cluster label of a sample corresponding to a corresponding global row label;

(6) and finally, storing the cluster label of the sample into a main memory to finish clustering operation.

2. The high-performance parallel implementation method of the K-means algorithm on the domestic Shenwei 26010 many-core processor according to claim 1, wherein the method comprises the following steps: the step (d) of updating the central point is realized as follows:

(1) the clustering step finds a cluster to which each sample belongs, and updates the central point to be a weighted average of all the samples belonging to the cluster according to the number of the central points of the set clusters;

(2) the parallel mode design of task division is carried out on the central point, a task scheduling strategy based on work sharing is adopted, and each slave core is responsible for updating and calculating a part of the distributed central points; when each slave core processes the update of a cluster which is responsible for the slave core, firstly, an array new _ centroid with the length of d is created and set to be zero, and the new _ centroid is used for storing a new central point of the cluster; then reading a cluster label of each sample, judging whether the sample belongs to a cluster processed currently, if the sample belongs to the cluster processed currently, reading the corresponding sample into an LDM space of the slave core, and accumulating the sample into a new _ centroid;

(3) after traversing the cluster label array, the new _ centroid stores the accumulated values of the corresponding dimensions of all samples belonging to the cluster, and then the value of each dimension of the new _ centroid is divided by the number of the samples in the cluster, so as to obtain the new value of the central point of the cluster.

3. The high-performance parallel implementation method of the K-means algorithm on the domestic Shenwei 26010 many-core processor according to claim 1, wherein the method comprises the following steps: the driving layer firstly initiates DMA transmission in the d cycle to obtain a distance matrix block, and based on the asynchronism of the DMA transmission, a dual-buffer mechanism based on access memory-calculation overlapping is adopted, so that when the distance matrix block in the current cycle is calculated, DMA reading operation of a sample matrix block and a central point matrix block in the next cycle is initiated, the access memory and calculation time is covered, and the calculation performance is improved.

4. The high-performance parallel implementation method of the K-means algorithm on the domestic Shenwei 26010 many-core processor according to claim 1, wherein the method comprises the following steps: in the clustering step (4), when each slave core needs to obtain a sample matrix tile and a central point matrix tile required by calculating the distance matrix tile from other slave cores when calculating the responsible distance matrix tile, the clustering is realized by designing a collaborative inter-core data sharing method; the inter-core data sharing method carries out data sharing among 64 slave cores in 8 steps, namely the slave cores are arranged in an 8 x 8 topology, i is more than or equal to 0 and less than or equal to 7 in the ith step, the slave core row of the ith row broadcasts a local sample matrix to all the slave cores of the same row, the column broadcasts a local central point matrix to all the slave cores of the same column, and the broadcasted slave cores receive corresponding data to own data buffer; meanwhile, all local sample matrixes tile are broadcasted from the row of the core in the ith column, all the slave cores in the ith row except the ith row and the ith column are broadcasted from the column of the core, and the local central point matrixes tile except the ith row and the ith column of the slave cores are broadcasted to receive corresponding data from the core to the data buffer of the core; after each step of data transmission is completed, each slave core uses the data in the data buffer to call a calculation function based on a register blocking method to calculate the distance matrix.

5. The high-performance parallel implementation method of the K-means algorithm on the domestic Shenwei 26010 many-core processor according to claim 4, wherein the method comprises the following steps: the calculation function based on the register blocking method is specifically as follows: dividing the sample matrix tile, the central point matrix tile and the distance matrix tile into a 16 × 1 sample matrix pannel, a 1 × 2 central point matrix pannel and a corresponding 16 × 2 distance matrix pannel; the core layer is compiled in an assembly code mode; the register is partitioned, 3 for loops are generated at the code level, code structure organization is carried out by adopting the loop sequence of n, k and d, and in the innermost loop, a sample matrix pannel and a central point matrix pannel are called to carry out distance calculation to obtain an innermost assembly function of the distance matrix pannel.

6. The high-performance parallel implementation method of the K-means algorithm on the domestic Shenwei 26010 many-core processor of claim 5, wherein the method comprises the following steps: after each slave core adopts a register partitioning method, calling a sample matrix pannel and a central point matrix pannel inside the innermost circulation to perform distance calculation to obtain an innermost assembly function of the distance matrix pannel; the innermost assembly function is carried out in the following mode; elements of the sample matrix pannel are stored using 4 floating-point vector registers, one vector register for each 4 elements; the elements of the central point matrix pannel are stored by using 2 floating point vector registers, and the elements of each central point matrix pannel are expanded into 4 parts and placed in one vector register; when the Euler distance between two elements is calculated, the two elements are firstly subtracted, then the square operation of multiplication is carried out on the subtraction result, 8 registers are used for storing the calculation result of the matrix element difference between the elements in the sample matrix pannel and the matrix element corresponding to the central point matrix pannel, and 8 registers are used for storing the finally obtained Euler distance, namely the distance matrix elements; 22 floating-point vector registers are used in the calculation; the Shenwei 26010 processor has 32 registers, except the 22 registers used, the remaining 10 registers are used for storing intermediate variables in a program or used as cyclic variables, and the register resources are reasonably and efficiently used; meanwhile, the assembly code is subjected to instruction rearrangement, and the performance of the program is further improved.

7. The high-performance parallel implementation method of the K-means algorithm on the domestic Shenwei 26010 many-core processor of claim 6, wherein the method comprises the following steps: after the register partitioning method is adopted, the instruction rearrangement is carried out on the innermost assembly function, and the following is realized: based on the double-pipeline and double-emission characteristics of the Shenwei 26010, the calculation instruction and the memory instruction are adjacent to each other as much as possible, so that the calculation instruction and the memory instruction can be emitted simultaneously; meanwhile, the instructions with dependence are arranged as far as possible, and instruction beating is reduced.

8. The high-performance parallel implementation method of the K-means algorithm on the domestic Shenwei 26010 many-core processor according to claim 1, wherein the method comprises the following steps: in the step (5) of clustering, the two-step cluster label reduction function based on register communication is implemented as follows: firstly, stipulating a distance matrix tile on each slave core of each slave core, and solving the minimum value of each row of the distance matrix tile and a column mark corresponding to the minimum value; secondly, performing pairwise reduction on 8 slave cores on the same row in three times, and finally solving the minimum value and the subscript of each row of all distance matrixes tile on the same row slave cores, wherein for the first time, the results obtained in the first step of the 1 st, 3 rd, 5 th and 7 th columns are packaged into vectors and are respectively sent to the slave cores of the 0 th, 2 th, 4 th and 6 th columns through a register communication function put _ r, and after the data are received by the slave cores of the 0 th, 2 th, 4 th and 6 th columns through the get _ r, the data are compared with the local minimum value, and the smaller value and the subscript are stored; secondly, the slave cores in the 2 nd column and the 6 th column respectively send the results obtained last time to the slave cores in the 0 th column and the 4 th column, after the slave cores in the 0 th column and the 4 th column receive data, the received data is compared with the local minimum value, and the minimum value and the subscript thereof are stored; and thirdly, the slave core in the 4 th column sends the result obtained last time to the slave core in the 0 th column, after the slave core in the 0 th column receives the data, the received data is compared with the local minimum value, and the subscript corresponding to the obtained minimum value is the cluster label of the sample in the current row.

9. The high-performance parallel implementation method of the K-means algorithm on the domestic Shenwei 26010 many-core processor of claim 2, wherein the method comprises the following steps: the parallel mode of adopting the task division to the central point is a dynamic task scheduling strategy: firstly, establishing a global shared task pool, wherein each slave core can access the task pool; each task in the task pool represents an update operation of a central point, and k tasks are sequentially labeled as 0,1, … (k-1); then, each slave core takes the task label from the shared task pool by using the mutual exclusion of the atomic instruction fetch-and-add, and the atomic instruction ensures that only one slave core reads and modifies the task label in the global task pool at the same time, thereby ensuring the consistency; after each slave core completes the current task, one task is taken down from the shared task pool until the tasks of the task pool are taken out.