CN106407561B

CN106407561B - Method for dividing parallel GPDT algorithm on multi-core SOC

Info

Publication number: CN106407561B
Application number: CN201610832540.5A
Authority: CN
Inventors: 韩军; 轩四中; 袁腾跃; 曾晓洋
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2016-09-19
Filing date: 2016-09-19
Publication date: 2020-07-03
Anticipated expiration: 2036-09-19
Also published as: CN106407561A

Abstract

The invention belongs to the technical field of integrated circuit design, and particularly relates to a method for dividing a parallel GPDT algorithm on a multi-core SoC. The parallel GPDT algorithm includes two layers of iteration, the inner layer of iteration is responsible for solving the working set, and the outer layer of iteration is responsible for updating the working set. In the aspect of calculating the critical path of the speed, the critical path of the outer loop is gradient updating, the critical path of the inner loop is vector calculation after each projection, and the two parts of matrix operation need to be subjected to parallel processing on multiple cores; and the rest operations can be realized only in a serial mode on the main core, including gradient projection operation realized by utilizing the Dai-Fletcher algorithm, update of a working set realized by introducing a quick sequencing algorithm and the like. And the vector obtained after the calculation is the support vector of the GPDT algorithm training data.

Description

Method for dividing parallel GPDT algorithm on multi-core SOC

Technical Field

The invention belongs to the technical field of integrated circuit design, and particularly relates to a method for dividing a parallel GPDT algorithm on a multi-core SoC.

Background

The GPDT algorithm is a decomposition method for the original QP problem proposed by Zanni et al, and the number of working set variables of each iteration is 10²To 10³The magnitude order is different, so that the algorithm can reach convergence after few iterations, although the calculation amount of each iteration is larger, the complex calculation can be distributed to a plurality of processors for carrying out in a parallelization mode, and the faster training speed is obtainedAnd (4) degree.

The original expression of the support vector machine problem is:

g is a matrix of l × l, called the kernel matrix, where,

is a kernel function.

The decomposition of the problem is the vector to be solved

The method is divided into two parts, wherein one part is a working set and is represented by B, and the other part is a non-working set and is represented by N. The vector to be solved, the sample category vector and the kernel matrix in the formula are decomposed into the following forms:

through simplification, the QP sub-problem after decomposition is converted into the following form:

the solving process of the QP subproblem mainly comprises four steps, the final result is solved through loop iteration, and the judgment condition of the iteration end is a KKT (Karush-Kuhn-Tucker) condition.

The specific steps of the algorithm are as follows:

step 1: and (5) initializing.

Will vector

Initialized to 0 and then two integers n are selected_BAnd n_CLet 0 be equal to or less than n_C≤n_B≤1,n_CIs even number, slave vector

In (1) random selection of n_BForming a working set B by one element, forming a non-working set N by the rest elements, and enabling the outer layer iteration number k to be 1;

step 2: and solving the QP subproblem.

Order to

Is the solution to the QP subproblem. Then order again

Step 2.1: initialization

Order to

Representing an initial gradient, and order

Step down ρ₀∈[ρ_min，ρ_max]，ρ_minAnd ρ_maxIs a predetermined value and satisfies 0<ρ_min＜ρ_maxMaking the inner layer iteration number k' equal to 0;

step 2.2: projection (projector)

By P_Ω() Representing the projection operation onto the feasible region omega, the vector is first determined

Whether a termination condition is met, if so, ending the iteration, otherwise calculating the direction of gradient descent using:

step 2.3: matrix multiplication

Computing matrices

Step 2.4: line search

Calculating coefficient lambda by line search method_k‘And updating the vector to be solved

Step 2.5: updating

Calculating the k' +1 iteration

And

a new gradient descent step size p is then calculated_k′+1Let k '═ k' +1 for the number of iterations, and return to step 2.2.

And step 3: and (4) updating the gradient.

Updating the vector of the objective function after the k iteration

Gradient (2):

after the update, if

If the KKT condition is satisfied, the iteration is ended, otherwise the next step is entered.

And 4, step 4: and updating the working set.

The following problem is solved first:

then, the result is obtained

α corresponding to the non-zero term in (1)_iTaken out to form a working set

The maximum number of non-zero terms is n_cThen take out the elements from the old working set B and fill in

In (1), up to

In to n_BAn element, last order

k is k +1 and then returns to step 2.

The advantage of the GPDT algorithm is the work of solving each iterationThe number of the collecting elements can reach 10³Orders of magnitude, enabling the algorithm to converge quickly, however, in a single iteration the computation is very large due to the large number of matrix operations.

Disclosure of Invention

The invention aims to provide a method for dividing a parallel GPDT algorithm on a multi-core SoC, so that the calculation time of single iteration is greatly shortened, and the operation efficiency of the whole training algorithm is improved.

The invention provides a method for dividing a parallel GPDT algorithm on a multi-core SoC, which has the general idea that n in a working set B is divided_BThe elements are equally distributed to N processors, and each processor is locally provided with a backup of training data, so that the matrix operation can be conveniently distributed to the N processors for execution. As can be seen from the basic principle of the algorithm, the parallelism of the algorithm is mainly concentrated in the steps 2 and 3, which are relatively concentrated steps of the matrix operation.

The method for dividing the parallel GPDT algorithm on the multi-core SoC comprises two parts: row decomposition and column decomposition; the details are as follows.

A line decomposition method. The method comprises the following steps: matrix is decomposed according to rows, parallel computation and result splicing:

in the initialization procedure of step 2.1, the initial gradient is calculated

Wherein A represents an n_B×n_BA matrix of

Represents an n_B× 1, then,

also the result of (a) is an n_B× 1, first, the matrix A will be decomposed into rows

Wherein A is_niRepresents one

A matrix of (a); then compute on each core

A value of (d); finally, the operation results of the cores are spliced on the main core,

is that

The result of (1).

Column decomposition method. The method comprises the following steps: the matrix is decomposed according to columns, parallel calculation and result splicing:

in the gradient update of step 3, calculation is performed

Wherein G is_LBIs l × n_BA matrix of

Is n_B× 1, the result of multiplying the two

Is a column vector of l × 1 due to the matrix G_LBIs l line n_BColumns, therefore, first, the matrix is decomposed into columns

Is decomposed into

Then, the calculation is carried out on each core

Finally, the results of the individual core calculations are accumulated on the primary core,

is that

The value of (c).

According to the partitioning method, the improved parallel GPDT algorithm (i.e., the parallel GPDT algorithm partitioned on the multi-core SoC) specifically includes the following steps:

step 1: first, a vector is initialized on a primary core

Is 0, two integers n are selected_BAnd n_CLet 0 be equal to or less than n_C≤n_B≤1，n_CIs even number, slave vector

In (1) random selection of n_BEach element forms a working set B, and the outer iteration number k is 1.

Step 2: QP sub-problem solving

2.1 setting initial gradients on the Primary nucleus

2.2 the initial gradients are then computed in parallel on the individual kernels

Row segment of

Will count on the main coreAnd (4) splicing calculation results:

wherein A is n_B×n_BThe matrix is a matrix of a plurality of matrices,

is n_B× 1, the column vector of the column vector,

is also n_B× 1, first, the matrix A will be decomposed into rows

Wherein A is_niRepresents one

A matrix of (a); then compute on each core

is that

The result of (1);

2.3 finishing the operation of omega projection to the feasible domain on the main core and judging the vector

Whether a termination condition is met, if so, ending the iteration, otherwise, calculating the gradient descending direction d^(k’)；

2.4 then the matrix z is calculated in parallel on the individual cores^(k’)Row segment of

Wherein the matrix A is decomposed in the same way in rowsThe same in step 2.2, then the operation results of the cores are spliced on the main core,

is that

The calculation result of (2);

2.5 then calculate the coefficient λ by line search on the principal kernel first_kCalculating a new step size p_k‘+1And u^k′+1And then making the inner layer iteration number k '═ k' + 1; judgment u^k′+1Whether the KKT termination condition is met or not, if yes, entering the next step; otherwise, return to step 2.2 and calculate a new gradient descent direction.

And step 3: in obtaining a solution to the QP subproblem

Later, the gradient needs to be updated, and column segments of gradient increments are computed in parallel on each core

The results are then accumulated on the primary kernel, resulting in a new gradient:

computing

Wherein G is_LBIs l × n_BA matrix of

Is n_B× 1, the result of multiplying the two

Is decomposed into

Then, the calculation is carried out on each core

is that

The value of (c).

And 4, step 4: judging on the Master core

And if the KKT condition is met, finishing calculation, otherwise, updating the working set on the main core, specifically, referring to background technical introduction, and returning to the step 2, wherein k is k + 1.

The parallel GPDT algorithm mainly comprises two layers of iteration, wherein the inner layer of iteration is responsible for solving the working set B, and the outer layer of iteration is responsible for updating the working set B. In terms of calculating the critical path of the velocity, the critical path of the inner loop is the post-projection vector z^(k’)The two parts of matrix operation need to be distributed to each core for parallelization processing, the parallelization processing modes are 'decomposition by row' and 'decomposition by column', the rest operations are realized on the main core in series, and the method mainly comprises two parts, namely the projection operation of the gradient, the Dai-Fletcher algorithm and the update of the working set B, and the step fills elements in a new working set efficiently by introducing a quick sorting algorithm.

Drawings

Figure 1 parallel GPDT algorithm flow.

Fig. 2 matrix multiplication by row decomposition.

Fig. 3 matrix multiplication by column decomposition.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

As shown in FIG. 1, the present invention calculates the initial gradient in the algorithm

In (1)

Inner loop computation matrix z^(k’)Outer loop calculation of gradient increments

The process of (2) is distributed to a plurality of processors for carrying out parallelization processing, so that the time of matrix operation in each iteration process is greatly reduced, and in addition, other parts in the algorithm are still serialization operation, including gradient projection, working set updating and the like. According to the Amdall law, the acceleration ratio of the parallelization algorithm is not only related to the acceleration ratio of the parallelizable part, but also related to the proportion of the parallelizable part, so that along with the increase of training data, the operation time proportion of the parallelizable part is increased, and the acceleration ratio of the whole algorithm is gradually close to the acceleration ratio of the parallelizable part.

1. The general idea of parallel partitioning is to divide n in the working set B_BThe elements are equally distributed to N processors, and the work set subscript distributed to each processor is defined as a set I_pP 1, 2, …, N, then the set I after allocation_pSatisfies the following conditions:

i.e. the sets to which each processor is assigned do not intersect each other. Assume that each processor is assigned a number of working set elements of n_pAnd satisfy

And each processor is stored locallyAnd the training data is backed up, so that the matrix operation can be conveniently distributed to N processors for execution, and the parallelism of the algorithm is mainly concentrated in the steps 2 and 3.

2. The calculation formula for calculating the initial gradient by parallelizing the initial gradient of the Dai-Fletcher algorithm is

Wherein A represents an n_B×n_BA matrix of

Then represents an n_B× 1, then

Also the result of (a) is an n_B× 1, according to fig. 2, the matrix a is decomposed into rows, each processor being assigned to n of the matrix a_pSegment of a row, then with a column vector

Multiplying, and finally splicing to obtain a final result:

similarly, matrix in step 2.3

Are also decomposed in the same way, i.e. by

3. Parallel computation of gradient updates the formula for the gradient update is:

order to

Then

Wherein G is_LBIs l × n_BA matrix of

Vector representing two adjacent iterations

The difference of (2) is the result of multiplication of the two

Is a column vector of l × 1 due to the matrix G_LBIs l line n_BColumns, so the division here is by way of matrix G_LBThe column decomposition is shown in figure 3. For each processor, its assigned matrix G_LBColumn fragment G of_npIs l line n_pMatrix of columns, and column vector

Row segment of

Multiplication, the result obtained is

Is a column vector of l rows, so the computation results of each processor need to be accumulated to obtain the final result:

4. other parts of the algorithm, including projection operation of the gradient and update of the working set, are still executed in series on the main core, and the overall flow of the improved parallel GPDT algorithm is shown in fig. 1.

Claims

1. A method for dividing a parallel GPDT algorithm on a multi-core SoC is characterized by comprising the following specific steps:

step 1: first, a vector is initialized on a primary core

In (1) random selection of n_BForming a working set B by the elements, and enabling the outer layer iteration number k to be 1;

step 2: QP sub-problem solving

2.1 setting initial gradients on the Primary nucleus

Row segment of

And splicing the calculation results on the main core:

wherein A is n_B×n_BThe matrix is a matrix of a plurality of matrices,

is n_B× 1, the column vector of the column vector,

is also n_B× 1, first, the matrix A will be decomposed into rows

Wherein A is_niRepresents one

A matrix of (a); then compute on each core

is that

The result of (1);

Wherein the row decomposition mode of the matrix A is the same as that in the step 2.2, then the operation results of the cores are spliced on the main core,

is that

The calculation result of (2);

2.5 then calculate the coefficient λ by line search on the principal kernel first_kCalculating a new step size p_k‘+1And u^k′+1Then, making the inner layer iteration number k '═ k' + 1; judgment u^k′+1Whether the KKT termination condition is met or not, if yes, entering the next step; otherwise, returning to the step 2.2, and calculating a new gradient descending direction;

and step 3: in obtaining a solution to the QP subproblem

computing

Wherein G is_LBIs l × n_BA matrix of

Is n_B× 1, the result of multiplying the two

Is a column vector of l × 1, due to the matrix G_LBIs l line n_BColumns, therefore, first, the matrix is decomposed into columns

Is decomposed into

Then, the calculation is carried out on each core

is that

A value of (d);

and 4, step 4: judging on the Master core

And if the KKT condition is met, finishing the calculation, otherwise, updating the working set on the main core, and returning to the step 2 to enable k to be k + 1.