CN112380018A

CN112380018A - Method for determining matrix blocking parameters for matrix multiplication based on genetic algorithm

Info

Publication number: CN112380018A
Application number: CN202011374455.1A
Authority: CN
Inventors: 孙成国
Original assignee: Haiguang Information Technology Co Ltd
Current assignee: Haiguang Information Technology Co Ltd
Priority date: 2020-11-30
Filing date: 2020-11-30
Publication date: 2021-02-19

Abstract

The present disclosure provides a method, apparatus, device, and storage medium for determining matrix blocking parameters for matrix multiplication based on a genetic algorithm. Wherein, the method comprises the following steps: acquiring a matrix blocking parameter boundary; randomly generating a first number of matrix blocking parameters based on the boundary, and taking the first number of matrix blocking parameters as a first number of individuals in the population to be optimized; for each matrix block parameter in the first number of matrix block parameters, performing matrix block by using the parameter and correspondingly performing matrix multiplication, and acquiring a performance index of the matrix multiplication based on the parameter and taking the performance index as the individual fitness of the individual corresponding to the parameter; constructing an optimized population through individual selection, individual crossing and individual variation based on the individual fitness of each individual in the population to be optimized; and under the condition that the optimized population meets the preset condition, selecting an individual with the optimal performance index from the optimized population as a matrix blocking parameter for matrix multiplication.

Description

Method for determining matrix blocking parameters for matrix multiplication based on genetic algorithm

Technical Field

The present disclosure relates to a matrix blocking technique for matrix multiplication, and more particularly, to a method, apparatus, device, and storage medium for determining matrix blocking parameters for matrix multiplication based on a genetic algorithm.

Background

The basic linear algebra subprogram library (BLAS library) is widely applied to scientific and engineering calculation, is a core operation library for many mathematical software and large-scale calculation applications, and has important significance for improving program performance and fully exerting computer operation capability by optimizing performance. The BLAS library comprises a first-stage vector and vector, a second-stage vector and matrix and a linear algebraic operation function between a third-stage matrix and the matrix, wherein the third-stage function has the characteristics of maximum calculation amount and most intensive access and storage, is the most important core function set in the BLAS library and has a larger optimization space. The general matrix multiplication is a core function of a third-level function in the BLAS library, and the calculation performance of the CPU can be effectively improved by aiming at the optimization of the matrix multiplication. At present, the general BLAS library optimizes matrix multiplication by adopting a matrix blocking method, wherein the optimization of matrix blocking parameters can more fully utilize a Cache (Cache), improve the access hit rate and further improve the calculation performance.

At present, matrix partitioning parameters are usually determined by matching theoretical calculation with a large number of manual tests, but the method has large workload, the matrix partitioning parameters need to be recalculated aiming at different hardware architectures, and the optimal calculation performance corresponding to the matrix partitioning parameters is difficult to ensure. Therefore, there is a need for a method that is less labor intensive, optimal in computational performance, and capable of adaptively determining matrix blocking parameters according to different hardware architectures.

Disclosure of Invention

In order to solve the above problems, the genetic algorithm is combined with the determination of the matrix blocking parameters, and genetic iteration is performed on the matrix blocking parameters to obtain the optimal matrix blocking parameters, so that the calculation performance is improved.

According to an embodiment of the present disclosure, there is provided a method of determining matrix blocking parameters for matrix multiplication based on a genetic algorithm, including: acquiring a matrix block parameter boundary for matrix multiplication; randomly generating a first number of matrix partitioning parameters based on the matrix partitioning parameter boundary, and taking the first number of matrix partitioning parameters as a first number of individuals in the population to be optimized; for each matrix block parameter in the first number of matrix block parameters, performing matrix block by using the matrix block parameter and correspondingly performing matrix multiplication, acquiring a performance index of the matrix multiplication based on the matrix block parameter, and taking the performance index as the individual fitness of the individual corresponding to the matrix block parameter; constructing an optimized population through individual selection, individual crossing and individual variation based on the individual fitness of each individual in the population to be optimized; and under the condition that the optimized population meets a preset condition, selecting an individual with the optimal performance index from the optimized population as a matrix blocking parameter for matrix multiplication.

According to the embodiment of the present disclosure, the performing matrix blocking and correspondingly performing matrix multiplication by using the blocking parameter to obtain the performance index of the matrix multiplication based on the blocking parameter includes: utilizing the matrix block parameters as matrix block parameters in a configuration file of a basic linear algebra subprogram library; recompiling and installing the basic linear algebra subprogram library to generate a dynamic link library; generating a first matrix and a second matrix for matrix multiplication, carrying out matrix blocking on the first matrix and the second matrix based on the dynamic link library, and carrying out matrix multiplication on the first matrix and the second matrix; and obtaining a performance index of the dynamic link library for executing the matrix multiplication.

According to the embodiment of the disclosure, the performance index is the number of times that the CPU performs floating point operation per second by using the basic linear algebraic subroutine library adopting the matrix blocking parameters.

According to the embodiment of the disclosure, the method further comprises determining a matrix partitioning kernel parameter based on an actual structure of a register in the CPU and a maximum data amount that the register can store, wherein the matrix partitioning kernel parameter represents a basic unit of the matrix partitioning; wherein randomly generating a first number of matrix blocking parameters based on the matrix blocking parameter boundary comprises: randomly generating a first number of matrix blocking parameters based on the matrix blocking parameter boundary and the matrix blocking kernel parameters, wherein each element in each matrix blocking parameter is an integer multiple of a corresponding element in the matrix blocking kernel parameters.

According to an embodiment of the present disclosure, the method further comprises obtaining a preset crossover rate and a preset variation rate for the genetic algorithm; based on the individual fitness of each individual in the population to be optimized, an optimized population is constructed through individual selection, individual crossing and individual variation, and the method comprises the following steps: selecting the individual with the highest individual fitness in the population to be optimized into a first population; selecting individuals with higher individual fitness in the population to be optimized into the first population; randomly grouping the individuals in the first population pairwise to form a plurality of individual pairs, wherein each individual pair corresponds to an individual pair crossing rate, carrying out individual crossing on the corresponding individual pairs of which the individual pair crossing rates are smaller than the preset crossing rate, generating a new individual pair, and selecting the new individual pair into the first population; and determining the individual variation rate of each individual in the first population, carrying out individual variation on the corresponding individual of which the individual variation rate is smaller than the preset variation rate, generating a new individual, and selecting the new individual into the first population to form the optimized population.

According to the embodiment of the disclosure, the predetermined condition is that the number of iterations for constructing the optimized population reaches a preset maximum number of iterations.

According to the embodiment of the disclosure, the predetermined condition is that the individuals having the optimal performance index tend to be stable in the iterative process of constructing the optimized population, that is, the individuals having the optimal performance index in the optimized population obtained through multiple iterations are the same.

According to an embodiment of the present disclosure, there is provided an apparatus for determining matrix blocking parameters for matrix multiplication based on a genetic algorithm, including: a parameter boundary acquisition module for acquiring a matrix block parameter boundary for matrix multiplication; a blocking parameter generation module, configured to randomly generate a first number of matrix blocking parameters based on the matrix blocking parameter boundary, and use the first number of matrix blocking parameters as a first number of individuals in a population to be optimized; the matrix multiplication actual measurement module is used for carrying out matrix blocking and correspondingly executing matrix multiplication on each matrix blocking parameter in the first number of matrix blocking parameters by using the matrix blocking parameter to obtain a performance index of the matrix multiplication based on the matrix blocking parameter, and taking the performance index as the individual fitness of the individual corresponding to the matrix blocking parameter; the genetic algorithm iteration module is used for constructing an optimized population through individual selection, individual crossing and individual variation based on the individual fitness of each individual in the population to be optimized; and the blocking parameter determining module is used for selecting an individual with the optimal performance index from the optimized population as a matrix blocking parameter for matrix multiplication under the condition that the optimized population meets a preset condition.

According to the embodiment of the present disclosure, the matrix multiplication actual measurement module performs matrix blocking by using the blocking parameter and correspondingly performs matrix multiplication to obtain the performance index of the matrix multiplication based on the blocking parameter, including: utilizing the matrix block parameters as matrix block parameters in a configuration file of a basic linear algebra subprogram library; recompiling and installing the basic linear algebra subprogram library to generate a dynamic link library; generating a first matrix and a second matrix for matrix multiplication, carrying out matrix blocking on the first matrix and the second matrix based on the dynamic link library, and carrying out matrix multiplication on the first matrix and the second matrix; and obtaining a performance index of the dynamic link library for executing the matrix multiplication.

According to the embodiment of the disclosure, the device further comprises a kernel parameter setting module, configured to determine a matrix partitioning kernel parameter based on an actual structure of a register in the CPU and a maximum data amount that can be stored in the register, where the matrix partitioning kernel parameter represents a basic unit of a matrix partitioning; wherein the block parameter generation module randomly generates a first number of matrix block parameters based on the matrix block parameter boundary, including: randomly generating a first number of matrix blocking parameters based on the matrix blocking parameter boundary and the matrix blocking kernel parameters, wherein each element in each matrix blocking parameter is an integer multiple of a corresponding element in the matrix blocking kernel parameters.

According to an embodiment of the present disclosure, the apparatus further includes a genetic parameter obtaining module, configured to obtain a preset crossing rate and a preset variation rate for the genetic algorithm; the genetic algorithm iteration module constructs an optimized population through individual selection, individual crossing and individual variation based on the individual fitness of each individual in the population to be optimized, and the genetic algorithm iteration module comprises the following steps: selecting the individual with the highest individual fitness in the population to be optimized into a first population; selecting individuals with higher individual fitness in the population to be optimized into the first population; randomly grouping the individuals in the first population pairwise to form a plurality of individual pairs, wherein each individual pair corresponds to an individual pair crossing rate, carrying out individual crossing on the corresponding individual pairs of which the individual pair crossing rates are smaller than the preset crossing rate, generating a new individual pair, and selecting the new individual pair into the first population; and determining the individual variation rate of each individual in the first population, carrying out individual variation on the corresponding individual of which the individual variation rate is smaller than the preset variation rate, generating a new individual, and selecting the new individual into the first population to form the optimized population.

According to an embodiment of the present disclosure, there is provided an apparatus for determining matrix blocking parameters for matrix multiplication based on a genetic algorithm, including: a processor; and a memory having stored thereon computer-executable instructions for implementing the method as described above when executed by the processor.

According to an embodiment of the present disclosure, there is provided a computer-readable storage medium having stored thereon computer-executable instructions for implementing the method as described above when executed by a processor.

According to an embodiment of the present disclosure, there is provided a computer program product or a computer program comprising computer instructions stored in a computer readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform a method of determining matrix blocking parameters for matrix multiplication based on a genetic algorithm according to an embodiment of the present disclosure.

Embodiments of the present disclosure provide a method, apparatus, device, and storage medium for determining matrix blocking parameters for matrix multiplication based on a genetic algorithm. The method provided by the embodiment of the disclosure introduces a genetic algorithm into a determination process of matrix block parameters, performs genetic iteration by taking the matrix block parameters as individuals in the genetic algorithm and taking performance indexes of matrix multiplication based on the matrix block parameters as individual fitness in the genetic algorithm to obtain optimal individuals as the matrix block parameters of the matrix multiplication, has small workload of a scheme and optimal calculation performance, adaptively determines the matrix block parameters of the matrix multiplication according to a hardware architecture, more reasonably utilizes hardware resources, and improves the calculation performance of a CPU.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings used in the description of the embodiments will be briefly introduced below. It is apparent that the drawings in the following description are only exemplary embodiments of the disclosure, and that other drawings may be derived from those drawings by a person of ordinary skill in the art without inventive effort.

Fig. 1A shows a schematic block diagram of a method of determining matrix blocking parameters for matrix multiplication based on a genetic algorithm according to an embodiment of the present disclosure.

Fig. 1B shows a detailed flow diagram of a method 100 for determining matrix blocking parameters for matrix multiplication based on a genetic algorithm according to an embodiment of the present disclosure.

Fig. 2 exemplarily shows a flowchart for obtaining a performance index for matrix multiplication based on matrix blocking parameters according to an embodiment of the present disclosure.

Fig. 3 shows a flow chart for constructing an optimized population through individual selection, individual crossover, and individual variation according to an embodiment of the present disclosure.

Fig. 4 shows a schematic diagram of an apparatus 400 for determining matrix blocking parameters for matrix multiplication based on a genetic algorithm according to an embodiment of the present disclosure.

Fig. 5 shows a schematic diagram of an apparatus 500 for determining matrix blocking parameters for matrix multiplication according to an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the present disclosure more apparent, example embodiments according to the present disclosure will be described in detail below with reference to the accompanying drawings. It is to be understood that the described embodiments are merely a subset of the embodiments of the present disclosure and not all embodiments of the present disclosure, with the understanding that the present disclosure is not limited to the example embodiments described herein.

In the present specification and the drawings, substantially the same or similar steps and elements are denoted by the same or similar reference numerals, and repeated descriptions of the steps and elements will be omitted. Meanwhile, in the description of the present disclosure, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance or order.

In the specification and drawings, elements are described in singular or plural according to embodiments. However, the singular and plural forms are appropriately selected for the proposed cases only for convenience of explanation and are not intended to limit the present disclosure thereto. Thus, the singular may include the plural and the plural may also include the singular, unless the context clearly dictates otherwise.

The computing performance can be effectively improved by optimizing the matrix multiplication under different hardware architectures, the matrix multiplication is optimized by adopting a matrix blocking method in the conventional universal BLAS, and an operation matrix is divided into block matrixes with different sizes according to the configuration of hardware parameters and the like of different hardware architectures, so that the Cache can be more fully utilized, the access hit rate is improved, and the computing performance of a CPU is improved. The division of the operation matrix is based on matrix block parameters, and the setting of the matrix block parameters is crucial to the optimization effect of matrix multiplication.

At present, matrix partitioning parameters are usually determined through theoretical calculation, but the theoretical calculation needs to be matched with a large number of manual tests to determine the matrix partitioning parameters, and new matrix partitioning parameters need to be re-calculated and determined for different hardware architectures. The invention therefore aims to adaptively determine the matrix blocking parameters for matrix multiplication according to the hardware architecture, ensuring the optimality of the computation performance while reducing the above-mentioned workload.

Accordingly, the present invention proposes a method for determining matrix blocking parameters for matrix multiplication based on genetic algorithm, and embodiments of the present disclosure will be further described with reference to the accompanying drawings.

As shown in fig. 1A, a method for determining matrix blocking parameters for matrix multiplication based on a genetic algorithm first theoretically calculates the matrix blocking parameters according to a specific hardware architecture, and the calculated matrix blocking parameters are used to guide generation of the matrix blocking parameters participating in optimization. Then, algorithm parameters of the genetic algorithm, such as population number and the like, are set based on the matrix partitioning parameters obtained by theoretical calculation and actual requirements. Secondly, establishing an initial population of the genetic algorithm based on the set algorithm parameters, wherein the matrix blocking parameters are used as individuals, and the initial population comprises the matrix blocking parameters with the population quantity. And then, carrying out genetic iteration on the initial population, wherein the genetic iteration process comprises selection, crossing and variation operations of individuals in the population, and the genetic iteration process can continuously generate the next generation population until the generated next generation population meets a preset termination condition, so that the genetic iteration process is ended. And finally, outputting the optimal individual in the corresponding population meeting the termination condition as a matrix blocking parameter for matrix multiplication.

As shown in fig. 1B, first, in step 101, a matrix blocking parameter boundary for matrix multiplication is acquired.

According to the embodiment of the disclosure, the matrix partitioning kernel parameter can be determined based on the actual structure of the register in the CPU and the maximum data amount that can be stored by the register, wherein the matrix partitioning kernel parameter represents the basic unit of the matrix partitioning. Thus, the matrix blocking parameters for matrix multiplication may be determined based on the matrix blocking kernel parameters.

According to an embodiment of the present disclosure, for matrix multiplication a × B ═ C, where matrix a is M × K in size, matrix B is K × N in size, and matrix C is M × N in size, the matrix blocking parameters of the corresponding matrix multiplication may be represented as [ Mc, Kc, Nc ], where matrix a is divided into a block matrix of Mc × Kc in size, matrix B is divided into a block matrix of Kc × Nc in size, and matrix C is divided into a block matrix of Mc × Nc in size. According to the embodiment of the present disclosure, taking an x8664 architecture CPU with a SIMD instruction width of 128 bits as an example, the CPU has 16 128-bit XMM general-purpose registers, which are reasonably allocated with different tasks to participate in the calculation process, and when in practical use, the number of registers that can be used for storing matrix data participating in the calculation each time is limited, so the number of operation data of a first matrix and the number of operation data of a second matrix for each participation in the calculation are limited by the number of memories, for example, for one implementation of double-precision matrix multiplication, each XMM register may store two double-precision floating point numbers, 1 operation data of the first matrix is taken each time and broadcasted into one fixed register, 4 consecutive operation data of the second matrix is taken and stored into two fixed registers, then calculation is performed through a multiply-add instruction and the result is stored into other registers, this is a calculation process, which is performed by successively taking 6 operation data (Mc ═ 6) of the first matrix and 4 operation data (Nc ═ 4) of the second matrix in one cycle, the rest registers are used for storing intermediate results of calculation, all general registers can be completely allocated, meanwhile, considering that the data participating in the operation is read into a register of the CPU by the Cache, the Cache is not only responsible for storing the operation data participating in the operation, but also needs to store an intermediate operation result, therefore, each time the number of operation data groups which can be sent to the register of the CPU by the Cache is limited, and each time the matrix is calculated by block multiplication, the first matrix participating in the operation comprises Kc columns, the second matrix comprises Kc rows, i.e. the number of operation data sets is Kc, e.g. Kc ═ 4, then for double precision matrix multiplication the matrix blocking kernel parameters [ Mc, Kc, Nc ] may be [6, 4, 4 ].

According to the embodiment of the disclosure, the matrix blocking parameter boundary may include a minimum matrix blocking parameter and a maximum matrix blocking parameter, an excessively small matrix blocking parameter may cause excessively low calculation performance, and an excessively large matrix blocking parameter may cause a corresponding blocking matrix to be unable to be completely read into the Cache, increasing the probability of Cache miss, thereby causing performance degradation. Therefore, according to the embodiments of the present disclosure, the matrix blocking parameter boundary for matrix multiplication may be based on a matrix blocking kernel parameter, and a minimum matrix blocking parameter and a maximum matrix blocking parameter are determined according to the actual data storage capacity and the computation performance requirement of the Cache, where each element in the minimum matrix blocking parameter may be an integer multiple of a corresponding element in the matrix blocking kernel parameter determined based on the minimum CPU computation performance, and the maximum matrix blocking parameter may be a matrix blocking parameter determined based on the maximum number of operation data that the Cache can actually store.

In step 102, a first number of matrix blocking parameters may be randomly generated based on the matrix blocking parameter boundary, and the first number of matrix blocking parameters may be used as a first number of individuals in the population to be optimized.

According to embodiments of the present disclosure, a first number of matrix partitioning parameters may be randomly generated based on matrix partitioning parameter boundaries and matrix partitioning kernel parameters, wherein each element within each matrix partitioning parameter is an integer multiple of a corresponding element within the matrix partitioning kernel parameters.

According to an embodiment of the present disclosure, the population number of the population to be optimized is the first number.

In step 103, for each matrix blocking parameter in the first number of matrix blocking parameters, matrix blocking may be performed by using the matrix blocking parameter and matrix multiplication may be performed accordingly, so as to obtain a performance index of matrix multiplication based on the matrix blocking parameter, and the performance index is used as an individual fitness of an individual corresponding to the matrix blocking parameter.

According to an embodiment of the present disclosure, the performance index may be a computation performance of a CPU using a BLAS library using the matrix blocking parameter, for example, a number of floating point operations (FLOPS) performed per second when the CPU performs a floating point operation using the BLAS library, which may be MFLOPS (megaFLOPS, one million floating point operations per second), GFLOPS (gigaFLOPS, one billion floating point operations per second), or the like. However, the performance index of matrix multiplication based on the matrix blocking parameter according to the embodiment of the present disclosure is not limited thereto, and may be a parameter such as time consumption for calculation, energy efficiency ratio, and the like.

As an example, fig. 2 exemplarily shows a flowchart of obtaining a performance index for matrix multiplication based on matrix blocking parameters according to an embodiment of the present disclosure.

As shown in fig. 2, in step 201, the matrix blocking parameters are used as the matrix blocking parameters in the basic linear algebraic subprogram library configuration file. According to the embodiment of the disclosure, the original matrix partitioning parameters in the basic linear algebra sub-library configuration file can be replaced by the matrix partitioning parameters.

In step 202, the basic linear algebra subprogram library is recompiled and installed to generate a dynamic link library. According to the embodiment of the disclosure, the configuration file of the basic linear algebra subprogram library with the new matrix block parameters is recompiled and installed, and the generated dynamic link library can be used by the CPU to obtain the performance index of the matrix multiplication corresponding to the matrix block parameters.

In step 203, a first matrix and a second matrix for matrix multiplication are generated, the first matrix and the second matrix are subjected to matrix blocking based on the dynamic link library, and the first matrix and the second matrix are subjected to matrix multiplication. For example, the first and second matrices may be randomly generated, or may be pre-generated standard test matrices for verifying various matrix blocking parameters.

According to the embodiment of the present disclosure, the CPU may use the dynamic link library to block the first matrix and the second matrix for matrix multiplication according to the matrix blocking parameter, and perform block matrix multiplication accordingly.

In step 204, a performance index for the dynamically linked library to perform matrix multiplication may be obtained. According to the embodiment of the disclosure, the performance index is used as the individual fitness of the individual corresponding to the matrix blocking parameter.

Returning to fig. 1B, in step 104, an optimized population may be constructed through individual selection, individual crossing, and individual variation based on the individual fitness of each individual in the population to be optimized.

According to the embodiment of the disclosure, based on the individual fitness of each individual in the population to be optimized, the individual in the population to be optimized is reasonably selected, so that the population can keep the current better characteristics on one hand, and on the other hand, the population also has the diversity of the individual characteristics so as to evolve towards the direction which is possibly better through subsequent operations, and the selected individual is subjected to individual crossing and individual variation to generate a new individual which may have new characteristics different from the current population or may be reintegrated with the existing characteristics of the current population, and the selected individual and the newly generated individual form the optimized population together. Step 104 will be described below in conjunction with fig. 3.

Next, in step 105, an individual with the optimal performance index may be selected from the optimized population as a matrix blocking parameter for matrix multiplication in case the optimized population satisfies a predetermined condition.

According to an embodiment of the present disclosure, the predetermined condition may be that the number of iterations for constructing the optimized population reaches a preset maximum number of iterations. According to specific performance requirements and the limitation of actual calculation time, a maximum iteration number can be preset, and when the iteration update number of the optimized population reaches the maximum iteration number, the individual with the optimal performance index in the optimized population at the moment is considered to be the optimal matrix blocking parameter which can be obtained within the capacity range, so that the individual with the optimal performance index in the optimized population is used as the matrix blocking parameter which is finally used for matrix multiplication.

According to the embodiment of the present disclosure, the predetermined condition may also be that the individuals having the optimal performance index tend to be stable in the iterative process of constructing the optimized population, that is, the individuals having the optimal performance index in the optimized population obtained through multiple iterations are the same. After the optimized population is subjected to repeated iterative updating, the optimized population tends to be stable, and the individual with the highest individual fitness in the optimized population also tends to be stable.

According to the embodiment of the disclosure, the predetermined condition may be that the optimal performance index of the individuals in the optimized population reaches a preset performance index. The preset performance index may be a more ideal performance index, and stopping the iterative process after there is an individual that has reached the preset performance index may reduce the computational workload while ensuring the more ideal computational performance, so that the individual may be used as a matrix partitioning parameter for the final matrix multiplication.

Hereinafter, a method of constructing an optimized population based on a genetic algorithm according to an embodiment of the present disclosure will be described with reference to fig. 3. Fig. 3 shows a flow chart for constructing an optimized population through individual selection, individual crossover, and individual variation according to an embodiment of the present disclosure.

As shown in fig. 3, in step 301, the individual with the highest individual fitness in the population to be optimized may be selected into the first population. The first population is an intermediate population in the process of constructing the optimized population from the population to be optimized, and is used for representing a set of individuals which are selected and changed in the process of constructing the optimized population. According to an embodiment of the present disclosure, after the population optimization process is finished, the first population is taken as an optimized population. The reason for selecting the individual with the highest individual fitness in the population to be optimized into the first population is that the element contained by the individual with the highest individual fitness is likely to be closer to the element contained by the optimal individual, and the selection of the individual is likely to be more favorable for the population to evolve in a more optimal direction.

In step 302, the individuals with higher individual fitness in the population to be optimized are selected into a first population. According to embodiments of the present disclosure, the individuals to be selected into the first population may be determined using a modified roulette selection method in which the probability of selecting individuals is determined based on the proportion of individual fitness to the total fitness of the population (the total fitness of the population being equal to the sum of the individual fitness of all individuals in the population), i.e. the probability of each individual being selected is proportional to the magnitude of its individual fitness, which method tends to select individuals that are next to the highest individual fitness, which tends to cause population optimization to fall into local optimality, since individuals in the current population to be optimized having a higher individual fitness may all be closer to a locally optimal individual, and an individual in the population to be optimized having a low individual fitness does not mean that the individual contains all elements that are worse than all elements having a higher individual fitness, subsequent individual crossing and individual variation may improve the poor elements contained in the individual, so that the individual fitness corresponding to the individual is greatly increased. It is therefore possible to improve upon the traditional roulette selection method by selecting individuals with a higher individual fitness while increasing the probability that individuals with a lower individual fitness are selected.

According to an embodiment of the present disclosure, sigmoid function f (x) 1/(1+ e) may be utilized^-x) And (f) indexing the individual fitness, wherein x represents the individual fitness, and (x) represents the individual fitness after the x is indexed. This function converts individual fitness to [0, 1%]To narrow the gap between individual fitness of individuals in the population, and then to narrow the individual populationThe difference of the selected probabilities increases the probability that the individuals with lower individual fitness are selected, and reduces the possibility of falling into local optimum.

According to embodiments of the present disclosure, softmax functions may also be utilized

To improve the probability of an individual being selected, where x_iDenotes the individual fitness, g (x), of the ith individual_i) Representing the individual selection probability of the ith individual. The function enables the individual selection probability distribution to be more uniform compared with the individual selection probability distribution in the traditional roulette selection method, but the individual selection probability corresponding to the individual with larger individual fitness is still larger than that corresponding to the individual with smaller individual fitness, so that the excellent individual is ensured to be more easily selected into a new population while the local optimization is avoided.

After the individual selection operation is completed, the individuals in the first population may be temporarily identified, and based on these individuals, individual crossover and individual variation may be performed. According to the embodiment of the disclosure, the preset crossover rate and the preset variation rate for the genetic algorithm can be obtained to respectively guide the individuals to carry out individual crossover and individual variation operations.

In step 303, randomly grouping the individuals in the first population pairwise to form a plurality of individual pairs, wherein each individual pair corresponds to an individual pair crossing rate, and performing individual crossing on the corresponding individual pair with the individual pair crossing rate smaller than a preset crossing rate to generate a new individual pair and selecting the new individual pair into the first population.

According to an embodiment of the present disclosure, for a first population for which individual selection has been completed, individuals in the first population may be randomly grouped pairwise to form a plurality of individual pairs, where each individual pair may correspond to an individual pair crossing rate. Since individual crossing in the population may be random, the individual pair crossing rate of each individual pair may be a random value (e.g., a random value between [0, 1 ]) randomly generated for the individual pair, which is compared to a preset crossing rate, and if the individual pair crossing rate is less than the preset crossing rate, the individual pair is crossed.

According to an embodiment of the present disclosure, the individual crossing may be a single point crossing, i.e., randomly exchanging elements of the same position of two individuals in an individual pair, thereby generating a new individual pair.

In step 304, the individual variation rate of each individual in the first population is determined, and individual variation is performed on the corresponding individual of which the individual variation rate is smaller than the preset variation rate, so as to generate a new individual and select the new individual into the first population, thereby forming an optimized population.

According to an embodiment of the present disclosure, the probability of variation occurring for each individual in the population may be random, and thus the individual variation rate of each individual may be a random value (e.g., a random value between [0, 1 ]) randomly generated for the individual, which is compared with a preset variation rate, and if the individual variation rate is less than the preset variation rate, the individual varies for the individual.

According to the embodiment of the present disclosure, the individual mutation may be binary mutation, that is, one element is randomly selected from all elements of the individual to be subjected to the individual mutation to perform the binary mutation, the selected element is first divided by the corresponding element in the kernel parameter of the matrix partition, then the obtained quotient is converted from a decimal number to a binary number, and a certain digit in the binary number is randomly inverted to generate a new binary number, and then the binary number is converted back to the decimal number and multiplied by the element of the kernel parameter of the matrix partition divided before the binary number, so that a new individual element can be generated, thereby generating a new individual. This mutation operation may ensure that each element in the mutated individual remains an integer multiple of the corresponding element in the matrix blocking kernel parameters.

According to the embodiment of the disclosure, for the first population which has completed individual selection, new individual pairs and individuals are generated in the individual crossing and individual variation processes, and the new individual pairs and individuals are also included in the first population, and the first population which has completed the above operations is the optimized population. The number of individuals in the optimized population is equal to the number of individuals in the population to be optimized, namely the population number is ensured to be unchanged in the processes of individual selection, individual crossing and individual variation.

After a new optimized population is generated each time, whether the optimized population meets a preset condition is judged, if the preset condition is met, an individual with the optimal performance index is selected from the optimized population and is used as a matrix blocking parameter for matrix multiplication, otherwise, the operation of the genetic algorithm is carried out again based on the optimized population.

As shown in fig. 4, an apparatus 400 for determining matrix blocking parameters for matrix multiplication based on a genetic algorithm according to an embodiment of the present disclosure may include: a parameter boundary acquisition module 401, a block parameter generation module 402, a matrix multiplication actual measurement module 403, a genetic algorithm iteration module 404, and a block parameter determination module 405.

According to an embodiment of the present disclosure, the parameter boundary acquisition module 401 may be configured to acquire matrix partition parameter boundaries for matrix multiplication. The matrix blocking parameter boundary may include a minimum matrix blocking parameter and a maximum matrix blocking parameter, and may be determined based on the actual data storage capacity of the Cache and the minimum computation performance requirement, respectively.

The blocking parameter generation module 402 may be configured to randomly generate a first number of matrix blocking parameters based on the matrix blocking parameter boundary, and use the first number of matrix blocking parameters as a first number of individuals in the population to be optimized. According to an embodiment of the present disclosure, the population number of the population to be optimized is the first number.

The matrix multiplication actual measurement module 403 may be configured to, for each matrix blocking parameter in the first number of matrix blocking parameters, perform matrix blocking using the matrix blocking parameter and correspondingly perform matrix multiplication, obtain a performance index of matrix multiplication based on the matrix blocking parameter, and use the performance index as the individual fitness of the individual corresponding to the matrix blocking parameter. According to the embodiment of the disclosure, the original matrix partitioning parameters in the basic linear algebra subprogram library configuration file can be replaced by the matrix partitioning parameters, and the performance index of the matrix multiplication corresponding to the matrix partitioning parameters is calculated based on the replaced basic linear algebra subprogram library.

The genetic algorithm iteration module 404 may be configured to construct an optimized population by individual selection, individual crossing, and individual variation based on individual fitness of each individual in the population to be optimized. According to the embodiment of the disclosure, the individuals in the population to be optimized are reasonably selected based on the individual fitness of each individual in the population to be optimized, then individual crossing and individual variation are carried out on the selected individuals, so that new individuals are generated, and the selected individuals and the newly generated individuals jointly form the optimized population.

The blocking parameter determination module 405 may be configured to select an individual with an optimal performance index from the optimized population as a matrix blocking parameter for matrix multiplication if the optimized population satisfies a predetermined condition. According to the embodiment of the disclosure, a performance index or a maximum iteration number can be preset based on specific performance requirements and practical calculation time and other limitations, and the optimal matrix blocking parameter obtained in the limitation range is used as the matrix blocking parameter for matrix multiplication. According to the embodiment of the disclosure, the individual with the highest individual fitness in the optimized population after stabilization can be used as the matrix blocking parameter for matrix multiplication based on that the optimized population tends to be stable.

According to an embodiment of the present disclosure, the apparatus 400 may further include a core parameter setting module configured to determine the matrix blocking core parameter based on the actual structure of the register in the CPU and the maximum amount of data that can be stored by the register. According to an embodiment of the present disclosure, the matrix blocking kernel parameters may represent basic units of matrix blocking, and thus the matrix blocking parameters for matrix multiplication may be determined based on the matrix blocking kernel parameters. According to an embodiment of the present disclosure, the parameter boundary obtaining module 401 may obtain a matrix blocking parameter boundary based on the matrix blocking kernel parameter determined by the kernel parameter setting module, and the blocking parameter generating module 402 may generate a matrix blocking parameter based on the matrix blocking parameter boundary and the matrix blocking kernel parameter.

According to an embodiment of the present disclosure, the apparatus 400 may further include a genetic parameter obtaining module configured to obtain a preset crossover rate and a preset variation rate for the genetic algorithm. According to the embodiment of the disclosure, the preset crossing rate and the preset variation rate are used for guiding the individual crossing and individual variation operation.

As shown in fig. 5, an apparatus 500 for determining matrix blocking parameters for matrix multiplication according to an embodiment of the present disclosure may include a processor 501 and a memory 502, which may be interconnected by a bus 503.

The processor 501 may perform various actions and processes according to programs or codes stored in the memory 502. In particular, the processor 501 may be an integrated circuit chip having signal processing capabilities. The processor may be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps, flows, and logic blocks disclosed in the embodiments of the disclosure may be implemented or performed. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, which may be the X86 architecture or the ARM architecture or the like.

The memory 502 stores executable instructions that when executed by the processor 501 are used to implement a method of determining matrix partitioning parameters for matrix multiplication based on a genetic algorithm according to an embodiment of the present disclosure. The memory 502 may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), Synchronous Link Dynamic Random Access Memory (SLDRAM), and direct memory bus random access memory (DRRAM). It should be noted that the memories of the methods described herein are intended to comprise, without being limited to, these and any other suitable types of memory.

Embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon computer-executable instructions that, when executed by a processor, may implement a method of determining matrix blocking parameters for matrix multiplication based on a genetic algorithm according to embodiments of the present disclosure. Similarly, computer-readable storage media in embodiments of the disclosure may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. It should be noted that the memories of the methods described herein are intended to comprise, without being limited to, these and any other suitable types of memory.

Embodiments of the present disclosure also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform a method of determining matrix blocking parameters for matrix multiplication based on a genetic algorithm according to an embodiment of the present disclosure.

It is to be noted that the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises at least one executable instruction for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In general, the various example embodiments of this disclosure may be implemented in hardware or special purpose circuits, software, firmware, logic or any combination thereof. Certain aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device. While aspects of embodiments of the disclosure have been illustrated or described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that the blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

The exemplary embodiments of the present disclosure described in detail above are merely illustrative, and not restrictive. It will be appreciated by those skilled in the art that various modifications and combinations of these embodiments or features thereof may be made without departing from the principles and spirit of the disclosure, and that such modifications are intended to be within the scope of the disclosure.

Claims

1. A method of determining matrix blocking parameters for matrix multiplication based on a genetic algorithm, comprising:

acquiring a matrix block parameter boundary for matrix multiplication;

randomly generating a first number of matrix partitioning parameters based on the matrix partitioning parameter boundary, and taking the first number of matrix partitioning parameters as a first number of individuals in the population to be optimized;

for each matrix block parameter in the first number of matrix block parameters, performing matrix block by using the matrix block parameter and correspondingly performing matrix multiplication, acquiring a performance index of the matrix multiplication based on the matrix block parameter, and taking the performance index as the individual fitness of the individual corresponding to the matrix block parameter;

constructing an optimized population through individual selection, individual crossing and individual variation based on the individual fitness of each individual in the population to be optimized; and

and under the condition that the optimized population meets a preset condition, selecting an individual with the optimal performance index from the optimized population as a matrix blocking parameter for matrix multiplication.

2. The method of claim 1, wherein performing matrix blocking using the matrix blocking parameter and performing matrix multiplication accordingly to obtain a performance metric for matrix multiplication based on the matrix blocking parameter comprises:

utilizing the matrix block parameters as matrix block parameters in a configuration file of a basic linear algebra subprogram library;

recompiling and installing the basic linear algebra subprogram library to generate a dynamic link library;

generating a first matrix and a second matrix for matrix multiplication, carrying out matrix blocking on the first matrix and the second matrix based on the dynamic link library, and carrying out matrix multiplication on the first matrix and the second matrix; and

and obtaining the performance index of the dynamic link library for executing the matrix multiplication.

3. The method of claim 2, wherein the performance metric is a number of floating point operations per second performed by a CPU using the basic linear algebraic subroutine library employing the matrix blocking parameters.

4. The method of claim 1, further comprising:

determining a matrix partitioning kernel parameter based on an actual structure of a register in a CPU and a maximum data amount which can be stored in the register, wherein the matrix partitioning kernel parameter represents a basic unit of a matrix partitioning;

wherein randomly generating a first number of matrix blocking parameters based on the matrix blocking parameter boundary comprises:

randomly generating a first number of matrix blocking parameters based on the matrix blocking parameter boundary and the matrix blocking kernel parameters, wherein each element in each matrix blocking parameter is an integer multiple of a corresponding element in the matrix blocking kernel parameters.

5. The method of claim 1, further comprising:

acquiring a preset cross rate and a preset variation rate for the genetic algorithm;

based on the individual fitness of each individual in the population to be optimized, an optimized population is constructed through individual selection, individual crossing and individual variation, and the method comprises the following steps:

selecting the individual with the highest individual fitness in the population to be optimized into a first population;

selecting individuals with higher individual fitness in the population to be optimized into the first population;

randomly grouping the individuals in the first population pairwise to form a plurality of individual pairs, wherein each individual pair corresponds to an individual pair crossing rate, carrying out individual crossing on the corresponding individual pairs of which the individual pair crossing rates are smaller than the preset crossing rate, generating a new individual pair, and selecting the new individual pair into the first population;

and determining the individual variation rate of each individual in the first population, carrying out individual variation on the corresponding individual of which the individual variation rate is smaller than the preset variation rate, generating a new individual, and selecting the new individual into the first population to form the optimized population.

6. The method of claim 1, wherein the predetermined condition is that the number of iterations to construct the optimized population reaches a preset maximum number of iterations.

7. The method of claim 1, wherein the predetermined condition is that the individuals with the best performance index tend to be stable during the iteration of constructing the optimized population, i.e., the individuals with the best performance index in the optimized population obtained from multiple iterations are the same.

8. An apparatus for determining matrix blocking parameters for matrix multiplication based on genetic algorithm, comprising:

a parameter boundary acquisition module for acquiring a matrix block parameter boundary for matrix multiplication;

a blocking parameter generation module, configured to randomly generate a first number of matrix blocking parameters based on the matrix blocking parameter boundary, and use the first number of matrix blocking parameters as a first number of individuals in a population to be optimized;

the matrix multiplication actual measurement module is used for carrying out matrix blocking and correspondingly executing matrix multiplication on each matrix blocking parameter in the first number of matrix blocking parameters by using the matrix blocking parameter to obtain a performance index of the matrix multiplication based on the matrix blocking parameter, and taking the performance index as the individual fitness of the individual corresponding to the matrix blocking parameter;

the genetic algorithm iteration module is used for constructing an optimized population through individual selection, individual crossing and individual variation based on the individual fitness of each individual in the population to be optimized;

and the blocking parameter determining module is used for selecting an individual with the optimal performance index from the optimized population as a matrix blocking parameter for matrix multiplication under the condition that the optimized population meets a preset condition.

9. The apparatus of claim 8, wherein the module for actually measuring matrix multiplication performs matrix blocking using the matrix blocking parameter and performs matrix multiplication accordingly, and obtains the performance index of matrix multiplication based on the matrix blocking parameter, including:

10. The apparatus of claim 9, wherein the performance metric is a number of floating point operations per second performed by a CPU using the basic linear algebraic subroutine library employing the matrix blocking parameters.

11. The apparatus of claim 8, further comprising:

the device comprises a kernel parameter setting module, a matrix partitioning kernel parameter setting module and a matrix partitioning kernel parameter setting module, wherein the kernel parameter setting module is used for determining the kernel parameter of a matrix partitioning based on the actual structure of a register in a CPU and the maximum data quantity which can be stored in the register, and the kernel parameter of the matrix partitioning represents the basic unit of the matrix partitioning;

wherein the block parameter generation module randomly generates a first number of matrix block parameters based on the matrix block parameter boundary, including:

12. The apparatus of claim 8, further comprising:

a genetic parameter obtaining module for obtaining a preset crossing rate and a preset variation rate for the genetic algorithm;

the genetic algorithm iteration module constructs an optimized population through individual selection, individual crossing and individual variation based on the individual fitness of each individual in the population to be optimized, and the genetic algorithm iteration module comprises the following steps:

13. The apparatus of claim 8, wherein the predetermined condition is that the number of iterations to construct the optimized population reaches a preset maximum number of iterations.

14. The apparatus of claim 8, wherein the predetermined condition is that the individuals with the best performance index tend to be stable during the iteration of constructing the optimized population, i.e., the individuals with the best performance index in the optimized population obtained from multiple iterations are the same.

15. An apparatus for determining matrix blocking parameters for matrix multiplication based on genetic algorithm, comprising:

a processor; and

memory having stored thereon computer-executable instructions for implementing the method of any one of claims 1-7 when executed by the processor.

16. A computer-readable storage medium having stored thereon computer-executable instructions for implementing the method of any one of claims 1-7 when executed by a processor.