CN111857833A

CN111857833A - Intelligent parallel computing processor and intelligent parallel computing processing method

Info

Publication number: CN111857833A
Application number: CN202010689149.0A
Authority: CN
Inventors: 赵永威; 支天; 杜子东; 陈云霁; 徐志伟; 孙凝晖; 郭崎
Original assignee: Institute of Computing Technology of CAS; University of Chinese Academy of Sciences
Current assignee: Institute of Computing Technology of CAS; University of Chinese Academy of Sciences
Priority date: 2020-07-16
Filing date: 2020-07-16
Publication date: 2020-10-30

Abstract

The present disclosure provides a parallel computing intelligent processor and a parallel computing intelligent processing method, the parallel computing intelligent processor includes: at least two fractal calculation subunit performs fractal calculation according to the fractal calculation instruction, wherein the structures of the fractal calculation subunit have the same level; the number of the fractal calculation subunit is set according to a program corresponding to the executed fractal calculation; the controller generates a fractal calculation instruction according to the number of the fractal calculation subunits and hardware resources and sends the fractal calculation instruction to the fractal calculation subunits; the reduction arithmetic unit carries out reduction operation on the fractal calculation result, wherein the rate of the reduction operation is in direct proportion to the number of processors of the fractal calculation subunit; the total amount of memory used in the computation of the parallel computing intelligent processor is independent of the number of processors the parallel computing intelligent processor has.

Description

Intelligent parallel computing processor and intelligent parallel computing processing method

Technical Field

The present disclosure relates to the field of intelligent processor architecture technologies, and in particular, to a parallel computing intelligent processor and a parallel computing intelligent processing method.

Background

The computing model provides an abstraction of a computing system that enables a separation between programming on the model and actual execution on the computing system. Thus, there is no necessary connection between the program being serial or parallel and the manner in which it is actually executed on the computing system being serial or parallel. Following the form of the Flynn classification, computational models can be divided into four classes:

serial programming, Serial Execution (SCSE): SCSE contains the most basic serial computing models, such as the Universal Turing Model (UTM), Von Neumann Model (VNM), and Random Access Model (RAM). SCSE programming is simple but cannot take advantage of the high performance provided by modern multi-core, parallel computing systems.

Parallel programming, Parallel Execution (Parallel Code, Parallel Execution, PCPE): the PCPE includes the most basic parallel computing models, such as Parallel Random Access Model (PRAM), LogP model, BSP model, Multi-BSP model, message communication interface (MPI), OpenMP, and the like. PCPE takes advantage of the parallel execution capabilities of hardware, has high performance characteristics, but at the same time creates programming difficulties.

Parallel programming, Serial Execution (PCSE): PCSE is common in simulation execution and includes most Hardware Description Languages (HDL) and its simulation, such as Verilog HDL, as well as simulators of various types of parallel architectures. By adopting the PCSE, a parallel computing system of any scale can be virtualized on a serial computing system, and the analysis and verification of a large-scale parallel algorithm, the parallel computing system and a digital circuit can be realized at low cost.

Serial Code, Parallel Execution (SCPE): SCPE is a development trend of modern computing models, such as MapReduce, which is widely adopted in the field of big data. SCPE seeks to have both the easy programming of serial programs and the high performance advantages of parallel execution.

Studies have shown that parallel programming is more difficult than serial programming because it requires additional considerations for communication and synchronization issues. If it is determined that programming should be classified as either serial or parallel, simply by whether the program has multiple control flows, according to such a rule, SIMD programs are classified as serial programs, while most SPMD programs are classified as parallel programs (e.g., MPI).

The complexity of parallel programming essentially derives from the programming scale-variance (programming scale-variance) that parallel computing models have. As the size of computing systems changes, the concurrency of parallel programming also changes, resulting in programs that cannot execute or require re-tuning to remain optimally executed. The Multi-BSP model is a typical example with program-scale dependencies: each time the scale of the computing system is expanded, the model correspondingly generates a new layer; the introduction of a new hierarchy introduces not only a new set of model parameters, but also a new set of execution resources that require additional programming control, resulting in the program on the original model no longer being suitable for the extended model. The program-scale dependency can result in programming being sensitive to the scale (concurrency) of the computing system, with programming being more difficult on larger scale parallel computing systems. Program-scale dependencies are prevalent among existing parallel computing systems (including PRAM, LogP, BSP, Multi-BSP, SPMD, SIMD, and MapReduce).

Disclosure of Invention

In view of the above-mentioned drawbacks, the present disclosure is directed to a parallel computing intelligent processor and a parallel computing intelligent processing method for at least partially solving the problems of programming-scale dependency prevalent in the prior art and solving the problem of reduced parallel efficiency that may be caused as the depth of parallel computing increases.

According to an aspect of the present disclosure, there is provided a parallel computing intelligent processor, including: the fractal calculation subunit is used for performing fractal calculation according to a fractal calculation instruction, wherein the structures of the fractal calculation subunit have the same level; the number of the fractal calculation subunit is set according to a program corresponding to the executed fractal calculation; the controller is used for generating a fractal calculation instruction according to the number of the fractal calculation subunits and hardware resources and sending the fractal calculation instruction to the fractal calculation subunits; the protocol arithmetic unit is used for carrying out protocol arithmetic on the fractal calculation result of each fractal submodel according to a protocol arithmetic instruction, wherein the rate of the protocol arithmetic is in direct proportion to the number of processors of the fractal calculation subunit; the total amount of storage used in the parallel computing intelligent processor computing process is independent of the number of processors the parallel computing intelligent processor has.

In some embodiments, the parallel computing smart processor further comprises a memory; the memory is used for storing data required by the fractal calculation and fractal calculation results.

In some embodiments, the controller is further configured to send the reduced-operation instruction to the reduced-operation unit.

In some embodiments, the fractal computation subunit is hierarchical in application load, hardware resources, and execution.

In some embodiments, the memory is a temporary storage.

In some embodiments, the maximum capacity of the temporary memory is not limited.

In some embodiments, the programs corresponding to the fractal calculation include a k-decomposition program, a specification program, and a leaf program, and the k-decomposition program, the specification program, and the leaf program are connected in series.

In some embodiments, a first temporary memory is arranged inside the controller, and the first temporary memory is used for storing the fractal calculation instruction; the controller is further used for executing k-decomposition on the fractal calculation instruction according to the k-decomposition program and sending the decomposed fractal calculation instruction to the fractal calculation subunit.

In some embodiments, the reduced operator and the controller are implemented by a serial computing system.

In some embodiments, the memory retrieves data required for the fractal calculation from a local memory through input/output; and after the fractal calculation or the protocol calculation is finished, writing the fractal calculation result or the protocol calculation result back to the local memory.

In some embodiments, the fractal calculation subunit is connected to the memory through a data bus.

According to an aspect of the present disclosure, there is provided a parallel computing intelligent processing method based on the above parallel computing intelligent processor, including: the controller selects the scale for executing the parallel computation according to the number of fractal computation subunits for executing the parallel computation and hardware resources; the controller generates a parallel computing instruction according to the scale; sending the fractal calculation instruction to the fractal calculation subunit; the fractal calculation subunit executes the parallel calculation in parallel according to the parallel calculation instruction; (ii) a The structure of each fractal calculation subunit has the same level; the number of the fractal calculation subunit is set according to a program corresponding to the executed fractal calculation; the protocol arithmetic unit carries out protocol arithmetic on the fractal calculation result of each fractal calculation subunit; wherein the rate of the reduction operation is proportional to the number of processors that the fractal computation subunit has; the total amount of storage used in the parallel computing intelligent processor computing process is independent of the number of processors the parallel computing intelligent processor has.

In some embodiments, the method further comprises:

and the memory stores the data required by the fractal calculation and the fractal calculation result.

In some embodiments, the method further comprises:

and the controller sends the protocol operation instruction to the protocol arithmetic unit.

In some embodiments, the memory is a temporary storage.

In some embodiments, the method further comprises:

and executing k-decomposition on the fractal calculation instruction according to the k-decomposition program, and sending the decomposed fractal calculation instruction to the fractal calculation subunit.

In some embodiments, the method further comprises:

the memory acquires data required by the fractal calculation from a local memory through input/output; and after the fractal calculation or the protocol calculation is finished, writing the fractal calculation result or the protocol calculation result back to the local memory.

In some embodiments, the parallel computation comprises a superstep operation.

In some embodiments, the method further comprises: and copying the cross-group communication request in the parallel computing process from the fractal computing subunit to a temporary memory.

In some embodiments, the method further comprises: and writing the cross-group communication request into a target address.

According to a third aspect of the present disclosure, there is provided an electronic device comprising the parallel computing intelligent processor described above.

The invention provides a parallel computing intelligent processor which is designed based on the principle of the same level, namely, each fractal computing subunit in the parallel computing intelligent processor has the same level in the application load, hardware resources and the execution mode, so that a parallel computing model has programming-scale independence, the program does not need to be modified at all during system scale expansion, infinite scaling can be performed through the scale of fractal execution, serial programming and parallel execution are realized, and the problem of parallel programming in the general field is solved.

According to the method, a constraint condition is introduced to constrain the parallel computation besides designing the hierarchical structure of the parallel computation intelligent processor, and based on the constraint condition, on the basis of realizing programming-scale independence, the parallel computation intelligent processor is further enabled to have parallel efficiency-scale independence, and parallel computation efficiency is not reduced due to system scale expansion.

Drawings

FIG. 1 schematically illustrates a block diagram of a parallel computing intelligent processor of an embodiment of the present disclosure;

FIG. 2 schematically illustrates a data dependency sequence diagram on a dynamic programming matrix of an embodiment of the present disclosure;

FIG. 3 schematically illustrates a fractal execution process using an intelligent processor to simulate a superstep;

FIG. 4 schematically illustrates a flow diagram of a parallel computing intelligent processing method according to an embodiment of the present disclosure;

fig. 5 schematically illustrates a block diagram of a fractal von neumann architecture of an embodiment of the present disclosure.

Detailed Description

For the purpose of promoting a better understanding of the objects, aspects and advantages of the present disclosure, reference is made to the following detailed description taken in conjunction with the accompanying drawings.

It should be noted that in the drawings or description, the same drawing reference numerals are used for similar or identical parts. Implementations not depicted or described in the drawings are of a form known to those of ordinary skill in the art. Additionally, while exemplifications of parameters including particular values may be provided herein, it is to be understood that the parameters need not be exactly equal to the respective values, but may be approximated to the respective values within acceptable error margins or design constraints. In addition, directional terms such as "upper", "lower", "front", "rear", "left", "right", and the like, referred to in the following embodiments, are directions only referring to the drawings. Accordingly, the directional terminology used is intended to be in the nature of words of description rather than of limitation.

To address the ubiquitous problem of programming-scale dependencies among existing parallel computing models, the present disclosure analyzes a fractal von neumann architecture. As shown in fig. 5, is a block diagram of the architecture of the fractal von neumann architecture of an embodiment of the present disclosure. The fractal von neumann architecture is similar to the von neumann architecture at each level, also having a controller, memory and input/output units, except that the hierarchy of the fractal von neumann architecture may have multiple operators executing in parallel — including a fractal operation unit (FFU) and a local operation unit (LFU). The fractal operation unit can be formed by a computer with a fractal von Neumann architecture, so that the organization structure and hardware resource abstraction of each layer are the same. Hardware resources on different levels have the same control mode, so that the system does not need additional programming control due to scale expansion and addition of new types of hardware resources.

Based on the analysis of machine learning application load, the fractal von neumann architecture characterizes machine learning primitives as fractal operations: the operation f (-) is a fractal operation, i.e. the execution behavior of f (X) can be completely composed of a set of decomposed operations g (f (X) ₁)，f(X₂)，…，f(X_k) ) simulating; wherein

The operation g (-) is called a reduction operation. Based on fractal operation, a fractal instruction set is designed by a fractal Von.Neumann architecture, wherein the fractal instruction set comprises a fractal instruction for expressing the fractal operation and a local instruction for expressing the reduction operation. Following the characteristics of hierarchical nature, each level of the system in the fractal von Neumann architecture adopts the same instruction set architecture, has the same task load abstraction and hides the hardware scale under the instruction set architecture.

The fractal von Neumann architecture maps fractal operation to the FFU for execution, maps reduction operation to the LFU for execution, maps decomposition of fractal instructions to the controller, and maps carrying of operation data to the input/output unit. The fractal von Neumann architecture has the same structure of each layer of instruction set, the same decomposition rule of fractal instructions, and the same mapping relation between execution and hardware resources, so that the same program can be executed to achieve the programming-scale independence.

The above research analysis shows that: an important principle for fractal von neumann architectures to achieve program-scale independence is to employ hierarchical architectural design criteria. Based on the method, the layering and identity criterion is applied to the design of a calculation model, the parallel calculation intelligent processor with programming-scale independence is designed, the fractal calculation model is used as a bridge, the calculation to be executed is connected with the intelligent processor, programming-scale independence is realized on the existing hardware in the general field, and the programming problem is solved. To characterize the hierarchy-specific features, the intelligent processor only characterizes from a single hierarchy, and does not characterize the whole. As described in detail below.

As shown in fig. 1, the present disclosure provides an intelligent processor 100 including a fractal calculation subunit 10, a controller 20, a temporary storage 30, and a reduction operator 40. The fractal calculation 10 may be a fractal calculation unit, or may be other calculation units, such as a leaf model (VNM). Each fractal calculation subunit 10 is connected to the temporary memory 30 via a data bus, and the time overhead required to propagate a single data on the data bus is g. The temporary memory 30 can access all data required for the fractal operation from an external memory through input/output.

The fractal calculation subunit 10 is configured to perform fractal calculation according to the fractal calculation instruction. The fractal calculation subunit structures are hierarchical, specifically, the application load, hardware resources and execution mode are hierarchical, that is, each fractal calculation subunit needs to meet the hierarchical architecture design criteria. The number k of the fractal calculation subunits 10 may be flexible, and specifically, any integer not less than 2 may be actively designated as k according to a program corresponding to the fractal operation. In other computing systems, the number p of processors, which is a corresponding parameter, is a fixed parameter depending on the model, and the program must be adapted to the value of p passively.

The controller 20 is configured to send a fractal calculation instruction to the fractal calculation subunit 10, so that the fractal calculation subunit 10 performs a fractal operation. The controller 20 may include a temporary memory (first temporary memory) therein for storing the fractal instruction. The controller and the first temporary storage form a RAM model, and the controller 20 may execute a k-decomposition program on the first temporary storage for the fractal calculation instruction and send the decomposed fractal calculation instruction to the fractal calculation subunit. Since the memory space (KB level) required by the controller 20 is generally much smaller than the space required for data, the memory space inside the controller is ignored in the intelligent processor, and only the execution time of the controller 20 is considered. In some embodiments of the present disclosure, the controller 20 may be implemented by a serial computing system.

The temporary memory 30 is used for storing data required by the fractal operation and a fractal operation result. Because the fractal calculation adopts the temporary memory, after one fractal calculation is finished, the data in the temporary memory is not reserved, and therefore, each fractal calculation must include steps of loading all data required by the fractal calculation and completely writing the result of the fractal calculation back to the external local memory. In addition, in the fractal calculation model, the maximum capacity of the temporary memory 30 is not limited.

And the reduction arithmetic unit 40 is used for carrying out reduction arithmetic on the fractal operation result of the fractal calculation subunit 10 according to the reduction arithmetic instruction to obtain the final calculation result of the fractal calculation model, wherein the rate of the reduction arithmetic unit 40 is in direct proportion to the number of processors of the fractal calculation subunit 10. The reduction operation instruction of the reduction operator 40 may also be sent by the controller 20. The reduction arithmetic unit 40 and the temporary storage 30 form a RAM model, and the reduction arithmetic unit 40 can directly read the fractal operation result from the temporary storage 30 for fractal operation, and has an operation rate of processing r basic operations per unit time, where r is an integer greater than a preset value. In some embodiments of the present disclosure, the controller 20 may be implemented by a serial computing system.

The intelligent processor does not consider parameters outside the level, the depiction of the fractal calculation subunit is completely completed by an execution overhead function t (-) and does not care which calculation unit the fractal calculation subunit is or what parameters the fractal calculation subunit has.

The intelligent processor may ignore the synchronization overhead. The intelligent processor performs synchronous operation among all functional units before one fractal calculation is finished. Research shows that the synchronization overhead does not have an important influence on performance analysis in a system corresponding to the BSP and the Multi-BSP model, and therefore, the synchronization overhead does not have an important influence on an intelligent processor corresponding to the fractal calculation model. To make the performance analysis of the intelligent processor more concise, the synchronization overhead can be ignored when building the intelligent processor.

The intelligent processor may measure the execution time of the program. The execution time of a program is the sum of the execution times of the fractal operations contained therein, and the execution time of each fractal operation can be calculated from bottom to top: firstly, calculating execution overhead t (-) on a fractal calculation subunit corresponding to a leaf model; the execution overhead T (-) of each layer of fractal calculation subunit can be estimated by the sum of the time required by the controller to perform k-decomposition, the time required by the data bus to complete communication, the time required by the fractal calculation subunit to execute, and the time required by the reduction operator to perform reduction calculation. And calculating layer by layer from the layer with the smallest scale until the total execution time of the fractal operation is calculated.

And a task load program on the fractal calculation subunit consists of serially executed fractal operations. A complete description of the fractal operation includes the following parts:

1. a k-decomposition program written on the RAM model;

2. a reduction program written on the RAM model;

3. a leaf program written on the leaf model.

The three parts of programs respectively describe the execution behaviors of the controller, the reduction arithmetic unit and the leaf model.

The embodiment of the disclosure lists a few specific intelligent processors to execute fractal calculation, and shows the programming mode of the intelligent processors. Specifically, the disclosed embodiments enumerate the programming and execution of the following four classes of algorithms on the intelligent processor: simple parallel (embrarossingly parallel) algorithm, divide-and-conquer algorithm, dynamic programming algorithm, intrinsic serial (inherentlyserial) algorithm. The intrinsic serial algorithm is difficult to efficiently execute on various parallel computing systems, such as the execution of a simulated universal turing machine; the first three classes of algorithms can be efficiently executed on intelligent processors. In addition, the specific calculation process of the model (such as the leaf model) in the following detailed description is performed by a fractal calculation subunit that simulates the model.

The simple parallel algorithm means that the algorithm has the characteristic of natural easy parallelism, the calculation can be decomposed into completely independent parts, and no data dependence, no communication, no result reduction or only very simple reduction is needed among the parts. Monte Carlo simulation, image rendering, vector operation and other tasks generally belong to simple parallel algorithms. The embodiment of the present disclosure takes a naive matrix multiplication operation as an example, a serial algorithm of the naive matrix multiplication is generally described as three-layer loop, each layer of loop can arbitrarily decompose a loop range, and a calculation result only needs simple summation reduction or does not need reduction. The algorithm is easily described as a fractal operation, and the description comprises the following three procedures:

1. k-decomposition procedure

2. Protocol program

3. Leaf procedure

The task load program is described as a serially executed fractal operation. The above three procedures have defined matrix multiplication as a fractal operation, so the task load procedure only includes one fractal operation:

when the program is executed on the intelligent processor, the matrix is firstly decomposed according to the n direction; when n is 1, the operation degenerates to a "vector multiply matrix" operation, followed by a decomposition in the m direction; finally, the sizes of n and m reach 1, the operation degenerates into vector inner products, and the decomposition can still be continued in the k direction. The k-decomposition procedure can also actively choose in which dimension to decompose according to other rules. In addition, a more efficient block matrix multiplication algorithm (e.g., Strassen algorithm) may also be implemented in a similar manner.

The divide-and-conquer algorithm is a kind of algorithm which is easy to be parallel, generally, the solving process of the divide-and-conquer algorithm comprises the step of solving a plurality of sub-problems of the same type, each sub-problem is relatively independent, data dependence does not exist, and the calculation result needs to be reduced more complexly. The embodiments of the present disclosure choose merge sort as an example of divide and conquer algorithm. Merge sorting first segments the array (corresponding to k-decomposition), recursively performs merge sorting on each segment (corresponding to fractal computation), and then merges the results (corresponding to a reduce operation). This execution mode conforms to the definition of the fractal operation, so the embodiment of the present disclosure may describe the merge sort as the fractal operation mergestort, and the description includes the following three procedures:

1. k-decomposition procedure

2. Reduction procedure

3. Leaf procedure

Here, merging and sorting itself is already a fractal operation, so the task load procedure only includes a fractal operation:

the parallel implementation of the dynamic programming algorithm is generally not as simple as the first two algorithms because each iteration of the algorithm has data dependency and needs to be executed in a certain order. However, data dependence in the dynamic programming algorithm is usually in a partial order relationship, and if the order of the data dependence is carefully processed, parallelization can still be realized among certain iterations. The embodiment of the present disclosure takes a string edit distance algorithm as an example, which is a classic dynamic programming algorithm, and the bellman equation is:

the data dependency order on the dynamic programming matrix is shown in fig. 2. Research shows that data dependency does not exist between data on each diagonal line of the dynamic programming matrix, and the data can be executed in parallel. The embodiment of the disclosure describes the plan on each diagonal as a fractal operation DP-Step:

1. k-decomposition procedure

Each DP-Step is responsible for deriving the next diagonal W from the upper two diagonals U, V of the dynamic programming matrix. Since each element on the diagonal is perfectly parallel, the decomposition can be done by range. The disclosed embodiments select the simplest 2-decomposition to implement.

2. Reduction procedure

Since the DP-Step is perfectly parallel, no reduction is needed and the reduction procedure is empty.

3. Leaf procedure

The leaf program calculates the bellman equation separately for each term in W, similar to the program of the serial algorithm.

The task load program requires an iterative process of control planning algorithms and thus contains a basic control flow. To simplify the example procedure, we assume that the two strings entered are both 1000 in length, and the iterative process is now divided into two stages: the newly derived diagonal line W in the first stage is added with a constant 1 at the left end and the right end as an initial condition, so that the size of the diagonal line W is larger than that of the diagonal line U and V; the second stage is no longer accompanied by the constant 1, so that W is smaller in size than U and V. The task load procedure can be described as:

based on the above examples, it can be seen that the intelligent processor designed by the present disclosure has programming-scale independence, and does not require any modification to the program when the system scale is extended. While intelligent processors require multi-part programs to be written to fully describe the application and the fractal operations involved therein, each of the programs is serial, does not require synchronization and interaction with each other during execution, and is independent of system size. Such programs can be automatically expanded to parallel execution on intelligent processors of various sizes, mainly because intelligent processors have features similar to geometric fractal and can infinitely scale dimensions through fractal execution. Therefore, in actual execution of the computing system, a scale up to which fractal execution is performed can be freely selected according to conditions such as the system scale and hardware resource limitations. Specifically, in an example program (taking merging sequencing as an example), the initial scale is 1000, the computing system can freely select the scale Z between 1 and 1000, and after fractal execution, the scale of a task executed on a leaf model can be guaranteed not to exceed Z. For example, when Z is 500, the execution process is:

This implementation can be developed in parallel on a small scale computing system with two processors, defining a fractal computation model at which the system scale p is 2. If the Z is further reduced, the fractal execution process can be naturally deepened, and deeper details are added in the execution process to form a zooming effect. For example, when Z is 250, the execution process is:

such an implementation can be deployed on a larger scale computing system (with four processors), defining a system scale p of 4 for the intelligent processor at that time. This effect of scaling the execution process at will, adding or hiding execution details, changing parallelism without modifying the program is called "infinite scaling".

Based on the above discussion, it is clear that a smart processor is not a hardware abstraction, and there are some conflicts with computers having a fractal von neumann architecture. The method mainly comprises the following two steps:

1. the number of fractal calculation subunit in the intelligent processor is determined by a program, and the number of fractal execution unit in the fractal von neumann architecture is a hardware parameter.

2. The capacity of temporary memory in the intelligent processor is not limited, and the capacity of local memory in the fractal von neumann architecture is bound by hardware resources.

Parallel decomposition is required to be introduced to solve the first contradiction, serial decomposition is required to be introduced to solve the second contradiction, and both decomposition techniques are used for simulating fractal execution on a single-layer fractal Von-Neumann architecture for multiple times until the intelligent processor is scaled to a proper scale. An appropriate scaling scale can provide sufficient concurrency for the fractal execution unit, and can reduce the occupation of storage space to a degree capable of meeting the constraint of hardware resources. Reference may be made to a specific implementation.

In addition, intelligent processors may also execute on parallel computers having other architectures. For example, on a BSP machine, one of the intelligent processors may be made to receive a task, simulate a controller in the intelligent processor, perform k-decomposition on the task, and then distribute the k-decomposed task to the other k processors; if there are idle processors in the system, the processors receiving the task can simulate the controller of the next layer, then execute k-decomposition to the task and distribute the k-decomposition to other idle processors. Therefore, at most logkp processors in the system assume the function of the simulation controller, and the other (p-logkp) processors simulate the leaf model. After a round of calculation of the fractal calculation subunit (for example, calculation corresponding to the leaf model) is completed, the calculation results are collected on a processor for distributing tasks, and at this time, the processor switches from the function of the analog controller to the function of the analog reduction operator, and performs reduction processing on the results. For efficient execution, the task should be distributed such that only the reference address of the data is sent, not the complete operand.

And when the task finally reaches the fractal calculation subunit (processor) corresponding to the simulated leaf model, the processor simulating the leaf model fetches data from the reference position. Data can be uniformly distributed on each processor through hashing, so that the situation that a single processor becomes a bottleneck when the data is read is avoided, and h-relation constraint of the BSP machine on communication can be met. In the process of distributing tasks and reducing results, one processor communicates with at most k +1 other processors, so that h-relation constraint of the BSP machine on communication can be met. The three steps of k-decomposition, leaf model execution and reduction can be executed in a pipeline mode, and the efficiency of the intelligent processor for simulating and executing the BSP is further improved.

Since the scale of the system for fractal execution can be extended and the scale can be scaled infinitely, the present disclosure further studies the problem of reduced parallel efficiency of execution that may be brought by the scale extension of the computing system and the increase of the depth of fractal execution, and the following describes in detail.

The intelligent processor is also a general parallel computing system, and a wide range of parallel computing algorithms can be optimally operated based on the intelligent processor. The BSP model is a classic general parallel computing model, a general parallel computing system can be realized based on the model, and the intelligent processor can optimally simulate and execute the BSP model after applying certain constraint conditions, so that the capability of optimally executing any BSP program is obtained. The basic unit of execution in the BSP model is superstep, which conforms to the definition of fractal operation. FIG. 3 is a diagram schematically illustrating a fractal execution process for simulating a superstep using an intelligent processor, the superstep being defined as a fractal operation in accordance with the manner of FIG. 3, the process comprising:

1. The v BSP processors are divided into k groups, and k-is decomposed into sub-super-step operations, each of which comprises a group (v/k) of BSP processors.

2. The required computation state (all data required for computation) is copied from temporary storage to the fractal submodel.

3. And executing k sub-superstep operations in parallel on the fractal submodel.

4. The computed computation state and the cross-team communication request are copied from the fractal submodel back to temporary storage.

5. The reduction operator writes the cross-team communication request to the target address.

Wherein, the meaning of the optimal simulation means that: when the system is sufficiently large in scale, the difference between the simulation execution time and the ideal simulation execution time can always be constrained to be within a constant multiplier, and the constant multiplier is usually small. In order to realize optimal simulation, the present disclosure performs related research, introduces constraint conditions to the intelligent processor, and further provides a parallel computing intelligent processor to realize optimal simulation of executing a BSP model and run a wide range of parallel computing algorithms. Only specific constraints will be described in detail below.

In a feasible manner of this embodiment, the constraint condition includes two conditions:

(limited storage conditions) the total amount of storage m used in the BSP model to be simulated is independent of the number of processors v;

(relaxation condition) the BSP model to be simulated has at least v ═ p log1+ ∈ p processors, where p refers to the number of processors the intelligent processor has, and ∈ is any positive real number.

The reason why the fractal calculation model can optimally simulate the BSP model to realize the general parallel calculation intelligent processor after the constraint condition is introduced and realize the parallel efficiency-scale independence is introduced as follows:

the execution time of each layer of fractal calculation subunit consists of four parts, namely k-decomposition, data copying, fractal calculation subunit execution and reduction operation, namely:

t_i＝t_{k-decomposition}+t_{Data copying}+t_i-1+t_{Protocol operation}(1)

Wherein i is the level of fractal calculation subunit, and the operation time t of fractal calculation subunit in formula 1 is divided_i-1In addition, each item is O (1). The item-by-item discussion is as follows:

k-decomposition: the time required to perform k-decomposition on the BSP superstep is dependent only on k, and is independent of both v and p, and can be considered as O (1).

Data copying: similar to the BSP hashing the storage during the certification process, the fractal computation model hashes the storage space of the simulated BSP model, so that the needed computation state and the cross-group communication request use a hash table as a data structure. Two hash tables each represent a storage portion (calculation state) included in the node and each represent a storage portion not included in the node (inter-group communication), and therefore the sum of the sizes of the two tables does not exceed the full storage space size m. The upper bound of the data copy amount does not exceed copying all used memory m to local storage and then copying back to the outside, whereas the present disclosure has constrained m to be independent of v, so the time of data copy is also considered to be O (1).

Reduction operation: the reduction operator performs the reduction operation of writing the cross-group communication into the target address, i.e. needs to merge multiple hash tables, and the calculation complexity of the operation is linear. However, since the size of the hash table is bounded by m, the time required for the reduction operation may be constrained to be within a constant time, and is considered to be O (1).

Therefore, equation 1 is abbreviated as:

t_i＝t_i-1+O(1) (2)

for an arbitrary layer fractal calculation model, if the number of layers is N, formula 2 can be recursively substituted into itself N times to obtain an expression of the total execution time:

t_N＝t₀+O(N) (3)

wherein, t₀Is the time required for the fractal calculation subunit of the leaf model to execute. The leaf model may be any computational model and thus may also be a BSP model. The BSP model can simulate itself in ideal time, since the disclosure has constrained relaxation conditions, the scale of the BSP model to be simulated is always larger than that of the fractal calculation model, hence t₀Is an infinite number exceeding the order of O (log p). When the number of layers N is a finite constant, t_NThe ratio of the ideal time to the ideal time is converged to 1, and the requirement of optimal simulation can be met.

After the demonstration of constraint 1, constraint 2 is demonstrated below.

The method and the device have the advantages that through research, the intelligent processor cannot gradually reduce the execution efficiency due to the expansion of the layer number by introducing the pipeline execution, and therefore the method and the device are added into a pipeline mechanism. When the next fractal operation and the fractal operation being executed do not form data dependency, the overhead introduced by each layer can be hidden from each other, so that even if the number of layers of the intelligent processor is infinitely increased, the overhead is not accumulated to infinity, and the order of (formula 2) O (1) is still maintained. The problem of hindering overhead hiding in a pipeline manner is data dependence, because if each BSP is executed as a fractal operation in an overproof mode in a simulation mode according to the demonstration process of the constraint condition 1, each fractal operation has a strict serial data dependence relationship, and cannot be executed in a pipeline manner.

To unravel the data dependence between fractal operations to satisfy the pipeline, the present disclosure relaxes the BSP model to be simulated, assuming that the BSP model to be simulated has v ═ p log1+ ∈ p processors, where p represents the total number of fractal computation subunit of the simulated leaf model contained within the intelligent processor, and epsilon is an arbitrary positive real number. Each superstep in the BSP model to be simulated originally contains p log1 +. epsilon.p processors, the embodiment of the disclosure decomposes the superstep into log1 +. epsilon.p sub-supersteps which are executed in series, each sub-superstep contains p processors, and any data dependency does not exist between the sub-supersteps.

After relaxation, the intelligent processor pipeline can be broken at least once after keeping 1+ ∈ p fractal operations per log, and the implementation considers it as a cycle in the worst case. The pipeline depth of fractal execution is equivalent to the number of layers of the fractal computation model, which is log p, so that the overhead exposed by pipeline startup and flushing in one cycle is always o (log p) (i.e., those overhead during the pipeline startup and flushing stages that do not overlap with the leaf model computation time). The overhead is shared among log1+ ∈ p fractal operations in the cycle, then the overhead introduced by pipeline break for each fractal operation is an infinitesimal amount, meaning that the constant multipliers that these overheads have on the overall execution time will converge to 1.

As known from equation 2, the overhead of each fractal operation when executed on each layer of intelligent processor is O (1); the pipeline can overlap the additional cost introduced by any multi-layer fractal calculation model, so that the additional cost is still O (1) in an infinite multi-layer intelligent processor; that is, a relaxation mode can be found, so that the extra cost caused by the broken pipeline is flattened into infinitesimal quantity, and even if strict serial data dependence exists in the simulated BSP model, the extra cost can still be hidden by the method of the pipeline.

Therefore, after the constraint condition is introduced, the intelligent processor can optimally simulate the BSP general parallel computing model to obtain a general parallel computing intelligent processor with parallel efficiency-scale independence, namely any parallel algorithm which can be realized on the BSP model by programming can be realized on the intelligent processor.

Based on the general-purpose parallel computing intelligent processor, another embodiment of the present disclosure further provides a parallel computing intelligent processing method based on the parallel computing intelligent processor, where the parallel computing intelligent processing method satisfies the constraint condition described above in a computing process, as shown in fig. 4, and the method may include:

s401, the controller selects the scale for executing the parallel computation according to the number of fractal computation subunits for executing the parallel computation and hardware resources.

As described above, the parallel computing intelligent processor has the advantages that the number of fractal computing subunits can actively designate any integer not less than 2 as k according to the execution of the parallel computing to adapt to the corresponding parallel computing, and the specific depth of the parallel computing to be executed can be selected according to the number of fractal computing subunits of the parallel computing intelligent processor and hardware resources.

Before parallel computation is performed, the temporary memory generally obtains data required by fractal computation from a local memory through input/output, and distributes the data to each fractal computation subunit through a data bus. S402, the controller generates a parallel computing instruction according to the executed scale and sends the parallel computing instruction to the fractal computing subunit.

For example, the above description: when the scale of the sorting calculation is 500, one fractal calculation subunit performs sorting calculation on 0-500, and the other fractal calculation subunit performs parallel execution on 500-1000, that is, the parallel calculation instructions generated by the controller are parallel calculation instructions performing 0-500 sorting and parallel calculation instructions performing 500-1000 sorting, that is, the execution mode is hierarchical. When the sequencing execution scale is 250, the number of fractal calculation subunits required by the parallel calculation intelligent processor is 4, the first fractal calculation subunit performs sequencing calculation on 0-250, the second fractal calculation subunit performs sequencing calculation on 250-500 in parallel, the third fractal calculation subunit performs sequencing calculation on 500-750 in parallel, and the fourth fractal calculation subunit performs sequencing calculation on 750-1000 in parallel, namely the parallel calculation instruction generated by the controller is a parallel calculation instruction for performing sequencing 0-250, a parallel calculation instruction for performing sequencing 250-500, a parallel calculation instruction for performing sequencing 500-750 and a parallel calculation instruction for performing sequencing 750-1000. And the controller sends the generated parallel computing instruction to the corresponding controller. In the disclosed embodiment, the parallel computation may be, for example, the above-mentioned superstep operation.

And S403, the fractal calculation subunit parallelly executes parallel calculation according to the parallel calculation instruction.

Because the temporary memory is adopted for parallel computation, each fractal computation subunit must load all data required by operation, the temporary memory acquires the data required by fractal computation from the local memory through input/output, and the data is distributed to each fractal computation subunit through a data bus.

And S404, carrying out reduction operation on the result obtained by the parallel calculation.

Whether the reduction operation is executed or not is selected according to different operation types, for example, in the process of realizing the dynamic programming algorithm by using the fractal calculation model introduced above, reduction is not needed because the DP-Step is perfectly parallel, and the reduction program is empty, that is, the reduction operation is not needed to be executed. Wherein the rate of the reduction operation is proportional to the number of processors of the fractal calculation subunit.

S405, writing the calculated data back to the local memory.

Since parallel computing is also known to those skilled in the art, in addition to implementing clients and servers in purely computer readable program code, it is entirely possible by logical programming of method steps to cause the clients and servers to perform the same functions in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such clients and servers may thus be considered a hardware component, and the means included therein for performing the various functions may also be considered as structures within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.

From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present application.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments can be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, both for the embodiments of the client and the server, reference may be made to the introduction of embodiments of the method described above.

The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The method adopts a temporary memory, after the parallel computation is finished, the data result after the parallel computation is copied from the fractal computation subunit to the temporary memory, and then the data result is written back to the local memory. In addition, after the operation is finished, the inter-group crossing communication request in the parallel process needs to be copied back to the temporary memory from the fractal calculation subunit, and then the inter-group crossing communication request is written into the target address. This completes the parallel computation.

For the details of the embodiment of the method, please refer to the above-mentioned model embodiment, which is not described herein again.

In summary, the intelligent processor designed based on the hierarchical and same-nature system structure design criterion enables the fractal calculation model to have programming-scale independence, the scale of the system is not required to be modified during expansion, infinite scaling can be performed through the scale of fractal execution, and the scale of fractal execution can be freely selected by the calculation system during actual execution according to the conditions of the system scale, hardware resource limitation and the like, so that serial programming and parallel execution are realized, and the problem of parallel programming in the general field is solved. Meanwhile, constraint conditions are introduced to constrain the intelligent processor, based on the constraint conditions, on the basis of realizing programming-scale independence, the intelligent processor is further enabled to optimally simulate a general parallel computing model, the general parallel computing intelligent processor with parallel efficiency-scale independence is realized, and parallel computing efficiency is not reduced due to system scale expansion.

The above-mentioned embodiments are intended to illustrate the objects, aspects and advantages of the present disclosure in further detail, and it should be understood that the above-mentioned embodiments are only illustrative of the present disclosure and are not intended to limit the present disclosure, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.

Claims

1. A parallel computing intelligent processor, comprising:

the fractal calculation subunit is used for performing fractal calculation according to a fractal calculation instruction, wherein the structures of the fractal calculation subunit have the same level; the number of the fractal calculation subunit is set according to a program corresponding to the executed fractal calculation;

the controller is used for generating a fractal calculation instruction according to the number of the fractal calculation subunits and hardware resources and sending the fractal calculation instruction to the fractal calculation subunits;

the protocol arithmetic unit is used for carrying out protocol arithmetic on the fractal calculation result of each fractal calculation subunit according to a protocol arithmetic instruction, wherein the rate of the protocol arithmetic is in direct proportion to the number of processors of the fractal calculation subunit; the total amount of storage used in the parallel computing intelligent processor computing process is independent of the number of processors the parallel computing intelligent processor has.

2. A parallel computing smart processor as claimed in claim 1, further comprising a memory; wherein,

the memory is used for storing the data required by the fractal calculation and the fractal calculation result.

3. The parallel computing smart processor of claim 1, wherein the controller is further configured to send the reduced operations instruction to the reduced operators.

4. The intelligent processor of claim 1, wherein the fractal calculation subunit is hierarchical in application load, hardware resources, and execution mode.

5. Parallel computing smart processor according to claim 2, characterized in that the memory is a temporary storage.

6. Parallel computing smart processor according to claim 5, characterized in that the maximum capacity of the temporary memory is not limited.

7. The intelligent processor for parallel computing according to claim 1, wherein the programs corresponding to the fractal computation comprise a k-decomposition program, a specification program and a leaf program, and the k-decomposition program, the specification program and the leaf program are connected in series.

8. The intelligent processor for parallel computing according to claim 7, wherein a first temporary memory is disposed inside the controller, and the first temporary memory is used for storing the fractal computing instruction.

9. The intelligent processor of claim 8, wherein the controller is further configured to perform k-decomposition on the fractal calculation instruction according to the k-decomposition program, and send the decomposed fractal calculation instruction to the fractal calculation subunit.

10. A parallel computing intelligent processor as claimed in claim 3, wherein the reduced arithmetic unit and the controller are implemented by a serial computing system.

11. The intelligent processor for parallel computing according to claim 2, wherein the memory obtains data required by the fractal computation from a local memory through input/output;

and after the fractal calculation or the protocol calculation is finished, writing the fractal calculation result or the protocol calculation result back to the local memory.

12. Parallel computing smart processor according to any of claims 1 to 11, characterized in that the fractal computing subunit is connected to the memory via a data bus.

13. A parallel computing intelligent processing method based on the parallel computing intelligent processor of any one of claims 1 to 11, characterized by comprising:

the controller selects the scale for executing the parallel computation according to the number of fractal computation subunits for executing the parallel computation and hardware resources;

the controller generates a parallel computing instruction according to the scale; sending the fractal calculation instruction to the fractal calculation subunit;

the fractal calculation subunit executes the parallel calculation according to the parallel calculation instruction; the structure of each fractal calculation subunit has the same level; the number of the fractal calculation subunit is set according to a program corresponding to the executed fractal calculation;

the protocol arithmetic unit carries out protocol arithmetic on the fractal calculation result of each fractal calculation subunit; wherein the rate of the reduction operation is proportional to the number of processors that the fractal computation subunit has; the total amount of storage used in the parallel computing intelligent processor computing process is independent of the number of processors the parallel computing intelligent processor has.

14. The intelligent processing method for parallel computing according to claim 13, further comprising:

15. The intelligent processing method for parallel computing according to claim 13, further comprising:

16. The intelligent processing method for parallel computing according to claim 13, wherein the fractal computation subunit has a hierarchical nature in application load, hardware resources and execution mode.

17. The intelligent processing method for parallel computing according to claim 14, wherein the memory is a temporary storage.

18. The intelligent processing method for parallel computing according to claim 17, wherein the maximum capacity of the temporary memory is not limited.

19. The intelligent processing method for parallel computing according to claim 13, wherein the programs corresponding to the fractal computation include a k-decomposition program, a specification program and a leaf program, and the k-decomposition program, the specification program and the leaf program are connected in series.

20. The intelligent processing method for parallel computing according to claim 19, further comprising:

21. The intelligent processing method for parallel computing according to claim 14, further comprising:

22. The intelligent processing method for parallel computing according to claim 13, wherein the parallel computing comprises a superstep operation.

23. A parallel computing intelligent processing method according to claim 13 or 14, characterized in that the method further comprises:

and copying the cross-group communication request in the parallel computing process from the fractal computing subunit to a temporary memory.

24. A parallel computing intelligent processing method according to claim 13 or 14, characterized in that the method further comprises:

and writing the cross-group communication request into a target address.

25. An electronic device comprising a parallel computing intelligent processor according to any one of claims 1 to 12.