CN116737159A

CN116737159A - Data processing method oriented to processing circuit array and related product

Info

Publication number: CN116737159A
Application number: CN202210195550.8A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Anhui Cambricon Information Technology Co Ltd
Current assignee: Anhui Cambricon Information Technology Co Ltd
Priority date: 2022-03-01
Filing date: 2022-03-01
Publication date: 2023-09-12

Abstract

The present disclosure discloses a data processing method, a computing device, a computer readable storage medium and a computer program product for a processor array. The computing means performing the data processing method may be comprised in a combined processing means, which combined processing means may also comprise interface means and other processing means. The computing device interacts with other processing devices to collectively perform a user-specified computing operation. The combined processing means may further comprise storage means connected to the computing means and the other processing means, respectively, for storing data of the computing means and the other processing means. The scheme of the disclosure provides a data processing scheme aiming at a hardware architecture of a processor array, can fully exert the parallel characteristic of hardware and provides high-performance and high-precision function operation.

Description

Data processing method oriented to processing circuit array and related product

Technical Field

The present disclosure relates generally to the field of intelligent computing, and more particularly to the field of programming. More particularly, the present disclosure relates to a data processing method, computing device, computer readable storage medium and computer program product for processing circuit array oriented.

Background

The basic mathematical function library is the most basic and important function library in a computer system, and is used in various scenes needing to process and calculate data in life. Embedded software in many fields of aerospace, avionics, etc., typically involves mathematical library functions (e.g., sin (), tanh (), etc.) to implement complex computations. As large as artificial intelligence, smart city, smart medical treatment, as small as daily life, the basic math library function provides important data calculation support.

Along with the increase of the data volume of the mathematical library function calculation, the improvement of the performance of the vectorization mathematical function calculation is more and more important, and more vectorization mathematical functions are widely applied. VMs in Math Kernel Library (MKL) libraries such as intel provide a very rich mathematical vectorization function. Intel Short Vector Math Library (SVML) is a well-known library of functions in the industry that provides highly optimized subroutines for evaluating basic functions that can use a variety of vector extensions available in intel processors, but such libraries are proprietary and are optimized only for intel processors. AMD provides a vectorized libm called AMD core math library AMD Core Math Library (ACML). In paper "An optimized Cell BE special function library generated by Coconut," IEEE trans. Comp, vol.58, no.8, pp.1126-1138, aug.2009, the C implementation of their 32 single precision libm functions tuned for the Cell BE SPU calculation engine was reported, and they used an environment named Coconut that supports rapid prototyping of patterns, assembly language fragments, and rapid unit testing of patterns to develop their libraries. Christoph Lauter published an open source Vectorlbm library implemented with pure C in paper "A new open-source SIMD vector libm fully imple-mented with high-level scale C," in Proc.50th Asilomar Conf.Signals System. Comput.,2016, pp.407-411. The paper "Speeding up HEP experiment software with a library of fast and auto-vectorisable mathematical functions" Piparo et al, "J.Phys.: conf.Ser., vol.513, no.5,2014, art.no.052027 discloses a VDT math library, which is a math library written for the automatic vectorization function of a compiler.

The manner in which these vectorized function library implementations are provided is known to take two general approaches: (1) Manually splicing or packaging vector instructions at the bottom layer by using assembly instructions; (2) Writing a high-level language program uses automatic vectorization implementations provided by a compiler.

Vector mathematical functions on the PuDianNao chip of the existing domestic artificial intelligent processor can be realized only by circularly calling scalar functions, and the method has lower performance and cannot fully exert the hardware characteristics. The vectorized function library known above is however basically optimized for a specific hardware or a specific application.

Therefore, a solution for implementing high-performance and high-precision vector mathematical functions on hardware of the PuDianNao architecture is needed to generalize the application scenario of the artificial intelligent processor.

Disclosure of Invention

To at least partially solve one or more of the technical problems mentioned in the background, the present disclosure provides solutions in several aspects as follows.

In a first aspect, the present disclosure discloses a data processing method for a processing circuit array for vectorizing scalar functions, the processing circuit array comprising a plurality of processing circuits connected in a one-dimensional or multi-dimensional array structure, each processing circuit comprising a sub-arithmetic circuit and a sub-storage circuit, the method comprising: receiving source code to be compiled, wherein the source code comprises scalar functions with multi-branch structures; constructing branch conditions of all branches in the multi-branch structure; and controlling whether a corresponding processing circuit in the processing circuit array performs a branching operation or whether an execution result of the branching operation is valid according to the branching condition.

In a second aspect, the present disclosure discloses a computing device for performing a data processing method for a processing circuit array to vectorize scalar functions, the processing circuit array including a plurality of processing circuits connected in a one-dimensional or multi-dimensional array structure, each processing circuit including a sub-arithmetic circuit and a sub-storage circuit, the computing device comprising: a processor configured to execute program instructions; and a memory configured to store the program instructions, which when loaded and executed by the processor, cause the processor to perform the scalar function vectorization method according to the first aspect of the present disclosure.

In a third aspect, the present disclosure discloses a computer readable storage medium having stored therein program instructions which, when loaded and executed by a processor, cause the processor to perform the scalar function vectorization method according to the first aspect of the present disclosure.

In a fourth aspect, the present disclosure discloses a computer program product comprising a computer program or instructions which, when executed by a processor, implements the scalar function vectorization method of the first aspect of the present disclosure.

According to the scheme provided by the scheme, scalar functions can be vectorized, so that vector mathematical functions can be efficiently and accurately realized on hardware architectures such as PuDianNao, and the parallel processing characteristics of hardware circuits are fully exerted.

Drawings

The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar or corresponding parts and in which:

FIG. 1 illustrates an accelerator architecture of PuDianNao;

FIG. 2 illustrates a typical VLIW instruction architecture;

FIG. 3 illustrates a hardware architecture suitable for SIMT and SIDM programming models;

FIG. 4 illustrates the distinction between SISD, SIMD and SIMT;

FIG. 5 illustrates an exemplary flow chart of a method of scalar function vectorization according to an embodiment of the present disclosure;

FIG. 6 illustrates an exemplary flow chart of a method of scalar function vectorization based on a SIMT programming model in accordance with an embodiment of the present disclosure;

FIG. 7 illustrates one example of an application of an embodiment of the present disclosure;

FIG. 8 illustrates an exemplary flow chart of a method of scalar function vectorization based on a SIMD programming model in accordance with an embodiment of the present disclosure;

FIG. 9 illustrates a block diagram of a hardware configuration of a computing device in which various aspects of embodiments of the present disclosure may be implemented;

FIG. 10 illustrates a block diagram of a combination processing device according to an embodiment of the present disclosure; and

fig. 11 shows a schematic structural view of a board according to an embodiment of the disclosure.

Detailed Description

The following description of the embodiments of the present disclosure will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the disclosure. Based on the embodiments in this disclosure, all other embodiments that may be made by those skilled in the art without the inventive effort are within the scope of the present disclosure.

It should be understood that the terms "first," "second," "third," and "fourth," and the like, as may appear in the claims, specification and drawings of the present disclosure, are used for distinguishing between different objects and not for describing a particular sequential order. The terms "comprises" and "comprising" when used in the specification and claims of the present disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the present disclosure is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the present disclosure and claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in this specification and the claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context.

Fig. 1 illustrates an accelerator architecture of pudianao.

PuDianNao is a Machine Learning (ML) accelerator that can support seven representative machine learning ML techniques simultaneously, including: k-means, k-nearest neighbors, naive bayes, support vector machines, linear regression, classification trees, and deep neural networks. PuDianNao can perform operations up to 1056GOP/s (e.g., additions and multiplications) over an area of 3.51mm2 by performing in-depth analysis of computational primitives and locality attributes for different ML techniques, consuming only 596mW of energy. PuDianNao (65 nm process) is 1.20 times faster and can reduce power consumption by 128.41 times compared to NVIDIA K20M GPU (28 nm process).

As shown in fig. 1, pudianao mainly comprises several functional units (FU, function Unit), three data buffers (HotBuf, coldBuf and OutputBuf), one instruction buffer (instrbuf), one control module (control module) and one DMA (Direct Memory Access) controller.

The Functional Units (FUs) are basic execution units of pudianao, the function of each FU being identical. Specifically, each FU includes two parts, one machine learning functional unit (MLU) for machine learning algorithm hardware customization support and one Arithmetic Logic Unit (ALU) for conventional computational control tasks. The MLU employs a 6-stage pipeline design so that a certain functional requirement can be achieved by combining multiple stages of operators and registers. The ALU mainly implements operations not supported in the MLU, i.e., logic that is not high frequency operations in machine learning algorithms, such as division, conditional assignment, etc. The consideration of such a design is the desire for pudianao to be able to support as autonomously as possible the basic component support required for machine learning algorithm operation. Because pudianao is also an accelerator in nature, it is embedded as a coprocessor in a host system that cooperatively supports the execution of complete computing tasks. If no support is provided for atypical high frequency operations in the machine learning algorithm, these operations need to fall back onto the host CPU, which increases the collaborative overhead of the host system and pudianao, and affects the acceleration effect.

PuDianNao is more robust in data feature changes or application scene changes than machine learning accelerators previously designed for small-scale machine learning techniques because it provides a range of candidate techniques for the user.

One instruction set in the pudianao programming model uses a VLIW (Very Long Instrution Word, very long instruction word set) architecture. Fig. 2 illustrates a typical VLIW instruction architecture.

As shown, VLIW performs multiple independent operations per instruction, encapsulates multiple instructions that are not dependent on each other into one very long instruction word and uses a corresponding number of ALUs to complete a series of operations of the instructions, uses a compiler to control the scheduling of instructions and resolves dependencies between instructions. VLIW encapsulates multiple parallel operation options in one instruction, which is longer than the common RISC and CISC instruction length, and is therefore called the very long instruction set.

The computational array of pudianao is composed of a number of Processing units (PEs) that execute instructions in lockstep, these PEs execute the same instructions at any time, and the predicate controls whether a certain PE needs to commit the execution results of these instructions. Because multiple PEs execute the same instructions in lockstep, suitable programming models include single instruction multithreading (Single Instruciton Multiple Threads, SIMT) programming models/instruction sets similar to CUDA and single instruction multiple data (Single Instruction Multiple Data, SIMD) programming models/instruction sets.

Fig. 3 illustrates a hardware architecture suitable for SIMT and SIMD programming models.

As shown, the hardware architecture may be generally divided into a control circuit 310, a storage circuit 320, and a plurality of processing circuits 330. The plurality of processing circuits 330 are connected in a one-dimensional or multi-dimensional array configuration into one or more processing circuit arrays, which are also a set of register array configurations, e.g., processing circuits 330 comprising M rows and N columns. This array structure of processing circuits ensures parallelism. The array structure is M times N, and then m×n elements can be read each time, so as to calculate m×n data. In addition, the data can be read according to the row/column according to the requirement setting.

Each processing circuit 330 includes a sub-operation circuit that can perform logical operations, arithmetic operations, and the like, and a sub-storage circuit that can be used to store data, predicate information, and the like.

When using the SIMT programming model, control circuitry 310 obtains and parses the SIMT instructions and then sends the parsed SIMT instructions to the plurality of processing circuits 330. The processing circuitry array may perform multi-threaded operations in accordance with the resolved SIMT instructions. The predicate registers in the sub-memory circuit are used for branch definition, the memory registers in the sub-memory circuit store intermediate operation results and temporary variables, and the sub-operation circuit performs operation.

Unlike SISD (single instruction single data) and SIMD (single instruction multiple data), SIMT one instruction can process multiple threads and multiple data. Because the PuDianNao hardware vector circuit can control each small core to execute tasks or not, a simple mask translator is realized on the basis, SIMT instructions are added to the PuDianNao hardware circuit, and decoders of the SIMT instructions are added, so that a SIMT programming model is packaged on a VLIW hardware architecture.

When programming using the SIMT programming model, it is only necessary to describe the behavior of one thread, all threads executing the same instructions in parallel in lockstep. The method can process data in batches, only one instruction needs to be written, and hardware can simultaneously run the same instruction to realize parallelization of calculation. Each thread may be considered as a separate program that processes the respective data. The operation of processing data is various, and can be simple operations such as addition, subtraction, multiplication, division and the like, or complex transcendental function calculations such as sin/cos/sqrt/exp/log and the like.

The pudianao processor also supports SIMD programming models/instruction sets. SIMD (Single Instruction Multiple Data) is a single instruction multiple data stream, with a controller controlling multiple processing circuits, with temporal parallelism achieved by performing the same computation on each element in the data. When executing SIMD instructions, multiple elements may be computed simultaneously by one instruction. Such as a 128bit vector register, 4 float types of data can be calculated at the time of compilation.

The SIMD instruction set of pudianao supports very rich instructions, and the different instructions can be mutually spliced to accomplish many complex functions. The supported instructions include configuration instructions, arithmetic operation instructions, comparison instructions, logic and shift instructions, predicate instructions, extended precision arithmetic operation instructions, data movement instructions, and the like.

Fig. 4 illustrates the distinction between SISD, SIMD and SIMT. As shown, the instructions, data, and results referred to by different instruction sets may vary. SISD is single instruction single data, one instruction can only process one piece of data. SIMD is a single instruction multiple data, one instruction can perform the same processing on multiple data at the same time, but purely using SIMD cannot perform a conditional jump function in parallel, because conditional jumps require different processing of data depending on the input data. SIMT is single instruction multithreading, where one instruction can handle multiple threads and multiple pieces of data, and SIMT allows each thread to have different branches. SIMT is more flexible than SIMD, allowing multiple data of one instruction to be addressed separately; SIMDs are fragments that must be contiguous together.

PuDianNao, on the other hand, is essentially an accelerator that is typically embedded as a coprocessor in a host system that cooperatively supports the execution of complete computing tasks. That is, pudianao needs to employ a heterogeneous programming model. Heterogeneous programming models are typically composed of a general purpose processor and a plurality of domain-specific processors. The general purpose processor is called host side (host) for complex control and scheduling. The main tasks of the host side comprise equipment acquisition, data or parameter preparation, execution flow creation, task description, kernel starting, output acquisition and the like. The domain-specific processors act as sub-devices (devices) for massive parallel computing and domain-specific computing tasks that cooperate together to complete the computing tasks. The programming model on PuDianNao is a heterogeneous programming model and is divided into a host end and a device end, wherein the host end is a general processor, and the device end is an intelligent processor. The intelligent processor is responsible for large-scale parallel computing or intelligent computing tasks, and the computing throughput is far higher than that of the general processor. PuDianNao data is transmitted from the host end to the device end, data parallel calculation is carried out on the device end, and a final result is output to the host end.

The foregoing describes a hardware environment and programming environment in which aspects of embodiments of the present disclosure may be implemented. As mentioned in the background art, the vector mathematical function on the hardware architecture based on pudianao can only be realized by circularly calling scalar functions, and the method has lower performance and cannot fully play the hardware characteristics.

In view of this, the embodiments of the present disclosure provide a scalar function vectorization scheme for a processing circuit array, which is based on, for example, the hardware architecture of fig. 3, and takes the characteristics of different programming models into consideration, so as to vectorize scalar functions, thereby improving the operation efficiency of functions.

Fig. 5 illustrates an exemplary flow chart of a method of scalar function vectorization according to an embodiment of the present disclosure. The method may be applied to a hardware architecture comprising a processing circuit array comprising a plurality of processing circuits connected in a one-dimensional or multi-dimensional array structure, each processing circuit comprising a sub-arithmetic circuit and a sub-memory circuit.

As shown, in step 510, source code to be compiled is received, the source code including scalar functions having a multi-branch structure.

Most complex functions in mathematical functions have the characteristics of multiple branches and/or nesting of branches. Such branched structures are referred to herein as "multi-branched structures". The source code may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++, and conventional procedural programming languages, such as the "C" programming language, python, or similar programming languages, as the embodiments of the present disclosure are not limited in this respect. Typically, in source code, these branches may be described using nested if, else statement blocks.

Next, in step 520, branching conditions for each branch in the multi-branch structure are constructed. Depending on the programming model, the branching conditions may have different manifestations. When the SIMT programming model is employed, the branch condition may be a branch condition mapped in a corresponding predicate register in the SIMT programming model. When employing a SIMD programming model, the branch condition may be an inserted mask vector for controlling whether the execution results of each branch are valid. The manner in which the specific branching conditions are constructed will be described later in connection with different embodiments.

Finally, in step 530, whether the corresponding processing circuit in the processing circuit array performs the branching operation or whether the execution result of the branching operation is valid is controlled according to the branching condition.

As mentioned in the previous step, different forms of branching conditions may be involved in controlling at different stages of the operation to achieve vectorized operation of the scalar function. When the branch condition in the predicate register in the SIMT programming model is adopted, it is possible to determine whether a branch operation needs to be performed before each processing circuit performs an operation, thereby saving power consumption and avoiding invalid computation. When the scheme of superposing the mask vector as the branch condition by adopting the SIMD coding model is adopted, vectorization operation of the scalar function can be realized by multiplying the execution result by the mask vector to screen out the invalid result after each processing circuit executes the operation according to the SIMD instruction.

Embodiments of the present disclosure provide corresponding vectorization schemes to control execution or execution results of conditional branches at different stages, based on different characteristics of both SIMT and SIMD programming architectures, respectively. Those skilled in the art will appreciate that while embodiments of the present disclosure are described in the context of a hardware environment of pudianao, embodiments of the present disclosure are not limited to a particular chip or processor of pudianao, but may be applied to any hardware environment that may support SIMT and/or SIMD programming models.

Fig. 6 illustrates an exemplary flow chart of a method of scalar function vectorization based on a SIMT programming model in accordance with an embodiment of the present disclosure.

The SIMT programming model does not support jump instructions, requiring all branches of the function to be traversed one time, in which case SIMT programming becomes difficult and prone to errors. In the embodiments of the present disclosure, a solution to expose branch ranges and branch flattening is proposed for the problem of SIMT programming model to solve the difficulty in vectorizing a multi-branch structure of a scalar function, especially for various complex nested branch problems in the scalar function.

As shown, in step 610, source code to be compiled is received, the source code including scalar functions having a multi-branch structure. This step is the same as step 510 of fig. 5 and will not be described again here.

FIG. 7 illustrates one example of code in which embodiments of the present disclosure may be implemented. In this example, the multi-branch structure portion of the tag function in the source code is shown. As shown, the multi-branch structure is described by if, else statements, comprising a plurality of branches, and a plurality of layers of nested branches in part of the branches. Specifically, a total of 8 branches are included in this example.

Continuing with FIG. 6, next, in step 620, branch ranges for each branch in the multi-branch structure are extracted.

Considering that the SIMT programming model does not support jump instructions, all branches of the function need to be traversed one pass, so the branch ranges of the various branches are directly exposed in the disclosed embodiments in order to flatten the branches. Specifically, the upper and lower limits of each branch may be determined according to conditional branch statements such as if, else and the nesting relationship that may exist and used in the scalar function of the source code, so as to extract the specific range of each branch.

In some embodiments, the specific upper and lower limits of each branch may be manually noted as annotations. In other embodiments, a machine program may be used to automatically extract specific ranges for each branch based on the structural features and nesting relationship of if, else statement blocks.

For example, in the example of FIG. 7, the branch ranges of the various branches are shown on the right side of the source code. As shown, for the p1 branch, its branch range is x >0; for the p2 branch, its branch range is x >23; for the p3 branch, its branch range is x >23 and y >0; and so on.

Finally, in step 630, corresponding branch conditions are generated and mapped into corresponding predicate registers in the SIMT programming model according to the branch ranges of the respective branches to control, at run-time, whether the corresponding processing circuits in the processing circuit array perform branching operations.

As described above in connection with fig. 3, in a hardware architecture supporting the SIMT programming model, each processing circuit has a respective predicate register. Instructions in the SIMT programming model will traverse all branches through, using predicate registers to control whether statements of the corresponding branch execute. If the branch is executed, the corresponding assembler instruction is used on the branch, and the element is finally calculated and processed in the ALU calculation unit. Thus, in the presently disclosed embodiments, it is desirable to generate and map corresponding branch conditions into corresponding predicate registers in the SIMT programming model based on the branch ranges of the respective branches in order to control whether the respective branches are executed based on the predicate registers when executing SIMT instructions.

In some embodiments, conditional branches may be directly constructed from branch ranges and stored in corresponding predicate registers. This approach is suitable for the case where the branching range is relatively simple. For example, in the example of fig. 7, the branching conditions constructed for the respective branches are given on the far right side. For the P1 branch, since it is the first branch and the branch range is relatively simple, being x >0, the conditional branch can be directly constructed and stored in predicate register P1. The conditional branches are, for example:

setp.gt.gp.s32％P1,％R1,0

similarly, for the P2 branch, whose range is x >23, a conditional branch may also be directly constructed, stored in predicate register P2:

setp.gt.gp.s32％P2,％R1,23

in other embodiments, branch conditions may be constructed from a combination of logical operations of conditions in one or more predicate registers based on a branch range and conditions in the existing predicate register(s), and stored in the corresponding predicate register. This approach is suitable for cases where the branching range is relatively complex. For example, in the example of fig. 5, for the P3 branch, since its branch range is x >23 and y >0, it is the P2 branch condition (x > 23) that is recombined with a new condition (y > 0), and the P2 branch condition has been previously constructed and stored in the predicate register P2, the branch condition of the P3 branch can be constructed based on the predicate register P2. For example, the conditional branching may be:

setp.gt.gp.s32 ％P3,％R2,0

and.gp ％P3,％P3,％P2

Similarly, for branches P4, P5, and P7, branch conditions are constructed in a logically combined manner using conditions in existing predicate registers, and stored in the P4, P5, and P7 predicate registers, respectively.

Further, when compiling source code, the compilation instructions are written under respective conditional branches to perform corresponding operations in scalar functions.

The above describes a method of programming based on the SIMT programming model to vectorize scalar functions. With this scheme, when performing the operation of the scalar function, the corresponding SIMT instruction traverses all branches through, controlling whether the operation of each branch is performed according to the branch condition in the predicate register. Thus, invalid operation of the processing circuit can be avoided, and power consumption can be reduced. Since the device side (intelligent processor) cannot return results on a particular branch as the host side (general purpose processor), the results of each branch are temporarily stored in a register, such as a store register in a sub-store circuit, and ultimately return the value in the register.

In another aspect, embodiments of the present disclosure also provide scalar function vectorization schemes based on SIMD programming models. Although SIMD instructions can compute multiple data streams simultaneously, the branches taken by each element may be different in vector functions, and all elements cannot be made to perform the same flow in SIMD fashion, thus requiring a solution to conditional branches. The inventors have noted that although SIMD instructions can only perform the same operation on multiple pieces of data at the same time, the result of the operation may be selected to effect processing of conditional branches.

FIG. 8 illustrates an exemplary flow chart of a method of scalar function vectorization based on a SIMD programming model in accordance with an embodiment of the present disclosure.

As shown, in step 810, source code to be compiled is received, the source code including scalar functions having a multi-branch structure therein. This step is the same as step 510 of fig. 5 and will not be described again here.

Next, in step 820, a mask vector is inserted, which is used to multiply the execution results of the respective branches to screen out invalid data. Specifically, the element value of the mask vector is used to control whether the execution result of the branch operation of the corresponding processing circuit in the processing circuit array is valid. For example, the element value of the mask vector is determined according to the true or false of the corresponding branch condition in the running process, and may be 0.0 or 1.0, where the corresponding element value is 1.0 when the branch condition is satisfied, and is 0.0 when the branch condition is not satisfied.

Finally, in step 830, the source code is compiled based on the SIMD instruction set. Thus, when a SIMD instruction traverses all branches of the mathematical function, the result of each branch operation multiplied by a mask clears the invalid data. And finally, adding and splicing the effective results of all the branches to obtain vector output.

In some embodiments, the space of output vectors is multiplexed, i.e., stored in a storage register, for temporary variables in the function without requiring the application of temporary space. This solution is suitable for some simple functions, whose temporary variables take up less space. In other embodiments, additional temporary space is applied for storage for temporary variables in the function. This solution is suitable for complex functions, whose temporary variables take up a large space.

The programming mode of SIMD (Single instruction multiple data) masking is described above to optimize scalar function vectorization, so that the problem that a multi-branch structure cannot be quantified in a mathematical function is solved, and the performance accuracy is good.

Based on the vectorization scheme provided by the embodiment of the disclosure, particularly based on the SIMT programming model, the vector mathematical function library PuDianNao-VecMAth based on the PuDianNao chip can be realized, so that the problem that the vectorization of a multi-branch structure of a mathematical function is difficult is solved, and the high-precision high-performance vector mathematical library function on the PuDianNao chip of the artificial intelligent processor is realized. The PuDianNao-VecMAth function library has good precision performance, stable functions and correct operation, the provided interfaces comprise common basic math library functions such as round functions (round), transcendental functions (asin), comparison functions (less), activating functions (tanh) and the like, and the supported data types comprise single-precision floating point numbers (float), half-precision floating point numbers (half), signed integers (int), unsigned integers (uint 32) and the like.

The inventors have performed accuracy and performance tests on the solutions provided by the embodiments of the present disclosure and compared with existing solutions (methods of loop invoking scalar functions, interpolation methods). Test results show that the SIMT-based scheme has better accuracy and performance than other schemes, and does not occupy intelligent processor space. The SIMD-masked scheme has better performance accuracy, but most functions require application of temporary space, which is multiplied in vector functions, and occupy space at the device end (intelligent processor), wasting a lot of precious computing resources.

Specifically, the implementation of several selected functions in the PuDianNao-VecMAth math library was compared with the implementation in other open source libraries based on the PuDianNao chip: in terms of precision, the operation result is compared with the result of the operation of the glibc scalar function in the CPU i7, and the result shows that: the maximum error of the single-precision version function is 3ULP (error unit for measuring the distance between floating point numbers) (unit in the last place), the maximum error of the half-precision version function is 1ULP, and the maximum ULP values are smaller than or equal to the CUDA MATH API function library. The performance is greatly improved compared with the scalar loop, the single-precision version is 18.26 times higher than the scalar loop average acceleration ratio, the maximum acceleration ratio is 35.9 times higher than the scalar loop average acceleration ratio, the half-precision version average acceleration ratio is 15.65 times higher than the half-precision version average acceleration ratio, and the maximum acceleration ratio is 30.1 times higher than the scalar loop average acceleration ratio. Compared with the running result of the CUDA MATH API function library on the GPU tesla4 hardware, the single-precision floating point number average speed ratio is 1.62 times, and the maximum speed ratio is 3.3 times.

The scalar function vectorization method for a processing circuit array according to an embodiment of the present disclosure is described above with reference to the accompanying drawings. The present disclosure also provides a computing device that may be used to perform a method of scalar function vectorization for an array of processing circuits.

Fig. 9 illustrates a block diagram of a hardware configuration of a computing device 900 in which various aspects of embodiments of the disclosure may be implemented. As shown, computing device 900 may include a processor 910 and a memory 920. In the computing apparatus 900 of fig. 9, only constituent elements related to the present embodiment are shown. Thus, it will be apparent to those of ordinary skill in the art that: computing device 900 may also include common constituent elements that are different from those shown in fig. 9, such as: a display.

Computing device 900 may correspond to a computing apparatus having various processing functions, e.g., functions for programming, compiling source code. For example, computing apparatus 900 may be implemented as various types of devices, such as a Personal Computer (PC), a server device, a mobile device, and so forth.

A processor 910 configured to execute program instructions to control all the functions of the computing device 900. For example, the processor 910 controls all functions of the computing device 900 by executing programs stored in the memory 920 on the computing device 900. The processor 910 may be implemented by a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), an Application Processor (AP), an artificial intelligence processor chip (IPU), etc. provided in the computing device 900. However, the present disclosure is not limited thereto.

Memory 920 is hardware for storing various data processed in computing device 900. For example, memory 920 may store processed data and data to be processed in computing device 900. The memory 920 may store data processed or to be processed by the processor 910, such as source code before compilation, assembly instructions after compilation, and the like. Further, the memory 920 may store program instructions for applications, drivers, etc. to be driven by the computing device 900. For example: the memory 920 may store various programs related to vectorization methods and the like to be executed by the processor 910. The memory 920 may be a DRAM, but the present disclosure is not limited thereto. The memory 920 may include at least one of volatile memory or nonvolatile memory. The nonvolatile memory may include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), flash memory, phase change RAM (PRAM), magnetic RAM (MRAM), resistive RAM (RRAM), ferroelectric RAM (FRAM), and the like. Volatile memory can include Dynamic RAM (DRAM), static RAM (SRAM), synchronous DRAM (SDRAM), PRAM, MRAM, RRAM, ferroelectric RAM (FeRAM), and the like. In an embodiment, the memory 920 may include at least one of a Hard Disk Drive (HDD), a Solid State Drive (SSD), a high density flash memory (CF), a Secure Digital (SD) card, a Micro-secure digital (Micro-SD) card, a Mini-secure digital (Mini-SD) card, an extreme digital (xD) card, a cache (caches), or a memory stick.

In summary, the specific functions implemented by the memory 920 and the processor 910 of the computing device 900 provided in the embodiments of the present disclosure may be explained in comparison with the foregoing embodiments in the present disclosure, and the technical effects of the foregoing embodiments may be achieved, which will not be repeated herein.

In an embodiment of the present disclosure, there is also provided a computer-readable storage medium having stored therein program instructions that, when loaded and executed by a processor, cause the processor to perform the scalar function vectorization method described in the embodiments of the present disclosure. In an embodiment of the present disclosure, there is also provided a computer program product comprising a computer program or instructions which, when executed by a processor, implement a method of scalar function vectorization according to the embodiments described in the present disclosure.

Fig. 10 is a block diagram illustrating a combination processing apparatus 1000 according to an embodiment of the present disclosure. As shown, the combined processing device 1000 includes a computing processing device 1002, an interface device 1004, other processing devices 1006, and a storage device 1008. Depending on the context of the application, one or more computing devices 1010 may be included in the computing processing device, which may be configured as computing device 900 shown in FIG. 9 for performing the operations described herein in connection with the figures.

In various embodiments, the computing processing means of the present disclosure may be configured to perform user-specified operations. In an exemplary application, the computing processing device may be implemented as a single-core artificial intelligence processor or as a multi-core artificial intelligence processor. Similarly, one or more computing devices included within a computing processing device may be implemented as an artificial intelligence processor core or as part of a hardware architecture of an artificial intelligence processor core. When multiple computing devices are implemented as artificial intelligence processor cores or portions of hardware structures of artificial intelligence processor cores, the computing processing devices of the present disclosure may be considered to have a single core structure or a homogeneous multi-core structure.

In an exemplary operation, the computing processing device of the present disclosure may interact with other processing devices through an interface device to collectively accomplish user-specified operations. Depending on the implementation, other processing devices of the present disclosure may include one or more types of processors among general-purpose and/or special-purpose processors such as central processing units (Central Processing Unit, CPU), graphics processors (Graphics Processing Unit, GPU), artificial intelligence processors, and the like. These processors may include, but are not limited to, digital signal processors (Digital Signal Processor, DSPs), application specific integrated circuits (Application Specific Integrated Circuit, ASICs), field programmable gate arrays (Field-Programmable Gate Array, FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof may be determined according to actual needs. As previously mentioned, the computing processing device of the present disclosure may be considered to have a single core structure or a homogeneous multi-core structure only with respect to it. However, when computing processing devices and other processing devices are considered together, both may be considered to form heterogeneous multi-core structures.

In one or more embodiments, the other processing device may interface with external data and controls as a computing processing device of the present disclosure (which may be embodied as an associated computing device for artificial intelligence, such as neural network operations), performing basic controls including, but not limited to, data handling, starting and/or stopping of the computing device, and the like. In other embodiments, other processing devices may also cooperate with the computing processing device to jointly accomplish the computational tasks.

In one or more embodiments, the interface device may be used to transfer data and control instructions between the computing processing device and other processing devices. For example, the computing device may obtain input data from other processing devices via the interface device, and write the input data to a storage device (or memory) on the computing device. Further, the computing processing device may obtain control instructions from other processing devices via the interface device, and write the control instructions into a control cache on the computing processing device chip. Alternatively or in addition, the interface device may also read data in a memory device of the computing processing device and transmit it to the other processing device.

Additionally or alternatively, the combined processing apparatus of the present disclosure may further comprise a storage device. As shown in the figure, the storage means are connected to the computing processing means and the other processing means, respectively. In one or more embodiments, a storage device may be used to store data for the computing processing device and/or the other processing devices. For example, the data may be data that cannot be stored entirely within an internal or on-chip memory device of a computing processing device or other processing device.

In some embodiments, the present disclosure also discloses a chip (e.g., chip 1102 shown in fig. 11). In one implementation, the Chip is a System on Chip (SoC) and is integrated with one or more combined processing devices as shown in fig. 10. The chip may be connected to other related components by an external interface device (such as external interface device 1106 shown in fig. 11). The relevant component may be, for example, a camera, a display, a mouse, a keyboard, a network card, or a wifi interface. In some application scenarios, other processing units (e.g., video codecs) and/or interface modules (e.g., DRAM interfaces) etc. may be integrated on the chip. In some embodiments, the disclosure also discloses a chip packaging structure including the chip. In some embodiments, the disclosure further discloses a board card, which includes the chip package structure described above. The board will be described in detail with reference to fig. 11.

Fig. 11 is a schematic diagram illustrating the structure of a board 1100 according to an embodiment of the disclosure. As shown, the board includes a memory device 1104 for storing data, including one or more memory cells 1110. The memory device may be connected to and data transferred from control device 1108 and chip 1102 described above by way of, for example, a bus. Further, the board card also includes an external interface device 1106 configured for data relay or transfer functions between the chip (or chips in a chip package structure) and an external device 1112 (e.g., a server or computer, etc.). For example, the data to be processed may be transferred by the external device to the chip through the external interface means. For another example, the calculation result of the chip may be transmitted back to the external device via the external interface device. The external interface device may have different interface forms according to different application scenarios, for example, it may use a standard PCIE interface or the like.

In one or more embodiments, the control device in the disclosed board card may be configured to regulate the state of the chip. For this purpose, in an application scenario, the control device may include a single chip microcomputer (Micro Controller Unit, MCU) for controlling the working state of the chip.

From the above description in connection with fig. 10 and 11, those skilled in the art will appreciate that the present disclosure also discloses an electronic device or apparatus that may include one or more of the above-described boards, one or more of the above-described chips, and/or one or more of the above-described combination processing apparatuses.

According to different application scenarios, the electronic device or apparatus of the present disclosure may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, an intelligent terminal, a PC device, an internet of things terminal, a mobile terminal, a cell phone, a vehicle recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an aircraft, a ship and/or a vehicle; the household appliances comprise televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas cookers and range hoods; the medical device includes a nuclear magnetic resonance apparatus, a B-mode ultrasonic apparatus, and/or an electrocardiograph apparatus. The electronic device or apparatus of the present disclosure may also be applied to the internet, the internet of things, data centers, energy sources, transportation, public management, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, medical, and the like. Further, the electronic device or apparatus of the present disclosure may also be used in cloud, edge, terminal, etc. application scenarios related to artificial intelligence, big data, and/or cloud computing. In one or more embodiments, a computationally intensive electronic device or apparatus according to aspects of the present disclosure may be applied to a cloud device (e.g., a cloud server), while a less power consuming electronic device or apparatus may be applied to a terminal device and/or an edge device (e.g., a smart phone or camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device according to the hardware information of the terminal device and/or the edge device to simulate the hardware resources of the terminal device and/or the edge device, so as to complete unified management, scheduling and collaborative work of an end cloud entity or an edge cloud entity.

It should be noted that, for the sake of brevity, the present disclosure describes some methods and embodiments thereof as a series of actions and combinations thereof, but those skilled in the art will understand that the aspects of the present disclosure are not limited by the order of actions described. Thus, one of ordinary skill in the art will appreciate in light of the present disclosure or teachings that certain steps thereof may be performed in other sequences or concurrently. Further, those skilled in the art will appreciate that the embodiments described in this disclosure may be considered alternative embodiments, i.e., wherein the acts or modules involved are not necessarily required for the implementation of some or some aspects of this disclosure. In addition, the description of some embodiments of the present disclosure is also focused on, depending on the scenario. In view of this, those skilled in the art will appreciate that portions of one embodiment of the disclosure that are not described in detail may be referred to in connection with other embodiments.

In particular implementations, based on the disclosure and teachings of the present disclosure, one of ordinary skill in the art will appreciate that several embodiments of the disclosure disclosed herein may also be implemented in other ways not disclosed herein. For example, in terms of the foregoing embodiments of the electronic device or apparatus, the units are divided herein by taking into account the logic function, and there may be other manners of dividing the units when actually implemented. For another example, multiple units or components may be combined or integrated into another system, or some features or functions in the units or components may be selectively disabled. In terms of the connection relationship between different units or components, the connections discussed above in connection with the figures may be direct or indirect couplings between the units or components. In some scenarios, the foregoing direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustical, magnetic, or other forms of signal transmission.

In the present disclosure, elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physically separate. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, some or all of the units may be selected to achieve the objectives of the embodiments of the disclosure, as desired. In addition, in some scenarios, multiple units in embodiments of the disclosure may be integrated into one unit or each unit may physically exist alone.

In other implementation scenarios, the integrated units may also be implemented in hardware, i.e. as specific hardware circuits, which may include digital circuits and/or analog circuits, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, devices such as transistors or memristors. In view of this, various types of devices described herein (e.g., computing devices or other processing devices) may be implemented by appropriate hardware processors, such as CPU, GPU, FPGA, DSP and ASICs, etc. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which may be, for example, variable resistance memory (Resistive Random Access Memory, RRAM), dynamic random access memory (Dynamic Random Access Memory, DRAM), static random access memory (Static Random Access Memory, SRAM), enhanced dynamic random access memory (Enhanced Dynamic Random Access Memory, EDRAM), high bandwidth memory (High Bandwidth Memory, HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM, RAM, etc.

The foregoing may be better understood in light of the following clauses:

clause 1, a data processing method for a processing circuit array for vectorizing scalar functions, the processing circuit array comprising a plurality of processing circuits connected in a one-dimensional or multi-dimensional array structure, each processing circuit comprising a sub-arithmetic circuit and a sub-storage circuit, the method comprising:

receiving source code to be compiled, wherein the source code comprises scalar functions with multi-branch structures; constructing branch conditions of all branches in the multi-branch structure; and

and controlling whether the corresponding processing circuit in the processing circuit array executes a branching operation or whether the execution result of the branching operation is valid according to the branching condition.

Clause 2, the method of clause 1, wherein constructing the branching conditions for each branch in the multi-branch structure comprises:

extracting branch ranges of all branches in the multi-branch structure; and

according to the branch range of each branch, corresponding branch conditions are generated and mapped into corresponding predicate registers in the SIMT programming model to control whether corresponding processing circuits in the processing circuit array execute branch operations at run time.

Clause 3, the method of clause 2, wherein extracting the branch ranges of each branch in the multi-branch structure comprises:

and determining the upper limit and the lower limit of each branch in the multi-branch structure according to the conditional branch statement and the nesting relation used in the source code so as to extract the branch range.

Clause 4, the method of any of clauses 2-3, wherein generating the corresponding branch condition and mapping into the corresponding predicate register in the SIMT programming model according to the branch ranges of the respective branches comprises:

and directly constructing conditional branches according to the branch range and storing the conditional branches in corresponding predicate registers.

Clause 5, the method of any of clauses 2-4, wherein generating the corresponding branch condition and mapping into the corresponding predicate register in the SIMT programming model according to the branch ranges of the respective branches comprises:

and constructing the branch condition through the logical operation combination of the conditions in the one or more predicate registers according to the branch range and the conditions in the one or more existing predicate registers, and storing the branch condition in the corresponding predicate registers.

Clause 6, the method of any of clauses 2-5, wherein the sub-storage circuit of each processing circuit comprises the predicate register.

Clause 7, the method of clause 1, wherein constructing the branching conditions for each branch in the multi-branch structure comprises:

a mask vector is inserted whose element values are used to control whether the execution result of a branch operation of a corresponding processing circuit in the processing circuit array is valid.

Clause 8, the method of clause 7, wherein the element values of the mask vector are determined according to the true or false of the run-time corresponding branch condition.

Clause 9, the method of any of clauses 7-8, wherein compiling the source code is based on a single instruction multiple data SIMD instruction set.

Clause 10, the method of any of clauses 1-9, further comprising: under respective branch conditions, assembler instructions are generated that perform corresponding operations in the scalar function.

Clause 11, the method of any of clauses 1-10, wherein the sub-storage circuit further comprises a storage register for storing intermediate operation results and/or temporary variables.

Clause 12, a computing device for performing a data processing method for a processing circuit array to vectorize scalar functions, the processing circuit array comprising a plurality of processing circuits connected in a one-dimensional or multi-dimensional array structure, each processing circuit comprising a sub-arithmetic circuit and a sub-storage circuit, the computing device comprising:

A processor configured to execute program instructions; and

a memory configured to store the program instructions that, when loaded and executed by the processor, cause the processor to perform the method according to any of clauses 1-11.

Clause 13, a computer readable storage medium, having stored therein program instructions, which when loaded and executed by a processor, cause the processor to perform the method according to any of clauses 1-11.

A computer program product of clause 14, comprising a computer program or instructions that, when executed by a processor, implement the method of any of clauses 1-11.

While various embodiments of the present disclosure have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. Numerous modifications, changes, and substitutions will occur to those skilled in the art without departing from the spirit and scope of the present disclosure. It should be understood that various alternatives to the embodiments of the disclosure described herein may be employed in practicing the disclosure. The appended claims are intended to define the scope of the disclosure and are therefore to cover all equivalents or alternatives falling within the scope of these claims.

Claims

1. A data processing method for a processing circuit array for vectorizing scalar functions, the processing circuit array comprising a plurality of processing circuits connected in a one-dimensional or multi-dimensional array structure, each processing circuit comprising a sub-arithmetic circuit and a sub-storage circuit, the method comprising:

receiving source code to be compiled, wherein the source code comprises scalar functions with multi-branch structures;

constructing branch conditions of all branches in the multi-branch structure; and

2. The method of claim 1, wherein constructing branching conditions for each branch in the multi-branch structure comprises:

extracting branch ranges of all branches in the multi-branch structure; and

3. The method of claim 2, wherein extracting branch ranges for each branch in the multi-branch structure comprises:

4. A method according to any of claims 2-3, wherein generating and mapping corresponding branch conditions into corresponding predicate registers in the SIMT programming model based on branch ranges of respective branches comprises:

5. The method of any of claims 2-4, wherein generating and mapping corresponding branch conditions into corresponding predicate registers in the SIMT programming model based on branch ranges of respective branches comprises:

6. The method of any of claims 2-5, wherein the sub-storage circuit of each processing circuit includes the predicate register.

7. The method of claim 1, wherein constructing branching conditions for each branch in the multi-branch structure comprises:

8. The method of claim 7, wherein the element values of the mask vector are determined based on the true or false of a runtime corresponding branch condition.

9. The method of any of claims 7-8, wherein the source code is compiled based on a single instruction multiple data SIMD instruction set.

10. The method of any of claims 1-9, further comprising:

under respective branch conditions, assembler instructions are generated that perform corresponding operations in the scalar function.

11. The method of any of claims 1-10, wherein the sub-storage circuit further comprises a storage register for storing intermediate operation results and/or temporary variables.

12. A computing device for performing a data processing method for a processing circuit array to vectorize scalar functions, the processing circuit array comprising a plurality of processing circuits connected in a one-dimensional or multi-dimensional array structure, each processing circuit comprising a sub-arithmetic circuit and a sub-storage circuit, the computing device comprising:

a processor configured to execute program instructions; and

A memory configured to store the program instructions, which when loaded and executed by the processor, cause the processor to perform the method according to any one of claims 1-11.

13. A computer readable storage medium having stored therein program instructions which, when loaded and executed by a processor, cause the processor to perform the method according to any of claims 1-11.

14. A computer program product comprising a computer program or instructions which, when executed by a processor, implements the method of any of claims 1-11.