CN111860798A - Operation method, device and related product - Google Patents

Operation method, device and related product Download PDF

Info

Publication number
CN111860798A
CN111860798A CN201910545270.3A CN201910545270A CN111860798A CN 111860798 A CN111860798 A CN 111860798A CN 201910545270 A CN201910545270 A CN 201910545270A CN 111860798 A CN111860798 A CN 111860798A
Authority
CN
China
Prior art keywords
operand
memory
address
local memory
instruction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910545270.3A
Other languages
Chinese (zh)
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cambricon Technologies Corp Ltd
Original Assignee
Cambricon Technologies Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cambricon Technologies Corp Ltd filed Critical Cambricon Technologies Corp Ltd
Priority to PCT/CN2020/083280 priority Critical patent/WO2020220935A1/en
Priority to EP21216615.1A priority patent/EP4012556A3/en
Priority to US17/606,838 priority patent/US20220261637A1/en
Priority to EP20799083.9A priority patent/EP3964950A4/en
Priority to PCT/CN2020/087043 priority patent/WO2020221170A1/en
Priority to EP21216623.5A priority patent/EP3998528A1/en
Publication of CN111860798A publication Critical patent/CN111860798A/en
Priority to US17/560,490 priority patent/US11841822B2/en
Priority to US17/560,411 priority patent/US20220188614A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Advance Control (AREA)

Abstract

The disclosure relates to an operation method, an operation device and a related product. The arithmetic device may include a processor configured to receive an input instruction, a memory controller configured to load an operand into a storage unit, and a plurality of arithmetic nodes configured to execute the input instruction according to the input instruction and the operand to implement the operand corresponding to the input instruction. The arithmetic device according to the present disclosure can improve arithmetic efficiency.

Description

Operation method, device and related product
Technical Field
The present disclosure relates to the field of information processing technologies, and in particular, to an operation method, an operation device, and a related product.
Background
In the technical field of artificial intelligence, a neural network algorithm is a very popular machine learning algorithm in recent years, and has a very good effect in various fields, such as image recognition, voice recognition, natural language processing and the like. Along with the development of neural network algorithms, the complexity of the algorithms is higher and higher, and in order to improve the recognition degree, the scale of the model is gradually increased. Processing these large-scale models with GPUs and CPUs takes a lot of computation time and consumes a lot of power.
Disclosure of Invention
In view of the above, the present disclosure provides an operand obtaining method and an arithmetic device.
According to an aspect of the present disclosure, there is provided a method for obtaining an operand, the method including:
searching whether the operand is stored on the local memory component or not in the data address information table;
if the operand is stored on the local memory component, determining the storage address of the operand on the local memory component according to the storage address of the operand on the external storage space and a data address information table;
and assigning the storage address of the operand on the local memory component to the instruction for acquiring the operand.
According to another aspect of the present disclosure, there is provided an arithmetic device including: a plurality of levels of compute nodes, each compute node including a local memory element, a processor, and a next level of compute node,
when the processor loads the operand from the memory component of the upper layer operation node of the current operation node to the local memory component, searching whether the operand is stored on the local memory component in the data address information table;
if the operand is stored on the local memory component, the processor determines the storage address of the operand on the local memory component according to the storage address of the operand on the external storage space and the data address information table; and assigning the storage address of the operand on the local memory component to the instruction for obtaining the operand.
In one possible implementation, if the operand is not stored in the local memory element, the processor generates control signals for loading the operand according to the storage address of the operand, and the control signals for loading the operand are used for loading the operand from the storage address of the operand to the local memory element.
In a possible implementation manner, the data address information table records an address correspondence relationship, where the address correspondence relationship includes: the storage address of the operand on the local memory element and the storage address of the operand on the external storage space.
In one possible implementation, the local memory elements include static memory segments and circular memory segments,
the processor is used for decomposing the input instruction of any one operation node to obtain a plurality of sub-instructions;
if a common operand exists among the sub-instructions, the processor allocates memory space for the common operand in the static memory section and allocates memory space for other operands of the sub-instructions in the circulating memory section;
wherein the common operand is: operands which are used when the next layer of operation nodes in any one operation node execute the plurality of sub-instructions are the following operands: operands of the plurality of sub-instructions are operands other than the common operand.
In a possible implementation manner, at least one data address information table corresponding to the static memory segment and a plurality of data address information tables corresponding to the circulating memory segment are provided in the processor.
In one possible implementation, before allocating memory space for the common operand in the static memory segment, the processor first searches at least one data address information table corresponding to the static memory segment whether the common operand is already stored in the static memory segment of the local memory device,
if the common operand is stored in the static memory segment of the local memory component, determining the storage address of the common operand on the local memory component according to the storage address of the common operand on the memory component of the upper-layer operation node and the at least one data address information table corresponding to the static memory segment;
and assigning the storage address of the common operand on the local memory component to the instruction for loading the common operand number.
In one possible implementation manner, before allocating memory space for other operands on the loop memory segment, the processor searches whether other operands are already stored on the loop memory segment of the local memory component in the data address information tables corresponding to the loop memory segment,
If the operand is stored in the circulating memory section of the local memory component, determining the storage address of the other operand on the local memory component according to the storage address of the other operand on the memory component of the upper-layer operation node and the plurality of data address information tables corresponding to the circulating memory section,
assigning the storage addresses of the other operands on the local memory component to the instruction for acquiring the other operands;
and if the data is not stored in the circulating memory section of the local memory component, loading the data.
In a possible implementation manner, when the operand is loaded to the static memory segment from the memory component of the upper-layer operation node, the processor determines a data address information table to be updated according to the count value of the first counter; the counting value of the first counter is used for determining different data address information tables corresponding to two ends of the static memory segment;
and updating the data address information table to be updated according to the storage address of the loaded operand on the memory component of the upper-layer operation node and the storage address on the static memory segment.
In one possible implementation manner, when loading other operands from the external storage space to any one of the plurality of sub-memory blocks on the cyclic memory segment, the processor updates the data address information table corresponding to the any one of the sub-memory blocks according to the storage addresses of the loaded other operands on the external storage space and the storage address on the local memory component.
According to another aspect of the present disclosure, there is provided an operand obtaining apparatus including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to perform the above method.
According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having computer program instructions stored thereon, wherein the computer program instructions, when executed by a processor, implement the above-described method.
According to the embodiment of the disclosure, the data stored in the local memory component is recorded by setting the data address information table, so that whether the operand is stored in the local memory component can be checked before the operand of the input instruction is loaded from the external storage space, if the operand is stored, the operand of the input instruction is not required to be loaded from the external storage space to the local memory component, the operand stored in the local memory component is directly used, and the bandwidth resource can be saved.
Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features, and aspects of the disclosure and, together with the description, serve to explain the principles of the disclosure.
Fig. 1 shows an application scenario diagram according to an embodiment of the present disclosure.
Fig. 2 shows a flow diagram of a method of operand fetch according to an embodiment of the present disclosure.
Fig. 3 shows a flow diagram of a method of operand fetch according to an embodiment of the present disclosure.
Fig. 4 shows a block diagram of an arithmetic device according to an embodiment of the present disclosure.
FIG. 5a illustrates a block diagram of an operational node according to an embodiment of the present disclosure.
Fig. 5b illustrates an example of a pipeline according to an embodiment of the present disclosure.
FIG. 6 shows a flow diagram of a process of serial decomposition according to an embodiment of the present disclosure.
Fig. 7 illustrates an example of partitioning of a memory component according to an embodiment of the present disclosure.
Fig. 8 is a schematic diagram illustrating a memory space allocation method for a static memory segment according to an embodiment of the disclosure.
Fig. 9 is a schematic diagram illustrating a memory space allocation method for a static memory segment according to an embodiment of the disclosure.
FIG. 10 shows a schematic diagram of a pipeline according to an example of the present disclosure.
FIG. 11 illustrates an example of partitioning of a memory component according to an embodiment of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present disclosure will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, not all embodiments of the present disclosure. All other embodiments, which can be derived by one skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the scope of protection of the present disclosure.
It should be understood that the terms "first," "second," "third," and "fourth," etc. in the claims, description, and drawings of the present disclosure are used to distinguish between different objects and are not used to describe a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the disclosure herein is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this disclosure refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.
As used in this specification and claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".
In order to better understand the technical solutions described in the present application, the following first explains the technical terms related to the embodiments of the present application:
calculating a primitive: machine learning is a computation and memory intensive technique, highly parallel at different levels, and the present disclosure decomposes machine learning into matrix and vector based operations, e.g., aggregating operations such as vector multiplication matrices and matrix multiplication vectors into matrix multiplication, aggregating operations such as matrix add/subtract matrices, matrix multiplication scalars and vector base arithmetic into element-by-element operations, and so on. Seven main computation primitives can be obtained by decomposing and aggregating machine learning, including: inner Product (IP), Convolution (CONV), Pooling (POOL), matrix multiplication (MMM), element-wise operation (ELTW), Sorting (SORT) and Counting (COUNT). The above computation primitives summarize the main features of machine learning, and they are all operations that can be decomposed.
Operations that can be decomposed: if an operation g (-) satisfies the following formula (1)
f(X)=g(f(XA),f(XB),...) (1)
Then the f (-) operation with operand X is said to be decomposableWherein f (-) is a target operator, g (-) is a search operator, X denotes all operands of f (-) and XA、XB,.. represents a subset of operand X, where X may be tensor data.
For example, if f (X) X × k, where k is a scalar, then f (X) may be decomposed into:
f(X)=[XA,XB,...]×k=g(f(XA),f(XB),…),
wherein, the operation g (-) decomposes f (X) according to the way of decomposing XA)、f(XB) … are combined into a matrix or vector.
And (3) operation classification: for operations that can be decomposed, based on the decomposed operand XA、XB… and X, the operation can be divided into three categories: independent operations, input dependent operations, and output dependent operations.
Independent operation: can mean, the decomposed operand XA、XB.. independent of and not overlapping each other, each subset XA、XB.., partial operations can be performed, and the final operation result can be obtained by only combining the results of the partial operations performed by each subset. To illustrate the independent operations by taking the vector addition operation as an example, X may first be split into two operands (i.e., two input vectors X, y) for the addition operation, since X, y may be split into two subsets (X, y) A,xB) And (y)A,yB) So that the two subsets can independently perform local vector addition operations, i.e. zA=xA+yAAnd zB=xB+yBThe final operation result only needs to combine the results of each partial operation, i.e. z ═ zA,zB]。
Input dependent operation: can mean, the decomposed operand XA、XB.., there is coincidence, and there is coincidence between operands of the decomposed local operations, i.e. there is input redundancy. Explaining input dependent operations with a one-dimensional convolution as an example, two operands are represented using x, y, and x ═ xA,xB],z=[zA,zB]=x*y=[xA,xB]Y, the operation is still divided into two parts, however, the operands of the two partial operations are overlapped, and the part x is additionally neededAAnd part xB(each is x)a,xb) I.e. zA=[xA,xb]*y、zB=[xa,xB]Y, each partial operation can be performed independently, and the final operation result only needs to be combined with the result of each partial operation, i.e. z ═ zA,zB]。
Output dependent operation: it may mean that the final operation result needs to be obtained by performing reduction processing on the result of each local operation after decomposition. Explaining the output dependent operation by taking the inner product operation as an example, the inner product operation (z ═ x · y) can be divided into two partial operations, where each partial operation still performs the inner product operation zA=xA·yAAnd zB=xB·yBHowever, to obtain the final operation result, the result of each local operation needs to be summed, i.e. z ═ z A+zB. Thus, g (·) is a summation operation, g (·) sum (·). It should be noted that some operations may be input dependencies or output dependencies after decomposition, and the specific dependencies are related to the decomposition manner.
The above calculation primitives can be divided into three classes, but it should be noted that different decomposition methods may cause different dependencies, as shown in table 1 below.
Table 1 computational primitive analysis
Computing primitives Decomposition mode Dependence on g(·) Data redundancy
IP Length of Output dependence Adding
CONV Feature(s) Output dependence Adding
CONV N dimension (batch) Input dependencies Weight value
CONV H or W dimension (space) Input dependencies Weight, coincidence
POOL Feature(s) Independent of each other
POOL H or W dimension (space) Input dependencies Coincidence
MMM Left side, vertical Output dependence Adding
MMM Right side, perpendicular Input dependencies Left matrix
ELTW At will Independent of each other
SORT At will Output dependence Merging
COUNT At will Output dependence Adding
The length in the IP decomposition method may be a length direction of a vector. The operand of the convolution operation may be tensor data expressed by NHWC (pitch, width, channels), the decomposition in the eigen direction may refer to decomposition in the C-dimension direction, the decomposition in the eigen direction by the POOL operation is the same meaning as the decomposition in the eigen direction for the operand, the decomposition in the N-dimension direction by the convolution operation has input dependence, the input redundancy is a weight, that is, a convolution kernel, and the decomposition in the space also has input dependence, and the input redundancy includes superposition of two decomposed tensor data in addition to the weight. The left side and the right side in the decomposition mode of the MMM mean that the left side operand or the right side operand of the MMM is decomposed, and the vertical direction can mean that the decomposition is carried out in the vertical direction of the matrix. The ELTW operation is independent of any way of decomposing the operand, and there are output dependencies for both the SORT and COUNT operations.
As can be seen from the above analysis, the computation primitives of the machine learning are all separable operations, and when the computation device of the present disclosure is used to perform the computation of the machine learning technique, the computation primitives can be separated and then computed according to actual requirements.
Inputting an instruction: may be an instruction describing a machine learned operation which may be composed of or consist of the computation primitives described above, and the input instruction may include operands, operators, and the like.
Common operand: the operand commonly used among the plurality of sub-operations after an operation is decomposed is a common operand, or after an input instruction is decomposed into a plurality of sub-instructions, the operand commonly used by the plurality of sub-instructions.
Machine learning is a computing and memory intensive technology, frequent data access puts high requirements on the bandwidth of an arithmetic device for performing machine learning operation, and in order to reduce the pressure on the bandwidth of the arithmetic device, the present disclosure provides an operand acquisition method, which can be applied to a processor, which can be a general-purpose processor, for example, a central Processing unit (cpu), a graphics Processing unit (gpu), and the like. The processor may also be an artificial intelligence processor for performing artificial intelligence operations, which may include machine learning operations, brain-like operations, and the like. The machine learning operation comprises neural network operation, k-means operation, support vector machine operation and the like. The artificial intelligence Processor may, for example, include one or a combination of an NPU (Neural-network processing Unit), a DSP (Digital Signal Processor), and a Field Programmable Gate Array (FPGA) chip. The artificial intelligence processor may include a plurality of arithmetic units, and the plurality of arithmetic units may perform operations in parallel.
Fig. 1 shows an application scenario diagram according to an embodiment of the present disclosure. As shown in fig. 1, when the processor executes an input instruction, it needs to load an operand of the input instruction from the external storage space to the local memory component, and after the input instruction is executed, outputs an operation result of the input instruction to the external storage space. In order to reduce bandwidth pressure, the embodiment of the disclosure records data stored on the local memory component by setting the data address information table, so that whether the operand of the input instruction is stored on the local memory component can be checked before the operand of the input instruction is loaded from an external storage space, if the operand is stored, the operand of the input instruction is not required to be loaded onto the local memory component from the external storage space, and the operand stored on the local memory component is directly used, so that bandwidth resources can be saved.
Wherein, an address corresponding relationship may be recorded in the data address information table, and the address corresponding relationship may include: the storage address of the operand on the local memory element and the storage address of the operand on the external storage space.
Table 1 shows an example of a data address information table according to an embodiment of the present disclosure.
It should be noted that the Out _ addr1, In _ addr1, etc. In table 1 are only one symbol indicating an address, and the address recorded In the data address information table of the embodiment of the present disclosure may be In the form of a start address + a granularity indicator, where the start address may refer to a start address of a memory space for storing an operand, and the granularity indicator may indicate a size of the operand, that is, information such as the start address of the data storage and the size of the data is recorded.
TABLE 1 data Address information Table
Storage address on external storage space Storage addresses on local memory components
Out_addr1 In_addr1
Out_addr2 In_addr2
Fig. 2 shows a flow diagram of a method of operand fetch according to an embodiment of the present disclosure. As illustrated in fig. 2, the method may include:
step S11, find out whether the operand is already stored in the local memory component in the data address information table;
step S12, if the operand is already stored in the local memory component, determining the storage address of the operand in the local memory component according to the storage address of the operand in the external storage space and the data address information table;
in step S13, the storage address of the operand on the local memory element is assigned to the instruction for obtaining the operand.
After receiving the data load instruction, the processor may execute the data load instruction to load the operand onto the local memory element. Specifically, the data load instruction binds a storage address of an operand on the external storage space, generates a control signal for loading data according to the data load instruction (bound storage address), and executes a data load process by a dma (direct Memory access) according to the control signal.
However, according to the embodiment of the present disclosure, before the control signal for generating the load data loads the operand, step S11 may be executed to look up in the data address information table whether the operand to be loaded is already stored on the local memory element.
As described above, the data address information table may record an address correspondence relationship, and may determine that the operand is already stored in the local memory device when the address correspondence relationship includes the storage addresses of all the operands in the external storage space, and determine that the operand is not stored in the local memory device when the address correspondence relationship does not include the storage addresses of all the operands in the external storage space. Specifically, it may be found whether the operand is already stored in the local memory element from the storage address in the external storage space recorded in the data address information table, in other words, if the operand to be loaded is previously stored, a corresponding relationship between the storage address in the external storage space and the storage address in the local memory element is recorded in the data address information table, and when the same operand is loaded next time, if it is found that the storage address in the external storage space recorded in the data address information table includes the storage address in the external storage space of the operand to be loaded, it is indicated that the operand to be loaded is already stored in the local memory element, and the operand to be loaded may be directly used without repeated loading.
For example, in some cases the number of operations may not be just a single number, but may be a plurality of numbers or include a plurality of vectors, matrices, tensors, and so on. In this case, the storage address of the operand bound by the data load instruction on the external storage space may be an address of a segment of storage space, and when the storage address on the external storage space in the address correspondence completely contains the storage address of the operand bound by the data load instruction on the external storage space, it may be determined that the operand is already stored on the local memory component; if the storage address on the external storage space in the address mapping relationship does not contain or only contains the storage address on the external storage space of the operand bound by a part of the data load instruction, it may be determined that the operand is not stored on the local memory element.
In a possible implementation manner, the method for checking whether the two segments of addresses are in the inclusive relationship may not need to traverse all the addresses of the data in the operand for checking, but only need to check whether the addresses of the data at two points of the operand fall on the storage address of the external storage space in any address corresponding relationship recorded in the data address information table. For example, if the operand is a matrix, it is only necessary to check whether the storage addresses of the data at two vertices on the diagonal of the matrix are contained by the storage address of the external storage space in any address corresponding relationship recorded in the data address information table, and it is not necessary to check whether the storage address of each data in the matrix is contained by the storage address of the external storage space in any address corresponding relationship recorded in the data address information table. The method is popularized to an N-dimensional space, and two parallel hypercubes in the N-dimensional space only need to check whether the storage addresses of the data of two vertexes on the main diagonal line of the operand are contained by the storage address of the external storage space in any address corresponding relation recorded in the data address information table. The hardware structure of each table entry can be provided with two discriminators besides the register required by the table entry record, the two discriminators can be used for judging whether the vertexes of two diagonals meet the inclusion condition, if the two discriminators give positive discrimination, the table entry is considered to be hit, namely the storage address of the operand to be inquired on the external storage space falls into the storage address of the external storage space in the address corresponding relation (table entry), and the operand to be inquired is stored on the local memory component. For example, assume that:
Record the table item 10000[10,11] [1,2] [20,21],
item 10053[4,5] [6,7] [18,19] to be queried
From the granularity of the record entries, it can be known that the condition that data with address 10000+21 × 1+ x0 is located in this tensor is:
0<=x0<21
2<=x0<2+11
0<=x1<20
1<=x1<1+10
from the granularity of the query terms, it can be known that the condition that the data with the address of 10053+19 × y1+ y0 is located in this tensor is:
0<=y0<19
7<=y0<7+5
0<=y1<18
6<=y1<6+4
checking two vertexes of the item to be queried on the main diagonal line: y0, y1 simultaneously take minimum points and y0, y1 simultaneously take maximum points, also corresponding to minimum and maximum values in the data address range, respectively. Minimum value y 0-7, y 1-6, address 10174; the maximum value is y 0-11, y 1-9, and the address is 10235.
Checking 10174 and 10235 for inside record entries first requires coordinates x0 and x1 in reverse. Order to
10000+21*x1+x0=10174
21*x1+x0=174
Since the constant (1) of the low dimensional variable (x0) is always a factor of the constant (21) of the high dimensional variable (x1), solving this equation requires only integer division. (a solution can be directly obtained when the dimension is 1; an integer division is needed when the dimension is 2; n-1 integer divisions are needed continuously when the dimension is n, and the remainder is used as a dividend each time, and is sequentially assigned from high dimension to low dimension)
174/21 equals 8 and 6, the mantissa is discarded, x1 equals 8, and x0 equals 6. This results in a unique solution for x.
Next, it is determined whether x1 is 8 and x0 is 6, which satisfy the conditions inside the tensor. This point is inside the tensor, since 1< ═ x1<11,2< ═ x0< 13.
The discriminator needs a subtracter (10174-10000), n integer dividers and 2n comparators to realize the above method. n is the largest dimension, typically within 8.
The two discriminators respectively judge the two vertexes. If both discriminators give a positive discrimination, the entry is considered to be hit.
Many entries, for example, 8 to 32 entries, do not need to be reserved in each TTT because the number of tensors processed in the operation is not large. When inquiring, firstly, the maximum address and the minimum address are calculated, the addresses are broadcasted to each TTT and two discriminators recorded by each item, all the discriminators work simultaneously, and the TTT only needs to return any item giving a positive discrimination.
For step S12, if it is determined that the operand is already stored in the local memory element, the storage address of the operand in the local memory element may be determined according to the storage address of the operand in the external storage space and the address corresponding relationship recorded in the data address information table. The method specifically comprises the following steps: and taking the storage address on the local memory component corresponding to the storage address of the operand on the external storage space in the address corresponding relation as the storage address of the operand on the local memory component. For example, as shown In table 1, if the storage address of the operand In the external storage space is Out _ addr1, the storage address of the operand In the local memory element may be determined to be In _ addr1 according to the address correspondence In table 1; or, if the storage address of the operand In the external storage space is a part of Out _ addr1, determining that the corresponding part In _ addr1 is the storage address of the operand on the local memory component according to the address correspondence relationship, specifically, Out _ addr1 is addr11 to addr12, the storage address of the operand In the external storage space is one segment of addr13 to addr14 In addr11 to addr12, and the address corresponding to the segment of addr13 to addr14 In _ addr1 is the storage address of the operand on the local memory component.
For step S13, the instruction for obtaining the operand may be a data load instruction, and after the storage address of the operand on the local memory element is determined in step S12, the storage address of the operand on the local memory element may be bound to the data load instruction corresponding to the operand, so that the processor may directly execute the data load instruction to obtain the operand from the local memory element, thereby omitting a process of loading the operand from an external storage space to the local memory element, and saving bandwidth resources.
Fig. 3 shows a flow diagram of a method of operand fetch according to an embodiment of the present disclosure. As shown in fig. 3, the method may further include:
in step S14, if the operand is not stored in the local memory device, a control signal for loading the operand is generated according to the storage address of the operand, and the control signal for loading the operand is used to load the operand from the storage address of the operand to the local memory device.
If the operand is not stored on the local memory element, the operand may be loaded from the external memory space onto the local memory element in accordance with normal procedures. The specific process may be that a memory space may be allocated for an operand on the local memory device, an address of the allocated memory space is determined, a control signal for loading the operand is generated according to the storage address of the operand bound by the data loading instruction and the address of the allocated memory space, the control signal for loading the operand is sent to the DMA, and the DMA loads the operand from the storage address of the operand onto the local memory device according to the control signal.
In one possible implementation, as illustrated in fig. 3, the method may further include:
in step S15, when the operand is loaded from the external storage space to the local memory device, the data address information table is updated according to the storage address of the loaded operand in the external storage space and the storage address in the local memory device.
In a possible implementation manner, the operand loaded covers the operand originally stored in the local memory component, and the address corresponding relation of the operand originally stored in the data address information table may be replaced by the corresponding relation of the storage address of the operand loaded in the external storage space and the storage address of the operand loaded in the local memory component. The specific process may also be that it is determined whether the storage address of the recorded operand in the external storage space overlaps with the storage address in the external storage space in the address correspondence, and if so, the originally recorded address correspondence may be invalidated, and the address correspondence of the newly loaded operand is recorded, that is, the correspondence between the storage address of the loaded operand in the external storage space and the storage address in the local memory component is recorded.
For example, as shown In table 1, the processor allocates the memory space of In _ addr1 to the above operand, and overwrites the data originally stored In the memory space of In _ addr1 after loading the operand, at this time, the address corresponding relationship between Out _ addr1 and In _ addr1 In the data address information table may be invalidated and replaced with the address corresponding relationship between Out _ addr3 and In _ addr 1. It should be noted that the above is only an example of the present disclosure, and the present disclosure is not limited In any way, for example, In _ addr1 represents a segment of memory space, and the processor simply allocates a part of the memory space In _ addr3 to the above operand, so the address correspondence relationship between Out _ addr3 and In _ addr3 may be used to replace the original address correspondence relationship between Out _ addr1 and In _ addr 1.
In one possible implementation manner, the original address corresponding relationship in the data address information table is replaced by: the correspondence of the storage address of the loaded operand on the external storage space and the storage address on the local memory component. In the present embodiment, the data address information records only the address correspondence relationship of the operand loaded most recently. Therefore, when the operand is loaded to the local memory component from the external storage space, the original address corresponding relation in the data address information table is directly replaced by: the correspondence of the storage address of the loaded operand on the external storage space and the storage address on the local memory component. The specific process may also include the above-mentioned invalidation process, that is, the aging time may be set, timing may be started after an address corresponding relationship is recorded, when the aging time is reached, the corresponding address corresponding relationship may be set to be invalid, even if a new operand is to be loaded, the operand to be loaded is already stored in the local memory component recorded in the data address information table, but because the address corresponding relationship is already invalid, the returned result is still not stored in the local memory component.
The aging time can be set according to the requirement balance of bandwidth and efficiency, and the aging time is not particularly limited in the present disclosure. In one possible implementation, the aging time may be set to be greater than or equal to two pipeline cycles, and one pipeline cycle may refer to the time required for the pipeline of the compute node to propagate one stage forward.
That is, for step S11, when the address mapping relationship is valid and the storage address in the external storage space of the address mapping relationship includes the storage address of the operand to be loaded on the external storage space, the result that the operand is already stored on the local memory element is returned, and when either of the two conditions is not satisfied, the result that the operand is stored on the local memory element is not returned, for example, the address mapping relationship is invalid and the result that the operand is stored on the local memory element is not returned, or although the address mapping relationship is valid, the storage address in the external storage space of the address mapping relationship does not include the storage address of the operand to be loaded on the external storage space, the result that the operand is stored on the local memory element is not returned.
In a possible implementation manner, an invalid flag of the address correspondence may be further recorded in the data address information table, where the invalid flag may indicate whether the address correspondence is valid, for example, an invalid flag of 1 indicates valid, and an invalid flag of 0 indicates invalid. Correspondingly, after recording an address corresponding relation, the corresponding invalid flag bit may be set to 1, and when the aging time is reached, the invalid flag may be set to 0.
According to the operand obtaining method of the above embodiment of the present disclosure, when the operand is already stored in the local memory component, the processor may directly execute the data loading instruction, and obtain the operand from the local memory component, thereby omitting a process of loading the operand from an external storage space to the local memory component, and saving bandwidth resources.
In one possible implementation, the method of the present disclosure may be applied to an arithmetic device, which may include: each of the plurality of layers of operation nodes comprises a local memory component, a processor and a next layer of operation node, and the external storage space may be a memory component of an operation node on a previous layer of operation node or a memory component of a next layer of operation node.
Fig. 4 shows a block diagram of an arithmetic device according to an embodiment of the present disclosure. As shown in fig. 4, the first layer of the computing device may be a computing node, and the computing node may include a processor, a memory component, and a next (second) layer of computing nodes, and the number of the second layer of computing nodes may be plural, and the disclosure is not limited thereto. As shown in fig. 4, each operation node in the second layer may also include: a processor, a memory component, and a next level (third level) compute node. Similarly, each operation node at the ith layer may include: the device comprises a processor, a memory component and an i +1 th layer operation node, wherein i is a natural number.
The processor may be implemented in hardware, and may be, for example, a digital circuit, an analog circuit, or the like; physical implementations of hardware structures include, but are not limited to, transistors, memristors, and the like. The memory component may be a Random Access Memory (RAM), a Read Only Memory (ROM), a CACHE memory (CACHE), etc., and the specific form of the memory component of the present disclosure is not limited.
It should be noted that, although fig. 4 only shows the expansion structure of one of the second-tier operation nodes included in the first-tier operation node (the second tier shown in fig. 4), it is understood that fig. 4 is only a schematic diagram, and the expansion structures of other second-tier operation nodes also include a processor, a memory component, and a third-tier operation node, and fig. 4 does not show the expansion structures of other second-tier operation nodes for simplicity, and the same is true for the ith-tier operation node. The number of the (i + 1) th layer operation nodes included in different ith layer operation nodes may be the same or different, and the disclosure does not limit this.
In a possible implementation manner, the processor may be configured to decompose an input instruction of any one operation node to obtain a plurality of sub instructions, for example, may decompose the input instruction into a plurality of parallel sub instructions, and send the parallel sub instructions to an operation node on a next layer of the any one operation node; and the any one operation node loads the operand required by the execution of the parallel sub-instruction from the memory component of the previous layer operation node to the memory component of the any one operation node, so that the next layer operation node of the any one operation node executes the parallel sub-instruction in parallel according to the operand.
In one possible implementation, the processor may include a serial demultiplexer sd (sequentialdedemoter), a Decoder DD (Demotion Decoder, where Demotion may refer to an operation node from an upper layer to a lower layer), and a parallel demultiplexer pd (parallel Decoder). The input end of the SD may be connected to the output end of the PD in the processor of the upper layer of operation nodes, the output end of the SD may be connected to the input end of the DD, the output end of the DD may be connected to the input end of the PD, and the output end of the PD may be connected to the input end of the lower layer of operation nodes. The processor processes the input instruction in three stages to obtain a parallel sub-instruction: the system comprises a serial decomposition stage, a decoding degradation stage and a parallel decomposition stage, wherein SD is used for carrying out serial decomposition on an input instruction to obtain a serial sub instruction, DD is used for carrying out instruction decoding on the serial sub instruction, and PD is used for carrying out parallel decomposition on the decoded serial instruction to obtain a parallel sub instruction.
FIG. 5a illustrates a block diagram of an operational node according to an embodiment of the present disclosure. As shown in fig. 5a, the input terminal of SD may be connected to the output terminal of the upper layer operational node, and receive the input instruction from the upper layer operational node. In a possible implementation manner, the input end of the SD may be connected to an instruction queue IQ (instruction queue), that is, the processor may load an output instruction of an operation node in a previous layer as an input instruction of a current operation node to the instruction queue IQ, the current operation node may be an operation node to which the processor belongs, and the SD obtains the input instruction from the IQ. The hardware limitation may include a limitation of a memory capacity of a memory component, and the like, and the serial decomposition of the input instruction may include decomposition of an operand of the input instruction and decomposition of the input instruction.
By setting IQ as the buffer between SD and the operation node of the previous layer, the strict synchronous execution relation between SD and the operation node of the previous layer can be saved. The IQ can simplify circuit design and improve execution efficiency, for example, allow independent asynchronous execution between the SD and the previous layer of operation node, reduce the time for the SD to wait for the previous layer of operation node to send an input instruction, and the like.
In a possible implementation manner, the decomposing of the input instruction of any one operation node by the processor to obtain a plurality of sub-instructions may include: and the SD carries out serial decomposition on the input instruction according to the memory capacity required by the input instruction and the capacity of the memory component to obtain a serial sub-instruction.
The memory capacity required by the input instruction can be determined according to the memory capacity required by storing the operand of the input instruction, the memory capacity required by an intermediate result after the operand is processed by the storage operator, and the like, after the memory capacity required by the input instruction is determined, whether the capacity of the memory component of the operation node of the layer meets the memory capacity required by the input instruction can be judged, and if the capacity of the memory component of the operation node of the layer does not meet the memory capacity required by the input instruction, the input instruction can be serially decomposed according to the capacity of the memory component of the operation node of the layer to obtain the serial sub-instruction.
The serial decomposition of the input instruction may include decomposition of operands of the input instruction and decomposition of the input instruction. In order to utilize the resources of the operation node more effectively when performing the serial decomposition, the serial sub-instruction obtained by the serial decomposition will have a decomposition granularity as large as possible, and the decomposition granularity of the serial sub-instruction obtained by the serial decomposition is determined by the resources of the operation node, for example, the resources of the operation node may be the capacity of the memory component of the operation node. The decomposition granularity herein may refer to the dimension of the decomposed operand.
For ease of understanding, the process of serial decomposition will be explained below using a specific operation as an example. The function of the SD is described by taking a matrix multiplication operation as an example, assuming that an input instruction is to multiply matrices X and Y, the SD may determine the memory capacity required by the input instruction according to the sizes of the matrices X and Y, may compare the memory capacity required by the input instruction with the capacity of the memory component of the operation node of the current layer, and if the memory capacity required by the input instruction is greater than the capacity of the memory component of the operation node of the current layer, the input instruction needs to be serially decomposed. The specific process may be to decompose the operand to divide the input instruction into a plurality of serial sub-instructions, where the plurality of serial sub-instructions may be executed serially, for example, may decompose matrix X or matrix Y, or decompose both matrix X and matrix Y, taking decomposition of matrix X as an example, may decompose the input instruction serially into a plurality of matrix-multiplied serial sub-instructions and summed serial sub-instructions, and after executing the plurality of matrix-multiplied serial sub-instructions serially, sum the operation result of the plurality of matrix-multiplied serial sub-instructions and the summed serial sub-instructions to obtain the operation result of the input instruction. It should be noted that the above serial decomposition manner for matrix multiplication is only one example of the disclosure for illustrating the function of SD, and the disclosure is not limited in any way.
In a possible implementation manner, the serial decomposer performs serial decomposition on the input instruction according to the capacity of the memory component of any one of the operation nodes and the memory capacity required by the input instruction to obtain a serial sub-instruction, and specifically may include: determining the decomposition priority of the dimension of the operand, sequentially selecting the dimension for decomposing the operand according to the order of the decomposition priority and determining the maximum decomposition granularity in a dichotomy mode until the memory capacity required by the decomposed operand is less than or equal to the capacity of the memory component of the operation node at the layer. This decomposition ensures that the input instructions are serially decomposed with as large a decomposition granularity as possible.
In one possible implementation manner, in order to improve the decomposition efficiency, for any selected dimension for decomposing the operand, before determining the maximum decomposition granularity in a dichotomy manner in the dimension direction, a size relationship between a memory capacity required by the operand after being decomposed into an atomic size in the dimension direction and a capacity of a memory element of the operation node at the current layer may be determined: if the memory capacity required by the operand after being decomposed into the atomic size in the dimension direction is less than the capacity of the memory component of the operation node at the layer, the operand can be split in the dimension direction in a dichotomy mode; if the memory capacity required by the operand after being decomposed into the atomic size in the dimension direction is larger than the capacity of the memory component of the operation node at the layer, repeating the above processes in the next dimension direction according to the decomposition priority; if the memory capacity required by the operand after being decomposed into the atomic size in the dimension direction is equal to the capacity of the memory component of the operation node at the current layer, the dimension of the decomposition can be directly determined, and the process of decomposing the operand is finished. Wherein decomposing to an atomic size may mean that the decomposition particle size is 1.
FIG. 6 shows a flow diagram of a process of serial decomposition according to an embodiment of the present disclosure. As shown in fig. 6: (1) in step S50, a decomposition priority of a dimension of an operand of the input instruction may be determined first, and in one possible implementation, the decomposition priority may be determined according to a size of the dimension of the operand, the larger the dimension is, the higher the decomposition priority is, the largest dimension of the operand is decomposed preferentially, for example, the operand X is an N-dimensional tensor, the dimensions are t1, t2, … ti, … tN, respectively, where t1< t2< … ti … < tN, where i represents different dimensions, i is a positive integer and i is less than or equal to N, then when the decomposition priority of the dimension of the operand X is determined, the tN dimension is the largest, the decomposition priority is the highest, and is tN-1 … ti … t2, t 1. (2) Selecting the dimension for operand decomposition in order of decomposition priority, and initializing i to N, in which case, in step S51, it may be determined that i ═ N > 0; in step S52, the decomposition granularity is determined to be 1 in the tN direction, in step S53, the size relationship between the memory capacity required by the operand after being decomposed into 1 in the tN direction and the capacity of the memory element of the operation node of the current layer is determined, and if the size relationship is smaller than the size relationship, the operand is decomposed in the tN dimension direction in a binary manner, which may be the following specific process: step S54, determining the minimum decomposition particle size min to be 0 and the maximum decomposition particle size max to be tN; step S55, determining the decomposition particle size in the tN direction as [ (max-min)/2 ]; step S56, determining a size relationship between the memory capacity required by the operand decomposed into [ (max-min)/2] in the tN direction and the capacity of the memory component of the operation node of the current layer, and if the memory capacity required by the operand decomposed into [ (max-min)/2] is equal to the capacity of the memory component of the operation node of the current layer, ending the decomposition process, and decomposing the operand according to the decomposition granularity [ (max-min)/2] in the tN direction; if the memory capacity required by the operand decomposed into [ (max-min)/2] is less than the capacity of the memory element of the operation node of the current layer, step S57 sets the minimum decomposition granularity min to [ (max-min)/2], and if the memory capacity required by the operand decomposed into [ (max-min)/2] is greater than the capacity of the memory element of the operation node of the current layer, step S58 sets the maximum decomposition granularity max to [ (max-min)/2 ]; and step S59, judging whether the difference value between the maximum decomposition particle size and the minimum decomposition particle size is 1, if so, determining the decomposition particle size to be min in the tN direction in step S60, if not, returning to step S55 to continue determining the decomposition particle size to be [ (max-min)/2] in the tN direction, and repeating the processes of S55-S60. (3) Returning to the previous step S51, if the memory capacity required by the operand after being decomposed into 1 in the tN direction is equal to the capacity of the memory element of the operation node of the current layer, the dimension of decomposition can be determined, and the process of decomposing the operand is ended; if the memory capacity required by the operand decomposed into 1 dimension in the tN direction is larger than the capacity of the memory element of the operation node of the current layer, i is equal to i-1, and the step S51 is returned to, if it is determined that i is equal to N-1 and is greater than 0, the step S52 is executed, and the above process is repeated until it is determined that the memory capacity required by the decomposed operand satisfies the capacity of the memory element of the operation node of the current layer.
After the decomposing of the operand, the decomposing of the input instruction according to the decomposed operand may specifically include: the input instruction is decomposed into a plurality of serial sub-instructions, the serial sub-instructions comprise serial sub-instructions which are responsible for operation of operands of each decomposed subset, and if output dependence exists after serial decomposition, the serial sub-instructions can also comprise a reduction instruction.
It should be noted that fig. 6 is only one example of a process for decomposing operands, and does not limit the disclosure in any way. It is understood that the decomposition granularity may also be determined in other manners, for example, the decomposition priority may be selected in other manners, and the manner of dimension decomposition is not limited to dichotomy, as long as the largest possible decomposition granularity can be selected.
In one possible implementation, the memory component may include a static memory segment and a circular memory segment. Fig. 7 illustrates an example of partitioning of a memory component according to an embodiment of the present disclosure. As shown in fig. 7, the memory space of the memory device can be divided into a static memory segment and a circular memory segment.
For some operations in machine learning, a part of operands is shared among the decomposed parts of operations, and for the part of operands, the common operand is referred to in the present disclosure. Taking the matrix multiplication operation as an example, assuming that the input instruction is to multiply matrices X and Y, if only matrix X is decomposed, then the serial sub-instructions obtained by serially decomposing the input instruction need to use operand Y in common, and operand Y is a common operand.
As described above, the input instruction may be an instruction describing a machine-learned operation (operation) to be composed of the above calculation primitive, and may include an operand, an operator, and the like. That is, for an input instruction of any one operation node: the processor decomposes an input instruction into a plurality of sub-instructions, and the plurality of sub-instructions may share a part of operands, namely, the part of operands shares an operand.
In one possible implementation, whether a decomposed operation or instruction has common operands may be determined according to an operation type and a decomposed dimension, where the operation type may refer to a specific operation or operation, such as matrix multiplication; the decomposed dimension may refer to a dimension in which an operand (tensor) of the input instruction is decomposed, for example, assuming that the operand is represented in the form of NHWC (batch, height, width, channels), the decomposed dimension is determined to be a C dimension according to the process shown in fig. 6, and then the decomposed dimension of the operand is the C dimension.
If a common operand exists among the sub-instructions, the processor allocates memory space for the common operand in the static memory section and allocates memory space for other operands of the sub-instructions in the circulating memory section; wherein the common operand is: operands which are used when the next layer of operation nodes in any one operation node execute the plurality of sub-instructions are the following operands: operands of the plurality of sub-instructions are operands other than the common operand.
For the shared operand, in order to avoid frequent reading and writing, the static memory segment is specially used for storing the shared operand in the memory component, and for the shared operand of a plurality of sub-instructions, before the plurality of sub-instructions are executed, the operation of loading the shared operand from the memory component of the upper layer operation node of any operation node to the static memory segment by the shared operand is only needed to be executed once, so that frequent data access can be avoided, and bandwidth resources can be saved.
The other operands may be decomposed operands of the input instruction, intermediate results obtained by executing the sub-instructions, reduction results, and the like, wherein the reduction results may be operation reduction on the intermediate results, and the operation reduction may be reduction processing as mentioned above.
In a possible implementation manner, the decomposing of the input instruction of any one operation node by the processor to obtain a plurality of sub-instructions may include: and the SD carries out serial decomposition on the input instruction according to the memory capacity required by the input instruction, the capacity of the static memory segment and the capacity of the circulating memory segment to obtain a serial sub-instruction.
In one example, for an input instruction without a common operand after decomposition, the input instruction may be serially decomposed into serial sub-instructions according to the memory capacity required by the input instruction and the capacity of the circulating memory segment.
In one example, for an input instruction having a common operand after decomposition, the input instruction may be serially decomposed to obtain a serial sub-instruction according to a size relationship between a memory capacity required by the common operand and a remaining capacity of the static memory segment and a size relationship between a memory capacity required by the other operands and a capacity of the loop memory segment.
For the input instruction with the common operand after decomposition, if the memory capacity required by the common operand is larger than the remaining capacity of the static memory segment, or the memory capacity required by other operands is larger than the capacity of the circulating memory segment, the input instruction needs to be subjected to serial decomposition.
For a common operand: the SD may calculate the remaining memory capacity of the static memory segment, and perform a first serial decomposition on the input instruction according to the remaining memory capacity of the static memory segment and the memory capacity required by the common operand to obtain a first serial sub-instruction. Specifically, the decomposition priority of the dimension of the common operand may be determined, the dimension for decomposing the common operand is sequentially selected according to the order of the decomposition priority, and the maximum decomposition granularity is determined in a dichotomy manner until the memory capacity required by the decomposed common operand is less than or equal to the remaining memory capacity of the static memory segment of the operation node in the current layer. For a specific process, reference may be made to the description of fig. 6, which is not described in detail. The input instruction may then be decomposed according to the manner in which the common operands are decomposed.
For the other operands: the SD may perform second serial decomposition on the first serial sub-instruction according to the memory capacity of the loop memory segment and the memory capacity required by the other operands to obtain the serial sub-instruction. Similarly, the decomposition priority of the dimensionality of other operands can be determined, the dimensionality for decomposing other operands is sequentially selected according to the decomposition priority, and the maximum decomposition granularity is determined in a dichotomy mode until the memory capacity required by the decomposed other operands is smaller than or equal to the residual memory capacity of the circulating memory segment of the operation node at the current layer. For a specific process, reference may be made to the description of fig. 6, which is not described in detail. The input instruction may then be decomposed according to the manner in which the other operands are decomposed.
For example, assume that the input instruction is to multiply matrices X and Y, operand Y is a common operand, and the other operands include operand X. According to the embodiment of the present disclosure, the memory capacity required for storing the operand Y and the capacity of the static memory segment may be determined, if the memory capacity required for storing the operand Y is smaller than the capacity of the static memory segment, the operand Y may not be decomposed, and if the memory capacity required for storing the operand Y is larger than the capacity of the static memory segment, the decomposition manner of the operand Y may be performed according to the process shown in fig. 6. The input instruction may be serially decomposed according to the decomposition of operand Y. The memory capacity required for storing the operand X, the intermediate result, and the reduction result may also be determined, where the memory capacity required for storing the intermediate result and the reduction result may be determined by combining the operand X and the operand Y after the decomposition, if the memory capacity required for storing other operands is smaller than the capacity of the loop memory segment, the operand X may not be decomposed, and if the memory capacity required for storing other operands is larger than the capacity of the static memory segment, the decomposition manner of the operand X may be performed according to the process shown in fig. 6, except that the size of the memory capacity required for storing other operands and the capacity of the loop memory segment is determined each time, and is not only the operand X.
After determining the decomposition mode of the operand, the SD may allocate a memory space to the common operand in the static memory segment, and a serial sub-instruction obtained by serially decomposing an input instruction includes a head instruction and a body instruction, where the head instruction is used to load the common operand, the head instruction records an address of the memory space allocated to the common operand, and the body instruction is used to load the other operands and calculate the common operand and the other operands.
In one possible implementation, a tensor permutation table (an example of a data address information table) may be provided in the computing device, and the tensor permutation table may record a correspondence relationship between a storage address of an operand stored in the static memory segment in an external storage space and a storage address in the static memory segment, where the external storage space may refer to a memory element of a previous layer of the computing node.
Before the SD allocates a memory space for the common operand in the static memory segment, it may first search in a tensor permutation table whether the common operand is already stored in the static memory segment of the local memory device, and if the common operand is already stored in the static memory segment of the local memory device, determine a storage address of the common operand on the local memory device according to a storage address of the common operand in an external storage space (a storage address of the operand on the memory device of the upper operation node) and the tensor permutation table; and assigning the storage address of the common operand on the local memory component to the head instruction.
Fig. 8 is a schematic diagram illustrating a memory space allocation method for a static memory segment according to an embodiment of the disclosure. As shown in fig. 8, SD allocates memory space for operand 1 of input instruction 1 first, and then allocates memory space for operand 2 of second input instruction 2, where operand 1 is still in use, so that memory space can be allocated for operands at a position adjacent to operand 1; when the third input instruction 3 arrives, the operand 1 may be used, the operand 2 is still used, and memory space may be allocated for the operand 3 at the location where the operand 1 is stored, but the memory space required for the operand 3 may be slightly smaller than the memory space for storing the operand 1, and at this time, a part of the memory space between the memory space for storing the operand 3 and the memory space for storing the operand 2 may be unavailable. Alternatively, the memory space required to store operand 3 may be slightly larger than the memory space required to store operand 1, in which case operand 3 may need to be allocated memory space to the right of operand 2 in FIG. 8. Resulting in complex memory management and low memory space utilization.
In order to solve the above technical problem, the present disclosure further provides a first counter (which may be referred to as counter 1) in the processor, and when the counter 1 is a different count value, the SD may allocate memory space for the common operand at different ends in the static memory segment according to the sequence of the header instructions generated by the serial decomposition and the value of the counter 1.
In one possible implementation, the allocating, by the processor, a memory space for the common operand in the static memory segment may include: the processor allocates a memory space for the common operand starting from a first starting end in the static memory segment, wherein the first starting end is a starting end corresponding to the count value of the first counter. The counting value of the first counter is used for representing the storage position information on the static memory segment, and different counting values represent different ends of the static memory segment; for example, the count value of the counter 1 may include 0 and 1, where 0 may correspond to one end of the static memory segment and 1 may correspond to the other end of the static memory segment.
Fig. 9 is a schematic diagram illustrating a memory space allocation method for a static memory segment according to an embodiment of the disclosure. The process of allocating the memory space of the static memory segment for the common operand by the SD is described with reference to fig. 9. A sub-instruction Queue SQ (sub-level instruction Queue) may be further connected between the output end of the SD and the input end of the DD, the output end of the SD is connected to the input end of SQ, and the output end of SQ is connected to the input end of the DD. The SQ is used as a buffer between the SD and the DD, and a strict synchronous execution relation between the SD and the DD can be saved. SQ can simplify circuit design while improving execution efficiency, e.g., allowing SD to execute asynchronously on its own, reducing the time the DD waits for the SD to serially resolve input instructions, etc.
The SD obtains an input instruction 1 from the SQ, performs serial decomposition on the input instruction 1 to obtain a plurality of serial sub-instructions 1, the plurality of serial sub-instructions 1 share an operand 1, and the SD allocates a memory space for the operand 1 from the static memory segment, and assuming that the count value of the counter 1 is 0 at this time, the SD may allocate a memory space for the operand 1 from the left end shown in fig. 9. The SD obtains an input instruction 2 from the SQ, performs serial decomposition on the input instruction 2 to obtain a plurality of serial sub-instructions 2, the plurality of serial sub-instructions 2 share the operand 2, and the SD allocates a memory space for the operand 2 from the static memory segment, assuming that the count value of the counter 1 is 1 at this time, the SD may allocate a memory space for the operand 2 from one end on the right side as shown in fig. 9. The SD obtains an input instruction 3 from the SQ, performs serial decomposition on the input instruction 3 to obtain a plurality of serial sub-instructions 3, the plurality of serial sub-instructions 3 share an operand 3, and the SD allocates a memory space for the operand 3 from the static memory segment, and assuming that the count value of the counter 1 is 0 at this time, the SD may allocate a memory space for the operand 3 from the left end shown in fig. 9.
For the above embodiment, a plurality of tensor substitution tables may be set to record the correspondence between the storage addresses of the operands stored at different ends of the static memory segment in the external storage space and the storage addresses in the static memory segment. Thus, step S15 may include: when the operand is loaded to the static memory segment from the external storage space, determining a data address information table (tensor substitution table) to be updated according to the count value of the first counter; and updating the data address information table (tensor substitution table) to be updated according to the storage address of the loaded operand on the external storage space and the storage address on the static memory segment. The external storage space may be a memory element of a previous-layer operation node of the current operation node.
For example, the operation node may be provided with a tensor substitution table 1 and a tensor substitution table 2, where the tensor substitution table 1 is used to record the correspondence relationship between the addresses of the operands stored at the left end of the static memory segment, and the tensor substitution table 2 is used to record the correspondence relationship between the addresses of the operands stored at the right end of the static memory segment.
Taking the above example as an example, the SD obtains an input instruction 1 from the SQ, serially decomposes the input instruction 1 to obtain a plurality of serial sub instructions 1, the plurality of serial sub instructions 1 share an operand 1, the SD allocates a memory space for the operand 1 from the static memory segment, the SD searches in the tensor substitution table 1 and the tensor substitution table 2 whether the shared operand 1 is already stored in the static memory segment, if the shared operand 1 is not stored in the static memory segment, assuming that the count value of the counter 1 is 0 at this time, the SD may allocate a memory space for the operand 1 from the left end shown in fig. 9, and record a corresponding relationship between a storage address in the memory element of the previous layer operation node sharing the operand 1 and a storage address in the local memory element in the tensor substitution table 1.
The SD obtains an input instruction 2 from the SQ, serially decomposes the input instruction 2 to obtain a plurality of serial sub-instructions 2, the plurality of serial sub-instructions 2 share an operand 2, the SD allocates a memory space for the operand 2 from the static memory segment, the SD searches whether a shared operand 3 is already stored in the static memory segment in the tensor substitution table 1 and the tensor substitution table 2, if the shared operand 3 is not stored in the static memory segment, assuming that the count value of the counter 1 is 1 at this time, the SD may allocate a memory space for the operand 2 from one end on the right side shown in fig. 9, and record a corresponding relationship between a storage address in a memory component of an upper-layer operation node sharing the operand 2 and a storage address in a local memory component in the tensor substitution table 2.
After the address corresponding relation is recorded in the tensor substitution table, the SD can set a timer corresponding to the address corresponding relation to start timing respectively, and when the timer reaches the aging time, the SD can set the address corresponding relation corresponding to the timer to be invalid. As described above, for the address correspondence relationship of the common operand 1, the timer 1 may be set, for the address correspondence relationship of the common operand 2, the timer 2 may be set, the address correspondence relationship of the common operand 1 and the address correspondence relationship of the common operand 2 may be valid before the timers 1 and 2 reach the aging time, after the timer 1 reaches the aging time, the address correspondence relationship of the common operand 1 may be set invalid, and after the timer 2 reaches the aging time, the address correspondence relationship of the common operand 2 may be set invalid.
The SD obtains an input instruction 3 from the SQ, serial decomposition is carried out on the input instruction 3 to obtain a plurality of serial sub instructions 3, the serial sub instructions 3 share an operand 3, the SD allocates a memory space for the operand 3 from a static memory segment, the SD searches whether the shared operand 3 is stored in the static memory segment or not in a tensor substitution table 1 and a tensor substitution table 2, and if a part of the stored shared operand 1 is found to be the shared operand 3, a storage address of the shared operand 1 corresponding to the shared operand 3 is directly bound to a head instruction.
It should be noted that, if the address mapping relationship of the common operand 1 is invalid, the result stored in the static memory segment will not be returned by the common operand 3, and the result stored in the static memory segment will be returned by the common operand 3 only when the timer 1 corresponding to the address mapping relationship of the common operand 1 has not reached the aging time and the storage address in the external storage space in the address mapping relationship of the common operand 1 includes the storage address of the common operand 3 in the external storage space.
By the memory allocation mode of the embodiment, the memory management complexity can be reduced, the memory space utilization rate can be improved, and meanwhile, the bandwidth resource can be saved.
In one possible implementation manner, as shown in fig. 5a, a sub-instruction Queue SQ (sub-level instruction Queue) may be further connected between the output end of the SD and the input end of the DD in the present disclosure, the output end of the SD is connected to the input end of SQ, and the output end of SQ is connected to the input end of the DD. The SQ is used as a buffer between the SD and the DD, and a strict synchronous execution relation between the SD and the DD can be saved. SQ can simplify circuit design while improving execution efficiency, e.g., allowing SD to execute asynchronously on its own, reducing the time the DD waits for the SD to serially resolve input instructions, etc.
As shown in fig. 5a, the operation node of the present disclosure is provided with a local processing unit lfu (local functions), a first Memory Controller (DMAC), and a second Memory Controller (DMA). The first memory controller may be implemented by a hardware circuit or a software program, which is not limited in this disclosure.
The first memory controller and the second memory controller form a memory controller, and the first memory controller is connected with the second memory controller. A data path is connected between the memory component of any one operation node and the memory components of the previous layer operation node and the next layer operation node of the any one operation node, as shown in fig. 5a, the memory component i is connected to the memory component i-1, and the connection of the memory component i to the next layer operation node may refer to the memory component i +1 connected to the next layer operation node. The second memory controller may be coupled to the data path, the first memory controller may control the second memory controller in response to control signals sent by other components in the compute node, the second memory controller controlling the data path to pass operands of the input instruction from one memory component to another memory component. For example, the first memory controller may control the second memory controller to load an operand of the input instruction from the memory element of the upper-level operation node to the local memory element according to a control signal sent by the SD or DD, or may write an operation result of the input instruction from the local memory element back to the memory element of the upper-level operation node.
In a possible implementation manner, the decoder sends a first control signal to the memory controller according to the header instruction, and the memory controller loads the common operand from the memory component of the upper-layer operation node to the static memory segment according to the first control signal; and the decoder sends a second control signal to the memory controller according to the main instruction, and the memory controller loads the other data from the memory component of the upper layer of operation node to the dynamic memory section according to the second control signal. The first memory controller may generate a load instruction according to the first control signal or the second control signal, send the load instruction to the second memory controller, and control the data path by the second memory controller according to the load instruction to implement loading of data.
And the first memory controller is respectively connected with the SD and the DD, reads operands from the memory component of the upper layer of operation node according to a control signal sent by the SD or the DD, and writes the operands into the memory component of the current operation node. The first memory controller is responsible for data write-back between different layers of operation nodes besides data read and write, for example, writing back the operation result of the i +1 layer of operation node to the i-th layer of operation node.
In a possible implementation, the memory component of each compute node is also connected to a local processing unit LFU in the same compute node. The output of the decoder DD is further connected to a Reduction control unit RC (also called Reduction Controller), which is connected to the local processing unit LFU. The reduction control unit RC is configured to control the LFU to execute the operation reduction RD to obtain an operation result of the input instruction, and write the operation result into the memory element, where the first memory controller may control the second memory controller to write the operation result in the memory element back into the memory element of the previous layer of operation node.
The SD can output the serial sub-instruction after serial decomposition into the SQ, the DD acquires the serial sub-instruction from the SQ, the DD mainly allocates memory space on a circulating memory section according to the requirement of data storage of a main instruction, the DD can allocate the memory space on a memory component of an operation node of the layer for the serial sub-instruction according to the storage requirement of an operand corresponding to the main instruction, and binds the address (local address) of the allocated memory space to the instruction for acquiring the operand in the main instruction, so that decoding processing is realized.
The DD may further send a control signal to the first memory controller according to the serial sub-instruction, and the first memory controller may control the second memory controller to load the operand corresponding to the serial sub-instruction into the memory space allocated to the first memory controller according to the control signal, that is, the storage location of the operand corresponding to the serial sub-instruction is found from the memory component of the upper-layer operation node according to the address of the operand corresponding to the input instruction recorded in the serial sub-instruction, and the operand is read and then written into the memory component of the current-layer operation node according to the local address.
As shown in fig. 5a, the DD decodes the serial sub-instruction and sends the decoded serial sub-instruction to the PD, and the PD may perform parallel decomposition on the decoded serial sub-instruction according to the number of next-layer operation nodes connected to the PD, where the parallel decomposition may mean that the parallel sub-instructions after decomposition may be executed in parallel. For example, assuming that the serial sub-instruction is An addition to vectors a and B, where a is (a1, a2 … Aj, … An) and B is (B1, B2 … Bj, … Bn), where n denotes the number of elements in vectors a and B, n is a positive integer, j denotes the sequence number of the elements, j is a positive integer, and j ≦ n, the PD may decompose the serial sub-instruction in parallel into a plurality of parallel sub-instructions each responsible for processing the addition operation of the fraction data in the vector according to the number of next-layer operation nodes, e.g., assuming that n is 4 and the PD connects 4 next-layer operation nodes, the PD may decompose the serial sub-instruction in parallel into 4 parallel sub-instructions, which are respectively An 1 and B1, a2 and B2, A3 and B3, and a4 and B4, and the PD may send 4 parallel sub-instructions to the next-layer operation nodes. It should be noted that the above examples are only for illustrating the example of parallel decomposition, and do not limit the present disclosure in any way.
In a possible implementation manner, the processor in any one of the operation nodes controls the next layer of operation nodes to execute the operation corresponding to the serial sub-instruction of any one of the operation nodes in a pipeline manner in multiple stages. Fig. 5b illustrates an example of a pipeline according to an embodiment of the present disclosure.
As shown in fig. 5b, the plurality of stages may include: an instruction decode ID (instruction decode), a data load LD (load), an operation execution EX (execution), an operation reduction RD (reduction), and a data write-back WB (writeback), the pipeline propagating in order of the instruction decode ID, the data load LD, the operation execution EX, the operation reduction RD, and the data write-back WB. In FIG. 5b, FFU (fractional Functional units) is the next layer of operation nodes.
The DD is used to perform an instruction decode ID on the plurality of sub-instructions (serial sub-instructions). The decoder sends a first control signal to the memory controller according to the head instruction, so that the memory controller loads the common operand according to the first control signal. For the main instruction, the DD may allocate a memory space on the loop memory segment of the operation node in the current layer according to the storage requirement of the other operands corresponding to the main instruction, and bind an address (local address) of the allocated memory space to an instruction, which acquires or stores the other operands, in the main instruction, thereby implementing decoding processing. The decoder may also send a second control signal to the memory controller according to the body instruction, so that the memory controller accesses the other operands according to the second control signal.
DMA is used for data loading LD: loading an operand of an input instruction into a memory component, specifically comprising: and the DMAC controls the DMA to load the shared operand from the memory component of the upper-layer operation node to the static memory segment according to a first control signal corresponding to the head instruction, and loads the other data from the memory component of the upper-layer operation node to the circulating memory segment according to a second control signal corresponding to the main instruction. And the DMAC controls the DMA to load the other data from the memory component of the operation node at the upper layer into the circulating memory section according to the second control signal, wherein the other data are mainly part of the other operands loaded, and the other operands are mainly part of the input operands and are not intermediate results or reduction results.
The DD decodes the serial sub-instruction and sends the decoded serial sub-instruction to the PD, and the PD may perform parallel decomposition on the decoded serial sub-instruction according to the number of next-layer operation nodes connected to the PD, where the parallel decomposition may refer to that the parallel sub-instruction after decomposition may be executed in parallel.
The next layer of operation nodes can execute the operation execution EX in the plurality of stages in a pipeline mode to obtain an execution result. The RC is used for controlling the LFU to operate and reduce RD on the execution result to obtain the operation result of the input instruction, and the DMA is also used for writing back the data to the WB: and writing the operation result back to the memory component of the operation node at the upper layer of any operation node.
FIG. 10 shows a schematic diagram of a pipeline according to an example of the present disclosure. Next, a process of executing an operation corresponding to an input instruction in stages in a pipeline manner will be described with reference to the arithmetic device shown in fig. 4 and fig. 10. As shown in fig. 4, taking the ith layer of operation node as an example, the ith layer of operation node receives an input instruction of the previous layer (i-1 layer) of operation node, performs instruction decoding ID on the input instruction to obtain a decoded instruction, loads data required for operating the input instruction, then sends the decoded instruction to the next layer (i +1 layer) of operation node, and executes the decoded instruction by the next layer (i +1 layer) of operation node according to the loaded data to complete the operation execution EX stage. Because there may be a plurality of next-layer (i + 1-th-layer) operation nodes, or the capacity of the memory component of the operation node in this layer is smaller than the capacity of the memory required for storing the data required by the input instruction, the processor may further decompose the decoded instruction, and some operations may also require reduction of the operation result of the decomposed instruction, that is, the reduction stage RD is operated to obtain the operation result of the input instruction, and if the i-th-layer operation node is not the first-layer operation node, the processor of the i-th-layer operation node may also write the operation result of the input instruction back to the previous-layer (i-1-th-layer) operation node. It should be noted that, the next-layer (i +1 th-layer) operation node also executes the operation execution EX in the plurality of stages in a pipeline manner, as shown in fig. 10, that is, after receiving an instruction (as an input instruction of the next-layer (i +1 th-layer) operation node) sent by the processor of the present-layer (i +1 th-layer) operation node, the next-layer (i +1 th-layer) operation node may decode the input instruction, load data required by the input instruction from the memory component of the present layer, send the decoded instruction to the next-layer (i +2 th-layer) operation node of the next-layer (i +1 th-layer) operation node to execute the operation execution stage … …, in other words, the next-layer (i +1 th-layer) operation node executes the operation execution EX, the operation reduction RD, and the data write-back WB in a pipeline manner in order of the instruction decoding ID, the data loading LD, the operation execution EX, the operation reduction RD, and the data write-back WB i layer) operation corresponding to the input instruction sent by the operation node.
The arithmetic device of the embodiment of the disclosure constructs a hierarchical architecture of the arithmetic device in a multi-layer iteration mode, the structure of each arithmetic node of the arithmetic device is the same, the arithmetic nodes of different layers and computers of different scales have the same programming interface and instruction set architecture, the same program is executed, and data is implicitly loaded between layers. The hierarchical structure of the arithmetic device can execute the operation corresponding to the input instruction in an iterative pipeline mode, efficiently utilize the arithmetic node of each hierarchy and improve the arithmetic efficiency.
SD, DD, and PD are separated in the processor, and memory allocation may be well staggered in time. Specifically, the PD always allocates memory space after the DD, but the allocated memory space is released earlier, and the DD always allocates memory space after the SD, but the allocated memory space is also released earlier. The memory space for serial decomposition of SD may be used in multiple serial sub-instructions, so a static memory segment is provided for SD, and other parts share memory (a circular memory segment) in the memory component except the static memory.
In the above pipeline stages, 4 stages except the ID are involved in memory access, and therefore, a maximum of 4 instructions need to access the memory at the same time. The LD and WB stages are DMA access memory sections, and the sequence of LD and WB is controlled by DMAC, so that no conflict occurs when accessing the memory, that is, only 3 instructions need to access the circular memory section at the same time, and therefore, the circular memory section can be divided into multiple sub-memory blocks, for example, 3 sub-memory blocks. When the DD needs to allocate a memory space for the operand of the serial sub-instruction, the memory space may be sequentially allocated for the operand of the serial sub-instruction in 3 segments of sub-memory blocks according to the input sequence of the serial sub-instruction.
In the memory management method according to the present embodiment, a plurality of tensor substitution tables (examples of data address information tables) may be provided to record operands stored in different sub-memory blocks of the circular memory segment. Before allocating a memory space for an operand on a cycle memory segment, a DD may first search, in a plurality of tensor substitution tables corresponding to the cycle memory segment, whether the operand is already stored on the cycle memory segment of a local memory component, and if the operand is already stored on the cycle memory segment of the local memory component, determine a storage address of the operand on the local memory component according to the tensor substitution table, and assign the storage address of the operand on the local memory component to an instruction for obtaining the operand; and if the data is not stored in the circulating memory section of the local memory component, loading the data.
In this embodiment, an invalid flag bit of the address correspondence may be recorded in the tensor replacement table, and after one address correspondence is recorded, a timer may be set to count time, and when the timer reaches the aging time, the address correspondence may be set to be invalid. Moreover, the result that the operand to be loaded is already stored in the loop memory segment of the local memory component is returned only when the address correspondence in the tensor permutation table is valid and the storage address on the external storage space in the address correspondence contains the storage address of the operand to be loaded on the external storage space.
In the present embodiment, step S15 may include: when an operand is loaded from the external storage space to any one of the plurality of sub-memory blocks in the cyclic memory segment, the DD may update the data address information table (tensor replacement table) corresponding to the any one of the sub-memory blocks according to the storage address of the loaded operand in the external storage space and the storage address of the loaded operand in the local memory component.
For example, for each sub-memory block, a tensor replacement table corresponding to the sub-memory block is set, and for an example including 3 sub-memory blocks: the cyclic memory segment 0, the cyclic memory segment 1, and the cyclic memory segment 2 may be configured such that the tensor permutation table 4, the tensor permutation table 5, and the tensor permutation table 6 correspond to the cyclic memory segment 0, the cyclic memory segment 1, and the cyclic memory segment 2, respectively. Thus, when the operand is loaded from the external memory space to the loop memory segment 0, the tensor substitution table 4 is updated based on the storage address of the loaded operand in the external memory space and the storage address in the local memory element.
In a possible implementation manner, a second counter is disposed in the processor, the loop memory segment includes multiple segments of sub-memory blocks, and the processor allocates memory spaces for other operands of the multiple sub-instructions in the loop memory segment, including: and the processor allocates memory space for the other operands from the sub-memory blocks corresponding to the count value of the second counter in the circulating memory section.
In a possible implementation manner, during the process of decoding the sub-instructions, the DD in the processor allocates a memory space for the other operands from the sub-memory block corresponding to the count value of the second counter in the loop memory segment.
FIG. 11 illustrates an example of partitioning of a memory component according to an embodiment of the present disclosure. As shown in fig. 11, the circular memory segment is divided into multiple segments of sub-memory blocks, for example, 3 segments of sub-memory blocks, and the memory capacities of the 3 segments of sub-memory blocks may be the same or different, which is not limited in this disclosure. The processor may be provided with a counter 2, and after the DD acquires the serial sub-instruction from the SQ, the DD may allocate, to a main instruction in the serial sub-instruction, a memory space of a cyclic memory segment according to the main instruction and a count value sequence of the counter 2, before allocating the memory space, the DD may search, in a plurality of tensor substitution tables corresponding to the cyclic memory segment, whether an operand is already stored in the cyclic memory segment of the local memory element, and if the operand is already stored in the cyclic memory segment of the local memory element, assign a storage address of the operand on the local memory element to the instruction for acquiring the operand.
For example, if a main instruction 1 is obtained, it is searched in the tensor substitution table 4, the tensor substitution table 5, and the tensor substitution table 6 whether the operand of the main instruction 1 is already stored in the loop memory segment of the local memory component, and if the operand of the main instruction 1 is not stored in the loop memory segment and the count value of the counter 2 is 0, the DD allocates a memory space for the operand of the main instruction 1 in the loop memory segment 0; then obtaining a main instruction 2, searching whether the operand of the main instruction 2 is stored in a loop memory segment of a local memory component in a tensor substitution table 4, a tensor substitution table 5 and a tensor substitution table 6, if the operand is not stored in the loop memory segment and the count value of a counter 2 is 1 at the moment, the DD allocates a memory space for the operand of the main instruction 2 in the loop memory segment 1; then a main instruction 3 is obtained, whether an operand of the main instruction 3 is stored in a cycle memory segment of the local memory component is searched in a tensor permutation table 4, a tensor permutation table 5 and a tensor permutation table 6, if the operand is stored in the cycle memory segment, the DD assigns a storage address of the operand in the local memory component to the instruction for obtaining the operand, so that the PD can directly obtain the operand from the cycle memory segment of the local memory component when executing the main instruction 3, and an upper layer operation node on the DMAC is not required to be loaded to the cycle memory segment of the local memory component.
By the memory allocation mode of the embodiment, the memory management complexity can be reduced, the memory space utilization rate can be improved, and meanwhile, the bandwidth resource can be saved.
In one possible implementation, the operand fetch method of the present disclosure supports data reuse in a "pipeline pass-forward" fashion, where the next instruction may use the result of the previous instruction as input, thereby leaving two instructions without bubble barriers when executing in the pipeline.
For example. There are now two instructions:
ELTW A,B;
ELTW B,C
provided that they do not require RD.
Without the tensor substitution table, B needs to be first written by the first instruction WB and then written by the second instruction LD. The assembly line is as follows:
ID LD EX RD WB;
__ __ __ __ ID LD EX RD WB;
after the tensor substitution table is added, the tensor substitution table records the address of a first instruction output operand B stored on a local memory component, and the output operand is prepared after the EX stage is finished; accordingly, when the input operand address of the second instruction is replaced with the address on the local memory component, the LD stage becomes the bubble, and EX is directly arranged as the initial stage of the instruction in the beat of the data ready. The assembly line is as follows:
ID LD EX RD WB;
__ ID LD EX RD WB;
execution of the pipeline becomes as independent, with data being passed directly from the EX of the first instruction to the EX of the second instruction. The technology is called 'pipeline forwarding' in a traditional static pipeline processor, and is realized by adding an extra data path, and the same effect is realized by a tensor permutation table in the scheme, so that the data path can be simplified compared with the traditional static pipeline, and the complexity of the processor structure is reduced.
In a possible implementation manner, the processor may further include a CMR (Commission Register), when the RC determines that resources required for performing reduction processing on the operation result of the next-layer operation node are greater than an upper limit of resources of the local processing unit, the RC may write a Commission instruction into the CMR according to the serial sub-instruction, the PD may periodically check whether a Commission instruction exists in the CMR, and if the Commission instruction exists, the next-layer operation node is controlled to perform reduction processing on the operation result of the next-layer operation node according to the Commission instruction to obtain an operation result of the input instruction. The periodic check may be a check based on a processing cycle, and the processing cycle may be determined according to a time when the next-layer operation node finishes processing one serial sub-instruction, and the like, which is not limited in this disclosure. The processing efficiency of the whole operation node can be improved by setting the CMR.
It is noted that while for simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present disclosure is not limited by the order of acts, as some steps may, in accordance with the present disclosure, occur in other orders and concurrently. Further, those skilled in the art will also appreciate that the embodiments described in the specification are exemplary embodiments and that acts and modules referred to are not necessarily required by the disclosure.
It is further noted that, although the various steps in the flowcharts of fig. 2-3 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-3 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least some of the sub-steps or stages of other steps.
It should be understood that the above-described apparatus embodiments are merely exemplary, and that the apparatus of the present disclosure may be implemented in other ways. For example, the division of the units/modules in the above embodiments is only one logical function division, and there may be another division manner in actual implementation. For example, multiple units, modules, or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented.
In addition, unless otherwise specified, each functional unit/module in the embodiments of the present disclosure may be integrated into one unit/module, each unit/module may exist alone physically, or two or more units/modules may be integrated together. The integrated units/modules may be implemented in the form of hardware or software program modules.
If the integrated unit/module is implemented in hardware, the hardware may be digital circuits, analog circuits, etc. Physical implementations of hardware structures include, but are not limited to, transistors, memristors, and the like. The processor may be any suitable hardware processor, such as a CPU, GPU, FPGA, DSP, ASIC, etc., unless otherwise specified. Unless otherwise specified, the Memory component may be any suitable magnetic storage medium or magneto-optical storage medium, such as resistive Random Access Memory (rram), Dynamic Random Access Memory (dram), Static Random Access Memory (SRAM), enhanced Dynamic Random Access Memory (edram), High-Bandwidth Memory (HBM), hybrid Memory cubic (hmc) Memory cube, and so on.
The integrated units/modules, if implemented in the form of software program modules and sold or used as a stand-alone product, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a memory and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments. The technical features of the embodiments may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terms used herein were chosen in order to best explain the principles of the embodiments, the practical application, or technical improvements to the techniques in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (10)

1. A method for operand retrieval, the method comprising:
searching whether the operand is stored on the local memory component or not in the data address information table;
if the operand is stored on the local memory component, determining the storage address of the operand on the local memory component according to the storage address of the operand on the external storage space and a data address information table;
and assigning the storage address of the operand on the local memory component to the instruction for acquiring the operand.
2. The method of claim 1, further comprising:
And if the operand is not stored in the local memory component, generating control signals for loading the operand according to the storage address of the operand, wherein the control signals for loading the operand are used for loading the operand from the storage address of the operand to the local memory component.
3. The method according to claim 1, wherein the data address information table records address correspondence relationships, and the address correspondence relationships include: the storage address of the operand on the local memory element and the storage address of the operand on the external storage space.
4. The method of claim 3, wherein looking up whether the operand is stored on the local memory element in the data address information table comprises:
and when the address corresponding relation comprises storage addresses of all the operands on the external storage space, determining that the operands are stored on the local memory component.
5. The method of claim 1, further comprising:
when loading the operand from the external storage space to the local memory element, the data address information table is updated according to the storage address of the loaded operand on the external storage space and the storage address on the local memory element.
6. The method of claim 5, wherein the local memory component comprises: the number of the static memory segments is,
when loading the operand from the external storage space to the local memory element, updating the data address information table according to the storage address of the loaded operand in the external storage space and the storage address in the local memory element, including:
when the operand is loaded to the static memory segment from the external storage space, determining a data address information table to be updated according to the count value of the first counter; the counting value of the first counter is used for representing the storage position information on the static memory segment;
and updating the data address information table to be updated according to the storage address of the loaded operand on the external storage space and the storage address on the static memory segment.
7. The method of claim 5, wherein the local memory component further comprises: a circular memory segment comprising a plurality of sub-memory blocks,
when loading the operand from the external storage space to the local memory element, updating the data address information table according to the storage address of the loaded operand in the external storage space and the storage address in the local memory element, including:
When the operand is loaded to any sub-memory block in the plurality of sub-memory blocks on the circulating memory segment from the external storage space, the data address information table corresponding to any sub-memory block is updated according to the storage address of the loaded operand on the external storage space and the storage address on the local memory component.
8. An arithmetic device, comprising: a plurality of levels of compute nodes, each compute node including a local memory element, a processor, and a next level of compute node,
when the processor loads the operand from the memory component of the upper layer operation node of the current operation node to the local memory component, searching whether the operand is stored on the local memory component in the data address information table;
if the operand is stored on the local memory component, the processor determines the storage address of the operand on the local memory component according to the storage address of the operand on the external storage space and the data address information table; and assigning the storage address of the operand on the local memory component to the instruction for obtaining the operand.
9. An operand retrieval apparatus, comprising:
a processor;
A memory for storing processor-executable instructions;
wherein the processor is configured to carry out the method of any one of claims 1 to 7 when executing the instructions.
10. A non-transitory computer readable storage medium having computer program instructions stored thereon, wherein the computer program instructions, when executed by a processor, implement the method of any of claims 1-7.
CN201910545270.3A 2019-04-27 2019-06-21 Operation method, device and related product Pending CN111860798A (en)

Priority Applications (8)

Application Number Priority Date Filing Date Title
PCT/CN2020/083280 WO2020220935A1 (en) 2019-04-27 2020-04-03 Operation apparatus
EP21216615.1A EP4012556A3 (en) 2019-04-27 2020-04-26 Fractal calculating device and method, integrated circuit and board card
US17/606,838 US20220261637A1 (en) 2019-04-27 2020-04-26 Fractal calculating device and method, integrated circuit and board card
EP20799083.9A EP3964950A4 (en) 2019-04-27 2020-04-26 Fractal calculating device and method, integrated circuit and board card
PCT/CN2020/087043 WO2020221170A1 (en) 2019-04-27 2020-04-26 Fractal calculating device and method, integrated circuit and board card
EP21216623.5A EP3998528A1 (en) 2019-04-27 2020-04-26 Fractal calculating device and method, integrated circuit and board card
US17/560,490 US11841822B2 (en) 2019-04-27 2021-12-23 Fractal calculating device and method, integrated circuit and board card
US17/560,411 US20220188614A1 (en) 2019-04-27 2021-12-23 Fractal calculating device and method, integrated circuit and board card

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2019103470270 2019-04-27
CN201910347027 2019-04-27

Publications (1)

Publication Number Publication Date
CN111860798A true CN111860798A (en) 2020-10-30

Family

ID=72966068

Family Applications (3)

Application Number Title Priority Date Filing Date
CN201910545272.2A Withdrawn CN111860799A (en) 2019-04-27 2019-06-21 Arithmetic device
CN201910545270.3A Pending CN111860798A (en) 2019-04-27 2019-06-21 Operation method, device and related product
CN201910544723.0A Active CN111860797B (en) 2019-04-27 2019-06-21 Arithmetic device

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN201910545272.2A Withdrawn CN111860799A (en) 2019-04-27 2019-06-21 Arithmetic device

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN201910544723.0A Active CN111860797B (en) 2019-04-27 2019-06-21 Arithmetic device

Country Status (1)

Country Link
CN (3) CN111860799A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102141905A (en) * 2010-01-29 2011-08-03 上海芯豪微电子有限公司 Processor system structure
CN107329936A (en) * 2016-04-29 2017-11-07 北京中科寒武纪科技有限公司 A kind of apparatus and method for performing neural network computing and matrix/vector computing
CN107992329A (en) * 2017-07-20 2018-05-04 上海寒武纪信息科技有限公司 A kind of computational methods and Related product
CN108363670A (en) * 2017-01-26 2018-08-03 华为技术有限公司 A kind of method, apparatus of data transmission, equipment and system

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3343135A (en) * 1964-08-13 1967-09-19 Ibm Compiling circuitry for a highly-parallel computing system
CA1065492A (en) * 1974-02-28 1979-10-30 Alan L. Davis System and method for concurrent and pipeline processing employing a data driven network
CN105630733B (en) * 2015-12-24 2017-05-03 中国科学院计算技术研究所 Device for vector data returning processing unit in fractal tree, method utilizing the device, control device comprising the device and intelligent chip comprising the control device
US10762164B2 (en) * 2016-01-20 2020-09-01 Cambricon Technologies Corporation Limited Vector and matrix computing device
CN107861757B (en) * 2017-11-30 2020-08-25 上海寒武纪信息科技有限公司 Arithmetic device and related product
CN108197705A (en) * 2017-12-29 2018-06-22 国民技术股份有限公司 Convolutional neural networks hardware accelerator and convolutional calculation method and storage medium
US10929143B2 (en) * 2018-09-28 2021-02-23 Intel Corporation Method and apparatus for efficient matrix alignment in a systolic array

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102141905A (en) * 2010-01-29 2011-08-03 上海芯豪微电子有限公司 Processor system structure
US20130111137A1 (en) * 2010-01-29 2013-05-02 Shanghai Xin Hao Micro Electronics Co. Ltd. Processor-cache system and method
CN107329936A (en) * 2016-04-29 2017-11-07 北京中科寒武纪科技有限公司 A kind of apparatus and method for performing neural network computing and matrix/vector computing
CN108363670A (en) * 2017-01-26 2018-08-03 华为技术有限公司 A kind of method, apparatus of data transmission, equipment and system
CN107992329A (en) * 2017-07-20 2018-05-04 上海寒武纪信息科技有限公司 A kind of computational methods and Related product

Also Published As

Publication number Publication date
CN111860797B (en) 2023-05-02
CN111860799A (en) 2020-10-30
CN111860797A (en) 2020-10-30

Similar Documents

Publication Publication Date Title
US10540093B2 (en) Multidimensional contiguous memory allocation
US10942673B2 (en) Data processing using resistive memory arrays
CN114391135A (en) Method for performing in-memory processing operations on contiguously allocated data, and related memory device and system
JP2020500365A (en) Utilization of Sparsity of Input Data in Neural Network Computing Unit
US20240004655A1 (en) Computing Machine Using a Matrix Space And Matrix Pointer Registers For Matrix and Array Processing
US11669443B2 (en) Data layout optimization on processing in memory architecture for executing neural network model
WO2020220935A1 (en) Operation apparatus
Zheng et al. Spara: An energy-efficient ReRAM-based accelerator for sparse graph analytics applications
TW202018599A (en) Neural processing unit
JP2020518068A (en) Graph matching for optimized deep network processing
US20220391320A1 (en) Operation device of convolutional neural network, operation method of convolutional neural network and computer program stored in a recording medium to execute the method thereof
US20220114270A1 (en) Hardware offload circuitry
US20190377549A1 (en) Stochastic rounding of numerical values
Chen et al. fgSpMSpV: A fine-grained parallel SpMSpV framework on HPC platforms
US9570125B1 (en) Apparatuses and methods for shifting data during a masked write to a buffer
CN114282661A (en) Method for operating neural network model, readable medium and electronic device
WO2022047802A1 (en) Processing-in-memory device and data processing method thereof
CN111860798A (en) Operation method, device and related product
Qiu et al. Dcim-gcn: Digital computing-in-memory to efficiently accelerate graph convolutional networks
US20220343146A1 (en) Method and system for temporal graph neural network acceleration
CN114692854A (en) NPU for generating kernel of artificial neural network model and method thereof
CN111831333A (en) Instruction decomposition method and device for intelligent processor and electronic equipment
CN111831582A (en) Memory management device and method for intelligent processor and electronic equipment
US20240004954A1 (en) Computer-implemented accumulation method for sparse matrix multiplication applications
CN108809726B (en) Method and system for covering node by box

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20201030