CN113543045B - Processing unit, correlation device, and tensor operation method - Google Patents

Processing unit, correlation device, and tensor operation method Download PDF

Info

Publication number
CN113543045B
CN113543045B CN202110588863.5A CN202110588863A CN113543045B CN 113543045 B CN113543045 B CN 113543045B CN 202110588863 A CN202110588863 A CN 202110588863A CN 113543045 B CN113543045 B CN 113543045B
Authority
CN
China
Prior art keywords
matrix
calculation
column
row
unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110588863.5A
Other languages
Chinese (zh)
Other versions
CN113543045A (en
Inventor
范虎
劳懋元
阎承洋
李玉东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou C Sky Microsystems Co Ltd
Original Assignee
Pingtouge Shanghai Semiconductor Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Pingtouge Shanghai Semiconductor Co Ltd filed Critical Pingtouge Shanghai Semiconductor Co Ltd
Priority to CN202110588863.5A priority Critical patent/CN113543045B/en
Publication of CN113543045A publication Critical patent/CN113543045A/en
Application granted granted Critical
Publication of CN113543045B publication Critical patent/CN113543045B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W4/00Services specially adapted for wireless communication networks; Facilities therefor
    • H04W4/06Selective distribution of broadcast services, e.g. multimedia broadcast multicast service [MBMS]; Services to user groups; One-way selective calling services
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W28/00Network traffic management; Network resource management
    • H04W28/16Central resource management; Negotiation of resources or communication parameters, e.g. negotiating bandwidth or QoS [Quality of Service]
    • H04W28/18Negotiating wireless communication parameters
    • H04W28/20Negotiating bandwidth
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W84/00Network topologies
    • H04W84/02Hierarchically pre-organised networks, e.g. paging networks, cellular networks, WLAN [Wireless Local Area Network] or WLL [Wireless Local Loop]
    • H04W84/04Large scale networks; Deep hierarchical networks
    • H04W84/08Trunked mobile radio systems
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Signal Processing (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Multimedia (AREA)
  • Neurology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Advance Control (AREA)

Abstract

Provided are a processing unit, a correlation apparatus, and a tensor operation method. The processing unit includes: a plurality of calculation units, which form a calculation matrix with n rows and m columns, wherein n and m are non-zero natural numbers; and the computing unit controller is used for controlling the computing matrix to work in a multicast data input mode under the condition that the external environment bandwidth where the processing unit is located meets the preset bandwidth requirement, broadcasting the data to all computing units in the corresponding column in a column mode and broadcasting the data to all computing units in the corresponding row in a row mode, controlling the computing matrix to work in a pulsating data input mode under the condition that the external environment bandwidth does not meet the preset bandwidth requirement, and receiving the data by the computing units in the same row in the previous column and the computing units in the previous row in the same column so as to support tensor operation. According to the embodiment of the disclosure, the working mode of the calculation matrix is flexibly configured according to the external environment bandwidth of the processing unit, so that the external environment bandwidth and the calculation capacity of the processing unit are adapted, and the calculation energy efficiency of the processing unit is improved.

Description

Processing unit, correlation device, and tensor operation method
Technical Field
The present disclosure relates to the field of chips, and in particular, to a processing unit, a correlation apparatus, and a tensor operation method.
Background
Deep learning is widely used in the fields of face recognition, voice recognition, automatic driving and the like at present. Because deep learning relies on a large number of repeated tensor operations such as convolution, matrix operation and the like, the efficiency of executing a corresponding algorithm by traditional hardware is low, and therefore, a computing framework special for executing the same is produced. The deep learning processing units in these architectures employ a computational matrix composed of a plurality of computational units. Each calculation unit in the calculation matrix executes the operation of elements in convolution and matrix operation, and then the operation results are accumulated to obtain tensor operation results. The way of transferring the elements in the tensor to be computed in the computation matrix generally includes a pulsating data input mode and a multicast data input mode. In the systolic data input mode, the calculation units in the calculation matrix receive data from the calculation units of the same row in the previous column and the calculation units of the same row in the previous column at each clock cycle, i.e. the data is pulsed by one calculation unit in the row direction and in the column direction at each clock cycle, i.e. passed on to the next calculation unit in the row and column. In the multicast data input mode, at each clock cycle, data is broadcast column by column to all the computation units of one column in the computation matrix and row by row to all the computation units of one row in the computation matrix, i.e. data is multicast between all the computation units of one row or one column.
The performance of the processing unit of the deep learning architecture is limited to two aspects: one is the external environmental bandwidth in which the processing unit is located, and the other is the computational power of the processing unit. For the multicast data input mode, the requirement on bandwidth is high, the processing of a computing unit is easily limited by the bandwidth, and the idle waiting state is large. For the pulsating data input mode, the requirement on the bandwidth is low, but the structure is fixed, the calculation is not flexible enough, and the bandwidth cannot be fully utilized when the bandwidth is high. In actual use, the external environment faced by the processing unit of the deep learning architecture is different. When the bandwidth of the external environment is low, the efficient calculation cannot be realized by adopting a multicast data input mode, and the idle waiting state of the calculation unit is large. When the external environment bandwidth is high, the bandwidth and resources cannot be fully utilized by adopting the pulsating data input mode. These all cause the external environment bandwidth and the computing power of the processing unit to be not adapted, and the computing energy efficiency of the processing unit is reduced.
Disclosure of Invention
In view of the above, an object of the present disclosure is to improve the adaptability of the environment bandwidth and the computing power of the external environment of the processing unit, so as to improve the computing energy efficiency of the processing unit.
In a first aspect, an embodiment of the present disclosure provides a processing unit, including:
a plurality of calculation units, which form a calculation matrix with n rows and m columns, wherein n and m are non-zero natural numbers;
and the computing unit controller is used for controlling the computing matrix to work in a multicast data input mode under the condition that the external environment bandwidth where the processing unit is located meets the preset bandwidth requirement, broadcasting data to all computing units in a corresponding column in a column mode and broadcasting data to all computing units in a corresponding row in a row mode, controlling the computing matrix to work in a pulsating data input mode under the condition that the external environment bandwidth does not meet the preset bandwidth requirement, and receiving the data by the computing units in the same row in the previous column and the computing units in the previous row in the same column so as to support tensor operation.
Optionally, the predetermined bandwidth requirements include:
the external environment bandwidth is greater than a predetermined environment bandwidth threshold, where the predetermined environment bandwidth threshold is a maximum total amount of data that needs to be input into the computation matrix in one clock cycle in the multicast data input mode.
Optionally, the processing unit is located in an acceleration unit, the acceleration unit is located in the computing device together with the scheduling unit, the acceleration unit receives the scheduled execution tensor operation of the scheduling unit, and the external environment bandwidth is obtained from the instruction received by the scheduling unit.
Optionally, the processing unit further includes: and the monitoring unit is used for monitoring the bandwidth of the external environment where the processing unit is located.
Optionally, the monitoring unit monitors the external environment bandwidth in real time, and the computing unit controller switches the computing matrix between the pulsating data input mode and the multicast data input mode according to the external environment bandwidth monitored in real time.
Optionally, the computing unit comprises:
the first register is used for storing data transferred by the computing units in the same row in the previous column;
a second register for storing data transferred from the calculation units of a previous row of the same column;
a multiplier for multiplying the elements with the same received column sequence number and row sequence number;
an accumulator for accumulating the result multiplied by the multiplier into a previous accumulated result;
and the third register is used for storing the accumulation result.
Optionally, in the systolic data input mode, the computational cells in the computational matrix receive data from the computational cells of the same row in a previous column via a first input line and from the computational cells of the same row in the previous column via a second input line.
Optionally, in the multicast data input mode, the computing units in the same column in the computing matrix are connected to a first column bus in common, and the computing units in the same column receive data through the first column bus respectively; the computing units in the same row in the computing matrix are connected to a first row bus together, and the computing units in the same row receive data through the first row bus respectively.
In a second aspect, embodiments of the present disclosure provide an acceleration unit core comprising a processing unit according to any one of the above.
In a third aspect, an embodiment of the present disclosure provides an acceleration unit, including:
the acceleration unit core according to above;
a command processor that assigns a tensor computation task to the acceleration unit core.
In a fourth aspect, an embodiment of the present disclosure provides an internet of things device, including:
the acceleration unit according to above;
and the scheduling unit is used for distributing tensor calculation tasks to the accelerating unit.
In a fifth aspect, embodiments of the present disclosure provide a system on chip comprising a processing unit according to any one of the above.
In a sixth aspect, an embodiment of the present disclosure provides a tensor operation method, including:
acquiring external environment bandwidths of a plurality of computing units, wherein the computing units form a computing matrix with n rows and m columns, and n and m are non-zero natural numbers;
based on the external environment bandwidth, the calculation matrix is controlled to work in a multicast data input mode under the condition that the external environment bandwidth where the calculation matrix is located meets the requirement of preset bandwidth, data are broadcasted to all calculation units in a corresponding column in a column mode and are broadcasted to all calculation units in a corresponding row in a row mode, under the condition that the external environment bandwidth does not meet the requirement of the preset bandwidth, the calculation matrix is controlled to work in a pulsating data input mode, and the calculation units receive the data from the calculation units in the same row in the previous column and the calculation units in the previous row in the same column so as to support tensor operation.
In the embodiment of the disclosure, the external environment bandwidth where the processing unit is located is acquired, and the computing unit controller flexibly controls the working mode of the computing matrix according to the external environment bandwidth. The method comprises the steps of setting a preset bandwidth requirement, and under the condition that the external environment bandwidth meets the preset bandwidth requirement, namely the capacity of the external environment of a processing unit for transmitting data to a calculation matrix is higher than the preset bandwidth requirement, ensuring that the total amount of data input into the calculation matrix from the external environment of the processing unit in one clock cycle reaches the row number multiplied by column number multiplied by 2 of the calculation matrix, enabling the calculation matrix to work in a multicast data input mode, broadcasting the data to all calculation units in a corresponding column in a column mode and broadcasting the data to all calculation units in a corresponding row in a row mode, enabling all calculation units in the calculation matrix to execute calculation, enabling no calculation unit in an idle waiting state to be available, fully playing the calculation capacity of the calculation matrix, improving the operation throughput rate of the calculation matrix and improving the calculation capacity of the processing unit. Under the condition that the external environment bandwidth does not meet the requirement of the preset bandwidth, namely the capacity of the external environment of the processing unit for transmitting data to the calculation matrix is not higher than the requirement of the preset bandwidth, the total amount of data input into the calculation matrix from the external environment of the processing unit in one clock cycle can not be ensured to reach the row number multiplied by the column number multiplied by 2 of the calculation matrix, so that the calculation unit in an idle waiting state exists in the calculation matrix in the multicast data input mode, in this case, the working mode of the calculation matrix is switched into a pulsating data input mode, the calculation unit receives data from the calculation unit in the same row in the previous column and the calculation unit in the previous row in the same column, the input data is subjected to pulsating multiplexing in the calculation matrix, the number of the calculation units in the idle waiting state is reduced, and the calculation energy efficiency of the processing unit is improved.
Drawings
The foregoing and other objects, features, and advantages of the disclosure will be apparent from the following description of embodiments of the disclosure, which refers to the accompanying drawings in which:
fig. 1 is a system architecture diagram of an internet of things (IoT) application scenario to which one embodiment of the present disclosure is applied;
FIG. 2 is an internal block diagram of a dispatch unit and acceleration unit according to one embodiment of the present disclosure;
FIG. 3 is an internal block diagram of an acceleration unit core according to one embodiment of the present disclosure;
FIG. 4 is an internal block diagram of a processing unit (tensor engine) according to one embodiment of the present disclosure;
FIG. 5 is an internal block diagram of a computation matrix according to one embodiment of the present disclosure;
FIG. 6 is an internal block diagram of a computing unit according to one embodiment of the present disclosure;
fig. 7 is a flowchart of a tensor operation method according to an embodiment of the present disclosure.
Detailed Description
The present disclosure is described below based on examples, but the present disclosure is not limited to only these examples. In the following detailed description of the present disclosure, some specific details are set forth in detail. It will be apparent to those skilled in the art that the present disclosure may be practiced without these specific details. Well-known methods, procedures, and procedures have not been described in detail so as not to obscure the present disclosure. The figures are not necessarily drawn to scale.
The following terms are used herein.
Deep learning model: deep Learning is a new research direction in the field of Machine Learning (ML), which is introduced to make Machine Learning closer to the original goal, Artificial Intelligence (AI). The internal rules and the expression levels of the sample data are deeply learned, and the information obtained in the learning process is greatly helpful for the interpretation of data such as characters, images and sounds. The final aim of the method is to enable the machine to have the analysis and learning capability like a human, and to recognize data such as characters, images and sounds. And the deep learning model is a deep learning model. The deep learning model has different formats according to different dependent model frameworks (frames), and can be divided into different types of formats, such as tenserflow, pytorch, mxnet, and the like.
A computing device: the device with computing or processing capability may be embodied in the form of a terminal, such as an internet of things device, a mobile terminal, a desktop computer, a laptop computer, etc., or may be embodied as a server or a cluster of servers. In the context of the internet of things of the present disclosure, the computing device is an internet of things terminal in the internet of things.
A scheduling unit: in addition to conventional processing (processing not used for complicated operations such as image processing and various deep learning models) performed in the computing apparatus, a unit that performs a scheduling function for an acceleration unit is also assumed. It allocates to the acceleration unit the tasks that the acceleration unit needs to undertake, such as tensor calculation tasks. The scheduling unit may take various forms such as a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), and the like.
An acceleration unit: in a computing device, a processing unit is designed to increase the data processing speed in some special-purpose fields in order to cope with the situation that a conventional processing unit is not efficient in the special-purpose fields (for example, processing images, processing various operations of a deep learning model, and the like). The acceleration unit, also known as an Artificial Intelligence (AI) processing unit, includes a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a General Purpose Graphics Processing Unit (GPGPU), a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), and special-purpose intelligent acceleration hardware (e.g., neural network processor NPU, tensor processor TPU).
A processing unit: the device with processing capability of convolution, matrix multiplication and other related operations in the deep learning model, which is positioned in the acceleration unit (such as a tensor engine of an acceleration unit core), can be embodied as a system on chip and can be inserted into or replaced from a computing device.
Tensor operation: the set composed of ordinal numbers satisfying a certain coordinate transformation relation when a plurality of coordinate systems are changed is a tensor. Colloquially, it is a vector and matrix based generalization. The scalar is treated as a 0 th order tensor, the vector is treated as a 1 st order tensor, and the matrix is treated as a 2 nd order tensor, but when neither of the two dimensions of space is sufficient to represent the input quantity, more than 3 rd order tensors are produced. With the tensor, the amount of input in any dimension space can be represented. The deep learning model is characterized by being capable of receiving input quantity of any dimension space. Regardless of the input amount in any dimension, it can be expressed as an input tensor that inputs a node of the first layer of the deep learning model. The nodes of the first layer have weight tensors of the same dimensional space. Since the dimension space of the input tensor is the same as the dimension space of the weight tensor, the operation of the input tensor and the weight tensor, such as point multiplication, convolution and the like, can be performed in the same dimension space, and the generated output is still the output of the same dimension space. The output tensor of the node of the previous layer is input to the node of the next layer as input, and tensor operations such as point multiplication, convolution and the like are performed on the output tensor of the node of the next layer in the same dimension space until the output tensor of the node of the last layer is obtained and used as the output tensor of the whole deep learning model.
Application environment of the present disclosure
The embodiment of the disclosure provides a tensor operation scheme. The whole tensor operation scheme is relatively universal, and can be used for various hardware devices for executing various deep learning models, such as a data center, an AI (artificial intelligence) acceleration unit, a GPU (graphic processing unit), IOT (internet of things) devices for executing the deep learning models, embedded devices and the like. The tensor operation method is independent of the hardware where the processing unit executing the tensor operation method is finally deployed. For exemplary description, however, the following description mainly refers to the internet of things as an application scenario. Those skilled in the art will appreciate that the disclosed embodiments are also applicable to other application scenarios.
Whole framework of thing networking
Fig. 1 is a system architecture diagram of an internet of things (IoT)100 to which an embodiment of the present disclosure is applied.
The cloud 110 may represent the internet, or may be a Local Area Network (LAN), or a Wide Area Network (WAN), such as a company's private network. IoT devices may include any number of different types of devices grouped in various combinations. For example, the traffic control group 206 may include IoT devices along streets in a city. These IoT devices may include traffic lights, traffic flow monitors, cameras, weather sensors, and the like. Each IoT device in the traffic control group 206 or other subgroup may communicate with the cloud 110 over a wireless link 208, such as an LPWA link or the like. Further, the wired or wireless subnetwork 212 can allow IoT devices to communicate with each other, such as over a local area network, wireless local area network, and so forth. The IoT device may use another device, such as the gateway 210, to communicate with the cloud 110.
Other groupings of IoT devices may include remote weather stations 214, local information terminals 216, alarm systems 218, automated teller machines 220, alarm panels 222, or mobile vehicles, such as emergency vehicles 224 or other vehicles 226, and the like. Each of these IoT devices may communicate with other IoT devices, with the server 140, or both.
As can be seen from fig. 1, a large number of IoT devices may communicate through the cloud 110. This may allow different IoT devices to autonomously request or provide information to other devices. For example, the traffic control group 206 may request a current weather forecast from a group of remote weather stations 214, which may provide the forecast without human intervention. Further, the emergency vehicle 224 may be alerted by the automated teller machine 220 that a theft is occurring. As the emergency vehicle 224 proceeds toward the automated teller machine 220, it may access the traffic control group 206 to request permission to reach the location, for example, by turning a light red to block cross traffic at the intersection for sufficient time to allow the emergency vehicle 224 to enter the intersection unimpeded.
Machine learning is often used in the IoT devices described above. For example, the automated teller machine 220 recognizes human faces using machine learning, and the traffic control group 206 analysis of traffic flow and control schemes using machine learning. Because the environmental conditions are uncertain, the external environment bandwidth changes along with the changes of the network conditions, the weather conditions and other environmental conditions, the computing power of the processing unit and the external environment bandwidth are not adapted, the computing energy efficiency of the processor is reduced, and the processing unit of the embodiment of the disclosure needs to be adopted.
Scheduling unit and acceleration unit
Fig. 2 is an internal structural diagram of a scheduling unit 420 and an acceleration unit 430 of an IoT device (computing device) according to an embodiment of the present disclosure. As shown in fig. 2, the IoT device includes a memory 410, a scheduling unit 420, and an acceleration unit 430. For convenience of description, only one scheduling unit 420 and one accelerating unit 430 are shown in fig. 2, but it should be understood that the embodiments of the present disclosure are not limited thereto. The IoT device of the present disclosure may include a scheduling unit cluster and an acceleration unit cluster connected to a memory 410 through a bus, the scheduling unit cluster including a plurality of scheduling units 420, and the acceleration unit cluster including a plurality of acceleration units 430. The acceleration unit 430 is a processing unit designed to increase the data processing speed in a special-purpose field. The acceleration unit, also known as an Artificial Intelligence (AI) processing unit, includes a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a General Purpose Graphics Processing Unit (GPGPU), a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), and special-purpose intelligent acceleration hardware (e.g., neural network processor NPU, tensor processor TPU). The embodiment of the disclosure can be applied to NPU scenes, but due to the adoption of a general compiling custom interface, a CPU, a GPU, a GPGPU, a TPU and the like can be used under hardware. The scheduling unit is a processing unit that schedules the acceleration units and allocates instruction sequences to be executed to each acceleration unit, and may take various forms such as a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), and the like. In some embodiments, the scheduling unit 420 allocates to each acceleration unit 430 a sequence of instructions to be executed for the tensor computation task to be performed.
In the traditional architecture design of the central processing unit, a control unit and a storage unit occupy a large part of space in the architecture, and the space occupied by a computing unit is insufficient, so that the traditional architecture design is very effective in logic control and is not efficient in large-scale parallel computing. Therefore, various special acceleration units have been developed to perform more efficient processing for increasing the operation speed for calculations of different functions and different fields. The acceleration unit provided by the invention is a processing unit special for accelerating the operation processing speed of a neural network model. The method is a processing unit which adopts a data-driven parallel computing architecture and is used for processing a large number of operations (such as convolution, pooling and the like) of each neural network node. Because data in a large number of operations (such as convolution, pooling and the like) of each neural network node and intermediate results are closely related in the whole calculation process and are frequently used, the conventional central processing unit framework needs to frequently access an off-core storage in a large number because the internal memory capacity of a core of the central processing unit is small, and thus, the processing efficiency is low. By adopting the accelerating unit special for accelerating the operation processing speed of the neural network model, because each core of the accelerating unit is provided with the on-chip memory with the storage capacity suitable for the neural network calculation, the frequent access to the memory outside the core is avoided, the processing efficiency can be greatly improved, and the calculation performance is improved.
The acceleration unit 430 is to accept the schedule of the scheduling unit 420. As shown in fig. 3, the memory 410 stores various deep learning models including nodes of the models, weight tensors of the nodes, and the like. These deep learning models are deployed by a scheduling unit 420 to an acceleration unit 430 in fig. 2 when needed. That is, the scheduling unit 420 may send addresses of parameters in the model (such as weight tensors of the nodes) in the memory 410 to the acceleration unit 430 in the form of instructions. When the acceleration unit 430 actually uses the deep learning model to perform the calculation, the parameters (such as the weight tensor) are directly addressed in the memory 410 according to the addresses of the parameters in the memory 410, and are temporarily stored in the on-chip memory thereof. When the acceleration unit 430 actually uses the deep learning model to perform calculation, the scheduling unit 420 further sends the input tensor of the model to the acceleration unit 430 in the form of an instruction, and temporarily stores the input tensor in the on-chip memory of the acceleration unit 430. The acceleration unit 430 can then perform inference calculations based on these input tensors and the parameters (e.g., weight tensors) in the model.
How the scheduling unit 420 schedules the acceleration unit 430 to operate will be described in detail below with reference to the internal structures of the scheduling unit 420 and the acceleration unit 430 shown in fig. 2.
As shown in fig. 2, scheduling unit 420 includes a plurality of processor cores 422 and a cache 221 shared by the plurality of processor cores 422. Each processor core 422 includes an instruction fetch unit 423, an instruction decode unit 424, an instruction issue unit 425, and an instruction execution unit 426.
Instruction fetch unit 423 is used to move instructions to be executed from memory 410 into an instruction register (which may be a register of register file 429 shown in fig. 2 that stores instructions) and to receive or compute a next instruction fetch address according to an instruction fetch algorithm, which may include, for example: the address is incremented or decremented according to the instruction length.
After fetching the instruction, dispatch unit 420 enters an instruction decode stage where instruction decode unit 424 decodes the fetched instruction according to a predetermined instruction format to obtain operand fetch information needed by the fetched instruction in preparation for operation by instruction execution unit 426. The operand fetch information points, for example, to an immediate, register, or other software/hardware capable of providing source operands.
The instruction issue unit 425 is located between the instruction decode unit 424 and the instruction execution unit 426 for scheduling and control of instructions to efficiently distribute individual instructions to different instruction execution units 426, enabling parallel operation of multiple instructions.
After instruction issue unit 425 issues the instruction to instruction execution unit 426, instruction execution unit 426 begins executing the instruction. But if the instruction execution unit 426 determines that the instruction should be executed by an acceleration unit, it is forwarded to the corresponding acceleration unit for execution. For example, if the instruction is a deep learning model inference (inference) instruction, the instruction execution unit 426 no longer executes the instruction, but rather sends the instruction over the bus to the acceleration unit 430 for execution by the acceleration unit 430.
Although the embodiment of the present disclosure is used in an NPU scenario, since a generic compiling custom interface is adopted, the acceleration unit 430 shown in fig. 2 is not limited to an NPU, and may also be a TPU. The TPU, i.e. tensor processor, is a processor dedicated to speeding up the computational power of deep neural networks. In addition, the acceleration unit 430 may also be a CPU, GPU, FPGA, ASIC, or the like.
The acceleration unit 430 includes a plurality of cores 436 (4 cores are shown in FIG. 2, but it will be understood by those skilled in the art that other numbers of cores 436, a command processor 437, a direct memory access mechanism 435, and a bus channel 431 may be included in the acceleration unit 430.
Bus channel 431 is a channel for instructions to pass from the bus to and from accelerator unit 430.
Direct Memory Access (DMA) mechanism 435 is a function provided by some computer bus architectures that enables data to be written from an attached device directly into the Memory of a computer motherboard. Compared with the mode that all data transmission between the devices needs to pass through the processing unit, the mode greatly improves the efficiency of data access. Due to the mechanism, the core of the acceleration unit 430 can directly access the memory 410, read parameters (such as weight tensors of each node) in the deep learning model, and the like, and greatly improve data access efficiency.
The command handler 437 allocates instructions sent by the dispatch unit 420 to the acceleration unit 430 for execution by the core 436. The instruction execution unit 426 sends the acceleration unit 430 a sequence of instructions to be executed that require execution by the acceleration unit 430. The instruction sequence to be executed is buffered at the command handler 437 as it enters from the bus channel 431, and the command handler 437 selects the core 436 to which the instruction sequence is assigned for execution. In some embodiments, the sequence of instructions to be executed is a sequence of instructions to be executed for a tensor computation task, instructing the processor 437 to assign the tensor computation task to the core 436. In addition, the command processor 437 is also responsible for synchronizing operations between the cores 436.
Accelerating unit core
FIG. 3 is an internal block diagram of an acceleration unit core according to one embodiment of the present disclosure.
In one embodiment, as shown in FIG. 3, the acceleration unit core 436 includes a tensor engine 510, a pooling engine 520, a memory copy engine 530, a sequencer 550, an instruction buffer 540, an on-chip memory 560, and a constant buffer 570.
The instruction sequence assigned to the accelerator unit core 436 by the command processor 437 is first buffered in the instruction cache 540. The sequencer 550 then fetches instructions from the instruction buffer 540 in a first-in-first-out order, and assigns them to the tensor engine 510 or pooling engine 520 for execution based on the nature of the instructions. Tensor engine 510 is responsible for handling the convolution and matrix multiplication related operations in the deep learning model. The pooling engine 520 is responsible for handling pooling operations in the deep learning model. The memory copy engine 530 is a unit dedicated to handling data copies, where a data copy includes copying some data from the on-chip memory 560 to memory shared by the cores 436, or the on-chip memory 560 of other cores 436, due to a potential overflow of the on-chip memory 560. The sequencer 550 determines whether to assign an instruction to the tensor engine 510, the pooling engine 520, or the memory copy engine 530, depending on the operation property such as convolution, matrix multiplication, pooling, or data copy of the fetched instruction.
The on-chip memory 560 is an in-core memory that stores the weight tensor in the deep learning model, and the input tensor and various intermediate results when the deep learning model is actually used. The constant buffer 570 is a buffer that stores other constant parameters (e.g., hyperparameters in the neural network model) in the deep learning model in addition to the weight tensors. As described above, in the process of the scheduling unit 420 pre-configuring the deep learning model in the acceleration unit 430, the scheduling unit 420 sends the addresses of the parameters in the model in the memory 410 to the acceleration unit 430 in the form of instructions. These parameters include the weight tensor of the node and other parameters (e.g., hyperparameters). For the weight tensor, the acceleration unit 430 takes out the weight tensor from the corresponding position of the memory 410 and puts the weight tensor into the on-chip memory 560 when the actual deep learning model is operated. For other parameters, the acceleration unit 430 is fetched from the corresponding location of the memory 410 and placed in the constant buffer 570 during the actual deep learning model operation. In addition, when an instruction to actually start inference (inference) is assigned to the core 436 by the command processor 437 and executed, an input tensor (input to the deep learning model) in the instruction is also stored in the on-chip memory 560. In addition, after the tensor engine 510 and the pooling engine 520 perform convolution or pooling operations, various intermediate results obtained are also stored in the on-chip memory 560.
Processing unit
Fig. 4 is an internal structural diagram of a processing unit (tensor engine 510) according to one embodiment of the present disclosure.
In one embodiment, as shown in FIG. 4, the processing unit includes a computational unit controller 610, a monitoring unit 620, and a computational matrix 630, wherein the monitoring unit 620 is optional. The calculation matrix 630 is composed of n × m calculation units 640 arranged in n rows and m columns, n and m being non-zero natural numbers.
When the monitoring unit 620 is not used, the instruction execution unit 426 may be enabled to send the instruction sequence to be executed with the external environment bandwidth of the processing unit. The external environment bandwidth is the ability of the external environment of the processing unit (e.g., memory 410 and on-chip memory 650, etc.) to transfer data to the computational matrix 630 of the processing unit, such as the total amount of data transferred from the external environment of the processing unit (e.g., memory 410 and on-chip memory 650, etc.) to the computational matrix 630 of the processing unit in one clock cycle. In general, a computing device is in an environmental condition where network conditions, weather conditions, and the like change, and an external environmental bandwidth changes as the environmental condition changes. The instruction sequence to be executed enters from the bus channel 431, is buffered in the command processor 437, allocated to the core 436 by the command processor 437, buffered in the instruction buffer 540 of the core 436, and allocated to the tensor engine 510, i.e., the processing unit, by the sequencer 550. The processing unit extracts the external environment bandwidth therefrom.
When the monitoring unit 620 is employed, the monitoring unit 620 monitors the bandwidth of the external environment in which the processing unit is located. In one embodiment, the monitoring unit 620 may monitor network conditions, etc., and extrapolate the external environmental bandwidth in conjunction with predetermined rules. In another embodiment, the monitoring unit 620 may record execution of a sequence of instructions historically entered into the processing unit and determine an average outside environment bandwidth over a predetermined period of time from a current point in time as the current outside environment bandwidth based on the instruction sequence execution record. The detection may be real-time, thereby ensuring that the computational matrix data delivery manner determined by the computational unit controller 610 is more in line with objective practice, enabling the computational power and external bandwidth of the processing unit to be adapted in real time, and improving the computational energy efficiency of the processing unit.
In some embodiments, the calculation unit controller 610 controls the calculation matrix 630 to operate in a multicast data input mode according to the external environment bandwidth in which the processing unit is located, in the case that the external environment bandwidth meets a predetermined bandwidth requirement, the data is broadcast to all the calculation units 640 of the corresponding column by column and broadcast to all the calculation units 640 of the corresponding row by row, in the case that the external environment bandwidth does not meet the predetermined bandwidth requirement, the calculation matrix 630 is controlled to operate in a pulsating data input mode, and the calculation units 640 receive the data from the calculation units 640 of the same row in the previous column and from the calculation units 640 of the same row in the previous column to support the calculation matrix 630 to perform tensor operation on the input tensor and the tensor weight. In some cases, in the event that the external environment bandwidth meets a predetermined bandwidth requirement, i.e., is greater than a predetermined environment bandwidth threshold, computational unit controller 610 controls computational matrix 630 to operate in a multicast data input mode. In other cases, in the event that the external ambient bandwidth does not meet the predetermined bandwidth requirement, i.e., is not greater than the predetermined ambient bandwidth threshold, computational unit controller 610 controls computational matrix 630 to operate in a systolic data input mode. The predetermined environment bandwidth threshold is, for example, a preset reference bandwidth, which may be a maximum amount of data required to be input into the calculation matrix 630 in one clock cycle in the multicast data input mode.
It should be appreciated that in some embodiments, in a multicast data input mode, data is broadcast column by column to all of the computational units 640 of each corresponding column of the computational matrix 630 and row by row to all of the computational units 640 of each corresponding row of the computational matrix 630 in one clock cycle. That is, the number of rows and columns of the computational units 640 in the computational matrix 630 determines the maximum amount of data that needs to be input into the computational matrix 630 in one clock cycle. In the multicast data input mode, the maximum total amount of data that needs to be input into the calculation matrix 630 is n × m × 2 in one clock cycle. In some embodiments, in the systolic data input mode, in one clock cycle, the computing units 640 in each row of the first column and the computing units 640 in each column of the first row of the computing matrix 630 receive data from the external environment, and the other computing units 640 receive data from the computing units 640 in the same row of the previous column and the computing units 640 in the same row of the previous column, so that the data input into the computing matrix 630 is systolic multiplexed in the computing matrix 630. That is, in the systolic data input mode, the maximum total amount of data that needs to be input into the calculation matrix 630 is n + m in one clock cycle. n × m × 2 is greater than n + m, that is, for the n × m calculation matrix 630, in one clock cycle, the maximum total amount of data that needs to be input into the calculation matrix 630 in the multicast data input mode is greater than the maximum total amount of data that needs to be input into the calculation matrix 630 in the burst data input mode. In this way, when the external environment bandwidth is larger than n × m × 2, in the multicast data input mode, it is ensured that the total amount of data input from the external environment of the processing unit to the calculation matrix 630 reaches n × m × 2 in one clock cycle, and all of the n × m calculation units 640 in the calculation matrix 630 perform calculation, and there is no calculation unit 640 in the idle waiting state. However, when the external environment bandwidth is equal to n × m × 2, in the multicast data input mode, since the external environment bandwidth varies with the environmental condition of the external environment of the processing unit, it is very likely that the total amount of data input from the external environment to the calculation matrix 630 in one clock cycle is less than n × m × 2, resulting in the calculation unit 640 being in an idle waiting state. When the external environment bandwidth is less than n × m × 2, in the multicast data input mode, the total amount of data that can be input from the external environment to the calculation matrix 630 in one clock cycle is less than n × m × 2, and the calculation matrix 630 has the calculation unit 640 in the idle waiting state. Thus, when the external environment bandwidth is not greater than n × m × 2, the operation mode of the calculation matrix 630 is switched from the multicast data input mode to the pulsating data input mode, and in the pulsating data input mode, although the total amount of data that can be input into the calculation matrix 630 from the external environment in one clock cycle is less than n × m × 2, the input data is transmitted from the calculation unit 640 in the same row in the previous column to the calculation unit 640 in the same row in the next column and from the calculation unit 640 in the previous row in the same column to the calculation unit 640 in the next row in the same column, that is, the input data is pulsatory multiplexed in the calculation matrix 630, and the number of the calculation units 640 in the idle waiting state can be reduced.
In the embodiment of the present disclosure, when the external environment bandwidth meets the predetermined bandwidth requirement, that is, is greater than the predetermined environment bandwidth threshold, the calculation matrix 630 is controlled to operate in the multicast data input mode, the input data is input into the calculation matrix 630 in rows and columns, the calculation units 640 all perform calculations, and the calculation matrix 630 does not have the calculation unit 640 in the idle waiting state, so that the calculation capacity of the calculation matrix 630 is fully exerted, the operation throughput of the calculation matrix 630 is improved, and the calculation capacity of the processing unit is improved. Under the condition that the external environment bandwidth does not meet the requirement of the preset bandwidth, namely the external environment bandwidth is not larger than the threshold value of the preset environment bandwidth, the working mode of the calculation matrix 630 is switched into a pulsating data input mode, input data are subjected to pulsating multiplexing in the calculation matrix 630, the number of the calculation units 640 in an idle waiting state is reduced, and the calculation energy efficiency of the processing units is improved.
In some embodiments, the calculation matrix 630 is used to perform multiplication of the first matrix and the second matrix, resulting in a product matrix. The first matrix is for example an input tensor and the second matrix is for example a tensor weight. The first matrix, the second matrix, and the product matrix are stored, for example, in on-chip memory 650. Assuming that the first matrix is a, the second matrix is B, and the product matrix of the first matrix a and the second matrix B is C, they are respectively expressed as follows:
Figure BDA0003088675320000111
Figure BDA0003088675320000112
Figure BDA0003088675320000113
then C is11=A11B11+A12B21+A13B31+…+A1NBN1 (4)
C12=A11B12+A12B22+A13B32+…+A1NBN2 (5)
By the way of analogy, the method can be used,
C1N=A11B1N+A12B2N+A13B3N+…+A1NBNN (6)
by the way of analogy, the method can be used,
CN1=ANxB11+AN2B21+AN3B31+…+ANNBN1 (7)
CN2=AN1B12+AN2B22+AN3B32+…+ANNBN2 (8)
by the way of analogy, the method can be used,
CNN=AN1B1N+AN2B2N+AN3B3N+…+ANNBNN (9)
from the aboveAs can be seen from equations (1) to (9), the process of taking the product of the first matrix A and the second matrix B is actually the element A of the first matrix AlkRespectively with elements B of a second matrix BkjCollision and multiplication, and a process of accumulating the products, wherein l, j, k, N is a natural number other than 0, l is less than or equal to N, j is less than or equal to N, and k is less than or equal to N.
FIG. 5 is an internal block diagram of a computation matrix 630, according to one embodiment of the present disclosure.
In some embodiments, as shown in fig. 5, the computational matrix 630 is made up of N × N computational units 640 arranged in N rows and N columns, N being a non-zero natural number. In the multicast data input mode, the maximum total amount of data that needs to be input into the calculation matrix 630 in one clock cycle is nxnx 2. In some embodiments, the predetermined environment bandwidth threshold is nxnx 2, the calculation matrix 630 is controlled to operate in the multicast data input mode if the external environment bandwidth is greater than nxnx 2, and the calculation matrix 630 is controlled to operate in the systolic data input mode if the external environment bandwidth is not greater than nxnx 2.
In some embodiments, as shown in fig. 5, in the systolic data input mode, the computing units 640 in the computing matrix 630 receive data from the computing units 640 in the same row in the previous column via a first input line 641 and from the computing units 640 in the previous row in the same column via a second input line 642. That is, in the systolic data input mode, the input data to each compute unit 640 is sourced from compute units 640 in the same row in the previous column and compute units 640 in the same row in the previous column.
In some cases, in the systolic data input mode, as shown in FIG. 5, let element A of the first row and first column of the first matrix A be in the first clock cycle11Enter the first column of the computation matrix 630, let element B of the first row and first column of the second matrix B11Into the first row of the calculation matrix 630 such that the calculation unit T in the first column of the first row of the calculation matrix 63011To obtain A11B11. In the second clock cycle, the element A of the first row and the first column of the first matrix A11From the calculation unit T of the first row and the first column of the calculation matrix 63011Enter the first row and the second column to the rightIs calculated by the calculation unit T12Element B of the first row and the first column of the second matrix B11From the calculation unit T of the first row and the first column of the calculation matrix 63011Continue to enter the calculation unit T of the second row and the first column21. At the same time let the first row and the second column of the first matrix A be12And a second row and a first column element A21The calculation units T respectively entering the first two rows of the first column of the calculation matrix 63011And T21Let the second row and the first column of elements B of the second matrix B21And element B of the first row and the second column12The calculation units T of the first two columns entering the first row of the calculation matrix 630 respectively11And T12. Thus, in the second clock cycle, at the computing unit T11,A12And B21Meet to obtain A12B21(ii) a In the computing unit T12,A11And B12Meet to obtain A11B12(ii) a In the computing unit T21,A21And B11Meet to obtain A21B11. By analogy, in this way, in the nth clock cycle, exactly N elements (N elements whose sum of row number and column number equals N + 1) in the first matrix a enter the N rows of the computation units T of the first column of the computation matrix 630, respectively11To TN1Exactly N elements (N elements with the sum of row and column numbers equal to N + 1) in the second matrix B enter the N columns of the first row of the computation matrix 630, respectively11To T1N. By analogy, in the 2N-1 clock period, the first matrix A has only one element ANN(sum of row number and column number equal to 2N) into the calculation unit T of the Nth row and first column of the calculation matrix 630N1The second matrix B has only one element BNN(sum of row number and column number equal to 2N) into the calculation unit T of the Nth column in the first row of the calculation matrix 6301N. That is, for the ith clock cycle, the calculation unit controller 610 makes the elements with the sum of row number and column number i +1 in the second matrix B enter the calculation units 640 of the corresponding columns of the calculation matrix 630, and makes the elements with the sum of row number and column number i +1 in the first matrix A enter the counter for the first 2N-1 clock cyclesA calculating unit 640 for calculating corresponding rows of the matrix 630, wherein the calculating unit 640 multiplies the same elements of the received column number from the first matrix a and the received row number from the second matrix B, and accumulates the multiplied result into the previous accumulated result; in clock cycles 2N to 3N-1, the first matrix a and the second matrix B have no more new elements input to the calculation matrix 630, the elements are pulsed in the calculation matrix 630, the elements having the same column number received from the first matrix a and the same row number received from the second matrix B are multiplied by the calculation unit 640, and the multiplied result is accumulated in the previous accumulated result. Finally, after 3N-1 clock cycles, each element in the product matrix C of the first matrix A and the second matrix B is obtained by the calculation matrix 630.
Thus, in the systolic data input mode, in the nth clock cycle, the calculation unit controller 610 makes N elements of the second matrix B having the sum of the row number and the column number N +1 enter the calculation units 640 of the corresponding columns of the calculation matrix 630, and makes N elements of the first matrix a having the sum of the row number and the column number N +1 enter the calculation units 640 of the corresponding rows of the calculation matrix 630. That is, in the ripple data input mode, the maximum total amount of data that needs to be input into the calculation matrix 630 in one clock cycle is N + N, the input data is transmitted from the calculation unit 640 in the same row in the previous column to the calculation unit 640 in the same row in the next column and from the calculation unit 640 in the same row in the previous column to the calculation unit 640 in the same row in the next column, and the input data is ripple-multiplexed in the calculation matrix 630.
In some embodiments, as shown in fig. 5, in the multicast data input mode, the computing units 640 in the same column in the computing matrix 630 are commonly connected to the first column bus 643, and the computing units 640 in the same column respectively receive data through the first column bus 643. The computing units 640 in the same row of the computing matrix 630 are commonly connected to a first row bus 644, and the computing units 640 in the same row respectively receive data through the first row bus 644. That is, in the multicast data input mode, data is input to each calculation unit 640 in rows and columns in a multicast transmission manner.
In some cases, in the multicast data input mode, as shown in fig. 5, in the second placeOne clock cycle, compute unit controller 610 lets element A of the first row and first column of the first matrix A11The calculation units T broadcast to the first row of the calculation matrix 63011To T1NLet element A of the second row and the first column in the first matrix A21Broadcast to the calculation units T of the second row of the calculation matrix 63021To T2NBy analogy, let the element A in the Nth row and the first column in the first matrix AN1The calculation units T broadcast to the Nth row of the calculation matrix 630N1To TNNLet element B of the first row and the first column in the second matrix B11The calculation units T broadcast to the first column of the calculation matrix 63011To TN1Let element B of the first row and second column in the second matrix B12The calculation units T broadcast to the second column of the calculation matrix 63012To TN2And so on, let the element B in the first row and the Nth column in the second matrix B1NThe calculation unit T broadcast to the Nth column of the calculation matrix 6301NTo TNN. In the second clock cycle, the computing unit controller 610 makes the first row and the second column of the element A in the first matrix A12The calculation units T broadcast to the first row of the calculation matrix 63011To T1NLet the element A of the second row and the second column in the first matrix A22Broadcast to the calculation units T of the second row of the calculation matrix 63021To T2NAnd so on, let the element A of the Nth row and the second column in the first matrix AN2The calculation units TN broadcast to the Nth row of the calculation matrix 6301To TNNLet element B of the second row and the first column in the second matrix B21The calculation units T broadcast to the first column of the calculation matrix 63011To TN1Let element B in the second row and second column of the second matrix B22The calculation units T broadcast to the second column of the calculation matrix 63012To TN2And so on, let the second row and Nth column element B in the second matrix B2NThe calculation unit T broadcast to the Nth column of the calculation matrix 6301NTo TNN. By analogy, in the Nth clock cycle, the computing unit controller 610 lets the first row and Nth column of the element A in the first matrix A1NThe calculation units T broadcast to the first row of the calculation matrix 63011To T1NLet the element A in the Nth column of the second row in the first matrix A2NBroadcast to the calculation units T of the second row of the calculation matrix 63021To T2NBy analogy, let the element A in the Nth row and Nth column of the first matrix ANNThe calculation units TN broadcast to the Nth row of the calculation matrix 6301To TNNLet element B in the Nth row and the first column of the second matrix BN1The calculation units T broadcast to the first column of the calculation matrix 63011To TN1Let the element B in the Nth row and the second column of the second matrix BN2The calculation units T broadcast to the second column of the calculation matrix 63012To TN2By analogy, let the element B in the Nth row and Nth column of the second matrix BNNThe calculation unit T broadcast to the Nth column of the calculation matrix 6301NTo TNN
Thus, for the first N clock cycles, at the ith clock cycle, the compute unit T at the first row and column of the compute matrix 63011To obtain A1iBi1In the first row and the second column of the calculation matrix 63012To obtain AliBi2And so on, the calculation unit T at the Nth column of the first row of the calculation matrix 6301NTo obtain A1iBiNIn the second row and the first column of the calculation matrix 63021To obtain A2iBi1In the second row and the second column of the calculation matrix 63022To obtain A2iBi2And so on, the calculation unit T at the Nth column in the second row of the calculation matrix 6302NTo obtain A2iBiNAnd so on, the calculation unit T in the Nth row and the first column of the calculation matrix 630N1To obtain ANiBi1In the Nth row and the second column of the calculation matrix 630, a calculation unit TN2To obtain ANiBi2And so on, the calculation unit T at the Nth row and the Nth column of the calculation matrix 630NNTo obtain ANiBiN. That is, for the first N clock cycles, in the ith clock cycle, compute unit controller 610 broadcasts the elements of the ith column of first matrix A to the rows of compute matrix 630, respectively, and broadcasts the second momentThe elements of the ith row in array B are broadcast to the columns of the calculation matrix 630, the elements with the same column number received from the first matrix a and the row number received from the second matrix B are multiplied by the calculation unit 640, and the multiplied result is accumulated into the previous accumulated result. In the (N + 1) th clock cycle, each element in the product matrix C of the first matrix a and the second matrix B obtained by calculating the matrix 630 is output.
Thus, in the multicast data input mode, for the first N clock cycles, in the ith clock cycle, the computing unit controller 610 broadcasts the N elements in the ith row of the first matrix a to the N rows of the computing matrix 630, and broadcasts the N elements in the ith row of the second matrix B to the N columns of the computing matrix 630. That is, in the multicast data input mode, the maximum total amount of data that needs to be input into the calculation matrix 630 in one clock cycle is nxnxnxnxnx2, that is, nxn calculation units 640 in the calculation matrix 630 all perform calculation, and there is no calculation unit 640 in an idle waiting state.
It should be noted that, in general, the calculation matrix 630 is fixed in the processing unit shown in fig. 5, and the first matrix a and the second matrix B are not fixed. For the case that the number of rows and the number of columns of the first matrix a and the second matrix B are not consistent with the calculation matrix 630, the first matrix a and the second matrix B are usually split, and then tensor operation is performed on the first matrix a and the second matrix B by using the calculation matrix 630 after the split.
Computing unit
Fig. 6 is an internal structural diagram of the calculation unit 640 according to one embodiment of the present disclosure.
In some embodiments, the internal structure of the computation units 640 in the computation matrix 630 is the same. As shown in fig. 7, the calculation unit 640 includes: a first register 651, a second register 652, a third register 653, a multiplier 654, and an accumulator 655.
In some embodiments, the monitoring unit 620 may monitor the external environment bandwidth in real time, and the computational unit controller 610 switches the computational matrix 630 between the systolic data input mode and the multicast data input mode according to the external environment bandwidth monitored in real time. In some embodiments, the computing unit 640 further comprises: a first gate 656 and a second gate 657. The first gating device 656 is configured to receive the control signal M provided by the computing unit controller 610, and gate the second input line 642 or the first column bus 643 in the computing matrix 630 according to the control signal M, so as to switch the computing matrix 630 between the burst data input mode and the multicast data input mode. The second gate 657 is used for receiving the control signal M provided by the computing unit controller 610, and gates the first input line 641 or the first row bus 644 in the computing matrix 630 according to the control signal M, so that the computing matrix 630 is switched between the systolic data input mode and the multicast data input mode. In some cases, according to the control signal M, the first gate 656 gates the second input line 642 in the calculation matrix 630, the calculation unit 640 receives data from the calculation unit 640 in the previous row of the same column through the second input line 642, the second gate 657 gates the first input line 641 in the calculation matrix 630, and the calculation unit 640 receives data from the calculation unit 640 in the same row of the previous column through the first input line 641, so that the calculation matrix 630 operates in the burst data input mode. In other cases, according to the control signal M, the first gate 656 gates the first column bus 643 in the computation matrix 630, the computation units 640 in the same column receive data through the first column bus 643, the second gate 657 gates the first row bus 644 in the computation matrix 630, and the computation units 640 in the same row receive data through the first row bus 644, so that the computation matrix 630 operates in the multicast data input mode.
In some embodiments, in the systolic data input mode, the computational cells 640 in the computational matrix 630 receive data from the computational cells 640 of the same row in the previous column through the first input line 641 and from the computational cells 640 of the same row in the previous column through the second input line 642 for one clock cycle. The first register 651 stores data transferred from the calculation unit 640 of the same row in the previous column according to the control signal M. The second register 652 stores data transferred from the computing unit 640 of the previous row of the same column according to the control signal M. The multiplier 654 multiplies the same elements of the column sequence number received from the first input line 641 and the row sequence number received from the second input line 642. The accumulator 655 accumulates the result multiplied by the multiplier 654 into a previous accumulated result. The third register 653 stores the accumulated result. In the next clock cycle, the calculation unit 640 transfers the data stored in the first register 651 to the calculation unit 640 of the same row in the next column, and transfers the data stored in the second register 655 to the calculation unit 640 of the same row in the next column.
In some embodiments, in the multicast data input mode, in one clock cycle, the calculation units 640 in the same column in the calculation matrix 630 respectively receive data through the first column bus 643, and the calculation units 640 in the same row respectively receive data through the first row bus 644. The first register 651 suspends its operation in accordance with the control signal M. The second register 652 suspends the operation according to the control signal M. The multiplier 654 multiplies the same element of the column sequence number received from the first row bus 644 and the row sequence number received from the first column bus 643. The accumulator 655 accumulates the result multiplied by the multiplier 654 into a previous accumulated result. The third register 653 stores the accumulated result.
Tensor operation method of the disclosed embodiment
Fig. 7 is a flowchart of a tensor operation method provided by an embodiment of the present disclosure. As shown on the figure, the method comprises the following steps.
In step S701, an external environment bandwidth in which a plurality of computing units constituting a computing matrix of n rows and m columns, n and m being non-zero natural numbers, are located is obtained.
In step S702, based on the external environment bandwidth, when the external environment bandwidth where the computation matrix is located meets a predetermined bandwidth requirement, the computation matrix is controlled to operate in a multicast data input mode, data is broadcast to all computation units in a corresponding column by columns and broadcast to all computation units in a corresponding row by rows, and when the external environment bandwidth does not meet the predetermined bandwidth requirement, the computation matrix is controlled to operate in a pulsating data input mode, and the computation units receive data from the computation units in the same row in the previous column and the computation units in the previous row in the same column, so as to support tensor operation.
The method of the embodiment of the disclosure is executed in a computing device which comprises a processing unit and a computing unit controller, and the method controls the computing matrix to work in at least one of a pulsating data input mode and a multicast data input mode according to the external environment bandwidth where the processing unit is located by utilizing the computing unit controller so as to support tensor operation. The resulting external environment bandwidth and computing power of the processing unit are adapted such that a better computational efficiency is achieved for the computing device.
Commercial value of the disclosed embodiments
The processing unit provided by the embodiment of the disclosure flexibly selects the working mode of the computation matrix according to the external environment bandwidth in which the processing unit is located, and controls the computation matrix to work in a multicast data input mode under the condition that the external environment bandwidth is greater than a preset environment bandwidth threshold, all the computation units execute computation, and the computation unit in an idle waiting state does not exist, so that the computation capability of the processing unit is improved; under the condition that the external environment bandwidth is not larger than the preset environment bandwidth threshold, the working mode of the calculation matrix is switched into a pulsating data input mode, input data are subjected to pulsating multiplexing in the calculation matrix, the number of calculation units in an idle waiting state is reduced, and the calculation energy efficiency of the processing unit is improved. Under the scene, the power consumption of the computing device is reduced by reducing the power consumption of the processing unit in the pulsating data input mode, and the running cost of the whole Internet of things is further reduced. The computing capability of the computing device is improved by improving the computing performance of the processing unit in the multicast data input mode, and the computing capability of the whole internet of things is further improved. The embodiment of the disclosure reduces the calculation energy consumption and improves the calculation capacity, thereby having good commercial value and economic value.
As will be appreciated by one skilled in the art, the present disclosure may be embodied as systems, methods and computer program products. Accordingly, the present disclosure may be embodied in the form of entirely hardware, entirely software (including firmware, resident software, micro-code), or in the form of a combination of software and hardware. Furthermore, in some embodiments, the present disclosure may also be embodied in the form of a computer program product in one or more computer-readable media having computer-readable program code embodied therein.
Any combination of one or more computer-readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium is, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer-readable storage medium include: an electrical connection for the particular wire or wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical memory, a magnetic memory, or any suitable combination of the foregoing. In this context, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with a processing unit, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a chopper. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any other suitable combination. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., and any suitable combination of the foregoing.
Computer program code for carrying out embodiments of the present disclosure may be written in one or more programming languages or combinations. The programming language includes an object-oriented programming language such as JAVA, C + +, and may also include a conventional procedural programming language such as C. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAn) or a wide area network (WAn), or the connection may be made to an external computer (for example, through the internet using an internet service provider).
The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims (13)

1. A processing unit, comprising:
a plurality of calculation units, which form a calculation matrix with n rows and m columns, wherein n and m are non-zero natural numbers;
and the computing unit controller is used for controlling the computing matrix to work in a multicast data input mode under the condition that the external environment bandwidth where the processing unit is located meets the preset bandwidth requirement, broadcasting data to all computing units in a corresponding column in a column mode and broadcasting data to all computing units in a corresponding row in a row mode, controlling the computing matrix to work in a pulsating data input mode under the condition that the external environment bandwidth does not meet the preset bandwidth requirement, and receiving the data by the computing units in the same row in the previous column and the computing units in the previous row in the same column so as to support tensor operation.
2. The processing unit of claim 1, wherein the predetermined bandwidth requirements include:
the external environment bandwidth is greater than a predetermined environment bandwidth threshold, where the predetermined environment bandwidth threshold is a maximum total amount of data that needs to be input into the computation matrix in one clock cycle in the multicast data input mode.
3. The processing unit of claim 1 or 2, wherein the processing unit is located within an acceleration unit co-located within a computing device with a scheduling unit, the acceleration unit receiving scheduled execution tensor operations of the scheduling unit, the external environment bandwidth being derived from instructions received by the scheduling unit.
4. The processing unit of claim 1 or 2, further comprising: and the monitoring unit is used for monitoring the bandwidth of the external environment where the processing unit is located.
5. The processing unit of claim 4, wherein the monitoring unit monitors the external environment bandwidth in real time, and the computational unit controller switches the computational matrix between the systolic data input mode and the multicast data input mode based on the external environment bandwidth monitored in real time.
6. The processing unit of claim 1, wherein the computing unit comprises:
the first register is used for storing data transferred by the computing units in the same row in the previous column;
a second register for storing data transferred from the calculation units of a previous row of the same column;
a multiplier for multiplying the elements with the same received column sequence number and row sequence number;
an accumulator for accumulating the result multiplied by the multiplier into a previous accumulated result;
and the third register is used for storing the accumulation result.
7. A processing unit according to claim 1 or 2, wherein in the systolic data input mode, the computational cells in the computational matrix receive data from computational cells of a same row in a previous column via a first input line and from computational cells of a same row in a previous column via a second input line.
8. The processing unit according to claim 1 or 2, wherein in the multicast data input mode, the computing units in the same column in the computing matrix are commonly connected to a first column bus, and the computing units in the same column respectively receive data through the first column bus; the computing units in the same row in the computing matrix are connected to a first row bus together, and the computing units in the same row receive data through the first row bus respectively.
9. An acceleration unit core comprising a processing unit according to any of claims 1-8.
10. An acceleration unit, comprising:
the acceleration unit core of claim 9;
a command processor that assigns a tensor computation task to the acceleration unit core.
11. An internet of things device, comprising:
an acceleration unit according to claim 10;
and the scheduling unit is used for distributing tensor calculation tasks to the accelerating unit.
12. A system on chip comprising a processing unit according to any of claims 1-8.
13. A tensor operation method, comprising:
acquiring external environment bandwidths of a plurality of computing units, wherein the computing units form a computing matrix with n rows and m columns, and n and m are non-zero natural numbers;
based on the external environment bandwidth, the calculation matrix is controlled to work in a multicast data input mode under the condition that the external environment bandwidth where the calculation matrix is located meets the requirement of preset bandwidth, data are broadcasted to all calculation units in a corresponding column in a column mode and are broadcasted to all calculation units in a corresponding row in a row mode, under the condition that the external environment bandwidth does not meet the requirement of the preset bandwidth, the calculation matrix is controlled to work in a pulsating data input mode, and the calculation units receive the data from the calculation units in the same row in the previous column and the calculation units in the previous row in the same column so as to support tensor operation.
CN202110588863.5A 2021-05-28 2021-05-28 Processing unit, correlation device, and tensor operation method Active CN113543045B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110588863.5A CN113543045B (en) 2021-05-28 2021-05-28 Processing unit, correlation device, and tensor operation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110588863.5A CN113543045B (en) 2021-05-28 2021-05-28 Processing unit, correlation device, and tensor operation method

Publications (2)

Publication Number Publication Date
CN113543045A CN113543045A (en) 2021-10-22
CN113543045B true CN113543045B (en) 2022-04-26

Family

ID=78094841

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110588863.5A Active CN113543045B (en) 2021-05-28 2021-05-28 Processing unit, correlation device, and tensor operation method

Country Status (1)

Country Link
CN (1) CN113543045B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101932996A (en) * 2007-09-24 2010-12-29 认知电子公司 Have parallel processing computer that reduces power consumption and the method that this system is provided
CN104205659A (en) * 2011-10-21 2014-12-10 奥普蒂斯蜂窝技术有限责任公司 Methods, processing device, computer programs, computer program products and antenna apparatus for calibration of antenna apparatus
CN106155814A (en) * 2016-07-04 2016-11-23 合肥工业大学 A kind of reconfigurable arithmetic unit supporting multiple-working mode and working method thereof
CN110909865A (en) * 2019-11-18 2020-03-24 福州大学 Federated learning method based on hierarchical tensor decomposition in edge calculation
CN111199273A (en) * 2019-12-31 2020-05-26 深圳云天励飞技术有限公司 Convolution calculation method, device, equipment and storage medium
CN111325321A (en) * 2020-02-13 2020-06-23 中国科学院自动化研究所 Brain-like computing system based on multi-neural network fusion and execution method of instruction set
CN111461311A (en) * 2020-03-26 2020-07-28 中国科学技术大学 Convolutional neural network operation acceleration method and device based on many-core processor
CN111611202A (en) * 2019-02-24 2020-09-01 英特尔公司 Systolic array accelerator system and method
CN111783933A (en) * 2019-04-04 2020-10-16 北京芯启科技有限公司 Hardware circuit design and method for data loading device combining main memory and accelerating deep convolution neural network calculation

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11037330B2 (en) * 2017-04-08 2021-06-15 Intel Corporation Low rank matrix compression

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101932996A (en) * 2007-09-24 2010-12-29 认知电子公司 Have parallel processing computer that reduces power consumption and the method that this system is provided
CN104205659A (en) * 2011-10-21 2014-12-10 奥普蒂斯蜂窝技术有限责任公司 Methods, processing device, computer programs, computer program products and antenna apparatus for calibration of antenna apparatus
CN106155814A (en) * 2016-07-04 2016-11-23 合肥工业大学 A kind of reconfigurable arithmetic unit supporting multiple-working mode and working method thereof
CN111611202A (en) * 2019-02-24 2020-09-01 英特尔公司 Systolic array accelerator system and method
CN111783933A (en) * 2019-04-04 2020-10-16 北京芯启科技有限公司 Hardware circuit design and method for data loading device combining main memory and accelerating deep convolution neural network calculation
CN110909865A (en) * 2019-11-18 2020-03-24 福州大学 Federated learning method based on hierarchical tensor decomposition in edge calculation
CN111199273A (en) * 2019-12-31 2020-05-26 深圳云天励飞技术有限公司 Convolution calculation method, device, equipment and storage medium
CN111325321A (en) * 2020-02-13 2020-06-23 中国科学院自动化研究所 Brain-like computing system based on multi-neural network fusion and execution method of instruction set
CN111461311A (en) * 2020-03-26 2020-07-28 中国科学技术大学 Convolutional neural network operation acceleration method and device based on many-core processor

Also Published As

Publication number Publication date
CN113543045A (en) 2021-10-22

Similar Documents

Publication Publication Date Title
US20210342395A1 (en) Storage edge controller with a metadata computational engine
US10949328B2 (en) Data flow graph computation using exceptions
US20210073170A1 (en) Configurable heterogeneous ai processor
US20190228037A1 (en) Checkpointing data flow graph computation for machine learning
US20190087708A1 (en) Neural network processor with direct memory access and hardware acceleration circuits
US20210073169A1 (en) On-chip heterogeneous ai processor
CN113850363A (en) Techniques for applying driving norms to automated vehicle behavior predictions
CN111190741B (en) Scheduling method, equipment and storage medium based on deep learning node calculation
US20190266218A1 (en) Matrix computation within a reconfigurable processor fabric
US20190279038A1 (en) Data flow graph node parallel update for machine learning
US20190138373A1 (en) Multithreaded data flow processing within a reconfigurable fabric
US11227030B2 (en) Matrix multiplication engine using pipelining
US20190057060A1 (en) Reconfigurable fabric data routing
CN115136123A (en) Tile subsystem and method for automated data flow and data processing within an integrated circuit architecture
US11934308B2 (en) Processor cluster address generation
US11645178B2 (en) Fail-safe semi-autonomous or autonomous vehicle processor array redundancy which permits an agent to perform a function based on comparing valid output from sets of redundant processors
US20190197018A1 (en) Dynamic reconfiguration using data transfer control
CN113313241A (en) Method and computing device for determining tensor information of deep learning model
CN111047045B (en) Distribution system and method for machine learning operation
US20220129320A1 (en) Schedule-aware dynamically reconfigurable adder tree architecture for partial sum accumulation in machine learning accelerators
CN114764375A (en) System for applying algorithms using thread-parallel processing middleware nodes
CN113543045B (en) Processing unit, correlation device, and tensor operation method
CN113722668B (en) Processing unit, correlation device and tensor operation method
CN111199276A (en) Data processing method and related product
US20190370076A1 (en) Methods and apparatus to enable dynamic processing of a predefined workload

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20240223

Address after: 310052 Room 201, floor 2, building 5, No. 699, Wangshang Road, Changhe street, Binjiang District, Hangzhou City, Zhejiang Province

Patentee after: C-SKY MICROSYSTEMS Co.,Ltd.

Country or region after: China

Address before: 5th Floor, No. 366 Shangke Road, Lane 55 Chuanhe Road, China (Shanghai) Pilot Free Trade Zone, Pudong New Area, Shanghai

Patentee before: Pingtouge (Shanghai) semiconductor technology Co.,Ltd.

Country or region before: China

TR01 Transfer of patent right