CN109919321A

CN109919321A - Unit has the artificial intelligence module and System on Chip/SoC of local accumulation function

Info

Publication number: CN109919321A
Application number: CN201910103617.9A
Authority: CN
Inventors: 连荣椿; 王海力; 马明
Original assignee: Jing Wei Qi Li (beijing) Technology Co Ltd
Current assignee: Jing Wei Qi Li (beijing) Technology Co Ltd
Priority date: 2019-02-01
Filing date: 2019-02-01
Publication date: 2019-06-21

Abstract

A kind of artificial intelligence AI module of the processing unit with local accumulation function and the System on Chip/SoC including the AI module.In embodiment, chip circuit includes AI module, and the AI module includes: the multiple processing units for being arranged in two-dimensional array, completes multiply-add operation；Processing unit includes enabled input terminal, receives enable signal, and suspends or start the operation of processing unit according to enable signal；Processing unit under the influence of control signals, is configured and adds up to product；Each processing unit shares the same clock signal and carries out operation.The embodiment of the present invention allows each unit to add up all previous operation result, can effectively reduce the scale of AI module.

Description

Unit has the artificial intelligence module and System on Chip/SoC of local accumulation function

Technical field

The present invention relates to technical field of integrated circuits more particularly to a kind of processing unit to have the artificial of local accumulation function Intelligent AI module and the System on Chip/SoC including the AI module.

Background technique

Systolic arrays (Systolic Array), it is intended that it is that data is allowed to be flowed in the array of arithmetic element, The number of memory access is reduced, and makes structure more regular, wiring is more unified, improves frequency.This concept of systolic arrays exists Nineteen eighty-two just has been proposed, recently the nuclear structure due to artificial intelligence chip using the structure as calculating, and again Concern is arrived.

With going deep into for artificial intelligence study and being widely popularized for application, it is necessary to release the AI module for more meeting demand.

In addition, artificial intelligence module is accessed control by processing unit by bus, and bus is that have certain band Tolerance system, such framework are difficult to adapt to the big bandwidth demand of artificial intelligence AI module.

Summary of the invention

According in a first aspect, the embodiment of the present invention provides a kind of chip circuit, which includes AI module, the AI Module includes: multiple processing units that two-dimensional array is arranged in by the first dimension and the second dimension, and each processing unit can be completed Multiply-add operation；Wherein, processing unit includes enabled input terminal, for receiving enable signal, and according to enable signal pause or Start the operation of processing unit；Processing unit under the influence of control signals, can add up to product；In two-dimensional array Each processing unit shares the same clock signal and carries out operation；First dimension and the second dimension are perpendicular to one another.

In a kind of embodiment of first aspect, processing unit includes coefficient memory, for providing processing unit fortune Calculation coefficient data；Processing unit includes multiplier, adder, the first register (REG1), the second register and multiplexer；? The first input data end and the first data output end in first dimension；The second data input pin and second in the second dimension Data output end；First data are inputted from the first data-in port, and the first data are multiplied by multiplier with coefficient data；Multiplexing Device selects a data output, addition from the output data of the second data and the first register from the second data input pin By the output data and product addition of the multiplexer, after being added and value is deposited in the first register device；With value in clock It can be exported through the second data output end under control；First data are also deposited in the second register, and under clock control It can be exported through the first output end.

In second aspect, the embodiment of the present invention provides a kind of System on Chip/SoC, comprising: chip electricity as described in relation to the first aspect Road；FPGA module is coupled with the AI module, to send data from AI module or to receive data.

In the embodiment of second aspect, AI module includes first processing units, the second processing unit and third processing Unit；Wherein first processing units and the second processing unit are along the first dimension arranged adjacent, the second output of first processing units It is coupled to the first input end of the second processing unit in end；First processing units and third processing unit are along the second dimension adjacent row Column, the first output end of first processing units are coupled to the second input terminal of third processing unit.

In yet another embodiment, the winding structure of FPGA module is multiplexed in AI Module-embedding FPGA module, with Just data are sent from AI module or receives data, all via the winding structure of the FPGA of the multiplexing.

In embodiment, due to adding up in each processing unit to all previous operation result, thus, it is possible to effectively reduce AI The scale of module.

Detailed description of the invention

Fig. 1 is the schematic diagram of 2 dimension AI module according to an embodiment of the present invention；

Fig. 2 is the schematic diagram of processing unit；

Fig. 3 is the schematic diagram of the memory MEM in the processing unit of Fig. 2；

Fig. 4 is the schematic diagram of 2 dimension systolic arrays processing data；

Fig. 5 is a kind of structural schematic diagram of System on Chip/SoC for being integrated with FPGA and AI module；

Fig. 6 is the structural schematic diagram of FPGA circuitry.

Specific embodiment

To make the technical solution of the embodiment of the present invention and becoming apparent from for advantage expression, below by drawings and examples, Technical scheme of the present invention will be described in further detail.

In the description of the present application, term " center ", "upper", "lower", "front", "rear", "left", "right", "vertical", " water It is flat ", "top", "bottom", "inner", the instructions such as "outside" orientation or positional relationship be to be based on the orientation or positional relationship shown in the drawings, Be merely for convenience of description the application and simplify description, rather than the device or element of indication or suggestion meaning must have it is specific Orientation, be constructed and operated in a specific orientation, therefore should not be understood as the limitation to the application.

Fig. 1 is the schematic diagram of 2 dimension AI module according to an embodiment of the present invention.In one example, AI module is pulsation battle array Column, systolic arrays are the processing unit structures that synchronization of data streams flows through adjacent two-dimensional array unit.As shown in Figure 1, pulsation battle array Column include, for example, 4X4 processing unit PE.Systolic arrays can be divided into two dimensions, the first dimension perpendicular to one another and the second dimension Degree.By taking first processing units, the second processing unit and third processing unit as an example, first processing units and the second processing unit edge First dimension is arranged along first direction, and the first output end of first processing units is coupled to first input of the second processing unit End；First processing units and third processing unit arrange in a second direction along the second dimension, the second output of first processing units It is coupled to the second input terminal of third processing unit in end.

One-dimensional data a can sequentially input identical second dimension values along first direction along the first dimension under same clock Each processing unit；Data are throughout managed in unit to be multiplied with another dimension data (coefficient) W of storage in the cells；Product is along second Dimension in a second direction everywhere in reason unit transmission, and be added each other.For the sake of understanding conveniently, hereafter will be with horizontal dimensions First dimension, from left to right are first direction, are the second dimension with vertical dimensions, upper downwards for second direction.

It is noted that every data line in Fig. 1 can both represent the signal of single-bit, 8 (or 16,32) bits can also be represented Signal.

Processing unit is configured with enable signal EN input terminal, for receiving enable signal EN, and according to the enable signal The treatment progress of EN, starting or pause processing unit.The same clock signal of units shared is managed everywhere in two-dimensional array to carry out Operation.

Processing unit under the influence of control signals, can add up to product.Control signal may include enabled letter Number EN, selection control signal of multiplexer etc..Due to that can add up in each unit to all previous operation result, so can To effectively reduce the scale of AI module.

In one example, matrix multiplication may be implemented in two-dimensional array.In another example, two-dimensional array may be implemented Convolution algorithm.

Fig. 2 is the schematic diagram of processing unit.As shown in Fig. 2, processing unit includes multiplier MUL, adder ADD.Data It inputs from the first data-in port DI, is multiplied in MUL with the coefficient W being stored in coefficient memory MEM, then, the product It is added in adder ADD with the data P from the second data-in port PI, after being added and value is deposited in register REG1 In.In next clock, and value S is exported through the first output end PO.It can be through inputting after the first output end PO output with value S Port PI inputs another underlying PE.First input data end DI and the first data output end DO is distributed along first direction In the first dimension；Second data input pin PI and the second data output end PO are distributed in a second direction in the second dimension.

In one example, processing unit further includes multiplexer MUX, which inputs according to control signal from the second data It holds and selects one in the output signal of the data P and REG1 of PI, to be sent into adder ADD.Processing unit is in control signal Under effect, it can add up to product.Based on such internal feedback mechanism, can be multiplied in the same processing unit Accumulation adds, and thus implements various types of AI operations.

Certainly, data a can also be deposited in register REG2, and be exported under clock control through second output terminal DO To the processing unit PE on right side.

Clock CK is used to control the treatment progress of processing unit.

Enable signal EN is used to start or suspend the treatment progress of processing unit.

Fig. 3 is the schematic diagram of the memory MEM in the processing unit of Fig. 2.As shown in figure 3, memory includes the D of 8 bits Trigger, coefficient data are then Q0-Q7 through output end Q output from D input terminal input trigger.Clock CK control trigger Rhythm.Enable signal EN is for determining whether d type flip flop starts or suspend.

Fig. 4 is the schematic diagram of 2 dimension systolic arrays processing data.As shown in figure 4, the left column of 4X4 systolic arrays includes at 4 Unit is managed, the coefficient stored in each processing unit is respectively W11, W12, W13, W14.Can temporarily with the label reference of coefficient at Manage unit.First, it is assumed that the MUX of processing unit only gates the input data of PI.

Data are inputted from left side.In first clock, a11 input unit 11, processing obtains product p11=a11*w11.Such as If the p10 that fruit comes from above processing unit is not 0, then also need to be included in the numerical value of p10.

In second clock, a11*w11 is displaced downwardly to unit 12 from unit 11；A21 input unit 11, a12 input unit 12； Then unit 11 obtains product a21*w11 (perhaps there are also product of this moment from p10), and unit 12 obtains product a12*w12, And export a12*w12+a11*w11.

In third clock, a21*w11 is displaced downwardly to unit 12, a12*w12+a11*w11 from unit 11 and moves down from unit 12 To unit 13；A31 input unit 11, a22 input unit 12, a13 input unit 13；Then unit 11 obtain product a31*w11 (or Perhaps there are also product of this moment from p10), unit 12 obtains product a22*w12, and exports a22*w12+a21*w11；It is single Member 13 obtains product a13*w13, and exports a13*w13+a12*w12+a11*w11.

In the 4th clock, a31*w11 is displaced downwardly to unit 12 from unit 11, and a12*w12+a11*w11 is moved down from unit 12 To unit 13, a13*w13+a12*w12+a11*w11 is displaced downwardly to unit 14 from unit 13；A41 input unit 11, a32 input are single 12, a23 of member input unit 13, a14 input unit 14；Then unit 11 obtains product a41*w11 and (perhaps comes from there are also this moment The product of p10), unit 12 obtains product a32*w12, and exports a32*w12+a31*w11；Unit 13 obtains product a23* W13, and export a23*w13+a22*w12+a31*w11；Unit 14 obtains product a14*w14, and exports a14*w14+ a13*w13+a12*w12+a11*w11。

Similarly, unit 24 is a14*w24+a13*w23+a12*w22+a11*w21 in the output of the 5th clock；Unit 34 It is a14*w34+a13*w33+a12*w32+a11*w31 in the output of the 6th clock；Output of the unit 44 in the 7th clock For a14*w44+a13*w43+a12*w42+a11*w41.

As can be seen that unit 14,24,34 and 44 respectively the 4th, 5,6, the output of 7 clocks can regard as respectively with aij A matrix and wij for element are the matrix product of the W matrix of element.

If adjusting the coefficient data in input data or memory, for example aij is replaced into a [N-i] [M-j], Matrix product is carried out on the basis of data after displacement, gained result of product is convolution.

MUX function as shown in connection with fig. 2 it is found that in one example, can configure the MUX of each processing unit, i.e., as follows In top n cycle period, MUX only gates the output numerical value of REG1, and in the cycle period of N+1, MUX only gates the defeated of PI Enter.So, processing unit can add up the operation result of top n cycle period, then in subsequent cycle period Accumulation result is exported into AI module.In this way, the scale of AI module can be effectively reduced.

Fig. 5 is a kind of structural schematic diagram of System on Chip/SoC for being integrated with FPGA and AI module.As shown in figure 5, System on Chip/SoC On be integrated at least one FPGA circuitry and at least one AI module.In at least one AI module, each AI module can be Fig. 1 institute The AI module shown.

In at least one FPGA circuitry, each FPGA circuitry can realize the various functions such as logic, calculating, control.FPGA module The various functions such as logic, calculating, control can be achieved.FPGA realizes that combination is patrolled using small-sized look-up table (for example, 16 × 1RAM) Volume, each look-up table is connected to the input terminal of a d type flip flop, and trigger drives other logic circuits or driving I/O again, by This constitutes the basic logic unit module that can not only realize combination logic function but also can realize sequential logic function, these intermodules I/O module is interconnected or is connected to using metal connecting line.The logic of FPGA is to load to compile by internally static storage cell Number of passes according to come what is realized, store value in a memory cell determine between the logic function and each module of logic unit or Connecting mode between module and I/O, and finally determine function achieved by FPGA.

Interface corresponding with two-dimensional convolution array is additionally provided on System on Chip/SoC, FPGA module and AI module pass through interface Module connection.Interface module can be XBAR module, and XBAR module is for example by multiple selectors (Multiplexer) and selection position Member composition.Interface module is also possible to FIFO (first in first out).Interface module can also be synchronizer (Synchronizer), together Step device is for example connected in series by 2 triggers (Flip-Flop or FF).FPGA module can be AI module transfer data, provide Control.

FPGA module and AI module can be placed side by side, and FPGA module can be AI module transfer data at this time, provide control System；AI module can also be embedded among FPGA module, and AI module needs to be multiplexed the winding structure of FPGA module at this time, will pass through The winding structure of the FPGA module of multiplexing sends and receivees data.

Fig. 6 is the structural schematic diagram of FPGA circuitry.As shown in fig. 6, FPGA circuitry may include having multiple programmable logic moulds The modules such as block (LOGIC), embedded memory block (EMB), multiply-accumulator (MAC) and corresponding coiling (XBAR).Certainly, FPGA electricity Road is additionally provided with the related resources such as clock/configuration module (trunk spine/ branch seam).If desired EMB or when MAC module, because of it The big many of area ratio PLB, therefore several PLB modules are replaced with this EMB/MAC module.

Coiling resource XBAR is the contact of each intermodule interconnection, is evenly distributed in FPGA module.Institute in FPGA module Some resources, PLB, EMB, MAC, IO mutual coiling are all to be had an identical interface-coiling XBAR unit to come in fact It is existing.From the point of view of winding mode, entire array is identical consistent, the XBAR unit formation grid of proper alignment, will be all in FPGA Module is connected.

LOGIC module may include, the table for example, 86 inputs are noted, 18 registers.EMB module can be, for example, The storage unit of 36k bit or 2 18k bits.MAC module can be, for example, 25x18 multiplier or 2 18x18 multiplication Device.There is no restriction for the accounting of each module number of LOGIC, MAC, EMB in FPGA array, and the size of array is also as needed, is setting Timing is determined by practical application.

Above-described specific embodiment has carried out further the purpose of the present invention, technical scheme and beneficial effects It is described in detail, it should be understood that being not intended to limit the present invention the foregoing is merely a specific embodiment of the invention Protection scope, all within the spirits and principles of the present invention, any modification, equivalent substitution, improvement and etc. done should all include Within protection scope of the present invention.

Claims

1. a kind of chip circuit, including artificial intelligence AI module, the AI module includes: to arrange by the first dimension and the second dimension At multiple processing units (PE) of two-dimensional array, each processing unit can complete multiply-add operation；Wherein, processing unit includes enabled Input terminal, for receiving enable signal, and according to enable signal pause or the operation of starting processing unit；Processing unit is being controlled Under the action of signal processed, it can add up to product；Everywhere in two-dimensional array manage the same clock signal of units shared into Row operation；First dimension and the second dimension are perpendicular to one another.

2. chip circuit according to claim 1, which is characterized in that processing unit includes coefficient memory, for providing Processing unit operation coefficient data；Processing unit include multiplier (MUL), adder (ADD), the first register (REG1), Second register (REG2) and multiplexer (MUX)；The first input data end (DI) and the output of the first data in the first dimension It holds (DO)；The second data input pin (PI) and the second data output end (PO) in the second dimension；First data are counted from first It is inputted according to input port, the first data are multiplied by multiplier with coefficient data (W)；Multiplexer is from from the second data input pin Select the output of data in second data and the output data of the first register, adder by the output data of the multiplexer and Product addition, after being added and value are deposited in the first register (REG1)；It can be through the second number under clock control with value It is exported according to output end；First data are also deposited in the second register, and can be defeated through the first output end under clock control Out.

3. a kind of System on Chip/SoC, comprising: the chip circuit as described in one of claim 1-2；

FPGA module is coupled with the AI module, to send data from AI module or to receive data.

4. System on Chip/SoC according to claim 3, which is characterized in that AI module includes first processing units, second processing Unit and third processing unit；Wherein first processing units and the second processing unit are along the first dimension arranged adjacent, the first processing First output end of unit is coupled to the first input end of the second processing unit；First processing units and third processing unit are along Two-dimensions arranged adjacent, the second output terminal of first processing units are coupled to the second input terminal of third processing unit.

5. System on Chip/SoC as claimed in claim 3, which is characterized in that be multiplexed FPGA mould in AI Module-embedding FPGA module The winding structure of block, to send data from AI module or to receive data, all via the bobbin winder bracket of the FPGA of the multiplexing Structure.