CN111882051B

CN111882051B - Global broadcast data input circuit for neural network processing

Info

Publication number: CN111882051B
Application number: CN202010746509.6A
Authority: CN
Inventors: 韩军; 张权; 张永亮; 曾晓洋
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2020-07-29
Filing date: 2020-07-29
Publication date: 2022-05-20
Anticipated expiration: 2040-07-29
Also published as: CN111882051A

Abstract

The invention belongs to the technical field of integrated circuits, and particularly relates to a global broadcast data input circuit oriented to neural network processing. The circuit of the invention comprises: the top layer module is used for recording data receiving times, the vertical bus module is used for inputting data broadcasting and transmitting in the vertical direction, the horizontal bus module is used for inputting data broadcasting and transmitting in the horizontal direction, and the broadcasting transmitting module of the appointed operation unit is selected. The circuit adopts a two-level bus form in the horizontal and vertical directions to cut a data path, and greatly reduces the extra area overhead and power consumption overhead brought by huge bandwidth in a single bus form while transmitting data in high parallel; meanwhile, an operation unit identification number and an input data label handshaking mechanism are introduced into the broadcast transmitting module, so that the data multiplexing degree is improved, the access frequency of the circuit is reduced, and the integral energy efficiency ratio of the circuit is improved while the data transmitting function of the input circuit is ensured to be correct. The invention can effectively improve the transmission efficiency of the input data in the neural network processing.

Description

Global broadcast data input circuit for neural network processing

Technical Field

The invention belongs to the technical field of integrated circuits, and particularly relates to a global broadcast data input circuit for neural network processing.

Background

The neural network algorithm is well applied to important fields of computer vision, speech recognition, robot control and the like, but various applications also continuously put higher requirements on the precision and complexity of the neural network algorithm, so that the algorithm implementation faces a series of challenging problems. Recent neural network processor architecture research shows that high parallelism and high reusability in a neural network algorithm can be well utilized by matching a row fixed data stream strategy and then matching a specific data transmission channel based on an array parallel spatial processor architecture, so that the number of times of data access and storage is greatly reduced, and the overall energy efficiency ratio of a processor is improved.

The data transmission path is an important medium for the data interaction between the storage system and the convolution operation array, and the hardware implementation of the data transmission path is mainly how to highly transmit data concurrently and reduce the area overhead and the power consumption overhead caused by the bandwidth. For high-concurrency data, input data can be directly sent to all operation units in the convolution operation array, and as the scale of the convolution operation array increases, the bandwidth overhead caused by the direct sending method is very high. The other method is to use a two-level bus form to cut the data path, and for the two-level bus, the bandwidth overhead is relatively small. The form of the two-stage bus is a common hardware implementation mode, and is beneficial to the implementation of an array parallel spatial neural network processor architecture, and the area overhead and the power consumption overhead caused by data bandwidth are greatly reduced. The design provides a row fixed data flow strategy, and a two-stage bus structure is adopted to complete high-parallel global input of data.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a global broadcast data input circuit oriented to neural network processing, which is matched with a row fixed data stream strategy and adopts a two-stage bus structure to complete global transmission of input data.

The invention provides a global broadcast data input circuit for neural network processing, which structurally comprises a top layer module, a vertical bus module, a horizontal bus module and a broadcast transmitting module; wherein:

the top module is used for receiving a data packet from the storage system and automatically recording the data receiving number and the automatic switching of the identification number array according to the internal signal of the data packet; specifically, the top module automatically calculates the data sending times of a single convolution layer according to an external control signal, records the line number of received data, ensures the accuracy of the data sending times, sends a switching signal of an ID array to a broadcast transmitting unit, and ensures the ordered sending of the data; the input data of the top module is a data packet and a data label: the data packets are input data arrays, and each data packet contains 8-bit input data, masks corresponding to data numerical values and convolution line ending signals; the data tags include row tags and column tags.

The entire global broadcast data input process contains six states: initialization state (eldle), configuration state (eConfig), load control information state (epoadctrl), read data tag state (eUpdatetag), read incoming packet state (eTrans), current packet transfer complete state (etrandone), represented using s0, s1, s2, s3, s4, and s5, respectively. Each state jump condition is: firstly, configuring a starting signal; configuring an end signal; making unconditional skip; reading a new data tag array from the external FIFO; fifthly, reading the input data packet by the external FIFO and decoding the input data packet; sixthly, the data of the current row is not transmitted; seventhly, after the data loading of the current row is finished, starting the input transmission of the next row; and completing loading of all data of the current pass, wherein the pass refers to the current channel calculation process. After the hardware is powered on, the hardware is in the state of s0, and when (r) is detected, the hardware starts to enter the state of s 1. The ID array is configured in the s1 state, and when the condition is satisfied, the state jumps to s 2. In the s2 state, the hardware loads the relevant control information in one cycle, and in the condition ③, the hardware enters the s3 state. In the s3 state, the top module reads the data tag array from the external FIFO and holds it until the next read update, and under condition (r), the state jumps to s 4. Under the state of s4, the top-layer module reads in an input data packet from an external FIFO and transmits the input data packet to the decoding module, the decoding module decodes the data packet to obtain input data, a mask corresponding to a data value and a convolution line end signal, and under the condition of a fifth, the state jumps to s 5; and in the state of s5, transmitting the input data obtained by decoding the row and column labels obtained in s3 and s4, the mask corresponding to the data value and the convolution row end signal to the vertical bus module for current data packet transmission. When the transmission of the current data packet is finished, judging according to the convolution line ending signal and the transmission line number, and when the condition of (c) is met, jumping to s 4; when the condition is met, the state jumps to s 3; when the condition of (b) is satisfied, the state jumps to (s 1).

And the vertical bus module is used for receiving the data which is decoded and sent by the top module, including the mask, the ID switching signal and the data label, copying the data and sending the data to all the broadcast transmitting modules which are connected between the module and the horizontal bus module. In addition, the vertical bus module also generates a packet Valid signal Valid at the time of copying and transmits it to all broadcast transmitting modules connected between the module and the horizontal bus module.

The horizontal bus module is used for receiving data sent by the broadcast transmitting module between the vertical bus module and the horizontal bus module, including masks, ID switching signals and column labels, copying the data, and then sending the data to all the broadcast transmitting modules connected between the module and the arithmetic unit. In addition, the horizontal bus module also generates a packet Valid signal Valid at the time of copying, and transmits it to all broadcast transmitting modules connected between the module and the arithmetic unit.

The broadcast transmitting unit has two kinds of module structures, one is positioned between the vertical bus module and the horizontal bus module, a corresponding line label is selected according to an ID switching signal sent by the vertical bus module, the line label is compared with the internal line identification number value, if the line label is matched with the internal line identification number value, and when the Valid and an input first-in first-out queue non-empty signal ready of the operation unit are both high level, the data, the mask, the ID switching signal and the line label are sent to a horizontal bus connected with the multiplex selection switch Mux, otherwise, the related output is shielded; and the other type of the tag is positioned between the horizontal bus module and the operation unit, the corresponding column tag is selected according to the ID switching signal sent by the horizontal bus module, the column tag is compared with the internal column identification number value, if the column tag is matched with the internal column identification number value and when the Valid and the input first-in first-out queue non-empty signal ready of the operation unit are both in a high level, the Valid data is selected to be sent to the operation unit through the multi-way selector Mux and the numerical mask, and otherwise, the relevant output is shielded.

The circuit provided by the invention adopts a two-level bus form in the horizontal and vertical directions to cut a data path, and the vertical bus module and the horizontal bus module cooperate with each other to send data in a highly parallel manner, so that the extra area overhead and power consumption overhead caused by huge bandwidth in a single bus form are greatly reduced. Meanwhile, an operation unit identification number and an input data label handshaking mechanism are introduced into the broadcast transmitting module, so that the data multiplexing degree is improved, the access frequency of the circuit is reduced, and the integral energy efficiency ratio of the circuit is improved while the data transmitting function of the input circuit is ensured to be correct. The invention can improve the transmission efficiency of the input data in the neural network processing.

Drawings

Fig. 1 is a basic block diagram of a global broadcast data input circuit structure of the present invention.

Fig. 2 is an input packet data format.

Fig. 3 is a top module structure diagram.

Fig. 4 is a schematic diagram of data transmission.

Fig. 5 is a diagram of a vertical bus module structure.

Fig. 6 is a block diagram of a horizontal bus module.

Fig. 7 is a block diagram of a broadcast transmitting unit located between a vertical bus block and a horizontal bus block.

Fig. 8 is a block diagram of a broadcast transmitting unit located between a horizontal bus module and an arithmetic unit.

Detailed Description

In the present invention, a basic block diagram of a global broadcast data input circuit structure is shown in fig. 1. The working process of the design is as follows:

the top module records the number of the currently received data lines according to the line ending signal and sends an automatic switching signal of the identification number array to the broadcast transmitting unit. The input data and the corresponding data mask and data label are sent to the vertical bus module to be copied and sent to the next broadcast transmitting unit connected with the vertical bus module; then the broadcast emission module located in the vertical bus module and the horizontal bus module, according to the result of comparing the line data label with the identification number value in the module, completes the sending of the data in the line direction; the horizontal bus module receives input and copies the input; and finally, the broadcast transmitting unit connected with the on-line arithmetic unit sends the input data to the appointed arithmetic unit according to the comparison result of the column data label and the identification number value in the module.

The input data packet in the design is a data array, the interior of the data array comprises a plurality of data components, the data format is shown in FIG. 2, [72:9] is a data value, [8:1] is a data mask, and [0] is a convolution line end signal.

The structure of the top module is shown in fig. 3, where the input data is a data packet and a data tag: the data packets are input data arrays, and each data packet contains 8-bit input data, masks corresponding to data numerical values and convolution line ending signals; the data tags include row tags and column tags. The entire global broadcast data input process contains six states: initializing state eIdle, configuring state eConfig, loading control information state eLoadctrl, reading data tag state eUpdatetag, reading input data packet state eTrans, and current data packet transmission completion state eTransdone, which are respectively represented by s0, s1, s2, s3, s4, and s 5. Each state jump process is shown by the sequence number in the figure: firstly, configuring a starting signal; configuring an end signal; thirdly, unconditional jumping is carried out; reading a new data tag array from the external FIFO; fifthly, reading the input data packet by the external FIFO and decoding the input data packet; sixthly, the data of the current row is not transmitted; seventhly, after the data loading of the current row is finished, starting the input transmission of the next row; and completing loading of all data of the current pass, wherein the pass refers to the current channel calculation process. After the hardware is powered on, the hardware is in the state of s0, and when (r) is detected, the hardware starts to enter the state of s 1. The ID array is configured in the s1 state, and when the condition is satisfied, the state jumps to s 2. In the s2 state, the hardware loads the relevant control information in one cycle, and in the condition ③, the hardware enters the s3 state. In the s3 state, the top module reads the data tag array from the external FIFO and holds until the next read update, and under condition (r), the state jumps to s 4. Under the state of s4, the top module reads in an input data packet from an external FIFO and transmits the input data packet to the decoding module, the decoding module decodes the data packet to obtain input data, a mask corresponding to a data value and a convolution line end signal, and under the condition of a fifth, the state jumps to s 5; and in the state of s5, transmitting the input data obtained by decoding the row and column labels obtained in s3 and s4, the mask corresponding to the data value and the convolution row end signal to the vertical bus module for current data packet transmission. When the transmission of the current data packet is finished, judging according to the convolution line ending signal and the transmission line number, and when the condition of (c) is met, jumping to s 4; when the condition is met, the state jumps to s 3; when the condition of (b) is satisfied, the state jumps to (s 1). The top module has another important function of completing automatic switching of the identification number array in cooperation with a row fixed data stream strategy, and fig. 4 shows a data transmission schematic diagram, wherein the identification number value in a single cycle is kept unchanged, the last cycle in a single pass completes switching of the identification number array as required, and a plurality of passes are repeated according to the number of channels of the convolutional layer to complete ordered configuration of the whole identification number array.

As shown in fig. 5, the vertical bus module receives data, mask, ID switching signal and data label sent by the top module decoding, copies the data, and sends the copied data to all broadcast transmitting modules connected between the vertical bus module and the horizontal bus module. In addition, the vertical bus module also generates a packet Valid signal Valid at the time of copying and transmits it to all broadcast transmitting modules connected between the module and the horizontal bus module.

The structure of the horizontal bus module is shown in fig. 6, and the horizontal bus module receives data, a mask, an ID switching signal and a column tag sent by the broadcast transmitting module between the vertical bus module and the horizontal bus module, copies the data, and sends the copied data to all the broadcast transmitting modules connected between the module and the arithmetic unit. In addition, the horizontal bus module also generates a packet Valid signal Valid at the time of copying and transmits it to all broadcast transmitting modules connected between the module and the arithmetic unit.

The broadcast transmitting unit module structure has two types: a is located between vertical bus module and horizontal bus module, as shown in fig. 7, according to the ID switching signal that the vertical bus module sends selects the correspondent line label, compare the line label with the internal line identification number value, if the two match, and Valid and input first-in first-out queue of the arithmetic unit are not empty signal ready high level, through the multiple-way selector switch Mux, send data, mask, ID switching signal and column label to the horizontal bus connected with it, otherwise shield the relevant output; and the other is positioned between the horizontal bus module and the operation unit, as shown in fig. 8, a corresponding column label is selected according to an ID switching signal sent by the horizontal bus module, the column label is compared with the internal column identification number value, if the column label is matched with the internal column identification number value, and when Valid and the input first-in first-out queue non-empty signal ready of the operation unit are both high level, Valid data is selected through a multi-way selector Mux and a numerical mask and sent to the operation unit, otherwise, relevant output is shielded.

Claims

1. A global broadcast data input circuit oriented to neural network processing is characterized by structurally comprising a top layer module, a horizontal bus module, a vertical bus module and a broadcast transmitting module; wherein:

the top module is used for receiving a data packet from the storage system and automatically recording the data receiving number and the automatic switching of the identification number array according to the internal signal of the data packet; specifically, the top module automatically calculates the data sending times of a single convolution layer according to an external control signal, records the line number of received data, ensures the accuracy of the data sending times, sends a switching signal of an ID array to a broadcast transmitting unit, and ensures the ordered sending of the data; the input data of the top module is a data packet and a data label: the data packets are input data arrays, and each data packet contains 8-bit input data, masks corresponding to data numerical values and convolution line ending signals; the data labels comprise row labels and column labels;

the vertical bus module is used for receiving data sent by the top module in a decoding mode, wherein the data comprises a mask code, an ID switching signal and a data label, copying the data and sending the data to all broadcast transmitting modules connected between the module and the horizontal bus module; the vertical bus module also generates a data packet Valid signal Valid during copying and sends the data packet Valid signal Valid to all broadcast transmitting modules connected between the vertical bus module and the horizontal bus module;

the horizontal bus module is used for receiving data sent by the broadcast transmitting module between the vertical bus module and the horizontal bus module, wherein the data comprises a mask code, an ID switching signal and a column label, copying the data and then sending the data to all the broadcast transmitting modules connected between the module and the arithmetic unit; the horizontal bus module also generates a data packet Valid signal Valid during copying and sends the data packet Valid signal Valid to all broadcast transmitting modules connected between the module and the arithmetic unit;

the broadcast transmitting unit module is divided into two types: one is positioned between the vertical bus module and the horizontal bus module, and the other is positioned between the horizontal bus module and the arithmetic unit; the former selects the corresponding line label according to the ID switching signal sent by the vertical bus module, compares the line label with the internal line identification number value, if the two are matched and the input FIFO queue non-empty signal ready of the Valid and the arithmetic unit is high level, the data, the mask, the ID switching signal and the column label are sent to the horizontal bus connected with the multiplex switch Mux, otherwise the relevant output is shielded; the latter selects the corresponding column label according to the ID switching signal sent by the horizontal bus module, compares the column label with the internal column identification number value, if the column label is matched with the internal column identification number value, and when the Valid and the input first-in first-out queue non-empty signal ready of the operation unit are both high level, selects effective data through a multi-way selection switch Mux and a numerical mask and sends the effective data to the operation unit, otherwise, shields the relevant output.

2. The neural network processing-oriented global broadcast data input circuit of claim 1, wherein the entire global broadcast data input process comprises six states: initializing state, configuring state, loading control information state, reading data tag state, reading input data packet state, and current data packet transmission completion state, which are respectively represented by s0, s1, s2, s3, s4 and s 5; each state jump condition is: firstly, configuring a starting signal; configuring an end signal; thirdly, unconditional jumping is carried out; reading a new data tag array from the external FIFO; fifthly, reading the input data packet by the external FIFO and decoding the input data packet; sixthly, the data of the current row is not transmitted; seventhly, starting the next row of input transmission when the data loading of the current row is finished; completing loading of all data of a current pass, wherein the pass refers to a current channel calculation process;

after the hardware is powered on, the hardware is in an s0 state, and when a jump condition is detected: firstly, when a start signal is configured, the hardware starts to enter an s1 state, and an ID array is configured in an s1 state; when the jump condition satisfies the configuration end signal, the state jumps to s 2; in the state of s2, loading related control information by hardware in a period, when a jump condition satisfies a three-unconditional jump, the hardware enters the state of s3, and in the state of s3, the top module reads a data tag array from an external FIFO and keeps the data tag array until the next reading and updating; when the jump condition satisfies the condition that the new data tag array is read from the external FIFO, the state jumps to s 4; in the s4 state, the top module reads in an input data packet from an external FIFO and transmits the input data packet to a decoding module, and the decoding module decodes the data packet to obtain input data, a mask corresponding to a data value and a convolution line end signal; when the skip condition is met and the external FIFO reads in the input data packet and decodes the input data packet, the state skips to s 5; in the s5 state, the input data obtained by decoding the row and column tags obtained in the s3 state and the s4 state, the mask corresponding to the data value and the convolution row end signal are transmitted to the vertical bus module for current data packet transmission; when the transmission of the current data packet is finished, judging according to the end signal of the convolution line and the transmission line number, and when the transmission of the former data is not finished, the state is jumped to s 4; when the forward data loading is completed, the state jumps to s 3; when the jump condition is satisfied and all data loading of the current pass is completed, the state jumps to s 1.