CN111142808B - Access device and access method - Google Patents

Access device and access method Download PDF

Info

Publication number
CN111142808B
CN111142808B CN202010267401.9A CN202010267401A CN111142808B CN 111142808 B CN111142808 B CN 111142808B CN 202010267401 A CN202010267401 A CN 202010267401A CN 111142808 B CN111142808 B CN 111142808B
Authority
CN
China
Prior art keywords
data
ports
control unit
data storage
write
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010267401.9A
Other languages
Chinese (zh)
Other versions
CN111142808A (en
Inventor
王必胜
栾国庆
张弥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Sineva Intelligent Technology Co ltd
Original Assignee
Zhejiang Sineva Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Sineva Intelligent Technology Co ltd filed Critical Zhejiang Sineva Intelligent Technology Co ltd
Priority to CN202010267401.9A priority Critical patent/CN111142808B/en
Publication of CN111142808A publication Critical patent/CN111142808A/en
Application granted granted Critical
Publication of CN111142808B publication Critical patent/CN111142808B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0604Improving or facilitating administration, e.g. storage management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0629Configuration or reconfiguration of storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/0644Management of space entities, e.g. partitions, extents, pools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0683Plurality of storage devices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Abstract

The application discloses an access device and an access method. The local data storage buffer area in the access device is used for storing data of at least one data type calculated by a target hardware acceleration engine; the write-in control unit is connected with a first number of basic write-in ports in the local data storage buffer area and is used for writing data of at least one data type to be written corresponding to the target hardware acceleration engine into a corresponding data storage area of the local data storage buffer area in parallel; the reading control unit is connected with a first number of basic reading ports in the local data storage buffer area and is used for reading data of a target data type to be read by the target hardware acceleration engine in parallel from the data storage area of the local data storage buffer area. The access device can flexibly configure the local data storage buffer according to the data volume of different data types, and the length of the access data is not limited.

Description

Access device and access method
Technical Field
The present disclosure relates to the field of integrated circuit technologies, and in particular, to an access device and an access method.
Background
With the development of the internet of things and artificial intelligence, the computational demand of hardware is higher and higher, and a large amount of data is often processed by a hardware acceleration module. In order to meet the requirements of fast access of data, data caching, and the like, a hardware acceleration module (or "hardware acceleration engine") usually adds an on-chip data storage buffer, i.e., a local data storage buffer, locally. The data buffer directly participates in core operations in the data path and is accessed by multiple modules in the hardware acceleration engine, so that the local data storage buffer is required to provide the operation function of multi-port parallel access.
At present, for a processor with low master frequency (or "clock frequency"), data of different memory cells in a multi-port shared memory can be accessed k times in a short time, that is, time-sharing parallel access is realized, so as to complete parallel access of different ports. Secondly, the shared memory needs to adopt a memory chip with high frequency enough to meet the requirement of time-sharing parallel access, and the multi-port time-sharing parallel access mode is only suitable for the requirement of data transmission with low throughput, namely the structure shared memory cannot realize the operation of multi-port parallel access.
When a multi-core processor with high main frequency accesses a multi-port shared memory, a memory controller of k ports adopts a k-n cross switch or a network on chip to connect n different memory units together, and the k ports simultaneously access n different memory units through the cross switch or the network on chip, so that the multi-port parallel access memory is realized.
However, the crossbar switch in such a multi-port shared memory can complete the crossbar operation of multiple ports, but the crossbar switch only supports reading and writing of fixed-length data, that is, only can complete the reading and writing operation of the fixed-length data buffer, and the data type stored in the fixed-length data buffer is also a fixed type.
Disclosure of Invention
The embodiment of the application provides an access device and an access method, which solve the problems in the prior art, improve the flexibility of configuring the data types in the local data storage buffer, and solve the problem that the local data storage buffer only performs read-write operation on data supporting fixed length in the prior art.
In a first aspect, an access device is provided, which may include:
the local data storage buffer area is used for storing data of at least one data type corresponding to the target hardware acceleration engine; the local data storage buffer area is a data storage area with different data types and composed of a first number of basic storage modules with uniform address coding; wherein the first number is determined according to the amount of data required for the target hardware acceleration engine calculation, each basic storage module having a basic read port and a basic write port;
the write-in control unit is connected with the first number of basic write-in ports in the local data storage buffer area and is used for writing data of at least one data type to be written, which corresponds to the target hardware acceleration engine, into a corresponding data storage area of the local data storage buffer area in parallel;
and the reading control unit is connected with the first number of basic reading ports in the local data storage buffer and is used for reading the data of the target data type to be read by the target hardware acceleration engine from the data storage area of the local data storage buffer in parallel.
In an optional implementation, the write control unit is further configured to determine, according to the number of types of the at least one data type, the number of write ports of the write control unit;
the reading control unit is further configured to determine the number of reading ports of the reading control unit according to the type number of the target data type.
In an optional implementation, if the number of the write-in ports of the write-in control unit is a second number N, the write-in ports of the second number N of the write-in control unit are connected to the target hardware acceleration engine, and the output ports of the first number K of the write-in control unit are connected to the basic write-in ports of the first number K in a one-to-one correspondence;
if the number of the reading ports of the reading control unit is the third number M, the reading ports of the third number M of the reading control unit are connected with the target hardware acceleration engine, and the input ports of the first number K of the reading control unit are connected with the basic reading ports of the first number K in a one-to-one correspondence manner.
In an alternative implementation, the write control unit includes the second number N of demultiplexers and the first number K of multiplexers; each demultiplexer comprises an input port and said first number K of output ports, and each multiplexer comprises said second number N of input ports and an output port;
the writing ports of the second number N of the writing control units are correspondingly connected with the input ends of the second number N of the multi-way distributors one by one;
the output ends of the first quantity K of each demultiplexer are correspondingly connected with the input ends of the multiplexers of the first quantity K one by one;
and the output end of each multiplexer is connected with a basic writing port to be written in the corresponding data storage area.
In an alternative implementation, the read control unit comprises a third number M of multiplexers and the first number K of demultiplexers; each multiplexer comprises a first number K of input ports and one output port, and each demultiplexer comprises one input port and a third number M of output ports;
the reading ports of the third quantity M of the reading control units are correspondingly connected with the input ends of the multiplexers of the third quantity M one by one;
the input ends of the first quantity K in each multiplexer are respectively connected with the output ends of the first quantity K of the multi-way distributors in a one-to-one correspondence mode;
the input end of each demultiplexer is connected with a basic reading port to be read in the corresponding data storage area.
In an alternative implementation, if the target hardware acceleration engine is a convolutional neural network engine, the at least one data type includes a weight type, a feature value type, a convolutional part, and a type.
In a second aspect, an accessing method is provided, which is applied in an accessing device, and the method may include:
writing data of at least one data type corresponding to a target hardware acceleration engine into a data storage area of a local data storage buffer area in the access equipment in parallel;
storing data of at least one data type corresponding to the target hardware acceleration engine;
or, reading the data of the target data type to be read by the target hardware acceleration engine in parallel from the data storage area of the local data storage buffer.
In an optional implementation, determining the number of write ports of a write control unit in the access device according to the type number of at least one data type corresponding to the target hardware acceleration engine;
and determining the number of reading ports of a reading control unit in the access device according to the type number of the target data types to be read by the target hardware acceleration engine.
In a third aspect, an electronic device is provided, which includes a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory complete communication with each other through the communication bus;
a memory for storing a computer program;
a processor for implementing the method steps as described in any one of the above second aspects when executing the program stored in the memory.
In a fourth aspect, a computer-readable storage medium is provided, having stored therein a computer program which, when executed by a processor, carries out the method steps of any of the above second aspects.
The embodiment of the invention provides access equipment, wherein a local data storage buffer area in the access equipment is used for storing data of at least one data type corresponding to a target hardware acceleration engine; the local data storage buffer area is a data storage area with different data types and composed of a first number of basic storage modules with uniform address coding; wherein the first number is determined according to the data amount required by the target hardware acceleration engine calculation, and each basic storage module is provided with a basic reading port and a basic writing port. The write-in control unit is connected with a first number of basic write-in ports in the local data storage buffer area and is used for writing data of at least one data type to be written corresponding to the target hardware acceleration engine into a corresponding data storage area of the local data storage buffer area in parallel; the reading control unit is connected with a first number of basic reading ports in the local data storage buffer area and is used for reading data of a target data type to be read by the target hardware acceleration engine in parallel from the data storage area of the local data storage buffer area. The access device can flexibly configure the local data storage buffer area according to the data volume of different data types, and can solve the problem that the local data storage buffer area only supports the read-write operation of the data with fixed length in the prior art.
Drawings
Fig. 1 is a schematic structural diagram of an access device according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a write control unit according to an embodiment of the present invention;
FIG. 3 is a schematic structural diagram of a read control unit according to an embodiment of the present invention;
FIG. 4 is a flowchart illustrating an accessing method according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without any creative effort belong to the protection scope of the present application.
The access device provided by the embodiment of the invention can be installed on a server or a terminal. The Terminal may be a User Equipment (UE) such as a Mobile phone, a smart phone, a notebook computer, a digital broadcast receiver, a Personal Digital Assistant (PDA), a tablet computer (PAD), a handheld device, a vehicle-mounted device, a wearable device, a computing device or other processing devices connected to a wireless modem, a Mobile Station (MS), a Mobile Terminal (Mobile Terminal), or the like.
As shown in fig. 1, the access device may include a local data storage buffer, a write control unit, and a read control unit.
The local data storage buffer is a data storage area of different data types consisting of a first number K of basic memory modules with uniform address encoding. Wherein each basic memory module has a basic read port and a basic write port; the first number K may be determined according to the amount of data required for calculation by the target hardware acceleration engine, or may be preset by a technician, and the embodiment of the present invention is not limited herein.
The local data storage buffer area is used for storing data of at least one data type corresponding to the target hardware acceleration engine, wherein the data of at least one data type comprises data required by calculation of the target hardware acceleration engine and data obtained by calculation of the target hardware acceleration engine;
it should be understood that the target hardware acceleration engine is only a hardware module that needs to access the local data storage buffer for accessing data, and the local data storage buffer may also store other hardware modules that need to perform data access operations, which is not limited herein.
Optionally, if the target hardware acceleration engine is a convolutional neural network engine, the at least one data type includes a weight type, a feature value type, a convolutional part and a type, that is, the data of the at least one data type is a weight coefficient, an image or a feature value, a convolutional part and data. The weight coefficient and the image can be obtained from a memory, the convolution part and the data can be obtained by calculation of the target hardware acceleration engine each time, and the convolution part and the data obtained by calculation of each time are accumulated by the target hardware acceleration engine to obtain a characteristic value.
In order to improve flexibility of configuring the local data storage buffer, a corresponding data storage area may be configured according to data types of data required for calculation and data obtained by calculation of the current target hardware acceleration engine. The number of the basic memory modules occupied by the data memory areas of different data types can be different or the same.
The write control unit is connected to the first number K of basic write ports in the local data storage buffer, and the read control unit is connected to the first number K of basic read ports in the local data storage buffer.
And the write-in control unit is used for writing the data of at least one data type corresponding to the target hardware acceleration engine into the corresponding data storage area in parallel.
And the reading control unit is used for reading the data of the target data type to be read by the target hardware acceleration engine in parallel from the data storage area of the local data storage buffer area.
Optionally, the write control unit is further configured to determine the number of write ports of the write control unit according to the number of types of the at least one data type.
And the reading control unit is also used for determining the number of the reading ports of the reading control unit according to the type number of the target data types to be read by the target hardware acceleration engine.
If the number of the types of the at least one data type is the second number N, the number of the write ports of the write control unit is the second number N. As shown in fig. 2, when the number of the write ports of the write control unit is the second number N, the write ports of the write control unit of the second number N are connected to the target hardware acceleration engine, and the output ports of the write control unit of the first number K are connected to the basic write ports of the first number K in a one-to-one correspondence manner.
Wherein the write control unit may include a second number N of demultiplexers and a first number K of multiplexers; each demultiplexer comprises an input port and a first number K of output ports and each multiplexer comprises a second number N of input ports and an output port.
The writing ports of the second number N of the writing control units are correspondingly connected with the input ends of the second number N of the multi-way distributors one by one; the output ends of the first quantity K of each demultiplexer are correspondingly connected with the input ends of the first quantity K of the multiplexers one by one; the output terminal of each multiplexer is connected to a basic write port to be written in the corresponding data storage area.
And the demultiplexer is used for determining a target basic storage area to be written according to the inter-block address of the basic storage module corresponding to the type of the data to be written.
And the multiplexer is used for determining a basic writing port of a basic storage block corresponding to the target basic storage area according to the target basic storage area to be written.
If the number of types of the corresponding data types to be read by the target hardware acceleration engine is the third number M, the number of the read ports of the read control unit is the third number M. As shown in fig. 3, when the number of the read ports of the read control unit is the third number M, the read ports of the read control unit of the third number M are connected to the target hardware acceleration engine, and the input ports of the read control unit of the first number K are connected to the basic read ports of the first number K in a one-to-one correspondence manner.
Wherein the read control unit may include a third number M of multiplexers and a first number K of demultiplexers; each multiplexer comprises a first number K of input ports and one output port, and each demultiplexer comprises one input port and a third number M of output ports.
The reading ports of the third quantity M of the reading control units are correspondingly connected with the input ends of the multiplexers of the third quantity M one by one; the input ends of the first quantity K in each multiplexer are respectively connected with the output ends of the first quantity K of the multi-way distributors in a one-to-one correspondence mode; the input of each demultiplexer is connected to a basic read port to be read in the corresponding data storage area.
And the multiplexer is used for determining a target basic storage area to be read according to the basic reading port of the basic storage block corresponding to the data type to be read.
And the demultiplexer is used for determining a basic reading port of a basic storage block corresponding to the target basic storage area according to the target basic storage area to be read.
It should be noted that the specific configuration of each demultiplexer and each multiplexer is performed independently.
Furthermore, the convolutional neural network engine generally has a plurality of network layers that need to implement convolutional calculation, and each network layer causes different sizes of the storage areas needed by the calculated data of different data types, that is, different numbers of basic storage modules, due to different sizes of the neural network nodes and different weight coefficients of each neural network node.
It can be seen that the local data storage buffer can dynamically allocate the size, that is, the size of the functional area of the local data buffer can be dynamically changed in due time according to the change of the requirements of different convolutional network layers, so as to meet the requirements of different types of data on different storage areas, and thus the flexibility of the configuration of the local data storage buffer is realized.
In an example, taking the target hardware acceleration engine as a convolutional neural network engine as an example, for a write port of the write control unit, if the type of data to be written is: the weight coefficient, the image or the characteristic value, the convolution part and the data, the writing control unit has three writing ports, namely N =3, which correspond to three types of data writing respectively. For the read port of the read control unit, if the type of the data to be read is: the weight coefficient, the image or characteristic value, the convolution part sum data and the final convolution accumulated sum, and the reading control unit has four reading ports, i.e. M =4, corresponding to four types of data reading respectively.
The local data storage buffer area needs 108KB in total, the size of a basic storage module is 4KB (the size of a basic Block RAM of an ordinary FPGA is 4KB, so that the Block RAM resource of the FPGA is convenient to use for verification), the data bit width is 16 bits, and the value of the first number K is 27. The bit width of the address in the basic memory module is 11bit, and the bit width of the address between the basic memory modules is 5bit (2 ^5=32> 27), namely the configuration address is from 00000b to 11010 b.
The write control unit comprises 3 demultiplexers and 27 1-out-of-3 multiplexers, wherein each demultiplexer is a 1-to-27-way demultiplexer, i.e. each demultiplexer comprises 27 output ports.
The read control unit includes 4 multiplexers and 27 1-to-4 demultiplexers, where each multiplexer is a 1-out-of-27 multiplexer.
Further, taking data required for convolution calculation of two layers of networks as an example, the following is described:
for the configuration of the local data storage buffer in the first network layer:
configuration of data types in local data storage buffers:
the space requirements for each data type entered are as follows:
the input weight coefficients are distributed to a Block1 (hereinafter referred to as B1) basic storage module space and a B2 basic storage module space;
the input image data is allocated to B3 to B15 basic storage module spaces;
the input convolution part and data are distributed to B16 to B27 basic storage module spaces;
the weight coefficient output port can only read the input weight coefficient of the first layer network layer from the space of the B1 and B2 basic storage modules;
the image or characteristic value output port can only read the input image data of the first layer network layer from the B3 to B15 basic storage module space;
the convolution part and data output port can only read the convolution part and data generated in the convolution calculation process of the first layer of network layer from the space of the basic storage modules from B16 to B27 for the requirement of subsequent convolution calculation of the first layer of network layer;
the final convolution accumulation and output port can only read the final convolution calculation sum formed after all convolution calculations of the first layer of network layer are completed from the space of the basic storage modules from B16 to B27. The port reads B16 through B27 basic memory module space in time-shared parallel with the convolution portion and the data output port.
Configuration of input ports in local data store buffer:
the corresponding input ports are a weight coefficient input port, an image data input port, a convolution part and a data input port.
(1) The inter-block address input range of the weight coefficient input port is as follows: 00000b to 00001b, and then 11 bits of the block address: 000h to 7ffh, namely the address of the weight coefficient port input to the local data storage buffer is 0000h to 0 fffh; wherein, the input ports of the 1 st 3-to-1 multiplexer and the 2 nd 3-to-1 multiplexer are connected with the output port of the multi-way distributor which outputs the weight coefficient, and the configuration parameter is 00 b;
(2) the inter-block address input range of the image data input port is: 00010b to 01110b, together with an 11-bit intra-block address, i.e. the address of the image port input to the local data storage buffer is 1000h to 77 ffh; wherein, the input ports of the 3 rd to 15 th 1-from-3 multiplexers from the 1 st multiplexer from the 3 rd to the 15 th are connected with the output port of the multi-channel distributor which outputs the image data, and the configuration parameter is 01 b;
(3) the input range of the address between the convolution part and the data input port is 01111b to 11010b, and the address which is combined with the address in the 11-bit block, namely the address which is input into the local data storage buffer area by the convolution part and the port is 7800h to D7 ffh; wherein, the input ports of the 16 th 1-from-3 multiplexer to the 27 th 1-from-3 multiplexer are connected with the output port of the output convolution part sum, and the configuration parameter is 10 b;
configuration of output ports in local data store buffer:
the corresponding output ports are a weight coefficient output port, an image or characteristic value output port, a convolution part, a data output port and a final convolution accumulation and output port.
(1) The inter-block address input range of the weight coefficient output port is 00000b to 00001b, 11 bits of intra-block address 000 h-7 ffh are added, namely the weight coefficient output port reads the weight coefficient from the storage space with the local data storage buffer address of 0000h to 0 fffh; wherein, the output ports of the 1 st to 4-path multi-path distributor and the 2 nd 1 to 4-path multi-path distributor are respectively connected with the 1 st and 2 nd output ports of the 27-to-1-path multi-path selector for outputting weight coefficients, and the configuration parameter is 00 b;
(2) the input range of the inter-block address of the image or characteristic value output port is 00010b to 01110b, the inter-block address and the 11-bit intra-block address are combined together, namely the image or characteristic value output port reads image data from a storage and payment space with the address of a local data storage buffer area of 1000h to 77 ffh; wherein, the output ports of the 3 rd 1 to 4-way demultiplexer to the 15 th 1 to 4-way demultiplexer are respectively connected with the 3 rd to 15 th output ports of the 27-to-1-way demultiplexer for outputting image data, and the configuration parameter is 01 b;
(3) the input range of the address between the convolution part and the data output port is 01111b to 11010b, and the convolution part and the data output port are combined together with the address in the 11 bits of block, namely the convolution part and the data output port read the convolution part and the data from the storage space with the address of 7800h to D7ffh of the local data storage buffer area; wherein, the 16 th 1 to 4-channel demultiplexer to the 27 th 1 to 4-channel demultiplexer output port is connected with the 16 th to 27 th output port of the 27-to-1-channel multiplexer which outputs the convolution part sum, and the configuration parameter is 10 b;
(4) the inter-block address input range of the final convolution accumulation and output port is 01111b to 11010b, and the final convolution accumulation and output port is combined with the 11-bit intra-block address, namely the final convolution accumulation and output port reads the final volume sum data from the storage space with the address of 7800h to D7ffh of the local data storage buffer area; due to time-division multiplexing with the convolution part and the data output port, the output ports of the 16 th to 27 th 1-to-4-way demultiplexers are reconfigured to be connected with the 16 th to 27 th output ports of the 27-to-1-way multiplexer outputting the final convolution accumulated sum, respectively, and the configuration parameter is 11 b.
For configuration of local data storage buffers in a layer-two network layer:
configuration of data types in local data storage buffers:
the space requirements for each data type entered are as follows:
the input weight coefficients are distributed to B1, B2, B3 and B4 basic storage module spaces;
the input image data is allocated to B5 to B16 basic storage module spaces;
the input convolution part and data are distributed to B17 to B27 basic storage module spaces;
the weight coefficient output port can only read the weight coefficient of the second layer network layer from the spaces of the basic storage modules B1, B2, B3 and B4;
the image or characteristic value output port can only read the input characteristic data of the second layer network layer from the space of the B3 to B16 basic storage modules;
the convolution part and data output port can only read the convolution part and data generated in the convolution calculation process of the second-layer network layer from the space of the basic storage modules from B17 to B27, so as to meet the requirement of subsequent convolution calculation of the second-layer network layer;
the final convolution accumulation and output port can only read the final convolution calculation sum formed after all convolution calculations of the second layer network layer are completed from the space of the basic storage modules from B17 to B27. The port reads B17 through B27 basic memory module space in time-shared parallel with the convolution portion and the data output port.
Configuration of input ports in local data store buffer:
the corresponding input ports are a weight coefficient input port, an image data input port, a convolution part and a data input port.
(1) The input range of the inter-block address of the weight coefficient input port is 00000b to 00011b, 11 bits of intra-block address 000 h-7 ffh are added, namely the address input by the weight coefficient port to the local data storage buffer is 0000h to 1 fffh; wherein, the input ports of the 1 st to 4 th 1-out-of-3 multiplexers are respectively connected with the output ports of the multi-way distributor which outputs the weight coefficient, and the configuration parameter is 00 b;
(2) the input range of the inter-block address of the image data input port is 00100b to 01111b, the inter-block address is combined with the 11-bit intra-block address, namely the address input by the image port to the local data storage buffer area is 2000h to 7 fffh; wherein, the input ports of the 5 th to 16 th 1-out-of-3 multiplexers should be connected with the output port of the demultiplexer which outputs the image data, and the configuration parameter is 01 b;
(3) the input range of the address between the convolution part and the data input port is 10000b to 11010b, and the address is 8000h to D7ffh together with the address in 11 bits of block, namely the address input from the convolution part and the port to the local data storage buffer area; wherein, the input ports of the 17 th to 27 th 1-out-of-3 multiplexers should be connected with the output ports for outputting the convolution part and data, and the configuration parameter is 10 b;
configuration of output ports in local data store buffer:
the corresponding output ports are a weight coefficient output port, an image or characteristic value output port, a convolution part, a data output port and a final convolution accumulation and output port.
(1) The input range of the inter-block address of the weight coefficient output port is 00000b to 00011b, 11 bits of intra-block address 000 h-7 ffh are added, and the weight coefficient output port reads the weight coefficient from the storage space with the address of the local data storage buffer zone of 0000h to 1 fffh; wherein, the output ports of the 1 st to 4 th 1 to 4-path multi-path distributors are respectively connected with the 1 st to 4 th output ports of the 27-to-1-path multi-path selectors of the output weight coefficient, and the configuration parameter is 00 b;
(2) the inter-block address input range of the image or characteristic value output port is 00100b to 01111b, and the inter-block address input range and the 11-bit intra-block address are combined, namely the image or characteristic value output port reads image data from a storage and payment space with the local data storage buffer area address of 2000h to 7 fffh; wherein, the output ports of the 1 to 4-path demultiplexers from the 5 th to the 16 th are respectively connected with the 5 th to the 16 th output ports of the 1-path 27-to-1-path multiplexers for outputting image data, and the configuration parameter is 01 b;
(3) the input range of the address between the convolution part and the data output port is 10000b to 11010b, and the address in 11 bits of blocks are combined together, namely the convolution part and the data output port read the convolution part and the data from the memory space with the address of 8000h to D7ffh of the local data storage buffer; wherein, the output ports of the demultiplexer from the 17 th to the 27 th 1-to-4-paths are respectively connected with the 17 th to the 27 th output ports of the demultiplexer which outputs the convolution part and the 27-to-1-path of data, and the configuration parameter is 10 b;
(4) the input range of the inter-block address of the final convolution accumulation and output port is 10000b to 11010b, and the final convolution accumulation and output port and the 11-bit intra-block address are combined together, namely the final convolution accumulation and output port reads the final convolution accumulation and data from the storage space of which the address of the local data storage buffer is 8000h to D7 ffh; since time-division multiplexing is performed with the convolution section and the data output port, the output ports of the 17 th to 27 th 1-to-4-way demultiplexers are reconfigured to be connected to the 17 th to 27 th output ports of the 27-to-1-way multiplexer that outputs the final convolution sum, respectively, with the configuration parameter of 11 b.
Therefore, the access device provided by the embodiment of the invention can flexibly configure the local data storage buffer according to the data volume of different data types, and can solve the problem that the local data storage buffer only supports the read-write operation of the data with fixed length in the prior art.
Corresponding to the above access device, an embodiment of the present invention further provides an access method, as shown in fig. 4, where an execution subject of the method is the access device, and the method includes:
s410, writing the data of at least one data type corresponding to the target hardware acceleration engine into the data storage area of the local data storage buffer area in parallel.
And S420, storing data of at least one data type corresponding to the target hardware acceleration engine.
The local data storage buffer area is a data storage area with different data types and composed of a first number K of basic storage modules with uniform address coding, and is used for storing data of at least one data type calculated by the target hardware acceleration engine.
Wherein the first number K is determined according to the data amount required by the calculation of the target hardware acceleration engine, and each basic storage module is provided with a basic reading port and a basic writing port.
And S430, reading the data of the target data type to be read by the target hardware acceleration engine in parallel from the data storage area of the local data storage buffer area.
In an optional implementation, the number of write ports of a write control unit in the access device is determined according to the type number of at least one data type corresponding to a target hardware acceleration engine;
and determining the number of reading ports of a reading control unit in the access device according to the type number of the target data types to be read by the target hardware acceleration engine.
The functions of the functional units of the access method provided in the above embodiment of the present invention can be implemented by the above units, and therefore, detailed working processes and beneficial effects of the method steps in the access method provided in the embodiment of the present invention are not described herein again.
An embodiment of the present invention further provides an electronic device, as shown in fig. 5, including a processor 510, a communication interface 520, a memory 530 and a communication bus 540, where the processor 510, the communication interface 520, and the memory 530 complete mutual communication through the communication bus 540.
A memory 530 for storing a computer program;
the processor 510, when executing the program stored in the memory 530, implements the following steps:
writing data of at least one data type corresponding to a target hardware acceleration engine into a data storage area of a local data storage buffer area in the access equipment in parallel;
storing data of at least one data type corresponding to the target hardware acceleration engine;
or, reading the data of the target data type to be read by the target hardware acceleration engine in parallel from the data storage area of the local data storage buffer.
In an optional implementation, determining the number of write ports of a write control unit in the access device according to the type number of at least one data type corresponding to the target hardware acceleration engine;
and determining the number of reading ports of a reading control unit in the access device according to the type number of the target data types to be read by the target hardware acceleration engine.
The aforementioned communication bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The communication interface is used for communication between the electronic equipment and other equipment.
The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component.
Since the implementation manner and the beneficial effects of the problem solving of each device of the electronic device in the above embodiment can be realized by referring to each step in the embodiment shown in fig. 4, detailed working processes and beneficial effects of the electronic device provided by the embodiment of the present invention are not described herein again.
In yet another embodiment of the present invention, a computer-readable storage medium is further provided, which has instructions stored therein, and when the instructions are executed on a computer, the computer is caused to execute the access method described in any of the above embodiments.
In a further embodiment of the present invention, there is also provided a computer program product comprising instructions which, when run on a computer, cause the computer to perform the access method of any of the above embodiments.
As will be appreciated by one of skill in the art, the embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all changes and modifications that fall within the true scope of the embodiments of the present application.
It is apparent that those skilled in the art can make various changes and modifications to the embodiments of the present application without departing from the spirit and scope of the embodiments of the present application. Thus, if such modifications and variations of the embodiments of the present application fall within the scope of the claims of the embodiments of the present application and their equivalents, the embodiments of the present application are also intended to include such modifications and variations.

Claims (7)

1. An access device, characterized in that the device comprises:
the local data storage buffer area is used for storing data of at least one data type corresponding to the target hardware acceleration engine; the local data storage buffer area is a data storage area with different data types and composed of a first number of basic storage modules with uniform address coding; wherein the first number is determined according to the amount of data required for the target hardware acceleration engine calculation, each basic storage module having a basic read port and a basic write port; if the target hardware acceleration engine is a convolutional neural network engine, the at least one data type comprises at least one of a weight type, a characteristic value type and a convolutional part and type;
the write-in control unit is connected with the first number of basic write-in ports in the local data storage buffer area and is used for writing data of at least one data type to be written, which corresponds to the target hardware acceleration engine, into a corresponding data storage area of the local data storage buffer area in parallel;
the reading control unit is connected with the first number of basic reading ports in the local data storage buffer area and is used for reading the data of the target data type to be read by the target hardware acceleration engine from the data storage area of the local data storage buffer area in parallel;
the write control unit is further configured to determine the number of write ports of the write control unit according to the type number of the at least one data type;
the reading control unit is further configured to determine the number of reading ports of the reading control unit according to the number of types of target data to be read by the target hardware acceleration engine.
2. The apparatus of claim 1,
if the number of the write-in ports of the write-in control unit is a second number, the second number of the write-in ports of the write-in control unit are connected with the target hardware acceleration engine, and the first number of the output ports of the write-in control unit are connected with the first number of the basic write-in ports in a one-to-one correspondence manner;
if the number of the read ports of the read control unit is a third number, the third number of the read ports of the read control unit are connected with the target hardware acceleration engine, and the first number of the input ports of the read control unit are connected with the first number of the basic read ports in a one-to-one correspondence manner.
3. The apparatus of claim 2,
the write control unit includes the second number of demultiplexers and the first number of multiplexers; each demultiplexer comprises an input port and said first number of output ports, and each multiplexer comprises said second number of input ports and an output port;
the second number of write-in ports of the write-in control unit are connected with the input ends of the second number of multi-way distributors in a one-to-one correspondence manner;
the first number of output ends of each demultiplexer are correspondingly connected with the input ends of the first number of multiplexers one by one;
and the output end of each multiplexer is connected with a basic writing port to be written in the corresponding data storage area.
4. The apparatus of claim 1 or 2,
the read control unit includes a third number of multiplexers and the first number of demultiplexers; each multiplexer including a first number of input ports and an output port, each demultiplexer including an input port and a third number of output ports;
the reading ports of the third number of the reading control units are correspondingly connected with the input ends of the multiplexers of the third number one by one;
the input ends of the first number in each multiplexer are respectively connected with the output ends of the first number of the multi-way distributors in a one-to-one correspondence manner;
the input end of each demultiplexer is connected with a basic reading port to be read in the corresponding data storage area.
5. An accessing method applied to the accessing apparatus of claim 1, the method comprising:
writing data of at least one data type corresponding to a target hardware acceleration engine into a data storage area of a local data storage buffer area in the access equipment in parallel;
storing data of at least one data type corresponding to the target hardware acceleration engine;
or, reading the data of the target data type to be read by the target hardware acceleration engine in parallel from the data storage area of the local data storage buffer;
determining the number of write-in ports of a write-in control unit in the access equipment according to the type number of at least one data type corresponding to the target hardware acceleration engine;
determining the number of reading ports of a reading control unit in the access device according to the type number of the target data types to be read by the target hardware acceleration engine;
wherein, if the target hardware acceleration engine is a convolutional neural network engine, the at least one data type includes at least one of a weight type, a feature value type, and a convolutional part sum type.
6. An electronic device, characterized in that the electronic device comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;
a memory for storing a computer program;
a processor for implementing the method steps of claim 5 when executing a program stored in the memory.
7. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method steps of claim 5.
CN202010267401.9A 2020-04-08 2020-04-08 Access device and access method Active CN111142808B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010267401.9A CN111142808B (en) 2020-04-08 2020-04-08 Access device and access method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010267401.9A CN111142808B (en) 2020-04-08 2020-04-08 Access device and access method

Publications (2)

Publication Number Publication Date
CN111142808A CN111142808A (en) 2020-05-12
CN111142808B true CN111142808B (en) 2020-08-04

Family

ID=70528815

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010267401.9A Active CN111142808B (en) 2020-04-08 2020-04-08 Access device and access method

Country Status (1)

Country Link
CN (1) CN111142808B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114442908B (en) * 2020-11-05 2023-08-11 珠海一微半导体股份有限公司 Hardware acceleration system and chip for data processing

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101330433A (en) * 2007-06-20 2008-12-24 中兴通讯股份有限公司 Method and apparatus for managing Ethernet equipment sharing buffer area base on transmission network
CN101776988B (en) * 2010-02-01 2012-11-07 中国人民解放军国防科学技术大学 Restructurable matrix register file with changeable block size
EP3035204B1 (en) * 2014-12-19 2018-08-15 Intel Corporation Storage device and method for performing convolution operations
CN105808454A (en) * 2014-12-31 2016-07-27 北京东土科技股份有限公司 Method and device for accessing to shared cache by multiple ports
US10572225B1 (en) * 2018-09-26 2020-02-25 Xilinx, Inc. Circuit arrangements and methods for performing multiply-and-accumulate operations
CN109740739B (en) * 2018-12-29 2020-04-24 中科寒武纪科技股份有限公司 Neural network computing device, neural network computing method and related products
CN110390385B (en) * 2019-06-28 2021-09-28 东南大学 BNRP-based configurable parallel general convolutional neural network accelerator
CN110751263B (en) * 2019-09-09 2022-07-01 瑞芯微电子股份有限公司 High-parallelism convolution operation access method and circuit
US11726950B2 (en) * 2019-09-28 2023-08-15 Intel Corporation Compute near memory convolution accelerator
CN110880038B (en) * 2019-11-29 2022-07-01 中国科学院自动化研究所 System for accelerating convolution calculation based on FPGA and convolution neural network

Also Published As

Publication number Publication date
CN111142808A (en) 2020-05-12

Similar Documents

Publication Publication Date Title
CN109213962B (en) Operation accelerator
KR102105918B1 (en) MEMORY SYSTEM AND SoC INCLUDING LINEAR ADDRESS REMAPPING LOGIC
US7577799B1 (en) Asynchronous, independent and multiple process shared memory system in an adaptive computing architecture
CN104778025B (en) The circuit structure of pushup storage based on random access storage device
CN110717583B (en) Convolution circuit, processor, chip, board card and electronic equipment
CN108121688A (en) A kind of computational methods and Related product
CN111768458A (en) Sparse image processing method based on convolutional neural network
CN108074211A (en) A kind of image processing apparatus and method
CN111142808B (en) Access device and access method
CN112256623A (en) Heterogeneous system-based processing performance optimization method and device
CN113626353A (en) Processing accelerator architecture
CN113687949B (en) Server deployment method, device, deployment equipment and storage medium
CN111566614B (en) Bit width matching circuit, data writing device, data reading device, and electronic apparatus
CN106227506A (en) A kind of multi-channel parallel Compress softwares system and method in memory compression system
CN111651383B (en) Method and apparatus for data flow in a processor having a data flow manager
CN110490308B (en) Design method of acceleration library, terminal equipment and storage medium
CN108108189A (en) A kind of computational methods and Related product
CN117057290A (en) Time sequence optimization method and device, electronic equipment and storage medium
CN111694513A (en) Memory device and method including a circular instruction memory queue
CN108572787A (en) A kind of method and device that data are stored, read
US9798550B2 (en) Memory access for a vector processor
CN113434813A (en) Matrix multiplication method based on neural network and related device
CN114707451A (en) Digital circuit layout planning method and device, electronic equipment and storage medium
CN113918879A (en) Matrix operation method and accelerator
CN115145842A (en) Data cache processor and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant