CN110046704B

CN110046704B - Deep network acceleration method, device, equipment and storage medium based on data stream

Info

Publication number: CN110046704B
Application number: CN201910280156.2A
Authority: CN
Inventors: 牛昕宇; 蔡权雄
Original assignee: Shenzhen Corerain Technologies Co Ltd
Current assignee: Shenzhen Corerain Technologies Co Ltd
Priority date: 2019-04-09
Filing date: 2019-04-09
Publication date: 2022-11-08
Anticipated expiration: 2039-04-09
Also published as: WO2020206637A1; CN110046704A

Abstract

The application discloses and provides a deep network acceleration method, a device, equipment and a storage medium based on data flow, wherein the method comprises the following steps: acquiring target depth network information required by data to be processed; according to the target deep network information, matching preset target network configuration rules corresponding to the target deep network information, wherein the target network configuration rules comprise pre-configured configuration rules among a computing engine, a first data stream storage module and a global data stream network; configuring to obtain a target data flow network according to the target network configuration rule; and processing the data to be processed through the target data stream network. The deep network is accelerated through the data flow, off-chip data communication is reduced, therefore, no instruction idle overhead exists, the hardware acceleration efficiency of the deep network can be improved, and different deep network models can be configured through network configuration to support various different deep network models.

Description

Deep network acceleration method, device, equipment and storage medium based on data stream

Technical Field

The present application relates to the field of artificial intelligence, and more particularly, to a method, an apparatus, a device, and a storage medium for deep network acceleration based on data streams.

Background

Advances in neural network-based deep learning applications require high processing power on the underlying hardware platform. When CPU-based platforms are unable to meet this growing demand, many companies have developed specialized hardware accelerators to support the advancement in the field. A common idea of existing hardware accelerators is to accelerate certain specific types of computations that are more frequently used in deep learning algorithm applications. Existing hardware architectures are based on instruction execution with an extensible instruction set, and then acceleration is achieved by implementing common computations as custom instructions. Instruction-based architectural implementations are commonly represented as system-on-a-chip (SoC) designs. In instruction-based architectures, many clock cycles are wasted for non-compute dependent operations. To support a more general instruction architecture, computations within deep learning neural networks are typically decomposed into multiple instructions. A computation usually requires multiple clock cycles. Arithmetic and Logic Units (ALUs) in processors are usually a collection of different operations implemented in hardware. Due to the limited instruction expression and the limited I/O bandwidth, most of the ALU resources are in an idle state when executing a single instruction, for example, when multiplication and addition are performed, multiplication data are read first, because the I/O speed is affected by the bandwidth, the addition needs to wait for the completion of multiplication calculation and write in a memory, and then a calculation result and addition data are read and taken out for addition calculation, and in the processes of multiplication calculation and reading and writing, an addition calculation unit is in an idle state. There is a problem in that instruction-based hardware acceleration is inefficient.

Disclosure of Invention

The present application aims to overcome the above-mentioned drawbacks of the prior art, and provides a method, an apparatus, a device, and a storage medium for deep network acceleration based on data flow, which solve the problems of limited instruction expressions and limited I/O bandwidths, and low acceleration efficiency due to the fact that most ALU resources are in an idle state when executing a single instruction.

The purpose of the application is realized by the following technical scheme:

in a first aspect, a deep network acceleration method based on data flow is provided, where the method includes:

acquiring target depth network information required by data to be processed;

according to the target deep network information, matching preset target network configuration rules corresponding to the target deep network information, wherein the target network configuration rules comprise pre-configured configuration rules among a computing engine, a first data stream storage module and a global data stream network;

configuring to obtain a target data flow network according to the target network configuration rule;

and processing the data to be processed through the target data stream network.

Optionally, the configuring, according to the target network configuration rule, to obtain the target data flow network includes:

configuring the parallel or serial connection among a plurality of computing engines according to the global data flow network;

obtaining data flow paths of the plurality of computing engines according to the parallelism or the series between the first data flow storage module and the plurality of computing engines;

forming the target data flow network based on the data flow path.

Optionally, the processing the data to be processed through the target data flow network includes:

reading the data to be processed to the first data stream storage module;

in the first data stream storage module, generating an address sequence for the data to be processed according to a preset generation rule according to the data format and the data path of the data to be processed;

and in each clock cycle, reading the data quantity corresponding to the calculation engine in the target data stream network from the first data stream storage module according to the address sequence, inputting the data quantity, and acquiring the states of the first data stream storage module and the calculation engine.

Optionally, the target network configuration further includes a computing core, a second data stream storage unit, and a local data stream network connecting the computing core and the second buffer, where the configuration of the computing engine includes:

configuring the interconnection between the computing core and the local data flow network to obtain a computing path of the computing core;

configuring the interconnection of the second data stream and the local data stream network in a storage unit to obtain a storage path;

and obtaining the calculation engine according to the calculation path and the storage path.

In a second aspect, a deep network acceleration method based on data flow is further provided, where the method includes:

acquiring target depth network information required by data to be processed;

according to the target depth network information, matching preset target network configuration rules corresponding to the target depth network information, wherein the target network configuration rules comprise a computing core, a second data stream storage module and a local data stream network;

configuring to obtain a target data stream engine according to the target network configuration rule;

and processing the data to be processed through the target data stream engine.

Optionally, the configuring, according to the target network configuration rule, to obtain the target data flow engine includes:

configuring the interconnection between the second data stream storage module and the local data stream network to obtain a storage path;

and obtaining the target data stream engine according to the calculation path and the storage path.

Optionally, the processing the data to be processed by the target data stream engine includes:

reading the data to be processed to the second data stream storage module;

in the second data stream storage module, according to the data format and the data path of the data to be processed, generating an address sequence for the data to be processed according to a preset generation rule;

and in each clock cycle, reading the data quantity corresponding to the calculation core in the target data stream engine from the second data stream storage module according to the address sequence, inputting the data quantity, and acquiring the states of the second data stream storage module and the calculation core.

Optionally, the second data stream storage module includes a first storage unit and a second storage unit, and the processing the data to be processed by the target data stream engine includes:

inputting the data in the first storage unit into a calculation core to obtain a calculation result;

and storing the calculation result in a second storage unit as input data of a next calculation core.

In a third aspect, a deep network acceleration apparatus based on data flow is further provided, where the apparatus includes:

the first acquisition module is used for acquiring target depth network information required by data to be processed;

the first matching module is used for matching preset target network configuration rules corresponding to the target depth network information according to the target depth network information, wherein the target network configuration rules comprise pre-configured configuration rules among a computing engine, a first data stream storage module and a global data stream network;

the first configuration module is used for configuring and obtaining a target data flow network according to the target network configuration rule;

and the first processing module is used for processing the data to be processed through the target data stream network.

In a fourth aspect, there is also provided a deep network acceleration apparatus based on data flow, the apparatus including:

the second acquisition module is used for acquiring target depth network information required by the data to be processed;

the second matching module is used for matching a preset target network configuration rule corresponding to the target depth network information according to the target depth network information, wherein the target network configuration rule comprises a computing core, a second data stream storage module and a local data stream network;

the second configuration module is used for configuring a target data flow engine according to the target network configuration rule;

and the second processing module is used for processing the data to be processed through the target data stream engine.

In a fifth aspect, an electronic device is provided, comprising: the device comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the computer program to realize the steps of the deep network acceleration method based on data flow provided by the embodiment of the application.

In a sixth aspect, a computer-readable storage medium is provided, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements the steps in the deep network acceleration method based on data stream provided in the embodiment of the present application.

The beneficial effect that this application brought: the deep network is accelerated through the data flow, off-chip data communication is reduced, therefore, no instruction idle overhead exists, the hardware acceleration efficiency of the deep network can be improved, and different deep network models can be configured through network configuration to support various different deep network models.

Drawings

Fig. 1 is a schematic diagram of an alternative implementation architecture of a deep network acceleration method based on data flow according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a deep network acceleration method based on data flow according to a first aspect of the embodiments of the present application;

fig. 3 is a schematic flowchart of another deep network acceleration method based on data flow according to an embodiment of the present application;

fig. 4 is a schematic flowchart of a deep network acceleration method based on data flow according to a second aspect of the present embodiment;

fig. 5 is a schematic flowchart of another deep network acceleration method based on data flow according to an embodiment of the present application;

fig. 6 is a schematic diagram of a deep network acceleration apparatus based on data flow according to a third aspect of the embodiment of the present application;

fig. 7 is a schematic diagram of a deep network acceleration apparatus based on data flow according to a fourth aspect of the present embodiment.

Detailed Description

The following describes preferred embodiments of the present application, and those skilled in the art will be able to realize the invention and its innovative features and advantages in the present application based on the related art described below.

To further describe the technical solution of the present application, please refer to fig. 1, where fig. 1 is a schematic diagram of an alternative implementation architecture of a deep network acceleration method based on data flow according to an embodiment of the present application, as shown in fig. 1, an architecture 103 is connected to an off-chip memory module (DDR) 101 and a CPU through an interconnect, where the architecture 103 includes: a first storage module 104, a global data stream network 105 and a data stream engine 106, wherein the first storage module 104 is connected to the off-chip storage module 101 through an interconnection and also connected to the global data stream network 105 through an interconnection, and the data stream engine 106 is connected to the global data stream network 105 through an interconnection so that the data stream engine 106 can implement parallel or serial connection. The data flow engine 106 described above may include: the computation cores (or called computation modules), the second storage module 108, and the local data stream network 107, where the computation cores may include kernels for computation, such as a convolution kernel 109, a pooling kernel 110, and an activation function kernel 111, and of course, may also include other computation kernels besides the example convolution kernel 109, pooling kernel 110, and activation function kernel 111, and are not limited herein, and may also include all kernels for computation in the deep network. The first memory module 104 and the second memory module 108 may be on-chip cache modules, and may also be DDR or high-speed DDR memory modules. The data stream engine 106 described above may be understood as a computational engine that supports data stream processing, and may also be understood as a computational engine that is dedicated to data stream processing. The CPU may include a control register, and the control register is preconfigured with a network configuration rule for configuring a network.

The medium-depth network in the present application may also be referred to as a deep learning network, a deep learning neural network, or the like.

The application provides a deep network acceleration method, device and equipment based on data flow and a storage medium.

The purpose of the application is realized by the following technical scheme:

in a first aspect, please refer to fig. 2, fig. 2 is a schematic flowchart of a deep network acceleration method based on data flow according to an embodiment of the present application, and as shown in fig. 2, the method includes the following steps:

201. and acquiring target depth network information required by the data to be processed.

In this step, the data to be processed may be data to be processed through a depth network, such as image data to be recognized, target data to be detected, target data to be tracked, and the like, where the target depth network information corresponds to depth network information of the data to be processed, for example, the data to be processed is the image data to be recognized, the target depth network information is a configuration parameter of the depth network for processing image recognition, and if the data to be processed is the target data to be detected, the target depth network information is a configuration parameter of the depth network for processing target detection, and the target depth network information may be preset, and may be determined by matching the data to be processed, or may be determined by manual selection, and is not limited herein. Obtaining the target deep network information may facilitate configuration of the deep network, and the deep network information may include a network type, a data type, a layer number, a calculation type, and the like.

202. And matching preset target network configuration rules corresponding to the target depth network information according to the target depth network information, wherein the target network configuration rules comprise pre-configured configuration rules among a computing engine, a first data stream storage module and a global data stream network.

The target deep network information already includes a network type, a data type, a number of layers, a calculation type, and the like of a deep network required by the data to be processed, the target network configuration rule may be set in advance, for example, each parameter rule, a calculation rule, and the like in a preset type network such as an image recognition network, a target detection network, a target tracking network, and the like, the parameter rule may be a setting rule of a hyper-parameter, a setting rule of a weight, and the like, and the calculation rule may be a calculation rule such as addition, multiplication, convolution, deconvolution, and the like. The configuration rules among the pre-configured computing engines, the first data stream storage module, and the global data stream network may be understood as the number of computing engines and the connection manner between the computing engines and the global data stream network, the connection manner between the first data stream and the global data stream network, the route connection manner in the global data stream network, and the like. The global data flow network may be configured by control registers. The network implementation may be a router between the first data stream storage module and the compute engine. When multiple compute engines are instantiated in a single architecture, the global data flow network may be configured to send different data to different compute engines for data parallelism, or by serially linking compute engines through their inputs and outputs into a longer compute pipeline, in which more neural network layers may be processed.

In a possible embodiment, the first data stream storage module may include an input data stream storage unit and an output data stream storage unit, and is used for accessing data, that is, the input data stream storage unit inputs input data into the calculation engine for calculation, and the calculation engine outputs a calculation result to the output data stream storage unit for storage, so as to prevent the output result of the calculation engine from being unable to be written into the input data stream storage unit when the input data stream storage unit inputs data into the calculation engine, for example, the calculation engine needs to repeat calculation for 2 times on one data in the input data stream storage unit, and after the first calculation is completed, the calculation engine needs to read the data in the input data stream storage unit for the second time.

203. And configuring to obtain the target data flow network according to the target network configuration rule.

The configuration may be implemented by a connection relationship among a pre-configured computing engine, a first data stream storage module, and a global data stream network, where the connection relationship may include the number of connections of the computing engine, a connection order, and the like, the computing engine may be connected to the global data stream network through interconnections to form a new deep network, and different deep networks may be formed according to different connection numbers and connection orders of the computing engine. And configuring according to the target network configuration rule to obtain a target data flow network for processing the data to be processed. Because each computing engine reads data through the first data stream storage module, the data in the first data stream storage module can be respectively read into different computing engines to form data streams, and instruction set sequencing is not needed, so that the configured computing engines cannot generate computing vacancy.

204. And processing the data to be processed through the target data stream network.

The target data stream network is configured through target network information and may also be referred to as a customized data stream network, and the target data stream network connects the first data stream storage module and the calculation engine through the global data stream network to form a data stream.

In the embodiment, target depth network information required by data to be processed is acquired; according to the target deep network information, matching preset target network configuration rules corresponding to the target deep network information, wherein the target network configuration rules comprise pre-configured configuration rules among a computing engine, a first data stream storage module and a global data stream network; configuring to obtain a target data flow network according to the target network configuration rule; and processing the data to be processed through the target data stream network. The deep network is accelerated through the data flow, off-chip data communication is reduced, therefore, no instruction idle overhead exists, the hardware acceleration efficiency of the deep network can be improved, and different deep network models can be configured through network configuration to support various different deep network models.

It should be noted that the deep network acceleration method based on data flow provided in the embodiment of the present application may be applied to a device for deep network acceleration of data flow, for example: computers, servers, cell phones, etc. may be devices that perform deep network acceleration based on data flow.

Referring to fig. 3, fig. 3 is a schematic flowchart of another deep network acceleration method based on data flow according to an embodiment of the present application, and as shown in fig. 3, the method includes the following steps:

301. and acquiring target depth network information required by the data to be processed.

302. And matching preset target network configuration rules corresponding to the target depth network information according to the target depth network information, wherein the target network configuration rules comprise pre-configured configuration rules among a computing engine, a first data stream storage module and a global data stream network.

303. Configuring the parallelism or the series among a plurality of computing engines according to the global data stream network.

In this step, the global data stream network may be implemented by routing, and the global data stream network may be configured by a control register, where the control register is preset with a corresponding global data stream network configuration rule. The network is implemented as a router between the first data stream storage module and each compute engine, and the network router mainly functions to provide a skip path and a feedback path for data streams between each compute engine. For example, when the computing engine a and the computing engine B are in a global data stream network serial connection, the data stream may be selected from the computing engine a for computing, and then the computing result is flowed to the computing engine B, and when the computing engine a and the computing engine B are in a serial connection, the data stream may be understood as a deepening of a deep network computing layer. The specific configuration can be that the data flow direction is controlled through a global data flow network, so that the parallel or serial configuration among a plurality of computing engines is realized. The above configuration of the parallelism or the series between the plurality of computing engines may be obtained by configuring the global data stream network and the interconnections between the plurality of computing engines, for example, the plurality of computing engines may be interconnected with the global data stream network according to a parallel rule, or the plurality of computing engines may be interconnected with the global data stream network according to a serial rule, and the first data stream storage module is configured and interconnected with the global data stream network.

304. And obtaining data flow paths of the plurality of computing engines according to the parallel or serial connection between the first data flow storage module and the plurality of computing engines.

In this step, the first data stream storage module may be a cache, a DDR or a high-speed access DDR, and in this embodiment of the application, it is preferably a cache, and specifically, a controllable read-write address generation unit may be disposed in the cache. Depending on the input data format and the required computations in the data path, the address generation unit will generate an adapted address sequence to index the data in the buffer. The above address sequence may be used to index the data in the cache and input the data into the corresponding calculation engine, for example, if the calculation engine needs 80 data to perform calculation, the 80 data of the corresponding address sequence are read from the cache and input into the calculation engine. In addition, the address generation unit can also make the generated address sequence have different cycle sizes by setting a counter, for example, one small cycle of data 1, data 2 and data 3, thereby improving the reusability of data and being suitable for the data processing size of each calculation engine. The data stream is stored through the first data stream storage module, and the data stream is controlled to flow to each data node in parallel or serial among a plurality of computing engines, namely, the data stream path is controlled, so that the data processing is carried out in the computing engines like a pipeline, and the efficiency of the data processing is improved.

305. Forming the target data flow network based on the data flow path.

In the step, the first data stream storage module inputs data into the corresponding calculation engine through the global data stream network, and the calculation engine outputs the calculation result into the first data stream storage module through the global data stream network, without instruction control, that is, without the problem that the calculation unit is in an idle state when a single instruction is executed.

306. And processing the data to be processed through the target data stream network.

In the embodiment, the data stream is stored by the first data stream storage module, and the data stream is controlled to flow to each data node in parallel or serial among the plurality of computing engines, namely, the data stream path, so that the data processing is processed in the computing engines like a pipeline, and the efficiency of the data processing is improved.

Optionally, the processing the data to be processed through the target data stream network includes:

reading the data to be processed to the first data stream storage module;

In this embodiment, the first data stream storage module may be a cache, a DDR or a high-speed access DDR, and in this embodiment, it is preferably a cache, and specifically, may be a cache provided with a controllable read-write address generation unit. Depending on the input data format and the required computations in the data path, the address generation unit will generate an adapted address sequence to index the data in the buffer. The above address sequence may be used to index the data in the cache and input the data into the corresponding calculation engine, for example, if the calculation engine needs 80 data to perform calculation, the 80 data of the corresponding address sequence are read from the cache and input into the calculation engine. In addition, the address generation unit can also make the generated address sequence have different cycle sizes by setting a counter, for example, one small cycle of data 1, data 2 and data 3, thereby improving the reusability of data and being suitable for the data processing size of each calculation engine. The state of the first data stream storage module includes: the state of the calculation engine includes whether the calculation is completed or not, whether the next calculation data needs to be read or not, and the like. The state of the first data stream storage module may be obtained by monitoring the state of the data in the first data stream storage module in the finite state machine, and the state of the calculation engine is obtained by obtaining the state of the first data stream storage module, for example, after the calculation result is written into the first data stream storage module, it may be determined that the state of the calculation engine is calculated.

In each clock cycle, the state of each calculation engine and the first data stream storage module is obtained, so that accurate prediction can be realized, the hardware performance optimization with the maximum efficiency can be carried out through accurate calculation scheduling, and the data processing efficiency is further improved.

In this embodiment, the above-mentioned computation core, the second data stream storage module, and the local data stream network are main configurations constituting the computation engine, the above-mentioned computation core may be a kernel having computation performance, such as a convolution kernel, a pooling kernel, and an activation function kernel, and the computation core may also be referred to as a computation kernel, a computation unit, a computation module, and the like. The second data stream storage module may be a storage module with a data access function, such as a cache, a DDR, or a high-speed DDR, and the second data stream storage module and the first data stream storage module may be different storage regions on the same memory, for example, the second data stream storage module may be a second data cache region in the cache, and the first data stream storage module may be a first data cache region in the cache, and the local data stream network may be understood as a route used in the compute engine to connect the compute core and the second data stream storage module. For example, the connections between the compute cores may be controlled by a network router. The network router described above mainly functions to provide a skip path and a feedback path. By setting the control registers, the local data flow network may be configured to form flow paths with different compute cores available in the compute engine. The combination of the types and sequences of these computation kernels along the flow path provides a continuous data processing pipeline for multiple layers in the deep learning neural network, for example, according to the data flow direction, if the combination of the computation kernels is from a convolution kernel to a pooling kernel to an activation function kernel, a convolution neural network layer can be obtained, and for example, if the combination of the computation kernels is from a deconvolution kernel to a pooling kernel to an activation function kernel, a deconvolution neural network layer can be obtained, and the like. It should be noted that the combination of the type and the order of the computing cores is specifically determined by the target network configuration rule. By forming data streams between the computing cores, the computation of the computing engine can be accelerated, thereby further improving the data processing efficiency of the deep network.

The above optional embodiments may implement the method for deep network acceleration based on data stream according to the embodiments corresponding to fig. 2 and fig. 3, to achieve the same effect, and are not described herein again.

In a second aspect, please refer to fig. 4, fig. 4 is a schematic flowchart of a deep network acceleration method based on data flow according to an embodiment of the present application, and as shown in fig. 4, the method includes:

401. and acquiring target depth network information required by the data to be processed.

In this step, the data to be processed may be data that can be processed through a depth network, such as image data to be identified, target data to be detected, and target data to be tracked, where the target depth network information corresponds to depth network information of the data to be processed, for example, if the data to be processed is the image data to be identified, the target depth network information is a configuration parameter of the depth network for processing image identification, and if the data to be processed is the target data to be detected, the target depth network information is a configuration parameter of the depth network for processing target detection, where the target depth network information may be preset, and is determined by matching the data to be processed, or by manually selecting and determining, and is not limited herein. Obtaining the target deep network information may facilitate configuration of the deep network, and the deep network information may include a network type, a data type, a layer number, a calculation type, and the like.

402. And matching preset target network configuration rules corresponding to the target depth network information according to the target depth network information, wherein the target network configuration rules comprise a computing core, a second data stream storage module and a local data stream network.

The target deep network information already includes a network type, a data type, a number of layers, a calculation type, and the like of a deep network required by the data to be processed, the target network configuration rule may be set in advance, for example, each parameter rule, a calculation rule, and the like in a preset type network such as an image recognition network, a target detection network, a target tracking network, and the like, the parameter rule may be a setting rule of a hyper-parameter, a setting rule of a weight, and the like, and the calculation rule may be a calculation rule such as addition, multiplication, convolution, deconvolution, and the like. The configuration rules among the computing cores, the second data stream storage module, and the local data stream network may be understood as the types and numbers of the computing cores, and the connection manner between the computing cores and the global data stream network, the connection manner between the second data stream and the global data stream network, the route connection manner in the local data stream network, and the like. The local data flow network may be configured by control registers. The network implementation may be a router between the first data stream storage module and the compute engine. For example, the connections between the compute cores may be controlled by a network router. The network router described above mainly functions to provide a skip path and a feedback path.

403. And configuring to obtain a target data stream engine according to the target network configuration rule.

The configuration may be implemented as a connection relationship among a pre-configured computing core, a second data stream storage module, and a local data stream network, where the connection relationship may include a type of the computing core, a connection number, a connection order, and the like, and the computing core may be connected to the local data stream network through an interconnect to form a new computing engine, that is, a data stream engine, and data stream engines required by different deep networks may be formed according to different types of the computing core, connection numbers, and connection orders. And configuring according to the target network configuration rule to obtain a target data flow engine for processing the data to be processed. The data in the second data stream storage module can be respectively read into different computation cores to form data streams, for example, the data to be multiplied is read into the multiplication core to be multiplied, the data to be added is read into the addition core to be added, and the like.

404. And processing the data to be processed through the target data stream engine.

The target data stream engine is configured through target network information, and may also be referred to as a customized data stream engine, where the target data stream engine connects the second data stream storage module and each computation core through a local data stream network to form a data stream, and compared with an implementation form of an instruction set, the target data stream engine does not need to wait for completion of reading and writing of a previous instruction, and can implement high efficiency of computation under a deep network architecture.

In the embodiment, target depth network information required by data to be processed is acquired; according to the target deep network information, matching a preset target network configuration rule corresponding to the target deep network information, wherein the target network configuration rule comprises a computing core, a second data stream storage module and a local data stream network; configuring to obtain a target data stream engine according to the target network configuration rule; and processing the data to be processed through the target data stream engine. The deep network is accelerated through the data flow, off-chip data communication is reduced, therefore, no instruction idle overhead exists, the hardware acceleration efficiency of the deep network can be improved, and moreover, through network configuration, the computing engines required by different deep network models can be configured, and the computing engines required by various different deep network models are supported.

Referring to fig. 5, fig. 5 is a schematic flow chart of another deep network acceleration method based on data flow according to an embodiment of the present application, and as shown in fig. 5, the method includes:

501. acquiring target depth network information required by data to be processed;

502. according to the target deep network information, matching a preset target network configuration rule corresponding to the target deep network information, wherein the target network configuration rule comprises a computing core, a second data stream storage module and a local data stream network;

503. configuring the interconnection between the computing core and the local data flow network to obtain a computing path of the computing core;

504. configuring the interconnection between the second data stream storage module and the local data stream network to obtain a storage path;

505. and obtaining the target data stream engine according to the calculation path and the storage path.

506. And processing the data to be processed through the target data stream engine.

In this embodiment, the above-mentioned computation core, the second data stream storage module, and the local data stream network are main configurations constituting the data stream engine, the above-mentioned computation core may be a kernel with computation performance, such as a convolution kernel, a pooling kernel, and an activation function kernel, and it should be noted that the computation core may also be referred to as a computation kernel, a computation unit, a computation module, and the like. The second data stream storage module may be a storage module with a data access function, such as a cache, a DDR, or a high-speed DDR, and the second data stream storage module and the first data stream storage module may be different storage regions on the same memory, for example, the second data stream storage module may be a second data cache region in the cache, and the first data stream storage module may be a first data cache region in the cache, and the local data stream network may be understood as a route used in the compute engine to connect the compute core and the second data stream storage module. For example, the connections between the compute cores may be controlled by a network router. The network router described above mainly functions to provide a skip path and a feedback path. By setting the control registers, the local data flow network can be configured to form flow paths with different compute cores available in the compute engine. The combination of the types and sequences of these computation kernels along the flow path provides a continuous data processing pipeline for multiple layers in the deep learning neural network, for example, according to the data flow direction, if the combination of the computation kernels is from a convolution kernel to a pooling kernel to an activation function kernel, a convolution neural network layer can be obtained, and for example, if the combination of the computation kernels is from a deconvolution kernel to a pooling kernel to an activation function kernel, a deconvolution neural network layer can be obtained, and the like. It should be noted that the combination of the type and the order of the computing cores is specifically determined by the target network configuration rule.

By forming data streams between the computing cores, the computation of the computing engine can be accelerated, thereby further improving the data processing efficiency of the deep network.

Optionally, the processing the data to be processed by the target data flow engine includes:

reading the data to be processed to the second data stream storage module;

in the second data stream storage module, generating an address sequence for the data to be processed according to a preset generation rule according to the data format and the data path of the data to be processed;

In this embodiment, the second data stream storage module may be a cache, a DDR or a high-speed access DDR, in this embodiment, it is preferably a cache, and specifically, may be a cache provided with a controllable read-write address generation unit. Depending on the input data format and the required computations in the data path, the address generation unit will generate an adapted address sequence to index the data in the buffer. The above address sequence may be used to input the data in the index cache into the corresponding computing core, for example, if the computing core needs 80 data to perform computation, the 80 data of the corresponding address sequence are read from the cache into the computing core. In addition, the address generating unit can also set a counter to enable the generated address sequence to have different loop sizes, for example, one small loop of data 1, data 2 and data 3, so that the reusability of the data can be improved, and meanwhile, the address generating unit can also adapt to the size of data processing of each computing core. The state of the second data stream storage module includes: the states of the computation cores include whether computation is completed or not, whether next computation data needs to be read or not, and the like. The state of the first data stream storage module can be obtained by monitoring the state of the data in the first data stream storage module in the finite state machine, and the state of the computational core can be obtained by obtaining the state of the second data stream storage module, for example, after the computation result is written into the second data stream storage module, the state of the computational core can be determined to be the computation completion.

In each clock cycle, the state of each computing core and the second data stream storage module is obtained, so that the method can be accurately predicted, the hardware performance optimization with the maximum efficiency can be carried out through accurate computing scheduling, and the data processing efficiency is further improved.

In this embodiment, the first storage unit may be an input data stream storage unit, the second storage unit may be an input data stream storage unit, and the first storage unit and the second storage unit are used for alternate access of a data stream, that is, the first storage unit inputs input data into the computation core for computation, and the computation core outputs a computation result to the second storage unit for storage, so that it can be avoided that when the first storage unit inputs data into the computation core, an output result of the computation core cannot be written into the first storage unit, for example, the computation core needs to repeat computation on one data in the patella storage unit 2 times, after the first computation is completed, the computation core needs to read the data in the first storage unit for the second time, and in a normal case, it is waited to store a first computation result into the first storage unit and then to read the data for the second time, but after the storage unit is set, it is possible to read the data in the first storage unit while storing a first computation result into the second storage unit, and it is not needed to wait for processing efficiency, and thus data processing efficiency is improved.

The above optional embodiments may implement the deep network acceleration method based on data stream according to the embodiments corresponding to fig. 4 and fig. 5, to achieve the same effect, and are not described herein again. The above embodiments may be combined with the examples of fig. 2 and 3.

In a third aspect, referring to fig. 6, fig. 6 is a schematic diagram of a deep network acceleration apparatus based on data flow according to an embodiment of the present application, and as shown in fig. 6, the apparatus includes:

a first obtaining module 601, configured to obtain target depth network information required by data to be processed;

a first matching module 602, configured to match a preset target network configuration rule corresponding to the target deep network information according to the target deep network information, where the target network configuration rule includes a pre-configured configuration rule among a computing engine, a first data stream storage module, and a global data stream network;

a first configuration module 603, configured to obtain a target data stream network according to the target network configuration rule;

a first processing module 604, configured to process the data to be processed through the target data flow network.

Optionally, the first configuration module 603 includes:

the global configuration submodule is used for configuring the parallel or serial of a plurality of computing engines according to the global data stream network;

the path configuration submodule is used for obtaining data flow paths of the plurality of computing engines according to the parallel or serial connection between the first data flow storage module and the plurality of computing engines;

and the forming submodule is used for forming the target data flow network based on the data flow path.

Optionally, the first processing module 604 includes:

the first acquisition submodule is used for reading the data to be processed to the first data stream storage module;

a first data address generation submodule, configured to generate, in the first data stream storage module, an address sequence for the to-be-processed data according to a preset generation rule and according to a data format and a data path of the to-be-processed data;

and the first input submodule is used for reading and inputting the data quantity corresponding to the calculation engine in the target data stream network from the first data stream storage module according to the address sequence in each clock cycle, and acquiring the states of the first data stream storage module and the calculation engine.

Optionally, the target network configuration further includes a computing core, a second data stream storage unit, and a local data stream network connecting the computing core and the second buffer, and the first configuration module 603 further includes:

the first local configuration submodule is used for configuring the interconnection between the computing core and the local data flow network to obtain a computing path of the computing core;

a first local path sub-module, configured to configure interconnection of the second data stream with the local data stream network at a storage unit, to obtain a storage path;

and the first engine module is used for obtaining the calculation engine according to the calculation path and the storage path.

In a fourth aspect, referring to fig. 7, fig. 7 is a schematic diagram of a deep network acceleration device based on data flow according to an embodiment of the present application, and as shown in fig. 7, the device includes:

a second obtaining module 701, configured to obtain target depth network information required by data to be processed;

a second matching module 702, configured to match a preset target network configuration rule corresponding to the target deep network information according to the target deep network information, where the target network configuration rule includes a computation core, a second data stream storage module, and a local data stream network;

a second configuration module 703, configured to obtain a target data stream engine according to the target network configuration rule;

a second processing module 704, configured to process the data to be processed through the target data stream engine.

Optionally, the second configuration module 703 includes:

the second local configuration submodule is used for configuring the interconnection between the computing core and the local data flow network to obtain a computing path of the computing core;

a second local path sub-module, configured to configure interconnection between the second data stream storage module and the local data stream network, to obtain a storage path;

and the second engine module is used for obtaining the target data stream engine according to the calculation path and the storage path.

Optionally, the second processing module 704 includes:

the second acquisition submodule is used for reading the data to be processed to the second data stream storage module;

a second data address generation submodule, configured to generate, in the second data stream storage module, an address sequence for the to-be-processed data according to a preset generation rule according to the data format and the data path of the to-be-processed data;

and the second input submodule is used for reading and inputting the data quantity corresponding to the computing core in the target data stream engine from the second data stream storage module according to the address sequence in each clock cycle, and acquiring the states of the second data stream storage module and the computing core.

Optionally, the second processing module 704 includes:

the input calculation submodule is used for inputting the data in the first storage unit into a calculation core to obtain a calculation result;

and the output storage submodule is used for storing the calculation result into a second storage unit to be used as input data of a next calculation core.

In a fifth aspect, an embodiment of the present application provides an electronic device, including: the device comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the computer program to realize the steps of the deep network acceleration method based on data flow provided by the embodiment of the application.

In a sixth aspect, the present application provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements the steps in the deep network acceleration method based on data streams provided by the present application.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are exemplary embodiments and that the acts and modules referred to are not necessarily required in this application.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative

In addition, the processors and chips in the embodiments of the present application may be integrated into one processing unit, may exist alone physically, or may be integrated into one unit by two or more pieces of hardware. The computer-readable storage medium or computer-readable program may be stored in a computer-readable memory. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a memory, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable memory, which may include: flash Memory disks, read-Only memories (ROMs), random Access Memories (RAMs), magnetic or optical disks, and the like.

The foregoing is a more detailed description of the present application in connection with specific preferred embodiments, and it is not intended that the present application be limited to the specific embodiments shown. For those skilled in the art to which the present application pertains, several simple deductions or substitutions may be made without departing from the concept of the present application, and all should be considered as belonging to the protection scope of the present application.

Claims

1. A deep network acceleration method based on data flow is characterized in that the method comprises the following steps:

acquiring target depth network information required by data to be processed;

according to the target deep network information, matching preset target network configuration rules corresponding to the target deep network information, wherein the target network configuration rules comprise a pre-configured computing engine, a first data stream storage module and configuration rules among a global data stream network, the global data stream network is configured by a control register, and the network is a router between the first data stream storage module and the computing engine;

processing the data to be processed through the target data stream network;

the processing the data to be processed through the target data stream network includes:

reading the data to be processed to the first data stream storage module;

2. The method of claim 1, wherein configuring the target data flow network according to the target network configuration rule comprises:

forming the target data flow network based on the data flow path.

3. The method of claim 1, wherein the processing the data to be processed through the target data flow network comprises:

reading the data to be processed to the first data stream storage module;

4. The method of any of claims 1 to 3, wherein the target network configuration rule further comprises a compute core, a second data stream storage unit, and a local data stream network connecting the compute core and the second data stream storage unit, the configuring of the compute engine comprising:

configuring interconnection between the computing core and the local data stream network to obtain a computing path of the computing core, wherein the local data stream network is a route used for connecting the computing core with a second data stream storage module in a computing engine;

configuring interconnection of the second data stream and the local data stream network in a second data stream storage unit to obtain a storage path;

5. A deep network acceleration method based on data flow, the method comprising:

acquiring target depth network information required by data to be processed;

according to the target deep network information, matching a preset target network configuration rule corresponding to the target deep network information, wherein the target network configuration rule comprises a computing core, a second data stream storage module and a local data stream network, and the local data stream network is a route used for connecting the computing core and the second data stream storage module in a computing engine;

processing the data to be processed through the target data stream engine;

the processing the data to be processed by the target data stream engine includes:

reading the data to be processed to the second data stream storage module;

6. The method as recited in claim 5, wherein said configuring a target data flow engine according to said target network configuration rules comprises:

configuring interconnection of the second data stream and the local data stream network in a second data stream storage module to obtain a storage path;

7. The method of claim 5 or 6, wherein the second data stream storage module comprises a first storage unit and a second storage unit, and the processing the data to be processed by the target data stream engine comprises:

and storing the calculation result to a second storage unit as input data of a next calculation core.

8. A deep network acceleration apparatus based on data flow, the apparatus comprising:

the first matching module is used for matching preset target network configuration rules corresponding to the target depth network information according to the target depth network information, wherein the target network configuration rules comprise pre-configured configuration rules among a computing engine, a first data stream storage module and a global data stream network, the global data stream network is configured by a control register, and the network is a router between the first data stream storage module and the computing engine;

the first processing module is used for processing the data to be processed through the target data stream network;

reading the data to be processed to the first data stream storage module;

9. A deep network acceleration apparatus based on data flow, the apparatus comprising:

the second matching module is used for matching a preset target network configuration rule corresponding to the target depth network information according to the target depth network information, wherein the target network configuration rule comprises a computing core, a second data stream storage module and a local data stream network, and the local data stream network is a route used for connecting the computing core and the second data stream storage module in a computing engine;

the second configuration module is used for configuring and obtaining a target data stream engine according to the target network configuration rule;

the second processing module is used for processing the data to be processed through the target data stream engine;

reading the data to be processed to the second data stream storage module;

10. An electronic device, comprising: memory, processor and computer program stored on the memory and executable on the processor, the processor implementing the steps in the data stream based deep network acceleration method according to any of claims 1 to 4 when executing the computer program.

11. A computer-readable storage medium, characterized in that a computer program is stored thereon, which computer program, when being executed by a processor, carries out the steps in the data-flow based deep network acceleration method of any one of claims 1 to 4.