WO2020206637A1

WO2020206637A1 - Deep network acceleration methods and apparatuses based on data stream, device, and storage medium

Info

Publication number: WO2020206637A1
Application number: PCT/CN2019/082101
Authority: WO
Inventors: 牛昕宇; 蔡权雄
Original assignee: 深圳鲲云信息科技有限公司
Priority date: 2019-04-09
Filing date: 2019-04-10
Publication date: 2020-10-15
Also published as: CN110046704A; CN110046704B

Abstract

The present application provides a deep network acceleration method and apparatus based on a data stream, a device, and a storage medium. The method comprises: obtaining target deep network information required by data to be processed; according to the target deep network information, matching a preset target network configuration rule corresponding to the target deep network information, wherein the target network configuration rule comprises a pre-configured configuration rule between a calculation engine, a first data stream storage module, and a global data stream network; configuring to obtain a target data stream network according to the target network configuration rule; and processing said data by means of the target data stream network. A deep network is accelerated by means of the data stream, an off-chip data communication is reduced, and accordingly, instruction idle overhead is avoided, and the hardware acceleration efficiency of the deep network can be improved; moreover, different deep network models can be configured by performing network configuration, and multiple different deep network models are supported.

Description

Data stream-based deep network acceleration method, device, equipment and storage medium

Technical field

This application relates to the field of artificial intelligence, and more specifically, to a data stream-based deep network acceleration method, device, device, and storage medium.

Background technique

The advancement of neural network-based deep learning applications requires high processing capabilities on the underlying hardware platform. When CPU-based platforms cannot meet this growing demand, many companies have developed dedicated hardware accelerators to support advancements in this field. The common idea of existing hardware accelerators is to accelerate certain types of calculations that are used more frequently in deep learning algorithm applications. The existing hardware architecture is based on the execution of instructions with an extensible instruction set, and then realizes acceleration by implementing common calculations as customized instructions. The instruction-based architecture implementation is usually expressed as a system-on-chip (SoC) design. In an instruction-based architecture, many clock cycles are wasted for non-computation related operations. In order to support a more general instruction architecture, calculations in deep learning neural networks are usually decomposed into multiple instructions. Therefore, a calculation usually requires multiple clock cycles. The arithmetic and logic unit (ALU) in the processor is usually a collection of different operations implemented in hardware. Due to limited instruction expressions and limited I/O bandwidth, most ALU resources are idle when executing a single instruction. For example, when doing multiplication and addition, the data of the multiplication will be read first, because the I/O speed is affected. Bandwidth affects, so that addition needs to wait for the multiplication calculation to be completed and write it into the memory, and then read the calculation result and the addition data for the addition calculation. During the multiplication calculation and reading and writing, the addition calculation unit is idle. Therefore, there is a problem of low efficiency of instruction-based hardware acceleration.

Application content

The purpose of this application is to provide a data stream-based deep network acceleration method, device, equipment and storage medium in view of the above-mentioned defects in the prior art, which solves the limited instruction expression and limited I/O bandwidth. Most The ALU resource is idle when executing a single instruction, which accelerates the problem of low efficiency.

The purpose of this application is achieved through the following technical solutions:

In a first aspect, a data stream-based deep network acceleration method is provided, the method includes:

Obtain the target deep network information needed for the data to be processed;

According to the target deep network information, match a preset target network configuration rule corresponding to the target deep network information, wherein the target network configuration rule includes a pre-configured calculation engine, a first data stream storage module, and global data Configuration rules between streaming networks;

Configure the target data flow network according to the target network configuration rule;

The data to be processed is processed through the target data stream network.

Optionally, the configuration to obtain the target data flow network according to the target network configuration rule includes:

According to the global data flow network, configure parallel or serial between multiple computing engines;

Obtaining the data flow paths of the multiple calculation engines according to the parallel or serial between the first data flow storage module and the multiple calculation engines;

Based on the data flow path, the target data flow network is formed.

Optionally, the processing the to-be-processed data through the target data flow network includes:

Reading the to-be-processed data into the first data stream storage module;

In the first data stream storage module, according to the data format and data path of the data to be processed, an address sequence is generated for the data to be processed according to a preset generation rule;

In each clock cycle, read from the first data stream storage module the amount of data corresponding to the calculation engine in the target data stream network according to the address sequence for input, and obtain the data of the first data stream storage module and the calculation engine status.

Optionally, the target network configuration further includes a computing core, a second data stream storage unit, and a local data stream network connecting the computing core and the second buffer, and the configuration of the computing engine includes:

Configuring the interconnection between the computing core and the local data flow network to obtain the computing path of the computing core;

Configuring the interconnection of the second data stream between the storage unit and the local data stream network to obtain a storage path;

According to the calculation path and the storage path, the calculation engine is obtained.

In a second aspect, there is also provided a data stream-based deep network acceleration method, the method including:

Obtain the target deep network information needed for the data to be processed;

According to the target deep network information, matching a preset target network configuration rule corresponding to the target deep network information, wherein the target network configuration rule includes a computing core, a second data stream storage module, and a local data stream network;

Configure the target data flow engine according to the target network configuration rule;

The data to be processed is processed by the target data flow engine.

Optionally, the configuration to obtain the target data flow engine according to the target network configuration rule includes:

Configuring the interconnection between the second data stream storage module and the local data stream network to obtain a storage path;

According to the calculation path and the storage path, the target data flow engine is obtained.

Optionally, the processing the data to be processed by the target data flow engine includes:

Reading the to-be-processed data into the second data stream storage module;

In the second data stream storage module, according to the data format and data path of the data to be processed, an address sequence is generated for the data to be processed according to a preset generation rule;

Each clock cycle, read from the second data stream storage module according to the address sequence the data volume corresponding to the computing core in the target data stream engine for input, and obtain the second data stream storage module and the computing core status.

Optionally, the second data stream storage module includes a first storage unit and a second storage unit, and the processing of the to-be-processed data by the target data stream engine includes:

Input the data in the first storage unit into the calculation core to obtain the calculation result;

The calculation result is stored in the second storage unit as the input data of the next calculation core.

In a third aspect, a data stream-based deep network acceleration device is also provided, the device includes:

The first obtaining module is used to obtain target deep network information required by the data to be processed;

The first matching module is configured to match a preset target network configuration rule corresponding to the target deep network information according to the target deep network information, wherein the target network configuration rule includes a pre-configured calculation engine, a first Configuration rules between the data flow storage module and the global data flow network;

The first configuration module is configured to configure the target data flow network according to the target network configuration rule;

The first processing module is configured to process the to-be-processed data through the target data flow network.

In a fourth aspect, there is also provided a data stream-based deep network acceleration device, the device comprising:

The second acquisition module is used to acquire target deep network information required by the data to be processed;

The second matching module is configured to match a preset target network configuration rule corresponding to the target deep network information according to the target deep network information, wherein the target network configuration rule includes a computing core and a second data stream storage Module and local data flow network;

The second configuration module is used to configure the target data flow engine according to the target network configuration rule;

The second processing module is configured to process the to-be-processed data through the target data flow engine.

In a fifth aspect, an electronic device is provided, including: a memory, a processor, and a computer program stored on the memory and capable of running on the processor, and the processor implements the implementation of the application when the processor executes the computer program Examples provide steps in the data stream-based deep network acceleration method.

In a sixth aspect, a computer-readable storage medium is provided, and a computer program is stored on the computer-readable storage medium. When the computer program is executed by a processor, it implements the data stream-based deep network acceleration method provided in the embodiments of the present application Steps in.

The beneficial effects brought by this application: the deep network is accelerated through the data stream and off-chip data communication is reduced, so there is no instruction idle overhead, and the hardware acceleration efficiency of the deep network can be improved. Moreover, by performing network configuration, different depths can be configured The network model supports a variety of different deep network models.

Description of the drawings

FIG. 1 is a schematic diagram of an optional implementation architecture of a data stream-based deep network acceleration method provided by an embodiment of this application;

2 is a schematic flowchart of a data stream-based deep network acceleration method according to the first aspect of the embodiments of the present application;

3 is a schematic flowchart of another data stream-based deep network acceleration method provided by an embodiment of the present application;

FIG. 4 is a schematic flowchart of a data flow-based deep network acceleration method provided by the second aspect of the embodiments of the application;

FIG. 5 is a schematic flowchart of another data stream-based deep network acceleration method provided by an embodiment of this application;

6 is a schematic diagram of a data stream-based deep network acceleration device provided by the third aspect of the embodiments of the application;

FIG. 7 is a schematic diagram of a data stream-based deep network acceleration device provided by the fourth aspect of the embodiments of the application.

detailed description

The preferred embodiments of the present application are described below. Those of ordinary skill in the art will be able to implement them with related technologies in the field according to the following description, and will be able to better understand the innovations and benefits of the present application.

In order to further describe the technical solution of this application, please refer to FIG. 1. FIG. 1 is a schematic diagram of an optional implementation architecture of a data stream-based deep network acceleration method provided by an embodiment of this application. As shown in FIG. 1, the architecture 103 and the chip The external storage module (DDR) 101 and the CPU are connected by interconnection. The architecture 103 includes: a first storage module 104, a global data flow network 105 and a data flow engine 106. The first storage module 104 is connected to the off-chip through interconnection. While storing the module 101, it is also connected to the global data flow network 105 through interconnection, and the data flow engine 106 is connected to the global data flow network 105 through interconnection so that the data flow engine 106 can realize parallel or serial. The aforementioned data flow engine 106 may include: a computing core (or called a computing module), a second storage module 108, and a local data flow network 107. The computing core may include a core for computing, such as a convolution core 109 and a pooling core. 110 and activation function core 111, etc. Of course, it can also include other computing cores besides the example convolution core 109, pooling core 110, and activation function core 111, which are not limited here, and can also be included in the deep network. The kernel used for calculation. The above-mentioned first storage module 104 and the above-mentioned second storage module 108 may be on-chip cache modules, or may be DDR or high-speed DDR memory modules. The above-mentioned data stream engine 106 can be understood as a computing engine that supports data stream processing, and can also be understood as a computing engine dedicated to data stream processing. The foregoing CPU may include a control register, and the foregoing control register is pre-configured with network configuration rules for configuring the network.

It should be noted that the deep network in this application may also be called a deep learning network, a deep learning neural network, and the like.

This application provides a data stream-based deep network acceleration method, device, equipment and storage medium.

For the first aspect, please refer to FIG. 2. FIG. 2 is a schematic flowchart of a data stream-based deep network acceleration method provided by an embodiment of the present application. As shown in FIG. 2, the method includes the following steps:

201. Obtain target deep network information required for data to be processed.

In this step, the aforementioned data to be processed may be data that can be processed through a deep network, such as image data to be identified, target data to be detected, target data to be tracked, and so on. The target deep network information corresponds to the deep network of the data to be processed. Information, such as the data to be processed is image data to be recognized, the target deep network information is the configuration parameter of the deep network used to process image recognition, and if the data to be processed is the target data to be detected, the target deep network information is used For processing the configuration parameters of the deep network for target detection, the above-mentioned target deep network information may be preset, and the matching determination may be performed through the data to be processed, or may be manually selected and determined, which is not limited herein. Obtaining the target deep network information can facilitate the configuration of the deep network. The aforementioned deep network information may include network type, data type, number of layers, calculation type, and so on.

202. According to the target deep network information, match a preset target network configuration rule corresponding to the target deep network information, wherein the target network configuration rule includes a pre-configured calculation engine, a first data stream storage module, and Configuration rules between global data flow networks.

The aforementioned target deep network information already contains the network type, data type, number of layers, calculation type, etc. of the deep network required by the data to be processed. The aforementioned target network configuration rules can be set in advance, for example, The parameter rules and calculation rules in the set image recognition network, target detection network, target tracking network and other types of networks. The above parameter rules can be hyperparameter setting rules, weight setting rules, etc. The above calculation rules can be It is the calculation rules for addition, multiplication, convolution, and deconvolution. The foregoing configuration rules between the pre-configured calculation engine, the first data stream storage module, and the global data stream network can be understood as the number of calculation engines and the connection mode between the calculation engine and the global data stream network. The connection mode between the global data flow network, the routing connection mode in the above-mentioned global data flow network, etc. The global data flow network can be configured by the control register. The network implementation may be a router between the first data flow storage module and the calculation engine. When multiple computing engines are instantiated in a single architecture, the global data flow network can be configured to send different data to different computing engines for data parallelism, or to serially link the computing engines into longer ones through its input and output Computing pipeline, you can process more neural network layers in this pipeline.

In a possible embodiment, the aforementioned first data stream storage module may include two data stream storage units, input and output, for data access, that is, the input data stream storage unit inputs input data into the calculation engine To perform calculations, the calculation engine outputs the calculation results to the output data stream storage unit for storage, which can prevent the input data stream storage unit from inputting data to the calculation engine and the output result of the calculation engine cannot be written into the input data stream storage unit. For example, the calculation engine needs to perform repeated calculations on a piece of data in the input data stream storage unit twice. After the first calculation is completed, the calculation engine needs to read the data in the input data stream storage unit for the second time. Normally , Will wait for the first calculation result to be stored in the input data stream storage unit, and then read the data for the second time, but after the output data stream storage unit is set, the first calculation result can be stored in the output data stream storage At the same time as the unit, read the data for the second time without waiting, which improves the efficiency of data processing.

203. Configure the target data flow network according to the target network configuration rule.

The foregoing configuration implements the target network configuration rule, which may be the connection relationship between the pre-configured calculation engine, the first data stream storage module, and the global data stream network. The foregoing connection relationship may include the number of connections of the calculation engine and the connection sequence. The calculation engine can be connected to the global data flow network through interconnection to form a new deep network, and different deep networks can be formed according to the number and connection sequence of different calculation engines. According to the configuration rules of the target network, the target data flow network can be obtained for processing the data to be processed. Since each calculation engine reads data through the first data stream storage module, the data in the first data stream storage module can be read into different calculation engines to form a data stream, and no instruction set ordering is required, so the configuration is good The calculation engine does not generate calculation vacancies.

204. Process the data to be processed through the target data stream network.

The above-mentioned target data stream network is configured through target network information, and can also be called a customized data stream network. The above-mentioned target data stream network connects the first data stream storage module and the calculation engine through the global data stream network to form a data stream Compared with the instruction set, there is no need to wait for the completion of the read and write of the previous instruction, which can achieve the efficiency of calculation under the deep network architecture.

In this embodiment, the target deep network information required by the data to be processed is acquired; according to the target deep network information, a preset target network configuration rule corresponding to the target deep network information is matched, wherein the target network The configuration rules include the pre-configured calculation engine, the first data flow storage module and the configuration rules between the global data flow network; according to the target network configuration rules, the target data flow network is configured; The data to be processed is processed. Accelerate the deep network through data flow to reduce off-chip data communication, so there is no instruction idle overhead, which can improve the hardware acceleration efficiency of the deep network, and through network configuration, you can configure different deep network models to support a variety of different Deep network model.

It should be noted that the data stream-based deep network acceleration method provided in the embodiments of the present application can be applied to devices for data stream deep network acceleration, such as computers, servers, mobile phones and other devices that can perform data stream-based deep network acceleration .

Please refer to FIG. 3. FIG. 3 is a schematic flowchart of another data flow-based deep network acceleration method provided by an embodiment of the present application. As shown in FIG. 3, the method includes the following steps:

301. Obtain target deep network information required for data to be processed.

302. According to the target deep network information, match a preset target network configuration rule corresponding to the target deep network information, wherein the target network configuration rule includes a pre-configured calculation engine, a first data stream storage module, and Configuration rules between global data flow networks.

303. Configure parallel or serial between multiple computing engines according to the global data flow network.

In this step, the aforementioned global data flow network can be implemented by routing, and the global data flow network can be configured by a control register, and corresponding global data flow network configuration rules are preset in the aforementioned control register. The network is implemented as a router between the first data flow storage module and each calculation engine, and the main function of the network router is to provide skip paths and feedback paths for data flows between each calculation engine. The parallel or serial between the above-mentioned multiple calculation engines can be configured through data flow. For example, when calculation engine A and calculation engine B are parallel in the global data flow network, the data flow flows to calculation engine A and calculation engine B at the same time. For the parallel processing of data, when the calculation engine A and the calculation engine B are serialized in the global data flow network, the data flow can be selected in the calculation engine A for calculation, and then the calculation result is flowed to the calculation engine B. In the serial mode, you can It is understood as the deepening of the deep network computing layer. The specific configuration can be to control the data flow direction through the global data flow network, so as to realize the parallel or serial configuration between multiple computing engines. The above configuration of parallel or serial between multiple calculation engines can be obtained by configuring the interconnection between the global data flow network and multiple calculation engines, for example, multiple calculation engines can be connected to the global data flow network according to parallel rules The interconnection can also be that multiple computing engines are interconnected with the global data flow network according to serial rules, and the first data flow storage module is configured to be interconnected with the global data flow network.

304. Obtain data flow paths of the multiple calculation engines according to the parallel or serial connection between the first data flow storage module and the multiple calculation engines.

In this step, the above-mentioned first data stream storage module may be a cache, DDR or high-speed access DDR. In the embodiment of the present application, it is preferably a cache. Specifically, a controllable read-write address generating unit may be provided in the cache. . Depending on the input data format and the calculations required in the data path, the address generation unit will generate an adapted address sequence to index the data in the cache. The aforementioned address sequence can be used to input data in the index cache to the corresponding calculation engine. For example, if the calculation engine requires 80 data to perform calculations, then 80 data corresponding to the address sequence are read from the cache to the calculation engine. In addition, the address generation unit can also set a counter to make the generated address sequence have different cycle sizes, for example, a small cycle of data 1, data 2, and data 3, which can improve the reusability of data, and at the same time, it can also adapt to The data processing size of each calculation engine. The data flow is stored through the first data flow storage module, and the data flow is controlled to each data node in parallel or serial between multiple computing engines, that is, the data flow path, so that the data processing can be calculated like a pipeline Processing in the engine improves the efficiency of data processing.

305. Form the target data flow network based on the data flow path.

In this step, the first data flow storage module inputs data to the corresponding calculation engine through the global data flow network, and the calculation engine outputs the calculation results to the first data flow storage module through the global data flow network, without instructions for control Therefore, there is no problem that the computing unit is idle when a single instruction is executed.

306. Process the data to be processed through the target data stream network.

In this embodiment, the data stream is stored by the first data stream storage module, and the data stream is controlled to each data node in parallel or serial between multiple computing engines, that is, the data stream path, so that the data processing is like The pipeline is generally processed in the calculation engine to improve the efficiency of data processing.

Reading the to-be-processed data into the first data stream storage module;

In this embodiment, the above-mentioned first data stream storage module may be a cache, DDR or high-speed access DDR. In the embodiment of the present application, it is preferably a cache. Specifically, it may be provided with a controllable read-write address generating unit. Cache. Depending on the input data format and the calculations required in the data path, the address generation unit will generate an adapted address sequence to index the data in the cache. The aforementioned address sequence can be used to input data in the index cache to the corresponding calculation engine. For example, if the calculation engine requires 80 data to perform calculations, then 80 data corresponding to the address sequence are read from the cache to the calculation engine. In addition, the address generation unit can also set a counter to make the generated address sequence have different cycle sizes, for example, a small cycle of data 1, data 2, and data 3, which can improve the reusability of data, and at the same time, it can also adapt to The data processing size of each calculation engine. The state of the first data stream storage module includes: a data read preparation state and a data write completion state. The state of the calculation engine includes whether the calculation is completed, whether the next calculation data needs to be read, and so on. The state of the first data flow storage module can be obtained by monitoring the state of the data in the first data flow storage module in the finite state machine, and the state of the calculation engine can be obtained by obtaining the state of the first data flow storage module, such as computing After the result is written into the first data stream storage module, it can be determined that the state of the calculation engine is calculation completed.

In each clock cycle, the status of each calculation engine and the first data stream storage module is obtained, so that it can be accurately predicted, and the hardware performance can be optimized for maximum efficiency through accurate calculation scheduling, and the efficiency of data processing can be further improved.

In this embodiment, the aforementioned computing core, the second data stream storage module, and the local data stream network are the main configurations of the computing engine. The aforementioned computing core may be a convolution core, a pooling core, an activation function core, etc. The performance core, in addition, it should be noted that the computing core may also be called a computing core, a computing unit, a computing module, and so on. The above-mentioned second data stream storage module may be a storage module with data access function such as cache, DDR or high-speed DDR, and the above-mentioned second data stream storage module and the first data stream storage module may be different on the same memory The storage area, for example, the second data flow storage module may be the second data buffer area in the buffer, the first data flow storage module may be the first data buffer area in the buffer, etc. The above-mentioned partial data flow network can be understood It is a route used in the calculation engine to connect the calculation core with the second data stream storage module. For example, the connection between computing cores can be controlled by a network router. The main function of the aforementioned network router is to provide a skip path and a feedback path. By setting the control register, the local data flow network can be configured to form a flow path with different calculation cores available in the calculation engine. The combination of the types and sequence of these computing cores along the flow path provides a continuous data processing pipeline for multiple layers in the deep learning neural network. For example, according to the data flow direction, if the combination of computing cores is a convolution core to a pooling core To the activation function core, a convolutional neural network layer can be obtained. For another example, the combination of the calculation core is a deconvolution core to a pooling core to an activation function core, and a deconvolutional neural network layer can be obtained. It should be noted that the combination of the type and sequence of the computing core is specifically determined by the target network configuration rule. By forming a data stream between the computing cores, the calculation of the computing engine can be accelerated, thereby further improving the data processing efficiency of the deep network.

The foregoing optional implementation manners can implement the data stream-based deep network acceleration method of the corresponding embodiment in FIG. 2 and FIG. 3, and achieve the same effect, which is not repeated here.

For the second aspect, please refer to FIG. 4. FIG. 4 is a schematic flowchart of a data stream-based deep network acceleration method provided by an embodiment of the application. As shown in FIG. 4, the method includes:

401. Obtain target deep network information required for data to be processed.

402. According to the target deep network information, match a preset target network configuration rule corresponding to the target deep network information, wherein the target network configuration rule includes a computing core, a second data stream storage module, and a local data stream The internet.

The aforementioned target deep network information already contains the network type, data type, number of layers, calculation type, etc. of the deep network required by the data to be processed. The aforementioned target network configuration rules can be set in advance, for example, The parameter rules and calculation rules in the set image recognition network, target detection network, target tracking network and other types of networks. The above parameter rules can be hyperparameter setting rules, weight setting rules, etc. The above calculation rules can be It is the calculation rules for addition, multiplication, convolution, and deconvolution. The above-mentioned configuration rules between the computing core, the second data stream storage module and the local data stream network can be understood as the type and number of computing cores, and the connection method between the computing core and the global data stream network. The connection mode between data flow networks, the routing connection mode in the above-mentioned partial data flow network, etc. The local data flow network can be configured by the control register. The network implementation may be a router between the first data flow storage module and the calculation engine. For example, the connection between computing cores can be controlled by a network router. The main function of the aforementioned network router is to provide a skip path and a feedback path.

403. Configure the target data flow engine according to the target network configuration rule.

The foregoing configuration implements the target network configuration rule, which may be the connection relationship between the pre-configured computing core, the second data stream storage module, and the local data stream network. The foregoing connection relationship may include the type of computing core and the number of connections, Connection sequence, etc., can connect computing cores with local data flow network through interconnection to form a new computing engine, that is, data flow engine, which can form different deep networks according to different computing core types, number of connections and connection sequence The required data flow engine. According to the configuration rules of the target network, the target data flow engine can be obtained to process the data to be processed. Since each computing core reads data through the second data stream storage module, the data in the second data stream storage module can be read into different computing cores to form a data stream. For example, data reading that requires multiplication The multiplication calculation is performed in the multiplication core, and the data that needs to be added is read into the addition core for addition calculation, etc. Since the data flow does not require instruction set ordering, the configured data flow engine will not generate calculation vacancies.

404. Process the to-be-processed data by using the target data flow engine.

The above-mentioned target data flow engine is configured through target network information, and can also be called a customized data flow engine. The above-mentioned target data flow engine connects the second data flow storage module and each computing core through a local data flow network to form data Compared with the implementation form of the instruction set, the flow does not need to wait for the completion of the read and write of the previous instruction, and can achieve the efficiency of calculation under the deep network architecture.

In this embodiment, the target deep network information required by the data to be processed is acquired; according to the target deep network information, a preset target network configuration rule corresponding to the target deep network information is matched, wherein the target network The configuration rule includes a computing core, a second data flow storage module, and a local data flow network; a target data flow engine is configured according to the target network configuration rule; the data to be processed is processed by the target data flow engine. The deep network is accelerated through data flow to reduce off-chip data communication, so there is no instruction idle overhead, which can improve the hardware acceleration efficiency of the deep network, and through network configuration, you can configure the computing engines required by different deep network models. Support the calculation engine required by a variety of different deep network models.

Please refer to FIG. 5. FIG. 5 is a schematic flowchart of another data stream-based deep network acceleration method provided by an embodiment of the application. As shown in FIG. 5, the method includes:

501. Obtain the target deep network information required for the data to be processed;

502. According to the target deep network information, match a preset target network configuration rule corresponding to the target deep network information, wherein the target network configuration rule includes a computing core, a second data stream storage module, and a local data stream The internet;

503. Configure the interconnection between the computing core and the local data flow network to obtain a computing path of the computing core.

504. Configure the interconnection between the second data stream storage module and the local data stream network to obtain a storage path.

505. Obtain the target data flow engine according to the calculation path and the storage path.

506. Process the data to be processed by the target data flow engine.

In this embodiment, the aforementioned computing core, the second data stream storage module, and the local data stream network are the main configurations of the data stream engine. The aforementioned computing core may be a convolution core, a pooling core, an activation function core, etc. The core for computing performance. In addition, it should be noted that the computing core may also be called a computing core, a computing unit, a computing module, and so on. The above-mentioned second data stream storage module may be a storage module with data access function such as cache, DDR or high-speed DDR, and the above-mentioned second data stream storage module and the first data stream storage module may be different on the same memory The storage area, for example, the second data flow storage module may be the second data buffer area in the buffer, the first data flow storage module may be the first data buffer area in the buffer, etc. The above-mentioned partial data flow network can be understood It is a route used in the calculation engine to connect the calculation core with the second data stream storage module. For example, the connection between computing cores can be controlled by a network router. The main function of the aforementioned network router is to provide a skip path and a feedback path. By setting the control register, the local data flow network can be configured to form a flow path with different calculation cores available in the calculation engine. The combination of the types and sequence of these computing cores along the flow path provides a continuous data processing pipeline for multiple layers in the deep learning neural network. For example, according to the data flow direction, if the combination of computing cores is a convolution core to a pooling core To the activation function core, a convolutional neural network layer can be obtained. For another example, the combination of the calculation core is a deconvolution core to a pooling core to an activation function core, and a deconvolutional neural network layer can be obtained. It should be noted that the combination of the type and sequence of the computing core is specifically determined by the target network configuration rule.

By forming a data stream between the computing cores, the calculation of the computing engine can be accelerated, thereby further improving the data processing efficiency of the deep network.

Reading the to-be-processed data into the second data stream storage module;

In this embodiment, the above-mentioned second data stream storage module may be a cache, DDR or high-speed access DDR. In the embodiment of the present application, it is preferably a cache. Specifically, it may be provided with a controllable read-write address generating unit. Cache. Depending on the input data format and the calculations required in the data path, the address generation unit will generate an adapted address sequence to index the data in the cache. The aforementioned address sequence can be used to input data in the index cache to the corresponding computing core. For example, if the computing core needs 80 data for calculation, then 80 data corresponding to the address sequence are read from the cache to the computing core. In addition, the address generation unit can also set a counter to make the generated address sequence have different cycle sizes, for example, a small cycle of data 1, data 2, and data 3, which can improve the reusability of data, and at the same time, it can also adapt to The size of each calculation core data processing. The state of the second data stream storage module includes: a data read preparation state and a data write completion state. The state of the calculation core includes whether the calculation is completed and whether the next calculation data needs to be read. The state of the first data flow storage module can be monitored by the finite state machine to obtain the state of the first data flow storage module, and the state of the computing core can be obtained by the state of the second data flow storage module, such as computing After the result is written to the second data stream storage module, it can be determined that the state of the calculation core is the calculation completed.

In each clock cycle, the status of each computing core and the second data stream storage module can be obtained, so that it can be accurately predicted, and the hardware performance can be optimized with the greatest efficiency through accurate calculation scheduling, and the efficiency of data processing can be further improved.

In this embodiment, the above-mentioned first storage unit may be an input data stream storage unit, and the above-mentioned second storage unit may be an input data stream storage unit. The first storage unit and the second storage unit are used for alternate access of data streams. , That is, the first storage unit inputs the input data into the calculation core for calculation, and the calculation core outputs the calculation result to the second storage unit for storage. This prevents the first storage unit from inputting data to the calculation core. The output result cannot be written into the first storage unit. For example, the calculation core needs to recalculate a piece of data in the kneecap storage unit twice. After the first calculation is completed, the calculation core needs to be second in the first storage unit. Read the data a second time. Normally, you will wait for the first calculation result to be stored in the first storage unit, and then read the data for the second time. However, after the storage unit is set, you can change the first calculation While the result is stored in the second storage unit, the data for the second time is read from the first storage unit without waiting, which improves the efficiency of data processing.

The above-mentioned optional implementation manners can implement the data stream-based deep network acceleration method of the corresponding embodiment in FIG. 4 and FIG. 5, and achieve the same effect, which is not repeated here. It should be noted that each of the above-mentioned embodiments can also be combined with the embodiment of FIG. 2 and FIG. 3.

In the third aspect, please refer to FIG. 6. FIG. 6 is a schematic diagram of a data stream-based deep network acceleration device provided by an embodiment of the application. As shown in FIG. 6, the device includes:

The first obtaining module 601 is configured to obtain target deep network information required by the data to be processed;

The first matching module 602 is configured to match a preset target network configuration rule corresponding to the target deep network information according to the target deep network information, wherein the target network configuration rule includes a pre-configured calculation engine, a second A configuration rule between the data stream storage module and the global data stream network;

The first configuration module 603 is configured to configure the target data flow network according to the target network configuration rule;

The first processing module 604 is configured to process the data to be processed through the target data stream network.

Optionally, the first configuration module 603 includes:

The global configuration sub-module is used to configure parallel or serial between multiple calculation engines according to the global data flow network;

A path configuration submodule, configured to obtain the data flow paths of the multiple calculation engines according to the parallel or serial between the first data flow storage module and the multiple calculation engines;

A forming sub-module is used to form the target data flow network based on the data flow path.

Optionally, the first processing module 604 includes:

The first acquisition submodule is configured to read the to-be-processed data into the first data stream storage module;

The first data address generation sub-module is configured to generate an address sequence for the data to be processed according to a preset generation rule in the first data stream storage module according to the data format and data path of the data to be processed;

The first input sub-module is used for each clock cycle to read from the first data stream storage module according to the address sequence the data volume corresponding to the calculation engine in the target data stream network for input, and obtain the first The state of the data stream storage module and calculation engine.

Optionally, the target network configuration further includes a computing core, a second data stream storage unit, and a local data stream network connecting the computing core and the second buffer, and the first configuration module 603 further includes:

The first partial configuration sub-module is used to configure the interconnection between the computing core and the local data flow network to obtain the computing path of the computing core;

The first partial path sub-module is configured to configure the interconnection of the second data flow between the storage unit and the partial data flow network to obtain a storage path;

The first engine module is used to obtain the calculation engine according to the calculation path and the storage path.

For the fourth aspect, please refer to FIG. 7. FIG. 7 is a schematic diagram of a data stream-based deep network acceleration device provided by an embodiment of the application. As shown in FIG. 7, the device includes:

The second obtaining module 701 is used to obtain target deep network information required by the data to be processed;

The second matching module 702 is configured to match a preset target network configuration rule corresponding to the target deep network information according to the target deep network information, wherein the target network configuration rule includes a computing core and a second data stream Storage module and local data flow network;

The second configuration module 703 is configured to configure and obtain the target data flow engine according to the target network configuration rule;

The second processing module 704 is configured to process the to-be-processed data through the target data flow engine.

Optionally, the second configuration module 703 includes:

The second partial configuration submodule is used to configure the interconnection between the computing core and the local data flow network to obtain the computing path of the computing core;

The second partial path sub-module is configured to configure the interconnection between the second data flow storage module and the partial data flow network to obtain a storage path;

The second engine module is used to obtain the target data flow engine according to the calculation path and the storage path.

Optionally, the second processing module 704 includes:

The second acquisition sub-module is configured to read the to-be-processed data into the second data stream storage module;

The second data address generation sub-module is configured to generate an address sequence for the data to be processed according to a preset generation rule in the second data stream storage module according to the data format and data path of the data to be processed;

The second input sub-module is used for each clock cycle to read from the second data stream storage module according to the address sequence the data volume corresponding to the computing core in the target data stream engine for input, and obtain the second The state of the data stream storage module and the computing core.

Optionally, the second processing module 704 includes:

The input calculation sub-module is used to input the data in the first storage unit into the calculation core to obtain the calculation result;

The output storage submodule is used to store the calculation result in the second storage unit as input data for the next calculation core.

In a fifth aspect, an embodiment of the present application provides an electronic device, including: a memory, a processor, and a computer program stored on the memory and capable of running on the processor. When the processor executes the computer program, The steps in the data stream-based deep network acceleration method provided in the embodiments of this application are implemented.

In a sixth aspect, an embodiment of the present application provides a computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, the data stream-based Steps in a deep network acceleration method.

It should be noted that for the foregoing method embodiments, for the sake of simple description, they are all expressed as a series of action combinations, but those skilled in the art should know that this application is not limited by the described sequence of actions. Because according to this application, some steps can be performed in other order or simultaneously. Secondly, those skilled in the art should also be aware that the embodiments described in the specification are all optional embodiments, and the involved actions and modules are not necessarily required by this application.

In the above-mentioned embodiments, the description of each embodiment has its own focus. For parts that are not described in detail in an embodiment, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in this application, it should be understood that the disclosed device can be implemented in other ways. For example, the device embodiments described above are only illustrative

In addition, the processors and chips in the various embodiments of the present application may be integrated into one processing unit, or may exist alone physically, or two or more hardware may be integrated into one unit. The computer-readable storage medium or the computer-readable program can be stored in a computer-readable memory. Based on this understanding, the technical solution of the present application essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a memory, A number of instructions are included to enable a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the method described in each embodiment of the present application. The aforementioned memory includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other various media that can store program codes.

Those of ordinary skill in the art can understand that all or part of the steps in the various methods of the above-mentioned embodiments can be completed by instructing relevant hardware through a program. The program can be stored in a computer-readable memory, and the memory can include: flash disk , Read-only memory (English: Read-Only Memory, abbreviation: ROM), random access device (English: Random Access Memory, abbreviation: RAM), magnetic disk or optical disc, etc.

The above content is a further detailed description of the application in conjunction with specific preferred embodiments, and it cannot be considered that the specific embodiments of the application are limited to these descriptions. For those of ordinary skill in the technical field to which this application belongs, a number of simple deductions or substitutions can be made without departing from the concept of this application, which should be regarded as falling within the protection scope of this application.

Claims

A data stream-based deep network acceleration method, characterized in that the method includes:

Obtain the target deep network information needed for the data to be processed;

According to the target deep network information, match a preset target network configuration rule corresponding to the target deep network information, wherein the target network configuration rule includes a pre-configured calculation engine, a first data stream storage module, and global data Configuration rules between streaming networks;

Configure the target data flow network according to the target network configuration rule;

The data to be processed is processed through the target data stream network.
The method of claim 1, wherein the configuring and obtaining the target data flow network according to the target network configuration rule comprises:

According to the global data flow network, configure parallel or serial between multiple computing engines;

Obtaining the data flow paths of the multiple calculation engines according to the parallel or serial between the first data flow storage module and the multiple calculation engines;

Based on the data flow path, the target data flow network is formed.
The method according to claim 1, wherein the processing the to-be-processed data through the target data flow network comprises:

Reading the to-be-processed data into the first data stream storage module;

In the first data stream storage module, according to the data format and data path of the data to be processed, an address sequence is generated for the data to be processed according to a preset generation rule;

In each clock cycle, read from the first data stream storage module the amount of data corresponding to the calculation engine in the target data stream network according to the address sequence for input, and obtain the data of the first data stream storage module and the calculation engine status.
The method according to any one of claims 1 to 3, wherein the target network configuration further comprises a computing core, a second data stream storage unit, and local data connecting the computing core and the second buffer In a streaming network, the configuration of the calculation engine includes:

Configuring the interconnection between the computing core and the local data flow network to obtain the computing path of the computing core;

Configuring the interconnection of the second data stream between the storage module and the local data stream network to obtain a storage path;

According to the calculation path and the storage path, the calculation engine is obtained.
A data stream-based deep network acceleration method, characterized in that the method includes:

Obtain the target deep network information needed for the data to be processed;

According to the target deep network information, matching a preset target network configuration rule corresponding to the target deep network information, wherein the target network configuration rule includes a computing core, a second data stream storage module, and a local data stream network;

Configure the target data flow engine according to the target network configuration rule;

The data to be processed is processed by the target data flow engine.
The method of claim 5, wherein the configuring and obtaining the target data flow engine according to the target network configuration rule comprises:

Configuring the interconnection between the computing core and the local data flow network to obtain the computing path of the computing core;

Configuring the interconnection of the second data stream between the storage unit and the local data stream network to obtain a storage path;

According to the calculation path and the storage path, the target data flow engine is obtained.
The method according to claim 5, wherein the processing the data to be processed by the target data flow engine comprises:

Reading the to-be-processed data into the second data stream storage module;

In the second data stream storage module, according to the data format and data path of the data to be processed, an address sequence is generated for the data to be processed according to a preset generation rule;

Each clock cycle, read from the second data stream storage module according to the address sequence the data volume corresponding to the computing core in the target data stream engine for input, and obtain the second data stream storage module and the computing core status.
The method according to any one of claims 5 to 7, wherein the second data stream storage module includes a first storage unit and a second storage unit, and the target data stream engine is used to process the waiting Processing data for processing, including:

Input the data in the first storage unit into the calculation core to obtain the calculation result;

The calculation result is stored in the second storage unit as the input data of the next calculation core.
A data stream-based deep network acceleration device, characterized in that, the device includes:

The first obtaining module is used to obtain target deep network information required by the data to be processed;

The first matching module is configured to match a preset target network configuration rule corresponding to the target deep network information according to the target deep network information, wherein the target network configuration rule includes a pre-configured calculation engine, a first Configuration rules between the data flow storage module and the global data flow network;

The first configuration module is configured to configure the target data flow network according to the target network configuration rule;

The first processing module is configured to process the to-be-processed data through the target data flow network.
A data stream-based deep network acceleration device, characterized in that, the device includes:

The second acquisition module is used to acquire target deep network information required by the data to be processed;

The second matching module is configured to match a preset target network configuration rule corresponding to the target deep network information according to the target deep network information, wherein the target network configuration rule includes a computing core and a second data stream storage Module and local data flow network;

The second configuration module is configured to configure and obtain the target data flow engine according to the target network configuration rule;

The second processing module is configured to process the to-be-processed data through the target data flow engine.
An electronic device, comprising: a memory, a processor, and a computer program stored on the memory and capable of running on the processor, and the processor executes the computer program as claimed in claim 1. Steps in the data stream-based deep network acceleration method described in any one of to 4.
A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the data-based system according to any one of claims 1 to 4 is implemented. Steps in the deep network acceleration method of the flow.