CN110046704A

CN110046704A - Depth network accelerating method, device, equipment and storage medium based on data flow

Info

Publication number: CN110046704A
Application number: CN201910280156.2A
Authority: CN
Inventors: 牛昕宇; 蔡权雄
Original assignee: Shenzhen Corerain Technologies Co Ltd
Current assignee: Shenzhen Corerain Technologies Co Ltd
Priority date: 2019-04-09
Filing date: 2019-04-09
Publication date: 2019-07-23
Anticipated expiration: 2039-04-09
Also published as: CN110046704B; WO2020206637A1

Abstract

The application, which discloses, provides a kind of depth network accelerating method, device, equipment and storage medium based on data flow, which comprises obtains the target depth network information required for pending data；According to the target depth network information, match pre-set target network configuration rule corresponding with the target depth network information, wherein, the target network configuration rule includes the configuration rule between preconfigured computing engines, the first data flow memory module and global data flow network；According to the target network configuration rule, configuration obtains target data stream network；The pending data is handled by the target data flow network.Depth network is accelerated by data flow, reduces the outer data communication of piece, therefore without instruction idle overhead, the hardware-accelerated efficiency of depth network can be improved, moreover, by carrying out network configuration, different depth network models can be configured, supports a variety of different depth network models.

Description

Depth network accelerating method, device, equipment and storage medium based on data flow

Technical field

This application involves artificial intelligence field, the depth network acceleration based on data flow that more specifically, it relates to a kind of Method, apparatus, equipment and storage medium.

Background technique

The advances calls bottom hardware platform of deep learning application program neural network based has high throughput.When When platform based on CPU is unable to satisfy this increasing need, many companies develop dedicated hardware accelerators to support The progress in the field.The common idea of existing hardware accelerator accelerates more frequently used in the application of deep learning algorithm Certain certain types of calculating.For existing hardware structure based on the instruction execution with expansible instruction set, then passing through will be normal Custom instruction is embodied as with calculating to realize and accelerate.Framework realization based on instruction is typically expressed as system on chip (SoC) design. In architecture based on instruction, many clock cycle are wasted for non-computational relevant operation.It is more general in order to support Instruction architecture, the calculating in deep learning neural network are often decomposed into multiple instruction.Therefore one calculates usual need Want multiple clock cycle.Arithmetic and logic unit (ALU) in processor is usually the set of hardware-implemented different operation. Due to limited instruction expression formula and limited I/O bandwidth, most of ALU resources are when executing single instruction in idle shape State, for example, the data of multiplication can be first read when doing multiplication and addition, since I/O speed is by bandwidth contributions, so that addition needs It waits multiplication to calculate to complete and be written in memory, then reads out calculated result and addition data progress additional calculation, multiplying Method is calculated with during read-write, and additional calculation unit is idle state.Therefore there is hardware-accelerated low efficiency based on instruction Problem.

Apply for content

The purpose of the application is in view of the above-mentioned drawbacks of the prior art, providing a kind of depth network based on data flow Accelerated method, device, equipment and storage medium solve limited instruction expression formula and limited I/O bandwidth, most of ALU Resource is in idle condition when executing single instruction, the low problem of acceleration efficiency.

The purpose of the application is achieved through the following technical solutions:

In a first aspect, providing a kind of depth network accelerating method based on data flow, which comprises

Obtain the target depth network information required for pending data；

According to the target depth network information, pre-set mesh corresponding with the target depth network information is matched Mark network configuration rule, wherein the target network configuration rule includes preconfigured computing engines, the storage of the first data flow Configuration rule between module and global data flow network；

According to the target network configuration rule, configuration obtains target data stream network；

The pending data is handled by the target data flow network.

Optionally, described according to the target network configuration rule, configuration obtains target data stream network, comprising:

According to the global data flow network, configure between multiple computing engines concurrently or sequentially；

It is according to concurrently or sequentially, being obtained between the first data flow memory module and the multiple computing engines The data flow path of multiple computing engines；

Flow path based on the data forms the target data flow network.

It is optionally, described that the pending data is handled by the target data flow network, comprising:

The pending data is read into the first data flow memory module；

In the first data flow memory module, according to the data format and data path of the pending data, It is that the pending data generates address sequence by pre-set create-rule；

Each clock cycle reads from the first data flow memory module and the target data according to address sequence Data volume corresponding with computing engines is inputted in flow network, and obtains the shape of the first data flow memory module and computing engines State.

Optionally, the target network configuration further includes calculating core, the second data flow storage unit and the connection meter Local data's flow network of core and second buffer is calculated, the configuration of the computing engines includes:

The interconnection for calculating core and local data's flow network is configured, the calculating path for calculating core is obtained；

Second data flow is configured in the interconnection of storage element and local data's flow network, obtains store path；

According to the calculating path and store path, the computing engines are obtained.

Second aspect also provides a kind of depth network accelerating method based on data flow, which comprises

Obtain the target depth network information required for pending data；

According to the target depth network information, pre-set mesh corresponding with the target depth network information is matched Mark network configuration rule, wherein the target network configuration rule includes calculating core, the second data flow memory module and part Data flow network；

According to the target network configuration rule, configuration obtains target data stream engine；

The pending data is handled by the target data stream engine.

Optionally, described according to the target network configuration rule, configuration obtains target data stream engine, comprising:

The interconnection for configuring the second data flow storage module Yu local data's flow network, obtains store path；

According to the calculating path and store path, the target data stream engine is obtained.

It is optionally, described that the pending data is handled by the target data stream engine, comprising:

The pending data is read into the second data flow memory module；

In the second data flow memory module, according to the data format and data path of the pending data, It is that the pending data generates address sequence by pre-set create-rule；

Each clock cycle reads from the second data flow memory module and the target data according to address sequence Data volume corresponding with core is calculated is inputted in stream engine, and is obtained the second data flow memory module and calculated the state of core.

Optionally, the second data flow memory module includes the first storage unit and the second storage unit, described logical The target data stream engine is crossed to handle the pending data, comprising:

Data input in first storage unit is calculated into core, obtains calculated result；

By calculated result storage to the second storage unit, as next input data for calculating core.

The third aspect, also provides a kind of depth network acceleration device based on data flow, and described device includes:

First obtains module, for obtaining the target depth network information required for pending data；

First matching module, for matching pre-set deep with the target according to the target depth network information Spend the corresponding target network configuration rule of the network information, wherein the target network configuration rule includes preconfigured calculating Configuration rule between engine, the first data flow memory module and global data flow network；

First configuration module, for according to the target network configuration rule, configuration to obtain target data stream network；

First processing module, for being handled by the target data flow network the pending data.

Fourth aspect, also provides a kind of depth network acceleration device based on data flow, and described device includes:

Second obtains module, for obtaining the target depth network information required for pending data；

Second matching module, for matching pre-set deep with the target according to the target depth network information Spend the corresponding target network configuration rule of the network information, wherein the target network configuration rule includes calculating core, the second data Flow memory module and local data's flow network；

Second configuration module, for according to the target network configuration rule, configuration to obtain target data stream engine；

Second processing module, for being handled by the target data stream engine the pending data.

5th aspect, provides a kind of electronic equipment, comprising: memory, processor and is stored on the memory and can The computer program run on the processor, the processor realize the embodiment of the present application when executing the computer program The step in the depth network accelerating method based on data flow provided.

6th aspect, provides a kind of computer readable storage medium, meter is stored on the computer readable storage medium Calculation machine program, the computer program realize the depth net provided by the embodiments of the present application based on data flow when being executed by processor Step in network accelerated method.

The application bring reduces the outer data communication of piece the utility model has the advantages that accelerate by data flow to depth network, because This does not instruct idle overhead, and the hardware-accelerated efficiency of depth network can be improved, moreover, by carrying out network configuration, it can be with Different depth network models is configured, supports a variety of different depth network models.

Detailed description of the invention

Fig. 1 is a kind of optional implementation framework of the depth network accelerating method based on data flow provided by the embodiments of the present application Schematic diagram；

Fig. 2 is a kind of process for depth network accelerating method based on data flow that the embodiment of the present application first aspect provides Schematic diagram；

Fig. 3 is the process signal of another depth network accelerating method based on data flow provided by the embodiments of the present application Figure；

Fig. 4 is that a kind of depth network accelerating method process based on data flow that the embodiment of the present application second aspect provides is shown It is intended to；

Fig. 5 is another depth network accelerating method flow diagram based on data flow provided by the embodiments of the present application；

Fig. 6 is a kind of depth network acceleration device signal based on data flow that the embodiment of the present application third aspect provides Figure；

Fig. 7 is a kind of depth network acceleration device signal based on data flow that the embodiment of the present application fourth aspect provides Figure.

Specific embodiment

The preferred embodiment of the application is described below, those of ordinary skill in the art will be according to described below with this The relevant technologies in field are realized, and can be more clearly understood that the innovation and bring benefit of the application.

In order to further describe the technical solution of the application, Fig. 1 is please referred to, Fig. 1 is one kind provided by the embodiments of the present application The optional implementation configuration diagram of depth network accelerating method based on data flow, as shown in Figure 1, framework 103 and piece external storage Module (DDR) 101 and place CPU are attached by interconnection, and framework 103 includes: the first memory module 104, global data stream Network 105 and data flow engine 106, above-mentioned first memory module 104 pass through mutual downlink connection sheet above external storage module 101 Meanwhile also by the above-mentioned global data flow network 105 of mutual downlink connection, above-mentioned data flow engine 106 is above-mentioned complete by mutual downlink connection Office data flow network 105 is so that above-mentioned data flow engine 106 may be implemented concurrently or sequentially.Above-mentioned data flow engine 106 can To include: to calculate core (or for computing module), the second memory module 108 and local data's flow network 107, calculating core can be with Including kernel for calculating, such as convolution kernel 109, Chi Huahe 110 and activation primitive core 111 etc., it is, of course, also possible to include Other calculating cores in addition to example convolution kernel 109, Chi Huahe 110 and activation primitive core 111 herein and without limitation can also To include all kernels for calculating in depth network.The first above-mentioned memory module 104 and the second above-mentioned storage mould Block 108 can be on piece cache module, be also possible to DDR or high speed DDR memory module etc..Above-mentioned data flow engine 106 can To be interpreted as supporting the computing engines of Data Stream Processing, it is understood that be the computing engines for being exclusively used in Data Stream Processing.It is above-mentioned CPU in may include control register, above-mentioned control register be provided in advance network configuration rule for network carry out Configuration.

It should be noted that the middle depth network of the application is referred to as deep learning network, deep learning nerve net Network etc..

This application provides a kind of depth network accelerating method, device, equipment and storage medium based on data flow.

In a first aspect, referring to Fig. 2, Fig. 2 is that a kind of depth network based on data flow provided by the embodiments of the present application adds The flow diagram of fast method, as shown in Fig. 2, the described method comprises the following steps:

201, the target depth network information required for pending data is obtained.

In the step, above-mentioned pending data can be image data to be identified, target data to be tested, target and wait chasing after The data that track data etc. can be handled by depth network, the above-mentioned target depth network information correspond to above-mentioned to be processed The depth network information of data, such as pending data are image data to be identified, then the target depth network information is then used for The configuration parameter of the depth network of image recognition is handled, for another example pending data is target data to be tested, then target depth net Network information is then for the configuration parameter of the depth network of processing target detection, and the above-mentioned target depth network information can be pre- First it is arranged, matching determination is carried out by pending data, is also possible to not do herein by carrying out selection determination manually It limits.Obtaining the target depth network information can be in order to configure depth network, and the above-mentioned depth network information can wrap Include network type, data type, the number of plies, calculating type etc..

202, it according to the target depth network information, matches pre-set corresponding with the target depth network information Target network configuration rule, wherein the target network configuration rule includes preconfigured computing engines, the first data flow Configuration rule between memory module and global data flow network.

The network class of depth network required for pending data has been contained in the above-mentioned target depth network information Type data type, the number of plies, calculates type etc., and above-mentioned target network configuration rule can be to be configured in advance, for example, It can be each parameter in the type networks such as the image recognition network, target detection network, target tracking network pre-set Rule, computation rule etc., above-mentioned parameter rule can be the setting rule of hyper parameter, setting rule of weight etc., above-mentioned Computation rule can be the computation rules such as addition, multiplication, convolution, deconvolution.Above-mentioned preconfigured computing engines, the first number It can be understood as the number and calculating of computing engines according to the configuration rule between stream memory module and global data flow network The connection type of engine and global data flow network, the connection type between above-mentioned first data flow and global data flow network, Routing connection type etc. in above-mentioned global data flow network.Global data flow network can be by control register configuration.The net Network realizes the router that can be between the first data flow memory module and computing engines.It is multiple when being instantiated in single framework When computing engines, global data flow network can be configured as to different computing engines and send different data to be used for data simultaneously Row, or being output and input by it by computing engines serial link is longer calculating pipeline, can be handled in this pipeline More neural net layers.

In a kind of possible embodiment, the first above-mentioned data flow memory module may include outputting and inputting two data Flow storage unit, for the access of data, be input traffic storage unit by input data be input in computing engines into Row calculates, and calculated result is output to output stream storage unit and stored by computing engines, in this way can be to avoid input number According to stream storage unit when to computing engines input data, input traffic storage can not be written in the output result of computing engines In unit, for example, computing engines need compute repeatedly 2 times to a data in input traffic storage unit, After the completion of primary calculating, computing engines need to read second of data in input traffic storage unit, it is generally the case that The storage of first time calculated result can be waited to input traffic storage unit, then go to read secondary data, but be provided with After output stream storage unit, while first time calculated result being stored to output stream storage unit, go to read Secondary data are taken, without being waited, improve the efficiency of data processing.

203, according to the target network configuration rule, configuration obtains target data stream network.

Realization of the above-mentioned configuration to target network configuration rule can be to preconfigured computing engines, the first data The connection relationship between memory module and global data flow network is flowed, above-mentioned connection relationship may include the company of computing engines Quantity, the order of connection etc. are connect, computing engines can be attached by interconnecting with global data flow network, form new depth Network can connect quantity and the order of connection according to different computing engines, form different depth networks.According to target network Configuration rule is configured, then available target data flow network is for handling pending data.Due to each computing engines Reading data is carried out by the first data flow memory module, the data in the first data flow memory module can be read respectively Data flow is formed in different computing engines, instruction set sequence is not needed, so configured computing engines will not generate calculating It is vacant.

204, the pending data is handled by the target data flow network.

Above-mentioned target data flow network is configured by objective network information, and customization data drift net is referred to as Network, above-mentioned target data flow network are connected the first data flow memory module and computing engines by global data flow network It connects to form data flow, for instruction set, the read-write without waiting for a upper instruction is completed, and depth network rack may be implemented The high efficiency calculated under structure.

In the present embodiment, the target depth network information required for pending data is obtained；According to the target depth The network information matches pre-set target network configuration rule corresponding with the target depth network information, wherein described Target network configuration rule include preconfigured computing engines, the first data flow memory module and global data flow network it Between configuration rule；According to the target network configuration rule, configuration obtains target data stream network；Pass through the target data Flow network handles the pending data.Depth network is accelerated by data flow, reduces the outer data communication of piece, Therefore without instruction idle overhead, the hardware-accelerated efficiency of depth network can be improved, moreover, by carrying out network configuration, it can To configure different depth network models, a variety of different depth network models are supported.

It should be noted that the depth network accelerating method provided by the embodiments of the present application based on data flow can be applied to The equipment of the depth network acceleration of data flow, such as: computer, server, mobile phone etc. can carry out the depth based on data flow The equipment of network acceleration.

Fig. 3 is referred to, Fig. 3 is another depth network accelerating method based on data flow provided by the embodiments of the present application Flow diagram, as shown in figure 3, the described method comprises the following steps:

301, the target depth network information required for pending data is obtained.

302, it according to the target depth network information, matches pre-set corresponding with the target depth network information Target network configuration rule, wherein the target network configuration rule includes preconfigured computing engines, the first data flow Configuration rule between memory module and global data flow network.

303, it according to the global data flow network, configures between multiple computing engines concurrently or sequentially.

In this step, above-mentioned global data flow network can be by controlling by route implementing, global data flow network Register processed is configured, and corresponding global data flow network configuration rule is previously provided in above-mentioned control register.It should Router of the network implementations between the first data flow memory module and each computing engines, network router major function is to mention Path and feedback path are skipped for data flow between each computing engines.Concurrently or sequentially may be used between above-mentioned multiple computing engines It is configured by data flow with to be, for example, computing engines A and computing engines B be when global data flow network is parallel, data flow Computing engines A and computing engines B is flowed to simultaneously, realizes the parallel processing to data, computing engines A and computing engines B is complete When office data flow network is serial, data flow can be to be selected in computing engines A and be calculated, then calculated result flow direction is calculated and is drawn B is held up, when serial, it can be understood as be the intensification of depth network query function layer.Specific configuration can be through global data stream Network control data flow direction, to realize to the configuration concurrently or sequentially between multiple computing engines.The above-mentioned multiple meters of configuration Calculate concurrently or sequentially can be by configuring mutually getting continuously between global data flow network and multiple computing engines between engine Arrive, for example, can be multiple computing engines by and line discipline be interconnected with global data flow network, be also possible to multiple calculating Engine is interconnected by serial rule with global data flow network, and the first data flow memory module is matched with global data flow network Set interconnection.

304, according to concurrently or sequentially, being obtained between the first data flow memory module and the multiple computing engines The data flow path of the multiple computing engines.

In this step, above-mentioned first data flow memory module can be caching, DDR or zero access DDR, in the application It in embodiment, preferably caches, specifically, can be set in caching with controllable read/write address generation unit.It depends on It is calculated needed for input data format and data path, scalar/vector will generate the address sequence adapted to indexed cache In data.Above-mentioned address sequence can be used for the data in indexed cache and be input in corresponding computing engines, for example, meter Calculating engine needs 80 data to be calculated, then the data of 80 corresponding address sequences are read from caching to the computing engines In.In addition, scalar/vector can also make the address sequence generated have different circulation sizes by setting counter, For example, a partial circulating of data 1, data 2, data 3, can be improved the reusability of data, meanwhile, it also can adapt to each calculating The size of engine data processing.Data flow is stored by the first data flow memory module, and it is multiple to control data flow Between computing engines concurrently or sequentially in each back end, be data flow path, make to obtain data processing such as cocurrent flow Waterline is generally able to be handled in computing engines, improves the efficiency of data processing.

305, flow path based on the data, forms the target data flow network.

In this step, the first data flow memory module enters data into corresponding calculating by global data flow network In engine, calculated result is output in the first data flow memory module by computing engines by global data flow network, without referring to The problem of order is controlled, and computing unit is in idle condition when being also just not carried out single instruction.

306, the pending data is handled by the target data flow network.

In the embodiment, data flow is stored by the first data flow memory module, and it is multiple to control data flow Between computing engines concurrently or sequentially in each back end, be data flow path, make to obtain data processing such as cocurrent flow Waterline is generally able to be handled in computing engines, improves the efficiency of data processing.

The pending data is read into the first data flow memory module；

In the embodiment, above-mentioned first data flow memory module can be caching, DDR or zero access DDR, in this Shen It please preferably cache, specifically, can be the caching for being provided with controllable read/write address generation unit in embodiment.Depend on It is calculated needed for input data format and data path, scalar/vector is slow to index by the address sequence for generating adaptation Data in depositing.Above-mentioned address sequence can be used for the data in indexed cache and be input in corresponding computing engines, for example, Computing engines need 80 data to be calculated, then the data of 80 corresponding address sequences are read from caching to the computing engines In.In addition, scalar/vector can also make the address sequence generated have different circulation sizes by setting counter, For example, a partial circulating of data 1, data 2, data 3, can be improved the reusability of data, meanwhile, it also can adapt to each calculating The size of engine data processing.The state of above-mentioned first data flow memory module includes: that reading data prepares state and data Completion status is written, the state of above-mentioned computing engines includes calculating whether to complete, if reading is needed to calculate data next time Etc. states.It can be by carrying out condition monitoring to the data in the first data flow memory module in finite state machine, to obtain To the state of the first data flow memory module, the shape of computing engines is acquired by the state of the first data flow memory module State, for example after calculated result is written to the first data flow memory module, can determine that the state of computing engines finishes to calculate.

In each clock cycle, obtained by the state of each computing engines and the first data flow memory module, so as to Accurately, it is expected that the hardware performance that can carry out maximal efficiency by accurately calculating waiting optimizes, data processing is further increased Efficiency.

In this embodiment, above-mentioned calculating core, the second data flow memory module and local data's flow network are groups At the main configuration of computing engines, above-mentioned calculating core can be convolution kernel, Chi Huahe, activation primitive core etc. with calculated performance Kernel, in addition, it is necessary to which explanation, calculates core and is referred to as calculating kernel, computing unit, computing module etc..Above-mentioned Second data flow memory module can be the memory module that caching, DDR or high speed DDR etc. have data access function, above-mentioned Second data flow memory module and the first data flow memory module can be the different storage zone on the same memory, than Such as, the second data flow memory module can be the second data buffer area in buffer, and the first data flow memory module can be First data buffer area in buffer etc., above-mentioned local data's flow network can be understood as in computing engines for that will calculate The routing that core and the second data flow memory module are attached.For example, the connection calculated between core can be by network router control System.Above-mentioned network router major function, which is to provide, skips path and feedback path.Register is controlled by setting, it can be by office Portion's data flow network is configured to be formed with the different flow paths for calculating core available in computing engines.Along these of flow path The group of the type and sequence that calculate core is combined into the multiple layers of continuous data processing pipeline of offer in deep learning neural network, For example, press data flow, if the combination for calculating core is convolution kernel to pond core to activation primitive core, an available convolution mind Through network layer, for another example, the combination for calculating core is deconvolution core to pond core to activation primitive core, an available deconvolution mind Through network layer etc..It should be noted that the combination for calculating the type and sequence of core is specifically carried out really by target network configuration rule It is fixed.By forming data flow between calculating core, the calculating of computing engines can be accelerated, to further increase depth The data-handling efficiency of network.

Above-mentioned optional embodiment, can the depth network based on data flow of reality Fig. 2 and Fig. 3 corresponding embodiment add Fast method reaches identical effect, and details are not described herein.

Second aspect, refers to Fig. 4, and Fig. 4 is that a kind of depth network based on data flow provided by the embodiments of the present application adds Fast method flow schematic diagram, as shown in Figure 4, which comprises

401, the target depth network information required for pending data is obtained.

402, it according to the target depth network information, matches pre-set corresponding with the target depth network information Target network configuration rule, wherein the target network configuration rule include calculate core, the second data flow memory module and Local data's flow network.

The network class of depth network required for pending data has been contained in the above-mentioned target depth network information Type data type, the number of plies, calculates type etc., and above-mentioned target network configuration rule can be to be configured in advance, for example, It can be each parameter in the type networks such as the image recognition network, target detection network, target tracking network pre-set Rule, computation rule etc., above-mentioned parameter rule can be the setting rule of hyper parameter, setting rule of weight etc., above-mentioned Computation rule can be the computation rules such as addition, multiplication, convolution, deconvolution.Above-mentioned calculating core, the second data flow memory module And the configuration rule between local data's flow network can be understood as calculating the type of core, number and calculate core and global number According to the connection type of flow network, connection type between above-mentioned second data flow and global data flow network, above-mentioned local data Routing connection type in flow network etc..Local data's flow network can be by control register configuration.The network implementations can be Router between first data flow memory module and computing engines.For example, the connection calculated between core can be routed by network Device control.Above-mentioned network router major function, which is to provide, skips path and feedback path.

403, according to the target network configuration rule, configuration obtains target data stream engine.

Realization of the above-mentioned configuration to target network configuration rule can be to preconfigured calculating core, the second data flow Connection relationship between memory module and local data's flow network, above-mentioned connection relationship may include calculating the type of core, Quantity is connected, order of connection etc. can will calculate core and be attached by interconnecting with local data flow network, and form new calculating Engine is data flow engine, can connect quantity and the order of connection according to different calculating core types, form different depths Spend data flow engine required for network.It is configured according to target network configuration rule, then available target data stream is drawn It holds up for handling pending data.Since each calculating core carries out reading data by the second data flow memory module, can incite somebody to action Data in second data flow memory module read in different calculating core respectively and form data flow, for example, will need to carry out The reading data of multiplication carries out multiplication calculating into multiplication core, and the reading data for carrying out addition will be needed to be added into addition core Method calculate etc., due to data flow do not need instruction set sequence, so configured data flow engine will not generate calculate it is vacant.

404, the pending data is handled by the target data stream engine.

Above-mentioned target data stream engine is configured by objective network information, is referred to as customization data flow and is drawn It holds up, above-mentioned target data stream engine is connected the second data flow memory module and each core that calculates by local data's flow network It connects to form data flow, for the way of realization of instruction set, the read-write without waiting for a upper instruction is completed, and may be implemented The high efficiency calculated under the depth network architecture.

In the present embodiment, the target depth network information required for pending data is obtained；According to the target depth The network information matches pre-set target network configuration rule corresponding with the target depth network information, wherein described Target network configuration rule includes calculating core, the second data flow memory module and local data's flow network；According to the target Network configuration rule, configuration obtain target data stream engine；By the target data stream engine to the pending data into Row processing.Depth network is accelerated by data flow, reduces the outer data communication of piece, therefore without instruction idle overhead, it can To improve the hardware-accelerated efficiency of depth network, moreover, different depth network models can be configured by carrying out network configuration Required computing engines support computing engines required for a variety of different depth network models.

Fig. 5 is referred to, Fig. 5 is another depth network accelerating method stream based on data flow provided by the embodiments of the present application Journey schematic diagram, as shown in Figure 5, which comprises

501, the target depth network information required for pending data is obtained；

502, it according to the target depth network information, matches pre-set corresponding with the target depth network information Target network configuration rule, wherein the target network configuration rule include calculate core, the second data flow memory module and Local data's flow network；

503, the interconnection for calculating core and local data's flow network is configured, the calculating path for calculating core is obtained；

504, the interconnection for configuring the second data flow storage module and local data's flow network obtains storage road Diameter；

505, according to the calculating path and store path, the target data stream engine is obtained.

506, the pending data is handled by the target data stream engine.

In this embodiment, above-mentioned calculating core, the second data flow memory module and local data's flow network are compositions The main configuration of data flow engine, above-mentioned calculating core can be convolution kernel, Chi Huahe, activation primitive core etc. with calculated performance Kernel, in addition, it is necessary to which explanation, calculates core and is referred to as calculating kernel, computing unit, computing module etc..Above-mentioned Second data flow memory module can be the memory module that caching, DDR or high speed DDR etc. have data access function, above-mentioned Second data flow memory module and the first data flow memory module can be the different storage zone on the same memory, than Such as, the second data flow memory module can be the second data buffer area in buffer, and the first data flow memory module can be First data buffer area in buffer etc., above-mentioned local data's flow network can be understood as in computing engines for that will calculate The routing that core and the second data flow memory module are attached.For example, the connection calculated between core can be by network router control System.Above-mentioned network router major function, which is to provide, skips path and feedback path.Register is controlled by setting, it can be by office Portion's data flow network is configured to be formed with the different flow paths for calculating core available in computing engines.Along these of flow path The group of the type and sequence that calculate core is combined into the multiple layers of continuous data processing pipeline of offer in deep learning neural network, For example, press data flow, if the combination for calculating core is convolution kernel to pond core to activation primitive core, an available convolution mind Through network layer, for another example, the combination for calculating core is deconvolution core to pond core to activation primitive core, an available deconvolution mind Through network layer etc..It should be noted that the combination for calculating the type and sequence of core is specifically carried out really by target network configuration rule It is fixed.

By forming data flow between calculating core, the calculating of computing engines can be accelerated, to further mention The data-handling efficiency of high depth network.

The pending data is read into the second data flow memory module；

In the embodiment, above-mentioned second data flow memory module can be caching, DDR or zero access DDR, in this Shen It please preferably cache, specifically, can be the caching for being provided with controllable read/write address generation unit in embodiment.Depend on It is calculated needed for input data format and data path, scalar/vector is slow to index by the address sequence for generating adaptation Data in depositing.Above-mentioned address sequence can be used for the data in indexed cache and be input in corresponding calculating core, for example, meter Calculating core needs 80 data to be calculated, then from the data of 80 corresponding address sequences of reading in caching into the calculating core.Separately Outside, scalar/vector can also make the address sequence generated have different circulation sizes by setting counter, for example, One partial circulating of data 1, data 2, data 3, can be improved the reusability of data, meanwhile, it also can adapt to each calculating nucleus number According to the size of processing.The state of above-mentioned second data flow memory module includes: that reading data preparation state and data have been written At state, whether the state of above-mentioned calculating core includes calculating to complete, if needs to read the states such as calculating data next time.It can With by finite state machine in the first data flow memory module data carry out condition monitoring, to get the first data The state for flowing memory module is acquired the state for calculating core by the state of the second data flow memory module, for example calculates knot After fruit is written to the second data flow memory module, it can determine that the state for calculating core finishes to calculate.

In each clock cycle, obtained by the state of each calculating core and the second data flow memory module, so as to standard Really, it is expected that the hardware performance that can carry out maximal efficiency by accurately calculating waiting optimizes, data processing is further increased Efficiency.

In the embodiment, the first above-mentioned storage unit can be input traffic storage unit, and above-mentioned second deposits Storage unit can be input traffic storage unit, and the first storage unit is deposited with the second storage unit for replacing for data flow It takes, to be the first storage unit, which be input to input data to calculate in core, calculates, and calculates core for calculated result and is output to the Two storage units are stored, and can calculate the output of core to avoid the first storage unit when to core input data is calculated in this way As a result it can not be written in first storage unit, need to carry out a data in knee cap storage unit for example, calculating core It computes repeatedly 2 times, after the completion of calculating first time, calculates core needs and read the data for the second time in the first storage unit, lead to In normal situation, the storage of first time calculated result can be waited to the first storage unit, then go to read secondary data, but be arranged After boundary storage unit, the first storage unit can be gone while by the storage of first time calculated result to the second storage unit Middle secondary data of reading improve the efficiency of data processing without being waited.

Above-mentioned optional embodiment, can the depth network based on data flow of reality Fig. 4 and Fig. 5 corresponding embodiment add Fast method reaches identical effect, and details are not described herein.It should be noted that above-mentioned each embodiment can also with Fig. 2 and Fig. 3 embodiment is combined.

The third aspect, please refers to Fig. 6, and Fig. 6 is that a kind of depth network based on data flow provided by the embodiments of the present application adds Speed variator schematic diagram, as shown in fig. 6, described device includes:

First obtains module 601, for obtaining the target depth network information required for pending data；

First matching module 602, for matching the pre-set and target according to the target depth network information The corresponding target network configuration rule of the depth network information, wherein the target network configuration rule includes preconfigured meter Calculate the configuration rule between engine, the first data flow memory module and global data flow network；

First configuration module 603, for according to the target network configuration rule, configuration to obtain target data stream network；

First processing module 604, for being handled by the target data flow network the pending data.

Optionally, first configuration module 603 includes:

Global configuration submodule, for configuring parallel between multiple computing engines according to the global data flow network Or it is serial；

Path configures submodule, for according between the first data flow memory module and the multiple computing engines Concurrently or sequentially, the data flow path of the multiple computing engines is obtained；

Submodule is formed, for flow path based on the data, forms the target data flow network.

Optionally, first processing module 604 includes:

First acquisition submodule, for the pending data to be read the first data flow memory module；

First data address generates submodule, is used in the first data flow memory module, according to described to be processed The data format and data path of data are that the pending data generates address sequence by pre-set create-rule；

First input submodule is used for each clock cycle, according to address sequence from the first data flow memory module Middle reading is inputted with data volume corresponding with computing engines in the target data flow network, and is obtained the first data flow and deposited Store up the state of module and computing engines.

Optionally, the target network configuration further includes calculating core, the second data flow storage unit and the connection meter Calculate local data's flow network of core and second buffer, first configuration module 603 further include:

First partial configures submodule, for configuring the interconnection for calculating core and local data's flow network, obtains Calculate the calculating path of core；

First partial path submodule, for configuring second data flow in storage element and local data's drift net The interconnection of network, obtains store path；

First engine modules, for obtaining the computing engines according to the calculating path and store path.

Fourth aspect, please refers to Fig. 7, and Fig. 7 is that a kind of depth network based on data flow provided by the embodiments of the present application adds Speed variator schematic diagram, as shown in fig. 7, described device includes:

Second obtains module 701, for obtaining the target depth network information required for pending data；

Second matching module 702, for matching the pre-set and target according to the target depth network information The corresponding target network configuration rule of the depth network information, wherein the target network configuration rule includes calculating core, the second number According to stream memory module and local data's flow network；

Second configuration module 703, for according to the target network configuration rule, configuration to obtain target data stream engine；

Second processing module 704, for being handled by the target data stream engine the pending data.

Optionally, second configuration module 703 includes:

Second local configuration submodule is obtained for configuring the interconnection for calculating core and local data's flow network Calculate the calculating path of core；

Second local path submodule, for configuring the second data flow storage module and local data's flow network Interconnection, obtain store path；

Second engine modules, for obtaining the target data stream engine according to the calculating path and store path.

Optionally, the Second processing module 704 includes:

Second acquisition submodule, for the pending data to be read the second data flow memory module；

Second data address generates submodule, is used in the second data flow memory module, according to described to be processed The data format and data path of data are that the pending data generates address sequence by pre-set create-rule；

Second input submodule is used for each clock cycle, according to address sequence from the second data flow memory module Middle reading is inputted with data volume corresponding with core is calculated in the target data stream engine, and obtains the storage of the second data flow Module and the state for calculating core.

Optionally, the Second processing module 704 includes:

Computational submodule is inputted, for the data input in the first storage unit to be calculated core, obtains calculated result；

Sub-module stored is exported, for storing the calculated result to the second storage unit, as next calculating core Input data.

5th aspect, the embodiment of the present application provide a kind of electronic equipment, comprising: memory, processor and are stored in described On memory and the computer program that can run on the processor, the processor are realized when executing the computer program Step in depth network accelerating method provided by the embodiments of the present application based on data flow.

6th aspect, the embodiment of the present application provide a kind of computer readable storage medium, the computer-readable storage medium Computer program is stored in matter, the computer program is realized provided by the embodiments of the present application based on number when being executed by processor According to the step in the depth network accelerating method of stream.

It should be noted that for the various method embodiments described above, for simple description, therefore, it is stated as a series of Combination of actions, but those skilled in the art should understand that, the application is not limited by the described action sequence because According to the application, some steps may be performed in other sequences or simultaneously.Secondly, those skilled in the art should also know It knows, embodiment described in this description belongs to alternative embodiment, related actions and modules not necessarily the application It is necessary.

In the above-described embodiments, it all emphasizes particularly on different fields to the description of each embodiment, there is no the portion being described in detail in some embodiment Point, reference can be made to the related descriptions of other embodiments.

In several embodiments provided herein, it should be understood that disclosed device, it can be by another way It realizes.For example, the apparatus embodiments described above are merely exemplary

It, can also be in addition, the processor, chip in each embodiment of the application can integrate in one processing unit It is to physically exist alone, it can also be with two or more hardware integrations in a unit.Computer readable storage medium or Computer-readable program can store in a computer-readable access to memory.Based on this understanding, the technology of the application Substantially all or part of the part that contributes to existing technology or the technical solution can be with software in other words for scheme The form of product embodies, which is stored in a memory, including some instructions are used so that one Platform computer equipment (can be personal computer, server or network equipment etc.) executes each embodiment the method for the application All or part of the steps.And memory above-mentioned include: USB flash disk, it is read-only memory (ROM, Read-Only Memory), random Access memory (RAM, Random Access Memory), mobile hard disk, magnetic or disk etc. are various to can store program The medium of code.

Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of above-described embodiment is can It is completed with instructing relevant hardware by program, which can store in a computer-readable memory, memory May include: flash disk, read-only memory (English: Read-Only Memory, referred to as: ROM), random access device (English: Random Access Memory, referred to as: RAM), disk or CD etc..

The above content is combine specific preferred embodiment to made by the application further description, and it cannot be said that The specific embodiment of the application is only limited to these instructions.The application person of an ordinary skill in the technical field is come It says, without departing from the concept of this application, a number of simple deductions or replacements can also be made, all shall be regarded as belonging to this Shen Protection scope please.

Claims

1. a kind of depth network accelerating method based on data flow, which is characterized in that the described method includes:

Obtain the target depth network information required for pending data；

According to the target depth network information, pre-set target network corresponding with the target depth network information is matched Network configuration rule, wherein the target network configuration rule includes preconfigured computing engines, the first data flow memory module And the configuration rule between global data flow network；

The pending data is handled by the target data flow network.

2. the method as described in claim 1, which is characterized in that described according to the target network configuration rule, configuration obtains Target data flow network, comprising:

According to concurrently or sequentially, being obtained the multiple between the first data flow memory module and the multiple computing engines The data flow path of computing engines；

Flow path based on the data forms the target data flow network.

3. the method as described in claim 1, which is characterized in that it is described by the target data flow network to described to be processed Data are handled, comprising:

The pending data is read into the first data flow memory module；

In the first data flow memory module, according to the data format and data path of the pending data, by pre- The create-rule being first arranged is that the pending data generates address sequence；

Each clock cycle reads from the first data flow memory module and the target data drift net according to address sequence Data volume corresponding with computing engines is inputted in network, and obtains the state of the first data flow memory module and computing engines.

4. such as method any one of claims 1 to 3, which is characterized in that the target network configuration further include calculate core, Second data flow storage unit and the connection local data's flow network for calculating core and second buffer, the calculating The configuration of engine includes:

Second data flow is configured in the interconnection of storage module and local data's flow network, obtains store path；

5. a kind of depth network accelerating method based on data flow, which is characterized in that the described method includes:

Obtain the target depth network information required for pending data；

According to the target depth network information, pre-set target network corresponding with the target depth network information is matched Network configuration rule, wherein the target network configuration rule includes calculating core, the second data flow memory module and local data Flow network；

The pending data is handled by the target data stream engine.

6. method as described in claim 5, which is characterized in that it is described according to the target network configuration rule, it is arranged to To target data stream engine, comprising:

7. method as described in claim 5, which is characterized in that it is described by the target data stream engine to described wait locate Reason data are handled, comprising:

The pending data is read into the second data flow memory module；

In the second data flow memory module, according to the data format and data path of the pending data, by pre- The create-rule being first arranged is that the pending data generates address sequence；

Each clock cycle reads from the second data flow memory module according to address sequence and draws with the target data stream Data volume corresponding with core is calculated is inputted in holding up, and is obtained the second data flow memory module and calculated the state of core.

8. the method as described in any in claim 5 to 7, which is characterized in that the second data flow memory module includes the One storage unit and the second storage unit, it is described by the target data stream engine to the pending data at Reason, comprising:

9. a kind of depth network acceleration device based on data flow, which is characterized in that described device includes:

First matching module, for matching the pre-set and target depth net according to the target depth network information The corresponding target network configuration rule of network information, wherein the target network configuration rule include preconfigured computing engines, Configuration rule between first data flow memory module and global data flow network；

10. a kind of depth network acceleration device based on data flow, which is characterized in that described device includes:

Second matching module, for matching the pre-set and target depth net according to the target depth network information The corresponding target network configuration rule of network information, wherein the target network configuration rule is deposited including calculating core, the second data flow Store up module and local data's flow network；

11. a kind of electronic equipment characterized by comprising memory, processor and be stored on the memory and can be in institute The computer program run on processor is stated, the processor is realized when executing the computer program as in Claims 1-4 Step in described in any item depth network accelerating methods based on data flow.

12. a kind of computer readable storage medium, which is characterized in that be stored with computer on the computer readable storage medium Program is realized when the computer program is executed by processor according to any one of claims 1 to 4 based on data flow Step in depth network accelerating method.