WO2020206637A1 - Deep network acceleration methods and apparatuses based on data stream, device, and storage medium - Google Patents

Deep network acceleration methods and apparatuses based on data stream, device, and storage medium Download PDF

Info

Publication number
WO2020206637A1
WO2020206637A1 PCT/CN2019/082101 CN2019082101W WO2020206637A1 WO 2020206637 A1 WO2020206637 A1 WO 2020206637A1 CN 2019082101 W CN2019082101 W CN 2019082101W WO 2020206637 A1 WO2020206637 A1 WO 2020206637A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
target
network
data stream
calculation
Prior art date
Application number
PCT/CN2019/082101
Other languages
French (fr)
Chinese (zh)
Inventor
牛昕宇
蔡权雄
Original Assignee
深圳鲲云信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳鲲云信息科技有限公司 filed Critical 深圳鲲云信息科技有限公司
Publication of WO2020206637A1 publication Critical patent/WO2020206637A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • G06F15/7814Specially adapted for real time processing, e.g. comprising hardware timers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • This application relates to the field of artificial intelligence, and more specifically, to a data stream-based deep network acceleration method, device, device, and storage medium.
  • the advancement of neural network-based deep learning applications requires high processing capabilities on the underlying hardware platform.
  • CPU-based platforms cannot meet this growing demand, many companies have developed dedicated hardware accelerators to support advancements in this field.
  • the common idea of existing hardware accelerators is to accelerate certain types of calculations that are used more frequently in deep learning algorithm applications.
  • the existing hardware architecture is based on the execution of instructions with an extensible instruction set, and then realizes acceleration by implementing common calculations as customized instructions.
  • the instruction-based architecture implementation is usually expressed as a system-on-chip (SoC) design.
  • SoC system-on-chip
  • In an instruction-based architecture many clock cycles are wasted for non-computation related operations.
  • calculations in deep learning neural networks are usually decomposed into multiple instructions. Therefore, a calculation usually requires multiple clock cycles.
  • the arithmetic and logic unit (ALU) in the processor is usually a collection of different operations implemented in hardware. Due to limited instruction expressions and limited I/O bandwidth, most ALU resources are idle when executing a single instruction. For example, when doing multiplication and addition, the data of the multiplication will be read first, because the I/O speed is affected. Bandwidth affects, so that addition needs to wait for the multiplication calculation to be completed and write it into the memory, and then read the calculation result and the addition data for the addition calculation. During the multiplication calculation and reading and writing, the addition calculation unit is idle. Therefore, there is a problem of low efficiency of instruction-based hardware acceleration.
  • the purpose of this application is to provide a data stream-based deep network acceleration method, device, equipment and storage medium in view of the above-mentioned defects in the prior art, which solves the limited instruction expression and limited I/O bandwidth.
  • Most The ALU resource is idle when executing a single instruction, which accelerates the problem of low efficiency.
  • a data stream-based deep network acceleration method includes:
  • the target deep network information match a preset target network configuration rule corresponding to the target deep network information, wherein the target network configuration rule includes a pre-configured calculation engine, a first data stream storage module, and global data Configuration rules between streaming networks;
  • the data to be processed is processed through the target data stream network.
  • the configuration to obtain the target data flow network according to the target network configuration rule includes:
  • the target data flow network is formed.
  • processing the to-be-processed data through the target data flow network includes:
  • an address sequence is generated for the data to be processed according to a preset generation rule
  • the target network configuration further includes a computing core, a second data stream storage unit, and a local data stream network connecting the computing core and the second buffer, and the configuration of the computing engine includes:
  • the calculation engine is obtained.
  • a data stream-based deep network acceleration method including:
  • the target deep network information matching a preset target network configuration rule corresponding to the target deep network information, wherein the target network configuration rule includes a computing core, a second data stream storage module, and a local data stream network;
  • the data to be processed is processed by the target data flow engine.
  • the configuration to obtain the target data flow engine according to the target network configuration rule includes:
  • the target data flow engine is obtained.
  • processing the data to be processed by the target data flow engine includes:
  • an address sequence is generated for the data to be processed according to a preset generation rule
  • Each clock cycle read from the second data stream storage module according to the address sequence the data volume corresponding to the computing core in the target data stream engine for input, and obtain the second data stream storage module and the computing core status.
  • the second data stream storage module includes a first storage unit and a second storage unit
  • the processing of the to-be-processed data by the target data stream engine includes:
  • the calculation result is stored in the second storage unit as the input data of the next calculation core.
  • a data stream-based deep network acceleration device includes:
  • the first obtaining module is used to obtain target deep network information required by the data to be processed
  • the first matching module is configured to match a preset target network configuration rule corresponding to the target deep network information according to the target deep network information, wherein the target network configuration rule includes a pre-configured calculation engine, a first Configuration rules between the data flow storage module and the global data flow network;
  • the first configuration module is configured to configure the target data flow network according to the target network configuration rule
  • the first processing module is configured to process the to-be-processed data through the target data flow network.
  • a data stream-based deep network acceleration device comprising:
  • the second acquisition module is used to acquire target deep network information required by the data to be processed
  • the second matching module is configured to match a preset target network configuration rule corresponding to the target deep network information according to the target deep network information, wherein the target network configuration rule includes a computing core and a second data stream storage Module and local data flow network;
  • the second configuration module is used to configure the target data flow engine according to the target network configuration rule
  • the second processing module is configured to process the to-be-processed data through the target data flow engine.
  • an electronic device including: a memory, a processor, and a computer program stored on the memory and capable of running on the processor, and the processor implements the implementation of the application when the processor executes the computer program Examples provide steps in the data stream-based deep network acceleration method.
  • a computer-readable storage medium is provided, and a computer program is stored on the computer-readable storage medium.
  • the computer program When executed by a processor, it implements the data stream-based deep network acceleration method provided in the embodiments of the present application Steps in.
  • the deep network is accelerated through the data stream and off-chip data communication is reduced, so there is no instruction idle overhead, and the hardware acceleration efficiency of the deep network can be improved.
  • the network model supports a variety of different deep network models.
  • FIG. 1 is a schematic diagram of an optional implementation architecture of a data stream-based deep network acceleration method provided by an embodiment of this application;
  • FIG. 2 is a schematic flowchart of a data stream-based deep network acceleration method according to the first aspect of the embodiments of the present application;
  • FIG. 3 is a schematic flowchart of another data stream-based deep network acceleration method provided by an embodiment of the present application.
  • FIG. 4 is a schematic flowchart of a data flow-based deep network acceleration method provided by the second aspect of the embodiments of the application;
  • FIG. 5 is a schematic flowchart of another data stream-based deep network acceleration method provided by an embodiment of this application.
  • FIG. 6 is a schematic diagram of a data stream-based deep network acceleration device provided by the third aspect of the embodiments of the application.
  • FIG. 7 is a schematic diagram of a data stream-based deep network acceleration device provided by the fourth aspect of the embodiments of the application.
  • FIG. 1 is a schematic diagram of an optional implementation architecture of a data stream-based deep network acceleration method provided by an embodiment of this application.
  • the architecture 103 and the chip The external storage module (DDR) 101 and the CPU are connected by interconnection.
  • the architecture 103 includes: a first storage module 104, a global data flow network 105 and a data flow engine 106.
  • the first storage module 104 is connected to the off-chip through interconnection. While storing the module 101, it is also connected to the global data flow network 105 through interconnection, and the data flow engine 106 is connected to the global data flow network 105 through interconnection so that the data flow engine 106 can realize parallel or serial.
  • the aforementioned data flow engine 106 may include: a computing core (or called a computing module), a second storage module 108, and a local data flow network 107.
  • the computing core may include a core for computing, such as a convolution core 109 and a pooling core. 110 and activation function core 111, etc. Of course, it can also include other computing cores besides the example convolution core 109, pooling core 110, and activation function core 111, which are not limited here, and can also be included in the deep network.
  • the above-mentioned first storage module 104 and the above-mentioned second storage module 108 may be on-chip cache modules, or may be DDR or high-speed DDR memory modules.
  • the above-mentioned data stream engine 106 can be understood as a computing engine that supports data stream processing, and can also be understood as a computing engine dedicated to data stream processing.
  • the foregoing CPU may include a control register, and the foregoing control register is pre-configured with network configuration rules for configuring the network.
  • the deep network in this application may also be called a deep learning network, a deep learning neural network, and the like.
  • This application provides a data stream-based deep network acceleration method, device, equipment and storage medium.
  • FIG. 2 is a schematic flowchart of a data stream-based deep network acceleration method provided by an embodiment of the present application. As shown in FIG. 2, the method includes the following steps:
  • the aforementioned data to be processed may be data that can be processed through a deep network, such as image data to be identified, target data to be detected, target data to be tracked, and so on.
  • the target deep network information corresponds to the deep network of the data to be processed.
  • Information, such as the data to be processed is image data to be recognized, the target deep network information is the configuration parameter of the deep network used to process image recognition, and if the data to be processed is the target data to be detected, the target deep network information is used
  • the above-mentioned target deep network information may be preset, and the matching determination may be performed through the data to be processed, or may be manually selected and determined, which is not limited herein.
  • Obtaining the target deep network information can facilitate the configuration of the deep network.
  • the aforementioned deep network information may include network type, data type, number of layers, calculation type, and so on.
  • the target deep network information match a preset target network configuration rule corresponding to the target deep network information, wherein the target network configuration rule includes a pre-configured calculation engine, a first data stream storage module, and Configuration rules between global data flow networks.
  • the target network configuration rule includes a pre-configured calculation engine, a first data stream storage module, and Configuration rules between global data flow networks.
  • the aforementioned target deep network information already contains the network type, data type, number of layers, calculation type, etc. of the deep network required by the data to be processed.
  • the aforementioned target network configuration rules can be set in advance, for example, The parameter rules and calculation rules in the set image recognition network, target detection network, target tracking network and other types of networks.
  • the above parameter rules can be hyperparameter setting rules, weight setting rules, etc.
  • the above calculation rules can be It is the calculation rules for addition, multiplication, convolution, and deconvolution.
  • the foregoing configuration rules between the pre-configured calculation engine, the first data stream storage module, and the global data stream network can be understood as the number of calculation engines and the connection mode between the calculation engine and the global data stream network.
  • the global data flow network can be configured by the control register.
  • the network implementation may be a router between the first data flow storage module and the calculation engine.
  • the aforementioned first data stream storage module may include two data stream storage units, input and output, for data access, that is, the input data stream storage unit inputs input data into the calculation engine
  • the calculation engine outputs the calculation results to the output data stream storage unit for storage, which can prevent the input data stream storage unit from inputting data to the calculation engine and the output result of the calculation engine cannot be written into the input data stream storage unit.
  • the calculation engine needs to perform repeated calculations on a piece of data in the input data stream storage unit twice. After the first calculation is completed, the calculation engine needs to read the data in the input data stream storage unit for the second time.
  • the first calculation result can be stored in the output data stream storage At the same time as the unit, read the data for the second time without waiting, which improves the efficiency of data processing.
  • the foregoing configuration implements the target network configuration rule, which may be the connection relationship between the pre-configured calculation engine, the first data stream storage module, and the global data stream network.
  • the foregoing connection relationship may include the number of connections of the calculation engine and the connection sequence.
  • the calculation engine can be connected to the global data flow network through interconnection to form a new deep network, and different deep networks can be formed according to the number and connection sequence of different calculation engines.
  • the target data flow network can be obtained for processing the data to be processed. Since each calculation engine reads data through the first data stream storage module, the data in the first data stream storage module can be read into different calculation engines to form a data stream, and no instruction set ordering is required, so the configuration is good The calculation engine does not generate calculation vacancies.
  • the above-mentioned target data stream network is configured through target network information, and can also be called a customized data stream network.
  • the above-mentioned target data stream network connects the first data stream storage module and the calculation engine through the global data stream network to form a data stream Compared with the instruction set, there is no need to wait for the completion of the read and write of the previous instruction, which can achieve the efficiency of calculation under the deep network architecture.
  • the target deep network information required by the data to be processed is acquired; according to the target deep network information, a preset target network configuration rule corresponding to the target deep network information is matched, wherein the target network
  • the configuration rules include the pre-configured calculation engine, the first data flow storage module and the configuration rules between the global data flow network; according to the target network configuration rules, the target data flow network is configured;
  • the data to be processed is processed. Accelerate the deep network through data flow to reduce off-chip data communication, so there is no instruction idle overhead, which can improve the hardware acceleration efficiency of the deep network, and through network configuration, you can configure different deep network models to support a variety of different Deep network model.
  • data stream-based deep network acceleration method provided in the embodiments of the present application can be applied to devices for data stream deep network acceleration, such as computers, servers, mobile phones and other devices that can perform data stream-based deep network acceleration .
  • FIG. 3 is a schematic flowchart of another data flow-based deep network acceleration method provided by an embodiment of the present application. As shown in FIG. 3, the method includes the following steps:
  • the target deep network information match a preset target network configuration rule corresponding to the target deep network information, wherein the target network configuration rule includes a pre-configured calculation engine, a first data stream storage module, and Configuration rules between global data flow networks.
  • the target network configuration rule includes a pre-configured calculation engine, a first data stream storage module, and Configuration rules between global data flow networks.
  • the aforementioned global data flow network can be implemented by routing, and the global data flow network can be configured by a control register, and corresponding global data flow network configuration rules are preset in the aforementioned control register.
  • the network is implemented as a router between the first data flow storage module and each calculation engine, and the main function of the network router is to provide skip paths and feedback paths for data flows between each calculation engine.
  • the parallel or serial between the above-mentioned multiple calculation engines can be configured through data flow. For example, when calculation engine A and calculation engine B are parallel in the global data flow network, the data flow flows to calculation engine A and calculation engine B at the same time.
  • the data flow can be selected in the calculation engine A for calculation, and then the calculation result is flowed to the calculation engine B.
  • the serial mode you can It is understood as the deepening of the deep network computing layer.
  • the specific configuration can be to control the data flow direction through the global data flow network, so as to realize the parallel or serial configuration between multiple computing engines.
  • the above configuration of parallel or serial between multiple calculation engines can be obtained by configuring the interconnection between the global data flow network and multiple calculation engines, for example, multiple calculation engines can be connected to the global data flow network according to parallel rules
  • the interconnection can also be that multiple computing engines are interconnected with the global data flow network according to serial rules, and the first data flow storage module is configured to be interconnected with the global data flow network.
  • the above-mentioned first data stream storage module may be a cache, DDR or high-speed access DDR.
  • it is preferably a cache.
  • a controllable read-write address generating unit may be provided in the cache. .
  • the address generation unit will generate an adapted address sequence to index the data in the cache.
  • the aforementioned address sequence can be used to input data in the index cache to the corresponding calculation engine. For example, if the calculation engine requires 80 data to perform calculations, then 80 data corresponding to the address sequence are read from the cache to the calculation engine.
  • the address generation unit can also set a counter to make the generated address sequence have different cycle sizes, for example, a small cycle of data 1, data 2, and data 3, which can improve the reusability of data, and at the same time, it can also adapt to The data processing size of each calculation engine.
  • the data flow is stored through the first data flow storage module, and the data flow is controlled to each data node in parallel or serial between multiple computing engines, that is, the data flow path, so that the data processing can be calculated like a pipeline Processing in the engine improves the efficiency of data processing.
  • the first data flow storage module inputs data to the corresponding calculation engine through the global data flow network, and the calculation engine outputs the calculation results to the first data flow storage module through the global data flow network, without instructions for control Therefore, there is no problem that the computing unit is idle when a single instruction is executed.
  • the data stream is stored by the first data stream storage module, and the data stream is controlled to each data node in parallel or serial between multiple computing engines, that is, the data stream path, so that the data processing is like
  • the pipeline is generally processed in the calculation engine to improve the efficiency of data processing.
  • processing the to-be-processed data through the target data flow network includes:
  • an address sequence is generated for the data to be processed according to a preset generation rule
  • the above-mentioned first data stream storage module may be a cache, DDR or high-speed access DDR.
  • it is preferably a cache.
  • it may be provided with a controllable read-write address generating unit. Cache.
  • the address generation unit will generate an adapted address sequence to index the data in the cache.
  • the aforementioned address sequence can be used to input data in the index cache to the corresponding calculation engine. For example, if the calculation engine requires 80 data to perform calculations, then 80 data corresponding to the address sequence are read from the cache to the calculation engine.
  • the address generation unit can also set a counter to make the generated address sequence have different cycle sizes, for example, a small cycle of data 1, data 2, and data 3, which can improve the reusability of data, and at the same time, it can also adapt to The data processing size of each calculation engine.
  • the state of the first data stream storage module includes: a data read preparation state and a data write completion state.
  • the state of the calculation engine includes whether the calculation is completed, whether the next calculation data needs to be read, and so on.
  • the state of the first data flow storage module can be obtained by monitoring the state of the data in the first data flow storage module in the finite state machine, and the state of the calculation engine can be obtained by obtaining the state of the first data flow storage module, such as computing After the result is written into the first data stream storage module, it can be determined that the state of the calculation engine is calculation completed.
  • each clock cycle the status of each calculation engine and the first data stream storage module is obtained, so that it can be accurately predicted, and the hardware performance can be optimized for maximum efficiency through accurate calculation scheduling, and the efficiency of data processing can be further improved.
  • the target network configuration further includes a computing core, a second data stream storage unit, and a local data stream network connecting the computing core and the second buffer, and the configuration of the computing engine includes:
  • the calculation engine is obtained.
  • the aforementioned computing core, the second data stream storage module, and the local data stream network are the main configurations of the computing engine.
  • the aforementioned computing core may be a convolution core, a pooling core, an activation function core, etc.
  • the performance core in addition, it should be noted that the computing core may also be called a computing core, a computing unit, a computing module, and so on.
  • the above-mentioned second data stream storage module may be a storage module with data access function such as cache, DDR or high-speed DDR, and the above-mentioned second data stream storage module and the first data stream storage module may be different on the same memory
  • the storage area for example, the second data flow storage module may be the second data buffer area in the buffer, the first data flow storage module may be the first data buffer area in the buffer, etc.
  • the above-mentioned partial data flow network can be understood It is a route used in the calculation engine to connect the calculation core with the second data stream storage module.
  • the connection between computing cores can be controlled by a network router.
  • the main function of the aforementioned network router is to provide a skip path and a feedback path.
  • the local data flow network can be configured to form a flow path with different calculation cores available in the calculation engine.
  • the combination of the types and sequence of these computing cores along the flow path provides a continuous data processing pipeline for multiple layers in the deep learning neural network.
  • the combination of computing cores is a convolution core to a pooling core
  • a convolutional neural network layer can be obtained.
  • the combination of the calculation core is a deconvolution core to a pooling core to an activation function core, and a deconvolutional neural network layer can be obtained.
  • the combination of the type and sequence of the computing core is specifically determined by the target network configuration rule.
  • FIG. 4 is a schematic flowchart of a data stream-based deep network acceleration method provided by an embodiment of the application. As shown in FIG. 4, the method includes:
  • the aforementioned data to be processed may be data that can be processed through a deep network, such as image data to be identified, target data to be detected, target data to be tracked, and so on.
  • the target deep network information corresponds to the deep network of the data to be processed.
  • Information, such as the data to be processed is image data to be recognized, the target deep network information is the configuration parameter of the deep network used to process image recognition, and if the data to be processed is the target data to be detected, the target deep network information is used
  • the above-mentioned target deep network information may be preset, and the matching determination may be performed through the data to be processed, or may be manually selected and determined, which is not limited herein.
  • Obtaining the target deep network information can facilitate the configuration of the deep network.
  • the aforementioned deep network information may include network type, data type, number of layers, calculation type, and so on.
  • the target deep network information match a preset target network configuration rule corresponding to the target deep network information, wherein the target network configuration rule includes a computing core, a second data stream storage module, and a local data stream The internet.
  • the aforementioned target deep network information already contains the network type, data type, number of layers, calculation type, etc. of the deep network required by the data to be processed.
  • the aforementioned target network configuration rules can be set in advance, for example, The parameter rules and calculation rules in the set image recognition network, target detection network, target tracking network and other types of networks.
  • the above parameter rules can be hyperparameter setting rules, weight setting rules, etc.
  • the above calculation rules can be It is the calculation rules for addition, multiplication, convolution, and deconvolution.
  • the above-mentioned configuration rules between the computing core, the second data stream storage module and the local data stream network can be understood as the type and number of computing cores, and the connection method between the computing core and the global data stream network.
  • the local data flow network can be configured by the control register.
  • the network implementation may be a router between the first data flow storage module and the calculation engine.
  • the connection between computing cores can be controlled by a network router.
  • the main function of the aforementioned network router is to provide a skip path and a feedback path.
  • the foregoing configuration implements the target network configuration rule, which may be the connection relationship between the pre-configured computing core, the second data stream storage module, and the local data stream network.
  • the foregoing connection relationship may include the type of computing core and the number of connections, Connection sequence, etc., can connect computing cores with local data flow network through interconnection to form a new computing engine, that is, data flow engine, which can form different deep networks according to different computing core types, number of connections and connection sequence The required data flow engine.
  • the target data flow engine can be obtained to process the data to be processed. Since each computing core reads data through the second data stream storage module, the data in the second data stream storage module can be read into different computing cores to form a data stream.
  • the multiplication calculation is performed in the multiplication core, and the data that needs to be added is read into the addition core for addition calculation, etc. Since the data flow does not require instruction set ordering, the configured data flow engine will not generate calculation vacancies.
  • the above-mentioned target data flow engine is configured through target network information, and can also be called a customized data flow engine.
  • the above-mentioned target data flow engine connects the second data flow storage module and each computing core through a local data flow network to form data Compared with the implementation form of the instruction set, the flow does not need to wait for the completion of the read and write of the previous instruction, and can achieve the efficiency of calculation under the deep network architecture.
  • the target deep network information required by the data to be processed is acquired; according to the target deep network information, a preset target network configuration rule corresponding to the target deep network information is matched, wherein the target network
  • the configuration rule includes a computing core, a second data flow storage module, and a local data flow network; a target data flow engine is configured according to the target network configuration rule; the data to be processed is processed by the target data flow engine.
  • the deep network is accelerated through data flow to reduce off-chip data communication, so there is no instruction idle overhead, which can improve the hardware acceleration efficiency of the deep network, and through network configuration, you can configure the computing engines required by different deep network models. Support the calculation engine required by a variety of different deep network models.
  • FIG. 5 is a schematic flowchart of another data stream-based deep network acceleration method provided by an embodiment of the application. As shown in FIG. 5, the method includes:
  • the target deep network information match a preset target network configuration rule corresponding to the target deep network information, wherein the target network configuration rule includes a computing core, a second data stream storage module, and a local data stream The internet;
  • the aforementioned computing core, the second data stream storage module, and the local data stream network are the main configurations of the data stream engine.
  • the aforementioned computing core may be a convolution core, a pooling core, an activation function core, etc.
  • the core for computing performance may also be called a computing core, a computing unit, a computing module, and so on.
  • the above-mentioned second data stream storage module may be a storage module with data access function such as cache, DDR or high-speed DDR, and the above-mentioned second data stream storage module and the first data stream storage module may be different on the same memory
  • the storage area for example, the second data flow storage module may be the second data buffer area in the buffer, the first data flow storage module may be the first data buffer area in the buffer, etc.
  • the above-mentioned partial data flow network can be understood It is a route used in the calculation engine to connect the calculation core with the second data stream storage module.
  • the connection between computing cores can be controlled by a network router.
  • the main function of the aforementioned network router is to provide a skip path and a feedback path.
  • the local data flow network can be configured to form a flow path with different calculation cores available in the calculation engine.
  • the combination of the types and sequence of these computing cores along the flow path provides a continuous data processing pipeline for multiple layers in the deep learning neural network.
  • the combination of computing cores is a convolution core to a pooling core
  • a convolutional neural network layer can be obtained.
  • the combination of the calculation core is a deconvolution core to a pooling core to an activation function core, and a deconvolutional neural network layer can be obtained.
  • the combination of the type and sequence of the computing core is specifically determined by the target network configuration rule.
  • the calculation of the computing engine can be accelerated, thereby further improving the data processing efficiency of the deep network.
  • processing the data to be processed by the target data flow engine includes:
  • an address sequence is generated for the data to be processed according to a preset generation rule
  • Each clock cycle read from the second data stream storage module according to the address sequence the data volume corresponding to the computing core in the target data stream engine for input, and obtain the second data stream storage module and the computing core status.
  • the above-mentioned second data stream storage module may be a cache, DDR or high-speed access DDR.
  • it is preferably a cache.
  • it may be provided with a controllable read-write address generating unit. Cache.
  • the address generation unit will generate an adapted address sequence to index the data in the cache.
  • the aforementioned address sequence can be used to input data in the index cache to the corresponding computing core. For example, if the computing core needs 80 data for calculation, then 80 data corresponding to the address sequence are read from the cache to the computing core.
  • the address generation unit can also set a counter to make the generated address sequence have different cycle sizes, for example, a small cycle of data 1, data 2, and data 3, which can improve the reusability of data, and at the same time, it can also adapt to The size of each calculation core data processing.
  • the state of the second data stream storage module includes: a data read preparation state and a data write completion state.
  • the state of the calculation core includes whether the calculation is completed and whether the next calculation data needs to be read.
  • the state of the first data flow storage module can be monitored by the finite state machine to obtain the state of the first data flow storage module, and the state of the computing core can be obtained by the state of the second data flow storage module, such as computing After the result is written to the second data stream storage module, it can be determined that the state of the calculation core is the calculation completed.
  • each clock cycle the status of each computing core and the second data stream storage module can be obtained, so that it can be accurately predicted, and the hardware performance can be optimized with the greatest efficiency through accurate calculation scheduling, and the efficiency of data processing can be further improved.
  • the second data stream storage module includes a first storage unit and a second storage unit
  • the processing of the to-be-processed data by the target data stream engine includes:
  • the calculation result is stored in the second storage unit as the input data of the next calculation core.
  • the above-mentioned first storage unit may be an input data stream storage unit
  • the above-mentioned second storage unit may be an input data stream storage unit.
  • the first storage unit and the second storage unit are used for alternate access of data streams. , That is, the first storage unit inputs the input data into the calculation core for calculation, and the calculation core outputs the calculation result to the second storage unit for storage. This prevents the first storage unit from inputting data to the calculation core. The output result cannot be written into the first storage unit.
  • the calculation core needs to recalculate a piece of data in the kneecap storage unit twice. After the first calculation is completed, the calculation core needs to be second in the first storage unit. Read the data a second time.
  • FIG. 6 is a schematic diagram of a data stream-based deep network acceleration device provided by an embodiment of the application. As shown in FIG. 6, the device includes:
  • the first obtaining module 601 is configured to obtain target deep network information required by the data to be processed
  • the first matching module 602 is configured to match a preset target network configuration rule corresponding to the target deep network information according to the target deep network information, wherein the target network configuration rule includes a pre-configured calculation engine, a second A configuration rule between the data stream storage module and the global data stream network;
  • the first configuration module 603 is configured to configure the target data flow network according to the target network configuration rule
  • the first processing module 604 is configured to process the data to be processed through the target data stream network.
  • the first configuration module 603 includes:
  • the global configuration sub-module is used to configure parallel or serial between multiple calculation engines according to the global data flow network
  • a path configuration submodule configured to obtain the data flow paths of the multiple calculation engines according to the parallel or serial between the first data flow storage module and the multiple calculation engines;
  • a forming sub-module is used to form the target data flow network based on the data flow path.
  • the first processing module 604 includes:
  • the first acquisition submodule is configured to read the to-be-processed data into the first data stream storage module
  • the first data address generation sub-module is configured to generate an address sequence for the data to be processed according to a preset generation rule in the first data stream storage module according to the data format and data path of the data to be processed;
  • the first input sub-module is used for each clock cycle to read from the first data stream storage module according to the address sequence the data volume corresponding to the calculation engine in the target data stream network for input, and obtain the first The state of the data stream storage module and calculation engine.
  • the target network configuration further includes a computing core, a second data stream storage unit, and a local data stream network connecting the computing core and the second buffer
  • the first configuration module 603 further includes:
  • the first partial configuration sub-module is used to configure the interconnection between the computing core and the local data flow network to obtain the computing path of the computing core;
  • the first partial path sub-module is configured to configure the interconnection of the second data flow between the storage unit and the partial data flow network to obtain a storage path;
  • the first engine module is used to obtain the calculation engine according to the calculation path and the storage path.
  • FIG. 7 is a schematic diagram of a data stream-based deep network acceleration device provided by an embodiment of the application. As shown in FIG. 7, the device includes:
  • the second obtaining module 701 is used to obtain target deep network information required by the data to be processed
  • the second matching module 702 is configured to match a preset target network configuration rule corresponding to the target deep network information according to the target deep network information, wherein the target network configuration rule includes a computing core and a second data stream Storage module and local data flow network;
  • the second configuration module 703 is configured to configure and obtain the target data flow engine according to the target network configuration rule
  • the second processing module 704 is configured to process the to-be-processed data through the target data flow engine.
  • the second configuration module 703 includes:
  • the second partial configuration submodule is used to configure the interconnection between the computing core and the local data flow network to obtain the computing path of the computing core;
  • the second partial path sub-module is configured to configure the interconnection between the second data flow storage module and the partial data flow network to obtain a storage path;
  • the second engine module is used to obtain the target data flow engine according to the calculation path and the storage path.
  • the second processing module 704 includes:
  • the second acquisition sub-module is configured to read the to-be-processed data into the second data stream storage module
  • the second data address generation sub-module is configured to generate an address sequence for the data to be processed according to a preset generation rule in the second data stream storage module according to the data format and data path of the data to be processed;
  • the second input sub-module is used for each clock cycle to read from the second data stream storage module according to the address sequence the data volume corresponding to the computing core in the target data stream engine for input, and obtain the second The state of the data stream storage module and the computing core.
  • the second processing module 704 includes:
  • the input calculation sub-module is used to input the data in the first storage unit into the calculation core to obtain the calculation result;
  • the output storage submodule is used to store the calculation result in the second storage unit as input data for the next calculation core.
  • an embodiment of the present application provides an electronic device, including: a memory, a processor, and a computer program stored on the memory and capable of running on the processor.
  • the processor executes the computer program, The steps in the data stream-based deep network acceleration method provided in the embodiments of this application are implemented.
  • an embodiment of the present application provides a computer-readable storage medium on which a computer program is stored.
  • the computer program is executed by a processor, the data stream-based Steps in a deep network acceleration method.
  • processors and chips in the various embodiments of the present application may be integrated into one processing unit, or may exist alone physically, or two or more hardware may be integrated into one unit.
  • the computer-readable storage medium or the computer-readable program can be stored in a computer-readable memory.
  • the technical solution of the present application essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a memory, A number of instructions are included to enable a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the method described in each embodiment of the present application.
  • the aforementioned memory includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other various media that can store program codes.
  • the program can be stored in a computer-readable memory, and the memory can include: flash disk , Read-only memory (English: Read-Only Memory, abbreviation: ROM), random access device (English: Random Access Memory, abbreviation: RAM), magnetic disk or optical disc, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Artificial Intelligence (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The present application provides a deep network acceleration method and apparatus based on a data stream, a device, and a storage medium. The method comprises: obtaining target deep network information required by data to be processed; according to the target deep network information, matching a preset target network configuration rule corresponding to the target deep network information, wherein the target network configuration rule comprises a pre-configured configuration rule between a calculation engine, a first data stream storage module, and a global data stream network; configuring to obtain a target data stream network according to the target network configuration rule; and processing said data by means of the target data stream network. A deep network is accelerated by means of the data stream, an off-chip data communication is reduced, and accordingly, instruction idle overhead is avoided, and the hardware acceleration efficiency of the deep network can be improved; moreover, different deep network models can be configured by performing network configuration, and multiple different deep network models are supported.

Description

基于数据流的深度网络加速方法、装置、设备及存储介质Data stream-based deep network acceleration method, device, equipment and storage medium 技术领域Technical field
本申请涉及人工智能领域,更具体的说,是涉及一种基于数据流的深度网络加速方法、装置、设备及存储介质。This application relates to the field of artificial intelligence, and more specifically, to a data stream-based deep network acceleration method, device, device, and storage medium.
背景技术Background technique
基于神经网络的深度学习应用程序的进步要求底层硬件平台具有高处理能力。当基于CPU的平台无法满足这种不断增长的需求时,许多公司开发了专用硬件加速器来支持该领域的进步。现有的硬件加速器的共同想法是加速在深度学习算法应用中更频繁使用的某些特定类型的计算。现有的硬件架构基于具有可扩展指令集的指令执行,然后通过将常用计算实现为定制指令来实现加速。基于指令的架构实现通常表示为片上系统(SoC)设计。在基于指令的体系结构中,许多时钟周期被浪费用于非计算相关操作。为了支持更通用的指令体系结构,深度学习神经网络内的计算通常被分解为多个指令。因此一个计算通常需要多个时钟周期。处理器中的算术和逻辑单元(ALU)通常是以硬件实现的不同操作的集合。由于有限的指令表达式和有限的I/O带宽,大多数ALU资源在执行单个指令时处于空闲状态,比如,在做乘法与加法时,会先读取乘法的数据,由于I/O速度受带宽影响,使得加法需要等待乘法计算完成并写入存储器中,再读取出计算结果及加法数据进行加法计算,在乘法计算与读写过程中,加法计算单元是空闲状态。因此存在基于指令的硬件加速效率低的问题。The advancement of neural network-based deep learning applications requires high processing capabilities on the underlying hardware platform. When CPU-based platforms cannot meet this growing demand, many companies have developed dedicated hardware accelerators to support advancements in this field. The common idea of existing hardware accelerators is to accelerate certain types of calculations that are used more frequently in deep learning algorithm applications. The existing hardware architecture is based on the execution of instructions with an extensible instruction set, and then realizes acceleration by implementing common calculations as customized instructions. The instruction-based architecture implementation is usually expressed as a system-on-chip (SoC) design. In an instruction-based architecture, many clock cycles are wasted for non-computation related operations. In order to support a more general instruction architecture, calculations in deep learning neural networks are usually decomposed into multiple instructions. Therefore, a calculation usually requires multiple clock cycles. The arithmetic and logic unit (ALU) in the processor is usually a collection of different operations implemented in hardware. Due to limited instruction expressions and limited I/O bandwidth, most ALU resources are idle when executing a single instruction. For example, when doing multiplication and addition, the data of the multiplication will be read first, because the I/O speed is affected. Bandwidth affects, so that addition needs to wait for the multiplication calculation to be completed and write it into the memory, and then read the calculation result and the addition data for the addition calculation. During the multiplication calculation and reading and writing, the addition calculation unit is idle. Therefore, there is a problem of low efficiency of instruction-based hardware acceleration.
申请内容Application content
本申请的目的是针对上述现有技术存在的缺陷,提供一种基于数据流的深度网络加速方法、装置、设备及存储介质,解决了有限的指令表达式和有限的I/O带宽,大多数ALU资源在执行单个指令时处于空闲状态,加速效率低的问题。The purpose of this application is to provide a data stream-based deep network acceleration method, device, equipment and storage medium in view of the above-mentioned defects in the prior art, which solves the limited instruction expression and limited I/O bandwidth. Most The ALU resource is idle when executing a single instruction, which accelerates the problem of low efficiency.
本申请的目的是通过以下技术方案来实现的:The purpose of this application is achieved through the following technical solutions:
第一方面,提供一种基于数据流的深度网络加速方法,所述方法包括:In a first aspect, a data stream-based deep network acceleration method is provided, the method includes:
获取待处理数据所需要的目标深度网络信息;Obtain the target deep network information needed for the data to be processed;
根据所述目标深度网络信息,匹配预先设置的与所述目标深度网络信息对应的目标网络配置规则,其中,所述目标网络配置规则包括预先配置的计算引擎、第一数据流存储模块以及全局数据流网络之间的配置规则;According to the target deep network information, match a preset target network configuration rule corresponding to the target deep network information, wherein the target network configuration rule includes a pre-configured calculation engine, a first data stream storage module, and global data Configuration rules between streaming networks;
根据所述目标网络配置规则,配置得到目标数据流网络;Configure the target data flow network according to the target network configuration rule;
通过所述目标数据流网络对所述待处理数据进行处理。The data to be processed is processed through the target data stream network.
可选的,所述根据所述目标网络配置规则,配置得到目标数据流网络,包括:Optionally, the configuration to obtain the target data flow network according to the target network configuration rule includes:
根据所述全局数据流网络,配置多个计算引擎之间的并行或串行;According to the global data flow network, configure parallel or serial between multiple computing engines;
根据所述第一数据流存储模块及所述多个计算引擎之间的并行或串行,得到所述多个计算引擎的数据流路径;Obtaining the data flow paths of the multiple calculation engines according to the parallel or serial between the first data flow storage module and the multiple calculation engines;
基于所述数据流路径,形成所述目标数据流网络。Based on the data flow path, the target data flow network is formed.
可选的,所述通过所述目标数据流网络对所述待处理数据进行处理,包括:Optionally, the processing the to-be-processed data through the target data flow network includes:
将所述待处理数据读取到所述第一数据流存储模块;Reading the to-be-processed data into the first data stream storage module;
在所述第一数据流存储模块中,根据所述待处理数据的数据格式以及数据路径,按预先设置的生成规则为所述待处理数据生成地址序列;In the first data stream storage module, according to the data format and data path of the data to be processed, an address sequence is generated for the data to be processed according to a preset generation rule;
每个时钟周期,根据地址序列从所述第一数据流存储模块中读取与所述目标数据流网络中与计算引擎相应的数据量进行输入,并获取第一数据流存储模块及计算引擎的状态。In each clock cycle, read from the first data stream storage module the amount of data corresponding to the calculation engine in the target data stream network according to the address sequence for input, and obtain the data of the first data stream storage module and the calculation engine status.
可选的,所述目标网络配置还包括计算核、第二数据流存储单元以及连接所述计算核与所述第二缓存器的局部数据流网络,所述计算引擎的配置包括:Optionally, the target network configuration further includes a computing core, a second data stream storage unit, and a local data stream network connecting the computing core and the second buffer, and the configuration of the computing engine includes:
配置所述计算核与所述局部数据流网络的互连,得到计算核的计算路径;Configuring the interconnection between the computing core and the local data flow network to obtain the computing path of the computing core;
配置所述第二数据流在储存单元与所述局部数据流网络的互连,得到存储路径;Configuring the interconnection of the second data stream between the storage unit and the local data stream network to obtain a storage path;
根据所述计算路径与存储路径,得到所述计算引擎。According to the calculation path and the storage path, the calculation engine is obtained.
第二方面,还提供一种基于数据流的深度网络加速方法,所述方法包括:In a second aspect, there is also provided a data stream-based deep network acceleration method, the method including:
获取待处理数据所需要的目标深度网络信息;Obtain the target deep network information needed for the data to be processed;
根据所述目标深度网络信息,匹配预先设置的与所述目标深度网络信息对应的目标网络配置规则,其中,所述目标网络配置规则包括计算核、第二数据流存储模块以及局部数据流网络;According to the target deep network information, matching a preset target network configuration rule corresponding to the target deep network information, wherein the target network configuration rule includes a computing core, a second data stream storage module, and a local data stream network;
根据所述目标网络配置规则,配置得到目标数据流引擎;Configure the target data flow engine according to the target network configuration rule;
通过所述目标数据流引擎对所述待处理数据进行处理。The data to be processed is processed by the target data flow engine.
可选的,所述根据所述目标网络配置规则,配置得到目标数据流引擎,包括:Optionally, the configuration to obtain the target data flow engine according to the target network configuration rule includes:
配置所述计算核与所述局部数据流网络的互连,得到计算核的计算路径;Configuring the interconnection between the computing core and the local data flow network to obtain the computing path of the computing core;
配置所述第二数据流储存模块与所述局部数据流网络的互连,得到存储路径;Configuring the interconnection between the second data stream storage module and the local data stream network to obtain a storage path;
根据所述计算路径与存储路径,得到所述目标数据流引擎。According to the calculation path and the storage path, the target data flow engine is obtained.
可选的,所述通过所述目标数据流引擎对所述待处理数据进行处理,包括:Optionally, the processing the data to be processed by the target data flow engine includes:
将所述待处理数据读取到所述第二数据流存储模块;Reading the to-be-processed data into the second data stream storage module;
在所述第二数据流存储模块中,根据所述待处理数据的数据格式以及数据路径,按预先设置的生成规则为所述待处理数据生成地址序列;In the second data stream storage module, according to the data format and data path of the data to be processed, an address sequence is generated for the data to be processed according to a preset generation rule;
每个时钟周期,根据地址序列从所述第二数据流存储模块中读取与所述目标数据流引擎中与计算核相应的数据量进行输入,并获取第二数据流存储模块及计算核的状态。Each clock cycle, read from the second data stream storage module according to the address sequence the data volume corresponding to the computing core in the target data stream engine for input, and obtain the second data stream storage module and the computing core status.
可选的,所述第二数据流存储模块包括第一存储单元以及第二存储单元,所述通过所述目标数据流引擎对所述待处理数据进行处理,包括:Optionally, the second data stream storage module includes a first storage unit and a second storage unit, and the processing of the to-be-processed data by the target data stream engine includes:
将第一存储单元中的数据输入计算核,得到计算结果;Input the data in the first storage unit into the calculation core to obtain the calculation result;
将所述计算结果存储到第二存储单元,做为下一计算核的输入数据。The calculation result is stored in the second storage unit as the input data of the next calculation core.
第三方面,还提供一种基于数据流的深度网络加速装置,所述装置包括:In a third aspect, a data stream-based deep network acceleration device is also provided, the device includes:
第一获取模块,用于获取待处理数据所需要的目标深度网络信息;The first obtaining module is used to obtain target deep network information required by the data to be processed;
第一匹配模块,用于根据所述目标深度网络信息,匹配预先设置的与所述目标深度网络信息对应的目标网络配置规则,其中,所述目标网络配置规则包括预先配置的计算引擎、第一数据流存储模块以及全局数据流网络之间的配置规则;The first matching module is configured to match a preset target network configuration rule corresponding to the target deep network information according to the target deep network information, wherein the target network configuration rule includes a pre-configured calculation engine, a first Configuration rules between the data flow storage module and the global data flow network;
第一配置模块,用于根据所述目标网络配置规则,配置得到目标数据流网络;The first configuration module is configured to configure the target data flow network according to the target network configuration rule;
第一处理模块,用于通过所述目标数据流网络对所述待处理数据进行处理。The first processing module is configured to process the to-be-processed data through the target data flow network.
第四方面,还提供一种基于数据流的深度网络加速装置,所述装置包括:In a fourth aspect, there is also provided a data stream-based deep network acceleration device, the device comprising:
第二获取模块,用于获取待处理数据所需要的目标深度网络信息;The second acquisition module is used to acquire target deep network information required by the data to be processed;
第二匹配模块,用于根据所述目标深度网络信息,匹配预先设置的与所述目标深度网络信息对应的目标网络配置规则,其中,所述目标网络配置规则包括计算核、第二数据流存储模块以及局部数据流网络;The second matching module is configured to match a preset target network configuration rule corresponding to the target deep network information according to the target deep network information, wherein the target network configuration rule includes a computing core and a second data stream storage Module and local data flow network;
第二配置模块,用于根据所述目标网络配置规则,配置得到目标数据流引 擎;The second configuration module is used to configure the target data flow engine according to the target network configuration rule;
第二处理模块,用于通过所述目标数据流引擎对所述待处理数据进行处理。The second processing module is configured to process the to-be-processed data through the target data flow engine.
第五方面,提供一种电子设备,包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现本申请实施例提供的基于数据流的深度网络加速方法中的步骤。In a fifth aspect, an electronic device is provided, including: a memory, a processor, and a computer program stored on the memory and capable of running on the processor, and the processor implements the implementation of the application when the processor executes the computer program Examples provide steps in the data stream-based deep network acceleration method.
第六方面,提供一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现本申请实施例提供的基于数据流的深度网络加速方法中的步骤。In a sixth aspect, a computer-readable storage medium is provided, and a computer program is stored on the computer-readable storage medium. When the computer program is executed by a processor, it implements the data stream-based deep network acceleration method provided in the embodiments of the present application Steps in.
本申请带来的有益效果:通过数据流对深度网络进行加速,减少片外数据通信,因此没有指令空闲开销,可以提高深度网络的硬件加速效率,而且,通过进行网络配置,可以配置不同的深度网络模型,支持多种不同的深度网络模型。The beneficial effects brought by this application: the deep network is accelerated through the data stream and off-chip data communication is reduced, so there is no instruction idle overhead, and the hardware acceleration efficiency of the deep network can be improved. Moreover, by performing network configuration, different depths can be configured The network model supports a variety of different deep network models.
附图说明Description of the drawings
图1为本申请实施例提供的一种基于数据流的深度网络加速方法的可选实施架构示意图;FIG. 1 is a schematic diagram of an optional implementation architecture of a data stream-based deep network acceleration method provided by an embodiment of this application;
图2是本申请实施例第一方面提供的一种基于数据流的深度网络加速方法的流程示意图;2 is a schematic flowchart of a data stream-based deep network acceleration method according to the first aspect of the embodiments of the present application;
图3是本申请实施例提供的另一种基于数据流的深度网络加速方法的流程示意图;3 is a schematic flowchart of another data stream-based deep network acceleration method provided by an embodiment of the present application;
图4为本申请实施例第二方面提供的一种基于数据流的深度网络加速方法流程示意图;FIG. 4 is a schematic flowchart of a data flow-based deep network acceleration method provided by the second aspect of the embodiments of the application;
图5为本申请实施例提供的另一种基于数据流的深度网络加速方法流程示意图;FIG. 5 is a schematic flowchart of another data stream-based deep network acceleration method provided by an embodiment of this application;
图6为本申请实施例第三方面提供的一种基于数据流的深度网络加速装置示意图;6 is a schematic diagram of a data stream-based deep network acceleration device provided by the third aspect of the embodiments of the application;
图7为本申请实施例第四方面提供的一种基于数据流的深度网络加速装置示意图。FIG. 7 is a schematic diagram of a data stream-based deep network acceleration device provided by the fourth aspect of the embodiments of the application.
具体实施方式detailed description
下面描述本申请的优选实施方式,本领域普通技术人员将能够根据下文所述用本领域的相关技术加以实现,并能更加明白本申请的创新之处和带来的益处。The preferred embodiments of the present application are described below. Those of ordinary skill in the art will be able to implement them with related technologies in the field according to the following description, and will be able to better understand the innovations and benefits of the present application.
为了进一步描述本申请的技术方案,请参照图1,图1为本申请实施例提供的一种基于数据流的深度网络加速方法的可选实施架构示意图,如图1所示,架构103与片外存储模块(DDR)101以及处CPU通过互连进行连接,架构103包括:第一存储模块104、全局数据流网络105以及数据流引擎106,上述第一存储模块104通过互连连接上述片外存储模块101的同时,还通过互连连接上述全局数据流网络105,上述数据流引擎106通过互连连接上述全局数据流网络105以使上述数据流引擎106可以实现并行或串行。上述的数据流引擎106可以包括:计算核(或称为计算模块)、第二存储模块108以及局部数据流网络107,计算核可以包括用于计算的内核,比如卷积核109、池化核110以及激活函数核111等,当然,还可以包括除示例卷积核109、池化核110以及激活函数核111外的其他计算核,在此并不做限定,也可以包括在深度网络中所有用于计算的内核。上述的第一存储模块104与上述的第二存储模块108可以是片上缓存模块,也可以是DDR或高速DDR存储模块等。上述的数据流引擎106可以理解为支持数据流处理的计算引擎,也可以理解为专用于数据流处理的计算引擎。上述的CPU中可以包括控制寄存器,上述控制寄存器预先配置有网络配置规则用于对网络进行配置。In order to further describe the technical solution of this application, please refer to FIG. 1. FIG. 1 is a schematic diagram of an optional implementation architecture of a data stream-based deep network acceleration method provided by an embodiment of this application. As shown in FIG. 1, the architecture 103 and the chip The external storage module (DDR) 101 and the CPU are connected by interconnection. The architecture 103 includes: a first storage module 104, a global data flow network 105 and a data flow engine 106. The first storage module 104 is connected to the off-chip through interconnection. While storing the module 101, it is also connected to the global data flow network 105 through interconnection, and the data flow engine 106 is connected to the global data flow network 105 through interconnection so that the data flow engine 106 can realize parallel or serial. The aforementioned data flow engine 106 may include: a computing core (or called a computing module), a second storage module 108, and a local data flow network 107. The computing core may include a core for computing, such as a convolution core 109 and a pooling core. 110 and activation function core 111, etc. Of course, it can also include other computing cores besides the example convolution core 109, pooling core 110, and activation function core 111, which are not limited here, and can also be included in the deep network. The kernel used for calculation. The above-mentioned first storage module 104 and the above-mentioned second storage module 108 may be on-chip cache modules, or may be DDR or high-speed DDR memory modules. The above-mentioned data stream engine 106 can be understood as a computing engine that supports data stream processing, and can also be understood as a computing engine dedicated to data stream processing. The foregoing CPU may include a control register, and the foregoing control register is pre-configured with network configuration rules for configuring the network.
需要说明的是,本申请的中深度网络也可以称为深度学习网络、深度学习神经网络等。It should be noted that the deep network in this application may also be called a deep learning network, a deep learning neural network, and the like.
本申请提供了一种基于数据流的深度网络加速方法、装置、设备及存储介质。This application provides a data stream-based deep network acceleration method, device, equipment and storage medium.
本申请的目的是通过以下技术方案来实现的:The purpose of this application is achieved through the following technical solutions:
第一方面,请参见图2,图2是本申请实施例提供的一种基于数据流的深度网络加速方法的流程示意图,如图2所示,所述方法包括以下步骤:For the first aspect, please refer to FIG. 2. FIG. 2 is a schematic flowchart of a data stream-based deep network acceleration method provided by an embodiment of the present application. As shown in FIG. 2, the method includes the following steps:
201、获取待处理数据所需要的目标深度网络信息。201. Obtain target deep network information required for data to be processed.
该步骤中,上述的待处理数据可以是图像待识别数据、目标待检测数据、目标待追踪数据等可以通过深度网络进行处理的数据,上述的目标深度网络信息对应于上述待处理数据的深度网络信息,比如待处理数据为图像待识别数据,则目标深度网络信息则是用于处理图像识别的深度网络的配置参数,又如待处理数据为目标待检测数据,则目标深度网络信息则是用于处理目标检测的深度 网络的配置参数,上述的目标深度网络信息可以是预先设置的,通过待处理数据来进行匹配确定,也可以是通过手动进行选择确定,在此并不做限定。获取目标深度网络信息可以便于对深度网络进行配置,上述的深度网络信息可以包括网络类型、数据类型、层数、计算类型等。In this step, the aforementioned data to be processed may be data that can be processed through a deep network, such as image data to be identified, target data to be detected, target data to be tracked, and so on. The target deep network information corresponds to the deep network of the data to be processed. Information, such as the data to be processed is image data to be recognized, the target deep network information is the configuration parameter of the deep network used to process image recognition, and if the data to be processed is the target data to be detected, the target deep network information is used For processing the configuration parameters of the deep network for target detection, the above-mentioned target deep network information may be preset, and the matching determination may be performed through the data to be processed, or may be manually selected and determined, which is not limited herein. Obtaining the target deep network information can facilitate the configuration of the deep network. The aforementioned deep network information may include network type, data type, number of layers, calculation type, and so on.
202、根据所述目标深度网络信息,匹配预先设置的与所述目标深度网络信息对应的目标网络配置规则,其中,所述目标网络配置规则包括预先配置的计算引擎、第一数据流存储模块以及全局数据流网络之间的配置规则。202. According to the target deep network information, match a preset target network configuration rule corresponding to the target deep network information, wherein the target network configuration rule includes a pre-configured calculation engine, a first data stream storage module, and Configuration rules between global data flow networks.
上述的目标深度网络信息中已经包含了待处理数据所需要的深度网络的网络类型、数据类型、层数、计算类型等,上述的目标网络配置规则可以是预先进行设置的,比如,可以是预先设置好的图像识别网络、目标检测网络、目标追踪网络等类型网络中的各参数规则、计算规则等,上述的参数规则可以是超参数的设置规则、权重的设置规则等,上述的计算规则可以是加法、乘法、卷积、反卷积等计算规则。上述的预先配置的计算引擎、第一数据流存储模块以及全局数据流网络之间的配置规则可以理解为计算引擎的个数以及计算引擎与全局数据流网络的连接方式,上述第一数据流与全局数据流网络之间的连接方式,上述全局数据流网络中的路由连接方式等。全局数据流网络可以由控制寄存器配置。该网络实现可以是第一数据流存储模块和计算引擎之间的路由器。当在单个架构中实例化多个计算引擎时,全局数据流网络可以被配置为向不同计算引擎发送不同数据以用于数据并行,或者通过其输入和输出将计算引擎串行链接为更长的计算管道,可以在此管道中处理更多神经网络层。The aforementioned target deep network information already contains the network type, data type, number of layers, calculation type, etc. of the deep network required by the data to be processed. The aforementioned target network configuration rules can be set in advance, for example, The parameter rules and calculation rules in the set image recognition network, target detection network, target tracking network and other types of networks. The above parameter rules can be hyperparameter setting rules, weight setting rules, etc. The above calculation rules can be It is the calculation rules for addition, multiplication, convolution, and deconvolution. The foregoing configuration rules between the pre-configured calculation engine, the first data stream storage module, and the global data stream network can be understood as the number of calculation engines and the connection mode between the calculation engine and the global data stream network. The connection mode between the global data flow network, the routing connection mode in the above-mentioned global data flow network, etc. The global data flow network can be configured by the control register. The network implementation may be a router between the first data flow storage module and the calculation engine. When multiple computing engines are instantiated in a single architecture, the global data flow network can be configured to send different data to different computing engines for data parallelism, or to serially link the computing engines into longer ones through its input and output Computing pipeline, you can process more neural network layers in this pipeline.
在一种可能的实施例中,上述的第一数据流存储模块可以包括输入和输出两数据流存储单元,用于数据的存取,即是输入数据流存储单元将输入数据输入到计算引擎中进行计算,计算引擎将计算结果输出到输出数据流存储单元进行存储,这样可以避免输入数据流存储单元在对计算引擎输入数据时,计算引擎的输出结果无法写入该输入数据流存储单元中,比如,计算引擎需要对输入数据流存储单元中的一个数据进行重复计算2次,在第一次计算完成后,计算引擎需要在输入数据流存储单元中读取第二次该数据,通常情况下,会等待第一次计算结果存储到输入数据流存储单元,再去读取第二次的该数据,但设置了输出数据流存储单元后,可以将第一次计算结果存储到输出数据流存储单元的同时,去读取第二次的该数据,无需进行等待,提高了数据处理的效率。In a possible embodiment, the aforementioned first data stream storage module may include two data stream storage units, input and output, for data access, that is, the input data stream storage unit inputs input data into the calculation engine To perform calculations, the calculation engine outputs the calculation results to the output data stream storage unit for storage, which can prevent the input data stream storage unit from inputting data to the calculation engine and the output result of the calculation engine cannot be written into the input data stream storage unit. For example, the calculation engine needs to perform repeated calculations on a piece of data in the input data stream storage unit twice. After the first calculation is completed, the calculation engine needs to read the data in the input data stream storage unit for the second time. Normally , Will wait for the first calculation result to be stored in the input data stream storage unit, and then read the data for the second time, but after the output data stream storage unit is set, the first calculation result can be stored in the output data stream storage At the same time as the unit, read the data for the second time without waiting, which improves the efficiency of data processing.
203、根据所述目标网络配置规则,配置得到目标数据流网络。203. Configure the target data flow network according to the target network configuration rule.
上述配置对目标网络配置规则的实现,可以是对预先配置的计算引擎、第 一数据流存储模块以及全局数据流网络之间的连接关系,上述的连接关系可以包括计算引擎的连接数量,连接顺序等,可以将计算引擎通过互连与全局数据流网络进行连接,形成新的深度网络,可以根据不同的计算引擎连接数量和连接顺序,形成不同的深度网络。根据目标网络配置规则进行配置,则可以得到目标数据流网络用于处理待处理数据。由于各个计算引擎通过第一数据流存储模块进行数据读取,可以将第一数据流存储模块中的数据分别读取到不同的计算引擎中形成数据流,不需要指令集排序,所以配置好的计算引擎不会产生计算空置。The foregoing configuration implements the target network configuration rule, which may be the connection relationship between the pre-configured calculation engine, the first data stream storage module, and the global data stream network. The foregoing connection relationship may include the number of connections of the calculation engine and the connection sequence. The calculation engine can be connected to the global data flow network through interconnection to form a new deep network, and different deep networks can be formed according to the number and connection sequence of different calculation engines. According to the configuration rules of the target network, the target data flow network can be obtained for processing the data to be processed. Since each calculation engine reads data through the first data stream storage module, the data in the first data stream storage module can be read into different calculation engines to form a data stream, and no instruction set ordering is required, so the configuration is good The calculation engine does not generate calculation vacancies.
204、通过所述目标数据流网络对所述待处理数据进行处理。204. Process the data to be processed through the target data stream network.
上述的目标数据流网络是通过目标网络信息进行配置,也可以称为定制数据流网络,上述的目标数据流网络通过全局数据流网络将第一数据流存储模块及计算引擎进行连接以形成数据流,相较于指令集而言,无需等待上一指令的读写完成,可以实现深度网络架构下计算的高效性。The above-mentioned target data stream network is configured through target network information, and can also be called a customized data stream network. The above-mentioned target data stream network connects the first data stream storage module and the calculation engine through the global data stream network to form a data stream Compared with the instruction set, there is no need to wait for the completion of the read and write of the previous instruction, which can achieve the efficiency of calculation under the deep network architecture.
在本实施例中,获取待处理数据所需要的目标深度网络信息;根据所述目标深度网络信息,匹配预先设置的与所述目标深度网络信息对应的目标网络配置规则,其中,所述目标网络配置规则包括预先配置的计算引擎、第一数据流存储模块以及全局数据流网络之间的配置规则;根据所述目标网络配置规则,配置得到目标数据流网络;通过所述目标数据流网络对所述待处理数据进行处理。通过数据流对深度网络进行加速,减少片外数据通信,因此没有指令空闲开销,可以提高深度网络的硬件加速效率,而且,通过进行网络配置,可以配置不同的深度网络模型,支持多种不同的深度网络模型。In this embodiment, the target deep network information required by the data to be processed is acquired; according to the target deep network information, a preset target network configuration rule corresponding to the target deep network information is matched, wherein the target network The configuration rules include the pre-configured calculation engine, the first data flow storage module and the configuration rules between the global data flow network; according to the target network configuration rules, the target data flow network is configured; The data to be processed is processed. Accelerate the deep network through data flow to reduce off-chip data communication, so there is no instruction idle overhead, which can improve the hardware acceleration efficiency of the deep network, and through network configuration, you can configure different deep network models to support a variety of different Deep network model.
需要说明的是,本申请实施例提供的基于数据流的深度网络加速方法可以应用于数据流的深度网络加速的设备,例如:计算机、服务器、手机等可以进行基于数据流的深度网络加速的设备。It should be noted that the data stream-based deep network acceleration method provided in the embodiments of the present application can be applied to devices for data stream deep network acceleration, such as computers, servers, mobile phones and other devices that can perform data stream-based deep network acceleration .
请参见图3,图3是本申请实施例提供的另一种基于数据流的深度网络加速方法的流程示意图,如图3所示,所述方法包括以下步骤:Please refer to FIG. 3. FIG. 3 is a schematic flowchart of another data flow-based deep network acceleration method provided by an embodiment of the present application. As shown in FIG. 3, the method includes the following steps:
301、获取待处理数据所需要的目标深度网络信息。301. Obtain target deep network information required for data to be processed.
302、根据所述目标深度网络信息,匹配预先设置的与所述目标深度网络信息对应的目标网络配置规则,其中,所述目标网络配置规则包括预先配置的计算引擎、第一数据流存储模块以及全局数据流网络之间的配置规则。302. According to the target deep network information, match a preset target network configuration rule corresponding to the target deep network information, wherein the target network configuration rule includes a pre-configured calculation engine, a first data stream storage module, and Configuration rules between global data flow networks.
303、根据所述全局数据流网络,配置多个计算引擎之间的并行或串行。303. Configure parallel or serial between multiple computing engines according to the global data flow network.
在该步骤中,上述的全局数据流网络可以通过路由实现,全局数据流网络 可由控制寄存器进行配置,上述的控制寄存器中预先设置有相应的全局数据流网络配置规则。该网络实现为第一数据流存储模块和各个计算引擎之间的路由器,网络路由器主要功能是提供各个计算引擎间数据流的跳过路径和反馈路径。上述多个计算引擎之间的并行或串行可以是通过数据流进行配置,比如,计算引擎A与计算引擎B在全局数据流网络并行时,数据流同时流向计算引擎A以及计算引擎B,实现对数据的并行处理,计算引擎A与计算引擎B在全局数据流网络串行时,数据流可以是选在计算引擎A中进行计算,再将计算结果流向计算引擎B,在串行时,可以理解为是深度网络计算层的加深。具体的配置可以是通过全局数据流网络控制数据流向,从而实现对多个计算引擎之间的并行或串行的配置。上述配置多个计算引擎之间的并行或串行可以是通过配置全局数据流网络与多个计算引擎之间的互连得到,比如,可以是多个计算引擎按并行规则与全局数据流网络进行互连,也可以是多个计算引擎按串行规则与全局数据流网络进行互连,而第一数据流存储模块与全局数据流网络配置互连。In this step, the aforementioned global data flow network can be implemented by routing, and the global data flow network can be configured by a control register, and corresponding global data flow network configuration rules are preset in the aforementioned control register. The network is implemented as a router between the first data flow storage module and each calculation engine, and the main function of the network router is to provide skip paths and feedback paths for data flows between each calculation engine. The parallel or serial between the above-mentioned multiple calculation engines can be configured through data flow. For example, when calculation engine A and calculation engine B are parallel in the global data flow network, the data flow flows to calculation engine A and calculation engine B at the same time. For the parallel processing of data, when the calculation engine A and the calculation engine B are serialized in the global data flow network, the data flow can be selected in the calculation engine A for calculation, and then the calculation result is flowed to the calculation engine B. In the serial mode, you can It is understood as the deepening of the deep network computing layer. The specific configuration can be to control the data flow direction through the global data flow network, so as to realize the parallel or serial configuration between multiple computing engines. The above configuration of parallel or serial between multiple calculation engines can be obtained by configuring the interconnection between the global data flow network and multiple calculation engines, for example, multiple calculation engines can be connected to the global data flow network according to parallel rules The interconnection can also be that multiple computing engines are interconnected with the global data flow network according to serial rules, and the first data flow storage module is configured to be interconnected with the global data flow network.
304、根据所述第一数据流存储模块及所述多个计算引擎之间的并行或串行,得到所述多个计算引擎的数据流路径。304. Obtain data flow paths of the multiple calculation engines according to the parallel or serial connection between the first data flow storage module and the multiple calculation engines.
在该步骤中,上述第一数据流存储模块可以是缓存、DDR或高速存取DDR,在本申请实施例中,优选为缓存,具体的,缓存中可以设置具有可控制的读写地址生成单元。取决于输入数据格式和数据路径中所需的计算,地址生成单元将生成适应的地址序列以索引缓存中的数据。上述的地址序列可以用于索引缓存中的数据输入到对应的计算引擎中,比如,计算引擎需要80个数据进行计算,则从缓存中读取80个对应地址序列的数据到该计算引擎中。另外,地址生成单元还可以通过设置计数器,使生成的地址序列具有不同的循环大小,比如,数据1、数据2、数据3的一个小循环,可提高数据的复用性,同时,也能够适应各个计算引擎数据处理的大小。通过第一数据流存储模块将数据流进行存储,并控制数据流向多个计算引擎之间的并行或串行中的各个数据节点,即是数据流路径,使得到数据处理如同流水线一般得以在计算引擎中进行处理,提高数据处理的效率。In this step, the above-mentioned first data stream storage module may be a cache, DDR or high-speed access DDR. In the embodiment of the present application, it is preferably a cache. Specifically, a controllable read-write address generating unit may be provided in the cache. . Depending on the input data format and the calculations required in the data path, the address generation unit will generate an adapted address sequence to index the data in the cache. The aforementioned address sequence can be used to input data in the index cache to the corresponding calculation engine. For example, if the calculation engine requires 80 data to perform calculations, then 80 data corresponding to the address sequence are read from the cache to the calculation engine. In addition, the address generation unit can also set a counter to make the generated address sequence have different cycle sizes, for example, a small cycle of data 1, data 2, and data 3, which can improve the reusability of data, and at the same time, it can also adapt to The data processing size of each calculation engine. The data flow is stored through the first data flow storage module, and the data flow is controlled to each data node in parallel or serial between multiple computing engines, that is, the data flow path, so that the data processing can be calculated like a pipeline Processing in the engine improves the efficiency of data processing.
305、基于所述数据流路径,形成所述目标数据流网络。305. Form the target data flow network based on the data flow path.
在该步骤中,第一数据流存储模块通过全局数据流网络将数据输入到对应的计算引擎中,计算引擎通过全局数据流网络将计算结果输出到第一数据流存储模块中,无需指令进行控制,也就没有执行单个指令时计算单元处于空闲状态的问题。In this step, the first data flow storage module inputs data to the corresponding calculation engine through the global data flow network, and the calculation engine outputs the calculation results to the first data flow storage module through the global data flow network, without instructions for control Therefore, there is no problem that the computing unit is idle when a single instruction is executed.
306、通过所述目标数据流网络对所述待处理数据进行处理。306. Process the data to be processed through the target data stream network.
该实施例中,通过第一数据流存储模块将数据流进行存储,并控制数据流向多个计算引擎之间的并行或串行中的各个数据节点,即是数据流路径,使得到数据处理如同流水线一般得以在计算引擎中进行处理,提高数据处理的效率。In this embodiment, the data stream is stored by the first data stream storage module, and the data stream is controlled to each data node in parallel or serial between multiple computing engines, that is, the data stream path, so that the data processing is like The pipeline is generally processed in the calculation engine to improve the efficiency of data processing.
可选的,所述通过所述目标数据流网络对所述待处理数据进行处理,包括:Optionally, the processing the to-be-processed data through the target data flow network includes:
将所述待处理数据读取到所述第一数据流存储模块;Reading the to-be-processed data into the first data stream storage module;
在所述第一数据流存储模块中,根据所述待处理数据的数据格式以及数据路径,按预先设置的生成规则为所述待处理数据生成地址序列;In the first data stream storage module, according to the data format and data path of the data to be processed, an address sequence is generated for the data to be processed according to a preset generation rule;
每个时钟周期,根据地址序列从所述第一数据流存储模块中读取与所述目标数据流网络中与计算引擎相应的数据量进行输入,并获取第一数据流存储模块及计算引擎的状态。In each clock cycle, read from the first data stream storage module the amount of data corresponding to the calculation engine in the target data stream network according to the address sequence for input, and obtain the data of the first data stream storage module and the calculation engine status.
该实施方式中,上述第一数据流存储模块可以是缓存、DDR或高速存取DDR,在本申请实施例中,优选为缓存,具体的,可以是设置有可控制的读写地址生成单元的缓存。取决于输入数据格式和数据路径中所需的计算,地址生成单元将生成适应的地址序列以索引缓存中的数据。上述的地址序列可以用于索引缓存中的数据输入到对应的计算引擎中,比如,计算引擎需要80个数据进行计算,则从缓存中读取80个对应地址序列的数据到该计算引擎中。另外,地址生成单元还可以通过设置计数器,使生成的地址序列具有不同的循环大小,比如,数据1、数据2、数据3的一个小循环,可提高数据的复用性,同时,也能够适应各个计算引擎数据处理的大小。上述第一数据流存储模块的状态包括:数据读取准备状态以及数据写入完成状态,上述的计算引擎的状态包括计算是否完成,是否需要读取下一次计算数据等状态。可以通过在有限状态机对第一数据流存储模块中的数据进行状态监控,从而获取到第一数据流存储模块的状态,通过第一数据流存储模块的状态获取得到计算引擎的状态,比如计算结果写入到第一数据流存储模块后,可以确定计算引擎的状态为计算完毕。In this embodiment, the above-mentioned first data stream storage module may be a cache, DDR or high-speed access DDR. In the embodiment of the present application, it is preferably a cache. Specifically, it may be provided with a controllable read-write address generating unit. Cache. Depending on the input data format and the calculations required in the data path, the address generation unit will generate an adapted address sequence to index the data in the cache. The aforementioned address sequence can be used to input data in the index cache to the corresponding calculation engine. For example, if the calculation engine requires 80 data to perform calculations, then 80 data corresponding to the address sequence are read from the cache to the calculation engine. In addition, the address generation unit can also set a counter to make the generated address sequence have different cycle sizes, for example, a small cycle of data 1, data 2, and data 3, which can improve the reusability of data, and at the same time, it can also adapt to The data processing size of each calculation engine. The state of the first data stream storage module includes: a data read preparation state and a data write completion state. The state of the calculation engine includes whether the calculation is completed, whether the next calculation data needs to be read, and so on. The state of the first data flow storage module can be obtained by monitoring the state of the data in the first data flow storage module in the finite state machine, and the state of the calculation engine can be obtained by obtaining the state of the first data flow storage module, such as computing After the result is written into the first data stream storage module, it can be determined that the state of the calculation engine is calculation completed.
每个时钟周期中,通过每个计算引擎和第一数据流存储模块的状态获取,从而可准确预计,可以通过精确的计算排期进行最大效率的硬件性能优化,进一步提高数据处理的效率。In each clock cycle, the status of each calculation engine and the first data stream storage module is obtained, so that it can be accurately predicted, and the hardware performance can be optimized for maximum efficiency through accurate calculation scheduling, and the efficiency of data processing can be further improved.
可选的,所述目标网络配置还包括计算核、第二数据流存储单元以及连接所述计算核与所述第二缓存器的局部数据流网络,所述计算引擎的配置包括:Optionally, the target network configuration further includes a computing core, a second data stream storage unit, and a local data stream network connecting the computing core and the second buffer, and the configuration of the computing engine includes:
配置所述计算核与所述局部数据流网络的互连,得到计算核的计算路径;Configuring the interconnection between the computing core and the local data flow network to obtain the computing path of the computing core;
配置所述第二数据流在储存单元与所述局部数据流网络的互连,得到存储 路径;Configuring the interconnection of the second data stream between the storage unit and the local data stream network to obtain a storage path;
根据所述计算路径与存储路径,得到所述计算引擎。According to the calculation path and the storage path, the calculation engine is obtained.
在该实施方式中,上述的计算核、第二数据流存储模块以及局部数据流网络是组成计算引擎的主要配置,上述的计算核可以是卷积核、池化核、激活函数核等具有计算性能的内核,另外,需要说明的是,计算核也可以称为计算内核、计算单元、计算模块等。上述的第二数据流存储模块可以是缓存、DDR或高速DDR等具有数据存取功能的存储模块,上述的第二数据流存储模块与第一数据流存储模块可以是在同一个存储器上的不同存储区域,比如,第二数据流存储模块可以是缓存器中的第二数据缓存区,第一数据流存储模块可以是缓存器中的第一数据缓存区等,上述的局部数据流网络可以理解为计算引擎中用于将计算核与第二数据流存储模块进行连接的路由。比如,计算核之间的连接可以由网络路由器控制。上述网络路由器主要功能是提供跳过路径和反馈路径。通过设置控制寄存器,可以将局部数据流网络配置为形成具有计算引擎中可用的不同计算核的流路径。沿着流路径的这些计算核的类型和顺序的组合为深度学习神经网络中的多个层提供连续的数据处理流水线,比如,按数据流向,若计算核的组合是卷积核到池化核到激活函数核,可以得到一个卷积神经网络层,又如,计算核的组合是反卷积核到池化核到激活函数核,可以得到一个反卷积神经网络层等。需要说明的是,计算核的类型和顺序的组合具体由目标网络配置规则进行确定。通过在计算核之间形成数据流,可以对计算引擎的计算进行加速,从而进一步提高深度网络的数据处理效率。In this embodiment, the aforementioned computing core, the second data stream storage module, and the local data stream network are the main configurations of the computing engine. The aforementioned computing core may be a convolution core, a pooling core, an activation function core, etc. The performance core, in addition, it should be noted that the computing core may also be called a computing core, a computing unit, a computing module, and so on. The above-mentioned second data stream storage module may be a storage module with data access function such as cache, DDR or high-speed DDR, and the above-mentioned second data stream storage module and the first data stream storage module may be different on the same memory The storage area, for example, the second data flow storage module may be the second data buffer area in the buffer, the first data flow storage module may be the first data buffer area in the buffer, etc. The above-mentioned partial data flow network can be understood It is a route used in the calculation engine to connect the calculation core with the second data stream storage module. For example, the connection between computing cores can be controlled by a network router. The main function of the aforementioned network router is to provide a skip path and a feedback path. By setting the control register, the local data flow network can be configured to form a flow path with different calculation cores available in the calculation engine. The combination of the types and sequence of these computing cores along the flow path provides a continuous data processing pipeline for multiple layers in the deep learning neural network. For example, according to the data flow direction, if the combination of computing cores is a convolution core to a pooling core To the activation function core, a convolutional neural network layer can be obtained. For another example, the combination of the calculation core is a deconvolution core to a pooling core to an activation function core, and a deconvolutional neural network layer can be obtained. It should be noted that the combination of the type and sequence of the computing core is specifically determined by the target network configuration rule. By forming a data stream between the computing cores, the calculation of the computing engine can be accelerated, thereby further improving the data processing efficiency of the deep network.
上述的可选实施方式,可以现实图2和图3对应实施例的基于数据流的深度网络加速方法,达到相同的效果,在此不再赘述。The foregoing optional implementation manners can implement the data stream-based deep network acceleration method of the corresponding embodiment in FIG. 2 and FIG. 3, and achieve the same effect, which is not repeated here.
第二方面,请参见图4,图4为本申请实施例提供的一种基于数据流的深度网络加速方法流程示意图,如图4所示,所述方法包括:For the second aspect, please refer to FIG. 4. FIG. 4 is a schematic flowchart of a data stream-based deep network acceleration method provided by an embodiment of the application. As shown in FIG. 4, the method includes:
401、获取待处理数据所需要的目标深度网络信息。401. Obtain target deep network information required for data to be processed.
该步骤中,上述的待处理数据可以是图像待识别数据、目标待检测数据、目标待追踪数据等可以通过深度网络进行处理的数据,上述的目标深度网络信息对应于上述待处理数据的深度网络信息,比如待处理数据为图像待识别数据,则目标深度网络信息则是用于处理图像识别的深度网络的配置参数,又如待处理数据为目标待检测数据,则目标深度网络信息则是用于处理目标检测的深度网络的配置参数,上述的目标深度网络信息可以是预先设置的,通过待处理数据来进行匹配确定,也可以是通过手动进行选择确定,在此并不做限定。获取 目标深度网络信息可以便于对深度网络进行配置,上述的深度网络信息可以包括网络类型、数据类型、层数、计算类型等。In this step, the aforementioned data to be processed may be data that can be processed through a deep network, such as image data to be identified, target data to be detected, target data to be tracked, and so on. The target deep network information corresponds to the deep network of the data to be processed. Information, such as the data to be processed is image data to be recognized, the target deep network information is the configuration parameter of the deep network used to process image recognition, and if the data to be processed is the target data to be detected, the target deep network information is used For processing the configuration parameters of the deep network for target detection, the above-mentioned target deep network information may be preset, and the matching determination may be performed through the data to be processed, or may be manually selected and determined, which is not limited herein. Obtaining the target deep network information can facilitate the configuration of the deep network. The aforementioned deep network information may include network type, data type, number of layers, calculation type, and so on.
402、根据所述目标深度网络信息,匹配预先设置的与所述目标深度网络信息对应的目标网络配置规则,其中,所述目标网络配置规则包括计算核、第二数据流存储模块以及局部数据流网络。402. According to the target deep network information, match a preset target network configuration rule corresponding to the target deep network information, wherein the target network configuration rule includes a computing core, a second data stream storage module, and a local data stream The internet.
上述的目标深度网络信息中已经包含了待处理数据所需要的深度网络的网络类型、数据类型、层数、计算类型等,上述的目标网络配置规则可以是预先进行设置的,比如,可以是预先设置好的图像识别网络、目标检测网络、目标追踪网络等类型网络中的各参数规则、计算规则等,上述的参数规则可以是超参数的设置规则、权重的设置规则等,上述的计算规则可以是加法、乘法、卷积、反卷积等计算规则。上述的计算核、第二数据流存储模块以及局部数据流网络之间的配置规则可以理解为计算核的类型、个数以及计算核与全局数据流网络的连接方式,上述第二数据流与全局数据流网络之间的连接方式,上述局部数据流网络中的路由连接方式等。局部数据流网络可以由控制寄存器配置。该网络实现可以是第一数据流存储模块和计算引擎之间的路由器。比如,计算核之间的连接可以由网络路由器控制。上述网络路由器主要功能是提供跳过路径和反馈路径。The aforementioned target deep network information already contains the network type, data type, number of layers, calculation type, etc. of the deep network required by the data to be processed. The aforementioned target network configuration rules can be set in advance, for example, The parameter rules and calculation rules in the set image recognition network, target detection network, target tracking network and other types of networks. The above parameter rules can be hyperparameter setting rules, weight setting rules, etc. The above calculation rules can be It is the calculation rules for addition, multiplication, convolution, and deconvolution. The above-mentioned configuration rules between the computing core, the second data stream storage module and the local data stream network can be understood as the type and number of computing cores, and the connection method between the computing core and the global data stream network. The connection mode between data flow networks, the routing connection mode in the above-mentioned partial data flow network, etc. The local data flow network can be configured by the control register. The network implementation may be a router between the first data flow storage module and the calculation engine. For example, the connection between computing cores can be controlled by a network router. The main function of the aforementioned network router is to provide a skip path and a feedback path.
403、根据所述目标网络配置规则,配置得到目标数据流引擎。403. Configure the target data flow engine according to the target network configuration rule.
上述配置对目标网络配置规则的实现,可以是对预先配置的计算核、第二数据流存储模块以及局部数据流网络之间的连接关系,上述的连接关系可以包括计算核的类型,连接数量,连接顺序等,可以将计算核通过互连与局部数据流网络进行连接,形成新的计算引擎,即是数据流引擎,可以根据不同的计算核类型,连接数量和连接顺序,形成不同的深度网络所需要的数据流引擎。根据目标网络配置规则进行配置,则可以得到目标数据流引擎用于处理待处理数据。由于各个计算核通过第二数据流存储模块进行数据读取,可以将第二数据流存储模块中的数据分别读取到不同的计算核中形成数据流,比如,将需要进行乘法的数据读取到乘法核中进行乘法计算,将需要进行加法的数据读取到加法核中进行加法计算等,由于数据流不需要指令集排序,所以配置好的数据流引擎不会产生计算空置。The foregoing configuration implements the target network configuration rule, which may be the connection relationship between the pre-configured computing core, the second data stream storage module, and the local data stream network. The foregoing connection relationship may include the type of computing core and the number of connections, Connection sequence, etc., can connect computing cores with local data flow network through interconnection to form a new computing engine, that is, data flow engine, which can form different deep networks according to different computing core types, number of connections and connection sequence The required data flow engine. According to the configuration rules of the target network, the target data flow engine can be obtained to process the data to be processed. Since each computing core reads data through the second data stream storage module, the data in the second data stream storage module can be read into different computing cores to form a data stream. For example, data reading that requires multiplication The multiplication calculation is performed in the multiplication core, and the data that needs to be added is read into the addition core for addition calculation, etc. Since the data flow does not require instruction set ordering, the configured data flow engine will not generate calculation vacancies.
404、通过所述目标数据流引擎对所述待处理数据进行处理。404. Process the to-be-processed data by using the target data flow engine.
上述的目标数据流引擎是通过目标网络信息进行配置,也可以称为定制数据流引擎,上述的目标数据流引擎通过局部数据流网络将第二数据流存储模块 及各计算核进行连接以形成数据流,相较于指令集的实现形式而言,无需等待上一指令的读写完成,可以实现深度网络架构下计算的高效性。The above-mentioned target data flow engine is configured through target network information, and can also be called a customized data flow engine. The above-mentioned target data flow engine connects the second data flow storage module and each computing core through a local data flow network to form data Compared with the implementation form of the instruction set, the flow does not need to wait for the completion of the read and write of the previous instruction, and can achieve the efficiency of calculation under the deep network architecture.
在本实施例中,获取待处理数据所需要的目标深度网络信息;根据所述目标深度网络信息,匹配预先设置的与所述目标深度网络信息对应的目标网络配置规则,其中,所述目标网络配置规则包括计算核、第二数据流存储模块以及局部数据流网络;根据所述目标网络配置规则,配置得到目标数据流引擎;通过所述目标数据流引擎对所述待处理数据进行处理。通过数据流对深度网络进行加速,减少片外数据通信,因此没有指令空闲开销,可以提高深度网络的硬件加速效率,而且,通过进行网络配置,可以配置不同的深度网络模型所需要的计算引擎,支持多种不同的深度网络模型所需要的计算引擎。In this embodiment, the target deep network information required by the data to be processed is acquired; according to the target deep network information, a preset target network configuration rule corresponding to the target deep network information is matched, wherein the target network The configuration rule includes a computing core, a second data flow storage module, and a local data flow network; a target data flow engine is configured according to the target network configuration rule; the data to be processed is processed by the target data flow engine. The deep network is accelerated through data flow to reduce off-chip data communication, so there is no instruction idle overhead, which can improve the hardware acceleration efficiency of the deep network, and through network configuration, you can configure the computing engines required by different deep network models. Support the calculation engine required by a variety of different deep network models.
请参见图5,图5为本申请实施例提供的另一种基于数据流的深度网络加速方法流程示意图,如图5所示,所述方法包括:Please refer to FIG. 5. FIG. 5 is a schematic flowchart of another data stream-based deep network acceleration method provided by an embodiment of the application. As shown in FIG. 5, the method includes:
501、获取待处理数据所需要的目标深度网络信息;501. Obtain the target deep network information required for the data to be processed;
502、根据所述目标深度网络信息,匹配预先设置的与所述目标深度网络信息对应的目标网络配置规则,其中,所述目标网络配置规则包括计算核、第二数据流存储模块以及局部数据流网络;502. According to the target deep network information, match a preset target network configuration rule corresponding to the target deep network information, wherein the target network configuration rule includes a computing core, a second data stream storage module, and a local data stream The internet;
503、配置所述计算核与所述局部数据流网络的互连,得到计算核的计算路径;503. Configure the interconnection between the computing core and the local data flow network to obtain a computing path of the computing core.
504、配置所述第二数据流储存模块与所述局部数据流网络的互连,得到存储路径;504. Configure the interconnection between the second data stream storage module and the local data stream network to obtain a storage path.
505、根据所述计算路径与存储路径,得到所述目标数据流引擎。505. Obtain the target data flow engine according to the calculation path and the storage path.
506、通过所述目标数据流引擎对所述待处理数据进行处理。506. Process the data to be processed by the target data flow engine.
在该实施例中,上述的计算核、第二数据流存储模块以及局部数据流网络是组成数据流引擎的主要配置,上述的计算核可以是卷积核、池化核、激活函数核等具有计算性能的内核,另外,需要说明的是,计算核也可以称为计算内核、计算单元、计算模块等。上述的第二数据流存储模块可以是缓存、DDR或高速DDR等具有数据存取功能的存储模块,上述的第二数据流存储模块与第一数据流存储模块可以是在同一个存储器上的不同存储区域,比如,第二数据流存储模块可以是缓存器中的第二数据缓存区,第一数据流存储模块可以是缓存器中的第一数据缓存区等,上述的局部数据流网络可以理解为计算引擎中用于将计算核与第二数据流存储模块进行连接的路由。比如,计算核之间的连接可以由网络路由器控制。上述网络路由器主要功能是提供跳过路径和反馈路径。 通过设置控制寄存器,可以将局部数据流网络配置为形成具有计算引擎中可用的不同计算核的流路径。沿着流路径的这些计算核的类型和顺序的组合为深度学习神经网络中的多个层提供连续的数据处理流水线,比如,按数据流向,若计算核的组合是卷积核到池化核到激活函数核,可以得到一个卷积神经网络层,又如,计算核的组合是反卷积核到池化核到激活函数核,可以得到一个反卷积神经网络层等。需要说明的是,计算核的类型和顺序的组合具体由目标网络配置规则进行确定。In this embodiment, the aforementioned computing core, the second data stream storage module, and the local data stream network are the main configurations of the data stream engine. The aforementioned computing core may be a convolution core, a pooling core, an activation function core, etc. The core for computing performance. In addition, it should be noted that the computing core may also be called a computing core, a computing unit, a computing module, and so on. The above-mentioned second data stream storage module may be a storage module with data access function such as cache, DDR or high-speed DDR, and the above-mentioned second data stream storage module and the first data stream storage module may be different on the same memory The storage area, for example, the second data flow storage module may be the second data buffer area in the buffer, the first data flow storage module may be the first data buffer area in the buffer, etc. The above-mentioned partial data flow network can be understood It is a route used in the calculation engine to connect the calculation core with the second data stream storage module. For example, the connection between computing cores can be controlled by a network router. The main function of the aforementioned network router is to provide a skip path and a feedback path. By setting the control register, the local data flow network can be configured to form a flow path with different calculation cores available in the calculation engine. The combination of the types and sequence of these computing cores along the flow path provides a continuous data processing pipeline for multiple layers in the deep learning neural network. For example, according to the data flow direction, if the combination of computing cores is a convolution core to a pooling core To the activation function core, a convolutional neural network layer can be obtained. For another example, the combination of the calculation core is a deconvolution core to a pooling core to an activation function core, and a deconvolutional neural network layer can be obtained. It should be noted that the combination of the type and sequence of the computing core is specifically determined by the target network configuration rule.
通过在计算核之间形成数据流,可以对计算引擎的计算进行加速,从而进一步提高深度网络的数据处理效率。By forming a data stream between the computing cores, the calculation of the computing engine can be accelerated, thereby further improving the data processing efficiency of the deep network.
可选的,所述通过所述目标数据流引擎对所述待处理数据进行处理,包括:Optionally, the processing the data to be processed by the target data flow engine includes:
将所述待处理数据读取到所述第二数据流存储模块;Reading the to-be-processed data into the second data stream storage module;
在所述第二数据流存储模块中,根据所述待处理数据的数据格式以及数据路径,按预先设置的生成规则为所述待处理数据生成地址序列;In the second data stream storage module, according to the data format and data path of the data to be processed, an address sequence is generated for the data to be processed according to a preset generation rule;
每个时钟周期,根据地址序列从所述第二数据流存储模块中读取与所述目标数据流引擎中与计算核相应的数据量进行输入,并获取第二数据流存储模块及计算核的状态。Each clock cycle, read from the second data stream storage module according to the address sequence the data volume corresponding to the computing core in the target data stream engine for input, and obtain the second data stream storage module and the computing core status.
该实施方式中,上述第二数据流存储模块可以是缓存、DDR或高速存取DDR,在本申请实施例中,优选为缓存,具体的,可以是设置有可控制的读写地址生成单元的缓存。取决于输入数据格式和数据路径中所需的计算,地址生成单元将生成适应的地址序列以索引缓存中的数据。上述的地址序列可以用于索引缓存中的数据输入到对应的计算核中,比如,计算核需要80个数据进行计算,则从缓存中读取80个对应地址序列的数据到该计算核中。另外,地址生成单元还可以通过设置计数器,使生成的地址序列具有不同的循环大小,比如,数据1、数据2、数据3的一个小循环,可提高数据的复用性,同时,也能够适应各个计算核数据处理的大小。上述第二数据流存储模块的状态包括:数据读取准备状态以及数据写入完成状态,上述的计算核的状态包括计算是否完成,是否需要读取下一次计算数据等状态。可以通过在有限状态机对第一数据流存储模块中的数据进行状态监控,从而获取到第一数据流存储模块的状态,通过第二数据流存储模块的状态获取得到计算核的状态,比如计算结果写入到第二数据流存储模块后,可以确定计算核的状态为计算完毕。In this embodiment, the above-mentioned second data stream storage module may be a cache, DDR or high-speed access DDR. In the embodiment of the present application, it is preferably a cache. Specifically, it may be provided with a controllable read-write address generating unit. Cache. Depending on the input data format and the calculations required in the data path, the address generation unit will generate an adapted address sequence to index the data in the cache. The aforementioned address sequence can be used to input data in the index cache to the corresponding computing core. For example, if the computing core needs 80 data for calculation, then 80 data corresponding to the address sequence are read from the cache to the computing core. In addition, the address generation unit can also set a counter to make the generated address sequence have different cycle sizes, for example, a small cycle of data 1, data 2, and data 3, which can improve the reusability of data, and at the same time, it can also adapt to The size of each calculation core data processing. The state of the second data stream storage module includes: a data read preparation state and a data write completion state. The state of the calculation core includes whether the calculation is completed and whether the next calculation data needs to be read. The state of the first data flow storage module can be monitored by the finite state machine to obtain the state of the first data flow storage module, and the state of the computing core can be obtained by the state of the second data flow storage module, such as computing After the result is written to the second data stream storage module, it can be determined that the state of the calculation core is the calculation completed.
每个时钟周期中,通过每个计算核和第二数据流存储模块的状态获取,从而可准确预计,可以通过精确的计算排期进行最大效率的硬件性能优化,进一 步提高数据处理的效率。In each clock cycle, the status of each computing core and the second data stream storage module can be obtained, so that it can be accurately predicted, and the hardware performance can be optimized with the greatest efficiency through accurate calculation scheduling, and the efficiency of data processing can be further improved.
可选的,所述第二数据流存储模块包括第一存储单元以及第二存储单元,所述通过所述目标数据流引擎对所述待处理数据进行处理,包括:Optionally, the second data stream storage module includes a first storage unit and a second storage unit, and the processing of the to-be-processed data by the target data stream engine includes:
将第一存储单元中的数据输入计算核,得到计算结果;Input the data in the first storage unit into the calculation core to obtain the calculation result;
将所述计算结果存储到第二存储单元,做为下一计算核的输入数据。The calculation result is stored in the second storage unit as the input data of the next calculation core.
该实施方式中,上述的第一存储单元可以是输入数据流存储单元,上述的第二存储单元可以是输入数据流存储单元,第一存储单元与第二存储单元用于数据流的交替存取,即是第一存储单元将输入数据输入到计算核中进行计算,计算核将计算结果输出到第二存储单元进行存储,这样可以避免第一存储单元在对计算核输入数据时,计算核的输出结果无法写入该第一存储单元中,比如,计算核需要对膝盖骨存储单元中的一个数据进行重复计算2次,在第一次计算完成后,计算核需要在第一存储单元中第二次读取该数据,通常情况下,会等待第一次计算结果存储到第一存储单元,再去读取第二次的该数据,但设置了疆存储单元后,可以在将第一次计算结果存储到第二存储单元的同时,去第一存储单元中读取第二次的该数据,无需进行等待,提高了数据处理的效率。In this embodiment, the above-mentioned first storage unit may be an input data stream storage unit, and the above-mentioned second storage unit may be an input data stream storage unit. The first storage unit and the second storage unit are used for alternate access of data streams. , That is, the first storage unit inputs the input data into the calculation core for calculation, and the calculation core outputs the calculation result to the second storage unit for storage. This prevents the first storage unit from inputting data to the calculation core. The output result cannot be written into the first storage unit. For example, the calculation core needs to recalculate a piece of data in the kneecap storage unit twice. After the first calculation is completed, the calculation core needs to be second in the first storage unit. Read the data a second time. Normally, you will wait for the first calculation result to be stored in the first storage unit, and then read the data for the second time. However, after the storage unit is set, you can change the first calculation While the result is stored in the second storage unit, the data for the second time is read from the first storage unit without waiting, which improves the efficiency of data processing.
上述的可选实施方式,可以现实图4和图5对应实施例的基于数据流的深度网络加速方法,达到相同的效果,在此不再赘述。需要说明的是,上述各个实施方式也可以与图2和图3实施例进行结合。The above-mentioned optional implementation manners can implement the data stream-based deep network acceleration method of the corresponding embodiment in FIG. 4 and FIG. 5, and achieve the same effect, which is not repeated here. It should be noted that each of the above-mentioned embodiments can also be combined with the embodiment of FIG. 2 and FIG. 3.
第三方面,请参照图6,图6为本申请实施例提供的一种基于数据流的深度网络加速装置示意图,如图6所示,所述装置包括:In the third aspect, please refer to FIG. 6. FIG. 6 is a schematic diagram of a data stream-based deep network acceleration device provided by an embodiment of the application. As shown in FIG. 6, the device includes:
第一获取模块601,用于获取待处理数据所需要的目标深度网络信息;The first obtaining module 601 is configured to obtain target deep network information required by the data to be processed;
第一匹配模块602,用于根据所述目标深度网络信息,匹配预先设置的与所述目标深度网络信息对应的目标网络配置规则,其中,所述目标网络配置规则包括预先配置的计算引擎、第一数据流存储模块以及全局数据流网络之间的配置规则;The first matching module 602 is configured to match a preset target network configuration rule corresponding to the target deep network information according to the target deep network information, wherein the target network configuration rule includes a pre-configured calculation engine, a second A configuration rule between the data stream storage module and the global data stream network;
第一配置模块603,用于根据所述目标网络配置规则,配置得到目标数据流网络;The first configuration module 603 is configured to configure the target data flow network according to the target network configuration rule;
第一处理模块604,用于通过所述目标数据流网络对所述待处理数据进行处理。The first processing module 604 is configured to process the data to be processed through the target data stream network.
可选的,所述第一配置模块603包括:Optionally, the first configuration module 603 includes:
全局配置子模块,用于根据所述全局数据流网络,配置多个计算引擎之间的并行或串行;The global configuration sub-module is used to configure parallel or serial between multiple calculation engines according to the global data flow network;
路径配置子模块,用于根据所述第一数据流存储模块及所述多个计算引擎之间的并行或串行,得到所述多个计算引擎的数据流路径;A path configuration submodule, configured to obtain the data flow paths of the multiple calculation engines according to the parallel or serial between the first data flow storage module and the multiple calculation engines;
形成子模块,用于基于所述数据流路径,形成所述目标数据流网络。A forming sub-module is used to form the target data flow network based on the data flow path.
可选的,第一处理模块604包括:Optionally, the first processing module 604 includes:
第一获取子模块,用于将所述待处理数据读取到所述第一数据流存储模块;The first acquisition submodule is configured to read the to-be-processed data into the first data stream storage module;
第一数据地址生成子模块,用于在所述第一数据流存储模块中,根据所述待处理数据的数据格式以及数据路径,按预先设置的生成规则为所述待处理数据生成地址序列;The first data address generation sub-module is configured to generate an address sequence for the data to be processed according to a preset generation rule in the first data stream storage module according to the data format and data path of the data to be processed;
第一输入子模块,用于每个时钟周期,根据地址序列从所述第一数据流存储模块中读取与所述目标数据流网络中与计算引擎相应的数据量进行输入,并获取第一数据流存储模块及计算引擎的状态。The first input sub-module is used for each clock cycle to read from the first data stream storage module according to the address sequence the data volume corresponding to the calculation engine in the target data stream network for input, and obtain the first The state of the data stream storage module and calculation engine.
可选的,所述目标网络配置还包括计算核、第二数据流存储单元以及连接所述计算核与所述第二缓存器的局部数据流网络,所述第一配置模块603还包括:Optionally, the target network configuration further includes a computing core, a second data stream storage unit, and a local data stream network connecting the computing core and the second buffer, and the first configuration module 603 further includes:
第一局部配置子模块,用于配置所述计算核与所述局部数据流网络的互连,得到计算核的计算路径;The first partial configuration sub-module is used to configure the interconnection between the computing core and the local data flow network to obtain the computing path of the computing core;
第一局部路径子模块,用于配置所述第二数据流在储存单元与所述局部数据流网络的互连,得到存储路径;The first partial path sub-module is configured to configure the interconnection of the second data flow between the storage unit and the partial data flow network to obtain a storage path;
第一引擎模块,用于根据所述计算路径与存储路径,得到所述计算引擎。The first engine module is used to obtain the calculation engine according to the calculation path and the storage path.
第四方面,请参照图7,图7为本申请实施例提供的一种基于数据流的深度网络加速装置示意图,如图7所示,所述装置包括:For the fourth aspect, please refer to FIG. 7. FIG. 7 is a schematic diagram of a data stream-based deep network acceleration device provided by an embodiment of the application. As shown in FIG. 7, the device includes:
第二获取模块701,用于获取待处理数据所需要的目标深度网络信息;The second obtaining module 701 is used to obtain target deep network information required by the data to be processed;
第二匹配模块702,用于根据所述目标深度网络信息,匹配预先设置的与所述目标深度网络信息对应的目标网络配置规则,其中,所述目标网络配置规则包括计算核、第二数据流存储模块以及局部数据流网络;The second matching module 702 is configured to match a preset target network configuration rule corresponding to the target deep network information according to the target deep network information, wherein the target network configuration rule includes a computing core and a second data stream Storage module and local data flow network;
第二配置模块703,用于根据所述目标网络配置规则,配置得到目标数据流引擎;The second configuration module 703 is configured to configure and obtain the target data flow engine according to the target network configuration rule;
第二处理模块704,用于通过所述目标数据流引擎对所述待处理数据进行处理。The second processing module 704 is configured to process the to-be-processed data through the target data flow engine.
可选的,所述第二配置模块703包括:Optionally, the second configuration module 703 includes:
第二局部配置子模块,用于配置所述计算核与所述局部数据流网络的互连,得到计算核的计算路径;The second partial configuration submodule is used to configure the interconnection between the computing core and the local data flow network to obtain the computing path of the computing core;
第二局部路径子模块,用于配置所述第二数据流储存模块与所述局部数据流网络的互连,得到存储路径;The second partial path sub-module is configured to configure the interconnection between the second data flow storage module and the partial data flow network to obtain a storage path;
第二引擎模块,用于根据所述计算路径与存储路径,得到所述目标数据流引擎。The second engine module is used to obtain the target data flow engine according to the calculation path and the storage path.
可选的,所述第二处理模块704包括:Optionally, the second processing module 704 includes:
第二获取子模块,用于将所述待处理数据读取到所述第二数据流存储模块;The second acquisition sub-module is configured to read the to-be-processed data into the second data stream storage module;
第二数据地址生成子模块,用于在所述第二数据流存储模块中,根据所述待处理数据的数据格式以及数据路径,按预先设置的生成规则为所述待处理数据生成地址序列;The second data address generation sub-module is configured to generate an address sequence for the data to be processed according to a preset generation rule in the second data stream storage module according to the data format and data path of the data to be processed;
第二输入子模块,用于每个时钟周期,根据地址序列从所述第二数据流存储模块中读取与所述目标数据流引擎中与计算核相应的数据量进行输入,并获取第二数据流存储模块及计算核的状态。The second input sub-module is used for each clock cycle to read from the second data stream storage module according to the address sequence the data volume corresponding to the computing core in the target data stream engine for input, and obtain the second The state of the data stream storage module and the computing core.
可选的,所述第二处理模块704包括:Optionally, the second processing module 704 includes:
输入计算子模块,用于将第一存储单元中的数据输入计算核,得到计算结果;The input calculation sub-module is used to input the data in the first storage unit into the calculation core to obtain the calculation result;
输出存储子模块,用于将所述计算结果存储到第二存储单元,做为下一计算核的输入数据。The output storage submodule is used to store the calculation result in the second storage unit as input data for the next calculation core.
第五方面,本申请实施例提供一种电子设备,包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现本申请实施例提供的基于数据流的深度网络加速方法中的步骤。In a fifth aspect, an embodiment of the present application provides an electronic device, including: a memory, a processor, and a computer program stored on the memory and capable of running on the processor. When the processor executes the computer program, The steps in the data stream-based deep network acceleration method provided in the embodiments of this application are implemented.
第六方面,本申请实施例提供一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现本申请实施例提供的基于数据流的深度网络加速方法中的步骤。In a sixth aspect, an embodiment of the present application provides a computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, the data stream-based Steps in a deep network acceleration method.
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于可选实施例,所涉及的动作和模块并不一定是本申请所必须的。It should be noted that for the foregoing method embodiments, for the sake of simple description, they are all expressed as a series of action combinations, but those skilled in the art should know that this application is not limited by the described sequence of actions. Because according to this application, some steps can be performed in other order or simultaneously. Secondly, those skilled in the art should also be aware that the embodiments described in the specification are all optional embodiments, and the involved actions and modules are not necessarily required by this application.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。In the above-mentioned embodiments, the description of each embodiment has its own focus. For parts that are not described in detail in an embodiment, reference may be made to related descriptions of other embodiments.
在本申请所提供的几个实施例中,应该理解到,所揭露的装置,可通过其 它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的In the several embodiments provided in this application, it should be understood that the disclosed device can be implemented in other ways. For example, the device embodiments described above are only illustrative
另外,在本申请各个实施例中的处理器、芯片可以集成在一个处理单元中,也可以是单独物理存在,也可以两个或两个以上硬件集成在一个单元中。计算机可读存储介质或计算机可读程序可以存储在一个计算机可读取存储器中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储器中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储器包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。In addition, the processors and chips in the various embodiments of the present application may be integrated into one processing unit, or may exist alone physically, or two or more hardware may be integrated into one unit. The computer-readable storage medium or the computer-readable program can be stored in a computer-readable memory. Based on this understanding, the technical solution of the present application essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a memory, A number of instructions are included to enable a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the method described in each embodiment of the present application. The aforementioned memory includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other various media that can store program codes.
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序可以存储于一计算机可读存储器中,存储器可以包括:闪存盘、只读存储器(英文:Read-Only Memory,简称:ROM)、随机存取器(英文:Random Access Memory,简称:RAM)、磁盘或光盘等。Those of ordinary skill in the art can understand that all or part of the steps in the various methods of the above-mentioned embodiments can be completed by instructing relevant hardware through a program. The program can be stored in a computer-readable memory, and the memory can include: flash disk , Read-only memory (English: Read-Only Memory, abbreviation: ROM), random access device (English: Random Access Memory, abbreviation: RAM), magnetic disk or optical disc, etc.
以上内容是结合具体的优选实施方式对本申请所作的进一步详细说明,不能认定本申请的具体实施方式只局限于这些说明。对于本申请所属技术领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干简单推演或替换,都应当视为属于本申请的保护范围。The above content is a further detailed description of the application in conjunction with specific preferred embodiments, and it cannot be considered that the specific embodiments of the application are limited to these descriptions. For those of ordinary skill in the technical field to which this application belongs, a number of simple deductions or substitutions can be made without departing from the concept of this application, which should be regarded as falling within the protection scope of this application.

Claims (12)

  1. 一种基于数据流的深度网络加速方法,其特征在于,所述方法包括:A data stream-based deep network acceleration method, characterized in that the method includes:
    获取待处理数据所需要的目标深度网络信息;Obtain the target deep network information needed for the data to be processed;
    根据所述目标深度网络信息,匹配预先设置的与所述目标深度网络信息对应的目标网络配置规则,其中,所述目标网络配置规则包括预先配置的计算引擎、第一数据流存储模块以及全局数据流网络之间的配置规则;According to the target deep network information, match a preset target network configuration rule corresponding to the target deep network information, wherein the target network configuration rule includes a pre-configured calculation engine, a first data stream storage module, and global data Configuration rules between streaming networks;
    根据所述目标网络配置规则,配置得到目标数据流网络;Configure the target data flow network according to the target network configuration rule;
    通过所述目标数据流网络对所述待处理数据进行处理。The data to be processed is processed through the target data stream network.
  2. 如权利要求1所述的方法,其特征在于,所述根据所述目标网络配置规则,配置得到目标数据流网络,包括:The method of claim 1, wherein the configuring and obtaining the target data flow network according to the target network configuration rule comprises:
    根据所述全局数据流网络,配置多个计算引擎之间的并行或串行;According to the global data flow network, configure parallel or serial between multiple computing engines;
    根据所述第一数据流存储模块及所述多个计算引擎之间的并行或串行,得到所述多个计算引擎的数据流路径;Obtaining the data flow paths of the multiple calculation engines according to the parallel or serial between the first data flow storage module and the multiple calculation engines;
    基于所述数据流路径,形成所述目标数据流网络。Based on the data flow path, the target data flow network is formed.
  3. 如权利要求1所述的方法,其特征在于,所述通过所述目标数据流网络对所述待处理数据进行处理,包括:The method according to claim 1, wherein the processing the to-be-processed data through the target data flow network comprises:
    将所述待处理数据读取到所述第一数据流存储模块;Reading the to-be-processed data into the first data stream storage module;
    在所述第一数据流存储模块中,根据所述待处理数据的数据格式以及数据路径,按预先设置的生成规则为所述待处理数据生成地址序列;In the first data stream storage module, according to the data format and data path of the data to be processed, an address sequence is generated for the data to be processed according to a preset generation rule;
    每个时钟周期,根据地址序列从所述第一数据流存储模块中读取与所述目标数据流网络中与计算引擎相应的数据量进行输入,并获取第一数据流存储模块及计算引擎的状态。In each clock cycle, read from the first data stream storage module the amount of data corresponding to the calculation engine in the target data stream network according to the address sequence for input, and obtain the data of the first data stream storage module and the calculation engine status.
  4. 如权利要求1至3中任一所述的方法,其特征在于,所述目标网络配置还包括计算核、第二数据流存储单元以及连接所述计算核与所述第二缓存器的局部数据流网络,所述计算引擎的配置包括:The method according to any one of claims 1 to 3, wherein the target network configuration further comprises a computing core, a second data stream storage unit, and local data connecting the computing core and the second buffer In a streaming network, the configuration of the calculation engine includes:
    配置所述计算核与所述局部数据流网络的互连,得到计算核的计算路径;Configuring the interconnection between the computing core and the local data flow network to obtain the computing path of the computing core;
    配置所述第二数据流在储存模块与所述局部数据流网络的互连,得到存储路径;Configuring the interconnection of the second data stream between the storage module and the local data stream network to obtain a storage path;
    根据所述计算路径与存储路径,得到所述计算引擎。According to the calculation path and the storage path, the calculation engine is obtained.
  5. 一种基于数据流的深度网络加速方法,其特征在于,所述方法包括:A data stream-based deep network acceleration method, characterized in that the method includes:
    获取待处理数据所需要的目标深度网络信息;Obtain the target deep network information needed for the data to be processed;
    根据所述目标深度网络信息,匹配预先设置的与所述目标深度网络信息对应的目标网络配置规则,其中,所述目标网络配置规则包括计算核、第二数据流存储模块以及局部数据流网络;According to the target deep network information, matching a preset target network configuration rule corresponding to the target deep network information, wherein the target network configuration rule includes a computing core, a second data stream storage module, and a local data stream network;
    根据所述目标网络配置规则,配置得到目标数据流引擎;Configure the target data flow engine according to the target network configuration rule;
    通过所述目标数据流引擎对所述待处理数据进行处理。The data to be processed is processed by the target data flow engine.
  6. 如权利要求5中所述的方法,其特征在于,所述根据所述目标网络配置规则,配置得到目标数据流引擎,包括:The method of claim 5, wherein the configuring and obtaining the target data flow engine according to the target network configuration rule comprises:
    配置所述计算核与所述局部数据流网络的互连,得到计算核的计算路径;Configuring the interconnection between the computing core and the local data flow network to obtain the computing path of the computing core;
    配置所述第二数据流在储存单元与所述局部数据流网络的互连,得到存储路径;Configuring the interconnection of the second data stream between the storage unit and the local data stream network to obtain a storage path;
    根据所述计算路径与存储路径,得到所述目标数据流引擎。According to the calculation path and the storage path, the target data flow engine is obtained.
  7. 如权利要求5中所述的方法,其特征在于,所述通过所述目标数据流引擎对所述待处理数据进行处理,包括:The method according to claim 5, wherein the processing the data to be processed by the target data flow engine comprises:
    将所述待处理数据读取到所述第二数据流存储模块;Reading the to-be-processed data into the second data stream storage module;
    在所述第二数据流存储模块中,根据所述待处理数据的数据格式以及数据路径,按预先设置的生成规则为所述待处理数据生成地址序列;In the second data stream storage module, according to the data format and data path of the data to be processed, an address sequence is generated for the data to be processed according to a preset generation rule;
    每个时钟周期,根据地址序列从所述第二数据流存储模块中读取与所述目标数据流引擎中与计算核相应的数据量进行输入,并获取第二数据流存储模块及计算核的状态。Each clock cycle, read from the second data stream storage module according to the address sequence the data volume corresponding to the computing core in the target data stream engine for input, and obtain the second data stream storage module and the computing core status.
  8. 如权利要求5至7中任一所述的方法,其特征在于,所述第二数据流存储模块包括第一存储单元以及第二存储单元,所述通过所述目标数据流引擎对所述待处理数据进行处理,包括:The method according to any one of claims 5 to 7, wherein the second data stream storage module includes a first storage unit and a second storage unit, and the target data stream engine is used to process the waiting Processing data for processing, including:
    将第一存储单元中的数据输入计算核,得到计算结果;Input the data in the first storage unit into the calculation core to obtain the calculation result;
    将所述计算结果存储到第二存储单元,做为下一计算核的输入数据。The calculation result is stored in the second storage unit as the input data of the next calculation core.
  9. 一种基于数据流的深度网络加速装置,其特征在于,所述装置包括:A data stream-based deep network acceleration device, characterized in that, the device includes:
    第一获取模块,用于获取待处理数据所需要的目标深度网络信息;The first obtaining module is used to obtain target deep network information required by the data to be processed;
    第一匹配模块,用于根据所述目标深度网络信息,匹配预先设置的与所述目标深度网络信息对应的目标网络配置规则,其中,所述目标网络配置规则包括预先配置的计算引擎、第一数据流存储模块以及全局数据流网络之间的配置规则;The first matching module is configured to match a preset target network configuration rule corresponding to the target deep network information according to the target deep network information, wherein the target network configuration rule includes a pre-configured calculation engine, a first Configuration rules between the data flow storage module and the global data flow network;
    第一配置模块,用于根据所述目标网络配置规则,配置得到目标数据流网络;The first configuration module is configured to configure the target data flow network according to the target network configuration rule;
    第一处理模块,用于通过所述目标数据流网络对所述待处理数据进行处理。The first processing module is configured to process the to-be-processed data through the target data flow network.
  10. 一种基于数据流的深度网络加速装置,其特征在于,所述装置包括:A data stream-based deep network acceleration device, characterized in that, the device includes:
    第二获取模块,用于获取待处理数据所需要的目标深度网络信息;The second acquisition module is used to acquire target deep network information required by the data to be processed;
    第二匹配模块,用于根据所述目标深度网络信息,匹配预先设置的与所述目标深度网络信息对应的目标网络配置规则,其中,所述目标网络配置规则包括计算核、第二数据流存储模块以及局部数据流网络;The second matching module is configured to match a preset target network configuration rule corresponding to the target deep network information according to the target deep network information, wherein the target network configuration rule includes a computing core and a second data stream storage Module and local data flow network;
    第二配置模块,用于根据所述目标网络配置规则,配置得到目标数据流引擎;The second configuration module is configured to configure and obtain the target data flow engine according to the target network configuration rule;
    第二处理模块,用于通过所述目标数据流引擎对所述待处理数据进行处理。The second processing module is configured to process the to-be-processed data through the target data flow engine.
  11. 一种电子设备,其特征在于,包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如权利要求1至4中任一项所述的基于数据流的深度网络加速方法中的步骤。An electronic device, comprising: a memory, a processor, and a computer program stored on the memory and capable of running on the processor, and the processor executes the computer program as claimed in claim 1. Steps in the data stream-based deep network acceleration method described in any one of to 4.
  12. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1至4中任一项所述的基于数据流的深度网络加速方法中的步骤。A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the data-based system according to any one of claims 1 to 4 is implemented. Steps in the deep network acceleration method of the flow.
PCT/CN2019/082101 2019-04-09 2019-04-10 Deep network acceleration methods and apparatuses based on data stream, device, and storage medium WO2020206637A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910280156.2 2019-04-09
CN201910280156.2A CN110046704B (en) 2019-04-09 2019-04-09 Deep network acceleration method, device, equipment and storage medium based on data stream

Publications (1)

Publication Number Publication Date
WO2020206637A1 true WO2020206637A1 (en) 2020-10-15

Family

ID=67276511

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/082101 WO2020206637A1 (en) 2019-04-09 2019-04-10 Deep network acceleration methods and apparatuses based on data stream, device, and storage medium

Country Status (2)

Country Link
CN (1) CN110046704B (en)
WO (1) WO2020206637A1 (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021026768A1 (en) * 2019-08-13 2021-02-18 深圳鲲云信息科技有限公司 Automatic driving method and apparatus based on data stream, and electronic device and storage medium
CN113272792A (en) * 2019-10-12 2021-08-17 深圳鲲云信息科技有限公司 Local data stream acceleration method, data stream acceleration system and computer equipment
CN112905525B (en) * 2019-11-19 2024-04-05 中科寒武纪科技股份有限公司 Method and equipment for controlling computing device to perform computation
CN111404770B (en) * 2020-02-29 2022-11-11 华为技术有限公司 Network device, data processing method, device and system and readable storage medium
CN111857989B (en) * 2020-06-22 2024-02-27 深圳鲲云信息科技有限公司 Artificial intelligence chip and data processing method based on same
CN111753994B (en) * 2020-06-22 2023-11-03 深圳鲲云信息科技有限公司 Data processing method and device of AI chip and computer equipment
CN111752887B (en) * 2020-06-22 2024-03-15 深圳鲲云信息科技有限公司 Artificial intelligence chip and data processing method based on same
CN111737193B (en) * 2020-08-03 2020-12-08 深圳鲲云信息科技有限公司 Data storage method, device, equipment and storage medium
CN114021708B (en) * 2021-09-30 2023-08-01 浪潮电子信息产业股份有限公司 Data processing method, device and system, electronic equipment and storage medium
CN114461978B (en) * 2022-04-13 2022-07-08 苏州浪潮智能科技有限公司 Data processing method and device, electronic equipment and readable storage medium
CN116974654B (en) * 2023-09-21 2023-12-19 浙江大华技术股份有限公司 Image data processing method and device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014105309A1 (en) * 2012-12-31 2014-07-03 Mcafee, Inc. System and method for correlating network information with subscriber information in a mobile network environment
CN106447034A (en) * 2016-10-27 2017-02-22 中国科学院计算技术研究所 Neutral network processor based on data compression, design method and chip
CN108154165A (en) * 2017-11-20 2018-06-12 华南师范大学 Love and marriage object matching data processing method, device, computer equipment and storage medium based on big data and deep learning
CN108710941A (en) * 2018-04-11 2018-10-26 杭州菲数科技有限公司 The hard acceleration method and device of neural network model for electronic equipment
CN109445935A (en) * 2018-10-10 2019-03-08 杭州电子科技大学 A kind of high-performance big data analysis system self-adaption configuration method under cloud computing environment

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11055063B2 (en) * 2016-05-02 2021-07-06 Marvell Asia Pte, Ltd. Systems and methods for deep learning processor
US11216722B2 (en) * 2016-12-31 2022-01-04 Intel Corporation Hardware accelerator template and design framework for implementing recurrent neural networks
US20180189641A1 (en) * 2017-01-04 2018-07-05 Stmicroelectronics S.R.L. Hardware accelerator engine
CN107066239A (en) * 2017-03-01 2017-08-18 智擎信息系统(上海)有限公司 A kind of hardware configuration for realizing convolutional neural networks forward calculation

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014105309A1 (en) * 2012-12-31 2014-07-03 Mcafee, Inc. System and method for correlating network information with subscriber information in a mobile network environment
CN106447034A (en) * 2016-10-27 2017-02-22 中国科学院计算技术研究所 Neutral network processor based on data compression, design method and chip
CN108154165A (en) * 2017-11-20 2018-06-12 华南师范大学 Love and marriage object matching data processing method, device, computer equipment and storage medium based on big data and deep learning
CN108710941A (en) * 2018-04-11 2018-10-26 杭州菲数科技有限公司 The hard acceleration method and device of neural network model for electronic equipment
CN109445935A (en) * 2018-10-10 2019-03-08 杭州电子科技大学 A kind of high-performance big data analysis system self-adaption configuration method under cloud computing environment

Also Published As

Publication number Publication date
CN110046704A (en) 2019-07-23
CN110046704B (en) 2022-11-08

Similar Documents

Publication Publication Date Title
WO2020206637A1 (en) Deep network acceleration methods and apparatuses based on data stream, device, and storage medium
US10713568B2 (en) Apparatus and method for executing reversal training of artificial neural network
US11893414B2 (en) Operation method, device and related products
KR102486030B1 (en) Apparatus and method for executing forward operation of fully-connected layer neural network
WO2018171717A1 (en) Automated design method and system for neural network processor
CN109086877B (en) Apparatus and method for performing convolutional neural network forward operation
US11915139B2 (en) Modifying machine learning models to improve locality
EP3407265B1 (en) Device and method for executing forward calculation of artificial neural network
AU2014203218B2 (en) Memory configuration for inter-processor communication in an MPSoC
US11294599B1 (en) Registers for restricted memory
TWI634489B (en) Multi-layer artificial neural network
US11694075B2 (en) Partitioning control dependency edge in computation graph
US10990525B2 (en) Caching data in artificial neural network computations
Voss et al. Convolutional neural networks on dataflow engines
CN113496248A (en) Method and apparatus for training computer-implemented models
US20210125042A1 (en) Heterogeneous deep learning accelerator
JPWO2020188658A1 (en) Architecture estimator, architecture estimation method, and architecture estimation program
US11797280B1 (en) Balanced partitioning of neural network based on execution latencies
Abeyrathne et al. Offloading specific performance-related kernel functions into an FPGA
WO2024120050A1 (en) Operator fusion method used for neural network, and related apparatus
US20230126594A1 (en) Instruction generating method, arithmetic processing device, and instruction generating device
CN115204086A (en) Network-on-chip simulation model, dynamic path planning method and device, and multi-core chip
AU2015271896A1 (en) Selection of system-on-chip component models for early design phase evaluation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19924206

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19924206

Country of ref document: EP

Kind code of ref document: A1