WO2020206637A1 - Procédés et appareils d'accélération de réseau profond basés sur un flux de données, dispositif, et support de stockage - Google Patents

Procédés et appareils d'accélération de réseau profond basés sur un flux de données, dispositif, et support de stockage Download PDF

Info

Publication number
WO2020206637A1
WO2020206637A1 PCT/CN2019/082101 CN2019082101W WO2020206637A1 WO 2020206637 A1 WO2020206637 A1 WO 2020206637A1 CN 2019082101 W CN2019082101 W CN 2019082101W WO 2020206637 A1 WO2020206637 A1 WO 2020206637A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
target
network
data stream
calculation
Prior art date
Application number
PCT/CN2019/082101
Other languages
English (en)
Chinese (zh)
Inventor
牛昕宇
蔡权雄
Original Assignee
深圳鲲云信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳鲲云信息科技有限公司 filed Critical 深圳鲲云信息科技有限公司
Publication of WO2020206637A1 publication Critical patent/WO2020206637A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • G06F15/7814Specially adapted for real time processing, e.g. comprising hardware timers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • This application relates to the field of artificial intelligence, and more specifically, to a data stream-based deep network acceleration method, device, device, and storage medium.
  • the advancement of neural network-based deep learning applications requires high processing capabilities on the underlying hardware platform.
  • CPU-based platforms cannot meet this growing demand, many companies have developed dedicated hardware accelerators to support advancements in this field.
  • the common idea of existing hardware accelerators is to accelerate certain types of calculations that are used more frequently in deep learning algorithm applications.
  • the existing hardware architecture is based on the execution of instructions with an extensible instruction set, and then realizes acceleration by implementing common calculations as customized instructions.
  • the instruction-based architecture implementation is usually expressed as a system-on-chip (SoC) design.
  • SoC system-on-chip
  • In an instruction-based architecture many clock cycles are wasted for non-computation related operations.
  • calculations in deep learning neural networks are usually decomposed into multiple instructions. Therefore, a calculation usually requires multiple clock cycles.
  • the arithmetic and logic unit (ALU) in the processor is usually a collection of different operations implemented in hardware. Due to limited instruction expressions and limited I/O bandwidth, most ALU resources are idle when executing a single instruction. For example, when doing multiplication and addition, the data of the multiplication will be read first, because the I/O speed is affected. Bandwidth affects, so that addition needs to wait for the multiplication calculation to be completed and write it into the memory, and then read the calculation result and the addition data for the addition calculation. During the multiplication calculation and reading and writing, the addition calculation unit is idle. Therefore, there is a problem of low efficiency of instruction-based hardware acceleration.
  • the purpose of this application is to provide a data stream-based deep network acceleration method, device, equipment and storage medium in view of the above-mentioned defects in the prior art, which solves the limited instruction expression and limited I/O bandwidth.
  • Most The ALU resource is idle when executing a single instruction, which accelerates the problem of low efficiency.
  • a data stream-based deep network acceleration method includes:
  • the target deep network information match a preset target network configuration rule corresponding to the target deep network information, wherein the target network configuration rule includes a pre-configured calculation engine, a first data stream storage module, and global data Configuration rules between streaming networks;
  • the data to be processed is processed through the target data stream network.
  • the configuration to obtain the target data flow network according to the target network configuration rule includes:
  • the target data flow network is formed.
  • processing the to-be-processed data through the target data flow network includes:
  • an address sequence is generated for the data to be processed according to a preset generation rule
  • the target network configuration further includes a computing core, a second data stream storage unit, and a local data stream network connecting the computing core and the second buffer, and the configuration of the computing engine includes:
  • the calculation engine is obtained.
  • a data stream-based deep network acceleration method including:
  • the target deep network information matching a preset target network configuration rule corresponding to the target deep network information, wherein the target network configuration rule includes a computing core, a second data stream storage module, and a local data stream network;
  • the data to be processed is processed by the target data flow engine.
  • the configuration to obtain the target data flow engine according to the target network configuration rule includes:
  • the target data flow engine is obtained.
  • processing the data to be processed by the target data flow engine includes:
  • an address sequence is generated for the data to be processed according to a preset generation rule
  • Each clock cycle read from the second data stream storage module according to the address sequence the data volume corresponding to the computing core in the target data stream engine for input, and obtain the second data stream storage module and the computing core status.
  • the second data stream storage module includes a first storage unit and a second storage unit
  • the processing of the to-be-processed data by the target data stream engine includes:
  • the calculation result is stored in the second storage unit as the input data of the next calculation core.
  • a data stream-based deep network acceleration device includes:
  • the first obtaining module is used to obtain target deep network information required by the data to be processed
  • the first matching module is configured to match a preset target network configuration rule corresponding to the target deep network information according to the target deep network information, wherein the target network configuration rule includes a pre-configured calculation engine, a first Configuration rules between the data flow storage module and the global data flow network;
  • the first configuration module is configured to configure the target data flow network according to the target network configuration rule
  • the first processing module is configured to process the to-be-processed data through the target data flow network.
  • a data stream-based deep network acceleration device comprising:
  • the second acquisition module is used to acquire target deep network information required by the data to be processed
  • the second matching module is configured to match a preset target network configuration rule corresponding to the target deep network information according to the target deep network information, wherein the target network configuration rule includes a computing core and a second data stream storage Module and local data flow network;
  • the second configuration module is used to configure the target data flow engine according to the target network configuration rule
  • the second processing module is configured to process the to-be-processed data through the target data flow engine.
  • an electronic device including: a memory, a processor, and a computer program stored on the memory and capable of running on the processor, and the processor implements the implementation of the application when the processor executes the computer program Examples provide steps in the data stream-based deep network acceleration method.
  • a computer-readable storage medium is provided, and a computer program is stored on the computer-readable storage medium.
  • the computer program When executed by a processor, it implements the data stream-based deep network acceleration method provided in the embodiments of the present application Steps in.
  • the deep network is accelerated through the data stream and off-chip data communication is reduced, so there is no instruction idle overhead, and the hardware acceleration efficiency of the deep network can be improved.
  • the network model supports a variety of different deep network models.
  • FIG. 1 is a schematic diagram of an optional implementation architecture of a data stream-based deep network acceleration method provided by an embodiment of this application;
  • FIG. 2 is a schematic flowchart of a data stream-based deep network acceleration method according to the first aspect of the embodiments of the present application;
  • FIG. 3 is a schematic flowchart of another data stream-based deep network acceleration method provided by an embodiment of the present application.
  • FIG. 4 is a schematic flowchart of a data flow-based deep network acceleration method provided by the second aspect of the embodiments of the application;
  • FIG. 5 is a schematic flowchart of another data stream-based deep network acceleration method provided by an embodiment of this application.
  • FIG. 6 is a schematic diagram of a data stream-based deep network acceleration device provided by the third aspect of the embodiments of the application.
  • FIG. 7 is a schematic diagram of a data stream-based deep network acceleration device provided by the fourth aspect of the embodiments of the application.
  • FIG. 1 is a schematic diagram of an optional implementation architecture of a data stream-based deep network acceleration method provided by an embodiment of this application.
  • the architecture 103 and the chip The external storage module (DDR) 101 and the CPU are connected by interconnection.
  • the architecture 103 includes: a first storage module 104, a global data flow network 105 and a data flow engine 106.
  • the first storage module 104 is connected to the off-chip through interconnection. While storing the module 101, it is also connected to the global data flow network 105 through interconnection, and the data flow engine 106 is connected to the global data flow network 105 through interconnection so that the data flow engine 106 can realize parallel or serial.
  • the aforementioned data flow engine 106 may include: a computing core (or called a computing module), a second storage module 108, and a local data flow network 107.
  • the computing core may include a core for computing, such as a convolution core 109 and a pooling core. 110 and activation function core 111, etc. Of course, it can also include other computing cores besides the example convolution core 109, pooling core 110, and activation function core 111, which are not limited here, and can also be included in the deep network.
  • the above-mentioned first storage module 104 and the above-mentioned second storage module 108 may be on-chip cache modules, or may be DDR or high-speed DDR memory modules.
  • the above-mentioned data stream engine 106 can be understood as a computing engine that supports data stream processing, and can also be understood as a computing engine dedicated to data stream processing.
  • the foregoing CPU may include a control register, and the foregoing control register is pre-configured with network configuration rules for configuring the network.
  • the deep network in this application may also be called a deep learning network, a deep learning neural network, and the like.
  • This application provides a data stream-based deep network acceleration method, device, equipment and storage medium.
  • FIG. 2 is a schematic flowchart of a data stream-based deep network acceleration method provided by an embodiment of the present application. As shown in FIG. 2, the method includes the following steps:
  • the aforementioned data to be processed may be data that can be processed through a deep network, such as image data to be identified, target data to be detected, target data to be tracked, and so on.
  • the target deep network information corresponds to the deep network of the data to be processed.
  • Information, such as the data to be processed is image data to be recognized, the target deep network information is the configuration parameter of the deep network used to process image recognition, and if the data to be processed is the target data to be detected, the target deep network information is used
  • the above-mentioned target deep network information may be preset, and the matching determination may be performed through the data to be processed, or may be manually selected and determined, which is not limited herein.
  • Obtaining the target deep network information can facilitate the configuration of the deep network.
  • the aforementioned deep network information may include network type, data type, number of layers, calculation type, and so on.
  • the target deep network information match a preset target network configuration rule corresponding to the target deep network information, wherein the target network configuration rule includes a pre-configured calculation engine, a first data stream storage module, and Configuration rules between global data flow networks.
  • the target network configuration rule includes a pre-configured calculation engine, a first data stream storage module, and Configuration rules between global data flow networks.
  • the aforementioned target deep network information already contains the network type, data type, number of layers, calculation type, etc. of the deep network required by the data to be processed.
  • the aforementioned target network configuration rules can be set in advance, for example, The parameter rules and calculation rules in the set image recognition network, target detection network, target tracking network and other types of networks.
  • the above parameter rules can be hyperparameter setting rules, weight setting rules, etc.
  • the above calculation rules can be It is the calculation rules for addition, multiplication, convolution, and deconvolution.
  • the foregoing configuration rules between the pre-configured calculation engine, the first data stream storage module, and the global data stream network can be understood as the number of calculation engines and the connection mode between the calculation engine and the global data stream network.
  • the global data flow network can be configured by the control register.
  • the network implementation may be a router between the first data flow storage module and the calculation engine.
  • the aforementioned first data stream storage module may include two data stream storage units, input and output, for data access, that is, the input data stream storage unit inputs input data into the calculation engine
  • the calculation engine outputs the calculation results to the output data stream storage unit for storage, which can prevent the input data stream storage unit from inputting data to the calculation engine and the output result of the calculation engine cannot be written into the input data stream storage unit.
  • the calculation engine needs to perform repeated calculations on a piece of data in the input data stream storage unit twice. After the first calculation is completed, the calculation engine needs to read the data in the input data stream storage unit for the second time.
  • the first calculation result can be stored in the output data stream storage At the same time as the unit, read the data for the second time without waiting, which improves the efficiency of data processing.
  • the foregoing configuration implements the target network configuration rule, which may be the connection relationship between the pre-configured calculation engine, the first data stream storage module, and the global data stream network.
  • the foregoing connection relationship may include the number of connections of the calculation engine and the connection sequence.
  • the calculation engine can be connected to the global data flow network through interconnection to form a new deep network, and different deep networks can be formed according to the number and connection sequence of different calculation engines.
  • the target data flow network can be obtained for processing the data to be processed. Since each calculation engine reads data through the first data stream storage module, the data in the first data stream storage module can be read into different calculation engines to form a data stream, and no instruction set ordering is required, so the configuration is good The calculation engine does not generate calculation vacancies.
  • the above-mentioned target data stream network is configured through target network information, and can also be called a customized data stream network.
  • the above-mentioned target data stream network connects the first data stream storage module and the calculation engine through the global data stream network to form a data stream Compared with the instruction set, there is no need to wait for the completion of the read and write of the previous instruction, which can achieve the efficiency of calculation under the deep network architecture.
  • the target deep network information required by the data to be processed is acquired; according to the target deep network information, a preset target network configuration rule corresponding to the target deep network information is matched, wherein the target network
  • the configuration rules include the pre-configured calculation engine, the first data flow storage module and the configuration rules between the global data flow network; according to the target network configuration rules, the target data flow network is configured;
  • the data to be processed is processed. Accelerate the deep network through data flow to reduce off-chip data communication, so there is no instruction idle overhead, which can improve the hardware acceleration efficiency of the deep network, and through network configuration, you can configure different deep network models to support a variety of different Deep network model.
  • data stream-based deep network acceleration method provided in the embodiments of the present application can be applied to devices for data stream deep network acceleration, such as computers, servers, mobile phones and other devices that can perform data stream-based deep network acceleration .
  • FIG. 3 is a schematic flowchart of another data flow-based deep network acceleration method provided by an embodiment of the present application. As shown in FIG. 3, the method includes the following steps:
  • the target deep network information match a preset target network configuration rule corresponding to the target deep network information, wherein the target network configuration rule includes a pre-configured calculation engine, a first data stream storage module, and Configuration rules between global data flow networks.
  • the target network configuration rule includes a pre-configured calculation engine, a first data stream storage module, and Configuration rules between global data flow networks.
  • the aforementioned global data flow network can be implemented by routing, and the global data flow network can be configured by a control register, and corresponding global data flow network configuration rules are preset in the aforementioned control register.
  • the network is implemented as a router between the first data flow storage module and each calculation engine, and the main function of the network router is to provide skip paths and feedback paths for data flows between each calculation engine.
  • the parallel or serial between the above-mentioned multiple calculation engines can be configured through data flow. For example, when calculation engine A and calculation engine B are parallel in the global data flow network, the data flow flows to calculation engine A and calculation engine B at the same time.
  • the data flow can be selected in the calculation engine A for calculation, and then the calculation result is flowed to the calculation engine B.
  • the serial mode you can It is understood as the deepening of the deep network computing layer.
  • the specific configuration can be to control the data flow direction through the global data flow network, so as to realize the parallel or serial configuration between multiple computing engines.
  • the above configuration of parallel or serial between multiple calculation engines can be obtained by configuring the interconnection between the global data flow network and multiple calculation engines, for example, multiple calculation engines can be connected to the global data flow network according to parallel rules
  • the interconnection can also be that multiple computing engines are interconnected with the global data flow network according to serial rules, and the first data flow storage module is configured to be interconnected with the global data flow network.
  • the above-mentioned first data stream storage module may be a cache, DDR or high-speed access DDR.
  • it is preferably a cache.
  • a controllable read-write address generating unit may be provided in the cache. .
  • the address generation unit will generate an adapted address sequence to index the data in the cache.
  • the aforementioned address sequence can be used to input data in the index cache to the corresponding calculation engine. For example, if the calculation engine requires 80 data to perform calculations, then 80 data corresponding to the address sequence are read from the cache to the calculation engine.
  • the address generation unit can also set a counter to make the generated address sequence have different cycle sizes, for example, a small cycle of data 1, data 2, and data 3, which can improve the reusability of data, and at the same time, it can also adapt to The data processing size of each calculation engine.
  • the data flow is stored through the first data flow storage module, and the data flow is controlled to each data node in parallel or serial between multiple computing engines, that is, the data flow path, so that the data processing can be calculated like a pipeline Processing in the engine improves the efficiency of data processing.
  • the first data flow storage module inputs data to the corresponding calculation engine through the global data flow network, and the calculation engine outputs the calculation results to the first data flow storage module through the global data flow network, without instructions for control Therefore, there is no problem that the computing unit is idle when a single instruction is executed.
  • the data stream is stored by the first data stream storage module, and the data stream is controlled to each data node in parallel or serial between multiple computing engines, that is, the data stream path, so that the data processing is like
  • the pipeline is generally processed in the calculation engine to improve the efficiency of data processing.
  • processing the to-be-processed data through the target data flow network includes:
  • an address sequence is generated for the data to be processed according to a preset generation rule
  • the above-mentioned first data stream storage module may be a cache, DDR or high-speed access DDR.
  • it is preferably a cache.
  • it may be provided with a controllable read-write address generating unit. Cache.
  • the address generation unit will generate an adapted address sequence to index the data in the cache.
  • the aforementioned address sequence can be used to input data in the index cache to the corresponding calculation engine. For example, if the calculation engine requires 80 data to perform calculations, then 80 data corresponding to the address sequence are read from the cache to the calculation engine.
  • the address generation unit can also set a counter to make the generated address sequence have different cycle sizes, for example, a small cycle of data 1, data 2, and data 3, which can improve the reusability of data, and at the same time, it can also adapt to The data processing size of each calculation engine.
  • the state of the first data stream storage module includes: a data read preparation state and a data write completion state.
  • the state of the calculation engine includes whether the calculation is completed, whether the next calculation data needs to be read, and so on.
  • the state of the first data flow storage module can be obtained by monitoring the state of the data in the first data flow storage module in the finite state machine, and the state of the calculation engine can be obtained by obtaining the state of the first data flow storage module, such as computing After the result is written into the first data stream storage module, it can be determined that the state of the calculation engine is calculation completed.
  • each clock cycle the status of each calculation engine and the first data stream storage module is obtained, so that it can be accurately predicted, and the hardware performance can be optimized for maximum efficiency through accurate calculation scheduling, and the efficiency of data processing can be further improved.
  • the target network configuration further includes a computing core, a second data stream storage unit, and a local data stream network connecting the computing core and the second buffer, and the configuration of the computing engine includes:
  • the calculation engine is obtained.
  • the aforementioned computing core, the second data stream storage module, and the local data stream network are the main configurations of the computing engine.
  • the aforementioned computing core may be a convolution core, a pooling core, an activation function core, etc.
  • the performance core in addition, it should be noted that the computing core may also be called a computing core, a computing unit, a computing module, and so on.
  • the above-mentioned second data stream storage module may be a storage module with data access function such as cache, DDR or high-speed DDR, and the above-mentioned second data stream storage module and the first data stream storage module may be different on the same memory
  • the storage area for example, the second data flow storage module may be the second data buffer area in the buffer, the first data flow storage module may be the first data buffer area in the buffer, etc.
  • the above-mentioned partial data flow network can be understood It is a route used in the calculation engine to connect the calculation core with the second data stream storage module.
  • the connection between computing cores can be controlled by a network router.
  • the main function of the aforementioned network router is to provide a skip path and a feedback path.
  • the local data flow network can be configured to form a flow path with different calculation cores available in the calculation engine.
  • the combination of the types and sequence of these computing cores along the flow path provides a continuous data processing pipeline for multiple layers in the deep learning neural network.
  • the combination of computing cores is a convolution core to a pooling core
  • a convolutional neural network layer can be obtained.
  • the combination of the calculation core is a deconvolution core to a pooling core to an activation function core, and a deconvolutional neural network layer can be obtained.
  • the combination of the type and sequence of the computing core is specifically determined by the target network configuration rule.
  • FIG. 4 is a schematic flowchart of a data stream-based deep network acceleration method provided by an embodiment of the application. As shown in FIG. 4, the method includes:
  • the aforementioned data to be processed may be data that can be processed through a deep network, such as image data to be identified, target data to be detected, target data to be tracked, and so on.
  • the target deep network information corresponds to the deep network of the data to be processed.
  • Information, such as the data to be processed is image data to be recognized, the target deep network information is the configuration parameter of the deep network used to process image recognition, and if the data to be processed is the target data to be detected, the target deep network information is used
  • the above-mentioned target deep network information may be preset, and the matching determination may be performed through the data to be processed, or may be manually selected and determined, which is not limited herein.
  • Obtaining the target deep network information can facilitate the configuration of the deep network.
  • the aforementioned deep network information may include network type, data type, number of layers, calculation type, and so on.
  • the target deep network information match a preset target network configuration rule corresponding to the target deep network information, wherein the target network configuration rule includes a computing core, a second data stream storage module, and a local data stream The internet.
  • the aforementioned target deep network information already contains the network type, data type, number of layers, calculation type, etc. of the deep network required by the data to be processed.
  • the aforementioned target network configuration rules can be set in advance, for example, The parameter rules and calculation rules in the set image recognition network, target detection network, target tracking network and other types of networks.
  • the above parameter rules can be hyperparameter setting rules, weight setting rules, etc.
  • the above calculation rules can be It is the calculation rules for addition, multiplication, convolution, and deconvolution.
  • the above-mentioned configuration rules between the computing core, the second data stream storage module and the local data stream network can be understood as the type and number of computing cores, and the connection method between the computing core and the global data stream network.
  • the local data flow network can be configured by the control register.
  • the network implementation may be a router between the first data flow storage module and the calculation engine.
  • the connection between computing cores can be controlled by a network router.
  • the main function of the aforementioned network router is to provide a skip path and a feedback path.
  • the foregoing configuration implements the target network configuration rule, which may be the connection relationship between the pre-configured computing core, the second data stream storage module, and the local data stream network.
  • the foregoing connection relationship may include the type of computing core and the number of connections, Connection sequence, etc., can connect computing cores with local data flow network through interconnection to form a new computing engine, that is, data flow engine, which can form different deep networks according to different computing core types, number of connections and connection sequence The required data flow engine.
  • the target data flow engine can be obtained to process the data to be processed. Since each computing core reads data through the second data stream storage module, the data in the second data stream storage module can be read into different computing cores to form a data stream.
  • the multiplication calculation is performed in the multiplication core, and the data that needs to be added is read into the addition core for addition calculation, etc. Since the data flow does not require instruction set ordering, the configured data flow engine will not generate calculation vacancies.
  • the above-mentioned target data flow engine is configured through target network information, and can also be called a customized data flow engine.
  • the above-mentioned target data flow engine connects the second data flow storage module and each computing core through a local data flow network to form data Compared with the implementation form of the instruction set, the flow does not need to wait for the completion of the read and write of the previous instruction, and can achieve the efficiency of calculation under the deep network architecture.
  • the target deep network information required by the data to be processed is acquired; according to the target deep network information, a preset target network configuration rule corresponding to the target deep network information is matched, wherein the target network
  • the configuration rule includes a computing core, a second data flow storage module, and a local data flow network; a target data flow engine is configured according to the target network configuration rule; the data to be processed is processed by the target data flow engine.
  • the deep network is accelerated through data flow to reduce off-chip data communication, so there is no instruction idle overhead, which can improve the hardware acceleration efficiency of the deep network, and through network configuration, you can configure the computing engines required by different deep network models. Support the calculation engine required by a variety of different deep network models.
  • FIG. 5 is a schematic flowchart of another data stream-based deep network acceleration method provided by an embodiment of the application. As shown in FIG. 5, the method includes:
  • the target deep network information match a preset target network configuration rule corresponding to the target deep network information, wherein the target network configuration rule includes a computing core, a second data stream storage module, and a local data stream The internet;
  • the aforementioned computing core, the second data stream storage module, and the local data stream network are the main configurations of the data stream engine.
  • the aforementioned computing core may be a convolution core, a pooling core, an activation function core, etc.
  • the core for computing performance may also be called a computing core, a computing unit, a computing module, and so on.
  • the above-mentioned second data stream storage module may be a storage module with data access function such as cache, DDR or high-speed DDR, and the above-mentioned second data stream storage module and the first data stream storage module may be different on the same memory
  • the storage area for example, the second data flow storage module may be the second data buffer area in the buffer, the first data flow storage module may be the first data buffer area in the buffer, etc.
  • the above-mentioned partial data flow network can be understood It is a route used in the calculation engine to connect the calculation core with the second data stream storage module.
  • the connection between computing cores can be controlled by a network router.
  • the main function of the aforementioned network router is to provide a skip path and a feedback path.
  • the local data flow network can be configured to form a flow path with different calculation cores available in the calculation engine.
  • the combination of the types and sequence of these computing cores along the flow path provides a continuous data processing pipeline for multiple layers in the deep learning neural network.
  • the combination of computing cores is a convolution core to a pooling core
  • a convolutional neural network layer can be obtained.
  • the combination of the calculation core is a deconvolution core to a pooling core to an activation function core, and a deconvolutional neural network layer can be obtained.
  • the combination of the type and sequence of the computing core is specifically determined by the target network configuration rule.
  • the calculation of the computing engine can be accelerated, thereby further improving the data processing efficiency of the deep network.
  • processing the data to be processed by the target data flow engine includes:
  • an address sequence is generated for the data to be processed according to a preset generation rule
  • Each clock cycle read from the second data stream storage module according to the address sequence the data volume corresponding to the computing core in the target data stream engine for input, and obtain the second data stream storage module and the computing core status.
  • the above-mentioned second data stream storage module may be a cache, DDR or high-speed access DDR.
  • it is preferably a cache.
  • it may be provided with a controllable read-write address generating unit. Cache.
  • the address generation unit will generate an adapted address sequence to index the data in the cache.
  • the aforementioned address sequence can be used to input data in the index cache to the corresponding computing core. For example, if the computing core needs 80 data for calculation, then 80 data corresponding to the address sequence are read from the cache to the computing core.
  • the address generation unit can also set a counter to make the generated address sequence have different cycle sizes, for example, a small cycle of data 1, data 2, and data 3, which can improve the reusability of data, and at the same time, it can also adapt to The size of each calculation core data processing.
  • the state of the second data stream storage module includes: a data read preparation state and a data write completion state.
  • the state of the calculation core includes whether the calculation is completed and whether the next calculation data needs to be read.
  • the state of the first data flow storage module can be monitored by the finite state machine to obtain the state of the first data flow storage module, and the state of the computing core can be obtained by the state of the second data flow storage module, such as computing After the result is written to the second data stream storage module, it can be determined that the state of the calculation core is the calculation completed.
  • each clock cycle the status of each computing core and the second data stream storage module can be obtained, so that it can be accurately predicted, and the hardware performance can be optimized with the greatest efficiency through accurate calculation scheduling, and the efficiency of data processing can be further improved.
  • the second data stream storage module includes a first storage unit and a second storage unit
  • the processing of the to-be-processed data by the target data stream engine includes:
  • the calculation result is stored in the second storage unit as the input data of the next calculation core.
  • the above-mentioned first storage unit may be an input data stream storage unit
  • the above-mentioned second storage unit may be an input data stream storage unit.
  • the first storage unit and the second storage unit are used for alternate access of data streams. , That is, the first storage unit inputs the input data into the calculation core for calculation, and the calculation core outputs the calculation result to the second storage unit for storage. This prevents the first storage unit from inputting data to the calculation core. The output result cannot be written into the first storage unit.
  • the calculation core needs to recalculate a piece of data in the kneecap storage unit twice. After the first calculation is completed, the calculation core needs to be second in the first storage unit. Read the data a second time.
  • FIG. 6 is a schematic diagram of a data stream-based deep network acceleration device provided by an embodiment of the application. As shown in FIG. 6, the device includes:
  • the first obtaining module 601 is configured to obtain target deep network information required by the data to be processed
  • the first matching module 602 is configured to match a preset target network configuration rule corresponding to the target deep network information according to the target deep network information, wherein the target network configuration rule includes a pre-configured calculation engine, a second A configuration rule between the data stream storage module and the global data stream network;
  • the first configuration module 603 is configured to configure the target data flow network according to the target network configuration rule
  • the first processing module 604 is configured to process the data to be processed through the target data stream network.
  • the first configuration module 603 includes:
  • the global configuration sub-module is used to configure parallel or serial between multiple calculation engines according to the global data flow network
  • a path configuration submodule configured to obtain the data flow paths of the multiple calculation engines according to the parallel or serial between the first data flow storage module and the multiple calculation engines;
  • a forming sub-module is used to form the target data flow network based on the data flow path.
  • the first processing module 604 includes:
  • the first acquisition submodule is configured to read the to-be-processed data into the first data stream storage module
  • the first data address generation sub-module is configured to generate an address sequence for the data to be processed according to a preset generation rule in the first data stream storage module according to the data format and data path of the data to be processed;
  • the first input sub-module is used for each clock cycle to read from the first data stream storage module according to the address sequence the data volume corresponding to the calculation engine in the target data stream network for input, and obtain the first The state of the data stream storage module and calculation engine.
  • the target network configuration further includes a computing core, a second data stream storage unit, and a local data stream network connecting the computing core and the second buffer
  • the first configuration module 603 further includes:
  • the first partial configuration sub-module is used to configure the interconnection between the computing core and the local data flow network to obtain the computing path of the computing core;
  • the first partial path sub-module is configured to configure the interconnection of the second data flow between the storage unit and the partial data flow network to obtain a storage path;
  • the first engine module is used to obtain the calculation engine according to the calculation path and the storage path.
  • FIG. 7 is a schematic diagram of a data stream-based deep network acceleration device provided by an embodiment of the application. As shown in FIG. 7, the device includes:
  • the second obtaining module 701 is used to obtain target deep network information required by the data to be processed
  • the second matching module 702 is configured to match a preset target network configuration rule corresponding to the target deep network information according to the target deep network information, wherein the target network configuration rule includes a computing core and a second data stream Storage module and local data flow network;
  • the second configuration module 703 is configured to configure and obtain the target data flow engine according to the target network configuration rule
  • the second processing module 704 is configured to process the to-be-processed data through the target data flow engine.
  • the second configuration module 703 includes:
  • the second partial configuration submodule is used to configure the interconnection between the computing core and the local data flow network to obtain the computing path of the computing core;
  • the second partial path sub-module is configured to configure the interconnection between the second data flow storage module and the partial data flow network to obtain a storage path;
  • the second engine module is used to obtain the target data flow engine according to the calculation path and the storage path.
  • the second processing module 704 includes:
  • the second acquisition sub-module is configured to read the to-be-processed data into the second data stream storage module
  • the second data address generation sub-module is configured to generate an address sequence for the data to be processed according to a preset generation rule in the second data stream storage module according to the data format and data path of the data to be processed;
  • the second input sub-module is used for each clock cycle to read from the second data stream storage module according to the address sequence the data volume corresponding to the computing core in the target data stream engine for input, and obtain the second The state of the data stream storage module and the computing core.
  • the second processing module 704 includes:
  • the input calculation sub-module is used to input the data in the first storage unit into the calculation core to obtain the calculation result;
  • the output storage submodule is used to store the calculation result in the second storage unit as input data for the next calculation core.
  • an embodiment of the present application provides an electronic device, including: a memory, a processor, and a computer program stored on the memory and capable of running on the processor.
  • the processor executes the computer program, The steps in the data stream-based deep network acceleration method provided in the embodiments of this application are implemented.
  • an embodiment of the present application provides a computer-readable storage medium on which a computer program is stored.
  • the computer program is executed by a processor, the data stream-based Steps in a deep network acceleration method.
  • processors and chips in the various embodiments of the present application may be integrated into one processing unit, or may exist alone physically, or two or more hardware may be integrated into one unit.
  • the computer-readable storage medium or the computer-readable program can be stored in a computer-readable memory.
  • the technical solution of the present application essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a memory, A number of instructions are included to enable a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the method described in each embodiment of the present application.
  • the aforementioned memory includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other various media that can store program codes.
  • the program can be stored in a computer-readable memory, and the memory can include: flash disk , Read-only memory (English: Read-Only Memory, abbreviation: ROM), random access device (English: Random Access Memory, abbreviation: RAM), magnetic disk or optical disc, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Artificial Intelligence (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

La présente invention concerne un procédé et un appareil d'accélération de réseau profond basés sur un flux de données, un dispositif, et un support de stockage. Le procédé consiste : à obtenir des informations de réseau profond cible requises par des données à traiter ; selon les informations de réseau profond cible, à mettre en correspondance une règle de configuration de réseau cible prédéfinie correspondant aux informations de réseau profond cible, la règle de configuration de réseau cible comprenant une règle de configuration préconfigurée entre un moteur de calcul, un premier module de stockage de flux de données, et un réseau de flux de données global ; à réaliser une configuration de manière à obtenir un réseau de flux de données cible selon la règle de configuration de réseau cible ; et à traiter lesdites données au moyen du réseau de flux de données cible. Un réseau profond est accéléré au moyen du flux de données, une communication de données hors puce est réduite et ainsi, une surcharge de veille d'instruction est évitée, et l'efficacité d'accélération matérielle du réseau profond peut être améliorée ; de plus, différents modèles de réseau profond peuvent être configurés en réalisant une configuration de réseau, et de multiples modèles de réseau profond différents sont pris en charge.
PCT/CN2019/082101 2019-04-09 2019-04-10 Procédés et appareils d'accélération de réseau profond basés sur un flux de données, dispositif, et support de stockage WO2020206637A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910280156.2 2019-04-09
CN201910280156.2A CN110046704B (zh) 2019-04-09 2019-04-09 基于数据流的深度网络加速方法、装置、设备及存储介质

Publications (1)

Publication Number Publication Date
WO2020206637A1 true WO2020206637A1 (fr) 2020-10-15

Family

ID=67276511

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/082101 WO2020206637A1 (fr) 2019-04-09 2019-04-10 Procédés et appareils d'accélération de réseau profond basés sur un flux de données, dispositif, et support de stockage

Country Status (2)

Country Link
CN (1) CN110046704B (fr)
WO (1) WO2020206637A1 (fr)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021026768A1 (fr) * 2019-08-13 2021-02-18 深圳鲲云信息科技有限公司 Procédé et appareil de conduite automatique basés sur un flux de données et dispositif électronique et support de stockage
CN113272792A (zh) * 2019-10-12 2021-08-17 深圳鲲云信息科技有限公司 本地数据流加速方法、数据流加速系统及计算机设备
CN112905525B (zh) * 2019-11-19 2024-04-05 中科寒武纪科技股份有限公司 控制运算装置进行计算的方法及设备
CN111404770B (zh) * 2020-02-29 2022-11-11 华为技术有限公司 网络设备、数据处理方法、装置、系统及可读存储介质
CN111857989B (zh) * 2020-06-22 2024-02-27 深圳鲲云信息科技有限公司 人工智能芯片和基于人工智能芯片的数据处理方法
CN111753994B (zh) * 2020-06-22 2023-11-03 深圳鲲云信息科技有限公司 Ai芯片的数据处理方法、装置和计算机设备
CN111752887B (zh) * 2020-06-22 2024-03-15 深圳鲲云信息科技有限公司 人工智能芯片和基于人工智能芯片的数据处理方法
CN111737193B (zh) * 2020-08-03 2020-12-08 深圳鲲云信息科技有限公司 数据存储方法、装置、设备和存储介质
CN114021708B (zh) * 2021-09-30 2023-08-01 浪潮电子信息产业股份有限公司 一种数据处理方法、装置、系统、电子设备及存储介质
CN114461978B (zh) * 2022-04-13 2022-07-08 苏州浪潮智能科技有限公司 数据处理方法、装置、电子设备及可读存储介质
CN116974654B (zh) * 2023-09-21 2023-12-19 浙江大华技术股份有限公司 一种图像数据的处理方法、装置、电子设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014105309A1 (fr) * 2012-12-31 2014-07-03 Mcafee, Inc. Système et procédé pour mettre en corrélation des informations de réseau avec des informations d'abonné dans un environnement de réseau mobile
CN106447034A (zh) * 2016-10-27 2017-02-22 中国科学院计算技术研究所 一种基于数据压缩的神经网络处理器、设计方法、芯片
CN108154165A (zh) * 2017-11-20 2018-06-12 华南师范大学 基于大数据与深度学习的婚恋对象匹配数据处理方法、装置、计算机设备和存储介质
CN108710941A (zh) * 2018-04-11 2018-10-26 杭州菲数科技有限公司 用于电子设备的神经网络模型的硬加速方法和装置
CN109445935A (zh) * 2018-10-10 2019-03-08 杭州电子科技大学 云计算环境下一种高性能大数据分析系统自适应配置方法

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11055063B2 (en) * 2016-05-02 2021-07-06 Marvell Asia Pte, Ltd. Systems and methods for deep learning processor
US11216722B2 (en) * 2016-12-31 2022-01-04 Intel Corporation Hardware accelerator template and design framework for implementing recurrent neural networks
US20180189641A1 (en) * 2017-01-04 2018-07-05 Stmicroelectronics S.R.L. Hardware accelerator engine
CN107066239A (zh) * 2017-03-01 2017-08-18 智擎信息系统(上海)有限公司 一种实现卷积神经网络前向计算的硬件结构

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014105309A1 (fr) * 2012-12-31 2014-07-03 Mcafee, Inc. Système et procédé pour mettre en corrélation des informations de réseau avec des informations d'abonné dans un environnement de réseau mobile
CN106447034A (zh) * 2016-10-27 2017-02-22 中国科学院计算技术研究所 一种基于数据压缩的神经网络处理器、设计方法、芯片
CN108154165A (zh) * 2017-11-20 2018-06-12 华南师范大学 基于大数据与深度学习的婚恋对象匹配数据处理方法、装置、计算机设备和存储介质
CN108710941A (zh) * 2018-04-11 2018-10-26 杭州菲数科技有限公司 用于电子设备的神经网络模型的硬加速方法和装置
CN109445935A (zh) * 2018-10-10 2019-03-08 杭州电子科技大学 云计算环境下一种高性能大数据分析系统自适应配置方法

Also Published As

Publication number Publication date
CN110046704A (zh) 2019-07-23
CN110046704B (zh) 2022-11-08

Similar Documents

Publication Publication Date Title
WO2020206637A1 (fr) Procédés et appareils d'accélération de réseau profond basés sur un flux de données, dispositif, et support de stockage
US10713568B2 (en) Apparatus and method for executing reversal training of artificial neural network
US11893414B2 (en) Operation method, device and related products
KR102486030B1 (ko) 완전연결층 신경망 정방향 연산 실행용 장치와 방법
WO2018171717A1 (fr) Procédé et système de conception automatisée pour processeur de réseau neuronal
CN109086877B (zh) 一种用于执行卷积神经网络正向运算的装置和方法
US11915139B2 (en) Modifying machine learning models to improve locality
EP3407265B1 (fr) Dispositif et procédé permettant d'exécuter un calcul depuis l'origine d'un réseau de neurones artificiels
AU2014203218B2 (en) Memory configuration for inter-processor communication in an MPSoC
US11294599B1 (en) Registers for restricted memory
TWI634489B (zh) 多層人造神經網路
US11694075B2 (en) Partitioning control dependency edge in computation graph
US10990525B2 (en) Caching data in artificial neural network computations
Voss et al. Convolutional neural networks on dataflow engines
CN113496248A (zh) 训练计算机实施的模型的方法和设备
US20210125042A1 (en) Heterogeneous deep learning accelerator
JPWO2020188658A1 (ja) アーキテクチャ推定装置、アーキテクチャ推定方法、およびアーキテクチャ推定プログラム
US11797280B1 (en) Balanced partitioning of neural network based on execution latencies
Abeyrathne et al. Offloading specific performance-related kernel functions into an FPGA
WO2024120050A1 (fr) Procédé de fusion d'opérateurs utilisé pour un réseau de neurones artificiels, et appareil associé
US20230126594A1 (en) Instruction generating method, arithmetic processing device, and instruction generating device
CN115204086A (zh) 片上网络仿真模型及动态路径规划方法、装置、多核芯片
AU2015271896A1 (en) Selection of system-on-chip component models for early design phase evaluation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19924206

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19924206

Country of ref document: EP

Kind code of ref document: A1