CN110321998B

CN110321998B - Convolutional neural network implementation method and device, acceleration equipment and storage medium

Info

Publication number: CN110321998B
Application number: CN201810278677.XA
Authority: CN
Inventors: 李天平; 孙晓明
Original assignee: Xilinx Inc
Current assignee: Xilinx Inc
Priority date: 2018-03-31
Filing date: 2018-03-31
Publication date: 2022-06-14
Anticipated expiration: 2038-03-31
Also published as: CN110321998A

Abstract

The invention discloses a convolutional neural network implementation method, a convolutional neural network implementation device, acceleration equipment and a storage medium. The convolutional neural network implementation method comprises the following steps: an operator dependency relationship obtaining step of collecting dependency relationships between operators in different layers for operators in each operator layer formed by the operators in the convolutional neural network; and an operator fusion step of fusing the plurality of operators satisfying the interdependency condition into a new operator which replaces the previous plurality of operators. The convolutional neural network implementation method improves the calculation efficiency and reduces the energy consumption.

Description

Convolutional neural network implementation method and device, acceleration equipment and storage medium

Technical Field

The present invention relates to convolutional neural networks, and more particularly to convolutional neural network implementations, for example.

Background

A Convolutional Neural Network (CNN) is a feed-forward Neural Network, generally composed of one or more Convolutional layers and a top fully connected Layer (corresponding to a classical Neural Network), and also includes an associated weight and a Pooling Layer (Pooling Layer). Convolutional neural networks can give better results in terms of image and speech recognition than other deep learning structures. Compared with other deep and feedforward neural networks, the convolutional neural network needs fewer considered parameters, so that the convolutional neural network becomes an attractive deep learning structure.

After being trained, the CNN needs to be deployed to a corresponding platform to provide corresponding services for users. At present, CNN mainly runs on CPU and GPU according to the needs of different application scenarios. For the sensitive scene of energy consumption, a special FPGA or ASIC acceleration chip is also generated at present. These acceleration chips are typically integrated with embedded systems to form a heterogeneous acceleration system. The CPU is mainly responsible for service logic realization and task scheduling, and the FPGA/ASIC is mainly responsible for CNN forward acceleration operation. The FPGA/ASIC can fully utilize on-chip storage in the CNN forward operation process, and the memory access times are reduced to the maximum extent through a data reuse technology, so that the calculation efficiency is improved, and the energy consumption is reduced. However, because the bandwidth and the storage space are limited in the embedded system environment, for a network with a large depth and a large parameter amount, the following 2 problems still exist in the deployment process:

(1) the current ASIC/FPGA chip mainly reduces the access times through a parameter multiplexing mechanism, but for Feature Map (Feature Map) data between different layers, a data multiplexing mechanism is generally lacked, and under the scene that the Feature Map is large, excessive access overhead is caused, the computing efficiency is influenced, and the real-time performance and the energy consumption of the system are influenced;

(2) Under embedded systems, there is a general limitation on the off-chip storage capacity. Under the condition that the network depth is too large and the feature map data is too much, the off-chip storage space cannot meet the requirement due to the need of allocating corresponding Input and Output (IO) space for each feature map, so that the deployment of the networks is still very difficult, and the application scenario limitation of the CNN is caused.

Therefore, it is necessary to provide an efficient storage and computation compiling optimization method for the CNN forward acceleration system for the scenario where the storage capacity and bandwidth under the embedded environment are limited.

Disclosure of Invention

To solve one of the above problems, the present invention provides an implementation of convolutional neural network computation, which can further accelerate convolutional neural network computation.

The invention provides a convolution neural network implementation method, which comprises the following steps: an operator dependency relationship obtaining step of collecting the dependency relationship between operators in different layers for operators in each operator layer formed by the operators in the convolutional neural network; and an operator fusion step of fusing a plurality of operators which are located on different operator layers and satisfy the interdependence relation conditions into a new operator, wherein the new operator is used for replacing the plurality of operators.

Optionally, the interdependence conditions between operators of different layers include a one-to-one correspondence between operators of different layers.

Or, optionally, the dependencies between operators of different layers comprise a one-to-one correspondence between operators of adjacent layers.

Optionally, fusing the plurality of operators satisfying the interdependency conditions into a new operator comprises sequentially recording the plurality of operators as a new abstract operator.

Optionally, the new operator is used as a single operator in an operator layer of the convolutional neural network, and the operator dependency relationship obtaining step and the operator fusing step are performed.

Optionally, the convolutional neural network implementation method further includes: and a sub-chip storage multiplexing step, wherein a memory is allocated for the output of the current operator, the correlation attribute value of the dependency relationship between the current operator and the upstream operator obtained in the operator dependency relationship obtaining step is adjusted along with the completion of the memory allocation of the current operator, and the memory of the upstream operator is released according to the change of the correlation attribute value.

The invention also provides a convolutional neural network implementation method, which comprises the following steps: an operator dependency relationship obtaining step of collecting dependency relationships between operators in different layers for operators in an operator layer formed by the operators in the convolutional neural network; and a sub-chip storage multiplexing step of allocating a memory for the output of the current operator, and then releasing the memory of the upstream operator based on the change of the relevant attribute value of the dependency relationship between the current operator and the upstream operator of the current operator obtained in the operator dependency relationship obtaining step.

Optionally, the correlation attribute value of the dependency of the current operator with its upstream operator is used to indicate the number of nodes downstream of the upstream operator.

Optionally, when the correlation attribute value of the dependency relationship between the current operator and the upstream operator is 0, releasing the memory of the upstream operator.

Optionally, the above implementation method of the convolutional neural network further includes an operator fusion step, in which multiple operators located on different operator layers and satisfying interdependence conditions are fused into a new operator, and the new operator is used to replace the multiple operators.

Optionally, fusing the plurality of operators satisfying the interdependence condition into a new operator comprises sequentially recording the plurality of operators as a new abstract operator.

Optionally, the new operator is used as a single operator in an operator layer of the convolutional neural network, and the operator dependency obtaining step and the operator fusion step are performed.

According to the present invention, there is provided a convolutional neural network implementing apparatus, the apparatus including: an operator dependency obtaining unit configured to collect, for operators in each operator layer formed by the operators in the convolutional neural network, dependencies between operators in different layers; and an operator fusion section fusing a plurality of operators, which satisfy interdependence conditions, located on different operator layers into a new operator to be used for replacing the plurality of operators.

Optionally, the apparatus of the present invention further comprises: and the sub-chip storage multiplexing component allocates a memory for the output of the current operator, adjusts the correlation attribute value of the dependency relationship between the current operator and the upstream operator obtained in the operator dependency relationship obtaining step along with the completion of the memory allocation of the current operator, and releases the memory of the upstream operator according to the change of the correlation attribute value.

The invention provides a convolution neural network implementation device, which comprises: an operator dependency obtaining unit configured to collect, for an operator in an operator layer formed by operators in the convolutional neural network, a dependency between operators in different layers; and an off-chip storage multiplexing step of allocating a memory for the output of the current operator, adjusting the correlation attribute value of the dependency relationship between the current operator and the upstream operator thereof obtained in the operator dependency relationship obtaining step along with the completion of the memory allocation of the current operator, and releasing the memory of the upstream operator according to the change of the correlation attribute value.

Optionally, the convolutional neural network implementing apparatus further includes: and an operator fusion part fusing a plurality of operators which are positioned on different operator layers and satisfy the interdependence relation condition into a new operator, wherein the new operator is used for replacing the plurality of operators.

An acceleration apparatus according to an embodiment of the present invention includes: a processor; and a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the convolutional neural network implementing method described above.

A non-transitory machine-readable storage medium according to an embodiment of the present invention has stored thereon executable code that, when executed by a processor of an electronic device, causes the processor to perform the convolutional neural network implementation method described above.

The invention provides the implementation scheme of the efficient convolutional neural network for the CNN forward acceleration system according to the embodiment of the invention aiming at the scene of limited storage capacity and bandwidth under an embedded environment, and simplifies a CNN calculation graph by collecting and counting the mutual relation between the operators of different operator layers and fusing a plurality of operators with specific relation, thereby greatly reducing the access and storage overhead and improving the on-chip data use efficiency, and greatly improving the calculation efficiency.

Moreover, for the scenes that the storage capacity and the bandwidth under the chip are limited in the embedded environment, the invention also provides the implementation scheme of the efficient convolutional neural network for the CNN forward acceleration system according to the other embodiment of the invention.

Furthermore, the invention can greatly reduce the use size of the whole memory especially under the scene that the depth of the CNN network is too deep or the characteristic diagram is too large, thereby reducing the difficulty of network deployment and expanding the range of the CNN use scene.

Furthermore, the CNN computational graph can be simplified to the greatest extent by gradually fusing the operators with specific relations, so that the access and storage overhead is reduced more, the on-chip data use efficiency is improved, and the computational efficiency is improved more.

It will be understood by those skilled in the art that the above technical effects of the present invention do not exist alone but can be combined with each other by combination of various technical features of the present invention.

Drawings

The foregoing and other objects, features and advantages of the disclosure will be apparent from the following more particular descriptions of exemplary embodiments of the disclosure as illustrated in the accompanying drawings wherein like reference numbers generally represent like parts throughout the exemplary embodiments of the disclosure.

FIG. 1 illustrates a flow diagram of a convolutional neural network implementation in accordance with an embodiment of the present invention.

FIG. 2 illustrates an example of operator fusion in accordance with the present invention.

FIG. 3 illustrates a flow diagram of a convolutional neural network implementation in accordance with another embodiment of the present invention.

Fig. 4 illustrates an example of off-chip storage multiplexing according to the present invention.

FIG. 5 illustrates a flow diagram of a convolutional neural network implementation in accordance with another embodiment of the present invention.

Fig. 6 illustrates a convolutional neural network implementing apparatus according to an embodiment of the present invention.

Fig. 7 illustrates a convolutional neural network implementing apparatus according to another embodiment of the present invention.

Fig. 8 illustrates a convolutional neural network implementing apparatus according to an embodiment of the present invention.

Fig. 9 illustrates a schematic block diagram of an acceleration apparatus in which an embodiment according to the present invention may be implemented.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art. It should be noted that the numbers, serial numbers and reference numbers in the present application are only presented for convenience of description, and no limitation is made to the steps, the sequence and the like of the present invention unless the specific sequence of the steps is explicitly indicated in the specification.

According to an embodiment of the present invention, a convolutional neural network implementation method is provided, and a flowchart of the method is illustrated in fig. 1.

1) Step S1 of obtaining operator dependencies

In a CNN network, each operator may be considered an operator node, and the upstream and downstream relationships between operator nodes on the CNN computation graph constitute a hierarchy of operator nodes. For example, as shown in fig. 4, it can be considered that the operator OP1 is at the first level, the operators OP2 and OP3 are at the second level, and the operators OP4 and OP5 are at the third level.

In step S1, the method is mainly used for collecting dependencies between different operators in the CNN network, especially dependencies between operators of each layer, especially interdependencies between operators upstream and downstream. Note that for ease of understanding and ease of description herein, the dependencies between operators will be explained using a layer 2 operator as an example.

First, the node at the upper layer of the two layers is defined as a predecessor node (i.e. an upstream node), and the node at the lower layer thereof is defined as a back-driving node (i.e. a downstream node), and accordingly, the dependency relationship between the operators of the two layers can be divided into two types, namely a predecessor-M type and a back-driving-M type:

(1) precursor-M: the operator node is depended on by M rear-driving nodes, namely the output of the operator is taken as the input by the M rear-driving nodes, wherein M is a natural number;

(2) rear drive-M: the operator node depends on M predecessor nodes, i.e. the operator needs to rely on the output of the M predecessor operators as input.

For example, the compiler may traverse all operators of the CNN network and record each operator's relationship to its predecessor (upstream) and successor (downstream) nodes in turn. After the traversal is completed, each operator node may be added with an attribute for recording the dependency relationship of the operator node with the operator nodes upstream and downstream thereof, and the attribute may be represented by (predecessor-M, successor-M).

For the parameters in the attribute representation, the following can be specified: if the parameter value of the front driving-M or the rear driving-M is 0, no dependency relationship is shown; if the parameter value is 1, the dependency relationship is indicated.

In this step, the dependency relationships between operators in different layers, especially between operators in upstream and downstream are collected and counted for the subsequent operator fusion step to fuse the operators, so as to reduce the number of operators on the CNN computation graph, thereby simplifying the CNN computation graph.

2) Step S2 operator fusion

In step S2, the fusion is mainly performed to the operators satisfying the interdependence condition, so that the on-chip data can be highly multiplexed.

In step S2, first, it is determined whether the operator satisfies the fusion condition. For example, it can be performed as follows. The operator nodes in the CNN computation graph may be traversed sequentially, starting from the input node (which may be considered a layer 0 or layer 1 operator node), based on the dependencies between operator nodes resulting from step S1 to determine whether to fuse the current node with its descendant node.

Here, for example, the condition of operator fusion (interdependence condition) may be: the predecessor-M value of the current operator node is 1 and the successor node is a successor-M value is 1, i.e. the current operator is in a one-to-one correspondence with its immediately downstream operators.

If the condition is judged to be satisfied, the current operator node and the subsequent operator node can be fused to form a new operator node to replace the original operator, so that the number of operators on the CNN computational graph is reduced, and the CNN computational graph is simplified.

Here, after the current CNN traversal is completed, because a new abstract node exists, the CNN operator dependency relationship obtaining step and the operator fusion step may be performed again, and the process may also be repeated continuously until an operator that can be fused cannot be found, so that the CNN computation graph may be simplified to the greatest extent.

In the following, the operator fusion will be explained by taking fig. 2 as an example, and it can be seen from fig. 2 that the convolution operator and the pooling operator satisfy the condition of the above-mentioned dependency relationship, so that the two operators can be fused into a new operator. The newly formed operators may be represented in the order of their positions on the CNN computation graph prior to fusion of the operators. For example, upstream operators are "convolved" before fusion, and downstream operators that have dependencies on them are "pooled", then fusion may be followed by expression as "convolution pooling" or "convolution + pooling" based on their positional relationships, and so on.

Here, the method of representing the newly formed operator is not limited to this, as long as the positional relationship or the connection relationship (connection order) of the operator before the fusion can be recognized in the subsequent calculation process.

For the above example, before un-fusion, the convolution operator needs to store the output result in its on-chip cache to the off-chip storage space, so that when pooling operation is performed, the intermediate result written by the previous convolution operation needs to be read from the off-chip storage space as input, and frequent accesses and storages need to be performed for many times. After fusion, the convolution operation does not need to write the intermediate result into the storage space under the chip, and the pooling operation can directly read the result from the on-chip cache as the input of the result, so that the access and storage expenses are greatly reduced.

Therefore, the operational character fusion technology can fully realize on-chip data multiplexing, and greatly reduce the access and storage times, thereby greatly reducing the power consumption and improving the calculation efficiency.

Preferably, for the operator fusion, if the current convolutional neural network hardware accelerator has a pre-driver operator operation unit and a post-driver operator operation unit, the operation (operator) capable of being fused is more, and the effect of improving the calculation efficiency is better. However, even if the arithmetic unit is not provided, fusion of a plurality of conventional operators can be performed, on-chip data multiplexing can be realized, and the access times are reduced, so that the power consumption can be reduced, and the calculation efficiency can be improved.

According to another embodiment of the present invention, a convolutional neural network implementation method is proposed, a flowchart of which is illustrated in fig. 3, and will be described below with reference to fig. 3.

1) Step S1': operator dependency obtaining

In step S1', the method is mainly used for collecting the dependency relationship between different operators in the CNN network, especially the dependency relationship between operators of each layer, especially between upstream and downstream operators. The operation of this step is similar to step S1 in fig. 1, and is not described herein again.

2) Step S2' storage multiplexing under the chip

The aforementioned step S2 (operator fusion step) is mainly used for multiplexing on-chip data (reducing access times), and the step S2 'is mainly used for multiplexing off-chip storage space, and based on the operator dependency obtained in the step S1', the off-chip storage space is reasonably allocated, and the memory usage efficiency is improved.

In the present invention, in order to implement off-chip storage space multiplexing, when allocating memory for operators in the CNN computation graph, the dependency relationship between the operators proposed by the present invention is considered, for example, the number of downstream nodes of the upstream node of the current operator node is used, and the number can be represented by the "predecessor-M' parameter of the upstream node (i.e., the number of operator nodes depending on the upstream node)" which is one of the above-mentioned attribute values for determining the dependency relationship. For example, the specific operations may be as follows:

(1) for each operator in an operator hierarchy, sequentially allocating memory for the output thereof, and subtracting 1 from the attribute value of the predecessor node (upstream node) of the current operator each time the memory allocation for one output of the current operator is completed;

(2) when the attribute value of the predecessor node (upstream node) of the current operator is 0 (meaning that the output of the operator in the current layer is completely allocated with memory), the input memory space of the predecessor node (upstream node) is released and put into the memory pool for recycling.

The method for determining whether to release the memory of the upstream node at intervals of one operator layer is described above as an example, and the present invention is not limited thereto, and the method for determining whether to release the memory of the upstream node at intervals of at least one operator layer may be used. For example, the memory of the upstream node (which may be the upstream node of the two operator layers) may also be released once every two operator layers, which is not limited in this respect, as long as the off-chip memory space can be reasonably multiplexed. For easier understanding, the following description will be given by taking an example, as shown in fig. 4, in which after the memory allocation of the outputs of all the back-drive nodes (OP2 and OP3) on the immediately next layer of the operator OP1 (as the upstream operator node) (in turn, as the current operator node) is completed, the output memory of the OP1 (upstream operator node) can be recycled and put into the memory pool for multiplexing, so that the memory space released by the OP1 (upstream operator node) can be multiplexed when the output memory space is allocated for the OP4 and the OP5, thereby realizing very efficient memory multiplexing.

Therefore, through the off-chip memory multiplexing technology, the use size of the whole memory can be greatly reduced particularly under the scene that the depth of a CNN network is too deep or a characteristic diagram is too large, so that the network deployment difficulty is reduced, and the use scene range of the CNN is expanded.

In conclusion, the invention provides a storage, calculation, compilation and optimization method for a convolutional neural network forward acceleration system under an embedded heterogeneous platform, which can increase the use efficiency of data on a chip to the maximum extent and reduce the use of memory space under the chip, thereby improving the calculation efficiency and reducing the energy consumption.

In addition, according to still another embodiment of the present invention, a convolutional neural network implementation method is provided, and a flowchart of the method is illustrated in fig. 5, which will be described below with reference to fig. 5.

1) Step S1 ″: operator dependency acquisition

In step S1 ″, it is mainly used to collect the dependencies between different operators in the CNN network, especially the dependencies between operators of each layer, especially between upstream and downstream operators.

The operation of this step is similar to step S1 in fig. 1 and step S1' in fig. 3, and is not described again here.

2) Step S2 ″: operator fusion

In step S2 ″, the fusion is mainly performed to merge operators satisfying the interdependence conditions, so that the on-chip data can be highly multiplexed.

The operation of this step is similar to step S2 in fig. 1, and is not described herein again.

3) Step S3 ″: off-chip memory multiplexing

The step is mainly used for multiplexing the off-chip storage space, and the off-chip storage space is reasonably distributed based on the operator dependency relationship obtained in the step S1' so as to improve the use efficiency of the memory.

The operation of this step is similar to step S2' in fig. 3, and is not described again here.

It should be noted that the above is only an example showing three operations of operator dependency obtaining, operator fusing and under-chip storage multiplexing, and those skilled in the art should understand that the three operations may have other various combinations and modifications, and such combinations should also be included in the scope of the present invention.

According to an embodiment of the present invention, a convolutional neural network implementation apparatus is provided.

As shown in fig. 6, the apparatus convolutional neural network implementing apparatus 100 may include an operator dependency obtaining part 110 and an operator fusing part 120.

The operator dependency obtaining unit 110 is configured to collect, for operators in each operator layer formed by the operators in the convolutional neural network, dependencies between operators in different layers, especially dependencies between upstream and downstream operators. The dependency relationships are similar to those described above, and are not described in detail here. In the present invention, the operator dependency obtaining part 110 may be configured to perform operations similar to the above-mentioned operator dependency obtaining step, and will not be described herein again.

Wherein the operator fusion component 120 is configured to fuse a plurality of operators located at different operator levels that satisfy interdependency conditions into a new operator to be used to replace the plurality of operators. In the present invention, the operator fusion component 120 can be used to perform operations similar to the above-mentioned operator fusion step, and will not be described herein again.

According to another embodiment of the present invention, a convolutional neural network implementing apparatus is provided.

As shown in fig. 7, the apparatus convolutional neural network implementing apparatus 200 may include an operator dependency obtaining part 210 and an off-chip storage multiplexing part 220.

Here, the operator dependency obtaining part 210 is similar to the aforementioned operator dependency obtaining part 110, and is configured to collect, for operators in an operator layer formed by operators in a convolutional neural network, dependencies between operators of different layers.

The under-slice storage multiplexing unit 230 is configured to allocate a memory for an output of a current operator, adjust a correlation attribute value of a dependency relationship between the current operator and an upstream operator obtained in the operator dependency relationship obtaining step as the memory allocation of the current operator is completed, and release the memory of the upstream operator according to a change in the correlation attribute value. Wherein off-chip memory multiplexing component 230 may perform operations similar to the off-chip memory multiplexing steps described previously.

According to still another embodiment of the present invention, an apparatus for implementing a convolutional neural network is provided.

As shown in fig. 8, the apparatus convolutional neural network implementing apparatus 300 may include an operator dependency obtaining part 310, an operator fusing part 320, and an off-chip storage multiplexing part 330.

Here, the operator dependency obtaining part 310 functions similarly to the aforementioned operator dependency obtaining parts 110 and 210, the operator fusion part 320 functions similarly to the aforementioned operator fusion part 120, and the off-chip storage multiplexing part 330 functions similarly to the aforementioned off-chip storage multiplexing part 230, and thus, a detailed description thereof is omitted.

The convolutional neural network implementation method can be applied to the scene of using a software method to implement convolutional neural network calculation, and can also be applied to the scene of using hardware accelerators such as an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit) to implement convolutional neural network calculation.

Fig. 9 is a schematic structural diagram of a computing acceleration device according to an embodiment of the present invention.

Referring to fig. 9, the acceleration apparatus 1 includes a memory 10 and a processor 20.

The processor 20 may be a multi-core processor or may include a plurality of processors. In some embodiments, processor 20 may comprise a general-purpose host processor and one or more special purpose coprocessors such as, for example, a Graphics Processor (GPU), a Digital Signal Processor (DSP), or the like. In some embodiments, processor 20 may be implemented using custom circuits, such as an Application Specific Integrated Circuit (ASIC) or a Field Programmable Gate Array (FPGA).

The memory 10 may include various types of storage units such as a system memory, a Read Only Memory (ROM), and a permanent storage device. Wherein the ROM may store static data or instructions that are required by the processor 20 or other modules of the computer. The persistent storage device may be a read-write storage device. The persistent storage may be a non-volatile storage device that does not lose stored instructions and data even after the computer is powered off. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the permanent storage may be a removable storage device (e.g., floppy disk, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as a dynamic random access memory. The system memory may store instructions and data that some or all of the processors require at runtime. Further, the memory 10 may comprise any combination of computer-readable storage media, including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic and/or optical disks, may also be employed. In some embodiments, memory 1010 may include a removable storage device that is readable and/or writable, such as a Compact Disc (CD), a read-only digital versatile disc (e.g., DVD-ROM, dual layer DVD-ROM), a read-only Blu-ray disc, an ultra-density optical disc, a flash memory card (e.g., SD card, min SD card, Micro-SD card, etc.), a magnetic floppy disc, or the like. Computer-readable storage media do not contain carrier waves or transitory electronic signals transmitted by wireless or wired means.

The memory 10 has stored thereon processable code, which, when processed by the processor 20, causes the processor 20 to perform the convolutional neural network implementation method for a convolutional neural network described above.

A convolutional neural network implementing method for a convolutional neural network according to the present invention has been described in detail above with reference to the accompanying drawings.

Furthermore, the method according to the invention may also be implemented as a computer program or computer program product comprising computer program code instructions for carrying out the above-mentioned steps defined in the above-mentioned method of the invention.

Alternatively, the invention may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or a computer program, or computer instruction code) which, when executed by a processor of an electronic device (or computing device, server, etc.), causes the processor to perform the steps of the above-described method according to the invention.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

While embodiments of the present invention have been described above, the above description is illustrative, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A convolutional neural network implementation method, comprising:

an operator dependency relationship obtaining step, which is to traverse all operators in an operator layer formed by operators in the convolutional neural network and collect dependency relationships among the operators in different layers; and

and a sub-chip storage multiplexing step of allocating memory for the output of the current operator, and then releasing the memory of the upstream operator based on the change of the relevant attribute values of the dependency relationship between the current operator and the upstream operator of the current operator, which is obtained in the operator dependency relationship obtaining step.

2. The convolutional neural network implementation of claim 1, wherein the correlation attribute value of the dependency of a current operator with its upstream operator is used to represent the number of nodes downstream of the upstream operator.

3. The convolutional neural network implementation of claim 1, wherein when the correlation attribute value of the dependency of a current operator and its upstream operator is 0, the memory of the upstream operator is released.

4. The convolutional neural network implementation method of claim 1, further comprising:

And fusing the operators which are positioned on different operator layers and meet the interdependency relation condition into a new operator, wherein the new operator is used for replacing the operators.

5. The convolutional neural network implementation of claim 4, wherein fusing a plurality of operators which satisfy an interdependency condition into a new operator comprises sequentially recording the plurality of operators as a new abstract operator.

6. The convolutional neural network implementation of claim 4, wherein the operator dependency obtaining step and the operator fusion step are performed with the new operator as a separate operator in an operator layer of the convolutional neural network.

7. The convolutional neural network implementation of claim 4, wherein the interdependence conditions between operators of different layers include a one-to-one correspondence between operators of different layers.

8. The convolutional neural network implementation of claim 4, wherein the interdependence conditions between operators of different layers include a one-to-one correspondence between operators of adjacent layers.

9. An apparatus for implementing a convolutional neural network, the apparatus comprising:

an operator dependency obtaining unit configured to traverse all operators in an operator layer formed by operators in the convolutional neural network, and collect dependency between operators in different layers; and

and a sub-chip storage multiplexing step, wherein a memory is allocated for the output of the current operator, the correlation attribute value of the dependency relationship between the current operator and the upstream operator obtained in the operator dependency relationship obtaining step is adjusted along with the completion of the memory allocation of the current operator, and the memory of the upstream operator is released according to the change of the correlation attribute value.

10. The convolutional neural network implementing device of claim 9, further comprising:

and an operator fusion part fusing a plurality of operators which are positioned on different operator layers and satisfy the interdependence relation condition into a new operator, wherein the new operator is used for replacing the plurality of operators.

11. An acceleration apparatus, comprising:

a processor; and

a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method of any of claims 1-8.

12. A non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the method of any of claims 1-8.