CN109919315B

CN109919315B - Forward reasoning method, device, equipment and storage medium of neural network

Info

Publication number: CN109919315B
Application number: CN201910188467.6A
Authority: CN
Inventors: 刘凯; 吕亚飞; 张致江; 李必然; 刘远东
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2019-03-13
Filing date: 2019-03-13
Publication date: 2021-10-01
Anticipated expiration: 2039-03-13
Also published as: CN109919315A

Abstract

The application provides a forward reasoning method, a device, equipment and a storage medium of a neural network, wherein the method comprises the following steps: dividing the target neural network into a plurality of sub-networks, wherein any sub-network comprises at least one hidden layer of the target neural network, creating inference examples and inference engines corresponding to the sub-networks on hardware equipment of the inference platform, and performing forward inference on the target neural network based on the inference examples and the inference engines corresponding to the sub-networks. Because one inference engine is only responsible for a part of hidden layers of the neural network, a plurality of data inputs can be executed in parallel in different inference engines at the same time, so the forward inference method provided by the application has higher inference efficiency and data throughput, and the hardware resources of the inference platform are fully utilized.

Description

Forward reasoning method, device, equipment and storage medium of neural network

Technical Field

The present application relates to the field of parallel computing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for forward inference in a neural network.

Background

The forward reasoning of the neural network refers to that a reasoning example and a reasoning engine are created on a reasoning platform aiming at the neural network to be reasoned, and the reasoning engine carries out operation on each layer of the neural network based on input data and the reasoning example of an input layer of the neural network.

The current reasoning scheme is as follows: an inference example is created for a neural network to be inferred, an inference engine is created in the inference example, the inference engine receives input data and operates each layer of the whole neural network in sequence based on the inference example, namely, the operation of one input data on different layers is strictly serial, and different inputs are also strictly serial, namely, the operation of the next input data can be performed only after the output result of the previous input data is obtained.

According to the existing reasoning scheme, along with the deepening of the layer number of the neural network, the calculation time from the input to the output of one piece of data is longer and longer, and the overall throughput is smaller and smaller. Meanwhile, with the continuous development of chip technology, the computing power of various hardware devices suitable for the neural network is greatly improved, and the existing reasoning scheme causes the utilization rate of the hardware devices to be very low, thereby seriously wasting hardware resources.

Disclosure of Invention

In view of this, the present application provides a forward inference method, apparatus, device and readable storage medium for a neural network, so as to solve the problems of long time consumption, low efficiency and low hardware resource utilization rate of the existing inference scheme, and the technical scheme is as follows:

a method of forward reasoning for a neural network, comprising:

dividing a target neural network into a plurality of sub-networks, wherein any sub-network comprises at least one hidden layer of the target neural network;

creating inference instances and inference engines corresponding to the sub-networks on hardware equipment of an inference platform respectively;

and carrying out forward reasoning on the target neural network based on the reasoning examples and the reasoning engines respectively corresponding to the sub-networks.

Optionally, the dividing the target neural network into a plurality of sub-networks includes:

acquiring hardware equipment information of the reasoning platform, the calculated amount of a target neural network and the required storage space;

and dividing the target neural network into a plurality of sub-networks based on the hardware equipment information of the reasoning platform, the calculated amount of the target neural network and the required storage space.

Wherein the hardware device information of the inference platform comprises one or more of the following information:

the number of hardware devices, the computing power of the hardware devices, the storage capacity of the hardware devices, and the transmission bandwidth among the hardware devices.

Optionally, obtaining the calculated amount and the required storage space of the target neural network includes:

constructing a calculation graph of the target neural network according to the network parameters of the target neural network;

determining the calculated amount and the required storage space of each layer of the target neural network according to the calculation graph of the target neural network;

and determining the calculated amount and the required storage space of the whole target neural network according to the calculated amount and the required storage space of each layer of the target neural network.

Optionally, the dividing the target neural network into a plurality of sub-networks based on the hardware device information of the inference platform, the calculation amount of the target neural network, and the required storage space includes:

determining a parallel mode suitable for the target neural network based on hardware device information of the inference platform, a calculation amount and required storage space of the target neural network and a parallel mode configured by a user, wherein the parallel mode comprises a single-device parallel mode and a multi-device parallel mode, in the single-device parallel mode, forward inference of the target neural network is realized based on a single device, and in the multi-device parallel mode, the forward inference of the target neural network is realized based on a plurality of devices;

dividing the target neural network into a plurality of sub-networks based on a parallel pattern that fits the target neural network.

Optionally, the determining a parallel mode suitable for the target neural network based on the hardware device information of the inference platform, the computation amount and the required storage space of the target neural network, and the user-configured parallel mode includes:

if the calculation amount of the whole target neural network is larger than the calculation capacity of a single device and/or the storage space required by the whole target neural network is larger than the storage capacity of the single device, determining that the parallel mode suitable for the target neural network is the multi-device parallel mode;

if the calculation amount of the whole target neural network is less than or equal to the calculation capacity of the single device, and the storage space required by the whole target neural network is less than or equal to the storage capacity of the single device, determining a parallel mode suitable for the target neural network based on the user-configured parallel mode.

Optionally, the determining a parallel pattern suitable for the target neural network based on the user-configured parallel pattern includes:

when the user-configured parallel mode is the single-device parallel mode, determining that the parallel mode suitable for the target neural network is the single-device parallel mode;

when the parallel mode configured by the user is the multi-device parallel mode, if the transmission time between the devices is greater than the maximum execution time of the preset sub-network, determining that the parallel mode suitable for the target neural network is the single-device parallel mode, and if the transmission time between the devices is less than or equal to the maximum execution time of the preset sub-network, determining that the parallel mode suitable for the target neural network is the multi-device parallel mode.

Optionally, the dividing the target neural network into a plurality of sub-networks based on a parallel mode suitable for the target neural network includes:

if the parallel mode suitable for the target neural network is the multi-device parallel mode, obtaining the dividing number of sub-networks based on the number of the hardware devices, and dividing the target neural network based on the dividing number of the sub-networks;

and if the parallel mode suitable for the target neural network is the single-device parallel mode, dividing the target neural network based on the preset number of sub-network divisions.

Optionally, the dividing the target neural network based on the number of the sub-networks includes:

dividing the target neural network based on the dividing number of the sub-networks by taking theoretical calculation amount responsible by single equipment and maximum data amount transmitted among the equipment as dividing basis;

the theoretical calculation amount of the single device is determined by the calculation amount of the whole target neural network and the division number of the sub-networks, and the maximum data amount transmitted among the devices is determined by the preset maximum execution time of the sub-networks and the transmission bandwidth among the devices.

Optionally, the dividing the target neural network based on the number of the sub-networks and based on a theoretical calculation amount for a single device and a maximum data amount transmitted between devices, includes:

traversing from the input layer of the target neural network backwards in sequence: sequentially superposing the calculated amount of each hidden layer, and when the calculated amount obtained by current superposition is close to the theoretical calculated amount responsible for the single equipment, obtaining a sub-network formed by a plurality of adjacent hidden layers for superposition as a candidate sub-network;

if the output data volume of the candidate sub-network is smaller than or equal to the maximum data volume transmitted among the devices, taking the candidate sub-network as a sub-network obtained by dividing; if the output quantity of the candidate sub-networks is larger than the maximum data volume transmitted among the devices, removing hidden layers from the candidate sub-networks one by one from back to front until the output data volume of the removed sub-networks is smaller than or equal to the maximum data volume transmitted among the devices, and taking the sub-networks after removing the hidden layers as one sub-network obtained by division;

and continuing traversing backwards until all the sub-networks are obtained, wherein after each sub-network is obtained, the calculated amount of the hidden layer behind the sub-network is superposed again.

Optionally, the performing forward inference on the target neural network based on the inference instances and the inference engines respectively corresponding to the sub-networks includes:

determining the dependency relationship among inference engines corresponding to the sub-networks according to the dependency relationship among the sub-networks;

and inputting data to the inference engines corresponding to the sub-networks respectively in sequence, so that each inference engine operates the corresponding sub-network based on the input data and the corresponding inference instance.

A forward reasoning apparatus for a neural network, comprising: the system comprises a network processing module, an instance and engine creating module and an inference module;

the network processing module is used for dividing the target neural network into a plurality of sub-networks, wherein any sub-network comprises at least one hidden layer of the target neural network;

the instance and engine creating module is used for creating inference instances and inference engines which respectively correspond to the sub-networks on hardware equipment of the inference platform;

the reasoning module is used for carrying out forward reasoning on the target neural network based on the reasoning examples and the reasoning engines respectively corresponding to the sub-networks.

Optionally, the network processing module includes: the device comprises an information acquisition module and a sub-network dividing module;

the information acquisition module acquires hardware equipment information of the reasoning platform, the calculated amount of the target neural network and the required storage space;

the sub-network dividing module is used for dividing the target neural network into a plurality of sub-networks based on the hardware equipment information of the reasoning platform, the calculated amount of the target neural network and the required storage space.

Optionally, the information obtaining module includes: a computation graph construction sub-module and a computation amount and storage space determination sub-module;

the computation graph constructing sub-module is used for constructing a computation graph of the target neural network according to the network parameters of the target neural network;

and the calculation amount and storage space determining submodule is used for determining the calculation amount and the required storage space of each layer of the target neural network according to the calculation graph of the target neural network, and determining the calculation amount and the required storage space of the whole target neural network according to the calculation amount and the required storage space of each layer of the target neural network.

Optionally, the sub-network dividing module includes: a parallel mode determination submodule and a sub-network division submodule;

the parallel mode determination submodule is used for determining a parallel mode suitable for the target neural network based on hardware equipment information of the inference platform, the calculated amount and required storage space of the target neural network and a parallel mode configured by a user, wherein the parallel mode comprises a single-equipment parallel mode and a multi-equipment parallel mode, in the single-equipment parallel mode, the forward inference of the target neural network is realized based on a single equipment, and in the multi-equipment parallel mode, the forward inference of the target neural network is realized based on a plurality of equipments;

the sub-network dividing sub-module is used for dividing the target neural network into a plurality of sub-networks based on a parallel mode suitable for the target neural network.

Optionally, the parallel mode determining sub-module includes: a first determination submodule and a second determination submodule;

the first determining sub-module is used for determining that the parallel mode suitable for the target neural network is the multi-device parallel mode when the calculation amount of the whole target neural network is larger than the calculation capacity of a single device and/or the storage space required by the whole target neural network is larger than the storage capacity of the single device;

the second determining sub-module is used for determining the parallel mode suitable for the target neural network based on the parallel mode configured by the user when the calculation amount of the whole target neural network is less than or equal to the calculation capacity of the single device and the storage space required by the whole target neural network is less than or equal to the storage capacity of the single device.

Optionally, the second determining sub-module is specifically configured to determine, when the parallel mode configured by the user is the single device parallel mode, that the parallel mode suitable for the target neural network is the single device parallel mode; when the parallel mode configured by the user is the multi-device parallel mode, if the transmission time between the devices is greater than the maximum execution time of the preset sub-network, determining that the parallel mode suitable for the target neural network is the single-device parallel mode, and if the transmission time between the devices is less than or equal to the maximum execution time of the preset sub-network, determining that the parallel mode suitable for the target neural network is the multi-device parallel mode.

Optionally, the sub-network dividing sub-module includes: a first partitioning submodule and a second partitioning submodule;

the first dividing module is configured to, when the parallel mode suitable for the target neural network is the multi-device parallel mode, obtain the number of sub-networks to be divided based on the number of the hardware devices, and divide the target neural network based on the number of sub-networks to be divided;

and the second dividing submodule is used for dividing the target neural network based on the preset dividing number of sub-networks when the parallel mode suitable for the target neural network is the single-device parallel mode.

Optionally, the first partitioning module is specifically configured to partition the target neural network based on the number of the sub-networks, with a theoretical calculation amount for which a single device is responsible and a maximum data amount transmitted between devices as a partition basis;

Optionally, the first molecular dividing module is specifically configured to sequentially traverse backwards from an input layer of the target neural network: sequentially superposing the calculated amount of each hidden layer, and when the calculated amount obtained by current superposition is close to the theoretical calculated amount responsible for the single equipment, obtaining a sub-network formed by a plurality of adjacent hidden layers for superposition as a candidate sub-network; if the output data volume of the candidate sub-network is smaller than or equal to the maximum data volume transmitted among the devices, taking the candidate sub-network as a sub-network obtained by dividing; if the output quantity of the candidate sub-networks is larger than the maximum data volume transmitted among the devices, removing hidden layers from the candidate sub-networks one by one from back to front until the output data volume of the removed sub-networks is smaller than or equal to the maximum data volume transmitted among the devices, and taking the sub-networks after removing the hidden layers as one sub-network obtained by division; and continuing traversing backwards until all the sub-networks are obtained, wherein after each sub-network is obtained, the calculated amount of the hidden layer behind the sub-network is superposed again.

Optionally, the inference module is specifically configured to determine, according to the dependency relationships among the multiple sub-networks, the dependency relationships among the inference engines corresponding to the multiple sub-networks, respectively; and inputting data to the inference engines corresponding to the sub-networks respectively in sequence, so that each inference engine operates the corresponding sub-network based on the input data and the corresponding inference instance.

A forward reasoning apparatus for a neural network, comprising: a memory and a processor;

the memory is used for storing programs;

the processor is used for executing the program and realizing each step of the forward reasoning method of the neural network.

A readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the forward inference method of the neural network.

It can be seen from the above technical solutions that, the forward inference method for neural networks provided by the present application, first dividing a target neural network into a plurality of sub-networks, then respectively creating inference examples and inference engines for the plurality of sub-networks on an inference platform, and finally performing forward inference on the target neural network based on the inference examples and inference engines respectively corresponding to the plurality of sub-networks, since there are a plurality of inference engines and one inference engine is only responsible for a part of hidden layers of the target neural network, it is possible to input a plurality of data into different inference engines at the same time and perform operations corresponding to the sub-networks in parallel, compared with the existing inference scheme, since a plurality of inference engines perform operations simultaneously based on a plurality of input data at the same time, hardware resources are fully utilized, i.e. the utilization rate of the hardware resources is increased, and at the same time, the inference efficiency is increased, the data throughput is improved, and the storage space is saved on the premise that the storage resources are not changed.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a schematic diagram of a forward inference process including an inference example;

FIG. 2 is a schematic diagram of a forward inference process including multiple inference instances provided by an embodiment of the present application;

fig. 3 is a schematic flowchart of a forward inference method of a neural network according to an embodiment of the present application;

FIG. 4 is a schematic flowchart of obtaining the calculation amount and the required storage space of the target neural network according to an embodiment of the present disclosure;

fig. 5 is a schematic flowchart of dividing a target neural network into a plurality of sub-networks based on hardware device information of an inference platform and a calculation amount and a required storage space of the target neural network provided in an embodiment of the present application;

fig. 6 is a schematic flowchart of an alternative specific implementation manner of determining a parallel mode suitable for a target neural network based on hardware device information of an inference platform, a calculation amount and a required storage space of the target neural network, and a parallel mode configured by a user according to an embodiment of the present application;

FIG. 7 is a diagram illustrating an example of sub-network partitioning of a neural network according to an embodiment of the present application;

FIG. 8 is a schematic diagram of creating an inference engine in a multi-device parallel mode according to an embodiment of the present application;

FIG. 9 is a diagram illustrating an example of an inference process of a neural network provided by an embodiment of the present application;

fig. 10 is a block diagram of a forward inference apparatus of a neural network according to an embodiment of the present application;

fig. 11 is a block diagram of a forward inference apparatus of a neural network according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The inventor finds out in the process of realizing the invention: the existing inference scheme creates an inference instance and an inference engine for the whole neural network, that is, the existing inference scheme implements inference of the whole neural network based on an inference instance and an inference engine, specifically:

firstly, a calculation graph is constructed according to parameters of a neural network to be inferred, then an execution sequence of the whole neural network to be inferred is determined according to the calculation graph, then execution functions of different hidden layers are sequentially sent to an inference engine execution queue to be executed based on the execution sequence and an inference example of the whole neural network to be inferred, after input data arrives, the inference engine carries out operation of each hidden layer according to the sequence of the execution functions in an internal execution queue, namely, the different hidden layers of the whole neural network are executed in series, and the inference engine receives next input data only after the operation of each hidden layer is completely executed aiming at one input data, namely, in the existing inference scheme, single input forward inference is strictly serial in the execution queue of the inference engine, and different input data are also strictly serial.

As shown in fig. 1, for a neural network with N hidden layers, an inference example is created on an inference platform, input data X enters the network, and then sequentially passes through operations of the N hidden layers, and finally an output Y is obtained, where the execution time T of the whole network is T₁+T₂+…+T_NWherein, T_iThe execution time of the ith hidden layer is represented, for the deep neural network, the more the number of the hidden layers is, the longer the time required by one-time forward reasoning is, only one hidden layer is executed at any time, and all other layers are idle, so that the calculation throughput is limited when forward reasoning is carried out based on a single instance. The throughput is the number of input data processed per unit time.

With the continuous development of chip technology, the computing power of various hardware devices suitable for deep learning is greatly improved, taking a Graphics Processing Unit (GPU) in great as an example, the single-precision computing power of M40 reaches 7Tflops (that is, the number of floating point operations per second reaches 7T), P40 reaches 12Tflops (that is, the number of floating point operations per second reaches 12T), V100 reaches 15Tflops (that is, the number of floating point operations per second reaches 15T), and the newly added tensorrcore theory can reach 120Tflops at the highest.

In order to improve the speed of forward reasoning and improve the utilization rate of hardware equipment, the inventor of the present invention carries out intensive research:

the initial thinking was: in order to fully utilize hardware computing resources, a plurality of inference examples are created, one inference example is responsible for input data, the operation is carried out on the whole network based on the input data, when the forward inference is carried out, the inference examples are started to carry out inference simultaneously, and as shown in fig. 2, 4 inference examples are started to carry out inference simultaneously.

The inventor finds out through research that: although the above-mentioned idea can exert the strong computing capability of the hardware device to a certain extent, the inside of each instance is still in a serial relationship (as shown in fig. 2, the inside of each instance 0 to 4 is in a serial relationship, after the input data X0 in the instance 0 enters the network, the input data X4 can enter the instance for operation only after the input data X0 is obtained finally and the output Y0 is obtained finally), and the improvement is not achieved.

In view of the above problems, the present inventors have conducted further research and finally provide a forward reasoning scheme with better effect. The following embodiments are provided to describe the forward inference method of the neural network provided in the present application.

Referring to fig. 3, a flow chart of a forward inference method of a neural network provided in an embodiment of the present application is shown, where the method may include:

step S301: the target neural network is divided into a plurality of sub-networks.

The target neural network is a neural network to be inferred, and it can be understood that the target neural network generally comprises a plurality of hidden layers, operations are executed between the hidden layers in sequence, for a plurality of sub-networks obtained by division, each sub-network can comprise one hidden layer or a plurality of continuous adjacent hidden layers, and the plurality of sub-networks have a sequential dependency relationship.

Specifically, the process of dividing the target neural network into a plurality of sub-networks may include:

and step S3011, acquiring hardware device information of the inference platform, the calculated amount of the target neural network and the required storage space.

The inference platform may be, but is not limited to, a GPU server, a TPU (Tensor Processing Unit) server, and the like, and the hardware device of the inference platform may be a device with storage capability and computing capability, such as a video card.

The hardware device information may include one or more of the number of hardware devices, the computing power of the hardware devices, the storage capacity of the hardware devices, and the transmission bandwidth between the hardware devices, and preferably includes the above four kinds of information at the same time. In one possible implementation, the internal function may be called to obtain the hardware device information of the inference platform when the inference framework is started.

Illustratively, the inference platform is a GPU (Graphics Processing Unit) server of a 4-card P40 Graphics card, and the cuda function interface may be invoked to obtain hardware device information as follows: the number of the hardware devices is 4, the computing capacity of the hardware devices is 6.2, the single-precision throughput of 12Tflo ps is obtained by table lookup, namely the number of floating point operations per second reaches 12T, the storage capacity of each hardware device is 24G, and the bandwidth between PCIE interface devices is 10G/s or the nvlink bandwidth reaches 100G/s.

The calculated amount of the target neural network refers to the calculated amount of the whole target neural network and can be determined through the calculated amount of each hidden layer of the target neural network, and the storage space required by the target neural network refers to the total storage space required for operating each hidden layer of the whole neural network and can be determined through the storage space required by each hidden layer of the target neural network. The specific process of obtaining the calculation amount and the required storage space of the target neural network can be referred to the description of the subsequent embodiments.

And step S3012, dividing the target neural network into a plurality of sub-networks based on the hardware device information of the inference platform, the calculated amount of the target neural network and the required storage space.

It should be noted that, the hardware device information of the inference platform, the calculation amount of the target neural network, and the required storage space determine the number of divisions of the target neural network and the hidden layer included in each sub-network when the target neural network is divided based on the number of divisions of the sub-network, and based on this, the hardware device information of the inference platform, the calculation amount of the target neural network, and the required storage space are used as the basis for dividing the target neural network into sub-networks.

Step S302: and creating an inference instance and an inference engine which respectively correspond to a plurality of sub-networks on a hardware device of the inference platform.

Specifically, after the target neural network is divided into a plurality of sub-networks, an inference instance and an inference engine need to be created for each sub-network, wherein the inference instance is responsible for the operation of each hidden layer in the corresponding sub-network, and the inference engine is responsible for receiving input data and completing the operation of the corresponding sub-network based on the input data and the corresponding inference instance.

Step S303: and carrying out forward reasoning on the target neural network based on the reasoning examples and the reasoning engines respectively corresponding to the sub-networks.

Because the target neural network is divided into a plurality of sub-networks, each sub-network corresponds to one inference engine and one inference example, one inference engine is only responsible for one sub-network (namely a partial hidden layer), so that a plurality of input data can be input to a plurality of different inference engines at the same time, namely a plurality of inference engines perform parallel operation based on the input data and the corresponding inference examples at the same time.

The neural network forward reasoning method provided by the embodiment of the application comprises the steps of dividing a target neural network into a plurality of sub-networks, respectively creating a reasoning example and a reasoning engine for the sub-networks, and performing forward reasoning on the target neural network based on the reasoning example and the reasoning engine respectively corresponding to the sub-networks, wherein the reasoning engine is provided with a plurality of reasoning engines, and one reasoning engine is only responsible for a part of hidden layers of the target neural network, so that a plurality of data can be input into different reasoning engines at the same time to execute the operation of the corresponding sub-networks in parallel, compared with the existing reasoning scheme, because the plurality of reasoning engines perform the operation simultaneously based on a plurality of input data at the same time, the hardware resources are fully utilized, namely the utilization rate of the hardware resources is improved, meanwhile, the reasoning efficiency is improved, the data throughput is improved, and on the premise that the storage resources are not changed, the storage space is saved.

A process of acquiring the calculation amount and the required memory space of the target neural network in step S3011 will be described below.

Referring to fig. 4, a schematic flow chart illustrating obtaining the calculation amount and the required storage space of the target neural network is shown, which may include:

step S401: and constructing a calculation graph of the target neural network according to the network parameters of the target neural network.

In this embodiment, the network parameters of the target neural network may include the number of hidden layers of the target neural network, the number of neurons in each hidden layer, the connection relationship between hidden layers, the serial number of input and output nodes, and the like, and these network parameters reflect the complexity of the target neural network and are related to the calculated amount of the target neural network and the required storage space.

Optionally, the present embodiment may create the computation graph based on the network parameters of the target neural network and a preset depth-first search algorithm. The calculation graph of the target neural network is a graph capable of reflecting the calculation process of the target neural network, and comprises nodes and edges, wherein the edges represent the operation of executing the function of each hidden layer, and the nodes represent the input of the executing function.

Step S402: and determining the calculated amount and the required storage space of each layer of the target neural network according to the calculation graph of the target neural network.

After the computational graph of the target neural network is obtained, the computational graph is traversed to obtain the computation amount and the required storage space of each hidden layer. In order to obtain the calculation amount and the required storage space of each hidden layer, a calculation function of the calculation amount and a calculation function of the required storage space may be set for each hidden layer in advance, and the hidden layers are associated or bound with the hidden layers.

Alternatively, the present embodiment may use the number of times of multiply-add required to complete a hidden layer to represent the amount of computation for a hidden layer. Illustratively, the total connected layer with input dimension r × k and neuron number n has a calculated amount r × k × n × 2, and the required storage space is r × k + k × n + r × n; input dimension is v c h w, convolution kernel is k_h*k_wStep length of s_h*s_wThe calculated amount of the convolution layer with the output channel number f is about (v x c x h w x k)_h*k_w*f*2)/(s_h*s_w) The required storage space is about v x c x h x w + f c x k_h*k_w。

Step S403: and determining the calculated amount and the required storage space of the whole target neural network through the calculated amount and the required storage space of each layer of the target neural network.

After the calculated amount of each hidden layer of the target neural network and the required storage space are obtained, accumulating the calculated amount of each hidden layer of the target neural network to obtain the calculated amount of the whole target neural network; similarly, the storage space required by each hidden layer of the target neural network is accumulated to obtain the storage space required by the whole target neural network.

After obtaining the hardware equipment information of the reasoning platform, the calculated amount of the whole target neural network and the required storage space, the target neural network is divided into sub-networks according to the information.

The following is made to "step S3012" in the above embodiment: referring to fig. 5, a flow diagram of an implementation process of dividing a target neural network into a plurality of sub-networks "is shown, which is described based on hardware device information of an inference platform, and a calculation amount and a required storage space of the target neural network, and may include:

step S501: and determining a parallel mode suitable for the target neural network based on the hardware equipment information of the inference platform, the calculation amount and the required storage space of the target neural network and the user configured parallel mode.

The parallel mode includes a single-device parallel mode and a multi-device parallel mode, in the single-device parallel mode, the forward inference process of the whole target neural network is realized based on a single device, and in the multi-device parallel mode, the forward inference process of the whole target neural network is realized based on a plurality of devices (the plurality of devices may be all hardware devices on the inference platform or part of the hardware devices).

It should be noted that the parallel mode configured by the user may be a parallel mode suitable for the target neural network or a parallel mode unsuitable for the target neural network, for example, the parallel mode configured by the user may not support the amount of computation required by the target neural network or may not support the storage space required by the target neural network.

Step S502: the target neural network is divided into a plurality of sub-networks based on a parallel pattern that is appropriate for the target neural network.

After the parallel mode suitable for the target neural network is determined, the number of sub-network partitions can be determined based on the parallel mode suitable for the target neural network, and the target neural network is partitioned based on the determined number of sub-network partitions.

The above-described "step S501: and determining a parallel mode suitable for the target neural network based on hardware equipment information of the inference platform, the calculated amount and the required storage space of the target neural network and the parallel mode configured by a user, and introducing the implementation process.

Based on the hardware device information of the inference platform, the computation amount and the required storage space of the target neural network, and the user-configured parallel pattern, the process of determining the parallel pattern suitable for the target neural network may include: if the calculation amount of the whole target neural network is larger than the calculation capacity of a single device and/or the storage space required by the whole target neural network is larger than the storage capacity of the single device, determining that the parallel mode suitable for the target neural network is a multi-device parallel mode; if the calculation amount of the whole target neural network is less than or equal to the calculation capacity of the single device, and the storage space required by the whole target neural network is less than or equal to the storage capacity of the single device, determining the parallel mode suitable for the target neural network based on the parallel mode configured by the user.

It should be noted that the computation amount of the entire target neural network is greater than the computation capability of a single device, or the storage space required by the entire target neural network is greater than the storage capacity of a single device, which indicates that a single device cannot meet the computation requirement of the target neural network, and no matter what parallel mode the user configures, the final parallel mode needs to be the multi-device parallel mode, that is, if the parallel mode set by the user is the single-device parallel mode, the single-device parallel mode needs to be adjusted to be the multi-device parallel mode, and if the parallel mode set by the user is the multi-device parallel mode, the multi-device parallel mode is kept unchanged.

It should be noted that, if the computation amount of the entire target neural network is less than or equal to the computation capability of a single device, and the storage space required by the entire target neural network is less than or equal to the storage capacity of a single device, which indicates that a single device can meet the computation requirement of the target neural network, at this time, both the single-device parallel mode and the multi-device parallel mode can meet the computation requirement of the target neural network, and in this case, the parallel mode suitable for the target neural network can be determined based on the parallel mode configured by the user.

Further, the implementation process of determining the parallel mode suitable for the target neural network based on the parallel mode configured by the user includes: when the parallel mode configured by the user is a single-device parallel mode, the single-device parallel mode can be directly used as the parallel mode suitable for the target neural network; when the parallel mode configured by the user is the multi-device parallel mode, an optional implementation manner is to directly use the multi-device parallel mode as the parallel mode suitable for the target neural network, however, considering that in the multi-device parallel mode, data transmission exists between devices, and if the data transmission time between the devices is too long, the inference rate of the target neural network is inevitably affected, at this time, using the multi-device parallel mode is not a preferable scheme.

For example, for a pciee-connected P40, the transmission bandwidth is only 10G/s, the single-precision computation throughput can reach 12TFlops, that is, the number of floating point operations per second reaches 12T, which is 1200 times that of transmission, and taking a fully-connected layer with an input dimension of m × k and a number of neurons as an example, the layer has a computation amount of m × n × k 2 and an output data amount of m × n, when the value of k is not large, since the transmission time between devices is longer than the computation time of the devices themselves, the devices need to suspend waiting for arrival of data, which wastes computation resources and increases the total time of inference, and it is not preferable to use a multi-card parallel mode to perform inference.

In view of this, in a preferred implementation, the parallel mode suitable for the target neural network may be determined based on the inter-device transmission time and the preset maximum execution time of the sub-network, specifically, if the inter-device transmission time is greater than the preset maximum execution time of the sub-network, the parallel mode suitable for the target neural network is determined to be a single-device parallel mode, and if the inter-device transmission time is less than or equal to the preset maximum execution time of the sub-network, the parallel mode suitable for the target neural network is determined to be a multi-device parallel mode.

Referring to fig. 6, a schematic flow chart of an alternative specific implementation for determining a parallel mode suitable for a target neural network based on hardware device information of an inference platform, a calculation amount and a required storage space of the target neural network, and a user configured parallel mode is shown, and may include:

step S601: judging whether the calculated amount of the target neural network is larger than the calculation capacity of the single device, if not, executing a step S602; if yes, go to step S603.

Step S602: judging whether the storage space required by the target neural network is larger than the storage capacity of a single device, if so, executing a step S603; if not, go to step S604.

It should be noted that, the present embodiment does not limit the execution sequence of step S601 and step S602 to the above sequence, for example, step S602 may be executed first, step S601 may be executed later, or step S601 and step S602 may be executed in parallel. Regardless of the order of execution, if any of the determination results is yes, S603 is executed, and if both of the determination results are no, step S604 is executed.

Step S603: determining the parallel mode suitable for the target neural network as a multi-device parallel mode.

Step S604: judging whether the parallel mode configured by the user is a multi-device parallel mode, if not, executing the step S605; if yes, go to step S606.

Step S605: determining the parallel mode suitable for the target neural network as a single-device parallel mode.

Step S606: judging whether the transmission time between the devices is larger than the preset maximum execution time of the sub-network, if so, executing the step S605; if not, go to step S603.

The following further describes, by a specific example, determining a parallel pattern suitable for a target neural network based on hardware device information of an inference platform, a calculation amount and a required storage space of the target neural network, and a user configured parallel pattern:

the hardware device of the inference platform is a P40 video card, the actual storage capacity of a single P40 video card is 24GB, because some system space and reserved space are removed, the storage capacity of a single P40 video card is 22GB, the single precision peak computing capacity of a P40 video card is 12TFlops, because it is difficult to reach the theoretical peak in practical situations, and considering the influences of computing scale, read-write delay and the like, 8TFlops is taken as the average computing capacity of a single P40 video card, the parallel mode of the inference platform comprises a single-card parallel mode and a multi-card parallel mode, the calculated amount of a target neural network is S, the required storage space is M, and the process of determining the parallel mode suitable for the target neural network is as follows:

if M > 22G or S/(8 x 10)¹²)＞T1_max(that is, the storage requirement of the whole target neural network is greater than the available video memory of a single card, or the calculation amount of the whole target neural network is greater than the average calculation capacity of the single card), the single card cannot complete the forward reasoning task of the whole target neural network, and at this time, it is determined that the single card cannot complete the forward reasoning task of the whole target neural networkThe multi-card parallel mode is a parallel mode suitable for the target neural network. Wherein, T1_maxAnd completing the maximum execution of one input operation occupation by the single card set by the user. When M > 22G or S/(8 x 10)¹²)＞T1_maxWhen the multi-card parallel mode is determined to be the parallel mode suitable for the target neural network, regardless of which mode the user configures the parallel mode.

If M is less than or equal to 22G and S/(8 x 10)¹²)≤T1_max(that is, the storage requirement of the whole target neural network is less than or equal to the available video memory of a single card, and the calculation amount of the whole target neural network is less than or equal to the average calculation capacity of the single card), it indicates that the single card can complete the forward reasoning task of the whole target neural network, and at this time, the parallel mode suitable for the target neural network is determined based on the parallel mode configured by the user. Specifically, if the parallel mode configured by the user is a single-card parallel mode, determining that the parallel mode suitable for the target neural network is the single-card parallel mode; if the parallel mode configured by the user is a multi-card parallel mode, the method is further based on the transmission time T between the cards_tAnd the maximum execution time T2 of the preset sub-network_maxDetermining a parallel pattern appropriate for the target neural network, in particular if the inter-card transmission time T_t＞T2_maxDetermining the parallel mode suitable for the target neural network as a single-device parallel mode, and if T is greater than the preset threshold value_t≤T2_maxThen the parallel mode that fits the target neural network is determined to be the multi-device parallel mode. Wherein the time of transmission T between cards_tAnd m is the data interaction quantity between the sub-networks, and B is the transmission bandwidth between the cards.

After determining the parallel patterns that fit the target neural network, the target neural network may be divided into a plurality of sub-networks based on the parallel patterns that fit the target neural network. The following is a description of the division of a target neural network into a plurality of sub-networks based on a parallel pattern suitable for the target neural network.

The implementation of dividing the target neural network into a plurality of sub-networks based on a parallel pattern suitable for the target neural network may include: if the parallel mode suitable for the target neural network is a multi-device parallel mode, acquiring the number of sub-network partitions based on the number of hardware devices, and partitioning the target neural network based on the number of sub-network partitions; and if the parallel mode suitable for the target neural network is the single-device parallel mode, dividing the target neural network based on the preset sub-network dividing number.

It should be noted that, in the multi-device parallel mode, if the number of the devices is more than two, the number of the hardware devices actually used in the multi-device parallel mode may be determined based on the information of each hardware device and the operation requirement of the target neural network, for example, if there are 5 hardware devices on the inference platform, only 3 hardware devices may be used, and of course, all the hardware devices on the inference platform may also be directly used. That is, when the parallel mode suitable for the target neural network is the multi-device parallel mode, P (2< ═ P < ═ M, M is the number of hardware devices on the inference platform) may be used as the division number of the sub-networks, the target neural network may be divided into P sub-networks, each device is responsible for the calculation amount of S/P, and preferably, M, that is, the number of hardware devices on the inference platform, may be used as the division number of the sub-networks, and the target neural network may be divided into M sub-networks, that is, each device is responsible for the calculation amount of S/M, where S is the calculation amount of the entire target neural network.

A process of dividing the target neural network based on the determined number of divided sub-networks when the parallel mode suitable for the target neural network is the multi-device parallel mode will be described below.

In one possible implementation, the dividing the target neural network based on the number of sub-network partitions may include: and based on the dividing number of the sub-networks, dividing the target neural network by taking the theoretical calculation amount responsible by the single equipment and the maximum data amount transmitted between the equipment as a dividing basis.

The theoretical calculation amount of the single device is determined by the calculation amount of the whole target neural network and the division number of the sub-networks, and specifically, if the calculation amount of the whole target neural network is S and the division number of the sub-networks is M, the theoretical calculation amount of the single device is S/M; maximum data amount transmitted between devices through preset subThe maximum execution time of the network and the transmission bandwidth between the devices are determined, specifically, if the preset maximum execution time of the sub-network is T2_maxAnd the transmission bandwidth between the devices is B, the maximum data volume m transmitted between the devices_max＝T2_max*B。

Further, based on the number of divided subnetworks, taking the theoretical calculation amount for a single device and the maximum data amount transmitted between devices as a dividing basis, the process of dividing the target neural network may include: traversing backwards from the input layer of the target neural network in sequence: sequentially superposing the calculated amount of each hidden layer, and when the calculated amount obtained by current superposition is close to the theoretical calculated amount (such as S/M) responsible for single equipment, obtaining a sub-network formed by a plurality of adjacent hidden layers for superposition as a candidate sub-network; if the output data volume of the candidate sub-network (i.e. the data volume output by the last hidden layer of the candidate sub-network) is less than or equal to the maximum data volume m transmitted between the devices_maxThen the candidate sub-network is used as a sub-network obtained by division; if the output number of the candidate sub-network is larger than the maximum data quantity m transmitted between the devices_maxThen, the hidden layers are removed from the candidate sub-networks one by one from back to front until the output data volume of the removed sub-network is less than or equal to the maximum data volume m transmitted between the devices_maxThe sub-network with the hidden layer removed is used as a sub-network obtained by division; and continuing traversing backwards until all the sub-networks are obtained, wherein after each sub-network is obtained, the calculated amount of the hidden layer behind the sub-network is superposed again.

Illustratively, as shown in fig. 7, the target neural network includes Q hidden layers, i.e., Layer1, Layer 2, …, and LayerQ in sequence, and traverses backwards from the input Layer, and sequentially superimposes the calculated quantities of the hidden layers to obtain S_sum(i) For example, S when traversing to the first hidden layer_sum(1) For the calculated amount of the first hidden layer, S is the calculated amount of the second hidden layer when traversing to the second hidden layer_sum(2) The sum of the calculated amount of the first hidden layer and the calculated amount of the second hidden layer, and so on, when S_sum(K) Taking Layer 1-LayerK as a candidate when the calculated quantity is close to or equal to the theoretical calculated quantity responsible for single equipmentSelecting a sub-network, further comparing the output data volume of the candidate sub-network with the maximum data volume transferred between the devices, if the output data volume of the candidate sub-network is less than or equal to the maximum data volume transferred between the devices, the candidate sub-network is taken as the first sub-network obtained by dividing, if the output data volume of the candidate sub-network is larger than the maximum data volume transmitted between the devices, the hidden layers are removed from the candidate sub-networks one by one from back to front until the output data volume of the sub-network after the removal of the hidden layers is less than or equal to the maximum data volume transmitted between the devices, for example, the output data volume of the sub-network after the removal of LayerK and LayerK-1 is less than or equal to the maximum data volume transmitted between the devices, then Layer 1-Layer k-2 is used as the first sub-network obtained by division, and then the subsequent sub-networks are obtained step by step according to the same strategy. It should be noted that, each time a sub-network is obtained, the calculated amount is newly added from the first hidden layer behind the sub-network.

A process of dividing the target neural network based on the number of divided sub-networks set in advance when the parallel mode suitable for the target neural network is the single device parallel mode will be described below.

When the single-device parallel mode is adopted, the target neural network can be divided according to the preset sub-network division number, it should be noted that the sub-network division number is set properly and is not suitable to be too large, the large sub-network division number can cause the calculation amount of a single sub-network to become small, at the moment, the proportion of the data synchronization time among the sub-networks to the calculation time of the sub-networks is increased, and further the throughput of the sub-networks is reduced. In the single device parallel mode, the target neural network may be divided based on an average calculation amount (calculation amount of the entire target neural network/a predetermined number of sub-network divisions). Illustratively, the preset number of the divided sub-networks is 8, the calculated amount of the whole target neural network is S, the average calculated amount is S/8, when the target neural network is divided, the calculated amounts of the hidden layers are sequentially superposed, when the calculated amount obtained by current superposition is close to or equal to S/8, one sub-network obtained by dividing the sub-network formed by a plurality of superposed adjacent hidden layers is obtained, then the calculated amount is superposed again from the first hidden layer behind the sub-network, when the calculated amount obtained by superposition is close to or equal to S/8, another sub-network obtained by dividing the sub-network formed by a plurality of superposed adjacent hidden layers is obtained, and by analogy, all the sub-networks are obtained, and the calculated amount of each sub-network is close to or equal to S/8.

It should be noted that, in the single-device parallel mode, the entire target neural network may be divided into sub-networks, and for a single device in the multi-card parallel mode, the divided sub-networks may be further divided, and the specific dividing manner may refer to the dividing manner in which the entire target neural network is divided in the single-device parallel mode.

After dividing the target neural network into a plurality of sub-networks, it is necessary to create inference instances and inference engines for the respective sub-networks on the hardware devices of the inference platform. Specifically, for the multi-device parallel mode, it is necessary to create an inference instance and an inference engine on the corresponding hardware device for each sub-network, please refer to fig. 8, which shows a schematic diagram of creating an inference engine in the multi-device parallel mode, where the neural network shown in fig. 8 is divided into 4 sub-networks, each sub-network corresponds to one hardware device, and when creating an inference engine, an inference engine is created on each of the 4 hardware devices; for the single-device parallel mode, it is necessary to create corresponding inference instances and inference engines on one device for each sub-network.

After the reasoning examples and the reasoning engines respectively corresponding to the sub-networks are established, the whole target neural network is subjected to forward reasoning based on the reasoning examples and the reasoning engines respectively corresponding to the sub-networks. Because the sub-networks have precedence dependency relationship, when the reasoning examples and reasoning engines corresponding to the sub-networks respectively perform forward reasoning on the whole target neural network, firstly, the precedence dependency relationship between the reasoning engines corresponding to the sub-networks is established based on the precedence dependency relationship between the sub-networks, and specifically, read-write flags can be established between the reasoning engines corresponding to the sub-networks respectively based on the precedence dependency relationship between the sub-networks (as shown in fig. 8, read-write flags are respectively established between the reasoning engine established on the device 1 and the reasoning engine established on the device 2, between the reasoning engine established on the device 2 and the reasoning engine established on the device 3, and between the reasoning engine established on the device 3 and the reasoning engine established on the device 4), such that each inference engine operates on its corresponding sub-network based on the input data and the corresponding inference instance.

The neural network in fig. 8 performs inference by: the forward inference process for the data1 is completed by inputting the data1 into the device 1 (completing the operation of the sub-network 1), the device 2 (completing the operation of the sub-network 2), the device 3 (completing the operation of the sub-network 3) and the device 4 (completing the operation of the sub-network 4) in sequence, and it is important to explain that the data 2 is input into the device 1 while the device 1 inputs the output of the data1 into the device 2, so that it can be seen that a plurality of input data are executed in parallel in different inference engines at the same time.

The inference process of the neural network is further explained below with reference to fig. 9:

FIG. 9 includes N inference engines, namely, Engine1, Engine2, … and Engine N, in sequence, at any time T1, new input data dataN is sent to inference Engine1 for operation, data N-1 (substantially referring to the operation result of Engine1 on dataN-1) is sent to inference Engine2 for operation while data N is sent to inference Engine1 for operation, data N-2 (substantially referring to the operation result of Engine2 on dataN-2) is sent to inference Engine3 for operation, and so on, data1 (substantially referring to the operation result of Engine N-1 on data 1) is sent to inference Engine Engine for operation, so that at the same time T1, inference Engine engines Engine 1-Engine N perform operation simultaneously.

Compared with the inference method in the prior art (only one inference engine performs operation at the same time, and one inference engine performs operation on the whole network) and the inference method provided by the application, the inference method provided by the application discovers that: assuming that x input data are provided, the inference time required by the inference method in the prior art is x t, wherein t is the inference time required by one input data, the inference time required by the inference method provided by the application is t/N (2 x-1), wherein N is the number of inference engines, and when N is larger, the throughput of the whole target neural network is close to N/2 times of the prior inference scheme, so that the inference efficiency is greatly improved.

According to the neural network forward reasoning method provided by the embodiment of the application, the neural network is divided into the sub-networks, and each sub-network corresponds to one reasoning engine, so that each reasoning engine is only responsible for one part of the hidden layer of the target neural network, a plurality of data can be input to different reasoning engines for operation at the same time, and a plurality of reasoning engines are operated in parallel at the same time, so that hardware resources of a reasoning platform are fully utilized, the reasoning efficiency is obviously improved, and the data throughput is greatly increased.

The embodiment of the present application further provides a forward inference device of a neural network, which is described below, and the forward inference device of the neural network described below and the forward inference method of the neural network described above may be referred to correspondingly.

Referring to fig. 10, a schematic structural diagram of a forward inference apparatus of a neural network according to an embodiment of the present application is shown, and as shown in fig. 10, the apparatus may include: a network processing module 1001, an instance and engine creation module 1002, and an inference module 1003.

A network processing module 1001 configured to divide a target neural network into a plurality of sub-networks, where any sub-network includes at least one hidden layer of the target neural network.

An instance and engine creation module 1002, configured to create inference instances and inference engines corresponding to the plurality of sub-networks, respectively, on the hardware device of the inference platform.

An inference module 1003, configured to perform forward inference on the target neural network based on the inference instances and the inference engines respectively corresponding to the multiple subnetworks.

The forward inference device of the neural network provided by the embodiment of the application can divide the target neural network into a plurality of sub-networks, then, respectively creating an inference example and an inference engine for the plurality of sub-networks, and further, based on the inference example and the inference engine respectively corresponding to the plurality of sub-networks, the target neural network is subjected to forward reasoning, as a plurality of reasoning engines are provided, and one reasoning engine is only responsible for a part of hidden layers of the target neural network, this allows multiple data inputs to different inference engines at the same time to be operated in parallel, which, in contrast to existing inference schemes, because a plurality of inference engines operate in parallel at the same time, hardware resources are fully utilized, the utilization rate of hardware resources is improved, meanwhile, the reasoning efficiency is improved, the data throughput is improved, and the storage space is saved on the premise that the storage resources are not changed.

In a possible implementation manner, in the forward inference apparatus of a neural network provided in the foregoing embodiment, the network processing module 1001 may include: the device comprises an information acquisition module and a sub-network dividing module.

The information acquisition module is used for acquiring the hardware equipment information of the reasoning platform, the calculated amount of the target neural network and the required storage space.

In the forward inference apparatus of a neural network provided in the foregoing embodiment, the information obtaining module may include a hardware information obtaining sub-module.

The hardware information acquisition submodule is used for acquiring one or more of the following information: the number of hardware devices, the computing power of the hardware devices, the storage capacity of the hardware devices, and the transmission bandwidth among the hardware devices.

In a possible implementation manner, the information obtaining module further includes: a computation graph building sub-module and a computation amount and storage space determining sub-module.

And the computation graph constructing sub-module is used for constructing the computation graph of the target neural network according to the network parameters of the target neural network.

In a possible implementation manner, the sub-network dividing module in the forward inference apparatus of the neural network provided by the above embodiments may include: a parallel mode determination sub-module and a sub-network partitioning sub-module.

In one possible implementation, the parallel mode determining sub-module includes: a first determination submodule and a second determination submodule.

The first determining sub-module is used for determining that the parallel mode suitable for the target neural network is the multi-device parallel mode when the calculation amount of the whole target neural network is larger than the calculation capacity of a single device and/or the storage space required by the whole target neural network is larger than the storage capacity of the single device.

In a possible implementation manner, the second determining submodule is specifically configured to determine, when the user-configured parallel mode is the single-device parallel mode, that the parallel mode suitable for the target neural network is the single-device parallel mode; when the parallel mode configured by the user is the multi-device parallel mode, if the transmission time between the devices is greater than the maximum execution time of the preset sub-network, determining that the parallel mode suitable for the target neural network is the single-device parallel mode, and if the transmission time between the devices is less than or equal to the maximum execution time of the preset sub-network, determining that the parallel mode suitable for the target neural network is the multi-device parallel mode.

In one possible implementation, the sub-network partitioning sub-module includes: a first partitioning submodule and a second partitioning submodule.

The first dividing module is configured to, when the parallel mode suitable for the target neural network is the multi-device parallel mode, obtain the number of sub-networks to be divided based on the number of the hardware devices, and divide the target neural network based on the number of sub-networks to be divided.

In a possible implementation manner, the first partitioning module is specifically configured to partition the target neural network based on the number of the sub-networks partitioned, and based on a theoretical calculation amount in charge of a single device and a maximum data amount transmitted between devices as a partition basis.

In a possible implementation manner, the first dividing module is specifically configured to sequentially traverse backwards from an input layer of the target neural network: sequentially superposing the calculated amount of each hidden layer, and when the calculated amount obtained by current superposition is close to the theoretical calculated amount responsible for the single equipment, obtaining a sub-network formed by a plurality of adjacent hidden layers for superposition as a candidate sub-network; if the output data volume of the candidate sub-network is smaller than or equal to the maximum data volume transmitted among the devices, taking the candidate sub-network as a sub-network obtained by dividing; if the output quantity of the candidate sub-networks is larger than the maximum data volume transmitted among the devices, removing hidden layers from the candidate sub-networks one by one from back to front until the output data volume of the removed sub-networks is smaller than or equal to the maximum data volume transmitted among the devices, and taking the sub-networks after removing the hidden layers as one sub-network obtained by division; and continuing traversing backwards until all the sub-networks are obtained, wherein after each sub-network is obtained, the calculated amount of the hidden layer behind the sub-network is superposed again.

In a possible implementation manner, the inference module 1003 is specifically configured to determine, according to the dependency relationship among the multiple subnetworks, the dependency relationship among the inference engines corresponding to the multiple subnetworks, respectively; and inputting data to the inference engines corresponding to the sub-networks respectively in sequence, so that each inference engine operates the corresponding sub-network based on the input data and the corresponding inference instance.

An embodiment of the present application further provides a forward inference device of a neural network, please refer to fig. 11, which shows a schematic structural diagram of the forward inference device, where the forward inference device may include: at least one processor 1101, at least one communication interface 1102, at least one memory 1103, and at least one communication bus 1104;

in the embodiment of the present application, the number of the processor 1101, the communication interface 1102, the memory 1103 and the communication bus 1104 is at least one, and the processor 1101, the communication interface 1102 and the memory 1103 complete communication with each other through the communication bus 1104;

the processor 1101 may be a central processing unit CPU, or an application Specific Integrated circuit (asic), or one or more Integrated circuits configured to implement embodiments of the present invention, or the like;

the memory 1103 may include a high-speed RAM memory, a non-volatile memory (non-volatile memory), and the like, such as at least one disk memory;

wherein the memory stores a program and the processor can call the program stored in the memory, the program for:

Alternatively, the detailed function and the extended function of the program may be as described above.

Embodiments of the present application further provide a readable storage medium, where a program suitable for being executed by a processor may be stored, where the program is configured to:

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of forward reasoning for a neural network, comprising:

dividing a target neural network into a plurality of sub-networks, wherein any sub-network comprises one hidden layer or a plurality of continuous hidden layers of the target neural network;

performing forward reasoning on the target neural network based on the reasoning examples and the reasoning engines respectively corresponding to the sub-networks;

wherein the dividing of the target neural network into a plurality of sub-networks comprises:

2. The neural network forward reasoning method of claim 1, wherein the hardware device information of the reasoning platform comprises one or more of the following information:

3. The method of forward reasoning for a neural network as claimed in claim 1, wherein obtaining the computational load and the required memory space of the target neural network comprises:

4. The method of forward reasoning for neural networks of claim 1, wherein the determining the parallel mode suitable for the target neural network based on the hardware device information of the reasoning platform, the computation amount and the required storage space of the target neural network, and the user configured parallel mode comprises:

5. The method of neural network forward reasoning of claim 4, wherein said determining a parallel pattern that fits the target neural network based on the user-configured parallel pattern comprises:

6. The method of forward reasoning for a neural network as claimed in claim 1, wherein said dividing the target neural network into a plurality of sub-networks based on a parallel pattern adapted to the target neural network comprises:

7. The method of neural network forward inference according to claim 6, wherein said dividing the target neural network based on the number of divisions of the sub-network comprises:

8. The method for the neural network forward inference according to claim 7, wherein the dividing the target neural network based on the number of the sub-networks and the theoretical calculation amount for a single device and the maximum data amount transmitted between devices is a dividing basis, and the method includes:

9. The method of forward reasoning for neural networks of claim 1, wherein the forward reasoning for the target neural network based on the reasoning instances and the reasoning engines respectively corresponding to the sub-networks comprises:

10. A forward inference apparatus of a neural network, comprising: the system comprises a network processing module, an instance and engine creating module and an inference module;

the network processing module is used for dividing a target neural network into a plurality of sub-networks, wherein any sub-network comprises one hidden layer or a plurality of continuous hidden layers of the target neural network;

the reasoning module is used for carrying out forward reasoning on the target neural network based on the reasoning examples and the reasoning engines respectively corresponding to the sub-networks;

wherein the network processing module comprises: the device comprises an information acquisition module and a sub-network dividing module;

the information acquisition module is used for acquiring hardware equipment information of the reasoning platform, the calculated amount of the target neural network and the required storage space;

the sub-network dividing module includes: a parallel mode determination submodule and a sub-network division submodule;

11. The neural network forward reasoning apparatus of claim 10, wherein the hardware device information of the reasoning platform comprises one or more of:

12. The neural network forward reasoning apparatus as claimed in claim 10, wherein the information acquisition module comprises: a computation graph construction sub-module and a computation amount and storage space determination sub-module;

13. The neural network forward reasoning apparatus of claim 10, wherein the parallel mode determining sub-module comprises: a first determination submodule and a second determination submodule;

14. The forward inference apparatus of neural networks as claimed in claim 13, wherein said second determining sub-module is specifically configured to determine that the parallel mode suitable for the target neural network is the single device parallel mode when the user-configured parallel mode is the single device parallel mode; when the parallel mode configured by the user is the multi-device parallel mode, if the transmission time between the devices is greater than the maximum execution time of the preset sub-network, determining that the parallel mode suitable for the target neural network is the single-device parallel mode, and if the transmission time between the devices is less than or equal to the maximum execution time of the preset sub-network, determining that the parallel mode suitable for the target neural network is the multi-device parallel mode.

15. The neural network forward inference device of claim 10, wherein the sub-network partitioning sub-module comprises: a first partitioning submodule and a second partitioning submodule;

16. The forward inference device of a neural network of claim 15, wherein the first partitioning module is specifically configured to partition the target neural network based on the number of partitions of the sub-network, with a theoretical calculation amount for which a single device is responsible and a maximum data amount transmitted between devices as a partition basis;

17. The neural network forward reasoning apparatus as claimed in claim 16, wherein the first partitioning module is specifically configured to sequentially traverse backwards from an input layer of the target neural network: sequentially superposing the calculated amount of each hidden layer, and when the calculated amount obtained by current superposition is close to the theoretical calculated amount responsible for the single equipment, obtaining a sub-network formed by a plurality of adjacent hidden layers for superposition as a candidate sub-network; if the output data volume of the candidate sub-network is smaller than or equal to the maximum data volume transmitted among the devices, taking the candidate sub-network as a sub-network obtained by dividing; if the output quantity of the candidate sub-networks is larger than the maximum data volume transmitted among the devices, removing hidden layers from the candidate sub-networks one by one from back to front until the output data volume of the removed sub-networks is smaller than or equal to the maximum data volume transmitted among the devices, and taking the sub-networks after removing the hidden layers as one sub-network obtained by division; and continuing traversing backwards until all the sub-networks are obtained, wherein after each sub-network is obtained, the calculated amount of the hidden layer behind the sub-network is superposed again.

18. The forward inference device of a neural network as claimed in claim 10, wherein the inference module is specifically configured to determine, according to the dependency relationships among the sub-networks, the dependency relationships among the inference engines respectively corresponding to the sub-networks;

19. A forward reasoning apparatus for a neural network, comprising: a memory and a processor;

the memory is used for storing programs;

the processor, which executes the program, implements each step of the neural network forward inference method according to any one of claims 1 to 9.

20. A readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, performs the steps of the method of forward reasoning for a neural network as claimed in any one of claims 1 to 9.