CN110503199A - Method for splitting and device, the electronic equipment and storage medium of operation node - Google Patents
Method for splitting and device, the electronic equipment and storage medium of operation node Download PDFInfo
- Publication number
- CN110503199A CN110503199A CN201910750828.1A CN201910750828A CN110503199A CN 110503199 A CN110503199 A CN 110503199A CN 201910750828 A CN201910750828 A CN 201910750828A CN 110503199 A CN110503199 A CN 110503199A
- Authority
- CN
- China
- Prior art keywords
- neural network
- node
- network computing
- split
- functional module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Abstract
This application provides a kind of method and apparatus, electronic equipment and non-transient computer readable storage mediums that task is executed using artificial intelligence process device;Wherein, electronic equipment includes central processing unit, artificial intelligence process device and memory, and artificial intelligence process device and central processing unit communicate to connect, and including multiple functional modules;Memory is stored with computer program, when computer program is executed by central processing unit, so that central processing unit executes the method for executing task using artificial intelligence process device.
Description
Technical field
This application involves computer fields, and in particular to what the operation node in a kind of pair of neural network model was split
Method and apparatus, electronic equipment and non-transient computer readable storage medium.
Background technique
Artificial intelligence (Artificial Intelligence, abridge AI) is that research makes computer to simulate the certain of people
The subject of thought process and intelligent behavior (such as study, reasoning, thinking, planning), the main original that intelligence is realized including computer
The computer for managing, being manufactured similarly to human brain intelligence enables a computer to realize higher level application.
Artificial neural network (Artificial Neural Networks, be abbreviated as ANNs) is also referred to as neural network
(NNs), it is a kind of imitation animal nerve network behavior feature, carries out the algorithm mathematics model of distributed parallel information processing.
This network relies on the complexity of system, by adjusting relationship interconnected between internal great deal of nodes, to reach place
Manage the purpose of information.
Neural network is popular algorithm in current machine learning areas.In recent years, with depth learning technology
Fast development, model and algorithm neural network based achieve breakthrough in many fields.For example, in voice technology, people
The fields such as face identification, automatic Pilot, machine translation, the algorithm research based on deep neural network are more and more deep.
Summary of the invention
Based on this, this application provides the methods that the operation node in a kind of pair of neural network model is split, comprising:
Determine the critical path in neural network model;
According to the hardware concurrent degree of the functional module to match in artificial intelligence process device with different type, by the key
At least one neural network computing node in path is split as multiple neural network computing child nodes.
According to the another aspect of the application, the dress that the operation node in a kind of pair of neural network model is split is provided
It sets, comprising:
Determination unit determines the critical path in neural network model;
Split cells, according to the hardware concurrent degree of the functional module to match in artificial intelligence process device with different type,
At least one neural network computing node in the critical path is split as multiple neural network computing child nodes.
According to the another aspect of the application, a kind of electronic equipment is provided, comprising:
Central processing unit;
Artificial intelligence process device is communicated to connect with the central processing unit, and including multiple functional modules;
Memory is stored with computer program, when the computer program is executed by the central processing unit, so that institute
It states central processing unit and executes method as described above.
According to the another aspect of the application, a kind of non-transient computer readable storage medium is provided, is stored thereon with
Computer-readable instruction, when described instruction is executed by processor, so that the processor executes method as described above.
Detailed description of the invention
In order to more clearly explain the technical solutions in the embodiments of the present application, make required in being described below to embodiment
Attached drawing is briefly described, it should be apparent that, the drawings in the following description are only some examples of the present application, for
For those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings other
Attached drawing.
Fig. 1 shows an exemplary model of artificial intelligence process device;
Fig. 2 shows neural network model schematic diagrames;
Fig. 3 shows the side split according to one embodiment of the application to the operation node in neural network model
The flow chart of method;
Fig. 4 shows the schematic diagram split to neural network computing node;
Fig. 5, which is schematically illustrated, operates different fractionation modes to neural network computing;
Fig. 6 shows the side split according to another embodiment of the application to the operation node in neural network model
The flow chart of method;
Fig. 7 shows the side split according to another embodiment of the application to the operation node in neural network model
The flow chart of method;
Fig. 8 shows a program example of the PSC algorithm according to the application embodiment;
Fig. 9 shows the dress split according to one embodiment of the application to the operation node in neural network model
The schematic diagram set;
Figure 10 shows the schematic diagram of the electronic equipment according to one embodiment of the application.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete
Site preparation description, it is clear that described embodiment is some embodiments of the present application, instead of all the embodiments.Based on this Shen
Please in embodiment, those skilled in the art's every other embodiment obtained without making creative work,
It shall fall in the protection scope of this application.
It should be appreciated that claims hereof, specification and attached drawing in term " first ", " second ", " third " and
" 4th " etc. is not use to describe a particular order for distinguishing different objects.The description and claims of this application
Used in term " includes " and "comprising" indicate described feature, entirety, step, operation, the presence of element and/or component,
But the presence of one or more of the other feature, entirety, step, operation, element, component and/or its set is not precluded or adds
Add.
It is also understood that mesh of the term used in this present specification merely for the sake of description specific embodiment
, and be not intended to limit the application.As used in present specification and claims, unless context
Other situations are clearly indicated, otherwise " one " of singular, "one" and "the" are intended to include plural form.It should also be into one
Step understands that the term "and/or" used in present specification and claims refers to one in the associated item listed
A or multiple any combination and all possible combinations, and including these combinations.
In this application, artificial intelligence process device: also referred to as application specific processor, for specific application or the place in field
Manage device.Such as: graphics processor (Graphics Processing Unit, abbreviation: GPU) also known as shows core, visual processes
Device, display chip are one kind specially in PC, work station, game machine and some mobile devices (such as tablet computer, intelligence
Mobile phone etc.) on image operation work application specific processor.Another example is: neural network processor (Neural Processing
Unit, abbreviation: NPU), it is a kind of application specific processor for being directed to matrix multiplication operation in the application of artificial intelligence field, uses
The framework of " data-driven parallel computation " is especially good at the mass multimedia data of processing video, image class.With artificial intelligence
The development of technology can complete the task of many complexity, such as recognition of face, automatic Pilot, machine using computer at present
Translation etc..In order to preferably complete these tasks, these tasks can be realized using neural network model.However, being calculated with tradition
Method is different, and an important feature of neural network model is coexisting for high internal storage access and computational intensity.For general procedure
It is a huge challenge for device.In order to cope with this challenge, artificial intelligent processor is proposed to execute neural network and accelerate
Scheme.Currently, executing the scheme that neural network accelerates is broadly divided into three types.It is respectively: graphics processing unit
(GPU), field programmable gate array (FPGA) and applying specific integrated circuit (ASIC).Wherein, GPU has powerful parallel
Computing capability, but serious power efficiency problem is faced, and FPGA has flexibility but peak performance is poor.With GPU and FPGA
Difference, ASIC are dedicated custom hardware frameworks.Multicore and memory, which calculate design, facilitates them in artificial intelligence calculating field
Surmount GPU and FPGA.
In practical applications, traditionally the task schedule of general processor can not be directly applied for artificial intelligence process device,
Reason for this is that: the function logic inside general processor is abstracted, includes multiple identical functions inside general processor
Module, the calculating task that each functional module executes not essential difference.But in essence without any difference, often
A functional module can be substituted for each other.In this way, leading to general processor speed in the corresponding task of execution neural network model
Slowly, and power consumption is big.For artificial intelligence process device, the function logic inside artificial intelligence process device is abstracted, people
It include multiple functional modules inside work Intelligent treatment, each functional module intelligently executes corresponding calculating task.Each function mould
It cannot be substituted for each other between block.
Arithmetic operation in neural network model is analyzed, according to the arithmetic operation feature of artificial intelligence process device,
Operation in neural network model can there are many types, such as: the volume that can be completed by matrix multiplication and summation operation
Long-pending and full attended operation can operate by searching for the mode of table come the activation completed.The artificial intelligence process device of the application meaning
Refer to containing the artificial intelligence process device of multiple functional modules, each functional module is adapted for carrying out in neural network model not
The arithmetic operation of same type.Such as: artificial intelligence process device include be exclusively used in execute matrix multiplication operation functional module and
It is exclusively used in executing the functional module of look-up table operations, these are not limited to one according to the functional module that operating characteristic customizes.Therefore,
Artificial intelligence process device can realize the parallel processing to neural network model, to improve system performance.In addition, at artificial intelligence
Each functional module for managing device can mutual direct communication.For neural network model, data flow is in artificial intelligence process device
Each functional module between directly transmit, without caching.Fig. 1 shows an exemplary model of artificial intelligence process device.
In Fig. 1, artificial intelligence process device be abstracted as include multiple functional modules artificial intelligence process device, different functional modules
It is adapted for carrying out different types of arithmetic operation in neural network model.Such as: matrix multiplication operation, table lookup operation, Chi Hua
Operation, vector operation etc..In addition, further including the interconnecting number between each functional module in artificial intelligence process device shown in Fig. 1
According to transmission path.Different application scenarios have different requirements to artificial intelligence process device.For example, in edge calculations (edge
Computing) field, concern is primarily with the power efficiency of calculating and time delays;And for Cloud Server, it is main to close
Note is to calculate handling capacity and degree of parallelism.According to practical application, can quantity to the functional module of artificial intelligence process device, calculate
Speed, Connected degree and corresponding bandwidth are configured.
It is already previously mentioned, accelerate different types of arithmetic operation using the different function module of artificial intelligence process device,
So, it will can be operated in entire neural network model with the neural network computing of parallel processing and operate in different functional modules
On, total runing time is reduced with this.Citing: in a neural network model, then calculating A first utilizes the output meter of A
B and C is calculated, the output of B and C is finally recycled to calculate D.Wherein, A B C D is the different nodes in neural network model,
In this example embodiment, node B and node C can be executed parallel, if being the time required to node B executes corresponding arithmetic operation
4s, node C, which complete corresponding arithmetic operation the time required to executing corresponding arithmetic operation for 3s, node B and node C, to be executed
The arithmetic operation of node D.It is then that the longest arithmetic operation of required time is corresponding in parallel arithmetic operation the time required to parallel computation
Time.Therefore in this instance, node A, B, D are the node in critical path.Executing neural network model corresponding
During business, longest arithmetic operation the time required to determining in parallel arithmetic operation, according to the function inside artificial intelligence process device
The resource utilization of energy module, optimizes this kind of arithmetic operation, the time required to reducing, so that the time required to parallel computation
It is corresponding to reduce, it solves artificial intelligence process device and executes the task schedule optimization problem that neural network model corresponds to task.
During handling neural network model, the corresponding neural network model of pending task can be resolved to first
Calculating topological diagram comprising multiple neural network computing nodes.
Normally, according to neural network model the characteristics of, can resolve to a pending task comprising multiple operation sections
The calculating topological diagram of point.Fig. 2 shows neural network model schematic diagrames.As shown in Fig. 2, including in the neural network model
Multiple neural network computing nodes, each node (node) represent a neural network computing operation.Executing this
During business, input data finally obtains result through a large amount of neural network computing.
Fig. 3 shows the side split according to one embodiment of the application to the operation node in neural network model
The flow chart of method.As shown in figure 3, this method 100 may include step S110 and S120.
In step s 110, the critical path in neural network model is determined.
As described above, critical path refers in neural network model from the longest logic road of delay for being input to output process
Diameter contains multiple operation nodes in critical path.On the other hand, artificial intelligence process device is contained suitable for different type
Neural network computing operation multiple functional modules.So, the neural network computing of each type is operated, in artificial intelligence
There can be the functional module of one or more type matchings in processor to execute the operation.It, can be by neural network mould based on this
The neural network computing node not split either in type, the neural network computing child node still split, is ok
According to its arithmetic type, the functional module of type matching is found in artificial intelligence process device, and to execute, (specific split process will
It is described below).
In the step s 120, according to the hardware concurrent of the functional module to match in artificial intelligence process device with different type
Degree, is split as multiple neural network computing child nodes at least one neural network computing node in critical path.According to this
Embodiment can be torn open according to the hardware concurrent degree of the functional module to match in artificial intelligence process device with different type
Point.In this application, the hardware concurrent degree of functional module refers to the number of the functional module to match with same kind node
Amount.For example, as shown in figure 4, a node can be split as multiple sub- sections when splitting to neural network computing node
Point.The quantity of the child node specifically split out can be determined according to the quantity of the functional module to match with the type.For example,
If the functional module quantity to match in artificial intelligence process device comprising the action type with the node is X, tearing open
The node can be split as X child node by timesharing, to utilize the concurrent operation of X child node of X Implement of Function Module.
That is also receiving the quantity to the fractionation of node if the functional module limited amount to match with certain type node
It limits and cannot split unlimitedly.If the quantity of the child node split out is greater than the quantity of the functional module of type matching,
The multiple child nodes split out by the same node will be then handled in the same functional module, this will not have operation degree of parallelism
There is any promotion meaning.
Due to the functional module of type matching may have it is multiple, can be by tearing open for an operation node
Divide to realize parallel execution, and for entire calculating process, then and being split at least part operation node
Operation degree of parallelism is improved, so as to greatly save the task processing time, improves operation efficiency.
As described above, neural network computing operation is usually the arithmetic operation for being similar to matrix multiplication or tabling look-up, and it is this kind of
The characteristics of operation is can to split.For example, it is more to convert (that is, being split as) for the multiplication between two biggish matrixes of dimension
Multiplication between a lesser matrix of dimension.This i.e. neural network computing operation detachable property (detachability).According to
Detachable property can tear at least part neural network computing node in the critical path of task determining in step s 110 open
Dividing (partition) is multiple neural network computing child nodes.Fig. 4, which is shown, to be shown what neural network computing node was split
It is intended to.As shown in figure 4, before fractionation, the order of operation of original nerve network node is from node A to node B, then to node C.
After splitting to node B, node B is split as child node B1, B2, B3, then the neural network node operation after splitting
Sequence is from node A to parallel child node B1, B2, B3, then to node C.As it can be seen that neural network mould can be improved in the process split
Operation degree of parallelism in type.
In heterogeneous system (heterogeneous system), it has been proposed that multiple-task dispatching algorithm (task
scheduling algorithm).For example, in list scheduling heuristic algorithm (list scheduling heuristic
Algorithm in), all task creation priority sequences can be based on, then periodically carry out task choosing and processor choosing
It selects, until all task schedules are completed.(Mapping Heuristic) algorithm is soundd out in mapping and earliest time is preferential
(Earliest Time First) algorithm is two examples of list scheduling heuristic algorithm.In addition, will be mentioned in subsequent descriptions
HEFT (Heterogeneous-Earliest-Finish-Time, the isomery earliest finish time) algorithm and CPOP arrived
(Critical-Path-on-a-Processor, processor on critical path) algorithm also belongs to list scheduling algorithm.For appointing
The upward sequence (Upward Rank) of business and sequence (Downward Rank) downwards are the piths in the two algorithms, will
It is described in detail subsequent.
However, whether the task scheduling algorithm proposed in traditional heterogeneous system not account for task detachable, or
Person says, is all based on the indivisible situation of task, and processor is also general for all tasks.But for mind
It is operated through network operations, neural network task has detachable property, and such as neural network accelerates at the artificial intelligence of chip
Reason device can have multiple customization modules.
Fig. 5, which is schematically illustrated, operates different fractionation modes to neural network computing.(a) show original in Fig. 5
The neural network computing operation that beginning and end are split, type are convolution algorithm (conv).The output data of a upper node layer is one
A four dimensional tensor out [2,256,101,101], a upper node layer output be this node layer input in [2,256,101,
101], after by the convolution operation of this node layer, the data exported are still a four dimensional tensor out [2,512,50,50],
Its input data in [2,512,50,50] as next node layer.
(b) shows the mode split in batch direction in Fig. 5, i.e., 2 in tensor is split as two 1.Fig. 5
In (c) show the mode split in width direction, i.e., 101 in tensor are split as two 51.(d) is shown in Fig. 5
The mode split in input channel direction, i.e., be split as two 128 for 256 in tensor.(e) is shown in Fig. 5
The size of input data is split as two 2637312 by 5222912 by the mode that output channel direction is split.Due to
The particularity of convolution algorithm, the mode split in Fig. 5 (d) in input channel direction, it is also necessary to latter two fortune will be split
The result of calculation tries again operation, and other of Fig. 5 split modes and can directly splice the result after fractionation.
Since neural network operation can be completed by the operation of vectorization, neural network can be operated and be torn open
Point, to improve the degree of parallelism of task.Characteristic based on different operation can take different fractionation strategies such as shown in fig. 5.Example
Such as, based on direction is split, for a two-dimensional convolution operates, there can be following five kinds of fractionation modes.On batch direction
It is split, this mode not will increase additional burden, but if the batch quantity of the operation is 1, can not use this
Kind mode (degree of parallelism cannot be improved).It is split on input channel direction, this mode needs to increase additional addition fortune
It calculates, will be added from the partial results of split child node out.It is split on output channel direction, this mode
Each child node split out is needed to obtain whole input datas, this will increase volume of transmitted data.In height or width direction
On split, this mode will generate additional data transmission due to duplicate input in each child node split out
Consumption.On the other hand, for batch normalization operation, fractionation not will lead to extra workload.Therefore, in this application, may be used
Following disassembly principle is constructed to the operation of each neural network.
According to the application, Application of Neural Network (that is, pending task) has been represented by calculating topological diagram, for example, such as Fig. 2
Shown in directed acyclic graph (directed acyclic graph, DAG), can with G=(V, E) indicate.Each node therein
vi∈ V represents an operation, the side e between every two nodei,j∈ E represents node viWith node vjBetween data dependence
Property.For node viSplit process pi, node viBy by it is split go out new node (that is, child node) and with these new sections
The relevant side (edge) of point is replaced.Volume of transmitted data and node v on these new sidesiOperation it is related to parameter,
It is shown below:
G ' (V ', E ')=pi(G(V,E)) (5)
For all detachable nodes, the fractionation sequence p=(p of node can be obtained1,p2……pk), then based on splitting
Scheduling process can be expressed from the next:
According to the application embodiment, in above-mentioned steps S120, merge algorithm using splitting
(Partition Scheduling Combination Algorithm, abbreviation PSC algorithm) to neural network computing node into
Row is split.
It, can be according to the hardware concurrent degree one of the functional module in artificial intelligence process device according to one embodiment of the application
It is secondary to split all detachable neural network computing nodes.That is, PSC algorithm can once will be all detachable
Neural network computing node is split.Certainly, this fractured operation is also required to follow mentioned above principle, needs to consider artificial intelligence
The hardware concurrent degree of functional module in processor.
After once being split all detachable neural network computing nodes according to PSC algorithm, due to nerve net
The complexity of network model, it is possible to which the neural network computing child node that not all fractionation obtains can realize parallel fortune
It calculates.That is to say it is possible to occur being assigned to people by multiple child nodes that the same neural network computing node is split
The same functional module of work intelligent processor is performed situation.It in this case, can be according to distribution in PSC algorithm
To the neural network computing node and neural network computing child node of each functional module, the neural network computing that fractionation is obtained
Child node merges.When in the neural network computing child node for judging to distribute to some functional module comprising by the same mind
When the multiple child nodes split through network operations node, then these child nodes are merged.
For example, in artificial intelligence process device have with the matched three functional module a1 of neural network node type-A,
A2, a3, then node A is just split as three child nodes A1, A2, A3 when splitting.Then, then to all sections after fractionation
Point and child node are scheduled, to distribute to each functional module.Combined process refers to, if it find that by sub- section after passing through scheduling
Point A1 and A2 be dispatched in the same functional module (this is because may in original neural network result there is also with node A
Another node A ' of same type, therefore the child node quantity splitted out is more than the quantity of the functional module of type matching), then
Child node A1 and A2 are combined again.
The common ground of PSC algorithm and IPS algorithm is all that will calculate topological diagram to be split as a new calculating topological diagram, makes it
It is more suitable scheduling.The difference lies in that IPS algorithm is that stringent calculating splits income, and according to splitting income constantly iteration
It splits, it is a kind of relatively stable method for splitting, but the required operating time is relatively long.And PSC algorithm is that one kind is compared
The more radical method for splitting of IPS algorithm does not calculate and splits income and cost, only enlightening split, and root
Disposable merging is carried out according to split result to reduce fractionation cost to the greatest extent.The effect of PSC algorithm does not have IPS algorithmic stability, but
Runing time is shorter.
According to one embodiment of the application, fractionation can be obtained in the following way neural network computing child node into
Row merges.Firstly, detecting whether each functional module is assigned the neural network split out from same neural network computing node
Operation child node.Then, according to testing result, if so, will then be torn open in the functional module from same neural network computing node
The neural network computing child node separated merges.Hereby it is achieved that fractionation and merging to neural network computing node.
Fig. 6 shows the side split according to another embodiment of the application to the operation node in neural network model
The flow chart of method.As shown in fig. 6, this method 100 may also include step S130 other than step S110 and S120.Below will
It is described in detail only for the difference of Fig. 6 illustrated embodiment and Fig. 3, something in common will not be described in great detail.
Step S130: by the neural network computing node not split and obtained multiple neural network computing child nodes are split
Multiple functional modules of type matching in artificial intelligence process device are respectively allocated to, to be executed by the functional module.Step
S130 completes the scheduling (scheduling) to each node in topological diagram is calculated, and assigns them in artificial intelligence process device
Each functional module is handled.It is grasped as described above, artificial intelligence process device is contained suitable for different types of neural network computing
The multiple functional modules made.So, the neural network computing of each type is operated, has one in artificial intelligence process device
The functional modules of a or multiple type matchings executes the operation.Based on this, the neural network in neural network model can be transported
Operator node finds the functional module of type matching in artificial intelligence process device according to its arithmetic type to execute.For example, can incite somebody to action
Parallel multiple neural network computing nodes are respectively allocated to type matching in artificial intelligence process device in neural network model
Different function module executes.
The parallel processing to neural network model is realized using artificial intelligence process device as a result, by entire neural network
It can be operated and be operated in different functional modules with the neural network computing of parallel processing in model, when reducing total operation with this
Between, to improve system performance.
Method according to the present embodiment can be considered the fractionation dispatching algorithm based on neural network model.This process employs
It is contained in the detachable property and artificial intelligence process device of neural network computing operation and is suitable for handling different types of operation behaviour
The multiple functional modules made, thus when executing a task, first to node some or all of in the task key path into
Row is split, then the node not split and the child node split are jointly assigned to (that is, being scheduled to) artificial intelligence process
Multiple functional modules of type matching in device improve the speed of performing task to realize parallel processing, also improve hardware using effect
Rate.
Such as: for some neural network model, which includes m group arithmetic operation, operation behaviour
Make the node for being known as neural network model, is respectively labeled as: op1、op2、op3……、opm.Accordingly, the artificial intelligence configured
Energy processor includes n functional module, and the function representation for the arithmetic operation that each functional module executes is respectively labeled as fu1、
fu2、……fun.So, for each functional module, using function g (fui) characterize i-th of functional module fui's
Calculating speed, and utilize function h (fui,fuj) characterize functional module fuiWith fujBetween bandwidth.Function h (fui,fuj) can
It is indicated using following formula:
Neural network model can be indicated by directed acyclic graph (DAG), it may be assumed that G=(V, E).Wherein, V indicates neural network
Node in model, each node vi∈ V indicates arithmetic operation, and each side ei,j∈ E indicates viAnd vjBetween data according to
Lai Xing.Assuming that operation scale is cp (vi), and the function representation of the functional module selected is f (vi).It can will calculate the time
cpt(vi) according to cp (vi)/g(f(vi)) determine.Assuming that from node viTo node vjData scale be io (vi,vj), and phase
Call duration time iot (the v answeredi,vj) by io (vi,vj)/h(f(vi),f(vj)) definition.
Artificial intelligence process device is used for the processor accelerated to neural network, and inhomogeneity is arranged inside artificial intelligence process device
The functional module of type, the functional module of each type execute the task of corresponding types, cannot between different types of functional module
It is substituted for each other.Since be in accelerans network certain type of task it is necessary to distributing to same function in neural network model
Priority is arranged in the node of energy module.If: node viPriority be s (vi), then executing same function according to priority orders
Node in energy module.Assuming that node viAt the beginning of be st (vi), and the deadline is ft (vi).For same function
Node v in moduleiAnd vj, there is the constraint of formula (2), formula (3).
st(vi)≥ft(vj) ifs(vi) < s (vj) (2)
Specifically, in neural network model, for the not preceding Ingress node v after nodeiIf vjIt is also entrance
Node, and f (vi) it is equal to f (vj), then st (vi) it is equal to 0 or ft (vj).For other nodes in neural network model, formula (3)
For the expression formula of time started, and node viThe deadline for executing corresponding arithmetic operation is ft (vi)=st (vi)+cpt
(vi)。
According to above-mentioned definition, scheduling problem is to match suitable function from the functional module set of artificial intelligence process device
Energy module, and determine the priority of node, to find functional unit allocation function f and priority setting function s to minimize
Execute the time.
Fig. 7 shows the side split according to another embodiment of the application to the operation node in neural network model
The flow chart of method.As shown in fig. 7, this method 100 may also include step S140 other than step S110, S120, S130.For
For the sake of briefly, the difference of embodiment shown in Fig. 7 and Fig. 6 will only be described below, and its something in common will be omitted
Detailed description.
In step S140, all neural network computing nodes for distributing to same functional module and neural network fortune are determined
The priority of operator node, so that the functional module executes neural network computing node and mind by the sequence of identified priority
Through network operations child node.After fractionation and scheduling by step S120 and S130, the neural network computing section that is not split
The neural network computing child node that point and fractionation obtain is already allocated to the function mould of type matching in artificial intelligence process device
Block, and each functional module is likely to that more than one operation node or child node is assigned.So, which will be by
It what kind of sequencing to execute each node and child node that scheduling comes according to, can come according to the priority of each node and child node true
It is fixed.The higher node of priority or child node will be come previous processed by functional module, and the lower node of priority or son save
Point will come later process.It, will be each according to what is be scheduled as a result, in each functional module of artificial intelligence process device
The priority orders of operation are handled, so that it is guaranteed that the orderly execution of task, and reduce the overall execution time of task.
According to one embodiment of the application, using upward sequence (Upward Rank) algorithm and/or sequence downwards
(Downward Rank) algorithm determines all neural network computing nodes and neural network computing for distributing to same functional module
The priority of child node.
Specifically, in upward sort algorithm, according to opening for neural network computing node and neural network computing child node
Time difference between at the beginning of moment and the pending task of beginning determines priority.For example, by the beginning of pending task
Quarter is set as t0, at the beginning of a neural network computing node A for distributing to some functional module in artificial intelligence process device
For t1, it is t at the beginning of a neural network computing child node B12.So, if t1-t0>t2-t0, then illustrate node A's
Start time than child node B1 at the beginning of it is late, therefore the priority of the priority ratio node A of child node B1 is high.Conversely, such as
Fruit t1-t0<t2-t0, then it is early at the beginning of illustrating at the beginning of node A than child node B1, therefore the priority of child node A
It is higher than the priority of node B1.HEFT (the Heterogeneous-Earliest--employed in traditional heterogeneous system
Finish-Time) in dispatching algorithm, that is, the prioritization of Upward Rank is used.But the difference of the application
It is, which is used for the functional module in artificial intelligence process device, is saved to each node and son of distributing to functional module
Click through the sequence of row major grade.As described above, the task scheduling algorithm (such as HEFT algorithm) proposed in traditional heterogeneous system
It is whether detachable that task is not accounted for, in other words, is all based on the indivisible situation of task, therefore it is not particularly suited for root
According to the artificial intelligence process device of the application.In this application, in order to be adapted to the demand of Application of Neural Network, to traditional HEFT
Algorithm and CPOP algorithm are improved.Improved HEFT algorithm first with Upward Rank prioritization,
Set each node viPriority, wherein the Upward Rank value rank of each nodeu(vi) by average calculation timesAnd average communication dataIt is calculated, interior joint vjIt is node viRear class node.So, with
Traditional HEFT algorithm is different, in this application, in improved HEFT algorithmWithUnder
Formula determines:
Wherein, niIt indicates to be capable of handling node v included in artificial intelligence process deviceiFunctional module quantity.By
This, enables according on the dispatching algorithm seamless migration to artificial intelligence process device of the application.
On the other hand, in downward sort algorithm, according to neural network computing node and neural network computing child node
Time difference between finish time and the finish time of pending task determines priority.For example, by the end of pending task
Moment is set as T0, at the end of a neural network computing node A for distributing to some functional module in artificial intelligence process device
Carving is T1, the finish time of a neural network computing child node B1 is T2.So, if T0–T1>T0–T2, then illustrate node A
Finish time it is more early than the finish time of child node B1, therefore the priority of the priority ratio node B1 of child node A is high.Conversely,
If T0–T1<T0–T2, then illustrate that the finish time of node A is more late than the finish time of child node B1, therefore child node B1's is preferential
Grade is higher than the priority of node A.
Same functional module is distributed in addition, can also comprehensively consider Upward Rank and Downward Rank to determine
The priority of neural network computing node and neural network computing child node.For example, employed in traditional heterogeneous system
In CPOP (Critical-Path-on-a-Processor) dispatching algorithm, that is, use Upward Rank and Downward
The prioritization that Rank is combined selects using the sum of both algorithms and optimizes critical path.But the application
The difference is that the algorithm is used for the functional module in artificial intelligence process device, with to distributing to each of functional module
Node and child node carry out priority ranking.In traditional CPOP algorithm, each node in critical path must use identical
Processor handle.In this application, in order to be adapted to the demand of Application of Neural Network, traditional CPOP algorithm is carried out
It improves.Improved CPOP algorithm is also and there is no each nodes with same type must be assigned to the same functional module
Constraint.
According to one embodiment of the application, the multiple neural networks that will be split by same neural network computing node
Operation child node distributes to the different function module of type matching in artificial intelligence process device.Referring to Fig. 5, transported to neural network
Before operator node B is split, which can be executed by the functional module b1 of type matching in artificial intelligence process device.And right
After neural network computing node B is split, child node B1, B2, the B3 split can be respectively allocated to artificial intelligence
Different function module b1, b2, b3 of type matching are executed in processor.Therefore, the degree of parallelism that task execution process can be improved, adds
Fast operation time, and improve hardware utilization efficiency.
According to one embodiment of the application, in the step s 120, can be to the fractionation of neural network computing node
Fractionation (that is, splitting front and back, not changing the type of arithmetic operation) on data scale, is also possible to the fractionation to arithmetic logic
(that is, the arithmetic operation type for splitting front and back changes).That is, splitting in obtained neural network computing child node
The arithmetic type of at least part child node can be different from the arithmetic type of the node before fractionation.For example, connection operation is grasped entirely
After splitting, it can be operated by multiplication and full connection arithmetic operation is completed in add operation cooperation.So as to further
Improve concurrent operation effect.
According to one embodiment of the application, in step S140, in addition to the type according to functional module, also according to artificial
The calculating speed and/or communication cost of functional module in intelligent processor are allocated.
Firstly, being scheduled in artificial intelligence process device by neural network computing node and neural network computing child node
When functional module, need to consider whether its type matches, that is, whether the functional module is capable of handling the arithmetic operation of the type.
Secondly, then needing to consider these function when in artificial intelligence process device including the functional module of multiple type matchings
Which calculating speed of energy module is fast, then it is preferentially selected to handle the node or child node.
On the other hand, it is also contemplated that the communication cost of functional module.For example, two functional modules of type matching are ok
Handle some neural network computing node, however one of functional module and handle the node downstream node functional module
Between communication cost than the communication cost between another functional module and the functional module for the downstream node for handling the node
It is small, then preferentially select the lesser functional module of communication cost to handle the node.Communication cost described herein refers to two
The communication size transmitted between functional module is divided by the obtained value of average communication speed.As it can be seen that communication cost is smaller, then communicate
Time-consuming it is fewer, so as to shorten the time of task execution.
Fig. 8 shows a program example of the PSC algorithm according to the application embodiment.Fig. 9 is shown according to this Shen
It please the embodiment schematic diagram of device that the operation node in neural network model is split.As shown in figure 9, the dress
Setting 200 may include determination unit 210 and split cells 230.Determination unit 210 determines the critical path in neural network model.
Split cells 230, will be described according to the hardware concurrent degree of the functional module to match in artificial intelligence process device with different type
At least one neural network computing node in critical path is split as multiple neural network computing child nodes.
Figure 10 shows the schematic diagram of the electronic equipment according to one embodiment of the application.As shown in Figure 10, the electronics
Equipment 300 may include central processing unit 310, artificial intelligence process device 320 and memory 330.Artificial intelligence process device 320 with
Central processing unit 310 communicates to connect, and including multiple functional modules 321.Memory 330 is stored with computer program.Work as storage
When computer program in memory 330 is executed by central processing unit 310, central processing unit 310 is enabled to execute such as
The method that the operation node in neural network model is split described in any above embodiment.
According to the another aspect of the application, a kind of non-transient computer readable storage medium is provided, is stored thereon with
Computer-readable instruction enables to the processor to execute such as any above embodiment party when the instruction is executed by processor
The method that the operation node in neural network model is split described in formula.
It should be understood that above-mentioned Installation practice is only illustrative, the device of the application can also be by another way
It realizes.For example, the division of units/modules described in above-described embodiment, only a kind of logical function partition, in actual implementation may be used
To there is other division mode.For example, multiple units, module or component can combine, or be desirably integrated into another system,
Or some features can be ignored or does not execute.
The unit as illustrated by the separation member or module can be and be physically separated, and may not be and physically divides
It opens.It can be physical unit as unit or the component of module declaration, may not be physical unit, it can be located at one
In device, or it may be distributed on multiple devices.The scheme of embodiment can select according to the actual needs in the application
Some or all of unit therein is realized.
In addition, unless otherwise noted, each functional unit/module in each embodiment of the application can integrate at one
In units/modules, it is also possible to each unit/module and physically exists alone, it can also be with two or more units/modules collection
At together.Above-mentioned integrated units/modules both can take the form of hardware realization, can also be using software program module
Form is realized.
If the integrated units/modules are realized in the form of hardware, which can be digital circuit, simulation electricity
Road etc..The physics realization of hardware configuration includes but is not limited to transistor, memristor etc..Unless otherwise noted, the place
Reason device can be any hardware processor, such as CPU, GPU, FPGA, DSP and ASIC appropriate etc..Unless otherwise noted, institute
Stating storage unit can be any magnetic storage medium appropriate or magnetic-optical storage medium, for example, resistive formula memory RRAM
(Resistive Random Access Memory), dynamic random access memory DRAM (Dynamic Random Access
Memory), static random access memory SRAM (Static Random-Access Memory), enhancing dynamic randon access
Memory EDRAM (Enhanced Dynamic Random Access Memory), high bandwidth memory HBM (High-
Bandwidth Memory), mixing storage cube HMC (Hybrid Memory Cube) etc..
If the integrated units/modules realized in the form of software program module and as independent product sale or
In use, can store in a computer-readable access to memory.Based on this understanding, the technical solution essence of the application
On all or part of the part that contributes to existing technology or the technical solution can be with the shape of software product in other words
Formula embodies, which is stored in a memory, including some instructions are used so that a computer
Equipment (can for personal computer, server or network equipment etc.) execute each embodiment the method for the application whole or
Part steps.And memory above-mentioned includes: that USB flash disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited
Reservoir (RAM, Random Access Memory), mobile hard disk, magnetic or disk etc. be various to can store program code
Medium.
In the above-described embodiments, it all emphasizes particularly on different fields to the description of each embodiment, there is no the portion being described in detail in some embodiment
Point, reference can be made to the related descriptions of other embodiments.Each technical characteristic of above-described embodiment can be combined arbitrarily, to make
Description is succinct, and combination not all possible to each technical characteristic in above-described embodiment is all described, as long as however, these
Contradiction is not present in the combination of technical characteristic, all should be considered as described in this specification.
The embodiment of the present application is described in detail above, specific case used herein to the principle of the application and
Embodiment is expounded, and the explanation of above embodiments is only used for helping to understand the present processes and its core concept.Together
When, those skilled in the art according to the thought of the application, make in specific embodiment and application range based on the application
Change or deform place, shall fall in the protection scope of this application.In conclusion the content of the present specification should not be construed as to the application
Limitation.
Claims (16)
1. the method that the operation node in a kind of pair of neural network model is split, comprising:
Determine the critical path in neural network model;
According to the hardware concurrent degree of the functional module to match in artificial intelligence process device with different type, by the critical path
In at least one neural network computing node be split as multiple neural network computing child nodes.
2. according to the method described in claim 1, wherein the hardware concurrent degree is the function of matching with same kind node
The quantity of module.
3. according to the method described in claim 1, wherein according to the hardware concurrent degree once by all detachable nerve nets
Network operation node is split.
4. according to the method described in claim 3, further include:
According to the neural network computing child node for distributing to each functional module, the neural network computing child node that fractionation is obtained
It merges.
5. according to the method described in claim 4, wherein according to distributing to the neural network computing child node of each functional module,
The neural network computing child node that fractionation obtains is merged and includes:
Detect whether each functional module is assigned the neural network computing section split out from same neural network computing node
Point;
According to testing result, by the neural network computing split out in each functional module from same neural network computing node
Node merges.
6. according to the method described in claim 1, further include:
Multiple neural network computing child nodes that the neural network computing node not split and fractionation obtain are respectively allocated to people
Multiple functional modules of type matching in work intelligent processor, to be executed by the functional module.
7. according to the method described in claim 6, further include:
Determine the priority for distributing to all the neural network computing nodes and neural network computing child node of same functional module,
So that the functional module executes neural network computing node and neural network computing section by the sequence of identified priority
Point.
8. according to the method described in claim 7, wherein being distributed to using upward sort algorithm and/or the determination of downward sort algorithm
The priority of all the neural network computing nodes and neural network computing child node of same functional module.
9. according to the method described in claim 8, wherein in the upward sort algorithm, according to neural network computing node and
At the beginning of neural network computing child node with the pending task at the beginning of between time difference determine priority.
10. according to the method described in claim 8, wherein in the downward sort algorithm, according to neural network computing node
And neural network computing child node finish time and the pending task finish time between time difference determine it is preferential
Grade.
11. according to the method described in claim 6, the multiple nerves that will wherein be split by same neural network computing node
Network operations child node distributes to the different function module of type matching in the artificial intelligence process device.
12. according to the method described in claim 6, at least part in the neural network computing child node wherein split
The arithmetic type of child node is different from the arithmetic type of the node before fractionation.
13. according to the method described in claim 6, wherein the neural network computing node not split and fractionation are obtained multiple
Neural network computing child node is respectively allocated to multiple functional modules of type matching in artificial intelligence process device, by the function
Energy module, which executes, includes:
In addition to the type according to functional module, it is allocated also according to the calculating speed and/or communication cost of functional module.
14. the device that the operation node in a kind of pair of neural network model is split, comprising:
Determination unit determines the critical path in neural network model;
Split cells, according to the hardware concurrent degree of the functional module to match in artificial intelligence process device with different type, by institute
At least one the neural network computing node stated in critical path is split as multiple neural network computing child nodes.
15. a kind of electronic equipment, comprising:
Central processing unit;
Artificial intelligence process device is communicated to connect with the central processing unit, and including multiple functional modules;
Memory is stored with computer program, when the computer program is executed by the central processing unit, so that in described
Central processor executes such as method of any of claims 1-13.
16. a kind of non-transient computer readable storage medium, is stored thereon with computer-readable instruction, when described instruction is located
When managing device execution, so that the processor executes such as method of any of claims 1-13.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910750828.1A CN110503199A (en) | 2019-08-14 | 2019-08-14 | Method for splitting and device, the electronic equipment and storage medium of operation node |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910750828.1A CN110503199A (en) | 2019-08-14 | 2019-08-14 | Method for splitting and device, the electronic equipment and storage medium of operation node |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110503199A true CN110503199A (en) | 2019-11-26 |
Family
ID=68587426
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910750828.1A Pending CN110503199A (en) | 2019-08-14 | 2019-08-14 | Method for splitting and device, the electronic equipment and storage medium of operation node |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110503199A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2022021073A1 (en) * | 2020-07-28 | 2022-02-03 | 嘉楠明芯(北京)科技有限公司 | Multi-operator operation method and apparatus for neural network model |
CN114462900A (en) * | 2022-04-13 | 2022-05-10 | 云智慧(北京)科技有限公司 | Method, device and equipment for splitting service active node |
WO2023116400A1 (en) * | 2021-12-20 | 2023-06-29 | 深圳市中兴微电子技术有限公司 | Vector operation method, vector operator, electronic device and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108090560A (en) * | 2018-01-05 | 2018-05-29 | 中国科学技术大学苏州研究院 | The design method of LSTM recurrent neural network hardware accelerators based on FPGA |
CN108664496A (en) * | 2017-03-29 | 2018-10-16 | 腾讯科技(深圳)有限公司 | Data migration method and device |
CN108989148A (en) * | 2018-07-17 | 2018-12-11 | 浙江大学 | A kind of relaying multipath flow allocation method that propagation delay time minimizes |
CN109284815A (en) * | 2018-11-30 | 2019-01-29 | 上海寒武纪信息科技有限公司 | Neural network model algorithm Compilation Method, device and Related product |
CN109902819A (en) * | 2019-02-12 | 2019-06-18 | Oppo广东移动通信有限公司 | Neural computing method, apparatus, mobile terminal and storage medium |
-
2019
- 2019-08-14 CN CN201910750828.1A patent/CN110503199A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108664496A (en) * | 2017-03-29 | 2018-10-16 | 腾讯科技(深圳)有限公司 | Data migration method and device |
CN108090560A (en) * | 2018-01-05 | 2018-05-29 | 中国科学技术大学苏州研究院 | The design method of LSTM recurrent neural network hardware accelerators based on FPGA |
CN108989148A (en) * | 2018-07-17 | 2018-12-11 | 浙江大学 | A kind of relaying multipath flow allocation method that propagation delay time minimizes |
CN109284815A (en) * | 2018-11-30 | 2019-01-29 | 上海寒武纪信息科技有限公司 | Neural network model algorithm Compilation Method, device and Related product |
CN109902819A (en) * | 2019-02-12 | 2019-06-18 | Oppo广东移动通信有限公司 | Neural computing method, apparatus, mobile terminal and storage medium |
Non-Patent Citations (1)
Title |
---|
XIAOBING CHEN等: ""Partition and Scheduling Algorithms for Neural Network Accelerators", 《SPRINGER》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2022021073A1 (en) * | 2020-07-28 | 2022-02-03 | 嘉楠明芯(北京)科技有限公司 | Multi-operator operation method and apparatus for neural network model |
WO2023116400A1 (en) * | 2021-12-20 | 2023-06-29 | 深圳市中兴微电子技术有限公司 | Vector operation method, vector operator, electronic device and storage medium |
CN114462900A (en) * | 2022-04-13 | 2022-05-10 | 云智慧(北京)科技有限公司 | Method, device and equipment for splitting service active node |
CN114462900B (en) * | 2022-04-13 | 2022-07-29 | 云智慧(北京)科技有限公司 | Method, device and equipment for splitting service active node |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110490322A (en) | Method for splitting and device, the electronic equipment and storage medium of operation node | |
US11769061B2 (en) | Processing computational graphs | |
CN110503195A (en) | The method and its Related product of task are executed using artificial intelligence process device | |
Ali et al. | Grouped tasks scheduling algorithm based on QoS in cloud computing network | |
Kaur et al. | An intelligent regressive ensemble approach for predicting resource usage in cloud computing | |
US20200042856A1 (en) | Scheduler for mapping neural networks onto an array of neural cores in an inference processing unit | |
CN112084038B (en) | Memory allocation method and device of neural network | |
US11847553B2 (en) | Parallel computational architecture with reconfigurable core-level and vector-level parallelism | |
CN108351805A (en) | Calculate the accelerator processing based on stream of figure | |
CN110633153A (en) | Method for realizing neural network model splitting by using multi-core processor and related product | |
CN110503199A (en) | Method for splitting and device, the electronic equipment and storage medium of operation node | |
Hunter et al. | Parallel ranking and selection | |
CN110826708B (en) | Method for realizing neural network model splitting by using multi-core processor and related product | |
CN110147882A (en) | Training method, crowd's method of diffusion, device and the equipment of neural network model | |
US20230206132A1 (en) | Method and Apparatus for Training AI Model, Computing Device, and Storage Medium | |
CN113469355B (en) | Multi-model training pipeline in distributed system | |
CN112084037A (en) | Memory allocation method and device of neural network | |
CN110795233B (en) | Distributed resource allocation method and device and electronic equipment | |
CN103685492B (en) | Dispatching method, dispatching device and application of Hadoop trunking system | |
CN100531070C (en) | Network resource scheduling simulation system | |
CN108985449A (en) | A kind of control method and device of pair of convolutional neural networks processor | |
Hosseini et al. | Resource allocation optimization in cloud computing using the whale optimization algorithm | |
Rossant et al. | Playdoh: a lightweight Python library for distributed computing and optimisation | |
Ramesh et al. | Reinforcement learning-based spatial sorting based dynamic task allocation on networked multicore GPU processors | |
Kapoor et al. | Neural network based optimal placement strategy for service components in cloud computing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: Room 644, scientific research complex building, No. 6, South Road, Academy of Sciences, Haidian District, Beijing 100086 Applicant after: Zhongke Cambrian Technology Co.,Ltd. Address before: Room 644, scientific research complex building, No. 6, South Road, Academy of Sciences, Haidian District, Beijing 100086 Applicant before: Beijing Zhongke Cambrian Technology Co.,Ltd. |
|
CB02 | Change of applicant information | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20191126 |
|
WD01 | Invention patent application deemed withdrawn after publication |