CN110503199A

CN110503199A - Method for splitting and device, the electronic equipment and storage medium of operation node

Info

Publication number: CN110503199A
Application number: CN201910750828.1A
Authority: CN
Inventors: 不公告发明人
Original assignee: Beijing Zhongke Cambrian Technology Co Ltd
Current assignee: Cambricon Technologies Corp Ltd; Beijing Zhongke Cambrian Technology Co Ltd
Priority date: 2019-08-14
Filing date: 2019-08-14
Publication date: 2019-11-26

Abstract

This application provides a kind of method and apparatus, electronic equipment and non-transient computer readable storage mediums that task is executed using artificial intelligence process device；Wherein, electronic equipment includes central processing unit, artificial intelligence process device and memory, and artificial intelligence process device and central processing unit communicate to connect, and including multiple functional modules；Memory is stored with computer program, when computer program is executed by central processing unit, so that central processing unit executes the method for executing task using artificial intelligence process device.

Description

Method for splitting and device, the electronic equipment and storage medium of operation node

Technical field

This application involves computer fields, and in particular to what the operation node in a kind of pair of neural network model was split Method and apparatus, electronic equipment and non-transient computer readable storage medium.

Background technique

Artificial intelligence (Artificial Intelligence, abridge AI) is that research makes computer to simulate the certain of people The subject of thought process and intelligent behavior (such as study, reasoning, thinking, planning), the main original that intelligence is realized including computer The computer for managing, being manufactured similarly to human brain intelligence enables a computer to realize higher level application.

Artificial neural network (Artificial Neural Networks, be abbreviated as ANNs) is also referred to as neural network (NNs), it is a kind of imitation animal nerve network behavior feature, carries out the algorithm mathematics model of distributed parallel information processing. This network relies on the complexity of system, by adjusting relationship interconnected between internal great deal of nodes, to reach place Manage the purpose of information.

Neural network is popular algorithm in current machine learning areas.In recent years, with depth learning technology Fast development, model and algorithm neural network based achieve breakthrough in many fields.For example, in voice technology, people The fields such as face identification, automatic Pilot, machine translation, the algorithm research based on deep neural network are more and more deep.

Summary of the invention

Based on this, this application provides the methods that the operation node in a kind of pair of neural network model is split, comprising:

Determine the critical path in neural network model；

According to the hardware concurrent degree of the functional module to match in artificial intelligence process device with different type, by the key At least one neural network computing node in path is split as multiple neural network computing child nodes.

According to the another aspect of the application, the dress that the operation node in a kind of pair of neural network model is split is provided It sets, comprising:

Determination unit determines the critical path in neural network model；

Split cells, according to the hardware concurrent degree of the functional module to match in artificial intelligence process device with different type, At least one neural network computing node in the critical path is split as multiple neural network computing child nodes.

According to the another aspect of the application, a kind of electronic equipment is provided, comprising:

Central processing unit；

Artificial intelligence process device is communicated to connect with the central processing unit, and including multiple functional modules；

Memory is stored with computer program, when the computer program is executed by the central processing unit, so that institute It states central processing unit and executes method as described above.

According to the another aspect of the application, a kind of non-transient computer readable storage medium is provided, is stored thereon with Computer-readable instruction, when described instruction is executed by processor, so that the processor executes method as described above.

Detailed description of the invention

In order to more clearly explain the technical solutions in the embodiments of the present application, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, the drawings in the following description are only some examples of the present application, for For those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings other Attached drawing.

Fig. 1 shows an exemplary model of artificial intelligence process device；

Fig. 2 shows neural network model schematic diagrames；

Fig. 3 shows the side split according to one embodiment of the application to the operation node in neural network model The flow chart of method；

Fig. 4 shows the schematic diagram split to neural network computing node；

Fig. 5, which is schematically illustrated, operates different fractionation modes to neural network computing；

Fig. 6 shows the side split according to another embodiment of the application to the operation node in neural network model The flow chart of method；

Fig. 7 shows the side split according to another embodiment of the application to the operation node in neural network model The flow chart of method；

Fig. 8 shows a program example of the PSC algorithm according to the application embodiment；

Fig. 9 shows the dress split according to one embodiment of the application to the operation node in neural network model The schematic diagram set；

Figure 10 shows the schematic diagram of the electronic equipment according to one embodiment of the application.

Specific embodiment

Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete Site preparation description, it is clear that described embodiment is some embodiments of the present application, instead of all the embodiments.Based on this Shen Please in embodiment, those skilled in the art's every other embodiment obtained without making creative work, It shall fall in the protection scope of this application.

It should be appreciated that claims hereof, specification and attached drawing in term " first ", " second ", " third " and " 4th " etc. is not use to describe a particular order for distinguishing different objects.The description and claims of this application Used in term " includes " and "comprising" indicate described feature, entirety, step, operation, the presence of element and/or component, But the presence of one or more of the other feature, entirety, step, operation, element, component and/or its set is not precluded or adds Add.

It is also understood that mesh of the term used in this present specification merely for the sake of description specific embodiment , and be not intended to limit the application.As used in present specification and claims, unless context Other situations are clearly indicated, otherwise " one " of singular, "one" and "the" are intended to include plural form.It should also be into one Step understands that the term "and/or" used in present specification and claims refers to one in the associated item listed A or multiple any combination and all possible combinations, and including these combinations.

In this application, artificial intelligence process device: also referred to as application specific processor, for specific application or the place in field Manage device.Such as: graphics processor (Graphics Processing Unit, abbreviation: GPU) also known as shows core, visual processes Device, display chip are one kind specially in PC, work station, game machine and some mobile devices (such as tablet computer, intelligence Mobile phone etc.) on image operation work application specific processor.Another example is: neural network processor (Neural Processing Unit, abbreviation: NPU), it is a kind of application specific processor for being directed to matrix multiplication operation in the application of artificial intelligence field, uses The framework of " data-driven parallel computation " is especially good at the mass multimedia data of processing video, image class.With artificial intelligence The development of technology can complete the task of many complexity, such as recognition of face, automatic Pilot, machine using computer at present Translation etc..In order to preferably complete these tasks, these tasks can be realized using neural network model.However, being calculated with tradition Method is different, and an important feature of neural network model is coexisting for high internal storage access and computational intensity.For general procedure It is a huge challenge for device.In order to cope with this challenge, artificial intelligent processor is proposed to execute neural network and accelerate Scheme.Currently, executing the scheme that neural network accelerates is broadly divided into three types.It is respectively: graphics processing unit (GPU), field programmable gate array (FPGA) and applying specific integrated circuit (ASIC).Wherein, GPU has powerful parallel Computing capability, but serious power efficiency problem is faced, and FPGA has flexibility but peak performance is poor.With GPU and FPGA Difference, ASIC are dedicated custom hardware frameworks.Multicore and memory, which calculate design, facilitates them in artificial intelligence calculating field Surmount GPU and FPGA.

In practical applications, traditionally the task schedule of general processor can not be directly applied for artificial intelligence process device, Reason for this is that: the function logic inside general processor is abstracted, includes multiple identical functions inside general processor Module, the calculating task that each functional module executes not essential difference.But in essence without any difference, often A functional module can be substituted for each other.In this way, leading to general processor speed in the corresponding task of execution neural network model Slowly, and power consumption is big.For artificial intelligence process device, the function logic inside artificial intelligence process device is abstracted, people It include multiple functional modules inside work Intelligent treatment, each functional module intelligently executes corresponding calculating task.Each function mould It cannot be substituted for each other between block.

Arithmetic operation in neural network model is analyzed, according to the arithmetic operation feature of artificial intelligence process device, Operation in neural network model can there are many types, such as: the volume that can be completed by matrix multiplication and summation operation Long-pending and full attended operation can operate by searching for the mode of table come the activation completed.The artificial intelligence process device of the application meaning Refer to containing the artificial intelligence process device of multiple functional modules, each functional module is adapted for carrying out in neural network model not The arithmetic operation of same type.Such as: artificial intelligence process device include be exclusively used in execute matrix multiplication operation functional module and It is exclusively used in executing the functional module of look-up table operations, these are not limited to one according to the functional module that operating characteristic customizes.Therefore, Artificial intelligence process device can realize the parallel processing to neural network model, to improve system performance.In addition, at artificial intelligence Each functional module for managing device can mutual direct communication.For neural network model, data flow is in artificial intelligence process device Each functional module between directly transmit, without caching.Fig. 1 shows an exemplary model of artificial intelligence process device. In Fig. 1, artificial intelligence process device be abstracted as include multiple functional modules artificial intelligence process device, different functional modules It is adapted for carrying out different types of arithmetic operation in neural network model.Such as: matrix multiplication operation, table lookup operation, Chi Hua Operation, vector operation etc..In addition, further including the interconnecting number between each functional module in artificial intelligence process device shown in Fig. 1 According to transmission path.Different application scenarios have different requirements to artificial intelligence process device.For example, in edge calculations (edge Computing) field, concern is primarily with the power efficiency of calculating and time delays；And for Cloud Server, it is main to close Note is to calculate handling capacity and degree of parallelism.According to practical application, can quantity to the functional module of artificial intelligence process device, calculate Speed, Connected degree and corresponding bandwidth are configured.

It is already previously mentioned, accelerate different types of arithmetic operation using the different function module of artificial intelligence process device, So, it will can be operated in entire neural network model with the neural network computing of parallel processing and operate in different functional modules On, total runing time is reduced with this.Citing: in a neural network model, then calculating A first utilizes the output meter of A B and C is calculated, the output of B and C is finally recycled to calculate D.Wherein, A B C D is the different nodes in neural network model, In this example embodiment, node B and node C can be executed parallel, if being the time required to node B executes corresponding arithmetic operation 4s, node C, which complete corresponding arithmetic operation the time required to executing corresponding arithmetic operation for 3s, node B and node C, to be executed The arithmetic operation of node D.It is then that the longest arithmetic operation of required time is corresponding in parallel arithmetic operation the time required to parallel computation Time.Therefore in this instance, node A, B, D are the node in critical path.Executing neural network model corresponding During business, longest arithmetic operation the time required to determining in parallel arithmetic operation, according to the function inside artificial intelligence process device The resource utilization of energy module, optimizes this kind of arithmetic operation, the time required to reducing, so that the time required to parallel computation It is corresponding to reduce, it solves artificial intelligence process device and executes the task schedule optimization problem that neural network model corresponds to task.

During handling neural network model, the corresponding neural network model of pending task can be resolved to first Calculating topological diagram comprising multiple neural network computing nodes.

Normally, according to neural network model the characteristics of, can resolve to a pending task comprising multiple operation sections The calculating topological diagram of point.Fig. 2 shows neural network model schematic diagrames.As shown in Fig. 2, including in the neural network model Multiple neural network computing nodes, each node (node) represent a neural network computing operation.Executing this During business, input data finally obtains result through a large amount of neural network computing.

Fig. 3 shows the side split according to one embodiment of the application to the operation node in neural network model The flow chart of method.As shown in figure 3, this method 100 may include step S110 and S120.

In step s 110, the critical path in neural network model is determined.

As described above, critical path refers in neural network model from the longest logic road of delay for being input to output process Diameter contains multiple operation nodes in critical path.On the other hand, artificial intelligence process device is contained suitable for different type Neural network computing operation multiple functional modules.So, the neural network computing of each type is operated, in artificial intelligence There can be the functional module of one or more type matchings in processor to execute the operation.It, can be by neural network mould based on this The neural network computing node not split either in type, the neural network computing child node still split, is ok According to its arithmetic type, the functional module of type matching is found in artificial intelligence process device, and to execute, (specific split process will It is described below).

In the step s 120, according to the hardware concurrent of the functional module to match in artificial intelligence process device with different type Degree, is split as multiple neural network computing child nodes at least one neural network computing node in critical path.According to this Embodiment can be torn open according to the hardware concurrent degree of the functional module to match in artificial intelligence process device with different type Point.In this application, the hardware concurrent degree of functional module refers to the number of the functional module to match with same kind node Amount.For example, as shown in figure 4, a node can be split as multiple sub- sections when splitting to neural network computing node Point.The quantity of the child node specifically split out can be determined according to the quantity of the functional module to match with the type.For example, If the functional module quantity to match in artificial intelligence process device comprising the action type with the node is X, tearing open The node can be split as X child node by timesharing, to utilize the concurrent operation of X child node of X Implement of Function Module. That is also receiving the quantity to the fractionation of node if the functional module limited amount to match with certain type node It limits and cannot split unlimitedly.If the quantity of the child node split out is greater than the quantity of the functional module of type matching, The multiple child nodes split out by the same node will be then handled in the same functional module, this will not have operation degree of parallelism There is any promotion meaning.

Due to the functional module of type matching may have it is multiple, can be by tearing open for an operation node Divide to realize parallel execution, and for entire calculating process, then and being split at least part operation node Operation degree of parallelism is improved, so as to greatly save the task processing time, improves operation efficiency.

As described above, neural network computing operation is usually the arithmetic operation for being similar to matrix multiplication or tabling look-up, and it is this kind of The characteristics of operation is can to split.For example, it is more to convert (that is, being split as) for the multiplication between two biggish matrixes of dimension Multiplication between a lesser matrix of dimension.This i.e. neural network computing operation detachable property (detachability).According to Detachable property can tear at least part neural network computing node in the critical path of task determining in step s 110 open Dividing (partition) is multiple neural network computing child nodes.Fig. 4, which is shown, to be shown what neural network computing node was split It is intended to.As shown in figure 4, before fractionation, the order of operation of original nerve network node is from node A to node B, then to node C. After splitting to node B, node B is split as child node B1, B2, B3, then the neural network node operation after splitting Sequence is from node A to parallel child node B1, B2, B3, then to node C.As it can be seen that neural network mould can be improved in the process split Operation degree of parallelism in type.

In heterogeneous system (heterogeneous system), it has been proposed that multiple-task dispatching algorithm (task scheduling algorithm).For example, in list scheduling heuristic algorithm (list scheduling heuristic Algorithm in), all task creation priority sequences can be based on, then periodically carry out task choosing and processor choosing It selects, until all task schedules are completed.(Mapping Heuristic) algorithm is soundd out in mapping and earliest time is preferential (Earliest Time First) algorithm is two examples of list scheduling heuristic algorithm.In addition, will be mentioned in subsequent descriptions HEFT (Heterogeneous-Earliest-Finish-Time, the isomery earliest finish time) algorithm and CPOP arrived (Critical-Path-on-a-Processor, processor on critical path) algorithm also belongs to list scheduling algorithm.For appointing The upward sequence (Upward Rank) of business and sequence (Downward Rank) downwards are the piths in the two algorithms, will It is described in detail subsequent.

However, whether the task scheduling algorithm proposed in traditional heterogeneous system not account for task detachable, or Person says, is all based on the indivisible situation of task, and processor is also general for all tasks.But for mind It is operated through network operations, neural network task has detachable property, and such as neural network accelerates at the artificial intelligence of chip Reason device can have multiple customization modules.

Fig. 5, which is schematically illustrated, operates different fractionation modes to neural network computing.(a) show original in Fig. 5 The neural network computing operation that beginning and end are split, type are convolution algorithm (conv).The output data of a upper node layer is one A four dimensional tensor out [2,256,101,101], a upper node layer output be this node layer input in [2,256,101, 101], after by the convolution operation of this node layer, the data exported are still a four dimensional tensor out [2,512,50,50], Its input data in [2,512,50,50] as next node layer.

(b) shows the mode split in batch direction in Fig. 5, i.e., 2 in tensor is split as two 1.Fig. 5 In (c) show the mode split in width direction, i.e., 101 in tensor are split as two 51.(d) is shown in Fig. 5 The mode split in input channel direction, i.e., be split as two 128 for 256 in tensor.(e) is shown in Fig. 5 The size of input data is split as two 2637312 by 5222912 by the mode that output channel direction is split.Due to The particularity of convolution algorithm, the mode split in Fig. 5 (d) in input channel direction, it is also necessary to latter two fortune will be split The result of calculation tries again operation, and other of Fig. 5 split modes and can directly splice the result after fractionation.

Since neural network operation can be completed by the operation of vectorization, neural network can be operated and be torn open Point, to improve the degree of parallelism of task.Characteristic based on different operation can take different fractionation strategies such as shown in fig. 5.Example Such as, based on direction is split, for a two-dimensional convolution operates, there can be following five kinds of fractionation modes.On batch direction It is split, this mode not will increase additional burden, but if the batch quantity of the operation is 1, can not use this Kind mode (degree of parallelism cannot be improved).It is split on input channel direction, this mode needs to increase additional addition fortune It calculates, will be added from the partial results of split child node out.It is split on output channel direction, this mode Each child node split out is needed to obtain whole input datas, this will increase volume of transmitted data.In height or width direction On split, this mode will generate additional data transmission due to duplicate input in each child node split out Consumption.On the other hand, for batch normalization operation, fractionation not will lead to extra workload.Therefore, in this application, may be used Following disassembly principle is constructed to the operation of each neural network.

According to the application, Application of Neural Network (that is, pending task) has been represented by calculating topological diagram, for example, such as Fig. 2 Shown in directed acyclic graph (directed acyclic graph, DAG), can with G=(V, E) indicate.Each node therein v_i∈ V represents an operation, the side e between every two node_i,j∈ E represents node v_iWith node v_jBetween data dependence Property.For node v_iSplit process p_i, node v_iBy by it is split go out new node (that is, child node) and with these new sections The relevant side (edge) of point is replaced.Volume of transmitted data and node v on these new sides_iOperation it is related to parameter, It is shown below:

G ' (V ', E ')=p_i(G(V,E)) (5)

For all detachable nodes, the fractionation sequence p=(p of node can be obtained₁,p₂……p_k), then based on splitting Scheduling process can be expressed from the next:

According to the application embodiment, in above-mentioned steps S120, merge algorithm using splitting (Partition Scheduling Combination Algorithm, abbreviation PSC algorithm) to neural network computing node into Row is split.

It, can be according to the hardware concurrent degree one of the functional module in artificial intelligence process device according to one embodiment of the application It is secondary to split all detachable neural network computing nodes.That is, PSC algorithm can once will be all detachable Neural network computing node is split.Certainly, this fractured operation is also required to follow mentioned above principle, needs to consider artificial intelligence The hardware concurrent degree of functional module in processor.

After once being split all detachable neural network computing nodes according to PSC algorithm, due to nerve net The complexity of network model, it is possible to which the neural network computing child node that not all fractionation obtains can realize parallel fortune It calculates.That is to say it is possible to occur being assigned to people by multiple child nodes that the same neural network computing node is split The same functional module of work intelligent processor is performed situation.It in this case, can be according to distribution in PSC algorithm To the neural network computing node and neural network computing child node of each functional module, the neural network computing that fractionation is obtained Child node merges.When in the neural network computing child node for judging to distribute to some functional module comprising by the same mind When the multiple child nodes split through network operations node, then these child nodes are merged.

For example, in artificial intelligence process device have with the matched three functional module a1 of neural network node type-A, A2, a3, then node A is just split as three child nodes A1, A2, A3 when splitting.Then, then to all sections after fractionation Point and child node are scheduled, to distribute to each functional module.Combined process refers to, if it find that by sub- section after passing through scheduling Point A1 and A2 be dispatched in the same functional module (this is because may in original neural network result there is also with node A Another node A ' of same type, therefore the child node quantity splitted out is more than the quantity of the functional module of type matching), then Child node A1 and A2 are combined again.

The common ground of PSC algorithm and IPS algorithm is all that will calculate topological diagram to be split as a new calculating topological diagram, makes it It is more suitable scheduling.The difference lies in that IPS algorithm is that stringent calculating splits income, and according to splitting income constantly iteration It splits, it is a kind of relatively stable method for splitting, but the required operating time is relatively long.And PSC algorithm is that one kind is compared The more radical method for splitting of IPS algorithm does not calculate and splits income and cost, only enlightening split, and root Disposable merging is carried out according to split result to reduce fractionation cost to the greatest extent.The effect of PSC algorithm does not have IPS algorithmic stability, but Runing time is shorter.

According to one embodiment of the application, fractionation can be obtained in the following way neural network computing child node into Row merges.Firstly, detecting whether each functional module is assigned the neural network split out from same neural network computing node Operation child node.Then, according to testing result, if so, will then be torn open in the functional module from same neural network computing node The neural network computing child node separated merges.Hereby it is achieved that fractionation and merging to neural network computing node.

Fig. 6 shows the side split according to another embodiment of the application to the operation node in neural network model The flow chart of method.As shown in fig. 6, this method 100 may also include step S130 other than step S110 and S120.Below will It is described in detail only for the difference of Fig. 6 illustrated embodiment and Fig. 3, something in common will not be described in great detail.

Step S130: by the neural network computing node not split and obtained multiple neural network computing child nodes are split Multiple functional modules of type matching in artificial intelligence process device are respectively allocated to, to be executed by the functional module.Step S130 completes the scheduling (scheduling) to each node in topological diagram is calculated, and assigns them in artificial intelligence process device Each functional module is handled.It is grasped as described above, artificial intelligence process device is contained suitable for different types of neural network computing The multiple functional modules made.So, the neural network computing of each type is operated, has one in artificial intelligence process device The functional modules of a or multiple type matchings executes the operation.Based on this, the neural network in neural network model can be transported Operator node finds the functional module of type matching in artificial intelligence process device according to its arithmetic type to execute.For example, can incite somebody to action Parallel multiple neural network computing nodes are respectively allocated to type matching in artificial intelligence process device in neural network model Different function module executes.

The parallel processing to neural network model is realized using artificial intelligence process device as a result, by entire neural network It can be operated and be operated in different functional modules with the neural network computing of parallel processing in model, when reducing total operation with this Between, to improve system performance.

Method according to the present embodiment can be considered the fractionation dispatching algorithm based on neural network model.This process employs It is contained in the detachable property and artificial intelligence process device of neural network computing operation and is suitable for handling different types of operation behaviour The multiple functional modules made, thus when executing a task, first to node some or all of in the task key path into Row is split, then the node not split and the child node split are jointly assigned to (that is, being scheduled to) artificial intelligence process Multiple functional modules of type matching in device improve the speed of performing task to realize parallel processing, also improve hardware using effect Rate.

Such as: for some neural network model, which includes m group arithmetic operation, operation behaviour Make the node for being known as neural network model, is respectively labeled as: op₁、op₂、op₃……、op_m.Accordingly, the artificial intelligence configured Energy processor includes n functional module, and the function representation for the arithmetic operation that each functional module executes is respectively labeled as fu₁、 fu₂、……fu_n.So, for each functional module, using function g (fu_i) characterize i-th of functional module fu_i's Calculating speed, and utilize function h (fu_i,fu_j) characterize functional module fu_iWith fu_jBetween bandwidth.Function h (fu_i,fu_j) can It is indicated using following formula:

Neural network model can be indicated by directed acyclic graph (DAG), it may be assumed that G=(V, E).Wherein, V indicates neural network Node in model, each node v_i∈ V indicates arithmetic operation, and each side e_i,j∈ E indicates v_iAnd v_jBetween data according to Lai Xing.Assuming that operation scale is cp (v_i), and the function representation of the functional module selected is f (v_i).It can will calculate the time cpt(v_i) according to cp (v_i)/g(f(v_i)) determine.Assuming that from node v_iTo node v_jData scale be io (v_i,v_j), and phase Call duration time iot (the v answered_i,v_j) by io (v_i,v_j)/h(f(v_i),f(v_j)) definition.

Artificial intelligence process device is used for the processor accelerated to neural network, and inhomogeneity is arranged inside artificial intelligence process device The functional module of type, the functional module of each type execute the task of corresponding types, cannot between different types of functional module It is substituted for each other.Since be in accelerans network certain type of task it is necessary to distributing to same function in neural network model Priority is arranged in the node of energy module.If: node v_iPriority be s (v_i), then executing same function according to priority orders Node in energy module.Assuming that node v_iAt the beginning of be st (v_i), and the deadline is ft (v_i).For same function Node v in module_iAnd v_j, there is the constraint of formula (2), formula (3).

st(v_i)≥ft(v_j) ifs(v_i) < s (v_j) (2)

Specifically, in neural network model, for the not preceding Ingress node v after node_iIf v_jIt is also entrance Node, and f (v_i) it is equal to f (v_j), then st (v_i) it is equal to 0 or ft (v_j).For other nodes in neural network model, formula (3) For the expression formula of time started, and node v_iThe deadline for executing corresponding arithmetic operation is ft (v_i)=st (v_i)+cpt (v_i)。

According to above-mentioned definition, scheduling problem is to match suitable function from the functional module set of artificial intelligence process device Energy module, and determine the priority of node, to find functional unit allocation function f and priority setting function s to minimize Execute the time.

Fig. 7 shows the side split according to another embodiment of the application to the operation node in neural network model The flow chart of method.As shown in fig. 7, this method 100 may also include step S140 other than step S110, S120, S130.For For the sake of briefly, the difference of embodiment shown in Fig. 7 and Fig. 6 will only be described below, and its something in common will be omitted Detailed description.

In step S140, all neural network computing nodes for distributing to same functional module and neural network fortune are determined The priority of operator node, so that the functional module executes neural network computing node and mind by the sequence of identified priority Through network operations child node.After fractionation and scheduling by step S120 and S130, the neural network computing section that is not split The neural network computing child node that point and fractionation obtain is already allocated to the function mould of type matching in artificial intelligence process device Block, and each functional module is likely to that more than one operation node or child node is assigned.So, which will be by It what kind of sequencing to execute each node and child node that scheduling comes according to, can come according to the priority of each node and child node true It is fixed.The higher node of priority or child node will be come previous processed by functional module, and the lower node of priority or son save Point will come later process.It, will be each according to what is be scheduled as a result, in each functional module of artificial intelligence process device The priority orders of operation are handled, so that it is guaranteed that the orderly execution of task, and reduce the overall execution time of task.

According to one embodiment of the application, using upward sequence (Upward Rank) algorithm and/or sequence downwards (Downward Rank) algorithm determines all neural network computing nodes and neural network computing for distributing to same functional module The priority of child node.

Specifically, in upward sort algorithm, according to opening for neural network computing node and neural network computing child node Time difference between at the beginning of moment and the pending task of beginning determines priority.For example, by the beginning of pending task Quarter is set as t₀, at the beginning of a neural network computing node A for distributing to some functional module in artificial intelligence process device For t₁, it is t at the beginning of a neural network computing child node B1₂.So, if t₁-t₀>t₂-t₀, then illustrate node A's Start time than child node B1 at the beginning of it is late, therefore the priority of the priority ratio node A of child node B1 is high.Conversely, such as Fruit t₁-t₀<t₂-t₀, then it is early at the beginning of illustrating at the beginning of node A than child node B1, therefore the priority of child node A It is higher than the priority of node B1.HEFT (the Heterogeneous-Earliest--employed in traditional heterogeneous system Finish-Time) in dispatching algorithm, that is, the prioritization of Upward Rank is used.But the difference of the application It is, which is used for the functional module in artificial intelligence process device, is saved to each node and son of distributing to functional module Click through the sequence of row major grade.As described above, the task scheduling algorithm (such as HEFT algorithm) proposed in traditional heterogeneous system It is whether detachable that task is not accounted for, in other words, is all based on the indivisible situation of task, therefore it is not particularly suited for root According to the artificial intelligence process device of the application.In this application, in order to be adapted to the demand of Application of Neural Network, to traditional HEFT Algorithm and CPOP algorithm are improved.Improved HEFT algorithm first with Upward Rank prioritization, Set each node v_iPriority, wherein the Upward Rank value rank of each node_u(v_i) by average calculation timesAnd average communication dataIt is calculated, interior joint v_jIt is node v_iRear class node.So, with Traditional HEFT algorithm is different, in this application, in improved HEFT algorithmWithUnder Formula determines:

Wherein, n_iIt indicates to be capable of handling node v included in artificial intelligence process device_iFunctional module quantity.By This, enables according on the dispatching algorithm seamless migration to artificial intelligence process device of the application.

On the other hand, in downward sort algorithm, according to neural network computing node and neural network computing child node Time difference between finish time and the finish time of pending task determines priority.For example, by the end of pending task Moment is set as T₀, at the end of a neural network computing node A for distributing to some functional module in artificial intelligence process device Carving is T₁, the finish time of a neural network computing child node B1 is T₂.So, if T₀–T₁>T₀–T₂, then illustrate node A Finish time it is more early than the finish time of child node B1, therefore the priority of the priority ratio node B1 of child node A is high.Conversely, If T₀–T₁<T₀–T₂, then illustrate that the finish time of node A is more late than the finish time of child node B1, therefore child node B1's is preferential Grade is higher than the priority of node A.

Same functional module is distributed in addition, can also comprehensively consider Upward Rank and Downward Rank to determine The priority of neural network computing node and neural network computing child node.For example, employed in traditional heterogeneous system In CPOP (Critical-Path-on-a-Processor) dispatching algorithm, that is, use Upward Rank and Downward The prioritization that Rank is combined selects using the sum of both algorithms and optimizes critical path.But the application The difference is that the algorithm is used for the functional module in artificial intelligence process device, with to distributing to each of functional module Node and child node carry out priority ranking.In traditional CPOP algorithm, each node in critical path must use identical Processor handle.In this application, in order to be adapted to the demand of Application of Neural Network, traditional CPOP algorithm is carried out It improves.Improved CPOP algorithm is also and there is no each nodes with same type must be assigned to the same functional module Constraint.

According to one embodiment of the application, the multiple neural networks that will be split by same neural network computing node Operation child node distributes to the different function module of type matching in artificial intelligence process device.Referring to Fig. 5, transported to neural network Before operator node B is split, which can be executed by the functional module b1 of type matching in artificial intelligence process device.And right After neural network computing node B is split, child node B1, B2, the B3 split can be respectively allocated to artificial intelligence Different function module b1, b2, b3 of type matching are executed in processor.Therefore, the degree of parallelism that task execution process can be improved, adds Fast operation time, and improve hardware utilization efficiency.

According to one embodiment of the application, in the step s 120, can be to the fractionation of neural network computing node Fractionation (that is, splitting front and back, not changing the type of arithmetic operation) on data scale, is also possible to the fractionation to arithmetic logic (that is, the arithmetic operation type for splitting front and back changes).That is, splitting in obtained neural network computing child node The arithmetic type of at least part child node can be different from the arithmetic type of the node before fractionation.For example, connection operation is grasped entirely After splitting, it can be operated by multiplication and full connection arithmetic operation is completed in add operation cooperation.So as to further Improve concurrent operation effect.

According to one embodiment of the application, in step S140, in addition to the type according to functional module, also according to artificial The calculating speed and/or communication cost of functional module in intelligent processor are allocated.

Firstly, being scheduled in artificial intelligence process device by neural network computing node and neural network computing child node When functional module, need to consider whether its type matches, that is, whether the functional module is capable of handling the arithmetic operation of the type.

Secondly, then needing to consider these function when in artificial intelligence process device including the functional module of multiple type matchings Which calculating speed of energy module is fast, then it is preferentially selected to handle the node or child node.

On the other hand, it is also contemplated that the communication cost of functional module.For example, two functional modules of type matching are ok Handle some neural network computing node, however one of functional module and handle the node downstream node functional module Between communication cost than the communication cost between another functional module and the functional module for the downstream node for handling the node It is small, then preferentially select the lesser functional module of communication cost to handle the node.Communication cost described herein refers to two The communication size transmitted between functional module is divided by the obtained value of average communication speed.As it can be seen that communication cost is smaller, then communicate Time-consuming it is fewer, so as to shorten the time of task execution.

Fig. 8 shows a program example of the PSC algorithm according to the application embodiment.Fig. 9 is shown according to this Shen It please the embodiment schematic diagram of device that the operation node in neural network model is split.As shown in figure 9, the dress Setting 200 may include determination unit 210 and split cells 230.Determination unit 210 determines the critical path in neural network model. Split cells 230, will be described according to the hardware concurrent degree of the functional module to match in artificial intelligence process device with different type At least one neural network computing node in critical path is split as multiple neural network computing child nodes.

Figure 10 shows the schematic diagram of the electronic equipment according to one embodiment of the application.As shown in Figure 10, the electronics Equipment 300 may include central processing unit 310, artificial intelligence process device 320 and memory 330.Artificial intelligence process device 320 with Central processing unit 310 communicates to connect, and including multiple functional modules 321.Memory 330 is stored with computer program.Work as storage When computer program in memory 330 is executed by central processing unit 310, central processing unit 310 is enabled to execute such as The method that the operation node in neural network model is split described in any above embodiment.

According to the another aspect of the application, a kind of non-transient computer readable storage medium is provided, is stored thereon with Computer-readable instruction enables to the processor to execute such as any above embodiment party when the instruction is executed by processor The method that the operation node in neural network model is split described in formula.

It should be understood that above-mentioned Installation practice is only illustrative, the device of the application can also be by another way It realizes.For example, the division of units/modules described in above-described embodiment, only a kind of logical function partition, in actual implementation may be used To there is other division mode.For example, multiple units, module or component can combine, or be desirably integrated into another system, Or some features can be ignored or does not execute.

The unit as illustrated by the separation member or module can be and be physically separated, and may not be and physically divides It opens.It can be physical unit as unit or the component of module declaration, may not be physical unit, it can be located at one In device, or it may be distributed on multiple devices.The scheme of embodiment can select according to the actual needs in the application Some or all of unit therein is realized.

In addition, unless otherwise noted, each functional unit/module in each embodiment of the application can integrate at one In units/modules, it is also possible to each unit/module and physically exists alone, it can also be with two or more units/modules collection At together.Above-mentioned integrated units/modules both can take the form of hardware realization, can also be using software program module Form is realized.

If the integrated units/modules are realized in the form of hardware, which can be digital circuit, simulation electricity Road etc..The physics realization of hardware configuration includes but is not limited to transistor, memristor etc..Unless otherwise noted, the place Reason device can be any hardware processor, such as CPU, GPU, FPGA, DSP and ASIC appropriate etc..Unless otherwise noted, institute Stating storage unit can be any magnetic storage medium appropriate or magnetic-optical storage medium, for example, resistive formula memory RRAM (Resistive Random Access Memory), dynamic random access memory DRAM (Dynamic Random Access Memory), static random access memory SRAM (Static Random-Access Memory), enhancing dynamic randon access Memory EDRAM (Enhanced Dynamic Random Access Memory), high bandwidth memory HBM (High- Bandwidth Memory), mixing storage cube HMC (Hybrid Memory Cube) etc..

If the integrated units/modules realized in the form of software program module and as independent product sale or In use, can store in a computer-readable access to memory.Based on this understanding, the technical solution essence of the application On all or part of the part that contributes to existing technology or the technical solution can be with the shape of software product in other words Formula embodies, which is stored in a memory, including some instructions are used so that a computer Equipment (can for personal computer, server or network equipment etc.) execute each embodiment the method for the application whole or Part steps.And memory above-mentioned includes: that USB flash disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited Reservoir (RAM, Random Access Memory), mobile hard disk, magnetic or disk etc. be various to can store program code Medium.

In the above-described embodiments, it all emphasizes particularly on different fields to the description of each embodiment, there is no the portion being described in detail in some embodiment Point, reference can be made to the related descriptions of other embodiments.Each technical characteristic of above-described embodiment can be combined arbitrarily, to make Description is succinct, and combination not all possible to each technical characteristic in above-described embodiment is all described, as long as however, these Contradiction is not present in the combination of technical characteristic, all should be considered as described in this specification.

The embodiment of the present application is described in detail above, specific case used herein to the principle of the application and Embodiment is expounded, and the explanation of above embodiments is only used for helping to understand the present processes and its core concept.Together When, those skilled in the art according to the thought of the application, make in specific embodiment and application range based on the application Change or deform place, shall fall in the protection scope of this application.In conclusion the content of the present specification should not be construed as to the application Limitation.

Claims

1. the method that the operation node in a kind of pair of neural network model is split, comprising:

Determine the critical path in neural network model；

According to the hardware concurrent degree of the functional module to match in artificial intelligence process device with different type, by the critical path In at least one neural network computing node be split as multiple neural network computing child nodes.

2. according to the method described in claim 1, wherein the hardware concurrent degree is the function of matching with same kind node The quantity of module.

3. according to the method described in claim 1, wherein according to the hardware concurrent degree once by all detachable nerve nets Network operation node is split.

4. according to the method described in claim 3, further include:

According to the neural network computing child node for distributing to each functional module, the neural network computing child node that fractionation is obtained It merges.

5. according to the method described in claim 4, wherein according to distributing to the neural network computing child node of each functional module, The neural network computing child node that fractionation obtains is merged and includes:

Detect whether each functional module is assigned the neural network computing section split out from same neural network computing node Point；

According to testing result, by the neural network computing split out in each functional module from same neural network computing node Node merges.

6. according to the method described in claim 1, further include:

Multiple neural network computing child nodes that the neural network computing node not split and fractionation obtain are respectively allocated to people Multiple functional modules of type matching in work intelligent processor, to be executed by the functional module.

7. according to the method described in claim 6, further include:

Determine the priority for distributing to all the neural network computing nodes and neural network computing child node of same functional module, So that the functional module executes neural network computing node and neural network computing section by the sequence of identified priority Point.

8. according to the method described in claim 7, wherein being distributed to using upward sort algorithm and/or the determination of downward sort algorithm The priority of all the neural network computing nodes and neural network computing child node of same functional module.

9. according to the method described in claim 8, wherein in the upward sort algorithm, according to neural network computing node and At the beginning of neural network computing child node with the pending task at the beginning of between time difference determine priority.

10. according to the method described in claim 8, wherein in the downward sort algorithm, according to neural network computing node And neural network computing child node finish time and the pending task finish time between time difference determine it is preferential Grade.

11. according to the method described in claim 6, the multiple nerves that will wherein be split by same neural network computing node Network operations child node distributes to the different function module of type matching in the artificial intelligence process device.

12. according to the method described in claim 6, at least part in the neural network computing child node wherein split The arithmetic type of child node is different from the arithmetic type of the node before fractionation.

13. according to the method described in claim 6, wherein the neural network computing node not split and fractionation are obtained multiple Neural network computing child node is respectively allocated to multiple functional modules of type matching in artificial intelligence process device, by the function Energy module, which executes, includes:

In addition to the type according to functional module, it is allocated also according to the calculating speed and/or communication cost of functional module.

14. the device that the operation node in a kind of pair of neural network model is split, comprising:

Determination unit determines the critical path in neural network model；

Split cells, according to the hardware concurrent degree of the functional module to match in artificial intelligence process device with different type, by institute At least one the neural network computing node stated in critical path is split as multiple neural network computing child nodes.

15. a kind of electronic equipment, comprising:

Central processing unit；

Memory is stored with computer program, when the computer program is executed by the central processing unit, so that in described Central processor executes such as method of any of claims 1-13.

16. a kind of non-transient computer readable storage medium, is stored thereon with computer-readable instruction, when described instruction is located When managing device execution, so that the processor executes such as method of any of claims 1-13.