CN109976809B

CN109976809B - Scheduling method and related device

Info

Publication number: CN109976809B
Application number: CN201711467705.4A
Authority: CN
Inventors: 不公告发明人
Original assignee: Cambricon Technologies Corp Ltd
Current assignee: Cambricon Technologies Corp Ltd
Priority date: 2017-12-28
Filing date: 2017-12-28
Publication date: 2020-08-25
Anticipated expiration: 2037-12-28
Also published as: CN109976809A

Abstract

The embodiment of the application discloses a scheduling method and a related device, wherein the method is based on a server comprising a plurality of computing devices and comprises the following steps: receiving an operation request; acquiring an instruction stream of a target neural network model corresponding to the operation request; splitting the instruction stream into a plurality of parallel instructions and a plurality of serial instructions; selecting a plurality of parallel computing devices corresponding to the plurality of parallel instructions and at least one serial computing device corresponding to the plurality of serial instructions from the plurality of computing devices; calculating the operation data corresponding to the operation request according to the parallel instruction corresponding to each parallel computing device in the plurality of parallel computing devices and the serial instruction corresponding to each serial computing device in the at least one serial computing device to obtain a final operation result; and sending the final operation result to the electronic equipment sending the operation request. According to the embodiment of the application, the operation efficiency of the server for processing a single operation request can be improved.

Description

Scheduling method and related device

Technical Field

The present application relates to the field of computer technologies, and in particular, to a scheduling method and a related apparatus.

Background

The neural network is the basis of many artificial intelligence applications at present, and with the further expansion of the application range of the neural network, a server or cloud computing service is adopted to store various neural network models and operate according to an operation request submitted by a user, so that how to improve the operation efficiency of the server is a technical problem to be solved by technical personnel in the field.

Disclosure of Invention

The embodiment of the application provides a scheduling method and a related device, which can select a computing device in a server to execute a single operation request, so that the operation efficiency of the server is improved.

In a first aspect, an embodiment of the present application provides a scheduling method, based on a server of multiple computing devices, the method including:

receiving an operation request;

acquiring an instruction stream of a target neural network model corresponding to the operation request;

splitting the instruction stream into a plurality of parallel instructions and a plurality of serial instructions;

selecting a plurality of parallel computing devices corresponding to the plurality of parallel instructions and at least one serial computing device corresponding to the plurality of serial instructions from the plurality of computing devices;

calculating the operation data corresponding to the operation request according to the parallel instruction corresponding to each parallel computing device in the plurality of parallel computing devices and the serial instruction corresponding to each serial computing device in the at least one serial computing device to obtain a final operation result;

and sending the final operation result to the electronic equipment sending the operation request.

In a second aspect, an embodiment of the present application provides a server, which includes a plurality of computing devices, wherein:

a receiving unit configured to receive an operation request;

the acquisition unit is used for acquiring the instruction stream of the target neural network model corresponding to the operation request;

a splitting unit, configured to split the instruction stream into a plurality of parallel instructions and a plurality of serial instructions;

a selecting unit configured to select, from the plurality of computing devices, a plurality of parallel computing devices corresponding to the plurality of parallel instructions and at least one serial computing device corresponding to the plurality of serial instructions;

the operation unit is used for calculating operation data corresponding to the operation request according to the parallel instruction corresponding to each parallel computing device in the plurality of parallel computing devices and the serial instruction corresponding to each serial computing device in the at least one serial computing device to obtain a final operation result;

and the sending unit is used for sending the final operation result to the electronic equipment sending the operation request.

In a third aspect, embodiments provide another server comprising a processor, a memory, a communication interface, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the processor, and the programs include instructions for some or all of the steps described in the first aspect.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium storing a computer program comprising program instructions that, when executed by a processor, cause the processor to perform the method of the first aspect.

After the scheduling method and the related device are adopted, when a single operation request sent by the electronic equipment is received in the server, the instruction stream of the target neural network model corresponding to the operation request is split, parallel computing devices corresponding to the parallel instructions are selected to process the corresponding parallel instructions in parallel, serial computing devices which are good at processing the serial instructions are selected to independently execute the corresponding serial instructions, and a final operation result corresponding to the operation request is sent to the electronic equipment. That is, the computing devices in the plurality of parallel computing devices are used for executing the corresponding parallel instructions in parallel, so that the execution time of the parallel instructions is saved, and the computing efficiency between each serial instruction is improved through the serial computing device. In other words, the computing resources are uniformly distributed according to the computing request, so that a plurality of computing devices in the server can effectively cooperate, and the overall computing efficiency of the server is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Wherein:

fig. 1 is a schematic structural diagram of a server provided in an embodiment of the present application;

fig. 1a is a schematic structural diagram of a computing unit provided in an embodiment of the present application;

fig. 1b is a schematic structural diagram of a main processing circuit according to an embodiment of the present disclosure;

FIG. 1c is a schematic data distribution diagram of a computing unit according to an embodiment of the present application;

FIG. 1d is a schematic diagram of a data return of a computing unit according to an embodiment of the present application;

fig. 1e is an operational diagram of a neural network structure according to an embodiment of the present disclosure;

fig. 2 is a schematic flowchart of a scheduling method according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of another server provided in the embodiment of the present application;

fig. 4 is a schematic structural diagram of another server provided in the embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

The embodiment of the application provides a scheduling method and a related device, which can select a computing device in a server to execute an operation request, and improve the operating efficiency of the server. The present application is described in further detail below with reference to specific embodiments and with reference to the attached drawings.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a server according to an embodiment of the present disclosure. As shown in fig. 1, the server includes a plurality of computing devices, and the computing devices include, but are not limited to, server computers, and may be Personal Computers (PCs), network PCs, minicomputers, mainframe computers, and the like.

In the present application, each computing device included in the server establishes a connection and transfers data between them by wire or wirelessly, and each computing device includes at least one computing carrier, such as: a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a processor board card, and the like. The server related to the application can also be a cloud server, and provides cloud computing service for the electronic equipment.

Wherein, each calculation carrier can comprise at least one calculation unit for neural network operation, such as: a processing chip, etc. The specific structure of the computing unit is not limited, please refer to fig. 1a, and fig. 1a is a schematic structural diagram of the computing unit. As shown in fig. 1a, the calculation unit includes: a main processing circuit, a basic processing circuit and a branch processing circuit. Specifically, the main processing circuit is connected with the branch processing circuit, and the branch processing circuit is connected with at least one basic processing circuit.

The branch processing circuit is used for receiving and transmitting data of the main processing circuit or the basic processing circuit.

Referring to fig. 1b, fig. 1b is a schematic structural diagram of a main processing circuit, as shown in fig. 1b, the main processing circuit may include a register and/or an on-chip cache circuit, and the main processing circuit may further include a control circuit, a vector operator circuit, an ALU (arithmetic and logic unit) circuit, an accumulator circuit, a DMA (Direct memory access) circuit, and other circuits.

The main processing circuit further includes a data transmitting circuit, a data receiving circuit or an interface, the data transmitting circuit may integrate the data distributing circuit and the data broadcasting circuit, and certainly in practical application, the data distributing circuit and the data broadcasting circuit may also be separately configured; in practical applications, the data transmitting circuit and the data receiving circuit may be integrated together to form a data transmitting/receiving circuit. For broadcast data, i.e. data that needs to be sent to each of the basic processing circuits. For the distribution data, i.e. the data that needs to be selectively sent to part of the basic processing circuits, the specific selection mode can be specifically determined by the main processing circuit according to the load and the calculation mode. For the broadcast transmission mode, broadcast data is transmitted to each base processing circuit in a broadcast form. (in practical applications, broadcast data is transmitted to each basic processing circuit by one-time broadcasting, or broadcast data is transmitted to each basic processing circuit by multiple-time broadcasting, and the specific embodiments of the present invention do not limit the number of times of broadcasting), the distribution transmission method is to selectively transmit the distribution data to a part of the basic processing circuits.

When data distribution is realized, the control circuit of the main processing circuit transmits data to part or all of the basic processing circuits (the data may be the same or different, specifically, if the data is transmitted in a distribution mode, the data received by each basic processing circuit receiving the data may be different, and certainly, the data received by some basic processing circuits may be the same;

specifically, when broadcasting data, the control circuit of the main processing circuit transmits data to part or all of the basic processing circuits, and each basic processing circuit receiving data may receive the same data, that is, the broadcast data may include data that all the basic processing circuits need to receive. Distributing the data may include: part of the basic processing circuitry requires received data. The main processing circuit may send the broadcast data to all of the branch processing circuits via one or more broadcasts, and the branch processing circuits forward the broadcast data to all of the base processing circuits.

Optionally, the vector operator circuit of the main processing circuit may perform vector operations, including but not limited to: two vectors are added, subtracted, multiplied, divided, the vectors are added, subtracted, multiplied, divided with a constant, or any operation is performed on each element in the vector. The continuous operation may be, for example, addition, subtraction, multiplication, division, activation, accumulation, and the like of the vector and the constant.

Each base processing circuit may include a base register and/or a base on-chip cache circuit; each base processing circuit may further include: an inner product operator circuit, a vector operator circuit, an accumulator circuit, or the like, in any combination. The inner product operator circuit, the vector operator circuit, and the accumulator circuit may be integrated circuits, or the inner product operator circuit, the vector operator circuit, and the accumulator circuit may be circuits provided separately.

The connection structure of the branch processing circuit and the base circuit may be arbitrary and is not limited to the H-type structure of fig. 1 b. Optionally, the main processing circuit to the base circuit is a broadcast or distribution structure, and the base circuit to the main processing circuit is a gather structure. The definitions of broadcast, distribution and collection are as follows:

the data transfer mode from the main processing circuit to the basic circuit can comprise:

the main processing circuit is connected with a plurality of branch processing circuits respectively, and each branch processing circuit is connected with a plurality of basic circuits respectively.

The main processing circuit is connected with a branch processing circuit, the branch processing circuit is connected with a branch processing circuit, and the like, a plurality of branch processing circuits are connected in series, and then each branch processing circuit is connected with a plurality of basic circuits respectively.

The main processing circuit is connected with a plurality of branch processing circuits respectively, and each branch processing circuit is connected with a plurality of basic circuits in series.

The main processing circuit is connected with a branch processing circuit, the branch processing circuit is connected with a branch processing circuit, and the like, a plurality of branch processing circuits are connected in series, and then each branch processing circuit is connected with a plurality of basic circuits in series.

When distributing data, the main processing circuit transmits data to part or all of the basic circuits, and the data received by each basic circuit for receiving data can be different;

when broadcasting data, the main processing circuit transmits data to part or all of the basic circuits, and each basic circuit receiving data receives the same data.

When collecting data, some or all of the base circuits transmit data to the main processing circuit. It should be noted that the computing unit shown in fig. 1a may be a single physical chip, and of course, in practical applications, the computing unit may also be integrated into other chips (e.g., CPU, GPU).

Referring to fig. 1c, fig. 1c is a schematic diagram of data distribution of a computing unit, as shown by an arrow in fig. 1c, the arrow is a distribution direction of data, as shown in fig. 1c, after receiving external data, a main processing circuit splits the external data and distributes the split data to a plurality of branch processing circuits, and the branch processing circuits send the split data to a basic processing circuit.

Referring to fig. 1d, fig. 1d is a schematic diagram of data return of a computing unit, as shown by an arrow in fig. 1d, the arrow is a return direction of the data, as shown in fig. 1d, a basic processing circuit returns the data (e.g., inner product calculation result) to a branch processing circuit, and the branch processing circuit returns the data to a main processing circuit.

For the input data, it may be specifically vector, matrix, multidimensional (three-dimensional or four-dimensional and above) data, and for a specific value of the input data, it may be referred to as an element of the input data.

The embodiment of the present disclosure further provides a computing method of a computing unit as shown in fig. 1a, where the computing method is applied to neural network computing, and specifically, the computing unit may be used to perform operations on input data and weight data of one or more layers in a multi-layer neural network.

Specifically, the computing unit is configured to perform an operation on input data and weight data of one or more layers of the trained multi-layer neural network;

or the computing unit is used for executing operation on the input data and the weight data of one or more layers in the multilayer neural network of forward operation.

The above operations include, but are not limited to: one or any combination of convolution operation, matrix multiplication matrix operation, matrix multiplication vector operation, bias operation, full connection operation, GEMM operation, GEMV operation and activation operation.

The GEMM calculation means: the operation of matrix-matrix multiplication in the BLAS library. The general representation of this operation is: c ═ alpha _ op (S) op (P) + beta _ C, where S and P are two input matrices, C is an output matrix, alpha and beta are scalars, op represents some operation on matrix S or P, and there are some additional integers as parameters to account for the width and height of matrix S and P;

the GEMV calculation means: the operation of matrix-vector multiplication in the BLAS library. The general representation of this operation is: c ═ alpha _ op (S) _ P + beta _ C, where S is the input matrix, P is the vector of inputs, C is the output vector, alpha and beta are scalars, and op represents some operation on the matrix S.

The connection relation between the computing carriers in the computing device is not limited, the computing carriers can be isomorphic or heterogeneous computing carriers, the connection relation between the computing units in the computing carriers is not limited, and the computing efficiency can be improved by the heterogeneous computing carriers or the computing units executing parallel tasks.

The computing apparatus shown in fig. 1 includes at least one computing carrier, where the computing carrier includes at least one computing unit, that is, a target computing apparatus selected in this application depends on a connection relationship between the computing apparatuses and specific physical hardware support conditions such as a neural network model and network resources deployed in each computing apparatus and attribute information of an operation request, and then the computing carriers of the same type may be deployed in the same computing apparatus, for example, the computing carrier for forward propagation is deployed in the same computing apparatus, but not in different computing apparatuses, which effectively reduces overhead of communication between the computing apparatuses and facilitates improving operation efficiency; the specific neural network model can also be deployed on a specific calculation carrier, that is, when the server receives an operation request for a specific neural network, the server calls the calculation carrier corresponding to the specific neural network to execute the operation request, so that the time for determining a processing task is saved, and the operation efficiency is improved.

In the present application, neural network models that are disclosed and widely used are referred to as "neural network models" (e.g., LeNet, AlexNet, ZFNet, GoogleNet, VGG, ResNet) in Convolutional Neural Networks (CNN).

Optionally, the operation requirement of each designated neural network model in the designated neural network model set and the hardware attribute of each computing device in the plurality of computing devices are obtained to obtain a plurality of operation requirements and a plurality of hardware attributes; and deploying the corresponding appointed neural network model on the appointed computing device corresponding to each appointed neural network model in the appointed neural network model set according to the plurality of operation requirements and the plurality of hardware attributes.

The designated neural network model set comprises a plurality of designated neural network models, the hardware attributes of the computing device comprise the network bandwidth, the storage capacity, the processor main frequency and the like of the computing device, and the hardware attributes of a computing carrier or a computing unit in the computing device are also included. That is, the computing device corresponding to the operation requirement of the designated neural network model is selected according to the hardware attribute of each computing device, thereby avoiding the server failure caused by untimely processing and improving the operation support capability of the server.

The input neurons and the output neurons mentioned in the application do not refer to the neurons in the input layer and the neurons in the output layer of the whole neural network, but for any two adjacent layers in the network, the neurons in the lower layer of the network feedforward operation are the input neurons, and the neurons in the upper layer of the network feedforward operation are the output neurons. Taking a convolutional neural network as an example, let a convolutional neural network have L layers, where K is 1, 2.., L-1, and for the K-th layer and the K + 1-th layer, the K-th layer is referred to as an input layer, where neurons are the input neurons, and the K + 1-th layer is referred to as an output layer, where neurons are the output neurons. That is, each layer except the topmost layer can be used as an input layer, and the next layer is a corresponding output layer.

The operations mentioned above are operations in one layer of the neural network, and for the multi-layer neural network, the implementation process is shown in fig. 1e, in which the arrow with the dotted line indicates the inverse operation, and the arrow with the solid line indicates the forward operation. In the forward operation, after the execution of the artificial neural network of the previous layer is completed, the output neuron obtained from the previous layer is used as the input neuron of the next layer to perform operation (or the output neuron is subjected to some operation and then used as the input neuron of the next layer), and meanwhile, the weight value is replaced by the weight value of the next layer. In the inverse operation, after the inverse operation of the artificial neural network of the previous layer is completed, the input neuron gradient obtained by the previous layer is used as the output neuron gradient of the next layer for operation (or the input neuron gradient is subjected to some operation and then used as the output neuron gradient of the next layer), and meanwhile, the weight value is replaced by the weight value of the next layer.

The forward operation of the neural network is an operation process of inputting input data to final output data, the propagation direction of the reverse operation is opposite to that of the forward operation, and the operation process of reversely passing through the forward operation is a loss function corresponding to the loss or the loss of the final output data and expected output data. Through repeated forward and backward information calculation, the weights of all layers are corrected in a loss or loss function gradient descending mode, the weights of all layers are adjusted, the neural network learning training process is also realized, and the loss of network output can be reduced.

Referring to fig. 2, fig. 2 is a flowchart illustrating a scheduling method according to an embodiment of the present application, and as shown in fig. 2, the method is applied to a server as shown in fig. 1, and the method relates to the electronic device that allows access to the server, where the electronic device may include various handheld devices with wireless communication functions, vehicle-mounted devices, wearable devices, computing devices or other processing devices connected to a wireless modem, and various forms of User Equipment (UE), Mobile Stations (MS), terminal devices (terminal device), and so on.

201: an operation request is received.

In the present application, a server receives an operation request transmitted from an electronic device that is permitted to access.

The operation request includes attribute information such as an operation task (whether a training task or a testing task) and a target neural network model related to the operation. The training task is used for training a target neural network model, namely performing forward operation and reverse operation on the neural network model until the training is finished; and the test task is used for carrying out forward operation once according to the target neural network model.

The target neural network models may be neural network models uploaded when a user sends an operation request through the electronic device, or neural network models stored in the server, and the like.

202: and acquiring the instruction stream of the target neural network model corresponding to the operation request.

The instruction stream indicates the operation sequence of the target neural network model and the instruction corresponding to each sequence, namely the instruction sequence, and the operation of the target neural network model can be realized through the instruction stream. Each target neural network model corresponds to a basic operation sequence, namely a data structure for describing the operation of the target neural network model is obtained by analyzing the target neural network model. The method and the device for obtaining the instruction descriptor stream of the target neural network model are not limited to the parsing rule between the basic operation sequence and the instruction descriptor, and the instruction descriptor stream corresponding to the target neural network model is obtained according to the parsing rule between the basic operation sequence and the instruction descriptor.

The present application is also not limited to the predetermined format of each instruction descriptor stream in the instruction descriptor stream. And generating the instruction corresponding to the instruction descriptor stream according to the network structure with the preset format. The instructions include all instructions in the cambricon instruction set, such as matrix operation instructions, convolution operation instructions, full-link forward operation instructions, pooling operation instructions, normalization instructions, vector operation instructions, and scalar operation instructions.

Optionally, the obtaining of the instruction stream of the target neural network model corresponding to the operation request includes: acquiring a first instruction descriptor stream according to a basic operation sequence corresponding to the target neural network model; simplifying the first instruction descriptor stream to obtain a second instruction descriptor stream; and acquiring the instruction stream according to the second instruction descriptor stream.

That is, redundant instruction descriptors in the first instruction descriptor stream are eliminated by simplifying the first instruction descriptor stream, thereby shortening the instruction stream. And then, the instruction stream which can be executed by the computing device is obtained according to the second instruction descriptor stream, and the output data is obtained through operation according to the instruction and the input data, so that redundant input, output or other operations generated when the operation is performed by using a complete neural network consisting of fine-grained atomic operations such as convolution, pooling, activation and the like are overcome, and the operation speed of the server is further improved.

It should be noted that, if the operation request corresponds to multiple target neural network models, instruction streams of the multiple target neural network models need to be acquired and then split, so as to complete the operation request.

203: splitting the instruction stream into a plurality of parallel instructions and a plurality of serial instructions.

The present application does not limit how to split the instruction stream, a parallel instruction is an instruction that can be allocated to multiple computing devices to perform operations simultaneously, and a serial instruction is an instruction that can only be performed by a single computing device. For example: the operation requests for video identification, understanding and the like generally comprise a feature extraction instruction and a feature identification instruction, wherein the feature extraction instruction needs to perform convolution processing on continuous frames of images, and the feature identification instruction generally needs to perform calculation of a cyclic neural network on features obtained by the feature extraction instruction. The feature extraction instructions described above may be distributed to multiple computing devices, while the feature recognition instructions may only be processed by a single computing device.

204: selecting, from the plurality of computing devices, a plurality of parallel computing devices corresponding to the plurality of parallel instructions and at least one serial computing device corresponding to the plurality of serial instructions.

The method comprises the steps of splitting an instruction stream to obtain a plurality of parallel instructions and a plurality of serial instructions, selecting a parallel computing device corresponding to the execution of each parallel instruction and a serial computing device corresponding to the execution of the serial instructions from a plurality of computing devices in a server to obtain a plurality of parallel computing devices and at least one serial computing device, wherein the operation instruction of each parallel computing device is the parallel instruction corresponding to the parallel computing device, and the operation instruction of each serial computing device is the corresponding serial instruction.

The present application is not limited to the serial computing device and the parallel computing device.

Optionally, if the operation task of the operation request is a training task, grouping operation data corresponding to the operation request based on a training method corresponding to the training task to obtain multiple groups of operation data; selecting the plurality of parallel computing devices from the plurality of computing devices according to the plurality of sets of arithmetic data and the plurality of parallel instructions.

The operation data corresponding to the operation request is grouped according to a specific training method to obtain multiple groups of operation data, which may be grouped according to the data type of the operation data, or may be divided into multiple groups, which is not limited herein. After grouping, the appropriate computing devices are selected for parallel operation, namely, the operation amount of each computing device is further reduced, and the operation efficiency is improved.

For example: for a Batch Gradient Descent (BGD) algorithm, consider that there is a training set (Batch), which may be divided into a plurality of subsets, and distributed to a plurality of computing devices, where each computing device trains a subset, each subset being a set of operational data; for a Stochastic Gradient Descent (SGD) algorithm, considering that there is only one operational data in each training set (batch), different training sets can be assigned to different computing devices; for the mini-batch gradient descent algorithm (mini-batch SGD), different data of each batch may be distributed to different computing devices for calculation, or each batch may be divided into smaller subsets and distributed to different computing devices for calculation.

In the present application, a serial computing device may be one or more of a plurality of parallel computing devices, or may be another idle computing device.

For a case where the portions corresponding to the serial instructions have large differences in the computation characteristics between each portion of the serial instructions, optionally, the selecting the at least one serial computing device corresponding to the serial instructions from the plurality of computing devices includes: grouping the serial instructions to obtain at least one group of serial instruction sequences; and selecting the computing device corresponding to each serial instruction sequence in the at least one group of serial instruction sequences from the plurality of computing devices to obtain the at least one serial computing device.

The method comprises the steps of grouping a plurality of serial instructions to obtain a plurality of groups of instruction sequences, selecting a computing device corresponding to each group of instruction sequences to carry out operation, and executing corresponding instructions through the computing device which is good at processing to improve the operation efficiency of each part, so that the overall operation efficiency is improved.

Taking fast-regional convolutional neural network (fast R-CNN) as an example, fast R-CNN is composed of convolutional layers, RPN (candidate regional neural network) layers, and ROI pooling layers, and the difference in computational characteristics between each layer is very large, so that the convolutional layers and the RPN can be deployed on a neural network computing device which is good at handling convolution, and the ROI posing can be deployed on a neural network computing device which is good at handling convolution, such as a more general-purpose processor CPU, so that the computational efficiency of each part is improved, thereby improving the overall computational efficiency.

Optionally, if the operation task of the operation request is a test task, selecting a forward operation computing device including the target neural network model from the plurality of computing devices to obtain a plurality of target computing devices; if the operation task is a training task, selecting a calculation device comprising forward operation and backward training of the target neural network model from the plurality of calculation devices to obtain the plurality of target calculation devices; selecting, from the plurality of target computing devices, a plurality of parallel computing devices corresponding to the plurality of parallel instructions and at least one serial computing device corresponding to the plurality of serial instructions.

That is, if the operation task of the operation request is a test task, the target computing devices are computing devices that can be used to perform forward operations of the target neural network model; when the operation task is a training task, the plurality of target computing devices are computing devices which can be used for executing forward operation and backward training of the target neural network model, namely, the accuracy and the efficiency of operation can be improved by processing the operation request through a special computing device.

For example, the server includes a first computing device and a second computing device, wherein the first computing device only includes a forward operation for specifying the neural network model, and the second computing device can perform both the forward operation and the backward training operation of the specified neural network model. And when the target neural network model in the received target operation request is the specified neural network model and the operation task is a test task, determining that the first computing device executes the target operation request.

Optionally, selecting an auxiliary scheduling algorithm from an auxiliary scheduling algorithm set according to the attribute information of the operation request; selecting the plurality of parallel computing devices and the at least one serial computing device from the plurality of computing devices according to the secondary scheduling algorithm.

Wherein the set of secondary scheduling algorithms includes, but is not limited to, one of: Round-Robin scheduling (Round-Robin scheduling) algorithm, Weighted Round-Robin (Weighted Round-Robin) algorithm, Least-links (LeastConnections) algorithm, Weighted Least-links (Weighted Least Connections) algorithm, Locality-Based Least-links (Localness-Based Lecalness) algorithm, Locality-Based Least-links with Replication (Localness-Based Least Connections) algorithm, Destination address Hashing (Destination Hashing) algorithm, and Source address Hashing (Source Hashing) algorithm.

The method and the device do not limit how to select the auxiliary scheduling algorithm according to the attribute information, for example, if a plurality of target computing devices process the same operation request, the auxiliary scheduling algorithm can be a polling scheduling algorithm; if the compression resistance of different target computing devices is different and more operation requests should be allocated to the target computing devices with high configuration and low load, the auxiliary scheduling algorithm may be a weighted polling algorithm; if the workload allocated to each target computing device is different, the auxiliary scheduling algorithm may be a minimum link scheduling algorithm, and dynamically select one target computing device with the least number of currently backlogged connections to process the current request, so as to improve the utilization efficiency of the target computing device as much as possible, or may be a weighted minimum link scheduling algorithm.

That is, on the basis of the scheduling method in the above embodiment, the computing device that finally executes the operation request is selected in combination with the auxiliary scheduling algorithm, thereby further improving the operation efficiency of the server.

205: and calculating the operation data corresponding to the operation request according to the parallel instruction corresponding to each parallel computing device in the plurality of parallel computing devices and the serial instruction corresponding to each serial computing device in the at least one serial computing device to obtain a final operation result.

The present application does not limit the operation data corresponding to each operation request, and may be image data used for image recognition, or voice data used for voice recognition, or the like; when the operation task is a test task, the operation data is data uploaded by the user, and when the operation task is a training task, the operation data can be a training set uploaded by the user or a training set stored in the server and corresponding to the target neural network model.

The calculation process of the operation instruction can generate a plurality of intermediate operation results, and the final operation result corresponding to the operation request can be obtained according to the intermediate operation results.

206: and sending the final operation result to the electronic equipment sending the operation request.

It can be understood that, when a single operation request sent by the electronic device is received in the server, the instruction stream of the target neural network model corresponding to the operation request is split, parallel computing devices corresponding to the parallel instructions are selected to process the corresponding parallel instructions in parallel, serial computing devices good at processing the serial instructions are selected to independently execute the corresponding serial instructions, and a final operation result corresponding to the operation request is sent to the electronic device. That is, the computing devices in the plurality of parallel computing devices are used for executing the corresponding parallel instructions in parallel, so that the execution time of the parallel instructions is saved, and the computing efficiency between each serial instruction is improved through the serial computing device. In other words, the computing resources are uniformly distributed according to the computing request, so that a plurality of computing devices in the server can effectively cooperate, and the overall computing efficiency of the server is improved.

Optionally, the method further includes: waiting for a first preset time length, detecting whether each computing device in the plurality of parallel computing devices and the at least one serial computing device obtains a corresponding final computing result, and if not, taking the computing device which does not obtain the final computing result as a delay computing device; and selecting a spare computing device from the idle computing devices of the plurality of computing devices according to the instruction corresponding to the delay computing device.

That is, when the first preset time length is reached, the computing device which does not obtain the final operation result is used as the delay computing device, and the spare computing device is selected from the idle computing devices according to the instruction executed by the delay computing device, so that the operation efficiency is improved.

Optionally, after the executing, by the standby computing device, the operation instruction corresponding to the delay computing device, the method further includes: obtaining a final operation result obtained firstly between the delay calculation device and the standby calculation device; and sending a pause instruction to a computing device which does not obtain a final operation result between the delay computing device and the standby computing device.

The pause instruction is used for indicating the computing device which does not obtain the final operation result between the delay computing device and the standby computing device to pause the execution of the corresponding operation instruction. That is, the standby computing device executes the instruction corresponding to the delay computing device, selects the first final operation result obtained between the standby computing device and the delay computing device as the final operation result corresponding to the operation instruction, and sends a pause instruction to the computing device which passes through the delay computing device and does not obtain the final operation result between the delay computing device and the standby computing device, namely, pauses the operation of the computing device which does not complete the operation instruction, thereby saving power consumption.

Optionally, the method further includes: waiting for a second preset time length, detecting whether the delay calculation device obtains a corresponding final operation result, and if not, sending a fault instruction by taking the delay calculation device which does not obtain the final operation result as a fault calculation device.

The fault instruction is used for informing the operation and maintenance personnel that the fault calculation device has a fault, and the second preset time length is longer than the first preset time length. That is, when the second preset time period is reached, if the final operation result obtained by the delay calculation device is not received, it is determined that the delay calculation device is in fault, and the corresponding operation and maintenance personnel are notified, so that the fault processing capability is improved.

Optionally, the method further includes: updating a hash table of the plurality of computing devices every target time threshold.

Among them, a Hash table (also called Hash table) is a data structure directly accessed from a Key value (Key value). In the present application, IP addresses of a plurality of computing devices are used as key values and are mapped to a position in a hash table through a hash function (mapping function), that is, after a target computing device is determined, physical resources allocated by the target computing device can be quickly found. The specific form of the hash table is not limited, and the hash table may be a static hash table set manually, or may be a hardware resource allocated according to an IP address. And the hash tables of the plurality of computing devices are updated every other target time threshold, so that the searching accuracy and the searching efficiency are improved.

Referring to fig. 3, fig. 3 is a schematic structural diagram of another server provided in the present application, consistent with the embodiment of fig. 2, where the server includes a plurality of computing devices. As shown in fig. 3, the server 300 includes:

a receiving unit 301 configured to receive an operation request;

an obtaining unit 302, configured to obtain an instruction stream of a target neural network model corresponding to the operation request;

a splitting unit 303, configured to split the instruction stream into a plurality of parallel instructions and a plurality of serial instructions;

a selecting unit 304, configured to select, from the plurality of computing devices, a plurality of parallel computing devices corresponding to the plurality of parallel instructions and at least one serial computing device corresponding to the plurality of serial instructions;

an arithmetic unit 305, configured to calculate, according to a parallel instruction corresponding to each of the multiple parallel computing devices and a serial instruction corresponding to each of the at least one serial computing device, arithmetic data corresponding to the arithmetic request to obtain a final arithmetic result;

a sending unit 306, configured to send the final operation result to the electronic device that sent the operation request.

Optionally, the obtaining unit 302 is specifically configured to obtain a first instruction descriptor stream according to a basic operation sequence corresponding to the target neural network model; simplifying the first instruction descriptor stream to obtain a second instruction descriptor stream; and acquiring the instruction stream according to the second instruction descriptor stream.

Optionally, the splitting unit 303 is specifically configured to group the multiple serial instructions to obtain at least one group of serial instruction sequences; the selecting unit 304 is specifically configured to select, from the plurality of computing devices, a computing device corresponding to each serial instruction sequence in the at least one group of serial instruction sequences, so as to obtain the at least one serial computing device.

Optionally, the splitting unit 303 is specifically configured to, if an operation task of the operation request is a training task, group operation data corresponding to the operation request based on a training method corresponding to the operation request to obtain multiple groups of operation data; the selecting unit 304 is specifically configured to select the multiple parallel computing devices from the multiple computing devices according to the multiple sets of operation data and the multiple parallel instructions.

Optionally, the selecting unit 304 is specifically configured to select, if the operation task of the operation request is a test task, a computing device including a forward operation of the target neural network model from the multiple computing devices to obtain multiple target computing devices; if the operation task is a training task, selecting a computing device comprising forward operation and backward training of the target neural network model from the plurality of computing devices to obtain the plurality of target computing devices; selecting the plurality of parallel computing devices and the at least one serial computing device from the plurality of target computing devices.

Optionally, the selecting unit 304 is specifically configured to select an auxiliary scheduling algorithm from an auxiliary scheduling algorithm set according to the attribute information of the operation request, where the auxiliary scheduling algorithm set includes at least one of the following: a polling scheduling algorithm, a weighted polling algorithm, a minimum link algorithm, a weighted minimum link algorithm, a locality-based minimum link algorithm with replication, a target address hashing algorithm, and a source address hashing algorithm; selecting the plurality of parallel computing devices and the at least one serial computing device from the plurality of computing devices according to the secondary scheduling algorithm.

Optionally, the server further includes a detecting unit 307, configured to wait for a first preset duration, detect whether the multiple parallel computing devices and the at least one serial computing device obtain corresponding final operation results, and if not, use the computing device that does not obtain the final operation result as a delay computing device; selecting, by the selecting unit 304, a spare computing device from the idle computing devices of the plurality of computing devices according to the instruction corresponding to the delay computing device; the instructions corresponding to the delay calculation means are executed by the arithmetic unit 305 through the standby calculation means.

Optionally, the obtaining unit 302 is further configured to obtain a final operation result obtained first between the delay calculating device and the standby calculating device; a pause instruction is sent by the sending unit 306 to a computing device between the delay computing device and the standby computing device that does not obtain the final operation result.

Optionally, the detecting unit 307 is further configured to wait for a second preset duration, detect whether the delay calculating device obtains a corresponding final operation result, and if not, take the delay calculating device that does not obtain the final operation result as a fault calculating device; and sending a fault instruction by the sending unit 304, wherein the fault instruction is used for informing operation and maintenance personnel that the fault calculation device has a fault, and the second preset time length is longer than the first preset time length.

Optionally, the server further includes an updating unit 308, configured to update the hash table of the server every other target time threshold.

Optionally, the obtaining unit 302 is further configured to obtain an operation requirement of each specified neural network model in the specified neural network model set and a hardware attribute of each computing device in the plurality of computing devices to obtain a plurality of operation requirements and a plurality of hardware attributes;

the server further includes a deployment unit 309 configured to deploy, according to the plurality of operation requirements and the plurality of hardware attributes, a corresponding designated neural network model on a designated computing device corresponding to each designated neural network model in the designated neural network model set.

Optionally, the computing device comprises at least one computing carrier comprising at least one computing unit.

In one embodiment, as shown in fig. 4, the present application discloses another server 400 comprising a processor 401, a memory 402, a communication interface 403, and one or more programs 404, wherein the one or more programs 404 are stored in the memory 402 and configured to be executed by the processor, and the program 404 comprises instructions for performing some or all of the steps described in the scheduling method.

In another embodiment of the present invention, a computer-readable storage medium is provided, which stores a computer program comprising program instructions, which when executed by a processor, cause the processor to perform the implementation described in the scheduling method.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the terminal and the unit described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed terminal and method can be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the above-described division of units is only one type of division of logical functions, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit may be stored in a computer-readable storage medium if it is implemented in the form of a software functional unit and sold or used as a separate product. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the above method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

It is to be noted that, in the attached drawings or in the description, the implementation modes not shown or described are all the modes known by the ordinary skilled person in the field of technology, and are not described in detail. Further, the above definitions of the various elements and methods are not limited to the various specific structures, shapes or arrangements of parts mentioned in the examples, which may be easily modified or substituted by those of ordinary skill in the art.

The above embodiments are further described in detail for the purpose of illustrating the invention, and it should be understood that the above embodiments are only for illustrative purposes and are not to be construed as limiting the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the protection scope of the present application.

Claims

1. A method of scheduling, the method being based on a server comprising a plurality of computing devices, the method comprising:

receiving a single operation request;

splitting the instruction stream into a plurality of parallel instructions and a plurality of serial instructions; the parallel instruction is an instruction which can be distributed to a plurality of computing devices to carry out operation simultaneously, and the serial instruction is an instruction which can only be operated by a single computing device;

determining an auxiliary scheduling algorithm according to the attribute information of the operation request, and selecting a plurality of parallel computing devices corresponding to the parallel instructions and at least one serial computing device corresponding to the serial instructions from the computing devices according to the auxiliary scheduling algorithm; the serial computing device is one or more of a plurality of parallel computing devices, or other idle computing devices in the server;

calculating the operation data corresponding to the operation request according to the parallel instruction corresponding to each parallel computing device in the plurality of parallel computing devices and the serial instruction corresponding to each serial computing device in the at least one serial computing device, generating a plurality of intermediate operation results in the calculation process, and acquiring a final operation result corresponding to the operation request according to the plurality of intermediate operation results;

and sending the final operation result to the electronic equipment of the operation request.

2. The method of claim 1, wherein obtaining the instruction stream of the target neural network model corresponding to the operation request comprises:

acquiring a first instruction descriptor stream according to a basic operation sequence corresponding to the target neural network model;

simplifying the first instruction descriptor stream to obtain a second instruction descriptor stream;

and acquiring the instruction stream according to the second instruction descriptor stream.

3. The method of claim 2, wherein said selecting the at least one serial computing device from the plurality of computing devices corresponding to the plurality of serial instructions comprises:

grouping the serial instructions to obtain at least one group of serial instruction sequences;

and selecting the computing device corresponding to each serial instruction sequence in the at least one group of serial instruction sequences from the plurality of computing devices to obtain the at least one serial computing device.

4. The method according to any of claims 1-3, wherein said selecting said plurality of parallel computing devices corresponding to said plurality of parallel instructions from said plurality of computing devices comprises:

if the operation task of the operation request is a training task, grouping operation data corresponding to the operation request based on a training method corresponding to the operation request to obtain multiple groups of operation data;

selecting the plurality of parallel computing devices from the plurality of computing devices according to the plurality of sets of arithmetic data and the plurality of parallel instructions.

5. The method of claim 1, wherein selecting from the plurality of computing devices a plurality of parallel computing devices corresponding to the plurality of parallel instructions and at least one serial computing device corresponding to the plurality of serial instructions comprises:

if the operation task of the operation request is a test task, selecting a calculation device comprising the forward operation of the target neural network model from the plurality of calculation devices to obtain a plurality of target calculation devices;

if the operation task is a training task, selecting a computing device comprising forward operation and backward training of the target neural network model from the plurality of computing devices to obtain the plurality of target computing devices;

selecting the plurality of parallel computing devices and the at least one serial computing device from the plurality of target computing devices.

6. The method of claim 1, wherein selecting from the plurality of computing devices a plurality of parallel computing devices corresponding to the plurality of parallel instructions and at least one serial computing device corresponding to the plurality of serial instructions comprises:

selecting an auxiliary scheduling algorithm from an auxiliary scheduling algorithm set according to the attribute information of the operation request, wherein the auxiliary scheduling algorithm set comprises at least one of the following items: a polling scheduling algorithm, a weighted polling algorithm, a minimum link algorithm, a weighted minimum link algorithm, a locality-based minimum link algorithm with replication, a target address hashing algorithm, and a source address hashing algorithm;

selecting the plurality of parallel computing devices and the at least one serial computing device from the plurality of computing devices according to the secondary scheduling algorithm.

7. The method of claim 1,2, 3, 5, or 6, further comprising:

waiting for a first preset time length, detecting whether each computing device in the plurality of parallel computing devices and the at least one serial computing device obtains a corresponding final computing result, and if not, taking the computing device which does not obtain the final computing result as a delay computing device;

selecting a spare computing device from the idle computing devices of the plurality of computing devices according to the instruction corresponding to the delay computing device;

and executing the instruction corresponding to the delay computing device by the standby computing device.

8. The method of claim 7, wherein after the executing, by the standby computing device, the instructions corresponding to the deferred computing device, the method further comprises:

obtaining a final operation result obtained firstly between the delay calculation device and the standby calculation device;

and sending a pause instruction to a computing device which does not obtain a final operation result between the delay computing device and the standby computing device.

9. The method of claim 7, further comprising:

waiting for a second preset time, detecting whether the delay calculation device obtains a corresponding final calculation result, if not, taking the delay calculation device which does not obtain the final calculation result as a fault calculation device, and sending a fault instruction, wherein the fault instruction is used for informing operation and maintenance personnel that the fault calculation device breaks down, and the second preset time is longer than the first preset time.

10. The method of claim 1,2, 3, 5, 6, 8, or 9, further comprising:

and updating the hash table of the server every target time threshold.

11. The method of claim 1,2, 3, 5, 6, 8, or 9, further comprising:

acquiring the operation requirement of each appointed neural network model in an appointed neural network model set and the hardware attribute of each computing device in the plurality of computing devices to obtain a plurality of operation requirements and a plurality of hardware attributes;

and deploying the corresponding appointed neural network model on the appointed computing device corresponding to each appointed neural network model in the appointed neural network model set according to the plurality of operation requirements and the plurality of hardware attributes.

12. The method of claim 1,2, 3, 5, 6, 8, or 9, wherein the computing device comprises at least one computing carrier comprising at least one computing unit.

13. A server, comprising a plurality of computing devices, the server further comprising: means for performing the method of any of claims 1-12.

14. A server, comprising a processor, a memory, a communication interface, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the processor, the programs comprising instructions for performing the steps of the method of any of claims 1-12.

15. A computer-readable storage medium, having stored thereon a computer program comprising program instructions, which, when executed by a processor, cause the processor to carry out the method of any one of claims 1-12.