CN113472550A

CN113472550A - Distributed management method and system, and management system

Info

Publication number: CN113472550A
Application number: CN202010233858.8A
Authority: CN
Inventors: 陆冲之; 陈佩
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2020-03-30
Filing date: 2020-03-30
Publication date: 2021-10-01

Abstract

A distributed management system and method are disclosed. The distributed management system is applicable to a first computing system, the first computing system and at least one second computing system form a task distribution network, and the distributed management system comprises a manager, an orchestrator and a scheduler, wherein the manager is used for organizing and managing a predefined operation set, and the operation set is used for completing a specified function; the orchestrator is used for determining an operation flow consisting of operation sets according to requirements; the scheduler is used for distributing the operation flow to one of the at least one second computing system to run according to the current load information of the at least one second computing system. The system can construct operation flows according to requirements and deploy the operation flows according to load information.

Description

Distributed management method and system, and management system

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to a distributed management method and system, and a management system.

Background

In recent years, with the rise of artificial intelligence, various neural network models and acceleration units for supporting the neural network models have come into play, and many customers desire to use the neural network models in application services provided by themselves.

However, while many vendors can provide neural network models of various hardware processors (acceleration units, central processing units) and software versions, few resources provide a complete platform architecture to assist customers or developers in rapid development and deployment of AI (artificial intelligence) systems.

Disclosure of Invention

Based on this, it is an object of the present disclosure to provide a platform architecture that helps customers or developers to quickly develop and deploy AI systems.

In a first aspect, the disclosed embodiments provide a distributed management system for a first computing system, the first computing system and at least one second computing system forming a task distribution network, the distributed management system comprising a manager, an orchestrator, and a scheduler, wherein,

the manager is used for organizing and managing a predefined operation set, and the operation set is used for completing a specified function;

the orchestrator is used for determining the operation flow composed of the operation sets according to requirements;

the scheduler is used for distributing the operation flow to one of the at least one second computing system to run according to the current load information of the at least one second computing system.

In some embodiments, the scheduler is further configured to decompose the operation flow into a plurality of sub-operation flows according to current load information of a plurality of the second computing systems, and distribute the plurality of sub-operation flows to run on a number of the plurality of the second computing systems.

In some embodiments, the scheduler decomposes the operation flow into the plurality of sub-operation flows in response to determining that none of the plurality of second computing systems is suitable to independently run the operation flow based on current load information for the plurality of second computing systems.

In some embodiments, the scheduler determines whether the amount of data input and output between adjacent operation sets exceeds a set threshold to determine whether to divide the adjacent operation sets in the same sub-operation stream.

In some embodiments, the manager comprises: and the operation set registration unit is used for registering the operation set edited by the user and taking the registered operation set as the predefined operation set.

In some embodiments, the orchestrator includes a graphical interface on which icons corresponding to the predefined operation sets are displayed, so that a user can construct a directed acyclic graph on the graphical interface by dragging an icon, and the orchestrator generates the operation flow according to the directed acyclic graph.

In some embodiments, the graphical interface comprises: the view area is used for displaying the predefined operation set, the drawing area is used for drawing the directed acyclic graph, and the operation flow corresponding to the directed acyclic graph is displayed.

In some embodiments, the current load information of the second computing system includes the following information:

the use condition of a hardware execution unit inside the acceleration unit;

processor and memory usage;

use of input and output devices.

In some embodiments, the first computing system and the second computing system are both embedded systems.

In a second aspect, the disclosed embodiments provide a distributed management system for a first computing system, the first computing system and at least one second computing system forming a task distribution network, the distributed management system comprising a manager, an orchestrator, and a scheduler,

the manager is used for organizing and managing a predefined operation set, and the operation set is divided into a neural network operation set and a non-neural network operation set;

the orchestrator is used for determining a network structure of a neural network model to be used according to requirements, and constructing an operation flow by adopting the neural network operation set and the non-neural network operation set;

In some embodiments, the requirements include: video processing, face recognition and target detection, and the non-neural network model operation set is classified as: the method comprises an image processing operation set, a data encryption operation set, a video coding and decoding operation set, a data storage operation set and an image acquisition operation set.

In a third aspect, embodiments of the present disclosure provide a management system comprising a first computing system and at least one second computing system, the first computing system comprising a manager, an orchestrator, and a scheduler, wherein,

the scheduler is used for distributing the operation flow to one of the at least one second computing system to run according to the current load information of the at least one second computing system,

the second computing system, upon receiving an operational flow, executes the operational flow.

In some embodiments, the first computing system performs cloud computing in a cloud environment and the second computing system performs edge computing on an edge side of the cloud environment.

In some embodiments, the operational flow received and executed by the second computing system is an operational flow of a lightweight neural network model. .

In a fourth aspect, an embodiment of the present disclosure provides a distributed management method, which is applied to a first computing system, where the first computing system and at least one second computing system form a task distribution network, and the management method includes:

determining an operation flow consisting of operation sets according to requirements, wherein the operation sets are predefined operation sets used for completing specified functions;

and distributing the operation flow to one of the at least one second computing system to run according to the current load information of the at least one second computing system.

In some embodiments, further comprising: according to the current load information of the plurality of second computing systems, the operation flow is decomposed into a plurality of sub-operation flows, and the plurality of sub-operation flows are distributed to a plurality of the second computing systems to run.

In some embodiments, said decomposing said operational flow into a plurality of sub-operational flows according to current load information for a plurality of said second computing systems comprises:

and judging whether the second computing system suitable for independently running the operation flow exists according to the current load information of the plurality of second computing systems, and decomposing the operation flow into the plurality of sub-operation flows under the condition that none of the plurality of second computing systems is suitable for independently running the operation flow.

In some embodiments, said decomposing said operational flow into a plurality of sub-operational flows according to current load information for a plurality of said second computing systems comprises: and judging whether the data quantity of input and output between the adjacent operation sets exceeds a set threshold value or not so as to determine whether the adjacent operation sets are divided into the same sub-operation stream or not.

In some embodiments, the operation set is divided into a neural network operation set and a non-neural network operation set, and the determining an operation flow composed of the operation sets according to requirements includes: determining a network structure of a neural network model to be used according to requirements, constructing an operation flow of the neural network model by adopting the neural network operation set, and constructing an operation flow of an input/output module which provides input data of the neural network model and processes output data of the neural network model by adopting the non-neural network model operation set according to requirements.

In some embodiments, further comprising: and registering the operation set edited by the user, and taking the registered operation set as the operation set.

the use condition of a hardware execution unit inside the acceleration unit;

processor and memory usage;

use of input and output devices.

In a fifth aspect, the present disclosure provides a computer-readable medium storing computer instructions executable by an electronic device, where the computer instructions, when executed, implement the management method of any one of the above.

The distributed management system provided by the embodiment of the disclosure is not only suitable for constructing an AI system equipped with a neural network model, but also suitable for other application systems, and the distributed management system can construct an operation flow according to requirements and deploy the operation flow according to load information, thereby realizing rapid construction and deployment of the application system.

Drawings

The foregoing and other objects, features, and advantages of the disclosure will be apparent from the following description of embodiments of the disclosure, which refers to the accompanying drawings in which:

FIG. 1 is a schematic diagram of a network architecture suitable for use with embodiments of the present disclosure;

FIG. 2 is an example of a "central" system architecture;

FIG. 3 is an example of an embedded system;

FIG. 4 illustrates a distributed management system according to one embodiment of the present disclosure;

FIG. 5 is an exemplary graphical interface of the orchestrator of FIG. 4.

FIG. 6 is a diagram illustrating data flow relationships for a manager, scheduler, orchestrator, and computing system;

FIG. 7 shows a flow diagram of a management method according to one embodiment of the present disclosure.

Detailed Description

The present disclosure is described below based on examples, but the present disclosure is not limited to only these examples. In the following detailed description of the present disclosure, some specific details are set forth in detail. It will be apparent to those skilled in the art that the present disclosure may be practiced without these specific details. Well-known methods, procedures, and procedures have not been described in detail so as not to obscure the present disclosure. The figures are not necessarily drawn to scale.

The following terms are used herein.

An acceleration unit: also called neural network acceleration unit, aiming at the condition that the general-purpose processor is not efficient in some special-purpose fields (for example, processing images, processing various operations of neural network, video processing field, etc.), the processing unit designed for improving the data processing speed in these special-purpose fields is often used with a general-purpose processor CPU, receives the control of the general-purpose processor, executes some special-purpose or special-purpose processing, and improves the computer processing efficiency in the special-purpose or special-purpose fields.

A neural network: generally, the Artificial Neural Network (ANN) is an algorithm Network that simulates behavioral characteristics of an animal Neural Network and performs distributed parallel information processing. A classical neural network, also the simplest neural network structure, comprises three levels: an input layer, an output layer, and an intermediate layer (also called a hidden layer). The input, output and intermediate layers, in turn, each include a plurality of neurons (also referred to as nodes). A neuron is the smallest processing unit in a neural network. A very complex neural network structure can be formed via extensive interconnection of a large number of, simply functioning neurons.

A neural network model: in a neural network, neurons are digitized to produce a neuron mathematical model, and a number of neuron mathematical models in the neural network form the neural network model.

Fig. 1 is a schematic diagram of a network architecture. Referring to the figure, a first computing system 101 and a plurality of second computing systems 103 communicate via a plurality of interconnect units 102. The plurality of interconnection units 102 are software and hardware communication platforms supporting different interconnection protocols. The first computing system 101 and the second computing system 103 are a collection of software and hardware that have communication capabilities and are capable of handling certain tasks. The first computing system 101 may be implemented as a variety of devices, including but not limited to the following: a cellular phone, an internet protocol device, a digital camera, a Personal Digital Assistant (PDA), a personal computer, a notebook, a workstation, a network computer (NetPC) set-top box, a network hub, a Wide Area Network (WAN) switch, an internet of things device.

In some embodiments, the first computing system 101 and the second computing system 103 may be computer systems 200 as shown in FIG. 2, and the interconnection unit 102 represents the Internet, or may be a Local Area Network (LAN), or a Wide Area Network (WAN), such as a company's private network.

As shown in FIG. 2, system 200 is an example of a "central" system architecture. The system 200 may be constructed based on various types of processors currently on the market and is implemented by WINDOWS^TMOperating system drivers such as operating system version, UNIX operating system, Linux operating system, etc. Further, the system 200 is typically implemented in a PC, desktop, notebook, server. Thus, the system 200 in the disclosed embodiments is not limited to any specific combination of hardware circuitry and software.

Referring to fig. 2, a system 200 includes a processor 201. The processor 201 has data processing capabilities as are known in the art. It may be a processor of a Complex Instruction Set (CISC) architecture, a Reduced Instruction Set (RISC) architecture, a very long instruction space (VLIW) architecture, or a combination of the above, or any processor device constructed for a dedicated purpose.

The processor 201 is coupled to a system bus 202, which system bus 202 may be interconnect circuitry for connecting the processor 201 to various other components, and the interconnect circuitry supports various interconnect protocols and interface circuits that may include different requirements as desired. Thereby enabling communication between the processor 201 and the various components.

The system 200 also includes a memory 204 and a graphics card 207. Memory 204 may be Dynamic Random Access Memory (DRAM), Static Random Access Memory (SRAM), flash memory, or other memory. Memory 204 includes an instruction memory 205 and a data memory 206, which allow processor 201 to fetch instructions from instruction memory 205 and data from data memory 206 simultaneously when executing a pipeline of instructions by storing instruction information and data information represented by data signals separately. The display card 207 comprises a display driver for controlling the correct display of the display signals on the display screen.

The system 200 also includes a Memory Controller Hub (MCH)203, where the Memory Controller Hub (MCH)203 may be coupled to a system bus 202, a memory 204, a graphics card 207, and an I/O controller hub (ICH) 208. The processor 201 may communicate with a Memory Controller Hub (MCH)203 via a system bus 202. Memory Controller Hub (MCH)203 provides a high bandwidth memory access path to memory 204 for storage and retrieval of instruction information and data information. Meanwhile, the Memory Controller Hub (MCH)203 and the graphic card 207 perform transmission of a display signal based on a graphic card signal input/output interface. The video card signal input/output interface is, for example, an interface type such as DVI, HDMI, etc.

The Memory Controller Hub (MCH)203 not only transfers digital signals between the processor 201, the memory 204, and the graphics card 207, but also enables bridging of the digital signals between the system bus 202 and the memory 204 and the I/O controller hub (ICH) 208.

The system 200 also includes an I/O controller hub (ICH)208 that is coupled to the Memory Controller Hub (MCH)203 through a dedicated hub interface bus and that couples some I/0 devices to the input/output controller hub (ICH)208 via a local I/0 bus. The local I/0 bus is used to couple peripherals to the input/output control hub (ICH)208, which in turn is coupled to the Memory Controller Hub (MCH)203 and the system bus 202. Peripheral devices include, but are not limited to, the following: hard disk 209, optical disk drive 120, sound card 211, serial expansion port 212, audio controller 213, keyboard 214, mouse 215, GPIO interface 216, flash memory 217, and network card 218.

Of course, the block diagram of different computer systems varies depending on the motherboard, operating system, and instruction set architecture. For example, many computer systems now integrate a Memory Controller Hub (MCH)203 within the processor 201 such that an input/output control hub (ICH)208 becomes the control hub coupled to the processor 201.

In some embodiments, the first computing system 101 and the second computing system 103 are embedded systems 300 as shown in fig. 3, and the interconnection unit 102 represents the internet, a Local Area Network (LAN), or a Wide Area Network (WAN), or the internet of things. Although various specific functions, appearance interfaces, operations and the like of the embedded system are different and even different, the basic hardware structure of the embedded system is different, and the embedded system has high similarity with a hardware system of a general computer, but the application characteristics of the embedded system cause the embedded system to have larger difference with the general computer system in the aspects of hardware composition and implementation form.

First, in order to meet the requirements of the embedded system 300 on speed, volume and power consumption, data that needs to be stored for a long time, such as an operating system, application software, and special data, is usually not used in a storage medium with a large capacity and a low speed, such as a magnetic disk, but a random access Memory 302 or a Flash Memory (Flash Memory)303 is mostly used.

In addition, in the embedded system 300, an a/D (analog/digital conversion) interface 305 and a serial interface 306 are required for the need of measurement and control, which is rarely used in general-purpose computers. The a/D interface 305 mainly performs conversion of an analog signal to a digital signal and conversion of a digital signal to an analog signal, which are required in the test. Testing is often required when embedded system 300 is used in industrial production. Since the single chip generates a digital signal and needs to be converted into an analog signal for testing during testing, unlike a general-purpose computer, an a/D (analog/digital conversion) interface 305 is required to perform the related conversion. In addition, the industry often requires multiple embedded systems to be connected in series to perform related functions, and therefore a serial interface 306 for connecting multiple embedded systems in series is required, which is not required in general purpose computers.

In addition, the embedded system 300 is a basic processing unit, and it is often necessary to connect a plurality of embedded systems 300 into a network in industrial design, so that a network interface 307 for connecting the embedded system 300 into the network is required. This is also mostly not required in general purpose computers. In addition, some embedded systems 300 employ an external bus 304 depending on the application and size. With the rapid expansion of the application field of the embedded system 300, the embedded system 300 tends to be personalized more and more, and the types of buses adopted according to the characteristics of the embedded system 300 are more and more. In addition, in order to test the internal circuit of the embedded processor 301, the boundary scan test technology is commonly used in the processor chip. To accommodate this testing, a debug interface 308 is employed.

With the rapid development of Very Large Scale integrated circuits (Very Large Scale Integration) and semiconductor processes, embedded systems can be implemented on a silicon chip, which is an embedded system on chip (SoC). Accordingly, the first computing system 301 and the second computing system 303 may be a first system on chip and a second system on chip, and the interconnection unit 302 may be a connection unit between the systems on chips, for example, an AHB bus, an APB bus, an axi (advanced eXtensible interface) bus, and the like.

In addition, the first computing system 301 and/or the second computing system 303 may also be an embedded system or a computer system on which a neural network acceleration unit is built, so as to operate the neural network acceleration unit. For example, the first computing system 301 or the second computing system 303 is an AI product prototype built according to requirements, for example, a selected neural network computing card is collocated on a selected open source mainboard, and a high-performance processor is provided, so that a hardware platform with rich hardware interfaces is formed.

It should be noted that the above description is intended only to illustrate a suitable hardware and software environment for embodiments of the present disclosure, and not as a limitation on embodiments of the present disclosure. In fact, first computing system 101 and second computing system 103 may be one of any of the types of computing systems described above, forming a cross-system, cross-platform network architecture.

FIG. 4 illustrates a distributed management system according to one embodiment of the present disclosure. Referring to the figures, distributed management system 50 is located between an individual or enterprise user and a service provider, and as a result, the individual or enterprise user can build an application system on his own based on distributed management system 50 and deploy the application system on hardware platform 70, and hardware platform 70 is composed of computing systems 1-3. Individual or enterprise users may also provide requirements 60 to service providers, who decompose and process requirements 60 based on distributed management system 50, form application systems, and deploy the application systems onto various computing systems on hardware platform 70. With the distributed management system 50, developers need only to pay attention to development work related to the requirements 60, do not need to manage underlying equipment, and can take advantage of hardware acceleration by docking the system to a hardware platform.

The distributed management system 50 can be understood based on the network structure shown in fig. 1. Where the distributed management system 50 is deployed on the first computing system 101 in fig. 1, each computing system under the hard disk platform 70 corresponds to the second computing system 103 in fig. 1. The distributed management system 50 is responsible for task allocation and management in the network architecture, and the various computing systems under the hard disk platform 70 are responsible for executing the allocated tasks in the network architecture.

The distributed management system 50 includes a manager 501, a scheduler 502, and an orchestrator 503. Manager 501 is used to uniformly organize and manage a predefined set of operations. A set of operations is a set of operations for implementing a certain function or task. The orchestrator 503 is used to determine the operation flow consisting of the operation set according to the requirements. The scheduler 502 is configured to obtain current load information of each computing system on the hardware platform, and allocate an operation flow to at least one computing system to run.

To facilitate management of the operation set, manager 501 defines three basic components for the operation set, respectively: starting an interface, defining and calling functions and transmitting data types. And the starting interface is similar to the condition that all programs have an entry function, and the operation set defines a uniform starting interface so as to facilitate the external instantiated starting of the uniform starting interface. The function definition describes the externally provided service, and the function call describes a specific calling method of the service. And transmitting the data type, wherein the operation set is used as an execution node, the transmission data type describes the input and output data types of the transmission data type, and the data can be transmitted only if the input and output data types are matched. Based on the set of operations, operations across platforms, machines, networks, and protocols may be unified in the same set of operations.

The manager 501 may further define a plurality of operation set templates, where the operation set templates specify operation set specifications, and the operation set specifications include, for example, basic requirements of a start interface, a function definition and call, and a transmission data type of the operation set, and a developer may develop a custom operation set according to actual needs by using the operation set templates. By the method, research and development personnel can quickly develop the customized operation set and construct a more personalized application system according to the operation set.

In some embodiments, the manager 501 includes an operation set registration unit configured to register an operation set edited by a user, and take the registered operation set as a predefined operation set. The manager 501 may provide an editing interface to the developer so that the developer may make an operation set using the editing interface and then register it in the system using the operation set registration unit. The manager 501 may also provide only an upload interface to the research and development staff, so that the research and development staff can upload the created operation set to the system based on the upload interface and then register the operation set in the system by using the operation set registration unit.

After the manager 501 constructs the operation set, the manager may provide the detailed information of the operation set to the orchestrator 503 and the scheduler 502 through a storage unit such as a database or a table. The orchestrator 503 analyzes the requirements by combining with the operation sets stored in the system, determines the operation sets that need to be used, and combines the operation sets that need to be used into an operation stream, where the combination method includes but is not limited to timing combination, condition combination, loop combination, skip combination, and the like.

In some embodiments, orchestrator 503 builds operation flows based on DAGs (Directed Acyclic graphs). FIG. 5 is an exemplary graphical interface of the orchestrator 503. The graphical interface includes a plurality of functional areas. Among them, the operation set view area 601, the drawing area 602, the operation flow presentation area 603, and the code presentation area 604 are main areas involved in the operation flow construction. The operation set view field 601 is used to display a predefined operation set. Icon 6011 is the icon corresponding to each predefined set of operations. Clicking on the icon 6011 may display on the right side of the icon detailed information of the set of operations including the start interface, function definitions and calls, and transfer data types. The drawing area 602 is used to draw and display a directed acyclic graph of the operation set, and the operation flow showing area 603 is used to show operation flows corresponding to the current directed acyclic graph (on the graph are operation flows of the neural network model used as an example). The code exhibition area 604 is used for exhibiting source code corresponding to the current directed acyclic graph. When the research and development staff works 0, firstly, a required operation set is determined according to requirements, then icons corresponding to the required operation set are dragged to the drawing area 602, then a directed connection line is established between the operation sets on the drawing area 602 to form a directed acyclic graph, and an operation flow corresponding to the directed acyclic graph and a source code formed according to the operation flow are displayed in an operation flow display area 603 and a code display area 604. The developer can also refer to the displayed operation flow and the source code and then modify the directed acyclic graph in reverse. Such a graphical interface helps developers to build operation flows in a visual manner.

In other embodiments, orchestrator 503 does not construct operation flows in a manner that constructs directed acyclic graphs. For example, the orchestrator 503 provides a textual operation flow editing interface to the research and development personnel, and the research and development personnel maps the implementation of the requirement to the combination of the operation sets on the basis of knowing the requirement, displays the combination of the operation sets on the operation flow editing interface, and then performs various adjustments on the combination of the operation sets by the operation flow editing interface to obtain the operation flow. In this process, the developer also needs to pay attention to whether the data types of the adjacent operation sets match. For example, if the output data type of the previous operation set is a floating point type and the input data type of the next operation set is an integer type, then a type conversion operation set needs to be inserted between the adjacent operation sets to convert the floating point type data into integer type data.

The scheduler 502 is responsible for distributing the operation flow over the various computing systems. The specific distribution mode includes the following modes: the first method is to distribute the operation flow as a whole to a computing system for running, which has the advantages of simpler processing and badly cannot fully utilize the hardware performance of the computing system; the second method is to divide the operation flow into a plurality of sub-operation flows and then distribute the plurality of sub-operation flows to each computing system for running, i.e. the operation flows are distributed to different computing systems for execution, which has the advantage of fully utilizing the hardware performance of the computing system, and has the disadvantage of relatively complex processing because the data dependency relationship among the sub-operation flows needs to be considered. In the embodiment of dividing the operation stream into multiple sub-operation streams, the scheduler 503 determines whether the data amount of input and output between adjacent operation sets exceeds a set threshold, if the data amount of input and output does not exceed the set threshold, the adjacent operation sets are divided into two sub-operation streams, and if the data amount of input and output exceeds the set threshold, the adjacent operation sets are divided into the same sub-operation stream.

The distributed management system 50 described above can be used to construct an AI system, which is an application system equipped with a neural network model. The application system is used to handle requirements 70 such as video processing, object detection, face recognition, etc. Specifically, first, the operation sets are divided into two broad categories at the manager: the operation set used by the neural network model is called a neural network model operation set; and the operation set used by the rest modules is called a non-neural network model operation set. As shown in the figure, the non-neural network operation set includes an image processing operation set, a data encryption operation set, a video encoding and decoding operation set, a data storage operation set, and an image acquisition operation set, and the neural network model operation set is a model operation set. Then, research and development personnel can collect information of a hardware execution unit of the acceleration unit, collect network structure information of the formed neural network model, collect and refine the information, and then define one or more neural network model operation sets; the research and development personnel can also collect the requirements of enterprises and individuals on the application aspect of the neural network model, extract one or more functional units from the requirements, and form the functional units into a non-neural network model operation set. When the specific requirement is met, the orchestrator 503 determines, according to the requirement, a network structure of the neural network model to be used, where the network structure is preferably a network structure that has been trained and verified at present, then constructs, according to a correspondence between each node in the network structure and an operation set, an operation flow corresponding to the network structure, then constructs, according to the requirement, an operation flow of an input/output module that provides input data of the neural network model and processes output data of the neural network model, and finally combines the operation flows together for output. Of course, the neural network operational flow and the non-neural network operational flow may also be executed separately on different computing systems during the execution phase. The neural network operation flow is mainly used for neural network model calculation, and the non-neural network operation flow is mainly used for data collection, preprocessing, transmission and other operations.

The on-board operation development host 40 represents the host configuration used by the developer, including but not limited to graphical development environment, remote management (web), and remote management (SSH).

FIG. 6 is used to illustrate the data flow relationships of the manager 501, scheduler 502, orchestrator 503, and the computing system. The orchestrator 503 obtains predefined various operation sets from the manager 501, and constructs an operation flow in combination with the requirements. Scheduler 502 receives the operation flow and current load information for computing systems 1-3 and assigns the operation flow to one or more of computing systems 1-3. The current load information reflects the usage of various resources on the computing system. The various resources include hardware devices such as a central processing unit, an acceleration unit, a memory, an input-output device, and the like. For example, the computing system 1 reports to the orchestrator 503 that there are 8 acceleration units in the system, there are 4 acceleration units currently running the neural network model, and the computing system 1 may also report to the orchestrator 503 the use condition of the hardware execution unit in each acceleration unit, for example, there are 8 convolution operators, and the convolution operators are occupied by 4. For the central processing unit, an index of the total processor utilization of all cores may be reported, if the index is found to exceed a set threshold, the central processing unit may be considered to be in a busy state, for the storage, the total amount of physical memory occupied by all currently running processes may be reported, and the compiler 503 measures its busy and free degree accordingly, and if the index is found to exceed the set threshold, the physical memory may be considered to be in a busy state. The orchestrator 503 evaluates various metrics reported by the computing system to determine whether to deploy the operational flow after splitting the operational flow into multiple sub-operational flows or as a whole.

For AI systems equipped with neural network models, because the neural network models run primarily on the acceleration units, orchestrator 503 may allocate the operation flows to the acceleration units based on their usage, e.g., allocate the operation flows as a whole to the computing system containing the most unused acceleration units, or may deploy the operation flows for data preprocessing and the operation of the neural network models on at least two computing systems.

When the scheduler 503 determines that none of the computing systems is suitable for independently running the operation flow according to the current load information of each computing system, it decomposes the operation flow into a plurality of sub-operation flows, and deploys the sub-operation flows according to the current load information. When the operation flow is distributed and deployed to each computing system, a high data communication cost of data input and output is generated, for example, when the output data of the operation set 1 deployed on the computing system 1 is the input data of the operation set 2 deployed on the computing system 2, the output data of the operation set 1 needs to be transferred from the computing system 1 to the computing system 2, thereby generating the data communication cost of data input and output. This resource overhead can impact the overall execution efficiency of the system. In this case, the data amount of input and output between operation sets executed by different computing systems can be calculated, and if the data amount exceeds a set threshold, the data amount is divided again or adjacent operation sets with the data amount of input and output exceeding the set threshold are divided and merged to be executed on the same computing system.

It should be understood that the network structure may be a network result in a private environment built by an enterprise, or may be a network structure in a cloud environment, such as a data center. The scheduler 502 also needs to take into account distribution information in the physical sense of the various computing systems when distributing the set of operations to run on the various computing systems of the data center. A data center typically includes a plurality of clusters of computing systems located in different geographical areas. When the scheduler 502 is operating on each computing system that allocates a set of operations to a data center, a zone needs to be selected before the set of operations can be allocated to run on a computing system of the cluster of computing systems of the selected zone.

The distributed management system provided by the embodiment of the disclosure is not only suitable for constructing an AI system equipped with a neural network model, but also suitable for other application systems, and the distributed management system can construct an operation flow according to requirements and deploy the operation flow according to load information, thereby realizing rapid construction and deployment of the application system. For example, for an AI system, after various available operation sets are defined, a desired AI system can be built in a manner similar to building blocks, so that the AI system can be realized with less cost and higher performance.

The distributed management system also provides an integrated development deployment environment comprising development verification, product integration and product deployment. Specifically, a developer may build a hardware platform (e.g., an embedded platform or an AI platform) according to a requirement, then construct an operation flow of the application system according to the requirement by using the distributed management system, repeatedly test and verify the operation flow of the application system on the distributed management system, and finally deploy the operation flow passing the test and verification to a computing system in an actual environment by using the distributed management system.

Correspondingly, the embodiment of the disclosure also provides a management system. The management system comprises the first computing system and the second computing system. The distributed management system runs on a first computing system. Optionally, the first computing system and the second computing system form a cloud environment, the first computing system runs in a cloud, and the second computing system executes edge computing on an edge side. The edge side refers to the side close to the source of the object or data, and the computing system performing the edge calculation is located between the physical entity and the industrial connection, or is located on the top of the physical entity, so as to meet the basic requirements in the aspects of real-time, intelligence, security, privacy protection and the like. Further, the distributed management system deploys the operation flow of the constructed AI system, which may be a lightweight or simplified version of a neural network model, to the edge side for execution.

FIG. 7 shows a flow diagram of a management method according to one embodiment of the present disclosure. The method specifically comprises the following steps.

Step S110 determines an operation flow composed of an operation set according to the demand. The system includes a predefined plurality of sets of operations. The details of the operation set are stored in the storage unit. In order to ensure the validity of the operation sets, the detailed information of the edited operation sets is stored in the storage unit through a registration operation, and when the operation streams are arranged, the detailed information of the operation sets is acquired from the storage unit to determine which operation sets are selected to arrange the operation streams. In other words, only the registered set of operations can be used to orchestrate the operation flow.

Under the condition that a configuration neural network model needs to be built, determining a network structure of the neural network model to be used according to requirements, and building an operation flow by adopting a neural network operation set and a non-neural network operation set. Optionally, an operation flow of the neural network model is constructed first, and after the operation flow of the neural network model is determined, the input type and the output type of the neural network model may be determined, at this time, it is necessary to provide input data conforming to the input type of the neural network model for the neural network model, and process the output data of the neural network model into data conforming to the requirements, that is, an operation flow providing the input data of the neural network model is constructed by using a non-neural network model operation set, and an operation flow processing the output data of the neural network model is constructed by using a non-neural network model operation set. So far, the operation flow of the system equipped with the neural network model is successfully constructed.

Step S120 acquires current load information of a plurality of computing systems. The load information includes load information of hardware such as an acceleration unit, a processor, a memory, an input and an output, and corresponding software. Each computing system can report the current load information at present.

Step S130 determines whether there is a computing system adapted to run the operation flow independently. Specifically, whether a computing system suitable for independently running the operation flow exists is judged based on the load information of each computing system, and whether to run the operation flow on one computing system or to run the operation flow divided into a plurality of sub-operation flows is determined according to the judgment result. Computing systems with independently running operational streams, such as those with idle acceleration units, processor and memory usage, are satisfactory. If there is a computing system adapted to run the operation flow independently, S160 is performed, otherwise S140 is performed.

Step S140 decomposes the operation flow into a plurality of sub-operation flows. For example, for each operation set in the operation stream, the data amount of input and output of data of the adjacent operation set is calculated, if the data amount exceeds a set threshold, the data amount is divided into the same sub-operation stream, otherwise, the data amount is divided into different operation streams. For another example, the number of sub-operation streams and the computing system to which each sub-operation stream is adapted are determined according to the current load information of each computing system, and then division is performed.

Step S150 distributes the plurality of sub-operation flows to a plurality of computing systems for execution. Multiple sub-operation streams are deployed onto multiple computing systems, for example, according to an adapted computing system determined at the time of partitioning.

Step S160 assigns the operation flow to a computing system to run.

The embodiment of the disclosure provides a distributed management system, which constructs an operation flow according to requirements, and determines whether to decompose and deploy the operation flow according to load information, so as to fully utilize the software and hardware performance of a computing system.

In some embodiments, when an operation flow is constructed, an directed acyclic graph is constructed on a graphical interface in a mode of dragging an icon, icons corresponding to a predefined operation set are displayed on the graphical interface, and then the operation flow is made according to the directed acyclic graph.

In some embodiments, distributing the operational streams to run on the respective servers comprises: calculating whether the data quantity of input and output between the adjacent operation sets exceeds a set threshold value; when the data quantity of input and output between the adjacent operation sets exceeds a set threshold value, the adjacent operation sets are divided to run on the same computing system, otherwise, the adjacent operation sets are divided to be executed on different computing systems.

In some embodiments, the current load information for the computing system includes the following information:

an occupied condition of a hardware execution unit in the acceleration unit;

central processor and memory usage;

use of input and output devices.

Commercial value of the disclosed embodiments

The AI system equipped with the neural network model has good application effects in the aspects of target detection, face recognition, natural language processing and the like. By taking the field of face recognition as an example, video monitoring is collected through a camera, a face image is recognized through a neural network model, the face image is compared with a face stored in a cloud, and criminals in a monitoring video can be recognized. And then in the speech recognition field, performing speech recognition through a neural network model to realize simultaneous interpretation. These application scenarios can bring great commercial interest. The distributed management system provided by the embodiment of the disclosure can help research and development personnel to quickly construct the AI system equipped with the neural network model, so that the system also has wide application prospect and economic value.

As will be appreciated by one skilled in the art, the present disclosure may be embodied as systems, methods and computer program products. Accordingly, the present disclosure may be embodied in the form of entirely hardware, entirely software (including firmware, resident software, micro-code), or in the form of a combination of software and hardware. For example, the present disclosure is implemented as a terminal device, which includes a memory and a processor, where the memory further stores computer instructions executable by the processor, and the computer instructions, when executed, implement the distributed management method provided by the embodiment of the present formula.

Furthermore, in some embodiments, the present disclosure may also be embodied in the form of a computer program product in one or more computer-readable media having computer-readable program code embodied therein.

Any combination of one or more computer-readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium is, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer-readable storage medium include: an electrical connection for the particular wire or wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical memory, a magnetic memory, or any suitable combination of the foregoing. In this context, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with a processing unit, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a chopper. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any other suitable combination. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., and any suitable combination of the foregoing.

Computer program code for carrying out embodiments of the present disclosure may be written in one or more programming languages or combinations. The programming language includes an object-oriented programming language such as JAVA, C + +, and may also include a conventional procedural programming language such as C. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. A distributed management system for a first computing system, the first computing system and at least one second computing system forming a task distribution network, the distributed management system comprising a manager, an orchestrator, and a scheduler, wherein,

2. The distributed management system of claim 1, wherein the scheduler is further configured to split the operation flow into a plurality of sub-operation flows and distribute the plurality of sub-operation flows to run on ones of the plurality of second computing systems according to current load information of the plurality of second computing systems.

3. The distributed management system of claim 2, wherein the scheduler decomposes the operation flow into the plurality of sub-operation flows upon determining that none of the plurality of second computing systems is adapted to independently run the operation flow according to current load information of the plurality of second computing systems.

4. The distributed management system of claim 2 or 3, wherein the scheduler determines whether an amount of data input and output between adjacent operation sets exceeds a set threshold to determine whether to divide the adjacent operation sets in the same sub-operation stream.

5. The distributed management system of claim 1, wherein the manager comprises: and the operation set registration unit is used for registering the operation set edited by the user and taking the registered operation set as the predefined operation set.

6. The distributed management system of claim 5, wherein the orchestrator comprises a graphical interface on which icons corresponding to the predefined set of operations are displayed to facilitate a user building a directed acyclic graph on the graphical interface by dragging an icon, the orchestrator generating the operation flow according to the directed acyclic graph.

7. The distributed management system of claim 6, wherein the graphical interface comprises: the view area is used for displaying the predefined operation set, the drawing area is used for drawing the directed acyclic graph, and the operation flow corresponding to the directed acyclic graph is displayed.

8. The distributed management system of claim 1, wherein the current load information of the second computing system includes the following information:

the use condition of a hardware execution unit inside the acceleration unit;

processor and memory usage;

use of input and output devices.

9. The distributed management system of any of claims 1 to 8, wherein the first computing system and the second computing system are both embedded systems.

10. A distributed management system for a first computing system, the first computing system and at least one second computing system forming a task distribution network, the distributed management system comprising a manager, an orchestrator, and a scheduler,

11. The distributed management system of claim 10, wherein the requirements include: video processing, face recognition and target detection, and the non-neural network model operation set is classified as: the method comprises an image processing operation set, a data encryption operation set, a video coding and decoding operation set, a data storage operation set and an image acquisition operation set.

12. A management system comprising a first computing system and at least one second computing system, the first computing system comprising a manager, an orchestrator, and a scheduler, wherein,

13. The management system of claim 12, wherein the first computing system runs in an execution cloud and the second computing system performs edge computing on an edge side.

14. The management system of claim 13, the operational flow received and executed by the second computing system being an operational flow of a lightweight neural network model.

15. A distributed management method is applied to a first computing system, the first computing system and at least one second computing system form a task distribution network, and the management method comprises the following steps:

16. The distributed management method of claim 15, further comprising: according to the current load information of the plurality of second computing systems, the operation flow is decomposed into a plurality of sub-operation flows, and the plurality of sub-operation flows are distributed to a plurality of the second computing systems to run.

17. The distributed management method of claim 16 wherein said decomposing the operation flow into a plurality of sub-operation flows according to current load information for a plurality of the second computing systems comprises:

18. The distributed management method of claim 16 or 17, wherein said decomposing the operation flow into a plurality of sub-operation flows according to current load information of a plurality of the second computing systems comprises: and judging whether the data quantity of input and output between the adjacent operation sets exceeds a set threshold value or not so as to determine whether the adjacent operation sets are divided into the same sub-operation stream or not.

19. The distributed management method of claim 15, wherein the set of operations is divided into a set of neural network operations and a set of non-neural network operations, and the determining on-demand operation flows comprised of the set of operations comprises: determining a network structure of a neural network model to be used according to requirements, constructing an operation flow of the neural network model by adopting the neural network operation set, and constructing an operation flow of an input/output module which provides input data of the neural network model and processes output data of the neural network model by adopting the non-neural network model operation set according to requirements.

20. The distributed management method of claim 15, further comprising: and registering the operation set edited by the user, and taking the registered operation set as the operation set.

21. The distributed management method of claim 15, wherein the current load information of the second computing system includes the following information:

the use condition of a hardware execution unit inside the acceleration unit;

processor and memory usage;

use of input and output devices.

22. A computer readable medium storing computer instructions executable by an electronic device, the computer instructions when executed implementing the distributed management method of any of claims 15 to 21.