CN110221910A - Method and apparatus for executing MPI operation - Google Patents

Method and apparatus for executing MPI operation Download PDF

Info

Publication number
CN110221910A
CN110221910A CN201910533277.3A CN201910533277A CN110221910A CN 110221910 A CN110221910 A CN 110221910A CN 201910533277 A CN201910533277 A CN 201910533277A CN 110221910 A CN110221910 A CN 110221910A
Authority
CN
China
Prior art keywords
mpi
mpi operation
pod
operating status
executing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910533277.3A
Other languages
Chinese (zh)
Other versions
CN110221910B (en
Inventor
蔡卫东
杨金锋
王辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201910533277.3A priority Critical patent/CN110221910B/en
Publication of CN110221910A publication Critical patent/CN110221910A/en
Application granted granted Critical
Publication of CN110221910B publication Critical patent/CN110221910B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/485Task life-cycle, e.g. stopping, restarting, resuming execution

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The embodiment of the present application discloses the method and apparatus for executing MPI operation.This application involves field of cloud calculation.One specific embodiment of this method includes: to determine the operating status of MPI operation for init state in response to detecting that executing MPI operation executes request, current time is recorded as the starting time of MPI operation;According to and the corresponding resource allocation information of MPI operation create Pod corresponding with MPI operation;Public key corresponding with MPI operation and private key are generated, and public key and private key are mounted in Pod corresponding with MPI operation;The operating status of MPI operation is updated in creation;MPI operation is executed using each Pod corresponding with MPI operation, and during executing MPI operation, in real time according to the operating status of each Pod corresponding with MPI operation, updates the operating status of MPI operation.The embodiment realizes in cloud computing, is to execute MPI operation progress life cycle control on k8s platform and carry out written in code without user, simplifies user's operation.

Description

Method and apparatus for executing MPI operation
Technical field
The invention relates to field of computer technology, and in particular to the method and apparatus for executing MPI operation.
Background technique
Generally use virtual machine technique in the prior art to build distributed system, and container virtualization technology, as one Kind can substitute the solution of traditional virtual technology, be developed rapidly by its high efficiency and reliability.Container is empty Quasi-ization technology can isolate the different processes operated on host, thus reach between process, process and host operating system it Between it is mutually isolated, be independent of each other.Container cluster administrative skill is managed to a large amount of containers in distributed hardware resource Technology.The problem of distributed bring Decentralization can be shielded by container cluster administrative skill, provide higher view pair The resources such as container, calculating, storage, network are managed collectively, and are had representative such as Kubernetes and mesos Marathon etc..Wherein, Kubernetes is the sets of containers towards container based on Google borg (cluster manager dual system) exploitation Group's management platform, it is managed collectively based on the host, network and storage resource to bottom such as container technique, provides application The functions such as deployment, maintenance, extension mechanism can easily manage the application of across machine operation containerization using k8s.
It is a kind of distributed cloud service suitable for large-scale parallel batch processing job that batch, which calculates,.Wherein, operation refers to Carry out the minimum unit of batch processing.A usual operation is made of the single or multiple tasks with dependence.MPI operation is In accordance with the operation of MPI agreement.MPI (Message Passing Interface, messaging interface) is one logical across language It interrogates agreement and possesses broad application prospect in fields such as machine learning, big datas for writing parallel computer program.
MPI operation is run on based on container can carry out effective resource isolation on k8s, prevent from interfering with each other.Mesh Before, mostly managed by hand-coding script or code with realizing the life cycle to MPI operation.
Summary of the invention
The embodiment of the present application proposes the method and apparatus for executing MPI operation.
In a first aspect, the embodiment of the present application provides a kind of method for executing MPI operation, this method comprises: response In detecting that executing MPI operation executes request, current time is recorded as the starting time of MPI operation, determines MPI operation Operating status is init state;According to and the corresponding resource allocation information of MPI operation create Pod corresponding with MPI operation;It is raw It is mounted in Pod corresponding with MPI operation at public key corresponding with MPI operation and private key, and by public key and private key;By MPI The operating status of operation is updated in creation;MPI operation is executed using each Pod corresponding with MPI operation, and is being executed During MPI operation, in real time according to the operating status of each Pod corresponding with MPI operation, the operation shape of MPI operation is updated State.
In some embodiments, according to and the corresponding resource allocation information of MPI operation create Pod corresponding with MPI operation, It include: to read resource allocation information from task description file corresponding with MPI operation;And according to read resource distribution Information creating Pod corresponding with MPI operation.
In some embodiments, Pod corresponding with MPI operation is created according to read resource allocation information, comprising: press Host process Pod and progress of work Pod corresponding with MPI operation is created according to read resource allocation information.
In some embodiments, MPI operation is executed using each Pod corresponding with MPI operation, and made executing MPI During industry, in real time according to the operating status of each Pod corresponding with MPI operation, the operating status of MPI operation, packet are updated It includes: in response to determining that host process Pod and progress of work Pod corresponding with MPI operation is successfully started up, by the operation of MPI operation State is updated in execution, executes MPI operation using each Pod corresponding with MPI operation;In response to determining and MPI operation pair There are the resources that operating status is execution failure in the host process Pod and progress of work Pod answered, by the operating status of MPI operation It is updated to execute failure;In response to determining that the operating status of host process Pod and progress of work Pod corresponding with MPI operation is It runs succeeded, the operating status of MPI operation is updated to run succeeded;In response to determining the starting of current time and MPI operation Time difference between time is greater than default overtime duration, is updated to the operating status of MPI operation to execute time-out.
In some embodiments, this method further include: the operating status in response to determining MPI operation is to run succeeded, hold Row failure executes time-out, deletes host process Pod and progress of work Pod corresponding with MPI operation.
Second aspect, the embodiment of the present application provide it is a kind of for executing the device of MPI operation, the device include: starting Time recording unit is configured in response to detect that execute MPI operation executes request, current time is recorded as MPI work The starting time of industry determines that the operating status of MPI operation is init state;Pod creating unit, be configured to according to MPI The corresponding resource allocation information of operation creates Pod corresponding with MPI operation;Public key and private key carry unit, are configured to generate Public key corresponding with MPI operation and private key, and public key and private key are mounted in Pod corresponding with MPI operation;Shape in creation State updating unit is configured to for the operating status of MPI operation being updated in creation;Execution and operating status updating unit, quilt It is configured with each Pod corresponding with MPI operation and executes MPI operation, and during executing MPI operation, real-time root According to the operating status of each Pod corresponding with MPI operation, the operating status of MPI operation is updated.
In some embodiments, Pod creating unit, comprising: read module is configured to from work corresponding with MPI operation Industry describes to read resource allocation information in file;And Pod creation module, it is configured to according to read resource allocation information Create Pod corresponding with MPI operation.
In some embodiments, Pod creation module is further configured to:: it is created according to read resource allocation information Build host process Pod corresponding with MPI operation and progress of work Pod.
In some embodiments, it executes and operating status updating unit includes: state update module in executing, be configured to In response to determining that host process Pod and progress of work Pod corresponding with MPI operation is successfully started up, by the operating status of MPI operation It is updated in execution, executes MPI operation using each Pod corresponding with MPI operation;Status of fail update module is executed, is matched Being set in response to determining in host process Pod and progress of work Pod corresponding with MPI operation there are operating status be what execution failed Resource is updated to the operating status of MPI operation to execute failure;The state that runs succeeded update module is configured in response to really The operating status of fixed host process Pod corresponding with MPI operation and progress of work Pod is to run succeeded, by the operation of MPI operation State is updated to run succeeded;Timeout mode update module is executed, is configured in response to determine current time and MPI operation Start the time difference between the time greater than default overtime duration, is updated to the operating status of MPI operation to execute time-out.
In some embodiments, device further include: resource clears up unit, is configured in response to determine MPI operation Operating status is to run succeeded, execute failure or execute time-out, deletion MPI operation and with the associated resource of MPI operation.
The third aspect, the embodiment of the present application provide a kind of electronic equipment, comprising: one or more processors;Storage dress It sets, is stored thereon with one or more programs, when said one or multiple programs are executed by said one or multiple processors, So that said one or multiple processors realize the method as described in implementation any in first aspect.
Fourth aspect, the embodiment of the present application provide a kind of computer readable storage medium, are stored thereon with computer journey Sequence, wherein realized when the computer program is executed by one or more processors such as implementation description any in first aspect Method.
Method and apparatus provided by the embodiments of the present application for executing MPI operation execute MPI operation by will test The current time of execution request be recorded as starting time of MPI operation, and determine that the operating status of MPI operation is initialization shape State;Again, according to and the corresponding resource allocation information of MPI operation create Pod corresponding with MPI operation;Then, it generates and makees with MPI The corresponding public key of industry and private key, and public key and private key are mounted in Pod corresponding with MPI operation;Then, by MPI operation Operating status be updated in creation;Finally, executing MPI operation using each Pod corresponding with MPI operation, and executing During MPI operation, in real time according to the operating status of each Pod corresponding with MPI operation, the operation shape of MPI operation is updated State.To which the life cycle progress to MPI operation is realized in the monitoring by the operating status of each Pod corresponding with MPI operation Automatic control, and whole process does not need user and writes script or code, is to execute MPI on k8s platform to make without user Industry carries out life cycle control and writes code, simplifies user's operation.
Detailed description of the invention
By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, the application's is other Feature, objects and advantages will become more apparent upon:
Fig. 1 is that one embodiment of the application can be applied to exemplary system architecture figure therein;
Fig. 2 is the flow chart according to one embodiment of the method for executing MPI operation of the application;
Fig. 3 is the flow chart according to another embodiment of the method for executing MPI operation of the application;
Fig. 4 is the structural schematic diagram according to one embodiment of the device for executing MPI operation of the application;
Fig. 5 is adapted for the structural schematic diagram for the computer system for realizing the electronic equipment of the embodiment of the present application.
Specific embodiment
The application is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining related invention, rather than the restriction to the invention.It also should be noted that in order to Convenient for description, part relevant to related invention is illustrated only in attached drawing.
It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase Mutually combination.The application is described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.
Fig. 1 is shown can the method for executing MPI operation using the application or the device for executing MPI operation Embodiment exemplary system architecture 100.
As shown in Figure 1, system architecture 100 may include terminal device 101,102,103, network 104 and k8s cluster 105. Network 104 between 101,102,103 and k8s of terminal device cluster 105 to provide the medium of communication link.Network 104 can To include various connection types, such as wired, wireless communication link or fiber optic cables etc..
User can be used terminal device 101,102,103 and be interacted by network 104 with k8s cluster 105, to receive or send out Send message etc..Various telecommunication customer end applications can be installed on terminal device 101,102,103, such as k8s client application, Web browser applications, shopping class application, searching class application, instant messaging tools, mailbox client, social platform software etc..
Terminal device 101,102,103 can be hardware, be also possible to software.When terminal device 101,102,103 is hard When part, the various electronic equipments of k8s client application, including but not limited to intelligent hand are can be with display screen and supported Machine, tablet computer, E-book reader, MP3 player (Moving Picture Experts Group Audio Layer III, dynamic image expert's compression standard audio level 3), MP4 (Moving Picture Experts Group Audio Layer IV, dynamic image expert's compression standard audio level 4) player, pocket computer on knee and desktop computer etc. Deng.When terminal device 101,102,103 is software, may be mounted in above-mentioned cited electronic equipment.It may be implemented At multiple softwares or software module (such as providing k8s client service), single software or software mould also may be implemented into Block.It is not specifically limited herein.
K8s cluster 105, which can be operation, a server cluster of k8s, such as to showing on terminal device 101,102,103 K8s client application provide support server cluster.Server cluster can to receive MPI job execution request etc. Data carry out the processing such as analyzing, and processing result (such as operating status and/or implementing result of MPI operation) is fed back to terminal Equipment.
It should be noted that for executing the method for MPI operation generally by k8s cluster provided by the embodiment of the present application 105 execute, and correspondingly, the device for executing MPI operation is generally positioned in k8s cluster 105.
It should be noted that k8s cluster 105 can be made of multiple Work machines, each Work machine can be physics Machine or virtual machine.Work machine is different according to role, can be classified as main controlled node (Master) and slave node (Minion).Wherein Master provides debugging and control function, is responsible for the management work of entire cluster, and Minion is then k8s The working node of cluster.
It should be understood that the number of server is only schematical in terminal device, network and k8s cluster in Fig. 1.Root It factually now needs, can have any number of terminal device, network and server.
With continued reference to Fig. 2, it illustrates the streams according to one embodiment of the method for executing MPI operation of the application Journey 200.The method for being used to execute MPI operation, comprising the following steps:
Step 201, in response to detecting that executing MPI operation executes request, by current time is recorded as opening for MPI operation The dynamic time determines that the operating status of MPI operation is init state.
In the present embodiment, for executing the executing subject of the method for MPI operation (such as in k8s cluster shown in FIG. 1 Work machine) whether can have with real-time detection and to execute MPI operation and execute request, if detecting the execution of execution MPI operation Current time can be then recorded as the starting time of MPI operation by request, and determine that the operating status of MPI operation is initial Change state.
In practice, it can use Custom Resource Definition (CRD) and pre-define MPI operation resource.Make For example, a kind of CRD code of MPI operation is given below:
It is a MPI operation using " MPIJob " that CRD is defined in above-mentioned code, and " MPIJobList " is MPI work Industry list, " MPIJobSpec " are MPI job description, and " MPIJobLauncherStatusType " is the operation shape of MPI operation State, the operating status value of MPI operation may include it is following it is several " ", " Creating ", " Running ", " Succeed ", " Failed ", " Timeout " }, wherein " " indicates in MPI job initialization, and " Creating " is indicated in MPI job creation, " Running " is indicated in MPI job run, and " Succeed " indicates the success of MPI job run, and " Failed " indicates MPI operation fortune Row failure, " Timeout " indicate MPI job run time-out." MPIJobStatus " is the status information of MPI operation, MPI operation Status information may include the operating status of MPI operation, start the time, be used to indicate the corresponding resource of MPI operation whether by The resource of removing removes the reason of mark, the failure of MPI job execution information.And current time is recorded as to the starting of MPI operation Time can be realized by setting current time for " StartTime " in " MPIJobStatus ", determine MPI operation Operating status be init state can be by setting " " Lai Shixian for " MPIJobLauncherStatusType ".
Step 202, according to and the corresponding resource allocation information of MPI operation create Pod corresponding with MPI operation.
In the present embodiment, above-mentioned executing subject can be according to resource allocation information corresponding with MPI operation creation and MPI The corresponding Pod of operation.Here, Pod is the basic operation unit of K8s, is the smallest deployment list for creating, debugging and managing Member.Relevant one or more container may be constructed a Pod.In general, the container in Pod runs identical application.Pod includes Container operate on the same Minion (Host) or Master, one unified management unit of operation, share identical volume, Network, NameSpace, the space IP and Port.
It is understood that above-mentioned executing subject has been after having created Pod corresponding with MPI operation, it can also be to be created Pod distribute corresponding resource, such as central processing unit (CPU, Central Processing Unit), memory, graphics process Device (GPU, Graphics Processing Unit) etc..
In some optional implementations of the present embodiment, step 202 can carry out as follows:
Firstly, above-mentioned executing subject can read resource allocation information from task description file corresponding with MPI operation.
In practice, each MPI operation can be corresponding with corresponding task description file, can be in the task description file It is stored with the information of the Pod created needed for executing MPI operation.For example, may include the Pod quantity of required creation, and each Resource requirement corresponding to Pod (for example, cpu resource demand, memory source demand) etc..
Then, above-mentioned executing subject can create Pod corresponding with MPI operation according to read resource allocation information.
In practice, when creating Pod corresponding with MPI operation according to read resource allocation information, due to MPI operation For principal and subordinate's scheduling model, a MPI operation may include host process (being referred to as main task) and at least one progress of work (being referred to as subtask).Therefore, creating Pod corresponding with MPI operation according to read resource allocation information here can To include: to create host process Pod and progress of work Pod corresponding with MPI operation according to read resource allocation information.
According to above-mentioned optional implementation, user can be executed by uploading the task description file of MPI operation with realizing MPI operation and the life cycle management for realizing MPI job execution automatically.
Step 203, public key corresponding with MPI operation and private key are generated, and public key and private key are mounted to and MPI operation In corresponding Pod.
In the present embodiment, above-mentioned executing subject can generate public key corresponding with MPI operation using various implementations And private key, and public key generated and private key are mounted in Pod corresponding with MPI operation.
In practice, can use comfigmap resource public key generated and private key are mounted to it is corresponding with MPI operation In Pod.
As an example, above-mentioned public key and private key can be the public key of SSH (Secure Shell, Secure Shell) cipher key pair And private key.Above-mentioned executing subject can generate SSH key pair based on pass phrase.Wherein, pass phrase is a string closeer than commonly using The arrangement that code (typical length is 4~16 characters) wants longer character, is used to form a digital signature or for believing The encryption and decryption of breath, pass phrase often can achieve the length of 100 characters or more.SSH cipher key pair may include public affairs Key (public key), private key (private key) and the authorization key (authorized_keys) for carrying public key.It needs Illustrate, pass phrase is commonly used in encrypting private key.Here, above-mentioned pass phrase can also be null password phrase, i.e., Without further encryption key.
Here, since MPI operation is principal and subordinate's scheduling model, a MPI operation includes that host process (is referred to as leading Task) and at least one progress of work (being referred to as subtask).And pass through step 203, the corresponding each Pod of MPI operation In the corresponding public key of accessible MPI operation and private key, the corresponding host process of MPI operation and the progress of work then may be implemented Between exempt from password login, simplify the difficulty of implementation of MPI operation.
Step 204, the operating status by MPI operation is updated in creation.
As an example, can continue to adopt code shown in step 201 here, more by the operating status of MPI operation Newly being can be by setting " Creating " Lai Shixian for " MPIJobLauncherStatusType " in creation.
Step 205, MPI operation is executed using each Pod corresponding with MPI operation, and in the process for executing MPI operation In, in real time according to the operating status of each Pod corresponding with MPI operation, update the operating status of MPI operation.
In the present embodiment, above-mentioned executing subject can use each Pod corresponding with MPI operation and execute MPI operation, with And during executing MPI operation, in real time according to the operating status of each Pod corresponding with MPI operation, MPI operation is updated Operating status.And then it realizes and the life cycle during MPI job execution is managed automatically.
It should be noted that can use CRD in practice and define MPI operation resource, and it is above-mentioned for executing MPI operation Method can be with Operator mode operation in the Work machine in k8s cluster.Here, Operator is opened by CoreOS Hair, for extending Kubernetes API, specific application controller, it is used to create, configures and complex management Stateful application, such as database, caching and monitoring system.Resource and controller concept of the Operator based on Kubernetes it Upper building, but contain the specific domain knowledge of application program again simultaneously.
The prison that the method provided by the above embodiment of the application passes through the operating status of each Pod corresponding with MPI operation Control is realized and is managed automatically to the life cycle of MPI operation, and whole process does not need user and writes script or code, It is to execute MPI operation progress life cycle control on k8s platform and carry out written in code without user, simplifies user's operation.
With further reference to Fig. 3, it illustrates the processes 300 of another embodiment of the method for executing MPI operation.It should For executing the process 300 of the method for MPI operation, comprising the following steps:
Step 301, in response to detecting that executing MPI operation executes request, by current time is recorded as opening for MPI operation The dynamic time determines that the operating status of MPI operation is init state.
Step 302, according to and the corresponding resource allocation information of MPI operation create Pod corresponding with MPI operation.
Step 303, public key corresponding with MPI operation and private key are generated, and public key and private key are mounted to and MPI operation In corresponding Pod.
Step 304, the operating status by MPI operation is updated in creation.
In the present embodiment, the concrete operations of step 301, step 302, step 303 and step 304 and its brought skill Art effect and the record of step 201 in embodiment shown in Fig. 2, step 202, step 203 and step 204 are essentially identical, herein It repeats no more.
It step 305, will in response to determining that host process Pod and progress of work Pod corresponding with MPI operation is successfully started up The operating status of MPI operation is updated in execution, executes MPI operation using each Pod corresponding with MPI operation.
In the present embodiment, for executing the executing subject of the method for MPI operation (such as in k8s cluster shown in FIG. 1 Work machine) it can be in the case where determining that host process Pod and progress of work Pod corresponding with MPI operation is successfully started up, it will The operating status of MPI operation is updated in execution, and executes MPI operation using each Pod corresponding with MPI operation.
Step 306, in response to determining, there are operating statuses in host process Pod and progress of work Pod corresponding with MPI operation For the resource for executing failure, it is updated to the operating status of MPI operation to execute failure.
In the present embodiment, above-mentioned executing subject can determine host process Pod corresponding with MPI operation and the progress of work In the case where being the resource for executing failure there are operating status in Pod, it is updated to the operating status of MPI operation to execute failure.
Step 307, it is in response to the operating status of determining host process Pod and progress of work Pod corresponding with MPI operation It runs succeeded, the operating status of MPI operation is updated to run succeeded.
In the present embodiment, above-mentioned executing subject can determine host process Pod corresponding with MPI operation and the progress of work The operating status of Pod is to be updated to run succeeded by the operating status of MPI operation in the case where running succeeded.
Step 308, in response to determining that the time difference between current time and the starting time of MPI operation is greater than default time-out Duration is updated to the operating status of MPI operation to execute time-out.
In the present embodiment, above-mentioned executing subject can determine between current time and the starting time of MPI operation In the case that time difference is greater than default overtime duration, it is updated to the operating status of MPI operation to execute time-out.
By step 305 arrive step 308, may be implemented accurately determine MPI operation operating status for execute in, execute at Function executes failure or executes time-out.
Step 309, it is deleted in response to determining the operating status of MPI operation to run succeeded, executing failure or execute time-out Except MPI operation and with the associated resource of MPI operation.
In the present embodiment, above-mentioned executing subject can be to run succeeded, execute mistake in the operating status for determining MPI operation In the case where losing or executing time-out, delete MPI operation and with the associated resource of MPI operation.
In practice, above-mentioned executing subject can carry out resource cleaning based on owner reference mechanism.MPI operation mentions Customized MPI operation resource can be generated after friendship.The MPI operation resource can be corresponding with resource identification to indicate the MPI operation Resource.In addition, include host process Pod and at least one subprocess Pod in the MPI operation resource, by by the MPI resource institute Including host process Pod and the owner reference of at least one subprocess Pod be set as the resource of the MPI operation resource Mark.It is above-mentioned to hold in the case where determining the operating status of the MPI operation is to run succeeded, execute failure or execute time-out Row main body can carry out resource cleaning automatically, that is, will be deleted corresponding MPI operation, and the gc mechanism of k8s can also find all Dependent on remaining resource of this MPI operation resource, cascade deletion is carried out.
From figure 3, it can be seen that the side for being used to execute MPI operation compared with the corresponding embodiment of Fig. 2, in the present embodiment For the process 300 of method by according to the operating status of each Pod corresponding with MPI operation, the operating status for updating MPI operation is specifically bright Really it is in executing, executes failure, run succeeded and execute time-out, and the step of has had more automatic cleaning resource.This reality as a result, The scheme for applying example description can accurately determine that the operating status of MPI operation is in executing, runs succeeded, executes failure or execute Time-out, and user can be not required to and write script or code and realize that resource is cleared up automatically, and then reduce resource consumption.
With further reference to Fig. 4, as the realization to method shown in above-mentioned each figure, this application provides one kind for executing One embodiment of the device of MPI operation, the Installation practice is corresponding with embodiment of the method shown in Fig. 2, and the device is specific It can be applied in various electronic equipments.
As shown in figure 4, the device 400 for executing MPI operation of the present embodiment include: starting time recording unit 401, Pod creating unit 402, public key and private key carry unit 403, creation in state updating unit 404 and execute and operating status more New unit 405.Wherein, start time recording unit 401, be configured in response to detect that execute MPI operation executes request, Current time is recorded as to the starting time of above-mentioned MPI operation, determines that the operating status of above-mentioned MPI operation is init state; Pod creating unit 402 is configured to create and above-mentioned MPI operation pair according to resource allocation information corresponding with above-mentioned MPI operation The Pod answered;Public key and private key carry unit 403 are configured to generate public key corresponding with above-mentioned MPI operation and private key, and Above-mentioned public key and above-mentioned private key are mounted in Pod corresponding with above-mentioned MPI operation;State updating unit 404 in creation are matched It is set to and the operating status of above-mentioned MPI operation is updated in creation;And execution and operating status updating unit 405, it is configured to Above-mentioned MPI operation is executed using each Pod corresponding with above-mentioned MPI operation, and during executing above-mentioned MPI operation, In real time according to the operating status of each Pod corresponding with above-mentioned MPI operation, the operating status of above-mentioned MPI operation is updated.
In the present embodiment, for executing starting time recording unit 401, the Pod creation list of the device 400 of MPI operation State updating unit 404 and execution and operating status updating unit 405 in member 402, public key and private key carry unit 403, creation Specific processing and its brought technical effect can be respectively with reference to step 201, step 202, step in Fig. 2 corresponding embodiment 203, the related description of step 204 and step 205, details are not described herein.
In some optional implementations of the present embodiment, above-mentioned Pod creating unit 402 may include: read module 4021, it is configured to read resource allocation information from task description file corresponding with above-mentioned MPI operation;And Pod creation Module 4022 is configured to create Pod corresponding with above-mentioned MPI operation according to read resource allocation information.
In some optional implementations of the present embodiment, above-mentioned Pod creation module 4022 can be further configured At: host process Pod and progress of work Pod corresponding with above-mentioned MPI operation is created according to read resource allocation information.
In some optional implementations of the present embodiment, above-mentioned execution and operating status updating unit 405 be can wrap Include: state update module 4051 in execution is configured in response to determine host process Pod corresponding with above-mentioned MPI operation and work The process Pod of work is successfully started up, and the operating status of above-mentioned MPI operation is updated in execution, using corresponding with above-mentioned MPI operation Each Pod execute above-mentioned MPI operation;Status of fail update module 4052 is executed, determining and above-mentioned MPI is configured in response to There are the resources that operating status is execution failure in operation corresponding host process Pod and progress of work Pod, by above-mentioned MPI operation Operating status be updated to execute failure;The state that runs succeeded update module 4053 is configured in response to determining and above-mentioned MPI The operating status of operation corresponding host process Pod and progress of work Pod is to run succeeded, by the operation shape of above-mentioned MPI operation State is updated to run succeeded;Timeout mode update module 4054 is executed, is configured in response to determine current time and above-mentioned MPI Time difference between the starting time of operation is greater than default overtime duration, and the operating status of above-mentioned MPI operation is updated to execute Time-out.
In some optional implementations of the present embodiment, above-mentioned apparatus can also include: resource cleaning unit 406, The operating status for being configured in response to determine above-mentioned MPI operation is to run succeeded, execute failure or execute time-out, in deletion State MPI operation and with the above-mentioned associated resource of MPI operation.
It should be noted that the realization of each unit is thin in the device provided by the embodiments of the present application for executing MPI operation Section and technical effect can be with reference to the explanations of other embodiments in the application, and details are not described herein.
Below with reference to Fig. 5, it illustrates the computer systems 500 for the electronic equipment for being suitable for being used to realize the embodiment of the present application Structural schematic diagram.Electronic equipment shown in Fig. 5 is only an example, function to the embodiment of the present application and should not use model Shroud carrys out any restrictions.
As shown in figure 5, computer system 500 includes central processing unit (CPU, Central Processing Unit) 501, it can be according to the program being stored in read-only memory (ROM, Read Only Memory) 502 or from storage section 508 programs being loaded into random access storage device (RAM, Random Access Memory) 503 and execute various appropriate Movement and processing.In RAM 503, also it is stored with system 500 and operates required various programs and data.CPU 501,ROM 502 and RAM 503 is connected with each other by bus 504.Input/output (I/O, Input/Output) interface 505 is also connected to Bus 504.
I/O interface 505 is connected to lower component: the importation 506 including keyboard, mouse etc.;It is penetrated including such as cathode Spool (CRT, Cathode Ray Tube), liquid crystal display (LCD, Liquid Crystal Display) etc. and loudspeaker Deng output par, c 507;Storage section 508 including hard disk etc.;And including such as LAN (local area network, Local Area Network) the communications portion 509 of the network interface card of card, modem etc..Communications portion 509 is via such as internet Network executes communication process.Driver 510 is also connected to I/O interface 505 as needed.Detachable media 511, such as disk, CD, magneto-optic disk, semiconductor memory etc. are mounted on as needed on driver 510, in order to from the calculating read thereon Machine program is mounted into storage section 508 as needed.
Particularly, in accordance with an embodiment of the present disclosure, it may be implemented as computer above with reference to the process of flow chart description Software program.For example, embodiment of the disclosure includes a kind of computer program product comprising be carried on computer-readable medium On computer program, which includes the program code for method shown in execution flow chart.In such reality It applies in example, which can be downloaded and installed from network by communications portion 509, and/or from detachable media 511 are mounted.When the computer program is executed by central processing unit (CPU) 501, limited in execution the present processes Above-mentioned function.It should be noted that computer-readable medium described herein can be computer-readable signal media or Computer readable storage medium either the two any combination.Computer readable storage medium for example can be --- but Be not limited to --- electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor system, device or device, or any above combination. The more specific example of computer readable storage medium can include but is not limited to: have one or more conducting wires electrical connection, Portable computer diskette, hard disk, random access storage device (RAM), read-only memory (ROM), erasable type may be programmed read-only deposit Reservoir (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic memory Part or above-mentioned any appropriate combination.In this application, computer readable storage medium, which can be, any include or stores The tangible medium of program, the program can be commanded execution system, device or device use or in connection.And In the application, computer-readable signal media may include in a base band or the data as the propagation of carrier wave a part are believed Number, wherein carrying computer-readable program code.The data-signal of this propagation can take various forms, including but not It is limited to electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be computer Any computer-readable medium other than readable storage medium storing program for executing, the computer-readable medium can send, propagate or transmit use In by the use of instruction execution system, device or device or program in connection.Include on computer-readable medium Program code can transmit with any suitable medium, including but not limited to: wireless, electric wire, optical cable, RF etc., Huo Zheshang Any appropriate combination stated.
The calculating of the operation for executing the application can be write with one or more programming languages or combinations thereof Machine program code, described program design language include object oriented program language-such as Java, Smalltalk, C+ +, it further include conventional procedural programming language-such as " C " language or similar programming language.Program code can Fully to execute, partly execute on the user computer on the user computer, be executed as an independent software package, Part executes on the remote computer or executes on a remote computer or server completely on the user computer for part. In situations involving remote computers, remote computer can pass through the network of any kind --- including local area network (LAN) Or wide area network (WAN)-is connected to subscriber computer, or, it may be connected to outer computer (such as utilize Internet service Provider is connected by internet).
Flow chart and block diagram in attached drawing are illustrated according to the system of the various embodiments of the application, method and computer journey The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation A part of one module, program segment or code of table, a part of the module, program segment or code include one or more use The executable instruction of the logic function as defined in realizing.It should also be noted that in some implementations as replacements, being marked in box The function of note can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are actually It can be basically executed in parallel, they can also be executed in the opposite order sometimes, and this depends on the function involved.Also it to infuse Meaning, the combination of each box in block diagram and or flow chart and the box in block diagram and or flow chart can be with holding The dedicated hardware based system of functions or operations as defined in row is realized, or can use specialized hardware and computer instruction Combination realize.
Being described in unit involved in the embodiment of the present application can be realized by way of software, can also be by hard The mode of part is realized.Described unit also can be set in the processor, for example, can be described as: a kind of processor packet Include starting time recording unit, Pod creating unit, in public key and private key carry unit, creation state updating unit and execution and Operating status updating unit.Wherein, the title of these units does not constitute the restriction to the unit itself, example under certain conditions Such as, starting time recording unit is also described as " in response to detecting that executing MPI operation executes request, when will be current Between be recorded as starting time of MPI operation, determine that the operating status of MPI operation is the unit of init state ".
As on the other hand, present invention also provides a kind of computer-readable medium, which be can be Included in device described in above-described embodiment;It is also possible to individualism, and without in the supplying device.Above-mentioned calculating Machine readable medium carries one or more program, when said one or multiple programs are executed by the device, so that should Device: in response to detecting that executing MPI operation executes request, current time is recorded as the starting time of MPI operation, determination The operating status of MPI operation is init state;According to resource allocation information corresponding with MPI operation creation and MPI operation pair The Pod answered;Corresponding with MPI operation public key and private key are generated, and public key and private key is mounted to corresponding with MPI operation In Pod;The operating status of MPI operation is updated in creation;MPI operation is executed using each Pod corresponding with MPI operation, And during executing MPI operation, in real time according to the operating status of each Pod corresponding with MPI operation, updates MPI and make The operating status of industry.
Above description is only the preferred embodiment of the application and the explanation to institute's application technology principle.Those skilled in the art Member is it should be appreciated that invention scope involved in the application, however it is not limited to technology made of the specific combination of above-mentioned technical characteristic Scheme, while should also cover in the case where not departing from foregoing invention design, it is carried out by above-mentioned technical characteristic or its equivalent feature Any combination and the other technical solutions formed.Such as features described above has similar function with (but being not limited to) disclosed herein Can technical characteristic replaced mutually and the technical solution that is formed.

Claims (12)

1. a kind of method for executing MPI operation, comprising:
In response to detecting that executing MPI operation executes request, current time is recorded as the starting time of the MPI operation, The operating status for determining the MPI operation is init state;
According to and the corresponding resource allocation information of the MPI operation create Pod corresponding with the MPI operation;
Public key corresponding with the MPI operation and private key are generated, and the public key and the private key are mounted to and the MPI In the corresponding Pod of operation;
The operating status of the MPI operation is updated in creation;
The MPI operation is executed using each Pod corresponding with the MPI operation, and in the process for executing the MPI operation In, in real time according to the operating status of each Pod corresponding with the MPI operation, update the operating status of the MPI operation.
2. described to be created according to resource allocation information corresponding with the MPI operation according to the method described in claim 1, wherein Build Pod corresponding with the MPI operation, comprising:
Resource allocation information is read from task description file corresponding with the MPI operation;And
Pod corresponding with the MPI operation is created according to read resource allocation information.
3. described according to the creation of read resource allocation information and the MPI according to the method described in claim 2, wherein The corresponding Pod of operation, comprising:
Host process Pod and progress of work Pod corresponding with the MPI operation is created according to read resource allocation information.
4. according to the method described in claim 3, wherein, it is described executed using each Pod corresponding with the MPI operation described in MPI operation, and during executing the MPI operation, in real time according to the fortune of each Pod corresponding with the MPI operation Row state updates the operating status of the MPI operation, comprising:
In response to determining that host process Pod and progress of work Pod corresponding with the MPI operation is successfully started up, the MPI is made The operating status of industry is updated in execution, executes the MPI operation using each Pod corresponding with the MPI operation;
In response to determining in host process Pod and progress of work Pod corresponding with the MPI operation that there are operating statuses to execute mistake The resource lost is updated to the operating status of the MPI operation to execute failure;
In response to determine the operating status of host process Pod and progress of work Pod corresponding with the MPI operation be execution at The operating status of the MPI operation is updated to run succeeded by function;
It, will in response to determining that the time difference between current time and the starting time of the MPI operation is greater than default overtime duration The operating status of the MPI operation is updated to execute time-out.
5. according to the method described in claim 4, wherein, the method also includes:
Operating status in response to the determination MPI operation is to run succeeded, execute failure or execute time-out, deletion with it is described MPI operation corresponding host process Pod and progress of work Pod.
6. a kind of for executing the device of MPI operation, comprising:
Start time recording unit, is configured in response to detect that execute MPI operation executes request, current time is recorded For the starting time of the MPI operation, determine that the operating status of the MPI operation is init state;
Pod creating unit is configured to create and the MPI operation according to resource allocation information corresponding with the MPI operation Corresponding Pod;
Public key and private key carry unit, are configured to generate corresponding with MPI operation public key and private key, and by the public affairs Key and the private key are mounted in Pod corresponding with the MPI operation;
State updating unit in creation is configured to for the operating status of the MPI operation being updated in creation;
Execution and operating status updating unit are configured to execute the MPI using each Pod corresponding with the MPI operation Operation, and during executing the MPI operation, in real time according to the operation shape of each Pod corresponding with the MPI operation State updates the operating status of the MPI operation.
7. device according to claim 6, wherein the Pod creating unit, comprising:
Read module is configured to read resource allocation information from task description file corresponding with the MPI operation;And
Pod creation module is configured to create Pod corresponding with the MPI operation according to read resource allocation information.
8. device according to claim 7, wherein the Pod creation module is further configured to:
Host process Pod and progress of work Pod corresponding with the MPI operation is created according to read resource allocation information.
9. device according to claim 8, wherein the execution and operating status updating unit include:
State update module in execution, be configured in response to determine host process Pod corresponding with the MPI operation and work into Journey Pod is successfully started up, and the operating status of the MPI operation is updated in execution, using corresponding with the MPI operation each A Pod executes the MPI operation;
Status of fail update module is executed, is configured in response to determine host process Pod corresponding with the MPI operation and work There are the resource that operating status is execution failure in process Pod, it is updated to the operating status of the MPI operation to execute failure;
The state that runs succeeded update module is configured in response to determine host process Pod corresponding with the MPI operation and work The operating status of process Pod is to run succeeded, and the operating status of the MPI operation is updated to run succeeded;
Execute timeout mode update module, be configured in response to determine current time and the MPI operation the starting time it Between time difference be greater than default overtime duration, be updated to the operating status of the MPI operation to execute time-out.
10. device according to claim 9, wherein described device further include:
Resource clears up unit, be configured in response to determine the MPI operation operating status be run succeeded, executes unsuccessfully or Person executes time-out, delete the MPI operation and with the associated resource of MPI operation.
11. a kind of electronic equipment, comprising:
One or more processors;
Storage device is stored thereon with one or more programs,
When one or more of programs are executed by one or more of processors, so that one or more of processors Realize such as method as claimed in any one of claims 1 to 5.
12. a kind of computer readable storage medium, is stored thereon with computer program, wherein the computer program is by one Or multiple processors realize such as method as claimed in any one of claims 1 to 5 when executing.
CN201910533277.3A 2019-06-19 2019-06-19 Method and apparatus for performing MPI jobs Active CN110221910B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910533277.3A CN110221910B (en) 2019-06-19 2019-06-19 Method and apparatus for performing MPI jobs

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910533277.3A CN110221910B (en) 2019-06-19 2019-06-19 Method and apparatus for performing MPI jobs

Publications (2)

Publication Number Publication Date
CN110221910A true CN110221910A (en) 2019-09-10
CN110221910B CN110221910B (en) 2022-08-02

Family

ID=67814075

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910533277.3A Active CN110221910B (en) 2019-06-19 2019-06-19 Method and apparatus for performing MPI jobs

Country Status (1)

Country Link
CN (1) CN110221910B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110795207A (en) * 2019-09-27 2020-02-14 广东浪潮大数据研究有限公司 Virtual container minimum resource unit mutual trust configuration method and device
CN115454450A (en) * 2022-09-15 2022-12-09 北京火山引擎科技有限公司 Method and device for resource management of data operation, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102622414A (en) * 2012-02-17 2012-08-01 清华大学 Peer-to-peer structure based distributed high-dimensional indexing parallel query framework
CN108062254A (en) * 2017-12-12 2018-05-22 腾讯科技(深圳)有限公司 Job processing method, device, storage medium and equipment
CN108920259A (en) * 2018-03-30 2018-11-30 华为技术有限公司 Deep learning job scheduling method, system and relevant device
CN108964982A (en) * 2018-06-13 2018-12-07 众安信息技术服务有限公司 For realizing the method, apparatus and storage medium of the deployment of the multinode of block chain

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102622414A (en) * 2012-02-17 2012-08-01 清华大学 Peer-to-peer structure based distributed high-dimensional indexing parallel query framework
CN108062254A (en) * 2017-12-12 2018-05-22 腾讯科技(深圳)有限公司 Job processing method, device, storage medium and equipment
CN108920259A (en) * 2018-03-30 2018-11-30 华为技术有限公司 Deep learning job scheduling method, system and relevant device
CN108964982A (en) * 2018-06-13 2018-12-07 众安信息技术服务有限公司 For realizing the method, apparatus and storage medium of the deployment of the multinode of block chain

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110795207A (en) * 2019-09-27 2020-02-14 广东浪潮大数据研究有限公司 Virtual container minimum resource unit mutual trust configuration method and device
CN110795207B (en) * 2019-09-27 2022-08-12 广东浪潮大数据研究有限公司 Virtual container minimum resource unit mutual trust configuration method and device
CN115454450A (en) * 2022-09-15 2022-12-09 北京火山引擎科技有限公司 Method and device for resource management of data operation, electronic equipment and storage medium
CN115454450B (en) * 2022-09-15 2024-04-30 北京火山引擎科技有限公司 Method and device for resource management of data job, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN110221910B (en) 2022-08-02

Similar Documents

Publication Publication Date Title
US10735345B2 (en) Orchestrating computing resources between different computing environments
US11170316B2 (en) System and method for determining fuzzy cause and effect relationships in an intelligent workload management system
US11245588B2 (en) Modifying realized topologies
US8065676B1 (en) Automated provisioning of virtual machines for a virtual machine buffer pool and production pool
US20180101371A1 (en) Deployment manager
US20120096461A1 (en) Load balancing in multi-server virtual workplace environments
CN108304250A (en) Method and apparatus for the node for determining operation machine learning task
CN106575243A (en) Hypervisor-hosted virtual machine forensics
US20140081615A1 (en) Virtual systems testing
US10019293B2 (en) Enhanced command selection in a networked computing environment
CN108429768A (en) Cloud data analysis service manages system, method and cloud server
KR102651083B1 (en) MANAGING MULTI-SINGLE-TENANT SaaS SERVICES
CN108369532A (en) For first and the encapsulation tool of third party's arrangements of components
CN114371857B (en) Digital twin enabled asset performance and upgrade management
US10956131B2 (en) Separation of user interface logic from user interface presentation by using a protocol
Grandinetti Pervasive cloud computing technologies: future outlooks and interdisciplinary perspectives: future outlooks and interdisciplinary perspectives
CN110221910A (en) Method and apparatus for executing MPI operation
CN115185697A (en) Cluster resource scheduling method, system, equipment and storage medium based on kubernets
Xu et al. Enhanced service framework based on microservice management and client support provider for efficient user experiment in edge computing environment
US20210203665A1 (en) Process and system for managing data flows for the unified governance of a plurality of intensive computing solutions
US20120311117A1 (en) Object Pipeline-Based Virtual Infrastructure Management
US20220335318A1 (en) Dynamic anomaly forecasting from execution logs
US11886921B2 (en) Serverless runtime container allocation
CN109324892A (en) Distribution management method, distributed management system and device
JP2021526685A (en) Distributed computing system with frameset package store of synthetic data as a service

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant