WO2012024435A2 - System and method for execution of high performance computing applications - Google Patents

System and method for execution of high performance computing applications Download PDF

Info

Publication number
WO2012024435A2
WO2012024435A2 PCT/US2011/048134 US2011048134W WO2012024435A2 WO 2012024435 A2 WO2012024435 A2 WO 2012024435A2 US 2011048134 W US2011048134 W US 2011048134W WO 2012024435 A2 WO2012024435 A2 WO 2012024435A2
Authority
WO
WIPO (PCT)
Prior art keywords
kernel
algorithm
algorithms
library
states
Prior art date
Application number
PCT/US2011/048134
Other languages
French (fr)
Other versions
WO2012024435A3 (en
Inventor
Kevin D. Howard
Original Assignee
Massively Parallel Technologies, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Massively Parallel Technologies, Inc. filed Critical Massively Parallel Technologies, Inc.
Priority to EP11818745.9A priority Critical patent/EP2606424A4/en
Priority to JP2013524967A priority patent/JP2013534347A/en
Publication of WO2012024435A2 publication Critical patent/WO2012024435A2/en
Publication of WO2012024435A3 publication Critical patent/WO2012024435A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/448Execution paradigms, e.g. implementations of programming paradigms
    • G06F9/4498Finite state machines

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multi Processors (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Stored Programmes (AREA)

Abstract

A runtime environment for high performance computing applications. Algorithms, kernels used in the algorithms, inputs to and outputs from the algorithms, and the number of computing nodes to be used in executing each step of each of the algorithms are initially defined. The algorithms and the kernels are added to respective libraries. A request for execution of one of the algorithms is received, and input datasets for the algorithm to the computing system are transferred to the system. The requested algorithm is then executed to generate output data sets.

Description

Docket No.: 514405
SYSTEM AND METHOD FOR EXECUTION OF HIGH PERFORMANCE COMPUTING APPLICATIONS
SUMMARY
[0001] The present system comprises a runtime environment for parallel computing and other high performance applications. Primary functions of the present system include storing, controlling, and running small run-time state- machine-associated programs called kernels. The system uses four interface methods: kernel management, algorithm management, kernel execution, and algorithm execution.
[0002] Algorithms, kernels used in the algorithms, inputs to and outputs from the algorithms, and the number of computing nodes to be used in executing each step of each of the algorithms are initially defined. The algorithms and the kernels are added to respective libraries. A request for execution of one of the algorithms is received, and input datasets for the algorithm are transferred to the system. The requested algorithm is then executed to generate output data sets.
BRIEF DESCRIPTION OF THE DRAWINGS
[0003] Figure 1 is an exemplary diagram showing high-level
components in one embodiment of the present system;
[0004] Figure 2 is an exemplary illustration of a requester-system interaction;
[0005] Figure 3 is a flowchart showing an exemplary set of steps performed by the present system in executing an algorithm; and
[0006] Figure 4 is an exemplary illustration of an HPC state machine.
DETAILED DESCRIPTION
[0007] Figure 1 is an exemplary diagram showing high-level
components in one embodiment of the present system 100 for execution of high performance computing applications. An application is defined herein to be an end-user accessible algorithm. As shown in Figure 1 , system 100 includes a processor 101 which, in operation, executes a kernel management module 110, an algorithm management module 105, a kernel execution module 130, and an Docket No.: 514405 algorithm execution module 125. System 100 further includes libraries 1 15 / 120 which respectively store algorithms 1 17 and kernels 122.
[0008] System 100 is coupled to a host management system 145, which provides management of system functions, and issues system requests.
Algorithm execution module 125 initiates execution of kernels invoked by algorithms that are executed. Algorithm execution system 135 may comprise any computing system with multiple computing nodes 140 which can execute kernels stored in system 100. Management system 145 can be any external client computer system which requests services from the present system 00. These services, which are described in detail below, include requesting that kernels or algorithms be added/changed/deleted from a respective library within the current system. In addition, the external client system can request that a
kernel/algorithm be executed. It should be noted that the description of kernel and algorithm management and execution is merely exemplary, and the claimed system is not limited to the specific file names, formats and instructions
presented herein.
Kernel Management
[0009] A kernel is an executable computer program or program segment that contains data transformation/data code, and no program execution control code. Execution control code is any code that can change which code is to be executed next. In the exemplary embodiment described herein, kernels 122 are stored in a kernel library file 121 in kernel library 120. System 100, in an exemplary embodiment, provides kernel management to add, change or delete a kernel in a kernel library file 121 by adding, updating, or deleting the .DLL
(Dynamic Linked Library) or .SO (Shared Object) library or other library file that contains the kernel of interest. References to "the system" in this section refer, in general, to system 100, and in applicable embodiments, refer more specifically to kernel management module 1 10.
[0010] All kernel management messages sent from host system 145 to system 100 contain library and kernel information. The kernel information fields are Library Title and KerneMTitle. Upon completion of a management task, the system creates and sends a Return Status message containing completion Docket No.: 514405 status. The Status field contains a zero if the task was completed successfully or a non-zero if the task could not be completed.
[0011] Kernels 122 are grouped with other kernels from the same originating organization and category in a kernel library file 121 which is created when the first kernel is added, and deleted when the last kernel is deleted. The name of the library file (Library_Title) is the concatenation of the organization and category names with a '_' character between them. The library is of file type .DLL, .SO, or other library file type, depending on the operating system which is used as a platform by the system.
[0012] The function name of the kernel in the library (Kernel_Title) is the kernel name concatenated with the user name with a '_' character between them. The '@' character in the user name is replaced with the string "_MPT_". The KerneMltle is further concatenated with the string "_OWN", "_POST", or "_PUBLISH" depending upon where the kernel has been posted or published. The function may exist in the OWN, POST, and PUBLISH form at the same time, depending on owner/organization desires and the development status.
Adding Kernels
[0013] When the system receives a Kernel_Add_Request message, the system attempts to add the new kernel to a system Kernel list
(Library_Kernel_Table_List) 160. The Library_Title is the name of the library file that contains the kernel, and Kernel_Title is the name of the kernel to add. The Kernel_Add_Request message is defined below:
Kernel_Add_Request
{
char message = 120,
char Library_Title [32],
char Kernel_Title [32]
int number_data_byt.es,
char Kernel_data[number_data_bytes]
}
[0014] Upon receipt of the Kernel_Add_Request message, the system determines if the library named Library_Title exists. Library existence is checked Docket No.: 514405 by comparing Library_Title to all library titles in the
Library_Kernel_Table_List.160. If the library does exist, the system determines if the kernel exists within the library by comparing the Kernel_Title to all
Kernel_Titles 162 stored in the Library_Kernel_Table 161.
The definition of the Library Kernel table is given below in Table 1 :
Table 1 Library Kernel Table Definition
Figure imgf000005_0001
[0015] Existence of Kernel_Title in the library causes the system to generate and send the Return_Status message containing a 19 (Kernel Already Exists Error) in its Status field to the requester, which in an exemplary
embodiment, is host management system 145.
[0016] The Return_Status message is defined below:
Return_Status
{
char message = 2,
int status = 19
}
[0017] Non-existence of the kernel in the library causes the system to attempt to add the kernel to the kernel library file 121. Unsuccessful addition of the kernel to the library causes the system to generate and send the
Return_Status message containing a 22 (Kernel Add/Update Error) in its Status field. Adding a kernel to the library requires the library file to be replaced by the requester's information.
[0018] Successful replacement of the library and non-existence of the Library_Kernel_Table causes the system to create the Library_Kernel_Table 161 and add it to the Library_Kernel_Table_List 160. Unsuccessful creation of the Docket No.: 514405
Library_Kernel_Table causes the system to remove the library and generate and send the Return_Status message containing a 22 (Kernel Add/Update Error) in its Status field to the requester.
[0019] Successful creation of the Library_Kernel_Table causes the system to add the kernel information to the Library_Kernel_Table. Unsuccessful addition of the kernel information to the Library_Kernel_Table causes the system to remove the library and Library_Kernel_Table, if new, or replace their original contents then generate and send the Return_Status message containing a 20 (Kernel Not Added Successfully) in its Status field to the requester.
[0020] Successful addition of the kernel information to the
Library_Kernel_Table causes the system to attempt to reload the
Library_Kernel_Table_Address for all kernels in the library. Unsuccessful reloading of the Library_Kernel_Table_Address for all kernels in the library causes the system to restore the prior contents of the library, remove the added kernel from the Library_Kernel_Table, restore the address list of the
Library_Kernel_Table and generate and send the Return_Status message containing a 23 (Kernel Not Added/Changed Error - kernel not found in library) in its Status field to the requester.
[0021] Successful reload of the Library_Kernel_Table_Address for all kernels in the library causes the system to generate and send the Return_Status message containing a zero in its Status field to the requester.
Table 2 Library Kernel Table List
Figure imgf000006_0001
Change Kernels
[0022] When the system receives a Kernel_Change_Request message from management system 145, the system attempts to change an existing kernel Docket No.: 514405 in its library of kernels. The Library_Title is the name of the library file that contains the kernel, and Kernel_Title is the name of the kernel to add.
The Kernel_Change_Request message is defined below:
Kernel_Change_Request
{
char message = 121 ,
char Library_Title [32],
char KerneUltle [32]
int number_data_bytes,
char Kernel_data[number_data_bytes]
}
[0023] Upon receipt of the Kernel_Change_Request message, the system determines if the Library_Title exists by comparing Library_Title to all Library_Titles in the Library_Kernel_Table_List. If the library does exist, the system determines if the kernel exists within the library by comparing the
Kernel_Title to all Kernel_Titles stored in the Library Kernel table.
[0024] If the library does not exist or Kernel_Title does not exist in the Library_Kernel_Table for the library, the system generates and sends the
Return_Status message containing a 21 (Kernel Does Not Exist Error) in its Status field to the requester.
[0025] If the library does exist and Kernel_Title does exist in the library, the system replaces the old kernel information with the information from the new kernel definition by replacing the contents of the Library_Title library file.
Unsuccessful replacement of the library causes the system to generate and send the Return_Status message containing a 22 (Kernel Add/Update Error - library not replaced) in its Status field to the requester.
[0026] Successful replacement of the library causes the system to reload the Library_Kernel_Table_Address for all kernels in the library.
Unsuccessful reload of the Library_Kernel_Table_Address for all kernels in the library causes the system to restore the prior contents of the library, remove the added kernel from the Library_Kernel_Table, and restore the address list of the Library_Kernel_Table. It then generates and sends the Return_Status message Docket No.: 514405 containing a 23 (Kernel Not Added/Changed Error - kernel not found in library) in its Status field to the requester.
[0027] Successful reload of the Library_Kernel_Table_Address for all kernels in the library causes the system to generate and send the Return_Status message containing a zero in its Status field to the requester.
Kernel Delete
[0028] The system receives the Kernel_Delete_Request message from management system 145 and attempts to delete an existing kernel in its library of kernels.
The Kernel_Delete_Request is defined below:
Kernel_Delete_Request
{
char message = 122,
char Library_Title [32],
char Kernel_Title [32]
}
[0029] Upon receipt of the Kernel_Delete_Request message, the system determines if the Library_Title exists by comparing Library_Title to all Library_Titles in the Library_Kernel_Table_List. If the library does exist, the system determines if the kernel exists within the library by comparing the
Kernel_Title to all Kernel_Titles stored in the Library Kernel table.
[0030] If Library_Title does not exist in the Library_Kernel_Table_List or Kernel_Title does not exist in the Library_Kernel_Table, the system generates and sends the Return_Status message containing a 21 (Kernel Does Not Exist Error) in its Status field to the requester.
[0031] If the library exists and the kernel exists in the library, the system replaces the old kernel information with the information from the new kernel definition by replacing the contents of the Library_Title library file. Unsuccessful replacement of the library causes the system to generate and send the
Return_Status message containing a 24 (Kernel Not Deleted Error - library not replaced) in its Status field to the requester. Docket No.: 514405
[0032] Successful library replacement causes the system to verify that the deleted kernel is not in the library. If the deleted kernel exists in the library, then the system restores the prior contents of the library, restore the deleted Kernel to the Library_Kernel_Table, and restore the address list of the
Library_Kernel_Table. It then sends the Return_Status message containing a 54 (Kernel Not Deleted Error - kernel still found in library) in its Status field to the requester.
[0033] Existence of the deleted kernel in the library causes the system to delete the Kernel_Title from the Library_Kernel_Table. Unsuccessful deletion of the Kernel_Title from the Library_Kernel_Table causes the system to restore the prior contents of the library, restore the deleted kernel to the
Library_Kernel_Table, and restore the address list of the Library_Kernel_Table. It then sends the Return_Status message containing a 53 (Kernel Not Deleted Error - kernel not found in library) in its Status field to the requester.
[0034] Successful deletion of the Kernel_Title from the
Library_Kernel_Table causes the system to generate and send the
Return_Status message containing a zero in its Status field to the requester.
Algorithm Management
[0035] Algorithm management allows the present system 100 to add, change, or delete an algorithm. An algorithm is a state machine that comprises states (kernel invocations) and state transitions (the conditions needed to go from one state to another). References to "the system" in this section refer in general to system 100, and in applicable embodiments, to algorithm management module 105.
[0036] Each algorithm 1 17 is kept in an algorithm definition file 1 16 in algorithm library 115 with a name (Algorithm_Title) that is the concatenation of the organization name, the category name, algorithm name, and user name with a '_' character between each of the names. The '@' character in the user name is replaced with the string "_MPT_". The Algorithm_Title is further concatenated with the string "_OWN", "_POST", or "_PUBLISH" depending upon where the kernel has been posted or published. The algorithm may exist in the OWN, Docket No.: 514405
POST, and PUBLISH form at the same time, depending on owner/organization desires and the development status. The file type of the file is ".MPT".
Algorithm Add
[0037] When system 100 receives an Algorithm_Add_Request message, the system attempts to add the algorithm to the system
Algorithm_Table 150. The Algorithm_Title is the name of the file that contains the algorithm to add. The Algorithm_Add_Request is defined below:
Algorithm_Add_Request
{
char message = 130, char Algorithm_Title [32]
int number_data_byt.es,
char Algorithm_data[number_data_bytes]
}
[0038] Upon receipt of the Algorithm_Add_Request message, the system checks to see if the algorithm exists by comparing Algorithm_Title to all Algorithm_Titles 151 in the Algorithm_Table 150. Existence of the algorithm in the Algorithm table causes the system to generate and send the Return_Status message containing a 29 (Algorithm Already Exists Error) in its Status field to the requester.
[0039] Non-existence of the algorithm in the Algorithm table causes the system to save the algorithm data to an algorithm definition file 116.
Unsuccessful saving of the algorithm data to the algorithm definition file 1 16 in algorithm library 115 causes the system to generate and send the Return_Status message containing a 29 (Algorithm Already Exists Error) in its Status field to the requester.
[0040] Successful saving of the algorithm data to the algorithm file causes the system to add the Algorithm_Title to the Algorithm_Table 150.
Unsuccessful addition of the Algorithm_Title to the Algorithm_Table causes the system to remove the added algorithm file and generate and send the
Return_Status message containing a 33 (Algorithm Add to Table Error) in its Status field to the requester. Docket No.: 514405
[0041] Successful addition of the algorithm to the algorithm title list 151 in the Algorithm_Table 150 causes the system to generate and send the
Return_Status message containing a zero in its Status field to the requester.
Algorithm Change
[0042] When the system receives an Algorithm_Change_Request message, the system attempts to change the algorithm in the system algorithm title list 151 in the Algorithm_Table 150. The Algorithm_Title is the name of the file that contains the algorithm to change. The Algorithm_Change_Request message is defined below:
Algorithm_Change_Request
{
char message = 131 , char Algorithm_Title [32]
int number_data_byt.es,
char Algorithm_data[number_data_bytes]
}
[0043] Upon receipt of the Algorithm_Change_Request message, the system checks the existence of the algorithm by comparing Algorithm_Title to all Algorithm_Titles in the Algorithm_Table. Non-existence of the algorithm in the Algorithm_Table causes the system to generate and send the Return_Status message containing a 31 (Algorithm Does Not Exist Error) in its Status field to the requester.
[0044] Existence of the Algorithm in the Algorithm_Table causes the system to replace the algorithm information in its file with the new contents.
Unsuccessful change of the algorithm file contents causes the system to generate and send the Return_Status message containing a 32 (Algorithm Unsuccessfully Changed Error) in its Status field to the requester.
[0045] Successful change of the algorithm file contents causes the system to generate and send the Return_Status message containing a zero in its Status field to the requester.
Algorithm Delete Docket No.: 514405
[0046] When the system receives an Algorithm_Delete_Request message and the system attempts to delete the algorithm from the system Algorithm List. The Algorithm_Title is the name of the file that contains the algorithm to delete. The Algorithm_Delete_Request message is defined below:
Algorithm_Delete_Request
{
char message = 132, char Algorithmjntle [32]
}
[0047] Upon receipt of the Algorithm_Change_Request message, the system checks for algorithm existence by comparing Algorithm_Title to all Algorithm_Titles in the Algorithm_Table. Non-existence of the algorithm causes the system to generate and send the Return_Status message containing a 31 (Algorithm Does Not Exist Error) in its Status field to the requester.
[0048] Existence of the algorithm in the Algorithm_Table causes the system to delete the algorithm from the Algorithm_Table. Unsuccessful deletion of the algorithm from the Algorithm_Table causes the system to generate and send the Return_Status message containing a 34 (Algorithm Not Successfully Deleted Error) in its status field to the requester.
[0049] Successful deletion of the algorithm from the Algorithm_Table causes the system to delete the algorithm file. The system then generates and sends the Return_Status message containing a zero in its Status field to the requester.
Algorithm Execution
[0050] Figure 2 is an exemplary illustration of requester 145 - system 100 interaction. Executing an algorithm means that the system accesses an algorithm 117 (contained in a respective algorithm definition file 1 16) in response to an 'execute' request, at step 201 , and executes the contained state machine until completion. The execution begins with the first state and continues through each required transition to new states until the completion state is executed or an error is encountered. Docket No.: 514405
[0051] To run the algorithm, several steps need to be completed, in order. First, the system verifies that the algorithm exists in its Algorithm table. The system verifies all execution parameters. If either of these checks fails, the user is notified and the request rejected. This should be done before the investment is made to transfer all input data sets to the system.
[0052] Once the existence of the algorithm and execute parameters is verified, the input datasets need to be sent to the system in a reliable manner. To facilitate this transfer, the system requests each dataset from the requester. The datasets are transferred to dataset buffer 103 one at a time to ensure that unambiguous transfer is accomplished. The system requests each specified dataset from the execution requester (typically host management system 145) and completes receiving the dataset before moving to the next dataset. This procedure of inputting datasets is collectively shown by the series of steps indicated by brace 205.
[0053] Once all input datasets have been transferred, the system starts the execution of the algorithm. The algorithm execution continues until the final state of the algorithm is reached, an error is encountered, or the user cancels the execution. Once execution has completed, at step 207, the system notifies the requester of the completion status of the execution.
[0054] At this time, the requester may request the sending of the output datasets to the requester. This transfer is done in essentially the inverse of the way in which the input datasets are sent to the system. The requester sends a request for each dataset and the system responds with the dataset or a reason why it is unavailable, as indicated by the series of steps referenced by brace 210. A dataset may be unavailable when a requester cancels or when an error occurs during execution.
[0055] Once the requester has requested all desired datasets, at step 212 the requester notifies the system that it is finished and that the system can de-allocate all resources that were in use during algorithm execution, including all datasets and computing resources. Once the system has responded to this request, at step 214, the execution process is complete. Docket No.: 514405
[0056] As noted above, the execution of an algorithm begins with a request message from the requester. The Algorithm_Execute_Request message is used for this task and is defined below:
Algorithm_Execute_Request
{
char message = 133, char Algorithm_Title [32],
int number_of_input_datasets,
char input_dataset_name[32][number_of_input_datasets], char input_dataset_type[32][number_of_input_datasets], char
input_dataset_element_size[32][number_of_input_datasets], int_64 input_dataset_size[number_of_input_datasets], int number_of_output_datasets,
char
output_dataset_name[32][number_of_output_datasets], char output_dataset_type[32][number_of_output_datasets], char
output_dataset_element_size[32][number_of_output_dataset s],
int_64 output_dataset_size[number_of _output_datasets],
}
[0057] Once the Algorithm_Execute_Request message is received, the system checks for Algorithm existence by comparing Algorithm_Title to all Algorithm_Titles in the Algorithm_Table. Non-existence of the algorithm causes the system to generate and send the Return_Status message containing a 31 (Algorithm Does Not Exist Error) in its Status field to the requester.
[0058] Existence of the algorithm causes the system to begin requesting the datasets from the requester, one at a time, as algorithm execution calls for each particular set. For each dataset, the system sends a Dataset_Request message to the requester with the dataset to be returned to the system. The Dataset_Request message appears as follows:
Dataset_Request Docket No.: 514405
{
char message = 135, char dataset_name[32]
}
The Dataset_Response message appears as follows:
Dataset_Response
{
char message
char dataset_name[32],
int status
int 64 dataset size
}
[0059] If the dataset name is not recognized by the requester, the requester generates and sends the Dataset_Response message to the system with the Status field set to 56 (Dataset Name Not Recognized).
[0060] If the dataset name is recognized but the requester is unable to send the dataset contents, the requester generates and sends the
Dataset_Response message to the system with the Status field set to 57
(Dataset Not Available).
[0061] If the dataset name is recognized and the requester is able to send the dataset contents, the requester generates and sends the
Dataset_Response message to the system with the status field set to 0
(Success). The contents of the dataset immediately follow the
Dataset_Response message.
[0062] If any error code is returned by the requester, the system generates and sends an Algorithm_Execution_Response message containing a 59 (Dataset Receive Failed) in its Status field and to the requester along with the status returned by the requester in the Error_Status_Supplemental field and the dataset name in the Error_Entity_Name field.
[0063] The Algorithm_Execution_Response message appears as follows:
Algorithm_Execute_Response Docket No.: 514405
{
char message = 134, char Algorithmjntle [32],
int status,
int error_status_supplemental,
int error_entity_name[32],
int error_data_supplemental[32],
int number_of_output_datasets,
char
output_dataset_name[32][number_of_output_datasets], char output_dataset_type[32][number_of_output_datasets], char
output_dataset_element_size[32][number_of_output_dataset s],
int_64 output_dataset_size[number_of_output_datasets],
}
[0064] If the system cannot store the received dataset contents, the system generates and sends the Algorithm_Execution_Response message containing a 60 (Dataset Store Failed) in its Status field, the dataset name in the Error_Entity_Name field, and any system error code in the
Error_Status_Supplemental field to the requester.
[0065] Once all input datasets have been received by the system, algorithm execution module 125 proceeds initiate the execution of the Algorithm on nodes 140 of algorithm execution system 135. Once execution completes for any reason, the system generates and sends the
Algorithm_Execution_Response message to the requester. If the status is nonzero when returned, the supplemental error fields may contain additional information that may aid in determining the cause of the error. Additionally, the output dataset fields may also contain useful information.
[0066] Regardless of the execution completion status, the requester may retrieve the output datasets from the system using the same protocol as the present system uses to retrieve the input datasets from the requester: Docket No.: 514405
[0067] The requester generates and sends a Dataset_Request with the dataset_name set to the name of the dataset to be retrieved. If the dataset is not available (a dataset can be stored in dataset buffer 103 after execution and/or deleted after transmission to the user), the system generates and sends the Dataset_Response message with the status field set to 57 (Dataset Not
Available) to the requester. If the dataset is available, the system generates and sends the Dataset_Response message with the status field set to 0 to the requester. The Dataset_Response is followed by the contents of the dataset. The requester may request any other dataset, particularly if the execution ended in failure. In this case, the system may respond with a 0 status and the dataset contents or may respond with the status field set to 57 (Dataset Not Available) to the requester.
[0068] Once all datasets have been retrieved from the requester, the requester generates and sends a Execute_Release_Request message to the system. The Execute_Release_Request message looks as follows:
Execute_Release_Request
{
Figure imgf000017_0001
[0069] This message causes the system to terminate the execution job and release all resources, such as memory and datasets, associated with the job. Once all resources are released, the system generates and sends a
Status_return message with the status set to 0 to the requester. If this message is received while the job is still running, the system generates and sends a Status_return message with the status set to 66 (Execute Still Running) to the requester. Other error status values are returned as appropriate.
[0070] At any time during the algorithm execution, the requester may send a Execute_Cancel message, which looks as follows:
Execute Cancel
{
char message = 137
} Docket No.: 514405
[0071] If algorithm execution is no longer taking place, the system generates and sends a Return_status message with the status field set to 67 (Execute Not Running) to the requester. If the algorithm execution is still running, the system generates and sends a Return_status message with the status field set to 0 to the requester, once the execution has been stopped.
Kernel Execution
[0072] A kernel is always one state whereas an algorithm is always more than one state. An HPC (high performance computing) state machine, as defined herein, effectively constitutes a set of 'states', each of which is a single kernel that contains compiled software and state-transitions (or state-vectors) which are the conditions under which control is transferred from one state to another.
[0073] Executing a kernel means that the system executes a simple one-state algorithm that calls the kernel. Because of this, all messaging for dataset retrieval at job start and end as well as for job cancel and resource release appears and proceeds like the execution of an algorithm as indicated above. An exception to this is the Kernel_Execute_Request and
Kernel_Execute_Response which is used in place of the
Algorithm_Execute_Request and System_Algorithm_Execute_Response.
[0074] The format of the Kernel_Execute_Request is as follows:
Kernel_Execute_Request
{
char message = 123, char Library_Title [32],
char Kernel_Title [32],
int number_of_input_datasets,
char input_dataset_name[32][number_of_input_datasets], char input_dataset_type[32][number_of_input_datasets], char
input_dataset_element_size[32][number_of_input_datasets], int input_dataset_size[number_of_input_datasets], int number_of_output_datasets, Docket No.: 514405 char
output_dataset_name[32][number_of_output_datasets], char output_dataset_type[32][number_of_output_datasets], char
output_dataset_element_size[32][number_of_output_dataset
8],
int output_dataset_size[number_of_output_datasets],
}
The format of the Kernel_Execute_Response is as follows:
Kernel_Execute_Response
{
char message = 124, char Library_Title [32],
char Kernel_Title [32],
int number_of_output_datasets,
char
output_dataset_name[32][number_of_output_datasets], char output_dataset_type[32][number_of_output_datasets], char
output_dataset_element_size[32][number_of_output_dataset s],
int output_dataset_size[number_of_output_datasets],
}
Algorithm Definition
[0075] An algorithm is defined herein to be a directly executable Finite State Machine containing a series of steps that are taken to complete a
computation or other computer-based activity. These steps, called states, are followed in an order which may be dependent on the algorithm-run parameters or output of a prior step. For each step, the algorithm must define the inputs to be used, the outputs to be produced, and the next step to execute.
[0076] An algorithm definition is a parsable, unambiguous text description composed of more than one states. The algorithm definition is Docket No.: 514405 converted by algorithm management module 105 into a table or list containing index numbers, kernels, transition conditions and the index of a to-be- transitioned-to kernel. All characters in the definition are printable ASCII characters. While the state definitions of the algorithm are normally defined to be case insensitive, kernel identifiers (the identifiers of the kernels to be executed) are normally case sensitive. Line feeds, carriage returns and blank characters are ignored. State entries are separated by a semicolon and are comprised of four sub-sections - node count, input datasets, output datasets, and transitions. Although human readability is not required (the algorithm definition may be generated by management system 145 ), it can be accommodated within this format.
Step Format Definition
[0077] The following sections show the steps of an algorithm (entries in the state transition table or algorithm). The syntax is as follows:
[0078] StateNumber,Kernelldentifier(node cnt)(input datasets)(output datasets) (transitions)
[0079] The StateNumber is an index for this state, giving the rest of the algorithm an index by which to refer to this state. The values of StateNumber should start at 1 and increment by one. It is never 0, as a next state of 0 indicates that execution has completed.
[0080] The Kernelldentifer is the identifier of the kernel to run during this state. It is comprised of the Library_Title and the Kernel_Title of a kernel concatenated with a ':' character between them. It should correspond to the name of a dynamic library which contains the code for the kernel and the function name of the kernel.
[0081] The Node Count subsection defines the number of nodes that are used to execute the kernel. It has the following format:
Min,Opt,Max
[0082] The Min value denotes the minimum number of nodes that are used to run the kernel. This is usually 1. It may also be 0, which also means 1. The Opt value denotes the optimal number of nodes that are used to run the kernel. A value of 0 should be specified if there is no restriction or Docket No.: 514405 recommendation on the number of nodes to be used. A value of In may be specified if a division of the number of nodes specified by the user at run time is to be used. Likewise, a value of *n may be used for a multiple of the number of nodes specified by the user at run time.
[0083] The Max value denotes the maximum number of nodes that are used to run the kernel. This is usually 0, which denotes to use the number of nodes specified by the user at run time. If a division or multiple of the number of nodes specified by the user at run time is desired, the same syntax may be used here as for the Opt value.
Input Datasets
[0084] The Input Datasets subsection defines the names of the input datasets that are used to execute the kernel. These names correspond to either a dataset that is an input to the algorithm or a dataset that is produced by the execution of a kernel in another state. The names are positionally dependent on the implementation of the kernel being executed and are separated by commas.
Output Datasets
[0085] The Output Datasets subsection defines the names of the output datasets that is produced when the kernel is executed. The names are
positionally dependent on the implementation of the kernel being executed and are separated by commas.
Transitions
[0086] The Transitions subsection defines the next state to be transitioned to after the execution of this kernel. The transitions are in a form of a pair of conditions and next_state indices separated by commas. The pairs are scanned from left to right and the processing stops when the first true condition is encountered. The state transitions to the corresponding next _state for that condition and all other condition/next_state pairs are ignored.
The format for the Transitions subsection is:
[condition, next_state],[condition,next_state],[condition,next_state]
[0087] To determine the next state, the condition/next_state pairs are examined from left to right starting at the left. When a condition is found to be true, the next_state is used for the next state to be executed for the algorithm. If Docket No.: 514405 no condition is found to be true, the execution of the algorithm is declared to be complete.
Condition
[0088] The format of the condition is as follows:
(dataset_name[index], dataset_element_type, value, comparator)
[0089] The dataset_name must correspond to either a dataset that is an input to the algorithm or a dataset that is produced by the execution of a kernel in another state. Multiple elements are accessed via the use of an index that is used as an offset into the database to find a particular element.
[0090] The dataset_element_type is one of the following shown in Table
3:
Table 3
Figure imgf000022_0001
[0091] The value is of the form shown in Table 4:
Table 4
Figure imgf000022_0002
Docket No.: 514405 [0092] The comparator is one of the following shown in shown in Table
5:
Table 5
Figure imgf000023_0001
[0093] The evaluation of the condition follows the ANSI-standard C language rules of precedence.
Next State
[0094] The next state is always the index of another state in the algorithm. A next state of 0 (zero) always indicates that the execution of the algorithm is complete.
Final State
[0095] As above, if no condition evaluates to true, that state is declared to be the final state and execution is completed. If the next_state for any conditional that evaluates to true is 0, this also indicates that the execution is complete. If it is desired to make a state the (unconditional) final state in the Algorithm, the format of the Transition section may be shortened to:
(0)
Example
[0096] The following is an example of a step of an algorithm (an entry in the state transition table that is the algorithm). It has line breaks and indentation to allow clear viewing of the various components.
12,
Shop_Tools:ConvolveWithFilter_bob_MPT_Shop.com_POST (1 ,0,0)
(Bobslmage_1 , BobsTargetSet.BobsFilterSet) Docket No.: 514405
(BobsTargettedlmage_1 ,BobsFilteredTargettedlmage_1)
([(BobsControlData[27],uint32,1 ,==),13],
[(BobsControlData[27],uint32,2,==),14],
[(BobsControlData[28],uint32,1 ,==),15])
[0097] The above code is step 12 in the algorithm. The Kernelldentifier is:
Shop_Tools:ConvolveWithFilter_bob_MPT_Shop.com_POST
[0098] The kernel comes from the library (Library_Title) Shop_Tools (where the organization name is Shop and the category name is Tools). The Kernel_Title is ConvolveWithFilter_bob_MPT_Shop.com_POST , where the KernelName is ConvolveWithFilter which is owned by user bob@Shop.com. The kernel specified is the one in the posted state.
[0099] The node count specifier says that at least one (1 ) node must be used; otherwise, there is no restriction on the number of compute nodes to use.
The Input Dataset list has three members (datasets):
Bobslmage_1
BobsTargetSet
BobsFilterSet
The Output Dataset list has two members (datasets):
BobsTargettedlmage_1
BobsFilteredTargettedlmage_1
The Transition has the following meaning:
If ((uint32 *)BobsControlData)[27] == 1 Goto state 13
If ((uint32 *)BobsControlData)[27] == 2 Goto state 14
If ((uint32 *)BobsControlData)[28] == 1 Goto state 15
Implied at the end of this list is:
Else Goto state 0
[0100] This is the end-of-algorithm indicator. If none of the other conditional state transitions apply, the algorithm execution is complete.
[0101] Figure 3 is a flowchart summarizing an exemplary set of steps performed by the present system in executing an algorithm. As shown in Figure 3, at step 305, for each algorithm to be executed, a definition is provided for the Docket No.: 514405 kernels to be used, the names of the input datasets to be used, the names of output datasets to be generated, the next step to execute, and the number of nodes to be used to execute each kernel (or step), as described above.
[0102] Kernels 122 to be executed are added to kernel library 120 at step 310, and algorithms 117 to be executed are added to algorithm library 1 15, at step 320. In response to an algorithm execution request, existence of the requested algorithm is verified, and the algorithm's execution parameters are verified, at respective steps 325 and 330. The algorithm is then transferred from algorithm library 1 15 to algorithm execution system 35 at step 333.
[0103] At step 335, the input datasets for the algorithm are transferred to algorithm execution system 135 one at a time, and at step 340, each state transition in the algorithm is evaluated and conditionally executed in order until the final state of the algorithm is reached. Prior to execution, each kernel invoked by the algorithm is transferred from kernel library 120 to the algorithm execution system 135. Finally, at step 345, algorithm completion status and all output data sets are returned to the requester.
[0104] Table 6, below, is an example of an HPC algorithm definition / HPC State Machine:
Table 6
Figure imgf000025_0001
Figure imgf000025_0002
Docket No.: 514405
Figure imgf000026_0001
Figure imgf000026_0002
Figure imgf000026_0003
[0105] Figure 4 is an exemplary illustration of an HPC state machine 401 in a system context. There are several differences between an HPC
Algorithm Definition / HPC state machine 401 and a standard state machine. In an HPC State Machine, the state number (state #) is an index that identifies the row in the algorithm definition that holds the state. A zero state index indicates the last-state condition. The state is actually the library and kernel name that is to be executed, as indicated by arrow 403 in Figure 4. The Cluster Node
Allocation entry in the HPC state machine provides a way to dynamically specify the minimum, maximum, and optimal node count usable by the executed kernel. The ability to specify a node count means that, unlike a standard state machine, an HPC state machine is suited for parallel processing. Docket No.: 514405
[0106] Selection of the actual node count used is a function of the designated condition. The input and output variable lists consist of the memory address of the variable, the variable type and the number of dimensions associated with the variable. All variables used by a "Condition" must be either variables found in the input/output variable lists, constants, or a computed index value within a loop. This allows a state transition vector to access the required variables without needing to compile special codes into the source code of the kernel.
[0107] A standard high performance computer system does not use a finite state machine (FSM) - it runs computer programs that have Message Passing Interface (MPI) functions embedded in that code. Adding MPI functions to standard software means changing that software and possibly injecting new bugs into it. Using the present method, parallel states can be injected into an FSM without recompiling or changing the original code.
[0108] In a standard state machine a state transition consists of a condition and a vector linking together two states. In a HPC state machine a condition can link together a state with multiple states (called a state list). Each state in the state list is executed on a separate node, as indicated by arrows 405 and 406 in Figure 4. Each node can have a single or multiple processors, and/or single or multiple cores, which means that multiple parallel threads on multiple machines can be spawned from a single transition. To collapse back down to a single node the special transition called "Collapse" is used. This transitions both 'collapses' down to a single node and transitions to a new state. A loop allows a transition to efficiently call itself, initialize and calculate an index value, and then transition to one or more new states.
[0109] Certain changes may be made in the above methods and systems without departing from the scope of that which is described herein. It is to be noted that all matter contained in the above description or shown in the accompanying drawings is to be interpreted as illustrative and not in a limiting sense. For example, the system shown in the accompanying drawings may include different components than those shown. The following claims are intended to cover all generic and specific features described herein, as well as all Docket No.: 514405 statements of the scope of the present method, system and structure, which, as a matter of language, might be said to fall there between.

Claims

Docket No.: 514405
What is claimed is:
1. A method for execution of applications on a parallel processing computing system including a plurality of computing nodes, the method comprising:
receiving a plurality of algorithms and adding the algorithms to an
algorithm library;
receiving kernels and adding the kernels to a kernel library; wherein the kernels are executable programs invoked by execution of the algorithms;
receiving a request for execution of a requested one of the algorithms; transferring the requested one of the algorithms from the algorithm library to the computing system;
transferring, to the computing system, input datasets used by the
algorithm; and
executing the requested one of the algorithms on the computing system using the input datasets, to generate one or more output data sets, wherein each of the kernels invoked by the requested one of the algorithms is transferred from the kernel library to the computing system to effect execution of the algorithm;
wherein each of the algorithms indicate the kernels to be invoked, the
names of the input datasets to be used, the names of the output datasets to be generated, and the number of the nodes to be used in executing each step of each of the algorithms.
2. The method of claim 1 , wherein each of the algorithms is a state machine that comprises states and state transitions, wherein each of the states is a kernel invocation.
3. The method of claim 1 , wherein executing each said kernel comprises executing a one-state algorithm that calls the kernel.
4. The method of claim 1 , wherein: Docket No.: 514405 the request is received from a second computing system external to the parallel processing computing system; and
wherein a third computing system performs the functions of receiving the algorithms, receiving the request, transferring the algorithms and the input datasets to the parallel processing computing system.
5. The method of claim 1 , wherein each of the algorithms is a state machine comprising a set of states, each of which comprises a single kernel including compiled software, and a list of state-transitions indicating conditions under which control is transferred to another one of the states;
wherein each of the states includes:
indicia for locating the kernel to be executed;
a cluster node allocation entry dynamically specifying minimum, maximum, and optimal number of the nodes usable by the kernel to be executed, and
a state list a conditionally linking together one of the states with one or more other said states;
wherein each of the states is executed on a separate one of the nodes.
6. The method of claim 5, wherein each of the nodes has multiple processors to allow parallel threads on a plurality of the processors to be spawned from a single state transition.
7. The method of claim 5, wherein the indicia for locating the kernel to be executed is an address of a kernel in a kernel library.
8. A system for execution of applications on a parallel processing computing system comprising including a plurality of computing nodes:
an algorithm library containing algorithms;
a kernel library containing kernels used by the algorithms;
an algorithm execution module; and
a kernel execution module;
wherein:
the kernels invoked, names of input datasets used by the algorithm and output data sets from each of the algorithms, and a number of the Docket No.: 514405 nodes to be used in executing each step of each of the algorithms are indicated in each of the algorithms;
a requested one of the algorithms is loaded from the algorithm library into the computing system;
the input datasets used by the algorithm are transferred to the computing system;
the requested one of the algorithms is executed by the computing system to generate the output data sets;
the requested one of the algorithms is executed under control of the
algorithm execution module and each of the kernels invoked by the algorithm is executed under control of the kernel execution module; and
the kernels are transferred from the kernel library and loaded into the
computing system prior to execution thereof.
9. The system of claim 8, wherein each of the algorithms is a state machine that comprises states and state transitions, wherein each of the states is a kernel invocation.
10. The system of claim 8, wherein executing each said kernel comprises executing a one-state algorithm that calls the kernel. 1. The system of claim 8, wherein a request for execution of the requested one of the algorithms is received from a system external to the computing system.
12. The method of claim 8, wherein each of the algorithms is a state machine comprising a set of states, each of which comprises a single kernel including compiled software, and a state-transition indicating a condition under which control is transferred to another one of the states;
wherein each of the states includes:
indicia for locating the kernel to be executed;
a cluster node allocation entry dynamically specifying minimum, maximum, and optimal number of the nodes usable by the kernel to be executed, and Docket No.: 514405 a state list a conditionally linking together one of the states with one or more other said states;
wherein each of the states is executed on a separate one of the nodes.
13. The method of claim 12, wherein each of the nodes has multiple processors to allow parallel threads on a plurality of the processors to be spawned from a single state transition.
14. The method of claim 12, wherein the indicia for locating the kernel to be executed is an address of a kernel in a kernel library.
15. A state machine executable on a parallel processing computing system including a plurality of computing nodes comprising:
a set of states, each of which comprises a single kernel including compiled software and a state-transition indicating a condition under which control is transferred to another one of the states;
wherein each of the states includes:
indicia for locating the kernel to be executed;
a cluster node allocation entry dynamically specifying minimum, maximum, and optimal number of the nodes usable by the kernel to be executed, and
a state list a conditionally linking together one of the states with one or more other said states;
wherein each of the states is executed on a separate one of the nodes.
16. The state machine of claim 12, wherein each of nodes has multiple processors to allow parallel threads on a plurality of the processors to be spawned from a single state transition.
17. The state machine of claim 12, wherein the indicia for locating the kernel to be executed is an address of a kernel in a kernel library.
PCT/US2011/048134 2010-08-17 2011-08-17 System and method for execution of high performance computing applications WO2012024435A2 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP11818745.9A EP2606424A4 (en) 2010-08-17 2011-08-17 System and method for execution of high performance computing applications
JP2013524967A JP2013534347A (en) 2010-08-17 2011-08-17 System and method for execution of high performance computing applications

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US37450110P 2010-08-17 2010-08-17
US61/374,501 2010-08-17

Publications (2)

Publication Number Publication Date
WO2012024435A2 true WO2012024435A2 (en) 2012-02-23
WO2012024435A3 WO2012024435A3 (en) 2012-05-03

Family

ID=45605663

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2011/048134 WO2012024435A2 (en) 2010-08-17 2011-08-17 System and method for execution of high performance computing applications

Country Status (3)

Country Link
EP (1) EP2606424A4 (en)
JP (1) JP2013534347A (en)
WO (1) WO2012024435A2 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9851949B2 (en) 2014-10-07 2017-12-26 Kevin D. Howard System and method for automatic software application creation
US10496514B2 (en) 2014-11-20 2019-12-03 Kevin D. Howard System and method for parallel processing prediction
US11520560B2 (en) 2018-12-31 2022-12-06 Kevin D. Howard Computer processing and outcome prediction systems and methods
US11687328B2 (en) 2021-08-12 2023-06-27 C Squared Ip Holdings Llc Method and system for software enhancement and management
US11861336B2 (en) 2021-08-12 2024-01-02 C Squared Ip Holdings Llc Software systems and methods for multiple TALP family enhancement and management

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11110351A (en) * 1997-10-08 1999-04-23 Hitachi Ltd State transition control method
US7418470B2 (en) * 2000-06-26 2008-08-26 Massively Parallel Technologies, Inc. Parallel processing systems and method
JP4596781B2 (en) * 2002-01-10 2010-12-15 マッシブリー パラレル テクノロジーズ, インコーポレイテッド Parallel processing system and method
CN101371264A (en) * 2006-01-10 2009-02-18 光明测量公司 Method and apparatus for processing sub-blocks of multimedia data in parallel processing systems
US8424003B2 (en) * 2006-05-31 2013-04-16 International Business Machines Corporation Unified job processing of interdependent heterogeneous tasks using finite state machine job control flow based on identified job type
US8136104B2 (en) * 2006-06-20 2012-03-13 Google Inc. Systems and methods for determining compute kernels for an application in a parallel-processing computer system
US8381202B2 (en) * 2006-06-20 2013-02-19 Google Inc. Runtime system for executing an application in a parallel-processing computer system
US20080172677A1 (en) * 2007-01-16 2008-07-17 Deepak Tripathi Controlling execution instances
US8423749B2 (en) * 2008-10-22 2013-04-16 International Business Machines Corporation Sequential processing in network on chip nodes by threads generating message containing payload and pointer for nanokernel to access algorithm to be executed on payload in another node

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See references of EP2606424A4 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9851949B2 (en) 2014-10-07 2017-12-26 Kevin D. Howard System and method for automatic software application creation
US10496514B2 (en) 2014-11-20 2019-12-03 Kevin D. Howard System and method for parallel processing prediction
US11520560B2 (en) 2018-12-31 2022-12-06 Kevin D. Howard Computer processing and outcome prediction systems and methods
US11687328B2 (en) 2021-08-12 2023-06-27 C Squared Ip Holdings Llc Method and system for software enhancement and management
US11861336B2 (en) 2021-08-12 2024-01-02 C Squared Ip Holdings Llc Software systems and methods for multiple TALP family enhancement and management

Also Published As

Publication number Publication date
JP2013534347A (en) 2013-09-02
EP2606424A2 (en) 2013-06-26
WO2012024435A3 (en) 2012-05-03
EP2606424A4 (en) 2014-10-29

Similar Documents

Publication Publication Date Title
Esfahani et al. CloudBuild: Microsoft's distributed and caching build service
EP1686470B1 (en) Efficient data access via runtime type inference
US9075750B2 (en) Oracle rewind: metadata-driven undo
WO2019024674A1 (en) Smart contract processing method and apparatus
US7650346B2 (en) User-defined type consistency checker
US10083016B1 (en) Procedurally specifying calculated database fields, and populating them
WO2012024435A2 (en) System and method for execution of high performance computing applications
US20130024472A1 (en) Extensibility of business process and application logic
KR20150087265A (en) Dynamic component performance monitoring
CN111930489B (en) Task scheduling method, device, equipment and storage medium
González-Aparicio et al. A new model for testing CRUD operations in a NoSQL database
CN112783912A (en) Data processing method and device, computer equipment and storage medium
US11204746B2 (en) Encoding dependencies in call graphs
US20100299384A1 (en) System and method for using a same program on a local system and a remote system
JPH0565892B2 (en)
WO2015196524A1 (en) Software upgrade processing method and device, terminal and server
US20220050669A1 (en) Representing asynchronous state machine in intermediate code
US20210303339A1 (en) Data backup method, electronic device and computer program product
EP1634166B1 (en) System and method for incremental object generation
US10140155B2 (en) Dynamically provisioning, managing, and executing tasks
Wood et al. Triton: a domain specific language for cyber-physical systems
US20230071160A1 (en) Compiler generation for partial evaluation
US20240086404A1 (en) Intelligent optimization of parameterized queries
CN113238915B (en) Processing method, device, equipment, storage medium and program for calling information
US20240036861A1 (en) Automatic generation of interfaces for optimizing codebases

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11818745

Country of ref document: EP

Kind code of ref document: A2

ENP Entry into the national phase in:

Ref document number: 2013524967

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase in:

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2011818745

Country of ref document: EP