WO2022093936A1

WO2022093936A1 - Batch processing

Info

Publication number: WO2022093936A1
Application number: PCT/US2021/056816
Authority: WO
Inventors: Jinlyung CHOI; Cesar Emilio Cardona URIBE; Satish VISWANATHAM; Paul Michael LORIAUX; Joanna Catherine Ceolane DREUX; Brendan Michael WEE
Original assignee: Second Genome, Inc.
Priority date: 2020-10-28
Filing date: 2021-10-27
Publication date: 2022-05-05

Abstract

Computer memory stores instructions that, when executed by one or more processors, cause the processors to perform operations comprising: receiving a request to run one or more services in a pipeline; generating an initial configuration file to be used as a current configuration file; and for each service of the one or more services of the request, after generation of the initial configuration file: executing at least one instance of the service using the initial configuration file to access at least one data-chunk in order to generate a new configuration file to be used as the current configuration file; wherein executing the at least one instance comprises generating, from the at least one data-chunk, corresponding output data stored in the new configuration file; aggregating the current configuration file with any other current configuration files available in the system; and providing, after each service of the pipeline is executed, the results.

Description

BATCH PROCESSING

[0001] The present document generally relates to computer technology for batch processing, such as cloud-based batch processing.

BACKGROUND

[0002] Many industries process large quantities of data to make advancements in their fields. For example, some companies in the pharmaceutical industry have used large scale computing efforts in an attempt to treat various diseases through microbiome medicine. These companies have leveraged large quantities of data to identify and discover, for example, effective peptides, proteins, and small molecule therapeutics that can provide new therapeutics to patients. Such large scale computing platforms have used numerous bioinformatics software tools, each with their own runtime environments and compute resource requirements. For example, monolithic workflow have been used to coordinate such bioinformatic tools into computational pipelines that may each have dedicated processing resources, such as dedicated servers, memory, and storage capacity.

SUMMARY

[0003] The disclosed technology is generally directed to a platform and architecture to provide improved cloud-based batch processing for processing, for example, large data sets. For example, the disclosed technology can provide for robust, dynamic microservice-based workflows that incorporates a reusable-template, sometimes called an inner-state machine, as the basis of each job in a batch. Such an inner-state machine can be adapted to each specific job through specifications that are passed into each instantiated instance of the inner-state machine, which can permit for advantageous features built-in to the inner-state machine (e.g., graceful error handling, data handling) to be realized for each job without having to be specifically delineated for each job, for example. This can provide any of a variety of advantages over other technology, such as monolithic workflows, which may instead rely upon separate definition and specification of features within each pipeline that is used. The disclosed technology uses an architectural style where applications are decomposed into loosely coupled services. The code can be broken into smaller services that run as separate jobs. These jobs can be independently run and the output from one service can be used as an input to another service. Such a decentralized architecture can be less prone to failures. This architecture can also be coded, tested, deployed and scaled independently.

[0004] A system can be used for batch processing of services. The system includes one or more processors; and computer memory. The computer memory can store instructions that, when executed by the processors, cause the processors to perform operations comprising: receiving a request to run one or more services in a pipeline; wherein the request comprises data to be operated on; chunking the data to generate data- chunks, each data-chunk comprising some, but not all, of the data; wherein all data- chunks combined comprise all of the data; storing the data-chunks in locations specified by one or more references; generating an initial configuration file to be used as a current configuration file; wherein the initial configuration file comprises the one or more references; and for each service of the one or more services of the request, after generation of the initial configuration file: executing at least one instance of the service using the initial configuration file to access at least one data-chunk in order to generate a new configuration file to be used as the current configuration file; wherein executing the at least one instance comprises generating, from the at least one data-chunk, corresponding output data stored in the new configuration file; aggregating the current configuration file with any other current configuration files available in the system; and providing, after each service of the pipeline is executed, the results based on the current configuration file. Other systems, methods, devices, computer-readable media, software, and other forms of technology may be used for the batch processing of services.

[0005] Implementations can include any, all, or none of the following features, the request is received from a client device geographically remote from the one or more processors and the computer memory, and in data communication with the one or more processors and the computer memory. At least some of the services are executed in parallel with each other. At least some services are executed in series. The output of some services are used as input for some other services. The system is configured to: monitor the operations to determine if the operations cause an error; and halt, in response to determining that the operations cause an error, the operations. The system is further configured to generate, responsive to determining that the operations cause an error, an error message containing information about the service. The system is further configured to record, in a run-trace datastore trace information about the execution of the instances of a service, the trace information comprising parameters related to operations of the system as the processors perform the operations. The system is further configured to: receive a query identifying the request; and responding to the query with at least some of the trace data.

[0006] One or more advantages can be provided by the disclosed technology. For example, the disclosed technology can provide improved batch-processing of computer jobs. For instance, the disclosed technology can be trackable, resilient, scalable, flexible, testable, and deployable. Existing cloud services may not provide for flexibility and complexity in batch processing, such as piping output from one process into input for another process without limits as well as formatting the output that is piped into input. [0007] In another example, the disclosed technology can increase efficiency in processing large amounts of data. Existing approaches to batch processing may not provide a comprehensive framework for tracking parameters for a run or components (e.g., service executions) of that run. Existing approaches may also limit workflows in the number of services that can be executed during a run, thereby limiting scalability. The disclosed technology, on the other hand, can provide for a scalable framework that accepts multiple services to be executed in parallel, where each service can have multiple different parameters. Moreover, one or more output files from different services can be formatted during run execution and provided as input to one or more other services being executed. This feature can provide for more ability to efficiently perform runs, no matter how extensive or variant a run pipeline can be. Moreover, each run can be identified by a unique identifier. That unique identifier can be used by a user who requested the run in order to track progress through run execution. The identifier can create a trail for logging progress and errors that occur during execution of any of the services within the run pipeline. In other words, parameters for a run as well as components (e.g., services) of the run can be tracked in real-time (e.g., with a power query language provided by a service such as AMAZON). Unlike traditional batch processing services, the disclosed technology can advantageously assist users in more quickly identifying, diagnosing, and remediating errors in the pipeline. As a result, users can rerun only parts of the pipeline rather than rerun the entire pipeline, thereby improving run efficiency.

[0008] In another example, high-level notifications can be provided to the user who requested a run, which can assist the user in more quickly identifying errors in the run execution. Traditional batch processing services provide the user with output containing both necessary and unnecessary information. The user would have to meticulously comb through such information to identify any errors or other parameters/conditions that are important to run execution. On the other hand, the disclosed technology can provide the user with output that indicates whether the entire run was successful and if not, where there were errors. For example, the notification for a failed run can include information about which service experienced an error, whether the error was a service or data error, and other information that can be beneficial to a user in resolving that error. The notification may not, for example, include information about every service that was executed, such as those that were successful. As a result, the user does not have to spend time reviewing all output information to manually identify a source of an error. Furthermore, this can reduce potential human mistakes that can result when having to review all the output information to manually identify errors. The disclosed technology, therefore, can provide for more efficient batch processing, run reporting, and faster error identification.

[0009] In another example, the disclosed technology can provide for increased traceability through run execution. As a result, users can more easily move backwards from run execution output all the way to run execution input. Doing so, the user can more readily determine and identify what operations were performed, where there may have been errors in execution of any step in the run, and why the run execution output resulted. The user not only can view run execution results but also trace an entire process of obtaining those results. The user can view and query high-level information about the run execution but also more detailed information about execution of every step in the run and/or every service in the run in order to track any information in the run execution.

[0010] In yet another example, the disclosed technology can provide for dynamic batch processing. For instance, the disclosed technology can provide a platform that is able to dynamically change to varied and changing batch processing demands, such as introducing new and different processing pipelines and tests, introducing new and different types of data, introducing new and different output requirements, introducing new and different processing dependencies among pipelines, introducing new and different testing parameters, and/or others. Given that the core of the platform is a dynamically customizable inner state machine, just about any change or modification that is conceived of can be implemented and dynamically accommodated by the platform.

[0011] In another example, the disclosed technology can be easily deployable in any cloud environment. A script or marked up document can outline (e.g., consolidate) all entities, processes, techniques, and/or methods of the disclosed technology. That script or marked up document can then be deployed into any cloud environment such that the disclosed technology can be created and implemented. As a result, the disclosed technology can be used across different industries, services, platforms, and computing environments.

[0012] Other features, aspects and potential advantages will be apparent from the accompanying description and figures. DESCRIPTION OF DRAWINGS

[0013] FIG. 1 is a conceptual diagram of an example cloud-based batch processing as described herein.

[0014] FIG. 2 is a system diagram of an example cloud-based batch processing environment.

[0015] FIG. 3 is a flowchart of a process for executing cloud-based batch processing.

[0016] FIG. 4 is a flowchart of a process for executing an instance in FIG. 3.

[0017] FIGS. 5A-B are exemplary notifications for executed cloud-based batch processes.

[0018] FIG. 6 is a flowchart of a process for generating notifications.

[0019] FIG. 7 is a flowchart of a process for using a config file for cloud-based batch processing.

[0020] FIGS. 8A-J are exemplary code segments during cloud-based batch processing as described herein.

[0021] FIGS. 9A-C are exemplary code segments for querying results from cloud-based batch processing as described herein.

[0022] FIG. 10 shows an example of a computing device and an example of a mobile computing device that can be used to implement the techniques described here.

[0023] Like reference symbols in the various drawings indicate like elements.

DETAILED DESCRIPTION

[0024] The disclosed systems and methods are generally directed to providing for cloud-based batch processing in a scheme that uses an updating or “living” config file to manage inputs and outputs through the batch processing. The disclosed technology provides for efficient processing of run requests. A workflow can have a pipeline of numerous services (e.g., jobs) to be executed. A config file can be associated with the workflow. The config file can provide information for each of the services to be executed, how to clean and/or modify output from each service, and how to translate each service output into input for another service. As a result, batch processing in parallel can be possible and more efficient, regardless of how many services are to be executed in the workflow and/or how many parameters are to be used for each service execution.

[0025] Using step function techniques, workflows can be translated into a state machine that is easy to understand, document, and modify. Moreover, each step or service of execution in the workflow can be monitored such that a user can more quickly identify and fix errors in the workflow. High-level notifications can be generated and presented to the user that articulates why a workflow failed. Such notifications can also identify which step or service experienced an error that caused the workflow to fail. Therefore, the user can more easily and quickly identify and remediate errors in the workflow, even if the user does not have a full understanding of the computing platform in which the workflow is running. This can advantageously extend the use of these features to users with less training on the computing platform.

[0026] The batch processing described throughout this disclosure can provide for more efficient and automatic scaling of each workflow. For example, a user can request 8 jobs, each job requiring 1 central processing unit (e.g., CPU) and 4 gigabytes (e.g., Gb) of memory. During batch processing, the disclosed technology can auto scale the 8 jobs by using one instance that has 8 CPUs and 32 Gb of memory instead of using 8 smaller, independent instances. As a result, workflows can be performed quicker and more efficiently. Moreover, this provides for more scalable systems and methods to meet demands of different workflows and services that are requested per workflow. More so, continuous integration and delivery techniques can be employed with the disclosed technology to assist developers in delivering code changes more frequently and reliably. For example, unit-tests and automatic deployment can be more easily conducted. Largescale architectural deployment can also be made possible by the disclosed technology. This example, for purposes of clarity, describes a relatively simple set of job requirements. However, it will be understood that this technology can handle much more complex requirements (e.g., different CPU requirements for each job, variable memory requirements depending on the output of previous jobs in the workflow).

[0027] FIG. 1 is a conceptual diagram of an example cloud-based batch processing as described herein. A cloud-based batch processing environment can include a cloud system 100 and a user device 102. The cloud system can include a run trace store 110, which can be configured to store one or more parameters or other information associated with running a request. The cloud system 100 can receive a request to run one or more jobs (e.g., workflows) from the user device 102 in A. For example, the run request can specify an analysis pipeline of one or more services (e.g., processes) to complete. The cloud system 100 can receive more than one request from more than one user device.

[0028] The cloud system 100 can run the requested jobs in B. Once running the request is completed, results and/or notifications about the run can be generated and sent to the user device 102 in C. As described below, the notifications can provide a user at the user device 102 with information as to whether a run was successful or not. The notifications can include information about which services in the run failed and reason(s) for those failures. The notifications are beneficial for the user to more quickly understand, address, and fix data and/or service errors.

[0029] To run the request in B, the cloud system 100 can receive data 104 and a config file 106. In some implementations, the cloud system 100 can autogenerate the config file 106 based on the received data 104, that is to say, without specific user input other than the submission of the data 104. The data 104 can include information about the user’s run request, including technical information (e.g. services to be run, input files/data to be used for the services ) and other information (e.g., billing information, user identifiers or contact inforamtion). Before executing the run and the services therein, the config file 106 can be updated with information about the run as well as pointer(s) to the data 104 for a first service to be executed. This information can be stored in the run trace store 110. The information stored in the run trace store 110 (e.g., a run ID, execution link, start time, etc.) can be beneficial for the user to track progress (e.g., in real-time) through the run request. The run information can also be beneficial for the user to more easily identify errors that occur during execution of any of the services in the workflow. For example, the information stored in the run trace store 110 can be used to provide the user with a high-level view of the overall run request as well as more detailed views of each of the executed services. The user can then more easily and quickly trace each step of each executed service in the run request.

[0030] The updated/prepared config file 106 then becomes input for batch processing in the cloud system 100. In other words, the config file 106 can be input for each of the executed services, which provides for more efficient parallel processing of the services. As described herein, the config file 106 can be continuously updated during execution of the run. Updating the config file 106 can include generating/outputting new config files for each executed service, which can then be provided as input to other services in the pipeline. Information that is added to updated config files can be stored in the run trace store 110.

[0031] Batch processing in the system 100 includes spinning up various instances 108A-N during a run. Each of the instances 108A-N can be different services in the pipeline. Each of the instances 108A-N can also be multiple executions of a single service in the pipeline. Each of the instances 108A-N can be spun up simultaneously. In some implementations, for example, a bid can be submitted to the system 100 for each of the instances 108A-N. Based on traffic in the system 100, one or more of the instances 108A-N can be selected for processing based on their bids while one or more other of the instances 108A-N can wait in a queue until their bids are received. Each instance 108A-N can receive the necessary data 104 to run the instance based on the pointer indicated in the config file 106.

[0032] A first instance 108 A can receive the config file 106 as input. As the instance 108A is spun up and a service is performed therein, the config file 106 can be updated and outputted from the instance 108Aas config file 106’. As described throughout, the config file 106’ and every updated config file thereafter can be separate config files that expand on the original config file 106. These updated config files can be input for services further down the run pipeline. This is advantageous so that the updated config files can be re-used across multiple different services and/or instances that are executed in parallel. As a result, the disclosed technology is scalable and efficient in batch processing. The config file 106 can be stored. In some implementations, one or more parameters from the config file 106 and every updated config file thereafter can be stored. This is beneficial for more efficient traceability. In other words, with the stored parameters, more detailed, as well as high-level, views of a run and any executed services therein can be provided to the user.

[0033] Updating and outputting the config file 106 includes preparing the config file to be input into another process/ service. Thus, the updated config file 106’ can be fed down the pipeline as input into instance 108B. Although not depicted, the config file 106’ can also be fed as input into one or more other instances 108C-N in parallel/ simultaneously. As the instance 108B is spun up and a service/process can be performed therein, the config file 106’ can be updated and outputted from the instance 108B as config file 106”. The updated config file 106” can then be fed down the pipeline as input into instance 108C. As the instance 108C can be spun up and a service/process can be performed therein, the config file 106” can be updated and outputted from the instance 108C as config file 106’”. The updated config file 106’” can then be fed down the pipeline as input into any additional instances, such as instance 108N.

[0034] Once the instances 108A-N are spun up and down (e.g., a run request is completed), a final updated config file 106”” can be outputted. Based on the config file 106””, the system 100 can identify results for each of the executed services and/or the overall run. The system 100 can determine a notification to send to the user device 102 that indicates whether the overall run was a success or a failure. As described further, the run request can be identified as failed where at least one service execution in the run experienced an error.

[0035] FIG. 2 is a system diagram of an example cloud-based batch processing environment. The environment can include the cloud system 100 and the user device 102, as depicted and described in reference to FIG. 1. The system 100 and device 102 can communicate via network(s) 200 (e.g., wireless and/or wired, an intranet, the Internet). [0036] The cloud system 100 can include a batch processing module 204, and a network interface 212. The batch processing module 204 can include a preparation module 206, a run module 208, and a reporting engine 210. Once the cloud system 100 receives a request to run services from the user device 102 (e.g., refer to A in FIG. 1), the run can be prepared and executed by the batch processing module 204 (e.g., refer to B in FIG. 1, FIG. 3, FIGS. 7-8). The preparation module 206 can be configured to prepare a config file for use in executing the run request. The module 206 can also scale each of the services in the run pipeline. In some implementations, scaling the services can be performed before executing each service. In other implementations, one or more services can be scaled before executing any service. As part of scaling a service, input data for that service can be chunked/separated into smaller sizes/files such that the data can be more efficiently and quickly processed. The module 206 can also be configured to, after running each service, dechunk or stitch back together data that was previously chunked. [0037] The run module 208 can be configured to run the requested services once they are prepared by the preparation module 206. As described throughout, running the services includes spinning up one or more instances (e.g., refer to B in FIG. 1, FIG. 3). While the run module 208 executes each of the instances, a config file updater 209 can be configured to continuously update the config file. The updater 209 can also clean and/or filter output results from each service such that the output results can be provided as input to any other instances/ services in the run (e.g., refer to FIG. 3). The updater 209 can also clean/filter the output results such that the results are easier to use for identifying errors in execution of one or more services (e.g., refer to FIGS. 7-8). Traditional batch processing techniques may not clean/filter the output results. Therefore, with traditional techniques, it is a more time-consuming and tedious task to comb through fields of unfiltered output data in order to identify potential errors in service execution. The disclosed technology, on the other hand, filters the output results such that only information relevant to identifying errors and other information about service execution are stored.

[0038] Moreover, the run module 208 can include a tracing module 207. The tracing module 207 can be configured to store information (e.g., parameters) associated with the prepared config file and any updated config file thereafter in the run trace store 110. As a result, the tracing module 207 can be used to create a trackable trail of every step through execution of the services in the run. This is beneficial for the user to more easily and quickly trace through an entire run to see where errors might have occurred. The tracing module 207 can generate tracing data, which can include parameters about what happened during an execution step, what steps were performed during overall execution of a service or run, what input was received, what output was generated, etc. [0039] The reporting engine 210 can be configured to review a final, updated config file (e.g., cleaned, output results) to determine whether the run was a success or a failure. As described throughout, the run can be a success where each of instances are successfully executed by the run module 208. The run can be a failure where at least one of the instances is not successfully executed by the run module 208 (e.g., there was a data and/or service error in any one of the instances in the run). The reporting engine 210 can generate notifications about a success or fail state of the run (e.g., refer to FIG. 6).

[0040] The notifications generated by the reporting engine 210 can be sent to the user device 102 (e.g., refer to C in FIG. 1). For example, the network interface 212 can facilitate communication of the cloud system 100 and the user device 102 over the network(s) 200.

[0041] The cloud system 100 can be in communication with the run trace store 110 and a data store 214. The run trace store 110 can be a serverless query service that makes it easier to analyze data (e.g., output) from run executions. For example, SQL commands can be used to query the data. The store 110 can include run data tables and views that register history of service executions, input files, parameters, and result directories. The store 110 can also include service derived tables with selected results that can be queried as input to nominations and other analysis.

[0042] A first run data table can be a table that aggregates all execution information in a single place. Additional views can allow the user to more easily query parts of this data. As described throughout, each run can save a run data JSON file in the store 110. Such files can provide files that were run through specific services, parameters used with the services, and other metadata (e.g., refer to FIG. 9B). The structure of these files can become nested, so they can be parsed, searched, and/or filtered based on a run ID, workflow name, group, read strand, file name, service name, parameter name, and/or parameter value (e.g., refer to FIG. 9A). Additional values can be used to parse, search, and/or filter these files. [0043] A second view table can provide the user with a quick and easy way to look at combined fdes, service information, and parameter information. In other words, the second view table can provide the user with a more general view of the overall run. [0044] Specific views can also be provided for the user. For example, an input files table view can be provided, which lists the input files associated with a specific run. A run summary view can be provided, which lists general execution information of the specific run. A service run view can be provided, which lists service run details for the specific run. A service parameters view can also be provided, which lists details of the services and their associated parameters that were used for the specific run. A service source ID view can also be provided, which can list run details for each source ID that is processed by each service and run. Moreover, each of the services can generate ready-to- see details, or derived data, on each of the services. The derived data and run data described above can be joined together to answer traceability questions that the user may have.

[0045] The data store 214 can store data that is used by the batch processing module 204. The data store 214 can also store the config file (e.g., an original config file and updated config files or output files for each of the executed services) as described throughout this disclosure. For example, the data store 214 can store cleaned output results for each of the executed services. The reporting engine 210 can then access each of the cleaned output results from the data store 214 to more quickly and efficiently determine whether the run was a success or failure.

[0046] Data store 214 can be one or more different storage services. For example, the data store 214 can be a database (e.g., cloud-based) for storing unstructured data. The data store 214 can also be a database (e.g., cloud-based) for storing structured (e.g., filtered, processed, and/or prepared) data. To provide for quicker and more efficient batch processing in the cloud, data can additionally be stored in different services within the data storage 214. As will be understood, the different storage services may offer different service levels (e.g., one may be faster to search structured data, one may be faster to read and write unstructured data), and the particular storage services used may be selected based on technology needs of the various types of data and access.

[0047] Still referring to FIG. 1, the user device 102 can include a display 216, input device(s) 218, and a network interface 220. The user device 102 can be a mobile device, such as a smartphone and/or tablet. The device 102 can also be a computing device, such as a computer or laptop. The display 216 can provide a graphical user interface (GUI) to a user of the user device 102. Notifications about a state of the run can be displayed on the display 216. The input device(s) 218 can include a touchscreen (e.g., the display 216) a keyboard, mouse, or any other type of input device, such as microphone. The user can input, using the input device(s) 218, requests to run services. The display 216 can present an application (e.g., web-based), software, or other interface to the user that prompts the user to enter run requests. The user-inputted requests can then be sent to the cloud system 100. For example, the network interface 220 can facilitate communication of the user device 102 and the cloud system 100 over the network(s) 200. [0048] In some implementations, as described further, the user can provide a mapping file (e.g., text file) as part of the user’s run request. The mapping file can indicate what services are requested to be run and what input files/data are from the data store 104 are used for each service. The system 100 can use the mapping file to autogenerate a config file. The config file can then be used to execute the run request. [0049] FIG. 3 is a flowchart of a process 300 for executing cloud-based batch processing. The process 300 can be performed by the cloud system 100 described throughout this disclosure and/or other cloud-computing systems.

[0050] In 301, the cloud system can receive a request. The request can be received from a user device, as described herein (e.g., refer to FIG. 1). The request can include one or more services (e.g., jobs, processes) that the user would like to be run in a pipeline (e.g., workflow). The cloud system can also receive data, or pointer(s) to the data in a data store, associated with the run request. In 301, the cloud system can automatically generate and/or populate a config file with data to run the services. The data can also be used to update a config file during one or more parts of the process 300 as described herein. The config file can be used as input for running an instance. As described herein, the config file can be continuously updated and used as input for one or more other instances during run execution.

[0051] In 302, the cloud system can determine whether to execute one or more services in the request. In some examples, this can be determined before the request is submitted and/or this determination can be made by the user. If the cloud system determines that it cannot execute the service, then execution for that service is skipped (330). A service in the run request can be skipped while other services are executed. For example, the run request can define every service that can be run but then only certain services of that pipeline are chosen to be executed. Services that are not chosen to be executed can merely be skipped over so that other services that are chosen to be executed can be run.

[0052] If the cloud system determines that it can execute the service, (e.g., if no errors are found, if the series have authorization to use the computing resources), then the cloud system can chunk data associated with the service in 304 into smaller files.Whether to chunk data can depend on each of the services to be executed in the run. In other words, one or more services can require data to be chunked before processing while one or more other services do not require data to be chunked. The config file can include file management information for each service. The file management information can include a boolean value (e.g., TRUE/FALSE) that indicates whether data should be chunked before the service is executed.

[0053] Chunking the data can include splitting up input data into a series of input files. Each of the input files can have a portion of the aggregate input data received in 301. Chunking is beneficial to reduce costs and time in processing the input data. For example, by breaking up the input data into smaller chunks or files, the smaller chunks can be run quicker through smaller systems and/or memory. The chunked data can then be stitched back together later in the process 300 (e.g., refer to dechunking in 318).

[0054] In some implementations, a decision to chunk the input data can be made by the cloud system in 301 and/or 302. As mentioned, not all input data may require chunking. For example, input data for one service can already be small enough in size that it can be processed via a smaller system while input data for another service can be too large for processing. Chunking in 304 can then be skipped for the input data that is already an appropriate size. [0055] If the chunking process fails, then an error can be identified in 328. Once the error is identified, a notification can be generated and sent in 320, as described below. [0056] Where chunking is successfully completed, output items can be created in 306. During creation of the output items, the config file can be expanded to include chunked data and other data (e.g., unchunked data, metadata, parameters) needed for each of the services in the run. As described throughout this disclosure, output of one service can be input to another service. Therefore, a lambda can be used to map (e.g., run transformations on) a service’s inputs and outputs such that they can reside in the config file in a predictable way. The service’s inputs and outputs can be correctly formatted, using the config file so that the service’s outputs can be easily inputted into another service in the pipeline, whether at a later time or in parallel execution.

[0057] The expanded data (e.g., updated config file) can be stored in a temporary data store (e.g., the data store 214 in FIG. 2). The expanded data can include information such as a list of integers per each service. The list of integers can indicate a number of instances to be spun up for that particular service (e.g., execute instance(s) in 308). The integers in the list can also be pointers that point to locations in the data store along with other metadata needed to execute the service. The list of integers can be passed to preparing an instance for execution in 310, rather than passing in the actual parameters for each service. For each execution of an instance in 308, the cloud system can use the pointers to identify which data (e.g., output from a previously-run instance) to pull as that instance’s input. As described throughout this disclosure, the config file is continuously updated/transformed with each executed instance. Therefore, creating output items 306, which includes updating the config file, can occur with every execution of an instance in the pipeline.

[0058] If creating output items fails, then an error can be identified in 328. Once the error is identified, a notification can be generated and sent in 320, as described below.

[0059] An instance of the service can be executed in 308. In other words, the cloud system can spin up one or more instances using the config file that has been updated. In some implementations, one or more instances can be executed per service. Various other determinations for how many instances to run can be made by the cloud system, a user at the user device, information in the config file, and/or other systems. Moreover, as depicted in FIG. 3, one or more instances can be executed in parallel/ simultaneously. As part of executing an instance of the service, tracing data can be generated and stored (e.g., refer to the run trace store 110 in FIGS. 1-2). This tracing data can be used by the user in order to track progress through every step in the executed instance, service, as well as overall run. The tracing data can, as described throughout this disclosure, indicate what happened during each step of run execution, what steps were performed during execution, etc.

[0060] The instance can be prepared in 310 as part of executing the instance.

Preparing the instance can include using the list of integers (e.g., pointers) provided from creating output items in 306 to pull necessary parameters to run that instance. An updated config file (e.g., the config file was updated during execution of one or more instances before this particular instance) can also be pulled when preparing the instance in 310. Moreover, as described further below, preparing the instance can include generate a new config file based off the previous config file (e.g., refer to FIGS. 7-8). Anew config file can be generated per service execution. That config file can then be used as input to one or more other services in the pipeline.

[0061] If preparing the instance fails, an error can be identified in 314. For example, preparing the instance can fail where an integer in the list does not point to data in the data store that is needed to run that instance (e.g., the integer points to nothing in the data store, the integer points to data for a different instance or service, etc.).

[0062] If preparing the instance is successful in 310, then the prepared instance can be run in 312. The instance can be run using parameters, data, metadata, and/or the config file that was prepared in 310.

[0063] If running the instance fails, then an error can be identified in 314 (e.g., error in transforming data, incorrect input data, bugs in code, networking issues, etc.). Identifying the error while executing the instance in 308 is beneficial because it adds value later in the process 300 when generating and sending notifications (320). For example, recognizing that an error occurred when preparing the instance or running the instance and incorporating that error into the notifications can assist users to more quickly identify reasons why the overall run was not successful. Adjustments can then be made specific to preparing and/or running the instance such that during a subsequent run, these errors would not occur again.

[0064] If running the instance is successful in 312, then batch results can be collected in 316. Collecting batch results can include getting exit codes for each of the services that are run. Collecting batch results can also include cleaning and filtering output results from service execution (e.g., refer to FIGS. 7-8). Traditional output results can include an abundance of pre-filled fields and information that is not critical for identifying errors in execution. Therefore, collecting batch results in 316 includes filtering out unimportant or non-critical fields from the output results. The remaining, cleaned output results can include only the information relevant for identifying errors in the service execution. For example, the batch results can include success and/or failure information about each of the prepared and run instances in the run request. Batch results can be collected while each instance is executed. In other implementations, batch results can be collected once all the instances in the run are executed.

[0065] A status for the overall run can be determined based on the exit codes and/or each services output(s) (e.g., refer to 320, 322). For example, when collecting batch results, the cloud system can determine whether any of the services failed. If any service did fail or experience an error, then the cloud system can determine that the overall run was a failure (e.g., refer to 320, 322). Determining a status for the overall run can also be performed separate from collecting the batch results in 316.

[0066] The collected batch results can be dechunked in 318. Dechunking can occur after each instance is executed in 308. In other implementations, dechunking can occur after all instances are executed in 308. Dechunking is a process of putting or stitching back together whatever data was chunked in 304. If dechunking is successfully completed, a notification can be generated and sent in 320.

[0067] If dechunking is unsuccessful, an error can be identified in 328.

[0068] In 328, one or more errors can be identified based on failures in chunking

(304), creating output items (306), and/or dechunking (318). Identifying such errors is beneficial because it adds value later in the process 300 when generating and sending notifications to the user (320). For example, recognizing that an error occurred when chunking the data in 304 and incorporating that error into the notifications can assist users to more quickly identify reasons why the overall run was not successful. Therefore, adjustments can be made specific to chunking data for a specific service such that during a subsequent run, that error does not reoccur.

[0069] A notification can be generated and sent in 320 (e.g., FIGS. 5-6). The notifications can provide users with a high-level view of whether the overall run was successful or failed. The notifications can also include high-level views of one or more services that failed and/or were successful. A lambda can be used to generate the notifications. The notifications can be generated based on pulling data from the config file and/or the collected batch results (316). Pulling data from the config file can be beneficial to get a high-level view of how each service in the batch was executed. Pulling data from the collected batch results (316) can also be beneficial to get a more detailed view or understanding of individual services or instances that were run as well as services or instances that require additional reporting out. Moreover, the notifications can be generated based on pulling stored tracing data (e.g., refer to 308 in FIG. 3, the run trace store 110 in FIGS. 1-2).

[0070] The cloud system can check a status of the full run in 322. 322 can be performed as part of 320 in generating and sending the notifications. 322 can also be performed as part of other portions of the process 300, such as collecting batch results (316), identifying error(s) 328, and/or skipping execution (330). During the status check, the cloud system can determine whether there was an error or failure in executing the full run (e.g., data associated with one service could not be chunked in 304, an instance for another service could not be successfully prepared in 310, etc.). [0071] If there is an error or failure with any one of the services that are part of the run request, then the cloud system can identify the run with a failed state identifier in 324. If the cloud system determines that every service was successfully executed, then the run can be identified with a success state identifier in 326. The identifiers determined in 324 and 326 can be used by the cloud system in generating and sending the notifications to the user (320). For example, in some implementations, the cloud system may only send a notification to the user in 320 if the run is identified with the failed state identifier. This is beneficial for the user to more easily and quickly determine why the run failed and what adjustments can be made so that the run does not fail again.

[0072] FIG. 4 is a flowchart of a process 400 for executing an exemplary instance in FIG. 3. The process 400 can be performed by the cloud system or any other system described herein. By way of example, the process 400 can be performed during metagenomics analysis. The process 400 can also be performed in a variety of other applications and industries. In metagenomic analysis, a user may seek to identify genetic markers for gut bacteria. Identifying these genetic markers is advantageous to predict/determine trajectories of individual hosts’ (e.g., humans) health conditions. Batch processing for metagenomic analysis therefore uses significant computing power and automation.

[0073] Before instances can be run, quality control can be performed on input data in 402 (e g., refer to 304-306 in FIG. 3). Scripts can be gathered by the cloud system in order to transform and clean the input data. In the exemplary use of FIG. 4, the input data can include metagenomics or transcripnomics data. This data further includes sequencing data. The metagenomics data can be human gut sequencing data. Typically, the input data received in 402 can be a large population of test cases or human samples.

[0074] Part of cleaning the input data can include removing or filtering out (e.g., sequencing) DNA sequences that come from a host (e.g., patient, person, human). After all, preferred data for metagenomics analysis (e.g., batch processing) is DNA sequences for the bacteria being studied that does not include other non-related DNA sequences. One or more existing quality filters can be applied in 402. For example, quality values can be assigned to each character in a DNA sequence. Based on a comparison of the quality values, the cloud system can determine how confident it is about what a certain DNA sequence represents. In other implementations of the process 400, different quality filters and/or quantity control techniques can be employed.

[0075] Once quality control is performed on the input data, the cleaned up data can be used to execute instances in 404 (e.g., refer to 308 in FIG. 3). In other words, jobs or services in a pipeline can be run using the cleaned data.

[0076] Instances can be executed for a plurality of services, as described herein. Some of these services can include, for example, a taxonomic classification module 406, an assembly module 408, which involves gene calling in 410, a read mapping module 412, and a ma_seq module 414. One or more other modules or services can be implemented/executed in 404 based on a user’s run request.

[0077] In the exemplary use of FIG. 4, the taxonomic classification module 406 can take all the data in a classification and identify where the data comes from. In other words, the module 306 can label the input sample data with identifiers indicating which organisms the sample data is predicted to originate from. [0078] The assembly module 408 can be configured to take all fragments of DNA (e.g., individual characters of individual DNA sequences) and combine them into a single, contiguous piece of DNA. Assembly can be reference-guided and/or de novo. As a result of this assembly, genomes for all species can be identified. Thus, gene calling 410 can be applied to the contiguous piece of DNA to predict, based on genomic features, what functional properties are encoded by the genes in the contiguous piece of DNA. As described throughout the disclosure, the assembly module 408 can receive, as input, the output from the taxonomic classification module 406 and/or any other module that is being executed, whether at a different time or in parallel. The module 408 can also receive a config file associated with the run request such that the module 408 can be executed accordingly.

[0079] The read mapping module 412 can also perform similar functions as the taxonomic classification module 406. In some implementations, the module 412 can generate a list of assemblies.

[0080] The ma seq module 414 can be configured to focus on RNA sequencing. The module 414 can determine what cells in a sample are doing at time of data collection by generating a list of all the genes/proteins that are created at the time of collection.

[0081] As described throughout this disclosure, output from the taxonomic classification module 406 can be added to a config file. That updated config file can be inputted into the assembly module 408. The assembly module 408 can use information from the updated config file for execution. Output from the module 408 can then be used to update the config file. That updated config file can be used as input into the read mapping module 412. Output from the module 412 can be used to update the config file. That updated config file can be used as input into the ma seq module 414. Output from the module 414 can then be used to update the config file. This process can continue until all instances and/or services are executed in 404 during the run request. For example, if data is inputted for 300 test patients, 300 instances of quality control can be performed in 402. 300 instances of any of the services 406-414 can also be executed. Having the config file, which is continuously updated and used as input as well as output, makes it easier to handle passing of parameters from service to service in a batch process. Moreover, using the config file provides for services to be executed in parallel with minimal input or processing restrictions.

[0082] FIGS. 5A-B are exemplary notifications for executed cloud-based batch processes. Users can prefer receiving a notification as an email, where that notification provides enough detailed yet high-level information/status about run execution. This type of notification can be more beneficial to users than uncleaned/unfiltered output that results from conventional batch processing techniques (e.g., refer to FIG. 7). With the disclosed technology, users do not have to spend valuable time combing through long, platform-specific service outputs and code to identify information about the execution and/or any errors that occurred during execution. Instead, the disclosed notifications can, for example, indicate what data is or has been run, what data caused the entire run to error out, what service experienced the error, and any additional metadata that can be useful in error identification and/or remediation. As such, a scientist can be presented with information about the job that they submitted, instead of computer-hardware information that they have not spent the time to become familiar with. [0083] FIG. 5A depicts a notification 500 for a successful run. The notification 500 can include information such as a run ID, type of run, billing code, data partner, execution link, and status statement. The notification 500 can also include IDs for one or more of the services that were successfully executed during the run, including start and end times, and output files. As described in more detail below (e.g., refer to FIGS. 7-8), the run ID provides a trail for the user to track an entire run. Unlike conventional approaches, the disclosed technology provides the run ID so that users can more easily track and identify errors throughout an entire run. The execution link can be accessed by the user to view real-time progress of run execution. In other words, the user can use the execution link to see what service is currently being executed, what services have been executed, and what services still need to be executed. This is another way in which the user can track progress of the run and any potential errors therein.

[0084] The notification 500 can also provide the user with the service IDs and a high-level view summarizing each service execution. As a result, the notification 500 can be more service specific. This is beneficial for the user to understand how long it took to execute a particular service and what output was generated during that execution.

[0085] FIG. 5B depicts a notification 510 for a failed run. The notification 510 can include information such as run ID, type of run, billing code, data partner, execution link, and status statement. As described in reference to the notification 500, the notification 510 can also include high-level information about the services executed during the run. The notification 510 can provide information about the failed services. [0086] The notification 510 can additionally and/or optionally indicate input for each of the executed services as well as what input or datasets need to be looked at for a re-run. This is beneficial to point the user towards what caused the specific service to fail so that the error can be remediated. The notification 510 can also list exit code and how many re-runs were attempted. Providing this type of information in the notification 510 can assist the user in more quickly and easily determining whether there was a data error or a service error and how to fix such errors.

As described herein, a run can fail so long as one service has an error during execution. Therefore, the notification 510 for a failed run may only provide information about the service(s) that failed, rather than providing information for every service that was executed. The user is mostly interested in the services that failed so that they can fix whatever reasons for those failures. Hence, providing the user with a notification pointing the user to the particular service(s) that failed can assist the user in more quickly identifying and addressing errors. The information about the failed services can include execution information, exit code, start and end times, and log_stream links. The execution information provided shows what job number this service was in a list created in build output items. This can be an arbitrary number. The group information provided shows a unique identifier for data that is being process. The retry information provided shows what retry attempt this entry represents. For example, 1 can indicate a first retry of the initial failed job. The exit code provided shows an outcome of the batch process (e.g., whether it was a success or fail). The exit code can be represented by a boolean value. For example, 1 can indicate that the outcome failed and 0 can indicate that the outcome was successful. The vCPUs provided shows a number of vCPUs that are requisitioned for the batch process. In other words, it shows the compute resources that are required for execution of the run. The log stream link provided shows a log for the batch process. The log stream link can include reporting information about individual failed task attempts.

[0087] FIG. 6 is a flowchart of a process 600 for generating notifications. The process 600 can be performed by the cloud system or any other computing system described herein.

[0088] Services in a run request can be executed in 602. As described throughout this disclosure, one or more services can be executed in parallel. The services can be executed according to a processing order as specified in a config file. The config file, as described in further detail in reference to FIG. 7, can dictate parameters for execution of each of the services. Thus, the config file can be parsed to pull the specifics for executing a particular service. Output can be generated during execution of each service. Moreover, during service execution, tracing data can be generated and stored (e.g., refer to the run trace store 110 in FIGS. 1-2, 308 in FIG. 3).

[0089] The output data of each service can be cleaned and/or filtered in 604 (e.g., refer to FIG. 7). The output can be in a new file, which is cleaned according to one or more parameters identified in the config file. Cleaning the output can include transforming the output into readable input for one or more other services in the run pipeline. As a result, more services can be processed in parallel without sacrificing efficiency, time, and/or memory. In some examples, cleaning the output data can occur after a service is executed. In other examples, cleaning the output data can occur after all the services in the run pipeline are executed.

[0090] Cleaning the output data can include identifying information in the output that can be useful to a user for diagnosing errors in the run. For example, conventional output data can include a lot of unnecessary information about execution of a service. The user can specify how to clean the output data in the user’s run request. If there is an error in execution of that service, the user would have to comb through all of the output data to identify a source for that error. This can be a timely and tedious process. Cleaning the output data, on the other hand, can result in refining the output data to include only information pertinent to identifying and/or diagnosing errors in the service execution. The cleaned output data can include information about potential errors in the service execution. Cleaning the output data can also include updating and/or generating a status statement/value for the executed service. The status statement can indicate whether execution of the service was a success or a failure.

[0091] Once the output data for each services is cleaned, it can be determined whether any of the executed services have errors in their output data in 606. The cleaned output data can be reviewed to quickly identify whether a service was successfully executed or not. For example, the cloud system can look at the status statement/value for each of the outputs. If at least one service has an error (e.g., a status statement of failed), then the entire run can be identified as failed in 608. On the other hand, if none of the services have errors, then the entire run can be identified as successful in 612.

[0092] A notification indicating that the run failed can be generated in 610 (e g., refer to FIG. 5B). Likewise, a notification indicating that the run was successful can be generated in 610 (e.g., refer to FIG. 5 A).

[0093] The notification generated in 610 or 614 can then be outputted in 616. For example, the notification can be sent to a user device of the user, for display on an application, website interface, or other graphical user interface. As another example, the notification can be sent as an email to the user who requested the run. The notification can be transmitted to devices of one or more stakeholders of the run request.

[0094] FIG. 7 is a flowchart of a process 700 for using a config file for cloudbased batch processing. The process 700 can be performed by the cloud system or any other system described herein.

[0095] A mapping file can be received in 702. A user at a user device (e.g., refer to the user device 102 in FIGS. 1-2) can generate the mapping file (e.g., text file) and send it to the cloud system. The mapping file includes a run request and pointers to data that will be processed in the run request.

[0096] A config file can be generated in 704. The config file can be automatically generated based on the mapping file. Autogeneration of the config file can be done using one or more scripts that are stored in the cloud system. For example a script can automatically generate run request code for the received mapping file. The config file can include information about the run request, data to be processed in the run, and one or more services that are part of the run pipeline. The config file can also identify input items, which are stored parameters that are to be read in for processing during run execution.

[0097] For example, and as depicted in FIG. 8 A, the config file can include batch parameters, which can be used to identify where in the cloud system to run the request. The config file also includes information specific to the request, including an order of services in the run pipeline, what services are requested, and what user to bill for completing the run request. The config file also includes service parameters, which can be used to identify one or more services that will be run. The config file can further include input item parameters, which can indicate what datasets to access for each of the services during the run.

[0098] The config file can include file management information specific to each of the services in the run. For example, the file can identify input and output files for each service. The config file can also indicate whether certain input files need to be chunked before processing, and if so, what size chunks should be used. The chunking information can be tailored to the parameters of the particular service and/or the input files that are being used for that service. As discussed below, the config file can also include output items strategy for each of the services. The output items strategy can identify type of data that is to be generated as output for the particular service. The strategy can also help determine a location of the output files so that the output files can be fed into another service as input for that service. In some implementations, the output items strategy can also identify and/or allocate storage locations for output items.

[0099] Because information for identifying and executing each service is in one config file, the file can be updated and parsed with every service execution. Moreover, as described further below, the config file can be expanded into separate files per executed service, such that separate output files can be generated per service and fed into the next service in the pipeline as input. This input would effectively be read-only, which means it can be reused multiple times across multiple different services all in parallel. This is advantageous to promote robust and scalable batch processing.

[00100] Still referring to FIG. 7, the config file can be prepared in 706. A lambda can be used to prepare the file. Preparing the config file can include adding an execution link to the config file, assigning a run ID to the config file, and updating input data for the config file (e.g., refer to FIG. 8B). Additionally, a start time and/or price estimate can be added to the config file when it is being prepared.

[00101] The execution link is beneficial to assist the user in tracking execution of the run and the services therein. For example, the user can use the execution link to see a real-time diagram flow of the run and where in the run the processing is up to. The user can view real-time progress of the run and potential errors that may arise during the run. [00102] The run ID is beneficial because it assists the user in more easily and quickly tracking execution of each service/instance during the entire run. In other words, the run ID creates a trail through every working step and/or service that is executed during the run. Using the trail, the user can more easily identify where and why any error occurred during the run. Traditional batch processing techniques do not provide for tracking via a run ID. As a result, under a traditional approach, the user would have to comb through logs associated with each executed service in order to locate and identify any errors. This is a time-consuming process and can result in the user misinterpreting data in the batch logs and/or missing any errors in the logs. The disclosed technology, on the other hand, uses the run ID to provide the user with a high-level view of what service failed and what error caused that failure (e.g., refer to FIG. 6).

[00103] The input data can also be updated in the config file in 706. The lambda can read through the mapping file to appropriately configure the config file for each of the services in the run. In other words, the input data can be adapted based on mapping information for each of the services such that the input data can be read and used during execution of the services in the run. [00104] The prepared config file can then be received in 708. When the config file is received, it is ready to be used in batch processing (e.g., execution of the run and each of the services therein). In other words, one or more of the services in the run can be executed. As described herein, one or more services can be executed in parallel.

[00105] A service identifier and its associated parameters can be parsed from the prepared config file in 710. Thus, to run the service, code associated with that service can be identified and selected from the config file based on the service identifier, as depicted in FIGS. 8C and 8E.

[00106] The service can be run (e.g., executed) in 712. Running the service includes accessing the data and/or parameters that are associated with the service identifier (710). To run the service, the cloud system can receive only the service identifier and/or pointers to parameters for that service, as depicted in FIGS. 8F and 8G. The system does not receive points or identifiers for other services in the run that are not yet being executed.

[00107] While the service is being run, the config file can be updated. For example, the config file can have an execute parameter associated with each service in the run. The execute parameter can have a boolean value, such as TRUE/FALSE, YES/NO, 0/1, etc. This value can be updated/changed to indicate whether the service is currently being performed (e.g. if the service is executing in 712, then the execute parameter can be changed to TRUE in the config file.

[00108] An output strategy for the service can be received in 714. This can occur before and/or during execution of the service (712). Each service can have a pre- identified output strategy. The output strategy can indicate what types of files and/or output items are expected from running the service (e.g., refer to FIG 8D).

[00109] Expected output items can be created in 716. This can occur during execution of the service ( 12). The output can include a list of identifiers to parameters and/or data, as described throughout this disclosure (e.g., refer to FIG. 8H).

[00110] The created output items can be written into the config file in 718. In other words, the config file can be updated. The updated config file can then be provided as input to one or more other services further down the pipeline. In some implementations, the created output items can be written into a separate config file that is associated with the particular service that is being executed. The separate config file can include some of the original data from the prepared config file of 708. However, some of the data can be removed and/or added based on execution of the service. Moreover, as config files are generated for each service that is executed, each of those config files can include some or all data from other and/or previous config files associated with the different services in the run.

[00111] Therefore, the config file associated with the particular service can be provided as an input file to one or more other services further down the run pipeline. The config file associated with the particular service can be a read-only file, such that the file can be reused across multiple different services in parallel. When the different services are run/executed, they can receive the config file associated with the particular service as well as the prepared config file from 708. As mentioned, the config file associated with the particular service can include input (output from the particular service) used for execution of the next service. The prepared config file from 708 can be used to execute the next service.

[00112] A lambda can then be used to filter and/or clean output batch results in 720. This can occur after the service is executed. In some implementations, cleaning the output batch results can occur after every service is executed in the run. The cleaned output batch results can be stored in a temporary data store (e.g., refer to the data storage 214 in FIG. 2).

[00113] Each service outputs batch results. For example a service can generate one or more output files upon completion of service execution. The output batch results (e.g., files) can include a lot of different types of information. All that information, however, may not be necessary for the user to understand a status of the service (e.g., whether there was an error or the service was successfully executed), as depicted in FIG. 8H.

[00114] Traditionally, the user may have to comb through all the information in the output batch results in order to identify whether there were any errors and if so, what was the source or sources of such errors. This can be a time-consuming and erroneous process. The disclosed technology, on the other hand, uses the lambda to go through the output batch results for each service and clean or filter such results. Filtering such results can provide for a high-level view of an overarching service status, as depicted in FIG. 81. For example, the filtered output batch results can include the run ID, the service identifier, input and output file identifiers, a start time, and a status for the service (e.g., success or failure). This type of high-level view can be used by the user to more easily and quickly identify what might have gone wrong in executing the service and how these issues can be resolved so that the service can be executed successfully in a subsequent run.

[00115] It can be determined whether additional services need to be executed in the run in 722. As mentioned above, cleaning the output batch results can occur before and/or after all services in the run are executed. In 722, the cloud system can look at the prepared config file from 708 and see whether any additional service identifiers need to be parsed.

[00116] If additional services need to be executed (e.g., this can be done in parallel with other services as they are being executed), then 710-720 can be repeated for each additional service. If no more services need to be executed in the run, then a notification can be generated for the run in 724. As described throughout this disclosure (e.g., refer to FIG. 6), once all services are executed and their outputs are cleaned, the cloud system can check all of the outputs to determine whether there were any failures. If there is even one failure, the system can generate a notification indicating that the entire run failed. An overarching run notification can then include information from one or more of the outputs associated with the services to indicate which of the services failed/had errors (e.g., refer to FIG. 8J).

[00117] Tracing data can be generated at a variety of steps in the process 700. For example, tracing data can be generated and stored when the service identifier and parameters are parsed in 710. Tracing data can also be generated and stored when the service is run in 712. The tracing data can then be used in generating notifications for the run in 724. The tracing data is beneficial to assist the user in more easily stepping through execution of any service or step in the run. [00118] In some implementations, with regards to the process 700, 702-710 can be performed by the prepare module 206 as depicted and described in FIG 2. 702-710 can also be performed during and/or as part of 302-310 in FIG. 3 and/or 402 in FIG. 4. 712- 720 can be performed by the run module 208 and config file updater module 209 as depicted and described in FIG 2. 712-720 can also be performed during and/or as part of 312-318 in FIG. 3 and/or 404-414 in FIG. 4. Additionally, 724 can be performed by the reporting engine 210 as depicted and described in FIG. 2. 724 can also be performed during and/or as part of 320-326 in FIG. 3 and/or the process 600 in FIG. 6. In yet other implementations, any portion of the process 700 can be performed by one or more other modules or systems and/or during any other processes described throughout this disclosure.

[00119] FIGS. 8A-J are exemplary code segments during cloud-based batch processing as described herein (e.g., refer to FIG. 7). FIG. 8A depicts an exemplary config file 800, which can be generated in 704 in FIG. 7. The config 800 file includes run execution information, such as the run ID, execution link, user email, start and end times, and status. The config file 800 also includes a pointer to a location in a data store where input items for the run are located. The pointer can also be to a mapping file generated by the user as part of their run request. That mapping file is used to autogenerate the config file 800 (e.g., refer to 702 in FIG. 7).

[00120] As described throughout, the config file 800 includes execution information for each service in the run request. In the exemplary config file 800, two services are requested: “MTxQC” and “taxonomic_classification” (e.g., refer to 402 and

406 in FIG. 4). Each of the services can include a boolean value for execution (e.g., TRUE/FALSE), file management information, an output items strategy, batch parameters, and service parameters. The file management information can dictate whether input data for that service should be chunked before processing and if so, a size for that chunking (e.g., refer to 304 in FIG. 3). For example, the file management information for “MTxQC” service has an execute boolean value of FALSE, which means no chunking is needed for this service’s execution.

[00121] The output strategy can indicate what type of output is expected as a result of executing the service, where to find the output, etc. For example, the “taxonomic_classification” service’s strategy includes a strategy name as well as an output files extension.

[00122] The batch parameters information can indicate information for identifying the service and where it is being executed in the cloud system. Thus, each service can be given a job name, queue, definition, and timeout.

[00123] Finally, the service parameters can indicate parameters, conditions, and other information necessary to execute the service. For example, the “MTxQC” service has parameters for each step in the service’s execution. Those steps can include pairing data, trimming data, and removing data.

[00124] FIG. 8B depicts an exemplary updated config file 810, which is prepared in 706 in FIG. 7. A lambda is used to add metadata to the config file 810. For example, the original config file 800 does not include values for the run ID, execution link, and start time. Once the lambda is used, the config file 800 is updated into config file 810 with the run ID, execution link, and start time. In addition, the input items pointer can be updated from the location of the user’s mapping file to a temporary input file for the first service that is to be executed.

[00125] FIG. 8C depicts an exemplary segment 820 of the config file that is passed for executing a particular service. In other words, when the config file 810 is parsed, the segment 820 of the file is selected in order to execute the service (e.g., refer to 710 in FIG. 7). The segment 820 can include run information, input location, and service information. For example, the exemplary segment 820 includes information needed to run the “MTxQC” service, the fist service in the run pipeline.

[00126] FIG. 8D depicts exemplary output items 830 that are built during execution of the service (e.g., refer to 716 in FIG. 7). The exemplary output items 830 can include identifiers for each of the values that are going to be returned during execution of the service.

[00127] FIG. 8E depicts exemplary code 840 for files that are going to be written into the data store during service execution. Each number/index represents a file (e.g., JSON) that is to be written into the data store. In other words, each of the numbers indicate a number of times that the service will be executed in a batch. The code 840 also includes associated parameters and input files in order to execute each instance of the service. This code 840 can be generated while building the output items.

[00128] FIG. 8F depicts exemplary code 850 that is passed in order to run the service. The code 850 can be passed for each instance of execution of that service. Therefore, the code 850 can be passed for each of the numbers represented in the code

840 in FIG. 8E. [00129] FIG. 8G depicts exemplary code 860 that is used to run the service. For example, before running the service, the service’s input files and parameters need to be read from the data store. Those read files and parameters are depicted in the code 860, which can then be used by the system to run the service.

[00130] FIG. 8H depicts exemplary batch results 870 (e.g., refer to 316 in FIG. 3). When the service is executed, pre-defined fields from conventional batch processing services can be returned in the results. This causes the output file for the service to expand undesirably, which makes it more challenging and tedious for the user to comb through the results and identify errors or other pertinent information.

[00131] FIG 81 depicts exemplary collected batch results 880 (e.g., refer to 316 in FIG. 3, 720 in FIG. 7). Using a lambda, the batch results 870 can be cleaned and/or filtered to identify and pull out pertinent reporting information. Collected batch results 880 is generated. The results 880 can include a service identifier, run ID, status information (e.g. fail or success), start and end times, input information, and output information. The collected batch results 880 can then be saved in the data store, rather than the original batch results 870.

[00132] FIG. 8J depicts exemplary data 890 that is used to generate notifications for the user (e.g., refer to 320 in FIG. 3, FIG. 6, 724 in FIG. 7). The system can use the collected batch results 880, which is stored in the data store, in order to notify the users about results for each of the executed services and the overall run. As described throughout this disclosure, the system may not output all of the data 890 and instead may only output high-level information about a service or services that failed during execution. Therefore, the user can more easily and quickly identify errors in run execution.

[00133] FIGS. 9A-C are exemplary code segments for querying results from cloud-based batch processing as described herein. FIG. 9A depicts a run data schema 900. As described in reference to the run trace store 110 in FIGS. 1-2, run data can be queried and reported to the user to assist the user in tracing every aspect of an executed run. The run data, as described throughout this disclosure, can come from the config file and/or any of the updated config files generated thereafter during execution of each service (e.g., instance) in the run pipeline.

[00134] The run data schema 900 of FIG. 9A indicates some of the data that can be stored in the store 110. This schema 900 can also be used to query run data. The schema 900 includes a run ID, a JSON version, a parent run ID, an OSM, a user email, a project code, a project billable, a price estimate, a start time, a pointer to input file(s), an end time, a run status, a results location in storage, and information about one or more services that are executed in the run. The services information can include a service name, service parameters, service results, a start time, an end time, and a service results location.

[00135] FIG. 9B depicts an exemplary run data JSON file 910. The file 910 can be parsed and/or queried using one or more of the techniques described herein (e.g., refer to the run trace store 110 in FIGS. 1-2, FIG. 9A). As demonstrated, the file 910 includes many of the fields in the run data schema 900 of FIG. 9A, including the JSON version, end time, parent run ID, price estimate, project billable, project code, pointer to the results location, run ID, pointer to input files, and one or more services information. The file 910 further includes the start time, status, user email, and workflow name.

[00136] FIG. 9C depicts an exemplary query used to create tables and parse raw data for presentation to a user. Arrays can often be stored with complex structs. To turn items of the arrays into rows, a UNNEST command in SQL can be used. Using the UNNEST command in a query 930, raw input table 920 can be transformed into a queried table 940. The exemplary raw input table 920 includes information for two runs, where each executes different services with the same input files. However, each service and each parameter for that service are grouped into one column. Rows for each of the service names can be generated using the query 930. The UNNEST command in the query 930 can be used to make each entry in the array into a row. Thus, as depicted in the query 930, the UNNEST command can be called followed by a t(ALIAS_NAME) statement. This saves the parsed rows under an object with a name ALIAS_NAME. The ALIAS_NAME changes depending on what a user inputs in the t(ALIAS_NAME) statement. In the exemplary query 930, the name ALIAS_NAME is “serv.” Using the query 930, the queried table 940 results. Additionally and/or optionally, to parse each service parameter into individual columns, the query 930 can be expanded using WITH and AS statement, as depicted in expanded query 950. As a result of added the expanded query 950 to the query 930, two additional columns can be generated, including “name” and “value.” Each row can contain a single value for each of these columns. One or more other queries can be used to generate different customized views of all run data.

[00137] FIG. 10 shows an example of a computing device 1000 and an example of a mobile computing device that can be used to implement the techniques described here. The computing device 1000 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The mobile computing device is intended to represent various forms of mobile devices, such as personal digital assistants, cellular telephones, smart-phones, and other similar computing devices. The components shown here, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed in this document.

[00138] The computing device 1000 includes a processor 1002, a memory 1004, a storage device 1006, a high-speed interface 1008 connecting to the memory 1004 and multiple high-speed expansion ports 1010, and a low-speed interface 1012 connecting to a low-speed expansion port 1014 and the storage device 1006. Each of the processor 1002, the memory 1004, the storage device 1006, the high-speed interface 1008, the highspeed expansion ports 1010, and the low-speed interface 1012, are interconnected using various busses, and can be mounted on a common motherboard or in other manners as appropriate. The processor 1002 can process instructions for execution within the computing device 1000, including instructions stored in the memory 1004 or on the storage device 1006 to display graphical information for a GUI on an external input/output device, such as a display 1016 coupled to the high-speed interface 1008. In other implementations, multiple processors and/or multiple buses can be used, as appropriate, along with multiple memories and types of memory. Also, multiple computing devices can be connected, with each device providing portions of the necessary operations (e.g., as a server bank, a group of blade servers, or a multi-processor system).

[00139] The memory 1004 stores information within the computing device 1000. In some implementations, the memory 1004 is a volatile memory unit or units. In some implementations, the memory 1004 is a non-volatile memory unit or units. The memory 1004 can also be another form of computer-readable medium, such as a magnetic or optical disk.

[00140] The storage device 1006 is capable of providing mass storage for the computing device 1000. In some implementations, the storage device 1006 can be or contain a computer-readable medium, such as a floppy disk device, a hard disk device, an optical disk device, or a tape device, a flash memory or other similar solid state memory device, or an array of devices, including devices in a storage area network or other configurations. A computer program product can be tangibly embodied in an information carrier. The computer program product can also contain instructions that, when executed, perform one or more methods, such as those described above. The computer program product can also be tangibly embodied in a computer- or machine-readable medium, such as the memory 1004, the storage device 1006, or memory on the processor 1002.

[00141] The high-speed interface 1008 manages bandwidth-intensive operations for the computing device 1000, while the low-speed interface 1012 manages lower bandwidth-intensive operations. Such allocation of functions is exemplary only. In some implementations, the high-speed interface 1008 is coupled to the memory 1004, the display 1016 (e.g., through a graphics processor or accelerator), and to the high-speed expansion ports 1010, which can accept various expansion cards (not shown). In the implementation, the low-speed interface 1012 is coupled to the storage device 1006 and the low-speed expansion port 1014. The low-speed expansion port 1014, which can include various communication ports (e.g., USB, Bluetooth, Ethernet, wireless Ethernet) can be coupled to one or more input/output devices, such as a keyboard, a pointing device, a scanner, or a networking device such as a switch or router, e.g., through a network adapter.

[00142] The computing device 1000 can be implemented in a number of different forms, as shown in the figure. For example, it can be implemented as a standard server 1020, or multiple times in a group of such servers. In addition, it can be implemented in a personal computer such as a laptop computer 1022. It can also be implemented as part of a rack server system 1024. Alternatively, components from the computing device 1000 can be combined with other components in a mobile device (not shown), such as a mobile computing device 1050. Each of such devices can contain one or more of the computing device 1000 and the mobile computing device 1050, and an entire system can be made up of multiple computing devices communicating with each other.

[00143] The mobile computing device 1050 includes a processor 1052, a memory 1064, an input/output device such as a display 1054, a communication interface 1066, and a transceiver 1068, among other components. The mobile computing device 1050 can also be provided with a storage device, such as a micro-drive or other device, to provide additional storage. Each of the processor 1052, the memory 1064, the display 1054, the communication interface 1066, and the transceiver 1068, are interconnected using various buses, and several of the components can be mounted on a common motherboard or in other manners as appropriate. [00144] The processor 1052 can execute instructions within the mobile computing device 1050, including instructions stored in the memory 1064. The processor 1052 can be implemented as a chipset of chips that include separate and multiple analog and digital processors. The processor 1052 can provide, for example, for coordination of the other components of the mobile computing device 1050, such as control of user interfaces, applications run by the mobile computing device 1050, and wireless communication by the mobile computing device 1050.

[00145] The processor 1052 can communicate with a user through a control interface 1058 and a display interface 1056 coupled to the display 1054. The display 1054 can be, for example, a TFT (Thin-Film-Transistor Liquid Crystal Display) display or an OLED (Organic Light Emitting Diode) display, or other appropriate display technology. The display interface 1056 can comprise appropriate circuitry for driving the display 1054 to present graphical and other information to a user. The control interface 1058 can receive commands from a user and convert them for submission to the processor 1052. In addition, an external interface 1062 can provide communication with the processor 1052, so as to enable near area communication of the mobile computing device 1050 with other devices. The external interface 1062 can provide, for example, for wired communication in some implementations, or for wireless communication in other implementations, and multiple interfaces can also be used.

[00146] The memory 1064 stores information within the mobile computing device 1050. The memory 1064 can be implemented as one or more of a computer-readable medium or media, a volatile memory unit or units, or a non-volatile memory unit or units.

An expansion memory 1074 can also be provided and connected to the mobile computing device 1050 through an expansion interface 1072, which can include, for example, a SIMM (Single In Line Memory Module) card interface. The expansion memory 1074 can provide extra storage space for the mobile computing device 1050, or can also store applications or other information for the mobile computing device 1050. Specifically, the expansion memory 1074 can include instructions to carry out or supplement the processes described above, and can include secure information also. Thus, for example, the expansion memory 1074 can be provide as a security module for the mobile computing device 1050, and can be programmed with instructions that permit secure use of the mobile computing device 1050. In addition, secure applications can be provided via the SIMM cards, along with additional information, such as placing identifying information on the SIMM card in a non-hackable manner.

[00147] The memory can include, for example, flash memory and/or NVRAM memory (non-volatile random access memory), as discussed below. In some implementations, a computer program product is tangibly embodied in an information carrier. The computer program product contains instructions that, when executed, perform one or more methods, such as those described above. The computer program product can be a computer- or machine-readable medium, such as the memory 1064, the expansion memory 1074, or memory on the processor 1052. In some implementations, the computer program product can be received in a propagated signal, for example, over the transceiver 1068 or the external interface 1062.

[00148] The mobile computing device 1050 can communicate wirelessly through the communication interface 1066, which can include digital signal processing circuitry where necessary. The communication interface 1066 can provide for communications under various modes or protocols, such as GSM voice calls (Global System for Mobile communications), SMS (Short Message Service), EMS (Enhanced Messaging Service), or MMS messaging (Multimedia Messaging Service), CDMA (code division multiple access), TDMA(time division multiple access), PDC (Personal Digital Cellular), WCDMA (Wideband Code Division Multiple Access), CDMA2000, or GPRS (General Packet Radio Service), among others. Such communication can occur, for example, through the transceiver 1068 using a radio-frequency. In addition, short-range communication can occur, such as using a Bluetooth, WiFi, or other such transceiver (not shown). In addition, a GPS (Global Positioning System) receiver module 1070 can provide additional navigation- and location-related wireless data to the mobile computing device 1050, which can be used as appropriate by applications running on the mobile computing device 1050.

[00149] The mobile computing device 1050 can also communicate audibly using an audio codec 1060, which can receive spoken information from a user and convert it to usable digital information. The audio codec 1060 can likewise generate audible sound for a user, such as through a speaker, e.g., in a handset of the mobile computing device 1050. Such sound can include sound from voice telephone calls, can include recorded sound (e.g., voice messages, music files, etc.) and can also include sound generated by applications operating on the mobile computing device 1050.

[00150] The mobile computing device 1050 can be implemented in a number of different forms, as shown in the figure. For example, it can be implemented as a cellular telephone 1080. It can also be implemented as part of a smart-phone 1082, personal digital assistant, or other similar mobile device. [00151] Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various implementations can include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which can be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device. [00152] These computer programs (also known as programs, software, software applications or code) include machine instructions for a programmable processor, and can be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms machine-readable medium and computer-readable medium refer to any computer program product, apparatus and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term machine-readable signal refers to any signal used to provide machine instructions and/or data to a programmable processor. [00153] To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.

[00154] The systems and techniques described here can be implemented in a computing system that includes a back end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front end component (e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back end, middleware, or front end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network (LAN), a wide area network (WAN), and the Internet.

[00155] The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

Claims

WHAT IS CLAIMED IS:

1. A system for batch processing of services, the system comprising: one or more processors; and computer memory storing instructions that, when executed by the processors, cause the processors to perform operations comprising: receiving a request to run one or more services in a pipeline; wherein the request comprises data to be operated on; chunking the data to generate data-chunks, each data-chunk comprising some, but not all, of the data; wherein all data-chunks combined comprise all of the data; storing the data-chunks in locations specified by one or more references; generating an initial configuration file to be used as a current configuration file; wherein the initial configuration file comprises the one or more references; and for each service of the one or more services of the request, after generation of the initial configuration file: executing at least one instance of the service using the initial configuration file to access at least one data-chunk in order to generate a new configuration file to be used as the current configuration file; wherein executing the at least one instance comprises generating, from the at least one data-chunk, corresponding output data stored in the new configuration file; aggregating the current configuration file with any other current configuration files available in the system; and providing, after each service of the pipeline is executed, the results based on the current configuration file.

2. The system of claim 1, wherein the request is received from a client device geographically remote from the one or more processors and the computer

54 memory, and in data communication with the one or more processors and the computer memory. The system of claim 1, wherein at least some of the services are executed in parallel with each other. The system of claim 1, wherein at least some services are executed in series. The system of claim 4, wherein the output of some services are used as input for some other services. The system of claim 1, wherein the system is configured to: monitor the operations to determine if the operations cause an error; and halt, in response to determining that the operations cause an error, the operations. The system of claim 6, wherein the system is further configured to generate, responsive to determining that the operations cause an error, an error message containing information about the service. The system of claim 1, wherein the system is further configured to record, in a run-trace datastore trace information about the execution of the instances of a service, the trace information comprising parameters related to operations of the system as the processors perform the operations. The system of claim 8, wherein the system is further configured to: receive a query identifying the request; and responding to the query with at least some of the trace data. A method for batch processing of services, the method comprising: receiving a request to run one or more services in a pipeline; wherein the request comprises data to be operated on;

55 chunking the data to generate data-chunks, each data-chunk comprising some, but not all, of the data; wherein all data-chunks combined comprise all of the data; storing the data-chunks in locations specified by one or more references; generating an initial configuration file to be used as a current configuration file; wherein the initial configuration file comprises the one or more references; and for each service of the one or more services of the request, after generation of the initial configuration file: executing at least one instance of the service using the initial configuration file to access at least one data-chunk in order to generate a new configuration file to be used as the current configuration file; wherein executing the at least one instance comprises generating, from the at least one data-chunk, corresponding output data stored in the new configuration file; aggregating the current configuration file with any other current configuration files available; and providing, after each service of the pipeline is executed, the results based on the current configuration file. The method of claim 10, wherein the request is received from a client device geographically remote from the one or more processors and the computer memory, and in data communication with the one or more processors and the computer memory. The method of claim 10, wherein at least some of the services are executed in parallel with each other. The method of claim 10, wherein at least some services are executed in series. The method of claim 13, wherein the output of some services are used as input for some other services.

56 The method of claim 10, wherein the method further comprises: monitoring the operations to determine if the operations cause an error; and halting, in response to determining that the operations cause an error, the operations. The method of claim 15, wherein the method further comprises further generating, responsive to determining that the operations cause an error, an error message containing information about the service. The method of claim 10, wherein the method further comprises recording, in a run-trace datastore trace information about the execution of the instances of a service, the trace information comprising parameters related to operations of the method. The method of claim 17, wherein the method further comprises: receiving a query identifying the request; and responding to the query with at least some of the trace data. A non-transitory, computer-readable medium tangibly storing instructions that, when executed by one or more processors, cause the processors to perform operations comprising: receiving a request to run one or more services in a pipeline; wherein the request comprises data to be operated on; chunking the data to generate data-chunks, each data-chunk comprising some, but not all, of the data; wherein all data-chunks combined comprise all of the data; storing the data-chunks in locations specified by one or more references; generating an initial configuration file to be used as a current configuration file; wherein the initial configuration file comprises the one or more references; and for each service of the one or more services of the request, after generation of the 57 initial configuration file: executing at least one instance of the service using the initial configuration file to access at least one data-chunk in order to generate a new configuration file to be used as the current configuration file; wherein executing the at least one instance comprises generating, from the at least one data-chunk, corresponding output data stored in the new configuration file; aggregating the current configuration file with any other current configuration files available; and providing, after each service of the pipeline is executed, the results based on the current configuration file. The medium of claim 19, wherein the operations further comprise recording, in a run-trace datastore trace information about the execution of the instances of a service, the trace information comprising parameters related to the operations.