AU2015414467A1 - Methods and systems for generating workflows for analysing large data sets - Google Patents

Methods and systems for generating workflows for analysing large data sets Download PDF

Info

Publication number
AU2015414467A1
AU2015414467A1 AU2015414467A AU2015414467A AU2015414467A1 AU 2015414467 A1 AU2015414467 A1 AU 2015414467A1 AU 2015414467 A AU2015414467 A AU 2015414467A AU 2015414467 A AU2015414467 A AU 2015414467A AU 2015414467 A1 AU2015414467 A1 AU 2015414467A1
Authority
AU
Australia
Prior art keywords
algorithms
user
data
computer
computer system
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
AU2015414467A
Inventor
Dadabhai T SINGH
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cloudseq Pte Ltd
Original Assignee
Cloudseq Pte Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cloudseq Pte Ltd filed Critical Cloudseq Pte Ltd
Publication of AU2015414467A1 publication Critical patent/AU2015414467A1/en
Priority to AU2022241571A priority Critical patent/AU2022241571A1/en
Abandoned legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code
    • G06F8/34Graphical or visual programming
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biotechnology (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioethics (AREA)
  • Software Systems (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Economics (AREA)
  • Human Resources & Organizations (AREA)
  • Strategic Management (AREA)
  • Development Economics (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Game Theory and Decision Science (AREA)
  • Educational Administration (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Stored Programmes (AREA)

Abstract

A method is proposed for designing a computational process for performing a computational task such as a next generation sequencing task. A user who wishes to instigate the workflow for analysing a dataset is presented with a graphical user interface (GUI) comprising plurality of areas representative of respective associated data analysis algorithms, each of the algorithms being characterized by one or more numerical parameters. Default values have been defined for some or all of the numerical parameters, and one of more of the parameters of each algorithm may be modified using the GUI. The GUI allows the user to select and combine a plurality of the icons (e.g. by drag and drop operations) and set one or more of the modifiable parameters, to form a workflow comprising the corresponding algorithms.

Description

invention
The present invention proposed methods and computer systems to be used by a user (typically a scientist or scientific researcher) to analyse a large data-set, typically composed of scientific data. In particular, the invention proposes ways of defining a computational workflow to be performed on the data set.
Background of the invention
Next Generation Sequencing (NGS) has revolutionized life sciences research across academia and industry alike. The cost of sequencing is dwindling with rapid innovations in NGS technology. As a result, thousands of genome sequencing projects have been taken up over the past five years by leading institutes in academia and industrial laboratories, and the number of such projects is rising every year. It has been estimated that average genomics lab will generate more than 20 terabytes of sequence data per annum. There are thousands of such labs all over the world. Analysing this volume of data is a great challenge as well as a good opportunity for bioinformatics firms to roll out various solutions for managing, analysing and visualizing this sequence data, to produce results with direct implications for human welfare (clinical genomics and diagnostics), food security (agrigenomics) and environmental sustainability (metagenomics).
Furthermore, NGS is just one example of the use of computers to analyse vast datasets generated in a complex scientific field. Another such field is phyloinformatics (the use of a data system structured and queried according to the hierarchical relationships of living creatures according to their evolutionary taxa).
A common problem in such scientific fields is that a user who instigates the data analysis is typically a scientist or researcher with training and expertise in the scientific field itself, rather than in data analysis. Existing data analysis software is often command line based so a user has to input complex commands including values for each of a number of numerical parameters associated with the data analysis. This means that the user must engage or at least consult an expert in data analysis in order to set the values of all of the parameters. Once the values of the parameters are set it
WO 2017/082813
PCT/SG2015/050446 is difficult to tweak certain parameters without the need to consult the data analysis expert again.
Summary of the invention
In general terms, the invention proposes a method of designing a computational process (“workflow”) for performing a computational task, in which a user who wishes to instigate a workflow for analysing a dataset is presented with a graphical user interface (GUI) comprising plurality of areas representative of respective associated data analysis algorithms, each of the algorithms being characterized by one or more numerical parameters. Default values have been defined for some or all of the numerical parameters, and one or more of the parameters of the algorithms may be modified using the GUI. The GUI allows the user to select and combine a plurality of the icons (e.g. by drag and drop operations) and set one or more of the modifiable parameters, to form a workflow comprising the corresponding algorithms.
For each algorithm, there is a corresponding Functionality Requirements Document (FRD). The FRDs are populated by a computer programmer with indications of which parameters are interactive (that is able to be modified by the user). Typically, the FRDs contain default values for some or all of the interactive parameters; if the user choses not to change one or more of the default values, then later the algorithm will be performed using the default values for the corresponding parameters. The FRDs also specify the default values for the parameters which are not interactive.
Once the user has completed the workflow, the work flow can be executed. A set of shell scripts generates commands for the algorithms based on the parameter values selected by the user and the default values stored in the FRD.
The implementation of the workflow could in principle be carried out in a platform which is a single computer system (such as a single server, or two or more proximate servers operating as a single computer system). However it is more preferable for the implementation to be carried out by a platform “in the cloud”: a network of preferably geographically distributed data nodes (“cores”), typically implemented by respective servers. The data nodes may be coordinated by a master node. The master node (which may be the computer which generated the GUI and received the user choices)
WO 2017/082813
PCT/SG2015/050446 may be accessible to a user terminal over a communication network such as the internet.
The output of the workflow is preferably stored in the cloud (that is, in a logical space (“pool”) supported by a plurality of geographically distributed servers), and preferably in a format such as XML (extensible mark-up language) which is suitable for processing using a MapReduce framework, such as using the programming language XQuery or its extension ChuQL.
The platform preferably includes visualisation tools appropriate to the type of data. The visualisation tools may include standard XML utilities, which are bundled with a data storage and/or visualisation tool which the user is also permitted to select using the GUI, for example by positioning an icon representing the tool at the end of the workflow.
The visualisation tools may be ones which access pre-existing databases, such as databases of public-domain data.
In one example, the algorithms may be ones suitable for the implementation of NGS. In this case, the visualisation tools may be ones for analysis of genetic data, and the public-domain data may be genome data.
The invention may be expressed as a system for generating the GUI and receiving the user’s choices, to define the workflow. The system may include the cluster of data nodes which actually implement the work flow. Alternatively, it may be defined as a method performed by the system. The system may perform the method by running a set of computer program instructions stored in non-transitory form on a tangible data storage device.
Brief description of the drawings
An embodiment of the invention will now be described for the sake of example only with reference to the following figures, in which:
Fig. 1 shows the overall structure of a network of computer units which can cooperate to implement a method which is an embodiment of the invention, and incorporating a server which is an embodiment of the invention;
Fig. 2 shows the logical structure of elements for implementing the invention;
WO 2017/082813
PCT/SG2015/050446
Fig. 3 is a flow diagram of a method according to the invention;
Fig. 4 shows a first graphical user interface (GUI) presented to a user of the system of Fig. 1 at a first time;
Fig. 5 is an expanded view of a portion of Fig. 4, showing a workflow defined using the invention;
Fig. 6 is an expanded view of one of the portions of the GUI of Fig. 4 at a different time;
Fig. 7 is a flow diagram of one of the steps of the workflow illustrated in Fig. 5. Fig. 8 is a view of a GUI for browsing the results of an NGS;
Fig. 9 is a view of a GUI for examining the results of an NGS; and
Fig. 10 is an expanded view of a portion of Fig. 4, showing a second workflow defined using the network of Fig. 1.
Detailed description
Referring to Fig. 1 a first embodiment of the system is shown. The system includes a number of client nodes 1, 2, 3 operated by respective users. Although three client nodes are illustrated, the number of client nodes may be lower, higher or much higher than this.
In Fig. 1, the structure of the client node 3 is shown in more detail than the other client nodes 1, 2. It includes a terminal 3a having a screen and one or more user input devices (not shown) such as keyboard, mouse etc, and a database 3b storing one or more large datasets, which may be datasets of structured, semi-structured or unstructured data. Each dataset is preferably in the XML format. The terminal 3a is in read/write communication with the database 3b. The datasets are composed of data generated in a scientific field, and the client nodes 1, 2, 3 are operated by respective users who are scientists or researchers in the scientific field of the data.
The terminal 3a is a “dumb terminal” which has limited commuting power. Its functions may be limited to read/write operations to the database 3b, communication with external devices, recognition of user input using the data input devices, and display of information received from elsewhere on the screen.
The users control the terminals 1, 2, 3 to interact over a communication network (such as the internet) with a platform 5. The platform 5 may be provided as a single server, or
WO 2017/082813
PCT/SG2015/050446 a single cluster of neighbouring servers. However, more preferably it is provided in the form of a cloud-based network of distributed units as illustrated in Fig. 1. It includes a master node 6, and a number m of data nodes 61, 62...... 6m which are geographically distributed (e.g. at least one pair of the data nodes is spaced apart by at least 10 km). The master node 6 may not be in physical proximity to the data nodes 61,
62.....6m (e.g. at least 10 km from at least one of them), and communicates with them over a communications network. The data nodes 61, 62......6m communicate with each other over a communication network using a file-share system such as the Javabased system HDFS 7. The master node 6 has the task of coordinating the data nodes
61, 62......6m including passing to them program instructions for the data nodes 61,
62, ... 6m to implement. The programming of the master node 6 (both its own programming, and the software elements which the master node 6 is operative to transfer to the data nodes 61, 62...... 6m), is preferably controlled by one or more developers operating respective terminals 8 (for simplicity only one such terminal is shown).
The data in the database 3b can be accessed via the terminal 3a by the platform 5. The platform 5 uses XQuery, a publically known language for parsing XML data, to access and handle one or more of the datasets.
The platform 5 further preferably uses Hadoop, an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. Hadoop consists of a file distribution system HFDS mentioned above, and a computational model MapReduce. The HFDS is shown on Fig. 1 as element 7 though it is in fact simply a framework used by the master node 6 and data notes 61, 62, ... 6m. HDFS is responsible for distributing data with built-in tolerance between the data nodes 61, 62, ...6m through the master node 6.
MapReduce is responsible for parallel processing of the data. It has two components: a
Map component which maps the data to the data nodes 61, 62.....6m, and a second component Reduce which collects the analysed data and presents a unified picture to the end user. The Map component receives records from the data nodes as key/value pairs, and the results are written to an initial output and replicated and distributed using HDFS. The Reduce function receives all the results and aggregates them. The aggregated results are written to the final output, which is replicated and distributed
WO 2017/082813
PCT/SG2015/050446 again. MapReduce is hardware fault tolerant and has the intrinsic ability to distribute load based on hardware availability.
The system further preferably employs Docker, a tool for running isolated containers on Linux. This allows rapid development of applications and workflows. The system further preferably uses InfiniBand technology, to provide network linking.
Fig. 2 shows the logical structure of the operation of the platform 5 in relation to the client node 3. The platform 5 comprises a GUI module 51 for generating GUIs for display on the screen of the terminal 3a. It further comprises a shell scripts module 52 comprising a number of shell scripts, and n sections of code 531, 532, ... 53n, which implement n corresponding algorithms referred to as Alg_1, Alg_2, Alg_3.....Alg_n. . It further has access to a storage unit 54, which stores a list 55 of the n algorithms. For each of the algorithms there is a corresponding Functionality Requirements Document (FRD): FRD Alg_1, FRD Alg_2.....FRD Alg_2.
Each of the m algorithms accepts input in a certain format (“Input”) and generates output in a certain format (“Output”). These formats are included in the corresponding FRD. Each FRD also stores a location of software implementing the algorithm. The locations may be stored as specific paths or as environmental variables.
Each algorithm includes a number of parameters, typically including one or more which are predefined (take “default” values) and one or more which are settable by the user. The FRD stored the default values.
For example, as illustrated in Fig. 2, the algorithm Alg_1 includes four parameters: Paral which takes the default value 3; Para 2 which is selectable (i.e. settable by a user, for example by choosing from a plurality of predefined options); and Para 3 which takes the default value 1; and Para 4 which is selectable. The FRD 58 stores the default values for the algorithm Alg_1.
The storage unit 54 further stores, for each of the client nodes 1, 2, 3, a corresponding a user selection document 56 showing selections the corresponding user has made (see below) of values for the settable parameters of one or more of the algorithms the user has selected. For the user of client node 3, the user has selected Alg_1, and for it has selected the parameter values Para 2=4 and Para4=3.
WO 2017/082813
PCT/SG2015/050446
The system of Fig. 1 may be operable to provide computing services to users from a number of different scientific fields. Each field may have a different set of applicable algorithms. A given user would initially specify which scientific field he is interested in, and Fig. 2 illustrates only the logical elements applicable to the scientific field chosen to the user of the client node 3.
If the platform is implemented by a single system, such as a server, then all of the elements of Fig. 2 would be present in the server.
Alternatively, if the platform is implemented in the distributed manner shown in Fig. 1, then the structure of Fig. 2 may optionally be provided in the master node 6. Alternatively, it may be provided by another unit (not shown) which is able to communicate with the terminal 3a. Or the elements of the structure of Fig. 2 may be distributed (e.g. with part on the master node 6 and part elsewhere).
Turning to Fig. 3, a method according to the embodiment is shown for defining and running a workflow. In step 31, the GUI module 51 controls the terminal 3a to display a GUI as illustrated in Fig. 4. The GUI includes a portion 41 which lists the algorithms from list 55. Thus, there is a plurality of areas of the GUI representing respective ones of the algorithms. In one version of the embodiment these areas are displayed only upon a user command (e.g. activating a drop down menu).
The GUI further includes a workflow space 42, an area 43 for defining parameters of the algorithms (one algorithm at a time), and a section 44 for displaying information generated when the workflow is implemented. Areas 42, 43, 44 are initially empty.
In step 32, the user successively uses the data input devices of the client node 3 to generate user input to select one or more algorithms from the list of available algorithms 41, and drags each algorithm to a corresponding portion of the area 42. An icon representing the selected algorithm is shown in the area 42. The set of algorithms selected by the user, and the order in which the selected algorithms are performed, defines a work flow which is illustrated by the icons. Note that the user may select one or more of the algorithms more than once, so that the workflow contains multiple instances of the same algorithm.
WO 2017/082813
PCT/SG2015/050446
Fig. 5 shows the area 42 after the user has chosen 4 algorithms labelled BWA, Gatk, HPC, and SNPEFF and positioned them in area 42 to produce a NGS workflow. The significance of these algorithms is explained below when the NGS workflow is considered in detail. Area 42 comprises arrows defined by the user to show the order in which the 5 algorithms are performed and/or the flow of data between them.
The item “Print” in area 42 is not an algorithm, but rather a graphical representation of software tool which is invoked at the end of the workflow to visualize the data output. The user may have specified the tool by clicking on a corresponding area of the GUI. The area labelled “print” may then be automatically created in the area 42 at the end of the workflow, or more preferably the user may position it there (e.g. by a drag and drop operation). The function of the tool is explained in more detail below. The “print” module is also associated with an FRD. This FRD specifies a number of utilities, such as standard XML utilities, for forming graphical representations of data generated by the selected algorithms of the workflow.
In step 33, the user selects parameters for the algorithms. The user sets the parameters of the algorithms in the workflow algorithm-by-algorithm. When the user is setting the parameters of a given algorithm, those parameters are shown in area 43.
For example, first, the user selects one of the algorithms in the workflow shown in area 42 (thus, in Fig. 5 the algorithm BWA has been selected by clicking on the area 45; this area is therefore highlighted in Fig. 5). The GUI uses the FRD corresponding to that algorithm to display to the user in area 43 the corresponding settable parameters of the highlighted algorithm (and typically the default parameter values also), and receive user input specifying a selection for a given settable parameter. An example is shown in Fig. 6, where the GUI is showing that the parameter Nst of algorithm Bayes can take the values 1, 2, 6 or “mixed”. The user selects one of these options, thereby selecting a value for this interactive parameter. Specifically, Nst is a parameter specifying the nucleotide substitution rate of a model which is to be studied using the Bayes algorithm. Its value is selected to be “mixed” when the input is an amino acid. For other parameters, the user may be free to enter a numerical value freely, e.g. by typing, however providing the user with options has the advantage that the user is able to make a choice which is likely to be reasonable, even if he has little insight into the algorithm.
WO 2017/082813
PCT/SG2015/050446
The values of the parameters are stored in the user selection 56. The user may also be able to select the input location (i.e. the location of an input file containing data to be processed by the algorithm) and the output location to which the result of the workflow is written. The input and output may be in the same database, or different databases, or within different sections of the same database. In the case of the first algorithm 45 in the workflow, the input location may be in the database 3b, and similarly for the last algorithm in the workflow the output location may be in the database 3b. Other inputs/output are chosen so as to define the arrows in Fig. 5 between algorithms. Note that this can be done after the algorithms themselves are selected. Furthermore, the choice of how data flows between the algorithms can be modified later by moving the arrows.
Once the user has completed the parameter selection in step 33, in step 34 the workflow is executed. During execution, the shell scripts module 52 generates commands for the algorithms using the values of the parameters stored in the FRDs and the User Selection data 56. Each shell script module may be provided as a Java wrapper. Various IDEs (integregrated development environments) written in Java are well-known, such as, Netbean IDE, Eclipse IDE etc. If the input data for a given module is in XML format, the module may be Java wrapped for processing XML data, such as using the MapReduce framework.
The implementation of the workflow is managed by the master node 6 by controlling a plurality of the geographically distributed data nodes 61, 62, ..... 6m in parallel, to provide a high performance computing (HPC) environment.
Optionally, the user may be able to choose (e.g. in step 33) the number of data nodes
61, 62...... 6m which are used to run any the selected algorithms (and particularly the computationally intensive ones) according to the user’s budget and the urgency of the data analysis. For example, a user may choose for 10 data nodes, 100 data nodes or 1000 data nodes to run any one of the selected algorithms, such as the (computationally expensive) BWA algorithm of Fig. 5.
When the workflow is running, the area 44 shows any error messages. This allows a user to stop the workflow if there is an error. This is particularly useful when the workflow is run in the cloud when resources may be charged for on the basis, for example, of the number of processors used multiplied by the number of hours for which
WO 2017/082813
PCT/SG2015/050446 each processor is used. In response to an error message, the user can “tweak” the workflow, for example by clicking on one of the icons representing the algorithms of the workflow, and resetting one or more parameters of the algorithm corresponding to the icon using the GUI portion 43.
If the workflow is repeated multiple times (i.e. with different input data each time), a series of outputs are generated.
In step 35, the platform 5 and/or the database 3b may store an output file with the results of the processing carried out by the workflow. The output file may store results generated by each of the algorithms. Additionally, the output file may store results for running the workflow for different input data.
Preferably, the database is in the cloud, and in an XML format. It may be created by XML utilities.
In step 36, the data is visualised. This step may include accessing one or more previously generated database(s), such as public domain databases, to compare previously generated data to data generated by the selected algorithms. Step 36 too may be performed using XML utilities. Thus, XML utilities can be used in all three stages of data integration: the performance of the workflows in step 34 (data generation), database creation (step 35), and data visualisation (step 36).
The selection of exactly what actions the computer systems performs in steps 35 and/or 36 may be made during the construction of the workflow, and be graphically represented on the GUI as an area of the workflow. Specifically, the user may input data to the computerized network which is recognised as a selection of at least one tool providing database construction and visualisation. A representation of the tool may be displayed in area 42, for example as a final block of the workflow (e.g. the area “print” in Fig. 5).
We now return to Fig. 5, and consider in more detail NGS workflow which the embodiment permitted the user to define. The input here is a FASTQ file, which represents a set of fragments of DNA. The output of the NGS workflow is an indication of the Variants that have been identified. Each Variant is a location at which one of the DNA fragments differs from a known genome sequence (“reference genome”).
WO 2017/082813
PCT/SG2015/050446
The first stage of the NGS workflow is the BWA (Burrows-Wheeler Aligner) algorithm. This is an algorithm that aligns sequence fragments to the reference genome. This algorithm takes a FASTQ input of DNA fragments (raw sequence data) and outputs a BAM file which is an aligned DNA sequence.
The detailed steps are shown in Fig. 7. First, in step 71, the raw sequence data is divided into m paired end files.
In steps 721, 722, ... 72m, each of these are processed separately and simultaneously by the data nodes 61, 62, ...6m. First, a known BWA algorithm (such a BWA MEM) is used to give a corresponding aligned sequence alignment map (SAM). This is improved with known SAM tools, to give a corresponding sorted SAM.
In step 73, the results are combined to produce a BAM file.
In step 74, the BAM file is improved using known tools such as the PICARD tools supplied by the Broad Institute of Cambridge MA.
In step 75 SAM tools are used for indexing, giving an indexed BAM file.
Gatk: This refers to a Genome Analysis Toolkit developed by the Broad Institute of Cambridge MA. It takes the aligned DNA and compares it with the reference genome to identify variants. A first stage is a local realignment process which is designed to input one or more BAM files and to locally realign reads such that the number of mismatching bases is minimized across all the reads. Then there is a base recalibration step, followed by a step of Base Quality Score Recalibration (BQSR).
HPC (Haplotype caller): This identifies haplotypes (that is, sets of genes which a progeny tends to inherit from a parent). The output is a raw VCF (variant call format) file, i.e. a text file in a format commonly used for storing gene sequence variations. Note that in a variation of the workflow the HPC, which is accurate but requires intensive computing resources, is replaced by a commonly-known tool called a unified type caller, which is less accurate but only needs low memory. Both of these algorithms are preferably available in the section 41 of the GUI for the user to select during the definition of the workflow.
WO 2017/082813
PCT/SG2015/050446
SNPEFF: This annotates the VCF file. The output is an annotated variant list.
Print: This takes the output of the above algorithms and puts it all in a database file, and then calls a visualization tool to view the data.
The database file may be stored in the database 3b of the client node 3, or on the platform 5.
In a preferred possibility, the output of the workflow (in the case of the NGS workflow, the output of SNPEFF) is stored in the cloud, preferably as an XML format database. Optionally, the format of the data may initially be according to a relational database management system (RDBMS) such as MYSQL or PostgreSQL (also called Postgres), but if so it is converted to an XML format database. The XML format of the data makes it possible for the data to be analysed using XQuery, a programming language for XML data. XQuery programs can be implemented in Hadoop using the MapReduce function.
The output data generated by the NGS workflow of Fig. 5 can be viewed by the user in various ways. One way of doing this is for the “Print” module (i.e. the module represented by the last block of the workflow in area 42 of Fig. 5) to call a visualization tool which is a browser which opens automatically on the terminal 3a to access a website which is operative to read the output database file, and display it in a graphical format. The browser includes a number of utilities, which are specified by the FRD of the print module. This is shown in Figs. 8 and 9.
Fig. 8 shows how the website presents a GUI defining a number of options the user can select to select portions of the output database. The website then searches the database, and displays information about the items in the database files which match the user’s selections. This allows the user to browse the data in database file.
Fig. 9 shows a screen in which a user can cause the website to display an image to visualize the data. Fig. 9 shows a case in which the workflow of Fig. 5 has been run for 6 sets of data SS1 to SS6. These are listed at the top of the area 91 on the left hand side of the GUI. By selecting one (or more) of the tickboxes in the area 91, the user indicates to the website the set(s) of data to be visualised. The results are then shown in the area 92.
WO 2017/082813
PCT/SG2015/050446
As mentioned above, the browser includes bundled utilities (such as standard XML utilities) for analysing the data generated by the workflow automatically, At least some of the utilities preferably do this by accessing pre-existing databases, such as publicdomain databases.
As explained above, the embodiment of Fig. 1 is not only useful for NGS. Fig. 10 shows the workflow space 42 generated by another of the users to generate a different workflow.
WO 2017/082813
PCT/SG2015/050446

Claims (31)

  1. CLAIMS:
    1. A computer-implemented method of designing a computational process for analysing a dataset, the method comprising a computer system:
    presenting to a user a graphical user interface (GUI) comprising a plurality of areas representative of respective associated data analysis algorithms, each of the algorithms being characterized by one or more numerical parameters, receiving first user input from the user specifying a selection from among the areas, thereby selecting one or more of the algorithms, receiving second user input from the user specifying a positioning within a workspace portion of the GUI of respective graphical elements representing the selected algorithms, thereby defining a workflow in which the selected algorithms are performed, and receiving third user input from the user specifying the values of at least one of the parameters of the selected algorithms.
  2. 2. A method according to claim 1 in which the computer system comprises a data storage unit storing for each of the algorithms a corresponding functionality requirements document, the functionality requirements document specifying a plurality of characteristics of the corresponding algorithm and a location of software for performing the algorithm.
  3. 3. A method according to claim 1 or claim 2 which the dataset comprises one or more genetic sequences.
  4. 4. A method according to claim 3 in which the computational process is for comparing the genetic sequences to a reference genome to align the genetic sequences with the reference genome, and identifying instances of variants between the genetic sequences and the reference genome.
  5. 5. A computer-implemented method according to any preceding claim further including receiving fourth user input specifying that the workflow includes at least one software module for at least one of (i) generating a database of data output by the selected algorithms, and (ii) visualisizing the data output by the selected algorithms.
    WO 2017/082813
    PCT/SG2015/050446
  6. 6. A computer-implemented method of analysing a dataset, the method comprising:
    designing a computational process to analyse the dataset by a method according to any of claims 1 to 5; and a computer platform implementing the computational process.
  7. 7. A computer-implemented method according to claim 6 in which the computer platform implements the computational process by activating shell scripts to generate program instructions for the algorithms using the parameter values specified by the user.
  8. 8. A computer-implemented method according to claim 6 or 7 in which the computer platform comprises a plurality of data nodes, and the implementation of the computation process comprises dividing the dataset into respective portions for the data nodes, distributing the portions of the dataset to the respective data nodes for processing, and amalgamating the results.
  9. 9. A computer-implemented method according to any of claims 6 to 8 further including storing data generated by the selected algorithms in a memory space defined by a plurality of geographically distributed servers.
  10. 10 A computer-implemented method according to any of claims 6 to 9 further including generating an XML database of data generated by the selected algorithms.
  11. 11. A computer-implemented method according to any of claims 6 to 9 further including using at least one pre-existing database to generate a visual representation of data generated by the selected algorithms.
  12. 12. A computer system for designing a computational process for analysing a dataset, the computer system comprising a processor and a data storage device storing program instructions operative when implemented by the processor to cause the computer system to:
    present to a user a graphical user interface (GUI) comprising a plurality of areas representative of respective associated data analysis algorithms, each of the algorithms being characterized by one or more numerical parameters, receive first user input from the user specifying a selection from among the areas, thereby selecting one or more of the algorithms,
    WO 2017/082813
    PCT/SG2015/050446 receive second user input from the user specifying a positioning within a workspace portion of the GUI of respective graphical elements representing the selected algorithms, thereby defining a workflow in which the selected algorithms are performed, and receive third user input from the user specifying the values of at least one of the parameters of the selected algorithms.
  13. 13. A computer system according to claim 12 in which the data storage device stores for each of the algorithms a corresponding functionality requirements document, the functionality requirements document specifying a plurality of characteristics of the corresponding algorithm and a location of software for performing the algorithm.
  14. 14. A computer system according to claim 12 or 13 in which the algorithms comprise algorithms for comparing genetic sequences to a reference genome to align the genetic sequences with the reference genome, and identifying instances of variants between the genetic sequences and the reference genome.
  15. 15. A computer system according to any of claims 12 to 14 which is operative to receive fourth user input specifying that the workflow includes at least one software module for at least one of (i) generating a database of data output by the selected algorithms, and (ii) visualizing the data output by the selected algorithms.
  16. 16. A computer system according to any of claims 12 to 15 which is operative to perform the designed computational process.
  17. 17. A computer system according to claim 16 in which the data storage device stores shell scripts to generate program instructions for the algorithms using the parameter values specified by the user.
  18. 18. A computer system according to claim 16 or 17 comprising a master node for dividing the dataset into respective portions, distributing the portions of the dataset to respective data nodes for processing, and collecting and amalgamating the results of the processing by the data nodes.
  19. 19. A computer system according to any of claims 16 to 18 which is operative to store data generated by the selected algorithms in a memory space defined by a plurality of geographically distributed servers.
    WO 2017/082813
    PCT/SG2015/050446
  20. 20 A computer system according to any of claims 16 to 19, which is operative to generate an XML database of data generated by the selected algorithms.
  21. 21. A computer system according to any of claims 16 to 20 which is operative to use at least one pre-existing database to generate a visual representation of data generated by the selected algorithms.
  22. 22. A computer program product for designing a computational process for analysing a dataset, the program instructions being operative when implemented by the processor of a computer system to cause the computer system to:
    present to a user a graphical user interface (GUI) comprising a plurality of areas representative of respective associated data analysis algorithms, each of the algorithms being characterized by one or more numerical parameters, receive first user input from the user specifying a selection from among the areas, thereby selecting one or more of the algorithms, receive second user input from the user specifying a positioning within a workspace portion of the GUI of respective graphical elements representing the selected algorithms, thereby defining a workflow in which the selected algorithms are performed, and receive third user input from the user specifying the values of at least one of the parameters of the selected algorithms.
  23. 23. A computer program product according to claim 22 in which the program instructions are operative to cause the processor to populate using the third user input a respective functionality requirements document for each of the selected algorithms, the functionality requirements document specifying a plurality of characteristics of the corresponding algorithm and a location of software for performing the algorithm.
  24. 24. A computer program product according to claim 22 or 23 in which the algorithms comprise algorithms for comparing genetic sequences to a reference genome to align the genetic sequences with the reference genome, and identifying instances of variants between the genetic sequences and the reference genome.
  25. 25. A computer program product according to any of claims 22 to 24 which is operative to cause the computer system to receive fourth user input specifying that the
    WO 2017/082813
    PCT/SG2015/050446 workflow includes at least one software module for at least one of (i) generating a database of data output by the selected algorithms, and (ii) visualizing the data output by the selected algorithms.
  26. 26. A computer program product according to any of claims 22 to 25 comprising an implementation module operative when performed by the processor to cause the computer system to perform the designed computational process.
  27. 27. A computer program product according to claim 26 comprising shell scripts to generate program instructions for the algorithms using the parameter values specified by the user.
  28. 28. A computer program product according to claim 26 or 27 in which the implementation module is operative to cause the processor to divide the dataset into respective portions, distribute the portions of the dataset to respective data nodes for processing, and collect and amalgamate the results of the processing by the data nodes.
  29. 29. A computer program product according to any of claims 26 to 28 in which the implementation module is operative to store data generated by the selected algorithms in a memory space defined by a plurality of geographically distributed servers.
  30. 30 A computer program product according to any of claims 26 to 29 in which the implementation module is operative to generate an XML database of data generated by the selected algorithms.
  31. 31. A computer program product according to any of claims 26 to 30 in which the implementation module is operative to use at least one pre-existing database to generate a visual representation of data generated by the selected algorithms.
    WO 2017/082813
    PCT/SG2015/050446
    WO 2017/082813
    PCT/SG2015/050446
    2/7
    Fig. 2
    531 532 53η
    WO 2017/082813
    PCT/SG2015/050446
    3/7
    Fig. 3
    WO 2017/082813
    PCT/SG2015/050446
    4/7
    Fig. 5
    WO 2017/082813
    PCT/SG2015/050446
    5/7
    Data
    Name Value Ψ Bayes Component ID Bayesl i -1 Caption Bayes - k Server 123.136.65.117 : Generations 100000 Nst 1 . Rates 1 .. OmegaVar 6 mixed
    Fig. 6
    Fig. 7
    WO 2017/082813
    PCT/SG2015/050446
    6/7
    SNPs Data Browser _ 3'' iRG+4 + ££. iel 3 !*! Μ5ΛΓ SeiecT.fi .'kc : , ' JBR0WSEIRGSP4 [ ukal.blast I Rre t+IPsJ^ n 4 entr'BS per page Search Search here.. id Chromosome Start End Reference Variant Allele Variant Sample Type Count Expected allele Frequency dbSNP match category db SNP Id Strand Gene Ge Sym
    23564 chr3 10001317 10001317 G A Substitution 1 1.00000000 Known rs19?33594 + 4332485thr3 OsOSgO. 23598 chr3 10004109 10004109 G A Substitution 1 1 00000000 Known rs19733844 - 4332486 Chr3 Os03gD: 23599 chr3 10004677 10004677 A C Substitution 1 1.00000000 Known rs350236241 4332486.chr3 Os03gG'
    Fig. 8 ;+ SS1 Alignments (BAM)
    552 Alignments (RAM)
    553 Alignments (6AM) t-2 SS4 Alignments [BAM) , SS5 Alignments |BAM) O SSt Alignments (BAM) 1RGSP4 Comeraws+/04/ f / SSI Consensus
    SS2 Consensus L. i SS3 Consensus U SS4 Consensus >/// SS5 Consensus •// SS6 Consensus 1R6SP4 Ceyerace /4: 4 4//4 Γ SSl_cove.sge iV- SS2_coverage ί SS3. coverage ' SS4_eoverage : BS5+ove.Me r i SSfc_coveiage ,»OSIM VCf+4044+44 + <S1_VCF
    W 5S2.VCF |V;SS3.VCF iV5S4_VCF ;✓ 5S5.VCF ;✓ 'S6_VCF
    Reference sequence i«r> 1RGSP4 Annotation i* IRGSP4 Reference
    WO 2017/082813
    PCT/SG2015/050446
    7/7 »13£ffi 11 w»»*· \ς
    Erie Edit ¥»ew fictions Help
    QwasMdp iornroren’E ► 10 : Common >· Math
    Image
    Examples + Benchmarking > Ciustaiwjoft > seqfc-ooi > DnaDist > Protlhst > Neighbcurloin > Upoma > Condense > Input File CO'iiSt v-orl sheet - /· cmeiuh/softwarezmp^instalVGampIes+hyiomtcfToancsnhe Magic os Qtascade qua 1 '........
    fi-TigalC rlueteh «oV piu teh *οΐ ''C qBocf' Nqfiut,
    ProtOist 2 protDist] 1
    Neighbour loml ’Ne^hbour-u cor er el { o er se] b Up« 1 [Up^r + or er p |< o er ie] @1 |ori»uifb bM
    I II I. i il
    ............... ·. ___rSlo.w.inJ . ( ...........I
    ΊηΒη8ΒηΗΒΙ·ΙΚκ I/ wHHHMr
    Fig. 10
AU2015414467A 2015-11-12 2015-11-12 Methods and systems for generating workflows for analysing large data sets Abandoned AU2015414467A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2022241571A AU2022241571A1 (en) 2015-11-12 2022-09-29 Methods and systems for generating workflows for analysing large data sets

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/SG2015/050446 WO2017082813A1 (en) 2015-11-12 2015-11-12 Methods and systems for generating workflows for analysing large data sets

Related Child Applications (1)

Application Number Title Priority Date Filing Date
AU2022241571A Division AU2022241571A1 (en) 2015-11-12 2022-09-29 Methods and systems for generating workflows for analysing large data sets

Publications (1)

Publication Number Publication Date
AU2015414467A1 true AU2015414467A1 (en) 2018-06-28

Family

ID=58695833

Family Applications (2)

Application Number Title Priority Date Filing Date
AU2015414467A Abandoned AU2015414467A1 (en) 2015-11-12 2015-11-12 Methods and systems for generating workflows for analysing large data sets
AU2022241571A Pending AU2022241571A1 (en) 2015-11-12 2022-09-29 Methods and systems for generating workflows for analysing large data sets

Family Applications After (1)

Application Number Title Priority Date Filing Date
AU2022241571A Pending AU2022241571A1 (en) 2015-11-12 2022-09-29 Methods and systems for generating workflows for analysing large data sets

Country Status (4)

Country Link
AU (2) AU2015414467A1 (en)
GB (1) GB2565439A (en)
NZ (1) NZ743378A (en)
WO (1) WO2017082813A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20240094996A1 (en) * 2022-09-15 2024-03-21 International Business Machines Corporation Auto-wrappering tools with guidance from exemplar commands

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004192257A (en) * 2002-12-10 2004-07-08 Nec Corp Array display method/device/program/recording medium, and homology retrieval method/device/ program/recording medium
JP5790006B2 (en) * 2010-05-25 2015-10-07 ソニー株式会社 Information processing apparatus, information processing method, and program
WO2013037687A1 (en) * 2011-09-14 2013-03-21 Siemens Aktiengesellschaft A system and method for managing development of a test piece of code
US9501202B2 (en) * 2013-03-15 2016-11-22 Palantir Technologies, Inc. Computer graphical user interface with genomic workflow
US20150066383A1 (en) * 2013-09-03 2015-03-05 Seven Bridges Genomics Inc. Collapsible modular genomic pipeline

Also Published As

Publication number Publication date
NZ743378A (en) 2022-10-28
WO2017082813A1 (en) 2017-05-18
GB201813187D0 (en) 2018-09-26
GB2565439A (en) 2019-02-13
AU2022241571A1 (en) 2022-10-27

Similar Documents

Publication Publication Date Title
Mi et al. Protocol Update for large-scale genome and gene function analysis with the PANTHER classification system (v. 14.0)
Bauer et al. Ontologizer 2.0—a multifunctional tool for GO term enrichment analysis and data exploration
Brohée et al. Network Analysis Tools: from biological networks to clusters and pathways
Zhou et al. Using OmicsNet for network integration and 3D visualization
US11023105B2 (en) Systems and methods for composable analytics
US8566779B2 (en) Visually prioritizing information in an agile system
Wang et al. SnpHub: an easy-to-set-up web server framework for exploring large-scale genomic variation data in the post-genomic era with applications in wheat
US8510288B2 (en) Applying analytic patterns to data
US20150170382A1 (en) Systems and methods for automatic interactive visualizations
US10514910B2 (en) Automatically mapping data while designing process flows
WO2017161316A1 (en) Analytics engine for detecting medical fraud, waste, and abuse
CN108537008A (en) High-throughput gene sequencing big data analysis cloud platform system
Liu et al. PGen: large-scale genomic variations analysis workflow and browser in SoyKB
US20110047189A1 (en) Integrated Genomic System
JP2021525927A (en) Methods and systems for sparse vector-based matrix transformations
Skinner et al. Setting up the JBrowse genome browser
WO2016141045A2 (en) Detection and visualization of temporal events in a large-scale patient database
US9965597B2 (en) Collaborative drug discovery system
US11631482B2 (en) Visualising clinical and genetic data
AU2022241571A1 (en) Methods and systems for generating workflows for analysing large data sets
Nazipova et al. Big Data in bioinformatics
Yu et al. Genotet: An interactive web-based visual exploration framework to support validation of gene regulatory networks
Diaz-Uriarte et al. EvAM-Tools: tools for evolutionary accumulation and cancer progression models
Dahlquist Using Gen MAPP and MAPPFinder to View Microarray Data on Biological Pathways and Identify Global Trends in the Data
Campos et al. Egas–collaborative biomedical annotation as a service

Legal Events

Date Code Title Description
MK5 Application lapsed section 142(2)(e) - patent request and compl. specification not accepted