NZ743378A

NZ743378A - Methods and systems for generating workflows for analysing large data sets

Info

Publication number: NZ743378A
Application number: NZ743378A
Authority: NZ
Inventors: Dadabhai T Singh
Original assignee: Cloudseq Pte Ltd
Priority date: 2015-11-12
Filing date: 2015-11-12
Publication date: 2022-10-28
Also published as: GB201813187D0; GB2565439A; WO2017082813A1; AU2022241571A1; AU2015414467A1

Abstract

A method is proposed for designing a computational process for performing a computational task such as a next generation sequencing task. A user who wishes to instigate the workflow for analysing a dataset is presented with a graphical user interface (GUI) comprising plurality of areas representative of respective associated data analysis algorithms, each of the algorithms being characterized by one or more numerical parameters. Default values have been defined for some or all of the numerical parameters, and one of more of the parameters of each algorithm may be modified using the GUI. The GUI allows the user to select and combine a plurality of the icons (e.g. by drag and drop operations) and set one or more of the modifiable parameters, to form a workflow comprising the corresponding algorithms.

Claims

CLAIMS :

1. A computer-implemented method of analysing a dataset carried out by a master node of a cloud-based network of distributed processing units, the method comprising: 5 generating a graphical user interface (GUI) for presentation to a user, the GUI comprising a plurality of areas representative of respective associated data analysis algorithms, each of the algorithms being characterized by one or more numerical parameters, receiving first user input specifying a selection from among the areas, thereby 10 selecting one or more of the algorithms, receiving second user input specifying a positioning within a workspace portion of the GUI of respective graphical elements representing the selected algorithms, thereby defining a workflow in which the selected algorithms are performed, receiving third user input specifying the values of at least one of the parameters 15 of the selected algorithms, executing the workflow by generating commands for the selected algorithms using the values of the at least one of the parameters of the selected algorithms, receiving user input specifying the number of distributed processing units to be used to implement the commands and controlling the specified number of distributed 20 processing units of the cloud-based network to implement the commands, storing an output file comprising results generated by each of the selected algorithms in a database, and generating a visualization of the results generated by each of the selected algorithms.

2. A method according to claim 1 in which the master node comprises a data storage unit storing for each of the algorithms a corresponding functionality requirements document, the functionality requirements document specifying a plurality of characteristics of the corresponding algorithm and a location of software for 30 performing the algorithm.

3. A method according to claim 1 or claim 2 which the dataset comprises one or more genetic sequences. MARKED-UP COPY 16 marked-up page

4. A method according to claim 3 in which the computational process is for comparing the genetic sequences to a reference genome to align the genetic sequences with the reference genome, and identifying instances of variants between the genetic sequences and the reference genome.

5. A computer-implemented method according to any preceding claim further including receiving fourth user input specifying that the workflow includes at least one software module for at least one of (i) generating the database of data output by the selected algorithms, and (ii) visualizing the data output by the selected algorithms.

6. A computer-implemented method according to any preceding claim in which the master node implements the computational process by activating shell scripts to generate program instructions for the algorithms using the parameter values specified by the user.

7. A computer-implemented method according to any preceding claim in which the could-based network of distributed processing units comprises a plurality of data nodes, and the implementation of the computation process comprises dividing the dataset into respective portions for the data nodes, distributing the portions of the 20 dataset to the respective data nodes for processing, and amalgamating the results.

8. A computer-implemented method according to any preceding claim further including storing data generated by the selected algorithms in a memory space defined by a plurality of geographically distributed servers.

9 A computer-implemented method according to any preceding claim, wherein the output file is stored in an XML database.

10. A computer-implemented method according to any preceding claim further 30 including using at least one pre-existing database to generate a visual representation of data generated by the selected algorithms.

11. A computer system for analysing a dataset, the computer system being configured as a master node of a cloud-based network of distributed processing units, 35 the computer system comprising a processor and a data storage device storing MARKED-UP COPY 17 marked-up page program instructions operative when implemented by the processor to cause the computer system to: generate a graphical user interface (GUI) for presentation to a user, the GUI comprising a plurality of areas representative of respective associated data analysis 5 algorithms, each of the algorithms being characterized by one or more numerical parameters, receive first user input specifying a selection from among the areas, thereby selecting one or more of the algorithms, receive second user input specifying a positioning within a workspace portion of 10 the GUI of respective graphical elements representing the selected algorithms, thereby defining a workflow in which the selected algorithms are performed, receive third user input specifying the values of at least one of the parameters of the selected algorithms, execute the workflow by generating commands for the selected algorithms 15 using the values of the at least one of the parameters of the selected algorithms, receiving user input specifying the number of distributed processing units to be used to implement the commands and controlling the specified number of distributed processing units of the cloud-based network to implement the commands, store an output file comprising results generated by each of the selected 20 algorithms in a database, and generate a visualization of the results generated by each of the selected algorithms.

12. A computer system according to claim 11 in which the data storage device 25 stores for each of the algorithms a corresponding functionality requirements document, the functionality requirements document specifying a plurality of characteristics of the corresponding algorithm and a location of software for performing the algorithm.

13. A computer system according to claim 11 or 12 in which the algorithms 30 comprise algorithms for comparing genetic sequences to a reference genome to align the genetic sequences with the reference genome, and identifying instances of variants between the genetic sequences and the reference genome.

14. A computer system according to any of claims 11 to 13 which is operative to 35 receive fourth user input specifying that the workflow includes at least one software

MARKED-UP COPY 18 marked-up page module for at least one of (i) generating the database of data output by the selected algorithms, and (ii) visualizing the data output by the selected algorithms. 5 15. A computer system according to any of claims 11 to 14 in which the data storage device stores shell scripts to generate program instructions for the algorithms using the parameter values specified by the user.

16. A computer system according to any of claims 11 to 15 further configured to 10 divide the dataset into respective portions, distribute the portions of the dataset to respective data nodes for processing, and collect and amalgamate the results of the processing by the data nodes.

17. A computer system according to any of claims 11 to 16 which is operative to 15 store data generated by the selected algorithms in a memory space defined by a plurality of geographically distributed servers.

18 A computer system according to any of claims 11 to 17, which is operative to generate an XML database as the database.

19. A computer system according to any of claims 11 to 18 which is operative to use at least one pre-existing database to generate a visual representation of data generated by the selected algorithms. 25 20. A computer program product for analysing a dataset, the program instructions being operative when implemented by the processor of a computer system configured as a master node of a cloud-based network of distributed processing units to cause the computer system to: generate a graphical user interface (GUI) for presentation to a user, the GUI 30 comprising a plurality of areas representative of respective associated data analysis algorithms, each of the algorithms being characterized by one or more numerical parameters, receive first user input specifying a selection from among the areas, thereby selecting one or more of the algorithms,

MARKED-UP COPY 19 marked-up page receive second user input specifying a positioning within a workspace portion of the GUI of respective graphical elements representing the selected algorithms, thereby defining a workflow in which the selected algorithms are performed, and receive third user input specifying the values of at least one of the parameters 5 of the selected algorithms, execute the workflow by generating commands for the selected algorithms using the values of the at least one of the parameters of the selected algorithms, receiving user input specifying the number of distributed processing units to be used to implement the commands and controlling the specified number of distributed 10 processing units of the cloud-based network to implement the commands, store an output file comprising results generated by each of the selected algorithms in a database, and generate a visualization of the results generated by each of the selected algorithms.

21. A computer program product according to claim 20 in which the program instructions are operative to cause the processor to populate using the third user input a respective functionality requirements document for each of the selected algorithms, the functionality requirements document specifying a plurality of characteristics of the 20 corresponding algorithm and a location of software for performing the algorithm.

22. A computer program product according to claim 20 or 21 in which the algorithms comprise algorithms for comparing genetic sequences to a reference genome to align the genetic sequences with the reference genome, and identifying 25 instances of variants between the genetic sequences and the reference genome.

23. A computer program product according to any of claims 20 to 22 which is operative to cause the computer system to receive fourth user input specifying that the workflow includes at least one software module for at least one of (i) generating the 30 database of data output by the selected algorithms, and (ii) visualizing the data output by the selected algorithms.

24. A computer program product according to any of claims 20 to 23 comprising shell scripts to generate program instructions for the algorithms using the parameter 35 values specified by the user. MARKED-UP COPY 20 marked-up page

25. A computer program product according to any of claims 20 to 24 in which the implementation module is operative to cause the processor to divide the dataset into respective portions, distribute the portions of the dataset to respective data nodes for processing, and collect and amalgamate the results of the processing by the data 5 nodes.

26. A computer program product according to any of claims 20 to 25 in which the implementation module is operative to store data generated by the selected algorithms in a memory space defined by a plurality of geographically distributed servers. 27 A computer program product according to any of claims 20 to 26 in which the implementation module is operative to generate an XML database of data generated by the selected algorithms. 15 28. A computer program product according to any of claims 20 to 27 in which the implementation module is operative to use at least one pre-existing database to generate a visual representation of data generated by the selected algorithms. o o o o o o