NZ743378A - Methods and systems for generating workflows for analysing large data sets - Google Patents
Methods and systems for generating workflows for analysing large data setsInfo
- Publication number
- NZ743378A NZ743378A NZ743378A NZ74337815A NZ743378A NZ 743378 A NZ743378 A NZ 743378A NZ 743378 A NZ743378 A NZ 743378A NZ 74337815 A NZ74337815 A NZ 74337815A NZ 743378 A NZ743378 A NZ 743378A
- Authority
- NZ
- New Zealand
- Prior art keywords
- algorithms
- selected algorithms
- user input
- computer system
- data
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract 17
- 238000004422 calculation algorithm Methods 0.000 claims abstract 63
- 238000007405 data analysis Methods 0.000 claims abstract 4
- 230000002068 genetic effect Effects 0.000 claims 10
- 238000004590 computer program Methods 0.000 claims 9
- 238000013500 data storage Methods 0.000 claims 4
- 238000013515 script Methods 0.000 claims 3
- 230000000007 visual effect Effects 0.000 claims 3
- 238000012800 visualization Methods 0.000 claims 3
- 230000003213 activating effect Effects 0.000 claims 1
- 238000007481 next generation sequencing Methods 0.000 abstract 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/30—Creation or generation of source code
- G06F8/34—Graphical or visual programming
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
- G16B50/30—Data warehousing; Computing architectures
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- General Health & Medical Sciences (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Evolutionary Biology (AREA)
- Biotechnology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Databases & Information Systems (AREA)
- Biophysics (AREA)
- Bioethics (AREA)
- Business, Economics & Management (AREA)
- Software Systems (AREA)
- Economics (AREA)
- Strategic Management (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Entrepreneurship & Innovation (AREA)
- Human Resources & Organizations (AREA)
- Operations Research (AREA)
- Marketing (AREA)
- Game Theory and Decision Science (AREA)
- General Business, Economics & Management (AREA)
- Educational Administration (AREA)
- Development Economics (AREA)
- Quality & Reliability (AREA)
- Tourism & Hospitality (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Stored Programmes (AREA)
Abstract
A method is proposed for designing a computational process for performing a computational task such as a next generation sequencing task. A user who wishes to instigate the workflow for analysing a dataset is presented with a graphical user interface (GUI) comprising plurality of areas representative of respective associated data analysis algorithms, each of the algorithms being characterized by one or more numerical parameters. Default values have been defined for some or all of the numerical parameters, and one of more of the parameters of each algorithm may be modified using the GUI. The GUI allows the user to select and combine a plurality of the icons (e.g. by drag and drop operations) and set one or more of the modifiable parameters, to form a workflow comprising the corresponding algorithms.
Claims (26)
1. A computer-implemented method of analysing a dataset carried out by a master node of a cloud-based network of distributed processing units, the method comprising: 5 generating a graphical user interface (GUI) for presentation to a user, the GUI comprising a plurality of areas representative of respective associated data analysis algorithms, each of the algorithms being characterized by one or more numerical parameters, receiving first user input specifying a selection from among the areas, thereby 10 selecting one or more of the algorithms, receiving second user input specifying a positioning within a workspace portion of the GUI of respective graphical elements representing the selected algorithms, thereby defining a workflow in which the selected algorithms are performed, receiving third user input specifying the values of at least one of the parameters 15 of the selected algorithms, executing the workflow by generating commands for the selected algorithms using the values of the at least one of the parameters of the selected algorithms, receiving user input specifying the number of distributed processing units to be used to implement the commands and controlling the specified number of distributed 20 processing units of the cloud-based network to implement the commands, storing an output file comprising results generated by each of the selected algorithms in a database, and generating a visualization of the results generated by each of the selected algorithms.
2. A method according to claim 1 in which the master node comprises a data storage unit storing for each of the algorithms a corresponding functionality requirements document, the functionality requirements document specifying a plurality of characteristics of the corresponding algorithm and a location of software for 30 performing the algorithm.
3. A method according to claim 1 or claim 2 which the dataset comprises one or more genetic sequences. MARKED-UP COPY 16 marked-up page
4. A method according to claim 3 in which the computational process is for comparing the genetic sequences to a reference genome to align the genetic sequences with the reference genome, and identifying instances of variants between the genetic sequences and the reference genome.
5. A computer-implemented method according to any preceding claim further including receiving fourth user input specifying that the workflow includes at least one software module for at least one of (i) generating the database of data output by the selected algorithms, and (ii) visualizing the data output by the selected algorithms.
6. A computer-implemented method according to any preceding claim in which the master node implements the computational process by activating shell scripts to generate program instructions for the algorithms using the parameter values specified by the user.
7. A computer-implemented method according to any preceding claim in which the could-based network of distributed processing units comprises a plurality of data nodes, and the implementation of the computation process comprises dividing the dataset into respective portions for the data nodes, distributing the portions of the 20 dataset to the respective data nodes for processing, and amalgamating the results.
8. A computer-implemented method according to any preceding claim further including storing data generated by the selected algorithms in a memory space defined by a plurality of geographically distributed servers.
9 A computer-implemented method according to any preceding claim, wherein the output file is stored in an XML database.
10. A computer-implemented method according to any preceding claim further 30 including using at least one pre-existing database to generate a visual representation of data generated by the selected algorithms.
11. A computer system for analysing a dataset, the computer system being configured as a master node of a cloud-based network of distributed processing units, 35 the computer system comprising a processor and a data storage device storing MARKED-UP COPY 17 marked-up page program instructions operative when implemented by the processor to cause the computer system to: generate a graphical user interface (GUI) for presentation to a user, the GUI comprising a plurality of areas representative of respective associated data analysis 5 algorithms, each of the algorithms being characterized by one or more numerical parameters, receive first user input specifying a selection from among the areas, thereby selecting one or more of the algorithms, receive second user input specifying a positioning within a workspace portion of 10 the GUI of respective graphical elements representing the selected algorithms, thereby defining a workflow in which the selected algorithms are performed, receive third user input specifying the values of at least one of the parameters of the selected algorithms, execute the workflow by generating commands for the selected algorithms 15 using the values of the at least one of the parameters of the selected algorithms, receiving user input specifying the number of distributed processing units to be used to implement the commands and controlling the specified number of distributed processing units of the cloud-based network to implement the commands, store an output file comprising results generated by each of the selected 20 algorithms in a database, and generate a visualization of the results generated by each of the selected algorithms.
12. A computer system according to claim 11 in which the data storage device 25 stores for each of the algorithms a corresponding functionality requirements document, the functionality requirements document specifying a plurality of characteristics of the corresponding algorithm and a location of software for performing the algorithm.
13. A computer system according to claim 11 or 12 in which the algorithms 30 comprise algorithms for comparing genetic sequences to a reference genome to align the genetic sequences with the reference genome, and identifying instances of variants between the genetic sequences and the reference genome.
14. A computer system according to any of claims 11 to 13 which is operative to 35 receive fourth user input specifying that the workflow includes at least one software
MARKED-UP COPY 18 marked-up page module for at least one of (i) generating the database of data output by the selected algorithms, and (ii) visualizing the data output by the selected algorithms. 5 15. A computer system according to any of claims 11 to 14 in which the data storage device stores shell scripts to generate program instructions for the algorithms using the parameter values specified by the user.
16. A computer system according to any of claims 11 to 15 further configured to 10 divide the dataset into respective portions, distribute the portions of the dataset to respective data nodes for processing, and collect and amalgamate the results of the processing by the data nodes.
17. A computer system according to any of claims 11 to 16 which is operative to 15 store data generated by the selected algorithms in a memory space defined by a plurality of geographically distributed servers.
18 A computer system according to any of claims 11 to 17, which is operative to generate an XML database as the database.
19. A computer system according to any of claims 11 to 18 which is operative to use at least one pre-existing database to generate a visual representation of data generated by the selected algorithms. 25 20. A computer program product for analysing a dataset, the program instructions being operative when implemented by the processor of a computer system configured as a master node of a cloud-based network of distributed processing units to cause the computer system to: generate a graphical user interface (GUI) for presentation to a user, the GUI 30 comprising a plurality of areas representative of respective associated data analysis algorithms, each of the algorithms being characterized by one or more numerical parameters, receive first user input specifying a selection from among the areas, thereby selecting one or more of the algorithms,
MARKED-UP COPY 19 marked-up page receive second user input specifying a positioning within a workspace portion of the GUI of respective graphical elements representing the selected algorithms, thereby defining a workflow in which the selected algorithms are performed, and receive third user input specifying the values of at least one of the parameters 5 of the selected algorithms, execute the workflow by generating commands for the selected algorithms using the values of the at least one of the parameters of the selected algorithms, receiving user input specifying the number of distributed processing units to be used to implement the commands and controlling the specified number of distributed 10 processing units of the cloud-based network to implement the commands, store an output file comprising results generated by each of the selected algorithms in a database, and generate a visualization of the results generated by each of the selected algorithms.
21. A computer program product according to claim 20 in which the program instructions are operative to cause the processor to populate using the third user input a respective functionality requirements document for each of the selected algorithms, the functionality requirements document specifying a plurality of characteristics of the 20 corresponding algorithm and a location of software for performing the algorithm.
22. A computer program product according to claim 20 or 21 in which the algorithms comprise algorithms for comparing genetic sequences to a reference genome to align the genetic sequences with the reference genome, and identifying 25 instances of variants between the genetic sequences and the reference genome.
23. A computer program product according to any of claims 20 to 22 which is operative to cause the computer system to receive fourth user input specifying that the workflow includes at least one software module for at least one of (i) generating the 30 database of data output by the selected algorithms, and (ii) visualizing the data output by the selected algorithms.
24. A computer program product according to any of claims 20 to 23 comprising shell scripts to generate program instructions for the algorithms using the parameter 35 values specified by the user. MARKED-UP COPY 20 marked-up page
25. A computer program product according to any of claims 20 to 24 in which the implementation module is operative to cause the processor to divide the dataset into respective portions, distribute the portions of the dataset to respective data nodes for processing, and collect and amalgamate the results of the processing by the data 5 nodes.
26. A computer program product according to any of claims 20 to 25 in which the implementation module is operative to store data generated by the selected algorithms in a memory space defined by a plurality of geographically distributed servers. 27 A computer program product according to any of claims 20 to 26 in which the implementation module is operative to generate an XML database of data generated by the selected algorithms. 15 28. A computer program product according to any of claims 20 to 27 in which the implementation module is operative to use at least one pre-existing database to generate a visual representation of data generated by the selected algorithms. o o o o o o
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/SG2015/050446 WO2017082813A1 (en) | 2015-11-12 | 2015-11-12 | Methods and systems for generating workflows for analysing large data sets |
Publications (1)
Publication Number | Publication Date |
---|---|
NZ743378A true NZ743378A (en) | 2022-10-28 |
Family
ID=58695833
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
NZ743378A NZ743378A (en) | 2015-11-12 | 2015-11-12 | Methods and systems for generating workflows for analysing large data sets |
Country Status (4)
Country | Link |
---|---|
AU (2) | AU2015414467A1 (en) |
GB (1) | GB2565439A (en) |
NZ (1) | NZ743378A (en) |
WO (1) | WO2017082813A1 (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US12020007B2 (en) * | 2022-09-15 | 2024-06-25 | International Business Machines Corporation | Auto-wrappering tools with guidance from exemplar commands |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2004192257A (en) * | 2002-12-10 | 2004-07-08 | Nec Corp | Array display method/device/program/recording medium, and homology retrieval method/device/ program/recording medium |
JP5790006B2 (en) * | 2010-05-25 | 2015-10-07 | ソニー株式会社 | Information processing apparatus, information processing method, and program |
WO2013037687A1 (en) * | 2011-09-14 | 2013-03-21 | Siemens Aktiengesellschaft | A system and method for managing development of a test piece of code |
US9501202B2 (en) * | 2013-03-15 | 2016-11-22 | Palantir Technologies, Inc. | Computer graphical user interface with genomic workflow |
US20150066383A1 (en) * | 2013-09-03 | 2015-03-05 | Seven Bridges Genomics Inc. | Collapsible modular genomic pipeline |
-
2015
- 2015-11-12 AU AU2015414467A patent/AU2015414467A1/en not_active Abandoned
- 2015-11-12 NZ NZ743378A patent/NZ743378A/en unknown
- 2015-11-12 GB GB1813187.0A patent/GB2565439A/en not_active Withdrawn
- 2015-11-12 WO PCT/SG2015/050446 patent/WO2017082813A1/en active Application Filing
-
2022
- 2022-09-29 AU AU2022241571A patent/AU2022241571A1/en not_active Abandoned
Also Published As
Publication number | Publication date |
---|---|
GB201813187D0 (en) | 2018-09-26 |
GB2565439A (en) | 2019-02-13 |
WO2017082813A1 (en) | 2017-05-18 |
AU2022241571A1 (en) | 2022-10-27 |
AU2015414467A1 (en) | 2018-06-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Kumar et al. | AIR: A batch-oriented web program package for construction of supermatrices ready for phylogenomic analyses | |
JP2020527794A5 (en) | ||
JP6594950B2 (en) | Summary of data lineage | |
CN106529682A (en) | Method and apparatus for processing deep learning task in big-data cluster | |
US11062213B2 (en) | Table-meaning estimation system, method, and program | |
WO2016036824A4 (en) | Visually specifying subsets of components in graph-based programs through user interactions | |
EP3032442B1 (en) | Modeling and simulation of infrastructure architecture for big data | |
CN108229101B (en) | NGS-based targeted sequencing data simulation method and device | |
JP5051135B2 (en) | Resource information collection device, resource information collection method, program, and collection schedule generation device | |
CN108008942B (en) | Method and system for processing data records | |
JP2017535856A (en) | Job creation using data preview | |
JP2022009364A (en) | Method and system for flexible pipeline generation | |
CN114598631B (en) | Neural network computing-oriented modeling method and device for distributed data routing | |
Xu et al. | A memetic algorithm for the re-entrant permutation flowshop scheduling problem to minimize the makespan | |
Carrillo et al. | A post-pareto approach for multi-objective decision making using a non-uniform weight generator method | |
KR20150084596A (en) | The method for parameter investigation to optimal design | |
EP3420485A1 (en) | Method and system for quantifying the likelihood that a gene is casually linked to a disease | |
JPWO2018025707A1 (en) | Table meaning estimation system, method and program | |
Vlasov et al. | Cloud technology in simulation studies: GPSS cloud project | |
NZ743378A (en) | Methods and systems for generating workflows for analysing large data sets | |
CN106708609B (en) | Feature generation method and system | |
CN108595149B (en) | Reconfigurable multiply-add operation device | |
JP2020052451A (en) | Computer system and pattern generation method of business flow | |
US20140236977A1 (en) | Mapping epigenetic surprisal data througth hadoop type distributed file systems | |
JP6370230B2 (en) | Secret calculation control device, secret calculation control method, and secret calculation control program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PSEA | Patent sealed | ||
RENW | Renewal (renewal fees accepted) |
Free format text: PATENT RENEWED FOR 1 YEAR UNTIL 12 NOV 2024 BY CPA GLOBAL Effective date: 20231214 |