NZ743378A - Methods and systems for generating workflows for analysing large data sets - Google Patents

Methods and systems for generating workflows for analysing large data sets

Info

Publication number
NZ743378A
NZ743378A NZ743378A NZ74337815A NZ743378A NZ 743378 A NZ743378 A NZ 743378A NZ 743378 A NZ743378 A NZ 743378A NZ 74337815 A NZ74337815 A NZ 74337815A NZ 743378 A NZ743378 A NZ 743378A
Authority
NZ
New Zealand
Prior art keywords
algorithms
selected algorithms
user input
computer system
data
Prior art date
Application number
NZ743378A
Inventor
Dadabhai T Singh
Original Assignee
Cloudseq Pte Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cloudseq Pte Ltd filed Critical Cloudseq Pte Ltd
Publication of NZ743378A publication Critical patent/NZ743378A/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code
    • G06F8/34Graphical or visual programming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Databases & Information Systems (AREA)
  • Biophysics (AREA)
  • Bioethics (AREA)
  • Business, Economics & Management (AREA)
  • Software Systems (AREA)
  • Economics (AREA)
  • Strategic Management (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Human Resources & Organizations (AREA)
  • Operations Research (AREA)
  • Marketing (AREA)
  • Game Theory and Decision Science (AREA)
  • General Business, Economics & Management (AREA)
  • Educational Administration (AREA)
  • Development Economics (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Stored Programmes (AREA)

Abstract

A method is proposed for designing a computational process for performing a computational task such as a next generation sequencing task. A user who wishes to instigate the workflow for analysing a dataset is presented with a graphical user interface (GUI) comprising plurality of areas representative of respective associated data analysis algorithms, each of the algorithms being characterized by one or more numerical parameters. Default values have been defined for some or all of the numerical parameters, and one of more of the parameters of each algorithm may be modified using the GUI. The GUI allows the user to select and combine a plurality of the icons (e.g. by drag and drop operations) and set one or more of the modifiable parameters, to form a workflow comprising the corresponding algorithms.

Claims (26)

CLAIMS :
1. A computer-implemented method of analysing a dataset carried out by a master node of a cloud-based network of distributed processing units, the method comprising: 5 generating a graphical user interface (GUI) for presentation to a user, the GUI comprising a plurality of areas representative of respective associated data analysis algorithms, each of the algorithms being characterized by one or more numerical parameters, receiving first user input specifying a selection from among the areas, thereby 10 selecting one or more of the algorithms, receiving second user input specifying a positioning within a workspace portion of the GUI of respective graphical elements representing the selected algorithms, thereby defining a workflow in which the selected algorithms are performed, receiving third user input specifying the values of at least one of the parameters 15 of the selected algorithms, executing the workflow by generating commands for the selected algorithms using the values of the at least one of the parameters of the selected algorithms, receiving user input specifying the number of distributed processing units to be used to implement the commands and controlling the specified number of distributed 20 processing units of the cloud-based network to implement the commands, storing an output file comprising results generated by each of the selected algorithms in a database, and generating a visualization of the results generated by each of the selected algorithms.
2. A method according to claim 1 in which the master node comprises a data storage unit storing for each of the algorithms a corresponding functionality requirements document, the functionality requirements document specifying a plurality of characteristics of the corresponding algorithm and a location of software for 30 performing the algorithm.
3. A method according to claim 1 or claim 2 which the dataset comprises one or more genetic sequences. MARKED-UP COPY 16 marked-up page
4. A method according to claim 3 in which the computational process is for comparing the genetic sequences to a reference genome to align the genetic sequences with the reference genome, and identifying instances of variants between the genetic sequences and the reference genome.
5. A computer-implemented method according to any preceding claim further including receiving fourth user input specifying that the workflow includes at least one software module for at least one of (i) generating the database of data output by the selected algorithms, and (ii) visualizing the data output by the selected algorithms.
6. A computer-implemented method according to any preceding claim in which the master node implements the computational process by activating shell scripts to generate program instructions for the algorithms using the parameter values specified by the user.
7. A computer-implemented method according to any preceding claim in which the could-based network of distributed processing units comprises a plurality of data nodes, and the implementation of the computation process comprises dividing the dataset into respective portions for the data nodes, distributing the portions of the 20 dataset to the respective data nodes for processing, and amalgamating the results.
8. A computer-implemented method according to any preceding claim further including storing data generated by the selected algorithms in a memory space defined by a plurality of geographically distributed servers.
9 A computer-implemented method according to any preceding claim, wherein the output file is stored in an XML database.
10. A computer-implemented method according to any preceding claim further 30 including using at least one pre-existing database to generate a visual representation of data generated by the selected algorithms.
11. A computer system for analysing a dataset, the computer system being configured as a master node of a cloud-based network of distributed processing units, 35 the computer system comprising a processor and a data storage device storing MARKED-UP COPY 17 marked-up page program instructions operative when implemented by the processor to cause the computer system to: generate a graphical user interface (GUI) for presentation to a user, the GUI comprising a plurality of areas representative of respective associated data analysis 5 algorithms, each of the algorithms being characterized by one or more numerical parameters, receive first user input specifying a selection from among the areas, thereby selecting one or more of the algorithms, receive second user input specifying a positioning within a workspace portion of 10 the GUI of respective graphical elements representing the selected algorithms, thereby defining a workflow in which the selected algorithms are performed, receive third user input specifying the values of at least one of the parameters of the selected algorithms, execute the workflow by generating commands for the selected algorithms 15 using the values of the at least one of the parameters of the selected algorithms, receiving user input specifying the number of distributed processing units to be used to implement the commands and controlling the specified number of distributed processing units of the cloud-based network to implement the commands, store an output file comprising results generated by each of the selected 20 algorithms in a database, and generate a visualization of the results generated by each of the selected algorithms.
12. A computer system according to claim 11 in which the data storage device 25 stores for each of the algorithms a corresponding functionality requirements document, the functionality requirements document specifying a plurality of characteristics of the corresponding algorithm and a location of software for performing the algorithm.
13. A computer system according to claim 11 or 12 in which the algorithms 30 comprise algorithms for comparing genetic sequences to a reference genome to align the genetic sequences with the reference genome, and identifying instances of variants between the genetic sequences and the reference genome.
14. A computer system according to any of claims 11 to 13 which is operative to 35 receive fourth user input specifying that the workflow includes at least one software
MARKED-UP COPY 18 marked-up page module for at least one of (i) generating the database of data output by the selected algorithms, and (ii) visualizing the data output by the selected algorithms. 5 15. A computer system according to any of claims 11 to 14 in which the data storage device stores shell scripts to generate program instructions for the algorithms using the parameter values specified by the user.
16. A computer system according to any of claims 11 to 15 further configured to 10 divide the dataset into respective portions, distribute the portions of the dataset to respective data nodes for processing, and collect and amalgamate the results of the processing by the data nodes.
17. A computer system according to any of claims 11 to 16 which is operative to 15 store data generated by the selected algorithms in a memory space defined by a plurality of geographically distributed servers.
18 A computer system according to any of claims 11 to 17, which is operative to generate an XML database as the database.
19. A computer system according to any of claims 11 to 18 which is operative to use at least one pre-existing database to generate a visual representation of data generated by the selected algorithms. 25 20. A computer program product for analysing a dataset, the program instructions being operative when implemented by the processor of a computer system configured as a master node of a cloud-based network of distributed processing units to cause the computer system to: generate a graphical user interface (GUI) for presentation to a user, the GUI 30 comprising a plurality of areas representative of respective associated data analysis algorithms, each of the algorithms being characterized by one or more numerical parameters, receive first user input specifying a selection from among the areas, thereby selecting one or more of the algorithms,
MARKED-UP COPY 19 marked-up page receive second user input specifying a positioning within a workspace portion of the GUI of respective graphical elements representing the selected algorithms, thereby defining a workflow in which the selected algorithms are performed, and receive third user input specifying the values of at least one of the parameters 5 of the selected algorithms, execute the workflow by generating commands for the selected algorithms using the values of the at least one of the parameters of the selected algorithms, receiving user input specifying the number of distributed processing units to be used to implement the commands and controlling the specified number of distributed 10 processing units of the cloud-based network to implement the commands, store an output file comprising results generated by each of the selected algorithms in a database, and generate a visualization of the results generated by each of the selected algorithms.
21. A computer program product according to claim 20 in which the program instructions are operative to cause the processor to populate using the third user input a respective functionality requirements document for each of the selected algorithms, the functionality requirements document specifying a plurality of characteristics of the 20 corresponding algorithm and a location of software for performing the algorithm.
22. A computer program product according to claim 20 or 21 in which the algorithms comprise algorithms for comparing genetic sequences to a reference genome to align the genetic sequences with the reference genome, and identifying 25 instances of variants between the genetic sequences and the reference genome.
23. A computer program product according to any of claims 20 to 22 which is operative to cause the computer system to receive fourth user input specifying that the workflow includes at least one software module for at least one of (i) generating the 30 database of data output by the selected algorithms, and (ii) visualizing the data output by the selected algorithms.
24. A computer program product according to any of claims 20 to 23 comprising shell scripts to generate program instructions for the algorithms using the parameter 35 values specified by the user. MARKED-UP COPY 20 marked-up page
25. A computer program product according to any of claims 20 to 24 in which the implementation module is operative to cause the processor to divide the dataset into respective portions, distribute the portions of the dataset to respective data nodes for processing, and collect and amalgamate the results of the processing by the data 5 nodes.
26. A computer program product according to any of claims 20 to 25 in which the implementation module is operative to store data generated by the selected algorithms in a memory space defined by a plurality of geographically distributed servers. 27 A computer program product according to any of claims 20 to 26 in which the implementation module is operative to generate an XML database of data generated by the selected algorithms. 15 28. A computer program product according to any of claims 20 to 27 in which the implementation module is operative to use at least one pre-existing database to generate a visual representation of data generated by the selected algorithms. o o o o o o
NZ743378A 2015-11-12 2015-11-12 Methods and systems for generating workflows for analysing large data sets NZ743378A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/SG2015/050446 WO2017082813A1 (en) 2015-11-12 2015-11-12 Methods and systems for generating workflows for analysing large data sets

Publications (1)

Publication Number Publication Date
NZ743378A true NZ743378A (en) 2022-10-28

Family

ID=58695833

Family Applications (1)

Application Number Title Priority Date Filing Date
NZ743378A NZ743378A (en) 2015-11-12 2015-11-12 Methods and systems for generating workflows for analysing large data sets

Country Status (4)

Country Link
AU (2) AU2015414467A1 (en)
GB (1) GB2565439A (en)
NZ (1) NZ743378A (en)
WO (1) WO2017082813A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US12020007B2 (en) * 2022-09-15 2024-06-25 International Business Machines Corporation Auto-wrappering tools with guidance from exemplar commands

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004192257A (en) * 2002-12-10 2004-07-08 Nec Corp Array display method/device/program/recording medium, and homology retrieval method/device/ program/recording medium
JP5790006B2 (en) * 2010-05-25 2015-10-07 ソニー株式会社 Information processing apparatus, information processing method, and program
WO2013037687A1 (en) * 2011-09-14 2013-03-21 Siemens Aktiengesellschaft A system and method for managing development of a test piece of code
US9501202B2 (en) * 2013-03-15 2016-11-22 Palantir Technologies, Inc. Computer graphical user interface with genomic workflow
US20150066383A1 (en) * 2013-09-03 2015-03-05 Seven Bridges Genomics Inc. Collapsible modular genomic pipeline

Also Published As

Publication number Publication date
GB201813187D0 (en) 2018-09-26
GB2565439A (en) 2019-02-13
WO2017082813A1 (en) 2017-05-18
AU2022241571A1 (en) 2022-10-27
AU2015414467A1 (en) 2018-06-28

Similar Documents

Publication Publication Date Title
Kumar et al. AIR: A batch-oriented web program package for construction of supermatrices ready for phylogenomic analyses
JP2020527794A5 (en)
JP6594950B2 (en) Summary of data lineage
CN106529682A (en) Method and apparatus for processing deep learning task in big-data cluster
US11062213B2 (en) Table-meaning estimation system, method, and program
WO2016036824A4 (en) Visually specifying subsets of components in graph-based programs through user interactions
EP3032442B1 (en) Modeling and simulation of infrastructure architecture for big data
CN108229101B (en) NGS-based targeted sequencing data simulation method and device
JP5051135B2 (en) Resource information collection device, resource information collection method, program, and collection schedule generation device
CN108008942B (en) Method and system for processing data records
JP2017535856A (en) Job creation using data preview
JP2022009364A (en) Method and system for flexible pipeline generation
CN114598631B (en) Neural network computing-oriented modeling method and device for distributed data routing
Xu et al. A memetic algorithm for the re-entrant permutation flowshop scheduling problem to minimize the makespan
Carrillo et al. A post-pareto approach for multi-objective decision making using a non-uniform weight generator method
KR20150084596A (en) The method for parameter investigation to optimal design
EP3420485A1 (en) Method and system for quantifying the likelihood that a gene is casually linked to a disease
JPWO2018025707A1 (en) Table meaning estimation system, method and program
Vlasov et al. Cloud technology in simulation studies: GPSS cloud project
NZ743378A (en) Methods and systems for generating workflows for analysing large data sets
CN106708609B (en) Feature generation method and system
CN108595149B (en) Reconfigurable multiply-add operation device
JP2020052451A (en) Computer system and pattern generation method of business flow
US20140236977A1 (en) Mapping epigenetic surprisal data througth hadoop type distributed file systems
JP6370230B2 (en) Secret calculation control device, secret calculation control method, and secret calculation control program

Legal Events

Date Code Title Description
PSEA Patent sealed
RENW Renewal (renewal fees accepted)

Free format text: PATENT RENEWED FOR 1 YEAR UNTIL 12 NOV 2024 BY CPA GLOBAL

Effective date: 20231214