EP4309044A1 - Data pipeline - Google Patents
Data pipelineInfo
- Publication number
- EP4309044A1 EP4309044A1 EP22715864.9A EP22715864A EP4309044A1 EP 4309044 A1 EP4309044 A1 EP 4309044A1 EP 22715864 A EP22715864 A EP 22715864A EP 4309044 A1 EP4309044 A1 EP 4309044A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- data
- files
- file system
- storage
- distributed file
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000003860 storage Methods 0.000 claims description 268
- 238000000034 method Methods 0.000 claims description 200
- 238000012546 transfer Methods 0.000 claims description 98
- 238000007405 data analysis Methods 0.000 claims description 79
- 239000002245 particle Substances 0.000 claims description 17
- 238000012545 processing Methods 0.000 abstract description 32
- 230000008569 process Effects 0.000 description 29
- 230000004044 response Effects 0.000 description 29
- 230000006870 function Effects 0.000 description 24
- VQLYBLABXAHUDN-UHFFFAOYSA-N bis(4-fluorophenyl)-methyl-(1,2,4-triazol-1-ylmethyl)silane;methyl n-(1h-benzimidazol-2-yl)carbamate Chemical compound C1=CC=C2NC(NC(=O)OC)=NC2=C1.C=1C=C(F)C=CC=1[Si](C=1C=CC(F)=CC=1)(C)CN1C=NC=N1 VQLYBLABXAHUDN-UHFFFAOYSA-N 0.000 description 21
- 108090000623 proteins and genes Proteins 0.000 description 11
- 102000004169 proteins and genes Human genes 0.000 description 11
- 238000003384 imaging method Methods 0.000 description 10
- 238000004891 communication Methods 0.000 description 9
- 238000005516 engineering process Methods 0.000 description 8
- 238000012544 monitoring process Methods 0.000 description 7
- 238000011160 research Methods 0.000 description 7
- 238000007726 management method Methods 0.000 description 6
- 238000005457 optimization Methods 0.000 description 6
- 230000001360 synchronised effect Effects 0.000 description 6
- 238000013459 approach Methods 0.000 description 5
- 230000008901 benefit Effects 0.000 description 5
- 238000004590 computer program Methods 0.000 description 5
- 238000013500 data storage Methods 0.000 description 5
- 238000001493 electron microscopy Methods 0.000 description 5
- 238000013403 standard screening design Methods 0.000 description 5
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 4
- 230000007717 exclusion Effects 0.000 description 4
- 238000002474 experimental method Methods 0.000 description 4
- 238000001000 micrograph Methods 0.000 description 4
- 239000000203 mixture Substances 0.000 description 4
- 230000000007 visual effect Effects 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 238000013480 data collection Methods 0.000 description 3
- 229940000406 drug candidate Drugs 0.000 description 3
- 230000002068 genetic effect Effects 0.000 description 3
- 229920002521 macromolecule Polymers 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 230000002085 persistent effect Effects 0.000 description 3
- 238000001303 quality assessment method Methods 0.000 description 3
- 239000008186 active pharmaceutical agent Substances 0.000 description 2
- 239000000654 additive Substances 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 239000013078 crystal Substances 0.000 description 2
- 238000013523 data management Methods 0.000 description 2
- 201000010099 disease Diseases 0.000 description 2
- 208000035475 disorder Diseases 0.000 description 2
- 239000003814 drug Substances 0.000 description 2
- 229940079593 drug Drugs 0.000 description 2
- 238000007876 drug discovery Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 230000000977 initiatory effect Effects 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000004088 simulation Methods 0.000 description 2
- 230000001225 therapeutic effect Effects 0.000 description 2
- OKTJSMMVPCPJKN-UHFFFAOYSA-N Carbon Chemical compound [C] OKTJSMMVPCPJKN-UHFFFAOYSA-N 0.000 description 1
- 241000711573 Coronaviridae Species 0.000 description 1
- 108010052285 Membrane Proteins Proteins 0.000 description 1
- 238000000342 Monte Carlo simulation Methods 0.000 description 1
- 238000005481 NMR spectroscopy Methods 0.000 description 1
- 206010028980 Neoplasm Diseases 0.000 description 1
- 208000036142 Viral infection Diseases 0.000 description 1
- 241000700605 Viruses Species 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 238000013476 bayesian approach Methods 0.000 description 1
- 239000000872 buffer Substances 0.000 description 1
- 201000011510 cancer Diseases 0.000 description 1
- 229910052799 carbon Inorganic materials 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 239000003153 chemical reaction reagent Substances 0.000 description 1
- 150000001875 compounds Chemical class 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000009509 drug development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000010894 electron beam technology Methods 0.000 description 1
- 230000037406 food intake Effects 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 230000003116 impacting effect Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 208000027866 inflammatory disease Diseases 0.000 description 1
- 238000012177 large-scale sequencing Methods 0.000 description 1
- 239000003446 ligand Substances 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 206010025135 lupus erythematosus Diseases 0.000 description 1
- 239000002932 luster Substances 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000000386 microscopy Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 102000039446 nucleic acids Human genes 0.000 description 1
- 108020004707 nucleic acids Proteins 0.000 description 1
- 150000007523 nucleic acids Chemical class 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 229920000642 polymer Polymers 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 102000004196 processed proteins & peptides Human genes 0.000 description 1
- 108090000765 processed proteins & peptides Proteins 0.000 description 1
- 238000010223 real-time analysis Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 206010039073 rheumatoid arthritis Diseases 0.000 description 1
- 229920006395 saturated elastomer Polymers 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 150000003384 small molecules Chemical class 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 230000008685 targeting Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
- 241000712461 unidentified influenza virus Species 0.000 description 1
- 230000009385 viral infection Effects 0.000 description 1
- 238000002424 x-ray crystallography Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
- G06F9/4881—Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/17—Details of further file system functions
- G06F16/178—Techniques for file synchronisation in file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
- G06F16/184—Distributed file systems implemented as replicated file system
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/54—Interprogram communication
- G06F9/541—Interprogram communication via adapters, e.g. between incompatible applications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/54—Interprogram communication
- G06F9/546—Message passing systems or structures, e.g. queues
Definitions
- cloud computing services are provided globally to millions of users and customers who reside in different locations (e.g., countries, continents, etc.).
- Various entities provide private or public cloud computing services globally to different customers over various sectors for critical and non-critical applications.
- These entities provide various cloud computing services including, for example, software-as-a-service (SaaS), infrastructure-as-a-service (IaaS), and/or platform-as-a-service (PaaS).
- SaaS software-as-a-service
- IaaS infrastructure-as-a-service
- PaaS platform-as-a-service
- users In order to utilize such cloud computing services, users must transfer locally generated data to the cloud.
- research experiments generate enormous amounts of data and uploading such data to a cloud computing service takes an unsatisfactory amount of time and slows down research.
- cryo-electron microscopy reveals the structure of proteins by probing a flash-frozen solution with a beam of electrons, and then combining two-dimensional (2D) images of individual molecules into a three-dimensional (3D) picture.
- Cryo-EMs are powerful scientific instruments, and they produce enormous amount of data in the form of 2D pictures of the proteins at high resolutions.
- scientists have to perform a series of computations that requires a large amount of computing power to convert the 2D images into useful 3D models.
- Such a task takes weeks of time on a regular workstation or computer clusters with finite capacities. Excess upload times associated with utilizing a cloud computing service for model generation further exacerbates the time required to generate the 3D models, slowing down research.
- the methods and systems disclosed individually or in combination, provide a scalable cloud-based data processing and computing platform to support large volume data pipeline.
- the disclosure provides a method.
- the method comprises receiving an indication of a synchronization request.
- the method further comprises determining, based on the indication, one or more files stored in a staging location.
- the method further comprises generating, based on the one or more files, a data transfer filter.
- the method further comprises causing, based on the data transfer filter, transfer of the one or more files to a destination computing device.
- the disclosure provides a method.
- the method comprises receiving, via a graphical user interface, a request to convert a dataset from object storage to a distributed file system.
- the method further comprises receiving, via the graphical user interface, an indication of a storage size of the distributed file system.
- the method further comprises converting, based on the request and the indication, the dataset from object storage to the distributed file system associated with the storage size.
- the disclosure provides a method.
- the method comprises identifying a data analysis application program.
- the method further comprises identifying a dataset associated with the data analysis application program.
- the method further comprises determining, as a program template, one or more job parameters associated with the data analysis application program processing the dataset.
- the method further comprises causing, based on the program template, execution of the data analysis application program on the dataset.
- the disclosure provides a method.
- the method comprises receiving an indication of a synchronization request.
- the method further comprises determining, based on the indication, one or more files stored in a staging location.
- the method further comprises generating, based on the one or more files, a data transfer filter.
- the method further comprises causing, based on the data transfer filter, transfer of the one or more files to object storage of a destination computing device.
- the method further comprises receiving, via a graphical user interface, a request to convert the one or more files from object storage to a distributed file system.
- the method further comprises receiving, via the graphical user interface, an indication of a storage size of the distributed file system.
- the method further comprises converting, based on the request and the indication, the one or more files from object storage to the distributed file system associated with the storage size.
- the method further comprises identifying a data analysis application program associated with the one or more files in the distributed file system.
- the method further comprises determining, as a program template, one or more job parameters associated with the data analysis application program processing the dataset.
- the method further comprises causing, based on the program template, execution of the data analysis application program the one or more files in the distributed file system.
- Figure 1 shows an example operating environment
- Figure 2A shows an example data pipeline
- Figure 2B shows an example operating environment
- Figure 3 shows an example operating environment
- Figure 4A shows an example operating environment
- FIG. 4B shows an example cloud-based storage system
- Figure 5 shows an example graphical user interface
- Figure 6A shows an example graphical user interface
- Figure 6B shows an example graphical user interface
- Figure 7 shows an example graphical user interface
- Figure 8A shows an example program template
- Figure 8B shows an example operating environment
- Figure 8C shows an example operating environment
- Figure 9 shows an example operating environment
- Figure 10 shows an example operating environment
- Figure 11 shows an example operating environment
- Figure 12 shows an example method
- Figure 13 shows an example method
- Figure 14 shows an example method
- Figure 15 shows an example method
- the word “comprise” and variations of the word, such as “comprising” and “comprises,” means “including but not limited to,” and is not intended to exclude, for example, other additives, components, integers or steps.
- each step comprises what is listed (unless that step includes a limiting term such as “consisting of’), meaning that each step is not intended to exclude, for example, other additives, components, integers or steps that are not listed in the step.
- Ranges may be expressed herein as from “about” one particular value, and/or to “about” another particular value. When such a range is expressed, also specifically contemplated and considered disclosed is the range from the one particular value and/or to the other particular value unless the context specifically indicates otherwise. Similarly, when values are expressed as approximations, by use of the antecedent “about,” it will be understood that the particular value forms another, specifically contemplated embodiment that should be considered disclosed unless the context specifically indicates otherwise. It will be further understood that the endpoints of each of the ranges are significant both in relation to the other endpoint, and independently of the other endpoint unless the context specifically indicates otherwise.
- a protein can be an antibody or fragment thereof.
- a macromolecule such as an antibody
- the macromolecule can then be used in methods of treating, detecting, or diagnosing.
- an antibody identified using the disclosed technology can be administered to a subject to treat a disease, disorder, and/or condition of interest.
- a disease, disorder, and/or condition of interest can be cancer, viral infection (e.g., coronavirus, influenza virus), or inflammatory disorder (e.g. rheumatoid arthritis, lupus).
- the present disclosure provides a High Performance Computing (HPC) platform on the cloud.
- the methods and systems disclosed can provide end results (such as 3D models) in significantly shorter times than state of the art systems.
- a self-service storage management utility may enable users to manage the datasets being analyzed. For example, instead of creating one larger file system, a distributed file system (DFS) and/or parallel file system may be generated per dataset being analyzed. Such a distributed/parallel file system may reduce storage capacities and/or cost.
- DFS distributed file system
- parallel file system may reduce storage capacities and/or cost.
- a data processing application that may be supported by the efficient, high-speed, big data transfer techniques disclosed herein includes 3D structure estimation from 2D electron cryo-microscopy images.
- a system 100 can comprise a data origin 102.
- the data origin 102 can be any type of data generating system, for example, an imaging system, a genetic sequencing system, combinations thereof, and the like.
- the data origin 102 may comprise, in an embodiment, one or more components that supply data.
- the component(s) may expose the data in numerous ways, according to one or several mechanism.
- the component(s) may be embodied in, or may constitute, a computing device comprising one or several types of data storage.
- the data origin 102 may comprise a network file system (NFS), a server message block (SMB), a Hadoop Distributed File System (HDFS), and/or an on-premises object store.
- NFS network file system
- SMB server message block
- HDFS Hadoop Distributed File System
- the data origin 102 may comprise an imaging system made up of one or more electron microscopes (e.g., cryogenic electron microscopy (Cryo-EM)).
- Cryo-EM is a computer vision-based approach to 3D macromolecular structure determination.
- Cryo-EM is applicable to medium-sized to large-sized molecules in their native state. This scope of applicability is in sharp contrast to X-ray crystallography, which requires a crystal of the target molecule, which crystal is often difficult (if not plain unfeasible) to grow. Such a scope also is in sharp contrast to nuclear magnetic resonance (NMR) spectroscopy, which is limited to relatively small molecules.
- NMR nuclear magnetic resonance
- a purified solution of a target molecule is first cryogenically frozen into a thin (single molecule thick) film on a carbon grid, and then the resulting grid is imaged with a transmission electron microscope.
- the grid is exposed to a low-dose electron beam inside the microscope column, and 2D projections of the sample are collected using a camera (film, charge-coupled device (CCD) sensor, direct electron detector, or similar) at the base of the column.
- CCD charge-coupled device
- a large number of such projections are obtained, each of which provides a micrograph containing hundreds of visible, individual molecules.
- particle picking individual molecules are selected from the micrographs, resulting in a stack of cropped images of the molecule (referred to as “particle images”).
- particle images Each particle image provides a noisy view of the molecule with an unknown pose.
- the usefulness of a particular Cryo-EM reconstruction for a given target depends on the resolution that is achievable on that target.
- a high-resolution reconstruction can resolve fine detail including, in a particularly good case, atomic positions to be interpreted from the reconstruction.
- a low-resolution reconstruction may only depict large, globular features of a protein molecule rather than fine detail; thus, making it difficult to use the reconstruction in further chemistry or biological research pipelines.
- high resolution reconstructions of a target can be substantially advantageous. As an example, such high resolution reconstructions can yield extremely valuable insight into whether the target is well-suited for the application of a therapeutic (such as a drug).
- high resolution reconstructions can be used to understand the types of drug candidates that may be suitable for the target.
- high resolution reconstructions can even illuminate possible ways to optimize a drug candidate to improve its binding affinity and reduce off- target binding; thereby reducing the potential for unwanted side effects.
- Cryo- EM reconstructions approaches that can improve the resolution of a computationally reconstructed 3D result are of high scientific and commercial value.
- Resolution in the context of Cryo-EM is generally measured and described in terms of a shortest resolvable wavelength of a 3D structural signal in a final 3D structure output of a structure refinement technique.
- the shortest resolvable wavelength has a resolution being the shortest wavelength that has correct, validate-able signal.
- the wavelength is typically stated in units of Angstroms (A; a tenth of a nanometer). Smaller values for the wavelength indicate a higher resolution.
- a very high resolution Cryo-EM structure can have a resolution of approximately 2 A, a medium resolution can have approximately 4 A, and a low resolution can be in the range of about 8 A or worse.
- interpretability and usefulness of a Cryo-EM reconstruction can depend on the quality of the 3D density map that is reconstructed and whether or not a qualified user can examine the 3D density map with their naked eye to identify critical features of the protein molecule; for example, backbone, side-chains, bound ligands, or the like. The ability of the user to identify these features with accuracy is highly dependent on the resolution quality of the 3D density map.
- the data origin 102 may be configured to generate data 104.
- the data 104 may comprise image data, such as image data defining 2D electron cryo- microscopy images, also referred to as particle images.
- the data 104 may comprise sequence data, in some cases.
- a computing device 106 may be in communication with the data origin 102.
- the computing device 106 may be, for example, a smartphone, a tablet, a laptop computer, a desktop computer, a server computer, or the like.
- the computing device 106 may include a group of one or more server devices.
- the computing device 106 may be configured to generate, store, maintain, and/or update various data structures including a database for storage of the data 104.
- the computing device 106 may be configured to operate one or more application programs, such as a data staging module 108, a data sync manager 110, and/or a data sync module 112.
- the data staging module 108, the data sync manager 110, and/or the data sync module 112 may be stored and or configured to operate on the same computing device 106 or separately on separate computing devices.
- the computing device 106 may be configured, via the data staging module 108, to collect, retrieve, and/or receive the data 104 from the data origin 102 for storage in a storage system on the computing device 106 (or in a storage system functionally coupled to the computing device 106).
- the storage system may comprise one or multiple memory devices, and may be referred to as a staging location.
- the data staging module 108 may manage data stored in the storage system until such data is transferred out of that staging location. Once data has been transferred out of the staging location, the data staging module 108 may delete such data.
- the data staging module 108 may be configured to receive the data 104 through a variety of mechanisms.
- the staging location may be treated as a remote directory for the data origin 102 such that data 104 generated by the data origin 102 is saved directly into the staging location.
- the data staging module 108 may be configured to monitor one or more network storage locations to detect new data 104, upon identifying new data 104 in a network storage location, the data staging module 108 may transfer the new data 104 to the staging location. Further, or in yet another embodiment, the data staging module 108 may be configured to permit a user to manually upload data to the staging location.
- the computing device 106 may be configured, via the data sync manager 110 and the data sync module 112, to transfer the data 104 from the staging location to a cloud platform 114.
- the computing device 106 may be configured, via the data sync manager 110 and the data sync module 112, to transfer the data 104 as the data 104 is received from the data origin 102.
- the system 100 represents an automated, end-to-end processing pipeline that enables the transport and processing of more than ITB/hour of raw data.
- the data 104 may be transferred in near real-time as the data 104 is acquired.
- the data sync module 112 may be a data synchronization application program configured to transport the data 104 to the cloud platform 114.
- the data synchronization application program may be any data synchronization program, including, for example, AWS DataSync.
- AWS DataSync is a native AWS service configured to transport large amounts of data between on-premises storage and Amazon native storage services.
- the on-premises storage can be the staging location present in the computing device 106 or functionally coupled thereto.
- data synchronization application programs are “sync” utilities, such application programs do not function as a unidirectional copy utility.
- AWS DataSync executes four phases to transfer data: launching, preparing, transferring, and verifying.
- AWS DataSync examines the source (e.g., the computing device 106) and destination (e.g., the cloud platform 114) file systems to determine which files to sync. AWS DataSync does so by recursively scanning the contents and metadata of files on the source and destination file systems for differences.
- the time that AWS DataSync spends in the preparing phase depends on the number of files in both the source and destination file systems and for large data transfers can take several hours. As the size of the data 104 stored at either source or destination, or both, grows, the time AWS DataSync spends in the preparing phase increases. Currently, with an example data size of 500 TB on the destination (e.g., the cloud platform 114), the preparing phase takes upwards of 2 hours.
- the data origin 102 generates an extremely large amount of data 104.
- This extremely large amount of data needs to be made available on high- performance computing platforms, such as the cloud platform 114, as quickly as possible.
- Making the data 104 available faster provides a lead time for scientist to process and achieve results quicker, directly impacting drug discovery timing.
- the present state of existing data synchronization application programs greatly increases the time needed to transfer such data to high-performance computing platforms because of the time spent in scanning local and remote file systems prior to data transfer.
- the system 100 is configured to implement an improved data pipeline 201 as shown in FIG. 2 that addresses the technological deficiencies of data synchronization application programs.
- the data pipeline 201 may comprise a multi-stage data transfer process to push the data 104 from the staging location on the computing device 106 (e.g., on-premises) to the cloud system 114.
- the data 104 may be generated by the data origin 102.
- the data 104 may be stored at the staging location by the data staging module 108.
- the purpose of the data staging process 202 is to hold the data 104 and maintain the data 104 ready for transmission.
- the data 104 in the staging location may be deleted once the data 104 is moved to the data destination (e.g., the cloud platform 114).
- a sync condition 203 dictates when a data transfer process 204 may be initiated. Thus, satisfying the sync condition 203 may cause initiation of the data transfer process 204.
- the data transfer process 204 is initiated periodically, at a rate defined by a time interval that may be configurable. Thus, the sync condition 203 dictates that elapsed time since last data transfer must be equal to the time interval.
- the data sync manager 110 may be configured to determine the data 104 (e.g., identify files and/or directories) currently available at the staging location.
- the data sync manager 110 may fetch a list 205 of the data 104 currently available at the staging location.
- the data sync manager 110 may connect to the staging location and/or to any respective mount point/disk-volumes.
- the data sync manager 110 may then execute a list command to fetch a list of available files.
- the data sync manager 110 may be configured to utilize naming conventions when fetching a list of available files. For example, a scientific instrument may be configured to produce data with a defined naming convention.
- the data sync manager 110 may utilize Regular Expressions (RegEx) to include (or exclude) one or more files in the list.
- the data sync manager 110 may also rely on RegEx to validate the directories and/or files for inclusion on the list.
- RegEx Regular Expressions
- the data sync manager 110 may be configured to use the list to generate a filter 206.
- the filter may comprise one or more of, a file name, a file location, a file extension, a file size, a checksum, a created date, a modified date, combinations thereof, and the like.
- Generating the filter 206 may comprise generating a message that invokes a function call to a cloud service (e.g., AWS DataSync), where the message passes the list of available files as an argument of the function call.
- the function call can initiate a task (or job) of the cloud service.
- the function call can be invoked according to an API implemented by the data storage service.
- the cloud service can be provided by one or more components of the cloud platform 114.
- the filter may be generated dynamically in that the filter may be generated at each iteration of the data transfer process 204.
- the filter may include a reference to a partial file (e.g., a file that is not yet complete or is in the process of transfer to the staging location).
- a partial file e.g., a file that is not yet complete or is in the process of transfer to the staging location.
- the filter will include the full file and update the transferred partial file.
- the data sync manager 110 then triggers the data transfer process 204 according to the filter 206.
- the filter 206 causes the data transfer process 204 to transfer only those files and/or directories specified by the filter 206.
- the filter 206 thus represents the data 104 that is only present at the staging location.
- the data pipeline 201 represents an improvement in computer technology as the standard data transfer process would compare data that is available at the staging location and the cloud platform 114, determine all new and changed/updated files to transfer, and push the data to the cloud platform 114, resulting in significantly increased time to complete the data transfer process.
- the present dynamically generated filter causes the data transfer process 204 to scan only a limited set of data at the staging location and at the cloud platform 114 which significantly reduces the time required for completing the data transfer process 204.
- the filter 206 causes the prepare phase of the AWS DataSync task to only scan the files specified in the filter instead of all files, thus minimizing the prepare phase time.
- various synchronization policies can be generated and/or applied to determine data that is synchronized and data that is not synchronized.
- Synchronization polices may specify files to be synchronized based on selected criteria including data type, metadata, and location information (e.g., electron microscopy equipment that generated that data).
- synchronization policies can be retained in one or more memory devices 250 (referred to as datastore 250) within one or more data structures 260 (referred to as policies 260).
- the datastore 250 can be integrated into the computing device 106 or can be functionally coupled thereto. In some cases, the datastore 250 can be part of the staging location.
- Synchronization policies can dictate the manner of generating the filter 206.
- a scientist can flag particular data to not be synchronized, even though the data is present in the staging location.
- a synchronization policy may dictate that data flagged in such a manner is to not be synchronized.
- the data sync manager 110 may be configured to use a list of one or more files and such a synchronization policy in order to generate an instance of the filter 206. Accordingly, that instance of the filter may be updated to include one or more flags (which may be referred to as exclusion flags) associated with respective files. Due to the exclusion flag(s), such file(s) are excluded from synchronization.
- Another synchronization policy can dictate the time-to-live period of an exclusion flag, where the time-to-live period defines a time interval during which the exclusion flag is active.
- the TTL period causes data to be synchronized at some point in time, which avoids unnecessarily withholding data in the staging location.
- flags or metadata can be defined to control the manner in which an instance of the filter 206 is generated and applied in data synchronization. Some flags may automatically expire after a full dataset is loaded to the staging location to avoid partial synchronization, for example.
- FIG. 3 shows an example AWS architecture for implementing the data pipeline 201 of FIG. 2.
- Data is generated at data centers/laboratories at 301.
- Generated data may be staged in NetApp storage located in a local datacenter at 302.
- An AWS Cloud watch rule is configured to trigger a lambda function at regular intervals (e.g., periodically, at a configurable rate or time interval) depending on the agreed SLA at 303.
- An invoked Lambda function may connect to the on-premises NetApp storage via NFS to fetch a list of files available at 304. Once the file list is available, the Lambda function may filter out valid datasets (based on the naming convention) and are passed as a filter to the triggered DataSYNC job at 305.
- Lambda environment variable will hold the DataSYNC job ID, which it has to trigger. Out of the Lambda execution (success/failure) will be passed to SNS topic at 306. Lambda environment variable will hold the SNS topic ARN. All success or failure messages will be sent to subscribed emails at 307. SNS subscription has a message attribute filter setup, and this will pick up and Failures and additionally sends a text to admins at 308. Any failures will be notified instantaneously to admins via text to react quicker.
- the example AWS architecture in FIG. 3 greatly reduces the prepare phase timing as shown in Table 1. Making the data available for compute as quickly as possible is a key factor for faster drug discovery and analysis. With the improved data pipeline provides by the embodiments of this disclosure, data is available at compute significantly faster. In some cases, speedup factor of about 4 can be achieved.
- the data 104 received by the cloud platform 114 may be stored in one or more types of storage (e.g., file systems).
- the cloud platform 114 may comprise a distributed parallel file system (e.g., Lustre) and/or an object based file system.
- the data 104 received by the cloud platform 114 may be stored in the distributed parallel file system or the object based file system.
- the data 104 received by the cloud platform 114 is initially stored in the object based file system and moved to the distributed parallel file system when the data 104 is to be processed (e.g., analyzed).
- a file system is a subsystem that an operating system or program uses to organize and keep track of files.
- File systems may be organized in different ways. For example, a hierarchical file system is one that uses directories to organize files into a tree structure.
- File systems provide the ability to search for one or more files stored within the file system. Often this is performed using a “directory” scan or search. In some operating systems, the search can include file versions, file names, and/or file extensions.
- third party file systems may be developed. These systems can interact smoothly with the operating system but provide more features, such as encryption, compression, file versioning, improved backup procedures, and stricter file protection.
- File systems are implemented over networks. Two common systems include the Network File System (NFS) and the Server Message Block (SMB, now CFIS) system.
- NFS Network File System
- SMB Server Message Block
- a file system implemented over a network takes a request from the operating system, converts the request into a network packet, transmits the packet to a remote server, and then processes the response.
- Other file systems are implemented as downloadable file systems, where the file system is packaged and delivered as a unit to the user.
- File systems share an abstracted interface upon which the user may perform operations. These operations include, but are not limited to: Mount/Unmount, Directory scan, Open(Create)/Close, Read/Write, Status, and the like. The steps of associating a file system with an operating system (e.g.
- Mounting makes the Virtual Layer binding of the file system to the operating system.
- a newly mounted file system is associated with a specific location in a hierarchical file tree. All requests to that portion of the file tree are passed to the mounted file system.
- Different operating systems impose restrictions on the number and how deeply nested file system mounts can be.
- Un-mounting is the converse of mounting: a file system is disassociated from the operating system.
- analysis of the data 104 may be performed on a distributed computation and storage architecture, such as the cloud platform 114.
- the data origin 102 typically generates a significant amount of data 104 (e.g., data per experiment), it is not feasible to keep such data in Hot Storage 401 (e.g., a distributed parallel file system, solid-state drive (SSD), etc.) for a long period. Accordingly, the data 104 may be kept in Warm Storage 402 (e.g., object storage) instead of Hot Storage 401.
- the data 104 can be moved to the Hot Storage 401 via a self-service model using a Dataset Management (DSM) utility 116 as disclosed herein.
- DSM Dataset Management
- the DSM utility 116 can permit or otherwise facilitate creation of a POSIX distributed filesystem by a user and retrieval of the appropriate datasets from Warm Storage 402 to Hot Storage 401.
- the POSIX file system may be attached in an HPC cluster (e.g., compute nodes 403) for processing.
- HPC cluster e.g., compute nodes 403
- Lustre is a high-performance distributed file system and can act as a front end to S3 data and present S3 data in a POSIX based filesystem to the compute nodes 403.
- the disclosed DSM utility 116 provides on-demand provision of a cloud-based file system, for example.
- a Lustre file system is an example of the cloud-based file system that can be provided.
- a user may create a Lustre file system pointing to a dataset when running a job.
- the Lustre file system can serve as a staging storage for the processing, sync the results back to S3 object storage once the job is complete, and delete the Lustre file system using the DSM utility 116.
- the DSM utility 116 may create a new custom size distributed file system by targeting the datasets to be processed.
- the DSM utility 116 can mount the distributed file system on a HPC cluster (e.g., compute nodes 403) for staging the processed data.
- the DSM utility 116 can sync the modified datasets back to the S3 object store.
- the DSM utility 116 can enable viewing of the files available in the S3 object store.
- the DSM utility 116 can enable a self-service data life cycle management. Typically such functions require the assistance of technically trained users, however, the DSM utility 116 permits non technical users to perform these tasks.
- FIG. 5 shows a graphical user interface 501 for the DSM utility 116.
- the graphical user interface 501 provides a user with the ability to create and manage file systems for distributed workloads.
- the graphical user interface 501 provides a menu of selectable options, comprising a first selectable option 502 (labeled “Create Lustre”) and a second selectable option 503 (labeled “Manage Lustre”).
- the first selectable option 502 permits a user to browse through a data store on S3 to view files and directories and to create a file system (e.g., a Lustre file system) from any location on S3.
- a file system e.g., a Lustre file system
- the second selectable option 503, once a Lustre file system is created permits a user to mount a file system to view the file system from an operating system (O/S) level and access data within the file system.
- the second selectable option 503, once a Lustre file system is created also permits a user save data so S3 and, while working with the file system, also permits the user to create new data or modify existing data in the file system. To make this data persistent even after deleting the file system, the user may export data back to a data store (S3).
- S3 data store
- the second selectable option 503, once a Lustre file system is created permits a user to view the status of export jobs. A user can switch between export job status and file systems view.
- the graphical user interface 501 provides the contents of the data store, Warm Storage.
- the user can drill down into any of the directories to view subfolders by double clicking on the specific directory.
- the visual selectable element 602 (labeled “Previous Directory”) permits the user to go back by one step.
- the visual selectable element 603 (labeled “Refresh Dataset”) permits the user to go to the top level screen.
- the visual selectable element 603 may also serve as a refresh marking to fetch the latest data from the data store.
- the visual selectable element 604 (labeled “Load Dataset”) cause initiation of creation of a Lustre file system.
- the graphical user interface 501 provides the user the ability to adjust the size of the Lustre file system.
- a Lustre file system may be created with 7.2 TB of storage space, which can be altered by moving the slider indicium 610 to the left (decrease) or right (increase) to change storage capacity.
- a menu of selectable options also is shown in FIG. 6B, comprising a first selectable option 605 (labeled “Proceed”) and a second selectable option 606 (labeled “Cancel”). Selecting the first selectable option 605 causes the Lustre file system to be created.
- the graphical user interface 501 upon selecting the second selectable option 503 (“Manage Lustre”), provides an upper window 710a that displays all the file systems owned by the user and a lower window 710b that displays other file systems that are not owned by the user.
- the graphical user interface 501 shown in FIG. 7 also comprises a menu of selectable options, including a first selectable option 701 (labeled “Mount File System”), a second selectable option 702 (labeled “Save Data to S3”), a third selectable option 703 (labeled “Show Repo Tasks”), and a fourth selectable option 704 (labeled “Delete Lustre FSx”).
- the user needs to mount the file system on the O/S level to access files.
- To mount a file system the user may select the file system to be mounted and further select (e.g., click on) the first selectable option 701 (“Mount File System”).
- the data is to be saved to the data store (e.g., S3).
- the user may select the file system to be saved and further select (e.g., click on) the second selectable option 702 (“Save Dataset to S3”).
- the user may select the file system on which 'Save dataset S3' operation is being performed and further select (e.g., click on) the third selectable option 703 (“Show Repo Tasks”) to be shown a screen listing repository job status.
- the user may delete the file system to save costs.
- the user may select the file system to be deleted and further select (e.g., click on) the fourth selectable option (“Delete Lustre FSx”). The latter one of those selections can prompt the user to run “Save Dataset to S3” 702 before deleting the selected file system. Once confirmed, deletion of the file system may start.
- FIG. 4B sets forth an example of a cloud-based storage system 418 of the cloud platform 114 in accordance with some embodiments of the present disclosure.
- the DSM Utility 104 may be in communication the cloud storage system 418 and in an embodiment, may be embodied in one or more components shown in FIG. 4B (e.g., storage controller application, software daemon, and the like).
- the cloud-based storage system 418 is created entirely in the cloud platform 114 such as, for example, Amazon Web Services (AWS’)TM, Microsoft AzureTM, Google Cloud PlatformTM, IBM CloudTM, Oracle CloudTM, and others.
- AWS Amazon Web Services
- AzureTM Microsoft AzureTM
- Google Cloud PlatformTM IBM CloudTM
- Oracle CloudTM Oracle CloudTM
- the cloud computing instances 420, 422 may be embodied, for example, as instances of cloud computing resources (e.g., virtual machines) that may be provided by the cloud platform 114 to support the execution of software applications such as the storage controller application 424, 426.
- cloud computing instances 420, 422 may execute on an Azure VM, where each Azure VM may include high speed temporary storage that may be leveraged as a cache (e.g., as a read cache).
- the cloud computing instances 420, 422 may be embodied as Amazon Elastic Compute Cloud (‘EC2’) instances.
- an Amazon Machine Image (‘AMG) that includes the storage controller application 424, 426 may be booted to create and configure a virtual machine that may execute the storage controller application 424, 426.
- AMG Amazon Machine Image
- the storage controller application 424, 426 may be embodied as a module of computer program instructions that, when executed, carries out various storage tasks.
- the storage controller application 424, 426 may be embodied as a module of computer program instructions that, when executed, carries out the same tasks associated with writing data to the cloud-based storage system 418, erasing data from the cloud-based storage system 418, retrieving data from the cloud- based storage system 418, monitoring and reporting of disk utilization and performance, performing redundancy operations, such as RAID or RAID-like data redundancy operations, compressing data, encrypting data, deduplicating data, and so forth.
- redundancy operations such as RAID or RAID-like data redundancy operations
- cloud computing instances 420, 422 that each include the storage controller application 424, 426
- one cloud computing instance 420 may operate as the primary controller as described above while the other cloud computing instance 422 may operate as the secondary controller as described above.
- the storage controller application 424, 426 depicted in FIG. 4B may include identical source code that is executed within different cloud computing instances 420, 422 such as distinct EC2 instances.
- each cloud computing instance 420, 422 may operate as a primary controller for some portion of the address space supported by the cloud-based storage system 418, each cloud computing instance 420, 422 may operate as a primary controller where the servicing of I/O operations directed to the cloud-based storage system 418 are divided in some other way, and so on.
- each cloud computing instance 420, 422 may operate as a primary controller where the servicing of I/O operations directed to the cloud-based storage system 418 are divided in some other way, and so on.
- costs savings may be prioritized over performance demands, only a single cloud computing instance may exist that contains the storage controller application.
- the cloud-based storage system 418 depicted in FIG. 4B includes cloud computing instances 440A, 440B, and 440n with local storage 430, 434, and 438.
- the cloud computing instances 440A, 440B, and 440n may be embodied, for example, as instances of cloud computing resources that may be provided by the cloud platform 114 to support the execution of software applications.
- the cloud computing instances 440A, 440B, and 440n of FIG. 4B may differ from the cloud computing instances 420, 422 described above as the cloud computing instances 440A, 440B, and 440n of FIG.
- the cloud computing instances 440A, 440B, and 440n with local storage 430, 434, and 438 may be embodied, for example, as EC2 M5 instances that include one or more SSDs, as EC2 R5 instances that include one or more SSDs, as EC2 13 instances that include one or more SSDs, and so on.
- the local storage 430, 434, and 438 may be embodied as solid-state storage (e.g., SSDs) rather than storage that makes use of hard disk drives.
- Hot storage 401 may include one or more of the local storage 430, 434, and 438.
- each of the cloud computing instances 440A, 440B, and 440n with local storage 430, 434, and 438 can include a software daemon 428, 432, 436 that, when executed by a cloud computing instance 440A, 440B, and 440n can present itself to the storage controller applications 424, 426 as if the cloud computing instance 440A, 440B, and 440n were a physical storage device (e.g., one or more SSDs).
- a software daemon 428, 432, 436 that, when executed by a cloud computing instance 440A, 440B, and 440n can present itself to the storage controller applications 424, 426 as if the cloud computing instance 440A, 440B, and 440n were a physical storage device (e.g., one or more SSDs).
- the software daemon 428, 432, 436 may include computer program instructions similar to those that would normally be contained on a storage device such that the storage controller applications 424, 426 can send and receive the same commands that a storage controller would send to storage devices.
- the storage controller applications 424, 426 may include code that is identical to (or substantially identical to) the code that would be executed by the controllers in the storage systems described above.
- communications between the storage controller applications 424, 426 and the cloud computing instances 440A, 440B, and 440n with local storage 430, 434, and 438 may utilize iSCSI, NVMe over TCP, messaging, a custom protocol, or in some other mechanism.
- each of the cloud computing instances 440A, 440B, and 440n with local storage 430, 434, and 438 may also be coupled to block storage 442, 444, 446 that is offered by the cloud platform 114 such as, for example, as Amazon Elastic Block Store (‘EBS’) volumes.
- Hot storage 401 may include one or more of the block storage 442, 444, and 446.
- the block storage 442, 444, 446 that is offered by the cloud platform 114 may be utilized in a manner that is similar to how the NVRAM devices described above are utilized, as the software daemon 428, 432, 436 (or some other module) that is executing within a particular cloud comping instance 440A, 440B, and 440n may, upon receiving a request to write data, initiate a write of the data to its attached EBS volume as well as a write of the data to its local storage 430, 434, 438 resources. In some alternative embodiments, data may only be written to the local storage 430, 434, 438 resources within a particular cloud comping instance 440A, 440B, 440n.
- NVRAM block storage 442, 444, 446 that is offered by the cloud platform 114 as NVRAM
- actual RAM on each of the cloud computing instances 440A, 440B, 440n with local storage 430, 434, 438 may be used as NVRAM, thereby decreasing network utilization costs that would be associated with using an EBS volume as the NVRAM.
- high performance block storage resources such as one or more Azure Ultra Disks may be utilized as the NVRAM.
- the storage controller applications 424, 426 may be used to perform various tasks such as deduplicating the data contained in the request, compressing the data contained in the request, determining where to the write the data contained in the request, and so on, before ultimately sending a request to write a deduplicated, encrypted, or otherwise possibly updated version of the data to one or more of the cloud computing instances 440A, 440B, 440n with local storage 430, 434, 438.
- Either cloud computing instance 420, 422 may receive a request to read data from the cloud-based storage system 418 and may ultimately send a request to read data to one or more of the cloud computing instances 440A, 440B, 440n with local storage 430, 434, 438.
- the software daemon 428, 432, 436 may be configured to not only write the data to its own local storage 430, 434, 438 resources and any appropriate block storage 442, 444, 446 resources, but the software daemon 428, 432, 436 may also be configured to write the data to cloud object storage 448 that is attached to the particular cloud computing instance 440A, 440B, 440n.
- the cloud object storage 448 that is attached to the particular cloud computing instance 440A, 440B, 440n may be embodied, for example, as Amazon Simple Storage Service (‘S3’).
- the cloud computing instances 420, 422 that each include the storage controller application 424, 426 may initiate the storage of the data in the local storage 430, 434, 438 of the cloud computing instances 440A, 440B, 440n and the cloud object storage 448.
- a persistent storage layer may be implemented in other ways. For example, one or more Azure Ultra disks may be used to persistently store data (e.g., after the data has been written to the NVRAM layer).
- Warm storage 402 may include the cloud object storage 448.
- the DSM utility 116 may be in communication with the cloud object storage 448, the local storage (430, 434, 438), and/or the block storage (442, 444, and 446).
- the DSM utility 116 may be configured to permit or otherwise facilitate creation of a distributed file system by a user and retrieval of datasets from Warm Storage 402 to Hot Storage 401.
- the DSM utility 116 enables creation of a file system on the cloud object storage 448, the local storage (430, 434, 438), and/or the block storage (442, 444, and 446).
- the DSM utility 116 supports transfer of data sets from the cloud object storage 448 to/from the local storage (430, 434, 438) and/or the block storage (442, 444, and 446).
- the software daemon 428, 432, 436 may therefore be configured to take blocks of data, package those blocks into objects, and write the objects to the cloud object storage 448 that is attached to the particular cloud computing instance 440A, 440B, 440n.
- writing the data to the local storage 430, 434, 438 resources and the block storage 442, 444, 446 resources that are utilized by the cloud computing instances 440A, 440B, 440n is relatively straightforward as 5 blocks that are 1 MB in size are written to the local storage 430, 434, 438 resources and the block storage 442, 444, 446 resources that are utilized by the cloud computing instances 440A, 440B, 440n.
- the software daemon 428, 432, 436 may also be configured to create five objects containing distinct 1 MB chunks of the data.
- each object that is written to the cloud object storage 448 may be identical (or nearly identical) in size.
- metadata that is associated with the data itself may be included in each object (e.g., the first 1 MB of the object is data and the remaining portion is metadata associated with the data).
- the cloud object storage 448 may be incorporated into the cloud-based storage system 418 to increase the durability of the cloud-based storage system 418.
- all data that is stored by the cloud-based storage system 418 may be stored in both: 1) the cloud object storage 448, and 2) at least one of the local storage 430, 434, 438 resources or block storage 442, 444, 446 resources that are utilized by the cloud computing instances 440A, 440B, 440n.
- the local storage 430, 434, 438 resources and block storage 442, 444, 446 resources that are utilized by the cloud computing instances 440A, 440B, 440n may effectively operate as cache that generally includes all data that is also stored in S3, such that all reads of data may be serviced by the cloud computing instances 440A, 440B, 440n without requiring the cloud computing instances 440A, 440B, 440n to access the cloud object storage 448.
- all data that is stored by the cloud-based storage system 418 may be stored in the cloud object storage 448, but less than all data that is stored by the cloud- based storage system 418 may be stored in at least one of the local storage 430, 434, 438 resources or block storage 442, 444, 446 resources that are utilized by the cloud computing instances 440A, 440B, 440n.
- various policies may be utilized to determine which subset of the data that is stored by the cloud-based storage system 418 should reside in both: 1) the cloud object storage 448, and 2) at least one of the local storage 430, 434, 438 resources or block storage 442, 444, 446 resources that are utilized by the cloud computing instances 440A, 440B, 440n.
- One or more modules of computer program instructions that are executing within the cloud-based storage system 418 may be designed to handle the failure of one or more of the cloud computing instances 440A, 440B, 440n with local storage 430, 434, 438.
- the monitoring module may handle the failure of one or more of the cloud computing instances 440A, 440B, 440n with local storage 430, 434, 438 by creating one or more new cloud computing instances with local storage, retrieving data that was stored on the failed cloud computing instances 440A, 440B, 440n from the cloud object storage 448, and storing the data retrieved from the cloud object storage 448 in local storage on the newly created cloud computing instances.
- Various performance aspects of the cloud-based storage system 418 may be monitored (e.g., by a monitoring module that is executing in an EC2 instance) such that the cloud-based storage system 418 can be scaled-up or scaled-out as needed.
- a monitoring module may create a new, more powerful cloud computing instance (e.g., a cloud computing instance of a type that includes more processing power, more memory, etc. . . .
- the monitoring module may create a new, less powerful (and less expensive) cloud computing instance that includes the storage controller application such that the new, less powerful cloud computing instance can begin operating as the primary controller.
- the cloud platform 114 may comprise a plurality of compute nodes (not depicted in FIG. 1, for the sake of simplicity).
- the plurality of compute nodes communicate with the storage system of the cloud platform 114.
- the plurality of compute nodes may comprise respective processing devices of one or more processing platforms.
- the plurality of compute nodes may comprise respective virtual machines (VMs) each having a processor and a memory, although numerous other configurations are possible.
- VMs virtual machines
- the plurality of compute nodes may additionally or alternatively be part of cloud infrastructure, such as an Amazon Web Services (AWS) system.
- AWS Amazon Web Services
- Other examples of cloud- based systems that can be used to provide compute nodes include Google Cloud Platform (GCP) and Microsoft Azure.
- GCP Google Cloud Platform
- Azure Microsoft Azure
- the plurality of compute nodes illustratively provide compute services such as execution of one or more application programs on behalf of each of one or more users associated with respective ones of the plurality of compute nodes.
- the plurality of compute nodes can be configured for parallel computation.
- the cloud platform 114 may be part of a data analysis system.
- the cloud platform 114 may provide a 3D structure estimation service, a genetic data analysis service (e.g., GEWAS, PHEWAS, etc.), and the like.
- the cloud platform 114 may be configured to perform such data analysis via one or more data analysis modules 118.
- the data analysis module(s) 118 can be configured to leverage a computation module 120.
- the computation module 120 may be configured to generate a program template that may be used by at least one of the data analysis module(s) 118 to govern the execution of one or more processes/tasks, such as the use of GPU-based computing.
- the data analysis module(s) 118 may be configured to output a data analysis result, such as an estimated 3D structure of a target in a resultant 3D map (e.g., a 3D model).
- the cloud platform 114 may also comprise a remote display module 122.
- the remote display module 122 may comprise a high-performance remote display protocol configured to securely deliver remote desktops and application streaming to another computing device 124.
- the remote display module 122 may be configured as NICE DCV.
- the data analysis module 118 may be an application program configured to perform image reconstructions (e.g., a reconstruction module).
- a reconstruction module e.g., the reconstruction module
- the application program can be configured to execute a reconstruction technique to determine a likely molecular structure. Any known technique for determining the likely molecular structure may be used.
- the application program may comprise RELION.
- RELION is an open-source program configured to apply an empirical Bayesian approach, in which optimal Fourier filters for alignment and reconstruction are derived from data in a fully automated manner.
- the computation module 120 may be configured to determine one or more job parameters for the data analysis module 118.
- the one or more jobs parameters may be referred to as a program template.
- the program template may enable an application program to manage programs and/or jobs.
- the program template may enable an application program to leverage computational resources, including, for example, CPU processing time and/or GPU processing time.
- a program template may enable an application program (e.g., a reconstruction module) to determine a level of detail to be extracted from raw data 104 (e.g., raw image data files and/or raw video data files).
- the job parameters may comprise one or more of a number of Message Passing Interfaces (MPIs), a number of threads, a number of compute nodes, desired wall- clock time, combinations thereof, and the like.
- a particular configuration of jobs parameters constitutes a particular program template.
- a program template is defined by a number of MPIs, a number of threads, and a number of compute nodes.
- the computation module 120 may be configured to determine such job parameters for one or more portions of a given application program, to include for each of one or more given tasks or processes of the given application program.
- FIG. 8A shows examples of program templates.
- the program templates are identified by respective template names.
- a template name identifies a file that contains the program template; that is, the file that contains the one or more job parameters defining the program template.
- the computation module 120 may assume that the larger the number of MPIs and threads for a job, the more performance is gained (e.g., less time consumed for job completion). The computation module 120 may assume that disabling hyperthreaded cores may benefit performance.
- the computation module 120 may implement one or more parameters that specify a multi-GPU and multi-core infrastructure setup with hyperthreaded cores disabled.
- the computation module 120 may be configured to run one or more simulations in order to determine one or more jobs parameters defining a program template that is satisfactory (e.g., optimal or nearly optimal) for a program application or a tasks thereof.
- the computation module 120 may equate the number of MPIs to that of available GPU cards and the number of threads to that of available CPU cores on a node.
- a combination of multi-node jobs e.g., 2, 4, 6, 12 node jobs, and the like
- performance benchmarks e.g., 2, 4, 6, 12 node jobs, and the like
- a multi-queue model to execute jobs on a GPU vs CPU based compute may use RELION and CryoSPARC applications to process images.
- a workflow may comprise a sequence of jobs (for example, 8 jobs) to run to complete image processing.
- the workflow may comprise an amount of computationally light steps and an amount of steps that demand significant resources (CPU vs GPU). Having the compute nodes set to GPU-based processing for all workflow processing can be costly when handling jobs that only require CPU-based processing.
- a multi-queueing system may be implemented on an (HPC) cluster.
- An HPC cluster may comprise hundreds or thousands of compute servers that are networked together. Each server is called a node. The nodes in each cluster work in parallel with each other, boosting processing speed to deliver high performance computing.
- a queue may be configured to run with CPU-based compute instances and another queue may be configured to run with GPU-based compute instances. Users may have an option to choose the required queue to run a specific job and/or workflow.
- RELION is an open-source software package configured to process Cryo-EM data and produce protein structure images. Execution of that software depends on various job parameters which determine how the software uses the underlying compute resources. Any misconfigurations in these job parameters will lead to poor utilization of the resource, and thus increase the operational cost to a great extent and the job run-time.
- Cryo-EM jobs resource usage in a cluster for all job types may be determined over time.
- the disclosed methods may manage the resources available in the cluster effectively to reduce the jobs runtime and cost associated with the compute and distributed storage.
- the disclosed methods may be applied in multiple phases of job execution.
- the disclosed methods may observe Cryo-EM jobs resource usage data over time and determine an optimized pattern in a template file for future use. That optimized pattern defines a program template — that is, a defined set of multiple job parameters. Such an optimized pattern may enable completion of jobs many times (e.g., six to eight times) faster by using fewer compute resources.
- a computing environment 800 may generate program templates, in accordance with aspects described herein.
- the computing environment 800 may include a job generation module 810 that can receive data 802.
- the data can be received from the data origin 102.
- the data 802 can be synthetic in that it may generated by a computing device for the purpose of executing a simulated reconstruction.
- the job generation module 810 can generate jobs, or tasks associated with jobs, to reconstruct one or more targets.
- the job generation module 810 may select subsets of the data 802 and may generate or otherwise schedule a job directed to performing an abridged simulation (or reconstruction).
- the job generated in that such a fashion may be sent to a template generator module 820 that may generate various configurations of job parameters. Such configurations can be referred to as job configurations. Each job configuration includes particular values of respective job parameters. Thus, such job configurations correspond to respective candidate program templates.
- the template generator module 820 may apply numerous strategies to generate job configurations. In some cases, the template generator module 820 may generate job configurations randomly. In other cases, the template generator module 820 may rely on a perturbative approach whereby the template generator module 820 generates variations of pre-existing configurations that have been used in production (or actual) reconstruction of targets.
- the template generator module 820 may send a job configuration to the computation module 120 for execution in the cloud platform 114 according to the job parameters defined in the job configuration.
- the template generator module 820 may collect or otherwise receive metrics indicative of performance of the execution of job using a particular job. Numerous metrics can be collected. Examples of metrics include wall-clock time, GPU time, CPU time, number of I/O operations, execution cost, and the like. Values of the metrics that are collected serve as feedback on fitness of a job configuration for a job.
- the template generator module 820 can iteratively generate job configurations for the job until a satisfactory performance has been achieved. To that end, the template generator module 820 may explore the space of job parameters using one of various optimization solvers, such as steepest descent, Monte Carlo simulations, genetic algorithm, or similar. A job configuration that results in a satisfactory performance (e.g., optimal performance) can determine satisfactory values of the job parameters. Such values define a program template.
- the data analysis module 118 may execute one or more jobs according to the program template in order to analyze data.
- the compute module 120 may select compute nodes within the cloud platform 114 to execute a computing job or task that is part of the computing job.
- the selected compute nodes can be part of the compute nodes 403 (FIG. 4).
- the computation module 120 includes an interface module 850 that may receive a program template 844 and data 846 defining the job.
- That program template 844 specifies a set of multiple job parameters and serves as a condition for the selection of compute nodes within the cloud platform 114.
- the program template can specify n MPIs, m threads, and q compute nodes for a task (e.g., a reconstruction task) to be executed.
- the cloud platform 114 can include multiple sets of q compute nodes that can be selected to execute the task. Additionally, at least some of the compute nodes may have respective processors, each having multiple cores that may support the m threads. Similarly, other compute nodes may support, for example, the n MPIs. Accordingly, the cloud platform 114 may support multiple arrangements, or allocations, consistent with the program template. [0077] In an embodiment, as is shown in FIG.
- the computation module 120 includes a selection module 860 that can evaluate a candidate arrangement consistent with the program template.
- the evaluation component 864 may determine respective performance metrics of respective workloads on respective compute nodes that form the candidate arrangement.
- the respective workloads may include the computing job defined by the data 846.
- the computing device 106 (FIG. 1) may request the computing job.
- the evaluation component 864 may determine the respective performance metrics based on respective measured performance data of compute nodes in a candidate arrangement.
- the computation module 120 may obtain the measured performance data from one or more components within the cloud platform 114.
- the measured performance data can include, e.g., present usage or supply of one or more resources, or other data.
- the measured performance data can also include or be based on processed data, e.g., values derived from the measured data such as statistics of the measured data. For example, the average CPU usage and/or average GPU usage on a compute node can be included in the measured performance data for the nodes in the candidate arrangement.
- the selection module 860 can include a configuration component 868 that can traverse a set of multiple candidate arrangements, evaluating each (or, in some cases, at least some) candidate arrangement. That traversal can result in multiple fitness scores for respective candidate arrangements.
- the configuration component 868 can rank the multiple candidate arrangements according to fitness score and can then to select a highest-ranked or high-ranked one of the candidate arrangements as a node arrangement 850 to be utilized to execute the computing job defined by the data 846.
- the data analysis module 118 may store the results of any data analysis in a file system of the cloud platform 114 and/or may provide the results back to the computing device 106.
- the DSM utility 116 may be used to save the results of the data analysis from the file system to a data store and delete the file system.
- FIG. 9 and FIG. 10 show an example system and method wherein data may be generated via electron microscopy and cached in a respective support computing device.
- Multiple electron microscopes can generate imaging data as part of respective electron microscopy experiments.
- Support computing devices functionally coupled to respective ones of the electron microscopes can obtain and cache imaging data.
- the imaging data from a support computing device may be pushed to a local staging area. On a schedule (e.g., hourly, daily, at defined times, etc.), imaging data from the staging area may be pushed into a storage system, such as cloud-based storage (e.g., AWS S3).
- cloud-based storage e.g., AWS S3
- Separate scheduled data sync tasks may keep pushing data into respective datastore buckets (e.g., S3 buckets).
- Imaging data can be viewed from a storage gateway. Scheduled auto cache-refresh may be used. Datasets required for processing may be mounted on to master/compute notes via the DSM utility and storage used may be distributed and/or parallel (e.g., FSx-Lustre).
- FIG. 11 is a block diagram depicting an environment 1100 comprising non limiting examples of the computing device 106 and the cloud platform 114 connected through a network 1104.
- the computing device 106 can comprise one or multiple computers configured to store one or more of the data 104, the data sync manager 110, and/or the data sync module 112.
- the cloud platform 114 can comprise a high-throughput storage system 1106 configured to store the data 104, the DSM utility 116, the data analysis module(s) 118, the computation module 120, the remote display module 122, and/or one or more compute nodes 1108 configured to process the data 104.
- the cloud platform 114 can communicate with the computing device 106 via the network 1104.
- the computing device 106 and the cloud platform 114 can be one or more digital computers that, in terms of hardware architecture, generally includes a processor 1110, memory system 1112, input/output (I/O) interfaces 1114, and network interfaces 1116. These components (1110, 1112, 1114, and 1116) are communicatively coupled via a local interface 1118.
- the local interface 1118 can be, for example, but not limited to, one or more buses or other wired or wireless connections, as is known in the art.
- the local interface 1118 can have additional elements, which are omitted for simplicity, such as controllers, buffers (caches), drivers, repeaters, and receivers, to enable communications. Further, the local interface may include address, control, and/or data connections to enable appropriate communications among the aforementioned components.
- the processor 1110 can be one or more hardware devices for executing software, particularly that stored in memory system 1112.
- the processor 1110 can be any custom made or commercially available processor, a central processing unit (CPU), an auxiliary processor among several processors associated with the computing device 106 and the cloud platform 114, a semiconductor-based microprocessor (in the form of a microchip or chip set), or generally any device for executing software instructions.
- the processor 1110 can be configured to execute software stored within the memory system 1112, to communicate data to and from the memory system 1112, and to generally control operations of the computing device 106 and the cloud platform 114 pursuant to the software.
- the I/O interfaces 1114 can be used to receive user input from, and/or for providing system output to, one or more devices or components.
- User input can be provided via, for example, a keyboard and/or a mouse.
- System output can be provided via a display device and a printer (not shown).
- I/O interfaces 1114 can include, for example, a serial port, a parallel port, a Small Computer System Interface (SCSI), an infrared (IR) interface, a radio frequency (RF) interface, and/or a universal serial bus (USB) interface.
- SCSI Small Computer System Interface
- IR infrared
- RF radio frequency
- USB universal serial bus
- the network interface 1116 can be used to transmit and receive from the computing device 106 and/or the cloud platform 114 on the network 1104.
- the network interface 1116 may include, for example, a lOBaseT Ethernet Adaptor, a 100BaseT Ethernet Adaptor, a LAN PHY Ethernet Adaptor, a Token Ring Adaptor, a wireless network adapter (e.g., WiFi, cellular, satellite), or any other suitable network interface device.
- the network interface 1116 may include address, control, and/or data connections to enable appropriate communications on the network 1104.
- the memory system 1112 can include any one or combination of volatile memory elements (e.g., random access memory (RAM, such as DRAM, SRAM, SDRAM, etc.)) and nonvolatile memory elements (e.g., ROM, hard drive, tape, CDROM, DVDROM, etc.). Moreover, the memory system 1112 may incorporate electronic, magnetic, optical, and/or other types of storage media. Note that the memory system 1112 can have a distributed architecture, where various components are situated remote from one another, but can be accessed by the processor 1110.
- the software in memory system 1112 may include one or more software programs, each of which comprises an ordered listing of executable instructions for implementing logical functions.
- the software in the memory system 1112 of the computing device 106 can comprise the data 104, the data staging module 108, the data sync manager 110, the data sync module 112, the policies 260, a suitable operating system (O/S) 1120, and/or any other modules (for example modules disclosed in FIG. 1).
- O/S operating system
- the software in the high-throughput storage system 1106 of the cloud platform 114 can comprise, the data 104, the DSM utility 116, the data analysis module(s) 118, the computation module 120, the remote display module 122, a suitable operating system (O/S) 1120, and/or any other modules (for example modules disclosed in FIG. 1).
- the operating system 1120 essentially controls the execution of other computer programs and provides scheduling, input-output control, file and data management, memory management, and communication control and related services.
- application programs and other executable program components such as the operating system 1120 are illustrated herein as discrete blocks, although it is recognized that such programs and components can reside at various times in different storage components of the computing device 106 and/or the cloud platform 114.
- An implementation of the data sync manager 110, the data sync module 112, the DSM utility 116, the data analysis module(s) 118, the computation module 120, and/or the remote display module 122 can be stored on or transmitted across some form of computer readable media. Any of the disclosed methods can be performed by computer readable instructions embodied on computer readable media.
- Computer readable media can be any available media that can be accessed by a computer.
- Computer readable media can comprise “computer storage media” and “communications media.”
- “Computer storage media” can comprise volatile and non volatile, removable and non-removable media implemented in any methods or technology for storage of information such as computer readable instructions, data structures, program modules, or other data.
- Exemplary computer storage media can comprise RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer.
- the data sync manager 110 and/or the data sync module 112 may be configured to perform an example method 1200, shown in FIG. 12.
- the example method 1200 may be performed in whole or in part by a single computing device, a plurality of electronic devices, and the like.
- the example method 1200 may comprise, at block 1210, receiving an indication of a synchronization request. Receiving the indication of the synchronization request may be based on a synchronization condition. In some cases, the synchronization condition is a time interval.
- the indication comprises payload data conveying that data synchronization is to be implemented. In some cases, the indication may be embodied in a message invoking a function call to a data storage service, for example.
- the example method 1200 may comprise determining, based on the indication, one or more files stored in a staging location.
- Various types of files may be determined.
- the one or more files may comprise sequence data, particle images, or a combination of sequence data and particle image(s).
- the example method 1200 may comprise generating, based on the one or more files, a data transfer filter.
- Generating the data transfer filter may comprise generating a message that invokes a function call to a cloud service (e.g., AWS DataSync), where the message passes the list of available files as an argument of the function call.
- the function call can initiate a task (or job) of the cloud service.
- the function call can be invoked according to an API implemented by the data storage service.
- the data transfer filter comprises a list of the one or more files stored in the staging location.
- the example method 1200 may comprise causing, based on the data transfer filter, transfer of the one or more files to a destination computing device.
- Causing such a transfer based on the data transfer filter may comprise causing a data synchronization application program to scan the staging location and the destination computing device only for the one or more files.
- the example method 1200 may comprise receiving, from a data origin device, the one or more files.
- the data origin device may comprise one or more of a sequencer or an electron microscope.
- the example method 1200 may comprise deleting, based on the transfer of the one or more files to the destination computing device, the one or more files from the staging location.
- the DSM utility 116 may be configured to perform an example method 1300, shown in FIG. 13.
- the method 1300 may be performed in whole or in part by a single computing device, a plurality of electronic devices, and the like.
- the example method 1300 may comprise, at block 1310, receiving, via a graphical user interface, a request to convert a dataset from object storage to a distributed file system.
- the example method 1300 may comprise receiving, via the graphical user interface, an indication of a storage size of the distributed file system.
- the example method 1300 may comprise converting, based on the request and the indication, the dataset from object storage to the distributed file system associated with the storage size.
- the example method 1300 may comprise receiving a request to perform an operation involving the distributed file system.
- the example method 1300, at block 1350, may comprise performing the operation.
- the operation can be one or many operations involving the distributed file system.
- the example method 1300 comprises receiving, via the graphical user interface, a request to mount the distributed file system. Additionally, at block 1350, the example method 1300 comprises mounting the distributed file system.
- the example method 1300 comprises receiving, via the graphical user interface, a request to save data in the distributed file system into the object storage. Additionally, at block 1350, the example method 1300 comprises saving the data in the distributed file system into the object storage.
- the example method 1300 comprises receiving, via the graphical user interface, a request to delete the distributed file system. Additionally, at block 1350, the example method 1300 comprises deleting the distributed file system.
- the data analysis module(s) 118 and/or the computation module 120 may be configured to perform a method 1400, shown in FIG. 14.
- the method 1400 may be performed in whole or in part by a single computing device, a plurality of electronic devices, and the like.
- the method 1400 may comprise, at block 1410, identifying a data analysis application program.
- the example method 1400 may comprise identifying a dataset associated with the data analysis application program.
- the one or more job parameters may comprise one or more of a number of Message Passing Interfaces (MPIs), a number of threads, or a number of compute nodes.
- MPIs Message Passing Interfaces
- the example method 1400 may comprise causing, based on the program template, execution of the data analysis application program on the dataset.
- the example method 1400 may comprise determining a plurality of tasks executable by the data analysis application program.
- the data sync manager 110, the data sync module 112, the DSM utility 116, the data analysis module(s) 118, and/or the computation module 120 may be configured to perform an example method 1500, shown in FIG. 15.
- the example method 1500 may be performed in whole or in part by a single computing device, a plurality of electronic devices, and the like.
- the example method 1500 may comprise, at block 1510, receiving an indication of a synchronization request.
- the indication comprises payload data conveying that data synchronization is to be implemented.
- the indication may be embodied in a message invoking a function call to a data storage service, for example.
- Receiving the indication of the synchronization request may be based on a synchronization condition.
- the synchronization condition is a time interval.
- the example method 1500, at block 1520 may comprise determining, based on the indication, one or more files stored in a staging location.
- the example method 1500 may comprise generating, based on the one or more files, a data transfer filter.
- the example method 1500 may comprise causing, based on the data transfer filter, transfer of the one or more files to object storage of a destination computing device.
- the example method 1500 may comprise receiving, via a graphical user interface, a request to convert the one or more files from object storage to a distributed file system.
- the example method 1500 may comprise receiving, via the graphical user interface, an indication of a storage size of the distributed file system.
- the example method 1500 may comprise converting, based on the request and the indication, the one or more files from object storage to the distributed file system associated with the storage size.
- the example method 1500 may comprise identifying a data analysis application program associated with the one or more files in the distributed file system.
- the example method 1500, at block 1590 may comprise determining, as a program template, one or more job parameters associated with the data analysis application program processing the dataset.
- the example method 1500 may comprise causing, based on the program template, execution of the data analysis application program the one or more files in the distributed file system.
- an Example 1 of those embodiments includes a method comprising receiving an indication of a synchronization request; determining, based on the indication, one or more files stored in a staging location; generating, based on the one or more files, a data transfer filter; and causing, based on the data transfer filter, transfer of the one or more files to a destination computing device.
- An Example 2 of the numerous embodiments comprises the method of Example 1, where receiving the indication of the synchronization request is based on a synchronization condition.
- An Example 3 of the numerous embodiments comprises the method of Example 2, where the synchronization condition is a time interval.
- Example 4 of the numerous embodiments comprises the method of Example 1, where the data transfer filter comprises a list of the one or more files stored in the staging location.
- An Example 5 of the numerous embodiments comprises the method of Example 1, wherein generating, based on the one or more files, the data transfer filter comprises generating a message that invokes a function call to a cloud service, wherein the message passes one or more parameters identifying the one or more files as an argument of the function call.
- An Example 6 of the numerous embodiments comprises the method of Example 1, where causing, based on the data transfer filter, transfer of the one or more files to the destination computing device comprises causing a data synchronization application program to scan the staging location and the destination computing device only for the one or more files.
- An Example 7 of the numerous embodiments comprises the method of Example 1 and further comprises receiving, from a data origin device, the one or more files.
- An Example 8 of the numerous embodiments comprises the method of Example 7, where the data origin device comprises one or more of a sequencer or an electron microscope.
- An Example 9 of the numerous embodiments comprises the method of Example 8, where the one or more files comprise sequence data, particle images, or both.
- An Example 10 of the numerous embodiments comprises the method of Example 1 and further comprises deleting, based on the transfer of the one or more files to the destination computing device, the one or more files from the staging location.
- An Example 11 of those other numerous embodiments includes a method comprising receiving, via a graphical user interface, a request to convert a dataset from object storage to a distributed file system; receiving, via the graphical user interface, an indication of a storage size of the distributed file system; and converting, based on the request and the indication, the dataset from object storage to the distributed file system associated with the storage size.
- An Example 12 of the numerous embodiments comprises the method of Example 11 and further comprises receiving, via the graphical user interface, a request to mount the distributed file system; and mounting the distributed file system.
- An Example 13 of the numerous embodiments comprises the method of Example 11 and further comprises receiving, via the graphical user interface, a request to save data in the distributed file system into the object storage; and saving the data in the distributed file system into the object storage.
- An Example 14 of the numerous embodiments comprises the method of Example 11 and further comprises receiving, via the graphical user interface, a request to delete the distributed file system; and deleting the distributed file system.
- An Example 15 of the numerous embodiments includes a method comprising identifying a data analysis application program; identifying a dataset associated with the data analysis application program; determining, as a program template, one or more job parameters associated with the data analysis application program processing the dataset; and causing, based on the program template, execution of the data analysis application program on the dataset.
- An Example 16 of the numerous embodiments comprises the method of Example 15, where the one or more job parameters comprise one or more of: a number of Message Passing Interfaces (MPIs), a number of threads, or a number of compute nodes.
- MPIs Message Passing Interfaces
- An Example 17 of the numerous embodiments comprises the method of Example 15 and further comprises determining a plurality of tasks executable by the data analysis application program.
- Example 18 of the numerous embodiments comprises the method of Example 17, where determining the one or more job parameters associated with the data analysis application program processing the dataset comprises determining one or more job parameters for each task of the plurality of tasks.
- An Example 19 of the numerous embodiments includes a method comprising receiving an indication of a synchronization request; determining, based on the indication, one or more files stored in a staging location; generating, based on the one or more files, a data transfer filter; causing, based on the data transfer filter, transfer of the one or more files to object storage of a destination computing device; receiving, via a graphical user interface, a request to convert the one or more files from object storage to a distributed file system; receiving, via the graphical user interface, an indication of a storage size of the distributed file system; converting, based on the request and the indication, the one or more files from object storage to the distributed file system associated with the storage size; identifying a data analysis application program associated with the one or more files in the distributed file system; determining, as a program template, one or more job parameters associated with the data analysis application program processing the dataset; and causing, based on the program template, execution of the data analysis application program the one or more files in the distributed file system.
- Example 20 of the numerous embodiments includes a computing system comprising at least one processor; and at least one memory device having processor- executable instructions stored thereon that, in response to execution by the at least one processor, cause the computing system to: receive an indication of a synchronization request; determine, based on the indication, one or more files stored in a staging location; generate, based on the one or more files, a data transfer filter; and cause, based on the data transfer filter, transfer of the one or more files to a destination computing device.
- Example 21 of the numerous embodiments comprises the method of Example
- Example 22 of the numerous embodiments comprises the method of Example
- Example 23 of the numerous embodiments comprises the method of Example 20, where the data transfer filter comprises a list of the one or more files stored in the staging location.
- An Example 24 of the numerous embodiments comprises the method of Example 20, where generating, based on the one or more files, the data transfer filter comprises generating a message that invokes a function call to a cloud service, wherein the message passes one or more parameters identifying the one or more files as an argument of the function call.
- An Example 25 of the numerous embodiments comprises the method of Example 20, where causing, based on the data transfer filter, transfer of the one or more files to the destination computing device comprises causing a data synchronization application program to scan the staging location and the destination computing device only for the one or more files.
- An Example 26 of the numerous embodiments comprises the method of Example 20, the at least one memory device having further processor-executable instructions stored thereon that in response to execution by the at least one processor further cause the computing system to receive, from a data origin device, the one or more files.
- Example 27 of the numerous embodiments comprises the method of Example
- the data origin device comprises one or more of a sequencer or an electron microscope.
- Example 28 of the numerous embodiments comprises the method of Example
- the one or more files comprise sequence data, particle images, or both.
- An Example 29 of the numerous embodiments comprises the method of Example 20, the at least one memory device having further processor-executable instructions stored thereon that in response to execution by the at least one processor further cause the computing system to delete, based on the transfer of the one or more files to the destination computing device, the one or more files from the staging location.
- An Example 30 of the numerous embodiments includes a computing system comprising at least one processor; and at least one memory device having processor- executable instructions stored thereon that, in response to execution by the at least one processor, cause the computing system to: receive, via a graphical user interface, a request to convert a dataset from object storage to a distributed file system; receive, via the graphical user interface, an indication of a storage size of the distributed file system; and convert, based on the request and the indication, the dataset from object storage to the distributed file system associated with the storage size.
- An Example 31 of the numerous embodiments comprises the computing system of Example 30, the at least one memory device having further processor-executable instructions stored thereon that in response to execution by the at least one processor further cause the computing system to: receive, via the graphical user interface, a request to mount the distributed file system; and mount the distributed file system.
- An Example 32 of the numerous embodiments comprises the computing system of Example 30, the at least one memory device having further processor-executable instructions stored thereon that in response to execution by the at least one processor further cause the computing system to: receive, via the graphical user interface, a request to save data in the distributed file system into the object storage; and save the data in the distributed file system into the object storage.
- An Example 33 of the numerous embodiments comprises the computing system of Example 30, the at least one memory device having further processor-executable instructions stored thereon that in response to execution by the at least one processor further cause the computing system to: receive, via the graphical user interface, a request to delete the distributed file system; and delete the distributed file system.
- An Example 34 of the numerous embodiments includes a computing system comprising at least one processor; and at least one memory device having processor- executable instructions stored thereon that, in response to execution by the at least one processor, cause the computing system to: identify a data analysis application program; identify a dataset associated with the data analysis application program; determine, as a program template, one or more job parameters associated with the data analysis application program processing the dataset; and cause, based on the program template, execution of the data analysis application program on the dataset.
- An Example 35 of the numerous embodiments comprises the computing system of Example 34, where the one or more job parameters comprise one or more of: a number of Message Passing Interfaces (MPIs), a number of threads, or a number of compute nodes.
- MPIs Message Passing Interfaces
- An Example 36 of the numerous embodiments comprises the computing system of Example 34, the at least one memory device having further processor-executable instructions stored thereon that in response to execution by the at least one processor further cause the computing system to determine a plurality of tasks executable by the data analysis application program.
- An Example 37 of the numerous embodiments comprises the computing system of Example 36, where determining the one or more job parameters associated with the data analysis application program processing the dataset comprises determining one or more job parameters for each task of the plurality of tasks.
- An Example 38 of the numerous embodiments includes an apparatus comprising at least one processor; and at least one memory device having processor-executable instructions stored thereon that, in response to execution by the at least one processor, cause the computing system to: receive an indication of a synchronization request; determine, based on the indication, one or more files stored in a staging location; generate, based on the one or more files, a data transfer filter; and cause, based on the data transfer filter, transfer of the one or more files to a destination computing device.
- An Example 39 of the numerous embodiments comprises the apparatus of Example
- An Example 40 of the numerous embodiments comprises the apparatus of Example
- An Example 41 of the numerous embodiments comprises the apparatus of Example 38, where the data transfer filter comprises a list of the one or more files stored in the staging location.
- An Example 42 of the numerous embodiments comprises the apparatus of Example 38, where generating, based on the one or more files, the data transfer filter comprises generating a message that invokes a function call to a cloud service, wherein the message passes one or more parameters identifying the one or more files as an argument of the function call.
- An Example 43 of the numerous embodiments comprises the apparatus of Example 38, where causing, based on the data transfer filter, transfer of the one or more files to the destination computing device comprises causing a data synchronization application program to scan the staging location and the destination computing device only for the one or more files.
- An Example 44 of the numerous embodiments comprises the apparatus of Example 38, where the at least one memory device having further processor-executable instructions stored thereon that in response to execution by the at least one processor further cause the computing system to receive, from a data origin device, the one or more files.
- Example 45 of the numerous embodiments comprises the apparatus of Example
- the data origin device comprises one or more of a sequencer or an electron microscope.
- An Example 46 of the numerous embodiments comprises the apparatus of Example
- the one or more files comprise sequence data, particle images, or both.
- An Example 47 of the numerous embodiments comprises the apparatus of Example 38 and further comprises deleting, based on the transfer of the one or more files to the destination computing device, the one or more files from the staging location.
- An Example 48 of the numerous embodiments includes an apparatus comprising at least one processor; and at least one memory device having processor-executable instructions stored thereon that, in response to execution by the at least one processor, cause the computing system to: receive, via a graphical user interface, a request to convert a dataset from object storage to a distributed file system; receive, via the graphical user interface, an indication of a storage size of the distributed file system; and convert, based on the request and the indication, the dataset from object storage to the distributed file system associated with the storage size.
- An Example 49 of the numerous embodiments comprises the apparatus of Example 48, the at least one memory device having further processor-executable instructions stored thereon that in response to execution by the at least one processor further cause the computing system to: receive, via the graphical user interface, a request to mount the distributed file system; and mount the distributed file system.
- An Example 50 of the numerous embodiments comprises the apparatus of Example 48, the at least one memory device having further processor-executable instructions stored thereon that in response to execution by the at least one processor further cause the computing system to: receive, via the graphical user interface, a request to save data in the distributed file system into the object storage; and save the data in the distributed file system into the object storage.
- An Example 51 of the numerous embodiments comprises the apparatus of Example 48, the at least one memory device having further processor-executable instructions stored thereon that in response to execution by the at least one processor further cause the computing system to: receive, via the graphical user interface, a request to delete the distributed file system; and delete the distributed file system.
- An Example 52 of the numerous embodiments includes an apparatus comprising at least one processor; and at least one memory device having processor-executable instructions stored thereon that, in response to execution by the at least one processor, cause the computing system to: identify a data analysis application program; identify a dataset associated with the data analysis application program; determine, as a program template, one or more job parameters associated with the data analysis application program processing the dataset; and cause, based on the program template, execution of the data analysis application program on the dataset.
- An Example 53 of the numerous embodiments comprises the apparatus of Example 52, where the one or more job parameters comprise one or more of: a number of Message Passing Interfaces (MPIs), a number of threads, or a number of compute nodes.
- MPIs Message Passing Interfaces
- An Example 54 of the numerous embodiments comprises the apparatus of Example 52, the at least one memory device having further processor-executable instructions stored thereon that in response to execution by the at least one processor further cause the apparatus to determine a plurality of tasks executable by the data analysis application program.
- An Example 55 of the numerous embodiments comprises the apparatus of Example 54, where determining the one or more job parameters associated with the data analysis application program processing the dataset comprises determining one or more job parameters for each task of the plurality of tasks.
- An Example 56 of the numerous embodiments includes at least one computer- readable non-transitory storage medium having processor-executable instructions stored thereon that, in response to execution, cause a computing system to: receive an indication of a synchronization request; determine, based on the indication, one or more files stored in a staging location; generate, based on the one or more files, a data transfer filter; and cause, based on the data transfer filter, transfer of the one or more files to a destination computing device.
- An Example 57 of the numerous embodiments comprises the at least one computer- readable non-transitory storage medium of Example 56, where receiving the indication of the synchronization request is based on a synchronization condition.
- An Example 58 of the numerous embodiments comprises the at least one computer- readable non-transitory storage medium of Example 57, where the synchronization condition is a time interval.
- An Example 59 of the numerous embodiments comprises the at least one computer- readable non-transitory storage medium of Example 56, where the data transfer filter comprises a list of the one or more files stored in the staging location.
- An Example 60 of the numerous embodiments comprises the at least one computer- readable non-transitory storage medium of Example 56, wherein generating, based on the one or more files, the data transfer filter comprises generating a message that invokes a function call to a cloud service, wherein the message passes one or more parameters identifying the one or more files as an argument of the function call.
- An Example 61 of the numerous embodiments comprises the at least one computer- readable non-transitory storage medium of Example 56, wherein causing, based on the data transfer filter, transfer of the one or more files to the destination computing device comprises causing a data synchronization application program to scan the staging location and the destination computing device only for the one or more files.
- An Example 62 of the numerous embodiments comprises the at least one computer- readable non-transitory storage medium of Example 56, where the processor-executable instructions, in response to further execution, further cause the computing system to receive, from a data origin device, the one or more files.
- An Example 63 of the numerous embodiments comprises the at least one computer- readable non-transitory storage medium of Example 62, where the data origin device comprises one or more of a sequencer or an electron microscope.
- An Example 64 of the numerous embodiments comprises the at least one computer- readable non-transitory storage medium of Example 63, where the one or more files comprise sequence data, particle images, or both.
- An Example 65 of the numerous embodiments comprises the at least one computer- readable non-transitory storage medium of Example 56, where the processor-executable instructions, in response to further execution, further cause the computing system to delete, based on the transfer of the one or more files to the destination computing device, the one or more files from the staging location.
- An Example 66 of the numerous embodiments includes at least one computer- readable non-transitory storage medium having processor-executable instructions stored thereon that, in response to execution, cause a computing system to: receive, via a graphical user interface, a request to convert a dataset from object storage to a distributed file system; receive, via the graphical user interface, an indication of a storage size of the distributed file system; and convert, based on the request and the indication, the dataset from object storage to the distributed file system associated with the storage size.
- An Example 67 of the numerous embodiments comprises the at least one computer- readable non-transitory storage medium of Example 66, where the processor-executable instructions, in response to further execution, further cause the computing system to: receive, via the graphical user interface, a request to mount the distributed file system; and mount the distributed file system.
- An Example 68 of the numerous embodiments comprises the at least one computer- readable non-transitory storage medium of Example 66, where the processor-executable instructions, in response to further execution, further cause the computing system to: receive, via the graphical user interface, a request to save data in the distributed file system into the object storage; and save the data in the distributed file system into the object storage.
- An Example 69 of the numerous embodiments comprises the at least one computer- readable non-transitory storage medium of Example 66, where the processor-executable instructions, in response to further execution, further cause the computing system to: receive, via the graphical user interface, a request to delete the distributed file system; and delete the distributed file system.
- An Example 70 of the numerous embodiments includes at least one computer- readable non-transitory storage medium having processor-executable instructions stored thereon that, in response to execution, cause a computing system to: identify a data analysis application program; identify a dataset associated with the data analysis application program; determine, as a program template, one or more job parameters associated with the data analysis application program processing the dataset; and cause, based on the program template, execution of the data analysis application program on the dataset.
- An Example 71 of the numerous embodiments comprises the at least one computer- readable non-transitory storage medium of Example 70, where the one or more job parameters comprise one or more of: a number of Message Passing Interfaces (MPIs), a number of threads, or a number of compute nodes.
- MPIs Message Passing Interfaces
- An Example 72 of the numerous embodiments comprises the at least one computer- readable non-transitory storage medium of Example 70, where the processor-executable instructions, in response to further execution, further cause the computing system to determine a plurality of tasks executable by the data analysis application program.
- An Example 73 of the numerous embodiments comprises the at least one computer- readable non-transitory storage medium of Example 70, wherein determining the one or more job parameters associated with the data analysis application program processing the dataset comprises determining one or more job parameters for each task of the plurality of tasks.
- An Example 74 of the numerous embodiments includes a computing system comprising at least one processor; and at least one memory device having processor- executable instructions stored thereon that, in response to execution by the at least one processor, cause the computing system to: receive an indication of a synchronization request; determine, based on the indication, one or more files stored in a staging location; generate, based on the one or more files, a data transfer filter; cause, based on the data transfer filter, transfer of the one or more files to object storage of a destination computing device; receive, via a graphical user interface, a request to convert the one or more files from object storage to a distributed file system; receive, via the graphical user interface, an indication of a storage size of the distributed file system; convert, based on the request and the indication, the one or more files from object storage to the distributed file system associated with the storage size; identify a data analysis application program associated with the one or more files in the distributed file system; determine, as a program template, one or more job parameters associated with the data analysis application program
- An Example 75 of the numerous embodiments include at least one computer- readable non-transitory storage medium having processor-executable instructions stored thereon that, in response to execution, cause a computing system to: receive an indication of a synchronization request; determine, based on the indication, one or more files stored in a staging location; generate, based on the one or more files, a data transfer filter; cause, based on the data transfer filter, transfer of the one or more files to object storage of a destination computing device; receive, via a graphical user interface, a request to convert the one or more files from object storage to a distributed file system; receive, via the graphical user interface, an indication of a storage size of the distributed file system; convert, based on the request and the indication, the one or more files from object storage to the distributed file system associated with the storage size; identify a data analysis application program associated with the one or more files in the distributed file system; determine, as a program template, one or more job parameters associated with the data analysis application program processing the dataset; and cause,
- the methods and systems disclosed may be configured for big data collection and real-time analysis.
- the methods and systems disclosed are configured for ultra-fast end-to- end processing of raw Cryo-EM data and of reconstruction of electron density map, ready for ingestion into model building software.
- the methods and systems disclosed optimize reconstruction algorithms and GPU acceleration at one or more stages, from pre-processing through particle picking, 2D particle classification, 3D ab-initio structure determination, high resolution refinements, and heterogeneity analysis.
- the methods and systems disclosed enable real-time Cryo-EM data quality assessment and decision making during live data collection, as well as an expedited, streamlined workflow for processing already available data.
- the methods and systems disclosed comprise processing, compute platforms with good bandwidth on the storage for faster process and thereby reducing compute run time, which are costly resources.
- the methods and systems disclosed can be configured as a self-service, cloud- based, computational platform that enables scientists to run multiple analytical processes on demand, without IT dependencies or having to determine the compute design.
- the methods and systems disclosed have broad, flexible applications, regardless of the data type or size, or type of experimentation.
- the methods and systems disclosed may be configured as a platform that enables scientists to scale and process a vast amount of imagery in a timely fashion, with high levels of quality and agility, while containing costs.
- the methods and systems disclosed may be configured as an automated, end-to-end processing pipeline by employing AWS Datasync, Apache Airflow (for orchestrating), Luster Filesystem (for high-throughput storage) NextFlow and AWS Parallel Cluster Framework to enable to transport and processing of large amounts of data over time (e.g., ITB/hour of raw data) for model development.
- AWS Datasync Apache Airflow (for orchestrating), Luster Filesystem (for high-throughput storage) NextFlow and AWS Parallel Cluster Framework to enable to transport and processing of large amounts of data over time (e.g., ITB/hour of raw data) for model development.
- the methods and systems disclosed may integrate RELION for real-time Cryo-EM data quality assessment and decision-making during collection of data.
- the methods and systems disclosed may extend AWS Parallel Computation framework to accommodate GPU based computing.
- the methods and systems disclosed may comprise data management and tiering tooling to enable user management of the life cycle of the data.
- the methods and systems disclosed may implement a high performance remote display protocol such as NICE DCV to provide graphics-intensive applications to remote users and stream user interfaces to any client machines, eliminating the need for dedicated workstation.
- a high performance remote display protocol such as NICE DCV to provide graphics-intensive applications to remote users and stream user interfaces to any client machines, eliminating the need for dedicated workstation.
- the methods and systems disclosed may utilize blue-green high-performance computing, a concept that is generally limited to software development, to address the Cryo-EM data quality assessment and decision-making during collection. As a result, job processing is both sped up and scaled up.
- the methods and systems disclosed are able to speed up the Cryo-EM pipeline to approximately 60 minutes/lTB data - e.g., ingest raw data, preprocess, classify, reconstruct and refine a 3D map while the sample is still in the microscope.
- the methods and systems disclosed may be configured as a managed service which provides users instant access to RELION and its associated applications from anywhere.
- the methods and systems disclosed represent a scalable cloud-based data processing and computing platform to support a Cryo-EM type large volume data pipeline.
- the methods and systems disclosed provide key benefits with the cloud-based solution: scalable, nimble, responsive to ever-changing research needs.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Surgical Instruments (AREA)
- Ultra Sonic Daignosis Equipment (AREA)
- Tires In General (AREA)
Abstract
Description
Claims
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US202163163690P | 2021-03-19 | 2021-03-19 | |
US202163237904P | 2021-08-27 | 2021-08-27 | |
PCT/US2022/021190 WO2022198132A1 (en) | 2021-03-19 | 2022-03-21 | Data pipeline |
Publications (1)
Publication Number | Publication Date |
---|---|
EP4309044A1 true EP4309044A1 (en) | 2024-01-24 |
Family
ID=81328486
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP22715864.9A Pending EP4309044A1 (en) | 2021-03-19 | 2022-03-21 | Data pipeline |
Country Status (8)
Country | Link |
---|---|
US (1) | US20220300321A1 (en) |
EP (1) | EP4309044A1 (en) |
JP (1) | JP2024511756A (en) |
KR (1) | KR20230156416A (en) |
AU (1) | AU2022238487A1 (en) |
CA (1) | CA3210417A1 (en) |
IL (1) | IL305574A (en) |
WO (1) | WO2022198132A1 (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11695853B1 (en) * | 2022-04-07 | 2023-07-04 | T-Mobile Usa, Inc. | Content management systems providing zero recovery point objective |
Family Cites Families (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH09282359A (en) * | 1996-04-09 | 1997-10-31 | Nippon Telegr & Teleph Corp <Ntt> | Job-shop scheduling device |
US6748504B2 (en) * | 2002-02-15 | 2004-06-08 | International Business Machines Corporation | Deferred copy-on-write of a snapshot |
US7590667B2 (en) * | 2003-01-30 | 2009-09-15 | Hitachi, Ltd. | File replication method for distributed file systems |
US8336040B2 (en) * | 2004-04-15 | 2012-12-18 | Raytheon Company | System and method for topology-aware job scheduling and backfilling in an HPC environment |
US8725698B2 (en) * | 2010-03-30 | 2014-05-13 | Commvault Systems, Inc. | Stub file prioritization in a data replication system |
US9838478B2 (en) * | 2014-06-30 | 2017-12-05 | International Business Machines Corporation | Identifying a task execution resource of a dispersed storage network |
CN104537713B (en) * | 2015-01-05 | 2017-10-03 | 清华大学 | A kind of novel three-dimensional reconfiguration system |
US10409863B2 (en) * | 2016-02-05 | 2019-09-10 | Sas Institute Inc. | Verification and export of federated areas and job flow objects within federated areas |
WO2018053761A1 (en) * | 2016-09-22 | 2018-03-29 | 华为技术有限公司 | Data processing method and device, and computing node |
US11074220B2 (en) * | 2017-01-06 | 2021-07-27 | Oracle International Corporation | Consistent file system semantics with cloud object storage |
US11680914B2 (en) * | 2017-10-06 | 2023-06-20 | The Governing Council Of The University Of Toronto | Methods and systems for 3D structure estimation using non-uniform refinement |
US11449813B2 (en) * | 2018-04-13 | 2022-09-20 | Accenture Global Solutions Limited | Generating project deliverables using objects of a data model |
CN113835869B (en) * | 2020-06-23 | 2024-04-09 | 中国石油化工股份有限公司 | MPI-based load balancing method, MPI-based load balancing device, computer equipment and storage medium |
CN112258627B (en) * | 2020-09-18 | 2023-09-15 | 中国科学院计算技术研究所 | Local fault three-dimensional reconstruction system |
CN113377733B (en) * | 2021-06-09 | 2022-12-27 | 西安理工大学 | Storage optimization method for Hadoop distributed file system |
-
2022
- 2022-03-21 CA CA3210417A patent/CA3210417A1/en active Pending
- 2022-03-21 WO PCT/US2022/021190 patent/WO2022198132A1/en active Application Filing
- 2022-03-21 EP EP22715864.9A patent/EP4309044A1/en active Pending
- 2022-03-21 KR KR1020237035224A patent/KR20230156416A/en unknown
- 2022-03-21 AU AU2022238487A patent/AU2022238487A1/en active Pending
- 2022-03-21 IL IL305574A patent/IL305574A/en unknown
- 2022-03-21 US US17/700,076 patent/US20220300321A1/en active Pending
- 2022-03-21 JP JP2023557114A patent/JP2024511756A/en active Pending
Also Published As
Publication number | Publication date |
---|---|
JP2024511756A (en) | 2024-03-15 |
IL305574A (en) | 2023-10-01 |
WO2022198132A1 (en) | 2022-09-22 |
US20220300321A1 (en) | 2022-09-22 |
KR20230156416A (en) | 2023-11-14 |
CA3210417A1 (en) | 2022-09-22 |
AU2022238487A1 (en) | 2023-09-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9740706B2 (en) | Management of intermediate data spills during the shuffle phase of a map-reduce job | |
EP4242822A2 (en) | Ensuring reproducibility in an artificial intelligence infrastructure | |
Zhang et al. | Applying twister to scientific applications | |
Merceedi et al. | A comprehensive survey for hadoop distributed file system | |
Zhang et al. | Design and evaluation of a collective IO model for loosely coupled petascale programming | |
US20220300321A1 (en) | Data pipeline | |
EP4118536A1 (en) | Extensible streams on data sources | |
Fomferra et al. | Calvalus: Full-mission EO cal/val, processing and exploitation services | |
Wilke et al. | An experience report: porting the MG‐RAST rapid metagenomics analysis pipeline to the cloud | |
Wang et al. | ODDS: Optimizing data-locality access for scientific data analysis | |
García et al. | Data-intensive analysis for scientific experiments at the large scale data facility | |
CN117043759A (en) | Data pipeline | |
Abramson et al. | A cache-based data movement infrastructure for on-demand scientific cloud computing | |
CN117667853B (en) | Data reading method, device, computer equipment and storage medium | |
Abramson et al. | Democratising large scale instrument-based science through e-Infrastructure | |
Wan et al. | An image management system implemented on open-source cloud platform | |
US11513710B2 (en) | Multi-pass distributed data shuffle | |
Narayanapppa et al. | Need of Hadoop and Map Reduce for Processing and Managing Big Data | |
US20240028473A1 (en) | System and method for optimizing network attached storage backup of a large set of files based on resource availability | |
US20240028474A1 (en) | System and method for managing a backup of a large set of files using a file system analysis for data stored in a network attached storage | |
Jung et al. | High-performance serverless data transfer over wide-area networks | |
WO2008014614A1 (en) | A method for providing live file transfer between machines | |
Zeng et al. | SHAstor: A Scalable HDFS-based Storage Framework for Small-Write Efficiency in Pervasive Computing | |
Thakker et al. | GeoProcessing Workflow Models for Distributed Processing Frameworks | |
KR20220073947A (en) | Algorithm for Distributed Parallel Processing of Energy Big Data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: UNKNOWN |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE |
|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE |
|
17P | Request for examination filed |
Effective date: 20231010 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
RIN1 | Information on inventor provided before grant (corrected) |
Inventor name: SALUNKE, SIDDHESH Inventor name: HERNANDEZ, MARCO Inventor name: SHAIK, ABDUL Inventor name: FRANKLIN, MATTHEW Inventor name: BUHAY, CHRISTIAN Inventor name: SADANANDHAMURTHY, SRINIVASAN Inventor name: HU, CUIE Inventor name: GANDE, RAJESHWAR Inventor name: KARUMURI, NAVEEN Inventor name: NAWAZ, SHAH Inventor name: YANG, QUAN |
|
DAV | Request for validation of the european patent (deleted) | ||
DAX | Request for extension of the european patent (deleted) | ||
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: EXAMINATION IS IN PROGRESS |