CN109937426A - System and method for biological data management - Google Patents

System and method for biological data management Download PDF

Info

Publication number
CN109937426A
CN109937426A CN201780035638.0A CN201780035638A CN109937426A CN 109937426 A CN109937426 A CN 109937426A CN 201780035638 A CN201780035638 A CN 201780035638A CN 109937426 A CN109937426 A CN 109937426A
Authority
CN
China
Prior art keywords
base
data
stored
biological data
memory
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201780035638.0A
Other languages
Chinese (zh)
Inventor
马苏德·瓦基利
库尔特·克里斯托弗森
马克·奥尔德姆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Quantum Biosystems Inc
Original Assignee
Quantum Biosystems Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Quantum Biosystems Inc filed Critical Quantum Biosystems Inc
Publication of CN109937426A publication Critical patent/CN109937426A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/22Social work
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/40Encryption of genetic data
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/50Compression of genetic data
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/40ICT specially adapted for the handling or processing of patient-related medical or healthcare data for data related to laboratory analysis, e.g. patient specimen analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H15/00ICT specially adapted for medical reports, e.g. generation or transmission thereof

Abstract

System and method for biological data management can retain the alternative explanation to data, and multi-level encryption and privacy management may be implemented.System and method for biological data management may include unit level framework, library and block level framework and/or multi-layer framework.System and method for biological data management may include definition, rule and instruction and/or using two dimension or three-dimensional data structure.

Description

System and method for biological data management
Cross reference to related applications
This application claims the U.S. Provisional Patent Application No.62/321 submitted on April 11st, 2016,103 priority, Its whole is incorporated herein by reference.
Background technique
New research continues growing us to the understanding of hereditary information and is made regarding how to manage choosing for these information War.Be appreciated more fully the genetic map with higher resolution may be generated in health care and other subject it is valuable Result.
As an example, manage the challenge in hereditary DNA (DNA) data first is that there are highly conserved Code region, remain unchanged at any time, but seem not coding protein.However, studies have shown that they may be in gene table Up to playing a significant role in regulation, alternative splicing and Distal enhancer.Therefore, it is desirable to a kind of effective method come save without The region being often used, while keeping the quick access in the more frequently used region for genetic sequence.
Summary of the invention
It is herein recognized that needing data management scheme, the alternative interpretations of data are adapted to, and therefore accessible By the lower level data of various device measurings.Herein it is also recognized that needing to feel with greater flexibility and bigger integrality Know, storage and management genetic data, and need flexibly and effectively to create when handling wrong scene, be added to, safeguarding and Inquire these data sets of different stage.
There is provided herein the system and method for efficiently and safely managing genetic data, comprising: reads and explains and is former Beginning data, storage and the privacy and confidentiality of explaining genetic data and maintenance data.
Some system and method can provide definition and rule, and be directed to and health care, food safety and/or other diseases The problem of substance disposition correlation issues instruction appropriate.It can use the multitiered network framework in information processing environment.
The depth of parallelism can be used as required by the task and type that biological data is explained.Information initially can store in half hitch In the distributed memory of structure data, allow scanning as needed, reduction and shuffling information to structuring, column or relationship type In database.
System and method can execute different inquiries stage by stage simultaneously, allow information to be stored in repository, and can To be encrypted when static.Information can by safety and it is flexible in a manner of across distributed system, between repository, server it Between or transmit between server and client.
System and method can be stored according to the unit of the size of data or data cell and one or more storage equipment Biological data is stored in one or more storage equipment by the relationship between block or the size of memory bank.
System and method can support access control, access control to can be based on user, role, application, process or position It sets.
System and method can be related to memory cell rank, memory block rank, bank level or another deposit Genetic data (for example, polynucleotides data) are mapped and are stored in one or more memory devices by storage unit level In.
The one side of the disclosure provides a kind of biological data management system, comprising: (a) end user module, it is described most Whole line module includes sequencing equipment, and the sequencing equipment is configurable to generate base data;(b) with end user's module The locally-stored library of network communication is carried out, the locally-stored library is programmed or configured to (i) and receives the base data, (ii) The base data are converted into sequence data, (iii) is based on the sequence data and generates abbreviation data, and (iv) will be described Abbreviation data are compared with the database of existing abbreviation;And the center of network communication (c) is carried out with the locally-stored library Server, the central server are configured as updating the database of the existing abbreviation.
In some embodiments, the locally-stored library is also programmed or configured to label abbreviation and by labeled abbreviation It is transmitted to the central server.In some embodiments, the central server is also programmed or configured to receive the warp The abbreviation of label simultaneously executes further analysis to the labeled abbreviation.In some embodiments, the central server It is also programmed or configured to when analyzing the labeled abbreviation and generates instruction and described instruction is transmitted to described local store up Warehousing.In some embodiments, it is described abbreviation be variance, hash or verification and.
Another aspect of the present disclosure provides a kind of method for storing biological data, comprising: (a) determines the biology The size of data is to identify the storage unit size for being suitable for storing the biological data;(b) identification has and the storage unit Memory location in the memory devices of the compatible block size of size;And the biological data (c) is stored in described deposit In erasable piece at the storage location of storage device.
In some embodiments, each erasable piece include subregion for storing the biological data and for store with The subregion of the related metadata of the biological data.In some embodiments, the subregion used to store metadata includes more The long service life.In some embodiments, the subregion used to store metadata includes and is used to store the described of sequence data The different controller of the controller of subregion.In some embodiments, it compared with the subregion for storing sequence data, is used for The subregion of storage metadata is configured for more frequent access.
Another aspect of the present disclosure provides a kind of biological data management system, comprising: (a) first memory equipment, quilt It is configured to storage and supplies the biological data infrequently accessed;And (b) there is the second memory equipment of block size, described second Memory devices, which communicate with the first memory equipment and are configured as storage, supplies the biological data frequently accessed;Wherein, The second memory equipment is faster than the first memory equipment, and wherein, selects the block size according to the life The size of object data stores the biological data.
In some embodiments, the biological data is n oligomeric sequences, and wherein, and the block size is the storage n N times of digit needed for the monomer of aggressiveness.In some embodiments, the biological data is n oligomeric sequences, and wherein, institute State at least n times that block size is digit needed for storing the monomer of the n aggressiveness.In some embodiments, second storage Device equipment includes flash memory device.In some embodiments, the second memory equipment includes being used as flash memory Wipe the block of block.
Another aspect of the present disclosure provides a kind of for storing series in multi-level unit (MLC) memory devices The method of data, the MLC memory equipment includes memory cell, and each of described memory cell is configured as depositing Storage two, the method includes in a memory cell: (a) being set as 00 for described two to represent the base of the first kind; (b) 01 is set as to represent the base of Second Type by described two;(c) 10 are set as to represent third type by described two Base;Or 11 (d) are set as to represent the base of the 4th type by described two.
In some embodiments, the series data represent one or more polynucleotides, described one or more Each in polynucleotides includes one or more bases, each in one or more bases is at least four can One of energy base.In some embodiments, the polynucleotides are DNA or RNA.
Another aspect of the present disclosure provides a kind of method for biological data to be stored in memory devices, described Memory devices include block, and each of described piece includes block size, which comprises (a) determines the biological data Size;(b) block size of described piece of at least one subset is determined;(c) biological data is compressed based on the block size To generate compressed biological data;And (d) by the biological data be stored in described piece described at least one subset In.
According to the method for claim 19, wherein the memory device includes flash memory device, and its In, the block size is erasing block size.
In some embodiments, the block size is greater than or equal to the size of the compressed biological data.Some In embodiment, the erasing block stores the metadata of the biological data and the biological data.
Another aspect of the present disclosure provides a kind of method for series data to be stored in memory devices, The memory devices include memory cell, and each of described memory cell is configured to store at three few, institute The method of stating includes: that in a memory cell, (a) sets 000 for three in described at least three to represent the first kind Base;(b) 001 is set to represent the base of Second Type by three in described at least three;It (c) will be at least three described In three be set as 010 to represent the base of third seed type;(d) by three in described at least three be set as 011 with Represent the base of the 4th seed type;(e) 100 are set to represent the alkali of the 5th seed type by three in described at least three Base;(f) 101 are set to represent the base of the 6th seed type by three in described at least three;It (g) will be at least three described In three be set as 110 to represent the base of the 7th seed type;And it (h) sets three in described at least three to 111 to represent the base of the 8th seed type.
In some embodiments, the series data represent one or more polynucleotides, described one or more Each in polynucleotides includes one or more bases, each base in one or more bases be four kinds not With natural base, methylated base, oxidation one of base or non-base positions.In some embodiments, the multicore glycosides Acid is DNA or RNA.In some embodiments, the memory devices include flash memory, phase transition storage or resistive memory Device.
Another aspect of the present disclosure provides a kind of method for series data to be stored in memory devices, The series data include two kinds of possible bases to represent each in measured a variety of bases, the memory Equipment includes memory cell, and each of described memory cell is configured as storing multiple positions, which comprises The most probable base of the series data is stored in first of the multiple position;In the second of the multiple position Store the second most probable base of the series data;And in the remainder of the multiple position described in storage The relative probability of most probable base and the second most probable base.
In some embodiments, the method also includes: most may be used using the first unit of the memory cell to identify The base of energy;The second most probable base is identified using the second unit of the memory cell;And use the storage One or more of the other unit of device unit stores the relative probability.In some embodiments, the method also includes The probability of the second most probable base is stored in the third unit of the memory cell.
Another aspect of the present disclosure provides a kind of method for series data to be stored in memory devices, The memory devices include memory cell, and each memory cell is configured to store at three few, the method Including in the memory cell: providing at least three described in including three first instructions, (a) to represent first The base of type;(b) providing includes at least three three seconds instruction, to represent the base of Second Type;(c) There is provided includes at least three three third positions instruction, to represent the base of third type;(d) it provides including described in extremely The 4th instruction of three of few three, to represent the base of the 4th type;(e) three of at least three described in including are provided 5th instruction, to represent methylated base;(f) at least three described in including three the 6th instructions are provided, to represent Aoxidize base;And three the 7th instructions including described at least three (g) are provided, to represent abasic site.
In some embodiments, memory devices include flash memory, phase transition storage or Memister.
Another aspect of the present disclosure provides a kind of method for encrypting biological sequence data, which comprises (a) identification Normal grade variance in the biological sequence data;And second level variance (b) is introduced in Xiang Suoshu biological sequence data, The second level variance is suitable with the normal grade variance, so that being relative to biological sequence data described in the normal grade variance Undistinguishable.
In some embodiments, the method also includes using encryption method to transmit the variance of introduced grade.
Another aspect of the present disclosure provides a kind of method of biological sequence data for encrypting subject, the method packet It includes: (a) encrypting information relevant to the subject using the first encipherment scheme;And it (b) is encrypted using the second encipherment scheme The biological sequence data, second encipherment scheme are different from first encipherment scheme.
In some embodiments, second encipherment scheme includes the encryption less extended than first encipherment scheme. In some embodiments, second encipherment scheme includes scrambling (chaffing) and selection by winnowing (winnowing).In some implementations In example, first encipherment scheme uses Public Key Infrastructure, and second encipherment scheme is set using the public base It applies.In some embodiments, first encipherment scheme uses the first Public Key Infrastructure, and second encipherment scheme makes With second Public Key Infrastructure different from first Public Key Infrastructure.
Another aspect of the present disclosure provides a kind of method for storing series data, which comprises Two-dimentional table structure is provided in computer storage, the two dimension table structure is configured as the information that storage represents potential base;It will The information for representing the most probable measurement base of the series data is stored in the first dimension of the two-dimentional table structure;It will generation The information of other potential bases of series data described in table is stored in the second dimension of the two-dimentional table structure;And it will Probability corresponding with the described in first peacekeeping the two-dimensional intersection is stored in the two-dimentional table structure.
In some embodiments, the potential base include each of one group of four kinds of possible base and methylated base, Aoxidize at least one of base and abasic site.In some embodiments, the method also includes in computer storage Middle offer second two-dimensional table structure, the second two-dimensional table structure are configured as the information that storage represents potential base;And The most probable measurement base and the series data of the series data are stored in the second two-dimensional table structure The second most probable measurement base.
Another aspect of the present disclosure provides a kind of method for managing biological data, which comprises offer is answered With server, the application server is programmed or configured to the biological data that (i) receives original measurement from sensor, and (ii) processed biological data is generated from the biological data of the original measurement;From locally-stored at the application server Library receives definition relevant to the processed biological data and rule;And by the application server publication based on The relevant definition of processed biological data and regular instruction.
In some embodiments, the processed biological data includes the processed biological data at described One for not finding related definition and rule in ground repository, and the method also includes sending institute to the locally-stored library State at least described part of the biological data of processing.In some embodiments, the method also includes by the processed life At least described part of object data is sent to central server from the locally-stored library.In some embodiments, the method It further include sending instruction from the central server to the locally-stored library.In some embodiments, the method also includes New definition and rule are sent from the central server to the locally-stored library.
Another aspect of the present disclosure provides a kind of method for storing series data, which comprises right In base positions, the information for representing the most probable base of the series data is stored in the first position of storage equipment, And the probability of the frequency of occurrence of the most probable base is stored in the second position of the storage equipment.
Another aspect of the present disclosure provides a kind of series data for storing comprising at least four possible bases Method, which comprises (a) provides three-dimensional table structure in computer storage, and the three-dimensional table structure is configured as Store the series data, wherein the one-dimension storage of (i) described three-dimensional table structure represents genetic sequence base data The information of most probable measurement base;(ii) two-dimensional storage of the three-dimensional table structure represents the genetic sequence base data Potential base information;And the third dimension storage of (iii) described three-dimensional table structure represents the series data extremely The information of the basecount probability of each in few four kinds of possible bases;It (b) will be with first dimension, the second peacekeeping institute The corresponding probability of intersection for stating the third dimension is stored in the three-dimensional table structure.
Another aspect of the present disclosure provides a kind of method for protecting biological data relevant to subject, the side Method includes: the personally identifiable information using the first encipherment scheme encryption subject;It is encrypted using the second encipherment scheme described tested The phenotype of person;The biological data is encrypted using third encipherment scheme, wherein second encipherment scheme or third encryption Scheme is different from first encipherment scheme;And by the personally identifiable information of the encryption, the phenotype of the encryption and described The biological data storage of encryption is in computer storage.
In some embodiments, (i) second encipherment scheme is different from first encipherment scheme, and described in (ii) Third encipherment scheme is different from first encipherment scheme, and different second encryptions of (iii) described third encipherment scheme Scheme.In some embodiments, the method also includes storing the gene expression data of the subject.In some embodiments In, the method also includes storing the geodata of the subject.
Another aspect of the present disclosure provides a kind of method for storing the genetic data of subject, the method packet It includes: the personally identifiable information of the subject is stored in the first memory paragraph with the first access limiter stage;Will it is described by The phenotypic data of examination person is stored in the second memory paragraph with the second access limiter stage;And it will be described in the subject Genetic data is stored in the third memory paragraph with third access limiter stage.
In some embodiments, the second access limiter stage or third access limiter stage are different from described first and visit Ask limiter stage.In some embodiments, (i) the second access limiter stage is different from the first access limiter stage, and (ii) the third access limiter stage is different from the first access limiter stage, and (iii) described third access limiter stage is not It is same as the second access limiter stage.
By described in detail below, the other aspects and advantage of the disclosure will become aobvious and easy for those skilled in the art See, in the following detailed description, the illustrative embodiments of the disclosure only has shown and described.As it will be realized, disclosure energy Enough there is other and different embodiments, and its several details can modify at various apparent aspects, it is all these Without departure from the disclosure.Therefore, attached drawing and description are substantially considered illustrative and not restrictive.
It is incorporated by reference into
The all publications, patents and patent applications referred in this specification are both incorporated herein by reference, and degree is such as It is incorporated by reference into each individually publication, patent or patent application by specifically and individually instruction.
Detailed description of the invention
Novel feature of the invention is specifically described in the following claims.By reference to following specific embodiments and Acquisition is best understood from the features and advantages of the present invention by attached drawing, and the specific embodiment elaborates wherein to utilize the present invention Principle illustrative embodiments, in attached drawing (also referred to as " scheming "):
Fig. 1 shows conductance-time graph example of sensor.
Fig. 2 shows the example of the schematic diagram of biological data management system.
Fig. 3 shows the example of the figure of the distributed network for biological data management.
Fig. 4 shows the example of the schematic diagram of biological data management system, within the system, during central server is located at Heart position.
Fig. 5 shows diagram can be by the example of the flow chart for the process that application server executes.
Fig. 6 shows diagram can be by the example of the flow chart for the process that locally-stored library executes.
Fig. 7 shows the example for the base probability matrix that sensor reads 21 aggressiveness.
Fig. 8 shows to read the example of the additional dimension of the data retained.
Fig. 9 shows the example of various sample identification symbols.
Figure 10 shows three examples of syntax.
Figure 11 shows the example of conversion syntax.
Figure 12 shows the example of application server input.
Figure 13 shows the example of application server output.
Figure 14 shows the example of distributed file system.
Figure 15 shows the example of the framework for being segmented access control.
Figure 16 A, 16B, 16C and 16D show the example of bedding storage access scheme.
Figure 17 shows the examples for the computer system for being programmed or being otherwise configured to management biological data.
Specific embodiment
Although various embodiments of the present invention have been illustrated and described herein, it is aobvious for those skilled in the art and It is clear to, these embodiments only provide by way of example.Without departing from the present invention, those skilled in the art can To expect many variations, change and replacement.It should be understood that the various alternatives of invention as described herein embodiment can be used Case.
As used herein, term " subject " typically refers to: animal, such as mammalian species (for example, people) or birds (for example, bird) species;Or other biologies, for example, plant.Subject can be vertebrate, mammal, mouse, primate Animal, ape and monkey or the mankind.Animal may include but be not limited to farm-animals, sport animals or pet.Subject can be healthy Body has or suspects with disease or treat or suspect in need for the treatment of for the individual disposed in advance or needs of disease Body.Subject can be patient.
As used herein, " genome " typically refers to the hereditary information of entire organism.Genome can be in deoxyribose Coding in nucleic acid (DNA) or ribonucleic acid (RNA).Genome may include the coding region or non-coding for protein coding Region.Genome may include the sequence of any or all chromosome of organism.For example, a total of 46 dyeing of human genome Body.The sequence of all these chromosomes can collectively form human genome.
As used herein, term " genetic variation " typically refers to change, change in the sample of nucleic acid or genome of subject Body or polymorphism.This change, variant or polymorphism can be relative to reference genome, can be with reference to genome tested The reference genome of person or other individuals.Polymorphism may include single nucleotide polymorphism (SNP).In some instances, one or Multiple polymorphisms include one or more single nucleotide variations (SNV), insertion or missing (insertion and deletion), repeat, small insertion, small Missing, small repetition, structural variant connection, variable-length series connection repeats and/or flanking sequence.Genetic variation may include that copy number becomes Body (CNV), transversion or other types of rearrangement.Genome change may include sequence change, insertion or missing (insertion and deletion), take In generation, repeats, copies number variation or transversion.
Term " polynucleotides " as used herein typically refers to the molecule comprising one or more nucleic acid subunits.Multicore glycosides Acid may include one selected from adenosine (A), cytimidine (C), guanine (G), thymidine (T) and uracil (U) or its variant Or multiple subunits.Nucleotide may include A, C, G, T, U or its variant.Nucleotide may include any subunit that can mix nucleic acid chains. Such subunit may include A, C, G, T, U or to one or more complementary A, C, G, T or U it is special or with purine (for example, A, G or its variant) or pyrimidine (for example, C, T or U or its variant) complementation any other subunit.Subunit can make each nucleic acid Base or base group (for example, AA, TA, AT, GC, CG, CT, TC, GT, TG, AC, CA or its uracil-counterpart) are separated. In some instances, polynucleotides may include DNA (DNA), ribonucleic acid (RNA) or derivatives thereof.Polynucleotides It can be single-stranded or double-stranded.
System and method described herein can be related to genetic data management.Genetic data management may include network rack Structure, report, definition and rule, instruction and movement, storage equipment and storage management, privacy, encryption or compression.
Various types of sensors can be used to measure different genetic properties.Certain sensors can will record and report Accuse the resolution ratio of different stage.Some sensors can provide local base sequence.In some cases, sensor can detecte Chemical modification, for example, methylation, amination/deamination, oxidation and/or any other modification in DNA and RNA and abasic (AP) site.
Sensor can be configured as the various types of signals of detection, for example, optical signalling, electric signal or combinations thereof.Light Learning signal may include fluorescence, shines, is chemiluminescence, bioluminescence, white heat, laser, light emitting diode (LED), visible light, infrared Radiation, near-infrared radiation or combinations thereof.Electric signal may include electric current, voltage, differential impedance, tunnel current, resistance, capacitor, electricity Lead or combinations thereof.Natural molecule can be changed to detect them in some solutions for Genetic Detection.Some detection method (examples Such as, polymerase chain reaction (PCR)) it may rely on amplification, wherein it can produce the original hereditary polymer of many copies.
Amplification procedure can introduce apparent mutation mistake in turn, this can cause result inaccurate.There may also be it Its error source, for example, in electronic noise, phase error, spectrum deconvolution error, fluid propagated error, quantitative error, reading Position, sequence context, spatially and spectrally optical crosstalk, this make various sensors or detector signal quality, error pattern, It is had differences in terms of the alternative interpretations of measurement accuracy or sensing or measurement data.
When managing these different types of genetic datas, manage about data source information, how to measure they with And the sensor for measurement, detection system, hardware, consumables, chemical method or software version can be important.Every group of number According to may include in all cases can be with characteristic error in need of consideration and uncertainty.
Another problem of management genetic data can be management data storage.It using different memory technologies and can set It is standby.It can be used various types of particular memory media, can be specified in conjunction with the property, quality or quantity of genetic data. Various types of genetic datas, for example, DNA or RNA sequence, can store in multiple-unit memory devices.It can be about something lost The feature for passing data uses memory block in various ways.For example, the type of the data stored in the size and memory block of memory block May exist relationship between size.
Data acquisition
One or more biosensors can detecte the initial data of strand.Each initial data can be read and be turned Become the native formatization record of the reading.For example, sensor can be passed through in chain if sensor senses and measure conductance The conductance of generation time sequence at any time when sensor, as shown in Figure 1.
In the case where DNA (DNA) or ribonucleic acid (RNA), conductance initial data can explained later on be Nucleotide base data or record.
Initial data from sensor can be for delivery to application server.Data may depend on sensor type, and It may originate from electrical characteristics, for example, conductance, capacitor, electric current (for example, tunnel current), voltage, resistance or any combination thereof.Data can Including optical data, for example, for example, by the derivative autofluorescence of modification (for example, nucleic acid base) of fluorescent label or subunit The optical data of (for example, chemiluminescence) or absorbance.
It can be used through wireless protocols (such as Wireless Fidelity (Wi-Fi), bluetooth or near-field communication (NFC)) and sensor Integrated wireless module transmits to execute the data from sensor to application server.Such as universal serial bus can be used (USB) wired connection is transmitted to execute data.
Application server may include desktop computer, laptop computer or such as mobile phone (for example, iPhone or Android phone) or tablet computer (for example, iPad or Android tablet computer) mobile device.
Application server, which can have, to be received original signal data and generates base data using certain base calling routines Instruction set.These routines can be carried out on the application server with sensor-based ability and characteristic or other global commands Programming and update, as described elsewhere herein.
It is updated for example, can receive or push sensor from sensor manufacturer, to improve signal measurement or change hardware Or firmware.
As shown in Fig. 2, application server or central server 201 may include or accessible application server or center From the private database of locally-stored library 202 received definition and rule.It can according to need update definition and rule.Definition and Rule can identify various situations and action.For example, there may be cause of disease body characteristics or sequence or with can be examined by local sensor The relevant any other data of the special pathogen of survey.Therefore, it defines and rule can be customization and can be dynamic. Application server 201 can be communicated with local main equipment 205, and local main equipment 205, which may be used as application server, to be explained Or the resource of the data obtained.Local main equipment 205 can be communicated with local from equipment 206, local to stop from equipment 206 In identical facility, but limited function can be provided by quickly accessing local main equipment.Locally-stored library 202 can To communicate with end node 1 203 and end node 2 204, end node 1 203 and end node 2 204 can be measuring device.
When application server, which executes, to be measured, it can compare the definition accessible with it of its result and rule Compared with, and then can correspondingly suggest instructing.
If not having available definition or rule for specific condition, application server can be with regard to the situation and its local Repository 202 communicates.
Locally-stored library may include the server being connected to the network with one or more application server, such as Fig. 3 institute Show.Locally-stored library 301 may include or accessible bigger database and more definition and rule or the definition updated and Rule.
For example, locally-stored library can be connected to the network with central server 302.Central server can be with multiple Ground repository 302 is connected to the network, these locally-stored libraries 302 can carry out network company with local application server 303 again It connects.
As shown in figure 4, central server can be located at center, for example, National Laboratory or health tissues facility.
The role of central server may include that will define to transmit or update together with instruction with rule to arrive multiple locals Repository receives report from them.
Depending on the viewpoint from some machine, may exist several scenes.It in some cases, can be for application clothes The one or more that device executes as shown in Figure 5 of being engaged in operates:
The signal 501 that sensor measurement is measured from polynucleotides;
Signal data is transmitted to application server 502 by sensor;
Application server receives signal data and generates base data 503;
Application server is based on base data identification sequence number according to 504;
Application server is directed to from the received definition in locally-stored library and rule analysis sequence data 505;
Application server is based on analysis and provides a user message 506;
If desired, sequence data is transmitted to locally-stored library 507 by application server.
Fig. 6 shows the possible operation executed by locally-stored library, can correspond to work as application server for sequence One group of operation described in Fig. 5 when column data is transmitted to locally-stored library:
Locally-stored library receives base data 601 from application server;
Locally-stored library checks definition and rule 602;
Exception relevant to base data is transmitted to central server 603 by locally-stored library;
Locally-stored library receives global and area update 604 from central server;
Locally-stored library updates definition and rule 605;
Locally-stored library and the new definition of application server transmitting and rule 606;
Instruction is passed to locally-stored library by central server;And
Instruction is passed to application server by locally-stored library.
Application server can carry out direct or network communication with locally-stored library.Locally-stored library can periodically to Application server sends locally-stored library from the received update of central server.
Central server can be located at central laboratory or health center, and can analyze and transmitted by locally-stored library Sequence data.The accessible sequence database of central server.
Example: pathogen
Sequence database may include the database of pathogen sequence.Central server can quickly be accessed by using more The nearest pathogen sequence of fast memory and communication pipe report.
When locally-stored library receive can be with pathogen known to new pathogen or nocuousness a possibility that related information when, this Ground repository can find it is being provided by central server, can definition relevant to the received sequence of institute in private database And rule.Based on received sequence data with specific definitions and rule the sequence in private database compared with, Locally-stored library can correspondingly use option appropriate.For example, locally-stored library can find ad hoc rules, and then will Specific instruction passes to application server.
As an alternative, if the definition in locally-stored library and rule meet certain group standard, it can be by the received sequence of institute It is transmitted to central server.
The accessible bigger database of central server, such as nearest and/or outburst earlier synthesis central data Library.Central server can continuously update central data based on the content that central server is collected from multiple locally-stored libraries Library.
Central server can be accessed by centralized laboratories or health center, wherein health or security professional can be with Access and is warned the event with certain predetermined threshold value about the event with certain predetermined threshold value.
The mechanism of operation central server can make various decisions.These decisions may include automatically or semi-automatically determining Plan.For example, centralized laboratories can be transmitted to locally-stored library to be ignored if centralized laboratories determine that some sequence is not dangerous The decision of this example.Alternatively, then central server can add the sequence of label if there is the instruction of more serious conditions To being exclusively used in the instruction of this example, and keep instruction quickly to access in memory.Being reported to centralized laboratories, Some subsequent instances with same or similar mode can receive identical instruction.The instruction may include related drug, inspection The decision of epidemic disease, rest etc..
When centralized laboratories solved and classify situation when, then centralized laboratories can establish relevant to situation fixed Justice and rule.Then these definition and rule and instruction can be transmitted to the locally-stored library of correlation.For example, if geographical Outburst terminates, then central server can update any or institute of end user relevant to the region and application server connection There is locally-stored library, while other regions of the areas adjacent are placed in alarm state.
About foodsafety, multiple sensors of different location can measure the sequence from various types food.This Sensor at a little positions can measure sequence and may search for candidate pathogens.Each sensor can be with application server Communication.Sensor can measure the signal from sequence and send application server for initial data.
Application server may include one group of definition and rule.When application server receives initial data from sensor, Application server can run program and generate contig nucleotide sequence to generate base reading from initial data and read from base.It is producing After raw contig nucleotide sequence, application server can be run base data or sequence data and the definition and rule that pre-establish The program being compared.These definition can be located in the accessible database of application server.Definition can be stored remotely In private server.There may be the subsets for being designated as especially important or vital definition.For example, may exist one The pathogen information of group recently or currently.These especially important or crucial data can store and can be easy in application server In the faster access memory of ground access.In some cases, it can indicate that application server is searched by instruction or rule Rope AD HOC.For example, the AD HOC can be with current outburst or the report from other sensors is related, these sensors Can pathogen have been indicated in the food of similar type (for example, agricultural product).
Application server can carry out network communication with locally-stored library.Locally-stored library can for it is many have definition and The application server of rule provides service, and can provide instruction to application server.Therefore, locally-stored library can be regular It sends and updates to application server.
If application server is not found for the appropriate definition of specific condition or rule, application server can be incited somebody to action Sequence data or other biological datas are sent to locally-stored library.Then, locally-stored library may search for it has access to that defining Or the wider database of rule.The database can be shared between one or more locally-stored libraries.Database can be with With for example bigger known pathogen set, or it can have some pathogen relevant to history outburst, these cause of diseases Body is not observed whithin a period of time.As an alternative, such pathogen can not be observed near sensor position, still The database of the accessible record pathogen in locally-stored library, and it can therefore be appreciated that they.
Under special circumstances, locally-stored library can be using any one of a variety of options.For example, locally-stored library can To search definition relevant to pathogen and rule, and it is passed into application server with certain instructions together.As an alternative, originally Ground repository can pass data to central server.
Locally-stored library can have the definition and rule of its from the received their own of central server.Central server can To carry out network communication with many locally-stored libraries.Therefore, central server can regularly update the definition at locally-stored library And rule.
If locally-stored library can not find for any definition of specific condition or rule, locally-stored library be can choose Pass data to central server.Rule can require locally-stored library to report any base number that can indicate special circumstances According to, sequence data or biological data.
Central repository can be located in the centralized laboratories including researcher or healthy professional, use wherein Or it is used by.For example, country or international hygiene center can control central repository.When detect special circumstances and by its from When sensor passes are to central server, the definition of the accessible big collection of central server or rule handle these situations.It can Selection of land, when reaching certain predetermined thresholds or being decided in its sole discretion by user, researcher or healthy professional can assess situation with The seriousness of certain situation.
Single sample can produce the original analog conductance information of multiple gigabytes, represent millions of a sequence informations Reading.Initial interpretation process can consume these analog readings, and can when no molecule passes through molecule sensor or The filtering environmental noise when pollutant leads to unreliable or null result.Data can be explained and are converted to basic by interpretation process Sequence string.Each base determination can be associated with one or more data dimensions.For example, dimension or vector can indicate it just In the probability levels of the base of reading, as shown in Figure 7.
Fig. 7 shows the 21 aggressiveness reading that can sense the sensor of one of the abasic site (AP) or five kinds of possible bases Base probability matrix.Determining base sequence 310 can represent the maximum probability base at each position in reading.It is de- A possibility that base position or base can include:
A=adenine
B=abasic site
C=cytimidine
G=guanine
T=thymidine
U=uracil
Each column shows the probability of the specific nucleotide base of each position in the sequence.Sensor side node or application service Device can explain the probability of each of at each position possible base.For example, it is in the 16th alkali that this, which illustrates cytimidine (C), Most probable base on base location.
Fig. 8 illustrates how the additional dimension for retaining data for reading.In this illustration, modification table is in each base position It sets and shows whether the base is methylated, aoxidizes or is acylated.In this example, the third and fourth base includes the 5'- of methylation C- phosphoric acid-G-3'(CpG) it is right.Also believe that cytimidine (C) is oxidized.Relevant base probability represents determining base sequence. Apart from the distance between the conversion to new base that table or conversion position table include in multiple bases, determining homopolymer is provided Length.The example is shown runs about two thymidine (T) bases before being converted to adenine (A).It is also shown in sequence The column middle and later periods is converted to two adenine (A) bases before guanine (G).The dimension for storing the data of reading can solve pass Same type base quantity the type sensor has intrinsic uncertain in sequence or subsequence.
Other dimensions may include total length and the base positions as the distance started away from reading.Some ordering techniques exist One end of oligonucleotides (oligonucleotides) starts and passes through synthesis (SBS) to be ranked up.This class process can be related to after every wheel Find base incorporation (for example, one at a time).Therefore, phase error can be generated when introducing base every time.For example, if depositing In clonal population, then in entire group the incorporation of base can be it is non-uniform.Certain members may include more than one alkali Base, and other members can not include base.Therefore, confidence level can be further decreased with sequence reads.Fourth dimension can be with Base transition including distance, base number, base pairing end or the primer incision tip from analyzed sequence.
Initial data reading can be retained for further analysis.For example, people may desire to by detection polymer creep, Phototoxicity, the presence of pollutant for influencing sensor or the atomic structure at nanometer gateway tip change to improve sensitivity.Base The uncertain of calling can be specific to the brand and model of sensor used.
For example, interpretation process controller the conductance of each filtering can be recorded pass to it is single explain worker process or Thread.Each original reading can be explained in the case where not considering locking, because can not have shared data.Synchronizing can be It is unnecessary, because the downstream process explained can execute repeatedly on the explanation sample data set of growth, until explaining with can The confidence level of receiving reaches its completion status.
In addition, the system can use various technologies to sense sequence in conjunction with the sensor from different suppliers.? In some cases, raw information can be unavailable.On the contrary, can be read from sample, wherein probability and induction error are Used technology is distinctive.Every kind of technology has merits and demerits, and can have different sensitivity.Every kind of technology can There is different resolution to the various aspects or dimension that read DNA or RNA sequence.Some technologies can to from a base to The conversion of next base is very sensitive, but less sensitive to specific interested base.In such a case, it may be desirable to right Base reading is further analyzed.
Some technologies can be especially good in terms of base determination, but less strong when determining mobile base or conversion.It is this A possibility that situation can cause it to check particular bases is very high, but the certainty that base quantity and repetition time are provided compared with It is low.Another technology can read each base along oligonucleotides (for example, one at a time), with additive errors model, So that remoter from start mark, base is sensed more uncertain.
Therefore, when storing in the nonvolatile memory, various embodiments are supported with the various patterns of file and record With format sequence of interpretation base data.For example, coming from extensible markup language (XML) or JavaScript object representation (JSON) data of the sample in file can store on a distributed.
File may include the reading of the single base value storage as nucleotide each in chain.It is general that reading, which can store, Rate value.As an alternative, reading can be stored as to the complete probability matrix in the possible base of each of each nucleotide position.It may Syntax may include the metadata syntax that the content being stored in read-record is described using one or more attributes.
Based on various factors involved in sample is collected, there are the various examples of semi-structured reading format, various realities The various examples can be explained and use by applying example.The example of these factors may include sample preparation, the brand of sensor and/or The analysis of model or data.Sample file may include simple and basic mode, which includes having one or more alkali Unique sample identifier of base reading.
Fig. 9 shows the example of sequence reads, basic format reading and syntax.Part A is shown comprising determining base sequence The reading of column.Part B shows the example of the identical base format reading of the probability data including each base.Second example Syntax include each word for describing single base.For example, third base is described as cytimidine (C), probability by word " C67.74 " More than 67%.
The third example shown in the C of part shows identical base format reading, wherein each word describes single base Position.In this example, each word describes base, probability and any modification.For example, word " Cf67.74 " describes third base For cytimidine (C), probability 67%.By adding lowercase after the base, modification can be recorded in each word. In this embodiment, the lowercase lacked and then indicates that the base is not methylated, aoxidizes or is acylated.Lowercase " a " arrives " h " can be translated into number 1 to 8 to keep the bitmask of modification table.Methylation is equal to most significant bit (MSB) (4), oxidation For (2), and acylation is least significant bit (LSB) (1).Therefore, cytimidine quilt is shown with cytimidine (C) base that " f " is modified Methylation and oxidation.
According to system and method described herein, second level and the possible base value of three-level can be maintained, to those bases The data dimension of any modification and any other sensor record.Figure 10 indicates three for storing the syntax of following part Example: each of base or AP site possibility of (A) six tracking;(B) highest two most probable bases or AP Site possibility;Or (C) only maintains the array of base positions probability if probability is more than some predetermined threshold.In part A Shown in first example, file stores in six bases that third base positions are tool in the probability and reading of each base Having maximum probability is more than 67% cytimidine (C) and the probability value of the abasic site with minimum probability lower than 2%.If only Two highest possible base values are maintained, then the base positions can be considered as main cytimidine (C) base, or replace with probability about 14% Selection of land is thymidine (T) base, as shown in the B of part.
Storage probability only when they are more than predetermined threshold, realize by shown in the C of part length/value syntax.Tool The base positions for having more than two base possibilities of 15% threshold value can cause to guide number " 2 " as word " 2C64.46 " First character is also provided as the length of the base array retained for the base positions.The probability highest of cytimidine (C), It is 64%, and threshold value of the guanine also above 15%.
The conversion syntax for recording the sensor of the distance between base transition dimension can also be used, such as Figure 11 institute Show.
Application server can collect millions of readings from sample.Then, it can be identified from the analysis of reading Longer aligned sequence or contig data.For further evaluation, application server can execute base reading and reference Alignment.Alternatively, reading can be with several other reading poly groups and for from the beginning assembling.Application server can be it is expansible, It is allowed to call the other processes for only receiving the subset of the information stored with the semi-structured format of reading.For example, alignment The syntax that the syntax or FASTQ that the FASTA that the interface of process can receive reading is formatted format.In this case, may be used The format for being converted into alignment procedure and being understood will be read.
For example, the reading of example described in Figure 12 can look similar to following four when being converted to FASTQ format Row:
@10032QB:1157S:1.1:20151221:09:42:37
ATCGTCGAGBAGTTACAAGCT
+ 10032QB:11578:1.1:20151221:09:42:37
' * &* '+%+) & (% ' (&&) &&& (
Base and corresponding Phread mass fraction can be sent.It can explain reading, and can be from alignment procedure Consistency algorithm returns to contig.Sample may include millions of readings.Reading can be aligned with reference sequences or from the beginning group Dress.The certain environment or apportionment ratio of base reading may be lost by converting different syntaxes for base reading.Shown in Figure 13 Example in, in addition to base sequence and by will read be aligned to contig program send and return probability or mass fraction, The sensor of instruction can also capture conversion distance and chemical modification.Application server can be aligned, and total when determining When knowledge, the environment of some loss or resolution ratio are re-applied into back contig nucleotide sequence, so that with similar with reading semi-structured Syntax memory overlay group.For example, for the contig derived from the base reading containing chemical modification, application server can be weighed New opplication is not used in any modification being ranked up to reading.
Application server can analyze about the contig nucleotide sequence data from locally-stored library received definition and rule.Peace Dress can be punished in end node, server and/or repository and be sent out, and the end node, server and/or repository network and cooperate With supervisory sequence data acquisition and operate on it.In one aspect, application server may include for efficiently finding With the rule for acting on genetic sequence information.Pathogen can be found with boot sequence discovery.In other cases, people may Want the contig of the certain gene expressions of discovery.Various embodiments allow the people of such as microbiologist to manage pathogen or gene Sequence definition database.Rule definition can distribute to specific instruction or instruction set, or be associated with it.
Center control and rules administration module can handle these rules.In some cases, they may convert rule Then or further alteration ruler, so that it runs on specific downstream server and node.Many rules will voluntarily distribute.
For example, rule may include simple sequence, matching process, weighting, one or more recurrence adjustment or believe sample Breath is bundled into the instruction for meeting the biological sample (BioSample) of national biotechnology center (NCBI) and notifying department head.
The instantiation of system in the example may include pedestal sensor, local node and/or local server.Rule It can be adjusted according to its particular device where executing.Application server can be attempted from each individual reading or overlapping Group discovery sequence.By modification higher level rule with based on used sensor brand or model more effectively find sequence Column can preferably serve the discovery part of rule.High-caliber rule can be the class based on used sequencing equipment Sequence is aligned by type with the contig of the variance having less than predetermined number.In some cases, can be used global approach and Valuation, and for other sequencing equipments, nation method and valuation can be applied.As an alternative, for example, if the sensor used is Roche 454, then the mapping of sequence to contig can have the level of the threshold variance based on flow chart.
In one embodiment, rule can be distributed and rule may include cooperating with dedicated application server. This can permit with the more accurate of less error result the entirety as a result, without negatively affecting end sequencing equipment Performance.For example, equipment can have multiple sensor nodes of test food sample:
Application server is sent by these read signals, to be construed to base reading and subsequent contig.
The original application server for for pathogen signature array each base reading execute have simply compared with The rule of the sequence alignment algorithm of reduction process cost.
If one or more pathogen meet the threshold value of multiple close match or score, which may include:
Expand the sampling at sensor;And/or
It bundlees complete sample and transfers it to dedicated pathogen test application server, more strictly to solve Pyroelectric sensor measurement.
Pathogen test application server may then based on the instruction that their own is applied in its discovery.
The embodiment may insure across a network transmit information when and when information stores in repository information by Protection.
It, can adding using such as security socket layer (SSL) or Transport Layer Security (TLS) for the data in transmission Close scheme.Data can be generated at sensor.These end segment point sensors can support the connection with local application server, Primary data analysis is base reading by local application server.Application server can be further by base reading Analysis Cheng Chong Folded group or sequence.As an alternative, application server reading can be transmitted to another application server with create base reading and Sequence.Between sensor and application server, between collaboration applications server, between application server and repository and apply Communication between server and service can support security socket layer (SSL) or Transport Layer Security (TLS) to connect.This can wrap It includes base reading and sequence is associated with other metadata (such as title or geographical location) and applies regular and instruction Server.
For static data (for example, not in the transmission), various mechanism can be used to protect the data.Data can be with It is stored in multiple positions.Sample data can store in file system.Each sample may include semi-structured data file. Process can execute marshalling, the system of solutions and/or the deletion of sample file.
Obtained contig or sequence data can by with as multiple semi-structured files classes in a manner of store.Contig number According to that can be stored in distributed file system, because contig data may include large data sets, continuously it can excavate and divide Analysis can be required to support the repository of high concurrency access with Test hypotheses.As sample file, process can be with Execute marshalling, the system of solutions and/or the deletion of contig file.These files can be anonymous.Encryption and compressor can be tuned System is to obtain lower central processing unit (CPU) access cost and higher read throughput.
When sequence storage is into repository, only identifier can be associated with contig.It can be directed to and sample phase Corresponding subject, position, contact details or research are identify to them.Identity data can store with sequence not In same repository.Equally, the base reading from sample can be only associated with unique identifier.If retaining original number According to then it can also be only associated with identifier.Identity data can be placed in individual database.Identity data can save In relational database.Sample-identity and contig-identity reference table can be maintained, to allow the feelings allowed in access control Link re-recognizes a pair of sample and contig under condition.A different set of access control can be applied to anonymous sample.Identity Data and sequence data can encrypt when static.
Sample data, contig and sequence can represent the data set of relative quiescent.After being added to repository, they can It can seldom update.They can indicate the data of up to thousand terabytes (for example, millions of gigabytes).It can be by using The distributed file system of shielded semi-structured data collection is stored to realize the processing of the analysis to these very big data sets, institute Stating shielded semi-structured data collection can be accessed and be reduced to by the process of such as MapReduce or Spark etc In work transaction or columnar database.
For example, Figure 14 shows the example of distributed file system, wherein information is retained in three individually storage systems In system-each it is used for sample 1401, contig 1402 and operational data 1403.Raw sample data 1401 can be explained and be turned Chemical conversion reads the semi-structured format formed together with the simple or basic metadata about sample by molecule.Basic metadata can be with It is accorded with including sample identification.All other metadata about sample is considered job information.Job information can refer to Sample identification symbol is stored separately in database.After processing, it can retain or not retain sample data.If sample data Retained for a long time and used or accessed for other purposes, then it can store in distributed document repository 1404. As an alternative, if sample data is retained a very long time but usually will not be accessed and use for other purposes, It can be archived.
Sample data can be explained further to, be aligned or is assembled into multiple groups contig or sequence.These contigs can be with It is stored in distributed file system 1404 with the semi-structured format of such as XML or JSON, the overlapping group identification with distribution Symbol.In the mode similar with sample data, the other metadata about contig can be job information, and can be with reference to weight Folded group identification is stored separately in database.
Contig can also have operational data.Operational data may include capturing other than reading with derived contig With the additional data used.This may include the information in relation to process involved in capturing information, for example, the product of device therefor Board, model or sequence number;Sample preparation information;Source-information;Obtain the position of sample;And the health and fitness information of protection, such as The name and contact details of patient.
These sample datas and contig data file can be compressed to increase capacity, it should be understood that in doing so, reading Calculating cost can be generated by taking when file.These files may encrypt.Since the information in these files can be anonymity, because This embodiment uses the Encryption Algorithm using high-performance (for example, safety) decryption counterpart.Hardware encryption accelerator can be used To minimize encryption and decryption cost.
Operational data may include additional information, be stored for re-recognizing or using sample and contig.Work Data can also include with the associated phenotypic pattern between identity, sequence and phenotype 1405.Operational data is also possible to Encryption.Although safety can be work number however, an important factor for performance be can be for determining using which kind of algorithm According to an important factor for.Furthermore, it is possible to realize fine granularity safety and access, such as access of record grade for operational data.
Symmetric key can be used to encrypt semi-structured file in sample storage and contig/sequence distributed storage.It is negative The application server process of responsible editor's group and system of solutions file can safeguard the cipher list of the file in safe wallet.In addition, operation The host of application server process may include accelerator, such as Intel's Advanced Encryption Standard-new command (Intel Advanced Encryption Standard-New Instructions)(AES-NI)。
One of benefit of the embodiment can be repository and be modeled to safeguard and provide necessary tool to access and dig The a large amount of acquisitions for the biological information that pick repository can store for a long time in anonymous environment.Anonymous contig and optional initial Sample data can be retained, and so that researcher is improved the understanding to science of heredity.
In some embodiments, doctor can be able to access that the trouble of the Genetic overlap group including being linked to related work information Person's case history.In this example, doctor is in the application program for providing two distinct types of access: to specific overlapping group and sequence The efficient access of collection and secure access to the operational data for being linked to contig and sequence.
Example 1: research
In research environment, the initial data of the sample of multiple sensors from different manufacturers is sent to using clothes Business device.Application server explains initial data and determines the base sequence partly or entirely read in initial data.Then, it applies Server or oneself execution alignment, which are analyzed or reading are formatted as the outside that it is called, is aligned Analysis server tool institute The syntax of understanding.The contig of generation returns to application server from external server.
In some cases, the information from sample readings is re-applied back contig by application server.The weight of reconstruct Folded group with identifier marking and is transmitted to contig library, they are saved as the distributed field system of application server there Semi-structured file in system.Other information (such as source, identity, position and/or address) relevant to contig will be inserted into In the working data base of repository.
Additional meta information can be merged into semi-structured file, such as classification, to allow in distributed field system Data are effectively stored or reduced during milking in system.The repository of contig increases over time.
Pass between the reason of researcher expresses certain of specific heredity signature and one or more phenotypes or probability System is assumed.Contig library is mined.Particular signature and its associated identifiers are extracted as independent variable and are loaded into number According in library with the theory of testing research personnel.
Then signature can be mapped to the phenotype obtained from external source.
Can save proves useful it is assumed that and being incorporated into the list of gene signature associated with gene expression and phenotype In application server in only database 1406.
Semi-structured file is encrypted, and database is also such.Access is controlled to sample and is overlapped the water of group identification It is flat.
Can in the case where the job information not with different security levels sample retrieval and contig information.Example Such as, all contigs in researcher's access system be can permit, but do not allow to access a times with its related job information What contig.
Access control is abstract, and can support such as to organize the concept with role security etc.With abstract control The fine granularity safety of part can be provided with the effective safety of time and privacy.As an example, the employee of medical team The embodiment of the bioinformatics information of part or all of patient member of the accessible storage about medical team.With the time Passage, be responsible for particular patient doctor can change.Doctor can only access the biology letter for the patient that they are currently responsible for Breath learns information.
Access provides support by powerful public/private key management system granted permission for non-repudiation.
Management program can be with the node of management system and user.Management program may include for issuing key and maintenance card The certificate granting service of book revocation list.It is transported in end segment point sensor, application server and distributed file system manager Capable process has public/private keys pair, them is allowed to act on information.User also generates key pair.User can have Multiple key pairs associated with its account, to support from multiple and different computers, tablet computer or other calculating equipment Certification.
Support the concept of role or group.By the data of Role Dilemma access storage, and the user of current active may belong to One or more roles.
The architecture of the access control of this static data has additional benefit with abstract, that is, ensures a part or institute There is sequence information to be protected and only authorized entity is enable to be used in the whole life cycle of data record.Figure 15 shows diagram It is segmented the exemplary architecture of access control.
Access control can be horizontal to individual specimen by such as fine granularityization.Each sample can be marked with unique identifier This.
For substantially unessential operation, rudimentary sorting unit or biosensor can be used.Rudimentary sorting unit or life Object sensor can not need large-scale permanent storage appliance.The example of this equipment may include measurement or data acquisition module. This equipment can have measurement hardware, processor and/or the system storage for processing system function.In these components Each can have the buffer storage of their own, for handling its own function.
The equipment that rudimentary sorting unit can need communication link so that its initial data is relayed to higher level is described higher The equipment of rank such as application server, locally-stored library or local server.
Communication link may include the near field communication protocols or such as Wi-Fi of such as bluetooth or near-field communication (NFC) Wireless protocols.Communication link may include (for example, wired) communication equipment using cable of such as USB.In some cases, Communication link may include satellite or cellular communication module.
Rudimentary sorting unit can with can be operated in such as mobile device of intelligent movable phone it is aforementioned to execute these Some application servers in function are integrated.For example, rudimentary sorting unit may include measurement hardware and use mobile device energy Power and application are used as local storage, processor and communication link.
As an alternative, intermediate sorting unit can be used for more critical situation.The example of this emergency situation may include that monitoring is suffered from Person and the point-of-care applications for needing initial diagnosis.
Intermediate sorting unit can execute more accurate polynucleotides measurement.It can be according to the reliable accurate judgement institute to sequence The precision needed is arranged precision.
Memory devices and communication component can be used in intermediate sorting unit.Therefore, intermediate sorting unit may include having to survey Measure measurement and the data acquisition module, processor and the system storage for processing system function of hardware.In these components Each may include the buffer storage of their own, for handling its own function.
Additional memory storage devices may include the flash memory of data-storable position (for example, multi-level unit flash Memory).Data in intermediate sorting unit can be base data, and in this case, multi-level unit flash memory can To be suitable for that data are being locally stored.The port of such as USB port etc can be used for transmitting data, such as there are mass data So that wired connection can be in the case where needed for high bandwidth or handling capacity purpose.
In one embodiment, the multi-level unit equipment of such as flash memory is used as storing and accessing genetic sequence number According to relatively quick mode.In flash memory device, a large amount of units can be used come based on being able to maintain the floating of charge Set gate field effect transistor (FET) storing data.It can be charged by the floating grid to each FET individually to program list Member.
One advantage of the embodiment is due to following situations: flash can be wiped in the form of block via block erasing operation Memory cell, to wipe all charges of all multiple floating grids in single operation.
The embodiment can also have the non-erasable addressable characteristic of individual cell.However, in this embodiment, flash Erasable piece of memory is for storing hereditary number relevant to the continuous inheritance data of base sequence, nucleotide or other means According to.In the case where needing to replace this erasable piece, user may it is generally desirable to wipe all numbers in erasable piece immediately According to, rather than a part of erasable piece of erasing.Therefore, which can permit cost of the optimization for genetic data storage Flexibility relative to speed.
In flash memory storage equipment, unit may start to fail after multiple programmed and eraseds circulation, hereafter, Reading or write-in may fail.The fact may be advantageously used with genetic data storage.Since the erasing of flash memory recycles Quantity may be limited, therefore the safety of longer time can be kept compared to some other usage scenario data.
May exist particular kind of relationship between the genetic data size for wiping block size and sequence or other means.This can be with Ensure the integrality of data relevant to entire sequence.
As specific example, the erasing block of 128 units is stored in (kbp) base sequence formed by 128 kilobase In:
CTT...GAG (128k base)
===...===(128k cell erasure block)
For n DNA and RNA base, two multi-level units (MLC) can be exclusively used in each base.For example, for relating to And the case of DNA, people use:
A(00)C(01)G(10)T(11)
It means that first and second are all closed when base is A, when base is C, second is opened, and works as alkali Base be G when first unlatching, and finally when base be T when first and second be switched on.Similar scheme can be used for RNA。
Each erasing block can be specially designed or configured to store multiple sequences.As an alternative, bigger sequence can be stored On the certain amount of erasing block with similar or identical attribute and life cycle.
Different size of erasing block can be used for different size of sequence.For example, the flash storage of smaller erasing block size Device equipment can be used for storing oligonucleotides data or hybridization data, and the flash memory device of bigger erasing block size can be used for It stores gene and is mutated or refers to gene.The flash memory device of big block size can be used for storing genomic data.
The advantages of faster being accessed using flash memory may be subjected to the influence of life cycle problem.Flash storage The copy of device content can in storage server mirror image, access speed is relatively slow but life cycle is longer.Then survey can be designed It tries to detect the integrality of the data in each block size.Sometimes, each piece can be tested for the mirror image data in server In data.If flash memory erasing block number can release flash memory device according to the sign for showing any degeneration The block.
The embodiment can be advantageous, and can be at least because of a longer lifetime storage equipment long-range in such as cloud Hard disk drive (HDD) storage server.
In another example, the erasing block of flash memory storage equipment can be used for storing sequence data plus some first numbers According to:
CTT...GAG (96k base)-metadata (64k=32k unit MLC)
===...===(128k cell erasure block)
The example of metadata may include any information relevant to the origin of sequence, the name of such as patient, with patient's phase The other information of pass or sequence itself.
Writing a Chinese character in simplified form for biological data can be next excellent relative to storage equipment framework for example by using compression or biological data Change the size of data.The size of compressed data can be finely tuned to obtain preferably storage equipment compatibility.
Hash table can be made of different biological datas.Each hash can correspond to a classification or gene.For example, In the case where cause of disease volume data proliferation, it can be hashed for every kind of pathogen building and use hash table.Whenever measurement new samples When, the hash for executing new samples can easily find matching in hash table.This is to obtain the quick of pathogen information and have The mode of effect.
Multi-level unit (MLC) storage unit can store two positions.The two positions can be used for storing about polynucleotides alkali The information of base.For example, can be used DNA base with bottom configuration:
In this way it is possible to represent all natural four bases using single memory unit.This method is for true The integrality for protecting data can be advantageous.
In another example, MLC memory cell can store three positions.These three positions can be used for storing about multicore glycosides The information of soda acid base, wherein additional information instruction methylation or the state of oxidation.For example, DNA base can be used following Position configuration:
In this way it is possible to use the multiple-unit memory devices of such as flash memory and phase transition storage.
It, can be by providing warning, passing through in the storage equipment of the block with multiple units in the case where data degradation Refresh cycle or by being kept away in automatic or excitation data dump to storage server (for example, HDD) or cloud storage service device Exempt from loss of data.
Erasing block in flash memory device can be used for easily accessible and storage management.All numbers on erasing block According to correspond to biological unit (for example, DNA or RNA sequence) when, memory access can be saved and data can have it is higher Integrality.This can lead to the power optimization in operate in large scale, in the operate in large scale, accessible many sequences Region or genetic data and its can operate in a short time.
It, can be by will all numbers relevant to certain genetic elements (for example, gene or contig) by the embodiment Data integrity is kept according to being maintained in certain one or more chunk.Furthermore, it is possible to realize other benefits, for example, processing, Optimization and the heat for reducing generation.It is contemplated that data management, data compression, memory access, temperature control and data are complete Property the ecosystem can be managed to entire biological data and generate positive net impacts, it is either local or global.
The memory block of such as flash memory erasing block be can choose with compatible with the size of genetic data.For this purpose, can To execute the compression and variance analysis of customization, so that the compression sizes of genetic data are more optimized to be suitable for memory block or memory The size of body.Optimization can be executed in terms of loss of data and data preservation.For example, in memory cell size, (such as block is big Small or body size) be greater than biological units data size in the case where, the rest part of storage space can be used for store about The additional information of biological unit data.For example, the erasing block in flash memory can be used to save gene information, and about base The additional information (such as gene expression) of cause can be stored in the remaining space of block.
The access to biological data can be managed by bedding storage access scheme, as shown in Figure 16 A.Application program can To be located locally on repository or central server.It can realize that first layer is accessed by using fast storage.In key In the case of, random access memory (RAM) 1601 can be used for accessing the certain data for needing frequently to access.It is in significantly less critical In system, fast storage may include flash memory 1602 in local HDD or storage unit based on cloud or neighbouring.
The decision for retaining certain biological datas can be based on hit or miss framework.When registering a certain number of hits When, the accessible biological data of processor and faster memory can be upgraded to (for example, by duplication or mobile life Object data).For example, locally-stored library or central server can be determined cause of disease when detecting the report of pathogen example The copy of body takes local storage to.In addition, at the specific region of the possible important biological data unit of identification, it can be more The copy of specific region is saved in fast memory, and can be protected the other parts of data cell in slower memory It holds in lower level, slower memory is, for example, HDD, cloud or equivalent 1603.Figure 16 B, 16C and 16D provide storage The additional example of framework.Figure 16 B shows the example for being suitable for providing the framework of ultrafast data access and decision-making, wherein place Reason device can be configured to communicate with RAM, flash memory and/or HDD or equivalent.Figure 16 C, which is shown, is suitable for providing quick something lost Pass access and decision-making framework example, wherein processor can be configured with flash memory and/or HDD or equivalent Communication.Figure 16 D show be suitable for providing heredity filing framework example, wherein processor can be configured with HDD or equivalent Object communication.
Example 2: privacy encryption
It provides through name Michael Smith and 16 oligomeric sequences related with imaginary people applied to imaginary The example of the encryption technology of the genetic sequence data of people.16 aggressiveness can be bigger sequence relevant to the people, gene or gene A part of group.
Michael Smith-... t t g c g atG t c t a a t g g ... (subject's sequence)
In this example, for purposes of illustration, name " Michael Smith " is encrypted using 24 passwords. Encrypt name and the expression of corresponding syntax are as follows:
Encrfn (" Michael Smith ", password 1)=
EnCt2568e6c561c2b3a78926b5dbb3adea5ba827c065e568e6c561c2b3a78926b5db bJIGwNtmg0ACHd+Q9elZHTMJV2DqVe3XSDb77IwEmS
As long as password be it is safe, this method may insure the privacy of name.Such encryption and subsequent solution Close and cryptoguard may be computation-intensive and with high costs.It is appreciated that in this example, if added using a large amount of Close, then the name that may include the people of several bytes can increase several hundred a bytes.
In order to ensure the privacy of sequence, it can be assumed that there are the reference sequences comprising the following contents:
t t g c g a aGt c t a a t g g ... (reference sequences)
Assuming that the base of runic and underscore is the base uniquely changed in group.
It is then possible to assume that the original series obtained from Michael Smith include the following contents:
… t t g c g a tG t c t a a t g g ... (subject's sequence)
According to this embodiment, which stores are as follows:
… t t g c g a a* G t c t a a t g g ... (expression of subject's sequence)
Wherein,*It can be 0 to 3 number, to provide:
A0=a
A1=c
A2=g
And
A3=t
In the case where Michael Smith, which is taken as 3, and " a " is displaced to " t ".
This example illustrate sequences
… t t g c g a a(0123)G t c t a a t g g ... can indicate whole using two characters as cost A group is in this case (0,1,2,3).
Due to the rest part of sequence be for entire group it is identical, according to this embodiment, can be using 2 keys as generation The complete privacy of valence realization sequence.
In this example, a part of oligonucleotides or contig is presented, wherein with reference oligonucleotides or Chong Die faciation Than only one base is variable.
In this example, in order to encrypt the sequence, it is assumed that reference sequences add 2 codes (123), can be according to encryption One base is shifted 1-3 position by scheme, such as:
a c(1)g(2)t(3)
For example, the shift function in encrypted code can provide if the variable base of encryption is " g ":
a(2)c(3)gt(1)
Similar scheme can be used without departing from the range of the present embodiment.
Computer control system
Present disclose provides be programmed to realize the computer control system of disclosed method.Figure 17 shows be programmed Or it is configured as the computer system 1701 of management biological data in other ways.The adjustable disclosure of computer system 1701 Data management various aspects, for example, the collection of biological data, storage, encryption, server, server and repository it Between about definition and rule communication and management definition and rule.Computer system 1701 can be user electronic equipment or The computer system being remotely located relative to electronic equipment.Electronic equipment can be mobile electronic device.
Computer system 1701 includes central processing unit (CPU, also referred to as " processor " and " computer disposal Device ") 1705, it can be single or multiple core processor, or multiple processors for parallel processing.Computer system 1701 further include for one or more of the other system and peripheral equipment 1725 (for example, cache, other memories, number According to storage and/or electronical display adapter) communication memory or memory location 1710 (for example, random access memory, only Read memory, flash memory), electronic memory module 1715 (for example, hard disk), communication interface 1720 is (for example, Network adaptation Device).Memory 1710, storage unit 1715, interface 1720 and peripheral equipment 1725 are (real by the communication bus of such as mainboard Line) it is communicated with CPU 1705.Storage unit 1715 can be data storage cell (or data storage bank) for storing data. Computer system 1701 can be operatively coupled to computer network (" network ") 1730 by means of communication interface 1720.Network 1730 can be internet, the network of interconnection and/or extranet or Intranet and/or extranet with internet communication.One In a little situations, network 1730 is telecommunications and/or data network.Network 1730 may include one or more computer servers, The distributed computing that it may be implemented such as cloud computing is all.In some cases by means of computer system 1701, network 1730 can To realize peer-to-peer network, the equipment for being coupled to computer system 1701 can be made to potentially act as client or server.
CPU 1705 can execute a series of machine readable instructions, can be embodied in program or software.Instruction can be with It is stored in such as memory location of memory 1710.Instruction can be directed toward CPU 1705, and CPU 1705 can then be programmed Or CPU 1705 is otherwise configured to realize disclosed method.Example by the operation executed of CPU 1705 may include Acquisition, decoding, execution and write-back.
CPU 1705 can be a part of the circuit of such as integrated circuit.One or more of the other component of system 1701 It may include in the circuit.In some cases, which is specific integrated circuit (ASIC).
Storage unit 1715 can store file, such as program of driver, library and preservation.Storage unit 1715 can be with User data is stored, for example, user preference and user program.In some cases, computer system 1701 may include counting 1701 outside of calculation machine system is (such as positioned at the remote server communicated by Intranet or internet with computer system 1701 On) one or more additional-data storage units.
Computer system 1701 can be communicated by network 1730 with one or more remote computer systems.For example, meter Calculation machine system 1701 can be communicated with the remote computer system of user (for example, laboratory or hospital).Remote computer system Example include personal computer (for example, portable PC), tablet computer or tablet PC (for example,iPad、GalaxyTab), phone, smart phone (for example,IPhone, support Android equipment,) or personal digital assistant.User can access computer system 1701 via network 1730.
Method described herein can by be stored on the electronic memory position of computer system 1701 (for example, On memory 1710 or electronic memory module 1715) machine (for example, computer processor) executable code realize.It can be with Machine executable code or machine readable code are provided in the form of software.During use, code can be by processor 1705 It executes.In some cases, code can be retrieved from storage unit 1715 and be stored in device for processing on memory 1710 1705 prepare access.In some cases, electronic memory module 1715 can be excluded, and machine-executable instruction is stored in In reservoir 1710.
It can be with precompile and configuration code or can transport for having the machine for the processor for being adapted for carrying out code Compiled code during when row.Code can be provided with programming language, can choose the programming language enables code with precompile Or the mode of compiling executes.
The various aspects (for example, computer system 1701) of system and method provided herein can embody in programming.It should The various aspects of technology are considered the machine usually to carry or embody in a type of machine readable media " product " of the form of (or processor) executable code and/or related data or " product ".Machine executable code can deposit Storage is on electronic memory module, for example, memory (for example, read-only memory, random access memory, flash memory) or hard Disk." storage " type medium may include any one of tangible memory or processor of computer etc. or its correlation module or All, various semiconductor memories, tape drive and disc driver etc., they can provide at any time non-transitory and deposit Reservoir is to be used for software programming.Sometimes, all or part of software can be carried out by internet or various other telecommunication networks Communication.For example, this communication can enable software to be loaded into another computer or processing from a computer or processor Device, for example, being loaded into the computer platform of application server from management server or master computer.Therefore, software can be carried The another type of medium of element includes on physical interface such as between local device, by wired and optics land line net Network and the optics used by various airlinks, electricity and electromagnetic wave.Such carrier wave physical component (for example, it is wired or Radio Link or optical link etc.) also it is considered the media of carrying software.As it is used herein, non-temporary except being not limited to Shi Xing, tangible " storage " media, otherwise the term of such as computer or machine " readable medium " etc refers to participation to processor Instruction is provided for any medium of execution.
Therefore, the machine readable media of such as computer-executable code etc can use many forms, including but not It is limited to tangible media, carrier media or physical transmission medium.Non-volatile memory medium includes such as CD or disk, all Such as such as can be used for realizing any computer of database shown in the drawings in any storage equipment.Volatile storage Medium includes dynamic memory, for example, the main memory of this computer platform.Tangible transmission media includes: coaxial cable;Copper Line and optical fiber, the line including constituting computer system internal bus.Carrier wave transmission media can be using electricity or electromagnetic signal or sound wave Those of or the form of light wave, for example, generated during radio frequency (RF) and infrared (IR) data communication.Therefore, common form Computer-readable medium includes for example: floppy disk, floppy disc, hard disk, tape, any other magnetic medium, CD-ROM, DVD or DVD- ROM, any other optical medium, perforation paper jam tape, any other physical storage medium with sectional hole patterns, RAM, ROM, PROM and EPROM, FLASH-EPROM, any other memory chip or cassette tape, carrier-wave transmission data or instruction, transmission The cable or link or computer of such carrier wave can be read from any other medium of programming code and/or data.It is many The computer-readable medium of these forms can be related to one or more sequences of one or more instruction being transmitted to processor For executing.
Computer system 1701 may include electronic console 1735 or communicate with electronic console 1735, electronic console 1735 include user interface (UI) 1740, for providing such as genetic data comprising such as base sequence string or in various sentences Reading in method, sequence alignment.The example of UI includes but is not limited to graphic user interface (GUI) and the user interface based on web.
Disclosed method and system can be realized by one or more algorithms.Algorithm can be by software in Central Processing Unit 1705 is realized when executing.The algorithm can such as encryption data, conversion heredity reading, analysis, explain, alignment and Assemble various data, including but not limited to sequence data, operational data, metadata, sample data, contig data.
Although the preferred embodiment of the present invention has been illustrated and described herein, it is aobvious for those skilled in the art and It is clear to, these embodiments only provide by way of example.The present invention is intended to not by the specific example provided in this specification Limitation.Although describing the present invention by reference to aforementioned specification, the description and explanation of the embodiments herein is not intended to Explain in a limiting sense.Without departing from the present invention, those skilled in the art are contemplated that many variations now, change Become and replaces.In addition, it should be understood that all aspects of the invention be not limited to it is as described herein depending on various conditions and variable Specific descriptions, configuration or relative scale.It should be understood that the various alternative solutions as described herein for the embodiment of the present invention It can be used for implementing the present invention.Therefore, it is contemplated that the present invention should also cover any such substitution, modification, variation or equivalent.It is intended to Appended claims limit the scope of the invention, and are thus covered on the side in the range of these claims and its equivalent Method and structure.

Claims (55)

1. a kind of biological data management system comprising:
(a) end user's module, end user's module includes sequencing equipment, and the sequencing equipment is configurable to generate base Data;
(b) the locally-stored library of network communication is carried out with end user's module, the locally-stored library is programmed or configures Are as follows: (i) receives the base data, and the base data are converted to sequence data by (ii), and (iii) is based on the sequence data Abbreviation data are generated, and the abbreviation data are compared by (iv) with the database of existing abbreviation;And
(c) central server of network communication is carried out with the locally-stored library, the central server is configured as updating institute State the database of existing abbreviation.
2. biological data management system according to claim 1, wherein the locally-stored library is also programmed or configures Are as follows:
Label abbreviation, and labeled abbreviation is transmitted to the central server.
3. biological data management system according to claim 2, wherein the central server is also programmed or configures Are as follows:
Labeled abbreviation is received, and further analysis is executed to the labeled abbreviation.
4. biological data management system according to claim 3, wherein the central server is also programmed or configures Are as follows:
When analyzing the labeled abbreviation, generates to instruct described instruction is simultaneously transmitted to the locally-stored library.
5. biological data management system according to claim 1, wherein
It is described abbreviation be variance, hash or verification and.
6. a kind of method for storing biological data comprising:
(d) size of the biological data is determined, to identify the storage unit size for being suitable for storing the biological data;
(e) identification has the memory location in the memory devices of the block size compatible with the storage unit size;And
(f) biological data is stored in erasable piece at the storage location of the memory devices.
7. according to the method described in claim 6, wherein,
Each erasable piece includes subregion for storing the biological data and related with the biological data for storing The subregion of metadata.
8. according to the method described in claim 7, wherein,
The subregion used to store metadata includes the longer service life.
9. according to the method described in claim 7, wherein,
The subregion used to store metadata includes the control different from being used to store the controller of the subregion of sequence data Device processed.
10. according to the method described in claim 7, wherein,
Compared with the subregion for storing sequence data, the subregion used to store metadata is configurable for more frequency Numerous access.
11. a kind of biological data management system comprising:
(g) first memory equipment, the first memory equipment are configured as storage and supply the biological data infrequently accessed;With And
(h) there is the second memory equipment of block size, the second memory equipment is communicated with the first memory equipment, And the second memory equipment is configured as storage and supplies the biological data frequently accessed;
Wherein, the second memory equipment is faster than the first memory equipment, and
Wherein, the block size is selected as to store the biological data according to the size of the biological data.
12. biological data management system according to claim 11,
Wherein, the biological data is n oligomeric sequences, and
Wherein, the block size is n times for digit needed for storing the monomer of the n aggressiveness.
13. biological data management system according to claim 11,
Wherein, the biological data is n oligomeric sequences, and
Wherein, the block size is at least n times for digit needed for storing the monomer of the n aggressiveness.
14. biological data management system according to claim 11, wherein
The second memory equipment includes flash memory device.
15. biological data management system according to claim 14, wherein
The second memory equipment includes the block as flash memory erasing block.
16. method of the one kind for storing series data in multi-level unit (MLC) memory devices, the MLC storage Device equipment includes memory cell, each described memory cell is configured as storage two,
The method includes in a memory cell:
(i) 00 is set as to represent the base of the first kind by described two;
(j) 01 is set as to represent the base of Second Type by described two;
(k) 10 are set as to represent the base of third type by described two;Or
(l) 11 are set as to represent the base of the 4th type by described two.
17. method according to claim 16, wherein
The series data represent one or more polynucleotides, each described polynucleotides includes one or more alkali Base, each base in one or more bases is at least four may one of bases.
18. method according to claim 17, wherein
The polynucleotides are DNA or RNA.
19. a kind of, for biological data to be stored in the method in memory devices, the memory devices include block, each Described piece includes block size,
The described method includes:
(m) size of the biological data is determined;
(n) block size of at least described piece of subset is determined;
(o) biological data is compressed based on the block size, to generate compressed biological data;And
(p) biological data is stored at least described piece of subset.
20. according to the method for claim 19,
Wherein, the memory devices include flash memory device, and
Wherein, the block size is erasing block size.
21. according to the method for claim 19, wherein
The block size is greater than or equal to the size of the compressed biological data.
22. according to the method for claim 20, wherein
The erasing block stores the metadata of the biological data and the biological data.
23. a kind of for series data to be stored in the method in memory devices, the memory devices include storage Device unit, each described memory cell are configured to store at three less,
The method includes in a memory cell:
(q) 000 is set by three in described at least three, to represent the base of the first kind;
(r) 001 is set by three in described at least three, to represent the base of Second Type;
(s) 010 is set by three in described at least three, to represent the base of third type;
(t) 011 is set by three in described at least three, to represent the base of the 4th type;
(u) 100 are set by three in described at least three, to represent the base of the 5th type;
(v) 101 are set by three in described at least three, to represent the base of the 6th type;
(w) 110 are set by three in described at least three, to represent the base of the 7th type;And
(x) 111 are set by three in described at least three, to represent the base of the 8th type.
24. the method for claim 23, wherein
The series data represent one or more polynucleotides, each described polynucleotides includes one or more alkali Base, each base in one or more bases be the natural base of four kinds of differences, methylated base, oxidation base or One of dealkalize base location.
25. the method for claim 24, wherein
The polynucleotides are DNA or RNA.
26. according to the method for claim 23, wherein
The memory devices include flash memory, phase transition storage or Memister.
27. a kind of for series data to be stored in the method in memory devices, the series data include two The possible base of kind is to represent each in measured a variety of bases, and the memory devices include memory cell, often One memory cell is configured as storing multiple positions,
The described method includes:
In first of the multiple position, the most probable base of the series data is stored;
In the second of the multiple position, the second most probable base of the series data is stored;And
In the remainder of the multiple position, the opposite of the most probable base and the second most probable base is stored Probability.
28. according to the method for claim 27, further includes:
The most probable base is identified using the first unit of the memory cell;
The described second most probable base is identified using the second unit of the memory cell;And
The relative probability is stored using one or more of the other unit of the memory cell.
29. according to the method for claim 27, further includes:
In the third unit of the memory cell, the probability of the second most probable base is stored.
30. a kind of for series data to be stored in the method in memory devices, the memory devices include storage Device unit, each described memory cell are configured to store at three less,
The method includes in the memory cell:
(y) three first instructions described in including at least three are provided, to represent the base of the first kind;
(z) three seconds instruction described in including at least three is provided, to represent the base of Second Type;
(aa) three third positions instruction described in including at least three is provided, to represent the base of third type;
(bb) three the 4th instructions described in including at least three are provided, to represent the base of the 4th type;
(cc) three the 5th instructions described in including at least three are provided, to represent methylated base;
(dd) three the 6th instructions described in including at least three are provided, to represent oxidation base;And
(ee) three the 7th instructions described in including at least three are provided, to represent abasic site.
31. according to the method for claim 29, wherein
The memory devices include flash memory, phase transition storage or Memister.
32. a kind of method for encrypting biological sequence data, which comprises
(ff) normal grade variance of the identification in the biological sequence data;And
(gg) second level variance, the second level variance and the normal grade variance phase are introduced in Xiang Suoshu biological sequence data When so that the biological sequence data is undistinguishable relative to the normal grade variance.
33. according to the method for claim 32, further includes:
The variance of introduced grade is transmitted using encryption method.
34. a kind of method for encrypting the biological sequence data of subject, which comprises
(hh) information relevant to the subject is encrypted using the first encipherment scheme;And
(ii) biological sequence data is encrypted using the second encipherment scheme, second encipherment scheme is different from described first and adds Close scheme.
35. according to the method for claim 34, wherein
Second encipherment scheme includes the encryption less extended than first encipherment scheme.
36. according to the method for claim 35, wherein
Second encipherment scheme includes scrambling and selection by winnowing.
37. according to the method for claim 35, wherein
First encipherment scheme uses Public Key Infrastructure, and
Second encipherment scheme uses the Public Key Infrastructure.
38. according to the method for claim 35, wherein
First encipherment scheme uses the first Public Key Infrastructure, and
Second encipherment scheme uses second Public Key Infrastructure different from first Public Key Infrastructure.
39. a kind of method for storing series data, which comprises
Two-dimentional table structure is provided in computer storage, and the two dimension table structure is configured as the letter that storage represents potential base Breath;
The information for representing the most probable measurement base of the series data is stored in the first of the two-dimentional table structure In dimension;
The information for representing other potential bases of the series data is stored in the second dimension of the two-dimentional table structure; And
Probability corresponding with the described in first peacekeeping the two-dimensional intersection is stored in the two-dimentional table structure.
40. according to the method for claim 39, wherein
The potential base includes the set of each base and at least one of the following terms in four kinds of possible bases:
Methylated base, oxidation base and abasic site.
41. according to the method for claim 39, further includes:
In computer storage provide second two-dimensional table structure, the second two-dimensional table structure be configured as storage represent it is potential The information of base;And
In the second two-dimensional table structure, store the series data most probable measurement base and the sequence alkali The most probable measurement base of the second of base data.
42. a kind of method for managing biological data, which comprises
There is provided application server, the application server is programmed or configured to: (i) receives the biology of original measurement from sensor Data, and (ii) generate processed biological data from the biological data of the original measurement;
At the application server, definition relevant to the processed biological data and rule are received from locally-stored library Then;And
By the application server, issue based on to the processed relevant definition of biological data and regular finger It enables.
43. according to the method for claim 42,
Wherein, the processed biological data include among the processed biological data in the locally-stored library A part of its related definition and rule is not found, and
Wherein, the method also includes sending at least described one of the processed biological data to the locally-stored library Point.
44. according to the method for claim 43, further includes:
Central server is sent from the locally-stored library by least described a part of the processed biological data.
45. according to the method for claim 44, further includes:
Instruction is sent from the central server to the locally-stored library.
46. according to the method for claim 45, further includes:
New definition and rule are sent from the central server to the locally-stored library.
47. a kind of method for storing series data, which comprises
For base positions, the information for representing the most probable base of the series data is stored in the first of storage equipment Position, and the probability of the frequency of occurrence of the most probable base is stored in the second position for storing equipment.
48. a kind of method for storing the series data comprising at least four possible bases, which comprises
(jj) three-dimensional table structure is provided in computer storage, and the three-dimensional table structure is configured as storing the series Data, wherein
(i) one-dimension storage of the three-dimensional table structure represents the letter of the most probable measurement base of genetic sequence base data Breath;
(ii) two-dimensional storage of the three-dimensional table structure represents the information of the potential base of the genetic sequence base data;With And
(iii) third dimension storage of the three-dimensional table structure represents described at least four possible bases of the series data In each base basecount probability information;
(kk) probability corresponding with the intersection of the third dimension described in first dimension, second peacekeeping is stored in the three-dimensional In table structure.
49. a kind of method for protecting biological data relevant to subject, which comprises
Use the personally identifiable information of the first encipherment scheme encryption subject;
The phenotype of the subject is encrypted using the second encipherment scheme;
The biological data is encrypted using third encipherment scheme, wherein second encipherment scheme or the third encipherment scheme Different from first encipherment scheme;And
Encrypted personally identifiable information, encrypted phenotype and encrypted biological data are stored in computer storage In.
50. according to the method for claim 49, wherein
(i) second encipherment scheme is different from first encipherment scheme, and
(ii) the third encipherment scheme is different from first encipherment scheme, and
(iii) different second encipherment schemes of the third encipherment scheme.
51. according to the method for claim 49, further includes:
Store the gene expression data of the subject.
52. according to the method for claim 50, further includes:
Store the geodata of the subject.
53. a kind of method for storing the genetic data of subject, which comprises
The personally identifiable information of the subject is stored in the first memory paragraph with the first access limiter stage;
The phenotypic data of the subject is stored in the second memory paragraph with the second access limiter stage;And
The genetic data of the subject is stored in the third memory paragraph with third access limiter stage.
54. method according to claim 53, wherein
The second access limiter stage or third access limiter stage are different from the first access limiter stage.
55. method according to claim 54, wherein
(i) the second access limiter stage is different from the first access limiter stage, and
(ii) the third access limiter stage is different from the first access limiter stage, and
(iii) the third access limiter stage is different from the second access limiter stage.
CN201780035638.0A 2016-04-11 2017-04-11 System and method for biological data management Pending CN109937426A (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US201662321103P 2016-04-11 2016-04-11
US62/321,103 2016-04-11
PCT/JP2017/014847 WO2017179581A1 (en) 2016-04-11 2017-04-11 Systems and methods for biological data management

Publications (1)

Publication Number Publication Date
CN109937426A true CN109937426A (en) 2019-06-25

Family

ID=60041640

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201780035638.0A Pending CN109937426A (en) 2016-04-11 2017-04-11 System and method for biological data management

Country Status (7)

Country Link
US (1) US20190304571A1 (en)
EP (1) EP3443531A4 (en)
JP (1) JP2019517056A (en)
KR (1) KR20190017738A (en)
CN (1) CN109937426A (en)
CA (1) CA3020669A1 (en)
WO (1) WO2017179581A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114996763A (en) * 2022-07-28 2022-09-02 北京锘崴信息科技有限公司 Private data security analysis method and device based on trusted execution environment

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011108540A1 (en) 2010-03-03 2011-09-09 国立大学法人大阪大学 Method and device for identifying nucleotide, and method and device for determining nucleotide sequence of polynucleotide
CA2929929A1 (en) 2013-09-18 2015-03-26 Quantum Biosystems Inc. Biomolecule sequencing devices, systems and methods
JP2015077652A (en) 2013-10-16 2015-04-23 クオンタムバイオシステムズ株式会社 Nano-gap electrode and method for manufacturing same
US10438811B1 (en) 2014-04-15 2019-10-08 Quantum Biosystems Inc. Methods for forming nano-gap electrodes for use in nanosensors
WO2015170782A1 (en) 2014-05-08 2015-11-12 Osaka University Devices, systems and methods for linearization of polymers
GB2554883A (en) * 2016-10-11 2018-04-18 Petagene Ltd System and method for storing and accessing data
US20190318118A1 (en) * 2018-04-16 2019-10-17 International Business Machines Corporation Secure encrypted document retrieval

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1938720A (en) * 2004-03-31 2007-03-28 松下电器产业株式会社 Memory card and memory card system
US20070171714A1 (en) * 2006-01-20 2007-07-26 Marvell International Ltd. Flash memory with coding and signal processing
JP2008146538A (en) * 2006-12-13 2008-06-26 Intec Web & Genome Informatics Corp Microrna detector, detection method and program
CN101497924A (en) * 2008-01-30 2009-08-05 中国农业大学 Biological sequence analysis method based on gap spectrum
US20110276277A1 (en) * 2009-11-06 2011-11-10 The Chinese University Of Hong Kong Size-based genomic analysis
JP2012118709A (en) * 2010-11-30 2012-06-21 Brother Ind Ltd Distribution system, storage capacity decision program, and storage capacity decision method
CN102870086A (en) * 2010-03-29 2013-01-09 卡尼股份有限公司 Digital profile system of personal attributes, tendencies, recommended actions, and historical events with privacy preserving controls
CN102915594A (en) * 2011-08-04 2013-02-06 深圳市凯智汇科技有限公司 Bank card security system based on human body biological information code and operation method thereof
CN103559427A (en) * 2013-11-12 2014-02-05 高扬 Method for identifying biological sequence and deducing species genetic relationship through digitals
CN105190636A (en) * 2013-03-28 2015-12-23 三菱宇宙软件株式会社 Genetic information storage device, genetic information search device, genetic information storage program, genetic information search program, genetic information storage method, genetic information search method, and genetic information search system
CN105447844A (en) * 2014-08-15 2016-03-30 大连达硕信息技术有限公司 New method for characteristic selection of complex multivariable data

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS6437640A (en) * 1987-08-03 1989-02-08 Mitsubishi Electric Corp Control system for cache memory
JPH04289938A (en) * 1991-03-18 1992-10-14 Nippon Telegr & Teleph Corp <Ntt> Cache memory control system
JPH10283230A (en) * 1997-03-31 1998-10-23 Nec Corp File data storage device and machine-readable recording medium with program recorded
JP4259902B2 (en) * 2003-04-01 2009-04-30 日立オムロンターミナルソリューションズ株式会社 Information reading device, program for information reading device
US8340914B2 (en) * 2004-11-08 2012-12-25 Gatewood Joe M Methods and systems for compressing and comparing genomic data
EP2634716A1 (en) * 2012-02-28 2013-09-04 Koninklijke Philips Electronics N.V. Tamper-proof genetic sequence processing
US20150310228A1 (en) * 2014-02-26 2015-10-29 Nantomics Secured mobile genome browsing devices and methods therefor
WO2015134664A1 (en) * 2014-03-04 2015-09-11 Bigdatabio, Llc Methods and systems for biological sequence alignment

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1938720A (en) * 2004-03-31 2007-03-28 松下电器产业株式会社 Memory card and memory card system
US20070171714A1 (en) * 2006-01-20 2007-07-26 Marvell International Ltd. Flash memory with coding and signal processing
JP2008146538A (en) * 2006-12-13 2008-06-26 Intec Web & Genome Informatics Corp Microrna detector, detection method and program
CN101497924A (en) * 2008-01-30 2009-08-05 中国农业大学 Biological sequence analysis method based on gap spectrum
US20110276277A1 (en) * 2009-11-06 2011-11-10 The Chinese University Of Hong Kong Size-based genomic analysis
CN102870086A (en) * 2010-03-29 2013-01-09 卡尼股份有限公司 Digital profile system of personal attributes, tendencies, recommended actions, and historical events with privacy preserving controls
JP2012118709A (en) * 2010-11-30 2012-06-21 Brother Ind Ltd Distribution system, storage capacity decision program, and storage capacity decision method
CN102915594A (en) * 2011-08-04 2013-02-06 深圳市凯智汇科技有限公司 Bank card security system based on human body biological information code and operation method thereof
CN105190636A (en) * 2013-03-28 2015-12-23 三菱宇宙软件株式会社 Genetic information storage device, genetic information search device, genetic information storage program, genetic information search program, genetic information storage method, genetic information search method, and genetic information search system
US20160048690A1 (en) * 2013-03-28 2016-02-18 Mitsubishi Space Software Co., Ltd. Genetic information storage apparatus, genetic information search apparatus, genetic information storage program, genetic information search program, genetic information storage method, genetic information search method, and genetic information search system
CN103559427A (en) * 2013-11-12 2014-02-05 高扬 Method for identifying biological sequence and deducing species genetic relationship through digitals
CN105447844A (en) * 2014-08-15 2016-03-30 大连达硕信息技术有限公司 New method for characteristic selection of complex multivariable data

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114996763A (en) * 2022-07-28 2022-09-02 北京锘崴信息科技有限公司 Private data security analysis method and device based on trusted execution environment
CN114996763B (en) * 2022-07-28 2022-11-15 北京锘崴信息科技有限公司 Private data security analysis method and device based on trusted execution environment

Also Published As

Publication number Publication date
CA3020669A1 (en) 2017-10-19
US20190304571A1 (en) 2019-10-03
WO2017179581A1 (en) 2017-10-19
EP3443531A1 (en) 2019-02-20
JP2019517056A (en) 2019-06-20
EP3443531A4 (en) 2020-07-22
KR20190017738A (en) 2019-02-20

Similar Documents

Publication Publication Date Title
CN109937426A (en) System and method for biological data management
Andrews et al. Tutorial: guidelines for the computational analysis of single-cell RNA sequencing data
Washburne et al. Methods for phylogenetic analysis of microbiome data
Kulkarni et al. Beyond bulk: a review of single cell transcriptomics methodologies and applications
Clarke et al. Tutorial: guidelines for annotating single-cell transcriptomic maps using automated and manual methods
Zhu et al. Identification of spatially associated subpopulations by combining scRNAseq and sequential fluorescence in situ hybridization data
Curry et al. Emu: species-level microbial community profiling of full-length 16S rRNA Oxford Nanopore sequencing data
Brbić et al. MARS: discovering novel cell types across heterogeneous single-cell experiments
Liu et al. Reconstructing cell cycle pseudo time-series via single-cell transcriptome data
Richardson et al. Statistical methods in integrative genomics
Tekaia Inferring orthologs: open questions and perspectives
Garber et al. Computational methods for transcriptome annotation and quantification using RNA-seq
ES2899879T3 (en) Identification and measurement of relative populations of microorganisms with direct DNA sequencing
Conway et al. Xenome—a tool for classifying reads from xenograft samples
Kelly et al. Phylogenetic trees do not reliably predict feature diversity
Lukhtanov et al. DNA barcodes as a tool in biodiversity research: testing pre-existing taxonomic hypotheses in Delphic Apollo butterflies (Lepidoptera, Papilionidae)
CN102007407A (en) Genome identification system
Lohr et al. Identification of sample annotation errors in gene expression datasets
Divoll et al. Disparities in second‐generation DNA metabarcoding results exposed with accessible and repeatable workflows
JP2003021630A (en) Method of providing clinical diagnosing service
JP6875498B2 (en) Biometric data provision method, biometric data encryption method and biometric data processing device
Altman Replication, variation and normalisation in microarray experiments
Cissé et al. FGMP: assessing fungal genome completeness
WO2019084236A1 (en) Method and system for generating and comparing genotypes
Chen et al. SIMBA: single-cell embedding along with features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20190625

WD01 Invention patent application deemed withdrawn after publication