EP3566050A1 - Systems, methods, and devices for analysis of genetic material - Google Patents
Systems, methods, and devices for analysis of genetic materialInfo
- Publication number
- EP3566050A1 EP3566050A1 EP18736154.8A EP18736154A EP3566050A1 EP 3566050 A1 EP3566050 A1 EP 3566050A1 EP 18736154 A EP18736154 A EP 18736154A EP 3566050 A1 EP3566050 A1 EP 3566050A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- nucleic acid
- acid sequence
- protein
- intron
- signature value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 108090000623 proteins and genes Proteins 0.000 title claims abstract description 198
- 102000004169 proteins and genes Human genes 0.000 title claims abstract description 149
- 238000000034 method Methods 0.000 title claims abstract description 74
- 150000007523 nucleic acids Chemical group 0.000 claims abstract description 212
- 108091028043 Nucleic acid sequence Proteins 0.000 claims abstract description 195
- 239000002773 nucleotide Substances 0.000 claims description 47
- 125000003729 nucleotide group Chemical group 0.000 claims description 46
- 230000002068 genetic effect Effects 0.000 claims description 37
- 150000001413 amino acids Chemical class 0.000 claims description 31
- 108020004707 nucleic acids Proteins 0.000 claims description 27
- 102000039446 nucleic acids Human genes 0.000 claims description 27
- 230000008569 process Effects 0.000 abstract description 19
- 235000018102 proteins Nutrition 0.000 description 111
- 230000006870 function Effects 0.000 description 82
- 230000007246 mechanism Effects 0.000 description 72
- 108020004414 DNA Proteins 0.000 description 38
- 108091032973 (ribonucleotides)n+m Proteins 0.000 description 33
- 229940024606 amino acid Drugs 0.000 description 27
- 235000001014 amino acid Nutrition 0.000 description 27
- 108020004705 Codon Proteins 0.000 description 21
- 230000015654 memory Effects 0.000 description 20
- 239000013598 vector Substances 0.000 description 18
- 108020004999 messenger RNA Proteins 0.000 description 15
- RWQNBRDOKXIBIV-UHFFFAOYSA-N thymine Chemical compound CC1=CNC(=O)NC1=O RWQNBRDOKXIBIV-UHFFFAOYSA-N 0.000 description 14
- 230000000295 complement effect Effects 0.000 description 10
- UYTPUPDQBNUYGX-UHFFFAOYSA-N guanine Chemical compound O=C1NC(N)=NC2=C1N=CN2 UYTPUPDQBNUYGX-UHFFFAOYSA-N 0.000 description 10
- 108091092195 Intron Proteins 0.000 description 9
- ISAKRJDGNUQOIC-UHFFFAOYSA-N Uracil Chemical compound O=C1C=CNC(=O)N1 ISAKRJDGNUQOIC-UHFFFAOYSA-N 0.000 description 8
- 238000004891 communication Methods 0.000 description 8
- OPTASPLRGRRNAP-UHFFFAOYSA-N cytosine Chemical compound NC=1C=CNC(=O)N=1 OPTASPLRGRRNAP-UHFFFAOYSA-N 0.000 description 8
- 108700024394 Exon Proteins 0.000 description 7
- 210000004027 cell Anatomy 0.000 description 7
- 238000013507 mapping Methods 0.000 description 7
- 229940113082 thymine Drugs 0.000 description 7
- 230000002441 reversible effect Effects 0.000 description 6
- 210000000805 cytoplasm Anatomy 0.000 description 5
- 230000003287 optical effect Effects 0.000 description 5
- 238000012545 processing Methods 0.000 description 5
- 238000012360 testing method Methods 0.000 description 5
- 229930024421 Adenine Natural products 0.000 description 4
- GFFGJBXGBJISGV-UHFFFAOYSA-N Adenine Chemical compound NC1=NC=NC2=C1N=CN2 GFFGJBXGBJISGV-UHFFFAOYSA-N 0.000 description 4
- DHMQDGOQFOQNFH-UHFFFAOYSA-N Glycine Chemical compound NCC(O)=O DHMQDGOQFOQNFH-UHFFFAOYSA-N 0.000 description 4
- AYFVYJQAPQTCCC-GBXIJSLDSA-N L-threonine Chemical compound C[C@@H](O)[C@H](N)C(O)=O AYFVYJQAPQTCCC-GBXIJSLDSA-N 0.000 description 4
- KDXKERNSBIXSRK-UHFFFAOYSA-N Lysine Natural products NCCCCC(N)C(O)=O KDXKERNSBIXSRK-UHFFFAOYSA-N 0.000 description 4
- ONIBWKKTOPOVIA-UHFFFAOYSA-N Proline Natural products OC(=O)C1CCCN1 ONIBWKKTOPOVIA-UHFFFAOYSA-N 0.000 description 4
- 229960000643 adenine Drugs 0.000 description 4
- 230000005540 biological transmission Effects 0.000 description 4
- 229940104302 cytosine Drugs 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 230000014616 translation Effects 0.000 description 4
- 229940035893 uracil Drugs 0.000 description 4
- 108010014173 Factor X Proteins 0.000 description 3
- ONIBWKKTOPOVIA-BYPYZUCNSA-N L-Proline Chemical compound OC(=O)[C@@H]1CCCN1 ONIBWKKTOPOVIA-BYPYZUCNSA-N 0.000 description 3
- QNAYBMKLOCPYGJ-REOHCLBHSA-N L-alanine Chemical compound C[C@H](N)C(O)=O QNAYBMKLOCPYGJ-REOHCLBHSA-N 0.000 description 3
- WHUUTDBJXJRKMK-VKHMYHEASA-N L-glutamic acid Chemical compound OC(=O)[C@@H](N)CCC(O)=O WHUUTDBJXJRKMK-VKHMYHEASA-N 0.000 description 3
- ROHFNLRQFUQHCH-YFKPBYRVSA-N L-leucine Chemical compound CC(C)C[C@H](N)C(O)=O ROHFNLRQFUQHCH-YFKPBYRVSA-N 0.000 description 3
- FFEARJCKVFRZRR-BYPYZUCNSA-N L-methionine Chemical compound CSCC[C@H](N)C(O)=O FFEARJCKVFRZRR-BYPYZUCNSA-N 0.000 description 3
- COLNVLDHVKWLRT-QMMMGPOBSA-N L-phenylalanine Chemical compound OC(=O)[C@@H](N)CC1=CC=CC=C1 COLNVLDHVKWLRT-QMMMGPOBSA-N 0.000 description 3
- OUYCCCASQSFEME-QMMMGPOBSA-N L-tyrosine Chemical compound OC(=O)[C@@H](N)CC1=CC=C(O)C=C1 OUYCCCASQSFEME-QMMMGPOBSA-N 0.000 description 3
- ROHFNLRQFUQHCH-UHFFFAOYSA-N Leucine Natural products CC(C)CC(N)C(O)=O ROHFNLRQFUQHCH-UHFFFAOYSA-N 0.000 description 3
- 239000004472 Lysine Substances 0.000 description 3
- MTCFGRXMJLQNBG-UHFFFAOYSA-N Serine Natural products OCC(N)C(O)=O MTCFGRXMJLQNBG-UHFFFAOYSA-N 0.000 description 3
- 108091081024 Start codon Proteins 0.000 description 3
- AYFVYJQAPQTCCC-UHFFFAOYSA-N Threonine Natural products CC(O)C(N)C(O)=O AYFVYJQAPQTCCC-UHFFFAOYSA-N 0.000 description 3
- 239000004473 Threonine Substances 0.000 description 3
- QIVBCDIJIAJPQS-UHFFFAOYSA-N Tryptophan Natural products C1=CC=C2C(CC(N)C(O)=O)=CNC2=C1 QIVBCDIJIAJPQS-UHFFFAOYSA-N 0.000 description 3
- 235000004279 alanine Nutrition 0.000 description 3
- ZDXPYRJPNDTMRX-UHFFFAOYSA-N glutamine Natural products OC(=O)C(N)CCC(N)=O ZDXPYRJPNDTMRX-UHFFFAOYSA-N 0.000 description 3
- 229930182817 methionine Natural products 0.000 description 3
- COLNVLDHVKWLRT-UHFFFAOYSA-N phenylalanine Natural products OC(=O)C(N)CC1=CC=CC=C1 COLNVLDHVKWLRT-UHFFFAOYSA-N 0.000 description 3
- 238000013519 translation Methods 0.000 description 3
- OUYCCCASQSFEME-UHFFFAOYSA-N tyrosine Natural products OC(=O)C(N)CC1=CC=C(O)C=C1 OUYCCCASQSFEME-UHFFFAOYSA-N 0.000 description 3
- 239000004475 Arginine Substances 0.000 description 2
- DCXYFEDJOCDNAF-UHFFFAOYSA-N Asparagine Natural products OC(=O)C(N)CC(N)=O DCXYFEDJOCDNAF-UHFFFAOYSA-N 0.000 description 2
- RYGMFSIKBFXOCR-UHFFFAOYSA-N Copper Chemical compound [Cu] RYGMFSIKBFXOCR-UHFFFAOYSA-N 0.000 description 2
- 239000004471 Glycine Substances 0.000 description 2
- DCXYFEDJOCDNAF-REOHCLBHSA-N L-asparagine Chemical compound OC(=O)[C@@H](N)CC(N)=O DCXYFEDJOCDNAF-REOHCLBHSA-N 0.000 description 2
- CKLJMWTZIZZHCS-REOHCLBHSA-N L-aspartic acid Chemical compound OC(=O)[C@@H](N)CC(O)=O CKLJMWTZIZZHCS-REOHCLBHSA-N 0.000 description 2
- AGPKZVBTJJNPAG-WHFBIAKZSA-N L-isoleucine Chemical compound CC[C@H](C)[C@H](N)C(O)=O AGPKZVBTJJNPAG-WHFBIAKZSA-N 0.000 description 2
- QIVBCDIJIAJPQS-VIFPVBQESA-N L-tryptophane Chemical compound C1=CC=C2C(C[C@H](N)C(O)=O)=CNC2=C1 QIVBCDIJIAJPQS-VIFPVBQESA-N 0.000 description 2
- KZSNJWFQEVHDMF-BYPYZUCNSA-N L-valine Chemical compound CC(C)[C@H](N)C(O)=O KZSNJWFQEVHDMF-BYPYZUCNSA-N 0.000 description 2
- KZSNJWFQEVHDMF-UHFFFAOYSA-N Valine Natural products CC(C)C(N)C(O)=O KZSNJWFQEVHDMF-UHFFFAOYSA-N 0.000 description 2
- 108700022814 XYZ Proteins 0.000 description 2
- ODKSFYDXXFIFQN-UHFFFAOYSA-N arginine Natural products OC(=O)C(N)CCCNC(N)=N ODKSFYDXXFIFQN-UHFFFAOYSA-N 0.000 description 2
- 235000009697 arginine Nutrition 0.000 description 2
- 229960001230 asparagine Drugs 0.000 description 2
- 235000009582 asparagine Nutrition 0.000 description 2
- 229940009098 aspartate Drugs 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- XUJNEKJLAYXESH-UHFFFAOYSA-N cysteine Natural products SCC(N)C(O)=O XUJNEKJLAYXESH-UHFFFAOYSA-N 0.000 description 2
- 235000018417 cysteine Nutrition 0.000 description 2
- 239000000835 fiber Substances 0.000 description 2
- 229930195712 glutamate Natural products 0.000 description 2
- HNDVDQJCIGZPNO-UHFFFAOYSA-N histidine Natural products OC(=O)C(N)CC1=CN=CN1 HNDVDQJCIGZPNO-UHFFFAOYSA-N 0.000 description 2
- 229960000310 isoleucine Drugs 0.000 description 2
- AGPKZVBTJJNPAG-UHFFFAOYSA-N isoleucine Natural products CCC(C)C(N)C(O)=O AGPKZVBTJJNPAG-UHFFFAOYSA-N 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 108090000765 processed proteins & peptides Proteins 0.000 description 2
- 239000010979 ruby Substances 0.000 description 2
- 229910001750 ruby Inorganic materials 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 238000013518 transcription Methods 0.000 description 2
- 230000035897 transcription Effects 0.000 description 2
- 239000004474 valine Substances 0.000 description 2
- PCTMTFRHKVHKIS-BMFZQQSSSA-N (1s,3r,4e,6e,8e,10e,12e,14e,16e,18s,19r,20r,21s,25r,27r,30r,31r,33s,35r,37s,38r)-3-[(2r,3s,4s,5s,6r)-4-amino-3,5-dihydroxy-6-methyloxan-2-yl]oxy-19,25,27,30,31,33,35,37-octahydroxy-18,20,21-trimethyl-23-oxo-22,39-dioxabicyclo[33.3.1]nonatriaconta-4,6,8,10 Chemical compound C1C=C2C[C@@H](OS(O)(=O)=O)CC[C@]2(C)[C@@H]2[C@@H]1[C@@H]1CC[C@H]([C@H](C)CCCC(C)C)[C@@]1(C)CC2.O[C@H]1[C@@H](N)[C@H](O)[C@@H](C)O[C@H]1O[C@H]1/C=C/C=C/C=C/C=C/C=C/C=C/C=C/[C@H](C)[C@@H](O)[C@@H](C)[C@H](C)OC(=O)C[C@H](O)C[C@H](O)CC[C@@H](O)[C@H](O)C[C@H](O)C[C@](O)(C[C@H](O)[C@H]2C(O)=O)O[C@H]2C1 PCTMTFRHKVHKIS-BMFZQQSSSA-N 0.000 description 1
- MTCFGRXMJLQNBG-REOHCLBHSA-N (2S)-2-Amino-3-hydroxypropansäure Chemical compound OC[C@H](N)C(O)=O MTCFGRXMJLQNBG-REOHCLBHSA-N 0.000 description 1
- 108700040618 BRCA1 Genes Proteins 0.000 description 1
- 101150072950 BRCA1 gene Proteins 0.000 description 1
- 102000053602 DNA Human genes 0.000 description 1
- XUJNEKJLAYXESH-REOHCLBHSA-N L-Cysteine Chemical compound SC[C@H](N)C(O)=O XUJNEKJLAYXESH-REOHCLBHSA-N 0.000 description 1
- ODKSFYDXXFIFQN-BYPYZUCNSA-P L-argininium(2+) Chemical compound NC(=[NH2+])NCCC[C@H]([NH3+])C(O)=O ODKSFYDXXFIFQN-BYPYZUCNSA-P 0.000 description 1
- ZDXPYRJPNDTMRX-VKHMYHEASA-N L-glutamine Chemical compound OC(=O)[C@@H](N)CCC(N)=O ZDXPYRJPNDTMRX-VKHMYHEASA-N 0.000 description 1
- HNDVDQJCIGZPNO-YFKPBYRVSA-N L-histidine Chemical compound OC(=O)[C@@H](N)CC1=CN=CN1 HNDVDQJCIGZPNO-YFKPBYRVSA-N 0.000 description 1
- KDXKERNSBIXSRK-YFKPBYRVSA-N L-lysine Chemical compound NCCCC[C@H](N)C(O)=O KDXKERNSBIXSRK-YFKPBYRVSA-N 0.000 description 1
- 108091093105 Nuclear DNA Proteins 0.000 description 1
- 108091092740 Organellar DNA Proteins 0.000 description 1
- 108020005038 Terminator Codon Proteins 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 229910052802 copper Inorganic materials 0.000 description 1
- 239000010949 copper Substances 0.000 description 1
- 230000001186 cumulative effect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000010353 genetic engineering Methods 0.000 description 1
- 230000008531 maintenance mechanism Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 229920002477 rna polymer Polymers 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6813—Hybridisation assays
- C12Q1/6816—Hybridisation assays characterised by the detection means
- C12Q1/6825—Nucleic acid detection involving sensors
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
Definitions
- This application includes a source code appendix with example computer source code.
- the source code appendix is considered part of this application for all purposes.
- This invention relates to genetic engineering and microbiology, and, more particularly, to systems, methods, and devices for analysis of genetic material such as nucleotide sequences.
- a computer-implemented method may comprise obtaining a representation of a nucleic acid sequence.
- the nucleic acid sequence may encode a particular gene and the particular gene may encode a particular protein.
- the nucleic acid sequence may comprise at least one intron.
- the embodiment may also comprise determining an intron signature value corresponding to the intron, with the intron signature value being based on a first computational function applied to one or more portions of the representation of the nucleic acid sequence corresponding to the at least one intron.
- the embodiment may also comprise determining a protein signature value corresponding to the particular protein, with the protein signature value being based on a second computational function applied to a representation of the particular protein, and forming, in a database, an association between the intron signature value and the protein signature value.
- the method may comprise repeating the above acts for each of a plurality of nucleic acid sequences.
- the computer-implemented method may also comprise using the formed association between the intron signature value and the protein signature value to determine or confirm at least one aspect of a genetic function of a particular nucleic acid sequence.
- the computer-implemented method may comprise using the formed association between the intron signature value and the protein signature value to determine or confirm at least one aspect of a genetic function of a particular nucleic acid sequence for each of a plurality of nucleic acid sequences.
- a computer-implemented method implemented by hardware in combination with software, may comprise obtaining a representation of a nucleic acid sequence, with the nucleic acid sequence comprising multiple non-overlapping nucleic acid subsequences, with each of the nucleic acid subsequences encoding an amino acid.
- the embodiment may also comprise determining a particular amino acid encoded by a particular nucleic acid subsequence of the multiple non-overlapping nucleic acid subsequences.
- the particular nucleic acid subsequence may comprise a first nucleotide, a second nucleotide adjacent to the first nucleotide, and a third nucleotide adjacent to the second nucleotide, by considering a nucleotide pair consisting of: (i) the first nucleotide and the second nucleotide, or (ii) the second nucleotide and the third nucleotide.
- a computer-implemented method of encoding a nucleic acid sequence may comprise obtaining a representation of the nucleic acid sequence, with the nucleic acid sequence comprising multiple non-overlapping nucleic acid subsequences, with each nucleic acid subsequences encoding an amino acid.
- the embodiment may also comprise determining a plurality of subsequences of the nucleic acid sequence, with the plurality of subsequences determined by a binary counting of the nucleic acid sequence, and determining a digital signature for the nucleic acid sequence based on a computational hash function applied to the plurality of subsequences of the nucleic acid sequence.
- the computer-implemented method may comprise using the digital signature for the nucleic acid sequence to determine or confirm at least one aspect of a genetic function of the nucleic acid sequence.
- a computer implemented method may comprise obtaining a particular character string, with the particular character string being a representation of a first one or more portions of a particular nucleic acid sequence.
- the particular nucleic acid sequence may comprise a second one or more portions that may encode a particular protein, with the first one or more portions of the particular nucleic acid sequence being distinct from the second one or more portions of the particular nucleic acid sequence.
- the embodiment may also comprise determining a plurality of hash values of a corresponding plurality of substrings associated with the particular character string, determining a signature value for the first one or more portions of the particular nucleic acid sequence based on the first plurality of hash values, and forming an association in a database between the signature value for the first one or more portions of the particular nucleic acid sequence and the particular protein.
- the embodiment may also comprise using the association formed between the signature value for the first one or more portions of the particular nucleic acid sequence and the particular protein to determine or confirm at least one aspect of a genetic function of the particular nucleic acid sequence.
- the computer-implemented method may comprise also using other associations in the database between other signature values of other character strings to determine or confirm the at least one aspect of a genetic function of the particular nucleic acid sequence.
- a computer-implemented method may comprise determining a particular intron signature value corresponding to a particular intron, with the particular intron signature value being based on a first computational function applied to one or more portions of a representation of a particular nucleic acid sequence corresponding to the particular intron.
- the embodiment may also comprise determining a particular protein signature value corresponding to a particular protein, with the protein signature value being based on a second computational hash function applied to a representation of the particular protein, and adding a record to a database, with the record comprising a first field for the particular intron signature value and a second field for the particular protein signature value computational hash function applied to a representation of the particular protein.
- the database may comprise a plurality of records corresponding to a plurality of nucleic acid sequences, with each of the records comprising an intron signature value and a corresponding protein signature value, with each of the intron signature values having been determined for and based on the first computational function applied to a portion of a corresponding nucleic acid sequence, the corresponding nucleic acid sequence encoding a particular gene, and each of the protein signature values having been determined based on a second computational hash function applied to a representation of a protein associated with the corresponding nucleic acid sequence.
- the computer-implemented method may comprise using information in the database to determine or confirm at least one aspect of a genetic function of the particular nucleic acid sequence.
- a non-transitory computer-readable recording medium storing one or more programs, which, when executed, may cause one or more processors to obtain a representation of a nucleic acid sequence, with the nucleic acid sequence encoding a particular gene and the particular gene encoding a particular protein.
- the nucleic acid sequence may comprise at least one intron.
- the embodiment may also comprise causing the one or more processors to determine an intron signature value corresponding to the at least one intron, with the intron signature value being based on a first computational function applied to one or more portions of the representation of the nucleic acid sequence corresponding to the at least one intron, and to determine a protein signature value corresponding to the particular protein, with the protein signature value being based on a second computational function applied to a representation of the particular protein.
- the embodiment may also comprise causing the one or more processors to form, in a database, an association between the intron signature value and the protein signature value.
- non-transitory computer-readable recording medium storing one or more programs, which, when executed, may cause one or more processors to use the association between the intron signature value and the protein signature value to determine or confirm at least one aspect of a genetic function of a particular nucleic acid sequence.
- a non-transitory computer-readable recording medium storing one or more programs, which, when executed, may cause one or more processors to obtain a representation of a nucleic acid sequence, with the nucleic acid sequence comprising multiple non-overlapping nucleic acid subsequences, and with each of the nucleic acid subsequences encoding an amino acid.
- the embodiment may also comprise causing the one or more processors to determine a plurality of subsequences of the nucleic acid sequence, with the plurality of subsequences being determined by a binary counting of the nucleic acid sequence, and to determine a digital signature for the nucleic acid sequence based on a computational hash function applied to the plurality of subsequences of the nucleic acid sequence.
- the non-transitory computer-readable recording medium storing one or more programs, which, when executed, may cause one or more processors to use the digital signature for the nucleic acid sequence to determine or confirm at least one aspect of a genetic function of the nucleic acid sequence.
- a non-transitory computer-readable recording medium storing one or more programs, which, when executed, may cause one or more processors to obtain a particular character string, with the particular character string being a representation of a first one or more portions of a particular nucleic acid sequence, with the particular nucleic acid sequence comprising a second one or more portions that encode a particular protein, and with the first one or more portions of the particular nucleic acid sequence being distinct from the second one or more portions of the particular nucleic acid sequence.
- the embodiment may also comprise causing the one or more processors to determine a plurality of hash values of a corresponding plurality of substrings associated with the particular character string, and to determine a signature value for the first one or more portions of the particular nucleic acid sequence based on the first plurality of hash values.
- the embodiment may also comprise causing the one or more processors to form an association in a database between the signature value for the first one or more portions of the particular nucleic acid sequence and the particular protein, and to use the association to determine or confirm at least one aspect of a genetic function of the particular nucleic acid sequence.
- non-transitory computer-readable recording medium storing one or more programs, which, when executed, may cause one or more processors to use other associations in the database between other signature values of other character strings to determine or confirm the at least one aspect of a genetic function of the particular nucleic acid sequence.
- a non-transitory computer-readable recording medium storing one or more programs, which, when executed, may cause one or more processors to determine a particular intron signature value corresponding to a particular at least one intron, with the particular intron signature value being based on a first computational function applied to one or more portions of a representation of a particular nucleic acid sequence corresponding to the particular at least one intron.
- the embodiment may also comprise causing the one or more processors to determine a particular protein signature value corresponding to a particular protein, with the protein signature value being based on a second computational hash function applied to a representation of the particular protein, and to add a record to a database, with the record comprising a first field for the particular intron signature value and a second field for the particular protein signature value.
- the database may comprise a plurality of records corresponding to a plurality of nucleic acid sequences, with each of the records comprising an intron signature value and a corresponding protein signature value, with each of the intron signature values having been determined for and based on the first computational function applied to a portion of a corresponding nucleic acid sequence, the corresponding nucleic acid sequence encoding a particular gene, and each of the protein signature values having been determined based on a second computational hash function applied to a representation of a protein associated with the corresponding nucleic acid sequence.
- non-transitory computer-readable recording medium storing one or more programs, which, when executed, may cause one or more processors to use information in the database to determine or confirm at least one aspect of a genetic function of the particular nucleic acid sequence.
- nucleic acid sequence encodes a particular gene and the particular gene encodes a particular protein, the nucleic acid sequence comprising at least one intron;
- nucleic acid sequence comprising multiple non-overlapping nucleic acid subsequences, each of the nucleic acid subsequences encoding an amino acid
- a computer-implemented method of encoding a nucleic acid sequence the method implemented by hardware in combination with software, the method comprising:
- nucleic acid sequence comprising multiple non-overlapping nucleic acid subsequences, each of the nucleic acid subsequences encoding an amino acid
- a method according to the previous embodiment further comprising: (D) using the digital signature for the nucleic acid sequence to determine or confirm at least one aspect of a genetic function of the nucleic acid sequence.
- a computer implemented method implemented by hardware in combination with software, the method comprising:
- the database comprises a plurality of records corresponding to a plurality of nucleic acid sequences, each of the records comprising an intron signature value and a corresponding protein signature value, each of the intron signature values having been determined for and based on the first computational function applied to a portion of a corresponding nucleic acid sequence, the corresponding nucleic acid sequence encoding a particular gene, and each of the protein signature values having been determined based on a second computational hash function applied to a representation of a protein associated with the corresponding nucleic acid sequence.
- a method according to the previous embodiment further comprising: (D) using information in the database to determine or confirm at least one aspect of a genetic function of the particular nucleic acid sequence.
- a non-transitory computer-readable recording medium storing one or more programs, which, when executed, cause one or more processors to, at least:
- a non-transitory computer-readable recording medium storing one or more programs, which, when executed, cause one or more processors to, at least:
- nucleic acid sequence comprising multiple non-overlapping nucleic acid subsequences, each of the nucleic acid subsequences encoding an amino acid
- (c) determine a digital signature for the nucleic acid sequence based on a computational hash function applied to the plurality of subsequences of the nucleic acid sequence.
- a non-transitory computer-readable recording medium storing one or more programs, which, when executed, cause one or more processors to, at least:
- the particular nucleic acid sequence comprises a second one or more portions that encode a particular protein, the first one or more portions of the particular nucleic acid sequence being distinct from the second one or more portions of the particular nucleic acid sequence;
- a non-transitory computer-readable recording medium storing one or more programs, which, when executed, cause one or more processors to, at least:
- the database comprises a plurality of records corresponding to a plurality of nucleic acid sequences, each of the records comprising an intron signature value and a corresponding protein signature value, each of the intron signature values having been determined for and based on the first computational function applied to a portion of a corresponding nucleic acid sequence, the corresponding nucleic acid sequence encoding a particular gene, and each of the protein signature values having been determined based on a second computational hash function applied to a representation of a protein associated with the corresponding nucleic acid sequence.
- FIG. 1 shows an overview of a system according to embodiments hereof.
- FIGS. 2A and 2B show exemplary database organizations according to embodiments hereof.
- FIG. 3 is a schematic diagram of a computer system used in embodiments hereof.
- alphanumeric character means a character that is either a letter or a number
- alphanumeric string means a character string consisting of letters and/or numbers;
- DNA means deoxyribonucleic acid;
- DNA sequence means a representation of the order in which the nucleotide bases are arranged within a nucleic acid sequence
- exon means a segment of a DNA or RNA molecule containing information coding for a protein or peptide sequence
- FNV means the Fowler-Noll-Vo hash function
- genomic means the nuclear or organellar DNA content of a biological individual or sample
- intron means the part of a nucleotide sequence that's not an exon
- RNA means ribonucleic acid
- the term “mechanism” refers to any device(s), process(es), service(s), or combination thereof.
- a mechanism may be implemented in hardware, software, firmware, using a special-purpose device, or any combination thereof.
- a mechanism may be integrated into a single device or it may be distributed over multiple devices. The various components of a mechanism may be co-located or distributed. The mechanism may be formed from other mechanisms.
- the term “mechanism” may thus be considered to be shorthand for the term device(s) and/or process(es) and/or service(s).
- a DNA sequence is a sequence of nucleotide bases selected from the four nucleotide bases adenine (A), cytosine (C), guanine (G), and thymine (T).
- A adenine
- C cytosine
- G guanine
- T thymine
- a DNA sequence S may be written as s ⁇ . .. Sk, wherein each element Si in the sequence is selected from the set ⁇ A, C, T, G ⁇ .
- a DNA sequence takes the form of a double helix formed with base pairs.
- base pairs an adenine (A) nucleotide pairs with a thymine (T) nucleotide (and vice versa) and a cytosine (C) nucleotide pairs with a guanine (G) nucleotide (and vice versa).
- A adenine
- T thymine
- C cytosine nucleotide pairs with a guanine (G) nucleotide (and vice versa).
- a DNA sequence encodes a genetic function
- a DNA sequence may contain a gene, e.g., encoding for a protein.
- amino acids there are twenty (20) amino acids (alanine, arginine, asparagine, aspartate, cysteine, glutamate, glutamine, glycine, histidine, isoleucine, leucine, lysine, methionine, phenylalanine, proline, serine, threonine, tryptophan, tyrosine, and valine) and all proteins are formed from combinations of these amino acids.
- RNA messenger RNA
- Transcription need not (and usually does not) read or copy the entire DNA, only the parts within the DNA that hold the information to perform the required function - e.g., to make the desired protein.
- a starting point is found in the DNA for the desired protein and then a transcript of the DNA is made (using RNA) up to an end point in the DNA.
- RNA copies at least a portion of the DNA using complementary bases according to the following rules:
- RNA the complement of A is U (Uracil), not thymine (T).
- RNA sequence R may be written as rir 2 . ..3 ⁇ 4 wherein each n is selected from the set ⁇ A, C, U, G ⁇ .
- the RNA Before being sent from the nucleus into the cytoplasm of the cell, the RNA may undergo further processing, as explained here.
- each gene may code for a specific protein and that the DNA that makes up a gene provides the code. However, not all of the DNA within a gene may directly code for proteins. The parts of DNA that codes directly for proteins are called exons. Non-coding portions of DNA that are within a gene and are between exons (coding portions) are referred to as introns. A gene may thus comprise one or exons interspersed with one or more introns. (E.g., as shown in the following:
- RNA messenger RNA
- the mRNA would consist of:
- ⁇ EXON-T k > is the RNA sequence corresponding to the DNA sequence
- the mRNA comprising the gene' s concatenated exons is passed out of the cell' s nucleus to the cytoplasm where translation of the mRNA to a protein occurs. Specifically, once the mRNA is made from transcription, the mRNA moves from the cell' s nucleus into the cytoplasm where a ribosome assembles amino acids associated with another molecule known as tRNA that match the codon of mRNA sequences moving through the ribosome (as specified or encoded in the mRNA) according to the following process.
- RNA sequence may conveniently be grouped into triples called codons.
- codons As conventionally understood, one triple or codon - the first codon of the mRNA transcript - encodes a start codon. The most common start codon is AUG.
- a stop codon (or termination codon) is a nucleotide triplet within mRNA that signals termination of translation. There are several stop codons, specifically, in RNA: UAG, UAA, and UGA (in DNA: TAG, TAA, TGA).
- Each codon corresponds to one of the twenty amino acids (recall that there are twenty (20) amino acids and all proteins are formed from combinations of these amino acids).
- there are four bases there are sixty four possible codons (i.e., there are 64 triples that can be formed from the bases A, C, G, T (or A, C, G, U).
- As there are 20 amino acids and 64 possible codons that code for amino acids (less the start and stop codons) some amino acids are produced by more than one codon.
- the four mRNA codons CCU, CCC, CCA, CCG all correspond to the amino acid proline.
- the four mRNA codons GCU, GCC, GCA, GCG all correspond to the amino acid alanine.
- these codons differ only in their third nucleotide.
- a ribosome recognizes the start codon in the mRNA and reads the mRNA one codon at a time.
- An amino acid (one of the twenty) attached to an available tRNA molecule in the Cytoplasm is attracted by the ribosome and joined to form a sequence of amino acids based on the underlying mRNA codon's being processed through the ribosome, according to the rules in the following table, until a "stop" codon (ATG, ATT, ACT for DNA or UAA, UAG, UGA for RNA) is reached.
- a system 100 includes a number of mechanisms 102 interacting with one or more databases 104.
- the exemplary mechanisms 102 may include database mechanism(s) 106, including database access mechanism(s) 108 and database maintenance mechanism(s) 110, signature determination mechanism(s) 112, which may include hashing mechanism(s) 114 and permuting mechanism(s) 116. Additionally, the mechanisms 102 may include sequence acquisition mechanism(s) 118, interface mechanism(s) 120, miscellaneous and auxiliary mechanism(s) 122, and analysis mechanism(s) 124.
- the interface mechanism(s) 120 may include user interface (UI) mechanism(s) 126.
- the mechanisms may include the following mechanisms (described below): the
- the databases 104 may include one or more sequence databases 128 and one or more association databases 130.
- the sequence database(s) 128 may include a mapping of sequence identifiers to corresponding sequences such as DNA sequences, RNA sequences, proteins, or the like.
- the sequences may correspond to introns or exons.
- the sequence identifiers may be arbitrarily assigned by the system.
- the association database(s) 130 preferably include a mapping of signatures (described in greater detail below) to associated and corresponding DNA sequences (e.g., as identified by their sequence identifiers).
- the signatures may also map to one or more of: (i) one or more intron sequences (e.g., as identified by their sequence identifiers), (ii) one or more exon sequences (e.g., as identified by their sequence identifiers), and (iii) a protein.
- a particular signature may not map to all of the fields.
- the database may include other mappings, e.g., from intron signatures to genetic functions or the like.
- the protein if included, may be identified by a protein identifier that maps to a separate protein database (not shown) or it may be identified by a sequence identifier, mapping to the sequence database 128.
- signature Sk maps to DNA sequence identified by sequence identifier IDk (that maps to a DNA sequence in the sequence database 128 in FIG. 2B).
- Signature Sk may also map to (i) a sequence of one or more introns (identified by a corresponding one or more sequence identifiers ID-I-k ... ); (ii) a sequence of one or more exons (identified by a corresponding one or more sequence identifiers ID-E-k ... ); and (iii) a protein (identified by a corresponding sequence identifier £).
- mappings are shown as tables. Those of ordinary skill in the art will realize and appreciate, upon reading this description, that these are merely exemplary representations of the databases and their mappings, and that different and/or other database organizations and/or representations may be used. It should be appreciated that the system is not limited by the particular database implementation or organization.
- nucleotide sequences or peptide sequences are typically stored in a text format, such as FASTA or FASTQ, using single-letter codes to represent the nucleotides or amino acids.
- FASTA format the bases adenine (A), cytosine (C), guanine (G), and thymine (T) are represented by the four ASCII letters "A”, "C", "G", and "T", respectively.
- RNA the base thymine (T) is replaced by the base uracil (U), represented by the letter "U”.
- nucleotides that constitute DNA and RNA may be expressed in five distinct characters (A, G, C, T, or U for adenine, guanine, cytosine, thymine, or uracil, respectively).
- each sequence element i.e., each nucleotide
- RNA sequence may be considered to be a character string with the characters selected from (and limited to) "A”, “C”, “G”, and “T” (or “U” instead of "T” for RNA).
- sequences may be represented as strings of characters.
- each letter may also be assigned a numeric value. It should be appreciated that standard encoding of values for the letters (e.g., ASCII) may be used, as long as the encoding is consistent across the system.
- the letters may be assigned the values, e.g., 1, 2, 3, 4, and 5 for the letters "A", “C", “G”, “T”, and “U”, respectively.
- the DNA sequence "AAGCGT” would correspond to the numerical sequence (or number) "113234"
- the RNA sequence "AAGCGU” would correspond to the numerical sequence (or number) " 1 13235".
- a DNA or RNA sequence may be directly manipulated and operated on arithmetically.
- Those of ordinary skill in the art will realize and appreciate, upon reading this description, that different and/or other encodings of DNA and RNA sequences may be used, and that some of these encodings will provide more efficient storage than others, and that some of these encodings will provide more efficient computation than others.
- different coding and storage schemes may be used for different aspects of the system, again, as long as consistency is maintained. For example, it may be efficient to store data in one form and to operate on those data in another form.
- Those of ordinary skill in the art will understand how to select an encoding scheme that balances storage and computing requirements.
- a protein may be represented by a character string (e.g., with the letters drawn from the alphabet in the 1-letter column in the table above).
- each of the twenty characters may be assigned a different numeric value (e.g., 1 to 20 or the characters ASCII numeric value or some other value).
- strings i.e., strings of characters
- arbitrary characters are used to describe operation of aspects of some algorithms.
- the characters in the strings may represent DNA or RNA or protein sequences.
- the subsequence is preferably ordered and preferably includes the characters which are directly adjacent to Sj in a forward direction.
- the substring length is greater or equal than 8.
- Patterrii(S) allows substrings of any length greater than or equal to 1.
- the function Pattern(S) refers to Patterns(S).
- Pattern 8 (Si) Si- 7 ...Si
- Pattern 8 (Si) is an ordered sequence of zero or more substrings. For notational purposes, these substrings or sub-sequences are shown in the text separated by the bar ("
- each letter is also given a subscript showing its position in the string.
- Pattern(Go) Empty (no subsequences longer than 8]
- Pattern(C 7 ] GTGGGCCC (i.e., GoTiG2G3G CsC6C 7 ]
- Pattern(As) TGGGCCCA
- GTGGGCCCA i.e., T ⁇ GsG ⁇ sCeOAs
- Pattern(G 9 ) GGGCCCAG
- Pattern(Cii) GCCCAGAC
- the index into the ordered pattern sequence is referred to herein as the "offset," with the first element being element zero.
- S' 1234567890ABCDEFGHIJ
- Pattern(S) is a pattern for that DNA or RNA or protein sequence.
- the count mechanisms operate on pattern sequences (i.e., on ordered alphanumeric sequences). Each count mechanism determines a count for each element of the sequence on which it operates. That is, if a sequence has k elements or subsequences (numbered from offset 0 to offset k-1), then that sequence will have k count values for each kind of count. When the ordered sequence of elements corresponds to a pattern for a string, the count mechanism determines the count for that pattern.
- Count(E) (for a particular count mechanism) provides corresponding count values for that DNA or RNA or protein sequence.
- the "offset count, bigger length" mechanism determines a corresponding vector or array of k count values, as follows:
- GCCCAGAC I GGCCCAGAC
- Offset Count Bigger Length(E) provides corresponding count values for that DNA or RNA or protein sequence.
- the "offset count, bigger length" mechanism determines a corresponding vector or array of k count values, as follows:
- the count for the j-th element (E j ) increments by 1 if either (i) the target subsequence contains the compared subsequence, or (ii) the compared subsequence contains target subsequence.
- Offset Count Bigger Length also matches elements at offsets 0 and 1.
- Offset_Count_Ignore_Length(Subsequence 2 ) 6
- Offset Count Ignore Length will always be greater than or equal to
- Offset Count Ignore Length(E) provides corresponding count values for that DNA or RNA or protein sequence.
- Subsequence 2 GTGGGCCCA
- the iScore mechanism determines the i Scores for that pattern.
- iScore(E) provides corresponding iScore values for that DNA or RNA or protein sequence.
- the iScore mechanism provides count of subsequence recurrence or inter-inclusiveness based on a one-nucleotide advance.
- Pattern 8 (S)
- the complementary sequence (C) is a sequence formed using complementary bases according to the following rules:
- RNA complement of A is U (Uracil), not thymine (T).
- RNA sequence R may be written as rir2. . . 3 ⁇ 4 wherein each n is selected from the set ⁇ A, C, U, G ⁇ .
- each "T" is replaced by a "U”, otherwise the sequence remains the same.
- the reverse complementary sequence of S is the complement of the reverse of S.
- the system maintains a count the occurrence of all subsequences that include any subsequence's complement.
- the signature mechanism provides a signature value of an alphanumeric string, where the value may be used, e.g., as a database index or key for that string.
- the signature (Vs) for a particular sequence (represented, e.g., as a character string s n ) is obtained as follows:
- Signature Fs as a function of the sequence of hash values of the substrings.
- signature Fs is a hash of the concatenated sequence of hash values of the substrings.
- Vs is an intron signature for that gene. For example, suppose a gene has the form:
- Other information e.g., the length of the string S, may be used to determine Vs.
- a hash function is a function that converts its input into a numeric value, called its hash value.
- the hash function is a non-cryptographic hash function such as FNV (e.g., FNVl-64Bit).
- FNV non-cryptographic hash function
- Exemplary source code for the FNV hashing function is shown in Appendix I to Application PCT/US 15/30478, which as been incorporated into this application for all purposes.
- the function call puts FNV.calculate("ABCDE") generates the hash value 813007184206524010 corresponding to the string "ABCDE”.
- any hash function may be used, and that FNV is merely provided by way of example.
- any hash function may be used, preferred embodiments hereof use two-way or reversible hash functions. In preferred embodiments the same hash function must be used for all independent entities in the string to produce the result.
- MENl has 15 protein-coding transcripts and BRCAl has 25 are protein- coding transcripts.
- the PhV mechanism For a given gene with k protein-coding transcripts (Ti, T 2 ... T k ), the PhV mechanism generates an ordered sequence (or vector) of k values, where the i-th value is signature (T,).
- T protein-coding transcripts
- Pattern ⁇ a Pattern mechanism
- the ordered sequence of elements for the intron(s) of the i-th transcript is referred to here as Ei. If a gene has m protein-coding transcripts, then m ordered sequences of transcripts (Ei, E 2 ... E m ) are generated.
- the iScore mechanism is applied to the components of Ei, giving, for each element of Ei (i.e., at each offset in Ei) a corresponding i Score.
- the Position in Vector (PiV) mechanism determines an ordered sequence (or vector) of vector index values for each offset of an intron subsequence.
- the vector index values for each PiV are in the range 1 to m for a gene that has m protein-coding transcripts (m>l).
- the protein hash vector mechanism (described above), applied to the proteins of the gene, gives a vector with m protein signatures (sometimes also referred to as protein hashes).
- the PiV for the given offset j is determined by comparing the iScore values for each of the other intron subsequences for this gene at the given offset j. If the sorted iScore values match the protein hash vector for the gene under consideration, then the PiV uses those indices. Since iValues may be the same for multiple indices, some PiV indices with the same iValue may be reordered.
- the PiV uses a max-order for the indices, where the max-order tries to maximize the match of the order of the PhV.
- Position in Vector is determined using subsequence hashes at the same nucleotide from zero.
- the ordered subsequence hashes are compared to the original position of the transcript protein hash, the constant of each transcript in the set being analyzed.
- the cumulative PiV is a count, for all subsequences, of the number of times they are assigned to each PiV.
- PhV and PiV are each relativity measures of all transcripts at an offset. Both maximize the ordering of transcripts based on the same application of ordering rules.
- PiV begins using iScore, PhV protein hash. Protein is the constant for each transcript at all offsets. Using its hash causes each transcript at any offset to start in the same order. Whereas, iScore is a variable that potentially starts each transcript in a different order at each offset.
- the following table is an iScore matrix for the MEMl gene at offset 1,000.
- the following table is an iScore matrix for the BRCA1 gene at offset 7,000. Sequence iScore Position in iScore Vector
- Programs that implement such methods may be and transmitted using a variety of media (e.g., computer readable media) in a number manners.
- Hard-wired circuitry or custom hardware may be used in place of, or in combination with, some or all of the software instructions that can implement the processes of various embodiments.
- various combinations of hardware and software may be used instead of software only.
- FIG. 3 is a schematic diagram of a computer system 300 upon which embodiments of the present disclosure may be implemented and carried out.
- the computer system 300 includes a bus 302 (i.e., interconnect), one or more processors 304, one or more communications ports 314, a main memory 306, removable storage media 310, read-only memory 308, and a mass storage 312.
- Communication port(s) 314 may be connected to one or more networks by way of which the computer system 300 may receive and/or transmit data.
- a "processor” means one or more microprocessors, central processing units (CPUs), computing devices, microcontrollers, digital signal processors, or like devices or any combination thereof, regardless of their architecture.
- An apparatus that performs a process can include, e.g., a processor and those devices such as input devices and output devices that are appropriate to perform the process.
- Processor(s) 304 can be (or include) any known processor, such as, but not limited to, an Intel® Itanium® or Itanium 2® processor(s), AMD® Opteron® or Athlon MP® processor(s), or Motorola® lines of processors, and the like.
- Communications port(s) 314 can be any of an RS-232 port for use with a modem based dial-up connection, a 10/100 Ethernet port, a Gigabit port using copper or fiber, or a USB port, and the like. Communications port(s) 314 may be chosen depending on a network such as a Local Area Network (LAN), a Wide Area Network (WAN), a CDN, or any network to which the computer system 300 connects.
- LAN Local Area Network
- WAN Wide Area Network
- CDN Code Division Multiple Access
- the computer system 300 may be in communication with peripheral devices (e.g. , display screen 316, input device(s) 318) via Input / Output (I/O) port 320. Some or all of the peripheral devices may be integrated into the computer system 300, and the input device(s) 318 may be integrated into the display screen 316 (e.g., in the case of a touch screen).
- peripheral devices e.g. , display screen 316, input device(s) 318) via Input / Output (I/O) port 320.
- Main memory 306 can be Random Access Memory (RAM), or any other dynamic storage device(s) commonly known in the art.
- Read-only memory 308 can be any static storage device(s) such as Programmable Read-Only Memory (PROM) chips for storing static information such as instructions for processor(s) 304.
- Mass storage 312 can be used to store information and instructions. For example, hard disks such as the Adaptec® family of Small Computer Serial Interface (SCSI) drives, an optical disc, an array of disks such as Redundant Array of Independent Disks (RAID), such as the Adaptec® family of RAID drives, or any other mass storage devices may be used.
- Bus 302 communicatively couples processor(s) 304 with the other memory, storage and communications blocks.
- Bus 302 can be a PCI / PCI-X, SCSI, a Universal Serial Bus (USB) based system bus (or other) depending on the storage devices used, and the like.
- Removable storage media 310 can be any kind of external hard-drives, floppy drives, IOMEGA® Zip Drives, Compact Disc - Read Only Memory (CD-ROM), Compact Disc - Re-Writable (CD-RW), Digital Versatile Disk - Read Only Memory (DVD-ROM), etc.
- Embodiments herein may be provided as one or more computer program products, which may include a machine-readable medium having stored thereon instructions, which may be used to program a computer (or other electronic devices) to perform a process.
- machine-readable medium refers to any medium, a plurality of the same, or a combination of different media, which participate in providing data (e.g., instructions, data structures) which may be read by a computer, a processor or a like device.
- Such a medium may take many forms, including but not limited to, non-volatile media, volatile media, and transmission media.
- Non-volatile media include, for example, optical or magnetic disks and other persistent memory.
- Volatile media include dynamic random access memory, which typically constitutes the main memory of the computer.
- Transmission media include coaxial cables, copper wire and fiber optics, including the wires that comprise a system bus coupled to the processor. Transmission media may include or convey acoustic waves, light waves and electromagnetic emissions, such as those generated during radio frequency (RF) and infrared (IR) data communications.
- RF radio frequency
- IR infrared
- the machine-readable medium may include, but is not limited to, floppy diskettes, optical discs, CD-ROMs, magneto-optical disks, ROMs, RAMs, erasable programmable read-only memories (EPROMs), electrically erasable programmable read-only memories (EEPROMs), magnetic or optical cards, flash memory, or other type of media/machine-readable medium suitable for storing electronic instructions.
- embodiments herein may also be downloaded as a computer program product, wherein the program may be transferred from a remote computer to a requesting computer by way of data signals embodied in a carrier wave or other propagation medium via a communication link (e.g., modem or network connection).
- data may be (i) delivered from RAM to a processor; (ii) carried over a wireless transmission medium; (iii) formatted and/or transmitted according to numerous formats, standards or protocols; and/or (iv) encrypted in any of a variety of ways well known in the art.
- a computer-readable medium can store (in any appropriate format) those program elements that are appropriate to perform the methods.
- main memory 306 is encoded with application(s) 322 that support(s) the functionality as discussed herein (an application 322 may be an application that provides some or all of the functionality of one or more of the mechanisms described herein).
- Application(s) 322 (and/or other resources as described herein) can be embodied as software code such as data and/or logic instructions (e.g., code stored in the memory or on another computer readable medium such as a disk) that supports processing functionality according to different embodiments described herein.
- processor(s) 304 accesses main memory 306 via the use of bus 302 in order to launch, run, execute, interpret or otherwise perform the logic instructions of the application(s) 322.
- Execution of application(s) 322 produces processing functionality of the service(s) or mechanism(s) related to the application(s).
- the process(es) 324 represents one or more portions of the application(s) 322 performing within or upon the processor(s) 304 in the computer system 300.
- the application 322 itself (i.e., the un-executed or non-performing logic instructions and/or data).
- the application 322 may be stored on a computer readable medium (e.g., a repository) such as a disk or in an optical medium.
- the application 322 can also be stored in a memory type system such as in firmware, read only memory (ROM), or, as in this example, as executable code within the main memory 306 (e.g. , within Random Access Memory or RAM).
- ROM read only memory
- executable code within the main memory 306 e.g. , within Random Access Memory or RAM
- application 322 may also be stored in removable storage media 310, read-only memory 308, and/or mass storage device 312.
- the application(s) 322 may correspond, at least in part, to one or more of the mechanisms 102 shown in FIG. 1.
- each of the "Pattern” mechanism; the “count” mechanisms; the “offset count, bigger length” mechanism; the “offset count, ignore length” mechanism; the iScore mechanism; the Signature mechanism; PhV mechanism; and the Position in Vector (PiV) mechanism may conveniently be implemented by one or more application(s) 322 that, when executed, run as corresponding processes 324.
- the computer system 300 can include other processes and/or software and hardware components, such as an operating system that controls allocation and use of hardware resources.
- the computer system 300 can include other processes and/or software and hardware components, such as an operating system that controls allocation and use of hardware resources.
- embodiments of the present invention include various steps or operations. A variety of these steps may be performed by hardware components or may be embodied in machine-executable instructions, which may be used to cause a general-purpose or special-purpose processor programmed with the instructions to perform the operations.
- module refers to a self-contained functional component, which can include hardware, software, firmware or any combination thereof.
- an apparatus may include a computer/computing device operable to perform some (but not necessarily all) of the described process.
- Embodiments of a computer-readable medium storing a program or data structure include a computer-readable medium storing a program that, when executed, can cause a processor to perform some (but not necessarily all) of the described process.
- process may operate without any user intervention.
- process includes some human intervention (e.g., a step is performed by or with the assistance of a human).
- portion means some or all. So, for example, "A portion of X” may include some of “X” or all of "X”. In the context of a sequence, the term “portion” means some or all of the sequence.
- the phrase “at least some” means “one or more,” and includes the case of only one.
- the phrase “at least some ABCs” means “one or more ABCs”, and includes the case of only one ABC.
- the phrase “based on” means “based in part on” or “based, at least in part, on,” and is not exclusive.
- the phrase “based on factor X” means “based in part on factor X” or “based, at least in part, on factor X.”
- the phrase “based on X” does not mean “based only on X.”
- the phrase “using” means “using at least,” and is not exclusive. Thus, e.g. , the phrase “using X” means “using at least X.” Unless specifically stated by use of the word “only”, the phrase “using X” does not mean “using only X.”
- overlap means “at least partially overlap.” Unless specifically stated, “overlap” does not mean fully overlap.
- sequence “ABCD” may be said to overlap the sequence “BCDE” and the sequence “ABCDE” and the sequence “DEFG”.
- distinct means “at least partially distinct.” Unless specifically stated, distinct does not mean fully distinct.
- the phrase, "X is distinct from Y” means that "X is at least partially distinct from Y,” and does not mean that "X is fully distinct from Y.”
- the phrase "X is distinct from Y” means that X differs from Y in at least some way. It should be appreciated that two fully or partially overlapping sequences can be distinct. As a non-limiting example, the sequence "BCD” overlaps and is distinct from the sequence "ABCDE”.
- a list may include only one item, and, unless otherwise stated, a list of multiple items need not be ordered in any particular manner.
- a list may include duplicate items.
- the phrase "a list of XYZs" may include one or more "XYZs”.
- compare_length compare_hash["sequence"]. length
- compare_hash [" sequence "]. include? (value_hash [" sequence "] )
- value_hash[" sequence” ] include?(compare_hash[ " sequence " ])
- value_hash["iscore”] (value_hash["offset_counts_same_transcript_ignore_length”] - value_hash["offset_counts_same_transcript_bigger_length”]) * 1.0 /
- sequence_hash["protein_hash_value”], "i” > sequence_hash["iscore”] ⁇
- sequences_array generate_offset_counts(sequences_array)
- sequences_array calculate_iscore(sequences_array)
- sequences_array input_data(transcript_stable_id)
- sequences_array generate_offset_counts(sequences_array, offset)
- sequences_array calculate_iscore(sequences_array)
- sort_by_hash_value(input_hash, "i") #step3 Reorder
- regroup hash regroup(input hash)
- transcripts count input hash.size if transcripts count.nil?
- order_percentage pop * 1.0 / transcripts_count
- gene_rank total_gene_rank * 1.0 / input_hash.size
- group_number, transcripts] groupedj3_ranking_positions_hash[group_number] Array. new if
- group_numbers regroup_hash. keys, sort
- transcripts regroup_hash[group_number]
- temp_hash.has_key? p_ranking_position temp_hash[p_ranking_position] + 1
- group_index + transcripts, size
- previous_buttom bottom if !bottom.nil?# reset bottom, bottom is nil only all pushed to top end
- sorted_tmp input_hash.sort_by ⁇
- hash_values[hash_value_type] ⁇ prev hash value nil
- group_hash Hash, new hash_value.each_pair do
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Chemical & Material Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Health & Medical Sciences (AREA)
- Analytical Chemistry (AREA)
- Biophysics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Biotechnology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Medical Informatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Theoretical Computer Science (AREA)
- Evolutionary Biology (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Organic Chemistry (AREA)
- Zoology (AREA)
- Wood Science & Technology (AREA)
- Bioethics (AREA)
- Microbiology (AREA)
- Immunology (AREA)
- Biochemistry (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
Description
Claims
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201762443136P | 2017-01-06 | 2017-01-06 | |
PCT/IB2018/050053 WO2018127821A1 (en) | 2017-01-06 | 2018-01-04 | Systems, methods, and devices for analysis of genetic material |
Publications (2)
Publication Number | Publication Date |
---|---|
EP3566050A1 true EP3566050A1 (en) | 2019-11-13 |
EP3566050A4 EP3566050A4 (en) | 2020-11-25 |
Family
ID=62790982
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP18736154.8A Pending EP3566050A4 (en) | 2017-01-06 | 2018-01-04 | Systems, methods, and devices for analysis of genetic material |
Country Status (6)
Country | Link |
---|---|
EP (1) | EP3566050A4 (en) |
CN (1) | CN110651186B (en) |
AU (1) | AU2018205825A1 (en) |
CA (1) | CA3049176C (en) |
IL (1) | IL267867B1 (en) |
WO (1) | WO2018127821A1 (en) |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060224329A1 (en) * | 2004-12-01 | 2006-10-05 | David Roth | Protein arrays and methods and systems for producing the same |
CN102762743A (en) * | 2009-12-09 | 2012-10-31 | 阿维埃尔公司 | Biomarker assay for diagnosis and classification of cardiovascular disease |
US8725422B2 (en) * | 2010-10-13 | 2014-05-13 | Complete Genomics, Inc. | Methods for estimating genome-wide copy number variations |
CN104699998A (en) * | 2013-12-06 | 2015-06-10 | 国际商业机器公司 | Method and device for compressing and decompressing genome |
WO2015175602A1 (en) * | 2014-05-15 | 2015-11-19 | Codondex Llc | Systems, methods, and devices for analysis of genetic material |
US10325676B2 (en) * | 2015-06-15 | 2019-06-18 | Atgenomix Inc. | Method and system for high-throughput sequencing data analysis |
-
2018
- 2018-01-04 AU AU2018205825A patent/AU2018205825A1/en active Pending
- 2018-01-04 CA CA3049176A patent/CA3049176C/en active Active
- 2018-01-04 IL IL267867A patent/IL267867B1/en unknown
- 2018-01-04 WO PCT/IB2018/050053 patent/WO2018127821A1/en unknown
- 2018-01-04 EP EP18736154.8A patent/EP3566050A4/en active Pending
- 2018-01-04 CN CN201880012087.0A patent/CN110651186B/en active Active
Also Published As
Publication number | Publication date |
---|---|
IL267867A (en) | 2019-09-26 |
WO2018127821A1 (en) | 2018-07-12 |
CA3049176A1 (en) | 2018-07-12 |
AU2018205825A1 (en) | 2019-08-15 |
CN110651186B (en) | 2021-06-25 |
IL267867B1 (en) | 2024-10-01 |
CA3049176C (en) | 2023-09-05 |
EP3566050A4 (en) | 2020-11-25 |
CN110651186A (en) | 2020-01-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Nie et al. | The phylogeny of leaf beetles (Chrysomelidae) inferred from mitochondrial genomes | |
Robb et al. | SmedGD 2.0: The Schmidtea mediterranea genome database | |
US8412462B1 (en) | Methods and systems for processing genomic data | |
Keedwell et al. | Intelligent bioinformatics: The application of artificial intelligence techniques to bioinformatics problems | |
Krogh et al. | A hidden Markov model that finds genes in E. coli DNA | |
Vision et al. | The origins of genomic duplications in Arabidopsis | |
US11308056B2 (en) | Systems and methods for SNP analysis and genome sequencing | |
Benson | Sequence alignment with tandem duplication | |
US10854314B2 (en) | Systems, methods, and devices for analysis of genetic material | |
Demongeot et al. | The uroboros theory of life’s origin: 22-nucleotide theoretical minimal RNA rings reflect evolution of genetic code and tRNA-rRNA translation machineries | |
WO2015013657A2 (en) | Method and system for rapid searching of genomic data and uses thereof | |
Gao et al. | Extent and evolution of gene duplication in DNA viruses | |
Oh et al. | Landscape of gene transposition–duplication within the Brassicaceae family | |
Dey et al. | Complete mitogenome of endemic plum-headed parakeet Psittacula cyanocephala–characterization and phylogenetic analysis | |
Morgan-Richards et al. | Sticky genomes: using NGS evidence to test hybrid speciation hypotheses | |
Benson | Composition alignment | |
CA3049176C (en) | Systems, methods, and devices for analysis of genetic material | |
US11017881B2 (en) | Systems, methods, and devices for analysis of genetic material | |
US8340917B2 (en) | Sequence matching allowing for errors | |
Clare et al. | Evolutionary search techniques for the Lyndon factorization of biosequences | |
US20220180064A1 (en) | Systems and methods for using dynamic reference graphs to accurately align sequence reads | |
Fujimoto et al. | Modeling global and local codon bias with deep language models | |
Timm | Analysis and application of hash-based similarity estimation techniques for biological sequence analysis | |
Salari et al. | A new hash function and its use in read mapping on genome | |
Bird et al. | Genome Sequence of Staphylococcus aureus Phage ESa2 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE |
|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE |
|
17P | Request for examination filed |
Effective date: 20190805 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
AX | Request for extension of the european patent |
Extension state: BA ME |
|
DAV | Request for validation of the european patent (deleted) | ||
DAX | Request for extension of the european patent (deleted) | ||
REG | Reference to a national code |
Ref country code: DE Ref legal event code: R079 Free format text: PREVIOUS MAIN CLASS: G01N0033480000 Ipc: G16B0020000000 |
|
A4 | Supplementary search report drawn up and despatched |
Effective date: 20201028 |
|
RIC1 | Information provided on ipc code assigned before grant |
Ipc: G16B 20/00 20190101AFI20201022BHEP |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: EXAMINATION IS IN PROGRESS |
|
17Q | First examination report despatched |
Effective date: 20231124 |