EP3566050A1

EP3566050A1 - Systems, methods, and devices for analysis of genetic material

Info

Publication number: EP3566050A1
Application number: EP18736154.8A
Authority: EP
Inventors: Kevin Bermeister; Xinghao Yu
Original assignee: Codondex LLC
Current assignee: Codondex LLC
Priority date: 2017-01-06
Filing date: 2018-01-04
Publication date: 2019-11-13
Also published as: IL267867A; CA3049176C; CN110651186B; CA3049176A1; WO2018127821A1; IL267867B1; IL267867B2; AU2018205825A1; CN110651186A; EP3566050A4

Abstract

A representation of a nucleic acid sequence encodes a particular gene having at least one intron. An intron signature value corresponding to the at least one intron is determined based on a first computational function applied to at least one portion of the representation of the nucleic acid sequence corresponding to the at least one intron. A protein signature value is determined, being based on a second computational function applied to a representation of a protein. In a database, an association is formed between the intron and protein signature values. This process is repeated for each of a plurality of nucleic acid sequences. Nucleic acid sequences in the database are ordered based on a sort of corresponding intron signature values. An ordering determined by the sort is used to determine or confirm a role or function of a portion of a given nucleic acid sequence.

Description

Systems, Methods, and Devices for Analysis Of Genetic Material

Copyright Statement

This patent document contains material subject to copyright protection. The copyright owner has no objection to the reproduction of this patent document or any related materials in the files of the United States Patent and Trademark Office, but otherwise reserves all copyrights whatsoever.

Related Application

This application claims priority from U.S. Provisional Patent application No.

62/443, 136, filed January 6, 2017, the entire contents of which are hereby fully incorporated herein by reference for all purposes.

This application is a continuation-in-part (CIP) of U.S. Patent application No.

15/345,505, filed November 7, 2016, which is a continuation of PCT US2015/030478, filed May 13, 2015, which claims priority from copending (i) U.S. Provisional Patent Application No. 61/993,846, filed May 15, 2014; and (ii) U. S. Provisional Patent Application No.

62/015,492, filed June 22, 2014, the entire contents of each of which are hereby fully incorporated herein by reference for all purposes.

Source Code Appendix

This application includes a source code appendix with example computer source code. The source code appendix is considered part of this application for all purposes.

Field of the Invention

This invention relates to genetic engineering and microbiology, and, more particularly, to systems, methods, and devices for analysis of genetic material such as nucleotide sequences.

Summary

The present invention is specified in the claims as well as in the below description. The following summary is exemplary and not limiting. Presently preferred embodiments are particularly specified in the dependent claims and the description of various embodiments.

In one embodiment, a computer-implemented method, implemented by hardware in combination with software, may comprise obtaining a representation of a nucleic acid sequence. The nucleic acid sequence may encode a particular gene and the particular gene may encode a particular protein. The nucleic acid sequence may comprise at least one intron. The embodiment may also comprise determining an intron signature value corresponding to the intron, with the intron signature value being based on a first computational function applied to one or more portions of the representation of the nucleic acid sequence corresponding to the at least one intron. The embodiment may also comprise determining a protein signature value corresponding to the particular protein, with the protein signature value being based on a second computational function applied to a representation of the particular protein, and forming, in a database, an association between the intron signature value and the protein signature value.

In another aspect, the method may comprise repeating the above acts for each of a plurality of nucleic acid sequences.

In another aspect, the computer-implemented method may also comprise using the formed association between the intron signature value and the protein signature value to determine or confirm at least one aspect of a genetic function of a particular nucleic acid sequence.

In another aspect, the computer-implemented method may comprise using the formed association between the intron signature value and the protein signature value to determine or confirm at least one aspect of a genetic function of a particular nucleic acid sequence for each of a plurality of nucleic acid sequences.

In another embodiment, a computer-implemented method, implemented by hardware in combination with software, may comprise obtaining a representation of a nucleic acid sequence, with the nucleic acid sequence comprising multiple non-overlapping nucleic acid subsequences, with each of the nucleic acid subsequences encoding an amino acid. The embodiment may also comprise determining a particular amino acid encoded by a particular nucleic acid subsequence of the multiple non-overlapping nucleic acid subsequences. The particular nucleic acid subsequence may comprise a first nucleotide, a second nucleotide adjacent to the first nucleotide, and a third nucleotide adjacent to the second nucleotide, by considering a nucleotide pair consisting of: (i) the first nucleotide and the second nucleotide, or (ii) the second nucleotide and the third nucleotide.

In another embodiment, a computer-implemented method of encoding a nucleic acid sequence, the method implemented by hardware in combination with software, may comprise obtaining a representation of the nucleic acid sequence, with the nucleic acid sequence comprising multiple non-overlapping nucleic acid subsequences, with each nucleic acid subsequences encoding an amino acid. The embodiment may also comprise determining a plurality of subsequences of the nucleic acid sequence, with the plurality of subsequences determined by a binary counting of the nucleic acid sequence, and determining a digital signature for the nucleic acid sequence based on a computational hash function applied to the plurality of subsequences of the nucleic acid sequence. In another aspect, the computer-implemented method may comprise using the digital signature for the nucleic acid sequence to determine or confirm at least one aspect of a genetic function of the nucleic acid sequence.

In another embodiment, a computer implemented method, implemented by hardware in combination with software, may comprise obtaining a particular character string, with the particular character string being a representation of a first one or more portions of a particular nucleic acid sequence. The particular nucleic acid sequence may comprise a second one or more portions that may encode a particular protein, with the first one or more portions of the particular nucleic acid sequence being distinct from the second one or more portions of the particular nucleic acid sequence. The embodiment may also comprise determining a plurality of hash values of a corresponding plurality of substrings associated with the particular character string, determining a signature value for the first one or more portions of the particular nucleic acid sequence based on the first plurality of hash values, and forming an association in a database between the signature value for the first one or more portions of the particular nucleic acid sequence and the particular protein. The embodiment may also comprise using the association formed between the signature value for the first one or more portions of the particular nucleic acid sequence and the particular protein to determine or confirm at least one aspect of a genetic function of the particular nucleic acid sequence.

In another aspect, the computer-implemented method may comprise also using other associations in the database between other signature values of other character strings to determine or confirm the at least one aspect of a genetic function of the particular nucleic acid sequence.

In yet another embodiment, a computer-implemented method, implemented by hardware in combination with software, may comprise determining a particular intron signature value corresponding to a particular intron, with the particular intron signature value being based on a first computational function applied to one or more portions of a representation of a particular nucleic acid sequence corresponding to the particular intron. The embodiment may also comprise determining a particular protein signature value corresponding to a particular protein, with the protein signature value being based on a second computational hash function applied to a representation of the particular protein, and adding a record to a database, with the record comprising a first field for the particular intron signature value and a second field for the particular protein signature value computational hash function applied to a representation of the particular protein. The database may comprise a plurality of records corresponding to a plurality of nucleic acid sequences, with each of the records comprising an intron signature value and a corresponding protein signature value, with each of the intron signature values having been determined for and based on the first computational function applied to a portion of a corresponding nucleic acid sequence, the corresponding nucleic acid sequence encoding a particular gene, and each of the protein signature values having been determined based on a second computational hash function applied to a representation of a protein associated with the corresponding nucleic acid sequence.

In another aspect, the computer-implemented method may comprise using information in the database to determine or confirm at least one aspect of a genetic function of the particular nucleic acid sequence.

In another embodiment, a non-transitory computer-readable recording medium storing one or more programs, which, when executed, may cause one or more processors to obtain a representation of a nucleic acid sequence, with the nucleic acid sequence encoding a particular gene and the particular gene encoding a particular protein. The nucleic acid sequence may comprise at least one intron. The embodiment may also comprise causing the one or more processors to determine an intron signature value corresponding to the at least one intron, with the intron signature value being based on a first computational function applied to one or more portions of the representation of the nucleic acid sequence corresponding to the at least one intron, and to determine a protein signature value corresponding to the particular protein, with the protein signature value being based on a second computational function applied to a representation of the particular protein. The embodiment may also comprise causing the one or more processors to form, in a database, an association between the intron signature value and the protein signature value.

In another aspect, the non-transitory computer-readable recording medium, storing one or more programs, which, when executed, may cause one or more processors to use the association between the intron signature value and the protein signature value to determine or confirm at least one aspect of a genetic function of a particular nucleic acid sequence.

In another embodiment, a non-transitory computer-readable recording medium storing one or more programs, which, when executed, may cause one or more processors to obtain a representation of a nucleic acid sequence, with the nucleic acid sequence comprising multiple non-overlapping nucleic acid subsequences, and with each of the nucleic acid subsequences encoding an amino acid. The embodiment may also comprise causing the one or more processors to determine a plurality of subsequences of the nucleic acid sequence, with the plurality of subsequences being determined by a binary counting of the nucleic acid sequence, and to determine a digital signature for the nucleic acid sequence based on a computational hash function applied to the plurality of subsequences of the nucleic acid sequence. In another aspect, the non-transitory computer-readable recording medium, storing one or more programs, which, when executed, may cause one or more processors to use the digital signature for the nucleic acid sequence to determine or confirm at least one aspect of a genetic function of the nucleic acid sequence.

In another embodiment, a non-transitory computer-readable recording medium storing one or more programs, which, when executed, may cause one or more processors to obtain a particular character string, with the particular character string being a representation of a first one or more portions of a particular nucleic acid sequence, with the particular nucleic acid sequence comprising a second one or more portions that encode a particular protein, and with the first one or more portions of the particular nucleic acid sequence being distinct from the second one or more portions of the particular nucleic acid sequence. The embodiment may also comprise causing the one or more processors to determine a plurality of hash values of a corresponding plurality of substrings associated with the particular character string, and to determine a signature value for the first one or more portions of the particular nucleic acid sequence based on the first plurality of hash values. The embodiment may also comprise causing the one or more processors to form an association in a database between the signature value for the first one or more portions of the particular nucleic acid sequence and the particular protein, and to use the association to determine or confirm at least one aspect of a genetic function of the particular nucleic acid sequence.

In another aspect, the non-transitory computer-readable recording medium, storing one or more programs, which, when executed, may cause one or more processors to use other associations in the database between other signature values of other character strings to determine or confirm the at least one aspect of a genetic function of the particular nucleic acid sequence.

In another embodiment, a non-transitory computer-readable recording medium storing one or more programs, which, when executed, may cause one or more processors to determine a particular intron signature value corresponding to a particular at least one intron, with the particular intron signature value being based on a first computational function applied to one or more portions of a representation of a particular nucleic acid sequence corresponding to the particular at least one intron. The embodiment may also comprise causing the one or more processors to determine a particular protein signature value corresponding to a particular protein, with the protein signature value being based on a second computational hash function applied to a representation of the particular protein, and to add a record to a database, with the record comprising a first field for the particular intron signature value and a second field for the particular protein signature value. The database may comprise a plurality of records corresponding to a plurality of nucleic acid sequences, with each of the records comprising an intron signature value and a corresponding protein signature value, with each of the intron signature values having been determined for and based on the first computational function applied to a portion of a corresponding nucleic acid sequence, the corresponding nucleic acid sequence encoding a particular gene, and each of the protein signature values having been determined based on a second computational hash function applied to a representation of a protein associated with the corresponding nucleic acid sequence.

In another aspect, the non-transitory computer-readable recording medium, storing one or more programs, which, when executed, may cause one or more processors to use information in the database to determine or confirm at least one aspect of a genetic function of the particular nucleic acid sequence.

Other aspects of the invention are discussed herein.

Below, further numbered embodiments of the invention will be discussed.

1. A computer-implemented method, implemented by hardware in combination with software, the method comprising:

(A) obtaining a representation of a nucleic acid sequence, wherein the nucleic acid sequence encodes a particular gene and the particular gene encodes a particular protein, the nucleic acid sequence comprising at least one intron;

(B) determining an intron signature value corresponding to the at least one intron, the intron signature value being based on a first computational function applied to one or more portions of the representation of the nucleic acid sequence corresponding to the at least one intron;

(C) determining a protein signature value corresponding to the particular protein, the protein signature value being based on a second computational function applied to a representation of the particular protein; and

(D) forming, in a database, an association between the intron signature value and the protein signature value.

2. The method according to the previous embodiment, further comprising: (E) repeating acts (A) to (D) for each of a plurality of nucleic acid sequences.

3. The method according to any of the previous embodiments further comprising: (F) using the association formed in (D) to determine or confirm at least one aspect of a genetic function of a particular nucleic acid sequence.

4. The method according to any of the previous embodiments further comprising: (F) using the association formed in (D) to determine or confirm at least one aspect of a genetic function of a particular nucleic acid sequence. 5. A computer-implemented method, implemented by hardware in combination with software, the method comprising:

(A) obtaining a representation of a nucleic acid sequence, the nucleic acid sequence comprising multiple non-overlapping nucleic acid subsequences, each of the nucleic acid subsequences encoding an amino acid;

(B) determining a particular amino acid encoded by a particular nucleic acid subsequence of the multiple non-overlapping nucleic acid subsequences, wherein the particular nucleic acid subsequence comprises a first nucleotide, a second nucleotide adjacent to the first nucleotide, and a third nucleotide adjacent to the second nucleotide, by considering a nucleotide pair consisting of: (i) the first nucleotide and the second nucleotide, or (ii) the second nucleotide and the third nucleotide.

6. A computer-implemented method of encoding a nucleic acid sequence, the method implemented by hardware in combination with software, the method comprising:

(A) obtaining a representation of the nucleic acid sequence, the nucleic acid sequence comprising multiple non-overlapping nucleic acid subsequences, each of the nucleic acid subsequences encoding an amino acid;

(B) determining a plurality of subsequences of the nucleic acid sequence, wherein the plurality of subsequences are determined by a binary counting of the nucleic acid sequence;

(C) determining a digital signature for the nucleic acid sequence based on a

computational hash function applied to the plurality of subsequences of the nucleic acid sequence.

7. A method according to the previous embodiment further comprising: (D) using the digital signature for the nucleic acid sequence to determine or confirm at least one aspect of a genetic function of the nucleic acid sequence.

8. A computer implemented method, implemented by hardware in combination with software, the method comprising:

(A) obtaining a particular character string, the particular character string being representation of a first one or more portions of a particular nucleic acid sequence, wherein the particular nucleic acid sequence comprises a second one or more portions that encode a particular protein, the first one or more portions of the particular nucleic acid sequence being distinct from the second one or more portions of the particular nucleic acid sequence;

(B) determining a plurality of hash values of a corresponding plurality of substrings associated with the particular character string;

(C) determining a signature value for the first one or more portions of the particular nucleic acid sequence based on the first plurality of hash values ; (D) forming an association in a database between the signature value for the first one or more portions of the particular nucleic acid sequence and the particular protein; and

(E) using the association formed in (D) to determine or confirm at least one aspect of a genetic function of the particular nucleic acid sequence.

9. A method according to the previous embodiment wherein the using in (E) also uses other associations in the database between other signature values of other character strings to determine or confirm the at least one aspect of a genetic function of the particular nucleic acid sequence.

10. A computer-implemented method, implemented by hardware in combination with software, the method comprising:

(A) determining a particular intron signature value corresponding to a particular at least one intron, the particular intron signature value being based on a first computational function applied to one or more portions of a representation of a particular nucleic acid sequence corresponding to the particular at least one intron;

(B) determining a particular protein signature value corresponding to a particular protein, the protein signature value being based on a second computational hash function applied to a representation of the particular protein;

(C) adding a record to a database, the record comprising a first field for the particular intron signature value and a second field for the particular protein signature value,

wherein the database comprises a plurality of records corresponding to a plurality of nucleic acid sequences, each of the records comprising an intron signature value and a corresponding protein signature value, each of the intron signature values having been determined for and based on the first computational function applied to a portion of a corresponding nucleic acid sequence, the corresponding nucleic acid sequence encoding a particular gene, and each of the protein signature values having been determined based on a second computational hash function applied to a representation of a protein associated with the corresponding nucleic acid sequence.

11. A method according to the previous embodiment, further comprising: (D) using information in the database to determine or confirm at least one aspect of a genetic function of the particular nucleic acid sequence.

12. A non-transitory computer-readable recording medium storing one or more programs, which, when executed, cause one or more processors to, at least:

(a) obtain a representation of a nucleic acid sequence, wherein the nucleic acid sequence encodes a particular gene and the particular gene encodes a particular protein, the nucleic acid sequence comprising at least one intron; (b) determine an intron signature value corresponding to the at least one intron, the intron signature value being based on a first computational function applied to one or more portions of the representation of the nucleic acid sequence corresponding to the at least one intron;

(c) determine a protein signature value corresponding to the particular protein, the protein signature value being based on a second computational function applied to a representation of the particular protein; and

(d) form, in a database, an association between the intron signature value and the protein signature value.

13. The non-transitory computer-readable recording medium according to the previous embodiment, wherein the one or more programs, when executed, cause one or more processors to, at least: (f) use the association formed in (d) to determine or confirm at least one aspect of a genetic function of a particular nucleic acid sequence.

14. A non-transitory computer-readable recording medium storing one or more programs, which, when executed, cause one or more processors to, at least:

(a) obtain a representation of the nucleic acid sequence, the nucleic acid sequence comprising multiple non-overlapping nucleic acid subsequences, each of the nucleic acid subsequences encoding an amino acid;

(b) determine a plurality of subsequences of the nucleic acid sequence, wherein the plurality of subsequences are determined by a binary counting of the nucleic acid sequence;

(c) determine a digital signature for the nucleic acid sequence based on a computational hash function applied to the plurality of subsequences of the nucleic acid sequence.

15. The non-transitory computer-readable recording medium according to the previous embodiment, wherein the one or more programs, when executed, cause one or more processors to, at least: (d) use the digital signature for the nucleic acid sequence to determine or confirm at least one aspect of a genetic function of the nucleic acid sequence.

16. A non-transitory computer-readable recording medium storing one or more programs, which, when executed, cause one or more processors to, at least:

(a) obtain a particular character string, the particular character string being

representation of a first one or more portions of a particular nucleic acid sequence, wherein the particular nucleic acid sequence comprises a second one or more portions that encode a particular protein, the first one or more portions of the particular nucleic acid sequence being distinct from the second one or more portions of the particular nucleic acid sequence;

(b) determine a plurality of hash values of a corresponding plurality of substrings associated with the particular character string; (c) determine a signature value for the first one or more portions of the particular nucleic acid sequence based on the first plurality of hash values ;

(d) form an association in a database between the signature value for the first one or more portions of the particular nucleic acid sequence and the particular protein; and

(e) use the association formed in (d) to determine or confirm at least one aspect of a genetic function of the particular nucleic acid sequence.

17. The non-transitory computer-readable recording medium according to the previous embodiment, wherein the using the association in (e) also uses other associations in the database between other signature values of other character strings to determine or confirm the at least one aspect of a genetic function of the particular nucleic acid sequence.

18. A non-transitory computer-readable recording medium storing one or more programs, which, when executed, cause one or more processors to, at least:

(a) determine a particular intron signature value corresponding to a particular at least one intron, the particular intron signature value being based on a first computational function applied to one or more portions of a representation of a particular nucleic acid sequence corresponding to the particular at least one intron;

(b) determine a particular protein signature value corresponding to a particular protein, the protein signature value being based on a second computational hash function applied to a representation of the particular protein;

(c) add a record to a database, the record comprising a first field for the particular intron signature value and a second field for the particular protein signature value,

19. The non-transitory computer-readable recording medium according to the previous embodiment, wherein the one or more programs, when executed, cause one or more processors to, at least: (d) use information in the database to determine or confirm at least one aspect of a genetic function of the particular nucleic acid sequence. The above features along with additional details of the invention, are described further in the examples below, which are intended to further illustrate the invention but are not intended to limit its scope in any way.

Brief Description Of The Drawings

Objects, features, and characteristics of the present invention as well as the methods of operation and functions of the related elements of structure, and the combination of parts and economies of manufacture, will become more apparent upon consideration of the following description and the appended claims with reference to the accompanying drawings, all of which form a part of this specification.

FIG. 1 shows an overview of a system according to embodiments hereof; and

FIGS. 2A and 2B show exemplary database organizations according to embodiments hereof; and

FIG. 3 is a schematic diagram of a computer system used in embodiments hereof.

Detailed Description Of The Presently Preferred Exemplary Embodiments

Glossary and Abbreviations

As used herein, unless used or described otherwise, the following terms or abbreviations have the following meanings:

"alphanumeric" means consisting of or using both letters and numbers;

"alphanumeric character" means a character that is either a letter or a number;

"alphanumeric string" means a character string consisting of letters and/or numbers; "DNA" means deoxyribonucleic acid;

"DNA sequence" means a representation of the order in which the nucleotide bases are arranged within a nucleic acid sequence;

"exon" means a segment of a DNA or RNA molecule containing information coding for a protein or peptide sequence;

"FNV" means the Fowler-Noll-Vo hash function;

"genome" means the nuclear or organellar DNA content of a biological individual or sample;

"intron" means the part of a nucleotide sequence that's not an exon;

"RNA" means ribonucleic acid.

As used herein, the term "mechanism" refers to any device(s), process(es), service(s), or combination thereof. A mechanism may be implemented in hardware, software, firmware, using a special-purpose device, or any combination thereof. A mechanism may be integrated into a single device or it may be distributed over multiple devices. The various components of a mechanism may be co-located or distributed. The mechanism may be formed from other mechanisms. In general, as used herein, the term "mechanism" may thus be considered to be shorthand for the term device(s) and/or process(es) and/or service(s).

Description

Background and Overview

The background section of patent application no. PCT/US 15/30478, filed 05/13/2015, published as WO 2015/175602 on 11/19/2015, is fully incorporated herein for all purposes.

A DNA sequence is a sequence of nucleotide bases selected from the four nucleotide bases adenine (A), cytosine (C), guanine (G), and thymine (T). For notational convenience, and following convention, these bases are abbreviated herein as A, C, G, and T That is, a DNA sequence S may be written as s^. .. Sk, wherein each element Si in the sequence is selected from the set {A, C, T, G}.

As is well known, within the nucleus of a cell a DNA sequence takes the form of a double helix formed with base pairs. Within the base pairs, an adenine (A) nucleotide pairs with a thymine (T) nucleotide (and vice versa) and a cytosine (C) nucleotide pairs with a guanine (G) nucleotide (and vice versa).

As conventionally understood, a DNA sequence encodes a genetic function, and a DNA sequence may contain a gene, e.g., encoding for a protein.

There are twenty (20) amino acids (alanine, arginine, asparagine, aspartate, cysteine, glutamate, glutamine, glycine, histidine, isoleucine, leucine, lysine, methionine, phenylalanine, proline, serine, threonine, tryptophan, tyrosine, and valine) and all proteins are formed from combinations of these amino acids.

In order for a cell to perform a genetic function encoded in DNA (i.e., in a gene of the DNA) a copy of a portion of the DNA is made or transcribed onto messenger RNA (mRNA) to be passed from the nucleus into the cell. Transcription need not (and usually does not) read or copy the entire DNA, only the parts within the DNA that hold the information to perform the required function - e.g., to make the desired protein. In the case of protein production, a starting point is found in the DNA for the desired protein and then a transcript of the DNA is made (using RNA) up to an end point in the DNA.

The RNA copies at least a portion of the DNA using complementary bases according to the following rules:

DNA to RNA

G → C

A → U

For RNA, the complement of A is U (Uracil), not thymine (T).

Thus an RNA sequence R may be written as rir₂. ..¾ wherein each n is selected from the set {A, C, U, G} .

When DNA is transcribed to RNA each "T" is replaced by a "U", otherwise the sequence remains the same.

Before being sent from the nucleus into the cytoplasm of the cell, the RNA may undergo further processing, as explained here.

Recall that each gene may code for a specific protein and that the DNA that makes up a gene provides the code. However, not all of the DNA within a gene may directly code for proteins. The parts of DNA that codes directly for proteins are called exons. Non-coding portions of DNA that are within a gene and are between exons (coding portions) are referred to as introns. A gene may thus comprise one or exons interspersed with one or more introns. (E.g., as shown in the following:

Processing of the transcribed RNA in the nucleus removes the introns so that the messenger RNA (mRNA) only contains the exons for a gene. E.g., for the above example, the mRNA would consist of:

<ΕΧΟΝ-Τι><ΕΧΟΝ-Τ₂><ΕΧΟΝ-Τ3><ΕΧΟΝ-Τ₄> where <EXON-T_k> is the RNA sequence corresponding to the DNA sequence

<EXON_k>.

The mRNA comprising the gene' s concatenated exons is passed out of the cell' s nucleus to the cytoplasm where translation of the mRNA to a protein occurs. Specifically, once the mRNA is made from transcription, the mRNA moves from the cell' s nucleus into the cytoplasm where a ribosome assembles amino acids associated with another molecule known as tRNA that match the codon of mRNA sequences moving through the ribosome (as specified or encoded in the mRNA) according to the following process.

The bases in an RNA sequence may conveniently be grouped into triples called codons. As conventionally understood, one triple or codon - the first codon of the mRNA transcript - encodes a start codon. The most common start codon is AUG. A stop codon (or termination codon) is a nucleotide triplet within mRNA that signals termination of translation. There are several stop codons, specifically, in RNA: UAG, UAA, and UGA (in DNA: TAG, TAA, TGA).

Each codon (other than the start and stop codons) corresponds to one of the twenty amino acids (recall that there are twenty (20) amino acids and all proteins are formed from combinations of these amino acids). As there are four bases, there are sixty four possible codons (i.e., there are 64 triples that can be formed from the bases A, C, G, T (or A, C, G, U). As there are 20 amino acids and 64 possible codons that code for amino acids (less the start and stop codons), some amino acids are produced by more than one codon. For example, as shown in the table below, the four mRNA codons CCU, CCC, CCA, CCG all correspond to the amino acid proline. Similarly, the four mRNA codons GCU, GCC, GCA, GCG all correspond to the amino acid alanine. Notably, these codons differ only in their third nucleotide.

Translation from codon to protein occurs as follows:

1. In the cell's cytoplasm a ribosome recognizes the start codon in the mRNA and reads the mRNA one codon at a time.

2. An amino acid (one of the twenty) attached to an available tRNA molecule in the Cytoplasm is attracted by the ribosome and joined to form a sequence of amino acids based on the underlying mRNA codon's being processed through the ribosome, according to the rules in the following table, until a "stop" codon (ATG, ATT, ACT for DNA or UAA, UAG, UGA for RNA) is reached.

Amino Acid DNA Base Triplets M-RNA Codons

Alanine CGA, CGG, CGT, CGC GCU, GCC, GCA, GCG

Arginine GCA, GCG, GCT, GCC, TCT, TCC CGU, CGC, CGA, CGG, AGA, AGG asparagine TTA, TTG AAU, AAC

aspartate CTA, CTG GAU, GAC

Cysteine ACA, ACG UGU, UGC

glutamate CTT, CTC GAA, GAG

glutamine GTT, GTC CAA, CAG

glycine CCA, CCG, CCT, CCC GGU, GGC, GGA, GGG

histidine GTA, GTG CAU, CAC

isoleucine TAA, TAG, TAT AUU, AUC, AUA

leucine AAT, AAC, GAA, GAG, GAT, GAC UUA, UUG, CUU, CUC, CUA, CUG lysine TTT, TTC AAA, AAG

methionine TAC AUG Amino Acid DNA Base Triplets M-RNA Codons phenylalanine AAA, AAG uuu, uuc

proline GGA, GGG, GGT, GGC CCU, CCC, CCA, CCG

serine AGA, AGG, AGT, AGC, TCA, TCG UCU, UCC, UCA, UCG, AGU, AGC stop ATG, ATT, ACT UAA, UAG, UGA

threonine TGA, TGG, TGT, TGC ACU, ACC, ACA, ACG

tryptophan ACC UGG

tyrosine ATA, ATG UAU, UAC

valine CAA, CAG, CAT, CAC GUU, GUC, GUA, GUG

The System

With reference now to FIG. 1, a system 100 according to embodiments hereof includes a number of mechanisms 102 interacting with one or more databases 104.

The exemplary mechanisms 102 may include database mechanism(s) 106, including database access mechanism(s) 108 and database maintenance mechanism(s) 110, signature determination mechanism(s) 112, which may include hashing mechanism(s) 114 and permuting mechanism(s) 116. Additionally, the mechanisms 102 may include sequence acquisition mechanism(s) 118, interface mechanism(s) 120, miscellaneous and auxiliary mechanism(s) 122, and analysis mechanism(s) 124. The interface mechanism(s) 120 may include user interface (UI) mechanism(s) 126.

The mechanisms may include the following mechanisms (described below): the

"Pattern" mechanism; the "count" mechanisms; the "offset count, bigger length" mechanism; the "offset count, ignore length" mechanism; the iScore mechanism; the Signature mechanism; PhV mechanism; and the Position in Vector (PiV) mechanism.

The databases 104 may include one or more sequence databases 128 and one or more association databases 130.

With reference now to FIGS. 2A and 2B, the sequence database(s) 128 may include a mapping of sequence identifiers to corresponding sequences such as DNA sequences, RNA sequences, proteins, or the like. The sequences may correspond to introns or exons. The sequence identifiers may be arbitrarily assigned by the system.

The association database(s) 130 preferably include a mapping of signatures (described in greater detail below) to associated and corresponding DNA sequences (e.g., as identified by their sequence identifiers). The signatures may also map to one or more of: (i) one or more intron sequences (e.g., as identified by their sequence identifiers), (ii) one or more exon sequences (e.g., as identified by their sequence identifiers), and (iii) a protein. It should be appreciated that a particular signature may not map to all of the fields. It should also be appreciated that the database may include other mappings, e.g., from intron signatures to genetic functions or the like. The protein, if included, may be identified by a protein identifier that maps to a separate protein database (not shown) or it may be identified by a sequence identifier, mapping to the sequence database 128.

In the exemplary mappings in FIGS. 2A and IB, the k-th row shows that signature Sk maps to DNA sequence identified by sequence identifier IDk (that maps to a DNA sequence in the sequence database 128 in FIG. 2B). Signature Sk may also map to (i) a sequence of one or more introns (identified by a corresponding one or more sequence identifiers ID-I-k ... ); (ii) a sequence of one or more exons (identified by a corresponding one or more sequence identifiers ID-E-k ... ); and (iii) a protein (identified by a corresponding sequence identifier £).

In FIGS. 2A and 2B the mappings are shown as tables. Those of ordinary skill in the art will realize and appreciate, upon reading this description, that these are merely exemplary representations of the databases and their mappings, and that different and/or other database organizations and/or representations may be used. It should be appreciated that the system is not limited by the particular database implementation or organization.

Data Representation

For computational purposes, e.g., in order to record and manipulate nucleotide (DNA and RNA) sequences, genetic sequencing data representing nucleotide sequences or peptide sequences are typically stored in a text format, such as FASTA or FASTQ, using single-letter codes to represent the nucleotides or amino acids. In the FASTA format, the bases adenine (A), cytosine (C), guanine (G), and thymine (T) are represented by the four ASCII letters "A", "C", "G", and "T", respectively. For RNA the base thymine (T) is replaced by the base uracil (U), represented by the letter "U". Thus the nucleotides that constitute DNA and RNA may be expressed in five distinct characters (A, G, C, T, or U for adenine, guanine, cytosine, thymine, or uracil, respectively).

However, if space and storage requirements are a concern, 3 bits may be used to represent each character.

As there are five values to be considered in each type of sequence, 3 bits are sufficient to store each sequence element (i.e., each nucleotide).

For some computational purposes a DNA (or RNA) sequence may be considered to be a character string with the characters selected from (and limited to) "A", "C", "G", and "T" (or "U" instead of "T" for RNA). In conventional programming languages, such sequences may be represented as strings of characters.

Since arithmetic functions will be performed on these strings (e.g., to compute their hashes), each letter may also be assigned a numeric value. It should be appreciated that standard encoding of values for the letters (e.g., ASCII) may be used, as long as the encoding is consistent across the system. In some embodiments the letters may be assigned the values, e.g., 1, 2, 3, 4, and 5 for the letters "A", "C", "G", "T", and "U", respectively. In these embodiments the DNA sequence "AAGCGT" would correspond to the numerical sequence (or number) "113234", whereas the RNA sequence "AAGCGU" would correspond to the numerical sequence (or number) " 1 13235". Using this approach, a DNA or RNA sequence may be directly manipulated and operated on arithmetically. Those of ordinary skill in the art will realize and appreciate, upon reading this description, that different and/or other encodings of DNA and RNA sequences may be used, and that some of these encodings will provide more efficient storage than others, and that some of these encodings will provide more efficient computation than others. It should also be appreciated that different coding and storage schemes may be used for different aspects of the system, again, as long as consistency is maintained. For example, it may be efficient to store data in one form and to operate on those data in another form. Those of ordinary skill in the art will understand how to select an encoding scheme that balances storage and computing requirements.

By convention the 20 amino acids are sometimes identified by their first three letters (e.g., "glu" for "glutamine", etc.), although for storage and computation purposes they are typically given one-letter abbreviations. The following table shows the standard 3 and 1 letter abbreviations for the twenty amino acids.

Amino Acid 3 -Letter 1 -Letter

Leucine Leu L

Lysine Lys K

Methionine Met M

Phenylalanine Phe F

Proline Pro P

Serine Ser S

Threonine Thr T

Tryptophan Trp w

Tyrosine Tyr Y

Valine Val V

Thus, as with DNA and RNA sequences, a protein may be represented by a character string (e.g., with the letters drawn from the alphabet in the 1-letter column in the table above). In order to operate arithmetically on sequences that represent proteins, each of the twenty characters may be assigned a different numeric value (e.g., 1 to 20 or the characters ASCII numeric value or some other value).

Mechanisms

For the purposes of this description, some of the mechanisms and algorithms are described as operating on alphanumeric strings (i.e., strings of characters). In some of the descriptions arbitrary characters are used to describe operation of aspects of some algorithms. As should be appreciated, the characters in the strings may represent DNA or RNA or protein sequences.

The "Pattern" mechanism

A version of this pattern mechanism is described in our co-pending patent application no. PCT/US 15/30478, the entire contents of which are hereby fully incorporated herein by reference for all purposes. An exemplary version of this procedure is described here.

For a given sequence or alphanumeric string S = S₀Si...S_n, where each element Si of the sequence is an alphabetic character: for each Si in S, t e Pattern mechanism generates a set of subsequences that, inter alia, preserve sequences between letters in the original string S =

S₀Si...S_n. The subsequence is preferably ordered and preferably includes the characters which are directly adjacent to Sj in a forward direction. In a presently preferred implementation, the substring length is greater or equal than 8. Although this mechanism is described here for substring lengths greater than or equal to 8, it should be appreciated that other lengths, including zero (0) may be used. Thus, e.g., Patterrii(S) allows substrings of any length greater than or equal to 1. For the remainder of this description, unless stated otherwise, and without loss of generality, the function Pattern(S) refers to Patterns(S).

Pattern₈(Si) = Si-₇...Si | Si-e...¾ | Si-9-..Si | ... | S₀Si...Si

Note that Pattern₈(Si) is an ordered sequence of zero or more substrings. For notational purposes, these substrings or sub-sequences are shown in the text separated by the bar ("|") character.

Consider the example where S =

For this example, as an aid to the explanation, each letter is also given a subscript showing its position in the string.

Pattern(Go) = Empty (no subsequences longer than 8]

Pattern(C₇] = GTGGGCCC (i.e., GoTiG2G3G CsC6C₇]

Pattern(As) = TGGGCCCA | GTGGGCCCA (i.e., T^GsG^sCeOAs and

Pattern(G₉) = GGGCCCAG | TGGGCCCAG | GTGGGCCCAG

Pattern(Cii) = GCCCAGAC | GGCCCAGAC | GGGCCCAGAC | TGGGCCCAGAC |

GTGGGCCCAGAC

The pattern for the entire string S is the ordered sequence of the patterns for each Si in S. For the example above, where S =

P = Pattern(S) = Pattern(G₀] | Pattern(C₇] | Pattern(A₈) ... | Pattern(Cn)

This corresponds to the following ordered sequence of fifteen strings:

The index into the ordered pattern sequence is referred to herein as the "offset," with the first element being element zero. Thus, in the example above, the first element, i.e., the element at offset = 0, is Po = GTGGGCCC = Pattern(S)o, the second element, at offset = 1 , is Pi = TGGGCCCA, and so on, where P₁₄ = GTGGGCCCAGAC. As another example, for the string S'= 1234567890ABCDEFGHIJ,

P' = Pattern(S') corresponds to the following ordered sequence of strings:

12345678 | 23456789 | 123456789 | 34567890 | 234567890 | 1234567890 | 4567890A | 34567890A | 234567890A | 1234567890A | 567890AB | 4567890AB | 34567890AB | 234567890AB | 1234567890AB | 67890ABC | 567890ABC | 4567890ABC | 34567890ABC I 234567890ABC | 1234567890ABC | 7890ABCD | 67890ABCD | 567890ABCD |

4567890ABCDEFGHI | 34567890ABCDEFGHI | 234567890ABCDEFGHI |

567890ABCDEFGHIJ | 4567890ABCDEFGHI] | 34567890ABCDEFGHIJ |

234567890ABCDEFGHIJ | 1234567890ABCDEFGHIJ

In the above example, Pattern(S') = P ' has 91 elements, with the element P'₀ at offset zero being "12345678", and the 91^st element P'₉₀ at offset 90 being

"1234567890ABCDEFGHIJ".

When the string S corresponds to a DNA or RNA or protein sequence, then Pattern(S) is a pattern for that DNA or RNA or protein sequence.

The "count" mechanisms

The count mechanisms operate on pattern sequences (i.e., on ordered alphanumeric sequences). Each count mechanism determines a count for each element of the sequence on which it operates. That is, if a sequence has k elements or subsequences (numbered from offset 0 to offset k-1), then that sequence will have k count values for each kind of count. When the ordered sequence of elements corresponds to a pattern for a string, the count mechanism determines the count for that pattern.

When the ordered sequence of elements E corresponds to a pattern of a DNA or RNA or protein sequence, then Count(E) (for a particular count mechanism) provides corresponding count values for that DNA or RNA or protein sequence.

The "offset count, bigger length" mechanism

For an ordered sequence of k elements E=[Eo, E₁ . . . Ek-i], the "offset count, bigger length" mechanism determines a corresponding vector or array of k count values, as follows:

For each sequence element E_j, for j=0 to k-1, (i.e., for subsequence E_j) start the count at zero (0), and then compare that subsequence Ej to all the subsequences in the ordered sequence E. The count for the j -th element (E_j) increments by 1 if a target subsequence is longer than and contains compared subsequence (E_j).

For example, for the following ordered sequence E

GTGGGCCC I TGGGCCCA | GTGGGCCCA | GGGCCCAG | TGGGCCCAG | GTGGGCCCAG \

GGCCCAGA I GGGCCCAGA | TGGGCCCAGA | GTGGGCCCAGA \

GCCCAGAC I GGCCCAGAC | GGGCCCAGAC | TGGGCCCAGAC | GTGGGCCCAGAC

For the third element, E₂ = Subsequence₂ = GTGGGCCCA matches at offsets 5, 9 and 14 (GTGGGCCCAG. GTGGGCCCAGA, and GTGGGCCCAGAC). To aid in this explanation, the matching parts of the sequence are underlined and in italics. Thus, in this example,

Offset_Count_Bigger_Length(Subsequence2) = 3

When an ordered sequence of elements E corresponds to a pattern of a DNA or RNA or protein sequence, then Offset Count Bigger Length(E) provides corresponding count values for that DNA or RNA or protein sequence.

The "offset count, ignore length" mechanism

For an ordered sequence of k elements E=[Eo, Ei . . . Ek-i], the "offset count, bigger length" mechanism determines a corresponding vector or array of k count values, as follows:

For each sequence element E_j, for j=0 to k-1, (i.e., for subsequence E_j) start the count at zero (0), and then, compare that subsequence to all the subsequences. The count for the j-th element (E_j) increments by 1 if either (i) the target subsequence contains the compared subsequence, or (ii) the compared subsequence contains target subsequence.

For example: 0 GTGGGCCC I 1 TGGGCCCA \ 2 GTGGGCCCA | 3 GGGCCCAG |

9 GTGGGCCCAGA | 10 GCCCAGAC | 11 GGCCCAGAC | 12 GGGCCCAGAC |

13 TGGGCCCAGAC | 14 GTGGGCCCAGAC

For the third element, E₂ = Subsequence₂ = GTGGGCCCA matches at offsets 5, 9 and

14 (GTGGGCCCAG, GTGGGCCCAGA, and GTGGGCCCA GAC), as with the

Offset Count Bigger Length mechanism. In this case, Offset Count Ignore Length also matches elements at offsets 0 and 1. Thus, in this example,

Offset_Count_Ignore_Length(Subsequence₂) = 6

It should be apparent that for an particular element of a sequence,

Offset Count Ignore Length will always be greater than or equal to

Offset Count Bigger Length.

When an ordered sequence of elements E corresponds to a pattern of a DNA or RNA or protein sequence, then Offset Count Ignore Length(E) provides corresponding count values for that DNA or RNA or protein sequence.

The iScore mechanism

For an ordered sequence of k elements E=[Eo, Ei . . . E_k-i], the iScore mechanism determines a corresponding value for each element Ej, for j=0 to k-f (i.e., for subsequence Ej) of the sequence, as follows: iScore(E_j) = ( Offset_Count_Ignore_Length(E_j) -

Offset_Count_Bigger_Length[Ej) ) ÷ Length(Ej)

In the above example, for Subsequence ₂ = GTGGGCCCA, iScore[Subsequence₂ - GTGGGCCCA) - (6 - 3) / 9 - 0.33333...

When the ordered sequence of elements corresponds to a pattern for a string, the iScore mechanism determines the i Scores for that pattern.

When an ordered sequence of elements E corresponds to a pattern of a DNA or RNA or protein sequence, then iScore(E) provides corresponding iScore values for that DNA or RNA or protein sequence.

In the case of a nucleotide sequence, the iScore mechanism provides count of subsequence recurrence or inter-inclusiveness based on a one-nucleotide advance. Example

This example shows the iScore values for the sequence

S- 'GTGCCCCCGGACTACATTTT"

The pattern mechanism applied to the string S (i.e., Pattern₈(S) =

Pattern₈(GTGCCCCCGGACTACATTTT) gives the following ordered sequence (with 91 elements):

GTGCCCCCGGACTAI CGGACTAC ICCGGACTAC | CCCGGACTAC | CCCCGGACTAC |

CCCCCGGACTAC | GCCCCCGGACTAC |TGCCCCCGGACTAC | GTGCCCCCGGACTAC |

GGACTACAI CGGACTACA| CCGGACTACA| CCCGGACTACA| CCCCGGACTACA|

CCCCCGGACTACA|GCCCCCGGACTACA|TGCCCCCGGACTACA|GTGCCCCCGGACTACA|

CCCCGGACTACAT|CCCCCGGACTACAT|GCCCCCGGACTACAT|TGCCCCCGGACTACAT|

GTGCCCCCGGACTACATI ACTACATTI GACTACATTI GGACTACATTI CGGACTACATTI

C C G GAC TAC ATT |CCCGGACTACATT|CCCCGGACTACATT|CCCCCGGAC TAC ATT |

GCCCCCGGACTACATT|TGCCCCCGGACTACATT|GTGCCCCCGGACTACATT|

CCGGACTACATTT|CCCGGACTACATTT|CCCCGGACTACATTT|CCCCCGGACTACATTT|

GCCCCCGGACTACATTT|TGCCCCCGGACTACATTT|GTGCCCCCGGACTACATTT|

TAC ATTTT I C TAC ATTTT | AC TAC ATTTT | GAC TAC ATTTT | GG AC TAC ATTTT |

C G GAC TAC ATTTT I C C G GAC TAC ATTTT I C C C G GAC TAC ATTTT | C C C C GG AC TAC ATTTT |

CCCCCGGACTACATTTT|GCCCCCGGACTACATTTT|TGCCCCCGGACTACATTTT|

GTGCCCCCGGACTACATTTT

The following table shows the iScores for each element of Patterns(S) (with the iScore values truncated to three decimal places, for convenience of display). Offset Offset

Count Count

Subsequence Bigger Ignore

Offset Subsequence Length Length Length iScore

0 GTGCCCCC 8 12 13 0.125

1 TGCCCCCG 8 23 24 0.125

2 GTGCCCCCG 9 11 14 0.333

3 GCCCCCGG 8 32 33 0.125

4 TG CCCCCGG 9 21 24 0.333

5 GTGCCCCCGG 10 10 16 0.600

6 CCCCCGGA 8 39 40 0.125

7 GCCCCCGGA 9 29 32 0.333

8 TG CCCCCGGA 10 19 25 0.600

9 GTGCCCCCGGA 11 9 19 0.909

10 CCCCGGAC 8 44 45 0.125

11 CCCCCGGAC 9 35 38 0.333

12 GCCCCCGGAC 10 26 32 0.600

13 TG CCCCCGGAC 11 17 27 0.909

14 GTGCCCCCGGAC 12 8 23 1.250

15 CCCGGACT 8 47 48 0.125

16 CCCCGGACT 9 39 42 0.333

17 CCCCCGGACT 10 31 37 0.600

18 GCCCCCGGACT 11 23 33 0.909

19 TG CCCCCGGACT 12 15 30 1.250

20 GTGCCCCCGGACT 13 7 28 1.615

21 CCGGACTA 8 48 49 0.125

22 CCCGGACTA 9 41 44 0.333

23 CCCCGGACTA 10 34 40 0.600

24 CCCCCGGACTA 11 27 37 0.909

25 GCCCCCGGACTA 12 20 35 1.250

26 TGCCCCCGGACTA 13 13 34 1.615

27 GTGCCCCCGGACTA 14 6 34 2.000 Offset Offset

Count Count

Subsequence Bigger Ignore

Offset Subsequence Length Length Length iScore

28 CGGACTAC 8 47 48 0.125

29 CCGGACTAC 9 41 44 0.333

30 CCCGGACTAC 10 35 41 0.600

31 CCCCGGACTAC 11 29 39 0.909

32 CCCCCGGACTAC 12 23 38 1.250

33 GCCCCCGGACTAC 13 17 38 1.615

34 TG CCCCCGGACTAC 14 11 39 2.000

35 GTGCCCCCGGACTAC 15 5 41 2.400

36 GGACTACA 8 44 45 0.125

37 CGGACTACA 9 39 42 0.333

38 CCGGACTACA 10 34 40 0.600

39 CCCGGACTAC A 11 29 39 0.909

40 CCCCGGACTACA 12 24 39 1.250

41 CCCCCGGACTACA 13 19 40 1.615

42 GCCCCCGGACTACA 14 14 42 2.000

43 TG CCCCCGGACTACA 15 9 45 2.400

44 GTGCCCCCGGACTACA 16 4 49 2.813

45 GACTACAT 8 39 40 0.125

46 GGACTACAT 9 35 38 0.333

47 CG GACTACAT 10 31 37 0.600

48 CCGGACTACAT 11 27 37 0.909

49 CCCG GACTACAT 12 23 38 1.250

50 CCCCGGACTACAT 13 19 40 1.615

51 CCCCCGGACTACAT 14 15 43 2.000

52 GCCCCCGGACTACAT 15 11 47 2.400

53 TGCCCCCGGACTACAT 16 7 52 2.813

54 GTGCCCCCGGACTACAT 17 3 58 3.235

55 ACTACATT 8 32 33 0.125 Offset Offset

Count Count

Subsequence Bigger Ignore

Offset Subsequence Length Length Length iScore

56 G ACTAC ATT 9 29 32 0.333

57 GG ACTAC ATT 10 26 32 0.600

58 CGGACTACATT 11 23 33 0.909

59 CCGGACTACATT 12 20 35 1.250

60 CCCGG ACTAC ATT 13 17 38 1.615

61 CCCCGGACTACATT 14 14 42 2.000

62 CCCCCGGACTACATT 15 11 47 2.400

63 GCCCCCGGACTACATT 16 8 53 2.813

64 TG CCCCCGGACTACATT 17 5 60 3.235

65 GTGCCCCCGGACTACATT 18 2 68 3.667

66 CTACATTT 8 23 24 0.125

67 ACTACATTT 9 21 24 0.333

68 G ACTAC ATTT 10 19 25 0.600

69 GG ACTAC ATTT 11 17 27 0.909

70 CGG ACTACATTT 12 15 30 1.250

71 CCGGACTACATTT 13 13 34 1.615

72 CCCGG ACTAC ATTT 14 11 39 2.000

73 CCCCGG ACTACATTT 15 9 45 2.400

74 CCCCCGGACTACATTT 16 7 52 2.813

75 GCCCCCGGACTACATTT 17 5 60 3.235

76 TG CCCCCG G ACTAC ATTT 18 3 69 3.667

77 GTGCCCCCGGACTACATTT 19 1 79 4.105

78 TACATTTT 8 12 13 0.125

79 CTACATTTT 9 11 14 0.333

80 AC I ACA I 1 1 1 10 10 16 0.600

81 AC 1 ALA 1 1 1 1 11 9 19 0.909

82 GGAU ACA I 1 1 1 12 8 23 1.250

83 LGGAC I AL^'A I 1 1 1 13 7 28 1.615 Offset Offset

Count Count

Subsequence Bigger Ignore

Offset Subsequence Length Length Length iScore

84 CCGGACTACATTTT 14 6 34 2.000

85 CCCGGACTACATTTT 15 5 41 2.400

86 CCCCGGACTACATTTT 16 4 49 2.813

87 LLLLLGGAL 1 ALA 1 1 1 1 17 3 58 3.235

88 GCCCCCGGAC I ACA I 1 1 1 18 2 68 3.667

89 I CCCCCGGAU ACA I 1 1 1 19 1 79 4.105

90 I C CC GGAC I ACA I 1 1 1 20 0 91 4.550

Complementary Subsequences

For a give DNA sequence (S), the complementary sequence (C) is a sequence formed using complementary bases according to the following rules:

Note that on the RNA the complement of A is U (Uracil), not thymine (T).

An RNA sequence R may be written as rir2. . . ¾ wherein each n is selected from the set {A, C, U, G} . When DNA is transcribed to RNA each "T" is replaced by a "U", otherwise the sequence remains the same.

For example, the complementary sequence of "GTAAGCCT" is "AGGCTTAC."

For a given sequence S, the reverse complementary sequence of S is the complement of the reverse of S.

In embodiments hereof, the system maintains a count the occurrence of all subsequences that include any subsequence's complement.

The Signature mechanism

The signature mechanism provides a signature value of an alphanumeric string, where the value may be used, e.g., as a database index or key for that string. In presently preferred implementations the signature (Vs) for a particular sequence (represented, e.g., as a character string s_n) is obtained as follows:

1 : Generate sequence (or ordered list) of substrings of S using, e.g., the pattern function (described above, e.g., Pattern₈(S))

2: Generate sequence of hash values, one for each element of the ordered list of substrings.

3: Generate signature Fs as a function of the sequence of hash values of the substrings. Preferably the signature Fs is a hash of the concatenated sequence of hash values of the substrings.

4: Repeat steps 2 to 4 for the reverse (R) of string S to obtain a signature Rs of the reverse string R.

5: Generate Vs as a function of Fs and Rs (e.g., Fs + Rs)

Where the sequence S corresponds to an intron, the value

Vs = signature (S)

is a signature for that intron.

Where the sequence S corresponds to an exon, the value

Vs- = signature (S)

is a signature for that exon.

Where the sequence S corresponds to a protein, the value

Vs- = signature (S)

is a signature for that protein.

Where the sequence S corresponds to a plurality of introns in a gene, the value Vs is an intron signature for that gene. For example, suppose a gene has the form:

Then the introns for that gene are:

and Vs = rNTRON₃>J may be considered an intron signature of the gene.

In some embodiments, the hashes of the reverse of the string may be omitted (in 4), and the value Vs is a function of Fs, e.g., Vs = Fs Other information, e.g., the length of the string S, may be used to determine Vs.

A requirement of the signature function is that two identical strings have the same signature. As is well known, a hash function is a function that converts its input into a numeric value, called its hash value. Preferably the hash function is a non-cryptographic hash function such as FNV (e.g., FNVl-64Bit). Exemplary source code for the FNV hashing function (in the Ruby programming language) is shown in Appendix I to Application PCT/US 15/30478, which as been incorporated into this application for all purposes.

For example, using the exemplary FNV source code in Appendix I to Application PCT/US15/30478, the function call puts FNV.calculate("ABCDE") generates the hash value 813007184206524010 corresponding to the string "ABCDE".

It should be appreciated that any hash function may be used, and that FNV is merely provided by way of example. Furthermore, while any hash function may be used, preferred embodiments hereof use two-way or reversible hash functions. In preferred embodiments the same hash function must be used for all independent entities in the string to produce the result.

PhV mechanism

Recall that a given gene has a number of transcripts, some of which may be protein coding. For example, MENl has 15 protein-coding transcripts and BRCAl has 25 are protein- coding transcripts.

For a given gene with k protein-coding transcripts (Ti, T₂ ... T_k), the PhV mechanism generates an ordered sequence (or vector) of k values, where the i-th value is signature (T,). Thus, e.g., for the MENl gene with 15 protein-coding transcripts,

PhV(MENl] = [signature(Ti), signature (T2), signature(Tk)]

Position in Vector (PiV) mechanism

For a given gene G with m protein-coding transcripts {m>l\ a Pattern mechanism (e.g., Pattern^) is applied to the intron values for each of the m transcripts, giving m ordered sequences of elements, one for the intron(s) of each transcript.

For the purposes of this description, and without loss of generality, the ordered sequence of elements for the intron(s) of the i-th transcript is referred to here as Ei. If a gene has m protein-coding transcripts, then m ordered sequences of transcripts (Ei, E₂ ... E_m) are generated.

For each of the m ordered sequences of transcripts (E_ls E₂ . .. E_m), the iScore mechanism is applied to the components of Ei, giving, for each element of Ei (i.e., at each offset in Ei) a corresponding i Score.

The Position in Vector (PiV) mechanism determines an ordered sequence (or vector) of vector index values for each offset of an intron subsequence. The vector index values for each PiV are in the range 1 to m for a gene that has m protein-coding transcripts (m>l). The protein hash vector mechanism (described above), applied to the proteins of the gene, gives a vector with m protein signatures (sometimes also referred to as protein hashes).

For a particular intron subsequence (e.g., Ei), at a given offset (e.g., offset =f), the PiV for the given offset j is determined by comparing the iScore values for each of the other intron subsequences for this gene at the given offset j. If the sorted iScore values match the protein hash vector for the gene under consideration, then the PiV uses those indices. Since iValues may be the same for multiple indices, some PiV indices with the same iValue may be reordered. On the other hand, if the sorted iScore values do not match the protein hash vector (PhV) for the gene under consideration, then the PiV uses a max-order for the indices, where the max-order tries to maximize the match of the order of the PhV.

Position in Vector (PiV) is determined using subsequence hashes at the same nucleotide from zero. The ordered subsequence hashes are compared to the original position of the transcript protein hash, the constant of each transcript in the set being analyzed.

The cumulative PiV is a count, for all subsequences, of the number of times they are assigned to each PiV.

It should be appreciated that PhV and PiV are each relativity measures of all transcripts at an offset. Both maximize the ordering of transcripts based on the same application of ordering rules. PiV begins using iScore, PhV protein hash. Protein is the constant for each transcript at all offsets. Using its hash causes each transcript at any offset to start in the same order. Whereas, iScore is a variable that potentially starts each transcript in a different order at each offset.

The rules of max-order for transcripts of an offset may be summarized (for some embodiments) as: PiV is initially grouped by iScore, PhV by hash. Each are subsequently reordered within their sub-group boundaries by the reciprocal PhV or iScore to obtain a final ordering that is maximized at the offset.

Our objective is to discover transcripts at offsets that retain the same position in each of their respective vectors, because sequence text of transcripts that retain position at offsets, using both methods exposes a dominant subset.

Example:

Recall that MEMl has 15 transcripts (ENST00000413626; ENST00000429702;

ENST00000394376; ENST00000377326; ENST00000450708; ENST00000377313;

ENST00000424912; ENST00000377321; ENST00000315422; ENST00000443283;

ENST00000337652; ENST00000312049; ENST00000377316; ENST00000440873;

ENST00000394374). For the MEMl gene, at offset 1,000, the subsequences for the introns of each transcript are shown in the following table:

The following table is an iScore matrix for the MEMl gene at offset 1,000.

Sequence iScore Position in iScore Vector

ENST00000377316 3.666666667 1

ENST00000440873 3.666666667 3

ENST00000394374 3.666666667 6

Example

For the BRCAl gene at offset 7,000, the subsequences for the introns of each transcript are shown in the following table:

Sequence Compared Sequence

ENST00000493919 GGAAAGGGACAGGGGGCCCAAGTGATGCTCTGGGGTACTGGCGTG

GGAGAGTGGATTTCCGAAGCTGACAGATGGGTATTCTTTGACGGG GGGTAGGGGCGGAAC

ENST00000489037 GGGGCGGAACCTGAGAGGCGTAAGGCGTTGTGAACCCTGGGGAGG

GGGGCAGTTTGTAGGTCGCGAGGGAAGCGCTGAGGATCAGGAAGG GGGCACTGAGTGTCC

ENST00000412061 CTTAACAGGCACTGAAAAGAGAGTGGGTAGATACAGTACTGTAAT

TAGATTATTCTGAAGAC C ATTTGGGAC CTTTAC AAC C C AC AAAATC TCTTGGCAGAGTTA

ENST00000468300 GGAAAGGGACAGGGGGCCCAAGTGATGCTCTGGGGTACTGGCGTG

GGAGAGTGGATTTCCGAAGCTGACAGATGGGTATTCTTTGACGGG GGGTAGGGGCGGAAC

ENST00000470026 GGGGCGGAACCTGAGAGGCGTAAGGCGTTGTGAACCCTGGGGAGG

GGGGCAGTTTGTAGGTCGCGAGGGAAGCGCTGAGGATCAGGAAGG GGGCACTGAGTGTCC

ENST00000478531 GGAAAGGGACAGGGGGCCCAAGTGATGCTCTGGGGTACTGGCGTG

GGAGAGTGGATTTCCGAAGCTGACAGATGGGTATTCTTTGACGGG GGGTAGGGGCGGAAC

ENST00000497488 GGACAGGGGGCCCAAGTGATGCTCTGGGGTACTGGCGTGGGAGAG

TGGATTTCCGAAGCTGACAGATGGGTATTCTTTGACGGGGGGTAG GGGCGGAACCTGAGA

ENST00000494123 GGAAAGGGACAGGGGGCCCAAGTGATGCTCTGGGGTACTGGCGTG

GGAGAGTGGATTTCCGAAGCTGACAGATGGGTATTCTTTGACGGG GGGTAGGGGCGGAAC

ENST00000473961 GAATGAC AC TC AAGTGCTGTC C ATGAAAAC TC AGGAAGTTTGC AC

AATTACTTTCTATGACGTGGTGATAAGACCTTTTAGTCTAGGTTA ATTTTAGTTCTGTAT

ENST00000354071 GGACAGGGGGCCCAAGTGATGCTCTGGGGTACTGGCGTGGGAGAG

TGGATTTCCGAAGCTGACAGATGGGTATTCTTTGACGGGGGGTAG GGGCGGAACCTGAGA

ENST00000461574 TGTTTGCCCCAGTCTATTTATAGAAGTGAGCTAAATGTTTATGCT

TTTGGGGAGCACATTTTACAAATTTCCAAGTATAGTTAAAGGAAC Sequence Compared Sequence

TGCTTCTTAAACTTG

ENST00000492859 GGACAGGGGGCCCAAGTGATGCTCTGGGGTACTGGCGTGGGAGAG

TGGATTTCCGAAGCTGACAGATGGGTATTCTTTGACGGGGGGTAG GGGCGGAACCTGAGA

ENST00000346315 GGAAAGGGACAGGGGGCCCAAGTGATGCTCTGGGGTACTGGCGTG

GGAGAGTGGATTTCCGAAGCTGACAGATGGGTATTCTTTGACGGG GGGTAGGGGCGGAAC

ENST00000352993 GGACAGGGGGCCCAAGTGATGCTCTGGGGTACTGGCGTGGGAGAG

TGGATTTCCGAAGCTGACAGATGGGTATTCTTTGACGGGGGGTAG GGGCGGAACCTGAGA

ENST00000586385 GGGGCGGAACCTGAGAGGCGTAAGGCGTTGTGAACCCTGGGGAGG

GGGGCAGTTTGTAGGTCGCGAGGGAAGCGCTGAGGATCAGGAAGG GGGCACTGAGTGTCC

ENST00000591849 GGACAGGGGGCCCAAGTGATGCTCTGGGGTACTGGCGTGGGAGAG

TGGATTTCCGAAGCTGACAGATGGGTATTCTTTGACGGGGGGTAG GGGCGGAACCTGAGA

ENST00000476777 GGACAGGGGGCCCAAGTGATGCTCTGGGGTACTGGCGTGGGAGAG

TGGATTTCCGAAGCTGACAGATGGGTATTCTTTGACGGGGGGTAG GGGCGGAACCTGAGA

ENST00000309486 GGAAAGGGACAGGGGGCCCAAGTGATGCTCTGGGGTACTGGCGTG

GGAGAGTGGATTTCCGAAGCTGACAGATGGGTATTCTTTGACGGG GGGTAGGGGCGGAAC

ENST00000484087 GAATGAC AC TC AAGTGCTGTC C ATGAAAAC TC AGGAAGTTTGC AC

AATTACTTTCTATGACGTGGTGATAAGACCTTTTAGTCTAGGTTA ATTTTAGTTCTGTAT

ENST00000491747 GGAAAGGGACAGGGGGCCCAAGTGATGCTCTGGGGTACTGGCGTG

GGAGAGTGGATTTCCGAAGCTGACAGATGGGTATTCTTTGACGGG GGGTAGGGGCGGAAC

ENST00000591534 GGACAGGGGGCCCAAGTGATGCTCTGGGGTACTGGCGTGGGAGAG

TGGATTTCCGAAGCTGACAGATGGGTATTCTTTGACGGGGGGTAG GGGCGGAACCTGAGA

The following table is an iScore matrix for the BRCA1 gene at offset 7,000. Sequence iScore Position in iScore Vector

ENST00000477152 46.2285714285714 18

ENST00000493795 46.2285714285714 21

ENST00000487825 46.2 2

ENST00000461221 46.2190476190476 9

ENST00000357654 46.2285714285714 13

ENST00000471181 46.2285714285714 27

ENST00000351666 46.2380952380952 29

ENST00000461798 46.2285714285714 10

ENST00000493919 46.2285714285714 26

ENST00000489037 46.2190476190476 7

ENST00000412061 46.2095238095238 5

ENST00000468300 46.2285714285714 12

ENST00000470026 46.2190476190476 6

ENST00000478531 46.2285714285714 20

ENST00000497488 46.2285714285714 25

ENST00000494123 46.2285714285714 14

ENST00000473961 46.2 3

ENST00000354071 46.2285714285714 24

ENST00000461574 46.2 4

ENST00000492859 46.2285714285714 22

ENST00000346315 46.2285714285714 28

ENST00000352993 46.2285714285714 23

ENST00000586385 46.2190476190476 8

ENST00000591849 46.2285714285714 16

ENST00000476777 46.2285714285714 11

ENST00000309486 46.2285714285714 17

ENST00000484087 46.2 1

ENST00000491747 46.2285714285714 15

ENST00000591534 46.2285714285714 19

Computing

Programs that implement such methods (as well as other types of data) may be and transmitted using a variety of media (e.g., computer readable media) in a number manners. Hard-wired circuitry or custom hardware may be used in place of, or in combination with, some or all of the software instructions that can implement the processes of various embodiments. Thus, various combinations of hardware and software may be used instead of software only.

FIG. 3 is a schematic diagram of a computer system 300 upon which embodiments of the present disclosure may be implemented and carried out.

According to the present example, the computer system 300 includes a bus 302 (i.e., interconnect), one or more processors 304, one or more communications ports 314, a main memory 306, removable storage media 310, read-only memory 308, and a mass storage 312. Communication port(s) 314 may be connected to one or more networks by way of which the computer system 300 may receive and/or transmit data.

As used herein, a "processor" means one or more microprocessors, central processing units (CPUs), computing devices, microcontrollers, digital signal processors, or like devices or any combination thereof, regardless of their architecture. An apparatus that performs a process can include, e.g., a processor and those devices such as input devices and output devices that are appropriate to perform the process.

Processor(s) 304 can be (or include) any known processor, such as, but not limited to, an Intel® Itanium® or Itanium 2® processor(s), AMD® Opteron® or Athlon MP® processor(s), or Motorola® lines of processors, and the like. Communications port(s) 314 can be any of an RS-232 port for use with a modem based dial-up connection, a 10/100 Ethernet port, a Gigabit port using copper or fiber, or a USB port, and the like. Communications port(s) 314 may be chosen depending on a network such as a Local Area Network (LAN), a Wide Area Network (WAN), a CDN, or any network to which the computer system 300 connects. The computer system 300 may be in communication with peripheral devices (e.g. , display screen 316, input device(s) 318) via Input / Output (I/O) port 320. Some or all of the peripheral devices may be integrated into the computer system 300, and the input device(s) 318 may be integrated into the display screen 316 (e.g., in the case of a touch screen).

Main memory 306 can be Random Access Memory (RAM), or any other dynamic storage device(s) commonly known in the art. Read-only memory 308 can be any static storage device(s) such as Programmable Read-Only Memory (PROM) chips for storing static information such as instructions for processor(s) 304. Mass storage 312 can be used to store information and instructions. For example, hard disks such as the Adaptec® family of Small Computer Serial Interface (SCSI) drives, an optical disc, an array of disks such as Redundant Array of Independent Disks (RAID), such as the Adaptec® family of RAID drives, or any other mass storage devices may be used. Bus 302 communicatively couples processor(s) 304 with the other memory, storage and communications blocks. Bus 302 can be a PCI / PCI-X, SCSI, a Universal Serial Bus (USB) based system bus (or other) depending on the storage devices used, and the like. Removable storage media 310 can be any kind of external hard-drives, floppy drives, IOMEGA® Zip Drives, Compact Disc - Read Only Memory (CD-ROM), Compact Disc - Re-Writable (CD-RW), Digital Versatile Disk - Read Only Memory (DVD-ROM), etc.

Embodiments herein may be provided as one or more computer program products, which may include a machine-readable medium having stored thereon instructions, which may be used to program a computer (or other electronic devices) to perform a process. As used herein, the term "machine-readable medium" refers to any medium, a plurality of the same, or a combination of different media, which participate in providing data (e.g., instructions, data structures) which may be read by a computer, a processor or a like device. Such a medium may take many forms, including but not limited to, non-volatile media, volatile media, and transmission media. Non-volatile media include, for example, optical or magnetic disks and other persistent memory. Volatile media include dynamic random access memory, which typically constitutes the main memory of the computer. Transmission media include coaxial cables, copper wire and fiber optics, including the wires that comprise a system bus coupled to the processor. Transmission media may include or convey acoustic waves, light waves and electromagnetic emissions, such as those generated during radio frequency (RF) and infrared (IR) data communications.

The machine-readable medium may include, but is not limited to, floppy diskettes, optical discs, CD-ROMs, magneto-optical disks, ROMs, RAMs, erasable programmable read-only memories (EPROMs), electrically erasable programmable read-only memories (EEPROMs), magnetic or optical cards, flash memory, or other type of media/machine-readable medium suitable for storing electronic instructions. Moreover, embodiments herein may also be downloaded as a computer program product, wherein the program may be transferred from a remote computer to a requesting computer by way of data signals embodied in a carrier wave or other propagation medium via a communication link (e.g., modem or network connection).

Various forms of computer readable media may be involved in carrying data (e.g.

sequences of instructions) to a processor. For example, data may be (i) delivered from RAM to a processor; (ii) carried over a wireless transmission medium; (iii) formatted and/or transmitted according to numerous formats, standards or protocols; and/or (iv) encrypted in any of a variety of ways well known in the art.

A computer-readable medium can store (in any appropriate format) those program elements that are appropriate to perform the methods. As shown, main memory 306 is encoded with application(s) 322 that support(s) the functionality as discussed herein (an application 322 may be an application that provides some or all of the functionality of one or more of the mechanisms described herein). Application(s) 322 (and/or other resources as described herein) can be embodied as software code such as data and/or logic instructions (e.g., code stored in the memory or on another computer readable medium such as a disk) that supports processing functionality according to different embodiments described herein.

During operation of one embodiment, processor(s) 304 accesses main memory 306 via the use of bus 302 in order to launch, run, execute, interpret or otherwise perform the logic instructions of the application(s) 322. Execution of application(s) 322 produces processing functionality of the service(s) or mechanism(s) related to the application(s). In other words, the process(es) 324 represents one or more portions of the application(s) 322 performing within or upon the processor(s) 304 in the computer system 300.

It should be noted that, in addition to the process(es) 324 that carries(carry) out operations as discussed herein, other embodiments herein include the application 322 itself (i.e., the un-executed or non-performing logic instructions and/or data). The application 322 may be stored on a computer readable medium (e.g., a repository) such as a disk or in an optical medium. According to other embodiments, the application 322 can also be stored in a memory type system such as in firmware, read only memory (ROM), or, as in this example, as executable code within the main memory 306 (e.g. , within Random Access Memory or RAM). For example, application 322 may also be stored in removable storage media 310, read-only memory 308, and/or mass storage device 312.

The application(s) 322 may correspond, at least in part, to one or more of the mechanisms 102 shown in FIG. 1. For example, each of the "Pattern" mechanism; the "count" mechanisms; the "offset count, bigger length" mechanism; the "offset count, ignore length" mechanism; the iScore mechanism; the Signature mechanism; PhV mechanism; and the Position in Vector (PiV) mechanism may conveniently be implemented by one or more application(s) 322 that, when executed, run as corresponding processes 324.

Those skilled in the art will understand that the computer system 300 can include other processes and/or software and hardware components, such as an operating system that controls allocation and use of hardware resources.

Those skilled in the art will understand that the computer system 300 can include other processes and/or software and hardware components, such as an operating system that controls allocation and use of hardware resources. As discussed herein, embodiments of the present invention include various steps or operations. A variety of these steps may be performed by hardware components or may be embodied in machine-executable instructions, which may be used to cause a general-purpose or special-purpose processor programmed with the instructions to perform the operations.

Alternatively, the steps may be performed by a combination of hardware, software, and/or firmware. The term "module" refers to a self-contained functional component, which can include hardware, software, firmware or any combination thereof.

One of ordinary skill in the art will readily appreciate and understand, upon reading this description, that embodiments of an apparatus may include a computer/computing device operable to perform some (but not necessarily all) of the described process.

Embodiments of a computer-readable medium storing a program or data structure include a computer-readable medium storing a program that, when executed, can cause a processor to perform some (but not necessarily all) of the described process.

Where a process is described herein, those of skill in the art will appreciate that the process may operate without any user intervention. In another embodiment, the process includes some human intervention (e.g., a step is performed by or with the assistance of a human).

As used in this description, the term "portion" means some or all. So, for example, "A portion of X" may include some of "X" or all of "X". In the context of a sequence, the term "portion" means some or all of the sequence.

As used herein, including in the claims, the phrase "at least some" means "one or more," and includes the case of only one. Thus, e.g., the phrase "at least some ABCs" means "one or more ABCs", and includes the case of only one ABC.

As used herein, including in the claims, the phrase "based on" means "based in part on" or "based, at least in part, on," and is not exclusive. Thus, e.g. , the phrase "based on factor X" means "based in part on factor X" or "based, at least in part, on factor X." Unless specifically stated by use of the word "only", the phrase "based on X" does not mean "based only on X."

As used herein, including in the claims, the phrase "using" means "using at least," and is not exclusive. Thus, e.g. , the phrase "using X" means "using at least X." Unless specifically stated by use of the word "only", the phrase "using X" does not mean "using only X."

In general, as used herein, including in the claims, unless the word "only" is specifically used in a phrase, it should not be read into that phrase.

As used herein, including in the claims, the phrase "overlap" means "at least partially overlap." Unless specifically stated, "overlap" does not mean fully overlap. Thus, by way of non-limiting example, the sequence "ABCD" may be said to overlap the sequence "BCDE" and the sequence "ABCDE" and the sequence "DEFG". As used herein, including in the claims, the phrase "distinct" means "at least partially distinct." Unless specifically stated, distinct does not mean fully distinct. Thus, e.g., the phrase, "X is distinct from Y" means that "X is at least partially distinct from Y," and does not mean that "X is fully distinct from Y." Thus, as used herein, including in the claims, the phrase "X is distinct from Y" means that X differs from Y in at least some way. It should be appreciated that two fully or partially overlapping sequences can be distinct. As a non-limiting example, the sequence "BCD" overlaps and is distinct from the sequence "ABCDE".

As used herein, including in the claims, a list may include only one item, and, unless otherwise stated, a list of multiple items need not be ordered in any particular manner. A list may include duplicate items. For example, as used herein, the phrase "a list of XYZs" may include one or more "XYZs".

It should be appreciated that the words "first", "second", "third," and so on, if used in the claims, are used to distinguish or identify, and not to show a serial or numerical limitation. Similarly, the use of letter or numerical labels (such as "(a)", "(b)", and the like) are used to help distinguish and / or identify, and not to show any serial or numerical limitation or ordering.

No ordering is implied by any of the labeled boxes in any of the flow diagrams unless specifically shown and stated. When disconnected boxes are shown in a diagram the activities associated with those boxes may be performed in any order, including fully or partially in parallel.

While the invention has been described in connection with and with respect to protein, DNA, and RNA molecules, those of ordinary skill in the art will realize and appreciate, upon reading this description, that these techniques may be applied to other molecules, including molecules in a genome (including non-coding molecules).

While the invention has been described in connection with what is presently considered to be the most practical and preferred embodiments, it is to be understood that the invention is not to be limited to the disclosed embodiment, but on the contrary, is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims. Appendix I (Source Code)

Exemplary source code (in the Ruby programming language) is shown below:

#Generate offset counts & iscore report

#Function 1 : Scan sequences and generate offsetcounts & iScore

#input example:

#[{"transcript_id"=>"ENST00000377313 ", "Direction" => "F", "offset" => 0,

" sequence"=>" GATTGAT " , "protein_hash_value" => 100164739721 16439183 }, {"transcript_id"=>"ENST00000377313 ", "Direction" => "F", "offset" => 1,

"sequence"=>"GATTGATCC", "protein_hash_value" => 100164739721 16439183 }, { "transcript_id"=>"ENST00000377313 ", "Direction" => "F", "offset" => 2,

"sequence"=>"GATTG", "protein_hash_value" => 12865572994853230352},

{"transcript_id"=>"ENST00000377313 ", "Direction" => "Fr", "offset" => 3,

"sequence"=>"TTTTTTTTTTTTTT", "protein_hash_value" => 9662452273227638168}, {"transcript_id"=>"ENST00000394374", "Direction" => "F", "offset" => 0,

" sequence"=>" GATTGATCC", "protein_hash_value" => 9662452273227638168}]

# run Iscore.test i score to run an example

#Function 2: Generate iscore report

#input example:

#[{"transcript_id"=>"ENST00000377313 ", "Direction" => "F", "offset" => 0,

"sequence"=>"GATTGAT", "protein_hash_value" => 100164739721 16439183 }, {"transcript_id"=>"ENST00000377313 ", "Direction" => "F", "offset" => 1,

"sequence"=>" GATTGATCC", "protein_hash_value" => 100164739721 16439183 }, {"transcript_id"=>"ENST00000377313 ", "Direction" => "F", "offset" => 2,

"sequence"=>"GATTG", "protein hash value" => 12865572994853230352},

{"transcript_id"=>"ENST00000377313 ", "Direction" => "Fr", "offset" => 3,

"sequence"=>"TTTTTTTTTTTTTT", "protein Jiash_value" => 9662452273227638168}, { "transcript id"=>"ENST00000394374", "Direction" => "F", "offset" => 0,

"sequence"=>" GATTGATCC", "protein_hash_value" => 9662452273227638168}]

# run Iscore.test_iscore_report to run an example

#

#Function 3 : generate a particular offset iscore

#Iscore.test_iscore_specific_offset("ENST00000600453 ", 130, "F")

#Iscore.test iscore specific offset("ENST00000597198", 154, "F") #Function 4: another example of iscore report

#Iscore.test_iscore_report_specific_offset

class Iscore

require 'json' def self.generate_offset_counts(sequences_array, offset = nil, direction = "F") icount = 0

sequences_array.each do |value_hash|

value_hash["offset_counts_same_transcript_bigger_length"] = 0 value_hash["offset_counts_same_transcript_ignore_length"] = 0 value_hash["offset_counts_diff_transcript_ignore_length"] = 0

value_hash["offset_counts_diff_transcript_bigger_length"] = 0

next if !offsetnil? && (offset != value_hash["offset"] || direction != value_hash [ "Directi on " ] )

sequences_array.each do |compare_hash|

bigger_count = 0

smaller_count = 0

src_length = value_hash[" sequence"], length

compare_length = compare_hash["sequence"]. length

if src_length <= compare_length &&

compare_hash [" sequence "]. include? (value_hash [" sequence "] )

bigger_count = 1

end if src_length > compare_length &&

value_hash[" sequence" ] . include?(compare_hash[ " sequence " ])

small er_count = 1

end

if value_hash["transcript_id"] == compare_hash["transcript_id"]

if bigger_count == 1

value_hash["offset_counts_same_transcript_bigger_length"] += 1 value_hash["offset_counts_same_transcript_ignore_length"] += 1 end

if smaller_count == 1

value_hash["offset_counts_same_transcript_ignore_length"] += 1 end

else

if bigger_count == 1

value_hash["offset_counts_diff_transcript_bigger_length"] += 1

value_hash["offset_counts_diff_transcript_ignore_length"] += 1

end

if smaller_count == 1

value_hash["offset_counts_diff_transcript_ignore_lengtli"] += 1

end

i count += 1

puts "Process: #{icount}"

end

return sequences_array

end

def self.calculate_iscore(sequences_array)

puts "calculate_iscore"

sequences_array.each do |value_hash|

value_hash["iscore"] = (value_hash["offset_counts_same_transcript_ignore_length"] - value_hash["offset_counts_same_transcript_bigger_length"]) * 1.0 /

value_hash [" sequence" ].1 ength

end

return sequences_array

end def self.transfer_to_report_input_hash(sequence_array)

ret hash = Hash, new

sequence_array.each do |sequence_hash|

key = sequence_hash["Direction"] + sequence_hash["offset"].to_s

ret_hash[key] = Hash. new if !ret_hash.has_key?(key)

ret_hash[key][sequence_hash["transcript_id"]] = {"p" =>

sequence_hash["protein_hash_value"], "i" => sequence_hash["iscore"]}

end

return ret_hash end

#Function 1 : generate i Score

def self. test i score

sequences_array = [{"transcript_id"=>"ENST00000377313", "Direction" => "F", "offset" = 0, "sequence"=>"GATTGAT", "protein_hash_value" => 10016473972116439183},

{"transcript_id"=>"ENST00000377313", "Direction" => "F", "offset" => 1,

"sequence"=>"GATTGATCC", "protein_hash_value" => 10016473972116439183}, {"transcript_id"=>"ENST00000377313", "Direction" => "F", "offset" => 2,

"sequence"=>"GATTG", "protein_hash_value" => 12865572994853230352},

{"transcript_id"=>"ENST00000377313", "Direction" => "Fr", "offset" => 3,

"sequence"=>"GATTGATCC", "protein_hash_value" => 9662452273227638168}]

sequences_array = generate_offset_counts(sequences_array)

sequences_array = calculate_iscore(sequences_array)

puts sequences_array.tojson

return sequences_array

end

#Function 2: generate iScore reports

def self.test_iscore_report

input_hashes = transfer_to_report_input_hash(test_iscore)

input_hashes.eac do |key, input_hash|

next if input_hash.size = 1

puts Report.generate(input_hash).toJ son

end

#Function 3: generate iScore for specific offset

def self.test_iscore_specific_offset(transcript_stable_id,offset, direction = "F")

sequences_array = input_data(transcript_stable_id)

sequences_array = generate_offset_counts(sequences_array, offset)

sequences_array = calculate_iscore(sequences_array)

sequences_array.each do |seq_hash| puts seq_hash if seq_hash["offset"] == offset && direction == seq_hash["Direction"] end

return nil;

end

def self.input_data(stable_id)

puts "input data"

t_array = Array, new

File.foreach("resources/#{stable_id}.csv").with_index do (line, line_num|

#puts "#{line_num}: #{line}"

t_input = line.gsub(Ar|\t|\n/,"").split(",")

raise "Data format error!" if t_input.size != 5

t hash = Hash.new

t_hash["transcript_id"] = t_input[0]

t_hash["Direction"] = t_input[l]

t_hash["offset"] = t_input[2].to_i

t_hash["protein_hash_value"] = t_input[3].to_i

t_hash["sequence"] = t_input[4]

t array.push t hash

end

return t_array

end

#Function 4: generate iScore reprot for specific offset

def self.test_iscore_report_specific_offset

inputjiash = {"ENST00000596756"=>{"p"=>4006617391866799131,

"i"=>3.6666666666666665}, "ENST00000597198"=>{"p"=>12271691713048959439, "i"=>3.6666666666666665}, "ENST00000309877"=>{"p"=>12271691713048959439,

=>3.6111111 111111111}, "ENST00000600911" =>{"p" =>5168434887446584552,

=>3.6111111 111111111}, "ENST00000597636" =>{"p" =>9314657173126866352,

=>3.6111111 [11111111}, "ENST00000442265" =>{"p" =>13186677374619181588,

=>3.6111111 111111111}, "ENST00000377135" =>{"p" =>14380345047581822947,

=>3.611111] 111111111}, "ENST00000593337" =>{"p" =>10922829938332626334,

=>3.6111111 [11111111}, "ENST00000601809" =>{"p" =>13988773081312178186, =>3 611111111111111 }, "ENST00000601373" =>{ 'Ρ' =>1075220869017619972,

=>3 611111111111 111 }, "ENST00000596765" =>{ 'Ρ' =>4242541338062981510,

=>3 611111111111 111 }, "ENST00000600022" =>{ 'Ρ' =>4242541338062981510,

=>3 611111111111 111 }, "ENST00000599144" =>{ 'Ρ' =>9284278690327170613,

=>3 611111111111 111 }, "ENST00000598808" =>{ 'Ρ' =>9284278690327170613,

=>3 611111111111 111 }, "ENST00000593922" =>{ 'Ρ' =>9284278690327170613,

=>3 611111111111 111 }, "ENST00000377139" =>{ 'Ρ' =>12271691713048959439,

=>3 611111111111 111 }, "ENST00000599223" =>{ 'Ρ' =>14380345047581822947,

=>3 611111111111 111 }, "ENST00000596822" =>{ 'Ρ' =>14742079273191910198,

=>3 611111111111 111 }, "ENST00000595034" =>{ 'Ρ' =>14904930809086793406,

=>3 611111111111 111 }, "ENST00000601291 " =>{ 'Ρ' => 15016766153870036082,

=>3 611111111111 111 }, "ENST00000600453" =>{ 'Ρ' =>12599430326588101984,

=>4 055555555555555}, "ENST00000596788" =>{ 'Ρ' =>1939525357298296176,

=>3 7222222222222223 } , "ENST00000593818 "=> {"ρ"=>4343270347374302829,

=>3 611111111111111 }, "ENST00000598108" =>{"p' => 12020088722410334026,

=>3 611111111111111 }}

puts Report, generate(input hash) . to J son

end

end puts "Testing iscore"

Iscore.test_iscore_specific_offset("ENST00000600453",130,

#Generate POP, GR and CR

#

class Report def self.generate(input_hash, transcripts_count = nil)

#step 1 : Generate protein rank/sort position

sort_by_hash_value(input_hash, "p")

#step 2: Generate intron rank/sort position

sort_by_hash_value(input_hash, "i") #step3 : Reorder

#step3.1 group

regroup hash = regroup(input hash)

#step3.2 reorder the group

reorder(regroup_hash, input_hash)

Calculate POP, GR, CR

return calculate(input_hash, transcripts

end def self.calculate(input_hash, transcripts_count = nil)

transcripts count = input hash.size if transcripts count.nil?

sorted_tmp = input_hash.sort_by { |stable_id, hash_values|

hash_values["intron_orderj3osition"] }

pop = 0

total_gene_rank = 0

sorted_tmp.each_with_index do (transcript, i_index|

values = transcript[l]

values["gene_rank"] = (values["p_ranking_position"] - values["i_ranking_position"]).abs values["paired"] = false

total_gene_rank += values["gene_rank"]

if i_index == 0

values["paired"] = true if values["p_ranking_position"] == sorted_tmp[i_index + l][l]["p_ranking_position"] || values["intron_order_position"] == values["p_ranking_position"] elsif i_index == (sorted_tmp.size - 1)

values["paired"] = true if values["p_ranking_position"] == sorted_tmp[i_index - l][l]["p_ranking_position"] || values["intron_order_position"] == values["p_ranking_position"] else

values["paired"] = true if values["p_ranking_position"] == sorted_tmp[i_index + l][l]["p_ranking_position"] || values["p_ranking_position"] == sorted_tmp[i_index - l][l]["p_ranking_position"] || values["intron_order_position"] == values["p_ranking_position"] end

pop += 1 if values["paired"] end if input hash.blank?

order percentage = nil

gene_rank = nil

combine score = nil

else

order_percentage = pop * 1.0 / transcripts_count

gene_rank = total_gene_rank * 1.0 / input_hash.size

combine score = gene rank - order percentage

end print_hash(input_hash)

return {"POP" => order_percentage, "GR" => gene_rank, "CR" => combine_score} end def self.print_hash(input_hash)

sorted_tmp = input_hash.sort_by { |stable_id, hash_values|

hash_values["intron_order_position"] }

puts "NO., Transcript, Intron Value/i Score, Intron Order, Protein Order, Protein Value, Gene Rank Intron Order, Gene Rank"

sorted_tmp.each_with_index do (transcript, index]

puts "#{transcript[l]["paired"] ? "*" : ""}[#{index + 1 }] #{transcript[0]},

#{transcript[l]["i"]}, #{transcript[l]["intron_order_position"]},

#{transcript[l]["p_ranking_position"]}, #{transcript[l]["p"]},

#{transcript[l]["i_ranking_position"] }, #{transcript[l]["gene_rank"]} "

end

end def self.reorder(regroup_hash, input_hash)

# generate a temp hash which will be used later to determine bottom

grouped_p_ranking_positions_hash = Hash.new

regroup_hash.each_pair do |group_number, transcripts] groupedj3_ranking_positions_hash[group_number] = Array. new if

! grouped j3_ranking_positions_hash.has_key? group_number

transcripts. each_pair do |stable_id, p_ranking_position|

grouped j)_rankingj3ositions_hash[group_number]. push p_ranking_position end

end group_index = 0

previous_buttom = nil

group_numbers = regroup_hash. keys, sort

group_numbers.each do |group_number|

transcripts = regroup_hash[group_number]

handled = Array. new

top_index = 0

# decide top

if group_number > 0 # skip the first group

tmp index = 0

transcripts. each _pair do |stable_id, p_ranking_position|

if p_ranking_position == previous_buttom

tmp_index += 1

top_index += 1

input_hash[stable_id]["intron_orderj3osition"] = group_index + tmp_index handled. ush stable id

end

#decide bottom

if group_number < group_numbers.size #skip the last group

#determin the bottom

bottom_cadidates = Array. new

temp hash = Hash.new

#remove pared to top and already paired ones

transcripts. each _pair do |stable_id, p_ranking_position|

next if handled. include? stable_id

if temp_hash.has_key? p_ranking_position temp_hash[p_ranking_position] += 1

else

temp_hash[p_ranking_position] = 1

end

# generate bottom candidates

temp_hash.each_pair do |p_ranking_position, count]

bottom cadidates.push p_ranking_position if (count == 1) &&

(groupedj3_ranking_positions_hash[group_number + 1]. include? p_ranking_position) end

if bottom cadidates.size > 1

bottom = determine_bot(bottom_cadidates, group_number + 1,

grouped_p_ranking_positions_hash)

elsif bottom_cadidates.size == 1

bottom = bottom_cadidates. first

end

if Ibottom.nil?

tmp_index = 0

transcripts. each_pair do |stable_id, p_ranking_position|

if p_ranking_position == bottom && !(handled.include? stable id)

input_hash[stable_id]["intron_order_position"] = group_index + transcripts. size - tmp index

tmp_index += 1

handled. push stable id

end

# sort middle

tt hash = Hash, new

transcripts. each_pair do |stable_id, p_ranking_position|

next if (handled. include? stable_id)

tt_hash[stable_id] = p ranking _position

end

if tt_hash.size > 0 sorted_tt_hash = tt_hash.sort_by { |stable_id, p_ranking_position| p_ranking_position } sorted_tt_hash.each_with_index do |(stable_id, p_ranking_position), ii|

input_hash[stable_id]["intron_order_position"] = group_index + top_index + ii + 1 bottom = p_ranking_position if ii == (sorted_tt_hash.size - 1) && bottom. nil? end

end

group_index += transcripts, size

previous_buttom = bottom if !bottom.nil?# reset bottom, bottom is nil only all pushed to top end

end

def self.determine_bot(candidates, index, array)

return nil if index > array, size # max out

candidates. sort.each do |c|

return c if ! array[index].include?(c)

end

return determine_bot(candidates, index + 1, array)

end

def self.sort_by_hash_value(input_hash, hash_value_type)

sorted_tmp = input_hash.sort_by { |stable_id, hash_values| hash_values[hash_value_type] } prev hash value = nil

current_paired_position = 1

current_rank_posistion = 1

sorted_tmp.each_with_index do |a, index]

if index > 0 && a[ 1 ] [hash_value_type] != prev_hash_value

current _paired _position += 1

current_rank_posistion = index + 1

end

input_hash[a.first][hash_value_type + "_paired_position"] = current_paired_position input_hash[a.first][hash_value_type + "_ranking_position"] = current_rank_posistion prev_hash_value = a[l][hash_value_type]

end

return input hash

end

def self.regroup(hash_value)

group_hash = Hash, new hash_value.each_pair do |stable_id, values]

group Jiash[values[''i_paired_position'']] = {stable_id => values["p_ranking_position"]} if !group_hash.has_key? values["i_paired_position"]

group_hash[values["ij>airedj30sition"]][stable_id] = values["p_ranking_position"] end

return group_hash

end

def self.test

input = {"ENST00000377313"=>{"p"=>10016473972116439183,

>167613066517741313}, "ENST00000394374"=>{"p"=>10016473972116439183,

=>2503961881512634917 "ENST00000394376" =>10016473972116439183, =>5164633206682038560 "ENST00000424912" =>12865572994853230352, =>5164633206682038560 "ENST00000429702" =>12865572994853230352, =>5819545816596913902 "ENST00000337652" =>10016473972116439183, =>5819545816596913902 "ENST00000443283" =>10016473972116439183, =>7189214929071588867 "ENST00000440873" =>9662452273227638168, =>9351644970681852145 "ENST00000377321 " =>10854016887269896695, =>9351644970681852145 "ENST00000377316" =>1070273296197864103, =>9534311961692397952 "ENST00000377326" =>13888077090689236035, =>9534311961692397952 "ENST00000315422"=>{' =>13888077090689236035, >12292542568658633110}, "ENST00000312049"=>{"p"=>13888077090689236035, >15081914161581960811 }, "ENST00000450708"=>{"p"=>6743057925785323865, 16379423582671150862}, "ENST00000413626"=>{"p"=>15724440847510920542, >18300401426631122607} }

puts generate(input)

end

Claims

Claims I claim

1. A computer-implemented method, implemented by hardware in combination with software, said method comprising:

(A) obtaining a representation of a nucleic acid sequence, wherein said nucleic acid sequence encodes a particular gene and said particular gene encodes a particular protein, said nucleic acid sequence comprising at least one intron;

(B) determining an intron signature value corresponding to said at least one intron, said intron signature value being based on a first computational function applied to one or more portions of said representation of said nucleic acid sequence corresponding to said at least one intron;

(C) determining a protein signature value corresponding to said particular protein, said protein signature value being based on a second computational function applied to a representation of said particular protein; and

(D) forming, in a database, an association between said intron signature value and said protein signature value.

2. The method of claim 1 further comprising:

(E) repeating acts (A) to (D) for each of a plurality of nucleic acid sequences.

3. The method of claim 1 further comprising:

(F) using said association formed in (D) to determine or confirm at least one aspect of a genetic function of a particular nucleic acid sequence.

4. The method of claim 2 further comprising:

5. A computer-implemented method, implemented by hardware in combination with software, said method comprising:

(A) obtaining a representation of a nucleic acid sequence, said nucleic acid sequence comprising multiple non-overlapping nucleic acid subsequences, each of said nucleic acid subsequences encoding an amino acid;

(B) determining a particular amino acid encoded by a particular nucleic acid subsequence of said multiple non-overlapping nucleic acid subsequences, wherein said particular nucleic acid subsequence comprises a first nucleotide, a second nucleotide adjacent to said first nucleotide, and a third nucleotide adjacent to said second nucleotide, by considering a nucleotide pair consisting of: (i) said first nucleotide and said second nucleotide, or (ii) said second nucleotide and said third nucleotide.

(A) obtaining a representation of said nucleic acid sequence, said nucleic acid sequence comprising multiple non-overlapping nucleic acid subsequences, each of said nucleic acid subsequences encoding an amino acid;

(B) determining a plurality of subsequences of said nucleic acid sequence, wherein said plurality of subsequences are determined by a binary counting of said nucleic acid sequence;

(C) determining a digital signature for said nucleic acid sequence based on a computational hash function applied to said plurality of subsequences of said nucleic acid sequence.

7. The method of claim 6 further comprising:

(D) using said digital signature for said nucleic acid sequence to determine or confirm at least one aspect of a genetic function of said nucleic acid sequence.

(A) obtaining a particular character string, said particular character string being representation of a first one or more portions of a particular nucleic acid sequence, wherein said particular nucleic acid sequence comprises a second one or more portions that encode a particular protein, said first one or more portions of said particular nucleic acid sequence being distinct from said second one or more portions of said particular nucleic acid sequence;

(B) determining a plurality of hash values of a corresponding plurality of substrings associated with said particular character string;

(C) determining a signature value for said first one or more portions of said particular nucleic acid sequence based on said first plurality of hash values;

(D) forming an association in a database between said signature value for said first one or more portions of said particular nucleic acid sequence and said particular protein; and

(E) using said association formed in (D) to determine or confirm at least one aspect of a genetic function of said particular nucleic acid sequence.

9. The method of claim 8 wherein said using in (E) also uses other associations in said database between other signature values of other character strings to determine or confirm said at least one aspect of a genetic function of said particular nucleic acid sequence.

10. A computer-implemented method, implemented by hardware in combination with software, said method comprising: (A) determining a particular intron signature value corresponding to a particular at least one intron, said particular intron signature value being based on a first computational function applied to one or more portions of a representation of a particular nucleic acid sequence corresponding to said particular at least one intron;

(B) determining a particular protein signature value corresponding to a particular protein, said protein signature value being based on a second computational hash function applied to a representation of said particular protein;

(C) adding a record to a database, said record comprising a first field for said particular intron signature value and a second field for said particular protein signature value, wherein said database comprises a plurality of records corresponding to a plurality of nucleic acid sequences, each of said records comprising an intron signature value and a corresponding protein signature value, each of said intron signature values having been determined for and based on said first computational function applied to a portion of a corresponding nucleic acid sequence, said corresponding nucleic acid sequence encoding a particular gene, and each of said protein signature values having been determined based on a second computational hash function applied to a representation of a protein associated with said corresponding nucleic acid sequence.

1 1. The method of claim 10, further comprising:

(D) using information in said database to determine or confirm at least one aspect of a genetic function of said particular nucleic acid sequence.

(a) obtain a representation of a nucleic acid sequence, wherein said nucleic acid sequence encodes a particular gene and said particular gene encodes a particular protein, said nucleic acid sequence comprising at least one intron;

(b) determine an intron signature value corresponding to said at least one intron, said intron signature value being based on a first computational function applied to one or more portions of said representation of said nucleic acid sequence corresponding to said at least one intron;

(c) determine a protein signature value corresponding to said particular protein, said protein signature value being based on a second computational function applied to a

representation of said particular protein; and

(d) form, in a database, an association between said intron signature value and said protein signature value.

13. The non-transitory computer-readable recording medium of claim 12, wherein the one or more programs, when executed, cause one or more processors to, at least:

(f) use said association formed in (d) to determine or confirm at least one aspect of genetic function of a particular nucleic acid sequence.

(a) obtain a representation of said nucleic acid sequence, said nucleic acid sequence comprising multiple non-overlapping nucleic acid subsequences, each of said nucleic acid subsequences encoding an amino acid;

(b) determine a plurality of subsequences of said nucleic acid sequence, wherein sai plurality of subsequences are determined by a binary counting of said nucleic acid sequence;

(c) determine a digital signature for said nucleic acid sequence based on a computational hash function applied to said plurality of subsequences of said nucleic acid sequence.

15. The non-transitory computer-readable recording medium of claim 12, wherein the one or more programs, when executed, cause one or more processors to, at least:

(d) use said digital signature for said nucleic acid sequence to determine or confirm at least one aspect of a genetic function of said nucleic acid sequence.

(a) obtain a particular character string, said particular character string being representation of a first one or more portions of a particular nucleic acid sequence, wherein sau particular nucleic acid sequence comprises a second one or more portions that encode a particular protein, said first one or more portions of said particular nucleic acid sequence being distinct from said second one or more portions of said particular nucleic acid sequence;

(b) determine a plurality of hash values of a corresponding plurality of substrings associated with said particular character string;

(c) determine a signature value for said first one or more portions of said particular nucleic acid sequence based on said first plurality of hash values;

(d) form an association in a database between said signature value for said first one or more portions of said particular nucleic acid sequence and said particular protein; and

(e) use said association formed in (d) to determine or confirm at least one aspect of genetic function of said particular nucleic acid sequence.

17. The non-transitory computer-readable recording medium of claim 16, wherein said using said association in (e) also uses other associations in said database between other signature values of other character strings to determine or confirm said at least one aspect of a genetic function of said particular nucleic acid sequence.

(a) determine a particular intron signature value corresponding to a particular at least one intron, said particular intron signature value being based on a first computational function applied to one or more portions of a representation of a particular nucleic acid sequence corresponding to said particular at least one intron;

(b) determine a particular protein signature value corresponding to a particular protein, said protein signature value being based on a second computational hash function applied to a representation of said particular protein;

(c) add a record to a database, said record comprising a first field for said particular intron signature value and a second field for said particular protein signature value,

wherein said database comprises a plurality of records corresponding to a plurality of nucleic acid sequences, each of said records comprising an intron signature value and a corresponding protein signature value, each of said intron signature values having been determined for and based on said first computational function applied to a portion of a corresponding nucleic acid sequence, said corresponding nucleic acid sequence encoding a particular gene, and each of said protein signature values having been determined based on a second computational hash function applied to a representation of a protein associated with said corresponding nucleic acid sequence.

19. The non-transitory computer-readable recording medium of claim 18, wherein the one or more programs, when executed, cause one or more processors to, at least:

(d) use information in said database to determine or confirm at least one aspect of a genetic function of said particular nucleic acid sequence.