WO2009024974A2 - Systèmes et procédés de sélection rationnelle de séquences de contexte et de modèles de séquence - Google Patents

Systèmes et procédés de sélection rationnelle de séquences de contexte et de modèles de séquence Download PDF

Info

Publication number
WO2009024974A2
WO2009024974A2 PCT/IL2008/001140 IL2008001140W WO2009024974A2 WO 2009024974 A2 WO2009024974 A2 WO 2009024974A2 IL 2008001140 W IL2008001140 W IL 2008001140W WO 2009024974 A2 WO2009024974 A2 WO 2009024974A2
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
context
attributes
sequences
template
Prior art date
Application number
PCT/IL2008/001140
Other languages
English (en)
Inventor
Yoav Namir
Original Assignee
Yoav Namir
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yoav Namir filed Critical Yoav Namir
Priority to US12/733,256 priority Critical patent/US20100153400A1/en
Publication of WO2009024974A2 publication Critical patent/WO2009024974A2/fr
Priority to US13/764,894 priority patent/US9779205B2/en
Priority to US15/677,234 priority patent/US20170351810A1/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2246Trees, e.g. B+trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/30Unsupervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/10Ontologies; Annotations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations

Definitions

  • the present invention relates to the analysis of polynucleotide sequence clusters, and in particular for the characterization of such sequence according to one or more parameters.
  • Clustering may be performed in a variety of methods.
  • Hierarchical clustering seeks to create by steps of either mergers or divisions, a hierarchy of segments or clusters.
  • Agglomerative approaches build the hierarchy of clusters by steps of such mergers. Some approaches combine the above two 1 .
  • K-Means clustering algorithm is an example of such a clustering technique. It has been used in combination with other techniques, for example, for exploring protein structure 2 . It was also used to identify recurring local sequence motifs for proteins 3 .
  • the present invention is directed to a computer implemented method for obtaining a repository of attributes sets, wherein attributes sets are statistically associated with a sequence template representing two or more context sequences, comprising:
  • the dataset of context sequences of step is further subjected to multiple sequence alignment.
  • the later provides a solution in a particular instance, for example, where the context sequences in the data set are of different lengths or where the context sequences in the data were substantially affected by insertion/deletion regions.
  • the present invention is directed to repository obtained by the computer implemented method obtaining a repository of attributes sets as defined.
  • the present invention is directed to a computer implemented method for identifying a sequence template as statistically associated with an attributes set of interest, comprising:
  • the computer implemented for identifying a sequence template as statistically associated with an attributes set of interest further comprises the step of merging at least two retrieved sequence templates.
  • the attributes are selected from: the Gene Ontology Project (GO), Interpro annotation (European Molecular Biology Laboratory, EMBL), SMART (a Simple Modular Architecture Research Tool, found at http://smart.embl.de/), UniProt Knowledgebase (SwissProt), OMIM (by NCBI) PROSITE (by the Swiss Institute of Bioinformatics), Protein Information Resource (PER), GeneCards, and Kyoto Encyclopedia of Genes and Genomes (KEGG).
  • GO Gene Ontology Project
  • Interpro annotation European Molecular Biology Laboratory, EMBL
  • SMART Simple Modular Architecture Research Tool, found at http://smart.embl.de/
  • UniProt Knowledgebase SwissProt
  • OMIM by NCBI
  • PROSITE by the Swiss Institute of Bioinformatics
  • Protein Information Resource PER
  • GeneCards GeneCards
  • KEGG Kyoto Encyclopedia of Genes and Genomes
  • the present invention is directed to a computer memory system comprising a plurality of tree topologies representing plurality of (k) heaps, wherein the plurality of tree topologies is managed through a common interface; and (k > 1).
  • the heaps are min heaps. In another embodiment, the heaps are max heaps.
  • an active subset of heaps is held in Random Access Memory (RAM), while the rest of said heaps are maintained on a secondary storage.
  • the invention is directed to a computer implemented method for clustering a plurality of polynucleotide sequences, comprising: determining an attributes set for the plurality of polynucleotide sequences; and clustering the polynucleotide sequences into a plurality of clusters according to values of said attributes set.
  • the invention is further directed to a method of preparing a polynucleotide construct, comprising:
  • the preparing step comprises synthesizing said context sequence. In another embodiment, the preparing step comprises the preparing of an expression vector comprising said context sequence. In another embodiment, the preparing step comprises the preparing of a probe comprising said context sequence.
  • the present invention is directed to a computerized system configured for identifying a sequence template as statistically associated with an attributes set of interest, the computerized system comprising: context sequence clustering module, configured to cluster said sequences into a plurality of clusters; an enrichment analysis module, configured to provide enrichment appraisal, wherein context sequence clustering module being communicatively coupled to the enrichment analysis module.
  • any device featuring a data processor and/or the ability to execute one or more instructions may be described as a computer, including but not limited to a PC (personal computer), or a server. Any two or more of such devices in communication with each other, and/or any computer in communication with any other computer may optionally comprise a "computer network”.
  • FIG. 1 illustrates, in accordance with one embodiment of the present invention, an exemplary computerized system on which the present invention may be implemented.
  • FIG. 2a illustrates, in accordance with one embodiment of the present invention, an exemplary user interface for obtaining a requested function array from a user.
  • FIG. 2b illustrates, in accordance with one embodiment of the present invention, an exemplary user interface for obtaining a function array or attributes set of interest which is optionally provided by a user.
  • FIG. 3 illustrates, in accordance with one embodiment of the present invention, an exemplary user interface for proposing the predicted context sequences for synthesis.
  • FIG. 4 illustrates, in accordance with one embodiment of the present invention, an exemplary viewer application reproducing a context sequence, the cellular function annotations and the size of the context sequence cluster.
  • FIG. 5 illustrates, in accordance with one embodiment of the present invention, an exemplary data structure of a function attribute array or the cellular function annotations array.
  • FIG. 6 illustrates, in accordance with one embodiment of the present invention, a simplified example of ascertaining the processing order of templates, (a) and (b) are two clusters of templates having equal minimum distances to a common template.
  • FIG. 7 illustrates, in accordance with one embodiment of the present invention, a Simplified example of ascertaining the processing order of clusters
  • (a') is a new cluster representing a merger of closest neighbors of (a) which was shown in Fig 6, and (b) is to be handled subsequently.
  • FIG. 8 illustrates, in accordance with one embodiment of the present invention, a multiple-tree-array topology within a memory module.
  • the top item is defined as the element having the minimal key value amongst the (k) specific min heaps as shown.
  • FIG. 8a illustrates, in accordance with one embodiment of the present invention, a multiple-tree-array topology within a memory module.
  • the top item is defined as the element having the maximal key value amongst the (k) specific heaps as shown.
  • the present invention in some embodiments, is of a system and method for analyzing a plurality of nucleotide or other sequences. In other embodiments, the present invention relates to a system and method which provide more efficient memory structures and computational processes. The later system and method may optionally be used with the former embodiments or may optionally be used independently.
  • Section I relates to the system of the present invention
  • Section II relates to embodiment for obtaining of a repository of attributes sets, statistically associated with context sequences and/or a sequence template representing the them
  • Section III relates to embodiments which provide more efficient memory structures and computational processes
  • Section IV details embodiments of a computer implemented method for identifying a sequence template as statistically associated with an attributes set of interest
  • Section V relates to experimental examples using such embodiments
  • function attribute and “attribute” of a given gene shall mean an attribute, term, characterization, molecular function annotation, or biological process annotation describing a gene or a gene product.
  • the terms can be used interchangeably and synonymously herein.
  • the cellular function annotation are typically reported in variety of sources such as, but not limited to, the Gene Ontology Project (GO), Interpro annotation (European Molecular Biology Laboratory, EMBL), SMART (a Simple Modular Architecture Research Tool, found at http://smart.embl.de/), UniProt Knowledgebase (SwissProt), OMIM (by NCBI) PROSITE (by the Swiss Institute of Bioinformatics), Protein Information Resource (PIR), GeneCards, Kyoto Encyclopedia of Genes and Genomes (KEGG).
  • the term or attribute "cell adhesion” associated to Homo sapiens discoidin domain receptor tyrosine kinase 1 is a cellular function annotation. This attribute is found, for example, in Gene Ontology under GO:0007155.
  • complete function attributes sef ⁇ and complete attributes sef shall mean the complete set of function attributes i.e. all function attributes stored in a repository of the present invention.
  • the terms can be used interchangeably and synonymously herein.
  • function attributes sef shall mean a subset of the complete function attributes set.
  • the terms can be used interchangeably and synonymously herein.
  • the function attributes array can be used to represent a specific user selection in which the user manifests particular function attributes of interest. The user can typically select an attributes set in order to perform the computer implemented method of the present invention for identifying a sequence template whish is statistically associated with the attributes set of interest.
  • the attributes set can be used to represent attributes set which is statistically associated with a sequence template.
  • the later can be identified in functional appraisal performed by the methods and system of the present invention. The later is typically performed with respect to a cluster of context sequences or attributes associated with a gene operably linked to the context sequences of the cluster. The results of the functional appraisal performed can thus be represented by an attributes set.
  • the attributes set optionally feature an array of real numbers, with each of said numbers representing a level of association of a particular annotation or attribute. It can also feature an array of binary digits, where each of said binary digit representing association with a particular annotation or attribute. In this case, '0' can represent the absence of association of a particular function attribute and ' 1 ' can indicates statistical association of the particular function attribute.
  • sequence shall mean a polynucleotide sequence, continuous or otherwise, of nucleotides being selected from a group consisting of deoxyribonucleotides
  • DNA DNA
  • RNA ribonucleotides
  • Sequence does not encompass therefore gene order in general or genomic meta structures.
  • Context sequence shall mean a sequence which regulate or affect a gene product (mRNA, polypeptide and alike). Context sequences consist of at least portion of un-translated sequence.
  • a context sequence may comprise a sequence which is operably linked to a coding region, sequence affecting expression level of a gene product or otherwise a sequence regulating gene product (or activity). Therefore, a context sequence may comprise a stretch of nucleotides preceding the translation initiation codon of mRNA molecule.
  • a context sequence may comprise a stretch of nucleotides downstream to the translation termination codon of mRNA molecule. In the above examples the context sequence was defined by its relative location to a coding region.
  • a context sequence of the present invention may further comprise a promoter, enhancer, inhibitor or other regulatory region.
  • template or “sequence template” shall include a matrix T 4x! , where (I) denotes the length of the context sequences or aligned context sequence which are represented by the template.
  • the template can either represent the distribution of each nucleotide for each position along a context sequence.
  • the template can further include a matrix (T) where T[a,i] holds the distribution of nucleotide (a) at position (i) in of the context sequences represented.
  • T[a,i] holds the distribution of nucleotide (a) at position (i) in of the context sequences represented.
  • context sequence transformation into a template can typically be performed as an integral part of matrix allocation.
  • sequence 'AG' is represented by a template having the following distribution matrix:
  • a sequence template can represent a cluster of context sequences and the distribution matrix will thus reflect the distribution of nucleotides which characterizes the context sequences within the cluster.
  • the sequence template can typically further comprise a set of gene names or unique identifiers which are operably linked or affected by the context sequences represented thereby.
  • repository and “database” shall mean a database or any system configured for insertion and retrieval of information of the present invention.
  • the terms can be used interchangeably and synonymously herein.
  • the repository of the present invention is typically configured for insertion and retrieval of attributes, attributes set, and context sequences. The later are typically in a form of sequence of ASCII characters.
  • the repository of the present invention can also be configured for insertion and retrieval of sequence templates which can typically comprise an array of numbers, or a 2D matrix of numbers.
  • a repository of the present invention is typically configured to insert and retrieve pointers or association between information elements stored therein.
  • the repository of the present invention can insert and retrieve an attributes set, a sequence template, and to associate between them; so as to enable retrieval of a sequence template together with at least one respective attributes set.
  • it can be configured to enable retrieval of an attributes set together with at least one respective sequence template.
  • multiple sequence alignment shall have the ordinary meaning as used by the skill person in the art of bio informatics.
  • CLUSTAL W is typical software package used for that purpose, and can be utilized by usage of default values and other values being adapted for the particular dataset in hand.
  • synthetic context sequence or “predicted context sequence” shall mean at least one context sequence or sequence template representing said context sequence that was identified by the systems and methods of the present invention, as statistically associated with an attributes set of interest.
  • Embodiments of the invention can be used in a general purpose computer system suitably adapted and designed for performing the extensive context sequences clustering, enrichment analysis and comparison.
  • FIG. 1 illustrates, in accordance with one embodiment of the present invention, an exemplary system on which the present invention may be implemented.
  • the computerized system 100 permits clients or users to provide an attributes set of interest for analysis 135.
  • the attributes set can consist of two or more attributes of interest.
  • the clients or users can further provide a dataset of context sequences 105 as input information; thereby obtaining a dataset of context sequences for analysis.
  • the context sequences can typically further comprise a set of gene names or unique identifiers which are operably linked or affected by the context sequences, respectively.
  • the attributes set of interest 135 and context sequences 105 can be entered via a user interface specifically configured for that purpose. Where the system 100 is implemented on a computer network, the attributes set 135 and context sequences 105 can be provided through a browser application, such as, but not limited to web browsing application. Alternatively, the attributes set 135 or the context sequences 105 can be comprised in a file. The file can be uploaded to the system 100 though either a network or other information uploading methods known in the art for that purpose.
  • the context sequence clustering module 110 clusters the context sequences as described hereinafter.
  • the dataset of context sequences 105 comprises a huge amount of sequence information.
  • each context sequence is transformed into a sequence template.
  • Clustering of the dataset of sequence templates is performed and results with plurality of clusters.
  • each gene cluster or the genes which are regulated or affected by the context sequences within the cluster is subjected to functional appraisal.
  • the result of the functional appraisals is a plurality of clusters each statistically associated with their respective attributes set.
  • the system and method of the present invention enables obtaining of heterogeneous clusters, as defined below.
  • the clustering procedures of the present invention are, inter alia, utilized in order to obtain a repository of attributes sets, statistically associated with a sequence template.
  • the sequence template represents two of more context sequences. The later may not be identical. Therefore, the clustering procedures of the present invention enable obtaining a heterogeneous repository, as defined hereinafter.
  • the clustering procedures of the present invention can use a 2-dimentional distance matrix to store and retrieve distance related information.
  • distance related information is typically stored and retrieved from computer memory system comprising of plurality of heaps 130, or heap data structures.
  • Data items which are stored and retrieved in the computer memory system 130 of the present invention typically comprise references pointing at two matrixes or templates and a real number.
  • Each said templates represent a cluster of context sequences and the real number measures the distance between the clusters.
  • data items may further comprise information such as, but not limited to, gene names or unique identifiers of genes which were classified within the clusters.
  • a template can further comprise information such as gene names or unique identifiers genes which were classified within the cluster which is represented by the template.
  • Clustering of the present invention is typically performed by the clustering module 110.
  • the context sequence clustering module 110 stores and retrieves data items from the computer memory system (or memory module) 130.
  • the structure of the computer memory system is described below.
  • the memory system is based on plurality of Heap data structure which was restructured and remodeled, as described below, to improve performance especially where large data set are in hand.
  • the memory systems shall also be referred to as "multiple-tree-array" the particulars of which are described below.
  • the later typically comprises min heaps and adheres to the invariant according to which the top data item in the multiple-tree-array is a data item referencing a pair of templates having a minimal distance between them.
  • the multiple-tree-array allows the system 100 to perform the clustering of the context sequences and enrichment analysis at an extremely efficient manner reducing the complexity by about one order in comparison to typical 2- dimentianal distance matrixes.
  • the enrichment analysis module 120 performs enrichment appraisals or functional appraisals as described below.
  • the context sequence clustering module 110 sends a request to the enrichment analysis module 120.
  • the request comprises a data set of context sequences or unique identifiers representing the context sequences within a cluster or unique identifiers of genes regulated or otherwise affected by context sequences.
  • the request typically channeled through either a communication port, BUS or a computer network 115 to the enrichment analysis module 120.
  • clusters of context sequences together with their respective enrichment appraisals can be stored in or retrieved from a repository or database 125.
  • the results of enrichment appraisals are represented by an attributes set or function attribute array being associated with respective cluster or clusters.
  • the function array comparator 140 is adapted to compare the attributes set of interest (typically provided by a user), with said stored enrichment appraisals retrieved from the repository 125.
  • FIG. 2a and FIG. 2b illustrate, in accordance with one embodiment of the present invention, an exemplary user interface 200 for obtaining an attributes set of interest from a user.
  • an attributes set is obtained from a client over the network (not shown).
  • the client may be local or remote, either human or automated procedure performed on a computer system.
  • the user select an attributes set from a list of function attributes 210.
  • the list of function attributes contains at least a sub set of a complete function attributes set.
  • the user selects the function attributes of interest in order retrieve a sequence template statistically associated with his selection.
  • the sequence template retrieved can be used in order to design a context sequence for the purpose of either synthesis or manufacture of polynucleotide construct, or vector.
  • the context sequence designed comprises the most dominant nucleotide in each position along the sequence template retrieved. In another embodiment, the context sequence designed comprises 80%-85%, 85%-90%, or 90%- 100% homology with sequence template or the sequence comprising the most dominant nucleotide in each position along the sequence template.
  • the subset of function attributes selected by the user can be represented by a function attribute array or the attributes set.
  • the manual selection can be performed with checkboxes 215 which indicate whether a particular function attribute was selected.
  • page scroller 205 can provide means for navigating through the entire list of function attribute.
  • the list of function attributes can be organized by several techniques, such as but not limited to, lexicographical order, classification, or source of the function attribute.
  • the user interface includes textboxes 220 in which a user enters the importance degree or confidence level associate with a particular function attribute.
  • the system of the present invention is adapted to retrieve an enrichment appraisal previously stored in the repository.
  • the enrichment appraisal typically shares a similarity with an attributes set of interest.
  • the system of the present invention is adapted to retrieve an enrichment appraisal which shares similarity with an attributes set of interest at a predetermined threshold.
  • the system of the present invention is adapted to retrieve an enrichment appraisal which shares maximal similarity with an attributes set of interest.
  • the output 150 comprises a cluster of context sequences or a sequence template representation thereof, which is statistically associated with said retrieved enrichment appraisal(s).
  • the output 150 comprises a cluster of context sequences or sequence template representation thereof which are statistically associated with said retrieved enrichment appraisal(s).
  • FIG. 3 illustrates, in accordance with one embodiment of the present invention, an exemplary user interface 300 providing an identified context sequence 310 or a sequence template representing a cluster of context sequences statistically associated with the attributes set of interest.
  • the identified context sequence or the sequence template consists of those which are statistically associated with stored attributes sets sharing maximal similarity with the attributes set of interest. Similarity or similarity degree is determined by the method described below.
  • the user interface 300 includes a textbox, label, or information box 320.
  • Each context sequence 310 or sequence template (not shown) can be associated with textbox, label, or information box 320.
  • the textbox, label, or information box may include statistical confidence level of the context sequence such as p_value or a false discovery rate (FDR) or other enrichment estimator.
  • the page scroller 305 can provide means for navigating through the entire list of context sequences where, for example, the predicted context sequences exceed the window size of the user interface 300.
  • FIG. 4 illustrates, in accordance with one embodiment of the present invention, another exemplary user interface 400 consists of a sequence template 410 representing a cluster of context sequences, said statistically associated attributes set 420 and the size of the cluster 430.
  • the distribution table 415 can comprises a matrix representing the probability of a given nucleotide at a particular position along the context sequences of the current cluster viewed. Each column can represent a position along a predicted context sequence. The most dominant nucleotide at a particular position along the identified context sequences can appear at the top of the respective column 410. Where two nucleotides share similar of identical dominance level both can appear at the top of the respective column 425.
  • the user interface 400 is utilized for viewing the clustered context sequences 410 comprising polypeptide sequences.
  • the distribution table 415 can comprise a matrix representing the probability of a given amino acid at a particular position along the predicted context sequence. Each column can represent a position along a predicted context sequence. The most dominant amino acid at a particular position along the predicted context sequence can appear at the top of the respective column 410, while two amino acids sharing similar or identical dominance levels both can appear at the top of the respective column 425.
  • FIG. 5 illustrates, in accordance with one embodiment of the present invention, an exemplary data structure of a function attribute array or the attributes set 500.
  • the attributes set 500 typically features a matrix of cells or items 510. Each of the cells in the matrix can comprise several fields or objects.
  • the first field of object is a function name/attribute 520 and the second is a value 530 associated therewith.
  • the value 530 may optionally represent a Boolean variable.
  • a Boolean variable in a cell holds #true, for example 530
  • the attributes set includes the particular attribute 520.
  • a Boolean variable in a cell holds #false the attributes set does not include the particular attribute.
  • value 530 can be represented a Real variable which represents the statistical confidence level of the particular attribute.
  • value 530 hold "1.0E- 17”
  • the function attribute array highly likely to include a particular attribute 520.
  • value 530 hold "1.0” the function attribute array most likely does not include the particular attribute 520.
  • the data structure of the attributes set can be varied almost indefinitely. Many other data structures can be employed for storing a subset of attribute.
  • the attributes set may optionally be stored as a Dictionary or hash table. Other one limiting examples: array of pair ⁇ string, boolean>, or indeed a 2D matrix where one dimension is the function attribute and the other dimension is a value.
  • the attributes set of interest 135 can be represented by the function attribute array 500.
  • the user may seek to identify one or more sequence templates associated with an attributes set of interest. Assume that the user wishes to consider immunoglobulin and transcription regulation with respect to humans.
  • the user selection of interest is transformed into an attributes set 135 which are typically represented by the function attribute array 500.
  • the function array comparator 140 compares the attributes set received comprising the user selection with said stored enrichment appraisals. The later are retrieved from the repository 125, with respect to humans.
  • the user can request retrieval of a stored sequence template which is statistically associated with the specific function attributes chosen by the user.
  • the user can retrieve the context sequences which were clustered together, and represented by the template. The user may find it advantageous to design or synthesize polynucleotide or polypeptide sequences on the basis of their functional association.
  • the system and methods of the present invention can thus be used in preparing a polynucleotide construct, comprising: identifying a sequence template as statistically associated with an attributes set of interest by a user or client; and preparing a polynucleotide construct having at least one portion operably linked to a context sequence; wherein said context sequence is characterized as having either 80%-85%, 85%-90%, or 90%- 100% homology with said sequence template.
  • the user may wish to synthesize said context sequence, by utilizing any synthesis method known in the art for that purpose.
  • the user may construct an expression vector comprising said context sequence or prepare a probe comprising the identified context sequence.
  • Homology in the range of X%-Y% shall be defined as identity score in the percentage range of X%-Y%. Said identity score is typically provided by an alignment analysis program.
  • the alignment analysis can be performed using a numerous commercial sequence analysis packages, such as, but not limited to WATER (Smith- Waterman local alignment) provided by EMBOSS (European Molecular Biology Open Software Suite) operated with either default values or open gap penalty: 11, extended gap penalty: 0.5, and the default EDNAFULL or BLOSUM62 similarity matrix. Therefore, Homology in the range of %80-%100, as an example, shall mean that an identity score which ranges between 80%- 100% using WATER according to the parameters set above.
  • the system described above is further adapted to execute the methods described hereinafter.
  • the method for obtaining a repository of attributes sets, wherein attributes sets are statistically associated with a sequence template representing two or more context sequences and the method for identifying a sequence template as statistically associated with an attributes set of interest.
  • the K-Means algorithm and its derivatives require the initial input of k-criterion from the user.
  • the initial input of k-criterion is simply not known.
  • the results of the K-Means algorithm are extremely sensitive to the initial random selection of cluster representatives. It was recently demonstrated that the worst-case running time of K-Means is super-polynomial i.e. 2 ⁇ ( ] 16 .
  • the present invention utilizes a different computer implemented method (hereinafter: "LBDL (Lower Bound Distance Limit) clustering method").
  • LBDL Lower Bound Distance Limit
  • the LBDL is preferably used for large datasets e.g. N> 16000 context sequences, and/or where no prior information relating to the suitable number of clusters is available i.e. k is unknown. While LBDL is preferred over K-means for example, the present invention is not limited to a particular clustering algorithm and may in fact optionally be implemented with any type of clustering.
  • the LBDL clustering method of the present invention does not require k-criterion at all. Instead, it requires a lower bound distance limit (LBDL) between clusters, as detailed below.
  • LBDL lower bound distance limit
  • lower bound distance limit shall mean a predetermined real number representing the lower bound distance limit.
  • lower bound distance limit invariant shall mean the following invariant (hereafter: the LDBL-invariant): during the execution of the computer implemented LBDL clustering method, clusters will not merge where the distance between them is greater than a given distance limit.
  • data item shall mean a data item in a memory structure comprising (i) representing a first template, (j) representing a second template, and d(ij) the distance between the templates.
  • data item can be presented be other means such as, but not limited to, other data items, or differently ordered data items, all which essentially hold the template information and distance information relating thereto.
  • each sequence under analysis is transformed to an information node or, as exemplified below, a sequence template.
  • the algorithm efficiently performs merger operations, until satisfaction of the LBDL criteria.
  • Each unraveled cluster is in turn subjected statistical functional appraisal.
  • Each sequence template representing a cluster of context sequences is stored together with the associated results of the functional appraisal in a repository: For the purposes of the present application 7/' shall mean a comment or remark. 1. for each context sequence in dataset allocate a template // representing the distribution of nucleotides along the sequence.
  • This step is an initialization step in which each context sequence respectively represented by a template.
  • a particular embodiment or template representation is detailed below.
  • step 3 represents an abstract data structure or data item typically comprising 3 numbers, two of which are identifying a pair of templates and the third is a distance measurement between them.
  • Retrieval of the minimal data item is typically performed by executing DeleteMinO procedure on a multiple-tree-array data structure.
  • Multiple-tree-arrays are defined below and by definition the minimal data item is an item having minimum distance held therein i.e. the data item represents a pair of templates sharing the highest similarity.
  • CurMin stores all items which were retrieved from the multiple-tree-array and are having same distance.
  • HandleCurrentTemplates() is a procedure which is defined below, and in essence this procedure which handles the merger operation(s) of the currently handled cluster(s).
  • each cluster of context sequences which is represented by a sequence template are subjected to a functional appraisal. This is typically performed by first retrieving the names or unique identifiers of genes regulated or affected by the context sequence within a cluster; and secondly, executing functional appraisal on the names or unique identifiers retrieved.
  • each sequence template(s) or context sequence(s) clusters are stored in a repository together with the associated functional appraisal result.
  • the functional appraisal result can be stored or represented as an attributes set or a list.
  • the associated functional appraisal is represented by the function attributes array 500.
  • the method therefore obtains a repository of attributes sets, where the attributes set is statistically associated with a sequence template or cluster of context sequences represented thereby.
  • a sequence template represents a cluster of two or more context sequences. The later may be either identical context sequences or typically context sequence consisting of different sequences.
  • the attributes set associated with a cluster of context sequence(s) can consist of two or more attributes.
  • a given cluster may also be associated with a particular attribute even where at least one of the context sequence (or gene affected thereby) is not characterized by the attribute.
  • a cluster may be deemed as statistically associated with an attribute by functional appraisal even where a specific context sequence within the cluster is not particularly characterized by that attribute. Therefore, a cluster in the present invention may therefore be deemed as a heterogeneous cluster.
  • "homogeneous cluster” shall mean a context sequence cluster (or sequence template representing said cluster) wherein all context sequences in the cluster are of identical sequence.
  • the term homogeneous cluster shall encompass a context sequence cluster (or sequence template representing said cluster) wherein all genes/context sequences in the cluster are characterized by an attribute.
  • a “heterogeneous cluster” shall mean a sequence context (or sequence template representing said cluster) which is not a homogeneous cluster, hi other words, a cluster exhibiting either: (1) at least one pair of non identical context sequences, or (2) statistical association to an attribute wherein at least one gene/context sequence is not characterized by the attribute.
  • heterogeneous repository shall refer to a repository comprising at least one heterogeneous cluster. Examples 1 to 4 exemplifies numerous heterogeneous clusters detailed in Tables 1 to 4.
  • steps step 12 or 13 further comprise the step of discarding those attributes where the functional appraisal resulted with P_value greater than 0.3, 0.2, 0.1, and preferably greater than 0.05.
  • P_value greater than 0.3, 0.2, 0.1, and preferably greater than 0.05.
  • the lower_bound_distance_limit can be set to various values depending on the distance formula used and the sought degree of separation between the clusters.
  • the distance formula used is d(V,W) (defined below) and (1) denotes the length of the context sequences
  • the LBDL can range between 2% ⁇ (21) to 5% ⁇ (21), 5% ⁇ (21) to 20% x (21), or 20% x (21) to 55% x (21). The later is the most preferable as an initial configuration for analysis.
  • the dataset of context sequences is further subjected to multiple sequence alignment.
  • multiple sequence alignment can result in gap insertions which in turn may lengthen the length (1) of the context sequences.
  • K-Means algorithm will cluster the population as follows:
  • each context sequence In order to perform the clustering of the context sequences, parameterization of each context sequence is required. For that end, at the initialization stage of the clustering method, each context sequence typically requires transformation into a corresponding sequence template.
  • (1) denotes the length of the context sequences or alternatively the length or the aligned context sequences.
  • the distance calculation procedure can be varied such that the fourth step would comprise distance calculation procedure can be varied such that the fourth step would comprise i] ⁇ J e N .
  • the sequence template T would hold the following matrix T 4x , as follows: for each i: 0 to 1-1 for each a e ⁇ A, T,G,C ⁇ perform:
  • This merger procedure can be referred to as "merge", or "merger”.
  • the above merger procedures can handle a merger of more than two context sequence clusters by using sequential merger procedures.
  • merger of 3 templates may typically require 2 merger operations.
  • the first merger can take place with respect to templates 1 and 2, the product of which can be denoted as new template 12'.
  • a second merger can merge the new template 12' with template 3 thereby producing a single template 123' representing all the context sequences which were previously represented by the separate templates 1, 2 and 3.
  • Sequence templates as defined above may be designed as a data structure or object.
  • the sequence template essentially represents a subset of context sequences from the dataset i.e. a cluster.
  • the sequence template would, therefore, hold distribution information of each nucleotide at each position in the cluster.
  • the sequence template will typically hold the specific sequences which are grouped together in the cluster represented thereby.
  • a sequence template further holds gene name(s) or unique gene IDs which are regulated or otherwise affected by the context sequences within the respective cluster.
  • each sequence in the dataset is transformed to a sequence template.
  • any pair of templates (or indeed the clusters represented thereby) having equal distances measured between them are preferably stored in CurMin List.
  • the order of merger the clusters or templates representing them will take place according to the order-invariant as explained and exemplified below. As described above, the order of merger operations according to the LBDL clustering method is dominated by the distance between the clusters.
  • the context sequence dataset might include subsets of numerous clusters having equal or substantially equal distances. The initial order of these clusters or the order of the context sequence may affect to final results of the algorithm. Therefore, in an embodiment, the clustering method of the present invention aims at reducing the sensitivity of the algorithm to the initial order.
  • the cluster of context sequences which share equal are handled together without preferring arbitrarily any particular cluster.
  • pairs of clusters or templates representing them
  • the common template having maximum number of neighboring clusters will be the first to merge or be handle i.e. the largest "cluster" of clusters currently (held in CurMin List) will be merged first. Subsequently, the algorithm merges the rest of the currently handled templates according to the order-invariant.
  • FIG. 6 illustrates, in accordance with one embodiment of the present invention, a simplified example of ascertaining the processing order of templates, (a) and (b), for example, are two clusters of templates having equal distances to a common cluster.
  • FIG. 7 illustrates the application of the order-invariant according to which cluster (a) , previously shown in Fig. 6, was merged prior to handling of cluster (b).
  • (a') is a new cluster representing the merger, and (b) is to be handled subsequently according to the order-invariant.
  • the multiple-tree-array is typically updated with all new distances between the pre-existing cluster (or templates representing them) and the newly merged templates.
  • new heap item (j,j,d ⁇ i,j)) is inserted into the multiple-tree-array, with a single proviso. Said insertion should takes place unless the distance d(i j) is lower than min.distance, defined above. In that case, the handling of data items which are held in CurMin List is temporarily suspended and these data items are re-inserted to the multiple-tree-array.
  • sorted dictionary data structure is utilized in order to provide fast identification of a common template having the maximal number of neighboring templates.
  • the sorted histogram data structure has the follow data structure: ⁇ number of template appearances, sequence template referenced
  • the 3 retrieved data items will generate the following histogram in Sorted dictionary: ⁇ 2,5>, ⁇ 1,1>, ⁇ 1,3>, ⁇ 1,4>, and ⁇ 1,6>.
  • the neighbors of template 5 will merge first, under the order-invariant (template referenced as
  • the present invention utilizes a computer implemented method (hereinafter: "Vector Space clustering method").
  • the VS clustering method is used for performing clustering which is a variant of LBDL method shown above.
  • This VS method is particularly useful where the length of the context sequences is in the range of 3-17 characters.
  • the skilled person in the art would recognize that range is largely affected by computation time, which is associated with the length, and the computer system employed.
  • Computer systems having high computation capabilities may process context sequences of greater length, including but not limited to the range of 10-
  • Subject Cluster to a functional appraisal // each cluster of context sequences which is represented by a sequence template are subjected to a functional appraisal. This is typically performed by first retrieving the names or unique identifiers of genes regulated or affected by the context sequence within a cluster; and secondly, executing functional appraisal on the names or unique identifiers retrieved.
  • the method is utilized for a particular subset of context sequences of interest.
  • the latter embodiment can be used to loop through a subset of possible sequences instead of looping through the entire vector space of possible sequences. This may be advantageous for achieving more efficient execution time in cases, for example, that some sequences are known not to feature substantial sequence patterns or important functional characteristics.
  • the VS differs from the LBDL in several aspects. For example, each context sequence in LBDL is classified into a single cluster. On the other hand, VS may classify each context sequence is several clusters. In that respect CVS is a "softer" classifier which sometimes can be advantageous because a single context sequence may be associated with multiplicity of functional attributes or attributes set. Another difference lies in the fact that VS typically spans thorough the entire vector space of all possible sequences i.e. even sequences which are absent from the context sequences of the data set. This is especially advantageous where synthetic or predicted sequences cannot be found in vivo. This is exemplified in the Step 1, where the analysis is performed for each (c) representing a possible sequence (not necessarily a context sequence of the data set).
  • Section III The present invention, in some embodiments, relates to an implementation of specialized memory structures and processes for computations. These structures and processes may optionally be implemented with the embodiments described above and/or may also optionally be used independently.
  • Memory module for holding Parameterized Information may optionally be implemented with the embodiments described above and/or may also optionally be used independently.
  • the later typically comprises 2D matrix of distances, such that each cell in said matrix holds the distance between a pair points of a set.
  • a distance matrix is typically a symmetric NxN matrix containing real numbers as elements, given N points in a set.
  • the distance matrix performance is unacceptable.
  • the performance time of retrieving the minimal or maximal element stored in the distance matrix is impractical for large data sets i.e. time for retrieving minimal/maximal element stored in the distance matrix.
  • key is a parameter within a data field comprising a value stored within a data item, or node.
  • key is a parameter capable of at least semi-order.
  • a key may comprise a real number stored in a data item.
  • (A) is data item
  • "KEY(A)” shall mean the parameter within a data field of data item (A).
  • the key in a data item (i,j,d(i,j)) of the present invention can be the field consisting the distance between the pair of clusters i and j.
  • heap is a data structure based of tree topology that satisfies a general heap invariant as follows: For each pair of elements, items or child nodes in a heap, X and Y: where X is a child node of Y, then KEY(Y) ⁇ KEY(X) i.e. The node having the maximum value as key (“greatest element”) is the top node (or root node) of the heap. This heap is typically referred to a max-heap. Where KEY(Y) ⁇ KEY(X) , the smallest element is always the top node, and the heap is referred to as a min heap.
  • DeleteMinO or “deletion” shall mean removing and retrieving the root node of a min-heap.
  • InsertQ or "insertion” shall mean adding a new element to a min heap. Heap shall further mean as defined in Corman et al 17 which is incorporated herein by reference.
  • a min heap provides an efficient data structure in which retrieving a minimal element is performed atO(logN) .
  • the latter is clearly more efficient in comparison to the traditional distance matrix at about 2 orders in magnitude.
  • the present invention provides a "multiple-tree- array" as defined and exemplified below.
  • “multiple-tree-array” shall mean memory module or data structure comprised therein employing plurality of tree topologies representing plurality of min-heaps, wherein the plurality of tree topology is managed through a common interface.
  • the present invention is directed to a computer memory system comprising a plurality of tree topologies representing plurality of (k) heaps, wherein the plurality of tree topologies is managed through a common interface; such that (k > 1).
  • FIG. 8 illustrates, in accordance with one embodiment of the present invention, a multiple-tree-array topology within a memory module.
  • the top item the item having the minimum distance, is the element having the minimal key value amongst the (k) min heaps as shown. Therefore, in one embodiment the computer memory system comprises min heaps.
  • the global minimum in the multiple-tree-array is defined as the minimal element (or minimal root element) amongst the min heaps comprising the multiple-tree-array.
  • the minimal element is holding the minimal key value in comparison to all (k) min heaps which comprises the multiple-tree-array (hereafter: min-heap invariant).
  • the global minimum is the minimal distance between a pair of context sequences or sequence templates.
  • FIG. 8a illustrates similarly, in accordance with another embodiment of the present invention, a multiple-tree-array topology within a memory module.
  • the root element in this embodiment, is the element having the maximal key value amongst the (k) max heaps as shown.
  • the multiple-tree-array is exemplified herein as a multiple-tree-array comprising min heaps and having a global minimal element
  • the present invention similarly relates to multiple- tree-array comprising max heaps and having a global maximal element.
  • the computer memory system comprises max heaps.
  • Secondary storage shall mean any data storage system performing slower than typical RAM (Random Access Memory). Secondary Storage typically includes the non-volatile or semi-permanent storage in a computer environment. Common secondary storage devices are diskettes, hard drives, or tapes.
  • each specific heap comprising the multiple-tree-array can be configured to operate as a conventional heap, either min- or max-heap. Insertion of a data item into the multiple-tree-array can be performed by invoking an Insert() procedure upon a specific min heap in the multiple-tree-array with one proviso. If the size of the specific min heap reaches a certain predetermined size threshold, another min heap which is selected for the insertion procedure. In the case where all min heaps reached the predetermined size threshold, additional memory comprising min- heap or max-heap is allocated to the multiple-tree-array memory module. In one embodiment, said size threshold is in the range of 100-1000 elements,
  • the element is a data item as defined above.
  • Deletion of a data item from the multiple-tree-array which comprises min-heaps can be performed by deleting the global minimum of the multiple-tree-array.
  • global minimum is the minimal top element which holds the minimal key value in comparison to all (k) min heaps comprising in the multiple-tree-array.
  • the deleted element is replaced by an element from a specific min heap ensuring the heap invariant. That is ensuring that global minimum is the element which holds the minimal key value in comparison to all (k) min heaps comprising in the multiple-tree-array. Where the last element in a min heap is removed the min heap can be released from the multiple-tree-array memory module.
  • the multiple-tree-array provides storage and retrieval performed at the worst case time of O(k ⁇ ogn) , where (k) in the number of heaps managed therein.
  • An "active min heap” and “active subset of min heaps” shall mean the min heaps which are stored in RAM, and at least one of the min heaps stores the global minimum of the multiple-tree-array.
  • a "passive min heap” and “passive subset of min heaps” shall mean the min heaps which are held in secondary storage.
  • An “active max heap” and “active subset of max heaps” shall mean the max heaps which are stored in RAM, and at least one of the heaps holds the global maximum of the multiple-tree-array.
  • a “passive max heap” and “passive subset of heaps” shall mean the max heaps which are held in secondary storage.
  • an active subset of heaps is held in RAM, while the rest of the heaps are maintained on a secondary storage.
  • a subset of passive min heaps is maintained on secondary storage.
  • an active subset of max heaps is held in RAM, while the rest of the heaps are maintained on a secondary storage.
  • a subset of passive max heaps is maintained on secondary storage.
  • the multiple-tree-array is configured to replace or switch at least one of the active min-heap with at least one passive min heap (one of which is storing the current global minimum).
  • a data item in a min heap array shall have at least the following members (i,j,d(J,j)) whereby i and j are pointers to respective templates (of the template matrixes) and the third member is a real number representing the distance between the templates i.e. the 3 rd field in the data item is the key, the common field as defined above.
  • a template (or a context sequence represented thereby) can be erased or invalidated from the data set during the "life time" of the multiple-tree- array. The invalidation may occur upon merger of templates, as described in the present invention. The merger procedure typically entails invalidation of the merged templates.
  • At least one existing data item in the multiple-tree-array may be holding distance information relating to the invalidated template. Therefore, said existing data item requires in turn its invalidation or deletion. Typically, such invalidation would require 0(2N) deletions of data items from the multiple-tree-array (N be the number of the cluster). Therefore, in yet another aspect, the present invention is directed to a postponed deletion procedure or postponed invalidation procedure. The deletion is postponed until the operation of DeleteMin().
  • the postponed deletion or invalidation of the data item is delayed until their respective deletion by the operation of DeleteMinO-
  • the multiple-tree-array instead of searching for the data item for deletion, the multiple-tree-array "awaits" until the invalidated data item is retrieved, by operation of DeleteMin().
  • me retrieved data item (i,j,d(i,j)) is verified to be comprising valid data or valid templates (i) and Q).
  • validation procedure utilizes a one dimensional array of Boolean values (B) such that B[i] holds #true if and only if template (i) is of valid status.
  • the validation procedure can utilize an array of other validation information such as but not limited to: a time stamp or a string representing a status.
  • the computer implemented method of the present invention for identifying a sequence template as statistically associated with an attributes set of interest typically comprises: (a) providing a repository of attributes sets, said attributes set is statistically associated with a sequence template; (b) selecting an attributes set of interest; and (c) retrieving at least one sequence template statistically associated with said attributes set.
  • a sequence template represents two or more context sequences.
  • the attributes set can consist of two or more attributes of interest selected by a user or client.
  • the retrieved sequence template of step (c) typically also represents two or more context sequences.
  • retrieved sequence template or cluster represented thereby is a heterogeneous cluster.
  • the repository was obtained according to any method of the present invention.
  • the repository can be obtained by utilization of the LBDL clustering method.
  • the repository was obtained by utilization of the VS clustering method.
  • the repository is a heterogeneous repository.
  • Attributes or function attributes of interest can be selected for from the group consisting: the Gene Ontology Project (GO), Interpro annotation (European Molecular Biology Laboratory, EMBL), SMART (a Simple Modular Architecture Research Tool, found at http://smart.embl.de/), UniProt Knowledgebase (SwissProt), OMIM (by NCBI) PROSITE (by the Swiss Institute of Bioinformatics), Protein Information Resource (PIR), GeneCards, and Kyoto Encyclopedia of Genes and Genomes (KEGG).
  • GO Gene Ontology Project
  • Interpro annotation European Molecular Biology Laboratory, EMBL
  • SMART Simple Modular Architecture Research Tool, found at http://smart.embl.de/
  • UniProt Knowledgebase SwissProt
  • OMIM by NCBI
  • PROSITE by the Swiss Institute of Bioinformatics
  • PIR Protein Information Resource
  • GeneCards and Kyoto Encyclopedia of Genes and Genomes
  • (a) - represents a particular function attribute name; and V[a] .value - represents a value associated to particular function (a).
  • V and W comprises binary digits as values;
  • V and W comprises real numbers as values
  • similarity between any pair of function attributes arrays V, and W can be determined by the following procedure: sd ⁇ 0
  • V[a] .value - represents a value associated to particular function (a).
  • V and/or W may not comprise a particular function attribute.
  • '0.0' may be deemed to represents a non inclusion of a particular function.
  • the above step of (sd +
  • similarity degree can be determined by the above distance measurement between a pair of function attribute arrays.
  • many alternative approaches may be adopted to provide a measure of similarity between function attribute arrays.
  • “functional significance appraisaF, “functional appraisaF, “attribute appraisaF and “functional significance tesF shall mean refer to a computational method comprising a statistical test yielding confidence- level or probability, P value that at least one function attribute is associated with a given gene cluster or gene cluster regulated or otherwise affected by context sequence(s).
  • the typical input for this computational method is the names or unique identifiers of genes regulated or otherwise affected by the context sequence within a cluster.
  • the typical result (or output) of functional appraisal is typically a list of attributes which can be deemed as statistically over represented within said input cluster.
  • the list of attributes can further comprise the P value or confidence level of an attribute within the list.
  • the statistical test can be based on Fisher exact probability test, or hyper-geometric (HG) probability distribution pertaining the sampling without replacement from finite population as explained hereinafter.
  • N typically denotes the entire size of the gene population (i.e. population size);
  • n denotes the size of context sequence cluster under analysis (i.e. sample size);
  • m denotes the number of genes in the entire population characterized by at least one function attribute (i.e. the "unique" group size);
  • k denotes the number of unique items found in the cluster under analysis.
  • the hypergeometric distribution with parameters N, m and n, and k can therefore define the probability of getting exactly k genes characterized by said function attribute in a cluster of input genes (or context sequence cluster regulating or affecting them).
  • Jackknife methodologies and other confidence assisting procedures can be added to increase the confidence level of the enrichment results.
  • Functional appraisal tools can be purchased in, for example, (http://david.abcc.ncifcrf.gov) 18 ' 19 .
  • the retrieval of a sequence template statistically associated with an attributes set of interest comprises: determining similarity between the attributes set of interest and each attributes set previously inserted into repository; and retrieving from the repository a sequence template associated with at least one attributes set previously inserted into said repository.
  • the repository can typically comprise (N) pair(s) of sequence templates and their associated attributes set: ⁇ T I ,AS I >, 1 > i ⁇ N , where T 1 , and AS 1 are a sequence template and attributes set of the i-th record in the repository, respectively.
  • the method of retrieval of a sequence template statistically associated with an attributes set (AS) of interest can therefore be performed by: (a) determining similarity, by utilizing similarity formula such as, but not limited to ds(AS,AS,) , as defined above; and (b) retrieval of ⁇ T l ,AS l >, l ⁇ i ⁇ N from the repository together with the respective, ds(AS,AS t ) .
  • the order of retrieved records is preferably in descending order according to the similarity degree.
  • the retrieved sequence template typically also represents two or more context sequences. The later may be either identical context sequences or typically context sequence consisting of different sequences.
  • the attributes set associated with the context sequence(s) or sequence template can consist of two or more attributes.
  • the context sequence(s) or sequence template may be statistically associated with a particular attribute even where at least one of the context sequence (or gene affected thereby) is not characterized by the attribute.
  • the retrieval procedures of the present invention therefore enable retrieval of heterogeneous clusters, as defined above.
  • the retrieval of a sequence template statistically associated with said attributes set may comprises the steps of: determining similarity between the attributes set of interest and at least one attributes set previously inserted into repository; and retrieving from the repository a sequence template associated with the at least one attributes set previously inserted into said repository.
  • the repository can therefore typically comprise (N) of pair(s) of sequence templates and their associated attributes set: ⁇ T l ,AS l >, l ⁇ i ⁇ N , where T 1 , and AS 1 are a sequence template and attributes set of the i-th record in the repository, respectively.
  • the method of retrieval of a sequence template statistically associated with an attributes set (AS) of interest can therefore be performed by: (a) determining similarity, by utilizing similarity formula such as, but not limited to ds(AS,AS t ) , as defined above; and (b) retrieval at least one of ⁇ T,,AS >, l ⁇ i ⁇ N from the repository together with respective ds(AS,AS t ) .
  • the order of retrieved records is preferably in descending order according to the similarity degree.
  • the retrieved sequence template typically also represents two or more context sequences.
  • the later may be either identical context sequences or typically context sequence consisting of different sequences.
  • the method typically retrieves at least one sequence template together with a degree of similarity between the attributes set of interest and the attributes set statistically associated with the sequence template. However, filtering of at least one sequence template is typically required.
  • the retrieving includes discarding a sequence template associated with said at least one attributes set, where the similarity between said at least one attributes set and the attributes set is above a predefined threshold (L).
  • the retrieval further comprises discarding (or filtering out) records having ds(AS, AS 1 ) ⁇ (L).
  • the threshold (L) can be set to various values depending on the number of results sought by the user or the client. As an alternative, the user or client may wish to retrieve the best result alone.
  • the retrieving step includes discarding a sequence template associated with said at least one attributes set, where the similarity between said at least one attributes set and the attributes set of interest is above the global minimum.
  • the retrieval further comprises discarding (or filtering out) records having ds(AS,AS,) > mm ⁇ J ⁇ N (ds(AS,AS j )) .
  • said retrieving includes discarding attributes (i.e. members of the attributes set) where the functional appraisal resulted with a respective P_value greater than 0.3, 0.2, 0.1, or preferably greater than 0.05.
  • P_value i.e. members of the attributes set
  • said retrieving includes discarding attributes (i.e. members of the attributes set) where the functional appraisal resulted with a respective P_value greater than 0.3, 0.2, 0.1, or preferably greater than 0.05.
  • the method can further comprise merging at least two of retrieved sequence template (or clusters represented thereby). Merger procedure is detailed above. Section V - Experimental Examples
  • This Section relates to experimental examples, illustrating the above embodiments of the present invention. These examples are provided for the purpose of illustration only and without any intention of being limiting in any way.
  • the complete RefSeq sequences of plants mRNA was downloaded (http://www.ncbi.nlm.nih.gov/RefSeq).
  • the database was filtered in order to exclusively include mRNA sequences of Arabidopsis Thaliana.
  • the dataset was thereafter cleaned of duplicate genes to reduce over representation of identical genes.
  • the translation initiator codon was identified using the RefSeq CDS. Sequence in the length of 9 nucleotides preceding translation initiator codon were parsed, and indexed.
  • the complete dataset was aligned.
  • the LBDL clustering method was applied on the mRNA dataset in 8 separate phases. In each phase the algorithm was provided with a different Lower Bound Distance Limit so as to cluster with varying degree of stringency (0.01; 2.01; 3.01; 4.01; 5.01; 6.01; and 7.01). The separate phase analysis provides an opportunity to investigate smaller more exotic clusters of genes before they merge into larger cluster and lose some significant functional properties along the way.
  • Table 1 prescribes the emerging gene clusters which were identified by LBDL clustering method. This table includes selected clusters which demonstrated significant functional attributes.
  • the clusters in Table 1 are arranged according to size i.e. number of different genes in each cluster.
  • said table provides a template comprising matrix T 4x9 , where the distribution of nucleotides for each position preceding the translation initiation codon.
  • T 4x9 matrix
  • the translation initiation codon is at position '0' and does not appear in the table.
  • Table 1 includes a portion of results due the amount of information the LBDL clustering method extracted and collected.
  • the table provides the significant functions or functional attributes set associated with the template.
  • the largest gene cluster includes some 1613 distinct genes.
  • the second largest cluster has 1433 distinct genes. These clusters seem to support previous work which stipulated the A-rich conserved region in higher plants 20 .
  • the large clusters were enriched, inter alia, with genes encoding nuclear and transcription related proteins, partially in contradiction to previous speculations 21 .
  • Another observation is that the smaller clusters tend to be quite distant from the largest gene clusters. Smaller clusters tend also to include non-A nucleotides with distribution above 80%. For easier reference these nucleotides were highlighted in the body of the table.
  • templates associated with transcription regulation consists, inter alia, of: 'aaaaaaaaaa', 'gttaagaaa', 'ttttcttca' and 'gagagagaa'.
  • Photosynthesis is associated with 'acaaaaaca', and also 'gaagaagaa'. This unravels the fact that as many as a single function can be associated to a plurality of context sequences or dominant context sequences with strong statistical significance.
  • Table 1 illustrates plurality of other templates and their association with significant functional attributes.
  • the complete RefSeq sequences of Human mRNA were downloaded (http://www.ncbi.nlm.nih.gov/RefSeq).
  • the database was filtered in order to exclusively include mRNA sequences of Homo sapiens.
  • the dataset was thereafter cleaned of duplicate genes to reduce over representation of identical genes.
  • the translation initiator codon was identified using the RefSeq CDS. Sequence in the length of 9 nucleotides preceding translation initiator codon were parsed, and indexed.
  • the complete dataset was aligned.
  • the LBDL clustering method was applied on the mRNA dataset in 3 separate phases. In each phase the algorithm was provided with a different Lower Bound Distance Limit so as to cluster with varying degree of stringency (5.01; 6.01; and 7.01). The separate phase analysis provides an opportunity to investigate smaller more exotic clusters of genes before they merge into larger cluster and lose some significant functional properties along the way.
  • Table 2 prescribes the emerging gene clusters which were identified by LBDL clustering method. This table includes selected clusters which demonstrated significant functional attributes.
  • the clusters in Table 2 are arranged according to size i.e. number of different genes in each cluster.
  • said table provides a template comprising matrix T 4x9 , where the distribution of nucleotides for each position preceding the translation initiation codon.
  • T 4x9 matrix of nucleotides for each position preceding the translation initiation codon.
  • the most frequent sequence of successive nucleotides is disclosed i.e. the dominant context sequence.
  • the translation initiation codon is at position '0' and does not appear in the table.
  • Table 2 includes only a portion of the results due the amount of information the LBDL clustering method extracted and collected. The most significant functional enrichment of each cluster appears as well. The largest gene cluster includes some 1562 distinct genes. The second largest cluster has 987 distinct genes.
  • templates associated with transcription regulation consists, inter alia, of: 'cgcgggaag, 'ggaggaaaa', and 'ctgaagaaa'. Metabolism is statistically associated with 'cccgccgcg', 'agcctagaa' and also 'ctgaagaaa'. Again, as many as a single function can be associated to a plurality of context sequences with strong statistical significance. Table 2 illustrates plurality of other templates and their association with a significant functional attributes.
  • the statistically supported associating functional attribute arrays with a template can be used both in research and genetic engineering.
  • the database was filtered in order to exclusively include mRNA sequences of Mus Musculus.
  • the dataset was thereafter cleaned of duplicate genes to reduce over representation of identical genes.
  • the translation initiator codon was identified using the RefSeq CDS. Sequence in the length of 9 nucleotides preceding translation initiator codon were parsed, and indexed. The dataset thereafter included the total of 15,312 short sequences of 9 successive nucleotides. The complete dataset was aligned.
  • the LBDL clustering method was applied on the mRNA dataset in 3 separate phases. In each phase the algorithm was provided with a different Lower Bound Distance Limit so as to cluster with varying degree of stringency (5.01; 6.01; and 7.01). The separate phase analysis provides an opportunity to investigate smaller more exotic clusters of genes before they merge into larger cluster and lose some significant functional properties along the way.
  • the clusters in Table 3 are arranged according to size i.e. number of different genes in each cluster.
  • said table provides a template comprising matrix T ⁇ x9 , where the distribution of nucleotides for each position preceding the translation initiation codon.
  • T ⁇ x9 matrix of nucleotides for each position preceding the translation initiation codon.
  • the most frequent sequence of successive nucleotides is disclosed i.e. the dominant context sequence.
  • the translation initiation codon is at position '0' and does not appear in the table.
  • Table 3 includes only a portion of the results due the amount of information the LBDL clustering method extracted and collected.
  • the most significant functional enrichment of each cluster appears as well.
  • the largest gene cluster includes some 1197 distinct genes.
  • the second largest cluster has 710 distinct genes.
  • the context sequence 'gccgccgcc' can be associated with sh3 domain.
  • plurality of context sequences are now associated with metabolism in general.
  • templates associated with metabolism consists, inter alia, of: 'ccccgcgcc, and 'cggaggaag'.
  • Metal ion binding is statistically associated with both 'gccgccgcc', and 'ccccgcgcc'.
  • Table 3 illustrates plurality of other templates and their association with a significant functional attributes.
  • the statistically supported associating functional attribute arrays with a template can be used both in research and genetic engineering.
  • the database was filtered in order to exclusively include mRNA sequences of Bos Tauros.
  • the dataset was thereafter cleaned of duplicate genes to reduce over representation of identical genes.
  • the translation initiator codon was identified using the RefSeq CDS. Sequence in the length of 9 nucleotides preceding translation initiator codon were parsed, and indexed. The dataset thereafter included the total of 9,723 short sequences of 9 successive nucleotides. The complete dataset was aligned.
  • the LBDL clustering method was applied on the mRNA dataset in 3 separate phases. In each phase the algorithm was provided with a different Lower Bound Distance Limit so as to cluster with varying degree of stringency (5.01; 6.01; and 7.01). The separate phase analysis provides an opportunity to investigate smaller more exotic clusters of genes before they merge into larger cluster and lose some significant functional properties along the way.
  • Table 4 prescribes the emerging gene clusters which were identified by LBDL clustering method. This table includes selected clusters which demonstrated significant functional attributes.
  • the clusters in Table 4 are arranged according to size i.e. number of different genes in each cluster. For each cluster, said table depicts the distribution of nucleotides for each position preceding the translation initiation codon. For convenience, the most frequent sequence of successive nucleotides, is disclosed i.e. the dominant context sequence.
  • the clusters in Table 4 are arranged according to size i.e. number of different genes in each cluster.
  • said table For each cluster, said table provides a template comprising matrix T Ax9 , where the distribution of nucleotides for each position preceding the translation initiation codon together with the most frequent sequence of successive nucleotides, is disclosed.
  • Table 4 illustrates plurality of other templates and their association with a significant functional attributes.
  • the most significant functional enrichment of each cluster appears as well.
  • the largest gene cluster includes some 815 distinct genes.
  • the second largest cluster has 583 distinct genes.
  • Example 1-4 exemplify numerous heterogeneous clusters detailed in Tables 1-4 which were identified by the method and systems of the present invention.
  • Table 1 Emerging gene clusters which were identified by the clustering algorithm pertaining Arabidopsis Thaliana. The below clusters are arranged according to declining size. For each cluster, the table depicts the distribution of nucleotides for each position along the context sequence.
  • Table 2 Emerging gene clusters which were identified by the clustering algorithm pertaining Homo Sapien. The below clusters are arranged according to declining size. For each cluster, the table depicts the distribution of nucleotides for each position along the context sequence.
  • Table 3 Emerging gene clusters which were identified by the clustering algorithm pertaining Mus Musculus. The below clusters are arranged according to declining size. For each cluster, the table depicts the distribution of nucleotides for each position along the context sequence.
  • Cluster Function attributes set Enrichment Distribution of nucleotides per position along the context sequence (%) (number of score/P_value/Benjamini) context sequences) Pos: -9 -8 -7 -6 -5 -3 -2 -1 intracellular non-membrane-bound organelle (7.46, 1.OE-9, 1.6E-7); non-membrane-bound organelle (7.46, 1.OE-9, A% 11.44 2.756 27.23 4.594 9.857 36.42 47.11 2.840 1.169
  • tissue kallikrein activity (3.41, 1.3E-12, 3.3E-9); serine protease (3.41, 6.5E-6, 4.1E- 4); serine proteinase (3.41, 9.7E-6, 5.6E-4); SFOOl 135:trypsin (3.41, 2.0E-5, 4.2E-2); submandibular gland (3.41, 2.5E-5, 1.2E-3); zymogen (3.41, 3.2E-5, 1.5E- 3); protease (3.41, 2.5E-4, 8.9E-3); Peptidase SlA, chymotrypsin (3.41, 8.3E-4, 5.7E-1); Peptidase Sl and S6, chymotrypsin/Hap (3.41, 1.2E-3, 5.9E-1); serine-type endopeptidase activity (3.41, 3.5E-3, 2.9E-1); serine-type peptidase activity (3.41, 6.4E-3,
  • Table 4 Emerging gene clusters which were identified by the clustering method pertaining Bos Tauros. The below clusters are arranged according to declining size. For each cluster, the table depicts the distribution of nucleotides for each position along the context sequence.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Bioethics (AREA)
  • Chemical & Material Sciences (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • Genetics & Genomics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
PCT/IL2008/001140 2007-08-21 2008-08-20 Systèmes et procédés de sélection rationnelle de séquences de contexte et de modèles de séquence WO2009024974A2 (fr)

Priority Applications (3)

Application Number Priority Date Filing Date Title
US12/733,256 US20100153400A1 (en) 2007-08-21 2008-08-20 Systems and methods for rational selection of context sequences and sequence templates
US13/764,894 US9779205B2 (en) 2007-08-21 2013-02-12 Systems and methods for rational selection of context sequences and sequence templates
US15/677,234 US20170351810A1 (en) 2007-08-21 2017-08-15 Systems and methods for rational selection of context sequences and sequence templates

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US93559207P 2007-08-21 2007-08-21
US60/935,592 2007-08-21

Related Child Applications (2)

Application Number Title Priority Date Filing Date
US12/733,256 A-371-Of-International US20100153400A1 (en) 2007-08-21 2008-08-20 Systems and methods for rational selection of context sequences and sequence templates
US13/764,894 Continuation US9779205B2 (en) 2007-08-21 2013-02-12 Systems and methods for rational selection of context sequences and sequence templates

Publications (1)

Publication Number Publication Date
WO2009024974A2 true WO2009024974A2 (fr) 2009-02-26

Family

ID=39967653

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IL2008/001140 WO2009024974A2 (fr) 2007-08-21 2008-08-20 Systèmes et procédés de sélection rationnelle de séquences de contexte et de modèles de séquence

Country Status (2)

Country Link
US (3) US20100153400A1 (fr)
WO (1) WO2009024974A2 (fr)

Families Citing this family (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2740334C (fr) * 2010-05-14 2015-12-08 National Research Council Systeme d'annalyse de groupes de donnees preservant l'ordonnancement et procede connexe
US9135292B1 (en) * 2013-04-08 2015-09-15 Google Inc. Selecting a template for a content item
US10331849B2 (en) * 2015-05-11 2019-06-25 Echelon Diagnostics, Inc. System and method for construction of internal controls for improved accuracy and sensitivity of DNA testing
US11449059B2 (en) 2017-05-01 2022-09-20 Symbol Technologies, Llc Obstacle detection for a mobile automation apparatus
WO2018204342A1 (fr) * 2017-05-01 2018-11-08 Symbol Technologies, Llc Système de détection d'état de produit
US11978011B2 (en) 2017-05-01 2024-05-07 Symbol Technologies, Llc Method and apparatus for object status detection
US11600084B2 (en) 2017-05-05 2023-03-07 Symbol Technologies, Llc Method and apparatus for detecting and interpreting price label text
US11506483B2 (en) 2018-10-05 2022-11-22 Zebra Technologies Corporation Method, system and apparatus for support structure depth determination
US11090811B2 (en) 2018-11-13 2021-08-17 Zebra Technologies Corporation Method and apparatus for labeling of support structures
US11079240B2 (en) 2018-12-07 2021-08-03 Zebra Technologies Corporation Method, system and apparatus for adaptive particle filter localization
US11416000B2 (en) 2018-12-07 2022-08-16 Zebra Technologies Corporation Method and apparatus for navigational ray tracing
CA3028708A1 (fr) 2018-12-28 2020-06-28 Zih Corp. Procede, systeme et appareil de fermeture dynamique des boucles dans des trajectoires de cartographie
US11151743B2 (en) 2019-06-03 2021-10-19 Zebra Technologies Corporation Method, system and apparatus for end of aisle detection
US11080566B2 (en) 2019-06-03 2021-08-03 Zebra Technologies Corporation Method, system and apparatus for gap detection in support structures with peg regions
US11662739B2 (en) 2019-06-03 2023-05-30 Zebra Technologies Corporation Method, system and apparatus for adaptive ceiling-based localization
US11960286B2 (en) 2019-06-03 2024-04-16 Zebra Technologies Corporation Method, system and apparatus for dynamic task sequencing
US11402846B2 (en) 2019-06-03 2022-08-02 Zebra Technologies Corporation Method, system and apparatus for mitigating data capture light leakage
US11341663B2 (en) 2019-06-03 2022-05-24 Zebra Technologies Corporation Method, system and apparatus for detecting support structure obstructions
US11507103B2 (en) 2019-12-04 2022-11-22 Zebra Technologies Corporation Method, system and apparatus for localization-based historical obstacle handling
US11107238B2 (en) 2019-12-13 2021-08-31 Zebra Technologies Corporation Method, system and apparatus for detecting item facings
US11822333B2 (en) 2020-03-30 2023-11-21 Zebra Technologies Corporation Method, system and apparatus for data capture illumination control
US11450024B2 (en) 2020-07-17 2022-09-20 Zebra Technologies Corporation Mixed depth object detection
US11593915B2 (en) 2020-10-21 2023-02-28 Zebra Technologies Corporation Parallax-tolerant panoramic image generation
US11954882B2 (en) 2021-06-17 2024-04-09 Zebra Technologies Corporation Feature-based georegistration for mobile computing devices
CN113919880A (zh) * 2021-10-21 2022-01-11 中国电力科学研究院有限公司 市场运营推演案例比对分析方法、系统、设备及存储介质

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6109776A (en) * 1998-04-21 2000-08-29 Gene Logic, Inc. Method and system for computationally identifying clusters within a set of sequences
US6223186B1 (en) * 1998-05-04 2001-04-24 Incyte Pharmaceuticals, Inc. System and method for a precompiled database for biomolecular sequence information
US7020561B1 (en) * 2000-05-23 2006-03-28 Gene Logic, Inc. Methods and systems for efficient comparison, identification, processing, and importing of gene expression data
US20040117127A1 (en) * 2002-12-11 2004-06-17 Affymetrix, Inc. Methods, computer software products and systems for clustering genes
US20040142325A1 (en) * 2001-09-14 2004-07-22 Liat Mintz Methods and systems for annotating biomolecular sequences
US6941332B2 (en) * 2002-04-23 2005-09-06 Medtronic, Inc. Implantable medical device fast median filter
WO2006044839A2 (fr) 2004-10-18 2006-04-27 The Samuel Roberts Noble Foundation, Inc. Augmentation de la production de cire dans des vegetaux

Also Published As

Publication number Publication date
US20100153400A1 (en) 2010-06-17
US20170351810A1 (en) 2017-12-07
US20130230916A1 (en) 2013-09-05
US9779205B2 (en) 2017-10-03

Similar Documents

Publication Publication Date Title
US20170351810A1 (en) Systems and methods for rational selection of context sequences and sequence templates
Orengo et al. Bioinformatics: genes, proteins and computers
Li et al. TargetM6A: identifying N 6-methyladenosine sites from RNA sequences via position-specific nucleotide propensities and a support vector machine
US10204207B2 (en) Systems and methods for transcriptome analysis
Ali et al. Alignment-free protein interaction network comparison
Sun et al. Machine learning and its applications in plant molecular studies
Peng et al. Clustering algorithms to analyze molecular dynamics simulation trajectories for complex chemical and biological systems
Aluru et al. Reverse engineering and analysis of large genome-scale gene networks
Ruan et al. DACIDR: deterministic annealed clustering with interpolative dimension reduction using a large collection of 16S rRNA sequences
Puigbò et al. Genome-wide comparative analysis of phylogenetic trees: the prokaryotic forest of life
Guo et al. PLncWX: a machine-learning algorithm for plant lncRNA identification based on WOA-XGBoost
Chiusano et al. ISOL@: an Italian SOLAnaceae genomics resource
Saha et al. An Overview of Bioinformatics and Computational Genomics in Modern Plant Science
Krishnan et al. Integrative approaches for mining transcriptional regulatory programs in Arabidopsis
Chen et al. Multi-objective evolutionary triclustering with constraints of time-series gene expression data
Kermani et al. A Two-Step Methodology for Dynamic Construction of a Protein Ontology.
Tang et al. Predicting protein complexes via the integration of multiple biological information
Mrozek et al. A large-scale and serverless computational approach for improving quality of NGS data supporting big multi-omics data analyses
Kuang et al. Learning Proteome Domain Folding Using LSTMs in an Empirical Kernel Space
Li et al. A comparative study for identifying the chromosome-wide spatial clusters from high-throughput chromatin conformation capture data
Godhandaraman et al. Big data in genomics
Dasmandal et al. Role of Bioinformatics in the development of Plant Genetic Resources
Lavanya et al. A Detailed Survey on Approaches of Phylogenetic Analysis.
Bleidorn et al. Finding Genes
Jing et al. A deep learning method for recovering missing signals in transcriptome-wide RNA structure profiles from probing experiments

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08789812

Country of ref document: EP

Kind code of ref document: A2

WWE Wipo information: entry into national phase

Ref document number: 204036

Country of ref document: IL

WWE Wipo information: entry into national phase

Ref document number: 12733256

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 08789812

Country of ref document: EP

Kind code of ref document: A1