WO2014098479A1 - Procédé mis en oeuvre par ordinateur d'analyse de mutation génomique ou de mutation épigénétique - Google Patents

Procédé mis en oeuvre par ordinateur d'analyse de mutation génomique ou de mutation épigénétique Download PDF

Info

Publication number
WO2014098479A1
WO2014098479A1 PCT/KR2013/011823 KR2013011823W WO2014098479A1 WO 2014098479 A1 WO2014098479 A1 WO 2014098479A1 KR 2013011823 W KR2013011823 W KR 2013011823W WO 2014098479 A1 WO2014098479 A1 WO 2014098479A1
Authority
WO
WIPO (PCT)
Prior art keywords
variation
snp
cancer
snps
organism
Prior art date
Application number
PCT/KR2013/011823
Other languages
English (en)
Korean (ko)
Inventor
김성호
김민승
Original Assignee
연세대학교 산학협력단
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from KR1020130115261A external-priority patent/KR101538692B1/ko
Application filed by 연세대학교 산학협력단 filed Critical 연세대학교 산학협력단
Publication of WO2014098479A1 publication Critical patent/WO2014098479A1/fr

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/40Population genetics; Linkage disequilibrium
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Definitions

  • the present invention relates to a computer im lemented method for analyzing genome variation or epigenetic variation of an organism, a computer-readable storage medium and a system therefor. [Background technology]
  • GAS genome-wide association study
  • the present inventors have tried to solve the above-mentioned problems of the prior art. As a result, the inventors have developed a novel variation analysis protocol that can more accurately analyze various variations (eg, genetic variations) found in an organism to obtain clinically meaningful predictive results.
  • the present invention is an approach similar to "word” and "word frequency profile" in natural language analysis, in which two distinct concepts, variation syntax (VAR-S; SNP), are applied.
  • VAR-S variation syntax
  • SNP VAR-S feature frequency profile
  • an object of the present invention is to provide a computer im lemented method for analyzing genomic variat ion or epigenomic variation of an organism.
  • VAR-S variation syntax
  • the invention provides a computer implemented method for analyzing genomic variation or epigenomic variation of an organism comprising the following steps: (a) constructing a linked string of the variants;
  • the present inventors have tried to solve the above-mentioned problems of the prior art. As a result, the inventors have developed a novel variation analysis protocol that allows for more accurate analysis of various variations (eg, genetic variations) found in an organism to yield clinically meaningful predictive results.
  • the present invention is similar in approach to "words" and "word frequency profiles" in natural language analysis, with two distinct concepts: variation syntax (VAR-S; SNP syntax: SNP-S) when applied to SNPs. And a method of expressing and analyzing organizational characteristics of an individual using a feature frequency profile (FFP) 'of VAR-S.
  • VAR-S variation syntax
  • SNP-S SNP syntax
  • FFP feature frequency profile
  • the inventors of the present invention describe "words” and “word frequency profiles” (CD Manning & H. Schuetze (1999). Foundations of Statistical Natural Language In a similar manner to The MIT Press, 1 edn), two inventive ideas, “SNP syntax (SNP syntax: SNP-S)” and “Feature Frequency Profile (SFP-S)”
  • SNP syntax SNP syntax: SNP-S
  • SFP-S Feature Frequency Profile
  • each genetic susceptibility allele to the cancer itself can have minor detrimental phenotypic consequences, pose minor hereditary risks, have variable penetrance, and in small portions of population Appear, and generally cause little "fatal” results.
  • each genetic susceptible allele can occur at minor frequencies in populations (ET Cirulli & DB Goldstein (2010) .Uncovering the roles of rare variants in common disease through whole-genome sequencing.Nat Rev Genet 11 ( 6): 415-425);
  • One particular cancer type has many subtypes, and in a single individual, each subtype is caused by many genes, with varying degrees of effectiveness (M. Ger linger, et. (2012) .Intratumor Heterogeneity and Branched Evolution Revealed by Mult i region Sequencing.N Engl J Med 366 (10): 883-892), multiple inheritances (genetic code sequences and nongenic ciphers) caused by various genetic variations Sequence) alleles make one individual sensitive to cancer, most of which can be represented as minor alleles in the genome of one individual;
  • Each set of genetic susceptibility alleles can trigger cancer development followed by driver mutations.
  • driver mutations e.g., events caused by pathogens, radiation, compounds, environmental factors, etc.
  • cancer driver alleles can trigger cancer development followed by driver mutations.
  • one or more consecutive waves of clonal expansion that depend on one or more consecutive acquisition of pj Stephens, et al. (2012) .
  • the correlation between cancer driver alleles and cancer sensitive alleles may appear, may be direct or potent, or not.
  • Genome mutations to be analyzed in the present invention include various variations found in an organism, preferably SNPCsingle nucleotide polymorphisms in nucleotide sequences), deletions, insertions or repeat variations; Or epigenomic variation. Examples of epigenetic variations include DNA methylation or histone modifications. Most preferably, the mutation to be analyzed in the present invention is SNP.
  • the mutation to be analyzed is a mutation present in a nucleotide sequence
  • the nucleotide sequence is a sequence on one chromosome, a sequence on a plurality of chromosomes, or a whole genome sequence, more preferably a whole genome (WG). )
  • the mutations to be analyzed in the present invention are SNPs in the entire genome sequence.
  • step (a) is carried out by assigning a code to each of the variants to build an associative string of the codes.
  • step (a) is performed by assigning a code to each genotype or haplotype of the SNP to construct a linked string of the codes.
  • a code For example, when analyzing the SNP of the human genome, There may be ten possible SNP genotypes, and each SNP genotype may be assigned an alphabetic code to construct a linking string for the SNP (see Table 3).
  • the variants to be analyzed in the present invention are SNPs
  • the SNPs are (i) 5% or less (more preferably, 4% or less, more preferably 3% or less, even more preferred Preferably from the group consisting of (ii) Hardy Weinberg Equilibrium test and ( ⁇ ) folate-effect test.
  • the method further comprises the step of determining the optimal length of the variant syntax before step (b). Determination of the optimal length of variant syntax (eg, SNP-S) can be made in a variety of ways.
  • determination of the optimal length of variant syntax is carried out empirically determined to a length that exhibits the highest accuracy for the phenotype of the organism.
  • the phenotype is preferably a disease (eg cancer).
  • the determination of the optimal length of the variance syntax may be performed by selecting the optimal length in the convergence section in the tree topology fabricated using the Robinson-Foulds distance (see reference).
  • step (b) is carried out using a sliding window having the determined optimal length.
  • the variation to be analyzed in the present invention is SNP and the optimum length of the SNP density (density) of 1 million SNPs / genome 6-14 (more preferably 8-12 , Most preferably 10). If the density of the SNP increases, the optimum length also increases.
  • the FFPs whose length is determined in step (d) are for rare VAR-S (eg SNP-S) of a certain length and are filtered-in.
  • the regression VAR-S of a particular length is 20% or less (more preferably, 5% or less, even more preferably 3% or less, most likely in a population including a mutation of the analyte).
  • a rare VAR 'S eg, SNP-S
  • the distance between the FFPs in step (d) can be obtained by applying various distance functions, for example, Jensen-Shannon (JS) divergence, Euclidean distance function , Cosine distance function , Minkowski distance function and Pearson linear correlation ⁇ can be obtained, and most preferably, the distance between FFPs with JS (Jensen-Shannon) divergence Get
  • JS Jensen-Shannon
  • M / is the average FFP of P and Q /
  • RE is the relative entropy
  • the present invention will be described with reference to the following examples, in which one individual has the SNP-Ss FFP and the smallest JS (Jensen-Shannon) divergence. If the other subject is a breast cancer patient, the subject may be determined to have high susceivability to breast cancer.
  • the pairwise all-against-all distances thus obtained are stored in the distance matrix.
  • identical objects have a distance of zero and different dissimilar objects have a large distance.
  • the distance relationship between FFPs can be visualized in various ways (eg, nearest-neighbor connection map or systematic tree) (see FIG. 3).
  • the method of classifying FFPs may be performed using a class prediction algorithm, for example, as a support vector machine (SVM).
  • SVM support vector machine
  • the organism to be analyzed in the present invention is an animal, a plant, a fungus, a yeast, a bacterium or a protist.
  • Animals that can be analyzed by the present invention include, but are not limited to, mammals, strata, reptiles, and birds.
  • the animals analyzed by the present invention include humans, mice, rats, cattle, pigs, horses, sheep, rabbits, goats, birds, fish, and stratum.
  • Plants that can be analyzed by the present invention include, but are not limited to, monocotyledonous plants, dicotyledonous plants and algae.
  • the plants analyzed by the present invention are food crops, including rice, wheat, barley, corn, soybean potatoes, wheat, palm, oats and sorghum;
  • Vegetable crops including arabidopsis, cabbage, radish, peppers, strawberries, tomatoes, watermelons, cucumbers, cabbages, melons, pumpkins, green onions, onions, and carrots;
  • Special crops including ginseng, tobacco, cotton, sesame, sugar cane, sugar beet, perilla, peanuts and rapeseed;
  • Fruit trees including apples, pears, jujube peaches, lambs, grapes, chisels, persimmons, plums, apricots and bananas;
  • Flowers including roses, gladiolus, gerberas, carnations, chrysanthemums, lilies, and yul
  • milk bacteria examples include, but are not limited to, Escherichia coli, Thermus thermophi 1 ics, Bacillus subti lis, Bacillus st earo thermoph ilus, Salmonella typhimuriu, Pseudo onas, Streptomyces, Staphylococcus, Lactobacillus, Lactococcus and Streptococcus ⁇ It doesn't work.
  • Examples of achibacteria that can be analyzed by the present invention include Methanococcus jannaschi i (Mj), Methanosarcina azei (Mm), Methanobacterium thermoautotroph icu (Mt), Methanococcus maripaludis, Methanopyrus kandleri, Halobacterium, Archaeoglobus fulgidus (rocfh, ArocPus, ArocPh, ), Pyrobaculimi aerophi lum, Pyrococcus abyss /, Sulfolobus solfataricus (Ss), Sulfolobus tokodaii, Aeuropyrum pernix (Ap), Thermoplasina acidophi lum and Thermoplasma volcanium.
  • Mj Methanococcus jannaschi i
  • Mm Methanosarcina azei
  • Mt Methanobacterium thermoautotroph icu
  • Protists that can be analyzed by the present invention include, but are not limited to, algae, Plasmodium, Phytophthora, slime molds, protozoans It is not.
  • the mutation of the analyte is a mutation associated with the traits of the organism and the method of the invention is used to predict sensitivity to the traits of the organism.
  • the traits to be analyzed in the present invention are bad traits (adverse traits), diseases, disorders, conditions or symptoms.
  • the disease, disease, condition or symptoms may include cancer, tumor, chronic disease, infectious disease, neurological disease, metabolic disease, immune disease, inflammatory disease, cardiovascular disease, respiratory disease, bone disease, thyroid disease, otolaryngology Ophthalmic diseases, dermatological diseases, dental diseases, endocrine diseases, gastrointestinal diseases, hereditary diseases, musculoskeletal disorders, arthritis, obesity and hyperlipidemia.
  • the trait to be analyzed in the present invention is a cancer disease.
  • susce ptibility to a cancer disease of one subject can be quantitatively predicted by the present invention.
  • the trait to be analyzed in the present invention is an advantegeous trait, which is growth rate, yield or quality.
  • the mutation to be analyzed in the present invention is a mutation associated with the therapeutic responsiveness of the organism and the method is used to predict the therapeutic responsiveness of the organism.
  • a representative example of treatment response is drug responsiveness. Respondents, non-respondents, and reverse respondents with respect to a particular drug can be determined by the present invention.
  • the mutations analyzed by the present invention can be used for multi iclass cancer classification.
  • the most important thing in the treatment of cancer is the accurate diagnosis or information about the cancer of the patient.
  • Multiclass cancer classification is required for such accurate diagnosis (Ramaswamy S, et al. (2001). Multi class cancer diagnosis using tumor gene expression signatures. PNAS USA 98 (26): 15149-54).
  • the present invention can be used for this multiclass cancer classification.
  • the invention provides a computer-readable storage medium embodied with instructions instructing a computer processor to perform the following steps: (a Constructing a linked string of the variants; (b) constructing a variation syntax (VAR-S) of a specific length by applying a sliding window of a specific length along the entire length of the linking string; (c) counting all possible features in the particular length variation syntax and assembling them into feature frequency profiles (FFPs) step; And (d) determining the distance between the FFPs or classifying the FFPs.
  • VAR-S variation syntax
  • FFPs feature frequency profiles
  • the invention provides a system for analyzing genomic variation or epigenomic variation of an organism comprising:
  • the storage medium and system of the present invention are for carrying out the method of the present invention as described above, and the contents in common between the two are omitted in order to avoid excessive complexity of the present specification.
  • the storage medium of the present invention is not particularly limited, and various storage media known in the art, for example, CD-R, CD-ROM, DVD, data signals contained in carrier waves, flash memory, floppy disks, hard drives, Magnetic tapes, MINIDISC, nonvolatile memory cards, EEPR0M, optical disks, optical storage media, RAM, ROM, system memory, and web servers.
  • the system of the present invention can be built in a variety of ways.
  • the system of the present invention may be built with a multiprocessor computer array, a web server and a multi-user / interactive system.
  • the system of the present invention may include various elements, for example, to construct a variant (e.g. SNP) information storage database, a processor to create an associative string of variants, to construct a variant syntax (e.g. SNP-S).
  • a processor to create an associative string of variants
  • a variant syntax e.g. SNP-S
  • processor to determine optimal length of variant syntax (e.g. SNP-S / )
  • FFP generator processor to perform distance determination between FFPs
  • processor to create distance matrix and processor to visualize distance matrix Can be built to include
  • the detailed description of the integrated approach of the second inventor of the invention is as follows:
  • the invention provides a computer implemented method for analyzing genomic variation or epigenomic ' variation of an organism comprising the following steps:
  • step (c) finally predicting the trait of the organism by applying at least four kinds of prediction results obtained in step (b) to an inference algorithm.
  • the basic strategy of the present invention is to apply a variation of a particular entity by applying at least two kinds of descriptors for the variation to each of at least two kinds of class prediction algorithms, and applying the results from these applications to the appropriate ⁇ inference algorithm.
  • Assay eg, susceptibility to specific traits.
  • Genome mutations to be analyzed in the present invention include various variations found in organisms, preferably SNPCsingle nucleotide polymorphisms in nucleotide sequences), deletions, insertions or repeat variations; Or epigenomic variation.
  • the mutation to be analyzed is a mutation present in a nucleotide sequence
  • the nucleotide sequence is a sequence on one chromosome, a sequence on a plurality of chromosomes, or a whole genome sequence, more preferably a whole genome (WG). )to be.
  • the mutations to be analyzed in the present invention are SNPs in the entire genome sequence.
  • the at least two kinds of descriptors for the mutation are constructed.
  • the at least two kinds of descriptors for the variation are: (i) the profile of the variations (eg, the profile of ordered SNPs) assuming that each variation is independent of the neighbor's variation and (ii) a particular length Profiles of the above-described variant syntax (VAR-S) (eg SNP syntax) that are associated variants.
  • VAR-S eg SNP syntax
  • the use of syntax (VAR-S) (eg SNP syntax) as one of two descriptors is due to the fact that each variation (eg SNP) location is not independent and is connected to neighbors to varying degrees.
  • step (a) is carried out by constructing a string of codes by assigning a code to each of the variants.
  • step (a) is performed by assigning a code to each genotype of the SNP to construct a string of the codes.
  • the analytes are SNPs, wherein the SNPs are (i) 5% or less (more preferably 4% or less, even more preferably 3% or less, even more preferably Is a minimum selected from the group consisting of removal of SNPs exhibiting an allele frequency of 2% or less, most preferably 13 ⁇ 4 or less), (ii) Hardy Weinberg Equilibrium test and (ii) Plate-effect test SNPs QCX Quality controlled by one method. As such, sample QC allows the analytical results of the present invention to be more accurate.
  • the class prediction algorithm applied in step (b) of the present invention includes various algorithms known in the art, for example, the ⁇ r-nearest neighbor O-nearest neighbor (N) algorithm (Bremner et al. al., (2005). “Output—sensit ive algorithms for computing nearest—neighbor decision boundaries.” Discrete and Computational Geometry 33 (4): 593-604), support vector machine (SVM) algorithms (Theodor idis).
  • N ⁇ r-nearest neighbor O-nearest neighbor
  • SVM support vector machine
  • the class prediction algorithm applied in step (b) is an r-nearest neighbor (ANN) algorithm .
  • SVM support vector machine
  • the nearest neighbor analysis algorithm searches the k nearest neighbors of the test subject. In the analysis algorithm, all pairwise "distances" are calculated between the descriptor of one entity and the descriptor of each entity. ANNs are then selected for the test subject and predicted whether the subject is sensitive to the most common traits among the A Ns.
  • the support vector machine (SVM) algorithm is a fractional classification method that identifies the most likely class to which the test subject belongs.
  • SVM support vector machine
  • the SVM is trained to correct the correct trait of one individual in each of all binary traits. Be aware.
  • having the maximum selection of all pair classifications by SVM predicts the susceptibility of the test subject to the most likely trait.
  • the descriptor for the variation is a VAR-S and a profile
  • the class prediction algorithm is the nearest neighbor algorithm
  • step (b) is less than 20% in (b-1) population A small step of selecting rare VAR-S found at low frequency of; (b-2) sub-steps to normalize to the total number of regression VAR-S; (b-3) constructing a Jensen-Shannon (JS) divergence matrix using the profile of the rare VAR-S; And (b-4) selecting a k-nearest neighbor for the organism using the JS divergence matrix.
  • This embodiment is abbreviated as KNN / VAR-S (KNN / SNP-S, when applied to SNP).
  • the KNN / SNP-S is described in more detail as follows: A vector of SNP-Ss for all members of the training set is obtained, followed by a feature selection step. At this stage, the syntax shared by any percentage of population is removed (filtered out) and the remainder (filtered in) is used for analysis. It is then normalized to the total number of rare SNP syntax of the subject. Finally, we use the rare SNP syntax to build the JS divergence matrix between all members. The reason why JS divergence was chosen to measure the distance of the descriptor is that it is more predictable than other conventional methods such as allele sharing. Measure the paired JS distance for every entity, each entity, then select or vote classes among the top nearest entities and select the one with the highest count.
  • the descriptor for the variation is a profile of the variation
  • the class prediction algorithm is a support vector machine (SVM) algorithm
  • step (b) is (b-1) 10 '2 to 1 Substep selecting mutations with a low P-value of 6 ; (b-2) substeps of performing SVM on each of all binary traits; And (iii) substeps classified according to a max-win voting scheme.
  • SVM / VAR SVM / SNP, if applied to SNP.
  • SVM can be implemented in a variety of ways, for example in the One-Versus—One (OVO) method.
  • the 0V0 method produces an n (n ⁇ l) / 2 classifier for each pair of two classes and takes the class with the highest election from n (nl) / 2 predictions for the test sample.
  • LIBSVM from Chang et al is used (Chang CC & Lin CJ (2011)
  • LIBSVM A Library for Support Vector Machines. Acm T Intel Syst Tec 2 (3)).
  • the SNPs are filtered out for a given P-value threshold (p) to select the associated SNPs between the two classes. It is recommended that cutoffs less than 1 ( ⁇ 6) should not be applied, since some classifiers do not leave SNPs after filtering by the association test. Encoding each genotype is done in the case of ambiguous predictions (ie multiple best elections). Repeat the poles in the set of highest elected classes until the tie breaks, and train the SVM to recognize the correct trait of one individual in each of the two binary traits. The most likely trait of the test subject is predicted to have the maximum election of all pair classifications by SVM.
  • step (b) is 20% of the (b-1) papul illustration. Found at a lower frequency of Small steps of screening for rare mutations; (b-2) substep normalizing to the total number of rare mutations; (b_3) constructing a JS divergence matrix using the profile of the rare variant; And (b-4) selecting a k-nearest neighbor NN) for the organism using the JS divergence matrix.
  • KNN / VAR N / SNP, if applied to SNP.
  • the class prediction algorithm is a support vector machine (SVM) algorithm, wherein step (b) is a (bl) 10- 2 to 10 — Small step of selecting VAR-S with low P-value of 6 ; (b-2) substeps of performing SVM on each of all binary traits; And (iii) substeps classified according to a max-win voting scheme.
  • SVM / VAR— S SVM / SNP-S, if applied to SNP.
  • SVM / SNP is performed using SNP-S and SVM / VAR-S is performed.
  • additional parameters for the optimal length of SNP-S are used.
  • step (b) at least four kinds of prediction results obtained in step (b) are applied to an inference algorithm to finally predict traits of an organism or an individual whose trait is not determined.
  • the inference algorithm used in step (c) comprises a Bayesian inference algorithm and a voting scheme, most preferably a Bayesian inference algorithm. to be.
  • each phenotype is represented by the total initials of each trait. Label it with the first character.
  • Bayesian inference of the prediction results of the four methods is used. These methods have the following shorthand: KNN / SNP-S, KNN / SNP, SVM / SNP-S, SVM / SNP.
  • the methods are mathematically represented by nP and m 4 , respectively.
  • the highest post-probability trait conditioned on the predictions obtained from the training method is selected, which can be formulated as PCsi ⁇ ii fiJ ⁇ ) ⁇ .
  • Bayesian theorem Bayes theorem can be expressed as: Denominator sil ⁇ ⁇ fi ⁇ ) ⁇ Since this is a normalization constant, the denominator is omitted. Since the prediction decisions of each method are inherently independent of each other, we apply the chain rule (Zhang H (2005) Exploring conditions for the optimality of Naive bayes. Int J Pattern Recogn 19 (2): 183-198):
  • P ( ⁇ C / s ⁇ B) can be estimated by identifying some of the true BRAC individuals estimated to be C0AD by the ⁇ N / SNP-S method.
  • transfection of the organism is a disease (diseases), disease (disorders), conditions (conditions), symptoms (symptoms) or the value "fee (therapy) reactivity (responsiveness).
  • the trait to be analyzed in the present invention is a cancer disease.
  • susceptibility to cancer disease in one subject can be quantitatively predicted by the present invention.
  • the trait to be analyzed in the present invention is an advantegeous trait, which is growth rate, yield or quality.
  • the variant to be analyzed in the present invention is a variation associated with the therapeutic responsiveness of the organism and the method is used to predict the therapeutic response of the organism.
  • a representative example of therapeutic responsiveness is drug responsiveness. Respondents, non-respondents, and reverse respondents for a particular drug can be determined by the present invention.
  • the invention provides a computer-readable storage medium embodied with instructions instructing a computer processor to perform the following steps: (a Constructing at least two kinds of descriptors for the mutations; (b) applying at least two kinds of class prediction algorithms to each of the at least two kinds of descriptors to analyze the genome variation or epigenetic variation of the organism to obtain at least four kinds of prediction results; And (c) obtaining in step (b) the final prediction of the trait of the organism by applying at least four kinds of prediction results to an inference algorithm.
  • the invention provides a system for analyzing genomic variation or epigenomic variation of an organism comprising: (a) a computer processor; And (b) the computer-readable storage medium coupled with the processor.
  • the present invention is similar to comparing two texts with words of natural language, and through this method provides a systematic feature frequency profile (FFP) for various variations (eg SNPs) found in an individual.
  • FFP feature frequency profile
  • the present invention also determines the distance between FFPs to accurately predict susceptibility to certain traits of an individual.
  • cancer sensitivity of an individual can be predicted with an accuracy of 47 to 76% even when the sample size is small.
  • this accuracy can be increased by increasing the size of the SNP genotype data, and can be further increased by classifying in advance.
  • the prediction accuracy of the second invention represents several times increased accuracy compared to the random prediction, and the degree of such prediction is highly improved prediction accuracy as it is possible to determine the health state of the individual or the population.
  • FIG. 1 is a diagram of the process of a method for assessing the sensitivity of eight cancer types.
  • This method includes preprocessing of SNP data, such as sample conditioning screening and genotyping coding, filtering of common SNP syntax and profiling of SNP syntax frequencies (FFPs of SNP-Ss), and the nearest neighbors of the smallest branches ("distances"). There are several processes, such as calculating the distance between paired FFPs to identify the sister.
  • FIG. 2 is a graph showing the accuracy assessment of cancer sensitivity to length (/) and percentage filtering—phosphorus of SNP-S. Increase the performance of risk assessment for multiclass cancers while increasing length (/) and reducing percentage filtering-in Measured.
  • 2% filtering means maintaining SNP-S / s generated at less than 2% of population. This process retains only "regressive" SNP-S / s present in populations below 2%.
  • the gray line represents the baseline accuracy and "No syntax" means the accuracy assessment by comparison of the entire SNPs as a non-associated feature, ie not using the FFP of the SNP-Ss.
  • 3 is a nearest-neighbor connection map.
  • the nearest neighbors of 594 individuals (66 in each of the eight cancers and 66 controls) were identified. Each individual is represented by a rare SNP-S 10 s FFP (2% filtered-in), and the nearest neighbor of one individual is defined as another individual with FFP with the smallest Jensen-Sha ⁇ on divergence (distance) from the first. do.
  • Types of arms are listed on the outside of the outer circle, indicated by different colors in the inner circle, and the interior of each curve of the circle connects two entities with nearest-neighbor ("sister") correlations. The color of the curve is the same as the cancer type of the nearest-neighbor pound by the search member (also the color of the small sal in the outer circle).
  • the nearest neighbor pound may or may not be the same cancer type ("true” sister) or not ("error” sister). If the search and found objects are interchangeable, the curve is represented by a thick line. Of these, the curves for all error sister correlations are shown in dark gray.
  • the color scheme is as follows: CEU, red; BRCA, orange; C0AD, bright orange; HNSC, yellow; KIRC, green; LGG, light blue; 0 V, blue; READ, dark blue; UCEC, purple. This map was created using circos. 4 is a genome mapping of a sensitive marker allele on chromosome 3.
  • the density of the sensitive marker allele is indicated by heat-maps on colored circle tracks for each cancer type (from inside to outside, CEU, red; BRCA, orange; C0AD, light orange; HNSC, yellow; KIRC , Green; LGG, light blue; 0V, blue; READ, dark blue; UCEC, purple), high density areas are indicated in dark colors.
  • the outermost track shows the cytoband of chromosome 3, and the labeled light blue tick marker indicates the location of the known cancer gene.
  • Cytoband tracks The blue short arches represent individual cytobands with one or more GS hits of known arms in Caucasus population.
  • the green short bar on the next inner track shows the genetic code site and the next inner circle shows the density of the published SNPs. This map was created using circos.
  • Figure 5 maps the sensitive marker alleles near the locus of two known cancer genes (BRCA2 and TP53). Each marker allele is represented by a cluster of circles (SNPs making up a specific SNP-S 10 allele) in the color of the cancer type.
  • the X-axis represents the physical location of chromosome 17 or 13 where TP53 and BRCA2 are found, respectively, and the Y-axis divides different marker alleles for different cancer types: CEU, red; BRCA, orange; C0AD, bright orange; HNSC, yellow; KIRC, green; LGG, light blue; 0 V, blue; READ, dark blue; UCEC, purple.
  • Sensitive marker alleles do not overlap with TP53 or BRCA2 (each represented by a dashed vertical line). Recombination ratio is indicated by blue spikes and the other genes around the two genes are indicated in the lower box of each figure. This picture was produced using Lo isZoom i / J.
  • FIG. 6 shows QC (Quality Control) results.
  • the graph shows the overall accuracy of different datasets from different QC criteria as a function of filtering threshold (left: dataset with HapMap control, right: dataset without HapMap control).
  • Two filters namely HWE and plate effect ⁇ 1 (dataset used in this study), HapMap data with TCGA and MAF>0.05; THM5, same as THM1 except MAF>0.05; THMO, same as THM1 except without MAF filtering; Same as THM0 except without THMOR, HE and plate effect tests; TCGA data of TM5, two filters, MAF> 0.05 with HWE and plate effect; Same as TM5 except TMl, MAF>0.01; Same as TM1 except without TMO, MAF filter; Same as TM0 except TM0R, two filters ie E and no plate effect.
  • BRCA, 0V, and UCEC were selected for analysis and excluded other characteristics because of the limited sample size.
  • the number of (right) characteristics is 3, 6, and 9 (BRCA, C0AD, and CEU have three characteristics; BRCA, COAD, HNSC, KIRC, 0V, and CEU have six characteristics; and BRCA, C0AD, HNSC, HNSC, KIRC, Increasing 0V, REDA, UCEC, and CEU to 9 characteristics decreases accuracy.
  • Each feature dataset size was fixed at 66 individuals.
  • the method of the present invention comprises SNP data preprocessing, including sample control screening and genotype encoding, selection of low ⁇ value SNPs and low frequency SNP syntax, application of two different analysis algorithms and the results of the four methods. It includes the final prediction step of integrating.
  • 10A shows the optimization of the parameters used in the process of applying the k—nearest neighbor algorithm to the profile of the SNP-syntax.
  • 10B shows the optimization of parameters used in applying the k-nearest neighbor algorithm to the profile of SNPs.
  • Figure 10c shows the optimization of the parameters used in the process of applying the SVM algorithm to the profile of the SNPs.
  • Figure 10d shows the optimization of the parameters used in the process of applying the SVM algorithm to the profile of the SNP-Ss.
  • 11A-11C show the 9-class prediction results of the test set for each of three cancer classes, BRCA (FIG. 11A), 0V (FIG. Lib) and UCEC (FIG. 11C). Prediction results of four methods and Bayesian inference on 50 test subjects for each of the three cancer classes are shown. The dotted horizontal line represents the random prediction, and the tick marks on each bar represent the standard error for the prediction result measured by resampling 50 test subjects 10 times.
  • the C165's 165 SNP array results (typed as Affymetrix 6.0 SNP) were downloaded from the HapMap ftp website. The data was genotyped using Affymetrix Power Tools with default parameter settings and discarded samples reported to have low sample quality from the website (see Table 1).
  • the Cancer Genome Atlas project initiated by the National Institute of Health (NIH); Breast Invasive Carcinoma (BRCA); Colon Adenocarcinoma (COAD); Head and Neck Squamous Cell Carcinoma (HNSC); Idney Renal Clear Cell Carcinoma (KIRC); Brain Lower grade glioma; Ovarian Serous Cystadenocarcinoma (OV); Rectum adenocarcinoma (EAD); Uterine Corpus Endometrioid Carcinoma (UCEC); Haptype Map Project (HapMap); Caucasians from Utah, USA; European-American (EA); PI_HAT, P NK parameter determined when two entities are related; Sample quality control + genetic association test (PIJHAT ⁇ 0.2) + removal of self-released wheat for EA individuals. Sample quality control
  • SNP Single Nucleotide Polymorphism
  • QC Quality Control
  • MAF Minor Allele Frequency
  • TCGAf The Cancer Genome Atlas project initiated by the National Institute of Health (NIH)); Haptype Map project of worldwide human populations (HapMap); Caucasians from Utah, USA; THMl (dataset used in this study), TCGA and HapMap data of MAF> 0.01 with two filters of HWE and plate effect; THM5, same as THMl except MAF>0.05; THM0, same as THM1 except without MAF filtering; Same as THM0 except no THM0R, HWE and plate effect tests; TCGA data with two filters, TM5, HWE and Plate effect, with MAF>0.05; Same as TM5 except TM1, MAF>0.01; TM0, same as TM1 except without MAF filtering; Same as TM0 except there are no two filters, TM0R, HWE and plate effect; X indicates that the dataset did not have an associated QC, and 0 is the opposite.
  • SNP code conversion SNP code conversion
  • SNP-S SNP syntax
  • the vector, feature frequency profile (FFP) for an individual represents the systematic characteristics of the individual's WG SNPs, which slide a fixed length window along the entire length of the individual genome's SNP strings and all possible features (SNP-Ss in this case) are constructed (see GE Sims, et al. (2009). Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proc Natl Acad Sci USA 106 (8): 2677-2682).
  • the optimal feature length for profiling is determined to be the length that shows the highest accuracy in calculating cancer sensitivity.
  • the optimal length was 10 (FIG. 2). Since each SNP has genotype information without its exact chromosomal allele (Haplotype) order, in the SNP syntax. The number of heterozygotes determines the possibility of the presence of haplotype information in the polymorphic context. Thus, each haplotype occurs in the same syntax Under the premise of having a likelihood, the "count" of the occurrence of SNP-S is inversely proportional to the number of possible haplotypes represented by SNP-S.
  • the study dataset does not include a missing genotype, when this case exists, it includes all features arising from the combination of possible genotypes in the untyped marker and their counts are extended by the missing. By dividing by the total number of SNP-Ss, it can be processed easily. The following equation represents the count in all cases:
  • x is the count of SNP-S (count in fraction)
  • i is the number of heterozygotes in SNP-S
  • j is the number of missing markers in the SNP-S sequence (see Table 4 ).
  • Percent Filtering-In, Normalization, and Jensen-Shannon Dispersion Matrix Obtain FFPs of SNP-S 10 counts for all members, then remove (filter out) the syntax shared by a few percent of this population and analyze the remainder. keep it. Then, it is normalized by the total number of rare SNP syntax for counting each residual rare syntax. Finally, a rare SNP syntax generated by percentage filtering-in was used to construct a Jensen-Shannon (JS) divergence matrix between all members. Nearest. Accuracy of identification and sensitivity of neighbors (“sisters”)
  • Section I presents two ideas of invention: SNP-S and FFP, and the overall systematic features of one individual's genome WG SNPs are assigned to FFP of SNP—Ss. How it can be represented; Section II shows the process of empirically identifying the optimal length of SNP-Ss and the optimal filtering level to reveal "rare" SNP-Ss to best practice the method of the present invention; Section III details the sensitivity predictions; Section IV summarizes the verification results for the approach of the present invention; And section V shows the genetic location of susceptible SNP—S alleles for known cancer genes, recently identified cancer-related SNPs and other genetic characteristics from G Ss.
  • the method of the present invention for comparing the systematic features of any two individuals' WG SNPs comprises four steps:
  • Linked WG SNP Strings The present invention starts with the most general description of the systematic features of an individual's WG SNPs, which is similar to the description in the natural language booklet (CD Manning & H. Schuetze (1999). Language Processing.The MIT Press, I edn), a very important difference is that the associated WG SNPs are treated as natural language text without spaces between words.
  • an individual's WG SNPs are represented by a single linked string of SNPs arranged in the genome of each individual, with each SNP genotype being one of ten alphabetic codes representing the ten possible genotypes of the SNP under the assumption of genotype two alleles. (See Table 3).
  • SNP syntax SNP syntax (SNP-S) is defined as a short ordered string of SNPs of a given specific length, which plays a role similar to a "word" of a certain length in natural language text. All possible SNP-S of a given body length (/) for one genome is obtained by sliding a window of length 1 along the total length of the SNP string of the genome. Thus, SNP-S identifies not only the systematic features of SNPs caused by various genetic mutations, but also those known to exist in WG SNPs such as associative imbalances.
  • FFP of SNP-Ss is suitable for showing the overall systematic characteristics of WG SNPs.
  • FFP of "rare" SNP-Ss is more suitable for disease-specific systematic features. Therefore, the following two assumptions were used as a useful criterion in empirical search for optimal length regression SNP-Ss for cancer sensitivity studies:
  • the number of sets of genome-sensitive alleles for one particular cancer type may be large, but we treat this as limiting.
  • the FFP of a rare SNP-Ss of one cancer individual it is very similar to the population of the same cancer type than that of the other cancer type or the control group. sister" ) There will be one or more other individuals with FFP (s) (this assumption is proven correct in the study described below).
  • Table 5 summarizes the prediction accuracy for the genetic sensitivity determined by the method of the present invention.
  • a dataset of 66 samples each of eight cancer types and controls was used to optimize two parameters, the length and percent filtering of SNP-Ss— (2%). Because of the small SNP data available in public databases, we chose three cancers (BRCA, 0V and UCEC), which are slightly more data. In each of the three cancers, 66 new samples were randomly selected that were not included in the dataset used in the optimization process, and two parameters were used to calculate sensitivity accuracy for the cancer type. This process was repeated 10 times for each of the three cancers and the average of the accuracy was calculated.
  • SNP-S 10 susceptibility allele To localize the genome region covered by the SNP-S 10 susceptibility allele for one cancer type, it appears only in membership of that cancer type and is common among one or more truth-sister pairs but in other cancer types An undetected SNP-S 10 (referred to as “sensitive SNP-S marker allele” or hereinafter “sensitive marker allele”) was identified. These were then analyzed at three levels: (1) overall observation of all marker alleles in the entire genome, (2) intermediate level observation of the position of the marker allele on one chromosome, and (3) several known cancer genes. Close-up observation of the position of the marker allele for. Table 6 shows the quartiles for the entire genome Show the contents.
  • Gene annotation data was downloaded from the Gene Track of the UCSC Table Browser on the Human Genome Build 19, and disease genes were downloaded from the GAD (Genetic Association Studies of Complex Diseases and Disorders) track of the UCSC Table Browser, and the cancer gene was cancer of the Wellcome Trust Sangerlnstitute. Annotated from Gene Census.
  • 1,600. 4 shows various features known on the chromosome 3 of 8 cancer sensitivity marker alleles on chromosome 3 (many cancer genes have been identified) (eg, location of known cancer genes, SNP density, genetic code site and cancer sensitivity). Shows the result of mapping relative to the cytoband) where the GWAS hit is found. The following general observations were made:
  • Figure 5 shows two regions of the genetic code region where the marker allele is mapped near the genes (a) TP53 and (b), which are well known cancer genes. An example is shown. All marker alleles of both cancers (BRCA and C0AD), whose most cancer genes are recorded in the 0MIM database, were examined. The following experimental results were obtained:
  • Sensitivity marker alleles for the two cancer types do not overlap with the two gene positions. Sensitivity markers. Alleles overlap with other nearby genes;
  • the present inventors introduce the concept of SNP syntax (SNP-S) and the feature frequency profile of this syntax to provide a method for analyzing the systematic characteristics of WG SNPs of an individual. Subsequently, multiclass cancer susceptibility was evaluated by comparing the FFP of the rare SNP-Ss of each individual with the FFP of the control individual and those with eight main cancers. Although the amount of SNP data currently available in the TCGA database is small, the present invention predicts genetic susceptibility to eight major cancers with an accuracy in the range of about 47-76%, depending on the type of cancer. This accuracy will increase as the size of the sample for each cancer type increases, and the increase in sample size will be readily obtainable by current sequencing techniques.
  • the findings of the present invention support the "multiple assortment model" for cancer susceptibility:
  • the individual's susceptibility to cancer is associated with a set of many regressive SNP syntaxes (sister specific marker alleles) present in the non-genetic code region of the genome (Table 6);
  • each set of alleles can be expected from a cancer-specific marker allele "classified", which all unusual unique Sister for one type of cancer ever It is a collection of marker alleles. Discussion
  • the present invention predicts genetic susceptibility to eight major cancers with an accuracy in the range of about 47-76%, depending on the type of cancer. Although increasing the sample size for one cancer type increases accuracy ( Figure 7), it does not reach 100%. Not all genetic susceptibility to one cancer type triggers cancer, and in most cases, the occurrence of cancer requires one or more triggering events that are non-genetic.
  • the present invention can provide substantial information: quantitatively predicting the size of the population with high genetic susceptibility to cancer is essential in establishing cancer prevention policies and cost control strategies. This is very useful information. Similarly, predicting the genetic susceptibility of an individual provides motivation for prevention and for early early diagnosis. Other applications in which the present invention may be applied include the study of genetic susceptibility to other diseases such as chronic diseases, infectious diseases and neurological diseases. In addition, if there is genome data for a stratified sample, the present invention may also be useful in determining the sensitivity and therapeutic benefit of a patient to a particular treatment. It may be applied to assess the patient's sensitivity to clinical trials that may increase the likelihood of efficacy and reduce the risk of adverse events.
  • each SNP genotype was converted to a number of 0, 1 or 2 depending on the number of minor alleles in that genotype;
  • each SNP of the SNP-S descriptor was converted to one of 10 alphabets (see Table 3).
  • kNN / SNPS method k-nearest neighbor (kNN) algorithm for SNP syntax (SNPSs) ,
  • Vectors of SNP-Ss for all members of the training set were obtained and then the feature screening step proceeded.
  • the syntax shared by some percentage of the population is removed (filtered out) and the residue (filtered in) is used for analysis. It was then normalized to the total number of rare SNP syntax of the subject.
  • a rare SNP syntax was used to construct a Jensen-Shannon (JS) divergence matrix among all members. The reason why JS divergence was chosen to measure the distance of the descriptor is that it is more predictable than other conventional methods such as allele sharing.
  • the paired JS distances were measured for all individuals, each individual, then voted 9 classes from the top k nearest individuals and selected the one with the highest count.
  • the class having the shortest average distance from the target entity among the class entities in the upper k was selected. Accuracy was measured using the correct guess assignments for all members. For the best accuracy of cancer sensitivity estimation for the training dataset, the optimal length, 1, f parameter for low frequency selection, and parameter k for SNP-S were optimized. Optimal parameter values were 8, 1, and 40 for!, F, and k, respectively (FIG. 10A, Table 8A). In the testing phase, the same / and / optimal parameters were used. The JS distance vector between the subject and the training sample was then measured. The test subjects were predicted through the same selection process in the training phase with the optimal k parameter.
  • SNP-S descriptors were replaced with SNPs and KNN was remodeled in the same manner as in 1) above. Different from SNP-S, each SNP was converted to the numeric form of 0, 1 and 2, depending on the count of minor alleles in the genotype. In the SNP, the / and ⁇ : parameters (see Figure 10b, Table 8b) were trained. Optimal values for / and the parameters were 15% and 200, respectively.
  • SVM is a supervised classification method, originally designed for building binary classifiers, and later used to build multiple classifiers in various ways.
  • OVO One-Versus-One
  • the 0V0 method generates a ⁇ ⁇ classifier for each pair of n classes and takes the class with the highest election from / predictions for the test sample.
  • LIBSVM In order to implement the 0V0 SVM method, LIBSVM by Chang et al. (Chang CC & Lin CJ (2011) LIBSVM: A Library for Support Vector Machines.
  • SW / SNPS Method Support Vector Machine (SVM) of SNPSs
  • Another predictive model was constructed using SVM using SNP-S instead of SNP (see FIG. 10D, Table 8 (1). Additional parameters for optimal length of SNP-S (explored and optimized during training) Except is included, the overall pipeline of the method is the same as 3) above. ⁇ value greater optimal length for optimal values, and SNP-S on-off value is a 10- 5 and 2, respectively. Bayes i an inference of multipole prediction algorithm
  • each phenotype was labeled with the first letter of the full initial of each trait.
  • Bayesian inference of the prediction results of the four methods was used. These methods have the following abbreviations: KNN / SNPS, KNN / SNP, SVM / SNPS, SVM / SNP. The methods are mathematically represented by nf, ⁇ ?, m 4 respectively.
  • the traits with the highest postconditioning were selected for the predicted results from the training method, which can be formulated PCsjlHfiJ ⁇ fi ⁇ —.
  • s / is the predicted trait of the subject i
  • i is the trait of the subject / predicted by the method.
  • Denominator Ps j ⁇ fi ⁇ J ⁇ is the normalization constant. Since the prediction decisions of each method are inherently independent of each other, we apply the chain rule (Zhang H (2005) Ex loring conditions for the optimality of Naive bayes. Int J Pattern Recogn 19 (2): 183-198):
  • the methods of the invention classify individuals into multiple cancer types, including three female-specific cancers and three common cancers.
  • male subjects were classified into five general cancers excluding breast cancer, ovarian cancer and endometrial cancer. Results
  • SNP-S as one of two descriptors reflects the observation that each SNP location is not independent and is connected to neighbors to varying degrees.
  • the use of experimentally obtained genotypes instead of computer inferred haplotypes is due to the fact that haplotypes are unreliable, in particular unreliable for the regression frequency SNPs of unrelated individuals on which the methods of the present invention are constructed ( Fan HC, Wang J, Potanina A, & Quake SR (2011) Who 1 e-genome molecular ha lotyping of single cells.Nature biotechnology 29 (1): 51—57).
  • the individual genome SNP-Ss is created by sliding a window of a certain length along the entire length of the total genome SNPs.
  • Descriptor elements SNP or For SNP-S
  • factors that increase the sensitivity of different cancer types are selected: SNPs or SNP-Ss with "very low Kal” or "rare frequency” depending on the analysis algorithm used.
  • all pairwise “distances” are calculated between the descriptor of one entity and the descriptor of each entity.
  • the NNs for the test subject are then selected and the subject is predicted whether it is sensitive to the most common traits among the A Ns (if there is more than one of the most likely traits, see the method above).
  • SVM is trained to recognize the correct trait of one individual in each of all binary traits.
  • having the maximum selection of all pair classifications by SVM predicts the susceptibility of the test subject to the most likely trait.
  • the final prediction of the sensitivity of the test subject is estimated based on Bayesian inference from the four prediction results. For female subjects, multiclass susceptibility was estimated for nine classes (eight joint cancer classes and one health trait), and for male subjects, predictions were made for six classes except three female-specific cancer classes. Was carried out.
  • TCGA Cancer Genome Atlas
  • HapMap The Cancer Genome Atlas
  • Details of data selection, sampling methods, sample control procedures and other details are described in the above experimental methods, and the numbers before and after sample control are listed in Table 7.
  • the dataset was divided into two groups: a training set for optimization of the parameters for each method and a test set for independent verification of the methods.
  • the maximum size of the sample for each trait in the training set was limited to the minimum sample size (66) of one trait of TCGA. To prevent artificially skewed predictions resulting from inappropriate sample sizes for each trait, 66 individuals were randomly equally extracted from each trait group. Lack of TCGA Due to the sample, a test set for all nine phenotypes could not be constructed.
  • Table 8a is the result of ANN / SNP-S and the remaining three methods and results are described in Tables 8b-8d. Table 8a
  • Training performs nce of SVM algorithm applied to profiles of SNPs ..
  • test set can be summarized as follows; (i) For each cancer class, three of the four methods predicted the test set with significantly better accuracy than random prediction; (ii) Individual genome variants of BRCA and 0V (strictly depicted in terms of SNPs or SNP-Ss) are more interrelated than the rest of the cancer types and are slightly less than the descriptors in 0V and UCEC. There was a similar connection between them.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Bioethics (AREA)
  • Physiology (AREA)
  • Artificial Intelligence (AREA)
  • Ecology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

La présente invention concerne un nouveau protocole d'analyse de mutation permettant d'obtenir un résultat de prédiction cliniquement significatif en analysant de manière plus précise diverses variations (par exemple, des variations génétiques) trouvées dans un corps organique. La présente invention permet de prédire avec précision une prédisposition à certaines caractéristiques d'un individu. La précision de prédiction de la présente invention indique une précision qui est augmentée plusieurs fois par rapport à une prédiction aléatoire, lequel degré de prédiction permet de déterminer un état de santé d'un individu ou d'une population, présentant ainsi une précision de prédiction nettement améliorée.
PCT/KR2013/011823 2012-12-18 2013-12-18 Procédé mis en oeuvre par ordinateur d'analyse de mutation génomique ou de mutation épigénétique WO2014098479A1 (fr)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
KR10-2012-0148533 2012-12-18
KR20120148533 2012-12-18
KR1020130115261A KR101538692B1 (ko) 2012-12-18 2013-09-27 지놈 변이 또는 후생학적 변이를 분석하기 위한 컴퓨터 실행 방법
KR10-2013-0115261 2013-09-27

Publications (1)

Publication Number Publication Date
WO2014098479A1 true WO2014098479A1 (fr) 2014-06-26

Family

ID=50978718

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2013/011823 WO2014098479A1 (fr) 2012-12-18 2013-12-18 Procédé mis en oeuvre par ordinateur d'analyse de mutation génomique ou de mutation épigénétique

Country Status (1)

Country Link
WO (1) WO2014098479A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019165279A1 (fr) * 2018-02-23 2019-08-29 EMULATE, Inc. Organes-sur-puces en tant que plate-forme pour la découverte d'épigénétiques
CN113035274A (zh) * 2021-04-22 2021-06-25 广东技术师范大学 一种基于nmf的肿瘤基因点突变的特征图谱提取算法

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110020815A1 (en) * 2001-03-30 2011-01-27 Nila Patil Methods for genomic analysis

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110020815A1 (en) * 2001-03-30 2011-01-27 Nila Patil Methods for genomic analysis

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
FAY J. HOSKING ET AL.: "Genome-wide association studies for detecting cancer susceptibility", 18 January 2011 (2011-01-18), pages 27 - 46, Retrieved from the Internet <URL:http://bmb.oxfordjournals.org/content/97/1/27.full.pdf> *
GREGORY E.SIMS ET AL.: "Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions", 24 February 2009 (2009-02-24), pages 2677 - 2682, Retrieved from the Internet <URL:http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2634796/pdf/zpq2677.pdf> *
H CHRISTINA FAN ET AL.: "Whole-genome molecular haplotyping of single cells", 19 December 2010 (2010-12-19), pages 1 - 9, Retrieved from the Internet <URL:http://thebigone.stanford.edu/papers/Fan%20Natbiotech%202010.pdf> *
SHAUN PURCELL ET AL.: "PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses", September 2007 (2007-09-01), pages 559 - 575, Retrieved from the Internet <URL:http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1950838/pdf/AJHGv81p559.pdf> *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019165279A1 (fr) * 2018-02-23 2019-08-29 EMULATE, Inc. Organes-sur-puces en tant que plate-forme pour la découverte d'épigénétiques
GB2585302A (en) * 2018-02-23 2021-01-06 Emulate Inc Organs-on-chips as a platform for epigenetics discovery
GB2585302B (en) * 2018-02-23 2023-03-22 Emulate Inc Organs-on-chips as a platform for epigenetics discovery
CN113035274A (zh) * 2021-04-22 2021-06-25 广东技术师范大学 一种基于nmf的肿瘤基因点突变的特征图谱提取算法

Similar Documents

Publication Publication Date Title
US20230326547A1 (en) Variant annotation, analysis and selection tool
Kruppa et al. Risk estimation and risk prediction using machine-learning methods
US7653491B2 (en) Computer systems and methods for subdividing a complex disease into component diseases
Zhang et al. A Bayesian partition method for detecting pleiotropic and epistatic eQTL modules
Vadapalli et al. Artificial intelligence and machine learning approaches using gene expression and variant data for personalized medicine
CN109072309A (zh) 癌症进化检测和诊断
AU2020398913A1 (en) Systems and methods for predicting homologous recombination deficiency status of a specimen
US20220215900A1 (en) Systems and methods for joint low-coverage whole genome sequencing and whole exome sequencing inference of copy number variation for clinical diagnostics
US20220310199A1 (en) Methods for identifying chromosomal spatial instability such as homologous repair deficiency in low coverage next- generation sequencing data
EP4115427A1 (fr) Systèmes et procédés de détermination d&#39;état cancéreux à l&#39;aide d&#39;autocodeurs
Yang et al. De novo pattern discovery enables robust assessment of functional consequences of non-coding variants
Pachganov et al. TransPrise: a novel machine learning approach for eukaryotic promoter prediction
Munquad et al. A deep learning–based framework for supporting clinical diagnosis of glioblastoma subtypes
Stokes et al. The application of network label propagation to rank biomarkers in genome-wide Alzheimer’s data
WO2014098479A1 (fr) Procédé mis en oeuvre par ordinateur d&#39;analyse de mutation génomique ou de mutation épigénétique
AU2020285475A1 (en) A method of treatment or prophylaxis
Banerjee et al. Tejaas: reverse regression increases power for detecting trans-eQTLs
KR101585190B1 (ko) 지놈 변이 또는 후생학적 변이를 분석하기 위한 컴퓨터 실행 방법
US20240038326A1 (en) Method and system for phenotypic profile similarity analysis used in diagnosis and ranking of disease-driving factors
Banerjee et al. Reverse regression increases power for detecting trans-eQTLs
Anbarasu et al. In-silico screening of deleterious NF1 SNPS associated with neurofibromatosis type I
US20230207132A1 (en) Covariate correction including drug use from temporal data
Vergara Lope Gracia Mathematical tools for analysis of genome function, linkage disequilibrium structure and disease gene prediction
Ifesinachi et al. APPLICATION OF DEEP LEARNING FOR THE DETECTION OF GENETIC VARIATIONS: ITS IMPLEMENTATION IN CLASSIFYING ALZHEIMER'S
Zhang Cross-species prediction of transcription factor binding by adversarial training of a novel nucleotide-level deep neural network

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13866114

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 13866114

Country of ref document: EP

Kind code of ref document: A1