CN101894216B - Method of discovering SNP group related to complex disease from SNP information - Google Patents

Method of discovering SNP group related to complex disease from SNP information Download PDF

Info

Publication number
CN101894216B
CN101894216B CN2010102309492A CN201010230949A CN101894216B CN 101894216 B CN101894216 B CN 101894216B CN 2010102309492 A CN2010102309492 A CN 2010102309492A CN 201010230949 A CN201010230949 A CN 201010230949A CN 101894216 B CN101894216 B CN 101894216B
Authority
CN
China
Prior art keywords
mrow
msub
snp
single nucleotide
msubsup
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2010102309492A
Other languages
Chinese (zh)
Other versions
CN101894216A (en
Inventor
张军英
耿耀君
于国强
赵晓雪
尚军亮
王跃
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN2010102309492A priority Critical patent/CN101894216B/en
Publication of CN101894216A publication Critical patent/CN101894216A/en
Application granted granted Critical
Publication of CN101894216B publication Critical patent/CN101894216B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a method of discovering SNP group related to complex disease from SNP information, solving the that problem that the multiple nosogenesis of susceptibilitytodiseases and SNP combination related to each reason can not be found in the SNP data of the complex disease in the prior art. The method comprises the steps of: preprocessing an SNP data set of the complex disease; searching an SNP group of candidate suspected nosogenesis in the preprocessed data set according to the measure of the relevance of the SNP group; calculating the stability measure of the relevance of the SNP group of all candidate suspected nosogenesis; adding the SNP group with maximum relevance stability measure as the suspected nosogenesis into a set of the SNP groups related to the complex disease; and outputting the SNP group in the set of the SNP groups related to the complex disease, and evaluating the degree of the SNP groups being the suspected nosogenesis by using the degree of the relevance stability close to 1. The invention can simultaneously find a plurality of nosogenesis SNP groups hidden in the data, and can be used for the pathogenic mechanism research and the early diagnosis of the complex disease and the development of the biomedicine.

Description

Method for finding SNP group related to complex disease from SNP data
Technical Field
The invention belongs to the technical field of data processing, in particular to a method for discovering an SNP group related to a complex disease from single nucleotide polymorphism SNP data by using a maximum stability criterion, which can be used for researching the pathogenesis of the complex disease, early diagnosing and developing biological medicines.
Background
The complex disease is generated by the combined action of various genetic factors and environmental factors, and the generation and development of the complex disease are influenced by a plurality of genes of a complex network structure. Complex diseases differ from mendelian genetic diseases in that in most cases there are often not enough major genes to cause disease, where the effect of a single gene on the disease may be negligible or even nonexistent, but the combined effects of these single, possibly insignificant, genomes may be the causative agents of the complex disease. These characteristics bring great difficulty in finding the causative gene of the complex disease, and it is difficult to find the causative gene or related markers for the pathogenesis research, early diagnosis and biopharmaceutical development of the complex disease. How to find out the multiple causes of the disease in the genome-wide range and which genes are combined to become one cause of the disease are the main problems existing at present.
To overcome these problems, researchers have attempted to develop multiple disease markers. These methods mainly include hypothesis-based methods, feature-based methods of selection, and causal methods:
(1) a hypothesis testing based approach. This is the most important method for finding single-gene disease at present, and the search method is usually an exhaustive method. Single SNPs that are pathogenic can be found from the whole genome data, or double SNP combinations that are pathogenic can be found in the medium-scale data.
(2) A method based on feature selection. Large-scale data can be processed, but correlation between features is generally not considered; the limited feature correlation can be examined, but only medium and small data can be processed; can be combined with the classifier in various ways, and aims at the best popularization performance of the classifier, but has large calculation amount and is only suitable for small-scale data.
(3) A method based on causal analysis. Most of the data appear in the form of causal networks, and currently, the data are only in a theoretical research stage and cannot process large-scale data.
None of the three methods described above was based on the discovery of objective causative SNPs in the data. Computationally, either only a single SNP that is pathogenic can be found, and cannot be used for the discovery of multiple SNP sets associated with complex diseases, or its ultra-large computational power cannot be used for the whole genome data.
Disclosure of Invention
The present invention is directed to overcome the disadvantages of the conventional methods, and to provide a method for discovering a complex disease related SNP set from SNP data, in view of discovering an SNP that is a cause of a disease that is present in the data, so as to discover a plurality of possible causes of a disease and SNP combinations related to each of the causes from the complex disease SNP data.
The technical scheme for realizing the invention comprises the following steps:
(1) setting C as a collection of single nucleotide polymorphic SNP groups related to the complex disease, setting an initial value as null, setting M as the number of the single nucleotide polymorphic SNP groups which are preset to discover to be related to the complex disease, and setting a default value as 6; setting L as the upper limit value of the number of SNPs contained in the single nucleotide polymorphism SNP group, and setting the default value as 5; according to the principle of treating the influence of the variation of any gene in homologous chromosome alleles on diseases equivalently, the single nucleotide polymorphic SNP data is preprocessed as follows:
Figure BSA00000196295300021
wherein N is the number of samples in the single nucleotide polymorphism SNP data, xi∈{0,1,2,3}dD is the number of single nucleotide polymorphism SNP in the data, y ″iE {1, 2} is a sample xiClass (1) represents a disease group, 2 represents a control group, and y is [ y ]1,y2,...,yN]Ω represents the preprocessed data;
(2) single nucleotide polymorphic SNP set FrAssociation AS (F) with class label yrOmega) is defined as FrMutual information MI (F) between and yr;y):
<math> <mrow> <mi>AS</mi> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>r</mi> </msub> <mo>/</mo> <mi>&Omega;</mi> <mo>)</mo> </mrow> <mo>=</mo> <mi>MI</mi> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>r</mi> </msub> <mo>;</mo> <mi>y</mi> <mo>)</mo> </mrow> <mo>=</mo> <munder> <mi>&Sigma;</mi> <msub> <mi>F</mi> <mi>r</mi> </msub> </munder> <munder> <mi>&Sigma;</mi> <mi>y</mi> </munder> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>r</mi> </msub> <mo>,</mo> <mi>y</mi> <mo>)</mo> </mrow> <mi>log</mi> <mfrac> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>r</mi> </msub> <mo>,</mo> <mi>y</mi> <mo>)</mo> </mrow> </mrow> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>r</mi> </msub> <mo>)</mo> </mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>y</mi> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow> </math>
Wherein, FrIs a set of r single nucleotide polymorphic SNPs, p (F)rY) is FrAnd the joint probability of y, p (F)r) Is the joint probability of r single nucleotide polymorphic SNPs, and p (y) is the classProbability of y being marked;
(3) polymorphism SNP group F based on Single nucleotiderRelevance measure of (AS) (F)rAnd/omega), searching a single nucleotide polymorphic SNP group of candidate suspected pathogenic causes in omega according to the following steps:
(3a) calculating the relevance measurement of each single nucleotide polymorphic SNP in omega, and adding the single nucleotide polymorphic SNP corresponding to the former K big relevance measurement values into a set D consisting of candidate suspected pathogenic cause single nucleotide polymorphic SNP groups;
(3b) removing an unlabeled single nucleotide polymorphic SNP group F from DrTurning to the step (3c), if an unlabeled SNP set F cannot be taken out of the set DrIf so, ending the step (3);
(3c) if FrThe number of contained single nucleotide polymorphism SNPs is equal to L, and the mark is FrTurning to the step (3b), otherwise, turning to the step (3 d);
(3d) calculating FrMeasuring the relevance of a new SNP group formed by each single nucleotide polymorphism SNP in the omega residual single nucleotide polymorphism SNPs, adding the single nucleotide polymorphism SNP group corresponding to the previous K big relevance measurement values into a set D formed by the candidate suspected pathogenic cause single nucleotide polymorphism SNP groups, and marking FrTurning to the step (3 b);
(4) calculating the stability measure ST of the relevance of the set of single nucleotide polymorphic SNPs of all the searched candidate suspected pathogenic causes (F)r):
<math> <mrow> <mi>ST</mi> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>r</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <msqrt> <mn>2</mn> <mi>&pi;</mi> </msqrt> </mfrac> <munderover> <mo>&Integral;</mo> <mfrac> <mrow> <mi>&delta;</mi> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>r</mi> </msub> <mo>)</mo> </mrow> <mo>-</mo> <msub> <mi>&mu;</mi> <mi>&delta;</mi> </msub> <mrow> <mo>(</mo> <mi>r</mi> <mo>)</mo> </mrow> </mrow> <mrow> <msub> <mi>&sigma;</mi> <mi>&delta;</mi> </msub> <mrow> <mo>(</mo> <mi>r</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <mrow> <mo>+</mo> <mo>&infin;</mo> </mrow> </munderover> <mi>exp</mi> <mrow> <mo>(</mo> <mo>-</mo> <mfrac> <msup> <mi>x</mi> <mn>2</mn> </msup> <mn>2</mn> </mfrac> <mo>)</mo> </mrow> <mi>dx</mi> </mrow> </math>
Wherein, FrIs a set consisting of r single nucleotide polymorphic SNPs; mu.sδ(r) and σδ(r) are each
Figure BSA00000196295300032
The mean and mean square error of the volatility,
Figure BSA00000196295300033
set of r single nucleotide polymorphic SNPs obtained for the ith sample
Figure BSA00000196295300034
I 1, 2, 1f,mfThe default value is 100 for the number of put-back samples of SNPs in Ω;
δ(Fr) AS a relevance measure (F)rOmega) of the measured signal, and/or the measured signal,
wherein, muAS(Fr) And σAS(Fr) Sampling m for putting back samples in omega respectivelysThe relevance measure AS (F) obtainedr/Zi) Mean and mean square error of (d):
<math> <mfenced open='{' close=''> <mtable> <mtr> <mtd> <msub> <mi>&mu;</mi> <mi>AS</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>r</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <msub> <mi>m</mi> <mi>s</mi> </msub> </mfrac> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>m</mi> <mi>s</mi> </msub> </munderover> <mi>AS</mi> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>r</mi> </msub> <mo>/</mo> <msub> <mi>Z</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mtd> </mtr> <mtr> <mtd> <msub> <mi>&sigma;</mi> <mi>AS</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>r</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <msup> <mrow> <mo>[</mo> <mfrac> <mn>1</mn> <mrow> <msub> <mi>m</mi> <mi>s</mi> </msub> <mo>-</mo> <mn>1</mn> </mrow> </mfrac> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>m</mi> <mi>s</mi> </msub> </munderover> <msup> <mrow> <mo>(</mo> <mi>AS</mi> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>r</mi> </msub> <mo>/</mo> <msub> <mi>Z</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>-</mo> <msub> <mi>&mu;</mi> <mi>AS</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>r</mi> </msub> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mn>2</mn> </msup> <mo>]</mo> </mrow> <mrow> <mn>1</mn> <mo>/</mo> <mn>2</mn> </mrow> </msup> </mtd> </mtr> </mtable> </mfenced> </math>
wherein Z isiData obtained for the ith sample with put back, AS (F)r/Zi) As data ZiMiddle FrI 1, 2, …, ms,msThe default value is 1000;
(5) according to the maximum stability criterion, selecting the single nucleotide polymorphic SNP group with the maximum stability measure of relevance from the set D of the candidate single nucleotide polymorphic SNP groups with suspected pathogenic causes as one single nucleotide polymorphic SNP group with the suspected pathogenic causes, adding the single nucleotide polymorphic SNP group with the maximum stability measure of relevance to the set C of the single nucleotide polymorphic SNP groups related to the complex diseases, removing the single nucleotide polymorphic SNP contained in the set C from omega, and turning to the step (3) if the number of the single nucleotide polymorphic SNP groups in the set C is less than M, and turning to the step (6) if not;
(6) the single nucleotide polymorphic SNP set in C was exported, and the suspected degree of this SNP set as the causative SNP set was evaluated by the degree in which the stability of the association of each SNP set was close to 1.
The invention has the following advantages:
(1) the invention uses mutual information as the relevance measure of the SNP group and the disease, and not only describes the linear statistical relationship of the SNP group and the disease, but also describes the nonlinear statistical relationship of the SNP group and the disease.
(2) The invention provides an SNP group for discovering suspected pathogenic causes by using the stability of the relevance of the SNP group; the evaluation method provides a method for judging whether the association between the SNP group and the disease is stable or not from the statistical angle by using a back sampling technology, and provides possibility for finding the objective SNP group related to the disease.
(3) In the process of finding the SNP group related to the complex disease, no artificial parameter is introduced, and no existing machine learning, pattern recognition and data mining method based on hypothesis is used, so that the influence of the artificial hypothesis is avoided to the maximum extent;
(4) the invention can find a plurality of possible pathogenic causes of the susceptible disease and SNP combinations related to each cause from complex disease SNP data.
Drawings
FIG. 1 is a flow chart of the present invention for discovering multiple pathogenic SNP groups.
Detailed Description
Referring to fig. 1, the specific implementation steps of the present invention are as follows:
step 1, preprocessing and initializing SNP data.
(1.1) processing SNP data into data containing only 0, 1, 2, 3 according to the principle that the influence of variation of any one gene in homologous chromosome alleles on diseases can be treated equivalently, wherein 0 represents deletion data;
(1.2) order
Figure BSA00000196295300051
Representing the preprocessed data, wherein N is the number of samples in the SNP data, xi∈{0,1,2,3}dD is the number of SNPs in the data, yiE {1, 2} is a sample xiClass (1) represents a disease group, 2 represents a control group, and y is [ y ]1,y2,...,yN];
(1.3) let C be a set of SNP groups associated with a complex disease, initialized to null; setting M as the number of SNP groups which are expected to be found to be related to the complex disease, wherein the default value is 6; let L be the upper limit of the number of SNPs included in the single nucleotide polymorphism SNP set, and the default value be 5.
And 2, defining relevance measurement.
(2.1) relevance measure AS (F)rOmega) is defined as SNP group FrMutual information MI (F) with classmark yr(ii) a y) represented by formula (1):
<math> <mrow> <mi>AS</mi> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>r</mi> </msub> <mo>/</mo> <mi>&Omega;</mi> <mo>)</mo> </mrow> <mo>=</mo> <mi>MI</mi> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>r</mi> </msub> <mo>;</mo> <mi>y</mi> <mo>)</mo> </mrow> <mo>=</mo> <munder> <mi>&Sigma;</mi> <msub> <mi>F</mi> <mi>r</mi> </msub> </munder> <munder> <mi>&Sigma;</mi> <mi>y</mi> </munder> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>r</mi> </msub> <mo>,</mo> <mi>y</mi> <mo>)</mo> </mrow> <mi>log</mi> <mfrac> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>r</mi> </msub> <mo>,</mo> <mi>y</mi> <mo>)</mo> </mrow> </mrow> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>r</mi> </msub> <mo>)</mo> </mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>y</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow> </math>
wherein, FrIs a set of r SNPs, p (F)rY) is FrAnd the joint probability of y, p (F)r) Is the joint probability of r SNPs, and p (y) is the probability of the classmark y;
and 3, searching the SNP group of the candidate suspected pathogenic reason in omega.
(3.1) calculating the relevance measurement of each SNP by the formula (1), and adding the SNP corresponding to the previous K big relevance measurement values into a set D consisting of candidate suspected pathogenic cause SNP groups;
(3.2) taking out from D a SNP group F which is not labeledrTurning to step (3.3), if an SNP group F which is not labeled cannot be extracted from DrIf yes, ending the step 3;
(3.3) if FrThe number of SNPs contained is equal to L, and the marker is FrTurning to the step (3.2), otherwise, turning to the step (3.4);
(3.4) calculation of F from formula (1)rMeasuring the relevance of a new SNP group formed by each SNP in omega residual SNPs, adding the SNP group corresponding to the previous K big relevance measurement values into a set D formed by candidate suspected pathogenic cause SNP groups, and marking FrAnd (3.2) turning to the step.
Step 4, calculating the searched SNP group F of all candidate suspected pathogenic causesrA stability measure of the correlation of (a).
(4.1) adding FrStability measure of relevance of ST (F)r) Defined by formula (2):
<math> <mrow> <mi>ST</mi> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>r</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <msqrt> <mn>2</mn> <mi>&pi;</mi> </msqrt> </mfrac> <munderover> <mo>&Integral;</mo> <mfrac> <mrow> <mi>&delta;</mi> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>r</mi> </msub> <mo>)</mo> </mrow> <mo>-</mo> <msub> <mi>&mu;</mi> <mi>&delta;</mi> </msub> <mrow> <mo>(</mo> <mi>r</mi> <mo>)</mo> </mrow> </mrow> <mrow> <msub> <mi>&sigma;</mi> <mi>&delta;</mi> </msub> <mrow> <mo>(</mo> <mi>r</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <mrow> <mo>+</mo> <mo>&infin;</mo> </mrow> </munderover> <mi>exp</mi> <mrow> <mo>(</mo> <mo>-</mo> <mfrac> <msup> <mi>x</mi> <mn>2</mn> </msup> <mn>2</mn> </mfrac> <mo>)</mo> </mrow> <mi>dx</mi> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow> </math>
wherein, FrIs a set consisting of r single nucleotide polymorphic SNPs; mu.sδ(r) and σδ(r) are each
Figure BSA00000196295300063
The mean and mean square error of the volatility,
Figure BSA00000196295300064
set of r single nucleotide polymorphic SNPs obtained for the ith sample
Figure BSA00000196295300065
I 1, 2, 1f,mfThe default value is 100 for the number of put-back samples of SNPs in Ω;
δ(Fr) AS a relevance measure (F)rA/omega) is defined as formula (3),
<math> <mrow> <mi>&delta;</mi> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>r</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <msub> <mi>&sigma;</mi> <mi>AS</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>r</mi> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <msub> <mi>&mu;</mi> <mi>AS</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>r</mi> </msub> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>3</mn> <mo>)</mo> </mrow> </mrow> </math>
wherein, muAS(Fr) And σAS(Fr) Sampling m for putting back samples in omega respectivelysThe relevance measure AS (F) obtainedr/Zi) Mean and mean square error of (d):
<math> <mrow> <mfenced open='{' close=''> <mtable> <mtr> <mtd> <msub> <mi>&mu;</mi> <mi>AS</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>r</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <msub> <mi>m</mi> <mi>s</mi> </msub> </mfrac> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>m</mi> <mi>s</mi> </msub> </munderover> <mi>AS</mi> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>r</mi> </msub> <mo>/</mo> <msub> <mi>Z</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mtd> </mtr> <mtr> <mtd> <msub> <mi>&sigma;</mi> <mi>AS</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>r</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <msup> <mrow> <mo>[</mo> <mfrac> <mn>1</mn> <mrow> <msub> <mi>m</mi> <mi>s</mi> </msub> <mo>-</mo> <mn>1</mn> </mrow> </mfrac> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>m</mi> <mi>s</mi> </msub> </munderover> <msup> <mrow> <mo>(</mo> <mi>AS</mi> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>r</mi> </msub> <mo>/</mo> <msub> <mi>Z</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>-</mo> <msub> <mi>&mu;</mi> <mi>AS</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>r</mi> </msub> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mn>2</mn> </msup> <mo>]</mo> </mrow> <mrow> <mn>1</mn> <mo>/</mo> <mn>2</mn> </mrow> </msup> </mtd> </mtr> </mtable> </mfenced> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>4</mn> <mo>)</mo> </mrow> </mrow> </math>
wherein Z isiData obtained for the ith sample with put back, AS (F)r/Zi) As data ZiMiddle FrI 1, 2, …, ms,msThe default value is 1000;
(4.2) the relevance measure AS (F) is calculated AS followsrOmega) volatility delta (F)r):
(4.2.1) m for samples in ΩsSampling with secondary release to obtain data Zi,i=1,2,...,ms
(4.2.2) for all ZiZ is calculated from the following formulaiMiddle FrRelevance measure of (AS) (F)r/Zi):
<math> <mrow> <mi>AS</mi> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>r</mi> </msub> <mo>/</mo> <msub> <mi>Z</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <munder> <mi>&Sigma;</mi> <msubsup> <mi>F</mi> <mi>r</mi> <mi>i</mi> </msubsup> </munder> <munder> <mi>&Sigma;</mi> <mi>y</mi> </munder> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>r</mi> </msub> <mo>,</mo> <mi>y</mi> <mo>)</mo> </mrow> <mi>log</mi> <mfrac> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>r</mi> </msub> <mo>,</mo> <mi>y</mi> <mo>)</mo> </mrow> </mrow> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>r</mi> </msub> <mo>)</mo> </mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>y</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>;</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>5</mn> <mo>)</mo> </mrow> </mrow> </math>
(4.2.3) calculating a relevance measure AS (F) from equation (4)r/Zi) Mean value of (a)AS(Fr) Sum mean square error σAS(Fr);
(4.2.4) calculating the volatility δ (F) of the relevance measure from equation (3)r);
(4.3) calculating μ as followsδ(r) and σδ(r);
(4.3.1) performing m for SNP in ΩfSampling with secondary amplification to obtain SNP group containing r single nucleotide polymorphism SNPs
Figure BSA00000196295300074
i=1,2,...,mf
(4.3.2) calculation
Figure BSA00000196295300075
The volatility of the relevance measure
Figure BSA00000196295300076
i=1,2,...,mf
(4.3.2.1) perform m on the samples in ΩsSampling with secondary release to obtain data Zj,j=1,2,...,ms
(4.3.2.2) for all ZjZ is calculated from the following formulajIn
Figure BSA00000196295300081
Measure of relevance of
Figure BSA00000196295300082
<math> <mrow> <mi>AS</mi> <mrow> <mo>(</mo> <msubsup> <mi>F</mi> <mi>r</mi> <mi>i</mi> </msubsup> <mo>/</mo> <msub> <mi>Z</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <munder> <mi>&Sigma;</mi> <msubsup> <mi>F</mi> <mi>r</mi> <mi>i</mi> </msubsup> </munder> <munder> <mi>&Sigma;</mi> <mi>y</mi> </munder> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>F</mi> <mi>r</mi> <mi>i</mi> </msubsup> <mo>,</mo> <mi>y</mi> <mo>)</mo> </mrow> <mi>log</mi> <mfrac> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>F</mi> <mi>r</mi> <mi>i</mi> </msubsup> <mo>,</mo> <mi>y</mi> <mo>)</mo> </mrow> </mrow> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>F</mi> <mi>r</mi> <mi>i</mi> </msubsup> <mo>)</mo> </mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>y</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>;</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>6</mn> <mo>)</mo> </mrow> </mrow> </math>
(4.3.2.3) calculating a relevance measure from the following equation
Figure BSA00000196295300084
Mean value of
Figure BSA00000196295300085
Sum mean square error
Figure BSA00000196295300086
<math> <mrow> <mfenced open='{' close=''> <mtable> <mtr> <mtd> <msub> <mi>&mu;</mi> <mi>AS</mi> </msub> <mrow> <mo>(</mo> <msubsup> <mi>F</mi> <mi>r</mi> <mi>i</mi> </msubsup> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <msub> <mi>m</mi> <mi>s</mi> </msub> </mfrac> <munderover> <mi>&Sigma;</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>m</mi> <mi>s</mi> </msub> </munderover> <mi>AS</mi> <mrow> <mo>(</mo> <msubsup> <mi>F</mi> <mi>r</mi> <mi>i</mi> </msubsup> <mo>/</mo> <msub> <mi>Z</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mtd> </mtr> <mtr> <mtd> <msub> <mi>&sigma;</mi> <mi>AS</mi> </msub> <mrow> <mo>(</mo> <msubsup> <mi>F</mi> <mi>r</mi> <mi>i</mi> </msubsup> <mo>)</mo> </mrow> <mo>=</mo> <msup> <mrow> <mo>[</mo> <mfrac> <mn>1</mn> <mrow> <msub> <mi>m</mi> <mi>s</mi> </msub> <mo>-</mo> <mn>1</mn> </mrow> </mfrac> <munderover> <mi>&Sigma;</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>m</mi> <mi>s</mi> </msub> </munderover> <msup> <mrow> <mo>(</mo> <mi>AS</mi> <mrow> <mo>(</mo> <msubsup> <mi>F</mi> <mi>r</mi> <mi>i</mi> </msubsup> <mo>/</mo> <msub> <mi>Z</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>-</mo> <msub> <mi>&mu;</mi> <mi>AS</mi> </msub> <mrow> <mo>(</mo> <msubsup> <mi>F</mi> <mi>r</mi> <mi>i</mi> </msubsup> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mn>2</mn> </msup> <mo>]</mo> </mrow> <mrow> <mn>1</mn> <mo>/</mo> <mn>2</mn> </mrow> </msup> </mtd> </mtr> </mtable> </mfenced> <mo>;</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>7</mn> <mo>)</mo> </mrow> </mrow> </math>
(4.3.2.4) calculating the volatility of the relevance measure from the following equation
Figure BSA00000196295300088
<math> <mrow> <mi>&delta;</mi> <mrow> <mo>(</mo> <msubsup> <mi>F</mi> <mi>r</mi> <mi>i</mi> </msubsup> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <msub> <mi>&sigma;</mi> <mi>AS</mi> </msub> <mrow> <mo>(</mo> <msubsup> <mi>F</mi> <mi>r</mi> <mi>i</mi> </msubsup> <mo>)</mo> </mrow> </mrow> <mrow> <msub> <mi>&mu;</mi> <mi>AS</mi> </msub> <mrow> <mo>(</mo> <msubsup> <mi>F</mi> <mi>r</mi> <mi>i</mi> </msubsup> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>;</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>8</mn> <mo>)</mo> </mrow> </mrow> </math>
(4.3.3) calculation from the following equation
Figure BSA000001962953000810
i=1,2,...,mfThe volatility of the relevance measure
Figure BSA000001962953000811
Mean value of (a)δ(r) and mean square error σδ(r):
<math> <mrow> <mfenced open='{' close=''> <mtable> <mtr> <mtd> <msub> <mi>&mu;</mi> <mi>&delta;</mi> </msub> <mrow> <mo>(</mo> <mi>r</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <msub> <mi>m</mi> <mi>f</mi> </msub> </mfrac> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>m</mi> <mi>f</mi> </msub> </munderover> <mi>&delta;</mi> <mrow> <mo>(</mo> <msubsup> <mi>F</mi> <mi>r</mi> <mi>i</mi> </msubsup> <mo>)</mo> </mrow> </mtd> </mtr> <mtr> <mtd> <msub> <mi>&sigma;</mi> <mi>&delta;</mi> </msub> <mrow> <mo>(</mo> <mi>r</mi> <mo>)</mo> </mrow> <mo>=</mo> <msup> <mrow> <mo>[</mo> <mfrac> <mn>1</mn> <mrow> <msub> <mi>m</mi> <mi>f</mi> </msub> <mo>-</mo> <mn>1</mn> </mrow> </mfrac> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>m</mi> <mi>f</mi> </msub> </munderover> <msup> <mrow> <mo>(</mo> <mi>&delta;</mi> <mrow> <mo>(</mo> <msubsup> <mi>F</mi> <mi>r</mi> <mi>i</mi> </msubsup> <mo>)</mo> </mrow> <mo>-</mo> <msub> <mi>&mu;</mi> <mi>&delta;</mi> </msub> <mrow> <mo>(</mo> <mi>r</mi> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mn>2</mn> </msup> <mo>]</mo> </mrow> <mrow> <mn>1</mn> <mo>/</mo> <mn>2</mn> </mrow> </msup> </mtd> </mtr> </mtable> </mfenced> <mo>;</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>9</mn> <mo>)</mo> </mrow> </mrow> </math>
(4.4) for each SNP set F in the set D consisting of candidate SNP sets of suspected causes obtained in step 3rThe stability of the correlation was calculated as follows:
(4.4.1) calculation of SNP group F from step (4.2)rFluctuation δ (F) ofr);
(4.4.2) calculating μ from step (4.3)δ(r) and σδ(r);
(4.4.3) calculating the lower limit of integration of equation (2)
Figure BSA000001962953000813
And substituting the expression into the formula (2) to calculate the SNP group FrStability of (2).
And 5, selecting the SNP group suspected to cause the disease.
(5.1) according to the maximum stability criterion, selecting the SNP group with the maximum relevance stability measurement value from the set D of the single nucleotide polymorphism SNP groups of the candidate suspected pathogenic causes, and taking the SNP group as one SNP group of the suspected pathogenic causes and marking the SNP group as S;
(5.2) adding S into the set C of the SNP groups related to the complex disease, and turning to the step 6 if the number of the SNP groups related to the complex disease in the set C is equal to M;
(5.3) remove the SNP contained in S from the data omega, go to step 3 to find the next SNP group related to the disease.
And 6, outputting the SNP set related to the complex disease, and evaluating the suspected degree of the SNP set as the disease causing SNP set by using the degree that the stability of the relevance of each SNP set is close to 1.
The invention will be described in more detail with respect to the effect of the process of the invention by the following experimental examples. These experimental examples are for illustrative purposes and are not intended to limit the scope of the present invention.
Experiment 1: and (3) simulating the discovery of the SNP group related to the complex disease in the data.
The simulation data is obtained by adding 7 known SNP groups related to complex diseases by biologists on the basis of real SNP data of the population in New York, and the 7 SNP groups are different from the association model of the diseases. There are two groups of data: the first set contained 2000 samples, 100 SNPs, denoted by SNP 100; the second set contained 2000 samples, 2000 SNPs, denoted SNP 2000. The detailed information of the data is shown in table 1. The experimental results obtained on the two sets of data described in table 1 are shown in table 2.
TABLE 1 Experimental data
Data set name Number of SNPs Number of samples Number of samples in disease group Number of samples in control group
SNP100 100 2000 1127 873
SNP2000 2000 2000 1181 819
In table 2, q represents the position of the association of SNPs in the discovered SNP set in the order of the association of all individual SNPs from large to small; the SNPs found represent those related to the disease found in the data by the method of the present invention; the pathogenic SNP group is a known SNP group related to complex diseases, which is added to data by biologists in advance; the relevance represents the relevance measurement value of the SNP group discovered by the method of the invention and the disease; stability represents a stability measure of the association of the SNP set found by the method of the present invention; the P value is a universal measurement value for evaluating the quality of the SNP group in the field of finding the SNP group related to the complex disease from the SNP data; SNPs in the table are indicated by their numbers in the data.
TABLE 2 results of experiments on SNP100 and SNP2000 data for the discovery of a pathogenic SNP set by the method of the invention
Figure BSA00000196295300101
Experiment 2: discovery of SNP group related to lung adenocarcinoma.
The data for real lung adenocarcinoma contained 191 disease samples, 99 control samples, 238304 SNPs, with 5.55% of the data lost. The results of experiments conducted on this data for the discovery of a pathogenic SNP set are shown in Table 3, in which SNPs are indicated by their numbers in the data.
TABLE 3 results of experiments on the discovery of a pathogenic SNP set for lung adenocarcinoma data by the method of the present invention
q Discovered SNP set Relevance Stability of P value
187716,1 130199,177958 0.223783 0.998951 1.3701e-005
3130,70815,2 102091,180050,234964 0.568258 0.986097 6.7758e-005
62201,3,14707 48316,144695,181381 0.586346 0.980482 7.4825e-006
2712,4 66357,206952 0.204549 0.997601 1.897e-005
5,2525,197037 7938,116763,236441 0.653206 0.984182 0.010945
114680,20,6 41440,76592,236930 0.492324 0.972419 0.0013376
From tables 2, 3, the following conclusions can be drawn:
(1) for SNP100 data, the method of the invention discovers 6 of 7 real pathogenic SNP groups from simulation data; for SNP2000 data, the method of the invention discovers 5 of 7 real pathogenic SNP groups from simulation data; for real lung adenocarcinoma data, 6 SNP groups suspected to be causative were also found. It can be seen that the method of the present invention can find SNP groups related to diseases in SNP data.
(2) From simulation data experiments, it can also be seen that the number of the found real pathogenic SNP groups is not obviously reduced due to the increase of the number of SNPs in the data, and the method shows strong robustness to the number of SNPs in the data; meanwhile, SNP groups with different association models with diseases are found, and the robustness of the method for the association models is shown.
(3) From the aspect of stability, the stability of the association of the pathogenic SNP group discovered by the invention is very high, and is close to 1, compared with the P value of the common assessment method, the stability can discover more implicit suspected pathogenic SNP groups, and the superiority of the assessment method is shown.
(4) In view of the q-values of the SNPs in the set of discovered pathogenic SNPs: some SNP groups with poor single relevance but strong combination effect, such as 83, 85, 100 combination and 1818, 1747, 1998 combination, can also be successfully found by the method of the invention, and further shows that the invention has stronger capability of finding single weak-relevance and strong-relevance pathogenic SNP groups.

Claims (3)

1. A method for discovering SNP group related to complex disease from single nucleotide polymorphism SNP data comprises the following steps:
(1) setting C as a collection of single nucleotide polymorphic SNP groups related to the complex disease, setting an initial value as null, setting M as the number of the single nucleotide polymorphic SNP groups which are preset to discover to be related to the complex disease, and setting a default value as 6; setting L as the upper limit value of the number of SNPs contained in the single nucleotide polymorphism SNP group, and setting the default value as 5; preprocessing single nucleotide polymorphism SNP data into single nucleotide polymorphism SNP data according to the principle of treating the influence of variation of any one gene in homologous chromosome alleles on diseases equivalently:
Figure FSB00000822690000011
Wherein N is the number of samples in the single nucleotide polymorphism SNP data, xi∈{0,1,2,3}dD is the number of single nucleotide polymorphic SNPs in the data, yiE {1, 2} is a sample xiClass (1) represents a disease group, 2 represents a control group, and y is [ y ]1,y2,...,yN]Ω represents the preprocessed data;
(2) single nucleotide polymorphic SNP set FrAssociation AS (F) with class label yrOmega) is defined as FrMutual information MI (F) between and yr;y):
<math> <mrow> <mi>AS</mi> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>r</mi> </msub> <mo>/</mo> <mi>&Omega;</mi> <mo>)</mo> </mrow> <mo>=</mo> <mi>MI</mi> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>r</mi> </msub> <mo>;</mo> <mi>y</mi> <mo>)</mo> </mrow> <mo>=</mo> <munder> <mi>&Sigma;</mi> <msub> <mi>F</mi> <mi>r</mi> </msub> </munder> <munder> <mi>&Sigma;</mi> <mi>y</mi> </munder> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>r</mi> </msub> <mo>,</mo> <mi>y</mi> <mo>)</mo> </mrow> <mi>log</mi> <mfrac> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>r</mi> </msub> <mo>,</mo> <mi>y</mi> <mo>)</mo> </mrow> </mrow> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>r</mi> </msub> <mo>)</mo> </mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>y</mi> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow> </math>
Wherein, FrIs a set of r single nucleotide polymorphic SNPs, p (F)rY) is FrAnd the joint probability of y, p (F)r) Is the joint probability of r single nucleotide polymorphic SNPs, and p (y) is the probability of the label y;
(3) polymorphism SNP group F based on Single nucleotiderRelevance measure of (AS) (F)rAnd/omega), searching a single nucleotide polymorphic SNP group of candidate suspected pathogenic causes in omega according to the following steps:
(3a) calculating the relevance measurement of each single nucleotide polymorphic SNP in omega, and adding the first K single nucleotide polymorphic SNPs corresponding to the large relevance measurement values into a set D consisting of candidate suspected pathogenic cause single nucleotide polymorphic SNP groups;
(3b) removing an unlabeled single nucleotide polymorphic SNP group F from DrTurning to the step (3c), if an unlabeled SNP set F cannot be taken out of the set DrIf so, ending the step (3);
(3c) if FrThe number of contained single nucleotide polymorphism SNPs is equal to L, and the mark is FrTurning to the step (3b), otherwise, turning to the step (3 d);
(3d) calculating FrMeasuring the relevance of a new SNP group formed by each one of the omega residual single nucleotide polymorphic SNPs, adding the first K single nucleotide polymorphic SNP groups corresponding to large relevance measurement values into a set D formed by the candidate suspected pathogenic cause single nucleotide polymorphic SNP groups, and marking FrTurning to the step (3 b);
(4) calculating the stability measure ST of the relevance of the set of single nucleotide polymorphic SNPs of all the searched candidate suspected pathogenic causes (F)r):
<math> <mrow> <mi>ST</mi> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>r</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <msqrt> <mn>2</mn> <mi>&pi;</mi> </msqrt> </mfrac> <mfrac> <munderover> <mo>&Integral;</mo> <mrow> <mi>&delta;</mi> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>r</mi> </msub> <mo>)</mo> </mrow> <mo>-</mo> <msub> <mi>&mu;</mi> <mi>&delta;</mi> </msub> <mrow> <mo>(</mo> <mi>r</mi> <mo>)</mo> </mrow> </mrow> <mrow> <mo>+</mo> <mo>&infin;</mo> </mrow> </munderover> <mrow> <msub> <mi>&sigma;</mi> <mi>&delta;</mi> </msub> <mrow> <mo>(</mo> <mi>r</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <mi>exp</mi> <mrow> <mo>(</mo> <mo>-</mo> <mfrac> <msup> <mi>x</mi> <mn>2</mn> </msup> <mn>2</mn> </mfrac> <mo>)</mo> </mrow> <mi>dx</mi> </mrow> </math>
Its value is [0, 1 ]]Wherein, FrIs a set consisting of r single nucleotide polymorphic SNPs; mu.sδ(r) and σδ(r) are each
Figure FSB00000822690000022
The mean and mean square error of the volatility,
Figure FSB00000822690000023
set of r single nucleotide polymorphic SNPs obtained for the ith sampleI 1, 2, 1f,mfThe default value is 100 for the number of samples with playback for the features in Ω;
δ(Fr) Is to turn offMeasure of relevance AS (F)rOmega) of the measured signal, and/or the measured signal,
Figure FSB00000822690000025
wherein, muAS(Fr) and σAS(Fr) Sampling m for putting back samples in omega respectivelysThe relevance measure AS (F) obtainedr/Zi) mean and mean square error:
<math> <mfenced open='{' close=''> <mtable> <mtr> <mtd> <msub> <mi>&mu;</mi> <mi>AS</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>r</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <msub> <mi>m</mi> <mi>s</mi> </msub> </mfrac> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>m</mi> <mi>s</mi> </msub> </munderover> <mi>AS</mi> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>r</mi> </msub> <mo>/</mo> <msub> <mi>Z</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mtd> </mtr> <mtr> <mtd> <msub> <mi>&sigma;</mi> <mi>AS</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>r</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <msup> <mrow> <mo>[</mo> <mfrac> <mn>1</mn> <mrow> <msub> <mi>m</mi> <mi>s</mi> </msub> <mo>-</mo> <mn>1</mn> </mrow> </mfrac> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>m</mi> <mi>s</mi> </msub> </munderover> <msup> <mrow> <mo>(</mo> <mi>AS</mi> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>s</mi> </msub> <mo>/</mo> <msub> <mi>Z</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>-</mo> <msub> <mi>&mu;</mi> <mi>AS</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>r</mi> </msub> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mn>2</mn> </msup> <mo>]</mo> </mrow> <mrow> <mn>1</mn> <mo>/</mo> <mn>2</mn> </mrow> </msup> </mtd> </mtr> </mtable> </mfenced> </math>
wherein Z isiData obtained for the ith sample with put back, AS (F)r/Zi) As data ZiMiddle FrI 1, 2, …, ms,msThe default value is 1000;
(5) selecting the single nucleotide polymorphic SNP group with the highest correlation stability as one single nucleotide polymorphic SNP group of the suspected pathogenic reason from the collection D of the single nucleotide polymorphic SNP groups of the candidate suspected pathogenic reasons according to the maximum stability criterion, adding the single nucleotide polymorphic SNP group into the collection C of the single nucleotide polymorphic SNP groups related to the complex disease, removing the single nucleotide polymorphic SNP contained in the collection C from omega, and turning to the step (3) if the number of the single nucleotide polymorphic SNP groups in the collection C is less than M, and turning to the step (6) if not;
(6) all the single nucleotide polymorphic SNP sets in C were exported, and the degree of the suspicion of this SNP set as a causative SNP set was evaluated with the degree in which the stability of the association of each SNP set was close to 1.
2. The method of claim 1, wherein the lower limit in the stability measure formula for the correlation given in step (4)
Figure FSB00000822690000032
The method comprises the following steps:
(4a) go m to features in ΩfSampling with secondary amplification to obtain a sample containing r single nucleotide polymorphisms SNP F r i , i = 1,2 , . . . , m f ;
(4b) Computing F r i , = 1,2 , . . . , m f Correlation of (2)
Figure FSB00000822690000035
i=1,2,...,mf
<math> <mrow> <mi>AS</mi> <mrow> <mo>(</mo> <msubsup> <mi>F</mi> <mi>r</mi> <mi>i</mi> </msubsup> <mo>/</mo> <mi>&Omega;</mi> <mo>)</mo> </mrow> <mo>=</mo> <munder> <mi>&Sigma;</mi> <msubsup> <mi>F</mi> <mi>r</mi> <mi>i</mi> </msubsup> </munder> <munder> <mi>&Sigma;</mi> <mi>y</mi> </munder> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>F</mi> <mi>r</mi> <mi>i</mi> </msubsup> <mo>,</mo> <mi>y</mi> <mo>)</mo> </mrow> <mi>log</mi> <mfrac> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>F</mi> <mi>r</mi> <mi>i</mi> </msubsup> <mo>,</mo> <mi>y</mi> <mo>)</mo> </mrow> </mrow> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>F</mi> <mi>r</mi> <mi>i</mi> </msubsup> <mo>)</mo> </mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>y</mi> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow> </math>
(4c) Calculating relevance
Figure FSB00000822690000037
Mean value of the volatility ofδ(r) and mean square error σδ(r);
(4d) Perform m on samples in ΩsSampling with secondary release to obtain data Zi,i=1,2,...,ms
(4e) Calculating data ZiMiddle FrRelevance measure of (AS) (F)r/Zi),i=1,2,....,ms
(4f) Computing a relevance measure AS (F)r/Zi) Mean value of (a)AS(Fr) Sum mean square error σAS(Fr);
(4g) Is measured by muAS(Fr) And σAS(Fr) Calculating the volatility delta (F) of the relevance measurer);
(4h) Lower bound in stability measure formula to determine relevance
Figure FSB00000822690000041
3. The method of claim 2, wherein the association of step (4c)
Figure FSB00000822690000042
Mean value of the volatility ofδ(r) and mean square error σδ(r) calculated as follows:
<math> <mfenced open='{' close=''> <mtable> <mtr> <mtd> <msub> <mi>&mu;</mi> <mi>&delta;</mi> </msub> <mrow> <mo>(</mo> <mi>r</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <msub> <mi>m</mi> <mi>f</mi> </msub> </mfrac> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>m</mi> <mi>f</mi> </msub> </munderover> <mi>&delta;</mi> <mrow> <mo>(</mo> <msubsup> <mi>F</mi> <mi>r</mi> <mi>i</mi> </msubsup> <mo>)</mo> </mrow> </mtd> </mtr> <mtr> <mtd> <msub> <mi>&sigma;</mi> <mi>&delta;</mi> </msub> <mrow> <mo>(</mo> <mi>r</mi> <mo>)</mo> </mrow> <mo>=</mo> <msup> <mrow> <mo>[</mo> <mfrac> <mn>1</mn> <mrow> <msub> <mi>m</mi> <mi>f</mi> </msub> <mo>-</mo> <mn>1</mn> </mrow> </mfrac> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>m</mi> <mi>f</mi> </msub> </munderover> <msup> <mrow> <mo>(</mo> <mi>&delta;</mi> <mrow> <mo>(</mo> <msubsup> <mi>F</mi> <mi>r</mi> <mi>i</mi> </msubsup> <mo>)</mo> </mrow> <mo>-</mo> <msub> <mi>&mu;</mi> <mi>&delta;</mi> </msub> <mrow> <mo>(</mo> <mi>r</mi> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mn>2</mn> </msup> <mo>]</mo> </mrow> <mrow> <mn>1</mn> <mo>/</mo> <mn>2</mn> </mrow> </msup> </mtd> </mtr> </mtable> </mfenced> </math>
wherein,
Figure FSB00000822690000044
is a set containing r single nucleotide polymorphic SNPs, mfTo perform the number of return samples for SNPs in omega,
Figure FSB00000822690000045
as a measure of relevance
Figure FSB00000822690000046
The fluctuation of the pressure of the air conditioner is reduced,
<math> <mrow> <mi>&delta;</mi> <mrow> <mo>(</mo> <msubsup> <mi>F</mi> <mi>r</mi> <mi>i</mi> </msubsup> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <msub> <mi>&sigma;</mi> <mi>AS</mi> </msub> <mrow> <mo>(</mo> <msubsup> <mi>F</mi> <mi>r</mi> <mi>i</mi> </msubsup> <mo>)</mo> </mrow> </mrow> <mrow> <msub> <mi>&mu;</mi> <mi>AS</mi> </msub> <mrow> <mo>(</mo> <msubsup> <mi>F</mi> <mi>r</mi> <mi>i</mi> </msubsup> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow> </math>
<math> <mfenced open='{' close=''> <mtable> <mtr> <mtd> <msub> <mi>&mu;</mi> <mi>AS</mi> </msub> <mrow> <mo>(</mo> <msubsup> <mi>F</mi> <mi>r</mi> <mi>i</mi> </msubsup> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <msub> <mi>m</mi> <mi>s</mi> </msub> </mfrac> <munderover> <mi>&Sigma;</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>m</mi> <mi>s</mi> </msub> </munderover> <mi>AS</mi> <mrow> <mo>(</mo> <msubsup> <mi>F</mi> <mi>r</mi> <mi>i</mi> </msubsup> <mo>/</mo> <msub> <mi>Z</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mtd> </mtr> <mtr> <mtd> <msub> <mi>&sigma;</mi> <mi>AS</mi> </msub> <mrow> <mo>(</mo> <msubsup> <mi>F</mi> <mi>r</mi> <mi>i</mi> </msubsup> <mo>)</mo> </mrow> <mo>=</mo> <msup> <mrow> <mo>[</mo> <mfrac> <mn>1</mn> <mrow> <msub> <mi>m</mi> <mi>s</mi> </msub> <mo>-</mo> <mn>1</mn> </mrow> </mfrac> <munderover> <mi>&Sigma;</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>m</mi> <mi>s</mi> </msub> </munderover> <msup> <mrow> <mo>(</mo> <mi>AS</mi> <mrow> <mo>(</mo> <msubsup> <mi>F</mi> <mi>r</mi> <mi>i</mi> </msubsup> <mo>/</mo> <msub> <mi>Z</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>-</mo> <msub> <mi>&mu;</mi> <mi>AS</mi> </msub> <mrow> <mo>(</mo> <msubsup> <mi>F</mi> <mi>r</mi> <mi>i</mi> </msubsup> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mn>2</mn> </msup> <mo>]</mo> </mrow> <mrow> <mn>1</mn> <mo>/</mo> <mn>2</mn> </mrow> </msup> </mtd> </mtr> </mtable> </mfenced> </math>
wherein Z isjFor the data obtained with the put back sample j,
Figure FSB00000822690000049
as data ZjIn
Figure FSB000008226900000410
J is 1, 2, …, ms
CN2010102309492A 2010-07-16 2010-07-16 Method of discovering SNP group related to complex disease from SNP information Expired - Fee Related CN101894216B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2010102309492A CN101894216B (en) 2010-07-16 2010-07-16 Method of discovering SNP group related to complex disease from SNP information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2010102309492A CN101894216B (en) 2010-07-16 2010-07-16 Method of discovering SNP group related to complex disease from SNP information

Publications (2)

Publication Number Publication Date
CN101894216A CN101894216A (en) 2010-11-24
CN101894216B true CN101894216B (en) 2012-09-05

Family

ID=43103406

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2010102309492A Expired - Fee Related CN101894216B (en) 2010-07-16 2010-07-16 Method of discovering SNP group related to complex disease from SNP information

Country Status (1)

Country Link
CN (1) CN101894216B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102629305B (en) * 2012-03-06 2015-02-25 上海大学 Feature selection method facing to SNP (Single Nucleotide Polymorphism) data
CN103366100A (en) * 2013-06-25 2013-10-23 西安电子科技大学 Method for filtering SNP (Single Nucleotide Polymorphism) unrelated to complex diseases from whole-genome
CN104462868B (en) * 2014-12-11 2017-04-05 西安电子科技大学 A kind of full-length genome SNP site analysis method of combination random forest and Relief F
CN105354444B (en) * 2015-11-24 2018-06-19 华南理工大学 Method based on the susceptible SNP combinations of susceptible SNP screenings complex disease

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1480532A (en) * 2002-09-04 2004-03-10 华中农业大学 Gene of cortexin-3 receptor of pig melanin and method for detecting polymorphism of mononucleotide
CN101346724A (en) * 2005-11-26 2009-01-14 吉恩安全网络有限责任公司 System and method for cleaning noisy genetic data and using genetic, phentoypic and clinical data to make predictions
CN101570788A (en) * 2009-06-09 2009-11-04 华东师范大学 Method for recognizing genotype through single nucleotide polymorphism chip

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040102905A1 (en) * 2001-03-26 2004-05-27 Epigenomics Ag Method for epigenetic feature selection

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1480532A (en) * 2002-09-04 2004-03-10 华中农业大学 Gene of cortexin-3 receptor of pig melanin and method for detecting polymorphism of mononucleotide
CN101346724A (en) * 2005-11-26 2009-01-14 吉恩安全网络有限责任公司 System and method for cleaning noisy genetic data and using genetic, phentoypic and clinical data to make predictions
CN101570788A (en) * 2009-06-09 2009-11-04 华东师范大学 Method for recognizing genotype through single nucleotide polymorphism chip

Also Published As

Publication number Publication date
CN101894216A (en) 2010-11-24

Similar Documents

Publication Publication Date Title
Minnoye et al. Chromatin accessibility profiling methods
CN106446600B (en) A kind of design method of the sgRNA based on CRISPR/Cas9
US20190130999A1 (en) Latent Representations of Phylogeny to Predict Organism Phenotype
US20210332354A1 (en) Systems and methods for identifying differential accessibility of gene regulatory elements at single cell resolution
Marshall et al. How mitonuclear discordance and geographic variation have confounded species boundaries in a widely studied snake
JP2005531853A (en) System and method for SNP genotype clustering
CN101894216B (en) Method of discovering SNP group related to complex disease from SNP information
US20190139628A1 (en) Machine learning techniques for analysis of structural variants
CN106202999A (en) Microorganism high-pass sequencing data based on different scale tuple word frequency analyzes agreement
Champigny et al. Learning from methylomes: epigenomic correlates of Populus balsamifera traits based on deep learning models of natural DNA methylation
CN113823356B (en) Methylation site identification method and device
An et al. KCRR: a nonlinear machine learning with a modified genomic similarity matrix improved the genomic prediction efficiency
CN108920889B (en) Chemical health hazard screening method
CN110400605A (en) A kind of the ligand bioactivity prediction technique and its application of GPCR drug targets
CN109997193B (en) Method for quantitative analysis of subgroups in specific group
Fu et al. A statistical model for mapping morphological shape
CN107977550A (en) A kind of quick analysis Disease-causing gene algorithm based on compression
CN106570350B (en) Mononucleotide polymorphic site parting algorithm
CN115394348A (en) IncRNA subcellular localization prediction method, equipment and medium based on graph convolution network
CN108733974B (en) Mitochondrial sequence splicing and copy number determination method based on high-throughput sequencing
US20220076784A1 (en) Systems and methods for identifying feature linkages in multi-genomic feature data from single-cell partitions
US20210324465A1 (en) Systems and methods for analyzing and aggregating open chromatin signatures at single cell resolution
CN114005489B (en) Analysis method and device for detecting point mutation based on third-generation sequencing data
Zhang et al. Reading the underlying information from massive metagenomic sequencing data
Sulins et al. Automatic termination of parallel optimization runs of stochastic global optimization methods in consensus or stagnation cases

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20120905

Termination date: 20180716