CN104615912A

CN104615912A - Modified whole genome correlation analysis algorithm based on channel

Info

Publication number: CN104615912A
Application number: CN201510096276.9A
Authority: CN
Inventors: 高会江; 樊惠中; 李俊雅; 夏江威; 吴洋; 张路培; 高雪; 陈燕; 郭鹏
Original assignee: Institute of Animal Science of CAAS
Current assignee: Institute of Animal Science of CAAS
Priority date: 2015-03-04
Filing date: 2015-03-04
Publication date: 2015-05-13
Anticipated expiration: 2035-03-04
Also published as: CN104615912B

Abstract

The invention discloses a modified whole genome correlation analysis algorithm based on a channel. The correlation analysis algorithm adopts a principal component analysis method and a maximal mean value method to build gene statistic for first time, removes interaction effect among SNP, and effectively solves the problem of SNP chain in genes; the strategy is applied to Simon Tal beef cattle GWAS data; remarkable correlation between two passages (a gamma propalanine channel and an NAFLD channel) and two characters (live weight before slaughter and area of eye muscle) is founded; reliable reference is provided for beef cattle modification breeding; reliable theoretical basis is provided for further molecular verification.

Description

A kind of whole-genome association algorithm based on path of improvement

Technical field

The present invention relates to a kind of whole-genome association algorithm based on path, be specifically related to a kind of whole-genome association algorithm based on path of improvement, belong to biological technical field.

Background technology

Along with the development of sequencing technologies and the universal of high density SNP chip, whole-genome association (GWAS) has become a powerful of human diseases research and animal breeding day by day.

Traditional whole-genome association has only paid close attention to the only a few site in strict conformity with statistics " full-length genome " level of signifiance in full-length genome data, but fraction hereditary variation can only be explained usually in these only a few sites, a large amount of residue hereditary information in full-length genome associated data, is also had to need to be excavated.

Along with the further investigation to GWAS, it also exposes following defects gradually, specifically has:

(1) for some proterties, do not have SNP by multiple check, so just can cannot carry out the assignment of genes gene mapping, even if or some SNP site have passed inspection, but find that it does not show any biological significance.

(2) research shows, the phenotypic variation of complicated quantitative character often not determined by a few SNP or gene, and the SNP that the research algorithm that so unit point returns finds or gene cannot explain all hereditary variation.

In order to solve the problems referred to above that GWAS exists, there has been proposed many different algorithms, wherein topmost a kind of be exactly whole-genome association algorithm based on path, according to the method difference building statistic, this algorithm can be divided three classes:

(1) the most significant SNP effect is used to build gene statistic, this algorithm possibly cannot detect that those single SNP effects are less but the SNP site that effect of joining together is very large, and this algorithm prefers gene containing more SNP and more polygenic path.

(2) use all SNP effects of gene internal to build statistic, this algorithm not only calculated amount is excessive, and easily causes false positive.

(3) build statistic by the effect of K SNP before using after effect sequence, in fact this algorithm based on being that independent this is supposed between SNP, and has linkage disequilibrium between SNP, the reciprocal effects of SNP can make association analysis accuracy greatly reduce.

Summary of the invention

For solving the deficiencies in the prior art, the object of the present invention is to provide a kind of whole-genome association algorithm based on path of improvement, SNP reciprocal effects is considered in the GWAS analysis based on path by this algorithm, effectively can reduce the chain impact on result of SNP.

In order to realize above-mentioned target, the present invention adopts following technical scheme:

The whole-genome association algorithm based on path of improvement, is characterized in that, comprise the following steps:

(1) path is collected:

Download at KEGG lane database and filter out the path with Niu Xiangguan;

(2) phenotypic suppression:

The GLM of application R language carries out phenotypic suppression, and the concrete model used is:

y _ijkm＝μ+Season _i+Year _j+Fattendays _k+Enterweight _m+e _ijkm

Wherein, y _ijkmfor individual phenotypic number, μ is colony's average, Season _ifor birth season, Year _jfor butchering the time, Fattendays _kthe date of marching into the arena is deducted, Enterweight for butchering the date _mthe body weight of individuality during for marching into the arena, e _ijkmfor residual effect;

(3) eSNP matrix is built:

Set up the correlation matrix of multiple SNP genotype indieating variables of constitutivegene, the accumulation contribution rate according to the characteristic root of matrix selects major component, sets up super SNP indieating variable matrix with the major component characteristic of correspondence vector selected by correlation matrix;

(4) GWAS model is set up:

Association analysis adopts simply based on the generalized linear model that unit point returns, and this model is specific as follows:

y ^*＝Xb+Qv+e

Wherein, y ^*for rejecting the phenotypic number of fixed effect, b is the effect value of eSNP mark, and v is the structure effect of colony, and e is residual effect, and X is the incidence matrix that eSNP is corresponding, and Q is the incidence matrix that v is corresponding;

(5) gene statistic is built:

Use formulae discovery gene statistic below:

In formula, with the mean value of statistic positive in gene K and the mean value of negative statistic respectively;

(6) path ES value is calculated:

Use formulae discovery path ES value below:

ES (K) = \max_{1 \leq j \leq N} {\underset{j^{'} < j}{Σ} \frac{| S_{(j^{'})} |}{N_{R}} - \underset{j^{'} < j}{Σ} \frac{1}{N - m}}

In formula,

N_{R} = Σ_{j^{'}}^{N} [S_{(j^{'})}];

(7) data rearrangement and inspection conspicuousness.

Aforesaid algorithm, is characterized in that, in step (), the concrete steps of collecting path are as follows:

(1a) path that is all and Niu Xiangguan is downloaded at KEGG lane database;

(1b) path with following characteristics is retained: contained gene number is greater than 5 and is less than 300, and still containing more than 5 genes after SNP being assigned to gene.

Aforesaid algorithm, is characterized in that, in step (2), gene internal SNP is made up of the SNP of gene internal and upstream and downstream 20Kb.

Aforesaid algorithm, is characterized in that, in step (three), the concrete steps building eSNP matrix are as follows:

(3a) correlation matrix of multiple SNP genotype indieating variables of constitutivegene is set up;

(3b) characteristic root and the proper vector of correlation matrix is calculated;

(3c) major component is selected according to the accumulation contribution rate of characteristic root;

(3d) be multiplied by multiple SNP genotype indieating variables of constitutivegene with the major component characteristic of correspondence vector in correlation matrix, set up super SNP indieating variable matrix.

Aforesaid algorithm, is characterized in that, in step (seven), the step of data rearrangement and inspection conspicuousness is as follows:

(7a) data rearrangement and the original conspicuousness of estimation:

Replace the data under different phenotype label, and calculate ES value again, for each path and gene sets form a new ES distribution, complete 1000 secondary data altogether and reset, thus each path has 1000 false ES that distribute _null, the significance degree of the ES of actual observation is according to the ES having how much number percent after data rearrangement _nullthe ES being greater than observation estimates;

(7b) multiple testing adjustment:

First, based on ES value, the ES of observation _nullmean value and standard deviation, set up a standardized ES value NES, formula is as follows:

NES = \frac{ES - mean (E S_{null})}{SD (E S_{null})}

Then, use false positive discovery rate control FDR to carry out multiple correction based on NES value and obtain more reliable result.

Aforesaid algorithm, is characterized in that, in step (7b), the computing formula of false positive discovery rate control FDR is as follows:

Usefulness of the present invention is:

(1) use principal component analysis (PCA) and Largest Mean method to build gene statistic first, eliminate the reciprocal effects between SNP, effectively solve the problem that gene internal SNP is chain.

(2) we by this application of policies in Simmental beef cattle GWAS data, have found two paths (γ-aminobutyric acid path and NAFLD path) and two proterties (Slaughter weight and eye muscle area) significant correlations, this improves seed selection for beef cattle and provides reliable reference, provides reliable theoretical foundation also to next step Molecular.

Accompanying drawing explanation

Fig. 1 is 263 path-log that Simmental beef cattle is analyzed for Slaughter weight ₁₀(P) synthesizing map of value, P is conspicuousness;

Fig. 2 is 263 path-log that Simmental beef cattle is analyzed for eye muscle area ₁₀(P) synthesizing map of value, P is conspicuousness;

The graph of a relation of gene dosage in Fig. 3 (a) path conspicuousness that to be Simmental beef cattle analyze for Slaughter weight and path;

The graph of a relation of SNP quantity in Fig. 3 (b) path conspicuousness that to be Simmental beef cattle analyze for Slaughter weight and path;

The graph of a relation of Fig. 3 (c) path conspicuousness that to be Simmental beef cattle analyze for Slaughter weight and the intragentic total size (kb) of path;

The graph of a relation of Fig. 3 (d) path conspicuousness that to be Simmental beef cattle analyze for Slaughter weight and the intragentic mean size of path (kb).

Embodiment

We use the thought of principal component analysis (PCA) to improve the existing whole-genome association algorithm based on path, and employ a kind of new formula and go to build gene statistic.SNP reciprocal effects take into account in the GWAS analysis based on path by the algorithm after improvement, effectively can reduce the chain impact on result of SNP.

Below in conjunction with the drawings and specific embodiments, concrete introduction is done to the present invention.

We choose Inner Mongol Wu Lagai 807, area beef cattle as reference colony, and collect the phenotypic data of Slaughter weight and these two proterties of eye muscle area, the result of calculation of the data of two proterties is in table 1.

The essential information of table 1 two Meat Quality phenotypic datas

Phenotype	Mean value	Standard deviation	Standard error	Maximal value	Minimum value
						Slaughter weight	491.64kg	57.53kg	2.03kg	711kg	318kg
Eye muscle area	82.85cm ²	12.23cm ²	0.43cm ²	150cm ²	51cm ²

Then adopt Illumina SNP770K superchip data and apply path analysis strategy and association analysis research is carried out to Meat Quality.The process of concrete association analysis research is as follows:

One, path is collected

1, download path that is all and Niu Xiangguan at KEGG lane database, we collect 280 paths altogether.

2, underproof path is screened, retain the path with following characteristics: contained gene number is greater than 5 and is less than 300, and still containing more than 5 genes after SNP being assigned to gene.The SNP of gene internal and upstream and downstream 20Kb all be can be regarded as gene internal SNP by us.

Through screening, we have finally chosen the analysis that 263 paths carry out next step.

Two, phenotypic suppression

807 bull reference groups are the bulls being born in 2010,2011 and 2012 respectively, respectively from 15 pastures, butcher the monthly age from 13 months to 20 months not etc.In order to reject the fixed effect impact from time, season, number of days of marching into the arena heavily, fatten, the GLM that we apply R language carries out the correction of phenotype, and concrete model is as follows:

y _ijkm＝μ+Season _i+Year _j+Fattendays _k+Enterweight _m+e _ijkm

Wherein, y _ijkmfor individual phenotypic number, μ is colony's average, Season _ifor birth season (adopt grouping correction method, 001 represent to represent November to April, 010 represent September to October May to August, 100), Year _jfor butchering the time (adopt grouping correction method equally, 001 represents 2009,010 represent 2010,100 represent 2011), Fattendays _kthe date of marching into the arena is deducted, Enterweight for butchering the date _mthe body weight of individuality during for marching into the arena, e _ijkmfor residual effect.

By residual effect e in test _ijkmas the phenotype y after correction ^*, for the phenotypic number when whole-genome association.

Three, eSNP matrix is built

We are described in detail for gene SV2C below, and this gene internal has 133 SNP.

1, the correlation matrix of multiple SNP genotype indieating variables of constitutivegene is set up.

The matrix foundation gene of SNP genotype indieating variable and the site information of SNP build, and here the SNP being positioned at gene upstream and downstream 20Kb is also used for the structure of matrix by us.

If a SNP is positioned at two gene internal simultaneously, so this SNP locates two genes simultaneously.

2, characteristic root and the proper vector of correlation matrix is calculated.

Concrete calculation procedure is as follows:

(1) standardization is carried out to original indieating variable matrix.

(2) correlation matrix is calculated.

(3) ask the characteristic root of correlation matrix, and they are arranged from big to small.

(4) ask correlation matrix about the proper vector of characteristic root.

3, major component is selected according to the accumulation contribution rate of characteristic root.

In algorithm of the present invention, select the accumulation contribution rate of 85% as the standard selecting major component.

In SV2C gene, we have selected 13 SNP as major component.

4, be multiplied by multiple SNP genotype indieating variables of constitutivegene with the selected major component characteristic of correspondence vector in correlation matrix, set up " super SNP " indieating variable matrix.

Four, GWAS model is set up

Association analysis we adopt simply based on the generalized linear model that unit point returns, this model is specific as follows:

y ^*＝Xb+Qv+e

Wherein, y ^*for rejecting the phenotypic number of fixed effect, b is the effect value of eSNP mark, and v is the structure effect of colony, and e is residual effect, and X is the incidence matrix that eSNP is corresponding, and Q is the incidence matrix that v is corresponding.

In addition, the Simmental beef cattle due to us is not completely random colony, and also consider the factor of colony's layering here, and first three proper vector joins in regretional analysis as covariant by we.

Five, gene statistic is built

First, suppose in this research, have N number of gene, and each gene there is one or more SNPs X _i(i=1,2 ..., N) locate it.

For a given gene G _inamely in all X SNPs, SNP phenotype association positive-effect average and negative effect average in that larger average be selected for the statistical value representing gene, use r _irepresent.

Then, gene statistic is built according to formula below:

In formula, with represent the mean value of statistic positive in gene K and the mean value of negative statistic respectively.

For gene SV2C, on the occasion of average be 0.6046093, the average of negative is 0.5865826, finally determines that the statistic of this gene is 0.6046093.

Next, arrange gene-phenotype association statistical value, the statistical value associated according to gene-phenotype is from peak to peak classification and ordination.

Finally, the sequence in the gene corresponding to these values is become a list of genes, with L (r ₁, r ₂..., r _n) represent.

Six, path ES value is calculated

For each given path S, it is by N number of genomic constitution, and calculates its ES value.ES value is that similar Andrei Kolmogorov-Si Moluofu checks (Kolmogorov-Smimov test) consecutive and (running-sum) statistical weight value, and it represents the overexpression degree of gene on the table top of all gene orders in S set.The calculating of enrichment value walks toward afterbody from the top of sorted list of genes L, belongs to S whenever running into a gene, consecutive and with regard to bonus point, not then deduction.What ES value was weighed is by the maximum difference of the statistical value enrichment degree of the S set randomly drawing one group of genomic constitution in genome.The correlation signal of S set is aggregated in the top of sequence, also just means that ES value can be very large.The formula that ES calculates is as follows:

ES (K) = \max_{1 \leq j \leq N} {\underset{j^{'} < j}{Σ} \frac{| S_{(j^{'})} |}{N_{R}} - \underset{j^{'} < j}{Σ} \frac{1}{N - m}}

In formula,

N_{R} = Σ_{j^{'}}^{N} [S_{(j^{'})}] .

Seven, data rearrangement and inspection conspicuousness

1, data rearrangement and the original conspicuousness of estimation

Replace the data under different phenotype label, and calculate ES value again, for each path and gene sets form a new ES distribution (false distribution), complete 1000 secondary data rearrangements altogether.Thus each path has 1000 false distribution ES _null.The significance degree of the ES of actual observation is according to the ES having how much number percent after data rearrangement _nullthe ES being greater than observation estimates.

2, multiple testing adjustment

First, based on ES value, the ES of observation _nullmean value and standard deviation, set up a standardized ES value (NES), formula is as follows:

NES = \frac{ES - mean (E S_{null})}{SD (E S_{null})} .

Then, use false positive discovery rate control (FDR) to carry out multiple correction based on NES value and obtain more reliable result.

The calculating of FDR, be according to 1000 secondary data reset after 1. number account for the ratio, 2. of sum the number ratio that accounts for sum calculate, computing formula is as follows:

Eight, result of study

We find that 4 paths reach significantly (P<0.01), as shown in Figure 1 after carrying out the GWAS analysis based on path to this proterties of Slaughter weight altogether.In these paths, the conspicuousness the highest (ES=0.4, p=0.00876) of γ-aminobutyric acid path.γ-aminobutyric acid includes 87 genes, and wherein 58 genes are in our SNP coverage, 7 gene rankings in more front 1000 genes of conspicuousness, wherein, in these genes:

(1) enrichment is worth maximum gene and is: GNG11 (T _maxmean=1.44).

(2) remaining gene has: GLS2 (T _maxmean=1.36), GABRD (T _maxmean=1.19), GNB3 (T _maxmean=1.07), GABRA6 (T _maxmean=1.05), GABBR1 (T _maxmean=1.04).

To the further analysis of γ-aminobutyric acid, we find that it is relevant to feed intake, the people such as fan find that feeding pig γ-aminobutyric acid can increase its feed intake, and the research of Wang Dengren demonstrates the feed intake of γ-aminobutyric acid and milk cow, and lactation performance and animal health are correlated with.In algorithm of the present invention, we have found γ-aminobutyric acid relevant to the Slaughter weight of beef cattle.

We find that 5 paths reach significantly (P<0.01), as shown in Figure 2 after carrying out the GWAS analysis based on path to this proterties of eye muscle area altogether.In these paths, the conspicuousness the highest (ES=0.38, p=0.0005) of non-alcohol fatty liver (NAFLD) path, wherein:

(1) enrichment is worth maximum gene and is: MAP3K11 (T _maxmean=1.82).

(2) remaining gene has: NDUFA6 (T _maxmean=1.80), IRS1 (T _maxmean=1.62), NUDFA7 (T _maxmean=1.53).

Nine, method comparison

The minimum P method of the data we collected calculates, and result shows: minimum p value method only finds a path to reach remarkable respectively in two proterties, and two paths do not have one to pass through multiple check.

We checked four factors that may affect result: in path, in number gene, path, in SNP number, path, gene total length is with gene average length in path, and result is see Fig. 3.The result display of Fig. 3: these factors do not associate with path enrichment value.

Known through above analysis, the SNP reciprocal effects of its data centralization gene internal of algorithm of the present invention is comparatively strong, larger gene or gene internal may be had to have more SNP in its path selected.

This firstly in Beef Cattle Population, uses the GWAS based on path to study, result of study provides the path that effectively can affect Beef Cattle Growth proterties, this improves seed selection for beef cattle and provides reliable reference, provides reliable theoretical foundation also to next step Molecular.

It should be noted that, above-described embodiment does not limit the present invention in any form, the technical scheme that the mode that all employings are equal to replacement or equivalent transformation obtains, and all drops in protection scope of the present invention.

Claims

1. the whole-genome association algorithm based on path improved, is characterized in that, comprise the following steps:

(1) path is collected:

Download at KEGG lane database and filter out the path with Niu Xiangguan;

(2) phenotypic suppression:

y _ijkm＝μ+Season _i+Year _j+Fattendays _k+Enterweight _m+e _ijkm

(3) eSNP matrix is built:

(4) GWAS model is set up:

y ^*＝Xb+Qv+e

(5) gene statistic is built:

Use formulae discovery gene statistic below:

(6) path ES value is calculated:

Use formulae discovery path ES value below:

ES (K) = \max_{1 \leq j \leq N} {\underset{j^{'} < j}{Σ} \frac{| S_{(j^{'})} |}{N_{R}} - \underset{j^{'} < j}{Σ} \frac{1}{N - m}}

In formula,

N_{R} = Σ_{j^{'}}^{N} [S_{(j^{'})}];

(7) data rearrangement and inspection conspicuousness.

2. algorithm according to claim 1, is characterized in that, in step (), the concrete steps of collecting path are as follows:

(1a) path that is all and Niu Xiangguan is downloaded at KEGG lane database;

3. algorithm according to claim 2, is characterized in that, in step (2), gene internal SNP is made up of the SNP of gene internal and upstream and downstream 20Kb.

4. algorithm according to claim 1, is characterized in that, in step (three), the concrete steps building eSNP matrix are as follows:

5. algorithm according to claim 1, is characterized in that, in step (seven), the step of data rearrangement and inspection conspicuousness is as follows:

(7a) data rearrangement and the original conspicuousness of estimation:

(7b) multiple testing adjustment:

NES = \frac{ES - mean ({ES}_{null})}{SD ({ES}_{null})}

6. algorithm according to claim 5, is characterized in that, in step (7b), the computing formula of false positive discovery rate control FDR is as follows: