CN105512510A

CN105512510A - Algorithm for assessing heritability through genome data

Info

Publication number: CN105512510A
Application number: CN201510873172.4A
Authority: CN
Inventors: 肖世俊; 董林松; 王志勇
Original assignee: Jimei University
Current assignee: Jimei University
Priority date: 2015-12-03
Filing date: 2015-12-03
Publication date: 2016-04-20
Anticipated expiration: 2035-12-03
Also published as: CN105512510B

Abstract

The invention discloses an algorithm for assessing heritability through genome data. The algorithm comprises the steps that for a certain quantitative character, marker effect estimation is conducted on a whole genome with different numbers of reference group individuals through a GBLUP algorithm, the breeding value of an estimated group is further obtained, and estimation accuracy is calculated; curve linearization fitting is conducted through genome estimation accuracy and the size of reference groups, and the reciprocal of the intercept of a regression equation obtained through fitting serves as the estimated value of heritability. According to the algorithm for assessing heritability through the genome data, heritability of the quantitative character is assessed through the genome data, the research achievement can be directly applied to quantitative character breeding of animals and plants, genealogy recording is not conducted on the individuals, but sequencing is conducted on each individual genome, heritability of the character is predicted through whole genome marks, the heritability estimation result is mainly applied to breeding work in the future, in addition, Mendel sampling errors can be captured through sequencing, and compared with genealogy data recording, more accurate genealogy information can be obtained.

Description

A kind of algorithm genetic force assessed by genomic data

Technical field

The present invention relates to genetic engineering field, specifically a kind of algorithm genetic force assessed by genomic data.

Background technology

Current genetic force appraisal procedure mainly utilizes the sibship between individuality, adopt various statistical means, as method of analysis of variance, relevant function method etc. are inferred, the method will carry out complete pedigree record, but for some species, carry out very large being even difficult to of pedigree record workload to realize, such as aquatic livestock; In addition, traditional genetic force appraisal procedure genomic information is used as " black box " process, and cannot capture the specifying information that gene transmits from parent to filial generation like this, namely cannot prepare to capture Mendelian sampling error, cause evaluated error larger; In order to solve the large problem with Mendelian sampling error accurately cannot be caught of pedigree record workload in conventional genetic power method of estimation, need to carry out improvement improvement to prior art.

Summary of the invention

The object of the present invention is to provide a kind of problem that error is comparatively large and pedigree record is loaded down with trivial details overcome in the estimation of conventional genetic power.By the algorithm that genomic data is assessed genetic force, to solve the problem proposed in above-mentioned background technology.

The present invention does not carry out individual pedigree record, directly checks order to the genome of all individualities, in conjunction with individual performance inventory and genomic marker information, estimates the accuracy of estimation of genomic breeding value, and then estimates the genetic force of proterties.

For achieving the above object, the invention provides following technical scheme:

By the algorithm that genomic data is assessed genetic force, for a certain quantitative character, by the estimation using the reference group of varying number individuality to carry out the marker effect of full-length genome, and then obtain the breeding value estimating group, and calculate accuracy of estimation; Said process is exactly the detailed process that genome is selected in fact, adopt GBLUP as the algorithm calculating marker effect in this invention, GBLUP algorithm was invented by people such as Meuwissen in calendar year 2001, its prior distribution thinks that the effect variance of all marker sites of genome is equal, and marker effect can be drawn by following formulae discovery:

[\begin{matrix} {l_{n}}^{'} l_{n} & {l_{n}}^{'} X \\ X^{'} l_{n} & X^{'} X + I λ \end{matrix}] [\begin{matrix} \hat{μ} \\ \hat{g} \end{matrix}] [\begin{matrix} {l_{n}}^{'} y \\ X^{'} y \end{matrix}] - - - (1)

Wherein, for population mean; for the effect vector of all marker sites; Genome estimated breeding value (GEBV) obtains, i.e. GEBV=∑ X by the effect of all marker sites being added _ig _i; GEBV estimates that accuracy is by calculating the related coefficient of GEBV and true breeding value (TBV), i.e. r _{(GEBV, TBV)}; Meanwhile, the people such as Daetwyler were deduced when GBLUP algorithm estimated breeding value in 2008, r _{(GEBV, TBV)}another computing formula be:

r_{(G E B V, T B V)} = \sqrt{\frac{N_{p} h^{2}}{N_{p} h^{2} + M}} - - - (2)

Wherein, N _pfor the individual amount with reference to group; h ²for the genetic force of studied proterties; M is the number of the effective gene pack section determining this proterties; But in actual production, the concrete numerical value of TBV cannot be learnt, therefore use phenotypic number (Y) to substitute TBV, the pass deriving GEBV and Y is:

r_{(G E B V, Y)} = r_{(G E B V, T B V)} * h = \sqrt{\frac{N_{p} h^{2}}{N_{p} h^{2} + M}} * h - - - (3)

In formula (3), by adjustment N _plarge I obtain different r _{(GEBV, Y)}value, this curvilinear equation of matching, the mode of matching adopts curve linearize, arranges, obtain linear equation to formula (3):

\frac{1}{r_{(G E B V, Y)}^{2}} = \frac{1}{h^{2}} + \frac{M}{h^{4}} * \frac{1}{N_{p}} - - - (4)

This equation is equivalent to linear regression model (LRM) y=a+bx, and wherein y is r _{(GEBV, Y)}square inverse, x is N _pinverse, namely the intercept a of equation is the inverse of genetic force, by asking the inverse of the intercept of this equation, obtains the estimated value of genetic force.

As the further scheme of the present invention: check order to all genes of individuals groups, obtain SNP information, the SNP site of all individualities is corresponding, and missing data is by imputation method polishing.

As the present invention's further scheme: for preventing single evaluated error comparatively large, adopting repeatedly the method for hybridization verification, repeatedly randomly drawing reference group and estimate colony from overall, obtaining the estimated result close to actual value.

As the present invention's further scheme: use different reference group numbers in conjunction with GBLUP algorithm to calculate the effect value of each mark of genome, to obtain the breeding value estimating group, obtain accuracy of estimation by carrying out correlation analysis to the breeding value and phenotypic number of estimating group

Compared with prior art, the invention has the beneficial effects as follows: the present invention is assessed by the genetic force of genomic data to quantitative character, the achievement studied can directly apply in the breeding of animals and plants quantitative character, algorithm of the present invention can not set up on the basis of family, the genetic force predicting proterties is marked by full-length genome, solve the loaded down with trivial details problem being even difficult to realize of pedigree record, and because order-checking can capture Mendelian sampling error, algorithm relative record pedigree data of the present invention can obtain pedigree information more accurately.

Accompanying drawing explanation

Fig. 1 is algorithm flow chart of the present invention.

Fig. 2 is the trend map of GEBV accuracy with reference group's size variation of body weight and long two proterties of body in the present invention.

Fig. 3 is the trend map after the GEBV accuracy of long two proterties of body weight and body in the present invention and reference group's size are changed according to formula 4.

Wherein, the value of horizontal ordinate is the reciprocal value with reference to group's number of individuals; The value of ordinate be GEBV accuracy square inverse; R ²for the coefficient of determination of regression equation.

Embodiment

Be described in more detail below in conjunction with the technical scheme of embodiment to this patent.

Refer to accompanying drawing 1-3, by the algorithm that genomic data is assessed genetic force, for a certain quantitative character, by the estimation using the reference group of varying number individuality to carry out the marker effect of full-length genome, and then obtain the breeding value estimating group, and calculate accuracy of estimation; Carry out the matching of curve linearize by genome accuracy of estimation and reference group's size, the inverse of the intercept of the regression equation simulated is the estimated value of genetic force; It is characterized in that: the detailed process that genome is selected adopts GBLUP as the algorithm calculating marker effect, and the effect variance of all marker sites of genome is equal, and marker effect is drawn by following formulae discovery:

[\begin{matrix} {l_{n}}^{'} l_{n} & {l_{n}}^{'} X \\ X^{'} l_{n} & X^{'} X + I λ \end{matrix}] [\begin{matrix} \hat{μ} \\ \hat{g} \end{matrix}] [\begin{matrix} {l_{n}}^{'} y \\ X^{'} y \end{matrix}] - - - (1)

Wherein, for population mean; for the effect vector of all marker sites; Genome estimated breeding value (GEBV) obtains by the effect of all marker sites being added, i.e. GEBV=∑ X _ig _i; GEBV estimates that accuracy is by calculating the related coefficient of GEBV and true breeding value (TBV), i.e. r _{(GEBV, TBV)}draw; When GBLUP algorithm estimated breeding value, r _{(GEBV, TBV)}another computing formula be:

r_{(G E B V, T B V)} = \sqrt{\frac{N_{p} h^{2}}{N_{p} h^{2} + M}} - - - (2)

Wherein, N _pfor the individual amount with reference to group; h ²for the genetic force of studied proterties; M is the number of the effective gene pack section determining this proterties; In actual production, cannot learn the concrete numerical value of TBV, therefore use phenotypic number (Y) to substitute TBV, the pass deriving GEBV and Y is:

r_{(G E B V, Y)} = r_{(G E B V, T B V)} * h = \sqrt{\frac{N_{p} h^{2}}{N_{p} h^{2} + M}} * h - - - (3)

\frac{1}{r_{(G E B V, Y)}^{2}} = \frac{1}{h^{2}} + \frac{M}{h^{4}} * \frac{1}{N_{p}} - - - (4)

Check order to all genes of individuals groups, obtain SNP information, the SNP site of all individualities is corresponding, and missing data is by imputation method polishing; For preventing single evaluated error comparatively large, adopting repeatedly the method for hybridization verification, from overall, repeatedly randomly draw reference group and estimate colony, obtaining the estimated result close to actual value; Use different reference group numbers in conjunction with GBLUP algorithm to calculate the effect value of each mark of genome, to obtain the breeding value estimating group, accuracy of estimation is obtained by carrying out analysis to the breeding value and phenotypic number of estimating group, solve the problem that pedigree record intricate operation has even been difficult to, accurately catch the Mendelian sampling error of allele in transmittance process simultaneously.

Embodiment 1

1. subjects is 500 large yellow croakers, and adopt and manually urge ovum technology, all large yellow croakers are being born on the same day, and namely the age is all identical; When test period is two age of large yellow croaker, Metric traits is that the body weight of all large yellow croakers and body are long.

2. adopt GBS (genotyping-by-sequencing) sequencing technologies to carry out gene order-checking to all individualities that will study, screen qualified SNP site, state modulator is as follows: by MAF > 0.05, Hardy-Weinberg equilibrium inspection P-value > 0.001, the miss rate of Single locus stays lower than the marker site of 20%; Final filter out altogether 29748 qualified SNP marker, for the site of disappearance, by the imputation program polishing of software Beagle3.3.2 version.

3. in all 500 individualities, colonies are estimated in random sampling extraction 20% i.e. 100 individual conducts, remaining is divided into four grades according to number of individuals 100,200,300,400, and the reference group number of individuals observing four different stages corresponds to the variation tendency of accuracy of estimation; All marker effects under using GBLUP algorithm to estimate each grade, obtain the breeding value GEBV of each individuality estimating group, estimating the GEBV of group and the related coefficient of phenotypic number, obtaining accuracy of estimation, i.e. r by calculating _{(GEBV, Y)}.

In order to reduce the excessive impact of single sampling error, by step 3 repetitive operation 20 times, owing to estimating that group and the individuality with reference to group are all random samplings at every turn, therefore each result repeated can be slightly different, but the mean value of 20 results can more close to legitimate reading, shown in the result accompanying drawing 2 of 20 mean values.

4. reference group size (the i.e. N of pair each grade _p) get inverse, to accuracy of estimation (the i.e. r of each grade _{(GEBV, Y)}) the squared inverse of the mean value of 20 results, relation therebetween as shown in Figure 3, carrys out the final regression equation of matching according to formula (4), as shown in the table:

According to upper table result, the heretability estimate value can trying to achieve body weight is 0.227, and body length is 0.196.

Above the better embodiment of this patent is explained in detail, but this patent is not limited to above-mentioned embodiment, in the ken that one skilled in the relevant art possesses, can also makes a variety of changes under the prerequisite not departing from this patent aim.

Claims

1. algorithm genetic force assessed by genomic data, for a certain quantitative character, by the estimation using the reference group of varying number individuality to carry out the marker effect of full-length genome, and then obtain the breeding value estimating group, and calculate accuracy of estimation; Carry out the matching of curve linearize by genome accuracy of estimation and reference group's size, the inverse of the intercept of the regression equation simulated is the estimated value of genetic force; It is characterized in that: the detailed process that genome is selected adopts GBLUP as the algorithm calculating marker effect, and the effect variance of all marker sites of genome is equal, and marker effect is drawn by following formulae discovery:

[\begin{matrix} {1_{n}}^{'} 1_{n} & {1_{n}}^{'} X \\ X^{'} 1_{n} & X^{'} X + I λ \end{matrix}] [\begin{matrix} \hat{μ} \\ \hat{g} \end{matrix}] = [\begin{matrix} {1_{n}}^{'} y \\ X^{'} y \end{matrix}] - - - (1)

Wherein, for population mean; for the effect vector of all marker sites; Genome estimated breeding value (GEBV) obtains by the effect of all marker sites being added, i.e. GEBV=∑ X _ig _i; GEBV estimates that accuracy is by calculating the related coefficient of GEBV and true breeding value (TBV), i.e. r _(GEBVTBV) draw; When GBLUP algorithm estimated breeding value, r _(GEBVTBV)another computing formula be:

r_{(G E B V, T B V)} = \sqrt{\frac{N_{p} h^{2}}{N_{p} h^{2} + M}} - - - (2)

r_{(G E B V, Y)} = r_{(G E B V, T B V)} * h = \sqrt{\frac{N_{p} h^{2}}{N_{p} h^{2} + M}} * h - - - (3)

\frac{1}{r_{(G E B V, Y)}^{2}} = \frac{1}{h^{2}} + \frac{M}{h^{4}} * \frac{1}{N_{p}} - - - (4)

2. algorithm genetic force assessed by genomic data according to claim 1, it is characterized in that, all genes of individuals groups are checked order, obtain SNP information, the SNP site of all individualities is corresponding, and missing data is by imputation method polishing.

3. algorithm genetic force assessed by genomic data according to claim 1, it is characterized in that, for preventing single evaluated error larger, adopt repeatedly the method for hybridization verification, from overall, repeatedly randomly draw reference group and estimate colony, obtaining the estimated result close to actual value.

4. algorithm genetic force assessed by genomic data according to claim 1, it is characterized in that, use different reference group numbers in conjunction with GBLUP algorithm to calculate the effect value of each mark of genome, to obtain the breeding value estimating group, obtain accuracy of estimation by carrying out correlation analysis to the breeding value and phenotypic number of estimating group.