TWI786916B

TWI786916B - Genomic estimated breeding value for predicting a trait and its application

Info

Publication number: TWI786916B
Application number: TW110140652A
Authority: TW
Inventors: 游卓遠; 許智堯; 王子明; 吳心平
Original assignee: 基育生物科技股份有限公司
Priority date: 2021-11-01
Filing date: 2021-11-01
Publication date: 2022-12-11
Also published as: TW202320082A

Abstract

The present disclosure provides a method for establishing a formula of a genomic estimated breeding value for prediction a trait, which includes performing genome wide association analysis and machine learning on a database of single nucleotide polymorphism of the whole genome and a quantified database of a trait for establishing the formula.

Description

Breeding Value of Predicted Traits and Its Application

本揭示係關於一種預測性狀之方法，具體而言，係關於一種以分子生物學技術預測性狀之方法。The disclosure relates to a method for predicting traits, specifically, a method for predicting traits using molecular biology techniques.

經濟動物對人類糧食之供應十分重要。根據FAO 2018年的統計資料顯示，全球豬隻的總在養數量約十四億頭，其數量超過牛肉、羊肉及其他家禽類，為畜禽產業之首，亦是市場上最主要的肉類蛋白供給者，其中亞洲地區的年生產量與消費量皆超過全球總量的50%以上，是全球食用豬肉的主要生產與消費地區。Economic animals are very important to the supply of human food. According to the statistics of FAO in 2018, the total number of pigs in the world is about 1.4 billion, surpassing beef, mutton and other poultry, ranking first in the livestock and poultry industry, and also the most important meat protein in the market Among the suppliers, the annual production and consumption in Asia are more than 50% of the global total, and they are the main production and consumption regions of edible pork in the world.

傳統養殖場最常使用外表型性狀選拔豬隻，但此方法選拔速度慢、且準確度低，通常需要經過許多世代的篩選才有機會選育出合適的品種；隨著分子生物技術的發展，單基因DNA分子標記輔助育種已經成為重要的育種輔助方法，目前分子標誌應用於家畜品種改良主要是依據該性狀之遺傳及分子代謝路徑資訊，進而從中找出影響該性狀的候選基因，再利用遺傳標記定位染色體中影響該重要經濟性狀之基因座區域，或是檢驗各別基因對性狀的影響，此方法必需要對目標性狀的遺傳與代謝路徑有深入的了解，才能有效提高目標性狀育種的準確度。Traditional breeding farms most often use appearance traits to select pigs, but this method is slow in selection speed and low in accuracy, and usually requires many generations of screening to have a chance to select suitable breeds; with the development of molecular biotechnology, Single-gene DNA molecular marker-assisted breeding has become an important auxiliary breeding method. At present, the application of molecular markers to livestock breed improvement is mainly based on the genetic and molecular metabolic pathway information of the trait, and then find out the candidate gene that affects the trait, and then use the genetic Marking and locating the locus region that affects the important economic trait in the chromosome, or testing the influence of individual genes on the trait, this method must have a deep understanding of the genetic and metabolic pathways of the target trait in order to effectively improve the accuracy of breeding for the target trait Spend.

生長、肉質與繁殖是重要的三大豬隻經濟性狀，亦是育種工作的研究重點，但這些性狀之主要的遺傳機制大多不清楚，且大部分為複雜的多基因控制遺傳性狀，受限於傳統育種方法與單基因標記輔助選種的技術限制，在主效基因及多型性位點的鑑定極為困難，難以選拔出穩定帶有該優良性狀之豬隻，因此建立完整的全基因組分子標誌資料庫以作為豬隻育種的依據是非常迫切且重要。Growth, meat quality, and reproduction are the three important economic traits of pigs, and they are also the focus of breeding research. However, the main genetic mechanisms of these traits are mostly unclear, and most of them are complex genetic traits controlled by polygenes. Due to the technical limitations of traditional breeding methods and single-gene marker-assisted selection, it is extremely difficult to identify major genes and polymorphic loci, and it is difficult to select pigs with such excellent traits stably. Therefore, a complete genome-wide molecular marker is established It is very urgent and important to use the database as the basis for pig breeding.

為改善前述問題之至少一或多項，本揭示建立預測性狀之方法，做為經濟性狀育種的依據。In order to improve at least one or more of the aforementioned problems, this disclosure establishes a method for predicting traits as a basis for economic trait breeding.

本揭示提供一種於一群體中建立用以預測一性狀育種價(genomic estimated breeding value，GEBV)之計算式的方法，其包含：建立該群體之全基因組單核苷酸多型性(SNP)資料庫，其包含該群體中每一個體之全基因組SNP；建立該群體中該性狀之性狀量化資料庫，其包含該群體中每一個體之該性狀之性狀量化值及平均性狀量化值；將該全基因組單核苷酸多型性資料庫與該性狀量化資料庫進行全基因組關聯分析(genome wide association study，GWAS)，分別給定每個SNP位點之各多型性一代表影響程度之權重值(e值)；及以機器學習方法選取影響該性狀之n個SNP，並建立式(1)之該計算式： GEBV _x= Mean _x+ e ₁ÍSNP ₁+ e ₂ÍSNP ₂+……+ e _nÍSNP _n式(1) 式(1)中： GEBV _x為該性狀之育種價； Mean _x為平均性狀量化值； e ₁至e _n分別為各個SNP位點之權重值；及 SNP ₁至SNP _n分別為數字，依據各個SNP位點之基因型所給定，其中若該基因型為異型合子，則給定為1；若該基因型為促進該性狀量化值之同型合子，則給定為2；若該基因型為降低該性狀量化值之同型合子，則給定為0。 The disclosure provides a method for establishing a calculation formula for predicting a trait breeding value (genomic estimated breeding value, GEBV) in a population, which includes: establishing genome-wide single nucleotide polymorphism (SNP) data of the population library, which includes the genome-wide SNP of each individual in the population; establishes a trait quantification database for the trait in the population, which includes the trait quantification value and the average trait quantification value of the trait in each individual in the population; Genome-wide association study (GWAS) was performed between the genome-wide single nucleotide polymorphism database and the trait quantification database, and each polymorphism of each SNP site was given a weight representing the degree of influence value (e value); and select n SNPs that affect the trait by machine learning methods, and establish the calculation formula of formula (1): GEBV _x = Mean _x + e ₁ ÍSNP ₁ + e ₂ ÍSNP ₂ +...+ e _n ÍSNP _n formula (1) In formula (1): GEBV _x is the breeding value of the trait; Mean _x is the quantified value of the average trait; e ₁ to e _n are the weight values of each SNP site; and SNP ₁ to SNP _n is a number, which is given according to the genotype of each SNP site. If the genotype is a heterozygote, it is given as 1; if the genotype is a homozygote that promotes the quantitative value of the trait, it is given is 2; if the genotype is an isozygote that reduces the quantified value of the trait, it is given a value of 0.

在本揭示之具體實施例中，該全基因組關聯分析係使用固定模型和隨機模型迴圈概率的統一(fixed and random model circulating probability unification)模型進行。In an embodiment of the present disclosure, the genome-wide association analysis is performed using a fixed and random model circulating probability unification model.

在本揭示之具體實施例中，該機器學習包含監督式學習(supervised learning)、非監督式學習(unsupervised learning)及/或半監督式學習(semi-supervised learning)。In a specific embodiment of the present disclosure, the machine learning includes supervised learning, unsupervised learning and/or semi-supervised learning.

在本揭示之具體實施例中，SNP ₁至SNP _n所轉換之數字為0、1或2。 In an embodiment of the present disclosure, the numbers converted from SNP ₁ to SNP _n are 0, 1 or 2.

在本揭示之具體實施例中，該群體係為豬。豬的實例包含但不限於杜洛克、約克夏、藍瑞斯或其雜交系。In specific embodiments of the present disclosure, the population is pigs. Examples of pigs include, but are not limited to, Duroc, Yorkshire, Lambrew, or hybrids thereof.

在本揭示之具體實施例中，該性狀的實例包含但不限於豬生長性狀、肉質性狀或繁殖性狀。該豬生長性狀的實例包含但不限於平均日增重、背脂厚度、100公斤日齡或飼料效率。該豬肉質性狀的實例包含但不限於肉色、大理石紋、腰眼面積、腰眼深度、背脂深度、粗脂肪、活體重、屠體重或烹煮失重率。該豬繁殖性狀的實例包含但不限於平均總產仔數、平均活仔數、平均畸/弱/死胎數、平均胎距、平均預產期差異天數、平均出生窩重、平均三週活仔數或平均三週窩重。In particular embodiments of the present disclosure, examples of such traits include, but are not limited to, pig growth traits, meat quality traits, or reproductive traits. Examples of such pig growth traits include, but are not limited to, average daily gain, backfat thickness, age at 100 kg, or feed efficiency. Examples of such pork quality traits include, but are not limited to, meat color, marbling, loin area, loin depth, backfat depth, crude fat, live weight, carcass weight, or cooking loss. Examples of such swine reproductive traits include, but are not limited to, average total litter size, average live piglet size, average number of malformed/weak/stillborn births, average fetal distance, average days difference between due dates, average litter weight at birth, average three week old live piglets, or Average litter weight at three weeks.

本揭示亦提供一種用以預測一個體之一性狀之育種價之計算式，其係由式(1)所表示，並由如本揭示之於一群體中建立一性狀之預測式之方法所建立。This disclosure also provides a calculation formula for predicting the breeding value of a trait of an individual, which is represented by formula (1), and established by the method for establishing a predictive formula of a trait in a population as disclosed in this disclosure .

本揭示亦提供一種使用如前述之計算式預測一待測個體之性狀育種價之方法，其包含檢測該待測個體基因體中之影響該性狀之n個SNP多型性，並以式(1)計算該待測個體之育種價。This disclosure also provides a method for predicting the breeding value of a trait of an individual to be tested using the aforementioned calculation formula, which includes detecting n SNP polymorphisms affecting the trait in the genome of the individual to be tested, and using the formula (1 ) to calculate the breeding value of the individual to be tested.

伴隨著大數據、人工智慧學習與基因體定序技術的快速發展，本揭示利用基因組選種 (genomic selection；GS)概念，將散布於待測樣品基因組中的大量分子標記全部收集，並估計其育種價 (genomic estimated breeding value；GEBV)，依照其育種價高低作為育種選拔的標準，基因組選種的準確度與效率都明顯高於傳統外表型選種與單基因分子標記輔助選種。With the rapid development of big data, artificial intelligence learning and genome sequencing technology, this disclosure uses the concept of genomic selection (GS) to collect all the molecular markers scattered in the genome of the sample to be tested, and estimate their Breeding value (genomic estimated breeding value; GEBV), according to the level of its breeding value as the standard of breeding selection, the accuracy and efficiency of genomic selection are significantly higher than the traditional phenotype selection and single-gene molecular marker-assisted selection.

為提高產業之競爭力與增加養殖業者的整體收益，本揭示提供一種於一群體中建立用以預測一性狀育種價之計算式的方法，其包含：建立該群體之全基因組單核苷酸多型性(SNP)資料庫，其包含該群體中每一個體之全基因組SNP；建立該群體中該性狀之性狀量化資料庫，其包含該群體中每一個體之該性狀之性狀量化值及平均性狀量化值；將該全基因組單核苷酸多型性資料庫與該性狀量化資料庫進行全基因組關聯分析，分別給定每個SNP位點之各多型性一代表影響程度之權重值(e值)；及以機器學習方法選取影響該性狀之n個SNP，並建立式(1)之該計算式： GEBV _x= Mean _x+ e ₁ÍSNP ₁+ e ₂ÍSNP ₂+……+ e _nÍSNP _n式(1) 式(1)中： GEBV _x為該性狀之育種價； Mean _x為平均性狀量化值； e ₁至e _n分別為各個SNP位點之權重值；及 SNP ₁至SNP _n分別為數字，依據各個SNP位點之基因型所給定，其中若該基因型為異型合子，則給定一基準值；若該基因型為促進該性狀量化值之同型合子，則給定為一大於該基準值之數字；若該基因型為降低該性狀量化值之同型合子，則給定為小於該基準值之數字。 In order to improve the competitiveness of the industry and increase the overall income of the breeders, this disclosure provides a method for establishing a calculation formula for predicting the breeding price of a trait in a population, which includes: Establishing the whole genome SNP of the population A phenotypic (SNP) database, which includes the genome-wide SNP of each individual in the population; establishes a trait quantitative database for the trait in the population, which includes the trait quantification value and the average value of the trait for each individual in the population Quantitative value of traits; Genome-wide association analysis is performed between the genome-wide single nucleotide polymorphism database and the trait quantitative database, and each polymorphism of each SNP site is given a weight value representing the degree of influence ( e value); and select n SNPs affecting the character by machine learning method, and establish the calculation formula of formula (1): GEBV _x = Mean _x + e ₁ ÍSNP ₁ + e ₂ ÍSNP ₂ +...+ e _n ÍSNP _n formula (1) In formula (1): GEBV _x is the breeding value of the trait; Mean _x is the quantified value of the average trait; e ₁ to e _n are the weight values of each SNP locus; and SNP ₁ to SNP _n They are numbers respectively, given according to the genotype of each SNP locus, where if the genotype is a heterozygote, a reference value is given; if the genotype is a homozygote that promotes the quantitative value of the trait, it is given as A number greater than the reference value; if the genotype is an isozygote that reduces the quantitative value of the trait, a number smaller than the reference value is given.

於本揭示之一實施例中，係以基因組選種之概念，結合全基因組關聯性分析與機器學習方法，分析全基因組單核苷酸多型性位點，根據性狀找出其相對應的基因多型性位點。最後，依照各個標誌對前述性狀的影響程度給予不同權重，進而建立各種重要經濟性狀評估系統，並將檢測結果量化成可以比較的數值，作為評比指數，提供給使用者作為育種篩選的參考指標。In one embodiment of the present disclosure, the concept of genome selection is used, combined with genome-wide correlation analysis and machine learning methods, to analyze the genome-wide single nucleotide polymorphic loci, and find out the corresponding genes according to the traits polymorphic sites. Finally, different weights are given according to the degree of influence of each marker on the aforementioned traits, and then an evaluation system for various important economic traits is established, and the test results are quantified into comparable values, which are used as a comparison index and provided to users as a reference index for breeding screening.

本揭示所述之「群體」可為多個生物個體所形成之集合，包含動物群體或植物群體。於本揭示之具體實施例中，該群體中之各個體皆屬於同一物種，於本揭示之另一具體實施例中，該群體中之各個體皆屬於同一品種，於本揭示之具體實施例中，該群體中之各個體皆屬於生長於同一區域之品種。本揭示之一具體實施例中，該生物為經濟動物，包含但不限於豬、牛、雞、鴨、魚、蝦。The "group" mentioned in this disclosure can be a collection formed by multiple biological individuals, including animal groups or plant groups. In a specific embodiment of the present disclosure, each individual in the group belongs to the same species. In another specific embodiment of the present disclosure, each individual in the group belongs to the same species. In a specific embodiment of the present disclosure, each individual in the group belongs to the same species. , each individual in this group belongs to the species that grow in the same area. In a specific embodiment of the present disclosure, the organism is an economic animal, including but not limited to pigs, cows, chickens, ducks, fish, and shrimps.

本揭示所述之豬不限定其品種或性別等。舉例而言，該豬可為公豬或母豬。此外，該豬之實例為杜洛克、約克夏、藍瑞斯及其雜交系。The pigs described in this disclosure are not limited to their breed or sex. For example, the pig can be a boar or a sow. In addition, examples of such pigs are Duroc, Yorkshire, Lambrace and crossbred lines thereof.

本揭示之性狀包含但不限於生長性狀、肉質性狀或繁殖性狀，為使可進行分析，根據本揭示之性狀係為經數量化，包含但不限於給定評分，以定量一性狀。The traits disclosed herein include but are not limited to growth traits, meat quality traits or reproductive traits. In order to enable analysis, the traits according to the disclosure are quantified, including but not limited to giving a score to quantify a trait.

本揭示之式(1)可用於預估一性狀之育種價值，藉以評估一待測個體是否可用於作為育種親本。The formula (1) disclosed in this disclosure can be used to estimate the breeding value of a trait, so as to evaluate whether a test individual can be used as a breeding parent.

式(1)中，e ₁至e _n為各位點之權重值，而SNP ₁至SNP _n分別為數字，依據各個SNP位點之基因型所給定，其中若該基因型為異型合子，則給定一基準值；若該基因型為促進該性狀量化值之同型合子，則給定為一大於該基準值之數字；若該基因型為降低該性狀量化值之同型合子，則給定為小於該基準值之數字。 In formula (1), e ₁ to e _n are the weight values of each site, and SNP ₁ to SNP _n are numbers respectively, which are given according to the genotype of each SNP site, wherein if the genotype is heterozygous, then A base value is given; if the genotype is an isozygote that promotes the quantitative value of the trait, it is given as a number greater than the base value; if the genotype is an isozygous that reduces the quantitative value of the trait, it is given as A number that is less than the benchmark value.

於本揭示之一具體實施例中，若一SNP位點之基因型為Aa，給定之基準值為1，若此SNP位點之基因型AA為促進一性狀量化值，則給定為2，若此SNP位點之基因型aa為降低一性狀量化值，則給定為0。於本揭示之另一具體實施例中，若一SNP位點之基因型為Bb，給定之基準值為1，若此SNP位點之基因型bb為促進一性狀量化值，則給定為2，若此SNP位點之基因型BB為降低一性狀量化值，則給定為0。In a specific embodiment of the present disclosure, if the genotype of a SNP site is Aa, the given reference value is 1, and if the genotype AA of the SNP site is a quantitative value for promoting a trait, then it is given as 2, If the genotype aa of this SNP locus reduces the quantitative value of a trait, it is set as 0. In another specific embodiment of the present disclosure, if the genotype of a SNP site is Bb, the given reference value is 1, and if the genotype bb of the SNP site is a quantitative value for promoting a trait, then it is given as 2 , if the genotype BB of this SNP locus is to reduce the quantitative value of a trait, it is given as 0.

於本揭示之一具體實施例中，該全基因組關聯分析係使用固定模型和隨機模型迴圈概率的統一(fixed and random model circulating probability unification)模型進行，相較於其它工具，可以有效的降低假陽性與假陰性(false positives and false negatives)的問題，其分成固定效應模型 (fixed effect model，FEM)與隨機效應模型(random effect model, REM)，利用二種模型不斷反覆交替檢測，以提高統計的準確性，同時也可以有效的增加運算速度與使用較低的電腦資源運算。In a specific embodiment of the present disclosure, the genome-wide association analysis is performed using a fixed and random model circulating probability unification model, which can effectively reduce false positives compared with other tools. The problem of positives and false negatives (false positives and false negatives) is divided into a fixed effect model (fixed effect model, FEM) and a random effect model (random effect model, REM). The two models are used to repeatedly detect alternately to improve statistics Accuracy, but also can effectively increase the computing speed and use lower computer resource computing.

在本揭示之具體實施例中，生長性狀可包含平均日增重、背脂厚度、100公斤日齡或飼料效率。豬肉質性狀可包含肉色、大理石紋、腰眼面積、腰眼深度、背脂深度、粗脂肪、活體重、屠體重或烹煮失重率。豬繁殖性狀可包含平均總產仔數、平均活仔數、平均畸/弱/死胎數、平均胎距、平均預產期差異天數、平均出生窩重、平均三週活仔數或平均三週窩重。In specific embodiments of the present disclosure, growth traits may include average daily gain, backfat thickness, age at 100 kg or feed efficiency. Pork quality traits may include flesh color, marbling, loin area, loin depth, fat depth, crude fat, live weight, carcass weight, or cooking loss. Pig reproductive traits can include average total litter size, average number of live piglets, average number of abnormal/weak/dead fetuses, average fetal distance, average days difference between expected farrowing dates, average birth litter weight, average three-week live piglets or average three-week litter weight .

於本揭示之一具體實施例中，其結合機器學習與全基因組定序技術，相較於外表型育種法及單一基因標誌法，此項技術在個體出生後即可進行篩檢，不需要等個體長大成熟後再觀察，且透過全基因組資料的分析，更可以全面的篩選與分析待測個體之全基因組SNP位點，相較於先前技術可以達到早期挑選出具有優良性狀的個體，以大幅降低養殖成本並縮短育種時間，此外透過全基因組SNP篩選的選種策略，可以同時找出所有影響該性狀的基因，以縮小育種變異、提高育種的準確度及穩定度。亦即，本揭示之方法及標誌可用於早期檢測待測個體之生長、肉質或繁殖等重要經濟性狀，早期發現並淘汰性狀數值差的個體，同時保留具有優良性狀的個體，以減少選種與育種成本的支出、提高整體經濟收益。In a specific embodiment of the present disclosure, it combines machine learning and whole genome sequencing technology. Compared with phenotype breeding method and single gene marker method, this technology can screen individuals after birth without waiting After the individual grows up and matures, it is observed, and through the analysis of the whole genome data, it is possible to comprehensively screen and analyze the SNP sites of the whole genome of the individual to be tested. Compared with the previous technology, it is possible to select individuals with excellent traits at an early stage. Reduce breeding costs and shorten breeding time. In addition, through the selection strategy of genome-wide SNP screening, all genes that affect the trait can be found at the same time, so as to reduce breeding variation and improve the accuracy and stability of breeding. That is to say, the methods and markers disclosed in this disclosure can be used for early detection of important economic traits such as growth, meat quality or reproduction of the individual to be tested, early detection and elimination of individuals with poor traits, while retaining individuals with excellent traits to reduce selection and Expenditure of breeding costs, improve the overall economic benefits.

以下之非限制性之實例有助於本發明所屬技術領域中具通常知識者實施本發明。該等實例不應視為過度地限制本發明。本發明所屬技術領域中具有通常知識者可在不背離本發明之精神或範疇的情況下對本文所討論之實施例進行修改及變化，而仍屬於本發明之範圍。實例 The following non-limiting examples assist those skilled in the art to practice the invention. These examples should not be considered to unduly limit the invention. Those skilled in the art to which the present invention pertains can make modifications and changes to the embodiments discussed herein without departing from the spirit or scope of the present invention, and still belong to the scope of the present invention. example

本揭示之式(1)係可藉由如圖1所示之流程所建立。針對豬隻生長、肉質與繁殖共21項性狀一共採樣了2531筆豬隻樣品檢測數據，其中251筆為全基因組定序、1239筆為Illumina Porcine SNP60 v2 BeadChip全基因組SNP晶片及1041筆為Affymetrix Axiom Porcine Breeders Array全基因組SNP晶片，經過這三種平台的檢測與分析之後，排除品質不佳的SNP後，總計有47,672個SNP標記可用於後續的全基因組關聯分析(GWAS)，其中包含了4種生長性狀、9種肉質性狀與8種繁殖性狀，分析步驟及方法詳如後述。 分析步驟及方法 (1) 實驗樣本收集 Equation (1) of the present disclosure can be established through the process shown in FIG. 1 . A total of 2,531 pig samples were sampled for 21 traits in growth, meat quality and reproduction, of which 251 were whole-genome sequencing, 1,239 were Illumina Porcine SNP60 v2 BeadChip genome-wide SNP arrays, and 1,041 were Affymetrix Axiom Porcine Breeders Array genome-wide SNP chips, after detection and analysis by these three platforms, after excluding SNPs with poor quality, a total of 47,672 SNP markers can be used for subsequent genome-wide association analysis (GWAS), including 4 kinds of growth Traits, 9 kinds of meat quality traits and 8 kinds of reproductive traits, the analysis steps and methods are described in detail below. Analysis steps and methods (1) Experimental sample collection

本發明收集了豬隻生長、肉質與繁殖等重要經濟性狀共21項之外表型數據，並且從上述紀錄豬隻樣品中採樣2531筆之全基因組基因型鑑別數據，其中樣品包含了杜洛克、藍瑞斯及約克夏。 (2) 豬基因組 DNA 萃取 a. 組織樣品 The present invention collects 21 other phenotypic data of important economic traits such as pig growth, meat quality and reproduction, and samples 2531 pieces of whole-genome genotype identification data from the above-mentioned recorded pig samples, wherein the samples include Duroc, blue Reese and Yorkshire. (2) Pig genomic DNA extraction a. Tissue samples

將收集到的新鮮組織放置於含250 μl裂解緩衝液(lysis buffer)之微量離心管內，以剪刀將組織剪碎後，再加入400 μl裂解緩衝液與35 μl之蛋白酶K (10 mg/ml)，將樣品與裂解緩衝液混合均勻，置於55℃水浴槽隔夜，續以320 xg、室溫條件離心5分鐘。取上清液至新的離心管並加入等體積之苯酚(phenol)，充分混合10分鐘後以10,000 xg離心10分鐘。收集上清液，再加入等體積之苯酚／氯仿-異戊醇(Phenol/chloroform-isoamyl alcohol)，混勻後以10,000 xg離心10分鐘。收集上清液，再加入等體積之氯仿-異戊醇(Chloroform-isoamyl alcohol, 24:1)，混勻後以10,000 xg離心10分鐘。取上清液至新離心管，加入1/10總體積之3 M醋酸鈉(Sodium acetate, NaOAc)，再加入等體積之異丙醇(Isopropanol)，混合均勻後置於-20℃冰箱加速沉澱30分鐘，之後以12,000 xg離心15分鐘。移除上清液後分別再以1 ml 70%酒精及1 ml 100%酒精於4℃條件下以12,000 xg轉速離心5分鐘去除鹽分及水，最後加入50 μl TE緩衝液回溶DNA，並儲存於-80℃備用。 b. 血液樣品 Place the collected fresh tissue in a microcentrifuge tube containing 250 μl of lysis buffer, cut the tissue into pieces with scissors, and then add 400 μl of lysis buffer and 35 μl of proteinase K (10 mg/ml ), mix the sample with lysis buffer evenly, place in a 55°C water bath overnight, and then centrifuge at 320 xg for 5 minutes at room temperature. Take the supernatant to a new centrifuge tube and add an equal volume of phenol, mix thoroughly for 10 minutes and then centrifuge at 10,000 xg for 10 minutes. Collect the supernatant, add an equal volume of phenol/chloroform-isoamyl alcohol (Phenol/chloroform-isoamyl alcohol), mix well and centrifuge at 10,000 xg for 10 minutes. Collect the supernatant, add an equal volume of chloroform-isoamyl alcohol (Chloroform-isoamyl alcohol, 24:1), mix well and centrifuge at 10,000 xg for 10 minutes. Take the supernatant to a new centrifuge tube, add 1/10 of the total volume of 3 M sodium acetate (Sodium acetate, NaOAc), then add an equal volume of isopropanol (Isopropanol), mix well and place in a -20°C refrigerator to accelerate precipitation 30 minutes, followed by centrifugation at 12,000 xg for 15 minutes. After removing the supernatant, centrifuge with 1 ml 70% ethanol and 1 ml 100% ethanol at 12,000 xg for 5 minutes at 4°C to remove salt and water, and finally add 50 μl TE buffer to dissolve the DNA and store Store at -80°C for later use. b. Blood samples

血液DNA的萃取使用Geneaid DNA extraction kit，取250 μl全血，加入50 μl之蛋白酶K (10 mg/ml)，均勻混合之後於60 ℃作用30分鐘; 加入300 μl GSB緩衝液，均勻混合後於60 ℃作用30分鐘；再加入300 μl 100%酒精，混勻；將上述混合液轉移至過濾管柱中，於室溫16,000 xg離心1分鐘，倒掉過濾之廢液；接著重複兩次此步驟，加入300 μl W1緩衝液，後於室溫16,000 xg離心1分鐘，倒掉過濾之廢液；接著重複兩次此步驟，加入300 μl清洗緩衝液，後於室溫16,000 xg離心1分鐘，倒掉過濾之廢液；接著以室溫16,000 xg離心3分鐘，將殘留的液體移除；最後加入100 μl洗脫緩衝液，以室溫16,000 xg離心3分鐘，此溶液中即為樣品DNA，儲存於-80℃備用。 c. 精液樣品 Blood DNA was extracted using the Geneaid DNA extraction kit. Take 250 μl of whole blood, add 50 μl of proteinase K (10 mg/ml), mix well and then act at 60 °C for 30 minutes; add 300 μl GSB buffer, mix well and then in Incubate at 60°C for 30 minutes; then add 300 μl of 100% ethanol and mix well; transfer the above mixture to a filter column, centrifuge at room temperature at 16,000 xg for 1 minute, discard the filtered waste solution; then repeat this step twice , add 300 μl W1 buffer, centrifuge at 16,000 xg for 1 minute at room temperature, discard the filtered waste liquid; then repeat this step twice, add 300 μl washing buffer, and centrifuge at 16,000 xg for 1 minute at room temperature, pour Discard the filtered waste liquid; then centrifuge at 16,000 xg at room temperature for 3 minutes to remove the residual liquid; finally add 100 μl of elution buffer, and centrifuge at 16,000 xg at room temperature for 3 minutes. The sample DNA is in this solution and stored Store at -80°C for later use. c. Semen sample

精液DNA的萃取同樣使用Geneaid DNA extraction kit，取100 μl濃縮過後的精液，以1:9與1xPBS均勻混合，於室溫16,000 xg離心1分鐘，移除上清液；加入100 μl裂解緩衝液 (850 μl Geneaid精子裂解緩衝液+160 μl 100 mM DTT+20 μl 10 mg/ml蛋白酶K)，均勻混合後置於60℃作用16小時；加入100 μl GSB緩衝液，混勻；再加入100 μl 100%酒精，混勻；將上述混合液轉移至過濾管柱中，於室溫16,000 xg離心1分鐘，倒掉過濾之廢液；接著重複兩次此步驟，加入300 μl W1緩衝液，後於室溫16,000 xg離心1分鐘，倒掉過濾之廢液；接著重複兩次此步驟，加入300 μl清洗緩衝液，後於室溫16,000 xg離心1分鐘，倒掉過濾之廢液；接著以室溫16,000 xg離心3分鐘，將殘留的液體移除；最後加入100 μl洗脫緩衝液，以室溫16,000 xg離心3分鐘，此溶液中即為樣品DNA，儲存於-80℃備用。 (3) 樣品 DNA 品質分析 Semen DNA was also extracted using the Geneaid DNA extraction kit. Take 100 μl of the concentrated semen, mix it with 1xPBS at a ratio of 1:9, centrifuge at 16,000 xg for 1 minute at room temperature, remove the supernatant; add 100 μl of lysis buffer ( 850 μl Geneaid sperm lysis buffer + 160 μl 100 mM DTT + 20 μl 10 mg/ml proteinase K), mix evenly and place at 60°C for 16 hours; add 100 μl GSB buffer, mix well; then add 100 μl 100 % alcohol, mix well; transfer the above mixture to a filter column, centrifuge at 16,000 xg at room temperature for 1 minute, pour off the filtered waste liquid; then repeat this step twice, add 300 μl W1 buffer, and store in the chamber Centrifuge at 16,000 xg for 1 minute, discard the filtered waste; then repeat this step twice, add 300 μl of washing buffer, and centrifuge at 16,000 xg for 1 minute at room temperature, discard the filtered waste; then wash at room temperature 16,000 Centrifuge at xg for 3 minutes to remove the remaining liquid; finally add 100 μl of elution buffer and centrifuge at room temperature for 3 minutes at 16,000 xg. The solution is the sample DNA, which is stored at -80°C for future use. (3) Sample DNA quality analysis

本揭示使用的豬隻DNA萃取完成後必須同時通過下列兩種品質檢測方式，皆符合標準才可用做後續的定序與分析使用，如品質檢測未通過，則該樣品必須回到步驟(2)重新萃取。 a. 瓊脂糖凝膠電泳（ Agarose gel electrophoresis ） The pig DNA used in this disclosure must pass the following two quality inspection methods at the same time after the extraction is completed. Only when they meet the standards can they be used for subsequent sequencing and analysis. If the quality inspection fails, the sample must return to step (2) Re-extract. a. Agarose gel electrophoresis

取0.8 g之瓊脂糖 (Agarose I，購自Amersco)加入100 ml之0.5 X TBE緩衝液(購自Bioman scientific)，混勻後以微波爐加熱至瓊脂糖完全溶解，呈現透明溶液狀，待溶液溫度降至約50℃，加入5 μl溴化乙錠(Ethidium bromide，EtBr，10 mg/ml，購自Sigma)，混合均勻後倒入製膠模型中，待膠體凝固後拔除齒梳。取2 μl步驟(2)萃取之基因組DNA與適量指示劑混合進行0.8%瓊脂糖凝膠電泳，電泳條件為100伏特電壓，時間約為30分鐘，並且對照Lambda/Hind III Marker 23 kb位置，以確認基因組DNA之品質與完整性。在紫外線燈下觀察，若DNA無小片段的汙點條帶(Smear)、且主要亮帶大於20 kb以上，表示萃取之基因組 DNA完整性很高，可以做為後續定序使用。 b. 超微量分光光度計分析 Take 0.8 g of agarose (Agarose I, purchased from Amersco) and add 100 ml of 0.5 X TBE buffer solution (purchased from Bioman scientific), mix well and heat in a microwave oven until the agarose is completely dissolved and appears as a transparent solution. Cool down to about 50°C, add 5 μl of ethidium bromide (EtBr, 10 mg/ml, purchased from Sigma), mix well, pour into the gel-making model, and remove the tooth comb after the gel is solidified. Take 2 μl of the genomic DNA extracted in step (2) and mix it with an appropriate amount of indicator for 0.8% agarose gel electrophoresis. The electrophoresis condition is 100 volts for about 30 minutes. Confirm the quality and integrity of genomic DNA. Observed under ultraviolet light, if the DNA has no small stain bands (Smear) and the main bright band is greater than 20 kb, it means that the extracted genomic DNA is of high integrity and can be used for subsequent sequencing. b. Micro spectrophotometer analysis

取1μl基因組DNA樣品，滴於NanoDrop 2000超微量分光光度計(Thermo Fisher Scientific)，分別測定DNA樣品的OD230、OD260、OD280讀值，並計算該樣品濃度及OD260/OD230、OD260/OD280比例，OD260/OD230值應大於2.0，OD260/OD280值應界於1.8至2.0之間，表示抽取的DNA純度高，蛋白質與多醣類的殘留少，可以做為後續定序使用。 (4) 豬全基因 SNP 位點分析 Take 1 μl of genomic DNA sample, drop it on a NanoDrop 2000 ultra-micro spectrophotometer (Thermo Fisher Scientific), measure the OD230, OD260, and OD280 reading values of the DNA sample, and calculate the concentration of the sample and the ratio of OD260/OD230, OD260/OD280, OD260 The /OD230 value should be greater than 2.0, and the OD260/OD280 value should be between 1.8 and 2.0, indicating that the extracted DNA has high purity and less protein and polysaccharide residues, and can be used for subsequent sequencing. (4) Analysis of SNP loci in the whole pig gene

使用美國Illumina公司開發之Porcine SNP60 v2 BeadChip全基因組晶片及Thermo Fisher所開發之Axiom Porcine Breeders Array晶片，其上分別含有64,232個及55,150個豬SNP位點，可提供平均每40kb一個檢測點的全基因組覆蓋率，以密集的SNP晶片位點為豬隻的遺傳變異提供詳細的資訊。經過影像分析之後，將所得到的SNP資訊對應Sus Scrofa 11.1資料庫進行數據比對。 (5) SNP 基因型資料品質管控 Using the Porcine SNP60 v2 BeadChip whole-genome chip developed by Illumina and the Axiom Porcine Breeders Array chip developed by Thermo Fisher, which contain 64,232 and 55,150 porcine SNP sites respectively, which can provide a whole-genome with an average of one detection point per 40kb Coverage, providing detailed information on the genetic variation of pigs with dense SNP loci. After image analysis, the obtained SNP information was compared with the Sus Scrofa 11.1 database. (5) Quality control of SNP genotype data

使用GAPIT 3 (Genome Association and Prediction Integrated Tool 3)軟體對SNP基因型進行品質控制，篩選的標準為：移除SNP檢出率(SNP call rate) ≦99％、最小等位基因頻率(Minor Allele Frequency, MAF) ≦0.01的SNP位點，經過上述的篩選過程後，最後通過篩檢之SNP位點將用於後續的GWAS分析。 (6) 全基因組關聯分析 GAPIT 3 (Genome Association and Prediction Integrated Tool 3) software was used for quality control of SNP genotypes. The screening criteria were: removal of SNP call rate (SNP call rate) ≦99%, minimum allele frequency (Minor Allele Frequency) , MAF) ≦0.01 SNP sites, after the above screening process, the SNP sites that finally pass the screening will be used for subsequent GWAS analysis. (6) Genome-wide association analysis

使用GAPIT 3套裝軟體中的FarmCPU (Fixed and random model circulating probability unification)模型進行GWAS分析。GWAS analysis was performed using the FarmCPU (Fixed and random model circulating probability unification) model in the GAPIT 3 software package.

將收集到的樣品全基因組SNP資料與外表型性狀資料透過FarmCPU模型的運算後，模型將針對性狀分別給予每個SNP位點一個影響程度的權重值(e值)，並建立其全基因組關聯分析曼哈頓圖，如圖2以平均胎距為例；再利用分位圖(Q-Q plot)檢視這些數據，如圖3平均胎距。由圖3的結果可以發現分位圖在X軸-log ₁₀3之後的SNP位點明顯偏離零假設的均勻分布線，由此證實這些位點確實與目標性狀有顯著的關聯性。 (7) 機器學習 After the collected sample genome-wide SNP data and external phenotype trait data are calculated through the FarmCPU model, the model will give each SNP site a weight value (e value) of influence degree for each trait, and establish its genome-wide association analysis The Manhattan map, as shown in Figure 2, takes the average tire distance as an example; then use the quantile map (QQ plot) to view these data, as shown in Figure 3, the average tire distance. From the results in Figure 3, it can be found that the SNP sites after the X-axis-log ₁₀ 3 of the quantile map deviate significantly from the uniform distribution line of the null hypothesis, thus confirming that these sites do have a significant correlation with the target trait. (7) Machine learning

此訓練資料以豬隻之全基因組SNP資料作為特徵、後來生長情況的外表型性狀數據作為標記，透過適當地選取影響豬隻生長的關鍵基因組合，使用機器學習方法來建立預測分類模型。模型的建立流程如圖 4所示。本揭示分別使用了下列三種機器學習技術的類型： a. 監督式學習(Supervised Learning) This training data is characterized by the whole genome SNP data of pigs, and the phenotype data of subsequent growth conditions are used as markers. By properly selecting key gene combinations that affect pig growth, machine learning methods are used to establish a predictive classification model. The process of building the model is shown in Figure 4. This disclosure uses the following three types of machine learning techniques respectively: a. Supervised Learning

監督式學習乃是指在訓練階段中，每一筆用於訓練模型的資料包含了正確答案。以豬隻性狀的評估為例，資料庫中收集有所有待分析豬隻之基因型性狀與豬隻長大後各項外表行性狀的數據，便可以根據監督式學習的機器學習方法來建立推論模型，找出基因型特徵資料與外表型性狀的關聯模式。之後只要將基因型特徵資料輸入推論模型後，便可預測豬隻長大之後的外表型性狀。也就是說，監督式學習方法在訓練階段會提供特徵資料與對應答案，找出兩者之間的關聯模式。 b. 非監督式學習(Unsupervised Learning) Supervised learning means that during the training phase, every piece of data used to train the model contains the correct answer. Taking the evaluation of pig traits as an example, the database collects the genotype traits of all pigs to be analyzed and the data of various external traits of pigs after they grow up, and then the inference model can be established according to the machine learning method of supervised learning , to find out the correlation mode between genotype characteristic data and phenotype traits. Afterwards, as long as the genotype characteristic data is input into the inference model, the external phenotype traits of pigs can be predicted after they grow up. That is to say, the supervised learning method will provide feature data and corresponding answers during the training phase, and find out the correlation mode between the two. b. Unsupervised Learning

相較於在訓練階段中給予資料和對應答案來找出相關模式，非監督式學習則是只給予資料和特徵，讓機器學習方法能自動找出可能的答案。同樣以豬隻性狀的評估為例子，資料庫中只有豬隻基因型資料或是只有外表型資料，但沒有完整基因型與外表型性狀的相對應資訊。這類型的問題在現實中十分常見。因缺乏基因型資料與外表型性狀的關聯模式，因此就無法利用監督式學習，只能利用非監督式學習方法，只有觀察基因型資料或是外表型性狀的趨勢，試圖從資料的相似程度與變化趨勢來找出可能的關聯模式。 c. 半監督式學習(Semi-supervised Learning) Compared with providing data and corresponding answers in the training phase to find relevant patterns, unsupervised learning is only given data and features, so that machine learning methods can automatically find possible answers. Taking the evaluation of pig traits as an example, the database only has genotype data or only phenotype data, but there is no corresponding information about the complete genotype and phenotype traits. This type of problem is very common in reality. Due to the lack of a correlation model between genotype data and phenotype traits, supervised learning cannot be used, and only unsupervised learning methods can be used. Only the trend of genotype data or phenotype traits can be observed, and the similarity between data and phenotype traits can only be observed. Change trends to find possible association patterns. c. Semi-supervised Learning

由於監督式需要已經標記好答案的歷史資料，需要的人工標記負擔較重。而非監督式學習沒有標記答案，只能根據資料的相似性來推論。而半監督式學習方法則擷取了監督式與非監督式的特性，一方面從已經標記好答案的資訊找出模型，接著使用已經找出的模型來針對沒有標記答案的資料近行推論，再從這些推論結果中找出可信程度較高的資料，納入到已經標記好答案的訓練資料集中。這樣反覆運作的結果便可以找出整個資料集的推論模型。由於現實的資料不容易找到完整標記答案的資料，因此半監督式學習方法在實務上舉有相當的重要性。Since the supervised method requires historical data that has already marked the answer, the burden of manual labeling is heavy. Unsupervised learning does not have labeled answers and can only infer based on the similarity of the data. The semi-supervised learning method extracts the characteristics of supervised and unsupervised. On the one hand, it finds a model from the information that has already marked the answer, and then uses the model that has been found to infer the data that has not marked the answer. Then find out more reliable data from these inference results, and incorporate it into the training data set that has marked the answer. As a result of this iterative operation, an inference model for the entire data set can be found. Since it is not easy to find complete and marked answers in real data, semi-supervised learning methods are of considerable importance in practice.

將GWAS分析中所找到與第3至5胎平均活仔數有顯著關聯性的SNP位點分別與Ensemble網站之Sus scrofa 11.1資料庫進行比對，藉此找出該SNP所在的基因序列位置。The SNP sites found in the GWAS analysis that have a significant correlation with the average live litter size of the 3rd to 5th litters were compared with the Sus scrofa 11.1 database on the Ensemble website to find out the gene sequence position of the SNP.

透過FarmCPU模型的反覆運算後，分別得到這47,672個SNP標記影響第3至5胎平均活仔數的權重，藉由統計每個SNP位點的基因型所獲得的權重，即可以得到每隻豬的基因組預估育種價值。此外，從曼哈頓圖的結果顯示有一個與第3至5胎平均活仔數有顯著的關聯性的SNP位點，如圖5及表1所示。利用分位圖檢驗這些數據，從圖6的結果可以發現分位圖在X軸約-log ₁₀3.5之後的SNP位點明顯偏離零假設的均勻分布線，由此可以證實這個位點確實與第3至5胎平均活仔數有關聯性。表1：與第3至5胎平均活仔數高度相關之SNP位點 染色體 SNP 標誌位置 P 值權重 SNP 最近基因 15 ALGA0085579 56317018 2.97E-11 -0.800 T/C mfhas1 After repeated calculations of the FarmCPU model, the weights of these 47,672 SNP markers affecting the average number of live piglets in the third to fifth litters were obtained. By counting the weights obtained by the genotype of each SNP site, you can get the The estimated breeding value of the genome. In addition, the results from the Manhattan plot show that there is a SNP site that has a significant correlation with the average number of living pigs in the 3rd to 5th litters, as shown in Figure 5 and Table 1. Use the quantile map to test these data. From the results in Figure 6, it can be found that the SNP site of the quantile map after about -log ₁₀ 3.5 on the X axis obviously deviates from the uniform distribution line of the null hypothesis, thus confirming that this site is indeed related to the first There is a correlation between the average number of piglets alive from 3 to 5 litters. Table 1: SNP loci highly correlated with the average live litter size from the 3rd to the 5th parity chromosome SNP markers Location P value Weights SNPs nearest gene 15 ALGA0085579 56317018 2.97E-11 -0.800 T/C mfhas1

上述實施例僅為說明本發明之原理及其功效，而非限制本發明。本發明所屬技術領域中具通常知識者對上述實施例所做之修改及變化仍不違背本發明之精神。本發明之權利範圍應如後述之申請專利範圍所列。The above-mentioned embodiments are only to illustrate the principles and effects of the present invention, but not to limit the present invention. Modifications and changes made to the above-mentioned embodiments by those skilled in the technical field of the present invention still do not violate the spirit of the present invention. The scope of rights of the present invention should be listed in the scope of patent application described later.

圖1顯示本揭示建立式(1)之一實施例的流程圖。FIG. 1 shows a flowchart of an embodiment of the present disclosure establishing equation (1).

圖2顯示平均胎距之全基因組關聯分析曼哈頓圖。Figure 2 shows the Manhattan plot of the genome-wide association analysis of the average fetal distance.

圖3顯示平均胎距之全基因組關聯分析分位圖。Figure 3 shows the quantile map of the genome-wide association analysis of the mean fetal distance.

圖4利用機器學習建立分類預測模型與實際運作流程。Figure 4 uses machine learning to establish a classification prediction model and the actual operation process.

圖5顯示第3至5胎平均活仔數之全基因組關聯分析曼哈頓圖。Figure 5 shows the Manhattan plot of the genome-wide association analysis for the average number of litters alive from the 3rd to the 5th litter.

圖6顯示第3至5胎平均活仔數之全基因組關聯分析分位圖。Figure 6 shows the quantile map of the genome-wide association analysis of the average live litter size from the 3rd to the 5th parity.

Claims

A method for establishing a calculation formula for predicting a trait breeding value (genomic estimated breeding value, GEBV) in a population, which includes: establishing a genome-wide single nucleotide polymorphism (SNP) database of the population, which Including the whole genome SNP of each individual in the group; establishing a trait quantification database of the trait in the population, which includes the trait quantification value and the average trait quantification value of the trait in each individual in the population; the whole genome single The nucleotide polymorphism database and the trait quantification database were used for genome wide association study (GWAS), and each polymorphism of each SNP site was given a weight value representing the degree of influence (e value); and select n SNPs that affect the trait by machine learning methods, and establish the calculation formula of formula (1): GEBV _x =Mean _x +e ₁ ×SNP ₁ +e ₂ ×SNP ₂ +.... ..+e _n ×SNP _n Formula (1) In Formula (1): GEBV _x is the breeding value of the trait; Mean _x is the quantified value of the average trait; e ₁ to e _n are the weight values of each SNP locus; and SNP ₁ to SNP _n are numbers respectively, which are given according to the genotype of each SNP locus, where if the genotype is heterozygous, a reference value is given; if the genotype is the same type that promotes the quantitative value of the trait If the genotype is a zygote, a number greater than the reference value is given; if the genotype is an isozygote that reduces the quantitative value of the trait, a number smaller than the reference value is given.

The method of claim 1, wherein the genome-wide association analysis is performed using a fixed and random model circulating probability unification model.

The method of claim 1, wherein the machine learning includes supervised learning, unsupervised learning and/or semi-supervised learning.

The method according to claim 1, wherein the numbers converted from SNP ₁ to SNP _n are 0, 1 or 2.

The method according to claim 1, wherein the herd is pigs.

The method as claimed in item 5, wherein the pig is Duroc, Yorkshire, Lambrus or a hybrid thereof.

The method according to any one of claims 1 to 6, wherein the traits are growth traits, meat quality traits or reproductive traits of pigs.

The method according to claim 7, wherein the pig growth traits are average daily gain, backfat thickness, age at 100 kg or feed efficiency.

The method according to claim 7, wherein the meat quality traits are meat color, marbling, loin eye area, loin eye depth, backfat depth, crude fat, live weight, carcass weight or cooking weight loss rate.

According to the method of claim 7, wherein the reproductive traits of pigs are average total litter size, average live piglet size, average number of deformed/weak/stillborn fetuses, average fetal distance, average number of days difference between expected delivery dates, average birth litter weight, and average three-week Litter size alive or average three-week litter weight.

A method for predicting the breeding value of a trait of an individual to be tested, comprising detecting n SNP polymorphisms affecting the trait in the genome of the individual to be tested, and calculating the breeding value of the individual to be tested by formula (1); GEBV _x =Mean _x +e ₁ ×SNP ₁ +e ₂ ×SNP ₂ +......+e _n ×SNP _n Formula (1) In formula (1): GEBV _x is the breeding value of the trait; Mean _x is the quantified value of the average trait; e ₁ to e _n are the weight values of each SNP site; and SNP ₁ to SNP _n are numbers respectively, which are given according to the genotype of each SNP site, wherein if the genotype is If the genotype is a homozygote that promotes the quantitative value of the trait, a number greater than the benchmark value is given; if the genotype is a homozygous that reduces the quantitative value of the trait, Then it is given as a number smaller than the reference value; and formula (1) is established by the method as in any one of claims 1 to 10.