KR102603207B1

KR102603207B1 - Method for screeing SNP marker associated with phenotype using statistical regularization and selection probability

Info

Publication number: KR102603207B1
Application number: KR1020200149370A
Authority: KR
Inventors: 선호근; 김기풍
Original assignee: 부산대학교 산학협력단
Priority date: 2020-11-10
Filing date: 2020-11-10
Publication date: 2023-11-16
Also published as: KR20220063480A

Abstract

본 발명은 오픈소스 R 프로그램를 이용하여 유전체 데이터를 분석하여 표현형질과 연관있는 SNP 마커를 발굴하는 방법에 관한 것으로서, 고차원 자료에서의 변수선택 기법인 규제화 방법을 이용한 선택확률을 계산할 뿐만 아니라 이에 대한 임계값을 통해 보다 정확하고 안정적으로 표현형질 연관 SNP 마커를 발굴할 수 있다.The present invention relates to a method of discovering SNP markers associated with phenotypes by analyzing genomic data using an open source R program. It not only calculates the selection probability using the regularization method, which is a variable selection technique in high-dimensional data, but also determines the threshold for this. Through this value, SNP markers associated with phenotypes can be discovered more accurately and stably.

Description

Method for discovering SNP marker associated with phenotype using statistical regularization method and selection probability {Method for screeing SNP marker associated with phenotype using statistical regularization and selection probability}

본 발명은 통계적 규제화 방법과 선택확률을 활용한 표현형질 연관 SNP 마커 발굴 방법에 관한 것이다.The present invention relates to a method for discovering SNP markers associated with phenotypes using statistical regularization methods and selection probabilities.

표준유전체 정보가 밝혀지고 차세대 염기서열 시퀀싱(next generation sequencing, NGS)을 이용한 대용량 염기서열 생산이 가속화되면서 유전체 기반의 분자육종은 종자시장 및 육종산업에 새로운 동력으로 각광받고 있다. 분자육종은 원하는 형질의 유무를 형질에서 유래한 표현형의 관찰 없이 DNA 염기서열의 차이를 나타내는 분자마커(molecular marker)를 이용해 판별하는 기법을 비롯해 크게 발전하고 있다.As standard genome information is revealed and large-scale sequence production using next generation sequencing (NGS) is accelerated, genome-based molecular breeding is attracting attention as a new driving force in the seed market and breeding industry. Molecular breeding is making great progress, including techniques to determine the presence or absence of a desired trait using molecular markers that indicate differences in DNA base sequences without observing the phenotype derived from the trait.

유전체 데이터를 분석하는 전장유전체 연관성 연구 (Genome-wide Association Study)는 표현형질과 연관 있는 유전변이를 보다 더 정확하게 발굴하기 위하여 꾸준하게 주목 받고 있는 분야이다. 특히, 여러 유전변이를 동시에 고려하는 회귀모형 기반의 규제화 방법이 통계학 분야에서는 널리 이용되고 있으나, 실제 유전체 분석 연구자들에게는 관련 소프트웨어의 부재 등으로 인해 접근성이 떨어진다는 문제점이 있다.Genome-wide Association Study, which analyzes genome data, is a field that is consistently attracting attention to more accurately discover genetic mutations associated with phenotypes. In particular, the regression model-based regularization method that simultaneously considers multiple genetic mutations is widely used in the field of statistics, but there is a problem in that it is inaccessible to actual genome analysis researchers due to the absence of related software.

한국공개특허 제2011-0064699호에는 '단일염기다형성(SNP) 유전자형의 분석 방법'이 개시되어 있고, 한국등록특허 제1774275호에는 '고추의 여교배 육종을 위한 단일염기다형성 마커 세트 및 이의 용도'가 개시되어 있으나, 본 발명의 통계적 규제화 방법과 선택확률을 활용한 표현형질 연관 SNP 마커 발굴 방법에 대해서는 기재된 바가 없다.Korean Patent Publication No. 2011-0064699 discloses 'Method for analyzing single nucleotide polymorphism (SNP) genotype', and Korean Patent No. 1774275 discloses 'Single nucleotide polymorphism marker set for backcross breeding of pepper and its use'. is disclosed, but there is no description of the method of discovering SNP markers associated with phenotypes using the statistical regulation method and selection probability of the present invention.

본 발명은 오픈소스 R 프로그램를 이용하여 유전체 데이터를 분석하여 표현형질과 연관있는 SNP (Single nucleotide polymorphism) 마커를 발굴하는 알고리즘 및 그에 대한 시각화를 제공하기 위한 것이다. 본 발명을 통해 많은 연구자들이 유전체 분석을 수월하게 수행할 수 있기를 기대한다.The present invention is intended to provide an algorithm and visualization for discovering SNP (single nucleotide polymorphism) markers associated with phenotypes by analyzing genomic data using an open source R program. We hope that through this invention, many researchers will be able to easily perform genome analysis.

상기 과제를 해결하기 위해, 본 발명은 (1) 데이터를 입력하는 단계; (2) 상기 입력된 데이터를 전처리하는 단계; (3) 상기 전처리된 데이터에 대해 선택확률을 계산하고 이에 대한 임계값을 계산하는 단계; 및 (4) 시각화 및 결과물을 제공하는 단계를 포함하는 규제화(regularization) 방법과 선택확률(selection probability)을 활용한 표현형질 연관 SNP 마커 발굴 방법을 제공한다.In order to solve the above problem, the present invention includes the steps of (1) inputting data; (2) preprocessing the input data; (3) calculating a selection probability for the preprocessed data and calculating a threshold value therefor; and (4) providing a regularization method including steps of visualization and providing results, and a method of discovering SNP markers associated with phenotypes using selection probability.

또한, 본 발명은 상기 방법을 수행하기 위한 컴퓨터로 판독 가능한 프로그램을 기록한 기록매체를 제공한다. Additionally, the present invention provides a recording medium on which a computer-readable program for performing the above method is recorded.

유전체 연구 분야에서 전문 인력을 제외하고는 대다수의 연구자들이 분석에 많은 어려움이 있었다. 본 발명은 고차원 유전체 자료를 분석하기 위한 회귀모형 기반의 규제화(regularization) 방법과 선택 확률(selection probability)을 이용하여 주어진 표현형질과 연관있는 SNP 마커를 발굴할 수 있다. 본 발명을 통해 일반 연구자들도 입력인자만 조정하여 주어진 데이터에 적합한 분석 결과를 얻을 수 있게 된다.In the field of genome research, most researchers, except for experts, had many difficulties in analysis. The present invention can discover SNP markers associated with a given phenotype using a regression model-based regularization method and selection probability for analyzing high-dimensional genomic data. Through the present invention, general researchers can also obtain analysis results suitable for given data by adjusting only the input factors.

도 1은 개발된 R 패키지 "sp.gwas" 메뉴얼의 첫 페이지이다.
도 2는 분석 알고리즘 개략도이다.
도 3은 개발된 R 패키지 "sp.gwas"에서 제공하는 SNP 마커들의 선택확률에 대한 Circular Manhattan plot의 예시이다. x축은 각 염색체에 위치한 SNP 마커이고, y축은 총 3개의 표현형질 각각에 대한 SNP 마커의 선택확률이다.Figure 1 is the first page of the developed R package “sp.gwas” manual.
Figure 2 is a schematic diagram of the analysis algorithm.
Figure 3 is an example of a Circular Manhattan plot for the selection probability of SNP markers provided by the developed R package "sp.gwas". The x-axis is the SNP marker located on each chromosome, and the y-axis is the selection probability of the SNP marker for each of the three phenotypes.

본 발명의 목적을 달성하기 위하여, 본 발명은In order to achieve the purpose of the present invention, the present invention

(1) 데이터를 입력하는 단계;(1) entering data;

(2) 상기 입력된 데이터를 전처리하는 단계;(2) preprocessing the input data;

(3) 상기 전처리된 데이터에 대해 선택확률을 계산하고 이에 대한 임계값을 계산하는 단계; 및(3) calculating a selection probability for the preprocessed data and calculating a threshold value therefor; and

(4) 시각화 및 결과물을 제공하는 단계를 포함하는 규제화(regularization) 방법과 선택확률(selection probability)을 활용한 표현형질 연관 SNP 마커 발굴 방법을 제공한다.(4) Provides a regularization method that includes steps for visualizing and providing results, and a method for discovering SNP markers associated with phenotypes using selection probability.

본 발명의 일 구현 예에 따른 방법에서, 상기 (1)단계는 합맵(HapMap) 형식의 SNP 데이터와 표현형질 데이터를 입력하는 것일 수 있으나, 이에 제한되지 않는다.In the method according to one embodiment of the present invention, step (1) may involve inputting SNP data and phenotypic data in HapMap format, but is not limited thereto.

본 발명의 일 구현 예에 따른 방법에서, 상기 (2)단계는 입력된 SNP 데이터와 표현형질 데이터의 표본 ID를 매칭한 뒤, SNP 데이터에 대해서 MAF (Minor Allele Frequency) 및 Missing Call Rate를 기준으로 저품질의 SNP을 제거하고, 남아있는 SNP 내 결측치에 대해서는 대립유전자(allele)의 빈도 분포 기반으로 결측치 대체를 실시하고, 결측치를 갖고 있는 표현형질 데이터 표본과 이에 해당되는 SNP 데이터 표본을 제거하는 것일 수 있으나, 이에 제한되지 않는다.In the method according to one embodiment of the present invention, step (2) matches the sample ID of the input SNP data and phenotypic data, and then based on MAF (Minor Allele Frequency) and Missing Call Rate for the SNP data. This may be to remove low-quality SNPs, replace missing values in the remaining SNPs based on the frequency distribution of the allele, and remove the phenotypic data sample with missing values and the corresponding SNP data sample. However, it is not limited to this.

본 발명에 따른 SNP 마커 발굴 방법에 있어서, 상기 "MAF (Minor Allele Frequency)"는 하나의 SNP에서 수가 작은 쪽의 대립형질의 빈도를 의미하는 것으로, 이에 제한되는 것은 아니나, MAF 5% 미만, 더욱 바람직하게는 MAF 1% 미만인 SNP를 분석에서 제외하였다.In the SNP marker discovery method according to the present invention, the “MAF (Minor Allele Frequency)” refers to the frequency of the smaller allele in one SNP, but is not limited to this, but MAF is less than 5%, more Preferably, SNPs with a MAF of less than 1% were excluded from the analysis.

또한, 용어 "Missing Call Rate"는 call rate 또는 missing rate로 혼용되어 사용되기도 하며, 각 SNP에 대해 개체가 정상적으로 관찰된 비율을 의미하는 것으로, 본 발명의 일 구현 예에 따른 표현형질 연관 SNP 마커 발굴 방법에 있어서, 저품질의 SNP 제거를 위한 missing call rate 값은 95% 미만일 수 있으나, 이에 제한되지 않으며, 데이터 양에 따라 당업자가 적절하게 기준값을 설정할 수 있다.In addition, the term "Missing Call Rate" is sometimes used interchangeably with call rate or missing rate, and refers to the rate at which an individual is normally observed for each SNP. Discovery of SNP markers associated with phenotypes according to an embodiment of the present invention In the method, the missing call rate value for removing low-quality SNPs may be less than 95%, but is not limited thereto, and a person skilled in the art can appropriately set the standard value depending on the amount of data.

본 발명의 일 구현 예에 따른 표현형질 연관 SNP 마커 발굴 방법에 있어서, 상기 저품질의 SNP 제거는 MAF 및 Missing Call Rate를 통해 우선적으로 저품질의 SNP를 제거하고, HWE (Hardy-Weinberg Equilibrium) 검정 유의확률 및 이형접합성(Heterozygosity) 중 하나 이상의 기준을 추가하여 저품질의 SNP를 제거할 수 있다. 상기 HWE 검정 유의확률과 이형접합성 기준의 추가는 SNP 데이터의 특성에 따라 선택할 수 있다. 구체적으로는, (1)단계의 데이터가 근친교배(inbreeding) 시료 유래일 경우 유전자형(genotype)이 대부분 동형성이기 때문에 HWE에 의한 선별 기준을 적용하는 것은 바람직하지 않고, 이형접합성을 기준으로 이형접합성 수치가 낮은 동형의 SNP 유전자형을 선별할 수 있고, (1)단계의 데이터가 근친교배가 아닌 시료 유래일 경우 유전자형이 대부분 이형성이기 때문에 이형접합성을 선별 기준에 적용하는 것은 바람직하지 않고, HWE 검정 유의확률을 적용하여 SNP를 선별할 수 있다.In the method for discovering phenotypic trait-related SNP markers according to an embodiment of the present invention, low-quality SNPs are preferentially removed through MAF and Missing Call Rate, and HWE (Hardy-Weinberg Equilibrium) test significance probability. and Heterozygosity. Low-quality SNPs can be removed by adding one or more criteria. The addition of the HWE test significance probability and heterozygosity criteria can be selected depending on the characteristics of the SNP data. Specifically, if the data in step (1) is derived from an inbreeding sample, it is not desirable to apply the selection criteria by HWE because the genotype is mostly homozygous, and heterozygosity is determined based on heterozygosity. Homozygous SNP genotypes with low values can be selected, and if the data in step (1) is from a sample rather than an inbreeding, most of the genotypes are heterozygous, so it is not advisable to apply heterozygosity as a selection criterion, and the HWE test must be careful. SNPs can be selected by applying probability.

상기 용어 "HWE(Hardy-Weinberg Equilibrium)"이란, 대를 거듭하더라도 집단에서 대립 유전자의 빈도가 변하지 않고, 평형상태를 유지하는 경우를 의미하는 것으로, 일반적으로, 적합도 검정결과 유의확률이 0.001 미만인 SNP는 분석에서 제외한다. 상기 HWE 검정 방법은 당업계에 공지된 통상의 방법들을 이용할 수 있다.The term "HWE (Hardy-Weinberg Equilibrium)" refers to a case where the frequency of alleles in a population does not change and maintains an equilibrium state even after successive generations. In general, a SNP with a significance probability of less than 0.001 as a result of a fitness test is excluded from the analysis. The HWE assay method may use conventional methods known in the art.

본 발명의 일 구현 예에 따른 방법에서, 저품질의 SNP 제거는 MAF (Minor Allele Frequency), Missing Call Rate, HWE (Hardy-Weinberg Equilibrium) 검정 유의확률 및 이형접합성(Heterozygosity)의 4개 수치를 모두 이용하여 수행될 수도 있고, MAF 및 Missing Call Rate를 이용하여 일차적으로 저품질의 SNP를 제거한 후, HWE 검정 유의확률 또는 이형접합성을 추가로 이용하여 SNP를 선별할 수 있다.In the method according to one embodiment of the present invention, low-quality SNPs are removed using all four values of MAF (Minor Allele Frequency), Missing Call Rate, HWE (Hardy-Weinberg Equilibrium) test significance probability, and heterozygosity. This can be done by first removing low-quality SNPs using MAF and Missing Call Rate, and then selecting SNPs by additionally using HWE test significance probability or heterozygosity.

본 발명의 일 구현 예에 따른 방법에서, 상기 (3)단계는 전처리된 SNP 데이터와 표현형질 데이터를 가지고 R 패키지 'glmnet'을 이용하여 패널티 함수를 갖는 일반화 선형 모형 기반의 규제화 방법을 통해 각 SNP에 대한 선택확률과 이에 해당하는 임계값을 계산하는 단계로서, 상기 선택확률은 규제화 방법을 반복 추출된 부표본(subsample)에 적용하고 이에 대한 비율로서 계산하고, 선택확률에 대한 임계값은 주어진 데이터를 무작위로 재배치하여 선택확률을 계산하고 상위 θ번째 선택확률을 얻는 과정을 반복한 뒤 구해진 선택확률들의 평균으로 계산할 수 있다.In the method according to one embodiment of the present invention, step (3) is performed using preprocessed SNP data and phenotypic data and using the R package 'glmnet' to perform a regularization method based on a generalized linear model with a penalty function for each SNP. A step of calculating the selection probability and the corresponding threshold value. The selection probability is calculated as a ratio by applying the regularization method to repeatedly extracted subsamples, and the threshold value for the selection probability is calculated based on the given data. The selection probability can be calculated by randomly rearranging, repeating the process of obtaining the top θ selection probability, and then calculating it as the average of the obtained selection probabilities.

본 발명의 일 구현 예에 따른 방법은 구체적으로The method according to one embodiment of the present invention is specifically

(1) 합맵(HapMap) 형식의 SNP 데이터와 표현형질 데이터를 입력하는 단계;(1) Inputting SNP data and phenotypic data in HapMap format;

(2) 상기 입력된 데이터를 전처리하는 단계로서, SNP 데이터와 표현형질 데이터의 표본 ID를 매칭한 뒤, SNP 데이터에 대해서 MAF (Minor Allele Frequency) 및 Missing Call Rate를 기준으로 저품질의 SNP을 제거하고, 남아있는 SNP 내 결측치에 대해서는 대립유전자(allele)의 빈도 분포 기반으로 결측치 대체를 실시하고, 결측치를 갖고 있는 표현형질 데이터 표본과 이에 해당되는 SNP 데이터 표본을 제거하며;(2) As a step to preprocess the input data, match the sample ID of the SNP data and the phenotype data, and then remove low-quality SNPs based on MAF (Minor Allele Frequency) and Missing Call Rate for the SNP data. , For missing values in the remaining SNPs, missing values are replaced based on the frequency distribution of the allele, and phenotypic data samples with missing values and corresponding SNP data samples are removed;

(3) 상기 전처리된 SNP 데이터와 표현형질 데이터를 가지고 R 패키지 'glmnet'을 이용하여 패널티 함수를 갖는 일반화 선형 모형 기반의 규제화 방법을 통해 각 SNP에 대한 선택확률과 이에 해당하는 임계값을 계산하는 단계로서, 상기 선택확률은 규제화 방법을 반복 추출된 부표본(subsample)에 적용하고 이에 대한 비율로서 계산하고, 선택확률에 대한 임계값은 주어진 데이터를 무작위로 재배치하여 선택확률을 계산하고 상위 θ번째 선택확률을 얻는 과정을 반복한 뒤 구해진 선택확률들의 평균으로 계산하며; 및(3) Using the preprocessed SNP data and phenotypic data, the R package 'glmnet' is used to calculate the selection probability for each SNP and the corresponding threshold through a regularization method based on a generalized linear model with a penalty function. As a step, the selection probability is calculated by applying the regularization method to the repeatedly extracted subsample and calculating it as a ratio, and the threshold for the selection probability is calculated by randomly rearranging the given data and calculating the top θ After repeating the process of obtaining the selection probability, it is calculated as the average of the obtained selection probabilities; and

(4) 시각화 및 결과물을 제공하는 단계를 포함할 수 있으나, 이에 제한되지 않는다.(4) It may include, but is not limited to, steps of providing visualization and results.

또한, 본 발명은 상기 방법을 수행하기 위한 컴퓨터로 판독 가능한 프로그램을 기록한 기록매체를 제공한다.Additionally, the present invention provides a recording medium on which a computer-readable program for performing the above method is recorded.

컴퓨터로 판독할 수 있는 기록매체란 컴퓨터에 의해 직접 판독되고 엑세스될 수 있는 임의의 기록매체를 말한다. 이러한 기록매체로서는 플로피 디스크, 하드 디스크, 자기 테이프 등의 자기기록매체, CD-ROM, CD-R, CD, RW, DVD-ROM, DVD-RAM, DVD-RW 등의 광학기록매체, RAM이나 ROM 등의 전기 기록매체 및 이들 범주의 혼합물(예를 들면, MO 등의 자기/광학기록매체)을 들 수 있지만, 이들에 제한되는 것이 아니다.A computer-readable recording medium refers to any recording medium that can be directly read and accessed by a computer. Such recording media include magnetic recording media such as floppy disks, hard disks, and magnetic tapes, optical recording media such as CD-ROM, CD-R, CD, RW, DVD-ROM, DVD-RAM, and DVD-RW, and RAM and ROM. electrical recording media such as these, and mixtures of these categories (for example, magnetic/optical recording media such as MO), but are not limited thereto.

상기 기록매체에 기록 또는 입력시키기 위한 기기 또는 기록매체 중의 정보를 판독하기 위한 기기 또는 장치의 선택은 기록매체의 종류와 엑세스 방법에 근거한다. 또한 여러 가지 데이터 프로세서 프로그램, 소프트웨어, 컴퍼레이터 및 포맷이 본 발명의 방법을 수행하기 위한 프로그램을 해당 매체에 기록시키기 위해 사용될 수 있다. 상기 정보는 예를 들면, 시판하는 소프트웨어로 포맷된 바이너리 파일(binary file), 텍스트 파일 또는 ASCII 파일의 형태로 나타낼 수 있다.The selection of a device or device for recording or inputting information on the recording medium or for reading information from the recording medium is based on the type of the recording medium and the access method. Additionally, various data processor programs, software, comparators, and formats may be used to record a program for performing the method of the present invention on the corresponding medium. The information may be expressed, for example, in the form of a binary file, text file, or ASCII file formatted with commercially available software.

이하, 본 발명을 실시예에 의해 상세히 설명한다. 단, 하기 실시예는 본 발명을 예시하는 것일 뿐, 본 발명의 내용이 하기 실시예에 한정되는 것은 아니다.Hereinafter, the present invention will be described in detail by examples. However, the following examples only illustrate the present invention, and the content of the present invention is not limited to the following examples.

본 발명에서 제시하는 규제화 방법과 선택확률을 활용한 표현형질 연관 SNP 마커 발굴 방법의 절차는 데이터 입력, 데이터 전처리, SNP 마커 선택, 결과파일 및 시각화 제공의 4단계로 구성된다.The procedure for discovering SNP markers associated with phenotypes using the regulation method and selection probability proposed in the present invention consists of four steps: data input, data preprocessing, SNP marker selection, and provision of result files and visualization.

데이터 입력 단계에서는 합맵(HapMap) 형식의 SNP 데이터와 표현형질 데이터를 입력받게 되고, 표현형질 데이터는 연속형, 범주형 모두 가능하다. 데이터 전처리 단계에서는 SNP 데이터와 표현형질 데이터의 표본 ID를 매칭한 뒤, SNP 데이터에 대해서 MAF (Minor Allele Frequency), Missing Call Rate, HWE (Hardy-Weinberg Equilibrium) 검정 유의확률 및 이형접합성(Heterozygosity)의 4개의 수치를 기준으로 저품질의 SNP을 제거한다. 이 때 남아있는 SNP 내 결측치에 대해서는 대립유전자의 빈도 분포 기반으로 결측치 대체를 실시한다. 그리고 결측치를 갖고 있는 표현형질 데이터 표본과 이에 해당되는 SNP 데이터 표본을 제거하고, 연속형 표현형질의 경우 정규분포를 따르는지 검정하고 히스토그램을 제공한 뒤 사전 입력 인자에 따라 표현형질 정규화 실시 여부를 결정한다.In the data input stage, SNP data and phenotypic data in HapMap format are input, and phenotypic data can be both continuous and categorical. In the data preprocessing step, the sample IDs of the SNP data and phenotypic data are matched, and then MAF (Minor Allele Frequency), Missing Call Rate, HWE (Hardy-Weinberg Equilibrium) test significance probability, and heterozygosity are calculated for the SNP data. Low-quality SNPs are removed based on four values. At this time, for missing values in the remaining SNPs, missing values are replaced based on the allele frequency distribution. Then, the phenotypic data sample with missing values and the corresponding SNP data sample are removed, and in the case of continuous phenotypic traits, it is tested to see if it follows a normal distribution, a histogram is provided, and then whether or not to normalize the phenotypic trait is determined according to the prior input factors. do.

다음으로 전처리된 SNP 데이터와 표현형질 데이터를 가지고 R 패키지 'glmnet'을 이용하여 Lasso와 Elastic-net 등의 패널티 함수를 갖는 일반화 선형 모형 기반의 규제화 방법을 통해 각 SNP에 대한 선택확률과 이에 해당하는 임계값을 계산한다. 선택확률은 규제화 방법을 반복 추출된 부표본(subsample)에 적용하고 이에 대한 비율로서 계산된다. 그리고 선택확률에 대한 임계값은 주어진 데이터를 무작위로 재배치하여 선택확률을 계산하고 상위 θ번째 선택확률을 얻는 과정을 반복한 뒤 구해진 선택확률들의 평균으로 계산된다.Next, using the preprocessed SNP data and phenotypic data, the R package 'glmnet' is used to determine the selection probability for each SNP and the corresponding regularization method using a generalized linear model with penalty functions such as Lasso and Elastic-net. Calculate the threshold. The selection probability is calculated as a ratio by applying the regularization method to repeatedly extracted subsamples. And the threshold for selection probability is calculated as the average of the selection probabilities obtained after repeating the process of calculating the selection probability by randomly rearranging the given data and obtaining the top θ selection probability.

마지막으로, 전처리 전/후 데이터, 선택확률, 임계값 등 이전 과정에서 제공된 모든 결과파일은 사용자 지정 폴더에 저장되고, 사용자는 본 발명에서 제공되는 시각화 함수를 통해 결과 폴더로부터 Manhattan plot을 그릴 수 있게 된다.Finally, all result files provided in the previous process, such as pre- and post-processing data, selection probability, and threshold values, are stored in a user-specified folder, and the user can draw a Manhattan plot from the result folder through the visualization function provided in the present invention. do.

본 발명은 오픈소스 소프트웨어인 R 프로그램의 패키지로 개발됨으로써 일반 연구자들의 접근성을 높이고자 하였고, 대부분의 분석과정을 자동화함으로써 편의성을 함께 갖출 수 있게 되었다. 종래 기술과의 차이점은 통계학에서 많이 이용되는 고차원 자료에서의 변수선택 기법인 규제화 방법을 이용하여 선택확률을 계산함으로써 보다 안정적이고 정확하게 표현형질 연관 SNP 마커를 발굴할 수 있다는 것이다.The present invention was developed as a package of the open source software R program to increase accessibility to general researchers, and convenience was achieved by automating most of the analysis process. The difference from the prior art is that by calculating the selection probability using the regularization method, a variable selection technique in high-dimensional data widely used in statistics, SNP markers associated with phenotypes can be discovered more stably and accurately.

Claims

(1) Inputting SNP data and phenotypic data in Hapmap format;
(2) As a preprocessing step for the input data, the sample IDs of the SNP data and phenotype data are matched, and then based on MAF (Minor Allele Frequency) of less than 5% and Missing Call Rate of less than 95% for the SNP data. Low-quality SNPs are removed, missing values in the remaining SNPs are replaced based on the frequency distribution of alleles, and phenotypic data samples with missing values and corresponding SNP data samples are removed;
(3) Using the preprocessed SNP data and phenotypic data, the R package 'glmnet' is used to calculate the selection probability for each SNP and the corresponding threshold through a regularization method based on a generalized linear model with a penalty function. As a step, the selection probability is calculated by applying the regularization method to the repeatedly extracted subsample and calculating it as a ratio, and the threshold for the selection probability is calculated by randomly rearranging the given data and calculating the top θ After repeating the process of obtaining the selection probability, it is calculated as the average of the obtained selection probabilities; and
(4) A method of discovering SNP markers associated with phenotypic traits using a regularization method including steps of visualization and providing results and selection probability,
The selection probability of step (3) above is calculated using the following algorithm,

A method for discovering SNP markers associated with phenotypic traits, characterized in that the threshold for selection probability in step (3) is calculated using the following algorithm.

delete

A recording medium recording a computer-readable program for performing the method of claim 1.