KR20240006339A

KR20240006339A - Device and method of selecting str marker candidates

Info

Publication number: KR20240006339A
Application number: KR1020220083275A
Authority: KR
Inventors: 신동용; 김경수; 신동훈; 박원; 김진호; 강태욱; 김무상; 김경현; 정의석; 채준영; 유선혜
Original assignee: 주식회사 코아아이티; (주)더모아젠
Priority date: 2022-07-06
Filing date: 2022-07-06
Publication date: 2024-01-15

Abstract

본 발명은 STR(Short Tandem Repeat) 마커 후보 선출 장치 및 방법에 관한 것으로서, 특히 STR 데이터셋을 전처리한 후 프로세스(절차)의 효율성을 높이기 위해 데이터셋의 수를 감소시키고 기계학습분류모델을 통해 샘플의 STR 좌위별 중요도를 산출하여 중요도가 가장 낮은 STR 좌위를 제거하여 마커 후보의 수를 감소시키고, 샘플의 STR 좌위별 중요도 순위를 산출하여 이를 기초로 STR 좌위를 선택하여 STR 마커 후보로 사용하는 STR 마커 후보 선출 장치 및 방법에 관한 것이다.The present invention relates to an apparatus and method for selecting STR (Short Tandem Repeat) marker candidates. In particular, after pre-processing the STR data set, the number of data sets is reduced to increase the efficiency of the process (procedure), and samples are selected through a machine learning classification model. The importance of each STR locus is calculated, the STR locus with the lowest importance is removed to reduce the number of marker candidates, and the importance ranking of each STR locus of the sample is calculated. STR loci are selected based on this and used as STR marker candidates. It relates to a marker candidate selection device and method.

Description

Device and method for selecting STR marker candidates {DEVICE AND METHOD OF SELECTING STR MARKER CANDIDATES}

통상, 유전자 마커(genetic marker)란 사람 또는 동물에 있어서 염색체상의 물리적 위치가 알려진 DNA 서열을 말한다. 유전자 자리에 있는 돌연변이나 변형에 의해 일어난 다양성은 유전자 마커로 활용 할 수 있다. 유전자 마커로 사용되는 것 중 하나로서, 단순 염기서열 길이 다형성(simple sequence length polymorphism, SSLP)이 있다. SSLP는 DNA 서열에서 반복되는 서열의 횟수가 다른 것을 나타내며, 길이의 다형성을 유전자 마커로 사용한다. 단순 염기서열 길이 다형성의 예로서, 초위성체(short tandem repeat 혹은 microsatellite)와 미소 부수체(minisatellite)가 있다. 초위성체는 일반적으로 2~6개 염기쌍(bp) 이상 길이의 동일한 내용으로 반복된 DNA 단편이다. 일반적으로 5~50번까지 단편이 반복되는 양상을 보이는 특징을 가지며, 주로 DNA 복제 시 슬립(slippage) 과정에서 반복 수의 차이가 발생된다. 단순 염기서열 길이 다형성을 나타내는 다른 예인, 미소 부수체는 반복 염기서열이 10개 이상의 염기서열 길이를 가지며 염색체의 텔로미어 부근에 집중되어 있는 특징을 가진다. 특히, 초위성체는 개체 또는 그룹마다 반복서열의 길이가 다양하고 염색체 전 구간에 고르게 퍼져 있어 초위성체가 미소 부수체에 비해 유전자 마커로 더 많이 사용된다. 예를 들면, 초위성체는 유전자 감식을 통한 친자 확인, 법의학(검증, 감별), 농수산물(원산지, 종 감별 등), 식품(병원균 감별 등), 약재 감별 등 여러 분야에서 널리 이용되고 있다.Typically, a genetic marker refers to a DNA sequence whose physical location on a chromosome is known in humans or animals. Diversity caused by mutations or modifications at a gene locus can be used as a genetic marker. One of the things used as a genetic marker is simple sequence length polymorphism (SSLP). SSLP indicates a different number of repeated sequences in a DNA sequence, and length polymorphism is used as a genetic marker. Examples of simple sequence length polymorphisms include short tandem repeats (or microsatellites) and minisatellites. Supersatellites are generally repeated DNA fragments with the same content of 2 to 6 base pairs (bp) or more in length. In general, the fragment is characterized by being repeated 5 to 50 times, and differences in the number of repeats mainly occur during slippage during DNA replication. Microsatellites, another example of simple sequence length polymorphism, are characterized by repeated sequences having a length of 10 or more bases and concentrated near the telomeres of chromosomes. In particular, the length of the repeat sequence of a supersatellite varies depending on the individual or group and is spread evenly throughout the chromosome, so supersatellites are used more as genetic markers than microsatellites. For example, supersatellites are widely used in various fields such as paternity confirmation through genetic identification, forensics (verification, identification), agricultural and marine products (country of origin, species identification, etc.), food (pathogen identification, etc.), and pharmaceutical identification.

유전자 마커를 발굴하기 위해서는 일반적으로 참조서열(Reference)이 필요하다. 초위성체(Short Tandem Repeat, STR) 마커 또한 기존 전통적인 방식에서는 마커 발굴 시 참조서열이 필요하기 때문에, 참조서열이 없는 동식물의 경우에는 참조서열을 구성하기 위해 DNA 시퀀싱 데이터를 이어 붙이는 방식의 어셈블리 과정을 통해 만들어서 진행하는데, 이 경우 반복 영역은 동일한 서열이 반복되기 때문에 정확도가 떨어진다. 당사에서 개발한 마커발굴 솔루션(denovoPoly)은 시퀀싱 데이터의 정렬 과정을 통해 동일 영역에 대한 댑스(depth)를 높여 원하는 영역에 대한 STR 영역을 예측하는 방식으로, 참조서열 없이도 마커를 발굴하여 이러한 문제점을 극복하고자 개발된 솔루션이다. 다시 말해, 참조서열 없이도 STR 마커를 찾을 수 있는 솔루션이다. 참조서열이 밝혀지지 않은 다양한 동식물에 적용할 수 있으며, 어셈블리 과정 없이 신속하고 정확하게 마커를 탐색할 수 있다.In order to discover genetic markers, a reference sequence is generally required. Short Tandem Repeat (STR) markers also require a reference sequence when discovering markers in the existing traditional method, so in the case of animals and plants without reference sequences, an assembly process is performed by splicing DNA sequencing data to construct a reference sequence. This is done by creating a repeat region, but in this case, the accuracy is low because the same sequence is repeated in the repetitive region. The marker discovery solution (denovoPoly) developed by our company predicts the STR region for a desired region by increasing the depth of the same region through the alignment process of sequencing data. This problem is solved by discovering markers without a reference sequence. This is a solution developed to overcome this. In other words, it is a solution that can find STR markers without a reference sequence. It can be applied to a variety of plants and animals whose reference sequences have not been identified, and markers can be searched quickly and accurately without an assembly process.

그러나, denovoPoly의 한계점은 종 판별 및 동정을 구분할 수 있는 후보 STR 마커가 너무 많이 발생한다는 것이다. 후보군이 너무 많아지게 되면 각 마커를 검증하기 위한 실험에 시간과 노력이 많이 소요된다. 기존에는 'p-value'를 기준으로 top-N 방법을 도입하여 후보군을 축소하였다. 이 방법은 단순하게 'p-value'만으로 판단하는 것이기 때문에 샘플 수 등 기타 요소에 의해 정확도가 떨어질 수 있고, 다중 마커 선발에 한계가 있다. However, a limitation of denovoPoly is that it generates too many candidate STR markers that can distinguish between species discrimination and identification. If the number of candidates becomes too large, it takes a lot of time and effort to conduct experiments to verify each marker. Previously, the top-N method was introduced based on 'p-value' to reduce the candidate group. Since this method is simply judged based on 'p-value', accuracy may decrease depending on other factors such as the number of samples, and there are limitations in selecting multiple markers.

국내 공개 특허 10-2017-0074418호 공보(공개일: 2017, 06, 30)Domestic Public Patent Publication No. 10-2017-0074418 (Publication Date: 2017, 06, 30)

따라서 본 발명은 상기와 같은 문제점을 해결하기 위해 이루어진 것으로서, 본 발명의 목적은 정확도가 높은 STR 마커 후보를 선정할 수 있는 STR 마커 후보 선출 장치 및 방법을 제공하는 데에 있다.Therefore, the present invention was made to solve the above problems, and the purpose of the present invention is to provide an apparatus and method for selecting STR marker candidates that can select STR marker candidates with high accuracy.

상기의 목적을 달성하기 위해 본 발명의 실시형태에 의한 STR 마커 후보 선출 장치는 샘플의 시퀀스 데이터셋으로부터 마커 발굴 솔루션을 통해 STR(Short Tandem Repeat) 데이터셋을 획득하도록 구성된 STR 데이터셋 획득부; 상기 STR 데이터셋을 샘플의 STR 좌위(Flanking Sequence)별 STR 개수 데이터셋으로 변환하고 변환된 데이터셋을 전처리하여 STR 좌위의 수를 줄이도록 구성된 데이터셋 변화 및 전처리부; 전처리된 상기 데이터셋을 설정 비율에 따라 학습 데이터셋과 검증 데이터셋으로 분리하도록 구성된 제1 데이터셋 분리부; 상기 학습 데이터셋 및 검증 데이터셋을 기초로 사전 설정된 복수의 기계학습 분류 모델 중 하나를 선택하도록 구성된 기계학습 분류 모델 선택부; 선택된 상기 기계학습 분류 모델을 통해 상기 전처리된 데이터셋의 샘플의 STR 좌위별 중요도를 산출하고 중요도가 가장 낮은 STR 좌위를 상기 전처리된 데이터셋에서 제거하도록 구성된 샘플 STR 좌위별 중요도 산출 및 재처리부; 상기 중요도가 가장 낮은 STR 좌위가 제거된 데이터셋을 설정 비율에 따라 학습 데이터셋과 검증 데이터셋으로 분리하도록 구성된 제2 데이터셋 분리부; 상기 제2 데이터셋 분리부에 의해 분리되어 얻어진 상기 학습 데이터셋 및 검증 데이터셋을 기초로 상기 중요도가 가장 낮은 STR 좌위가 제거된 데이터셋의 샘플의 STR 좌위별 중요도 순위를 산출하도록 구성된 샘플 STR 좌위별 중요도 순위 산출부; 및 상기 중요도가 가장 낮은 STR 좌위가 제거된 데이터셋에서 중요도 순위가 가장 높은 STR 좌위를 선택하여 STR 마커 후보로 사용하도록 구성된 STR 마커 후보 선정부;를 포함하는 것을 특징으로 한다.In order to achieve the above object, an apparatus for selecting STR marker candidates according to an embodiment of the present invention includes: an STR dataset acquisition unit configured to acquire a Short Tandem Repeat (STR) dataset from a sample sequence dataset through a marker discovery solution; a dataset change and preprocessor configured to convert the STR dataset into a STR count dataset for each STR locus (flanking sequence) of the sample and preprocess the converted dataset to reduce the number of STR loci; a first data set separation unit configured to separate the pre-processed data set into a training data set and a verification data set according to a set ratio; a machine learning classification model selection unit configured to select one of a plurality of preset machine learning classification models based on the learning dataset and the verification dataset; An importance calculation and reprocessing unit for each STR locus of a sample configured to calculate the importance of each STR locus of a sample of the preprocessed dataset through the selected machine learning classification model and remove the STR locus with the lowest importance from the preprocessed dataset; a second dataset separator configured to separate the dataset from which the STR locus of lowest importance has been removed into a training dataset and a validation dataset according to a set ratio; A sample STR locus configured to calculate an importance ranking for each STR locus of a sample of a dataset from which the STR locus of lowest importance was removed based on the learning dataset and validation dataset obtained by separating the second dataset separator. Star importance ranking calculation unit; and an STR marker candidate selection unit configured to select the STR locus with the highest importance rank from the dataset from which the STR locus with the lowest importance rank has been removed and use it as a STR marker candidate.

상기 실시형태에 의한 STR 마커 후보 선출 장치에 있어서, 상기 마커 발굴 솔루션은 denovoPoly일 수 있다.In the STR marker candidate selection device according to the above embodiment, the marker discovery solution may be denovoPoly.

상기 실시형태에 의한 STR 마커 후보 선출 장치에 있어서, 상기 전처리는 낮은 분산 제거 등의 방법을 사용할 수 있다.In the STR marker candidate selection device according to the above embodiment, the preprocessing may use a method such as low variance removal.

상기 실시형태에 의한 STR 마커 후보 선출 장치에 있어서, 상기 사전 설정된 복수의 기계학습 분류 모델 중 하나의 선택은 그리드 탐색 방법을 사용할 수 있다.In the STR marker candidate selection device according to the above embodiment, selection of one of the plurality of preset machine learning classification models may use a grid search method.

상기 실시형태에 의한 STR 마커 후보 선출 장치에 있어서, 상기 샘플의 STR 좌위별 중요도 순위 산출은 RFECV(Recursive Feature Elimination with Cross Validation) 방법을 사용할 수 있다.In the STR marker candidate selection device according to the above embodiment, the importance ranking for each STR locus of the sample may be calculated using the Recursive Feature Elimination with Cross Validation (RFECV) method.

상기의 목적을 달성하기 위해 본 발명의 다른 실시형태에 의한 STR 마커 후보 선출 방법은 STR 데이터셋 획득부가 샘플의 시퀀스 데이터셋으로부터 마커 발굴 솔루션을 통해 STR 데이터셋을 획득하는 단계; 데이터셋 변화 및 전처리부가 상기 STR 데이터셋을 샘플의 STR 좌위(Flanking Sequence)별 STR 개수 데이터셋으로 변환하고 변환된 데이터셋을 전처리하여 STR 좌위의 수를 줄이는 단계; 제1 데이터셋 분리부가 전처리된 상기 데이터셋을 설정 비율에 따라 학습 데이터셋과 검증 데이터셋으로 분리하는 단계; 기계학습 분류 모델 선택부가 상기 학습 데이터셋 및 검증 데이터셋을 기초로 사전 설정된 복수의 기계학습 분류 모델 중 하나를 선택하는 단계; 샘플의 STR 좌위별 중요도 산출 및 재처리부가 선택된 상기 기계학습 분류 모델을 통해 상기 전처리된 데이터셋의 샘플의 STR 좌위별 중요도를 산출하고 중요도가 가장 낮은 STR 좌위를 상기 전처리된 데이터셋에서 제거하는 단계; 제2 데이터셋 분리부가 상기 중요도가 가장 낮은 STR 좌위가 제거된 데이터셋을 설정 비율에 따라 학습 데이터셋과 검증 데이터셋으로 분리하는 단계; 샘플 STR 좌위별 중요도 순위 산출부가 상기 제2 데이터셋 분리부에 의해 분리되어 얻어진 상기 학습 데이터셋 및 검증 데이터셋을 기초로 상기 중요도가 가장 낮은 STR 좌위(예컨대, 중요도가 0인)가 제거된 데이터셋의 샘플의 STR 좌위별 중요도 순위를 산출하는 단계; 및 STR 마커 후보 선정부가 상기 중요도가 가장 낮은 STR 좌위가 제거된 데이터셋에서 중요도 순위가 가장 높은 STR 좌위를 선택하여 STR 마커 후보로 사용하는 단계;를 포함할 수 있다.In order to achieve the above object, a STR marker candidate selection method according to another embodiment of the present invention includes the steps of an STR data set acquisition unit acquiring a STR data set from a sequence data set of a sample through a marker discovery solution; A dataset change and preprocessing unit converting the STR dataset into a STR count dataset for each STR locus (flanking sequence) of the sample and preprocessing the converted dataset to reduce the number of STR loci; A first data set separation unit separating the pre-processed data set into a training data set and a verification data set according to a set ratio; A machine learning classification model selection unit selecting one of a plurality of preset machine learning classification models based on the training dataset and the verification dataset; Calculating the importance of each STR locus of the sample and reprocessing unit calculates the importance of each STR locus of the sample of the pre-processed dataset through the selected machine learning classification model, and removing the STR locus with the lowest importance from the pre-processed dataset. ; A second data set separation unit separating the data set from which the STR locus of lowest importance has been removed into a training data set and a validation data set according to a set ratio; Data from which the STR locus with the lowest importance (e.g., with an importance of 0) is removed based on the learning dataset and verification dataset obtained by separating the importance ranking calculation unit for each sample STR locus by the second data set separation unit. Calculating an importance ranking for each STR locus of the three samples; and a step wherein the STR marker candidate selection unit selects the STR locus with the highest importance rank from the dataset from which the STR locus with the lowest importance has been removed and uses it as a STR marker candidate.

본 발명의 실시형태에 의한 STR 마커 후보 선출 장치 및 방법에 의하면, STR 데이터셋을 샘플의 STR 좌위별 STR 개수 데이터셋으로 변환하고 변환된 데이터셋을 전처리하여 STR 좌위의 수를 줄이고, 전처리된 상기 데이터셋을 설정 비율에 따라 학습 데이터셋과 검증 데이터셋으로 분리하고, 상기 학습 데이터셋 및 검증 데이터셋을 기초로 사전 설정된 복수의 기계학습 분류 모델 중 하나를 선택하고, 선택된 상기 기계학습 분류 모델을 통해 상기 전처리된 데이터셋의 샘플의 STR 좌위별 중요도를 산출하고 중요도가 가장 낮은 STR 좌위를 상기 전처리된 데이터셋에서 제거하고, 상기 중요도가 가장 낮은 STR 좌위가 제거된 데이터셋을 설정 비율에 따라 학습 데이터셋과 검증 데이터셋으로 분리하고, 분리되어 얻어진 상기 학습 데이터셋 및 검증 데이터셋을 기초로 상기 중요도가 가장 낮은 STR 좌위가 제거된 데이터셋의 샘플의 STR 좌위별 중요도 순위를 산출하고, 상기 중요도가 가장 낮은 STR 좌위가 제거된 데이터셋에서 중요도 순위가 가장 높은 STR 좌위를 선택하여 STR 마커 후보로 사용하도록 구성됨으로써, 정확도가 높은 STR 마커 후보를 선정할 수 있다는 뛰어난 효과가 있다.According to the apparatus and method for selecting STR marker candidates according to an embodiment of the present invention, the STR dataset is converted into a STR count dataset for each STR locus of the sample, the converted dataset is preprocessed to reduce the number of STR loci, and the preprocessed Separate the dataset into a training dataset and a validation dataset according to a setting ratio, select one of a plurality of machine learning classification models preset based on the training dataset and the validation dataset, and use the selected machine learning classification model. The importance of each STR locus of the sample of the preprocessed dataset is calculated, the STR locus with the lowest importance is removed from the preprocessed dataset, and the dataset from which the lowest importance STR locus is removed is learned according to the set ratio. Separate the dataset into a validation dataset, calculate the importance ranking for each STR locus of the sample from the dataset from which the STR locus with the lowest importance was removed based on the separately obtained learning dataset and validation dataset, and calculate the importance By selecting the STR locus with the highest importance rank from the dataset from which the lowest STR locus has been removed and using it as a STR marker candidate, there is an excellent effect of selecting STR marker candidates with high accuracy.

도 1은 본 발명의 실시예에 의한 STR 마커 후보 선출 장치의 블록 구성도이다.
도 2는 본 발명의 실시예에 의한 STR 마커 후보 선출 방법을 설명하기 위한 플로우챠트이다.
도 3은 도 1의 STR 데이터셋 획득부가 획득한 STR 데이터셋의 예시도이다.
도 4는 도 1의 데이터셋 변화 및 전처리부가 STR 데이터셋을 변환시킨 샘플의 STR 좌위(Flanking Sequence)별 STR 개수 데이터셋의 예시도이다.
도 5는 도 1의 샘플 STR 좌위별 중요도 산출 및 재처리부가 산출한 데이터셋의 샘플의 STR 좌위별 중요도의 예시도이다.
도 6은 도 1의 샘플 STR 좌위별 중요도 순위 산출부가 산출한 데이터셋의 샘플의 STR 좌위별 중요도 순위의 예시도이다.Figure 1 is a block diagram of an apparatus for selecting STR marker candidates according to an embodiment of the present invention.
Figure 2 is a flowchart explaining a method for selecting STR marker candidates according to an embodiment of the present invention.
Figure 3 is an example diagram of the STR data set acquired by the STR data set acquisition unit of Figure 1.
Figure 4 is an example of a data set of STR numbers for each STR locus (Flanking Sequence) of a sample in which the data set change and pre-processing unit of Figure 1 converted the STR data set.
Figure 5 is an example diagram of the importance of each STR locus of a sample of the dataset calculated by the importance calculation and reprocessing unit of the sample STR locus of Figure 1.
Figure 6 is an example diagram of the importance ranking of each STR locus of a sample of the dataset calculated by the importance ranking calculation unit of the sample STR locus of Figure 1.

본 발명의 실시예를 설명함에 있어서, 본 발명과 관련된 공지기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략하기로 한다. 그리고 후술되는 용어들은 본 발명에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례 등에 따라 달라질 수 있다. 그러므로 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다. 상세한 설명에서 사용되는 용어는 단지 본 발명의 실시예를 기술하기 위한 것이며, 결코 제한적으로 해석되어서는 안 된다. 명확하게 달리 사용되지 않는 한, 단수 형태의 표현은 복수 형태의 의미를 포함한다. 본 설명에서, "포함" 또는 "구비"와 같은 표현은 어떤 특성들, 숫자들, 단계들, 동작들, 요소들, 이들의 일부 또는 조합을 가리키기 위한 것이며, 기술된 것 이외에 하나 또는 그 이상의 다른 특성, 숫자, 단계, 동작, 요소, 이들의 일부 또는 조합의 존재 또는 가능성을 배제하는 것으로 해석되어서는 안 된다.In describing embodiments of the present invention, if it is determined that a detailed description of the known technology related to the present invention may unnecessarily obscure the gist of the present invention, the detailed description will be omitted. The terms described below are defined in consideration of the functions in the present invention, and may vary depending on the intention or custom of the user or operator. Therefore, the definition should be made based on the contents throughout this specification. The terms used in the detailed description are only for describing embodiments of the present invention and should in no way be construed as limiting. Unless explicitly stated otherwise, singular forms include plural meanings. In this description, expressions such as “comprising” or “comprising” are intended to indicate certain features, numbers, steps, operations, elements, parts or combinations thereof, and one or more than those described. It should not be construed to exclude the existence or possibility of any other characteristic, number, step, operation, element, or part or combination thereof.

도면에서 도시된 각 장치에서, 몇몇 경우에서의 요소는 각각 동일한 참조 번호 또는 상이한 참조 번호를 가져서 표현된 요소가 상이하거나 유사할 수가 있음을 시사할 수 있다. 그러나 요소는 상이한 구현을 가지고 본 명세서에서 보여지거나 기술된 장치 중 몇몇 또는 전부와 작동할 수 있다. 도면에서 도시된 다양한 요소는 동일하거나 상이할 수 있다. 어느 것이 제1 요소로 지칭되는지 및 어느 것이 제2 요소로 불리는지는 임의적이다.In each device shown in the drawings, elements in some cases may each have the same reference number or different reference numbers, indicating that the elements represented may be different or similar. However, elements may have different implementations and operate with any or all of the devices shown or described herein. Various elements shown in the drawings may be the same or different. Which is called the first element and which is called the second element is arbitrary.

본 명세서에서 어느 하나의 구성요소가 다른 구성요소로 데이터 또는 신호를 '전송', '전달' 또는 '제공'한다 함은 어느 한 구성요소가 다른 구성요소로 직접 데이터 또는 신호를 전송하는 것은 물론, 적어도 하나의 또 다른 구성요소를 통하여 데이터 또는 신호를 다른 구성요소로 전송하는 것을 포함한다.In this specification, when one component 'transmits', 'delivers', or 'provides' data or signals to another component, it means that one component transmits data or signals directly to another component. It involves transmitting data or signals to another component through at least one other component.

이하, 본 발명의 실시예를 도면을 참조하여 상세히 설명하기로 한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

도 1은 본 발명의 실시예에 의한 STR 마커 후보 선출 장치의 블록 구성도이다.Figure 1 is a block diagram of an apparatus for selecting STR marker candidates according to an embodiment of the present invention.

본 발명의 실시예에 의한 STR 마커 후보 선출 장치는, 도 1에 도시된 바와 같이, STR 데이터셋 획득부(100), 데이터셋 변화 및 전처리부(200), 제1 데이터셋 분리부(300), 기계학습 분류 모델 선택부(400), 샘플 STR 좌위별 중요도 산출 및 재처리부(500), 제2 데이터셋 분리부(600), 샘플 STR 좌위별 중요도 순위 산출부(700), 및 STR 마커 후보 선정부(800)를 포함한다. STR 데이터셋 획득부(100), 데이터셋 변화 및 전처리부(200), 제1 데이터셋 분리부(300), 기계학습 분류 모델 선택부(400), 샘플 STR 좌위별 중요도 산출 및 재처리부(500), 제2 데이터셋 분리부(600), 샘플 STR 좌위별 중요도 순위 산출부(700), 및 STR 마커 후보 선정부(800)는 하나의 단말 장치(예컨대, 노트북, 퍼스널컴퓨터, PDA, PMP, 스마트폰 등)로 구성될 수 있다.As shown in FIG. 1, the STR marker candidate selection device according to an embodiment of the present invention includes an STR dataset acquisition unit 100, a dataset change and preprocessing unit 200, and a first dataset separation unit 300. , machine learning classification model selection unit 400, importance calculation and reprocessing unit 500 for each sample STR locus, second dataset separation unit 600, importance ranking calculation unit 700 for each sample STR locus, and STR marker candidate. Includes a selection unit 800. STR dataset acquisition unit 100, dataset change and preprocessing unit 200, first dataset separation unit 300, machine learning classification model selection unit 400, importance calculation and reprocessing unit for each sample STR locus (500) ), the second data set separation unit 600, the importance ranking calculation unit 700 for each sample STR locus, and the STR marker candidate selection unit 800 are operated on one terminal device (e.g., laptop, personal computer, PDA, PMP, It may consist of a smartphone, etc.).

STR 데이터셋 획득부(100)는 샘플의 시퀀스 데이터셋으로부터 마커 발굴 솔루션(예컨대, denovoPoly)을 통해 STR(Short Tandem Repeat) 데이터셋을 획득하는 역할을 한다. STR 데이터셋 획득부(100)에 의해 획득된 STR 데이터셋에는 도 3에 도시된 바와 같이 각 샘플의 시퀀스 데이터셋의 어느 위치(STR 좌위)에 어떤 종류의 STR이 존재하는지 등의 정보가 포함되어 있다.The STR data set acquisition unit 100 serves to acquire a Short Tandem Repeat (STR) data set from a sample sequence data set through a marker discovery solution (eg, denovoPoly). As shown in FIG. 3, the STR data set acquired by the STR data set acquisition unit 100 includes information such as what type of STR exists at which position (STR locus) in the sequence data set of each sample. there is.

데이터셋 변화 및 전처리부(200)는 STR 데이터셋 획득부(100)에 의해 획득된 STR 데이터셋을 샘플의 STR 좌위(Flanking Sequence)별 STR 개수 데이터셋으로 변환하는 역할을 한다. 변환된 데이터셋에서 샘플의 특징(Feature)은 STR 좌위이고 특징값은 STR 좌위별 STR 개수가 되며, 변환된 데이터셋에는 보통 수천에서 수만개의 특징(STR 좌위)이 존재한다(도 4 참조). The data set change and preprocessing unit 200 serves to convert the STR data set acquired by the STR data set acquisition unit 100 into a STR count data set for each STR locus (Flanking Sequence) of the sample. In the converted dataset, the feature of the sample is the STR locus and the feature value is the number of STRs for each STR locus, and there are usually thousands to tens of thousands of features (STR loci) in the converted dataset (see Figure 4).

데이터셋 변화 및 전처리부(200)는 또한 변환된 데이터셋을 예컨대, "낮은 분산 제거" 방법을 사용하여 특징(STR 좌위)을 최대한 줄이는 전처리를 수행하는 역할을 한다. The dataset change and preprocessing unit 200 also serves to preprocess the converted dataset to reduce features (STR loci) as much as possible using, for example, a “low variance removal” method.

제1 데이터셋 분리부(300)는 데이터셋 변화 및 전처리부(200)에 의해 전처리된 데이터셋을 설정 비율(예컨대, 8:2)에 따라 학습 데이터셋과 검증 데이터셋으로 분리하는 역할을 한다.The first dataset separation unit 300 serves to separate the dataset preprocessed by the dataset change and preprocessor 200 into a learning dataset and a verification dataset according to a set ratio (e.g., 8:2). .

기계학습 분류 모델 선택부(400)는 제1 데이터셋 분리부(300)에 의해 획득된 학습 데이터셋 및 검증 데이터셋을 기초로 그리드 탐색(Grid search) 등의 방법론을 사용하여 사전 설정된 복수의 기계학습 분류 모델(Random forest 등의 Tree Ensemble 기반 모델) 중 학습 데이터셋을 가장 잘 설명하는 하나의 기계학습 분류 모델을 선택하는 역할을 한다.The machine learning classification model selection unit 400 selects a plurality of machines preset using a methodology such as grid search based on the learning dataset and verification dataset acquired by the first dataset separation unit 300. It is responsible for selecting one machine learning classification model that best describes the learning dataset among learning classification models (Tree Ensemble-based models such as random forest).

샘플 STR 좌위별 중요도 산출 및 재처리부(500)는 기계학습 분류 모델 선택부(400)에 의해 선택된 기계학습 분류 모델을 통해 데이터셋 변화 및 전처리부(200)에 의해 전처리된 데이터셋의 샘플의 STR 좌위(특징)별 중요도(Feature importance)를 산출하고, 중요도가 가장 낮은(예컨대, 중요도가 0인) STR 좌위를 전처리된 데이터셋에서 제거하는 역할을 한다. 이 제거 과정을 통해 STR 좌위(특징)가 보통 수십에서 수백개 수준으로 줄어든다. 도 5에는 샘플 STR 좌위별 중요도 산출 및 재처리부(500)에 의해 산출된 데이터셋의 샘플의 STR 좌위별 중요도가 예시되어 있다.The importance calculation and reprocessing unit 500 for each STR locus of the sample changes the dataset through the machine learning classification model selected by the machine learning classification model selection unit 400, and changes the STR of the sample of the dataset preprocessed by the preprocessing unit 200. It calculates feature importance for each locus (feature) and removes the STR locus with the lowest importance (e.g., importance of 0) from the preprocessed dataset. Through this removal process, STR loci (features) are usually reduced from dozens to hundreds. Figure 5 illustrates the importance of each STR locus of a sample of the dataset calculated by the importance calculation and reprocessing unit 500.

제2 데이터셋 분리부(600)는 샘플 STR 좌위별 중요도 산출 및 재처리부(500)에 의해 중요도가 가장 낮은 STR 좌위가 제거된 데이터셋을 설정 비율에 따라 학습 데이터셋과 검증 데이터셋으로 분리하는 역할을 한다.The second dataset separation unit 600 separates the dataset from which the STR locus of lowest importance has been removed by the importance calculation and reprocessing unit 500 for each sample STR locus into a learning dataset and a validation dataset according to a set ratio. It plays a role.

샘플 STR 좌위별 중요도 순위 산출부(700)는 제2 데이터셋 분리부(600)에 의해 분리되어 얻어진 학습 데이터셋 및 검증 데이터셋을 기초로 예컨대, RFECV(Recursive Feature Elimination with Cross Validation) 방법을 사용하여 데이터셋[샘플 STR 좌위별 중요도 산출 및 재처리부(500)에 의해 중요도가 가장 낮은 STR 좌위가 제거된 데이터셋]의 샘플의 STR 좌위별 중요도 순위를 산출하는 역할을 한다.The importance ranking calculation unit 700 for each sample STR locus uses, for example, the Recursive Feature Elimination with Cross Validation (RFECV) method based on the learning dataset and validation dataset obtained by being separated by the second dataset separation unit 600. Thus, it serves to calculate the importance ranking for each STR locus of the sample of the dataset (a dataset from which the STR locus with the lowest importance has been removed by the sample STR locus-specific importance calculation and reprocessing unit 500).

도 6에는 샘플 STR 좌위별 중요도 순위 산출부(700)가 산출한, 데이터셋의 샘플의 STR 좌위별 중요도 순위가 예시되어 있다.Figure 6 illustrates the importance ranking of each STR locus of a sample of the data set, calculated by the importance ranking calculation unit 700 for each STR locus.

STR 마커 후보 선정부(800)는 데이터셋[샘플 STR 좌위별 중요도 산출 및 재처리부(500)에 의해 중요도가 가장 낮은 STR 좌위가 제거된 데이터셋]에서 중요도 순위가 가장 높은 STR 좌위(순위가 1인 STR 좌위)를 선택하여 STR 마커 후보로 사용하는 역할을 한다.The STR marker candidate selection unit 800 selects the STR locus with the highest importance rank (rank 1) from the dataset [a dataset from which the STR locus with the lowest importance was removed by the sample STR locus-specific importance calculation and reprocessing unit 500]. It plays the role of selecting an STR locus and using it as a candidate STR marker.

상기와 같이 구성된 본 발명의 실시예에 의한 STR 마커 후보 선출 장치를 이용한 STR 마커 후보 선출 방법에 대해 설명하기로 한다.A method of selecting a STR marker candidate using the STR marker candidate selection device according to an embodiment of the present invention configured as described above will be described.

도 2는 본 발명의 실시예에 의한 STR 마커 후보 선출 방법을 설명하기 위한 플로우챠트로서, 여기서 S는 스텝(step)을 의미한다.Figure 2 is a flowchart for explaining a method of selecting a STR marker candidate according to an embodiment of the present invention, where S stands for step.

먼저, STR 데이터셋 획득부(100)가 샘플의 시퀀스 데이터셋으로부터 마커 발굴 솔루션(denovoPoly)을 이용하여 STR 데이터셋을 획득한다(S10).First, the STR data set acquisition unit 100 acquires the STR data set from the sample sequence data set using a marker discovery solution (denovoPoly) (S10).

다음, 데이터셋 변화 및 전처리부(200)가 스텝(S10)에서 획득된 STR 데이터셋을 샘플의 STR 좌위(Flanking Sequence)별 STR 개수 데이터셋으로 변환하고(S20), 변환된 데이터셋을 예컨대, 낮은 분산 제거 방법을 사용하여 전처리함으로써 STR 좌위의 수를 줄인다(S30).Next, the data set change and preprocessing unit 200 converts the STR data set obtained in step S10 into a STR count data set for each STR locus (Flanking Sequence) of the sample (S20), and the converted data set is, for example, Reduce the number of STR loci by preprocessing using a low variance removal method (S30).

다음, 제1 데이터셋 분리부(300)가 스텝(S30)에서 전처리되어 STR 좌위의 수가 줄어든 데이터셋을 설정 비율(예컨대, 8:2)에 따라 학습 데이터셋과 검증 데이터셋으로 분리한다(S40).Next, the first dataset separation unit 300 separates the dataset in which the number of STR loci is reduced by preprocessing in step S30 into a training dataset and a validation dataset according to a set ratio (e.g., 8:2) (S40) ).

다음, 기계학습 분류 모델 선택부(400)가 스텝(S40)에서 획득된 학습 데이터셋 및 검증 데이터셋을 기초로 예컨대, 그리드 탐색 방법을 사용하여 사전 설정된 복수의 기계학습 분류 모델 중 하나를 선택한다(S50).Next, the machine learning classification model selection unit 400 selects one of a plurality of preset machine learning classification models based on the learning dataset and verification dataset obtained in step S40, for example, using a grid search method. (S50).

다음, 샘플 STR 좌위별 중요도 산출 및 재처리부(500)가 스텝(S50)에서 선택된 기계학습 분류 모델을 통해 스텝(S30)에서 전처리되어 STR 좌위의 수가 줄어든 데이터셋의 샘플의 STR 좌위별 중요도를 산출하고(S60), 중요도가 가장 낮은 STR 좌위(예컨대, 중요도가 0인 STR 좌위)를 전처리되어 STR 좌위의 수가 줄어든 데이터셋에서 제거한다(S70).Next, the sample STR locus calculation and reprocessing unit 500 calculates the importance of each STR locus of the sample of the dataset in which the number of STR loci has been reduced by preprocessing in step S30 through the machine learning classification model selected in step S50. And (S60), the STR loci with the lowest importance (e.g., STR loci with an importance of 0) are removed from the dataset in which the number of STR loci has been reduced through preprocessing (S70).

다음, 제2 데이터셋 분리부(600)가 스텝(S70)에서 중요도가 가장 낮은 STR 좌위가 제거된 데이터셋을 설정 비율(예컨대, 8:2)에 따라 학습 데이터셋과 검증 데이터셋으로 분리한다(S80).Next, the second data set separation unit 600 separates the data set from which the STR locus of lowest importance has been removed in step S70 into a training data set and a validation data set according to a set ratio (e.g., 8:2). (S80).

다음, 샘플 STR 좌위별 중요도 순위 산출부(700)가 스텝(S80)에서 얻어진 학습 데이터셋 및 검증 데이터셋을 기초로 예컨대, RFECV(Recursive Feature Elimination with Cross Validation) 방법을 사용하여 스텝(S70)에서 획득된 데이터셋의 샘플의 STR 좌위별 중요도 순위를 산출한다(S90).Next, the importance ranking calculation unit 700 for each sample STR locus uses, for example, the Recursive Feature Elimination with Cross Validation (RFECV) method in step S70 based on the learning dataset and validation dataset obtained in step S80. The importance ranking for each STR locus of the sample of the acquired dataset is calculated (S90).

다음, STR 마커 후보 선정부(800)가 스텝(S70)에서 획득된 데이터셋에서 중요도 순위가 가장 높은 STR 좌위(순위가 1인 STR 좌위)를 선택하여 STR 마커 후보로 사용한다(S100).Next, the STR marker candidate selection unit 800 selects the STR locus with the highest importance rank (STR locus with a rank of 1) from the dataset obtained in step S70 and uses it as a STR marker candidate (S100).

이와 같이 구성된 본 발명의 실시예에 의한 STR 마커 후보 선출 장치 및 방법에 의하면, STR 데이터셋을 샘플의 STR 좌위별 STR 개수 데이터셋으로 변환하고 변환된 데이터셋을 전처리하여 STR 좌위의 수를 줄이고, 전처리된 상기 데이터셋을 설정 비율에 따라 학습 데이터셋과 검증 데이터셋으로 분리하고, 상기 학습 데이터셋 및 검증 데이터셋을 기초로 사전 설정된 복수의 기계학습 분류 모델 중 하나를 선택하고, 선택된 상기 기계학습 분류 모델을 통해 상기 전처리된 데이터셋의 샘플의 STR 좌위별 중요도를 산출하고 중요도가 가장 낮은 STR 좌위를 상기 전처리된 데이터셋에서 제거하고, 상기 중요도가 가장 낮은 STR 좌위가 제거된 데이터셋을 설정 비율에 따라 학습 데이터셋과 검증 데이터셋으로 분리하고, 분리되어 얻어진 상기 학습 데이터셋 및 검증 데이터셋을 기초로 상기 중요도가 가장 낮은 STR 좌위가 제거된 데이터셋의 샘플의 STR 좌위별 중요도 순위를 산출하고, 상기 중요도가 가장 낮은 STR 좌위가 제거된 데이터셋에서 중요도 순위가 가장 높은 STR 좌위를 선택하여 STR 마커 후보로 사용하도록 구성됨으로써, 정확도가 높은 STR 마커 후보를 선정할 수 있다.According to the apparatus and method for selecting STR marker candidates according to an embodiment of the present invention configured as described above, the STR dataset is converted into a STR count dataset for each STR locus of the sample and the converted dataset is preprocessed to reduce the number of STR loci, Separate the preprocessed data set into a training data set and a validation data set according to a setting ratio, select one of a plurality of preset machine learning classification models based on the training data set and the validation data set, and select the selected machine learning model. Calculate the importance of each STR locus of the sample of the preprocessed dataset through a classification model, remove the STR locus with the lowest importance from the preprocessed dataset, and set the dataset from which the STR locus with the lowest importance was removed. Accordingly, it is separated into a learning dataset and a validation dataset, and based on the separately obtained learning dataset and validation dataset, the importance ranking for each STR locus of the sample of the dataset from which the STR locus with the lowest importance has been removed is calculated. , the STR locus with the highest importance rank is selected from the dataset from which the STR locus with the lowest importance has been removed and used as a STR marker candidate, so that a STR marker candidate with high accuracy can be selected.

도면과 명세서에는 최적의 실시예가 개시되었으며, 특정한 용어들이 사용되었으나 이는 단지 본 발명의 실시형태를 설명하기 위한 목적으로 사용된 것이지 의미를 한정하거나 특허 청구범위에 기재된 본 발명의 범위를 제한하기 위하여 사용된 것은 아니다. 그러므로, 본 기술 분야의 통상의 지식을 가진자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 수 있을 것이다. 따라서, 본 발명의 진정한 기술적 보호범위는 첨부된 특허청구범위의 기술적 사상에 의해 정해져야 할 것이다.Optimal embodiments are disclosed in the drawings and specifications, and specific terms are used, but these are used only for the purpose of describing embodiments of the present invention, and are used to limit the meaning or limit the scope of the present invention described in the patent claims. It didn't happen. Therefore, those skilled in the art will understand that various modifications and other equivalent embodiments are possible. Therefore, the true technical protection scope of the present invention should be determined by the technical spirit of the attached patent claims.

100: STR 데이터셋 획득부
200: 데이터셋 변화 및 전처리부
300: 제1 데이터셋 분리부
400: 기계학습 분류 모델 선택부
500: 샘플 STR 좌위별 중요도 산출 및 재처리부
600: 제2 데이터셋 분리부
700: 샘플 STR 좌위별 중요도 순위 산출부
800: STR 마커 후보 선정부100: STR data set acquisition unit
200: Dataset change and preprocessing unit
300: first dataset separation unit
400: Machine learning classification model selection unit
500: Importance calculation and reprocessing unit for each sample STR locus
600: second dataset separation unit
700: Importance ranking calculation unit for each sample STR locus
800: STR marker candidate selection unit

Claims

A STR data set acquisition unit configured to acquire a STR (Short Tandem Repeat) data set from a sample sequence data set through a marker discovery solution;
a dataset change and preprocessor configured to convert the STR dataset into a STR count dataset for each STR locus (flanking sequence) of the sample and preprocess the converted dataset to reduce the number of STR loci;
a first data set separation unit configured to separate the pre-processed data set into a training data set and a verification data set according to a set ratio;
a machine learning classification model selection unit configured to select one of a plurality of preset machine learning classification models based on the learning dataset and the verification dataset;
An importance calculation and reprocessing unit for each STR locus of a sample configured to calculate the importance of each STR locus of a sample of the preprocessed dataset through the selected machine learning classification model and remove the STR locus with the lowest importance from the preprocessed dataset;
a second dataset separator configured to separate the dataset from which the STR locus of lowest importance has been removed into a training dataset and a validation dataset according to a set ratio;
A sample STR locus configured to calculate an importance ranking for each STR locus of a sample of the dataset from which the STR locus with the lowest importance was removed based on the learning dataset and the validation dataset obtained by separating the second dataset separator. Star importance ranking calculation unit; and
An STR marker candidate selection unit configured to select an STR locus with the highest importance rank from the dataset from which the STR locus with the lowest importance rank has been removed and use it as a STR marker candidate.

According to claim 1,
The marker discovery solution is a STR marker candidate selection device called denovoPoly.

According to claim 1,
The preprocessing is a STR marker candidate selection device using a low variance removal method.

According to claim 1,
The STR marker candidate selection device uses a grid search method to select one of the plurality of preset machine learning classification models.

According to claim 1,
A STR marker candidate selection device uses the RFECV (Recursive Feature Elimination with Cross Validation) method to calculate the importance ranking for each STR locus of the sample.

A STR marker candidate selection method using a STR (Short Tandem Repeat) marker candidate selection device,
A STR data set acquisition unit acquiring a STR data set from a sample sequence data set through a marker discovery solution;
A dataset change and preprocessing unit converting the STR dataset into a STR count dataset for each STR locus (flanking sequence) of the sample and preprocessing the converted dataset to reduce the number of STR loci;
A first data set separation unit separating the pre-processed data set into a training data set and a verification data set according to a set ratio;
A machine learning classification model selection unit selecting one of a plurality of preset machine learning classification models based on the training dataset and the verification dataset;
Calculating the importance of each STR locus of a sample of the pre-processed dataset by a sample STR locus and reprocessing unit calculating the importance of each STR locus of the sample of the pre-processed dataset through the selected machine learning classification model, and removing the STR locus with the lowest importance from the pre-processed dataset;
A second data set separation unit separating the data set from which the STR locus of lowest importance has been removed into a training data set and a validation data set according to a set ratio;
Importance by STR locus of a sample of a dataset from which the STR locus with the lowest importance has been removed based on the learning dataset and validation dataset obtained by separating the sample STR locus from the importance ranking calculation unit by the second dataset separator. calculating a ranking; and
A method of selecting a STR marker candidate comprising: an STR marker candidate selection unit selecting the STR locus with the highest importance rank from the dataset from which the STR locus with the lowest importance rank has been removed and using it as a STR marker candidate.