KR102171681B1

KR102171681B1 - Computer readable media recording program of consructing potential rna aptamers bining to target protein using machine learning algorithms and process of constructing potential rna aptamers

Info

Publication number: KR102171681B1
Application number: KR1020180094894A
Authority: KR
Inventors: 한경숙; 이욱
Original assignee: 인하대학교 산학협력단
Priority date: 2018-08-14
Filing date: 2018-08-14
Publication date: 2020-10-29
Also published as: KR20200019404A

Abstract

본 발명은 기계 학습 모델의 하나인 랜덤 포레스트 모델을 통하여 표적 단백질 분자와 결합하는 후보 RNA 압타머 서열을 생성, 검증하는 프로그램을 기록한 컴퓨터로 판독할 수 있는 기록 매체 및 이를 이용하여 표적 단백질 분자와 결합하는 후보 RNA 압타머 서열을 생성, 검증하는 방법에 관한 것이다. 본 발명에 따른 프로그램을 이용하여, 특정 표적 단백질과 결합하는 후보 압타머 서열의 pool이 감소되고, 신속하고 효율적으로 최종 압타머를 선별할 수 있다.The present invention is a computer-readable recording medium recording a program for generating and verifying a candidate RNA aptamer sequence that binds to a target protein molecule through a random forest model, which is one of machine learning models, and binds to a target protein molecule using the same. It relates to a method for generating and verifying a candidate RNA aptamer sequence. By using the program according to the present invention, the pool of candidate aptamer sequences that bind to a specific target protein is reduced, and the final aptamer can be selected quickly and efficiently.

Description

A computer-readable recording medium recording a program that creates a candidate RNA aptamer that binds to a target protein using a machine learning algorithm, and a method for generating a candidate RNA aptamer {COMPUTER READABLE MEDIA RECORDING PROGRAM OF CONSRUCTING POTENTIAL RNA APTAMERS BINING TO TARGET PROTEIN USING MACHINE LEARNING ALGORITHMS AND PROCESS OF CONSTRUCTING POTENTIAL RNA APTAMERS}

본 특허출원은 대한민국 정부 과학기술정보통신부의 “개인기초연구”연구사업의 일환으로서 “단백질과 핵산의 상호작용 예측 모델 개발”(주관기관: 인하대학교; 과제고유번호: 2015R1A1A3A04001243) 과제와, "바이오 빅데이터의 수학적 모델링을 통한 암 유전자 발굴과 개인별 유전자 네트워크 추론"(주관기관: 인하대학교; 과제고유번호: 2017R1E1A1A03069921) 과제의 수행 결과물에 관한 것이다. This patent application is part of the “Personal Basic Research” research project of the Ministry of Science, Technology and Communication of the Republic of Korea, and “Development of a predictive model for protein-nucleic acid interaction” (Host: Inha University; Project number: 2015R1A1A3A04001243) Cancer gene discovery through mathematical modeling of big data and individual gene network inference" (Host: Inha University; Assignment number: 2017R1E1A1A03069921) This is the result of the project.

본 발명은 표적 물질과 결합하는 분자를 생성하는 프로그램을 기록한 매체에 관한 것으로, 더욱 상세하게는 기계 학습 알고리즘을 이용한 학습 과정을 통하여 표적 단백질과 결합하는 후보 RNA 압타머를 생성, 구축하기 위한 프로그램을 기록한 컴퓨터로 판독할 수 있는 기록 매체와, 이를 이용하여 표적 단백질과 결합하는 후보 RNA 압타머를 생성, 구축하는 방법에 관한 것이다. The present invention relates to a medium in which a program for generating a molecule that binds to a target substance is recorded, and more particularly, a program for generating and constructing a candidate RNA aptamer that binds to a target protein through a learning process using a machine learning algorithm. It relates to a recording medium that can be read by a recorded computer, and a method of generating and constructing a candidate RNA aptamer that binds to a target protein by using the recording medium.

생물학적 메커니즘과 관련한 분자에 대한 이해가 깊어지면서, 생물학적 분자들 사이의 상호작용에 대한 관심이 높아지고 있다. 이러한 생물학적 분자들 중에서 압타머가 최근 주목을 받고 있다. 압타머(aptamer)는 자체로 안정적인 3차원 구조를 가지면서 표적 분자에 대하 높은 친화성(affinity)과 특이성(specificity)을 가지면서 결합할 수 있는 특징을 가지는 단일가닥 핵산(single stranded nucleic acid)이다. 표적 분자와 결합하는 압타머는 단일클론항체(monoclonal antibody)와 유사하지만, 단일클론항체에 비하여 다음과 같은 장점을 가지고 있다. As the understanding of molecules related to biological mechanisms deepens, interest in interactions between biological molecules is increasing. Among these biological molecules, aptamers have recently attracted attention. An aptamer is a single-stranded nucleic acid that has a stable three-dimensional structure by itself and can bind with high affinity and specificity for a target molecule. . The aptamer binding to the target molecule is similar to a monoclonal antibody, but has the following advantages compared to a monoclonal antibody.

항체는 대략 150 kDa의 큰 분자 구조를 가지고 있어서 제조 및 변형이 어렵지만, 통상적으로 100개 이하의 염기 서열로 이루어지는 압타머는 작은 분자 구조를 가지고 있기 때문에 변형이 용이하다. 압타머는 항체에 비하여 안정성이 매우 우수하기 때문에, 실온에서 운반 및/또는 보관이 가능하고 멸균 후에도 그 기능을 유지할 수 있다. 뿐만 아니라, 압타머는 변성(denaturation)이 일어나더라도 짧은 시간에 재생(regeneration)이 가능하기 때문에, 장시간 또는 반복 사용이 요구되는 진단용 소재로서 쉽게 응용될 수 있다. Antibodies have a large molecular structure of approximately 150 kDa and are difficult to manufacture and modify, but aptamers, which are generally composed of 100 or less nucleotide sequences, have a small molecular structure and are therefore easily modified. Since the aptamer has very good stability compared to the antibody, it can be transported and/or stored at room temperature and can maintain its function even after sterilization. In addition, since the aptamer can be regenerated in a short time even when denaturation occurs, it can be easily applied as a diagnostic material requiring long or repeated use.

더욱이, 항체는 동물이나 세포를 이용하여 제조하기 때문에 많은 시간과 비용이 요구되며, 제조되는 시기나 방법에 따라 기능성이 달라질 가능성이 있다(batch to batch variation). 하지만, 압타머는 화학적 합성방법을 이용하여 제조되기 때문에 단시간에 적은 비용으로 제조될 수 있으며, batch to batch variation이 거의 없고, 고순도의 정제 과정이 매우 용이하다. 또한, 항체나 다른 의약용 단백질의 경우에는 빈번하게 생체내 면역거부반응이 일어나지만, 압타머는 이러한 생체내 면역거부반응이 거의 일어나지 않는 것으로 알려져 있어 치료용 소재의 개발에 있어 장점이 있다. 뿐만 아니라, 항체를 만들기 어려운 독소(toxin), 복잡한 단백질 복합체 또는 당-단백질 복합체에 대해서도 압타머를 제조할 수 있으며, 새로운 표적 물질에 대한 결합 물질로의 변형이 용이하여(flexibility) 신규한 압타머 발굴에 활발하게 이용될 수 있다. Moreover, since antibodies are manufactured using animals or cells, a lot of time and cost are required, and there is a possibility that the functionality may vary depending on the timing or method of manufacture (batch to batch variation). However, since the aptamer is manufactured using a chemical synthesis method, it can be manufactured in a short time and at low cost, there is little batch to batch variation, and a high purity purification process is very easy. In addition, in the case of antibodies or other pharmaceutical proteins, immune rejection reactions frequently occur in vivo, but aptamers are known to hardly cause immune rejection reactions in vivo, which is advantageous in the development of therapeutic materials. In addition, aptamers can be prepared for toxins, complex protein complexes, or sugar-protein complexes that are difficult to make antibodies, and the flexibility to bind to new target substances makes it a novel aptamer. It can be actively used for excavation.

새로운 압타머를 선별하는 것과 관련해서, 1990년데 콜로라도 대학의 Larry Gold 연구팀에 의해 개발된 발굴 기술인 SELEX(Systematic Evolution of Ligands by EXponential enrichment)라는 기법이 기본적으로 활용되고 있다. SELEX를 통해 새로운 압타머를 발굴하는 과정을 살펴보면, 1) DNA 합성 및 시험관 전사(in vitro transcription, RNA인 경우)을 이용하여 다양한 형태의 핵산 라이브러리(>10⁵)을 제조하고, 2) 핵산 구조체 라이브러리 내의 핵산 구조체들(압타머 후보 분자들)에 대하여 원하는 표적 분자와 결합할 수 있는 핵산 구조체만을 선별(screening)하는 과정을 거친다. 선별 과정은 예를 들어, 친화 크로마토그래프(Affinity chromatography)와 같은 방법을 통해 표적 분자와 결합하지 않은 핵산 구조체를 제거(washing)하고 표적 분자에 결합하는 것만을 선택적으로 얻는다. 마지막으로 표적 분자로부터 핵산 구조체를 분리(elution)하고, 분리된 핵산을 PCR(중합효소연쇄반응, polymerase chain reaction) 등의 방법으로 증폭시킨다. 이어서, 증폭된 핵산 구조체만을 이용한 선별 및 분리 과정을 5 내지 15회 반복하여, 매우 우수한 결합력과 특이성을 보이는 압타머를 발굴할 수 있다. Regarding the screening of new aptamers, a technique called SELEX (Systematic Evolution of Ligands by EXponential enrichment), developed in 1990 by Larry Gold's research team at the University of Colorado, is basically used. Looking at the process of discovering new aptamers through SELEX, 1) using DNA synthesis and in vitro transcription ( in case of RNA) to prepare various types of nucleic acid libraries (>10 ⁵ ), and 2) nucleic acid structures Nucleic acid constructs (aptamer candidate molecules) in the library are screened only for nucleic acid constructs capable of binding to a desired target molecule. In the selection process, nucleic acid constructs that are not bound to the target molecule are washed through a method such as affinity chromatography, and only binding to the target molecule is selectively obtained. Finally, the nucleic acid construct is separated from the target molecule (elution), and the separated nucleic acid is amplified by a method such as PCR (polymerase chain reaction). Subsequently, the selection and separation process using only the amplified nucleic acid construct is repeated 5 to 15 times, thereby discovering an aptamer showing very excellent binding power and specificity.

그런데, 압타머 후보 분자를 선별하기 위하여 통상적으로 10¹⁵ 이상의 무작위 핵산 라이브러리를 제조하고, 그 중에서 극히 일부의 핵산 서열만이 표적 분자에 결합하기 때문에, 최종적으로 표적 분자에 결합하는 압타머를 선별할 때 많은 시간, 노동 및 비용이 요구된다. 뿐만 아니라, SELEX에 의해 확인된 압타머는 생체내(in vivo)에서 다양한 항원 밀도, 상호작용 및 미세환경(microenvironment)로 인하여 생체내에서 표적 분자와 결합하지 못할 수 있다. However, in order to select a candidate aptamer molecule, a random nucleic acid library of 10 ¹⁵ or more is usually prepared, and only a small portion of the nucleic acid sequence is bound to the target molecule. When a lot of time, labor and cost are required. In addition, due to the variety of aptamers in vivo antigen density, interaction and microenvironment (microenvironment) from the (in vivo) identified by SELEX it can not be combined with the target molecule in vivo.

압타머를 선별할 때의 문제점과 관련해서, 미국특허 제9,315,804호에서는 적합도 함수를 고려하여 후보 압타머로부터, 표적 물질에 대한 적어도 하나의 압타머를 식별하는 방법을 제안하고 있다. 하지만, 상기 미국특허에서도 여전히 초기 압타머 후보 물질의 풀(pool)이 대량으로 존재하기 때문에, 표적 물질과 강하게 결합하는 압타머를 선별할 때 적지 않은 시간과 노동이 투입되어야 한다. Regarding the problem of selecting an aptamer, U.S. Patent No. 9,315,804 proposes a method of identifying at least one aptamer for a target substance from a candidate aptamer in consideration of a fitness function. However, even in the US patent, since there is still a large pool of initial aptamer candidate substances, a considerable amount of time and labor must be invested when selecting an aptamer that strongly binds to the target substance.

본 발명은 전술한 종래 기술의 문제점을 해소하기 위하여 제안된 것으로, 본 발명의 목적은 신속하고 효율적인 방법으로 표적 물질과 결합하는 잠재적인 후보 압타머를 생성, 구축하기 위한 프로그램을 기록한 컴퓨터로 판독할 수 있는 기록 매체 및 표적 물질과 결합하는 잠재적인 후보 압타머를 생성, 구축하는 방법을 제공하고자 하는 것이다. The present invention has been proposed in order to solve the problems of the prior art, and an object of the present invention is to be read with a computer recording a program for generating and constructing a potential candidate aptamer that binds to a target substance in a fast and efficient manner. It is intended to provide a method for generating and constructing a potential candidate aptamer that binds to a possible recording medium and a target material.

본 발명의 일 측면에 따르면, 본 발명은 RNA-단백질 복합체 데이터에 기초하여 RNA 서열의 특징 벡터와 단백질을 구성하는 아미노산 서열의 특징 벡터를 구축하고, 구축된 특징 벡터를 토대로 랜덤 포레스트(random forest) 모델을 적용하여 RNA 서열과 단백질 서열을 훈련시키는 서열 학습 수단; 및 무작위 RNA 서열을 필터링(filtering)하고, 상기 서열 학습 수단을 통해 학습된 랜덤 포레스트 모델을 적용하여, 필터링 된 RNA 서열 중에서 표적 단백질 분자와 결합하는 후보 RNA 압타머를 구축하는 서열 발굴 수단을 포함하고, 상기 서열 학습 수단에서 RNA 특징 벡터는 단백질을 구성하는 아미노산과의 상호작용 경향(interaction propensity), RNA 서열의 모노-뉴클레오타이드(mono-nucleotide; 단일-염기) 조성, 디-뉴클레오타이드(di-nucleotide; 2-염기) 조성 및 유사 트리-뉴클레오타이드(pseudo tri-nucleotide; 유사 3-염기) 조성을 토대로 작성되고, 상기 서열 학습 수단에서 단백질 특징 벡터는 단백질을 구성하는 아미노산의 조성-전이-분포 및 유사 아미노산 조성을 토대로 작성되는 표적 단백질 분자와 결합하는 후보 RNA 압타머를 생성하는 프로그램을 기록한 컴퓨터로 판독할 수 있는 기록 매체를 제공한다. According to one aspect of the present invention, the present invention constructs a feature vector of an RNA sequence and a feature vector of an amino acid sequence constituting a protein based on the RNA-protein complex data, and a random forest based on the constructed feature vector. Sequence learning means for training RNA sequence and protein sequence by applying the model; And a sequence discovery means for filtering a random RNA sequence and applying a random forest model learned through the sequence learning means to construct a candidate RNA aptamer that binds to a target protein molecule among the filtered RNA sequences. , In the sequence learning means, the RNA feature vector has an interaction propensity with amino acids constituting a protein, a mono-nucleotide (mono-nucleotide; single-base) composition of the RNA sequence, and a di-nucleotide; It is written based on the composition of 2-base) and pseudo tri-nucleotide (pseudo tri-nucleotide), and in the sequence learning means, the protein feature vector determines the composition-transition-distribution and similar amino acid composition of amino acids constituting the protein. It provides a computer-readable recording medium in which a program for generating a candidate RNA aptamer that binds to a target protein molecule to be prepared is recorded.

선택적인 실시형태에서, 상기 프로그램은, 상기 서열 발굴 수단에 의해 선별된 후보 RNA 압타머를 생성하는 모델을 평가 척도를 사용하여 평가하는 서열 평가 수단을 더욱 포함할 수 있다. In an alternative embodiment, the program may further include a sequence evaluation means for evaluating a model for generating the candidate RNA aptamer selected by the sequence discovery means using an evaluation scale.

예를 들어, 상기 평가 척도는 민감도(Sensitivity), 특이도(Specificity), 정확도(Accuracy), 양성예측도(Positive predictive value), 음성예측도(Negative predictive value) 및 매튜 상관계수(Matthews correlation coefficient) 중에서 선택되는 적어도 하나인 것을 특징으로 한다. For example, the evaluation scale is Sensitivity, Specificity, Accuracy, Positive predictive value, Negative predictive value, and Matthews correlation coefficient. It is characterized in that at least one selected from.

일례로, 상기 서열 발굴 수단은, 상기 무작위 RNA 서열의 2차 구조, 자유 에너지, 쌍을 이루지 않는 뉴클레오타이드의 개수 및 풀(pool) 내에서 동일한 2차 구조의 개수를 한정하여 상기 무작위 RNA 서열을 필터링 하는 것을 특징으로 한다. In one example, the sequence discovery means filters the random RNA sequence by limiting the secondary structure, free energy, the number of unpaired nucleotides, and the number of identical secondary structures in the pool. Characterized in that.

한편, 상기 RNA 특징 벡터를 구성하는 상기 유사 트리-뉴클레오타이드 조성은, 트리-뉴클레오타이드의 소수성(hydrophobicity), 친수성(hydrophilicity) 및 측쇄-중량(side-chain mass)를 포함하고, 상기 단백질 특징 벡터를 구성하는 상기 유사 아미노산 조성은, 아미노산의 소수성, 친수성, 측쇄-중량, 이온화 지수 및 등전점을 포함할 수 있다. Meanwhile, the similar tri-nucleotide composition constituting the RNA feature vector includes hydrophobicity, hydrophilicity, and side-chain mass of the tri-nucleotide, and constitutes the protein feature vector. The similar amino acid composition may include hydrophobicity, hydrophilicity, side chain-weight, ionization index, and isoelectric point of the amino acid.

본 발명의 다른 측면에 따르면, 본 발명은 컴퓨터에 의해 실현되는 프로그램을 이용하여 표적 단백질 분자와 결합하는 후보 RNA 압타머를 생성하는 방법으로서, 서열 학습 수단에 의하여, RNA-단백질 복합체 데이터에 기초하여 RNA 서열의 특징 벡터와 단백질을 구성하는 아미노산 서열의 특징 벡터를 구축하고, 구축된 특징 벡터를 토대로 랜덤 포레스트(random forest) 모델을 적용하여 RNA 서열과 단백질 서열을 훈련시키는 단계; 및 서열 발굴 수단에 의하여, 무작위 RNA 서열을 필터링(filtering)하고, 상기 서열 학습 수단을 통해 학습된 랜덤 포레스트 모델을 적용하여, 필터링 된 RNA 서열 중에서 표적 단백질 분자와 결합하는 후보 RNA 압타머를 구축하는 단계를 포함하고, 상기 서열 학습 수단에서 RNA 특징 벡터는 단백질을 구성하는 아미노산과의 상호작용 경향(interaction propensity), RNA 서열의 모노-뉴클레오타이드(mono-nucleotide; 단일-염기) 조성, 디-뉴클레오타이드(di-nucleotide; 2-염기) 조성 및 유사 트리-뉴클레오타이드(pseudo tri-nucleotide; 유사 3-염기) 조성을 토대로 작성되고, 상기 서열 학습 수단에서 단백질 특징 벡터는 단백질을 구성하는 아미노산의 조성-전이-분포 및 유사 아미노산 조성을 토대로 작성되는 표적 단백질 분자와 결합하는 후보 RNA 압타머를 생성하는 방법을 제공한다. According to another aspect of the present invention, the present invention is a method for generating a candidate RNA aptamer that binds to a target protein molecule using a computer-implemented program, by means of sequence learning, based on RNA-protein complex data. Constructing the feature vector of the RNA sequence and the feature vector of the amino acid sequence constituting the protein, and training the RNA sequence and the protein sequence by applying a random forest model based on the constructed feature vector; And filtering the random RNA sequence by the sequence discovery means, and applying the random forest model learned through the sequence learning means, to construct a candidate RNA aptamer that binds to the target protein molecule among the filtered RNA sequences. Including a step, wherein in the sequence learning means, the RNA feature vector has an interaction propensity with amino acids constituting a protein, a mono-nucleotide (single-base) composition of the RNA sequence, a di-nucleotide ( It is written based on the composition of di-nucleotide (2-base) and pseudo tri-nucleotide (pseudo tri-nucleotide), and in the sequence learning means, the protein feature vector is the composition-transition-distribution of amino acids constituting the protein. And a method of generating a candidate RNA aptamer that binds to a target protein molecule prepared based on a similar amino acid composition.

필요한 경우, 또한, 서열 평가 수단에 의하여, 상기 서열 발굴 수단에 의해 구축된 후보 RNA 압타머를 생성하는 모델을 평가 척도를 사용하여 평가하는 단계를 더욱 포함할 수도 있다. If necessary, it may further include the step of evaluating the model for generating the candidate RNA aptamer constructed by the sequence discovery means by the sequence evaluation means using an evaluation scale.

본 발명의 프로그램 및 방법에 따르면, 기계 학습 알고리즘으로서, 의사결정 트리의 단점을 개선한 랜덤 포레스트 알고리즘을 활용한 훈련, 학습 모델을 활용하여 특정 표적 분자와 결합할 가능성이 매우 높은 압타머를 생성, 구축할 수 있다. 표적 분자와 결합할 가능성이 높은 잠재적인 압타머 후보 물질만으로 초기 pool을 형성하고, 이와 같이 감소된 pool 내의 핵산 라이브러리에서 출발하여 신속하고 효율적으로 최종 압타머를 선별할 수 있다. According to the program and method of the present invention, as a machine learning algorithm, training using a random forest algorithm that improves the shortcomings of a decision tree, and using a learning model to generate an aptamer with a very high possibility of binding to a specific target molecule, Can build. An initial pool is formed only with potential aptamer candidates with a high possibility of binding to a target molecule, and the final aptamer can be quickly and efficiently selected starting from the nucleic acid library in the reduced pool.

특정 표적 분자와 결합할 가능성이 높은 잠재적 압타머 후보 물질만으로 최종 압타머를 신속하고 효율적으로 선별할 수 있기 때문에, 신약 표적 단백질에 대한 압타머와 경쟁적으로 결합할 수 있는 신약 후보 물질을 선별하여 새로운 신약 후보 물질의 개발, 압타머를 바이오마커(biomarker)로 활용하는 진단 시약이나 마이크로어레이(microarray)의 개발, 오염원 등과 같은 환경 유해물질 또는 식품 유해물질 등의 농도를 측정하기 위하여 압타머 또는 압타머-표적 분자의 결합 여부에 따른 신호를 검출하는 바이오 센서 등을 개발할 때 본 발명이 활용될 수 있을 것으로 기대된다. Since the final aptamer can be quickly and efficiently screened only with potential aptamer candidates with a high possibility of binding to a specific target molecule, new drug candidates that can competitively bind to the aptamer for the new drug target protein are selected. Aptamer or aptamer to measure the concentration of environmentally hazardous substances such as pollutants or food hazardous substances, development of diagnostic reagents or microarrays that use aptamers as biomarkers, and development of new drug candidate substances -It is expected that the present invention can be utilized when developing a biosensor that detects a signal according to whether or not a target molecule is bound.

도 1은 본 발명의 예시적인 실시형태에 따라, 기계 학습 알고리즘의 하나인 랜덤 포레스트 알고리즘을 활용하여 표적 단백질 분자와 결합하는 후보 RNA 압타머를 구축, 생성하는 프로그램이 탑재된 컴퓨터와, 프로그램이 기록된 컴퓨터로 판독 가능한 기록 매체의 구성을 개략적으로 도시한 도면이다.
도 2는 본 발명의 예시적인 실시예에 따라, 후보 RNA 압타머 서열을 발굴하기 위한 전체 framework를 개략적으로 나타낸 도면이다. mC는 모노-뉴클레오타이드(mono-nucleotide; 단일-염기) 조성, dC는 디-뉴클레오타이드(di-nucleotide; 2-염기) 조성, PseTNC는 유사 트리-뉴클레오타이드 조성(pseudo tri-nucleotide composition, 유사 3-염기 조성), C-T-D는 아미노산 그룹의 조성-전이-분포(composition-transition-distribution), PseAAC는 유사 아미노산 조성(pseudo amino acid composition)을 나타낸다.
도 3은 본 발명의 예시적인 실시형태에 따라, 랜덤 포레스트 알고리즘을 활용하여 표적 단백질 분자와 결합하는 후보 RNA 압타머를 구축, 생성하는 방법을 개략적으로 도시한 플로 차트이다.
도 4는 본 발명의 예시적인 실시예에 따라, RNA 서열에서 27개의 뉴클레오타이드로 이루어진 RNA 서열의 포지티브 윈도우 및 네거티브 윈도우의 일례를 나타낸다. RNA 서열이 27개 뉴클레오타이드의 슬라이딩 윈도우를 사용하여 스캐닝 됨에 따라, 해당 윈도우를 나타내는 특징 벡터(feature vector)가 생성된다. +는 단백질과 결합하는 뉴클레오타이드를 나타내고, -는 단백질과 결합하지 않는 뉴클레오타이드를 나타낸다. 하나의 윈도우에서 중간 뉴클레오타이드가 단백질과 결합하는 뉴클레오타이드라면, 해당 윈도우를 암호화하는 특징 벡터는 포지티브로 간주된다. 반면, 하나의 윈도우가 단백질과 결합하지 않는 뉴클레오타이드만을 포함한다면, 해당 윈도우에 대한 특징 벡터는 네거티브로 간주된다. 포지티브도 아니고 네거티브도 아닌 특징 벡터들은 학습 과정에서 사용되지 않는다.
도 5는 공지된 RNA 압타머의 2차 구조의 예를 나타낸 것이다. 2차 구조는 댕글링 말단(dangling end) 염기를 가지거나 가지지 않으면서 적어도 3개의 염기쌍(base pair)에 의해 종결되는(closed) 구조를 가지며, 붉은 원으로 표시된 표적 단백질 분자와 가장 잘 결합하는 뉴클레오타이드는 해당 RNA 2차 구조의 단일-가닥 부분에서 관찰된다.1 is a computer equipped with a program for constructing and generating a candidate RNA aptamer that binds to a target protein molecule using a random forest algorithm, which is one of machine learning algorithms, according to an exemplary embodiment of the present invention, and the program is recorded. It is a diagram schematically showing the configuration of a computer-readable recording medium.
2 is a diagram schematically showing an entire framework for discovering candidate RNA aptamer sequences according to an exemplary embodiment of the present invention. mC is a mono-nucleotide (single-base) composition, dC is a di-nucleotide (2-base) composition, and PseTNC is a pseudo tri-nucleotide composition (pseudo tri-nucleotide) Composition), CTD represents composition-transition-distribution of amino acid groups, and PseAAC represents pseudo amino acid composition.
3 is a flow chart schematically showing a method of constructing and generating a candidate RNA aptamer that binds to a target protein molecule using a random forest algorithm according to an exemplary embodiment of the present invention.
4 shows an example of a positive window and a negative window of an RNA sequence consisting of 27 nucleotides in an RNA sequence, according to an exemplary embodiment of the present invention. As the RNA sequence is scanned using a sliding window of 27 nucleotides, a feature vector representing the window is generated. + Represents a nucleotide that binds to a protein, and-represents a nucleotide that does not bind to a protein. If the intermediate nucleotide in a window is a nucleotide that binds to a protein, the feature vector encoding that window is considered positive. On the other hand, if one window contains only nucleotides that do not bind to the protein, the feature vector for that window is considered to be negative. Feature vectors that are neither positive nor negative are not used in the learning process.
5 shows an example of a secondary structure of a known RNA aptamer. The secondary structure has a structure that is closed by at least three base pairs, with or without dangling end bases, and the nucleotide that best binds to the target protein molecule indicated by a red circle. Is observed in the single-stranded portion of the corresponding RNA secondary structure.

이하, 필요한 경우에 첨부하는 도면을 참조하면서 본 발명을 설명한다. Hereinafter, the present invention will be described with reference to the accompanying drawings when necessary.

다양한 세포 과정에서 핵산 분자와 단백질 분자 사이의 상호작용은 필수적이다. 최근 대용량 데이터 처리 기술(high-throughput technology)에 의하여 단백질 분자와 핵산 분자 사이의 상호작용에 대한 많은 데이터가 생성되어, 특정 서열에서 결합 부위(binding sites)를 예측하거나 또는 이들 서열 사이의 상호작용을 결정하기 위하여, 컴퓨터 모델을 이용한 방법을 개발할 필요가 있다. Interactions between nucleic acid molecules and protein molecules are essential in various cellular processes. Recently, high-throughput technology has generated a lot of data on the interaction between protein molecules and nucleic acid molecules, so that binding sites in a specific sequence or interactions between these sequences are predicted. In order to decide, it is necessary to develop a method using a computer model.

특히, 단순히 하나의 핵산 서열이 표적 단백질 분자에 결합하는지를 결정하는 단순한 분류 모델이나 특정 표적 단백질 분자에 대해서만 적용되는 것을 넘어서, 표적 단백질 분자에 대하여 신규한 압타머 서열을 발굴, 생성할 수 있는 컴퓨터 모델을 개발할 필요가 있다. 본 발명에 따르면 표적 단백질 분자와 결합하는 후보 RNA 압타머 서열(aptamer sequences)을 생성, 선별, 평가 및/또는 검증할 수 있도록 랜덤 포레스트(Random Forest; RF)라는 기계 학습 모델을 이용하는데, 이에 대해서 설명한다. In particular, a computer model capable of discovering and generating a novel aptamer sequence for a target protein molecule, beyond a simple classification model that simply determines whether a nucleic acid sequence binds to a target protein molecule or applied only to a specific target protein molecule. Need to develop. According to the present invention, a machine learning model called Random Forest (RF) is used to generate, select, evaluate and/or verify candidate RNA aptamer sequences that bind to a target protein molecule. Explain.

도 1은 발명의 예시적인 실시형태에 따라, 랜덤 포레스트 알고리즘/모델을 활용하여 표적 단백질 분자와 결합하는 후보 RNA 압타머를 구축, 생성하는 프로그램이 탑재된 컴퓨터와, 프로그램이 기록된 컴퓨터로 판독 가능한 기록 매체의 구성을 개략적으로 도시한 도면이고, 도 2는 본 발명의 예시적인 실시형태에 따라, 랜덤 포레스트 모델을 적용하여 표적 단백질 분자와 결합하는 후보 RNA 압타머를 구축, 생성하는 방법을 개략적으로 도시한 플로 차트이다.1 is a computer equipped with a program for constructing and generating a candidate RNA aptamer binding to a target protein molecule using a random forest algorithm/model according to an exemplary embodiment of the present invention, and a computer readable program on which the program is recorded. A diagram schematically showing the configuration of a recording medium, and FIG. 2 schematically illustrates a method of constructing and generating a candidate RNA aptamer binding to a target protein molecule by applying a random forest model according to an exemplary embodiment of the present invention. It is a flow chart shown.

도 1에 나타내 바와 같이, 표적 단백질 분자와 결합하는 후보 RNA 압타머 서열을 생성하는 프로그램(210)이 기록된 매체(200)는 적절한 컴퓨터(100)에 탑재될 수 있다. 컴퓨터(100)는 최소한의 데이터 처리(data processing) 능력을 가지는 데이터 처리 장치이다. As shown in FIG. 1, a medium 200 in which a program 210 for generating a candidate RNA aptamer sequence binding to a target protein molecule is recorded may be mounted on an appropriate computer 100. The computer 100 is a data processing device having a minimum data processing capability.

예를 들어, 컴퓨터(100)는 데스크톱, 노트북과 같은 컴퓨터 단말기이고/이거나, 스마트폰, 태블릿 PC, PDA 등과 같은 모바일 단말기일 수 있다. 필요한 경우, 컴퓨터(100)는 인터넷이나, 3G, LTE 및/또는 5G 등과 같은 데이터 통신망과 연결될 수 있다. 예를 들어, 컴퓨터(100)는 서버(도시하지 않음)와 통신망을 통하여 연결될 수도 있고, 서버에 로컬(local) 형태로 연결될 수도 있지만, 본 발명에 따른 컴퓨터가 반드시 통신망에 연결되는 것은 아니다. For example, the computer 100 may be a computer terminal such as a desktop or a notebook computer and/or a mobile terminal such as a smartphone, a tablet PC, or a PDA. If necessary, the computer 100 may be connected to the Internet or a data communication network such as 3G, LTE and/or 5G. For example, the computer 100 may be connected to a server (not shown) through a communication network, or may be connected to the server in a local form, but the computer according to the present invention is not necessarily connected to the communication network.

컴퓨터(100)는 다양한 하드웨어(110 내지 140 및 160)와 이러한 하드웨어를 실행시키기 위한 구동/응용 프로그램(150)과 같은 소프트웨어인 다양한 컴포넌트를 포함할 수 있다.The computer 100 may include various hardware 110 to 140 and 160 and various components that are software such as driving/application programs 150 for executing such hardware.

예시적인 실시형태에서, 컴퓨터(100)는 키보드, 마우스 및/또는 터치 패널과 같이 데이터 또는 정보, 예를 들어 응용 프로그램의 일종인 후보 압타머를 생성하는 프로그램을 이용하여 적절한 후보 압타머를 훈련, 학습하기 위한 초기 입력 값을 작성할 수 있는 입력부(110)를 갖는다. 컴퓨터(100)는 또한 입력 또는 다운로드 된 데이터 또는 정보를 인쇄하기 위한 프린터와 같은 출력부(120)와, 이들 데이터를 화면으로 표시하기 위한 모니터와 같은 디스플레이(130)를 포함한다. 아울러, 컴퓨터(100)는 다양한 데이터 및 정보(예를 들어, 표적 단백질 분자의 아미노산 서열이나, 후보 RNA 압타머의 2차 구조 데이터; 생성된 후보 압타머 서열; 공지된 종래의 RNA 압타머 서열 및/또는 2차 구조 등)를 저장하고 있는 RAM, ROM, 하드디스크 등과 같은 메모리(140)를 가지고 있으며, 컴퓨터(100)에서의 다양한 구현 작업을 위하여 구동/응용 프로그램(150)을 탑재하고 있다. In an exemplary embodiment, the computer 100 trains a suitable candidate aptamer using a program that generates data or information, for example, a candidate aptamer, which is a kind of application program, such as a keyboard, mouse, and/or touch panel, It has an input unit 110 capable of creating an initial input value for learning. The computer 100 also includes an output unit 120, such as a printer, for printing input or downloaded data or information, and a display 130, such as a monitor, for displaying these data on a screen. In addition, the computer 100 includes various data and information (eg, the amino acid sequence of the target protein molecule or the secondary structure data of the candidate RNA aptamer; the generated candidate aptamer sequence; a known conventional RNA aptamer sequence and / Or a secondary structure, etc.), such as RAM, ROM, hard disk, and the like, and a driving/application program 150 is mounted for various implementation tasks in the computer 100.

컴퓨터(100)에서 구현되는 구동/응용 프로그램(150)의 예로서 WINDOWS 등과 같은 구동 소프트웨어는 물론이고, 문서, 이미지, 영상 데이터 등의 콘텐츠를 작성, 편집할 수 있는 콘텐츠 편집 소프트웨어, 인터넷과 같은 통신망에서의 데이터나 정보를 검색하기 위한 브라우저, 인공 신경망을 이용한 훈련, 학습 모델을 구축하기 위한 소프트웨어 등을 들 수 있다. 하지만, 그 외에도 다양한 구동/응용 프로그램이 컴퓨터(100)에 탑재될 수 있을 것이다. As an example of the driving/application program 150 implemented in the computer 100, not only driving software such as WINDOWS, but also content editing software capable of creating and editing contents such as documents, images, and video data, and a communication network such as the Internet Browsers for retrieving data or information from the network, training using artificial neural networks, and software for building learning models. However, in addition to that, various driving/application programs may be mounted on the computer 100.

또한, 컴퓨터(100)는 이들 하드웨어와 소프트웨어의 작업을 적절히 제어할 수 있도록 CPU와 같은 제어부(160)를 포함한다. 특히 컴퓨터(100) 중의 메모리(140)와 제어부(160)는 기록 매체(200)에 기록된 프로그램(210) 명령에 따른 일련의 작업을 실현한다는 점에서, 일종의 프로그램 실행부를 구성한다. Further, the computer 100 includes a control unit 160 such as a CPU so as to properly control the operations of these hardware and software. In particular, the memory 140 and the control unit 160 in the computer 100 constitute a kind of program execution unit in that they realize a series of operations according to the instructions of the program 210 recorded in the recording medium 200.

한편, 컴퓨터(100)가 판독할 수 있도록 컴퓨터(100)에 탑재될 수 있는 기록 매체(200)는 본 발명에 따라 표적 단백질 분자와 결합하는 후보 압타머 서열을 생성, 선택, 검증하는 프로그램(210)이 기록된다. 후술하는 바와 같이, 본 발명에 따른 프로그램(210)을 사용하여 표적 단백질 분자와 결합하는 후보 RNA 압타머 서열을 효율적으로 생성할 수 있다. 예를 들어, 본 발명에 따라 컴퓨터로 판독 가능한 기록 매체(200)는 표적 단백질 분자와 결합하는 후보 압타머 서열을 생성, 선별, 평가 및 검증하기 위한 일련의 프로그램 명령은 물론이고, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합한 것을 포함할 수 있다. Meanwhile, the recording medium 200 that can be mounted on the computer 100 so that the computer 100 can read it is a program 210 that generates, selects, and verifies a candidate aptamer sequence that binds to a target protein molecule according to the present invention. ) Is recorded. As described later, a candidate RNA aptamer sequence that binds to a target protein molecule can be efficiently generated by using the program 210 according to the present invention. For example, the computer-readable recording medium 200 according to the present invention includes a series of program instructions for generating, selecting, evaluating, and verifying a candidate aptamer sequence that binds to a target protein molecule, as well as data files and data. Structures and the like may be included alone or in combination.

본 발명에 따라 후보 단백질 서열과 결합하는 후보 압타머 서열을 생성하는 프로그램(210) 명령이 기록된 컴퓨터로 판독 가능한 기록 매체(200)의 예는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록 장치를 포함한다. 컴퓨터(100)가 판독할 수 있는 기록 매체(200)의 예로는 자기 테이프, 하드 디스크, 플로피 디스크와 같은 자기 매체, CD-ROM, DVD와 같은 광기록 매체, 플롭티컬(floptical) 디스크와 같은 자기-광 매체, 및 ROM, RAM, 플래시 메모리 등과 같은 프로그램 명령을 저장, 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. An example of a computer-readable recording medium 200 on which a program 210 for generating a candidate aptamer sequence that binds to a candidate protein sequence according to the present invention is recorded is all types of data that can be read by a computer system. Includes a recording device. Examples of the recording medium 200 readable by the computer 100 include magnetic media such as magnetic tapes, hard disks, and floppy disks, optical recording media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. -An optical medium and a hardware device specially configured to store and execute program instructions such as ROM, RAM, flash memory, and the like.

또한, 캐리어 웨이브(예를 들어, 인터넷을 통해 전송)의 형태로 구현되는 것도 포함되는데, 이러한 기록 매체(200)는 프로그램 명령, 데이터 구조 등을 지정하는 신호를 전송하는 반송파를 포함하는 광 또는 금속선, 도파관 등의 전송 매체일 수도 있다. 이러한 컴퓨터(100)가 판독 가능한 기록 매체(200)는 통신망으로 연결된 컴퓨터 시스템에 분산되어, 분산 방식으로 컴퓨터(100)가 읽을 수 있는 코드가 저장되고 실행될 수 있다. 한편, 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 아래에서 예시적으로 기술하고 있는, 본 발명을 구현하기 위한 기능적인 프로그램, 코드 및 코드 세그먼트들은 본 발명이 속하는 기술분야에서 통상의 프로그래머들에 의해 용이하게 추론될 수 있을 것이다. In addition, a carrier wave (for example, transmitted through the Internet) is also included. The recording medium 200 includes an optical or metal wire including a carrier wave that transmits a signal specifying a program command, a data structure, etc. , May be a transmission medium such as a waveguide. The recording medium 200 readable by the computer 100 may be distributed over a computer system connected through a communication network, and codes readable by the computer 100 may be stored and executed in a distributed manner. On the other hand, examples of the program instruction include not only machine language codes produced by a compiler but also high-level language codes that can be executed by a computer using an interpreter or the like. Functional programs, codes and code segments for implementing the present invention, which are exemplarily described below, may be easily deduced by ordinary programmers in the art to which the present invention pertains.

기록 매체(200)에 기록된 표적 단백질 분자와 결합하는 후보 RNA 압타머를 생성하는 프로그램(이하, RNA 압타머 생성 프로그램이라고 약칭한다, 210)은 적절한 학습, 훈련 모델을 통하여 RNA 서열 및 단백질 아미노산 서열을 훈련하는 서열 학습 수단(220)과, 무작위 생성된 RNA 서열로부터 표적 단백질 분자에 결합하는 적절한 후보 RNA 압타머를 구축하는 서열 발굴 수단(230)을 포함하고, 필요에 따라 서열 발굴 수단(230)에서 구축된 후보 RNA 압타머의 적절성 등을 평가하는 서열 평가 수단(240)을 포함할 수 있다. A program that generates a candidate RNA aptamer that binds to a target protein molecule recorded in the recording medium 200 (hereinafter, abbreviated as an RNA aptamer generation program, 210) is an RNA sequence and protein amino acid sequence through appropriate learning and training models. Sequence learning means 220 for training, and sequence discovery means 230 for constructing an appropriate candidate RNA aptamer binding to the target protein molecule from the randomly generated RNA sequence, and, if necessary, sequence discovery means 230 It may include a sequence evaluation means 240 for evaluating the suitability of the candidate RNA aptamer constructed in.

본 발명에 따라 후보 RNA 압타머를 발굴하는 과정을 도 3을 참조하면서 설명한다. 도 3은 본 발명의 예시적인 실시예에 따라 표적 분자와 결합하는 후보 RNA 압타머를 발굴하기 위한 전체 프레임워크(framework)를 개략적으로 나타낸 도면이다. A process of discovering a candidate RNA aptamer according to the present invention will be described with reference to FIG. 3. 3 is a diagram schematically showing an entire framework for discovering a candidate RNA aptamer that binds to a target molecule according to an exemplary embodiment of the present invention.

서열 학습 수단(220)은 단백질-RNA 복합체(complex) 데이터를 가지고 있는 공지된 데이터베이스를 이용하여, RNA 특징 벡터와 단백질 특징 벡터를 구축하고, 구축된 특징 벡터로부터 랜덤 포레스트(random forest; RF) 모델을 이용하여 RNA 서열과 단백질 서열을 훈련시킨다. Sequence learning means 220 constructs an RNA feature vector and a protein feature vector using a known database having protein-RNA complex data, and a random forest (RF) model from the constructed feature vector To train RNA and protein sequences.

하나의 예시적인 실시형태에서, 결합면역침전법(Cross-linking and Immunoprecipitation, CLIP; 클립)에 따라 분석, 확인된 단백질-RNA 복합체 데이터 세트를 제공하는 CLIPdb와 같이 생체내(in vivo) 분석 방법인 클립 및 이의 변형, 개량 분석 방법을 통하여 확인된 단백질-RNA 복합체 데이터 세트를 제공하는 공지된 데이터베이스를 활용할 수 있다. In one exemplary embodiment, an in vivo analysis method such as CLIPdb that provides a protein-RNA complex data set analyzed and identified according to cross-linking and immunoprecipitation (CLIP; clip) It is possible to utilize a known database that provides a data set of protein-RNA complexes identified through clips and their modifications and improved analysis methods.

예를 들어, 클립 및 이의 변형인 생체내 분석 방법은 1) RNA-단백질의 결합 부위를 확인하기 위하여 UV-가교(UV cross-linking)과 면역침전(Immunoprecipitation)을 결합하는 CLIP-seq(CLIP sequencing; HITS(high-throughput sequencing)-CLIP), 2) 세포의 RNA 결합 단백질이나 microRNA를 함유하는 ribonucleoprotein 복합체를 확인하는데 사용되는 PAR-CLIP(photoactivatable ribonucloside-enhanced cross-linking and Immunoprecipitation), 3) 단백질과 RNA 분자를 공유 결합하기 위하여 UV 광을 이용하는 iCLIP(individual nucleotide-resolution cross-linking and Immunoprecipitation), 4) RNA 양을 감소시키고 면역침전 된 RNA의 방사선 표지를 생략하는 sCLIP(simple CLIP) 등을 들 수 있다.For example, the in vivo analysis method, which is a clip and a modification thereof, is 1) CLIP-seq (CLIP sequencing) that combines UV cross-linking and immunoprecipitation to identify the binding site of RNA-protein. ; HITS (high-throughput sequencing)-CLIP), 2) PAR-CLIP (photoactivatable ribonucloside-enhanced cross-linking and immunoprecipitation), 3) protein used to identify ribonucleoprotein complexes containing RNA binding proteins or microRNAs ICLIP (individual nucleotide-resolution cross-linking and immunoprecipitation), which uses UV light to covalently bind RNA molecules, 4) sCLIP (simple CLIP), which reduces the amount of RNA and omits radiolabeling of immunoprecipitated RNA. have.

다른 예시적인 실시형태에서, 시험관내(in vitro) 분석 방법인 SELEX(Systematic Evolution of Ligands by EXponential enrichment) 및 이의 변형, 개량인 시험관내 분석 방법을 통하여 확인된 단백질-RNA 복합체 데이터 세트를 제공하는 다른 공지의 데이터베이스를 활용할 수 있다. In another exemplary embodiment, the in vitro analysis method SELEX (Systematic Evolution of Ligands by EXponential enrichment) and its modifications, modifications, and other protein-RNA complexes providing a data set identified through the in vitro analysis method. You can use a known database.

일례로, 셀렉스(SELEX)는 1) 핵산 라이브러리 제조, 2) 친화 크로마토그래피 등을 이용하여 표적 단백질 분자와 결합하는 핵산 구조체 선별, 3) 핵산 구조체를 분리, 증폭하는 과정을 반복하여 확인된 단백질-RNA 복합체에 대한 데이터 세트가 제공될 수 있다. For example, SELEX is a protein identified by repeating the process of 1) preparing a nucleic acid library, 2) selecting a nucleic acid structure that binds to a target protein molecule using affinity chromatography, etc., and 3) separating and amplifying the nucleic acid structure. Data sets for RNA complexes can be provided.

한편, SELEX를 개량, 변형한 다른 방법을 통하여 확인된 단백질-RNA 복합체 데이터 세트가 서열을 학습하기 위하여 활용될 수 있다. SELEX를 개량, 변형한 다른 방법은 1) 핵산 분해효소에 의해 분해되지 않는 L-oligo-nucleotide를 이용한 거울상 압타머(Spiegelmer) 방법, 2) 고순도의 단백질 정제 과정 없이 세포 표면에 존재하는 단백질과 결합하는 cell-to-Aptamer 방식으로 특이적 압타머를 발굴하는 Cell SELEX, 3) 모세관 전기영동을 이용하는 capillary electrophoresis SELEX (CE-SELEX), 4) counter-SELEX, 5) Toggle SELEX 등을 포함할 수 있다. 그 외에도, SELEX를 통해 얻어진 초기의 압타머를 안정적이고 강력한 압타머로 개량하기 위하여, 1) RNA 압타머의 Ribose 2'-OH를 2'-F 나 2'-NH₂, 2'-O-methyl group으로 치환하거나, 압타머를 polyethylene glycol(PEG)과 같은 고분자나 diacylglycerol 혹은 cholesterol을 접합시키는 post-SELEX 과정이 수행될 수도 있다. On the other hand, a protein-RNA complex data set identified through other methods of improving and modifying SELEX can be utilized to learn the sequence. Other methods of improving and modifying SELEX are: 1) Spiegelmer method using L-oligo-nucleotide, which is not degraded by nuclease, and 2) binding with proteins present on the cell surface without high-purity protein purification. Cell SELEX, which discovers specific aptamers in a cell-to-Aptamer method, 3) capillary electrophoresis SELEX (CE-SELEX) using capillary electrophoresis, 4) counter-SELEX, 5) Toggle SELEX, etc. . In addition, in order to improve the initial aptamer obtained through SELEX into a stable and powerful aptamer, 1) Ribose 2'-OH of RNA aptamer is 2'-F, 2'-NH ₂ , 2'-O-methyl Group substitution, or a post-SELEX process of conjugating aptamer to a polymer such as polyethylene glycol (PEG), diacylglycerol or cholesterol may be performed.

서열 학습 수단(220)은 RNA 서열 및 단백질의 아미노산 서열을 훈련, 학습시키기 위하여 이들 서열에 대한 적절한 특징 벡터를 구축한다. 본 발명에서, RNA 서열에 대한 특징 벡터는 RNA의 아미노산과의 상호작용 경향(interaction propensity; IP), RNA 서열의 모노-뉴클레오타이드 조성(mono-nucleotide composition; mC, 단일-염기 조성), 디-뉴클레오타이드 조성(di-nucleotide composition; dC, 2-염기 조성) 및 유사 트리-뉴클레오타이드 조성(pseudo tri-nucleotide composition; PseTNC, 유사 3-염기 조성)을 토대로 작성될 수 있다. The sequence learning means 220 constructs an appropriate feature vector for these sequences in order to train and learn the RNA sequence and the amino acid sequence of the protein. In the present invention, the characteristic vector for the RNA sequence is the interaction propensity (IP) with the amino acid of the RNA, the mono-nucleotide composition of the RNA sequence (mC, single-base composition), the di-nucleotide It can be prepared based on the composition (di-nucleotide composition; dC, 2-base composition) and pseudo tri-nucleotide composition (PseTNC, pseudo 3-base composition).

RNA-단백질 상호작용을 예측할 때, RNA 특징 벡터의 기초 특징의 하나인 아미노산과의 뉴클레오타이드 트리플렛의 상호작용 경향(IP)은 매우 강력한 특징이 될 수 있다. 다른 특징 역시 RNA 서열의 특징으로서 활용될 수 있다. When predicting RNA-protein interactions, the tendency of interaction of nucleotide triplets with amino acids (IP), which is one of the basic characteristics of RNA feature vectors, can be a very strong feature. Other features can also be utilized as features of the RNA sequence.

한편, 단백질의 아미노산 서열을 훈련시키기 위한 특징 벡터는 단백질을 구성하는 아미노산의 조성-전이-분포(composition-transition-distribution; C-T-D) 및 유사 아미노산 조성(Pseudo Amino Acid composition; PseAAC)을 토대로 작성될 수 있다. 단백질 특징을 추출하기 위하여 단백질을 구성하는 20개의 아미노산은 적절한 기준에 따라 군집화(clustering) 될 수 있다. 일례로, 단백질을 구성하는 아미노산은 쌍극자(dipole) 및 부피(volume)를 기준으로 군집화 될 수 있지만, 본 발명이 이에 한정되지 않는다. On the other hand, a feature vector for training the amino acid sequence of a protein can be created based on the composition-transition-distribution (CTD) and pseudo amino acid composition (PseAAC) of amino acids constituting the protein. have. In order to extract protein features, the 20 amino acids constituting the protein can be clustered according to appropriate criteria. For example, amino acids constituting a protein may be clustered based on a dipole and a volume, but the present invention is not limited thereto.

또한, RNA 서열을 훈련시키기 위한 특징의 하나인 유사 3-염기 조성(PseTNC) 및 단백질의 아미노산 서열을 훈련시키기 위한 특징의 하나인 유사 아미노산 조성(PseAAC)와 관련해서는 각각 RNA 서열을 구성하는 뉴클레오타이드의 생리화학적 성질이 추출될 수 있다. 일례로, 유사 3-염기 조성과 관련하여 공지된 단백질-RNA 복합체에서 RNA 서열을 구성하는 뉴클레오타이드의 소수성(hydrophobicity), 친수성(hydrophilicity) 및 뉴클레오타이드를 구성하는 염기의 측쇄-중량(side-chain mass)이 뉴클레오타이드의 특성으로서 고려될 수 있다. 한편, 유사 아미노산 조성과 관련해서, 공지된 단백질-RNA 복합체에서 단백질을 구성하는 아미노산의 소수성, 친수성, 측쇄-중량, 이온화 지수(예를 들어 카르복시 말단 및 아미노기 말단의 이온화 지수), 등전점이 고려될 수 있다. 하지만, 그 외에도 필요에 따라 RNA를 구성하는 뉴클레오타이드 및/또는 단백질을 구성하는 아미노산의 다른 특징이 활용될 수도 있다. In addition, with respect to the pseudo 3-base composition (PseTNC), which is one of the features for training the RNA sequence, and the pseudo-amino acid composition (PseAAC), which is one of the features for training the amino acid sequence of a protein, the nucleotides constituting the RNA sequence Physicochemical properties can be extracted. For example, the hydrophobicity, hydrophilicity and side-chain mass of the nucleotides constituting the nucleotide in the protein-RNA complex known in relation to the similar 3-base composition It can be considered as a property of this nucleotide. On the other hand, with regard to the composition of similar amino acids, the hydrophobicity, hydrophilicity, side chain-weight, ionization index (e.g., the ionization index of the carboxy terminal and the amino group terminal), and the isoelectric point of the amino acids constituting the protein in the known protein-RNA complex should be considered. I can. However, in addition to that, other features of the nucleotides constituting RNA and/or amino acids constituting the protein may be utilized as needed.

이들 유사 3-염기 조성 및 유사 아미노산 조성을 연산할 때, 가중치 인자(weight factor)와 단백질의 아미노산 서열을 따라 형성되는 상관관계의 개수 순위(counted rank; 티어(tier), 통상 λ로 표시됨)를 설정할 수 있다. 일례로, 가중치 인자(weight factor)로서는 0.05 내지 0.70 사이의 임의의 값을 설정할 수 있다. 한편, 상관관계의 티어는 1) 입력되는 단백질 서열의 길이보다 작아야 하고, 2) 음의 정수가 아니어야 하며, 3) 티어 값이 0일 때, 유사 아미노산 조성(PseAAC)의 출력은 통상적인 아미노산 조성으로 지정되도록 설정할 필요가 있다. 하나의 예시적인 실시형태에 따르면, 단백질을 구성하는 아미노산 서열의 상관관계 티어 값은 1 내지 5 사이의 정수로 설정될 수 있다. When calculating these pseudo 3-base composition and similar amino acid composition, a weight factor and a counted rank of the correlation formed along the amino acid sequence of the protein (tier, usually expressed as λ) are set. I can. For example, as the weight factor, an arbitrary value between 0.05 and 0.70 may be set. Meanwhile, the correlation tier must be 1) smaller than the length of the input protein sequence, 2) must not be a negative integer, and 3) when the tier value is 0, the output of the similar amino acid composition (PseAAC) is a normal amino acid. It needs to be set to be designated by composition. According to one exemplary embodiment, the correlation tier value of the amino acid sequence constituting the protein may be set to an integer between 1 and 5.

한편, SELEX 등의 실험 기법에서 최종적으로 확인된 RNA 압타머가 30개 이하인 경우가 많다. 또한, PAR-CLIP 기법의 경우에 통상적으로 21개 내지 35개의 뉴클레오타이드로 구성되는 단백질 결합 RNA 서열을 확인할 수 있다. 따라서, 이러한 분석 방법을 통해서 확인된 뉴클레오타이드에 대응되는 길이를 가지는 RNA 서열을 훈련 및 학습을 위한 서열로 설계하는 것이 바람직할 수 있다. 따라서 훈련되는 RNA 서열의 뉴클레오타이드 개수는 20 내지 40개, 바람직하게는 21개 내지 35개로 설정될 수 있지만, 본 발명이 이에 한정되지 않는다.On the other hand, there are many cases where there are 30 or less RNA aptamers finally identified in experimental techniques such as SELEX. In addition, in the case of the PAR-CLIP technique, a protein-binding RNA sequence typically composed of 21 to 35 nucleotides can be identified. Therefore, it may be desirable to design an RNA sequence having a length corresponding to a nucleotide identified through this analysis method as a sequence for training and learning. Accordingly, the number of nucleotides in the RNA sequence to be trained may be set to 20 to 40, preferably 21 to 35, but the present invention is not limited thereto.

공지된 단백질-RNA 복합체의 데이터 세트에 기초하여, RNA 서열과 단백질의 아미노산 서열의 특징 벡터를 구축하면, 서열 학습 수단(230)은 기계 학습(machine learning) 알고리즘(또는 모델)을 이용하여 해당 RNA 서열과 아미노산 서열을 훈련시킨다. 예를 들어, 서열 학습 수단(230)은 랜덤 포레스트(random forest, RF) 모델을 활용하여 RNA 서열 및 단백질의 아미노산 서열의 특징 벡터를 훈련시킬 수 있다. 랜덤 포레스트 모델은 의사결정 트리의 단점을 개선하기 위한 알고리즘의 하나로서, 변수에 임의성을 더하여 앙상블 이론이 갖는 장점을 극대화하여 예측 및 분류 정확도를 개선하여 안정성을 얻을 수 있는 것으로 알려져 있다. Based on the known protein-RNA complex data set, when a feature vector of the RNA sequence and the amino acid sequence of the protein is constructed, the sequence learning means 230 uses a machine learning algorithm (or model) to Train sequence and amino acid sequence. For example, the sequence learning means 230 may train a feature vector of an RNA sequence and an amino acid sequence of a protein using a random forest (RF) model. The random forest model is one of the algorithms for improving the shortcomings of the decision tree, and is known to increase the accuracy of prediction and classification by maximizing the advantages of the ensemble theory by adding randomness to variables to obtain stability.

랜덤 포레스트 모델은 데이터에서 부트스트랩핑(bootstrapping) 과정을 통한 배깅(bagging, Bootstrap Aggregation)을 통하여 주어진 데이터 세트에서 무작위로 subset을 N번 샘플링(sampling)하여 N개의 샘플링 데이터 세트(즉, N개의 예측 모델)를 생성할 수 있다. 이어서, 각각의 샘플링 된 데이터 세트에서 임의의 변수를 선택하는 과정을 진행하는데, 변수의 개수를 선택할 때, M개의 총 변수들 중에서 sqrt(M) 또는 M/3개의 개수만큼 변수들을 무작위로 선택하고 나머지 변수는 모두 제거하는 과정 반복한다. 이와 같이 변수 선택이 진행된 의사결정 트리들을 종합하여 앙상블(Ensemble) 모델을 만들고, OOB 오류(Out-Of-Bag error)를 통해 오-분류를 평가한다. The random forest model performs N sampling data sets (i.e., N predictions) by randomly sampling a subset N times from a given data set through bagging (bootstrap aggregation) through a bootstrapping process on the data. Model). Subsequently, the process of selecting a random variable from each sampled data set proceeds.When selecting the number of variables, randomly selects as many as sqrt(M) or M/3 number of variables among the M total variables. Repeat the process of removing all remaining variables. In this way, an ensemble model is created by synthesizing decision trees in which variable selection has been performed, and misclassification is evaluated through an out-of-bag error.

랜덤 포레스트 모델을 적용하면, 학습 오류를 구성하는 바이어스(bias, 이 값이 높으면 예측 결과가 실제 결과와 비교해서 부정확함)를 낮추면서, 배리언스(variance, 이 값이 높으면 예측 결과가 특정 데이터 세트에서는 잘 맞지만 다른 데이터 세트에서는 잘 맞지 않음)를 줄일 수 있다. When the random forest model is applied, the bias constituting the training error (if this value is high, the predicted result is inaccurate compared to the actual result), while the variance (if this value is high, the predicted result is It fits well in sets, but not in other data sets).

하나의 예시적인 실시형태에서, 윈도우의 중앙 뉴클레오타이드가 단백질과 결합하는 뉴클레오타이드인 경우, 해당 윈도우를 암호화하는 특징 벡터는 포지티브로 정의될 수 있다. 반면, 윈도우가 단백질과 결합하는 뉴클레오타이드를 포함하지 않은 경우, 해당 윈도우를 암호화하는 특징 벡터는 네거티브로 정의될 수 있다. 포지티브도 아니고 네거티브도 아닌 특징 벡터는 제외되어, 훈련 및 학습 과정에서 사용되지 않을 수 있다. In one exemplary embodiment, when the central nucleotide of the window is a nucleotide that binds to a protein, the feature vector encoding the window may be defined as positive. On the other hand, when the window does not contain a nucleotide that binds to a protein, the feature vector encoding the window may be defined as negative. Feature vectors that are neither positive nor negative are excluded and may not be used in training and learning processes.

한편, 서열 발굴 수단(230)은 무작위 서열을 필터링(filtering)하고, 전술한 서열 학습 수단(220)을 통해 훈련된 랜덤 포레스트 모델을 적용하여, 필터링 된 RNA 서열 중에서 표적 단백질 분자와 결합하는 후보 RNA 압타머를 구축, 생성한다. On the other hand, the sequence discovery means 230 filters a random sequence, and applies a random forest model trained through the above-described sequence learning means 220, and is a candidate RNA that binds to a target protein molecule among the filtered RNA sequences. Build and generate aptamer.

서열 발굴 수단(230)이 무작위 RNA 서열을 필터링 할 때, RNA 서열의 2차 구조의 제약 조건(constraints)으로서 2차 구조, 자유 에너지, 쌍을 이루지 않는 염기의 개수, 풀 내에서 동일한 2차 구조의 개수 등을 설정할 수 있다. When the sequence discovery means 230 filters a random RNA sequence, the secondary structure, free energy, the number of unpaired bases, the same secondary structure in the pool as constraints of the secondary structure of the RNA sequence You can set the number of

RNA 서열의 2차 구조로서 말단 부위(예를 들어 RNA 서열의 1 내지 5번째 염기와, 끝에서 1 내지 5번째 염기)는 댕글링 말단(이중 가닥을 형성할 수 있도록 결합이 이루어진 2개의 핵산 가닥 중에서 하나의 핵산 가닥 말단이 다른 핵산 가닥의 말단보다 길거나 짧은 말단) 염기를 가지거나 가지지 않을 수 있지만, 말단 부위는 적어도 3개의 염기 쌍(예를 들어 3 내지 7개, 바람직하게는 3 내지 5개의 염기 쌍)에 의해 종결될 필요가 있다. As the secondary structure of the RNA sequence, the terminal portion (for example, the 1st to 5th bases of the RNA sequence and the 1st to 5th bases from the end) is the dangling end (two nucleic acid strands bonded to form a double strand) One end of the nucleic acid strand may or may not have a base longer or shorter than the end of the other nucleic acid strand, but the end portion may or may not have at least 3 base pairs (e.g. 3 to 7, preferably 3 to 5). Base pair).

RNA 서열의 자유 에너지는 -4.0 내지 -7.0 kcal/mol, 바람직하게는 -5.0 내지 -6.0 kcal/mol 미만이어야 한다. 이는 공지된 RNA 압타머의 자유 에너지의 Z-스코어 값을 고려한 것이다. The free energy of the RNA sequence should be -4.0 to -7.0 kcal/mol, preferably less than -5.0 to -6.0 kcal/mol. This takes into account the Z-score value of the free energy of known RNA aptamers.

한편, 필터링을 위한 제약 조건으로서 RNA 서열 중에서 쌍을 이루지 않는(unpaired) 뉴클레오타이드의 개수를 일정 개수 이상 설정할 필요가 있다. 일반적으로 RNA의 2차 구조에서 이중 가닥 부분의 뉴클레오타이드(즉, 염기 쌍을 이루는 부분의 뉴클레오타이드)에 비하여 단일 가닥 부분의 뉴클레오타이드(즉, 염기 쌍을 이루지 않는 부분의 뉴클레오타이드)가 단백질에 빈번하게 결합한다. 단일 가닥 부분의 뉴클레오타이드가 단백질에 주로 결합하는 데는 몇 가지 이유가 있다. 단일 가닥 부분의 뉴클레오타이드는 그들의 입체구조(conformation)를 쉽게 바꿀 수 있기 때문에, 단일 가닥 부분의 뉴클레오타이드(즉, 쌍을 이루지 않는 뉴클레오타이드)는 이중 가닥 부분의 뉴클레오타이드(즉, 염기 쌍을 형성하는 뉴클레오타이드)에 비하여 보다 유연하다. 뿐만 아니라, 쌍을 이루지 않는 뉴클레오타이드는 단백질의 아미노산과 수소결합을 형성할 수 있는 전자주개(donor) 또는 전자받개(acceptor) 원자를 가지고 있다. Meanwhile, as a constraint for filtering, it is necessary to set the number of unpaired nucleotides in the RNA sequence to a certain number or more. In general, in the secondary structure of RNA, nucleotides in the single-stranded portion (i.e., nucleotides in the non-base pairing portion) frequently bind to proteins compared to nucleotides in the double-stranded portion (i.e., nucleotides in the base pairing portion) in the secondary structure of RNA. . There are several reasons why the single-stranded portion of nucleotides primarily bind to proteins. Because the nucleotides of the single-stranded portion can easily change their conformation, the nucleotides of the single-stranded portion (i.e., nucleotides that do not form a pair) are attached to the nucleotides of the double-stranded portion (i.e., the nucleotides forming the base pair). It is more flexible than that. In addition, nucleotides that do not form a pair have an electron donor or acceptor atom capable of forming a hydrogen bond with an amino acid of a protein.

따라서, 상대적으로 유연한 구조를 가지면서 단백질의 아미노산과 용이하게 수소결합을 형성할 수 있는 쌍을 이루지 않는 뉴클레오타이드의 개수는 구축하고자 하는 후보 RNA 압타머의 대략 30% 이상으로 설정하는 것이 바람직할 수 있다. 예를 들어, 후보 RNA 압타머의 뉴클레오타이드를 20 내지 40개로 설계한 경우, 쌍을 이루지 않는 뉴클레오타이드의 개수를 대략 5개 내지 20개로 설정할 수 있다. Therefore, it may be desirable to set the number of nucleotides that do not form a pair that can easily form hydrogen bonds with amino acids of a protein while having a relatively flexible structure to be approximately 30% or more of the candidate RNA aptamer to be constructed. . For example, when designing 20 to 40 nucleotides of the candidate RNA aptamer, the number of unpaired nucleotides may be set to approximately 5 to 20 nucleotides.

아울러, 풀(pool) 내에서 동일한 2차 구조가 반복되는 경우, 적절한 후보 RNA 압타머를 구축하기 곤란할 수 있다. 따라서 풀 내에서 동일한 2차 구조의 개수는 300개 미만, 예를 들어 200개 미만, 바람직하게는 150개 미만, 예를 들어 100개 미만으로 설정할 수 있다. In addition, when the same secondary structure is repeated in a pool, it may be difficult to construct an appropriate candidate RNA aptamer. Thus, the number of identical secondary structures in the pool can be set to less than 300, for example less than 200, preferably less than 150, for example less than 100.

서열 발굴 수단(230)은 서열 학습 수단(230)에서 훈련된 랜덤 포레스트 모델을 필터링 된 RNA 서열에 적용하여 후보 RNA를 예측, 구축, 생성할 수 있다. The sequence discovery means 230 may predict, construct, and generate candidate RNA by applying the random forest model trained in the sequence learning means 230 to the filtered RNA sequence.

이와 같이, 공지된 RNA-단백질 복합체로부터, 이들의 서열 수준 및 구조 수준에서, 상호작용하는 RNA 서열과 표적 단백질 분자의 주요한 특징(feature)를 확인한다. 본 발명에서 상호작용하는 RNA와 표적 단백질 분자의 주요 특징을 사용하여 랜덤 포레스트(RF) 모델을 구축한다. 뒤에서 확인되는 바와 같이, 교차 검증(cross validation)과 본 발명에서 구축된 RF 모델에 대한 독립적인 테스트 수행 결과는, 본 발명의 RF 모델에서 발견된 일련의 잠재적인 RNA 압타머는 강력한 후보 압타머 서열을 포함하고 있어서, 통상적인 SELEX 등의 공정에 의해 요구되는 시간과 비용을 크게 줄일 수 있다.Thus, from known RNA-protein complexes, at their sequence level and structural level, the key features of the interacting RNA sequence and the target protein molecule are identified. In the present invention, a random forest (RF) model is constructed using the main characteristics of the interacting RNA and the target protein molecule. As confirmed later, the results of cross validation and independent testing of the RF model constructed in the present invention show that the series of potential RNA aptamers found in the RF model of the present invention are strong candidate aptamer sequences. Including, it is possible to greatly reduce the time and cost required by a process such as a typical SELEX.

선택적으로, 도 2에 개략적으로 나타낸 절차에 따라 서열 학습 수단(220) 및 서열 발굴 수단(230)을 통하여 예측, 구축, 선별된 후보 RNA 압타머에 대하여 공지된 구조 분석 소프트웨어를 활용하여 그 구조를 분석할 수 있다. 예를 들어, 구조 분석 소프트웨어를 활용하여, 서열 발굴 수단(230)에 의해 선별된 후보 RNA 압타머의 2차 구조와, 표적 단백질 분자에 도킹된 후보 RNA 압타머의 결합 부위를 중심으로 하는 결합 구조를 분석할 수 있다. Optionally, the structure of the candidate RNA aptamer predicted, constructed, and selected through the sequence learning means 220 and the sequence discovery means 230 according to the procedure schematically shown in FIG. 2 is analyzed using known structure analysis software. Can be analyzed. For example, using structure analysis software, the secondary structure of the candidate RNA aptamer selected by the sequence discovery means 230 and the binding structure centering on the binding site of the candidate RNA aptamer docked to the target protein molecule Can be analyzed.

이때, 구조 분석 소프트웨어를 활용하여 서열 발굴 수단(230)에 의해 선별된 후보 RNA 압타머의 2차 구조와 공지된 RNA 압타머의 2차 구조를 비교 분석할 수 있고/있거나, 표적 단백질 분자에 후보 RNA 압타머를 도킹시켜, 3차원 구조의 단백질-RNA 복합체를 이룬 상태에서 후보 RNA 압타머의 표적 단백질 분자에 대한 결합 부위를 중심으로 후보 RNA 압타머의 표적 단백질 분자에 대한 결합 구조와, 공지된 RNA 압타머의 표적 단백질 분자에 대한 결합 구조를 비교할 수 있다. At this time, using the structure analysis software, the secondary structure of the candidate RNA aptamer selected by the sequence discovery means 230 and the secondary structure of the known RNA aptamer can be compared and analyzed and/or candidates for the target protein molecule The binding structure of the candidate RNA aptamer to the target protein molecule and the known structure of the binding site of the candidate RNA aptamer to the target protein molecule in the state in which the RNA aptamer is docked to form a three-dimensional protein-RNA complex. The binding structure of the RNA aptamer to the target protein molecule can be compared.

일례로, 후보 RNA 압타머 및/또는 공지된 RNA 압타머의 2차 구조는 RNAfold (Lorenz et al, 2011, ViennaRNA Package 2.0. Algorithms Mol. Biol. 6(1), 26) 등의 소프트웨어를 사용할 수 있고, PseudoViewer (Byun and Han, 2009, PseudoViewer3: Bioinformatics 25(11), 1435-1437) 등의 소프트웨어를 이용하여 시각화될 수 있다. 한편, 표적 단백질 분자에 도킹한 상태에서 후보 RNA 압타머 및/또는 공지된 RNA 압타머의 결합 구조는 HDOCK (Yan et al., 2017, HDOCK: a web server for proteinprotein and proteinDNA/RNA docking based on a hybrid strategy, Nucleic Acids Res. 45(W1), W365-W373) 등의 소프트웨어를 활용할 수 있으며, RNA 압타머의 3차 구조는 RNAComposer (Popenda et al., 2012, Automated 3D structure composition for large RNAs, Nucleic Acids Res. 40(14), e112) 등의 소프트웨어를 사용하여 분석할 수 있다. For example, for the secondary structure of a candidate RNA aptamer and/or a known RNA aptamer, software such as RNAfold (Lorenz et al, 2011, Vienna RNA Package 2.0. Algorithms Mol. Biol. 6(1), 26) can be used. And, it can be visualized using software such as PseudoViewer (Byun and Han, 2009, PseudoViewer3: Bioinformatics 25(11), 1435-1437). Meanwhile, the binding structure of the candidate RNA aptamer and/or known RNA aptamer in the state of being docked on the target protein molecule is HDOCK (Yan et al., 2017, HDOCK: a web server for protein protein and proteinDNA/RNA docking based on a Hybrid strategy, Nucleic Acids Res. 45(W1), W365-W373) can be used, and the tertiary structure of RNA aptamer is RNAComposer (Popenda et al., 2012, Automated 3D structure composition for large RNAs, Nucleic Acids Res. 40(14), e112) can be used for analysis.

한편, 후보 RNA 압타머 생성 프로그램(210)은 필요에 따라 서열 발굴 수단(230)에서 후보 RNA 서열을 생성하는데 사용한 랜덤 포레스트 모델을 검증하는 서열 평가 수단(240)을 더욱 포함할 수 있다. 서열 평가 수단(240)은 전술한 서열 학습 수단(220), 서열 발굴 수단(230)의 랜덤 포레스트 모델에서 사용한 포지티브(+) 인스턴스 및 네거티브(-) 인스턴스에 대하여 적절한 평가 척도를 사용하여 평가, 검증한다. 필요에 따라, 독립적인 테스트가 또한 서열 평가 수단(240)에 의해 수행될 수 있다. On the other hand, the candidate RNA aptamer generation program 210 may further include a sequence evaluation means 240 for verifying the random forest model used to generate the candidate RNA sequence by the sequence discovery means 230, if necessary. The sequence evaluation means 240 evaluates and verifies the positive (+) instances and negative (-) instances used in the random forest model of the sequence learning means 220 and the sequence discovery means 230 described above using an appropriate evaluation scale. do. If necessary, independent tests can also be performed by means of sequence evaluation 240.

예시적인 실시형태에서, 서열 검증 수단(250)은 표준적인 교차 검증, 예를 들어 표준적인 10배 교차 검증이 수행될 수 있다. 선택적으로, 표준적인 10배 교차 검증 이외에도 하나 남기기(leave-one-out; LOO) 교차 검증이 수행될 수 있다. LOO 교차 검증을 수행한 이유는, 통상적인 k-배 교차 검증은 PPI(단백질-단백질 상호작용) 또는 RNA 상호작용과 같은 쌍을 이루는 입력 값에 대하여 예측 성능을 과대평가하는 경향이 있기 때문이다(Abbasi, W.A., Minhas, F.U.A.A.: Issues in performance evaluation for host-pathogen protein interaction prediction, Journal of Bioinformatics and Computational Biology 14(3):1650011 (2016)). In an exemplary embodiment, the sequence verification means 250 may be subjected to a standard cross-validation, for example a standard 10-fold cross-validation. Optionally, in addition to the standard 10-fold cross-validation, a leave-one-out (LOO) cross-validation may be performed. The reason for performing the LOO cross-validation is that the conventional k-fold cross-validation tends to overestimate the predictive performance for paired input values such as PPI (protein-protein interaction) or RNA interaction ( Abbasi, WA, Minhas, FUAA: Issues in performance evaluation for host-pathogen protein interaction prediction, Journal of Bioinformatics and Computational Biology 14(3):1650011 (2016)).

하나의 예시적인 실시형태에서, 서열 평가 수단(240)은 본 발명에 따라 사용된 랜덤 포레스트 모델에 대하여 평가 척도로서 민감도(sensitivity), 특이도(specificity), 정확도(accuracy), 양성예측도(positive predictive value, PPV), 음성예측도(negative predictive value, NPV) 및 매튜 상관계수(Matthews correlation coefficient, MCC) 등이 사용될 수 있다. 이들 평가 척도는 하기 식 (1) 내지 (6)으로 표현될 수 있다. In one exemplary embodiment, the sequence evaluation means 240 is used as an evaluation measure for the random forest model used in accordance with the present invention as sensitivity, specificity, accuracy, and positive predictive value. predictive value, PPV), negative predictive value (NPV), and Matthews correlation coefficient (MCC), etc. may be used. These evaluation scales can be expressed by the following formulas (1) to (6).

위 식 (1) 내지 (6)에서 민감도는 서열 발굴 수단(230)의 실제 포지티브 인스턴스 중에서 랜덤 포레스트 모델에 의해 포지티브로 맞게 예측된 인스턴스의 비율이고, 특이도는 실제 네거티브 인스턴스 중에서 랜덤 포레스트 모델에 의해 네거티브로 맞게 예측된 인스턴스의 비율이며, 정확도는 실제 인스턴스 중에서 랜덤 포레스트 모델에 의해 맞게 예측된 포지티브 인스턴스 및 네거티브 인스턴스의 비율이고, 양성예측도는 랜덤 포레스트 모델에 의해 포지티브로 예측된 인스턴스 중에서 실제 포지티브 인스턴스의 비율이며, 음성예측도는 랜덤 포레스트 모델에 의해 네거티브로 예측된 인스턴스 중에서 실제 네거티브 인스턴스의 비율을 측정한 것이다. In the above equations (1) to (6), the sensitivity is the ratio of the instances predicted to be positive by the random forest model among the actual positive instances of the sequence discovery means 230, and the specificity is the ratio of the actual negative instances by the random forest model. It is the ratio of the instances predicted to be negative, the accuracy is the ratio of the positive and negative instances predicted by the random forest model among the actual instances, and the positive predictive degree is the actual positive instance among the instances predicted to be positive by the random forest model. Is the ratio of, and the speech prediction is a measure of the ratio of the actual negative instances among the negatively predicted instances by the random forest model.

또한, 식 (1) 내지 (6)에서 TP(true positive, 참된 긍정)는 포지티브로 올바르게 예측된 포지티브 인스턴스, TN(true negative, 참된 부정)은 네거티브로 올바르게 예측된 네거티브 인스턴스, FP(false positive, 거짓 긍정)은 포지티브로 잘못 예측된 네거티브 인스턴스, FN(false negative, 거짓 부정)은 네거티브로 잘못 예측된 포지티브 인스턴스를 의미한다. In addition, in Equations (1) to (6), TP (true positive) is a positive instance correctly predicted as a positive, TN (true negative, true negative) is a negative instance correctly predicted as a negative, FP (false positive, False positive) means a negative instance incorrectly predicted as positive, and FN (false negative, false negative) means a positive instance incorrectly predicted as negative.

필요한 경우, 서열 평가 수단(240)은 선행 연구에서 확인된 데이터 세트를 활용하여, 본 발명에서 확인된 후보 RNA 구축 모델(RF 모델)에 대한 독립적인 테스트를 수행할 수도 있다. If necessary, the sequence evaluation means 240 may perform an independent test for the candidate RNA construction model (RF model) identified in the present invention by using the data set identified in the previous study.

한편, 본 발명은 전술한 RNA 압타머 생성 프로그램(210, 도 1 참조)을 이용하여 표적 단백질 분자와 결합하는 후보 RNA 압타머를 생성하는 방법에 관한 것이다. 도 3은 본 발명의 예시적인 실시형태에 따라 표적 단백질 분자와 결합하는 후보 RNA 압타머를 생성하는 방법을 개략적으로 도시한 플로 차트이다. Meanwhile, the present invention relates to a method of generating a candidate RNA aptamer that binds to a target protein molecule by using the above-described RNA aptamer generation program 210 (see FIG. 1). 3 is a flow chart schematically showing a method of generating a candidate RNA aptamer that binds to a target protein molecule according to an exemplary embodiment of the present invention.

도 3에 나타낸 바와 같이, 컴퓨터에서 실현되는 프로그램을 이용하여 표적 단백질 분자와 결합하는 후보 RNA 압타머를 생성하는 방법은, 단백질-RNA 복합체 데이터 세트를 토대로, RNA 서열 및 단백질을 구성하는 아미노산 서열을 랜덤 포레스트 모델을 활용하여 훈련, 학습하는 단계(S210 단계)와, 무작위 RNA 서열을 필터링 하고, 필터링 된 무작위 RNA 서열로부터 훈련된 랜덤 포레스트 모델을 활용하여 표적 단백질 분자와 결합하는 후보 RNA 압타머를 예측, 구축, 생성하는 단계(S220 단계)를 포함하고, 필요에 따라, 예측, 선별, 구축된 후보 RNA 압타머에 대하여 척도를 사용하여 평가하는 단계(S230 단계)를 포함할 수 있다. As shown in Figure 3, the method of generating a candidate RNA aptamer binding to a target protein molecule using a program realized in a computer is based on the protein-RNA complex data set, the RNA sequence and the amino acid sequence constituting the protein. Predict candidate RNA aptamers that bind to target protein molecules by using the random forest model to train and learn (step S210), filter the random RNA sequence, and use the trained random forest model from the filtered random RNA sequence. , Construction, including the step of generating (step S220), and if necessary, may include a step (step S230) of predicting, selecting, and evaluating the constructed candidate RNA aptamer using a scale.

RNA 서열 및 단백질의 아미노산 서열을 훈련, 학습하는 단계(S210 단계)는 서열 학습 수단(220, 도 1 참조)에 의해 수행될 수 있다. 실험적 분석 방법, 예를 들어, 클립(CLIP) 및/또는 셀렉스(SELEX) 등의 분석 방법을 통하여 확인된 단백질-RNA 복합체 데이터 세트에 기초하여 RNA 서열의 특징 벡터와 단백질을 구성하는 아미노산 서열의 특징 벡터를 구축하고, 구축된 특징 벡터에 랜덤 포레스트 모델을 적용하여 RNA 서열과 단백질 서열을 훈련시킬 수 있다. 일례로서, RNA 서열을 훈련시키기 위한 특징 벡터는 단백질을 구성하는 아미노산과의 상호작용 경향, RNA 서열의 단일-염기 조성, RNA 서열의 2-염기 조성 및 유사 3-염기 조성을 포함할 수 있다. 예를 들어, 유사 3-염기 조성은 RNA 서열을 구성하는 뉴클레오타이드의 소수성, 친수성 및 염기의 측쇄-중량을 포함할 수 있다. 이때, 훈련되는 RNA 서열의 뉴클레오타이드의 개수는 20 내지 40개로 설정될 수 있지만, 본 발명이 이에 한정되지 않는다. Training and learning the RNA sequence and the amino acid sequence of the protein (step S210) may be performed by a sequence learning means (220, see FIG. 1). Characterization of the RNA sequence based on the protein-RNA complex data set identified through an experimental analysis method, for example, an analysis method such as CLIP and/or SELEX, and the amino acid sequence constituting the protein A vector can be constructed and the RNA sequence and protein sequence can be trained by applying a random forest model to the constructed feature vector. As an example, a feature vector for training an RNA sequence may include a tendency to interact with amino acids constituting a protein, a single-base composition of the RNA sequence, a 2-base composition and a pseudo 3-base composition of the RNA sequence. For example, a pseudo three-base composition can include the hydrophobicity, hydrophilicity of the nucleotides that make up the RNA sequence and the side-chain-weight of the base. At this time, the number of nucleotides of the trained RNA sequence may be set to 20 to 40, but the present invention is not limited thereto.

또한, 단백질 서열을 훈련시키기 위한 특징 벡터는 단백질을 구성하는 아미노산의 조성-전이-분포 특성과, 유사 아미노산 조성을 포함할 수 있다. 유사 아미노산 조성은 단백질을 구성하는 아미노산의 소수성, 친수성, 측쇄-중량, 이온화 지수 및 등전점을 포함할 수 있다. In addition, the feature vector for training the protein sequence may include composition-transition-distribution characteristics of amino acids constituting the protein and similar amino acid composition. The similar amino acid composition may include the hydrophobicity, hydrophilicity, side chain-weight, ionization index and isoelectric point of the amino acids constituting the protein.

후보 RNA 압타머를 예측, 구축, 생성하는 단계(S220 단계)는 서열 발굴 수단(230, 도 1 참조)에 의해 수행될 수 있다. 본 단계에서, 무작위 RNA 서열 중에서 특정 조건을 충족하는 RNA 서열을 필터링 하고, 서열을 훈련, 학습하는 단계(S210 단계)에서 훈련된 랜덤 포레스트 모델을 필터링 된 RNA 서열에 적용하여 필터링 된 RNA 서열 중에서 표적 단백질 분자와 결합하는 후보 RNA 압타머를 예측, 구축, 생성, 선별할 수 있다. Predicting, constructing, and generating a candidate RNA aptamer (step S220) may be performed by sequence discovery means 230 (see FIG. 1). In this step, the random forest model trained in the step of filtering the RNA sequence that meets a specific condition among the random RNA sequences and training and learning the sequence (step S210) is applied to the filtered RNA sequence to target the filtered RNA sequence. Candidate RNA aptamers that bind to protein molecules can be predicted, constructed, generated, and selected.

예시적인 실시형태에서, 무작위 RNA 서열을 필터링 할 때, 무작위 RNA 서열의 2차 구조, 자유 에너지, 쌍을 이루지 않는 뉴클레오타이드의 개수 및 풀(pool) 내에서 동일한 2차 구조의 개수를 한정하여 상기 무작위 RNA 서열을 필터링 할 수 있다. In an exemplary embodiment, when filtering a random RNA sequence, the random RNA sequence is limited by limiting the secondary structure of the random RNA sequence, the free energy, the number of unpaired nucleotides, and the number of identical secondary structures in the pool. RNA sequences can be filtered.

필요한 경우, 선별, 구축된 후보 RNA 압타머에 대하여 공지된 구조 분석 소프트웨어를 활용하여 2차 구조 및/또는 표적 단백질 분자와 도킹한 상태에서 결합 구조를 분석할 수 있다. 이때, 선별된 후보 RNA 압타머와, 공지된 RNA 압타머의 2차 구조 및 표적 단백질 분자에 대한 결합 구조가 비교, 분석될 수 있다. If necessary, the secondary structure and/or the binding structure in the state of docking with the target protein molecule can be analyzed using known structure analysis software for the selected and constructed candidate RNA aptamer. At this time, the selected candidate RNA aptamer, the secondary structure of the known RNA aptamer and the binding structure to the target protein molecule can be compared and analyzed.

한편, 후보 RNA 압타머를 적절한 평가 척도를 사용하여 평가하는 단계(S230 단계)는 서열 평가 수단(240, 도 1 참조)에 의해 수행될 수 있다. S230 단계는, 후보 RNA를 예측, 생성, 선별하기 위해 사용된 랜덤 포레스트 학습 모델을 적절한 평가 척도를 사용하여 평가, 검증한다. 예시적인 실시형태에서, 평가 척도는 민감도(Sensitivity), 특이도(Specificity), 정확도(Accuracy), 양성예측도(Positive predictive value), 음성예측도(Negative predictive value) 및 매튜 상관계수(Matthews correlation coefficient) 중에서 선택되는 적어도 하나일 수 있다. 평가 방법은 특별히 한정되는 것은 아니지만 표준적인 교차 검증, 예를 들어 표준적인 10배 교차 검증이나, 하나 남기기(LOO) 교차 검증이 수행될 수 있다. 필요에 따라, 본 단계에서는 선행 연구의 결과를 활용한 독립적인 테스트가 또한 수행될 수 있다. Meanwhile, the step of evaluating the candidate RNA aptamer using an appropriate evaluation scale (step S230) may be performed by the sequence evaluation means 240 (see FIG. 1). In step S230, the random forest learning model used to predict, generate, and select candidate RNA is evaluated and verified using an appropriate evaluation scale. In an exemplary embodiment, the evaluation measures are Sensitivity, Specificity, Accuracy, Positive predictive value, Negative predictive value, and Matthews correlation coefficient. ) May be at least one selected from. The evaluation method is not particularly limited, but a standard cross-validation, for example, a standard 10-fold cross-validation or a Leave One (LOO) cross-validation may be performed. If necessary, an independent test using the results of previous studies can also be performed at this stage.

이하, 예시적인 실시형태를 통하여 본 발명을 설명하지만, 본 발명이 하기 실시예에 기재된 기술사상으로 한정되지 않는다. Hereinafter, the present invention will be described through exemplary embodiments, but the present invention is not limited to the technical idea described in the following examples.

실시예Example

[데이터 세트][data set]

단백질 데이터 뱅크(Protein Data Bank, PDB)에서 5.0 Å 이상의 양호한 해상도를 가지는 X-선 결정학에 의해 해명된 단백질-RNA 복합체를 추출하였다. 10개 미만의 짧은 RNA 서열과, 120개를 초과하는 긴 RNA 서열을 가지는 복합체를 제거하여, 총 696개의 단백질-RNA 복합체만을 남겨두었다. 아미노산과의 뉴클레오타이드의 상호작용 경향(interaction propensity; IP)를 학습하기 위하여 남겨진 696개의 단백질-RNA 복합체를 사용하였다. Protein-RNA complexes elucidated by X-ray crystallography having a good resolution of 5.0 Å or more were extracted from the Protein Data Bank (PDB). Complexes with less than 10 short RNA sequences and more than 120 long RNA sequences were removed, leaving only 696 protein-RNA complexes in total. In order to learn the interaction propensity (IP) of nucleotides with amino acids, the remaining 696 protein-RNA complexes were used.

아울러, 전술한 696개의 단백질-RNA 복합체에 포함되지 않은, 공지된 RNA 압타머-단백질 복합체에 대한 구조 데이터를 PDB에서 얻었다. PDB에는 35개의 RNA 압타머-단백질 복합체가 존재하였다. In addition, structural data for known RNA aptamer-protein complexes that are not included in the aforementioned 696 protein-RNA complexes were obtained from PDB. There were 35 RNA aptamer-protein complexes in the PDB.

본 실시예에서 사용된 모델을 다른 모델과 비교하기 위하여, 압타머 베이스(Cruz et al., 2012, Aptamer Base: a collaborative knowledge base to describe aptamers and SELEX experiments. Database (Oxford), bas006)로부터 Li 등 (Li et al., 2014, Predictor of Aptamer-Target Interacting Pairs with Pseudo-Amino Acid Composition, PLoS One 9(1), e86729)에 의해 구축된 기준(benchmark) 데이터세트를 획득하였다. 본 실시예에 따른 모델을 테스트하기 위한 포지티브 예로서, 포지티브 데이터로부터 56개의 RNA 압타머-단백질 쌍을 사용하였다. In order to compare the model used in this example with other models, Li et al. from the aptamer base (Cruz et al., 2012, Aptamer Base: a collaborative knowledge base to describe aptamers and SELEX experiments.Database (Oxford), bas006). (Li et al., 2014, Predictor of Aptamer-Target Interacting Pairs with Pseudo-Amino Acid Composition, PLoS One 9(1), e86729), a benchmark dataset was obtained. As a positive example for testing the model according to this example, 56 RNA aptamer-protein pairs from positive data were used.

[예측 모델을 학습하기 위한 특징][Features for learning predictive models]

SELEX 과정은 30-mer 또는 60-mer의 다양한 영역을 가지는 올리고-뉴클레오타이드 라이브러리로부터 압타머를 선택하는데, 선택된 압타머는 통상적으로 30-mer보다 짧다. 따라서 표적 단백질 분자에 결합하는 30-mer보다 짧은 핵산 서열로 이루어진 잠재적인 RNA 압타머를 생성할 수 있도록, 본 실시예에서는 27-mer의 RNA 압타머를 생성하도록 설계하였다. 각 쌍의 RNA와 단백질 서열에 대하여, 다음과 같은 특징을 암호화하였다. The SELEX procedure selects an aptamer from an oligo-nucleotide library having various regions of 30-mer or 60-mer, the aptamer selected is usually shorter than 30-mer. Therefore, in this example, a 27-mer RNA aptamer was designed to be generated in order to generate a potential RNA aptamer consisting of a nucleic acid sequence shorter than a 30-mer binding to a target protein molecule. For each pair of RNA and protein sequences, the following characteristics were encoded.

RNA 특징: 아미노산과의 뉴클레오타이드 트리플렛(triplets)의 IP, 모노-뉴클레오타이드 조성(mC, 단일-염기 조성), 디-뉴클레오타이드 조성(dC, 2-염기 조성), 유사 트리-뉴클레오타이드 조성(PseTNC, 유사 3-염기 조성)RNA characteristics: IP of nucleotide triplets with amino acids, mono-nucleotide composition (mC, single-base composition), di-nucleotide composition (dC, 2-base composition), similar tri-nucleotide composition (PseTNC, pseudo 3 -Base composition)

단백질 특징: 아미노산 그룹의 조성-전이-분포(C-T-D), 유사 아미노산 조성(PseAAC). Protein characteristics: composition-transition-distribution of amino acid groups (C-T-D), similar amino acid composition (PseAAC).

아미노산과의 뉴클레오타이드 트리플렛의 IP를 정의하기 위하여, 696개의 단백질-RNA 복합체의 데이터 세트를 사용하여, 4³=64인 뉴클레오타이드 트리플렛과 20개의 아미노산 사이에서 1,280개(64 X 20)의 IP를 연산하였다. 아울러, 각각의 RNA 서열에 대하여, 단일-염기 조성(mC)과 2-염기 조성(dC)을 또한 연산하였다. To define the IP of the nucleotide triplet with the amino acid, a data set of 696 protein-RNA complexes was used to calculate the IP of 1,280 (64 X 20) between the nucleotide triplet of 4 ³ =64 and 20 amino acids. . In addition, for each RNA sequence, the single-base composition (mC) and the 2-base composition (dC) were also calculated.

한편, 단백질 서열을 표현할 수 있도록, 20개의 아미노산을 그들의 쌍극자(dipoles)와 부피(volume)을 기준으로 7개의 그룹으로 군집화(clustering)하였다. {M (methionine), S (Serine), T (Threonine), Y (Tyrosine)}, {F (phenylalanine), I (Isoleucine), L (Leucine), P (Proline)}, {H (Histidine), N (Asparagine), Q (Glutamine), W (Tryptophan)}, {A (Alanine), G (Glycine), V (Valine)}, {K (Lysine), R (Arginine)}, {D (Aspartic acid), E (Glutamic acid)}, {C (Cysteine)}. 단백질 서열 내의 각각의 아미노산 그룹에 대하여, 조성-전이-분포(composition-transition-distribution, C-T-D) 특징을 표현하였는데, 이 특징은 특징 벡터에서 7 + 21 + 35 = 63 요소를 요구한다(Dubchak et al., 1995, Prediction of protein folding class using global description of amino acid sequence, Proc. Natl. Acad. Sci. U.S.A. 92(19), 8700-8704). Meanwhile, in order to express the protein sequence, 20 amino acids were clustered into 7 groups based on their dipoles and volume. {M (methionine), S (Serine), T (Threonine), Y (Tyrosine)}, {F (phenylalanine), I (Isoleucine), L (Leucine), P (Proline)}, {H (Histidine), N (Asparagine), Q (Glutamine), W (Tryptophan)}, {A (Alanine), G (Glycine), V (Valine)}, {K (Lysine), R (Arginine)}, {D (Aspartic acid ), E (Glutamic acid)}, {C (Cysteine)}. For each group of amino acids in the protein sequence, a composition-transition-distribution (CTD) feature was expressed, which required 7 + 21 + 35 = 63 elements in the feature vector (Dubchak et al. ., 1995, Prediction of protein folding class using global description of amino acid sequence, Proc. Natl. Acad. Sci. USA 92(19), 8700-8704).

추가적인 특징으로서, 각각의 RNA 서열에 대한 유사 3-염기 조성(PseTNC) (Chen et al., 2014, it is-PseTNC: a sequence-based predictor for identifying initiation site in human genes using pseudo trinucleotide composition. Anal. Biochem. 462, 76-83)과 각각의 단백질 서열에 대한 유사 아미노산 조성(PseAAC) (Shen and Chou, 2008. PseAAC: a flexible web server for generating various kinds of protein pseudo amino acid composition, Anal. Biochem. 373(2), 386-388)을 연산하였는데, 이들 특징은 서열 내부에서 잔기(residues)의 서열-순서 효과(sequence-order effect)를 고려한다. As an additional feature, a pseudo trinucleotide composition (PseTNC) for each RNA sequence (Chen et al., 2014, it is-PseTNC: a sequence-based predictor for identifying initiation site in human genes using pseudo trinucleotide composition.Anal. Biochem. 462, 76-83) and similar amino acid composition for each protein sequence (PseAAC) (Shen and Chou, 2008. PseAAC: a flexible web server for generating various kinds of protein pseudo amino acid composition, Anal. Biochem. 373 (2), 386-388) were calculated, these features taking into account the sequence-order effect of residues within the sequence.

구체적으로, PseTNC와 관련해서는 뉴클레오타이드의 3개의 생리화학적 성질, 즉 소수성(hydrophobicity), 친수성(hydrophilicity), 측쇄-중량(side-chain mass)을 사용하여 PseTNC를 연산하였다. 이와 유사하게, PseAAC와 관련해서는 아미노산의 다양한 생리화학적 성질, 즉 소수성, 친수성, 측쇄-중량, 이온화 지수 1(pK1, a-CO₂H), 이온화 지수 2 (pK2, NH₃), 25℃에서의 등전점(isoelectric point, pI)을 이용하여 PseAAC를 연산하였다. PseTNC와 PseAAC를 연산할 때, 가중치(weigh factor, w)을 0.1로, 상관관계의 티어(tier)인 λ를 1로 설정하였다. PseTNC와 PseAAC는 각각 티어 상관 인자를 가지는 트리-뉴클레오타이드 조성과 아미노산 조성을 나타낸다. 티어 상관 인자의 개수는 파라미터 λ와 같기 때문에, PseTNC와 PseAAC에 대한 특징 벡터에서 요소의 개수는 각각 65와 21이다. 669개의 요소를 가지는 특징 벡터에서 한 쌍의 RNA 윈도우와 단백질 쌍이 암호화된다: 500 IP, 4 mC, 16 dC, 65 PseTNC, 63 C-T-D, 21 PseAAC. Specifically, with regard to PseTNC, PseTNC was calculated using three physiochemical properties of nucleotides, namely, hydrophobicity, hydrophilicity, and side-chain mass. Similarly, with respect to PseAAC, various physiochemical properties of amino acids, namely hydrophobicity, hydrophilicity, side chain-weight, ionization index 1 (pK1, a-CO ₂ H), ionization index 2 (pK2, NH ₃ ), at 25°C. PseAAC was calculated using the isoelectric point (pi) of. When calculating PseTNC and PseAAC, the weight factor (w) was set to 0.1 and the correlation tier, λ, was set to 1. PseTNC and PseAAC represent tri-nucleotide composition and amino acid composition each having a tier correlation factor. Since the number of tier correlation factors is equal to the parameter λ, the number of elements in the feature vectors for PseTNC and PseAAC are 65 and 21, respectively. In a feature vector with 669 elements, a pair of RNA windows and protein pairs are encoded: 500 IP, 4 mC, 16 dC, 65 PseTNC, 63 CTD, 21 PseAAC.

[훈련을 위한 포지티브 및 네거티브 인스턴스들][Positive and negative instances for training]

랜덤 포레스트(RF) 모델을 구축하고, 단백질-RNA 복합체로부터 생성된 전술한 특징 벡터에서 RF 모델을 훈련시켰다. RF 모델을 구축할 때, 트리(trees)의 개수는 35개로 설정하였으며, 특징의 개수는 특징 요소 개수의 제곱근으로 설정하였다. A random forest (RF) model was constructed, and the RF model was trained on the aforementioned feature vector generated from the protein-RNA complex. When constructing the RF model, the number of trees was set to 35, and the number of features was set as the square root of the number of feature elements.

30-mer 미만의 잠재적인 RNA 압타머를 생성하는 것이 본 발명의 주요한 목적이기는 하지만, 데이터 세트를 훈련하고 테스트할 때 RNA 서열은 다양한 길이를 가지고 있다. 도 3에 도시한 바와 같이, 27개의 뉴클레오타이드를 사용하여, 포지티브 인스턴스(positive instance)와 네거티브 인스턴스(negative instance)를 정의하였다. 27개 뉴클레오타이드의 윈도우를 사용하여 RNA 서열이 스캐닝 되면서, 해당 윈도를 나타내는 특징 벡터가 생성된다. 윈도우의 중앙 뉴클레오타이드가 단백질-결합 뉴클레오타이드라면, 해당 윈도우를 암호화하는 특징 벡터는 포지티브로 고려되었다. 만약 어떤 윈도우가 단백질-비결합 뉴클레오타이드만을 포함한다면, 해당 윈도우에 대한 특징 벡터는 네거티브로 고려되었다. 포지티브도 아니고 네거티브도 아닌 특징 벡터들은 훈련 데이터 세트에서 제거되었는데, 이들은 훈련을 위한 매우 불균형한 포지티브/네거티브 인스턴스를 야기하기 때문이다. 다만 예외적으로, 최초와 마지막 윈도우는 해당 윈도우의 임의의 위치에서 단백질 결합 뉴클레오타이드를 포함한다면 모두 포지티브로 고려되었다. 본 실시예에 따른 훈련 데이터세트에서, 네거티브 인스턴스에 대한 포지티브 인스턴스의 비율은 약 3:1이었다. Although generating potential RNA aptamers of less than 30-mers is the primary aim of the present invention, RNA sequences have varying lengths when training and testing data sets. As shown in Fig. 3, a positive instance and a negative instance were defined using 27 nucleotides. As the RNA sequence is scanned using a window of 27 nucleotides, a feature vector representing the window is created. If the central nucleotide of the window was a protein-binding nucleotide, then the feature vector encoding the window was considered positive. If a window contains only protein-unbound nucleotides, the feature vector for that window is considered negative. Feature vectors that are neither positive nor negative have been removed from the training data set because they result in very unbalanced positive/negative instances for training. However, as an exception, both the first and last windows were considered positive if they contained protein-binding nucleotides at any position in the window. In the training dataset according to this example, the ratio of positive instances to negative instances was about 3:1.

[RNA 압타머의 구조적 제한][Structural restriction of RNA aptamer]

주어진 단백질 서열에 대한 27-mer RNA 압타머를 발굴하기 위하여, 6 X 10⁶ 개의 27-mer 무작위 RNA 서열을 생성하고, RNAfold(Lorenz et al., 2011. ViennaRNA Package 2.0. Algorithms Mol. Biol. 6(1), 26)를 이용하여 이들의 2차 구조를 예측하였다. 무작위 RNA 서열 중에서, 2차 구조가 다음 4가지의 제약 조건(constraints)을 만족시키는 것을 필터링 하고, 후보 압타머의 초기 풀(pool)을 구축하였다. In order to discover a 27-mer RNA aptamer for a given protein sequence, 6 X 10 ⁶ 27-mer random RNA sequences were generated, and RNAfold (Lorenz et al., 2011. Vienna RNA Package 2.0. Algorithms Mol. Biol. 6) (1) and 26) were used to predict their secondary structure. Among the random RNA sequences, those in which the secondary structure satisfies the following four constraints were filtered, and an initial pool of candidate aptamers was constructed.

첫째, 2차 구조는 다음의 형태 중의 하나의 형태에서 댕글링 말단(dangling end 말단보다 길거나 짧은 말단) 염기를 가지거나 가지지 않으면서 적어도 3개의 연속적인 염기에 의해 종결되어야 한다. 형태는 (((N21))),.(((N20))),(((N20))). and .(((N19))). 여기서 Nk는 k-mer 서열 단편이다. First, the secondary structure must be terminated by at least three consecutive bases with or without dangling end (terminal longer or shorter than the dangling end) base in one of the following forms. The form is (((N21))),.(((N20))),(((N20))). and .(((N19))). Where Nk is a k-mer sequence fragment.

둘째, 자유 에너지는 -5.7 kcal/mol 미만이어야 한다. Second, the free energy should be less than -5.7 kcal/mol.

셋째, 2차 구조는 적어도 11개의 쌍을 이루지 않는(unpaired) 염기를 함유하여야 한다. Third, the secondary structure should contain at least 11 unpaired bases.

넷째, 풀 내에서 동일한 2차 구조의 개수는 150을 초과해서는 안 된다. Fourth, the number of identical secondary structures in the pool should not exceed 150.

전술한 4개의 제약 조건은 공지된 모든 RNA 압타머-단백질 복합체의 분석에서 유도된다. 도 4에 도시한 바와 같이, 공지된 구조를 가지는 대부분의 RNA 압타머의 2차 구조는 댕글링 말단 염기를 가지거나 가지지 않으면서 적어도 3개의 염기 쌍을 가지는 스템(stem)으로 종결된다. The above four constraints are derived from the analysis of all known RNA aptamer-protein complexes. As shown in Fig. 4, the secondary structure of most RNA aptamers having a known structure terminates with a stem having at least three base pairs with or without dangling terminal bases.

자유 에너지의 제약 조건은 Chushak 등(Chushak and Stone, 2009, In silico selection of RNA aptamers. Nucleic Acids Res. 37(12), e87)이 수행한 연구에서 얻어졌다. 연구자들은 10,000개의 동일 길이의 무작위 서열의 평균 자유 에너지를 사용하여 Z-스코어(Z-score)와 표준편차를 계산하였다. 공지된 RNA 압타머의 자유 에너지의 Z-스코어는 -0.9 내지 -4.3 사이의 범위이기 때문에, 자유 에너지의 컷-오프 값(cut-off value)은 Z-스코어 -1에 대응하는 -5.7 kcal/mol로 설정하였다. The free energy constraint was obtained from a study conducted by Chushak et al. (Chushak and Stone, 2009, In silico selection of RNA aptamers. Nucleic Acids Res. 37(12), e87). The researchers calculated the Z-score and standard deviation using the mean free energy of 10,000 equal-length random sequences. Since the Z-score of the free energy of known RNA aptamers is in the range of -0.9 to -4.3, the cut-off value of the free energy is -5.7 kcal/corresponding to the Z-score -1. set to mol.

예를 들어, 도 4에 도시한 바와 같이, 표적 단백질 분자에 결합하는 공지된 RNA 압타머의 구조에서, 단백질에 결합하는 뉴클레오타이드는 RNA 2차 구조의 이중 가닥 부분보다는 단일 가닥 부분에서 빈번하게 관찰된다(Chushak and Stone, 2009; Carothers et al., 2004, Informational complexity and functional activity of RNA structures. J. Am. Chem. Soc. 126(16), 5130-5137). For example, as shown in Figure 4, in the structure of a known RNA aptamer that binds to a target protein molecule, nucleotides that bind to the protein are frequently observed in a single-stranded portion rather than a double-stranded portion of the RNA secondary structure. (Chushak and Stone, 2009; Carothers et al., 2004, Informational complexity and functional activity of RNA structures. J. Am. Chem. Soc. 126(16), 5130-5137).

RNA 압타머의 길이를 고려하여, 본 실시예에서 적어도 11개의 쌍을 이루지 않는 뉴클레오타이드를 가지는 후보 압타머를 선별하였다(3번째 제약 조건).In consideration of the length of the RNA aptamer, candidate aptamers having at least 11 unpaired nucleotides in this Example were selected (3rd constraint).

최초에 생성된 6 X 10⁶ 개의 무작위 RNA 서열에 의해 형성된 많은 2차 구조는 상호 유사하다. 일부 2차 구조는 10,000 회 이상 관찰된 반면, 다른 2차 구조는 1회만 관찰되었다. 잠재적인 RNA 압타머의 다양성을 유지하기 위하여, 동일한 2차 구조의 개수를 150으로 제한하였다(4번째 제약 조건). Many secondary structures formed by the initially generated 6 X 10 ⁶ random RNA sequences are mutually similar. Some secondary structures were observed over 10,000 times, while others were observed only once. To maintain the diversity of potential RNA aptamers, the number of identical secondary structures was limited to 150 (4th constraint).

전술한 4개의 구조적 제약 조건을 적용한 뒤에, 최초 6 X 10⁶ 개의 무작위 서열로부터 38,327개의 RNA 서열만이 남았다. 각각의 단백질 표적 분자에 대하여, RNA-단백질 서열의 쌍이 전술한 RF 모델에 대해 특징 벡터에서 암호화되었다. 상기 RF 모델은 35개의 서브-트리(subtrees)로 이루어져 있고, 각각의 서브-트리는 입력 특징 벡터에 투표되었다(voted). 전술한 4가지 제약 조건에 의해 선별된(screened) 모든 RNA 서열에 대하여, RNA 서열 및 단백질의 아미노산 서열을 훈련할 때 적용한 RF 모델로 연산하여, 포지티브 투표(positive votes)의 비율로부터 확률 추정(probability estimate)을 얻었다. 자유 에너지 및 확률과 관련하여 제약 조건에서 선별된 38,327개의 RNA 서열에 순위를 부여하고, 최고의 확률을 가지는 서열 중에서도 가장 낮은 자유 에너지를 가지는 탑 10 순위를 가지는 후보 RNA 압타머를 선별하였다. After applying the above four structural constraints, only 38,327 RNA sequences remained from the first 6 X 10 ⁶ random sequences. For each protein target molecule, a pair of RNA-protein sequences was encoded in a feature vector for the RF model described above. The RF model consists of 35 sub-trees, each sub-tree being voted on an input feature vector. For all RNA sequences screened by the above four constraints, the probability is estimated from the ratio of positive votes by calculating with the RF model applied when training the RNA sequence and the amino acid sequence of the protein. estimate). In relation to free energy and probability, 38,327 RNA sequences selected under constraints were ranked, and among the sequences with the highest probability, candidate RNA aptamers having the lowest free energy ranking were selected.

[교차 검증 및 독립 테스트][Cross Verification and Independent Test]

상기에서 선별된 후보 RNA를 예측, 생성하기 위한 학습 모델을 10배 교차 검증(10-fold cross validation), 하나 남기기(leave-one-out) 교차 검증 및 독립 테스트를 수행하여 평가하였다. 교차 검증 및 독립 테스트는 전술한 식 (1) 내지 (6)으로 정의되는 민감도(sensitivity, Sn), 특이도(specificity, Sp), 정확도(accuracy, Acc), 양성예측도(positive predictive value, PPV), 음성예측도(negative predictive value, NPV) 및 매튜 상관계수(MCC)를 사용하여 평가하였다. 표 1은 10배 본 실시예에 따라 표적 단백질 분자와 결합하는 후보 RNA 압타머를 구축, 생성하는 것과 관련한 10배 교차 검증 및 하나 남기기 교차 검증의 결과를 나타낸다. 표 1에 나타낸 바와 같이, 2개의 교차 검증에서 모두 우수한 성능을 보여주었으며, 특히 하나 남기기 교차 검증 수행에 비하여 10배 교차 검증에서 보다 양호한 성능을 보여주었다. A learning model for predicting and generating the selected candidate RNA was evaluated by performing 10-fold cross validation, leave-one-out cross validation, and independent testing. Cross-validation and independent testing are the sensitivity (Sn), specificity (Sp), accuracy (Accuracy, Acc), positive predictive value, PPV defined by the above equations (1) to (6). ), negative predictive value (NPV), and Matthew's correlation coefficient (MCC). Table 1 shows the results of 10-fold cross-validation and leaving one cross-validation related to constructing and generating a candidate RNA aptamer that binds to a target protein molecule according to this Example. As shown in Table 1, both of the two cross-validations showed excellent performance, and in particular, compared to the one left cross-validation, the 10-fold cross-validation showed better performance.

후보 RNA 압타머 구축 모델의 검증Validation of candidate RNA aptamer construction model 검증Verification SnSn SpSp AccAcc PPVPPV NPVNPV MCCMCC 10배 교차10-fold crossing 95.10%95.10% 98.69%98.69% 97.79%97.79% 96.04%96.04% 98.37%98.37% 0.9410.941 LOOLOO 94.12%94.12% 98.04%98.04% 97.06^97.06^ 94.12%94.12% 98.04%98.04% 0.9220.922

한편, 본 실시예에 따라 구축된 RF 모델에 대한 독립 테스트에서, 다른 선행 연구(Li et al., 2014, Prediction of Aptamer-Target Interacting Pairs with Pseudo-Amino Acid Composition. PLoS One 9(1), e86729; Zhang et al., Prediction of aptamer-protein interacting pairs using ensemble classifier in combination with various protein sequence attributes. BMC Bioinformatics 17, 225)의benchmark 데이터 세트로부터 단백질에 결합하는 RNA 압타머가 포지티브 인스턴스로 사용되었다. 선행연구는 동일한 benchmark 데이터 세트를 사용하였다. 선행연구는 포지티브 데이터 세트에서 압타머와 단백질 표적에 대하여 무작위로 쌍을 형성하여 네거티브 인스턴스를 생성하였으나, 선행연구에서 단백질 파트너가 포지티브 데이터 세트 내의 단백질 표적 중 하나로 변하면 네거티브 인스턴스가 포지티브 인스턴스가 되기 때문에 선행연구의 네거티브 인스턴스를 사용하지는 않았다. 독립 테스트와 관련한 네거티브 인스턴스에 대하여, 무작위 RNA 서열을 생성하고, 이들을 PDB에서 얻어진 35개의 단백질-RNA 복합체에서 단백질 사슬 중 하나와 쌍을 형성하였다. 독립 테스트 수행 결과를 하기 표 2에 나타낸다. On the other hand, in the independent test for the RF model constructed according to this example, other previous studies (Li et al., 2014, Prediction of Aptamer-Target Interacting Pairs with Pseudo-Amino Acid Composition.PLoS One 9(1), e86729 ; An RNA aptamer binding to a protein from the benchmark data set of Zhang et al., Prediction of aptamer-protein interacting pairs using ensemble classifier in combination with various protein sequence attributes.BMC Bioinformatics 17, 225) was used as a positive instance. Previous studies used the same benchmark data set. In previous studies, negative instances were generated by randomly pairing aptamers and protein targets in the positive data set, but in previous studies, when the protein partner changes to one of the protein targets in the positive data set, the negative instance becomes a positive instance. I did not use negative instances of the study. For negative instances associated with independent testing, random RNA sequences were generated and paired with one of the protein chains in 35 protein-RNA complexes obtained from the PDB. The results of the independent test are shown in Table 2 below.

후보 RNA 압타머 구축 모델의 독립 테스트Independent testing of candidate RNA aptamer construction models 방법Way SnSn SpSp AccAcc PPVPPV NPVNPV MCCMCC Li et al. Li et al. 48.3%48.3% 87.1%87.1% 77.4%77.4% -- -- 0.3720.372 Zhang et al.Zhang et al. 73.8%73.8% 71.3%71.3% 71.9%71.9% -- -- 0.3880.388 실시예Example 76.8%76.8% 66.1%66.1% 71.4%71.4% 69.4%69.4% 74.0%74.0% 0.4310.431

표 2에 나타낸 바와 같이, 본 실시예에 따라 표적 단백질 분자와 결합하는 후보 RNA 압타머를 구축하기 위한 실시예의 RF 모델은 대체적으로 선행논문에 비하여 우수한 성능을 보여주었다. 선행논문과 본 실시예에서 사용한 네거티브 인스턴스의 차이로 인하여, 민감도가 비교를 위하여 가장 중요하고 공정한 평가 척도인 것으로 보인다. As shown in Table 2, the RF model of the Example for constructing a candidate RNA aptamer binding to a target protein molecule according to this Example generally showed superior performance compared to previous papers. Due to the difference between the preceding paper and the negative instance used in this example, sensitivity seems to be the most important and fair evaluation measure for comparison.

[결론][conclusion]

컴퓨터 연산을 이용하여 표적 단백질 분자와 결합하는 RNA 압타머를 종래의 방법은 표적 단백질 분자에 대한 신규 RNA 압타머를 발굴하는데 사용될 수 없다. 종래 기술에서 채택된 모델은 특정 표적으로만 제한되기 때문에, 주어진 한 쌍의 단백질과 RNA 서열이 상호작용하는지를 결정하는 분류 모델이기 때문이다. 본 실시예에서는 상호작용하는 RNA와 단백질의 다양한 특징과, RNA 2차 구조에 대한 구조적 제한 조건을 사용하여 표적 단백질 분자에 대한 잠재적인 후보 RNA 압타머를 구축하기 위한 컴퓨터 모델을 제안한다. RNA와 단백질 서열의 다양한 특징을 이용하여 랜덤 포레스트 모델을 구축하여, 서열을 훈련, 학습시켰다. 무작위 RNA 서열을 적절한 기준에 따라 필터링 하고, 구축된 랜덤 포레스트 모델을 이용하여 표적 단백질 분자에 대한 후보 RNA 압타머를 예측, 구축할 수 있다. 구축된 후보 RNA 압타머는 교차 검증 및 독립 테스트에서도 우수한 성능을 보여주었다. 따라서, 본 실시예의 모델을 적용하여 표적 단백질 분자와 결합하는 초기 핵산 서열의 풀(pool)의 크기를 크게 줄일 수 있기 때문에, 시험관내 실험에서 요구되는 막대한 시간과 비용을 절감할 수 있을 것으로 기대된다. Conventional methods of combining RNA aptamers with target protein molecules using computer operations cannot be used to discover new RNA aptamers for target protein molecules. This is because the model adopted in the prior art is limited to specific targets, and is a classification model that determines whether a given pair of proteins and RNA sequences interact. In this example, we propose a computer model for constructing a potential candidate RNA aptamer for a target protein molecule using various features of interacting RNA and protein, and structural constraints on the secondary structure of RNA. A random forest model was constructed using various features of RNA and protein sequences, and the sequences were trained and learned. A random RNA sequence can be filtered according to an appropriate criterion, and a candidate RNA aptamer for a target protein molecule can be predicted and constructed using the constructed random forest model. The constructed candidate RNA aptamer showed excellent performance in cross-validation and independent tests. Therefore, since the size of the pool of initial nucleic acid sequences binding to the target protein molecule can be greatly reduced by applying the model of this example, it is expected that enormous time and cost required for in vitro experiments can be saved. .

상기에서는 본 발명의 예시적인 실시형태 및 실시예에 기초하여 본 발명을 설명하였으나, 본 발명이 상기 실시형태 및 실시예에 기재된 기술사상으로 한정되는 것은 아니다. 오히려 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자라면 전술한 실시형태 및 실시예를 토대로 다양한 변형과 변경을 용이하게 추고할 수 있다. 하지만, 이러한 변형과 변경은 모두 본 발명의 권리범위에 속한다는 점은, 첨부하는 청구범위에서 분명하다. In the above, the present invention has been described based on exemplary embodiments and examples of the present invention, but the present invention is not limited to the technical idea described in the above embodiments and examples. Rather, a person of ordinary skill in the art to which the present invention pertains can easily add various modifications and changes based on the above-described embodiments and examples. However, it is clear from the appended claims that these modifications and changes all belong to the scope of the present invention.

100: 컴퓨터 110: 입력부
120: 출력부 130: 디스플레이
140: 메모리 150: 구동/응용 프로그램
160: 제어부
200: 기록 매체
210: 후보 RNA 압타머 생성 프로그램
220: 서열 학습 수단 230: 서열 발굴 수단
240: 서열 평가 수단100: computer 110: input
120: output unit 130: display
140: memory 150: drive/application
160: control unit
200: recording medium
210: candidate RNA aptamer generation program
220: sequence learning means 230: sequence discovery means
240: sequence evaluation means

Claims

Based on the RNA-protein complex data, the feature vector of the RNA sequence and the feature vector of the amino acid sequence constituting the protein are constructed, and the RNA sequence and the protein sequence are trained by applying a random forest model based on the constructed feature vector. Sequence learning means to let; And
Filtering the random RNA sequence, and applying the random forest model learned through the sequence learning means, comprising a sequence discovery means for constructing a candidate RNA aptamer binding to the target protein molecule among the filtered RNA sequences,
In the sequence learning means, the RNA feature vector has an interaction propensity with amino acids constituting a protein, a mono-nucleotide (single-base) composition of the RNA sequence, and a di-nucleotide 2 -Base) composition and pseudo tri-nucleotide (pseudo tri-nucleotide) composition,
In the sequence learning means, the protein feature vector is a computer-readable record recording a program for generating a candidate RNA aptamer that binds to a target protein molecule created based on the composition, transfer and distribution of amino acids constituting the protein, and similar amino acid composition. media.

The method of claim 1,
The program includes a program for generating a candidate RNA aptamer that binds to a target protein molecule further comprising a sequence evaluation means for evaluating a model for generating a candidate RNA aptamer constructed by the sequence discovery means using an evaluation scale. A recording medium that can be read by a recorded computer.

The method of claim 2,
The evaluation scale is at least selected from sensitivity, specificity, accuracy, positive predictive value, negative predictive value, and Matthews correlation coefficient. A computer-readable recording medium in which a program for generating a candidate RNA aptamer binding to a target protein molecule, characterized in that one is one, is recorded.

The method according to any one of claims 1 to 3,
The sequence discovery means filters the random RNA sequence by limiting the secondary structure of the random RNA sequence, free energy, the number of unpaired nucleotides, and the number of identical secondary structures in a pool. A computer-readable recording medium in which a program for generating a candidate RNA aptamer that binds to the target protein molecule described above is recorded.

The method according to any one of claims 1 to 3,
The similar tri-nucleotide composition constituting the RNA characteristic vector includes hydrophobicity, hydrophilicity and side-chain mass of the tri-nucleotide,
The similar amino acid composition constituting the protein characteristic vector can be read by a computer recording a program for generating a candidate RNA aptamer that binds to a target protein molecule including the hydrophobicity, hydrophilicity, side chain-weight, ionization index, and isoelectric point of the amino acid. Capable of recording media.

As a method of generating a candidate RNA aptamer that binds to a target protein molecule using a computer-implemented program,
By means of sequence learning, a feature vector of an RNA sequence and a feature vector of an amino acid sequence constituting a protein are constructed based on the RNA-protein complex data, and a random forest model is applied based on the constructed feature vector. Training the sequence and protein sequence; And
Filtering the random RNA sequence by means of sequence discovery, and applying the random forest model learned through the sequence learning means to construct a candidate RNA aptamer that binds to the target protein molecule among the filtered RNA sequences Including,
In the sequence learning means, the RNA feature vector has an interaction propensity with amino acids constituting a protein, a mono-nucleotide (single-base) composition of the RNA sequence, and a di-nucleotide 2 -Base) composition and pseudo tri-nucleotide (pseudo tri-nucleotide) composition,
In the sequence learning means, the protein feature vector is a method of generating a candidate RNA aptamer that binds to a target protein molecule created based on the composition, transfer and distribution of amino acids constituting the protein, and similar amino acid composition.

The method of claim 6,
A method of generating a candidate RNA aptamer binding to a target protein molecule, further comprising the step of evaluating a model for generating a candidate RNA aptamer constructed by the sequence discovery means by a sequence evaluation means using an evaluation scale.

The method of claim 7,
The evaluation scale is at least selected from sensitivity, specificity, accuracy, positive predictive value, negative predictive value, and Matthews correlation coefficient. Method for producing a candidate RNA aptamer binding to a target protein molecule, characterized in that one.

The method according to any one of claims 6 to 8,
The sequence discovery means filters the random RNA sequence by limiting the secondary structure of the random RNA sequence, free energy, the number of unpaired nucleotides, and the number of identical secondary structures in a pool. A method of generating a candidate RNA aptamer that binds to the target protein molecule.

The method according to any one of claims 6 to 8,
The similar tri-nucleotide composition constituting the RNA characteristic vector includes hydrophobicity, hydrophilicity and side-chain mass of the tri-nucleotide,
The similar amino acid composition constituting the protein characteristic vector is a method of generating a candidate RNA aptamer that binds to a target protein molecule including hydrophobicity, hydrophilicity, side chain-weight, ionization index, and isoelectric point of the amino acid.