KR20000025535A

KR20000025535A - Method for predicting secondary structure of rna molecules by simulation of growth process

Info

Publication number: KR20000025535A
Application number: KR1019980042656A
Authority: KR
Inventors: 한경숙; 김도형; 김홍진
Original assignee: 노건일; 학교법인 인하학원
Priority date: 1998-10-13
Filing date: 1998-10-13
Publication date: 2000-05-06
Also published as: KR100295246B1

Abstract

PURPOSE: A method for predicting secondary structure of RNA molecules by simulation of growth process is provided, in particular, the formation process of secondary structure of RNA is simulated, the secondary structure is predicted and then the predicted structure is visualized. CONSTITUTION: A method for predicting secondary structure of RNA molecules comprises the following steps of: 1) the analysis of homologous base sequences and the construction of covariation matrix, 2) the investigation of the potential helical structure from covariation matrix, 3) the simulation to generate the intermediates of the structural formation process, 4) the prediction of secondary structure of RNA and the visualizing as graphic form of the predicted secondary structure in the form of text.

Description

A method for predicting the secondary structure of RNA molecules by simulating the folding process

본 발명은 RNA 분자의 이차 구조 모델링 시스템에 있어서 RNA 분자의 이차 구조 예측 방법에 관한 것으로서, 특히 RNA의 이차 구조 형성 과정을 모의실험 하여 이차 구조를 예측하며, 예측된 구조를 시각화하는 방법에 관한 것이다.The present invention relates to a method for predicting the secondary structure of an RNA molecule in a secondary structure modeling system of an RNA molecule, and more particularly, to a method for predicting a secondary structure by visualizing a process of forming a secondary structure of an RNA and visualizing the predicted structure. .

먼저, 본 발명의 이해를 돕기 위하여 기본이 되는 개념과 용어들을 RNA 이차 구조의 구조 요소를 나타낸 도 1을 참조하여 설명한다.First, in order to help the understanding of the present invention, basic concepts and terms will be described with reference to FIG. 1, which shows structural elements of an RNA secondary structure.

도 1에 도시된 바와 같이 RNA 분자 이차 구조의 구조 요소는 두가닥으로 이루어지는 부분 (double-stranded part), 내부 루프 (internal loop), 불록한 루프 (bulge loop), 다중 루프 (multiple loop) 또는 결합되지 않은 종단부 (dangling end)와 같은 한 가닥으로 이루어진 부분 (single-stranded part)을 가리킨다. 염기서열에서 연속한 부분 (contiguous segment of a base sequence)을 구조 단위 (structural unit)라고 하며, 한 구조 요소는 하나 또는 그 이상의 구조 단위들로 구성된다.As shown in FIG. 1, structural elements of the secondary structure of RNA molecules include a double-stranded part, an internal loop, a bulk loop, a multiple loop, or a bond. Refers to a single-stranded part, such as a dangling end. A contiguous segment of a base sequence is called a structural unit, and one structural element consists of one or more structural units.

여기서, 두가닥으로 이루어지는 부분은 나선형 구조 (helix 또는 stem)라고 불리며 염기 쌍들이 두 개 이상 연속적으로 존재하는 부위를 말한다.Here, the two-stranded part is called a helical structure (helix or stem) and refers to a site where two or more base pairs are continuously present.

내부 루프는 양쪽 가닥에서 다 쌍을 이루지 못해 튀어나온 부분을 말하며, 불룩한 루프는 두 가닥 부분에서 한쪽 가닥에서만 쌍을 이루지 못해 튀어나온 부분을 말한다. 또, 다중 루프는 두 개 이상의 나선형 구조가 연결되는 부위에 있는 한 가닥 부분을 말하고, 결합되지 않은 종단부는 염기 서열의 시작 또는 끝에 있으면서 쌍을 이루지 못한 부분을 말한다.The inner loop is the part that protrudes out of pairs on both strands, and the bulging loop is the part that protrudes out of pairs on only one strand of both strands. In addition, a multi-loop refers to a portion of a strand at a site where two or more helical structures are connected, and an unbound terminal refers to a portion not paired at the beginning or the end of a nucleotide sequence.

한편, 후술하는 본 발명에서는 동류성 염기 서열을 분석 결과를 상호 변화 매트릭스 (covariation matrix)로 나타내는데, 상호 변화 매트릭스의 각 요소 (i, j)는 i번째 염기와 j번째 염기와의 관계 BP(i, j)를 나타내며, 관계 BP(i, j)는 다음과 같이 정의된다.On the other hand, in the present invention described below, the analysis results of the homologous base sequence is represented by a covariation matrix, wherein each element (i, j) of the cross change matrix is a relationship between the i-th base and the j-th base BP (i , j), and the relationship BP (i, j) is defined as follows.

정밀 비변형 짝 (exact-invariant match)은 염기 i와 j가 모든 염기 서열에서 쌍을 이루며 염기 i와 j에 변화가 없는 경우를 말하며, 정밀 변형 짝 (exact-variant match)은 염기 i와 j가 모든 염기 서열에서 쌍을 이루며 상호 보강적 염기 변화 (compensating base changes)가 있는 경우를 말한다.Exact-invariant match refers to the case where bases i and j are paired in all base sequences and there is no change in bases i and j. It refers to the case where there are compensating base changes in pairs in all base sequences.

와블 짝 (wobble match)은 염기 i가 대부분의 염기 서열에서 염기 j와 G-U 와블 쌍 (wobble pair)을 이루는 경우를 말하며, 비정밀 짝 (inexact match)은 염기 i가 모든 염기 서열을 아니지만 대부분의 염기 서열에서 염기 j와 쌍을 이루며 짝을 이루지 않는 빈도가 정해진 값을 넘지 않는 경우를 말한다.A wobble match refers to the case where base i forms a GU wobble pair with base j in most base sequences, while an inexact match indicates that base i is not all bases, but most bases. It refers to a case where the frequency of pairing with base j in the sequence does not exceed a predetermined value.

이러한 RNA 분자의 이차 구조는 다각형 디스플레이 (polygonal display), 산 (mountain), 원형 (circle) 또는 반구형 (dome)으로 나타낼 수 있는데, 각각은 도 2a의 이차구조에 대하여 2b 내지 도 2e에 나타낸 바와 같이 표현된다.The secondary structure of such RNA molecules can be represented by a polygonal display, a mount, a circle or a hemisphere, each as shown in 2b to 2e for the secondary structure of FIG. 2a. Is expressed.

RNA 분자 구조에 대한 연구에 있어서 가장 핵심이 되는 부분은 이차 구조의 예측과 예측된 구조를 시각화하는 작업이라 할 수 있다.The most important part of the study of RNA molecular structure is the task of predicting secondary structure and visualizing the predicted structure.

RNA의 이차 구조를 이론적으로 예측하는 방법들은 열역학적 방법과 계통 발생학적 비교법의 두 가지 유형으로 분류할 수 있으며, 열역학적 방법은 에너지 모델 (energy model)과 다이나믹 프로그래밍 (dynamic programming) 기법을 이용하여, 최소 또는 최소에 가까운 자유 에너지 값을 갖는 구조를 추정하여 낸다 [참조 문헌 1 ~ 3].Theoretical predictions of the secondary structure of RNA can be categorized into two types: thermodynamic and phylogenetic comparisons. Thermodynamic methods use energy models and dynamic programming techniques to minimize Alternatively, a structure having a free energy value close to the minimum is estimated. [Refs. 1 to 3].

그러나 이 방법은 사용하는 에너지 모델 자체가 불완전하고 부정확하기 때문에 예측되는 구조가 정확하지 않으며, RNA의 염기 서열의 국부적인 변화에도 지나치게 민감하다는 단점이 있다 [참조 문헌 4].However, this method has the disadvantage that the predicted structure is not accurate because the energy model itself is incomplete and inaccurate, and is too sensitive to local changes in the base sequence of RNA [Ref. 4].

계통 발생학적 비교법은 동류성 염기 서열을 비교 분석하여 이 염기 서열들에 공통된 구조를 예측하는데, 그 절차 중 상당한 부분이 번거로운 수작업에 의존하고 있다 [참조 문헌 5, 6].Phylogenetic comparisons compare and analyze homologous base sequences to predict the structures common to these base sequences, with a significant portion of the procedure relying on cumbersome manual work [Refs. 5, 6].

또한 RNA 분자가 성장함과 거의 동시에 구조를 형성하고, 보다 안정된 형태로 존재하기 위해 구조의 재편성이 계속 일어난다는 실험 결과에 비추어 볼 때 [참조 문헌 7], 이 두 방법 모두 개선될 여지가 있다.In addition, in view of the results of experiments in which RNA molecules form a structure almost simultaneously with growth and reorganization of the structure continues to exist in a more stable form, both methods have room for improvement.

한편, 본 발명자들은 여러 개의 동류성 염기 서열에 공통된 이차구조를 예측하는 휴리스틱 (heuristic)을 개발한 바 있다 [참조 문헌 4].On the other hand, the present inventors have developed a heuristic for predicting secondary structure common to several homologous nucleotide sequences [Ref. 4].

이 휴리스틱에서 이차 구조 예측에 소요되는 시간과 공간 량은 가장 효율적이라고 알려진 Zuker와 Stiegler의 방법 [참조 문헌 2]에서 한 개의 염기 서열의 구조를 밝히는데 필요로 하는 시간과 공간 량과 같다.The amount of time and space required for secondary structure prediction in this heuristic is equal to the amount of time and space required to elucidate the structure of one nucleotide sequence in Zuker and Stiegler's method [2], which is known to be the most efficient.

또한 구조 형성 제약 조건 (folding constraint)을 전파하는 기능을 갖는 이 방법은 Folder라고 불리는 모델링 시스템으로 구현되었으며, Folder는 각 단계에서 계통 발생학적 열역학적으로 안정된 구조를 추정함으로써 구조 형성의 변화를 모의실험 하는 데에도 적용되었다 [참조문헌 8, 9].In addition, this method, which has the function of propagating folding constraints, is implemented with a modeling system called Folder, which simulates changes in structure formation by estimating the phylogenetic thermodynamically stable structure at each step. This also applies to the literature [Refs. 8, 9].

그러나 이 모의실험 방법은 전 (前) 단계의 구조와는 독립적으로 각 중간 구조를 구하기 때문에 구조의 재편성을 허용하기는 하지만 생물학적인 의미가 미약하다. 또한, Folder는 예측된 구조를 그래픽 형식이 아닌 텍스트 형식으로만 보여주고, 생물· 생화학 실험실에서 구입하기에는 고가인 웍스테이션 (workstation)에서 구현되었기 때문에 RNA를 연구하는 사람들에게 널리 사용되기에는 어려운 문제점이 있다.However, this simulation method obtains each intermediate structure independently of the previous structure, but allows for the reorganization of the structure, but the biological meaning is weak. In addition, since Folder shows the predicted structure only in text form, not in graphic form, and is implemented in workstations that are expensive to purchase in biological and biochemical laboratories, it is difficult to be widely used for RNA researchers. .

상술한 열역학적 방법과 계통 발생학적 비교법으로 대표되는 기존의 RNA 구조 예측 방법들은 RNA의 구조 형성 과정을 고려하지 않고 최종 구조를 추정하는 것이다. 그러나 RNA 분자의 이차 구조는 그 RNA의 염기 서열이 다 자라기를 기다렸다가 가장 안정된 형태를 취하는 것이 아니라, 염기 서열이 생성됨과 거의 동시에 구조가 형성된다.Existing RNA structure prediction methods represented by the above-described thermodynamic method and phylogenetic comparison method are to estimate the final structure without considering the structure formation process of RNA. The secondary structure of an RNA molecule, however, does not wait for the RNA's base sequence to grow and then assumes the most stable form, but the structure is formed almost simultaneously with the generation of the base sequence.

따라서 구조 형성 과정 (folding)의 동력학 (kinetics) 측면에서 어려움이 있으면 열역학적 또는 계통 발생학적으로 안정된 형태에 도달하지 못할 수도 있고, RNA 분자의 최종 구조는 일차 구조가 시간이 경과함에 따라 자라면서 형성하게 되는 중간 단계의 구조들에 의해 어느 정도 영향을 받게 된다. 또한 한 구조에서 다른 구조로의 동적인 이동이 어떤 RNA 분자에 있어서는 기능의 중요한 부분일 수 도 있다.Thus, difficulties in the kinetics of folding may result in a failure to reach thermodynamic or phylogenetically stable forms, and the final structure of RNA molecules may be formed as the primary structure grows over time. It is influenced to some extent by the intermediate structures that are involved. In addition, dynamic movement from one structure to another may be an important part of function for some RNA molecules.

때문에 이와 같은 RNA 분자의 구조 형성 과정을 모의실험해 보려는 시도가 있었는데 중간 구조의 적합성 여부를 결정하는 기준으로서 자유 에너지가 강조되었다. 그 중 부가적인 접근법 (additive approach)이라고 특징 지울 수 있는 방법들은 자유 에너지를 최소로 증가시키면서 형성되고 있는 구조에 이미 존재하는 나선형 구조와 공존할 수 있는, 즉 이들과의 충돌 (conflict)이 없는 나선형 구조를 차례로 골라 현 구조에 덧붙이는 방법을 취한다.As a result, attempts have been made to simulate the structure formation of RNA molecules. Free energy is emphasized as a criterion for determining the suitability of intermediate structures. The methods that can be characterized as an additive approach are those that can coexist with the structures already existing in the structure being formed with a minimum increase in free energy, i.e. without any collisions with them. Choose a structure in turn and add it to the string structure.

이와 같은 구조의 재편성 (structure reorganization)을 허용하지 않는 부가적 접근법은 생화학 실험의 관찰과 상반되는 면이 있다.An additional approach that does not allow such a structure reorganization is contrary to the observation of biochemical experiments.

또 이차 구조는 동적 평형 상태에서 존재하는데 염기 서열이 성장하거나 화학적 처리로 잘라지게 되면 보다 안정된 구조를 이루기 위해 기존의 구조는 분해된다.Secondary structures exist in a dynamic equilibrium, but when the base sequence grows or is cut by chemical treatment, existing structures are degraded to achieve a more stable structure.

구조의 재편성을 허용하는 방법이 있기는 하나 이들은 각 중간 단계에서 총 자유 에너지를 최소로 할 수 있는 구조를 구하기 때문에 역시 생물학적 실현 가능성이 미약하다.Although there is a way to allow reorganization of the structure, they are also less biologically feasible because they find a structure that can minimize the total free energy at each intermediate stage.

또 몬테 칼로 방법 (Monte Carlo method)-RNA 이차 구조의 형성 과정에서 에너지 값이 낮은 구조를 그 단계에서 형성될 확률이 높은 구조로 선택함으로써 모의 실험하는 방법-을 사용하여 각 단계에서 확률이 가장 높은 구조를 선택하는 방법도 있으나 매우 복잡한 계산을 요하고 작은 RNA 분자에만 적용된다 [참조 문헌 10].In addition, the Monte Carlo method is most likely at each stage using the Monte Carlo method, which simulates by selecting a structure with a low energy value as a structure that is likely to be formed at that stage during the formation of RNA secondary structures. There are also methods for selecting structures, but require very complex calculations and apply only to small RNA molecules [Ref. 10].

최근, 유전자 알고리즘-생태계에서 유기체의 유전자가 번식 (breed), 변이 (mutation), 교차 (crossover) 등을 반복하면서 환경에 최적인 것이 생존하는 형상을 컴퓨터 알고리즘에 의하여 모의실험 하는 방법-을 이용하여 RNA 이차 구조를 예측하거나 [참조 문헌 11], RNA의 구조 형성 과정을 모의실험 하는 시도가 있었다 [참조 문헌 12].Recently, by using a genetic algorithm-a method that simulates a shape in which an organism's gene is optimal for the environment while repeating breeding, mutation, crossover, etc. in an ecosystem, by a computer algorithm. Attempts have been made to predict RNA secondary structure [11] or to simulate the process of RNA structure formation [12].

이와 같은 유전자 알고리즘을 이용한 방법들은 유용한 면이 있으나, 많은 계산량을 필요로 하기 때문에 커다란 분자에 적용하기에는 아직 실용적이지 못하고, 참조 문헌 20의 알고리즘의 경우는 모의실험에 지나치게 많은 시간이 걸린다 (예를 들어 500개의 염기 서열로 구성된 RNA 분자를 모의실험하는데 20 시간이 소요된다).While methods using such genetic algorithms are useful, they are not yet practical for large molecules because they require a lot of computation, and the algorithm of Reference 20 takes too much time to simulate (e.g., 20 hours to simulate RNA molecules consisting of 500 base sequences).

한편, RNA 이차 구조의 시각화를 위해서 몇몇 드로잉 프로그램이 개발되었다[참조 문헌 13 ∼ 15].On the other hand, several drawing programs have been developed for the visualization of RNA secondary structure (Refs. 13-15).

RNA 이차 구조를 시각화할 때 가장 큰 어려움은 구조의 구성 요소들이 겹치지 않도록 하면서 동시에 전체적인 구조가 간결하게 보이도록 그려야 한다는 것이다. 그러나 큰 RNA 분자일수록 상충되는 상기와 같은 조건을 만족시키는 것이 쉽지 않다. 또한 구조 시각화 시스템의 사용자의 입장에서 생기는 어려운 점 중의 하나는 구조 시각화 시스템과 예측 시스템이 서로 다른 타입의 데이터를 사용하는 경우가 많다는 것이다. 이것은 구조 시각화 시스템이 구조 예측 시스템의 일부분으로서가 아니라 별도의 시스템으로 개발되기 때문인데, 예측 시스템과 시각화 시스템이 서로 다른 컴퓨터에서 운용되는 경우 시스템 사용자의 불편은 더욱 증가한다.The biggest challenge when visualizing an RNA secondary structure is that the components of the structure must be drawn so that the overall structure looks concise while not overlapping. However, the larger the RNA molecule, the more difficult it is to satisfy such conflicting conditions. In addition, one of the difficulties for users of structural visualization systems is that structural visualization systems and prediction systems often use different types of data. This is because the structure visualization system is developed as a separate system rather than as part of the structure prediction system. The inconvenience of the system user is further increased when the prediction system and the visualization system are operated on different computers.

따라서, 일반에 널리 보급되어 있는 IBM PC 호환 기종의 컴퓨터에서 운영 가능하며 염기서열의 국부적인 변화에 민감하지 않고, 에너지 모델을 변형하여 사용할 수 있으며 값의 변경이 가능하고, 또한 수작업을 요구하지 않는 RNA 분자의 이차 구조 예측과 예측된 이차 구조를 시각화할 수 있는 방법에 대한 연구의 필요성이 대두되었다.Therefore, it can be operated on IBM PC compatible computer which is widely used in general, and is not sensitive to local changes in sequencing, and can change energy model, change the value, and do not require manual operation. There is a need for research on the prediction of secondary structure of RNA molecules and how to visualize the predicted secondary structure.

[참조 문헌]REFERENCES

1. Sankoff, D., Kruskal, J. B., Mainville, S., and Cedergren, R. J., "Fast Algorithms to Determine RNA Secondary Structure Containing Multiple Loops," InTime warps, string edits, and macromolecules: the theory and pracrice of sequence comparison(Sankoff, D., Kruskal, J. B., eds), pp 93-120, Addison-Wesley Publishing Company, 1983.1.Sankoff, D., Kruskal, JB, Mainville, S., and Cedergren, RJ, "Fast Algorithms to Determine RNA Secondary Structure Containing Multiple Loops," In Time warps, string edits, and macromolecules: the theory and pracrice of sequence comparison (Sankoff, D., Kruskal, JB, eds), pp 93-120, Addison-Wesley Publishing Company, 1983.

2. Zuker, M. and Stiegler, P. "Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information,"Nucleic Acids Res., Vol. 9, No. 1, pp. 133-148, 19812. Zuker, M. and Stiegler, P. "Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information," Nucleic Acids Res ., Vol. 9, No. 1, pp. 133-148, 1981

3. Zuker, M., "On Finding All Folding of an RNA Molecule,"Science, Vol. 244, pp. 48-52, 19893. Zuker, M., "On Finding All Folding of an RNA Molecule," Science , Vol. 244, pp. 48-52, 1989

4. Han, K. and Kim, H-J., "Prediction of Common Folding Structures of Homologous RNAs,"Nucleic Acids Res., Vol. 21, No. 5, pp. 1251-1257, 19934. Han, K. and Kim, HJ., "Prediction of Common Folding Structures of Homologous RNAs," Nucleic Acids Res ., Vol. 21, No. 5, pp. 1251-1257, 1993

5. Noller, H. F. and Woese, C. R., "Secondary Structure of 16S Ribosomal RNA"Science, vol. 212, No. 4493, pp. 403-411, 1981.5. Noller, HF and Woese, CR, "Secondary Structure of 16S Ribosomal RNA" Science , vol. 212, No. 4493, pp. 403-411, 1981.

6. Noller, H. F., "Structure of Ribosomal RNA"Ann. Rev. Biochem, vol. 53, pp. 119-162, 1984.6. Noller, HF, "Structure of Ribosomal RNA" Ann. Rev. Biochem , vol. 53, pp. 119-162, 1984.

7. Kramer, F. R. and Mills, D. R., "Secondary structure formation during RNA synthesis,"Nucleic Acids Res., Vol. 9, No. 19, pp. 5109-5124, 1981.7. Kramer, FR and Mills, DR, "Secondary structure formation during RNA synthesis," Nucleic Acids Res ., Vol. 9, No. 19, pp. 5109-5124, 1981.

8. Han, K. and Gelsey, N., "Qualitative Modeling of RNA Structure," InProc. of IJCAI-93, pp. 1558-1563, 1993.8. Han, K. and Gelsey, N., "Qualitative Modeling of RNA Structure," In Proc. of IJCAI-93 , pp. 1558-1563, 1993.

9. Kim, H-J. and Han, K., "Automated Modeling of the RNA Folding Process,"Mol. Cells, Vol. 5, pp. 406-412, 1995.9. Kim, HJ. and Han, K., "Automated Modeling of the RNA Folding Process," Mol. Cells , Vol. 5, pp. 406-412, 1995.

10. Mironov, A., Dyakonova, L. P., and Kister, A., "A Kinetic Approach to the Prediction of RNA Secondary Structures,"J. Biomolecular Structure & Dynamics, Vol. 2, No. 5, pp. 953-962, 1085.10. Mironov, A., Dyakonova, LP, and Kister, A., "A Kinetic Approach to the Prediction of RNA Secondary Structures," J. Biomolecular Structure & Dynamics , Vol. 2, No. 5, pp. 953-962, 1085.

11. Benedetti, G. and Morosetti, S., "A genetic algorithm to search for optimal and suboptimal RNA secondary structures,"Biophys. Chem, Vol. 55, pp. 253-259, 1995.11. Benedetti, G. and Morosetti, S., "A genetic algorithm to search for optimal and suboptimal RNA secondary structures," Biophys. Chem , Vol. 55, pp. 253-259, 1995.

12. Gultyaev, A. P. and van Batenburg, F. H. D., and Pleij, C. W. A., "The Computer Simulation of RNA Folding Pathways using a Genetic Algorithm,"J. Mol. Biol, Vol. 250, pp37-51, 1995.12. Gultyaev, AP and van Batenburg, FHD, and Pleij, CWA, "The Computer Simulation of RNA Folding Pathways using a Genetic Algorithm," J. Mol. Biol , Vol. 250, pp 37-51, 1995.

13. Muller, G., Gaspin, C. Etienne, A., and Westhof, E., "Automatic display of RNA secondary sturctures,"Comput. Applic. Biosci., Vol. 9, No. 5, pp. 551-561, 1993.13. Muller, G., Gaspin, C. Etienne, A., and Westhof, E., "Automatic display of RNA secondary sturctures," Comput. Applic. Biosci ., Vol. 9, No. 5, pp. 551-561, 1993.

14. Nakaya, A., Taura, K., Yamanoto, K., and Yonezawa, A., "Visualization of RNA secondary structure using highly parallel computers,"Comput. Applic. Biosci., Vol. 12, No. 3, pp. 205-211, 1996.14. Nakaya, A., Taura, K., Yamanoto, K., and Yonezawa, A., "Visualization of RNA secondary structure using highly parallel computers," Comput. Applic. Biosci ., Vol. 12, No. 3, pp. 205-211, 1996.

15. Yamamoto, K. Sakurai, N., and Yoshikura, H., "Graphics of RNA secondary structurd; towards on object-oriented algorithm,"Comput. Applic. Biosci., Vol. 3, pp. 99-103, 1987.15. Yamamoto, K. Sakurai, N., and Yoshikura, H., "Graphics of RNA secondary structurd; towards on object-oriented algorithm," Comput. Applic. Biosci ., Vol. 3, pp. 99-103, 1987.

상술한 종래의 RNA 분자의 이차 구조 예측 및 시각화 방법의 문제점을 해결하기 위한 본 발명의 목적은 열역학적·계통발생학적으로 안정되고, 구조 형성 과정이 고려된 RNA 분자의 이차 구조를 예측하며, 중간 단계 및 최종 단계에서 예측된 구조의 신뢰도를 정량적으로 나타낼 수 있으며, 성장과정 모의실험을 통한 RNA 분자의 이차 구조 예측 방법을 제공하는데 있다.An object of the present invention for solving the problems of the above-described method for predicting and visualizing the secondary structure of the conventional RNA molecule is to predict the secondary structure of the RNA molecule which is thermodynamically and phylogenetically stable and the structure forming process is considered, and an intermediate step. And it is possible to quantitatively represent the reliability of the structure predicted in the final step, and to provide a method for predicting the secondary structure of the RNA molecule through the growth process simulation.

도 1은 RNA 이차 구조의 구성 요소를 나타낸 도1 is a diagram showing the components of the RNA secondary structure

도 2a 내지 도 2e는 RNA 이차 구조의 표현 형태를 나타낸 도Figures 2a to 2e shows the representation of the RNA secondary structure

도 3은 본 발명의 실시예에 따른 입력 데이터의 형식을 나타낸 도3 illustrates the format of input data according to an embodiment of the present invention.

도 4는 본 발명의 실시예에 따른 출력 형식을 나타낸 도4 illustrates an output format according to an embodiment of the present invention.

도 5는 본 발명의 실시예에 따른 RNA 분자의 이차 구조를 예측하는 과정을 보인 전체 흐름도5 is an overall flowchart illustrating a process of predicting a secondary structure of an RNA molecule according to an embodiment of the present invention.

도 6은 본 발명의 실시예에 따른 RNA 분자 이차 구조의 형성 과정을 모의실 험하는 흐름도6 is a flow chart for simulating the formation of the secondary structure of the RNA molecule according to an embodiment of the present invention

도 7은 본 발명의 실시예에 의해 예측된 RNA 분자 이차 구조의 시각화 과정 을 보인 흐름도7 is a flow chart showing the visualization of the RNA molecule secondary structure predicted by the embodiment of the present invention

도 8은 도 7의 배치 우선 순위 결정 과정을 보인 흐름도8 is a flowchart illustrating a batch prioritization process of FIG. 7.

도 9는 본 발명의 방법을 이용한 시스템 (QFolder)의 레이아웃9 is a layout of a system (QFolder) using the method of the present invention.

도 10은 mRNA 5'NTR의 염기 서열의 배열을 나타낸 도Fig. 10 shows the sequence of the nucleotide sequence of mRNA 5'NTR

도 11은 본 발명의 방법을 이용하여 예측한 mRNA 5'NTR의 성장과정 중에 생 성되는 중간 구조 및 최종구조를 나타낸 도11 is a diagram showing an intermediate structure and a final structure generated during the growth process of mRNA 5'NTR predicted using the method of the present invention

도 12는 본 발명의 방법을 이용하여 예측한 HIV-1 TAR의 성장과정 중에 생성 되는 중간 구조 및 최종 구조를 나타낸 도12 is a diagram showing an intermediate structure and a final structure generated during the growth process of HIV-1 TAR predicted using the method of the present invention

본 발명의 성장과정 모의실험을 통한 RNA 분자의 이차 구조 예측 방법은 동류성 염기 서열을 분석하고 그 분석 결과를 i 번째 염기와 j 번째 염기와의 관계 BP(i, j)를 각 요소로 하는 상호 변화 매트릭스로 나타내는 동류성 염기 서열의 분석 및 상호 변화 매트릭스 생성 과정과, 상호 변화 매트릭스에서 잠재적 나선형 구조를 찾아내어 각 나선형 구조마다 시작위치, 끝 위치, 길이, best_helix라고 불리는 전체 구조에서 가장 안정된 헤어핀 루프 나선형 구조를 찾기 위한 스코어 함수 (score function) S1 및 나선형 구조의 신뢰도 CF (Certainty Factor)를 계산하고, 잠재적 나선형 구조가 모두 찾아지면, 이들을 끝 위치가 증가하는 순서로 정렬하여 구조를 만드는 잠재적 나선형 구조 탐색 과정과, RNA 분자의 염기 서열이 자라면서 취하게 되는 중간 구조들을 순서대로 생성시키는 구조 형성 과정 모의실험 및 구조 생성 과정으로 이루어지고, 텍스트 형식의 예측된 이차 구조를 그래픽 형식으로 시각화하는 것을 특징으로 한다.Secondary structure prediction method of RNA molecule through growth process simulation of the present invention analyzes the homologous nucleotide sequence and the result of the analysis of the relationship between the i-th base and the j-th base BP (i, j) Analysis of the homologous sequences represented by the change matrix and generation of the change matrix, and finding the potential helical structure from the change matrix, and for each helical structure, the most stable hairpin loop in the starting position, end position, length, and overall structure called best_helix. Score function to find the helical structure S1 and the reliability of the helical structure Calculate the reliability factor (CF) and, if all potential helical structures are found, arrange them in order of increasing end position to create the helical structure The search process and the intermediate structures taken as the base sequence of the RNA molecule grows in order It made of a structure-forming process simulation and the process of generating structure, and the predicted secondary structure of the text format, characterized in that to visualize graphically.

먼저, 예측되는 RNA 분자 이차 구조의 신뢰도에 대해 설명한다.First, the reliability of the predicted RNA molecular secondary structure will be described.

RNA 분자의 구조 형성 과정을 모의실험하고 이를 바탕으로 최종 구조를 예측하는 작업은 많은 불확실성을 내포한다. 이 불확실성은 근본적으로 실제의 구조 형성 과정을 프로그램을 통하여 재생성하기 위해 필요한 정보가 부족하다는 사실에 근거한다. 예를 들면 열역학적 방법에서 의존하는 에너지 모델은 부정확하고 불완전하기 때문에 에너지 모델의 자유 에너지 값이 RNA 구조 형성 과정을 모의실험하기에 유일하고도 충분한 정보가 되리라고 기대할 수 없다.Simulating the structure-forming process of RNA molecules and estimating the final structure based on this involves a lot of uncertainty. This uncertainty is fundamentally based on the fact that there is a lack of information needed to recreate the actual structure-forming process through the program. For example, because energy models that depend on thermodynamic methods are inaccurate and incomplete, we cannot expect the free energy values of the energy models to be the only enough information to simulate the process of RNA structure formation.

따라서, 본 발명은 에너지 모델의 수치를 그대로 사용하지 않고 근삿값을 이용하고, 구조 형성과정의 동역학적 측면과 동류성 염기 서열의 정보를 고려하여 구조 형성 과정을 모의실험하고 이를 바탕으로 최종 구조를 예측함으로써 예측되는 구조의 정확성을 높이고자 한다.Therefore, the present invention simulates the structure formation process by using approximation values without considering numerical values of the energy model, considering the kinetic aspects of the structure formation process and the information of the homologous nucleotide sequence, and predicts the final structure based on this. By increasing the accuracy of the predicted structure.

구체적으로 설명하면 구조 예측에 있어서의 불확실성은 두 단계의 신뢰도로 표현되고 관리된다. 즉, 잠재적인 나선형 구조의 신뢰도 CF(helix)와 잠재적 나선형 구조 중에서 선택된 나선형 구조로 구성되는 중간 구조 및 최종 구조의 신뢰도 CF(structure)로 관리된다.Specifically, uncertainty in structural prediction is expressed and managed with two levels of reliability. That is, it is managed by the reliability CF (structure) of the intermediate structure and the final structure consisting of the helical CF (helix) of the potential helical structure and the helical structure selected from the potential helical structure.

나선형 구조의 신뢰도는 수학식 1과 같이 5개 파라미터의 함수 값으로 정의된다.The reliability of the helical structure is defined as a function value of five parameters as in Equation 1.

CF(helix) = w₁·L + w₂·E + w₃·W + w₄·I + w₅·H _{CF (helix) = w 1 ·} L + w 2 · E + w 3 · W + w 4 · I + w 5 · H

여기서,here,

L: 나선형 구조의 길이 (염기 쌍의 개수로 표시)L: Length of the helical structure (expressed as the number of base pairs)

E: 나선형 구조에서 정밀 변형 짝의 개수E: number of precision strain pairs in helical structure

W: 나선형 구조에서 와블 짝의 개수W: number of wobble pairs in a spiral structure

I: 나선형 구조에서 비정밀 짝의 개수I: number of inexact pairs in the helical structure

H: 나선형 구조에 의하여 형성될 수 있는 헤어핀 루프의 길이H: length of hairpin loop which can be formed by helical structure

이다.to be.

5개 파라미터에 곱해지는 가중치 (w₁~w₅)는 자유 에너지와 동류성 염기 서열의 정보를 바탕으로 결정된다.The weights (w ₁ to w ₅ ) multiplied by the five parameters are determined based on the information of free energy and homologous nucleotide sequences.

긴 나선형 구조는 짧은 나선형 구조에 비하여 상대적으로 안정적이며, 정밀 변형 짝은 다른 유형의 염기 쌍보다 나선형 구조의 존재에 대한 강한 증거가 된다. 따라서 나선형 구조의 길이 (L)와 정밀 변형 짝 (E)의 가중치는 나선형 구조의 신뢰도를 계산할 때 양수 값을 갖는다. 와블 짝과 비정밀 짝은 정밀 짝보다 약한 결합이며, 긴 헤어핀 루프는 짧은 헤어핀 루프를 형성하는 나선형 구조에 비해 덜 안정적이다. 그러므로 파라미터 W, I, H에 곱하는 가중치는 음수이다.Long helical structures are relatively stable compared to short helical structures, and precisely modified pairs provide stronger evidence of the existence of helical structures than other types of base pairs. Therefore, the length L of the helical structure and the weight of the precision strain partner E have a positive value when calculating the reliability of the helical structure. Wobble pairs and coarse pairs are weaker bonds than precision pairs, and long hairpin loops are less stable than helical structures that form short hairpin loops. Therefore, the weights multiplied by the parameters W, I, and H are negative.

한편, 중간구조 및 최종구조에 대한 신뢰도는 다음과 같은 사항을 고려하여 결정된다.On the other hand, the reliability of the intermediate structure and the final structure is determined in consideration of the following.

첫째, 각 단계에서 취하는 구조는 얼마나 안정적인 나선형 구조들로 구성되는가?First, how stable are the spiral structures taken at each stage?

둘째, 한 단계에서 다음 단계의 구조로 변하는 것이 동역학적 측면에서 어느 정도 어려운가?Second, how difficult is it from a dynamic point of view to change from one stage to the next?

셋째, 각 단계에 존재하는 잠재적인 나선형 구조는 어느 정도 많고, 이 중 몇 개나 구조에 포함되는가?Third, how many potential spiral structures exist in each step, and how many of them are included in the structure?

넷째, 각 단계에서 형성되는 구조는 몇 개의 염기를 포함하는가?Fourth, how many bases does the structure formed in each step contain?

따라서, 구조의 신뢰도는 수학식 2와 같이 위의 4가지 사항을 나타내는 파라미터의 함수로 결정된다.Therefore, the reliability of the structure is determined as a function of a parameter representing the above four points, as shown in Equation (2).

여기서,here,

: 구조에 포함된 나선형 구조의 신뢰도의 합 : Sum of the reliability of the helical structure included in the structure

K: 전 단계의 구조로부터 변이될 때의 동역학적 난이도로서 전 단계의 구조에서 해체되는 염기 쌍의 개수K: The number of base pairs that are dissociated in the structure of the previous stage as a dynamic difficulty when mutated from the structure of the previous stage.

R: 구조의 범위에 존재하는 잠재적인 나선형 구조의 총 개수에 대한 포함된 나선형 구조의 개수의 비율R: The ratio of the number of included spiral structures to the total number of potential spiral structures present in the range of the structure.

S: 구조의 범위에 있는 염기 서열의 길이S: length of the nucleotide sequence in the range of the structure

를 나타낸다.Indicates.

구조에 대한 신뢰도를 계산할 때 각 파라미터에 곱해지는 가중치는 구조 형성 과정의 동역학뿐 아니라 자유 에너지와 동류성 염기 서열의 정보를 바탕으로 결정된다. 예를 들면, 높은 신뢰도를 갖는 나선형 구조들로 구성된 구조는 낮은 신뢰도를 갖는 나선형 구조로 구성된 구조보다 상대적으로 더 신뢰할 수 있으므로 이 파라미터에는 양수의 가중치를 지정한다.When calculating the reliability of the structure, the weight multiplied by each parameter is determined based on the information of free energy and homologous sequences as well as the kinetics of the structure formation process. For example, a structure composed of spiral structures with high reliability is relatively more reliable than a structure composed of spiral structures with low reliability, so this parameter is assigned a positive weight.

현 단계의 구조를 형성하기 위하여 전 단계에서 붕괴되어야 하는 염기 쌍이 많으면 동역학적으로 어려운 변이를 의미하므로 신뢰도가 낮아지고, 이에 따라 파라미터 K에는 음수의 가중치를 곱한다. 잠재적인 나선형 구조의 총 개수에 비하여 포함되는 나선형 구조의 개수가 많으면 많을수록 나선형 구조의 결정에 따른 불확실성이 감소하므로 파라미터 R에는 양수 가중치를 곱한다.If there are many base pairs that need to be collapsed in the previous step to form the structure of the current step, it means a difficult kinematic variation, so the reliability is low, and thus the parameter K is multiplied by a negative weight. The larger the number of helical structures included as compared to the total number of potential helical structures, the less uncertainty is due to the determination of the helical structure, so the parameter R is multiplied by a positive weight.

짧은 RNA 분자에 비해 긴 분자의 구조 예측에 더 많은 불확실성이 내포되므로 파라미터 S에는 음수의 가중치를 곱한다. 요약해서 말하면 구조의 신뢰도는 4가지 파라미터의 함수 값으로 산출되는데, 각 파라미터가 취하는 값이 양수이므로 신뢰도를 증가하는 파라미터이면 양수 가중치를 곱하고, 아니면 음수의 가중치를 곱한다. 산출되는 신뢰도의 상한값이나 하한 값은 정해져 있지 않으며, 신뢰도가 큰 구조는 신뢰도가 낮은 구조보다 상대적으로 더 안정적이라 할 수 있다.Parameter S is multiplied by a negative weight because more uncertainty is involved in the prediction of long molecules compared to short RNA molecules. In summary, the reliability of a structure is calculated as a function of four parameters. If the parameter takes a positive value, multiply it by a positive weight if the parameter increases reliability, or a negative weight. The upper limit value or the lower limit value of the calculated reliability is not determined, and a structure having high reliability is relatively more stable than a structure having low reliability.

상술한 9개 파라미터 이외에 본 발명의 모의실험 과정을 제어하는 파라미터가 2개 더 있다. 이들은 잠재적 나선형 구조의 최소 길이와 잠재적 나선형 구조를 찾을 때 허용되는 부당한 짝 (mismatch)의 개수를 지정하는데 사용된다.In addition to the nine parameters described above, there are two more parameters controlling the simulation process of the present invention. These are used to specify the minimum length of the potential helical structure and the number of mismatches allowed when looking for the potential helical structure.

후술할 도 9는 이 11개의 파라미터에 대하여 본 발명이 사용하는 기본값 (default value)을 보여준다. 이 기본값들의 부호와 상대적인 대소관계는 앞에서 설명한 이유에 의하여 쉽게 결정할 수 있으나, 절대적인 값은 다소 경험적으로 얻어진 것이다.9, which will be described later, shows default values used by the present invention for these eleven parameters. The sign and relative magnitude of these defaults can be easily determined for the reasons described above, but the absolute values have been obtained somewhat empirically.

감도 분석 (sensitivity analysis)을 한 결과, 파라미터 값의 부호와 대소 관계가 유지되면 이차 구조의 예측 결과는 파라미터의 절대적인 값에는 민감하지 않은 것으로 밝혀졌다. 도 9의 기본값은 염기 개수 1000개 이하로 구성된 10개 정도의 동류성 염기 서열에 공통된 이차구조를 예측할 때 대체로 만족스러운 결과를 산출하는데 이용되었던 값이다.Sensitivity analysis revealed that the prediction of the secondary structure was not sensitive to the absolute value of the parameter if the sign and magnitude of the parameter value were maintained. The default value of FIG. 9 is a value that was generally used to yield a satisfactory result when predicting a secondary structure common to about 10 homologous base sequences consisting of up to 1000 bases.

이하 첨부된 도면을 참조하여 본 발명의 바람직한 실시예에 대하여 상세하게 설명한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

본 발명에서 구조 형성의 단위는 염기 쌍이 아니라 나선형 구조이다. 모의실험의 각 단계에서 구조에 대한 변화는 전 단계의 구조에 새로운 나선형 구조가 추가되거나 전 단계에 있던 나선형 구조가 새로운 나선형 구조로 대치되는 경우에 생긴다.In the present invention, the unit of structure formation is not a base pair but a helical structure. Changes to the structure at each stage of the simulation occur when new helical structures are added to the previous stage's structure or when the helical structure at the previous stage is replaced by a new spiral structure.

또한 본 발명의 입력은 동류성 염기 서열의 배열 (alignment)과 상술한 제어 파라미터의 값이고, 출력으로서 구조 형성 단계에서 취하게 되는 중간 구조들과 그 신뢰도가 산출된다. 이는 도 3 및 도 4에 나타내었다.In addition, the input of the present invention is the alignment of the homologous base sequence and the value of the control parameter described above, and as the output, the intermediate structures to be taken in the structure forming step and the reliability thereof are calculated. This is shown in FIGS. 3 and 4.

본 발명의 실시예에 따른 RNA 분자의 이차 구조를 예측은 도 5에 나타낸 바와 같이 본 발명은 세 과정 (P100∼ P300)으로 이루어져 있다.Prediction of the secondary structure of the RNA molecule according to an embodiment of the present invention as shown in Figure 5 the present invention consists of three processes (P100 ~ P300).

즉, 동류성 염기 서열의 분석과 상호 변화 매트릭스 생성 과정과, 잠재적 나선형 구조 탐색 과정과, 구조 형성 과정 모의실험 및 구조 생성 과정으로 이루어진다.In other words, it consists of the analysis of homologous nucleotide sequences, the generation of a mutual change matrix, the discovery of potential helical structures, the simulation of structure formation, and the generation of structures.

첫 번째 과정에서는 동류성 염기 서열을 분석하고, 그 분석 결과를 상호 변화 매트릭스로 나타내는데, 먼저 아래쪽 삼각 매트릭스를 '-'로 채우고, 각각의 염기 쌍(i, j)에 대해 BP(i, j) 관계를 결정한 후, 동류성 염기 서열의 공통 서열 (consensus sequence)을 생성한다.In the first process, the homologous nucleotide sequence is analyzed and the result of the analysis is represented by a mutual change matrix. First, the lower triangular matrix is filled with '-', and BP (i, j) is applied to each base pair (i, j). After determining the relationship, a consensus sequence of homologous base sequences is generated.

상호 변화 매트릭스의 각 요소는 i 번째 염기와 j 번째 염기와의 관계 BP(i, j)로 나타내는데 이 정의는 참조 문헌 4를 따른다.Each element of the mutual change matrix is represented by the relation BP (i, j) between the i th base and the j th base, the definition of which follows Reference 4.

두 번째 과정에서는 상호 변화 매트릭스에서 잠재적 나선형 구조를 찾아내어 구조를 만든다. 잠재적 나선형 구조는 매트릭스에서 상단 우편에서 하단 좌편 방향으로 매치 기호들 (o, *, w, +)로 구성된 대각선이다.The second step is to find the potential spiral structure from the mutual transformation matrix and build the structure. The potential spiral structure is a diagonal consisting of match symbols (o, *, w, +) in the matrix from the top zip to the bottom left.

각 나선형 구조마다 시작 위치, 끝 위치, 길이, S1과 CF가 계산되는데, S1은 전체 구조에서 가장 안정된 헤어핀 루프 나선형 구조 (best_helix라 불린다)를 찾기 위한 스코어 함수 (score function)이며, CF는 나선형 구조의 신뢰도이다.For each helical structure, the starting position, end position, length, S1 and CF are calculated, where S1 is a score function to find the most stable hairpin loop helical structure (called best_helix) in the overall structure, and CF is a helical structure. Is the reliability of.

잠재적인 나선형 구조가 모두 찾아지면, 이들을 끝 위치가 증가하는 순으로 정렬하여서 다음 과정에서 순서대로 검사되도록 한다.Once all potential spiral structures are found, they are sorted in increasing order of their end positions so that they can be examined in order in the next step.

마지막 과정인 구조 형성 과정 모의실험 및 구조 생성 단계에서는 RNA 분자의 염기 서열이 자라면서 취하게 되는 중간 구조들을 순서대로 생성한다. 새로운 나선형 구조가 특정 단계의 구조에 포함될지의 여부는 그 나선형 구조가 best_helix인지, 기존의 나선형 구조와 공존할 수 있는지, 기존의 나선형 구조와 공존하지 않고 대치할 수 있는지의 여부에 따라 결정된다.In the final process, the structure formation process simulation and the structure generation step, intermediate structures that are taken as the nucleotide sequence of the RNA molecule grows are sequentially generated. Whether or not the new spiral structure is included in the structure of a particular stage depends on whether the spiral structure is best_helix, whether it can coexist with the existing spiral structure or can be replaced without coexisting with the existing spiral structure.

이를 도 6을 참조하여 상세하게 설명하면 다음과 같다.This will be described in detail with reference to FIG. 6 as follows.

먼저, 변수를 초기화한다 (S301, S302).First, the variable is initialized (S301, S302).

이어서 정렬된 리스트에 있는 나선형 구조 h(i)를 검사한다.Then check the spiral structure h (i) in the sorted list.

검사한 결과 h(i)가 best_helix이거나, h(i)가 존재하는 나선형 구조들과 충돌하지 않고 공존할 수 있으면 그것을 k 번째 구조에 포함시키고, 그렇지 않고 h(i)가 기존의 나선형 구조를 대치할 만큼 신뢰도가 있으면, 즉 (w6·CF(h) + w₇·#brokenbp) > w₆·CF(exisitng helix)이면 k번째 구조에 있는 h(i)와 존재하는 나선형 구조를 교환하고 신뢰도 CF(k번째 구조)를 계산한다 (S303 ~ S309).If the test shows that h (i) is best_helix, or if h (i) can coexist without colliding with existing helical structures, it is included in the kth structure, otherwise h (i) replaces the existing helical structure. If there is enough reliability, i.e. (w6 · CF (h) + w ₇ · # brokenbp)> w ₆ · CF (exisitng helix), the existing helical structure is exchanged with h (i) in the kth structure and the reliability CF (kth structure) is calculated (S303 to S309).

그렇지 않으면 h(i)를 무시한다 (S310).Otherwise, h (i) is ignored (S310).

모든 나선형 구조가 검사될 때까지 리스트에 있는 나선형 구조 h(i)를 검사하고, k번째 구조에 변화가 있으면 신뢰도 CF(k번째 구조)의 계산을 반복한다.Examine the helical structure h (i) in the list until all the helical structures have been examined, and repeat the calculation of the reliability CF (kth structure) if there is a change in the kth structure.

두 번째 단계와 마지막 단계에서 나선형 구조의 신뢰도와 분해되는 염기 쌍의 개수에 각각 곱해지는 w₆과 w₇은 구조의 신뢰도 산출에 이용되는 w₆과 w₇과 동일한 가중치를 갖는다. 구조 형성의 중간 단계에서 구성된 구조의 신뢰도는 앞서 설명한 방법에 의해 계산된다.In the second and last steps, w ₆ and w ₇ multiplied by the reliability of the helical structure and the number of base pairs decomposed, respectively, have the same weights as w ₆ and w ₇ used to calculate the reliability of the structure. The reliability of the structure constructed at the intermediate stage of the structure formation is calculated by the method described above.

도 7을 참조하여 예측된 RNA 이차 구조의 시각화 과정에 대하여 설명한다.Referring to Figure 7 will be described for the visualization of the predicted RNA secondary structure.

본 발명의 실시예에서는 예측된 RNA 이차 구조는 일단 텍스트 형식 (괄호 쌍과 염기들)으로 표현된다. 텍스트 형식으로 주어진 이차 구조를 그래픽 형식으로 그리기 위한 단계는 다음과 같다.In an embodiment of the invention, the predicted RNA secondary structure is expressed once in textual form (parentheses pairs and bases). The steps for drawing a given secondary structure in text form in graphical form are:

첫 번째 단계인 전 처리 단계에서는 서열의 처음과 끝이 닫혀 있다는 가정을 만족하도록 처리하고, 불룩한 루프를 내부 루프로 변환하며, 나선형 구조와 나선형 구조가 직접 연결되어 있는 곳에 한 가닥 부분을 삽입한다.In the first step, the preprocessing step, the sequence is processed to satisfy the assumption that the beginning and end of the sequence are closed, converting the bulging loop into an inner loop, and inserting one strand where the helical structure and the helical structure are directly connected.

두 번째 단계에서는 루프간의 연결 관계를 고려하여 배치 우선 순위 (positioning priority)를 계산한다.In the second step, the positioning priority is calculated by considering the connection relationship between loops.

마지막 단계에서 모든 루프에 대하여 다음을 반복 수행한다. 나선형 구조는 루프의 위치를 결정하는 단계에서 자동적으로 배치된다.In the last step, repeat the following for all loops. The helical structure is automatically placed in the step of determining the position of the loop.

먼저, 이미 배치된 이차구조의 구성 요소들과 적당한 공간을 유지하며 개 구간을 의미하는 원하는 영역을 탐색한 다음 이어서 루프의 회전 가능한 영역을 결정한다.First, it searches for a desired area, which means the dog section, while maintaining a proper space with the components of the secondary structure already arranged, and then determines the rotatable area of the loop.

이어서 원하는 영역과 회전 가능 영역을 고려하여, 만약 원하는 영역이 회전 가능 영역 내에 존재하는 경우에는 원하는 영역의 방향으로 해당 루프를 배치하고, 그렇지 않고 원하는 영역과 회전 가능 영역이 부분적으로 겹치는 경우이면 이동 가능한 범위 내에서 요구된 영역에 존재하는 방향으로 해당 루프를 배치한다.Then, considering the desired area and the rotatable area, if the desired area is present in the rotatable area, the corresponding loop is arranged in the direction of the desired area, otherwise the movable area can be moved if the desired area and the rotatable area partially overlap each other. Place the loop in the direction that exists in the required area within the range.

이차 구조를 그래픽 형식으로 그리기 위해서 전 처리 단계가 항상 필요한 것은 아니고 결합되지 않은 종단부나 불룩한 루프가 존재하거나 나선형 구조와 나선형 구조가 직접 연결되어 있는 경우에만 아래와 같이 전 처리를 필요로 한다. 이러한 전 처리의 목적은 원래구성 요소가 아닌 염기 (인조 염기)를 추가함으로써 이차 구조를 시각화하기 위한 자료 구조를 갱신하는 과정을 일반화하기 위함이다.The preprocessing step is not always necessary to draw the secondary structure in graphical form, but only if there is an unjoined end or bulging loop, or if the helical structure and the helical structure are directly connected as follows. The purpose of this pretreatment is to generalize the process of updating data structures to visualize secondary structures by adding bases (artificial bases) that are not original components.

1. 결합되지 않은 종단부는 나선형 구조를 추가함으로써 제거한다1. Unbonded terminations are removed by adding a helical structure

예를 들면 ---(((----))) 또는 ---(((----)))---는 (((---(((----)))---))) 으로 변경한다.For example, --- (((----))) or --- (((----))) --- is (((--- (((----)))) ---)))

2. 불룩한 루프는 내부 루프로 변경한다.2. Change the bulging loop to an inner loop.

예를 들면, (((((----)))---)))는 (((-(((----)))---)))로 변경된다.For example, ((((((----))) ---))) is changed to (((-(((----))) ---))).

3. 나선형 구조와 나선형 구조가 직접 연결되어 있는 경우가 없도록 한다.3. Make sure that the spiral structure and the spiral structure are not directly connected.

예를 들면 (((----)))(((----)))은 (((----)))-(((----))) 로 변경된다.For example, (((----))) (((----))) is changed to (((----)))-(((----))).

나선형 구조와 루프의 스코어를 정하는 기준은 나선형 구조의 경우는 나선형 구조를 이루는 염기 쌍의 개수, 즉 나선형 구조를 이루는 염기의 개수의 1/2이고, 루프는 루프를 이루고 있는 염기의 길이를 1, 염기와 염기 사이의 거리를 1이라고 가정하고 루프를 원이라고 가정하였을 때 계산되는 원의 면적이다.The criterion for determining the score of the helical structure and the loop is in the case of the helical structure, the number of base pairs forming the helical structure, that is, 1/2 of the number of bases forming the helical structure, and the loop is the length of the base constituting the loop 1, The area of a circle that is calculated when the distance between the base is 1 and the loop is a circle.

스코어에 따라 이들을 그릴 때의 우선 순위는 도 8에서와 같이 결정한다.The priority in drawing them according to the score is determined as in FIG.

즉, 먼저 나선형 구조와 루프의 스코어를 계산한 다음, 스코어가 가장 높은 루프를 그리기 리스트에 추가한다 (S421).That is, first, the score of the spiral structure and the loop is calculated, and then the loop having the highest score is added to the drawing list (S421).

그리기 리스트에 있는 객체와 연결되어 있으면 다른 한쪽 끝에 루프가 연결되어 있는 나선형 구조를 검색한다 (S422).If connected to an object in the drawing list, the other end of the loop searches for a spiral structure is connected (S422).

검색한 결과 그러한 나선형 구조가 존재하지 않으면 그리기 리스트에 있는 객체와 연결되어 있으며 가장 높은 스코어를 가지는 나선형 구조를 검색하여, 찾은 나선형 구조를 그리기 리스트에 추가하고, 그러한 나선형 구조가 존재하면 찾은 나선형 구조와 그것에 연결된 루프를 나선형 구조와 루프의 순서로 그리기 리스트에 추가한다 (S423 ~ S426).If the result of the search does not exist, it searches for the spiral with the highest score and is associated with the object in the drawing list, adds the found spiral to the drawing list, and if such a spiral exists, The loop connected to it is added to the drawing list in the order of the spiral structure and the loops (S423 to S426).

상기 단계 S422 ~ S426은 그리기 리스트의 상부 = 이차 구조 요소의 수 (iTopOfDrawList = iNumberOfSecondaryStructure)가 될 때까지 반복하며, 그 결과 내림차순으로 그리기 우선 순위가 결정된다.Steps S422 to S426 are repeated until the top of the drawing list = the number of secondary structure elements (iTopOfDrawList = iNumberOfSecondaryStructure), and the drawing priority is determined in descending order.

본 발명에서는 사용자가 사용하기 편한 GUI (Graphical User Interface)와 함께 RNA 이차 구조 그리기, 편집 및 인쇄 기능을 제공한다.The present invention provides a graphical user interface (GUI) that is easy for a user to use to draw, edit, and print RNA secondary structures.

도 9는 본 발명의 전형적인 인터페이스로서 윈도우들의 기능을 우측 상단부터 설명하면 다음과 같다.9 illustrates the functions of windows as a typical interface of the present invention from the upper right.

1.선택 (option)1.option

동류성 염기 서열의 배열을 입력 또는 수정하고, 본 발명의 제어 파라미터의 값을 설정할 수 있게 한다.It is possible to input or modify the arrangement of homologous base sequences and to set the values of the control parameters of the present invention.

2. 상호 변화 매트릭스 & 나선형구조2. Mutual change matrix & spiral structure

동류성 염기 서열을 분석한 결과 결정되는 염기 쌍의 관계를 상호 변화 매트릭스로 나타낸다. 이 매트릭스에서 찾아지는 잠재적인 나선형 구조도 보여준다.The relationship between base pairs determined as a result of the analysis of homologous base sequences is shown in a mutual change matrix. The potential spiral structure found in this matrix is also shown.

3. 모의실험 결과3. Simulation Results

시뮬레이션 수행 결과 얻어지는 중간 구조 및 최종 구조를 신뢰도와 함께 텍스트 형식으로 나타낸다.The intermediate and final structures obtained as a result of the simulation are presented in text form with confidence.

4.시각화4.Visualization

이차 구조를 그래픽 형식으로 보여주고 그 구조를 편집하는 기본 기능 (이동, 회전, 확대, 축소, 그룹 등)도 제공한다.It also displays the secondary structure in graphical form and provides basic functions for editing the structure (move, rotate, zoom in, zoom out, group, etc.).

5.윤곽보기5. Outline view

이차 구조의 염기는 보여주지 않고 전체적인 윤곽을 보여준다.The base of the secondary structure is not shown but the overall outline.

6.이력 (history)6.history

작업의 진행 과정을 보여주고 저장한다.Show and save the progress of your work.

이 윈도우들은 각각 크기변경 또는 스크롤 될 수 있고, 그 내용을 출력하거나 외부 파일로 저장할 수 있다.Each of these windows can be resized or scrolled, and its contents can be output or saved to an external file.

본 발명의 실시예는 Windows 95 환경에서 Borland C++ Builder를 이용하여 구현되었으며, 40Mbyte 이상의 RAM을 가지고 있으면서 Windows 95를 운영체제로 갖는 IBM PC 호환 기종에서는 어디서는 수행 가능하다.The embodiment of the present invention was implemented using Borland C ++ Builder in a Windows 95 environment, and can be performed anywhere on an IBM PC compatible model having Windows 95 as an operating system while having more than 40 Mbytes of RAM.

도 10은 7개의 Bip (binding protein) mRNA의 5' NTR (Non-Translated region) 염기 서열과 2개의 FGF-2 (Fibroblast Growth Factor 2) mRNA의 5'NTR 염기 서열이 배열되어 있는 것을 보여 준다. 이 배열의 길이는 간격을 포함하여 156개의 염기이고, BiP 염기 서열간의 유사도 (similarity)와 FGF-2 염기 서열간의 유사도 모두 40% 내외로서 동류성 염기 서열치고는 매우 낮은 편이다.FIG. 10 shows that 5 'non-translated region (NTR) base sequences of 7 binding protein (Bip) mRNAs and 5'NTR base sequences of two Fibroblast Growth Factor 2 (FGF-2) mRNAs are arranged. The length of the sequence is 156 bases including the interval, and the similarity between the BiP base sequence and the similarity between the FGF-2 base sequence is about 40%, which is very low compared to the homologous base sequence.

도 11은 도 10에 있는 5' NTR Bip와 5' NTR FGF-2의 배열을 입력받았을 때 본 발명에서 이들에 공통된 이차 구조형성 과정을 모의실험한 결과이다.FIG. 11 is a simulation result of secondary structure forming processes common to the present invention when the 5 'NTR Bip and 5' NTR FGF-2 arrays in FIG. 10 are input.

염기가 30개 이상 합성되어서야 비로소 나선형 구조가 하나 형성되고 (제1단계), 그 이후 새로운 나선형 구조가 매 단계마다 추가됨으로써 제5단계에서 최종 구조가 형성된다. 이 모의실험에서는 중간 구조에 포함되었던 나선형 구조가 다음 단계의 구조를 위해서 해체되는 현상은 보이지 않는다.Only at least 30 bases have been synthesized until a single helical structure is formed (first step), after which a new helical structure is added at every step to form a final structure at the fifth step. In this simulation, the spiral structure included in the intermediate structure is not dismantled for the next structure.

9개의 mRNA 5' NTR 염기 서열을 비교 분석한 결과, 잠재적인 나선형 구조는 모두 6개이고 이들 모두 최종 구조에 함여하는 안정적인 구조 형성과정을 보인다.A comparative analysis of the nine mRNA 5 'NTR sequences revealed that all six potential helical structures were present, all of which exhibited stable structure formation in the final structure.

동류성 염기 서열간의 유사도가 매우 낮음에도 불구하고 본 발명에 따른 모의실험에 의하여 도달한 최종 구조 (제6단계의 구조)는 매우 안정적이고, 이 구조는 Le 와 Maizel이 여러 종류의 프로그램 사용과 수작업을 통하여 예측한 구조와 일치한다 [참조 문헌: Le, S-Y. and Maizel J. V., "A common RNA structural motif involver in the internal initiation of trnaslation of cellular mRNAs,"Nucleic Acids Res., Vol. 25, No. 2, pp. 362-369, 1997].Although the similarity between homologous base sequences is very low, the final structure reached by the simulation according to the present invention (the structure of step 6) is very stable, and this structure is used by Le and Maizel for various types of program and manual work. Consistent with the structure predicted through [Le, SY. and Maizel JV, "A common RNA structural motif involver in the internal initiation of trnaslation of cellular mRNAs," Nucleic Acids Res ., Vol. 25, No. 2, pp. 362-369, 1997].

도 12는 10개의 HIV-1 TAR에 공통된 구조 형성 과정을 본 발명에 의해 모의실험한 결과이다.12 is a simulation result of the present invention the structure formation process common to 10 HIV-1 TAR.

1, 2, 3, 5 단계에서는 새로운 나선형 구조가 추가됨으로써 구조가 형성되고, 4단계에서는 이전단계 (3단계)의 구조에 포함되었던 3개의 나선형 구조가 모두 해체되고 새로운 나선형 구조들이 포함됨으로써 구조를 형성한다.In the 1st, 2nd, 3rd, and 5th stages, the structure is formed by the addition of a new spiral structure.In the 4th stage, all three spiral structures included in the previous stage (step 3) are dismantled, and the new spiral structures are included. Form.

동역학적 측면에서 보았을 때, 3단계에서 4단계로의 변이가 가장 어려운 변이라고 할 수 있다.In terms of dynamics, the transition from step 3 to step 4 is the most difficult one.

HIV-1 TAR의 모의실험에 사용된 염기 서열의 배열의 길이는 57 염기로서, mRNA 5' NTR의 배열의 길이인 156 염기의 절반도 안되는 짧은 염기 서열이다. 그러나 TAR의 최종 구조의 신뢰도는 3.57로서 mRNA 5'NTR 최종 구조의 신뢰도 7.74 보다 훨씬 낮다.The length of the base sequence used for the simulation of HIV-1 TAR is 57 bases, which is less than half of the length of 156 bases that is the length of the mRNA 5 'NTR sequence. However, the reliability of the final structure of the TAR is 3.57, which is much lower than the reliability of the mRNA 5'NTR final structure of 7.74.

이것은 본 발명에서 mRNA 5' NTR에 대한 예측한 결과를 TAR에 대해 예측한 결과보다 더 낮은 신뢰도를 갖는 구조가 예측된 이유는 신뢰도 계산에서 다음 요인이 작용하였기 때문이다.This is because in the present invention, a structure having a lower reliability than the result predicted for the TAR in the predicted result for the mRNA 5 ′ NTR is predicted because the following factors work in the reliability calculation.

1. TAR의 구조 형성 과정 중에는 나선형 구조가 해체되는 일이 발생하였지만 mRNA 5' NTR의 경우에는 나선형 구조가 해체되는 일이 없었다.1. During the formation of TAR structure, the helical structure was disassembled, but in the case of mRNA 5 'NTR, the helical structure was not disassembled.

2. TAR는 잠재적인 나선형 구조 중에서 적은 개수의 나선형 구조가 구조 형성에 참여하지만 (63개중 4개) mRNA 5' NTR의 경우에는 잠재적인 나선형 구조 6개 모두 구조에 포함된다.2. TAR has a small number of potential helical structures involved in structure formation (4 of 63), but in the case of mRNA 5 'NTR, all 6 potential helical structures are included in the structure.

본 발명의 모의실험에서 스텝 크기는 고정적이 아니라 변동적이다. 매단계마다 염기 서열의 범위를 일정한 크기만큼 증가시키면서 (예를 들면 염기 10개씩 증가) 수행되는 것이 아니라 포함되는 나선형 구조들의 끝 위치 중에서 가장 큰 것에 의하여 결정된다.In the simulations of the present invention, the step size is not fixed but variable. Not every step is performed with a certain amount of increase in the range of base sequences (for example, by 10 bases), but is determined by the largest of the end positions of the included helical structures.

고정된 스텝 크기 대신 변동적인 스텝 크기를 사용할 때의 이점 중의 하나는 모의실험 결과가 스텝 크기에 의해 달라지지 않는다는 것이다. 이 보다 더 큰 이점은 고정된 스텝 크기를 사용할 때 생기는 사소한 중간 단계들 (예를 들면 전 단계의 구조와의 유일한 차이가 기존의 나선형 구조 또는 루프의 길이가 연장되는 구조를 갖는 단계)이 생성되지 않는다는 것이다.One of the advantages of using variable step size instead of fixed step size is that the simulation results do not vary with the step size. A further advantage is that no minor intermediate steps (e.g., a step with a conventional spiral structure or a structure in which the length of the loop is extended) are generated, which only occurs when using a fixed step size. It is not.

이상에서 설명한 바와 같이 본 발명에 의하면 열역학적·계통발생학적으로 안정되고 구조 형성 과정이 고려된 RNA 분자 이차 구조를 예측할 수 있고, 중간 단계 및 최종 단계에서 예측된 구조의 신뢰도를 정량적으로 나타낼 수 있으며, 구조의 재편성을 허용함으로써 보다 생물학적으로 의미 있는 RNA 분자 이차 구조를 얻을 수 있으므로 RNA 분자 구조와 그 형성 과정을 연구하는데 유용한 도구로 사용될 수 있다.As described above, according to the present invention, it is possible to predict an RNA molecule secondary structure that is thermodynamically and phylogenetically stable and considering a structure forming process, and can quantitatively indicate the reliability of the structure predicted in the intermediate and final stages. By allowing reorganization of the structure, a more biologically meaningful RNA molecule secondary structure can be obtained, which can be used as a useful tool to study the structure of RNA molecule and its formation process.

또한 일반에 널리 보급된 IBM PC 호환 기종 컴퓨터에서 구현이 가능하므로 누구나 손쉽게 활용할 수 있으며, 예측된 이차 구조를 시각화함에 있어 GUI를 구현함으로써 컴퓨터 사용에 익숙지 않은 사용자도 쉽게 접근하고 활용할 수 있는 효과 있다.In addition, since it can be implemented on a widely used IBM PC compatible computer, it can be easily used by anyone, and by implementing a GUI to visualize the predicted secondary structure, it can be easily accessed and used by users who are not familiar with computer use.

Claims

In the secondary structure modeling system of RNA molecules,

Analysis of homologous sequences from the input data and analysis of the homologous sequences that represent the results of the analysis as a mutual change matrix, generation of the mutual change matrix, discovery of potential spiral structures in the mutual change matrix, and RNA molecules Structure formation process of generating intermediate structures taken in sequence as the nucleotide sequence of the structure is generated; and a process of generating the structure, and visualizing the predicted secondary structure in a text format. Method of predicting secondary structure of RNA molecule through process simulation.

The method of claim 1,

Each element of the mutual change matrix in the analysis of the homologous base sequence and the generation of the mutual change matrix is characterized by the relation BP (i, j) between the i th base and the j th base of the analyzed homologous base sequence. Secondary structure prediction method of RNA molecule through growth process simulation.

The method of claim 1,

The process of exploring the potential helical structure finds the potential helical structure in the mutual transformation matrix, and the score function S1 and the helical structure for finding the most stable hairpin loop helical structure in the overall structure called best_helix, starting position, end position, length, for each spiral structure. The method for predicting the secondary structure of the RNA molecule through growth process simulation, characterized by calculating the reliability of the CF, and finding all potential helical structures, sorting them in the order of increasing end positions.

The method of claim 3,

The reliability of the helical structure is defined as a function value of five parameters as shown in the following equation, the weight (w ₁ ~ w ₅ ) to be multiplied by the five parameters is determined based on the information of free energy and homologous base sequence Secondary structure prediction method of RNA molecule through growth process simulation, characterized in that:

_{CF (helix) = w 1 ·} L + w 2 · E + w 3 · W + w 4 · I + w 5 · H

Where L is the length of the helical structure (expressed as the number of base pairs), E is the number of precision strain pairs in the helical structure, W is the number of wobble pairs in the helical structure, and I is the number of coarse pairs in the helical structure. Number, where H is the length of the hairpin loop that can be formed by the helical structure.

The method of claim 1,

The structure formation process simulation and the structure generation process include an initialization step, and a spiral structure h (i) in the sorted list is examined. As a result, the spiral structure in which h (i) is best_helix or h (i) exists. If it can coexist without colliding with, it is included in the kth structure; otherwise, if h (i) is reliable enough to replace the existing helical structure, it exchanges the existing helical structure with h (i) in the kth structure. Otherwise, ignoring h, and calculating the reliability CF (kth structure) if there is a change in the kth structure, the helical structure h (i) in the list until all the helical structures have been examined. The method for predicting the secondary structure of the RNA molecule through the growth process simulation, characterized in that the test is performed, and if there is a change in the k-th structure, the calculation of the reliability CF (k-th structure) is repeated.

The method of claim 5,

The reliability CF (kth) is defined as a function value of four parameters as shown in the following equation, and the weights (w ₁ to w ₅ ) multiplied by the four parameters are not only dynamics of the structure forming process, but also free energy and homogeneity. Secondary structure prediction method of RNA molecule through growth process simulation, characterized in that determined based on the information of the base sequence:

here, Is the sum of the reliability of the helical structure included in the structure, K is the dynamic difficulty of transition from the previous stage of structure, the number of base pairs dissociated in the previous stage of structure, and R is the potential to exist in the range of the structure. Is the ratio of the number of helical structures included to the total number of helical structures, S being the length of the base sequence in the range of the structure.