KR100699437B1

KR100699437B1 - Apparatus and Method for Analysis of Amino Acid Sequence

Info

Publication number: KR100699437B1
Application number: KR1020040085496A
Authority: KR
Inventors: 김상태; 박희진; 이공주; 백은옥
Original assignee: 김상태; 백은옥; 박희진; 이공주
Priority date: 2004-10-25
Filing date: 2004-10-25
Publication date: 2007-03-27
Also published as: KR20060036322A

Abstract

단백질에 포함된 아미노산 서열을 분석하기 위한 장치 및 방법을 제시한다.An apparatus and method are provided for analyzing amino acid sequences contained in proteins.

본 발명은 분석 대상 펩타이드에 포함된 서열 태그 리스트를 추출하고 이를 포함하는 후보 펩타이드를 결정한 후, 서열 태그 간의 결합을 통해 공백을 포함하는 펩타이드 리스트를 생성한 다음, 공백 부분의 질량 변이와 후보 펩타이드에 포함된 공백 부분의 아미노산 서열 정보를 이용하여 해당 공백 부분의 아미노산 서열 또는 번역 후 변형을 탐색함으로써, 단백질에 포함된 아미노산 서열 및 번역 후 변형의 탐색 절차를 최소화할 수 있다.The present invention extracts a sequence tag list included in the peptide to be analyzed and determines a candidate peptide including the same, and then generates a list of peptides containing a space through binding between the sequence tags, and then applies the mass variation of the blank portion and the candidate peptide. By searching for the amino acid sequence or the post-translational modification of the corresponding blank portion using the amino acid sequence information of the included blank portion, it is possible to minimize the search procedure for the amino acid sequence and the post-translational modification included in the protein.

단백질, 아미노산 서열, 번역 후 변형Protein, amino acid sequence, post-translational modification

Description

Apparatus and Method for Analysis of Amino Acid Sequence

도 1은 본 발명에 의한 아미노산 서열 분석 장치가 적용되는 단백질 서열 분석 시스템의 구성도,1 is a block diagram of a protein sequence analysis system to which the amino acid sequence analysis device according to the present invention is applied,

도 2는 도 1에 도시한 아미노산 서열 분석 장치의 상세 구성도,2 is a detailed block diagram of the amino acid sequence analysis device shown in FIG.

도 3은 본 발명의 일 실시예에 의한 아미노산 서열 분석 방법을 설명하기 위한 흐름도,3 is a flowchart for explaining an amino acid sequence analysis method according to an embodiment of the present invention;

도 4는 본 발명에 의한 아미노산 서열 분석 방법 중 태그 추출 과정의 상세 흐름도,4 is a detailed flowchart of a tag extraction process of the amino acid sequence analysis method according to the present invention;

도 5는 본 발명에 의한 아미노산 서열 분석 중 펩타이드 탐색 과정의 상세 흐름도,5 is a detailed flowchart of a peptide search process in amino acid sequence analysis according to the present invention;

도 6은 본 발명에 의한 아미노산 서열 분석 중 태그 배치 과정의 상세 흐름도,6 is a detailed flowchart of a tag placement process in amino acid sequence analysis according to the present invention;

도 7은 본 발명에 의한 아미노산 서열 분석 중 번역 후 아미노산 서열 결정 과정의 상세 흐름도,7 is a detailed flow chart of the post-translational amino acid sequence determination process of the amino acid sequence analysis according to the present invention,

도 8 내지 도 12는 본 발명에 의한 아미노산 서열 분석 장치에 의한 단백질의 아미노산 서열 분석 방법의 일 예를 설명하기 위한 도면이다.8 to 12 are views for explaining an example of the amino acid sequence analysis method of the protein by the amino acid sequence analysis device according to the present invention.

<도면의 주요 부분에 대한 부호 설명><Description of the symbols for the main parts of the drawings>

10 : 제어부 20 : 탠덤 질량 분석 장치10 control unit 20 tandem mass spectrometer

30 : 아미노산 서열 분석 40 : 인터페이스부30: amino acid sequence analysis 40: interface unit

50 : 데이터베이스 310 : 피크 선택부50: database 310: peak selector

320 : 태그 추출부 322 : 질량 계산 수단320: tag extraction unit 322: mass calculation means

324 : 아미노산 매칭 수단 330 : 펩타이드 탐색부324: amino acid matching means 330: peptide search unit

332 : 펩타이드 비교 수단 334 : 후보 펩타이드 리스트 결정 수단332: peptide comparison means 334: candidate peptide list determination means

340 : 시퀀스 태그 생성부 342 : 태그 필터링 수단340: sequence tag generation unit 342: tag filtering means

344 : 태그 결합 수단 346 : 펩타이드 리스트 생성 수단344: tag binding means 346: peptide list generation means

348 : 펩타이드 리스트 필터링 수단348: peptide list filtering means

350 : 아미노산 서열 결정부 352 : 공백 서열 매칭 수단350: amino acid sequence determination unit 352: blank sequence matching means

354 : 스펙트럼 생성 및 비교 수단354: means for generating and comparing spectra

본 발명은 단백질 서열 분석 방법에 관한 것으로, 보다 상세하게는 단백질에 포함된 아미노산 서열을 분석하기 위한 장치 및 방법에 관한 것이다.The present invention relates to a protein sequencing method, and more particularly, to an apparatus and a method for analyzing an amino acid sequence contained in a protein.

인체의 비밀을 규명하기 위해서는 모든 유전정보를 지닌 유전체(genome)에 대한 연구 뿐 아니라 유전 정보의 발현형인 단백질의 기능 연구 역시 매우 중요하다. 인체의 모든 기능화 활동을 조절하는 기능이 단백질이기 때문에, 단백질이 어 떻게 작동하는지를 이해할 수 있게 되면 질병 역시 그 원인 단백질을 규명하여 치료제 개발을 본격화할 수 있기 때문이다.In order to elucidate the secrets of the human body, not only the study of the genome with all the genetic information, but also the study of the function of the protein, the expression type of the genetic information, is very important. Because protein is the function that regulates all the functionalization activities of the human body, if you can understand how the protein works, the disease can also identify the causal protein and start developing therapeutics.

프로테옴(proteome)이란 유전체로부터 만들어질 수 있는 모든 단백질의 종체로서, 한 세포 또는 조직에서 특이적인 생리 상태, 병리 상태에 따라 변화하는 동적인 개념이며, 프로테오믹스는 프로테옴을 연구하는 방법과 기술을 의미한다. 즉, 단백질의 성질을 발현, 번역 후 변형, 다른 단백질과의 결합 등에 대해 연구함으로써 세포내 변형 과정, 네트웍 형성 과정을 질병의 진행 과정과 연계시켜 총괄적으로 이해하기 위한 연구 분야를 의미한다.A proteome is a species of all proteins that can be made from the genome. It is a dynamic concept that changes according to specific physiological and pathological conditions in a cell or tissue, and proteomics refers to methods and techniques for studying proteome. . In other words, by studying the properties of the protein expression, post-translational modification, binding to other proteins, etc., it refers to a research field to comprehensively understand the intracellular transformation process and network formation process in connection with the disease progression process.

일반적으로, 생체 내에 포함된 단백질을 확인하기 위해서 탠덤 질량 스펙트럼을 이용한 방법이 많이 이용되어 왔다. 탠덤 질량 스펙트럼 방법은 단백질을 트립신(trypsin)과 같은 단백질 분해 효소를 이용하여 단백질을 소화시킨 후(즉, 절단한 후), 소화시킨 결과로서 생성되는 조각인 펩타이드를 충돌 유도 분열(collision-induced dissociation)하여, 이로 인해 생성되는 물질을 이온화시킨 뒤 이의 질량을 측정하여, 펩타이드의 서열을 확인하고 이를 이용하여 최종적으로 단백질의 서열을 찾아내는 방법이다.In general, a method using a tandem mass spectrum has been widely used to identify proteins contained in a living body. The tandem mass spectrometry method involves collidation-induced dissociation of a peptide, a fragment produced as a result of digesting the protein using proteolytic enzymes such as trypsin, and then digesting the protein. By ionizing the resulting material and then measuring its mass, the sequence of the peptide is identified and finally used to find the sequence of the protein.

트립신에 의해 단백질을 소화시켜 생성된 펩타이드의 서열은 아미노산 R 또는 K로 끝나는 특징이 있는데 예를 들어, 단백질의 서열이 "MEMEKEFEQIDKSGSWAAIYQDIRHEASDFPCRVAKLPKNKNRNRYRDVSPFDHSRKLHQEDNDYINASLIKMEEAQRSYILTQ"인 경우, 이를 트립신에 의해 소화시키면 예를 들어, "SGSWAAIYQDIR", "LPKNKNRNRYRDVSPFDHSR", "LHQEDNDYINASLIKMEEAQR" 등이 포함된 펩타이드 서열을 얻을 수 있다. 또한, 서열이 "SGSWAAIYQDIR"인 펩타이드를 충돌 유도 분열시키면 펩타이드를 구성하는 화학적 결합이 끊어지게 되며, 특히 작은 에너지를 가하여 펩타이드를 충돌 유도 분열시키는 경우, 주로 펩타이드 상의 아미노 결합(amino bond)이 파괴되어 해당 펩타이드 서열의 선두 아미노산(접두사, 예를 들어, S)으로 시작하는 물질(예를들어, "S", "SG", "SGSWA", "SGSWAAIYQDI"), 또는 해당 펩타이드 서열의 후미 아미노산(접미사, 예를 들어, R)으로 시작하는 물질(예를 들어, "R", "IR", "DIR", "QDIR", "GSWAAIYQDIR")이 생성된다. The sequence of peptides produced by digesting the protein with trypsin is characterized by ending with amino acids R or K. For example, if the sequence of the protein is "MEMEKEFEQIDK SGSWAAIYQDIR HEASDFPCRVAK LPKNKNRNRYRDVSPFDHSR K LHQEDNDYINASLIKMEEAQR SYILTQ", then digesting it with trypsin For example, peptide sequences including "SGSWAAIYQDIR", "LPKNKNRNRYRDVSPFDHSR", "LHQEDNDYINASLIKMEEAQR", and the like can be obtained. In addition, collision-induced cleavage of a peptide having the sequence "SGSWAAIYQDIR" breaks the chemical bond constituting the peptide, and particularly, in the case of collision-induced cleavage of the peptide by applying small energy, mainly an amino bond on the peptide is destroyed. Substances beginning with the leading amino acid (prefix, eg S) of the peptide sequence (eg, "S", "SG", "SGSWA", "SGSWAAIYQDI") or trailing amino acids (suffix) of the peptide sequence For example, substances starting with R) (eg, "R", "IR", "DIR", "QDIR", "GSWAAIYQDIR") are produced.

이 물질들은 이온화된 형태로 존재하며, 선두 아미노산으로 시작하는 형태의 물질을 b 이온, 후미 아미노산으로 시작하는 형태의 물질을 y 이온이라고 한다. 아미노 결합이 파괴되어 생성되는 b 이온, y 이온 이외에도 a, c, x, z 이온, 물이나 암모니아 이온이 상실되는 b-H₂O, b-NH₃, y-H₂O, y-NH₃ 등의 다양한 이온이 생성되게 된다. 이러한 이온화된 펩타이드 조각을 검출하여 질량(mass)을 측정하고, 동일한 질량을 갖는 이온이 얼마나 많이 검출되는지를 강도(intensity)로 나타낸 것이 탠덤 질량 스펙트럼이다. 이와 같이 탠덤 질량 스펙트럼을 분석하면 원래의 펩타이드의 질량을 유추해낼 수 있게 된다.These substances exist in an ionized form, and substances that begin with the leading amino acid are called b ions and substances that begin with the trailing amino acid are called y ions. Various ions such as bH ₂ O, b-NH ₃ , yH ₂ O, and y-NH ₃ , in which a, c, x, z ions, water or ammonia ions are lost, in addition to the b and y ions generated by breaking the amino bond Will be generated. The tandem mass spectrum is a measure of mass by detecting fragments of ionized peptides, and the intensity of how many ions with the same mass are detected. This analysis of tandem mass spectra allows us to infer the mass of the original peptide.

탠덤 질량 스펙트럼에서는 y 이온, b 이온의 강도가 다른 이온에 비해 상대적으로 크기 때문에 스펙트럼에서 지역적으로 가장 큰 지점인 피크(peak)들을 추출한 뒤 이의 차이를 분석하는 방법이 많이 이용되고 있다. 예를 들어 탠덤 질량 스펙트럼으로부터 측정한 두 피크의 강도가 각각 서열이 "DIR"인 y 이온과 서열이 "QDIR"인 y 이온을 나타낸다면, 그 차이값은 아미노산 "Q"의 질량이 되기 때문에 원래의 펩타이드 서열에 Q가 포함되어 있음을 알 수 있다.In the tandem mass spectrum, the intensity of y and b ions is relatively higher than that of other ions, and thus, a method of extracting peaks, which are the largest points in the spectrum, and analyzing the differences are widely used. For example, if the intensities of two peaks, measured from the tandem mass spectrum, represent y ions with sequence "DIR" and y ions with sequence "QDIR", respectively, the difference is the mass of amino acid "Q". It can be seen that the peptide sequence of contains Q.

그러나 탠덤 질량 스펙트럼 방법을 통한 펩타이드 서열 예측시, 아미노산이 화학적으로 변형되어 질량이 바뀌는 번역 후 변형(post-translational modification), 아미노산이 다른 아미노산으로 치환되는 등의 돌연변이(mutation), 질량 측정 시 발생하는 오차, 단백질이 트립신 효소로 소화될 때 소화가 잘 일어나지 않아서 생기는 비절단(miscleavage) 등의 여러 가지 변형 요소들로 인해 서열을 예측하는 데에 많은 어려움이 있다.However, in predicting peptide sequences using tandem mass spectrometry, post-translational modifications in which the amino acids are chemically modified to change the mass, mutations such as substitution of amino acids by other amino acids, There are many difficulties in predicting the sequence due to various modification factors such as errors and miscleavage due to poor digestion when the protein is digested with trypsin enzyme.

특히 단백질의 번역 후 변형의 경우 어떠한 종류의 변형이 단백질의 서열 중 어떠한 위치에서 일어나는지에 대한 정보는 단백질의 생물학적 기능을 밝히는 데 핵심적인 역할을 하기 때문에 단백질로부터 번역 후 변형을 확인하는 문제는 그 자체로도 프로테오믹스 분야에서 매우 중요한 과제로 인식되고 있다. 따라서, 단백질의 번역 후 변형과 같은 각종 변형 요소들이 다수 발생하더라도 탠덤 질량 스펙트럼으로부터 펩타이드와 단백질의 서열을 확인하고, 번역 후 변형의 종류와 위치를 찾아내는 알고리즘이 필요하다.Especially in the case of post-translational modifications of proteins, the problem of identifying post-translational modifications from proteins itself is important because information about what kind of modifications occur at which positions in the protein's sequence plays a key role in elucidating the biological function of the protein. It is also recognized as a very important task in the field of proteomics. Therefore, even if a large number of various modification factors such as post-translational modification of the protein occurs, it is necessary to identify the sequence of peptides and proteins from the tandem mass spectrum and to find the type and location of the post-translational modification.

탠덤 질량 스펙트럼으로부터 단백질이나 펩타이드의 서열을 확인하는 알고리즘은 여러 가지가 있는데, 그 중에서 주로 사용되고 있는 방법은 데이터베이스 탐색 기법, 드 노보(de novo) 시퀀싱 기법, 서열 태그 기법의 세 가지가 있다.There are a number of algorithms for identifying proteins or peptide sequences from tandem mass spectra, including three methods: database search, de novo sequencing, and sequence tagging.

먼저, 데이터베이스 탐색 기법은 단백질 서열이 저장되어 있는 데이터베이스로부터 번역 후 변형 등의 변형이 포함된 가능한 모든 펩타이드의 조합을 만들어내 고, 각각의 조합에 대한 가상의 탠덤 질량 스펙트럼을 생성한 뒤, 실험을 통해 얻어낸 탠덤 질량 스펙트럼과 비교하는 방법이다.First, the database search technique generates a combination of all possible peptides, including post-translational modifications, from the database where the protein sequences are stored, generates virtual tandem mass spectra for each combination, and then runs the experiment. Compared with the tandem mass spectrum obtained through

다음으로, 드 노보 시퀀싱 기법은 단백질 데이터베이스를 이용하지 않고 탠덤 질량 스펙트럼으로부터 번역 후 변형이 포함된 펩타이드의 전체 서열을 예측하는 방법이다.Next, the de novo sequencing technique is a method of predicting the total sequence of a peptide including post-translational modifications from tandem mass spectra without using a protein database.

마지막으로, 서열 태그 기법은 드 노보 시퀀싱 기법을 이용하여 탠덤 질량 스펙트럼으로부터 짧은 길이의 펩타이드 서열인 서열 태그를 검색하고, 단백질 서열 데이터베이스에서 검색된 서열 태그를 포함하는 펩타이드를 찾아낸 뒤, 찾아낸 펩타이드의 가상 탠덤 질량 스펙트럼과 실험을 통해 얻은 탠덤 질량 스펙트럼을 비교하는 방법이다. Finally, the sequence tag technique uses a de novo sequencing technique to search for a sequence tag, which is a short length peptide sequence from a tandem mass spectrum, find a peptide containing the sequence tag retrieved from a protein sequence database, and then find the virtual tandem of the found peptide. This method compares the mass spectrum and the tandem mass spectrum obtained through experiments.

그러나 상기의 방법들은 단백질의 서열을 확인하는 데에 초점을 맞추기 때문에 변형 요소들이 여러 개 존재할 때 단백질의 서열을 올바르게 예측할 수 없는 문제가 있다. 특히 번역 후 변형이 여러 개 존재하는 펩타이드의 서열을 예측하는 경우, 이를 위한 수행시간이 번역 후 변형의 개수와 종류에 따라 기하급수적으로 증가한다는 단점이 있다. 따라서 펩타이드에 여러 개의 번역 후 변형 등의 변형이 존재하는 경우, 현재의 단백질 서열 분석 방법으로는 단백질 서열을 정확하게 예측할 수 없고, 많은 시간과 노력이 소모되는 문제점가 있다.However, since the above methods focus on identifying the sequence of the protein, there is a problem in that the sequence of the protein cannot be correctly predicted when there are several modifying elements. In particular, when predicting the sequence of a peptide having a plurality of post-translational modifications, there is a disadvantage that the execution time for this increases exponentially depending on the number and type of post-translational modifications. Therefore, when a plurality of post-translational modifications are present in the peptide, current protein sequencing methods cannot accurately predict the protein sequence, and a lot of time and effort are consumed.

본 발명은 상술한 문제점을 해결하기 위해 안출된 것으로서, 단백질의 서열 분석시 시퀀스 태그의 공백 부분에 대한 질량 변이를 이용하여 단백질을 구성하는 아미노산 또는 아미노산에 포함된 복수개의 번역 후 변형의 종류 및 위치를 빠르고 용이하게 탐색하여 아미노산 서열을 분석할 수 있는 장치 및 방법을 제공하는 데 그 목적이 있다.The present invention has been made to solve the above-described problems, the type and location of a plurality of post-translational modifications included in the amino acids or amino acids constituting the protein by using the mass variation of the blank portion of the sequence tag in the sequence analysis of the protein It is an object of the present invention to provide an apparatus and method for quickly and easily searching for an amino acid sequence.

상술한 목적을 달성하기 위한 본 발명은 단백질에 포함된 아미노산 서열을 분석하기 위한 장치로서, 펩타이드 서열 데이터 및 번역 후 변형 데이터를 포함하는 데이터베이스; 상기 분석 대상 단백질의 질량 분석을 통해 선택되는 복수개의 피크로부터 번역 후 변형을 포함하지 않는 서열 태그 리스트가 생성되면, 상기 펩타이드 서열 데이터를 참조하여 추출된 상기 서열 태그가 포함되는 후보 펩타이드를 입력받아, 상기 후보 펩타이드에 포함되는 모든 서열 태그 및 결합 가능한 서열 태그를 결합하여 배치하여 공백이 존재하는 시퀀스 태그를 생성하는 시퀀스 태그 생성부; 및 상기 펩타이드 서열 데이터 또는 번역 후 변형 데이터베이스를 참조하여, 상기 시퀀스 태그 생성부에서 생성한 시퀀스 태그에 존재하는 공백을 아미노산 서열 또는 번역 후 변형을 포함하는 아미노산 서열로 대체하여 상기 공백이 채워진 펩타이드를 생성하는 아미노산 서열 결정부;를 포함한다.The present invention for achieving the above object is an apparatus for analyzing an amino acid sequence contained in a protein, comprising: a database comprising peptide sequence data and post-translational modification data; When a sequence tag list is generated that does not include post-translational modifications from a plurality of peaks selected through mass spectrometry of the analysis target protein, a candidate peptide including the sequence tag extracted by referring to the peptide sequence data is received. A sequence tag generation unit for generating a sequence tag having a space by combining and arranging all sequence tags and a bindable sequence tag included in the candidate peptide; And referring to the peptide sequence data or the post-translational modification database, replacing the space present in the sequence tag generated by the sequence tag generation unit with an amino acid sequence or an amino acid sequence including post-translational modification to generate the peptide filled with the space. To include; amino acid sequence determination unit.

본 발명은 분석 대상 단백질의 질량 분석에 의해 서열 태그를 찾고, 이를 이용하여 펩타이드 서열 데이터베이스를 탐색한다. 이때, 탐색하는 단백질의 수를 수 개에서 수십 개로 한정한 후 단백질에 포함된 번역 후 변형의 종류와 위치를 검색한다. 또한 서열 태그를 찾을 때 그 길이나 개수에 제한을 두지 않으며, 단백질 데이터베이스로부터 서열 태그를 포함하는 펩타이드를 찾을 때에 접미사 배열 (suffix array)을 사용함으로써 빠른 속도로 탐색이 가능하다. 그리고 단백질에 여러 종류의 번역 후 변형, 또는 여러 개의 번역 후 변형이 포함되더라도 수행시간이 크게 증가하지 않는 장점이 있다.The present invention finds the sequence tag by mass spectrometry of the protein of interest and uses it to search the peptide sequence database. At this time, the number of proteins to be searched is limited to several dozens, and then the types and positions of post-translational modifications included in the proteins are searched. In addition, the length or number is not limited when searching for a sequence tag, and a suffix array can be used to find a peptide including a sequence tag from a protein database. And even if the protein contains several kinds of post-translational modifications or several post-translational modifications, there is an advantage that the execution time does not increase significantly.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예를 보다 상세히 설명하기로 한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명에 의한 아미노산 서열 분석 장치가 적용되는 단백질 서열 분석 시스템의 구성도이다.1 is a block diagram of a protein sequence analysis system to which the amino acid sequence analysis device according to the present invention is applied.

도 1에 도시한 것과 같이, 단백질 서열 분석 시스템은 제어부(10), 단백질의 질량을 2단계로 분석하여 질량 스펙트럼을 생성하는 탠덤 질량 분석 장치(20), 탠덤 질량 분석 장치(20)의 결과로 얻어지는 질량 스펙트럼에 포함된 피크로부터 추출되는 서열 태그가 포함되는 후보 펩타이드가 결정되면, 서열 태그로부터 펩타이드 리스트를 생성하고, 펩타이드 리스트 중 결정되지 않은 서열 태그 부분에 포함되는 아미노산 서열 및/또는 번역 후 변형을 결정하는 아미노산 서열 분석 장치(30), 사용자로부터 제어 명령을 받아들이고 처리 결과를 출력하기 위한 인터페이스부(40) 및 데이터베이스(50)를 포함한다. 여기에서, 단백질의 질량 분석은 탠덤 질량 분석 장치(20)를 사용하여 분석하는 것에 한정되지 않고 모든 가능한 질량 분석 장치 중 어느 하나를 이용할 수 있음은 물론이다.As shown in FIG. 1, the protein sequencing system includes a control unit 10 and a result of a tandem mass spectrometer 20 and a tandem mass spectrometer 20 that generate mass spectra by analyzing masses of proteins in two stages. Once a candidate peptide is determined that includes a sequence tag extracted from the peaks included in the resulting mass spectrum, a peptide list is generated from the sequence tag and the amino acid sequence and / or post-translational modifications included in the undetermined sequence tag portion of the peptide list. Amino acid sequence analysis device 30 to determine the interface, the interface unit 40 and a database 50 for receiving a control command from the user and outputs the processing result. Here, the mass spectrometry of the protein is not limited to the analysis using the tandem mass spectrometer 20, of course, any one of all possible mass spectrometers can be used.

여기에서, 데이터베이스(50)는 펩타이드 서열 데이터베이스, 번역 후 변형 데이터베이스 등이 포함되며, 펩타이드 서열 데이터베이스는 단백질 서열로부터 계산적인 방법으로 만들어낸 접미사 배열 형태의 번역 후 변형을 포함하지 않는 펩타 이드 데이터의 집합이고, 번역 후 변형 데이터베이스는 번역 후 변형의 이름, 발생하는 아미노산 종류, 발생했을 경우 아미노산의 질량 변화, 발생하는 조건 등을 포함한다. 특히, 펩타이드 서열 데이터베이스에는 동일한 질량을 갖는 펩타이드 그룹이 질량에 따라 복수의 그룹으로 저장되고, 각 펩타이드는 접미사 형태로 저장되며, 각 펩타이드에 포함된 서열 태그 전후의 질량 변이에 대한 정보 또한 저장된다.Here, the database 50 includes a peptide sequence database, a post-translational modification database, and the like, and the peptide sequence database is a collection of peptide data that does not include post-translational modifications in the form of a suffix array generated by a computational method from a protein sequence. The post-translational modification database includes the name of the post-translational modification, the type of amino acid that occurs, the mass change of the amino acid, if any, and the conditions that occur. In particular, in the peptide sequence database, a group of peptides having the same mass is stored in a plurality of groups according to mass, each peptide is stored in a suffix form, and information on mass variation before and after the sequence tag included in each peptide is also stored.

탠덤 질량 분석 장치(20)는 본 발명이 속하는 기술분야에서 설명한 것과 같이, 어떠한 단백질을 단백질 분해 효소(예를 들어, 트립신)에 의해 절단하여 복수의 펩타이드 조각을 생성한 후, 각각의 펩타이드를 이온화시켜 주입하여 충돌 가스와의 충돌 등을 통해 단편화시킨 후 질량 대 전하비로 분리하고, 이를 이용하여 각각의 펩타이드마다의 질량 스펙트럼을 생성하는 장치이다. 여기에서, 질량 스펙트럼의 가로축은 전하량에 대한 질량의 비를 나타내고, 세로축은 가로축에 대한 존재량의 비율(강도)을 나타낸다.The tandem mass spectrometry device 20 cuts any protein with a proteolytic enzyme (eg, trypsin) to produce a plurality of peptide fragments, as described in the art, and then ionizes each peptide. It is a device that fragments through collision with collision gas, and then separates by mass to charge ratio, and generates a mass spectrum for each peptide by using the same. Here, the abscissa of the mass spectrum represents the ratio of mass to the amount of charge, and the ordinate represents the ratio (intensity) of the amount of abundance with respect to the abscissa.

아미노산 서열 분석 장치(30)는 탠덤 질량 분석 장치(20)에서 얻어진 탠덤 질량 분석 스펙트럼으로부터 분석 대상 단백질에 포함된 아미노산 서열 및/또는 번역 후 변형을 탐색하는 장치로서, 구체적인 구성은 도 2와 같다.Amino acid sequence analysis device 30 is a device for searching for the amino acid sequence and / or post-translational modifications included in the protein of analysis from the tandem mass spectrometry obtained by the tandem mass spectrometer 20, the specific configuration is as shown in FIG.

도 2는 도 1에 도시한 아미노산 서열 분석 장치의 상세 구성도로서, 설명의 편의를 위하여 본 발명에 의한 아미노산 서열 분석 장치에 의한 단백질에 포함된 아미노산 서열 분석 방법의 일 예를 설명한 도면인 도 8 내지 도 12를 참조하여 설명하기로 한다.FIG. 2 is a detailed configuration diagram of the amino acid sequence analysis apparatus illustrated in FIG. 1. For convenience of description, FIG. 8 is a view illustrating an example of an amino acid sequence analysis method included in a protein by the amino acid sequence analysis apparatus according to the present invention. This will be described with reference to FIG. 12.

도시한 것과 같이, 본 발명의 아미노산 서열 분석 장치(30)는 탠덤 질량 분석 장치(20)의 결과로 얻어진 탠덤 질량 스펙트럼의 원시 데이터로부터 상대적으로 강도가 높은 복수의 피크를 추출하기 위한 피크 선택부(310), 피크 선택부(310)에서 추출한 피크의 질량 차이를 이용하여 스펙트럼에 포함된 번역 후 변형을 포함하지 않는 짧은 길이의 아미노산 서열(즉, 서열 태그)을 검색하는 태그 추출부(320), 데이터베이스(50)에 저장된 펩타이드 서열 데이터를 참조하여 태그 추출부(320)에서 검색한 서열 태그가 포함되는 후보 펩타이드를 추출하는 펩타이드 탐색부(330), 펩타이드 탐색부(330)에서 추출한 후보 펩타이드에 대해, 이에 포함되는 모든 서열 태그를 배치하고 여러 개의 서열 태그가 동시에 배치될 수 있는 경우 또한 포함하여 공백이 존재하는 연속되는 시퀀스 태그를 생성하는 시퀀스 태그 생성부(340) 및 번역 후 변형 데이터베이스를 탐색하여 시퀀스 태그 생성부(340)에서 생성한 시퀀스 태그에 존재하는 공백을 아미노산 서열 또는 번역 후 변형을 포함하는 아미노산 서열로 대체하고, 공백에 해당하는 아미노산 서열 또는 번역 후 변형을 포함하는 후보 펩타이드 서열에 대한 가상의 스펙트럼을 생성하여 원래의 스펙트럼과 비교하여 유사도를 측정하기 위한 아미노산 서열 결정부(350)를 포함한다.As shown, the amino acid sequence analysis device 30 of the present invention is a peak selector for extracting a plurality of relatively high peaks from the raw data of the tandem mass spectrum obtained as a result of the tandem mass spectrometer 20 ( 310), the tag extractor 320 searches for a short length amino acid sequence (ie, a sequence tag) that does not include post-translational modifications included in the spectrum by using the mass difference of the peak extracted by the peak selector 310, The peptide search unit 330 for extracting the candidate peptide including the sequence tag searched by the tag extractor 320 with reference to the peptide sequence data stored in the database 50, and the candidate peptide extracted from the peptide search unit 330 Consecutive sequences of spaces, including all sequence tags included therein and also where multiple sequence tags can be placed simultaneously Search for the sequence tag generator 340 and the post-translation database to generate the sequence tag and replace the blanks in the sequence tag generated by the sequence tag generator 340 with the amino acid sequence or the amino acid sequence including the post-translational modification. And an amino acid sequence determination unit 350 for generating a virtual spectrum of the candidate peptide sequence including the amino acid sequence corresponding to the blank or the post-translational modification and comparing the original spectrum with the original spectrum.

여기에서, 피크 선택부(310), 태그 추출부(320) 및 펩타이드 탐색부(330)는 분석 대상 단백질로부터 피크를 선택하는 기능, 서열 태그를 추출하는 기능, 후보 펩타이드를 추출하는 기능을 갖는 모든 가능한 장치 중 어느 하나를 이용하여 구성할 수 있다.Here, the peak selector 310, the tag extractor 320, and the peptide searcher 330 may have a function of selecting a peak from a protein to be analyzed, a function of extracting a sequence tag, and a function of extracting a candidate peptide. It can be configured using any of the possible devices.

도 2에 도시한 본 발명의 아미노산 서열 분석 장치(30)에 대하여 보다 구체 적으로 설명하면 다음과 같다.Referring to the amino acid sequence analysis device 30 of the present invention shown in Figure 2 in more detail as follows.

먼저, 피크 선택부(310)는 탠덤 질량 분석 장치(20)에서 생성한 각각의 질량 스펙트럼 데이터로부터 강도가 높은 복수개(예를 들어 20 내지 200개)의 피크를 추출한다. 만약, 특정 질량 근처에 존재하는 피크의 강도가 작아서 y 이온이나 b 이온과 같은 중요한 피크가 추출되지 않는 경우에는 피크 추출을 위한 질량의 범위를 변경하여 해당 질량 범위에서 상대적으로 강도가 높은 피크를 추가적으로 추출한다. 즉, 전체적으로 강도가 큰 피크 뿐 아니라 지역적으로 강도가 큰 피크도 선출하는 것이다.First, the peak selector 310 extracts a plurality of high intensity peaks (for example, 20 to 200) from each mass spectrum data generated by the tandem mass spectrometer 20. If the intensity of peaks near a specific mass is small and important peaks such as y ions or b ions cannot be extracted, change the mass range for peak extraction to add a relatively high peak in that mass range. Extract. In other words, not only the peak having a large intensity as a whole, but also the peak having a large intensity locally.

태그 추출부(320)는 질량 계산 수단(322) 및 아미노산 매칭수단(324)을 포함하며, 피크의 질량 차이를 이용하여 스펙트럼에 포함된 아미노산 서열을 검색하기 위해 예를 들어 드 노보 시퀀싱 방법을 이용할 수 있다.The tag extracting unit 320 includes a mass calculating means 322 and an amino acid matching means 324, and uses a de novo sequencing method, for example, to search for an amino acid sequence included in a spectrum using a mass difference of peaks. Can be.

먼저, 질량 계산 수단(322)은 피크 선택부(310)에서 추출된 피크의 모든 쌍에 대해, 각 피크의 쌍이 나타내는 질량의 차이를 계산하고, 아미노산 매칭 수단(324)은 질량 계산 수단(322)에서 계산한 각 피크 쌍의 질량의 차이와 각 아미노산의 질량을 비교하여, 질량 차이값과 특정 아미노산의 질량 차이가 지정된 오차 범위 내의 값인 경우, 해당 피크 쌍이 특정 아미노산을 나타낼 수 있는 것으로 판단한다. 즉, 해당 피크 쌍의 질량의 차이만큼의 질량을 갖는 아미노산이 분석 대상 펩타이드에 존재한다는 결과를 얻는 것이다.First, the mass calculation means 322 calculates the difference in mass represented by each pair of peaks for all pairs of peaks extracted by the peak selector 310, and the amino acid matching means 324 calculates the mass calculation means 322. By comparing the difference in the mass of each peak pair and the mass of each amino acid calculated in the above, it is determined that the peak pair may represent a specific amino acid when the mass difference value and the mass difference of the specific amino acid are within a specified error range. That is, a result is obtained that amino acids having a mass corresponding to the difference in mass of the peak pair are present in the peptide to be analyzed.

아울러, 아미노산 매칭 수단(324)은 아미노산 1개를 나타내는 피크의 쌍 두개가 하나의 피크를 공유하는 경우, 두 쌍의 피크를 결합한 3개의 피크가 아미노산 2개를 나타낼 수 있음을 판단한다. 유사한 방법으로 n개의 피크가 아미노산 n-1개를 나타내는 모든 경우를 계산하여 아미노산 n-1개의 서열 태그와 이 태그를 구성하는 n개의 피크를 나타내는 서열 태그 리스트를 생성한다.In addition, when two pairs of peaks representing one amino acid share one peak, the amino acid matching means 324 determines that three peaks combining two pairs of peaks may represent two amino acids. In a similar manner, all cases where n peaks represent n-1 amino acids are calculated to generate a sequence tag list representing the amino acid n-1 sequence tags and the n peaks constituting the tag.

도 8을 참조하면, 피크번호 15와 17로 이루어지는 피크 쌍의 질량 차이가 아미노산 G를 나타내는 것을 알 수 있다. 마찬가지로, 피크번호 17과 20으로 이루어지는 피크 쌍의 질량 차이가 아미노산 M을 나타내고, 피크번호 20과 24로 이루어지는 피크 쌍의 질량 차이가 아미노산 I를 나타내며, 피크번호 24와 29로 이루어지는 피크 쌍의 질량 차이가 아미노산 D를 나타내는 것을 알 수 있다. 이 경우 서열 태그 리스트는 G(15, 17), M(17, 20), I(20, 24), D(24, 39)와 같이 생성된다.Referring to Fig. 8, it can be seen that the mass difference between peak pairs consisting of peak numbers 15 and 17 represents amino acid G. Similarly, the mass difference of the peak pair consisting of peak numbers 17 and 20 represents amino acid M, the mass difference of the peak pair consisting of peak numbers 20 and 24 represents amino acid I, and the mass difference of the peak pair consisting of peak numbers 24 and 29 It can be seen that represents the amino acid D. In this case, the sequence tag list is generated as G (15, 17), M (17, 20), I (20, 24), D (24, 39).

아울러, 도 8에서 아미노산 G와 M은 동일한 피크(17)를 공유하므로, G와 M을 나타내는 두 쌍의 피크를 결합한 3개의 피크(15, 17, 20)가 아미노산 2개를 나타내는 것으로 판단할 수 있으며 이에 따라 GM(15, 17, 20)과 같은 서열 태그가 생성된다. 마찬가지 방법으로 5개의 피크(15, 17, 20, 24, 29)가 4개의 아미노산을 나타낼 수 있으며, 이에 따라 GMIN(15, 17, 20, 24, 29)와 같은 서열 태그가 생성되게 된다.In addition, since amino acids G and M share the same peak 17 in FIG. 8, it can be determined that three peaks (15, 17, 20) combining two pairs of peaks representing G and M represent two amino acids. This generates sequence tags such as GM (15, 17, 20). In the same way, five peaks (15, 17, 20, 24, 29) can represent four amino acids, resulting in a sequence tag such as GMIN (15, 17, 20, 24, 29).

한편, 펩타이드 탐색부(330)는 펩타이드 서열 데이터베이스를 참조하여 태그 추출부(320)에서 생성한 서열 태그 리스트가 포함된 펩타이드 리스트를 탐색하는 것으로, 펩타이드 비교 수단(332) 및 후보 펩타이드 리스트 결정 수단(334)을 포함한다. 이때, 태그 추출부(320)에서 결정된 서열 태그가 b 이온 방향인지 y 이온 방향인지 판단할 수 없으므로, 하나의 서열 태그에 대해 정방향과 역박향(inverse) 서열 두 경우에 대한 데이터베이스 탐색을 수행한다. 펩타이드들은 데이터베이스에 접미사 배열의 형태로 저장되므로 서열 태그 탐색시 서열 태그의 길이만큼의 비교만을 수행하면 되기 때문에 기존 알고리즘들에 비해 서열 태그를 많이 생성하고 탐색하여도 속도가 저하되는 문제는 발생하지 않는다.Meanwhile, the peptide search unit 330 searches the peptide list including the sequence tag list generated by the tag extractor 320 with reference to the peptide sequence database, and includes a peptide comparison means 332 and a candidate peptide list determination means ( 334). In this case, since it is not possible to determine whether the sequence tag determined by the tag extractor 320 is the b ion direction or the y ion direction, a database search for both the forward and inverse sequences of one sequence tag is performed. Since peptides are stored in the database in the form of suffix arrays, the comparison of the length of the sequence tag only needs to be performed when searching for the sequence tag. Therefore, there is no problem of slowing down the rate of generating and searching more sequence tags than the existing algorithms. .

이후, 후보 펩타이드 리스트 결정 수단(334)은 펩타이드 비교 수단(332)의 비교 결과 중 지정된 제 1 길이(예를 들어, 3) 이상의 시퀀스 태그를 포함하는 적어도 하나 이상의 펩타이드를 추출하여 제 1 후보 펩타이드 리스트를 생성한다. 그리고, 제 1 후보 펩타이드 리스트에 포함된 펩타이드 각각에 대해 다시 지정된 제 2 길이(예를 들어, 2)의 시퀀스 태그를 포함하는 펩타이드를 추출하여 제 2 후보 펩타이드 리스트를 생성한다. 아울러, 각 제 2 후보 펩타이드 각각에 포함되는 모든 서열 태그의 리스트와 서열 태그와 제 2 후보 펩타이드가 일치된 방향(정방향일 경우 b 이온 방향, 역방향일 경우 y 이온 방향) 정보를 저장한다. Subsequently, the candidate peptide list determining means 334 extracts at least one or more peptides including a sequence tag having a specified first length (eg, 3) or more from the comparison result of the peptide comparing means 332 to list the first candidate peptide. Create The second candidate peptide list is generated by extracting a peptide including a sequence tag of a second length (for example, 2), which is again designated for each of the peptides included in the first candidate peptide list. In addition, a list of all sequence tags included in each of the second candidate peptides and information on the direction in which the sequence tags and the second candidate peptides coincide (b ion direction in the forward direction and y ion direction in the reverse direction) are stored.

태그 추출부(320)에서 추출한 서열 태그의 리스트가 [표 1]과 같고, 펩타이드 서열 데이터베이스에 저장된 펩타이드 서열 목록의 일부가 [표 2]와 같은 경우의 예를 들어, 도 9를 참조하여 설명하면 다음과 같다.For example, a list of the sequence tag extracted by the tag extractor 320 is shown in [Table 1], and a part of the peptide sequence list stored in the peptide sequence database is shown in [Table 2]. As follows.

[표 1]TABLE 1

V[L|I]GAN, AITGADKT, …V [L | I] GAN, AITGADKT,...

WWG, AT[L|I], VSV, TIA, ITG, …WWG, AT [L | I], VSV, TIA, ITG,...

LDA, GMI, NVL, …LDA, GMI, NVL,…

VI, VO, GM, MI, ID, …VI, VO, GM, MI, ID,...

G, M, I, K, …G, M, I, K,...

[표 2]TABLE 2

VTAEDKGTGNKVTAEDKGTGNK

RALSSQHQARRALSSQHQAR

VYADQRPLTKVYADQRPLTK

ETMEKAVEEKETMEKAVEEK

EFFNGKEPSREFFNGKEPSR

QATKDAGTIAGLNVLRQATKDAGTIAGLNVLR

VEIINQDAGNRVEIINQDAGNR

FLPFKVVEKKFLPFKVVEKK

：：

펩타이드 비교 수단(332)은 [표 1]과 같이 생성된 서열 태그 리스트와 [표 2]와 같은 펩타이드 서열을 포함하는 데이터베이스를 검색하여 서열 태그 리스트를 포함하는 모든 펩타이드 서열을 검색한다. 이때, 각 서열 태그의 정확한 방향을 알 수 없으므로 각각의 서열 태그에 대해 정방향과 역방향 모두에 대한 검색을 수행한다. 예를 들어, 서열 태그 리스트 중 AITGADKT에 대하여 b이온 방향(정방향)에 대한 검색을 수행하는 한편, AITGADKT의 y이온 방향(역방향)의 태그 서열 즉, tkdagtia에 대한 검색 또한 수행하는 것이다. 여기에서, 대문자로 표시되는 태그 서열은 정방향 이온인 것을 의미하고, 소문자로 표시되는 태그 서열은 역방향 이온인 것을 의미한다.The peptide comparison means 332 searches a database including a sequence tag list generated as shown in Table 1 and a peptide sequence as shown in Table 2 to search all peptide sequences including the sequence tag list. At this time, since the exact direction of each sequence tag is not known, a search for both forward and reverse directions is performed for each sequence tag. For example, a search for b ion direction (forward direction) is performed for AITGADKT in the sequence tag list, while a search for a tag sequence in the y ion direction (reverse direction) of AITGADKT, that is, tkdagtia is also performed. Herein, a tag sequence represented by an uppercase letter means a forward ion, and a tag sequence represented by a lowercase letter means a reverse ion.

이어서, 후보 펩타이드 리스트 결정 수단(334)은 서열 태그가 포함된 펩타이드 리스트 중, 예를 들어 3개 이상의 시퀀스 태그를 포함하는 펩타이드를 추출하고, 이 중 2개의 시퀀스 태그를 포함하는 후보 펩타이드 리스트를 생성한다. 도 9에 도시한 예의 경우 데이터베이스에 저장된 펩타이드 서열 중 AITGADKT의 역방향 이온과 NVL을 포함하고 있는 QATKDAGTIAGLNVLR가 후보 펩타이드 리스트로 결정된다. 본 예에서는 하나의 후보 펩타이드 리스트를 도시하였지만 후보 펩타이드 리스트는 복수개 선택되어질 수 있음은 물론이다.Subsequently, the candidate peptide list determining means 334 extracts a peptide including, for example, three or more sequence tags from the peptide list including the sequence tag, and generates a candidate peptide list including two of the sequence tags. do. In the example shown in FIG. 9, QATKDAGTIAGLNVLR including the reverse ion of AITGADKT and NVL among the peptide sequences stored in the database is determined as a candidate peptide list. Although one candidate peptide list is illustrated in this example, a plurality of candidate peptide lists may be selected.

이와 같이, 후보 펩타이드 리스트가 결정된 후 시퀀스 태그 생성부(340)는 후보 펩타이드에 포함되는 모든 서열 태그를 배치하고 여러 개의 서열 태그가 동시에 배치될 수 있는 경우 또한 포함하여 공백이 존재하는 연속되는 시퀀스 태그를 생성한다. 이를 위하여 시퀀스 태그 생성부(340)는 태그 필터링 수단(342), 태그 결합 수단(344), 펩타이드 리스트 생성 수단(346) 및 펩타이드 리스트 필터링 수단(348)을 포함한다.As such, after the candidate peptide list is determined, the sequence tag generation unit 340 arranges all sequence tags included in the candidate peptide and includes a sequence sequence in which spaces are present, including when multiple sequence tags can be placed at the same time. Create To this end, the sequence tag generation unit 340 includes a tag filtering unit 342, a tag combining unit 344, a peptide list generating unit 346, and a peptide list filtering unit 348.

먼저, 태그 필터링 수단(342)은 모질량(mother mass) 필터링과 시퀀스 태그의 전후 질량 비교에 의한 필터링을 수행할 수 있으며, 이 중 모질량 필터링은 생략하는 것도 가능하다. 여기에서, 모질량이란 펩타이드의 분자량을 의미하는 것으로, 해당 펩타이드에 변형이 포함되어 있는 경우 변형이 갖는 분자량을 포함한 값이며, 모질량 필터링은 후보 펩타이드 각각의 모질량과 분석 대상 펩타이드의 질량을 비교하여, 비교 결과 후보 펩타이드의 모질량이 분석 대상 펩타이드의 질량보다 제 1 설정값(예를 들어, 100da) 이상 작은 경우 해당 후보 펩타이드를 후보로부터 제외시키는 것이다. 모질량 필터링을 수행하는 이유는 펩타이드에 번역 후 변형이 발생한 경우 해당 펩타이드의 질량이 제 1 설정값(예를 들어, 100da) 이상 작아지는 경우가 거의 없기 때문이다.First, the tag filtering means 342 may perform mother mass filtering and filtering by comparing the front and rear masses of sequence tags, and among them, the parent mass filtering may be omitted. Here, the parent mass refers to the molecular weight of the peptide, and if the peptide contains a modification, it is a value including the molecular weight of the modification. The parent mass filtering compares the mass of each candidate peptide with the mass of the peptide to be analyzed. For example, when the parent mass of the candidate peptide is smaller than the mass of the peptide to be analyzed by a first predetermined value (eg, 100 da) or more, the candidate peptide is excluded from the candidate. The reason why the mass filtering is performed is that when the post-translational modification occurs in the peptide, the mass of the peptide rarely decreases by more than the first set value (for example, 100 da).

다음으로, 시퀀스 태그의 전후 질량 비교에 의한 필터링에서는 시퀀스 태그를 이루는 첫 번째 피크 또는 마지막 피크의 위치에 대한 질량이 후보 펩타이드로부터 이론적으로 계산된 피크의 위치에 대한 질량보다 제 2 설정값(예를 들어, 100da) 이상 작은 경우 해당 서열 태그를 서열 태그 리스트로부터 삭제한다. 즉, 시퀀스 태그의 첫 번째 또는 마지막 피크의 질량과 후보 펩타이드 중의 시퀀스 태그와 대응되는 부분에 대한 질량의 차이를 비교하여, 질량 차이가 제 2 설정값 이상 작은 경우 해당 시퀀스 태그를 삭제하는 것이다.Next, in filtering by comparing the front and back masses of the sequence tag, the mass for the position of the first peak or the last peak constituting the sequence tag is set to the second set value (for example, the mass for the position of the peak theoretically calculated from the candidate peptide. For example, when less than 100 da), the corresponding sequence tag is deleted from the sequence tag list. That is, by comparing the difference between the mass of the first or last peak of the sequence tag and the mass of the corresponding sequence tag in the candidate peptide, the corresponding sequence tag is deleted when the mass difference is smaller than the second set value.

이러한 필터링 과정이 완료되면, 태그 결합 수단(344)은 후보 펩타이드에 매치되는 시퀀스 태그 간의 일관성을 검사하여, 결합 가능한 둘 이상의 시퀀스 태그가 존재하는 경우 이들을 결합한다. 시퀀스 태그의 결합 가능 여부는 시퀀스 태그의 상대적인 위치에 따라 두 시퀀스 태그 중 어느 하나가 다른 하나에 포함되는 경우, 두 시퀀스 태그가 중첩되는 경우, 두 시퀀스 태그가 이웃하는 경우, 두 시퀀스 태그가 이격되어 있는 경우로 나누어 확인할 수 있다.When this filtering process is completed, the tag combining means 344 checks the consistency between the sequence tags that match the candidate peptide and combines them if there are two or more sequence tags that can be combined. Whether or not sequence tags can be combined depends on the relative position of the sequence tag, when one of two sequence tags is included in the other, when two sequence tags overlap, when two sequence tags are adjacent, It can be identified by dividing it into the case.

두 시퀀스 태그 중 어느 하나의 시퀀스 태그가 다른 하나에 포함되는 경우, 두 시퀀스 태그가 중첩되는 경우 및 이웃하는 경우에는 두 시퀀스 태그의 질량 변이의 차이를 고려하여 결합 여부를 결정한다. 즉, 질량 변이의 차이가 제 1 기준값(예를 들어, 0)과 근사한 값인 경우에는 서열 태그의 방향이 상이한 경우이며, 두 시퀀스 태그를 결합할 수 있다. 방향이 동일하고 잘량 변이의 차이가 0과 근사한 시퀀스 태그의 쌍은 존재할 수 없기 때문이다. 또한, 상기 세 가지 경우에 있어서 질량 변이의 차이가 제 2 기준값(예를 들어, 17) 또는 제 3 기준값(예를 들어, 18)인 경우에는 시퀀스 태그를 이루는 이온 중 어느 하나가 질량 분석 과정 중의 충돌 유도 분열 도중 H₂0 또는 NH₃를 상실한 경우로 볼 수 있으므로 두 시퀀스 태그를 결합할 수 있다. 아울러, 질량 변이의 차이가 제 4 기준값(예를 들어, 28)인 경우에는 두 시퀀스 태그가 나타내는 이온이 아미노 결합 절단에 의해 생성된 이온이 아닌 탄소 결합이 절단되어 생성되는 이온(a 이온)인 것으로 볼 수 있으므로, 두 시퀀스 태그를 결합할 수 있다. 이와 같이, 두 시퀀스 태그를 결합할 수 있는 경우의 질량 변이값에 대한 리스트를 미리 정의하고 시퀀스 태그 쌍의 질량 변이 차이가 리스트에 존재하는 차이와 근사할 경우 두 시퀀스 태그를 결합하며, 그렇지 않을 경우에는 결합하지 않는다.In the case where one sequence tag of the two sequence tags is included in the other, when the two sequence tags overlap and in the case of neighboring, it is determined by combining the difference in the mass variation of the two sequence tags. That is, when the difference in mass variation is close to the first reference value (for example, 0), the direction of the sequence tag is different, and the two sequence tags may be combined. This is because a pair of sequence tags having the same direction and having a difference in the amount of variation may be close to zero. In addition, in the above three cases, when the difference in mass variation is the second reference value (for example, 17) or the third reference value (for example, 18), any one of the ions forming the sequence tag is in the mass spectrometry process. It is possible to combine the two sequence tags because it can be seen that H ₂ 0 or NH ₃ is lost during collision-induced cleavage. In addition, when the difference in mass variation is the fourth reference value (for example, 28), the ions represented by the two sequence tags are ions (a ions) generated by cleaving carbon bonds, not ions produced by amino bond cleavage. As you can see, you can combine two sequence tags. As such, predefine a list of mass variation values when two sequence tags can be combined and combine the two sequence tags if the difference in mass variation of the sequence tag pairs approximates a difference in the list. Do not combine.

한편, 두 시퀀스 태그가 이격되어 있는 경우에는 두 시퀀스 태그를 결합하였을 때 발생하는 두 시퀀스 태그 사이의 공백에 대한 질량 변이를 계산한 후 질량 변이가 허용 가능한 제 3 설정값(예를 들어 100da) 내에 있는 경우에 두 시퀀스 태그를 결합한다.On the other hand, if two sequence tags are spaced apart, after calculating the mass variation for the space between the two sequence tags generated when the two sequence tags are combined, the mass variation is within the allowable third setting value (for example, 100 da). If present, combine the two sequence tags.

이와 같이 시퀀스 태그의 결합 과정을 수행한 후, 펩타이드 리스트 생성 수단(346)은 각각의 시퀀스 태그 및 시퀀스 태그의 결합에 공백이 포함된 펩타이드 리스트를 생성한다.After performing the sequence tag assembling process as described above, the peptide list generating means 346 generates a peptide list including spaces in each sequence tag and sequence tag combination.

도 10a 및 10b를 참조하면, 도 10a는 시퀀스 태그 리스트를 나타내고, 도 10b는 결합 가능한 시퀀스 태그를 결합하여 얻어진 공백이 포함된 시퀀스 리스트를 나타낸다.10A and 10B, FIG. 10A illustrates a sequence tag list, and FIG. 10B illustrates a sequence list including a space obtained by combining a combinable sequence tag.

먼저, 도 10a는 후보 펩타이드가 예를 들어 QATKDAGTIAGLNVLR이고, 시퀀스 태그 리스트가 tkdagtia, KDA, TKD, nvl인 경우, 시퀀스 태그 리스트를 후보 펩타이드와 매치시켜 배열하고, 이로 인해 나타나는 공백에 대한 질량 변이를 표시한 것을 알 수 있다. 즉, 시퀀스 태그 tkdagtia의 경우 공백 [①,②]에 대한 질량 변이가 [-17.12, 0.07], KDA의 경우 공백 [③,④]에 대한 질량 변이가 [-17.00, -0.03], TKD의 경우 공백 [⑤,⑥]에 대한 질량 변이가 [26.03, -43.07], nvl의 경우 공백 [⑦,⑧]에 대한 질량 변이가 [-17.05, 0.03]인 것을 나타낸다. 여기에서, 공백의 질량 변이는 공백 부분 피크의 이론적 위치와 실제 스펙트럼 피크 위치의 차이인 변이이다. 즉, 번역 후 변형 또한 아미노산과 마찬가지로 각 변형마다 각각의 질량을 가지고 있으므로 공백 부분의 질량 변이로부터 해당 공백 부분에 포함된 아미노산에 어떠한 번역 후 변형이 발생하였는지 확인할 수 있으며 만약, 공백 부분의 질량 변이가 0이라면 해당 공백 부분에 변이가 발생하지 않았음을 나타낸다.First, FIG. 10A shows that when the candidate peptide is for example QATKDAGTIAGLNVLR and the sequence tag list is tkdagtia, KDA, TKD, nvl, the sequence tag list is arranged to match the candidate peptide and displays the mass variation for the resulting blanks. I can see that. That is, for the sequence tag tkdagtia, the mass shift for the blank [①, ②] is [-17.12, 0.07], for KDA the mass shift for the blank [③, ④] is [-17.00, -0.03], for TKD The mass shift for the blank [⑤, ⑥] is [26.03, -43.07], and nvl indicates that the mass shift for the blank [⑦, ⑧] is [-17.05, 0.03]. Here, the mass shift of the blank is a shift that is the difference between the theoretical position of the blank partial peak and the actual spectral peak position. That is, since the post-translational modifications have their respective masses as well as amino acids, it is possible to check what post-translational modifications have occurred in the amino acids included in the corresponding blanks from the mass variation of the blanks. 0 indicates that no mutation occurred in the space.

다음으로 도 10b는 시퀀스 태그(tkdagtia, KDA, TKD, nvl) 중 결합 가능한 시퀀스를 결합하여 펩타이드 리스트를 생성한 것을 나타낸다. 먼저, tkdagtia와 nvl을 포함하는 시퀀스 태그의 경우 tkdagtia, KDA 및 nvl을 결합한 것이고, KDA와 nvl을 포함하는 시퀀스 태그의 경우 KDA와 nvl을 결합한 것이며, TKD 와 nvl을 포 함하는 시퀀스 태그의 경우 TKD와 nvl을 결합한 것으로서, 공백은 [⑨,⑩,⑪]은 [-17.12, 0.07, 0.03], 공백 [⑫,⑬,⑭]는 [-17.00, -0.05, 0.03], 공백 [⑮,?,?]은 [26.03, -43.09, 0.03]의 질량 변이를 갖는 것을 알 수 있다.Next, FIG. 10B shows that a list of peptides is generated by combining a sequence capable of binding among sequence tags (tkdagtia, KDA, TKD, and nvl). First, a sequence tag containing tkdagtia and nvl is a combination of tkdagtia, KDA, and nvl; a sequence tag containing KDA and nvl is a combination of KDA and nvl; and a sequence tag containing TKD and nvl is TKD. And nvl combined with spaces [⑨, ⑩, ⑪] are [-17.12, 0.07, 0.03], spaces [⑫, ⑬, ⑭] are [-17.00, -0.05, 0.03], spaces [⑮,?, ?] Has a mass variation of [26.03, -43.09, 0.03].

상술한 바와 같이, 후보 펩타이드와 일치된 시퀀스 태그의 연속으로 이루어진 공백이 포함된 펩타이드 리스트가 생성되고, 각 공백에 대한 질량 변이가 계산된 후, 펩타이드 리스트 필터링 수단(348)은 펩타이드 리스트에 포함된 복수의 펩타이드 중 상관관계가 적은 펩타이드를 삭제한다. 예를 들어 도 11에 도시한 것과 같이, 분석 대상 펩타이드의 질량 스펙트럼을 복수의 구간으로 나누어, 각 구간에서 강도가 가장 큰 피크의 강도를 100으로 하고, 나머지 피크들의 크기를 이에 비례하도록 조정하는 정규화를 수행한 후, 후보 펩타이드에 일치된 각각의 공백이 있는 펩타이드의 점수를 계산하여, 하나의 스펙트럼에 대해 그 스펙트럼으로부터 얻어진 모든 공백이 있는 펩타이드 중 가장 점수가 높은 것을 선택한 후, 선택된 펩타이드의 점수보다 지정된 값(예를 들어, 점수의 1/2) 이하의 점수를 갖는 모든 공백이 있는 펩타이드를 공백이 있는 펩타이드 리스트로부터 삭제한다. 여기에서, 공백이 있는 펩타이드의 점수는 [수학식 1]을 이용하여 계산할 수 있다.As described above, after generating a peptide list including a space consisting of a sequence of sequence tags matched with the candidate peptide, and after mass variation for each space is calculated, the peptide list filtering means 348 is included in the peptide list. The peptide with less correlation among the plurality of peptides is deleted. For example, as shown in FIG. 11, a normalization is performed by dividing the mass spectrum of the peptide to be analyzed into a plurality of sections, setting the intensity of the largest peak in each section to 100, and adjusting the size of the remaining peaks to be proportional thereto. Then, the score of each free peptide matched to the candidate peptide is calculated to select the highest score among all free peptides obtained from that spectrum for one spectrum, and then the score of the selected peptide. All empty peptides with scores less than or equal to the specified value (eg 1/2 of the score) are deleted from the empty peptide list. Here, the score of the peptide with a blank can be calculated using [Equation 1].

[수학식 1][Equation 1]

이후, 아미노산 서열 결정부(350)는 해당 공백에 포함될 아미노산 서열 또는번역 후 변형이 포함된 아미노산 서열을 결정하며, 이를 위하여 아미노산 서열 결 정부(350)는 공백 서열 매칭 수단(352) 및 스펙트럼 생성 및 비교 수단(354)을 포함한다.Subsequently, the amino acid sequence determiner 350 determines an amino acid sequence to be included in the corresponding blank or an amino acid sequence including post-translational modifications. For this purpose, the amino acid sequence determination unit 350 generates a blank sequence matching means 352 and generates a spectrum and Comparing means 354 is included.

공백 서열 매칭 수단(354)은 상술한 과정을 통해 남아 있는 공백이 있는 펩타이드에 대해 번역 후 변형 데이터베이스를 탐색한다. 공백 부분의 아미노산 서열 또는 번역 후 변형을 찾아내기 위해 공백 서열 매칭 수단(354)은 공백이 있는 펩타이드 리스트에서 공백 부분에 해당하는 후보 펩타이드의 서열과 공백 부분에 해당하는 질량의 변이를 번역 후 변형 데이터베이스로 질의하고, 매칭되는 것이 존재하는 경우 번역 후 변형의 종류와 위치를 결과로 수신한다. 만약, 공백 부분의 질량 변이가 0에 근사한 값인 경우, 해당 공백 부분에 번역 후 변형이 발생하지 않은 것으로 판단하여 기 추출한 후보 펩타이드를 결과적인 아미노산 서열인 것으로 판단하며, 공백 부분의 질량 변이가 0에 근사한 값이 아닌 경우에는, 해당 공백 부분에 포함되는 아미노산 서열에 번역 후 변형이 존재하는 것으로 판단하여 번역 후 변형 데이터베이스를 검색한다.The empty sequence matching means 354 searches the post-translational modification database for the remaining peptides through the above-described process. In order to find the amino acid sequence or post-translational modification of the blank portion, the blank sequence matching means 354 performs a post-translational modification database on the sequence of candidate peptides corresponding to the blank portion and the mass corresponding to the blank portion in the blank peptide list. Query and receive the result of the type and location of the post-translational transformation if a match is found. If the mass variation of the empty portion is close to 0, it is determined that the post-translational modification does not occur in the empty portion, and thus the candidate peptide extracted previously is determined to be the resulting amino acid sequence, and the mass variation of the empty portion is 0. If it is not an approximate value, it is determined that there is a post-translational modification in the amino acid sequence included in the blank portion, and the post-translational modification database is searched.

도 12를 참조하면, 펩타이드 __tkdagtia__nlv_의 첫 번째 공백 부분에 해당하는 서열(QA)과 질량 변이(-17.12)를 이용하여 데이터베이스를 검색하는데, 이 경우 질량 변이가 0에 근사한 값이 아니므로 번역 후 변형 데이터베이스를 검색한다. 아울러, 서열 GL에 대한 질량변이 0.07, 서열 R에 대한 질량변이 0.03은 0에 근사한 값이므로 번역 후 변형이 일어나지 않은 것으로 판단한다. 번역 후 변형 데이터베이스에는 변형이 일어날 경우 변화되는 질량, 변화되는 아미노산, 번역 후 변형의 종류 등이 포함되어 있으므로, 입력되는 서열(변화되는 아미노산)과 질량 변이 로부터 어떠한 번역 후 변형이 포함되어 있는지 확인할 수 있다. 도 12의 경우 분석 대상 펩타이드가 QAtkdagtiaGLNnvlR이라는 결과를 얻을 수 있고, 번역 후 변형이 아미노산 Q에서 발생했으며, 번역 후 변형의 종류는 Pyroglutamic acid인 것을 알 수 있다.Referring to FIG. 12, a database is searched using a sequence (QA) and a mass variation (-17.12) corresponding to the first empty portion of the peptide __tkdagtia__nlv_, in which case the mass variation is not close to zero, and thus after translation Search the variant database. In addition, since the mass variation 0.07 for the sequence GL and 0.03 mass variation for the sequence R are close to zero, it is determined that no post-translational modification has occurred. The post-translational modification database contains the mass that changes when a modification occurs, the amino acid that changes, and the type of post-translational modification, so you can see what post-translational modifications are included from the input sequence (change amino acid) and the mass variation. have. In the case of Figure 12 the peptide to be analyzed can be obtained the result of Q AtkdagtiaGLNnvlR, post-translational modification occurred in amino acid Q, it can be seen that the type of post-translational modification is Pyroglutamic acid.

이에 따라, 공백이 있는 펩타이드가 번역 후 변형의 종류와 위치를 포함하는 후보 펩타이드 서열로 채워지게 된다.Accordingly, the blank peptide is filled with candidate peptide sequences that include the type and location of the post-translational modification.

이후, 스펙트럼 생성 및 비교 수단(356)은 공백 서열 매칭 수단(354)에서 구한 각 후보 펩타이드(공백이 채워진 후보 펩타이드)에 대한 질량 분석 스펙트럼을 생성하고, 이를 분석 대상 스펙트럼과 비교하고 그 결과를 출력한다. 이는 모든 후보 펩타이드에 대하여 수행하고, 각 후보 펩타이드에 대해 점수를 책정하여 제공하는 것이 바람직하다.Thereafter, the spectrum generating and comparing means 356 generates a mass spectrometry spectrum for each candidate peptide (blank filled candidate peptide) obtained by the blank sequence matching means 354, compares it with the spectrum to be analyzed and outputs the result. do. This is preferably done for all candidate peptides and it is desirable to provide a score for each candidate peptide.

도 3은 본 발명의 일 실시예에 의한 아미노산 서열 분석 방법을 설명하기 위한 흐름도로서, 각 과정의 상세 처리 흐름도인 도 4 내지 도 7을 참조하여 설명하면 다음과 같다.3 is a flowchart illustrating an amino acid sequence analysis method according to an embodiment of the present invention, which will be described below with reference to FIGS. 4 to 7, which are detailed processing flowcharts of each process.

분석 대상 단백질에 포함된 아미노산 서열을 분석하기 위하여, 먼저 분석 대상 단백질에 대한 2단계 질량 분석 과정을 수행하고 이에 대한 질량 스펙트럼을 생성한다(S10). 이후, 아미노산 서열 분석 장치(30)의 피크 선택부(310)는 탠덤 질량 스펙트럼의 원시 데이터로부터 상대적으로 강도가 높은 복수의 피크를 추출한다(S20). 이때에는 전체적으로 강도가 큰 피크 뿐 아니라 지역적으로 강도가 큰 피크 또한 선택하며, 예를 들어 20 내지 200개의 피크를 선택하는 것이 바람직하다.In order to analyze the amino acid sequence included in the protein of interest, a two-step mass spectrometry process is first performed on the protein of interest and mass spectra thereof are generated (S10). Thereafter, the peak selector 310 of the amino acid sequence analysis device 30 extracts a plurality of peaks having relatively high intensity from the raw data of the tandem mass spectrum (S20). In this case, not only the peak having a large intensity as a whole, but also the peak having a large intensity locally, for example, it is preferable to select 20 to 200 peaks.

이후, 피크 선택 단계에서 추출한 피크의 질량 차이를 이용하여, 태그 추출부(320)는 분석 대상 펩타이드의 스펙트럼에 포함된 번역 후 변형을 포함하지 않는 짧은 길이의 아미노산 서열(즉, 서열 태그)을 추출한다. 태그 추출 과정(S30)에 대하여 도 4를 참조하여 보다 구체적으로 설명하면 다음과 같다. 먼저, 추출된 모든 피크 쌍의 질량 차이를 계산하고(S310), 질량 차이값에 대응하는 아미노산을 판단한다(S320). 이후, 복수개의 피크가 이루는 질량 차이와 복수개 아미노산의 질량 차이를 비교하여(S330), 번역 후 변형을 포함하지 않는 짧은 길이의 서열 태그 리스트를 생성한다(S340). 서열 태그 리스트에는 적어도 하나의 서열 태그가 포함된다.Thereafter, by using the mass difference of the peak extracted in the peak selection step, the tag extractor 320 extracts an amino acid sequence of short length (ie, sequence tag) that does not include post-translational modifications included in the spectrum of the peptide to be analyzed. do. The tag extraction process S30 will be described in more detail with reference to FIG. 4 as follows. First, the mass difference of all extracted peak pairs is calculated (S310), and the amino acid corresponding to the mass difference value is determined (S320). Thereafter, by comparing the mass difference between the plurality of peaks and the mass difference between the plurality of amino acids (S330), a short sequence tag list including no post-translational modification is generated (S340). The sequence tag list includes at least one sequence tag.

이와 같이 서열 태그 리스트가 생성된 후, 펩타이드 탐색부(330)는 펩타이드 서열 데이터베이스를 검색하여 서열 태그 리스트가 포함되는 후보 펩타이드를 추출한다(S40). 이때에는 도 5에 도시한 것과 같이, 펩타이드 서열 데이터베이스에 저장된 펩타이드와 각 서열 태그를 비교하여 지정된 제 1 길이(m, 예를 들어, 3) 이상의 시퀀스 태그가 일치하는 제 1 후보 펩타이드를 추출하고(S410), 제 1 후보 펩타이드 중 서열 태그와 지정된 제 2 길이(n, 예를 들어, 2)의 시퀀스 태그가 일치하는 제 2 후보 펩타이드를 추출한다(S420). 이후, 제 2 후보 펩타이드를 이용하여 후보 펩타이드 리스트를 생성하는 데, 이때에는 서열태그와 펩타이드의 일치 방향 또한 저장한다(S430).After the sequence tag list is generated as described above, the peptide search unit 330 searches the peptide sequence database to extract candidate peptides including the sequence tag list (S40). In this case, as shown in FIG. 5, the peptide stored in the peptide sequence database is compared with each sequence tag to extract a first candidate peptide having a sequence tag equal to or greater than a first length (m, for example, 3) specified ( S410), a second candidate peptide having a sequence tag of the first candidate peptide and a sequence tag having a specified second length (n, for example, 2) is extracted (S420). Thereafter, a candidate peptide list is generated using the second candidate peptide, in which case the sequence direction and the coincidence direction of the peptide are also stored (S430).

여기에서, 상기 단계 S10부터 S40은 상술한 방법에 한정되지 않고, 분석 대상 단백질로부터 피크를 선택하고, 서열 태그를 추출하고, 후보 펩타이드를 결정할 수 있는 모든 가능한 방법 중 어느 하나를 적용할 수 있다.Here, the steps S10 to S40 are not limited to the above-described method, and any one of all possible methods for selecting a peak from an analysis target protein, extracting a sequence tag, and determining a candidate peptide may be applied.

다음에, 펩타이드 탐색 단계(S40)에서 추출한 후보 펩타이드에 포함되는 모든 서열 태그를 배치하고, 복수의 서열 태그의 결합이 가능한 경우 복수의 서열 태그를 결합한 서열 태그를 배치하여 공백이 존재하는 연속되는 시퀀스 태그를 생성한다(S50). 도 6을 참조하여 보다 구체적으로 설명하면, 시퀀스 태그를 생성하기 전 모질량 필터링(S510)과 태그의 전후 질량 비교에 의한 필터링(S520)을 먼저 수행하며, 여기에서, 모질량 필터링은 생략하는 것도 가능하다. 모질량 필터링 단계(S510)에서는 후보 펩타이드 각각의 모질량과 분석 대상 펩타이드의 질량을 비교하여, 후보 펩타이드의 모질량이 분석 대상 펩타이드의 질량보다 제 1 설정값(예를 들어, 100da) 이상 작은 경우 해당 후보 펩타이드를 후보로부터 제외시킨다.Next, all sequence tags included in the candidate peptide extracted in the peptide search step (S40) are arranged, and when a plurality of sequence tags can be combined, a sequence sequence combining a plurality of sequence tags is arranged to form a continuous sequence in which spaces exist. A tag is generated (S50). More specifically, referring to FIG. 6, before the sequence tag is generated, the mother mass filtering (S510) and the filtering by comparing the front and rear masses of the tag (S520) are first performed, and the mother mass filtering may be omitted. It is possible. In the parental mass filtering step (S510), when the parental mass of each of the candidate peptides is compared with the mass of the peptide to be analyzed, when the parental mass of the candidate peptide is smaller than the mass of the peptide to be analyzed by a first predetermined value (eg, 100 da) or more. The candidate peptide is excluded from the candidate.

다음으로, 시퀀스 태그의 전후 질량 비교에 의한 필터링 단계(S520)에서는 시퀀스 태그를 이루는 첫 번째 피크 또는 마지막 피크의 위치에 대한 질량이 후보 펩타이드로부터 이론적으로 계산된 피크의 위치에 대한 질량보다 제 2 설정값(예를 들어, 100da) 이상 작은 경우 해당 서열 태그를 서열 태그 리스트로부터 삭제한다.Next, in the filtering step (S520) by comparing the front and rear masses of the sequence tag, the mass of the position of the first peak or the last peak of the sequence tag is set to be higher than the mass of the position of the peak theoretically calculated from the candidate peptide. If the value is small (for example, 100 da) or less, the corresponding sequence tag is deleted from the sequence tag list.

이후, 후보 펩타이드에 매치되는 시퀀스 태그 간의 일관성을 검사하여, 결합 가능한 둘 이상의 시퀀스 태그가 존재하는 경우 이들을 결합한다(S530, S540). 두 시퀀스 태그 중 어느 하나의 시퀀스 태그가 다른 하나에 포함되는 경우, 두 시퀀스 태그가 중첩되는 경우 및 이웃하는 경우, 이격되어 있는 경우, 도 2에서 설명한 질량 관계를 고려하여 결합 가능한 시퀀스 태그를 결합한다. 시퀀스 태그의 결합이 완료되면, 각각의 시퀀스 태그 및 시퀀스 태그의 결합에 공백이 포함된 펩타이드 리스트를 생성한다(S550).Thereafter, consistency between the sequence tags matched to the candidate peptide is checked, and if there are two or more sequence tags that can be combined, they are combined (S530 and S540). When one sequence tag of the two sequence tags is included in the other, when the two sequence tags are overlapped and neighboring, and spaced apart, the combineable sequence tags are combined in consideration of the mass relationship described in FIG. . When the combining of the sequence tag is completed, each peptide generates a list of peptides including a space in the combination of the sequence tag and the sequence tag (S550).

이어서, 공백이 포함된 펩타이드 리스트를 이루는 각각의 펩타이드에 대한 필터링을 수행한다(S560). 펩타이드 리스트 필터링 단계(S560)에서는 각 펩타이드에 대한 정규화 과정을 수행하고, 후보 펩타이드에 일치된 각각의 공백이 있는 펩타이드의 점수를 계산하여, 하나의 스펙트럼에 대해 그 스펙트럼으로부터 얻어진 모든 공백이 있는 펩타이드 중 가장 점수가 높은 것을 선택한 후, 선택된 펩타이드의 점수보다 지정된 값(예를 들어, 점수의 1/2) 이하의 점수를 갖는 모든 공백이 있는 펩타이드를 공백이 있는 펩타이드 리스트로부터 삭제한다. 이러한 펩타이드 리스트 필터링 단계는 생략하는 것도 가능하다.Subsequently, filtering is performed on each of the peptides forming the peptide list including the space (S560). In the peptide list filtering step (S560), a normalization process for each peptide is performed, and a score of each spaced peptide matched to the candidate peptide is calculated, and among all spaced peptides obtained from the spectrum for one spectrum. After selecting the highest score, delete all blank peptides with a score less than the score of the selected peptide (eg, 1/2 of the score) from the blank peptide list. The peptide list filtering step may be omitted.

상술한 태그 배치 과정(S50)에서 공백이 포함된 펩타이드 리스트가 생성된 후, 아미노산 서열 결정부(350)는 공백 부분의 서열과 질량 변이를 이용하여, 번역 후 변형 데이터베이스로부터 해당 펩타이드에 포함된 아미노산 서열 또는 번역 후 변형의 발생 위치 및 종류를 검색하고, 이를 통해 공백이 채워진 후보 펩타이드를 결정한다(S60). 도 7을 참조하여 구체적으로 설명하면, 공백이 있는 펩타이드에 대해 번역 후 변형 데이터베이스를 탐색하여 조건에 대응하는 아미노산 서열 또는 번역 후 변형의 발생 위치, 종류 등을 확인한다(S610). 이후, 단계(S610)에서 확인한 아미노산 서열 또는 번역 후 변형을 공백에 채워 공백이 채워진 후보 펩타이드 리스트를 생성하고(S620), 후보 펩타이드 리스트 각각에 대한 질량 분석 스펙트럼을 생성한 후 이를 분석 대상 펩타이드의 스펙트럼과 비교하여(S630) 그 결과를 출력한다(S640).After the peptide list including the blank is generated in the tag arrangement process (S50) described above, the amino acid sequence determiner 350 uses the sequence and the mass variation of the blank portion to generate amino acids included in the peptide from the post-translational modification database. The position and type of occurrence of the sequence or post-translational modification are searched and the candidate peptide filled with a blank is determined through this (S60). Specifically, referring to FIG. 7, a post-translational modification database is searched for a peptide having a space to identify an amino acid sequence corresponding to a condition or a location, a kind, and the like of a post-translational modification (S610). Subsequently, the candidate peptide list is filled by filling the blank with the amino acid sequence or the post-translational modification identified in step S610 (S620), and generates a mass spectrometry spectrum for each candidate peptide list, and then the spectrum of the peptide to be analyzed. Compared to (S630) and outputs the result (S640).

후보 펩타이드에 대한 질량 분석 스펙트럼과 분석 대상 펩타이드에 대한 스펙트럼을 비교하여, 가장 우수한 점수를 갖는 후보 펩타이드를 선택할 수 있으며, 선택된 펩타이드에 대한 정보(서열, 번역 후 변형의 발생 위치 및 종류 등)를 알고 있으므로, 단백질 서열 분석에 활용할 수 있게 된다.By comparing the mass spectrometry spectra for the candidate peptides with the spectra for the peptides to be analyzed, the candidate peptides with the best scores can be selected, and the information about the selected peptides (sequence, location and type of post-translational modifications) can be known. Therefore, it can be utilized for protein sequencing.

이와 같이, 본 발명이 속하는 기술분야의 당업자는 본 발명이 그 기술적 사상이나 필수적 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적인 것이 아닌 것으로서 이해해야만 한다. 본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 등가개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.As such, those skilled in the art will appreciate that the present invention can be implemented in other specific forms without changing the technical spirit or essential features thereof. Therefore, the above-described embodiments are to be understood as illustrative in all respects and not as restrictive. The scope of the present invention is shown by the following claims rather than the detailed description, and all changes or modifications derived from the meaning and scope of the claims and their equivalents should be construed as being included in the scope of the present invention. do.

이상에서 설명한 바와 같이, 본 발명에 의하면 단백질 서열 분석의 중요한 부분인 아미노산 서열 및 번역 후 변형을 탐색하는 데 있어서, 분석 대상 펩타이드에 포함된 서열 태그를 추출하고 이를 포함하는 후보 펩타이드를 결정한 후, 서열 태그 간의 결합을 통해 공백을 포함하는 펩타이드 리스트를 생성한 다음, 공백 부분의 질량 변이와 후보 펩타이드에 포함된 공백 부분의 아미노산 서열 정보를 이용하여 해당 공백 부분의 아미노산 서열 또는 번역 후 변형을 탐색함으로써, 단백질에 포함된 아미노산 서열 및 번역 후 변형의 탐색 절차 및 시간을 최소화할 수 있다. As described above, according to the present invention, in searching for the amino acid sequence and post-translational modification, which are an important part of protein sequence analysis, the sequence tag included in the peptide to be analyzed is extracted and the candidate peptide including the same is determined, and then the sequence By combining the tags to generate a list of peptides containing spaces, and then searching for the amino acid sequence or post-translational modification of the spaces using the mass variation of the spaces and the amino acid sequence information of the spaces included in the candidate peptide, The search procedure and time for the amino acid sequence and post-translational modifications included in the protein can be minimized.

아울러, 하나의 단백질에 복수개의 번역 후 변형이 존재하는 경우에도 용이하게 번역 후 변형의 발생 위치와 종류를 결정할 수 있기 때문에 프로테오믹스 기술의 발전을 더욱 가속화할 수 있다.In addition, even when there are a plurality of post-translational modifications in a protein, the location and type of post-translational modifications can be easily determined, thereby further accelerating the development of proteomics technology.

Claims

An apparatus for analyzing an amino acid sequence contained in a protein,

Peptide sequence database and post-translational modification database;

A peak selector configured to receive a mass spectrometry spectrum from a tandem mass spectrometer that analyzes the mass of the peptide constituting the protein to be analyzed, and extract a plurality of peaks from the mass spectrometry spectra;

When a sequence tag list including post-translational modifications is generated from the plurality of peaks extracted by the peak selector, a candidate peptide including the sequence tag extracted by referring to the peptide sequence data is input and included in the candidate peptide. A sequence tag generation unit generating a sequence tag having a space by combining all the sequence tags and the sequence sequence to be combined; And

Amino acid sequence determination unit for generating the peptide filled with the blank by replacing the blank present in the sequence tag generated by the sequence tag generation unit with an amino acid sequence or an amino acid sequence including a post-translational modification by referring to the post-translational modification database ;

Amino acid sequence analysis device comprising a.

delete

The method of claim 1,

The peak selector amino acid sequence analysis device characterized in that for selecting 20 to 200 peaks.

The method of claim 1,

The amino acid sequence analysis device further comprises a tag extraction unit for searching for a sequence tag that does not include post-translational modifications from the mass difference of the selected peak, and generates a sequence tag list including the sequence tag. Analysis device.

The method of claim 4, wherein

The tag extracting unit includes mass calculating means for calculating a difference in mass represented by each pair of peaks with respect to all pairs of peaks extracted by the peak selecting unit; And

Comparing the mass difference of each peak pair and the mass of each amino acid calculated by the mass calculation means, searching for an amino acid having a mass equal to the mass difference of the corresponding peak pair to generate a sequence tag list including a plurality of amino acids Amino acid matching means for;

Amino acid sequence analysis device comprising a.

The method of claim 1,

The amino acid sequence analysis device further comprises a peptide search unit for extracting the candidate peptide containing the sequence tag, with reference to the peptide sequence data.

The method of claim 6,

The peptide search unit may include: peptide comparison means for searching whether the sequence tag is included in the peptide data; And

Extracting at least one or more peptides including sequence tags of a first length or more specified from a comparison result of the peptide comparison means, and including a sequence tag of a second length designated for each of the peptides included in the first candidate peptide list Candidate peptide list determination means for extracting a candidate peptide and generating a candidate peptide list using the second candidate peptide;

Amino acid sequence analysis device comprising a.

The method of claim 7, wherein

The peptide comparison means is an amino acid sequence analysis device, characterized in that for performing a forward and reverse comparison to the sequence tag.

The method of claim 7, wherein

And the candidate peptide list determining means stores a concordance direction between the candidate peptide list and the sequence tag list.

The method of claim 1,

The sequence tag generation unit compares the mass for the position of the first peak or the last peak included in each sequence tag with the mass for the position of the candidate peptide, and if the mass comparison result is smaller than a specified value, the sequence tag is recalled. Tag filtering means for deleting from the sequence tag list;

Tag binding means for confirming binding between sequence tags matched with the candidate peptide, and for binding a sequence tag capable of binding; And

Peptide list generating means for generating a peptide list including a space from each sequence tag and the associated sequence tag;

Amino acid sequence analysis device comprising a.

The method of claim 10,

The tag filtering means compares the masses of the candidate peptide and the peptide extracted from the protein to be analyzed before performing the filtering on the front and rear masses of the sequence tag, and selects candidate peptides having a result of the mass comparison that is smaller than or equal to a first predetermined value. Amino acid sequencing device, characterized in that to perform further more mass filtering to remove from the list.

The method of claim 10,

The tag binding means may be configured to combine the sequence tag when one sequence tag of two sequence tags is included in the other, or when two sequence tags overlap or neighbor, and a difference in mass variation between the two sequence tags is close to a reference value. Amino acid sequence analysis device, characterized in that.

The method of claim 10,

The tag combining means combines the two sequence tags when the two sequence tags are spaced apart, when the mass variation of the space between the two sequence tags occurring after the two sequence tags is included within a predetermined setting. An amino acid sequence analysis apparatus.

The method of claim 10,

The sequence tag generation unit normalizes the mass spectrometry that is a result of the mass spectrometry of the analysis target protein, and selects the peptide having the highest peptide score among the peptides having the gap, compared to all peptides having the gap, and selects the selected peptide. And a peptide list filtering means for deleting from the list all empty peptides having a score less than or equal to a score of.

The method of claim 14,

And the score of the blank peptide is extracted from the number of matched peaks with the spectrum to be analyzed, the sum of the normalized intensities of the matched peaks, and the length of the matched peptides.

The method of claim 1,

The amino acid sequence determination unit refers to the sequence of the candidate peptide corresponding to the blank portion of the peptide having the blank and the amino acid sequence or the post-translational amino acid from the post-translational modification database by referring to the variation of the mass corresponding to the blank portion. Blank sequence matching means for retrieving the sequence; And

After filling the peptide having the blank by the amino acid sequence retrieved from the blank sequence matching means or the amino acid sequence including the post-translational modification, mass spectrometry is generated from the blank-filled peptide and compared with the spectrum to be analyzed. Spectral generation and comparison means;

Amino acid sequence analysis device comprising a.

The method of claim 1,

Peptide having the same mass is grouped in the peptide database, each peptide sequence is stored in the form of a suffix, amino acid sequence analysis device characterized in that it comprises the front and rear mass information of the sequence tag included in each peptide sequence.

A method for analyzing amino acid sequences included in proteins using a peptide sequence database and a post-translational modification database,

Extracting a plurality of peaks from the tandem mass spectrum by receiving a tandem mass spectrum of each peptide for the protein to be analyzed;

Extracting a sequence tag not including post-translational modifications included in the spectrum of the peptide of analysis from the mass difference of the extracted peaks and determining a sequence tag list including the sequence tag;

Searching a peptide sequence database to determine candidate peptides that include the sequence tag;

Combining and placing all sequence tags included in the candidate peptide and the bindable sequence tags to generate a sequence tag having a space; And

The post-translational modification database is searched using the amino acid sequence of the candidate peptide for the blank portion of the sequence tag in which the blank exists and the mass variation of the blank portion, and the amino acid sequence included in the blank portion of the peptide to be analyzed. An amino acid sequencing process for searching for the type or location and type of post-translational modification;

Amino acid sequence analysis method comprising a.

The method of claim 18,

Amino acid sequence analysis method characterized in that the 20 to 200 selected peaks.

delete

The method of claim 18,

The determining of the sequence tag list may include calculating mass differences of all the extracted peak pairs;

Determining an amino acid corresponding to a mass difference value on the peak; And

Comparing the mass difference between the plurality of peaks and the mass difference between the plurality of amino acids to generate a sequence tag list including no post-translational modifications;

Amino acid sequence analysis method comprising a.

The method of claim 18,

The determining of the candidate peptide list may include extracting a first candidate peptide having a sequence tag equal to or greater than a first length specified by comparing each of the sequence tags with peptides stored in the peptide sequence database;

Extracting a second candidate peptide among the first candidate peptides in which the sequence tag matches a sequence tag of a specified second length; And

Generating a candidate peptide list using the second candidate peptide, and storing a matching direction between the sequence tag and the peptide;

Amino acid sequence analysis method comprising a.

The method of claim 18,

The process of generating a sequence tag in which the space exists includes comparing the position of the first peak or the last peak of the sequence tag with the position of a peak theoretically calculated from the candidate peptide. Deleting a sequence tag from the sequence tag list;

Combining a sequence tag capable of binding among sequence tags included in the sequence tag list; And

Generating a list of peptides including a space consisting of the sequence tag and a combination of sequence tags;

Amino acid sequence analysis method comprising a.

The method of claim 23,

Before performing the filtering step by comparing the before and after mass of the sequence tag, the parent peptide of each of the candidate peptides is compared with the mass of the peptide to be analyzed, and the candidate peptide having a value obtained as a result of the comparison is smaller than a specified value. The method of amino acid sequence analysis, characterized in that it further performs the step of filtering out the mass mass from the list.

The method of claim 18,

The amino acid sequencing process, before searching the database, performs a normalization process for the peptide extracted from the protein to be analyzed, calculates the degree of matching with the peptide present in the gap, so that the gap with poor matching degree The method of amino acid sequence analysis, characterized in that further performing the deletion of the peptide.

The method of claim 18,

The amino acid sequencing process searches for the amino acid sequence or the post-translational modification, fills the peptide with the amino acid sequence or the post-translational modification with the blank to generate a blank-filled peptide, and performs mass spectrometry on the blank-filled peptide. And generating and comparing the spectrum to be analyzed.