JP2019185224A

JP2019185224A - Identification quality evaluation method and apparatus for endogenous modified peptide

Info

Publication number: JP2019185224A
Application number: JP2018072566A
Authority: JP
Inventors: 雅樹村瀬; Masaki Murase
Original assignee: Shimadzu Corp
Current assignee: Shimadzu Corp
Priority date: 2018-04-04
Filing date: 2018-04-04
Publication date: 2019-10-24

Abstract

To provide a decoy database that can estimate the false discovery rate (FDR) of a database search by PTM-compatible endogenous peptide retrieval method with high accuracy.SOLUTION: When a reader 132 reads data from a target DB12 that contains the sequence of modified proteins, an amino acid distribution statistics creation unit 133 creates amino acid distribution statistics indicating the frequency of occurrence of amino acids, and a PTM distribution statistics creation unit 134 creates PTM distribution statistics showing the frequency of occurrence of modification sites for each type of modification. A decoy sequence generating unit 135 randomly creates the same decoy sequence as the number of sequences of entries in an original target DB12 by weighting, using the amino acid distribution statistics. A decoy PTM information generating unit 136 creates a decoy PTM information corresponding to each decoy sequence by weighting, using the PTM distribution statistics. By collecting the decoy array and decoy PTM information created in this way, the system creates a decoy database with PTM addition information.SELECTED DRAWING: Figure 2

Description

本発明は、質量分析を利用して翻訳後修飾を受けた内在性ペプチド（内在性修飾ペプチド）を同定する手法の同定品質を評価する方法、及びそのための装置に関する。 The present invention relates to a method for evaluating the identification quality of a technique for identifying an endogenous peptide subjected to post-translational modification (endogenous modified peptide) using mass spectrometry, and an apparatus therefor.

近年、ポストゲノム研究としてタンパク質の構造や機能の解析が急速に進められている。質量分析を利用してタンパク質やペプチドを網羅的に同定するためのデータ解析手法として、最も広く使用されているのは、タンパク質配列データベース検索法（以下、単に「データベース検索法」と称す。また場合によっては「ＤＢ検索法」と略す）である。データベース検索法としては、英国マトリクスサイエンス（Matrix Science）社が提供している検索エンジンであるマスコット（Mascot）に含まれるＭＳ／ＭＳイオンサーチ（非特許文献１参照）が広く知られているが、それ以外にもフリーの検索エンジンであるX! Tandemなども類似した機能を有する。 In recent years, protein structures and functions have been rapidly analyzed as post-genomic research. The most widely used data analysis method for comprehensively identifying proteins and peptides using mass spectrometry is a protein sequence database search method (hereinafter simply referred to as “database search method”. (Abbreviated as “DB search method”). As a database search method, MS / MS ion search (see Non-Patent Document 1) included in Mascot, a search engine provided by UK Matrix Science, is widely known. Other than that, free search engines such as X! Tandem have similar functions.

データベース検索法には様々な手法（アルゴリズム）があるため、それら手法の同定品質の比較は適切な手法を選択するうえで重要である。また、同じ手法でも検索によってヒットしたアミノ酸配列のスコアや期待値を判定するための閾値を変えると同定品質は変わるため、そうした閾値を適切に設定することで同定品質を制御することも重要である。こうした際に用いられる同定品質の客観的な評価指標としては偽発見率（ＦＤＲ＝False Discovery Rate）が広く用いられている（非特許文献２参照）。 Since there are various methods (algorithms) for database retrieval methods, comparison of identification quality of these methods is important in selecting an appropriate method. In addition, even if the same technique is used, the identification quality changes if the threshold for judging the score or expected value of the amino acid sequence hit by the search is changed, so it is also important to control the identification quality by appropriately setting such a threshold. . As an objective evaluation index of identification quality used in such a case, a false discovery rate (FDR = False Discovery Rate) is widely used (see Non-Patent Document 2).

一般に、データベース検索法において偽発見率を推定する際には、天然タンパク質のアミノ酸配列に似せて作成されているが実在しない偽の配列であるデコイ（Decoy）配列を収録したデコイ配列データベースが用いられる。質量分析により収集したマススペクトル情報についてデコイ配列データベースを利用して検索を実施したときに、偶発的に同定されたデコイ配列の数を計数することによって、偽発見率を推定することができる。 In general, when estimating the false discovery rate in a database search method, a decoy sequence database containing a decoy sequence that is a false sequence that is created to resemble the amino acid sequence of a natural protein but does not exist is used. . When searching for mass spectral information collected by mass spectrometry using a decoy sequence database, the false discovery rate can be estimated by counting the number of decoy sequences identified accidentally.

トリプシン等の消化酵素を用いてタンパク質を人為的に断片化したペプチドを分析するプロテオーム解析では、検索対象である天然タンパク質配列データベース（以下「ターゲットデータベース」と称す）に収録されている各タンパク質のアミノ酸配列をタンパク質毎又は断片化されるペプチド毎に反転（Ｎ末端側とＣ末端側との反転）させたものをデコイ配列として用いると、簡便で且つ高い精度で以て偽発見率を推定できることが知られている。これに対し、生体内の不特定の分解酵素により産生された内在性ペプチドを網羅的に同定する場合、上述したような両末端で反転したアミノ酸配列をデコイ配列として採用すると、偽発見率を過大評価するおそれがあることが分かっている。そのため、こうした場合には、ターゲットデータベースに収録されたタンパク質のアミノ酸配列に基づいたランダムな配列の入替えによってデコイ配列を生成する方法が用いられる。 In proteome analysis, which analyzes peptides that have been artificially fragmented using a digestive enzyme such as trypsin, the amino acids of each protein recorded in the natural protein sequence database (hereinafter referred to as “target database”) to be searched If the sequence is inverted for each protein or peptide to be fragmented (inversion between the N-terminal side and the C-terminal side) is used as the decoy sequence, the false discovery rate can be estimated easily and with high accuracy. Are known. On the other hand, when comprehensively identifying endogenous peptides produced by unspecific degrading enzymes in the living body, adopting the amino acid sequence reversed at both ends as described above as a decoy sequence would increase the false discovery rate. It is known that there is a risk of evaluation. Therefore, in such a case, a method of generating a decoy sequence by replacing a random sequence based on the amino acid sequence of a protein recorded in a target database is used.

生体内に存在するタンパク質やペプチドの多くは、化学官能基の結合や解離、或いはペプチド結合の加水分解（切断）などによる翻訳後修飾（Post-translational modification、以下、適宜「ＰＴＭ」と略す場合がある）を受けており、それによって、その機能が動的に且つ厳密に調節されている。生体内のタンパク質のアミノ酸配列中のどの部位（どのアミノ酸残基）にどのような修飾が生じているかを解析することは、生命科学研究やその成果を産業に応用する観点からも非常に重要である。しかしながら、従来の一般的なデータベース検索法では、生体内で生じ得る全ての修飾ペプチドを一度の検索により同定することはできない。何故なら、修飾のバリエーションを考慮すると、探索対象となる理論ペプチドの数つまりは探索空間が指数関数的に拡がってしまい、その結果として、ペプチド候補をスコアの統計的な信頼度指標をもとに一意的に絞り込むことができなくなるためである（非特許文献２参照）。 Many of proteins and peptides existing in the body may be abbreviated as “PTM” as appropriate after the post-translational modification (bonding or dissociation of chemical functional groups or hydrolysis (cleavage) of peptide bonds). The function of which is dynamically and strictly adjusted. It is very important from the viewpoint of applying life science research and its results to industry to analyze which part (which amino acid residue) in the amino acid sequence of protein in the living body has been modified. is there. However, conventional general database search methods cannot identify all modified peptides that can occur in vivo by a single search. This is because the number of theoretical peptides to be searched, that is, the search space expands exponentially in consideration of modification variations, and as a result, peptide candidates are based on the statistical confidence index of the score. This is because it becomes impossible to narrow down uniquely (see Non-Patent Document 2).

こうしたことから、許容される信頼度で以て多数の修飾ペプチドを同定する、即ち、修飾ペプチドの同定感度を向上させるための制限として、探索対象とする修飾の種類を事前に少数個に絞り込んだ上でデータベース検索を行うのが一般的である（いわゆるバリアブルモディフィケーション法)。ただし、その場合でも、例えばリン酸化のようにタンパク質中に含まれる割合が高い、つまりは存在頻度が高いアミノ酸残基の側鎖に生じる修飾を検索の対象とすると、同定感度が低下することが知られている。また、リン酸化のような官能基の修飾ではなく切断部位を網羅的に探索しようとした場合にも、同様に探索空間が大幅に増大し、同定感度が低下することが知られている。 For this reason, a large number of modified peptides are identified with an acceptable degree of reliability, that is, the number of types of modifications to be searched is narrowed down in advance as a limitation to improve the identification sensitivity of modified peptides. It is common to perform a database search above (so-called variable modification method). However, even in such a case, identification sensitivity may decrease if the search is for a modification that occurs in the side chain of an amino acid residue that is highly contained in the protein, such as phosphorylation, that is, the frequency of occurrence is high. Are known. Also, it is known that the search space is greatly increased and the identification sensitivity is lowered when trying to exhaustively search for a cleavage site instead of modification of a functional group such as phosphorylation.

上述した、修飾ペプチド検索における同定感度低下の要因となる探索空間の増大を抑えるための他の方法として、既知である修飾部位に限定して探索を行う手法が有効である。これに包含される方法として、官能基付加に関しては、既知の修飾部位において特定の官能基修飾を受けているアミノ酸を実在しない疑似的なアミノ酸記号に対応付けて検索を実施する方法がある（特許文献１参照）。また、ペプチド結合の切断については、切断の組み合わせを、過去の測定において測定対象である試料から検出された配列や文献等において測定対象の試料から検出されたと報告されている配列、或いはその配列の中の部分的な配列を少なくとも一つ以上含む配列に限定した上で検索を実施する方法がある（特許文献２参照）。これらいずれの方法も、検索対象としてタンパク質配列データベースをそのまま用いるのではなく、これよりも小規模のペプチド単位のペプチド配列データベースを新たに構築して検索を行う点が一つの特徴である。 As another method for suppressing an increase in search space that causes a decrease in identification sensitivity in the modified peptide search described above, a method of performing a search limited to known modification sites is effective. As a method encompassed by this, regarding functional group addition, there is a method in which an amino acid that has undergone a specific functional group modification at a known modification site is associated with a pseudo amino acid symbol that does not exist (patent) Reference 1). For peptide bond cleavage, the combination of cleavage is a sequence detected from the sample to be measured in the past measurement, a sequence reported to have been detected from the sample to be measured in the literature, or the sequence of that sequence. There is a method of performing a search after limiting to a sequence including at least one partial sequence (see Patent Document 2). One of the features of these methods is that the protein sequence database is not used as a search target as it is, but a peptide sequence database of a smaller peptide unit is newly constructed and searched.

よく知られているように、内在性修飾ペプチドは官能基による修飾と切断との双方の修飾を受けたペプチドである。そこで、内在性修飾ペプチドを効果的に同定する方法として、上記特許文献１に記載の手法と特許文献２に記載の手法とを組み合わせる試みは当然考え得る手法の一つである。 As is well known, an endogenous modified peptide is a peptide that has undergone both functional modification and cleavage. Therefore, as a method for effectively identifying the endogenous modified peptide, an attempt to combine the method described in Patent Document 1 and the method described in Patent Document 2 is one of the possible methods.

しかしながら、特許文献１に記載の手法では、上述したように修飾を受けたアミノ酸を擬似的なアミノ酸として記号化していることから、例えばリン酸化について、一つのペプチドのアミノ酸配列上に３箇所の既知のリン酸化部位があれば、リン酸化の有無の組み合わせとして２³＝８通りのペプチド配列に展開してデータベースに収録する必要がある。また、一般に内在性ペプチドのサイズは大きく、一つのペプチド中に多数の既知の修飾部位が含まれることが想定されるため、特許文献１に記載の手法をそのまま適用すると、想定される修飾部位の組み合わせが膨大となる。一方、特許文献２に記載の手法においても、膨大な数のペプチドをデータベースに収録する必要がある。そのため、上述したように特許文献１、２に記載の方法を組み合わせようとしても、データベースのサイズが大きくなりすぎて実装が困難になる。 However, in the method described in Patent Document 1, since the modified amino acid is symbolized as a pseudo amino acid as described above, for example, phosphorylation is known at three locations on the amino acid sequence of one peptide. If there is a phosphorylation site, it is necessary to develop 2 ³ = 8 peptide sequences as a combination of the presence or absence of phosphorylation and record them in the database. In general, the endogenous peptide is large in size, and it is assumed that a large number of known modification sites are included in one peptide. Therefore, when the method described in Patent Document 1 is applied as it is, The combination becomes enormous. On the other hand, also in the technique described in Patent Document 2, it is necessary to record a huge number of peptides in a database. For this reason, as described above, even if the methods described in Patent Documents 1 and 2 are combined, the size of the database becomes too large to be implemented.

特開２０１３−４７６２４号公報JP 2013-47624 A 国際公開第２０１７／０４７５８０号パンフレットInternational Publication No. 2017/0475580 Pamphlet

「マトリクス・サイエンス−マスコット−ＭＳ／ＭＳ・イオンズ・サーチ（Matrix Science - Mascot - MS/MS Ions Search）」、[online]、マトリクス・サイエンス社（Matrix Science Ltd.）、[平成３０年４月３日検索]、インターネット＜URL : http://www.matrixscience.com/cgi/search_form.pl?FORMVER=2&SEARCH=MIS＞"Matrix Science-Mascot-MS / MS Ions Search", [online], Matrix Science Ltd., [April 3, 2018 Day search], Internet <URL: http://www.matrixscience.com/cgi/search_form.pl?FORMVER=2&SEARCH=MIS> 吉沢明康、「どのデータベースを使うか〜データベース検索と配列解析・誤解と難題〜」、プロテオミクス・レターズ（Proteomics Letters） 2016 1:63-80、[online]、日本プロテオーム学会、[平成３０年４月３日検索]、インターネット＜URL : https://www.jhupo.org/data/proteomeletters/16008.pdf＞Akiyasu Yoshizawa, “Which database to use: database search, sequence analysis, misunderstandings, and difficult problems”, Proteomics Letters 2016 1: 63-80, [online], Japan Proteomics Society, [April 2018 3 days search], Internet <URL: https://www.jhupo.org/data/proteomeletters/16008.pdf> シウィ（Siwy J.）、ほか４名、「ヒューマン・ユーリナリ・ペプタイド・データベース・フォー・マルチプル・ディジーズ・バイオメーカ・ディスカバリー（Human urinary peptide database for multiple disease biomarker discovery）」、プロテオミクス・クリニカル・アプリケーションズ（Proteomics Clinical Applications）、2011年、Vol.5、pp.367-374Siwy J. and 4 others, “Human urinary peptide database for multiple disease biomarker discovery”, Proteomics Clinical Applications (Proteomics) Clinical Applications), 2011, Vol.5, pp.367-374 「PEFF - PSI Extended Fasta Format」、PSI、[online]、[平成３０年４月３日検索]、インターネット＜URL :http://www.psidev.info/node/363 ＞"PEFF-PSI Extended Fasta Format", PSI, [online], [April 3, 2018 search], Internet <URL: http://www.psidev.info/node/363>

上記データベースのサイズを削減するためには、官能基修飾に関する修飾情報を、アミノ酸配列上の個々のアミノ酸残基に対する付加情報としてアミノ酸配列とは独立に扱うこととし、検索時に必要な情報のみを動的にメモリ上に修飾ペプチドとして展開して検索できるようにする、といった工夫が必要である。ところが、こうした工夫を加えた手法による内在性の修飾ペプチド検索法（以下「ＰＴＭ対応内在性ペプチド検索法」と称す）における同定品質を評価するために偽発見率を推定する場合、次のような問題がある。 In order to reduce the size of the above database, modification information related to functional group modification is handled as additional information for individual amino acid residues on the amino acid sequence, independently of the amino acid sequence, and only the information necessary for the search is moved. Therefore, it is necessary to devise such that it is possible to search by expanding it as a modified peptide on the memory. However, when estimating the false discovery rate in order to evaluate the identification quality in the endogenous modified peptide search method (hereinafter referred to as “PTM-compatible endogenous peptide search method”) using such a technique, the following There's a problem.

（１）上述したＰＴＭ対応内在性ペプチド検索法では、アミノ酸配列情報とは別に翻訳後修飾情報が必要となる。しかしながら、上述したような従来の一般的なデコイデータベースの作成方法では、アミノ酸のデコイ配列が収録されたデコイデータベースを作成することはできてもデコイの修飾情報は得られない。そのため、こうしたデータベースを用いた検索を行っても修飾されたペプチドを対象とした適切な検索が実行されず、偽発見率が過小評価されてしまうことになる。即ち、従来の一般的なデコイデータベースの作成方法では、上述したＰＴＭ対応内在性ペプチド検索法における偽発見率を実用上使用できる程度に精度良く推定することができないという問題がある。 (1) In the PTM-compatible endogenous peptide search method described above, post-translational modification information is required in addition to the amino acid sequence information. However, in the conventional method for creating a decoy database as described above, even if a decoy database containing amino acid decoy sequences can be created, decoy modification information cannot be obtained. Therefore, even if a search using such a database is performed, an appropriate search for a modified peptide is not executed, and the false discovery rate is underestimated. That is, the conventional method for creating a decoy database has a problem that the false discovery rate in the above-described PTM-compatible endogenous peptide search method cannot be estimated with sufficient accuracy to be practically used.

（２）上述したＰＴＭ対応内在性ペプチド検索法で使用されるタンパク質配列データベース又はペプチド配列データベースは、一般に用いられるタンパク質のアミノ酸配列から試料特異性を持つタンパク質のアミノ酸配列を抽出した部分集合（サブセット）であるか、或いは、さらにその部分集合となるペプチドのアミノ酸配列が収録されたものである。そこでＰＴＭ対応内在性ペプチド検索法では、複数の異なる配列データベースを用いた各々の検索により得られた検索結果を適切に統合することによって、偽発見率を所定の値に抑えながら同定数を増やすことができる。例えば、複数の異なる配列データベースに対する検索により得られた複数セットの同定結果の間で重複して同定されたペプチドは、異なる検索条件の下で再現性良く同定された信頼度の高いペプチドであると推定し得る。しかしながら、複数の異なるターゲットデータベースについて別々に、ランダムに生成したデコイ配列が一致するという確率は極めて小さい。そのため、従来の一般的なデコイ配列の生成方法を用いた場合、複数の異なるデータベースを用いて同定した結果の間で共通してヒットしたペプチドについての偽発見率を求めることはできない。 (2) The protein sequence database or peptide sequence database used in the above-described PTM-compatible endogenous peptide search method is a subset obtained by extracting the amino acid sequence of a protein having sample specificity from the amino acid sequence of a commonly used protein. Or the amino acid sequence of a peptide that is a subset thereof. Therefore, in the PTM-compatible endogenous peptide search method, the number of identifications can be increased while suppressing the false discovery rate to a predetermined value by appropriately integrating the search results obtained by each search using a plurality of different sequence databases. Can do. For example, a peptide identified redundantly among multiple sets of identification results obtained by searching against a plurality of different sequence databases is a highly reliable peptide identified with good reproducibility under different search conditions. Can be estimated. However, the probability that the randomly generated decoy sequences match for a plurality of different target databases is very small. Therefore, when a conventional general method for generating a decoy sequence is used, it is not possible to obtain a false discovery rate for peptides hit commonly among the results identified using a plurality of different databases.

本発明は上記課題を解決するために成されたものであり、その主たる目的は、修飾を受けた内在性ペプチド（内在性修飾ペプチド）を同定するデータベース検索法の同定品質を評価するための偽発見率を精度良く求めることができる内在性修飾ペプチドの同定品質評価方法及び装置を提供することである。 The present invention has been made to solve the above-mentioned problems, and its main purpose is to provide a fake for evaluating the identification quality of a database search method for identifying a modified endogenous peptide (endogenous modified peptide). An object is to provide an identification quality evaluation method and apparatus for endogenously modified peptides that can accurately determine the discovery rate.

上記課題を解決するためになされた本発明に係る内在性修飾ペプチドの同定品質評価方法は、質量分析結果に基づくデータベース検索により内在性修飾ペプチドを同定する際の同定品質を評価する方法であり、デコイ配列を用いて同定品質を評価する指標値を算出する内在性修飾ペプチドの同定品質評価方法において、
a)翻訳後修飾情報が付加されたタンパク質配列データベースから、収録されている修飾タンパク質のアミノ酸配列情報及び翻訳後修飾情報を読み込む情報読込みステップと、
b)前記情報読込みステップにより得られた前記アミノ酸配列情報に基づいて、各アミノ酸の出現頻度を示すアミノ酸分布統計情報を求めるアミノ酸分布統計取得ステップと、
c)前記情報読込みステップにより得られた前記翻訳後修飾情報に基づいて、各翻訳後修飾の出現頻度を示す翻訳後修飾分布統計情報を求める翻訳後修飾分布統計取得ステップと、
d)前記情報読込みステップにより得られたアミノ酸配列情報と前記アミノ酸分布統計情報とに基づいて、正規のアミノ酸配列に類似したデコイ配列を生成するデコイ配列生成ステップと、
e)前記デコイ配列生成ステップで生成されたデコイ配列と前記翻訳後修飾分布統計情報とに基づいて、修飾物の種類と修飾部位とを含むデコイ翻訳後修飾情報を生成するデコイ翻訳後修飾情報生成ステップと、
f)前記デコイ配列生成ステップで生成されたデコイ配列と前記デコイ翻訳後修飾情報生成ステップで生成されたデコイ翻訳後修飾情報とを統合することにより、翻訳後修飾情報が付加されたデコイ配列が収録されたデコイデータベースを作成するデコイデータベース作成ステップと、
を有することを特徴としている。 The identification quality evaluation method for endogenous modified peptides according to the present invention made to solve the above problems is a method for evaluating the identification quality when identifying an endogenous modified peptide by database search based on the results of mass spectrometry, In an identification quality evaluation method for endogenous modified peptides that calculates an index value for evaluating identification quality using a decoy sequence,
a) an information reading step for reading the amino acid sequence information and post-translational modification information of the modified protein recorded from the protein sequence database to which post-translational modification information is added;
b) Amino acid distribution statistics acquisition step for obtaining amino acid distribution statistical information indicating the appearance frequency of each amino acid based on the amino acid sequence information obtained by the information reading step;
c) based on the post-translational modification information obtained by the information reading step, a post-translational modification distribution statistics acquisition step for obtaining post-translational modification distribution statistical information indicating the frequency of appearance of each post-translational modification;
d) a decoy sequence generation step for generating a decoy sequence similar to a normal amino acid sequence based on the amino acid sequence information obtained by the information reading step and the amino acid distribution statistical information;
e) Generation of decoy post-translational modification information generating decoy post-translational modification information including the type and site of modification based on the decoy sequence generated in the decoy sequence generation step and the post-translational modification distribution statistical information Steps,
f) By integrating the decoy sequence generated in the decoy sequence generation step and the decoy post-translation modification information generated in the decoy post-translation modification information generation step, a decoy sequence to which post-translation modification information is added is recorded. A decoy database creation step for creating a decorated decoy database;
It is characterized by having.

また、本発明に係る内在性修飾ペプチドの同定品質評価装置は、上記発明に係る内在性修飾ペプチドの同定品質評価方法を実施する装置であり、質量分析結果に基づくデータベース検索により内在性修飾ペプチドを同定する際の同定品質を評価する装置であって、デコイ配列を用いて同定品質を評価する指標値を算出する内在性修飾ペプチドの同定品質評価装置において、
a)翻訳後修飾情報が付加されたタンパク質配列データベースから、収録されている修飾タンパク質のアミノ酸配列情報及び翻訳後修飾情報を読み込む情報読込み部と、
b)前記情報読込み部により得られた前記アミノ酸配列情報に基づいて、各アミノ酸の出現頻度を示すアミノ酸分布統計情報を求めるアミノ酸分布統計取得部と、
c)前記情報読込み部により得られた翻訳後修飾情報に基づいて、各翻訳後修飾の出現頻度を示す翻訳後修飾分布統計情報を求める翻訳後修飾分布統計取得部と、
d)前記情報読込み部により得られたアミノ酸配列情報と前記アミノ酸分布統計情報とに基づいて、正規のアミノ酸配列に類似したデコイ配列を生成するデコイ配列生成部と、
e)前記デコイ配列生成部により生成されたデコイ配列と前記翻訳後修飾分布統計情報とに基づいて、修飾物の種類と修飾部位とを含むデコイ翻訳後修飾情報を生成するデコイ翻訳後修飾情報生成部と、
f)前記デコイ配列生成部により生成されたデコイ配列と前記デコイ翻訳後修飾情報生成部により生成されたデコイ翻訳後修飾情報とを統合することにより、翻訳後修飾情報が付加されたデコイ配列が収録されたデコイデータベースを作成するデコイデータベース作成部と、
を備えることを特徴としている。 Moreover, the identification quality evaluation apparatus for endogenous modified peptides according to the present invention is an apparatus for carrying out the identification quality evaluation method for endogenous modified peptides according to the above invention, and the endogenous modified peptides are obtained by database search based on the mass spectrometry results. An apparatus for evaluating identification quality at the time of identification, wherein an identification quality evaluation apparatus for endogenous modified peptides that calculates an index value for evaluating identification quality using a decoy sequence,
a) an information reading unit for reading amino acid sequence information and post-translational modification information of the recorded modified protein from a protein sequence database to which post-translational modification information is added;
b) based on the amino acid sequence information obtained by the information reading unit, amino acid distribution statistics acquisition unit for obtaining amino acid distribution statistical information indicating the appearance frequency of each amino acid;
c) based on the post-translational modification information obtained by the information reading unit, a post-translational modification distribution statistical acquisition unit for obtaining post-translational modification distribution statistical information indicating the appearance frequency of each post-translational modification;
d) a decoy sequence generation unit that generates a decoy sequence similar to a normal amino acid sequence based on the amino acid sequence information obtained by the information reading unit and the amino acid distribution statistical information;
e) Generation of decoy post-translational modification information that generates decoy post-translational modification information including the type and site of modification based on the decoy sequence generated by the decoy sequence generation unit and the post-translational modification distribution statistical information And
f) By integrating the decoy sequence generated by the decoy sequence generation unit and the decoy post-translation modification information generated by the decoy post-translation modification information generation unit, a decoy sequence to which post-translation modification information is added is recorded. A decoy database creation unit for creating a decorated decoy database;
It is characterized by having.

本発明において、「翻訳後修飾情報が付加されたタンパク質配列データベース」としては、例えばスイスバイオインフォマティクス研究所（略称：SIB）等が提供しているUniProtKBなどを用いることができる。また、そうした既存のタンパク質データベース全体ではなく、その中の一部を適宜に抽出したものを「翻訳後修飾情報が付加されたタンパク質配列データベース」として用いてもよい。 In the present invention, as the “protein sequence database to which post-translational modification information is added”, for example, UniProtKB provided by Swiss Institute of Bioinformatics (abbreviation: SIB) or the like can be used. Further, instead of the entire existing protein database, a part of the protein database extracted appropriately may be used as the “protein sequence database to which post-translational modification information is added”.

本発明は、内在性修飾ペプチドを同定するためのデータベース検索法の同定品質を評価する際に使用されるデコイデータベースの作成の手順やその作成の際の処理に特徴がある。即ち、本発明に係る同定品質評価装置では、上述したような翻訳後修飾情報が付加されたタンパク質配列データベースからアミノ酸配列情報が読み込まれると、アミノ酸分布統計情報取得部は、得られたアミノ酸配列情報に基づいて各アミノ酸配列の出現頻度を示すアミノ酸分布統計情報を作成する。一方、翻訳後修飾情報が付加されたタンパク質配列データベースから翻訳後修飾情報が読み込まれると、翻訳後修飾分布統計取得部は、得られた翻訳後修飾情報に基づいて各翻訳後修飾の出現頻度と各翻訳後修飾が付加する部位の頻度とを示す翻訳後修飾分布統計情報を作成する。 The present invention is characterized by a procedure for creating a decoy database used when evaluating the identification quality of a database search method for identifying endogenous modified peptides, and a process at the time of creation. That is, in the identification quality evaluation apparatus according to the present invention, when amino acid sequence information is read from the protein sequence database to which post-translational modification information as described above is added, the amino acid distribution statistical information acquisition unit obtains the obtained amino acid sequence information. Based on the above, amino acid distribution statistical information indicating the appearance frequency of each amino acid sequence is created. On the other hand, when the post-translational modification information is read from the protein sequence database to which the post-translational modification information is added, the post-translational modification distribution statistics acquisition unit calculates the appearance frequency of each post-translational modification based on the obtained post-translational modification information. Post-translational modification distribution statistical information indicating the frequency of sites to which each post-translational modification is added is created.

そしてデコイ配列生成部は例えば、上記読み込みにより得られた正規のアミノ酸配列情報に基づき、アミノ酸分布統計情報に従った重み付けでランダムにアミノ酸を置き換えることで偽のデコイ配列を生成する。さらにデコイ翻訳後修飾情報生成部は、生成されたデコイ配列に対し、翻訳後修飾分布統計情報に基づいて付加する翻訳後修飾分子と付加する部位（アミノ酸残基）とをランダムに決定する。本発明では、翻訳後修飾情報が付加されたタンパク質配列データベースに基づいてアミノ酸分布統計情報及び翻訳後修飾分布統計情報をそれぞれ求めておき、それら情報を利用して偽のアミノ酸配列であるデコイ配列と偽の翻訳後修飾であるデコイ修飾とを決定する。そして、デコイデータベース作成部は、生成されたデコイ配列とデコイ翻訳後修飾とを統合することで、翻訳後修飾情報が付加されたデコイ配列が収録されたデコイデータベースを作成する。こうして作成されたデコイデータベースを用いてマススペクトル情報についての所定のデータベース検索法による検索を実施し、同定基準を満たした（同定閾値を超えた）デコイ配列同定数を計数することで、翻訳後修飾されたタンパク質又はペプチドについても高い精度で以て偽発見率を推定することができる。 For example, the decoy sequence generation unit generates a false decoy sequence by randomly replacing amino acids with weighting according to the amino acid distribution statistical information based on the normal amino acid sequence information obtained by the reading. Further, the post-decoy translation modification information generation unit randomly determines a post-translational modification molecule to be added and a site (amino acid residue) to be added to the generated decoy sequence based on post-translational modification distribution statistical information. In the present invention, amino acid distribution statistical information and post-translational modification distribution statistical information are respectively obtained based on a protein sequence database to which post-translational modification information is added, and using these information, a decoy sequence that is a fake amino acid sequence and Decide decoy modifications, which are fake post-translational modifications. Then, the decoy database creation unit creates a decoy database in which the decoy sequence to which post-translation modification information is added is recorded by integrating the generated decoy sequence and post-decoy post-translation modification. The post-translational modification is performed by performing a search using a predetermined database search method for mass spectrum information using the decoy database created in this way, and counting the number of decoy sequence identifications that meet the identification criteria (exceeding the identification threshold). The false discovery rate can also be estimated with high accuracy for the processed protein or peptide.

また本発明に係る内在性修飾ペプチドの同定品質評価方法において、好ましい態様として、
翻訳後修飾情報が付加されたタンパク質配列データベースであるターゲットデータベースを共通のエントリを有さない複数の部分集合データベースに分割するデータベース分割ステップと、
前記複数の部分集合データベースのそれぞれについて、前記情報読込みステップ、前記アミノ酸分布統計取得ステップ、前記翻訳後修飾分布統計取得ステップ、前記デコイ配列生成ステップ、前記デコイ翻訳後修飾情報生成ステップ、及び前記デコイデータベース作成ステップによる処理を実施して部分集合デコイデータベースを作成する部分集合デコイデータベース作成ステップと、
前記部分集合デコイデータベース作成ステップで作成された複数の部分集合デコイデータベースを前記ターゲットデータベースに対応して統合することにより全体集合デコイデータベースを作成するデコイデータベース統合ステップと、
を有するものとすることができる。 In the identification quality evaluation method for endogenous modified peptides according to the present invention, as a preferred embodiment,
A database dividing step of dividing the target database, which is a protein sequence database to which post-translational modification information is added, into a plurality of subset databases having no common entry;
For each of the plurality of subset databases, the information reading step, the amino acid distribution statistics acquisition step, the post-translational modification distribution statistics acquisition step, the decoy sequence generation step, the decoy post-translational modification information generation step, and the decoy database A subset decoy database creation step of creating a subset decoy database by performing the processing in the creation step;
A decoy database integration step of creating a whole set decoy database by integrating a plurality of subset decoy databases created in the subset decoy database creation step in correspondence with the target database;
It can have.

この好ましい態様によれば、各部分集合デコイデータベースに含まれるデコイ配列やデコイ翻訳後修飾はその部分集合デコイデータベースに対応する部分集合データベースに含まれる正規のアミノ酸組成や翻訳後修飾が反映されたものとなる。即ち、部分集合デコイデータベースは偽発見率を高い精度で求めることができるデコイデータベースである。そして、全体集合デコイデータベースはこうした部分集合デコイデータベースに含まれる情報を統合したものであるから、この全体集合デコイデータベースを用いることで、複数の部分集合データベースを用いた検索法における検索結果を統合する際の偽発見率についても高い精度で求めることが可能となる。 According to this preferable aspect, the decoy sequence and post-translational modification included in each subset decoy database reflect the normal amino acid composition and post-translational modification included in the subset database corresponding to the subset decoy database. It becomes. That is, the subset decoy database is a decoy database that can obtain the false discovery rate with high accuracy. And since the whole set decoy database integrates the information contained in such a subset decoy database, the search results in the search method using a plurality of subset databases are integrated by using this whole set decoy database. It is possible to obtain the false discovery rate with high accuracy.

本発明によれば、翻訳後修飾を受けたペプチドを同定するデータベース検索法、具体的にはＰＴＭ対応内在性ペプチド検索法などにおける同定品質を評価するための偽発見率を、精度良く求めることができる。 According to the present invention, it is possible to accurately obtain a false discovery rate for evaluating identification quality in a database search method for identifying a peptide that has undergone post-translational modification, specifically, a PTM-compatible endogenous peptide search method or the like. it can.

本発明の一実施形態である修飾タンパク質同定評価装置のブロック構成図。The block block diagram of the modification protein identification evaluation apparatus which is one Embodiment of this invention. 図１中のデコイデータベース作成部を中心とする要部のブロック構成図。The block block diagram of the principal part centering on the decoy database preparation part in FIG. デコイデータベース作成時の処理の流れを示すフローチャート。The flowchart which shows the flow of a process at the time of decoy database creation. デコイデータベース作成の過程で得られる情報の一例を示す模式図。The schematic diagram which shows an example of the information obtained in the process of decoy database creation. 内在性ペプチドデータベースの一例を示す図。The figure which shows an example of an endogenous peptide database. デコイデータベース作成の具体例における、ターゲット配列ＰＴＭ出現頻度に対するデコイ配列ＰＴＭ出現頻度の比とデコイ配列上の登録ＰＴＭサイト総数との関係を示すグラフ。The graph which shows the relationship between the ratio of the decoy sequence PTM appearance frequency with respect to the target sequence PTM appearance frequency, and the total number of registered PTM sites on the decoy sequence in a specific example of decoy database creation. デコイデータベース作成の具体例における、真性同定スペクトル数とＦＤＲとの関係の推定結果を示す図。The figure which shows the estimation result of the relationship between the number of intrinsic identification spectrums, and FDR in the specific example of decoy database preparation. デコイデータベース作成の具体例における、バリアブルモディフィケーション法による検索結果とＰＴＭ対応内在性ペプチド検索法による検索結果との比較結果を示す図。The figure which shows the comparison result of the search result by the variable modification method, and the search result by the PTM corresponding | compatible endogenous peptide search method in the specific example of decoy database creation. 図８で示した三つの手法による同定結果を示すベン図。FIG. 9 is a Venn diagram showing identification results obtained by the three methods shown in FIG. 8.

本発明に係る内在性修飾ペプチド同定品質評価装置の一実施形態について、添付図面を参照して説明する。
図１は本実施形態による内在性修飾ペプチド同定品質評価装置のブロック構成図である。 An embodiment of an endogenous modified peptide identification and quality evaluation apparatus according to the present invention will be described with reference to the accompanying drawings.
FIG. 1 is a block diagram of the endogenous modified peptide identification quality evaluation apparatus according to this embodiment.

この内在性修飾ペプチド同定品質評価装置１は、測定対象の試料に含まれる内在性修飾ペプチドのアミノ酸配列及び翻訳後修飾を同定するためのペプチド同定部１１における同定の品質、具体的には偽発見率を推定するものである。内在性修飾ペプチド同定品質評価装置１は、評価対象であるペプチド同定部１１のほかに、ターゲットデータベース１２、デコイデータベース作成部１３、デコイデータベース１４、偽発見率推定部１５などを備える。通常、データベース１２、１４以外の機能は、コンピュータにインストールされた専用のソフトウェアを該コンピュータ上で実行することにより実現される構成とすることができる。この場合、デコイデータベース作成部１３や偽発見率推定部１５は実質的にはソフトウェアで実現される機能である。また、ターゲットデータベース１２、デコイデータベース１４の実体はデータが所定の形式で格納されている記憶装置である。 The endogenous modified peptide identification quality evaluation device 1 is used to identify the quality of identification in the peptide identification unit 11 for identifying the amino acid sequence and post-translational modification of the endogenous modified peptide contained in the sample to be measured, specifically false discovery. The rate is estimated. The endogenous modified peptide identification quality evaluation apparatus 1 includes a target database 12, a decoy database creation unit 13, a decoy database 14, a fake discovery rate estimation unit 15 and the like in addition to the peptide identification unit 11 to be evaluated. Usually, functions other than the databases 12 and 14 can be realized by executing dedicated software installed in the computer on the computer. In this case, the decoy database creation unit 13 and the false discovery rate estimation unit 15 are substantially functions realized by software. The entities of the target database 12 and the decoy database 14 are storage devices that store data in a predetermined format.

なお、図１中に記載してある質量分析部２及びスペクトル解析部３は、試料中の内在性修飾ペプチドを同定する際に、ペプチド同定部１１にＭＳ²スペクトル情報を入力するために必要な構成要素であり、ペプチド同定部１１の同定品質を評価する際に用いられるものではない。 Note that the mass analysis unit 2 and the spectrum analysis unit 3 described in FIG. 1 are necessary for inputting MS ² spectrum information to the peptide identification unit 11 when identifying the endogenous modified peptide in the sample. It is a component and is not used when evaluating the identification quality of the peptide identification unit 11.

即ち、試料中の内在性修飾ペプチドを同定するに際し、質量分析部２は目的とする内在性修飾ペプチドを含む試料に対してＭＳ²測定を実行し、ＭＳ²スペクトル（又はｎが３以上であるＭＳⁿスペクトル）を取得する。スペクトル解析部３はＭＳ²スペクトル（又はＭＳⁿスペクトル）においてプロダクトイオン由来のピークを検出し、ピークの質量電荷比、信号強度、価数などの情報を集約したピークリストを作成する。そして、ピークリストをマススペクトル情報としてペプチド同定部１１に入力する。ペプチド同定部１１は、ターゲットデータベース１２に格納されているタンパク質又はペプチドのアミノ酸配列及び翻訳後修飾の情報を参照して、与えられた実測のマススペクトル情報に基づく検索を実施し目的とする内在性修飾ペプチドを同定する。 That is, when identifying the endogenous modified peptide in the sample, the mass spectrometer 2 performs MS ² measurement on the sample containing the target endogenous modified peptide, and the MS ² spectrum (or n is 3 or more). MS ⁿ spectrum). The spectrum analysis unit 3 detects peaks derived from product ions in the MS ² spectrum (or MS ⁿ spectrum), and creates a peak list in which information such as peak mass-to-charge ratio, signal intensity, and valence is aggregated. Then, the peak list is input to the peptide identification unit 11 as mass spectrum information. The peptide identification unit 11 refers to the amino acid sequence of the protein or peptide stored in the target database 12 and the post-translational modification information, and performs a search based on the given measured mass spectrum information for the purpose of endogenous Identify modified peptides.

ペプチド同定部１１の同定品質として偽発見率を求めるには、本来のペプチド配列や翻訳後修飾情報が収録されているデータベースの代わりに、偽のデコイ配列及び偽のデコイ翻訳後修飾情報（以下「デコイＰＴＭ情報」という）が収録されているデコイデータベースを参照した検索を実施する必要がある。デコイデータベース作成部１３はそのためのデコイデータベース１４を作成する機能を有する。 In order to obtain the false discovery rate as the identification quality of the peptide identification unit 11, a false decoy sequence and false post-translational modification information (hereinafter referred to as “post-translational modification information”) are used in place of the original peptide sequence and post-translational modification information database. It is necessary to perform a search referring to a decoy database in which “decoy PTM information” is recorded. The decoy database creation unit 13 has a function of creating a decoy database 14 for this purpose.

図２は図１中のデコイデータベース作成部１３を中心とする要部のブロック構成図、図３はデコイデータベース作成時の処理の流れを示すフローチャート、図４はデコイデータベース作成の過程で得られる情報の一例を示す模式図である。 2 is a block diagram of the main part centering on the decoy database creation unit 13 in FIG. 1, FIG. 3 is a flowchart showing the flow of processing when creating the decoy database, and FIG. 4 is information obtained in the process of creating the decoy database. It is a schematic diagram which shows an example.

図２に示すように、デコイデータベース作成部１３は機能ブロックとして、データベース分割部、データ読込み部、アミノ酸分布統計情報作成部、ＰＴＭ分布統計情報作成部、デコイ配列生成部１３５、デコイＰＴＭ情報生成部１３６、部分集合デコイデータベース構築部１３７、全体集合デコイデータベース構築部１３８、などを含む。
デコイデータベース作成部１３におけるデコイデータベース作成の処理手順の一例は次の通りである。 As shown in FIG. 2, the decoy database creation unit 13 includes, as functional blocks, a database division unit, a data reading unit, an amino acid distribution statistical information creation unit, a PTM distribution statistical information creation unit, a decoy sequence generation unit 135, and a decoy PTM information generation unit. 136, a subset decoy database construction unit 137, a whole set decoy database construction unit 138, and the like.
An example of the processing procedure for decoy database creation in the decoy database creation unit 13 is as follows.

データベース分割部１３１は例えばよく知られているUniProtKBなどであるヒトタンパク質データベース等の修飾タンパク質のデータベースであるターゲットデータベース１２を読み込み、これを、共通のエントリを有さないつまりはエントリの重複のない複数の部分集合データベースに分割する（ステップＳ１）。次にデータ読込み部１３２は複数の部分集合データベースの一つからデータを読み込む（ステップＳ２）。 The database dividing unit 131 reads the target database 12 that is a database of modified proteins such as the well-known UniProtKB and the like, such as a human protein database, and does not have a common entry, that is, a plurality of entries without duplication. Are divided into a subset database (step S1). Next, the data reading unit 132 reads data from one of the plurality of subset databases (step S2).

アミノ酸分布統計情報作成部１３３は、読み込まれたデータ中の各修飾タンパク質又は各修飾ペプチドのアミノ酸配列に基づいてアミノ酸毎の出現頻度を調べ、その出現頻度を示すアミノ酸分布統計情報を作成する（ステップＳ３）。一方、ＰＴＭ分布統計情報作成部１３４は、読み込まれたデータ中の修飾タンパク質又は修飾ペプチドにおける各翻訳後修飾分子についての修飾部位（結合しているアミノ酸の種類）の出現頻度を調べ、その出現頻度を示すＰＴＭ分布統計情報を作成する（ステップＳ４）。 The amino acid distribution statistical information creating unit 133 examines the appearance frequency for each amino acid based on the amino acid sequence of each modified protein or each modified peptide in the read data, and creates amino acid distribution statistical information indicating the appearance frequency (step) S3). On the other hand, the PTM distribution statistical information creation unit 134 checks the appearance frequency of the modification site (the type of amino acid bound) for each post-translationally modified molecule in the modified protein or modified peptide in the read data, and the appearance frequency Is generated (step S4).

次に、デコイ配列生成部１３５は、ステップＳ２においてデータ読込み部１３２により読み込まれた各タンパク質エントリについて、そのエントリに対応するアミノ酸配列と同じ配列長（つまりはアミノ酸残基数）のデコイ配列を、ステップＳ３で作成したアミノ酸分布統計情報に従った重み付けの下でランダムにアミノ酸を配列することにより作成する（ステップＳ５）。また、正規のアミノ酸配列中のアミノ酸をアミノ酸分布統計情報に従った重み付けの下でランダムに置き換えることで又は順列置換（配列の反転を含まない）により並べ替えることで、デコイ配列を作成してもよい。これにより、正規の修飾タンパク質又は修飾ペプチドのアミノ酸配列中に出現するアミノ酸の出現頻度に応じた偽のデコイ配列が生成される。ただし、部分集合データベースの配列によるアミノ酸分布統計情報として、ターゲット配列のスーパーセットとなるタンパク質配列（例えば、UniProtKBに登録されている分析対象生物種の登録タンパク質配列）のアミノ酸分布統計情報を用いてもよい。 Next, for each protein entry read by the data reading unit 132 in step S2, the decoy sequence generation unit 135 calculates a decoy sequence having the same sequence length (that is, the number of amino acid residues) as the amino acid sequence corresponding to the entry. It is created by randomly arranging amino acids under weighting according to the amino acid distribution statistical information created in step S3 (step S5). In addition, decoy sequences can be created by randomly replacing amino acids in regular amino acid sequences under weighting according to amino acid distribution statistical information, or by rearranging by permutation (not including sequence inversion). Good. Thereby, a fake decoy sequence corresponding to the appearance frequency of the amino acid appearing in the amino acid sequence of the regular modified protein or modified peptide is generated. However, as amino acid distribution statistical information based on the sequence of the subset database, amino acid distribution statistical information of a protein sequence that is a superset of the target sequence (for example, a registered protein sequence of an analysis target species registered in UniProtKB) may be used. Good.

次いでデコイＰＴＭ情報生成部１３６は、ステップＳ５で生成されたデコイ配列に対して、ステップＳ４で作成したＰＴＭ分布統計情報に従った重み付けの下で修飾の種類毎にランダムに修飾部位（アミノ酸残基）を決定する（ステップＳ６）。これにより、正規の修飾タンパク質又は修飾ペプチドにおいて出現する翻訳後修飾分子の出現頻度とその修飾部位の出現頻度に応じた偽のデコイＰＴＭ情報が生成される。なお、ステップＳ６では、ターゲット配列から生成されたペプチドとデコイ配列とで修飾部位数の分布（又は、全ペプチドの修飾部位総数）が一致するように重み付けを補正してもよい。 Next, the decoy PTM information generation unit 136 randomly modifies the decoy sequence generated in step S5 for each type of modification (amino acid residue) under weighting according to the PTM distribution statistical information generated in step S4. ) Is determined (step S6). Thereby, fake decoy PTM information according to the appearance frequency of the post-translationally modified molecule appearing in the regular modified protein or the modified peptide and the appearance frequency of the modified site is generated. In step S6, the weighting may be corrected so that the distribution of the number of modified sites (or the total number of modified sites of all peptides) matches between the peptide generated from the target sequence and the decoy sequence.

部分集合デコイデータベース構築部１３７は、ステップＳ５で生成されたデコイ配列、ステップＳ６で生成されたデコイＰＴＭ情報を収集し、これを元の部分集合データベースに収録されていた修飾タンパク質又は修飾ペプチドの配列の少なくとも一部と合わせて統合し、偽の情報を含むデコイ部分集合データベースを作成する（ステップＳ７）。そのあと、ステップＳ１において分割により作成された複数の部分集合データベースの全てについてステップＳ２〜Ｓ７の処理を実行したか否かを判定する（ステップＳ８）。そして、未処理の部分集合データベースがあればステップＳ８からＳ２へと戻り、未処理の部分集合データベースについてステップＳ２〜Ｓ７の処理を実行する。これにより、全ての部分集合データベースに対応してデコイ部分集合データベースが作成される。 The subset decoy database construction unit 137 collects the decoy sequence generated in step S5 and the decoy PTM information generated in step S6, and uses the modified protein or modified peptide sequence recorded in the original subset database. And a decoy subset database including fake information is created (step S7). Thereafter, it is determined whether or not the processing of steps S2 to S7 has been executed for all of the plurality of subset databases created by the division in step S1 (step S8). If there is an unprocessed subset database, the process returns from step S8 to S2, and the processes of steps S2 to S7 are executed for the unprocessed subset database. As a result, decoy subset databases are created corresponding to all the subset databases.

さらに、全体集合デコイデータベース構築部１３８がステップＳ７で生成された複数の部分集合デコイデータベースを統合することで、ステップＳ１における分割前のターゲットデータベースに対応する全体集合デコイデータベースを作成し、これをデコイデータベース１４として登録する（ステップＳ９）。
以上の処理により、UniProtKB等の修飾タンパク質のデータベースに対応するデコイデータベースを作成することができる。 Further, the whole set decoy database construction unit 138 integrates the plurality of subset decoy databases generated in step S7, thereby creating a whole set decoy database corresponding to the target database before the division in step S1, and decoying it. Register as the database 14 (step S9).
Through the above processing, a decoy database corresponding to a database of modified proteins such as UniProtKB can be created.

次に、デコイデータベースの作成の具体例とその評価結果について説明する。
ここでは、独国モザイク（Mosaiques）社が公開している尿中ペプチド配列データベースに登録されているペプチド配列データベース(Mosaiques DB、非特許文献３参照）を元に、特許文献２に記載の方法により、ペプチドの前駆体タンパク質の全長配列を参照して伸縮させることで作成したペプチドバリアント配列（１０〜３０残基）を収集して、内在性ペプチド同定用のペプチド配列データベースを構築した。このペプチド配列データベースに対して、Mosaiques DBに登録されているＰＴＭである酸化（ヒドロキシル化を含む）を受けていた修飾アミノ酸を、既知の翻訳後修飾情報としてデータベースに登録した。データベースのフォーマットとして、ＰＳＩ（Proteomics Standards Initiative）が提唱しているＰＥＦＦフォーマット（非特許文献４参照）を採用し、ペプチド配列エントリのヘッダ情報に翻訳後修飾情報を記載することで、内在性ペプチド同定用のターゲットデータベースを作成した。 Next, a specific example of creating a decoy database and its evaluation result will be described.
Here, based on the peptide sequence database (see Mosaiques DB, Non-patent Document 3) registered in the urinary peptide sequence database published by Mosaiques, Germany, the method described in Patent Document 2 is used. Peptide variant sequences (10 to 30 residues) created by stretching with reference to the full length sequence of the peptide precursor protein were collected to construct a peptide sequence database for identifying endogenous peptides. For this peptide sequence database, modified amino acids that had undergone oxidation (including hydroxylation), which is a PTM registered in the Mosaiques DB, were registered in the database as known post-translational modification information. Endogenous peptide identification by adopting the PEFF format (see Non-Patent Document 4) proposed by PSI (Proteomics Standards Initiative) as the database format and describing post-translational modification information in the header information of peptide sequence entries Created a target database for

図５は作成した内在性ペプチドデータベースの一例である。ここでは、ＰＥＦＦフォーマットを利用して、アクセッション情報に、ペプチドの前駆体タンパク質のアクセッションＩＤ（P02641）と、前駆体全長配列上におけるペプチドのＮ末端アミノ酸残基位置（705）及びＣ末端アミノ酸残基位置（723）とを記載している。また、ModResUnimodヘッダキーに翻訳後修飾情報を列挙している。例えば、図５中の（8|UNIMOD:35|Oxidation）は、ペプチドの８番目のアミノ酸残基であるプロリンが、Unimodの３５番のＩＤを持つ修飾である酸化（Oxidation）を受けることを意味している。 FIG. 5 is an example of the created endogenous peptide database. Here, using the PEFF format, the accession information includes the peptide precursor protein accession ID (P02641), the peptide N-terminal amino acid residue position (705) and the C-terminal amino acid on the precursor full-length sequence. Residue position (723). Also, post-translational modification information is listed in the ModResUnimod header key. For example, (8 | UNIMOD: 35 | Oxidation) in FIG. 5 means that proline, which is the eighth amino acid residue of the peptide, undergoes oxidation (Oxidation), which is a modification having ID No. 35 of Unimod. is doing.

図３中のステップＳ２に相当する処理として、上記内在性ペプチドデータベースであるターゲットデータベースから、登録ペプチドのアミノ酸配列情報と翻訳後修飾情報とを読み込んだ。加えて、このターゲットデータベースのスーパーセットであるSwiss-Protからアミノ酸配列情報を読み込んだ。次に、図３中のステップＳ３に相当する処理として、Swiss-Prot及びターゲットデータベースから読み込んだアミノ酸情報について、タンパク質配列上で重複するアミノ酸を除いて得られたアミノ酸の出現頻度をアミノ酸分布統計情報として取得した。 As processing corresponding to step S2 in FIG. 3, the amino acid sequence information and post-translational modification information of the registered peptide were read from the target database which is the endogenous peptide database. In addition, amino acid sequence information was read from Swiss-Prot, a superset of this target database. Next, as processing corresponding to step S3 in FIG. 3, with respect to the amino acid information read from Swiss-Prot and the target database, the appearance frequency of amino acids obtained by excluding overlapping amino acids on the protein sequence is calculated as amino acid distribution statistical information. Acquired as.

図３中のステップＳ５に相当する処理として、ターゲットデータベースに登録されているペプチド配列と配列長が一致する配列を、上述したSwiss-Protデータベースから読み込んだアミノ酸分布統計情報に基づいてランダムに生成し、これをデコイ配列とした。一方、図３中のステップＳ４に相当する処理としては、まず、上述したように得られた翻訳後修飾情報に基づいて、タンパク質配列上で重複するアミノ酸を除いたアミノ酸に対するＰＴＭ出現頻度（ターゲット配列ＰＴＭ出現頻度）を求めた。この結果を表１に示す。ここで、Proはプロリン、Lysはリシン、Metはメチオニンである。次に、重複を含むターゲットデータベース登録ペプチド配列上の登録ＰＴＭサイト総数を求めた。この結果を表２に示す。

As a process corresponding to step S5 in FIG. 3, a sequence whose sequence length matches the peptide sequence registered in the target database is randomly generated based on the amino acid distribution statistical information read from the Swiss-Prot database. This was used as a decoy sequence. On the other hand, as a process corresponding to step S4 in FIG. 3, first, based on post-translational modification information obtained as described above, PTM appearance frequency (target sequence) for amino acids excluding overlapping amino acids on the protein sequence. PTM appearance frequency) was determined. The results are shown in Table 1. Here, Pro is proline, Lys is lysine, and Met is methionine. Next, the total number of registered PTM sites on the target database registered peptide sequence including duplicates was determined. The results are shown in Table 2.

さらにまた、デコイ配列の生成により得られたデコイ配列上（重複あり）の修飾対象アミノ酸総数を求めた。この結果を表３に示す。表３において、合計欄の括弧内の数字は、ターゲットデータベースの修飾対象アミノ酸数に対する割合である。このように、重複ありの場合の修飾対象アミノ酸総数は、ターゲット配列の修飾対象アミノ酸総数の７０％%弱と、十分な信頼度では一致しない。

Furthermore, the total number of amino acids to be modified on the decoy sequence (with overlap) obtained by generating the decoy sequence was determined. The results are shown in Table 3. In Table 3, the number in parentheses in the total column is a ratio to the number of amino acids to be modified in the target database. Thus, the total number of amino acids to be modified in the case of duplication does not match with a sufficient degree of reliability, which is a little less than 70% of the total number of amino acids to be modified in the target sequence.

そこで、ターゲット配列ＰＴＭ出現頻度に比例したＰＴＭ出現頻度による重み付けに従ってデコイ配列上の修飾対象アミノ酸からＰＴＭサイトをランダムに選択し、デコイ配列上（重複あり）の登録ＰＴＭサイト総数を求めた。図６は、ターゲット配列ＰＴＭ出現頻度に対するデコイ配列ＰＴＭ出現頻度の比と、デコイ配列上の登録ＰＴＭサイト総数との関係を示すグラフである。図５から、上記比が約４．５である場合に、デコイ配列上の登録ＰＴＭサイト総数がターゲットデータベースと同程度になると見積もられる。表４は、この比が４．５であるときのデコイＤＢにおけるデコイＰＴＭサイト数である。表４で、合計欄の括弧内の数値は、ターゲットデータベース上の登録ＰＴＭサイト総数に対する割合である。

Therefore, PTM sites were randomly selected from the amino acids to be modified on the decoy sequence according to the weighting by the PTM appearance frequency proportional to the target sequence PTM appearance frequency, and the total number of registered PTM sites on the decoy sequence (with overlap) was obtained. FIG. 6 is a graph showing the relationship between the ratio of the decoy sequence PTM appearance frequency to the target sequence PTM appearance frequency and the total number of registered PTM sites on the decoy sequence. From FIG. 5, it is estimated that when the ratio is about 4.5, the total number of registered PTM sites on the decoy array is comparable to that of the target database. Table 4 shows the number of decoy PTM sites in the decoy DB when this ratio is 4.5. In Table 4, the numerical value in parentheses in the total column is a ratio to the total number of registered PTM sites on the target database.

こうして、ターゲット配列ＰＴＭ出現頻度に対するデコイ配列ＰＴＭ出現頻度の比を４．５として求めた修飾物の種類と修飾部位とを、デコイデータベース用のデコイ翻訳後修飾情報として選定した。そして、上述したように得られたデコイ配列及びデコイ翻訳後修飾情報とを統合することで、翻訳後修飾情報が付加されたデコイ配列が収録されたデコイデータベースを作成した。 In this way, the type and modification site of the modification obtained by setting the ratio of the decoy sequence PTM appearance frequency to the target sequence PTM appearance frequency as 4.5 was selected as decoy post-translational modification information for the decoy database. Then, by integrating the decoy sequence and decoy post-translation modification information obtained as described above, a decoy database in which the decoy sequence to which post-translation modification information was added was recorded was created.

このようにして作成したデコイデータベースを評価するために、デコイデータベースとターゲットデータベースとを結合したペプチドデータベースについてＦＤＲ精度の比較を実施した。比較対象となるデータベースとして、デコイ配列からデコイ翻訳後修飾情報を除去したデコイ翻訳後修飾情報無しペプチドデータベースと、ＰＴＭ分布統計情報を取得する際にターゲット配列ＰＴＭ出現頻度に対するデコイ配列ＰＴＭ出現頻度の比として１．０（つまり図６において横軸上の位置が１．０）を選定して作成したデコイ配列を用いたＰＴＭ出現頻度補正無しペプチドデータベースと、を選んだ。 In order to evaluate the decoy database thus created, a comparison of FDR accuracy was performed on a peptide database in which the decoy database and the target database were combined. As a database to be compared, a peptide database without decoy post-translational modification information from which decoy post-translational modification information is removed from a decoy sequence, and a ratio of the decoy sequence PTM appearance frequency to the target sequence PTM appearance frequency when acquiring PTM distribution statistical information PTM database without PTM appearance frequency correction using a decoy sequence created by selecting 1.0 (that is, the position on the horizontal axis in FIG. 6 is 1.0).

上記３種類のペプチドデータベースそれぞれについて、液体クロマトグラフ質量分析装置（ＬＣ−ＭＳ）による測定で得られた尿中ペプチド混合物由来のＭＳ／ＭＳスペクトル約７０００個に対して、検索エンジンCometを用いたＰＴＭ対応内在性ペプチド検索法を実施した。それによって同定されたデコイ配列数を元に結合ターゲット・デコイデータベース法に基づいて真性同定スペクトル数とＦＤＲとの関係を推定した。その結果を図７に示す。この結果から、ＦＤＲの実用的な値である０．０１〜０．０５の範囲において、今回作成したペプチドデータベースに対し、デコイ翻訳後修飾情報無しペプチドデータベースやＰＴＭ出現頻度補正無しペプチドデータベースを用いた場合に、真性同定スペクトル数を大きく見積もり過ぎていることが明らかになった。これは、デコイ翻訳後修飾情報無しペプチドデータベース及びＰＴＭ出現頻度補正無しペプチドデータベースのデコイ配列に対するＰＴＭ探索空間がターゲット配列に比べて小さ過ぎるためにＦＤＲを過小評価した結果である。 For each of the above three types of peptide databases, PTM using the search engine Comet was used for about 7000 MS / MS spectra derived from peptide mixtures in urine obtained by measurement with a liquid chromatograph mass spectrometer (LC-MS). A corresponding endogenous peptide search method was performed. Based on the number of decoy sequences identified thereby, the relationship between the number of intrinsic identification spectra and FDR was estimated based on the combined target decoy database method. The result is shown in FIG. From this result, within the range of 0.01 to 0.05, which is a practical value of FDR, a peptide database without decoy post-translational modification information or a peptide database without PTM appearance frequency correction was used for the peptide database created this time. In some cases, it became clear that the number of intrinsic identification spectra was overestimated. This is a result of underestimating the FDR because the PTM search space for the decoy sequence in the peptide database without modification information after decoy translation and the peptide database without PTM appearance frequency correction is too small compared to the target sequence.

また、上述したペプチドデータベースに登録されているターゲット配列の前駆体タンパク質（プロプロテイン）の全長配列のみを収録した配列データベースをターゲットデータベースとして、上記と同様の方法により、デコイ配列が収録されたデコイ翻訳後修飾情報無しデコイデータベースを作成し、ターゲットデータベースと統合してプロプロテインデータベースを作成した。このプロプロテインデータベースは、ペプチドデータベースのスーパーセットとなるデータベースであり、登録ペプチド数はペプチドデータベースの約１０倍であった。 In addition, a decoy translation in which the decoy sequence is recorded in the same manner as described above, using a sequence database that records only the full-length sequence of the precursor protein (proprotein) of the target sequence registered in the peptide database as described above. A decoy database without post-modification information was created and integrated with the target database to create a proprotein database. This proprotein database is a database that is a superset of the peptide database, and the number of registered peptides was about 10 times that of the peptide database.

このプロプロテインデータベースとペプチドデータベースに対し、検索エンジンCometを用いて上述のＬＣ−ＭＳで測定された尿中ペプチド混合物由来のＭＳ／ＭＳスペクトルをバリアブルモディフィケーション法（データベースに登録された翻訳後修飾情報を用いない従来のＰＴＭ検索法）により検索した結果と、上述のＰＴＭ対応内在性ペプチド検索法によるペプチドデータベースの検索結果との比較を行った。この比較結果を図８に示す。ペプチドデータベースに対してＰＴＭ対応内在性ペプチド検索法を実施することにより、従来のバリアブルモディフィケーション法によるＰＴＭ検索に比べて、ＦＤＲが実用的な値である０．０１〜０．０５の範囲において同定感度が向上することが確認できた。 For this proprotein database and peptide database, MS / MS spectra derived from peptide mixtures in urine measured by LC-MS using the search engine Comet were converted to the variable modification method (post-translational modification registered in the database). A comparison was made between the results of a search by a conventional PTM search method that does not use information and the search results of a peptide database by the above-described PTM-compatible endogenous peptide search method. The comparison result is shown in FIG. By performing the PTM-compatible endogenous peptide search method on the peptide database, the FDR is within a practical value range of 0.01 to 0.05 compared to the conventional PTM search by the variable modification method. It was confirmed that the identification sensitivity was improved.

また、図９は、図８で示した三つの手法による同定結果をベン図にまとめたものである。ペプチドデータベースのアミノ酸配列の集合はプロプロテインデータベースの部分集合であり、そのペプチドの登録数は１／１０程度に過ぎないが、図９から、プロプロテインデータベースで同定されたペプチドのうち、同定信頼度の高い結果の殆どをペプチドデータベースで同定できており、十分な網羅性を有していることが明らかになった。 FIG. 9 summarizes the identification results obtained by the three methods shown in FIG. 8 in a Venn diagram. The set of amino acid sequences in the peptide database is a subset of the proprotein database, and the number of registered peptides is only about 1/10. From FIG. 9, among the peptides identified in the proprotein database, the identification reliability Most of the high results were identified in the peptide database, and it was revealed that the results had sufficient coverage.

さらにまた、比較した三つの手法で重複して同定された結果についてのＦＤＲが、同定結果全体のＦＤＲに比べて小さいことが判明した。このような評価が実現できるのは、本発明による同定品質評価方法では、ＰＴＭ対応内在性ペプチド検索法で探索対象となったペプチドデータベースのデコイ修飾ペプチドの全てが、バリアブルモディフィケーション法によりペプチドデータベースやプロプロテインデータベースに対しても探索対象となるために、各データベースやＰＴＭ検索法で重複して同定された結果からも偽発見ペプチド数を見積もることができるためである。 Furthermore, it has been found that the FDR for the result of duplication identified by the three methods compared is smaller than the FDR of the entire identification result. Such an evaluation can be realized by the identification quality evaluation method according to the present invention, in which all decoy-modified peptides in the peptide database searched by the PTM-compatible endogenous peptide search method are converted into the peptide database by the variable modification method. This is because the number of false discovery peptides can be estimated from the results of duplicate identification by each database and PTM search method.

次に、デコイデータベースのより詳細な作成例を、図４により説明する。この例は、３種類の共通するエントリを持つターゲットデータベースに対してそれぞれデコイデータベースを作成する例である。 Next, a more detailed example of creating a decoy database will be described with reference to FIG. In this example, a decoy database is created for each target database having three types of common entries.

三つのターゲットデータベースの中で最も多くの情報を含む最上位のデータベース１２ａは、UniProtKB等のヒトタンパク質データベースである。測定対象の試料がヒトの尿内に存在する内在性の修飾ペプチド（尿中ペプチド）である場合、UniProtKBの部分集合であるタンパク質配列データベースとして、既知の文献等で報告されている尿中ペプチドのプロタンパク質配列を収録したものを一つの部分集合データベース（タンパク質配列データベース）とする。これが二つめのターゲットデータベース１２ｂである。さらにまた、特許文献１に記載されているように、生体内で分解されて生じた内在性ペプチド配列として、既知の文献等で報告されている尿中の内在性ペプチド、又はその部分配列を指定個数以上含むペプチド配列を収録したものを最下位の部分集合データベース（ペプチド配列データベース）とする。これが、三つめのターゲットデータベース１２ｃである。 The top-level database 12a including the most information among the three target databases is a human protein database such as UniProtKB. When the sample to be measured is an endogenous modified peptide (urinary peptide) present in human urine, a protein sequence database that is a subset of UniProtKB is used to identify urinary peptides reported in known literature. A collection of proprotein sequences is used as one subset database (protein sequence database). This is the second target database 12b. Furthermore, as described in Patent Document 1, urinary endogenous peptides reported in known literatures, etc., or partial sequences thereof are designated as endogenous peptide sequences generated by degradation in vivo. A record of peptide sequences containing more than the number is used as the lowest subset database (peptide sequence database). This is the third target database 12c.

ターゲットデータベース１２に含まれる上記三つのタンパク質又はペプチドの配列データベース１２ａ、１２ｂ、１２ｃは、UniProtKB又はその部分集合の配列である。したがって、それら三つのデータベース間には共通するエントリが含まれる。そのため、尿試料を精製することで得られたペプチドを質量分析（ＭＳ／ＭＳ分析）することで取得されたＭＳ／ＭＳスペクトルから求めたピーク情報に基づくペプチドの検索を実施すると、ターゲットデータベース１２に含まれる複数のデータベースから重複して同じペプチドが同定されるものと推定される。 The above three protein or peptide sequence databases 12a, 12b, and 12c included in the target database 12 are sequences of UniProtKB or a subset thereof. Therefore, a common entry is included between the three databases. Therefore, when a peptide search based on peak information obtained from an MS / MS spectrum obtained by mass spectrometry (MS / MS analysis) of a peptide obtained by purifying a urine sample is performed, the target database 12 It is estimated that the same peptide is identified redundantly from a plurality of included databases.

実際には、部分集合データベースである尿中ペプチドのタンパク質配列データベース１２ｂに収録されている配列のアミノ酸組成と、さらにその部分集合データベースであるペプチド配列データベース１２ｃに収録されている配列のアミノ酸組成との間では殆ど違いがない。そこで、この場合には、アミノ酸分布統計情報及びＰＴＭ分布統計情報とについては尿中ペプチドのタンパク質配列データベース１２ｂに基づいて作成し、これを、尿中ペプチドのタンパク質配列データベース１２ｂに対応するデコイデータベースの作成のみならず、ペプチド配列データベース１２ｃに対応するデコイデータベースの作成にも用いる。 Actually, the amino acid composition of the sequence recorded in the protein sequence database 12b of urinary peptide, which is a subset database, and the amino acid composition of the sequence recorded in the peptide sequence database 12c, which is the subset database. There is little difference between them. Therefore, in this case, the amino acid distribution statistical information and the PTM distribution statistical information are created based on the protein sequence database 12b of urinary peptide, and this is stored in the decoy database corresponding to the protein sequence database 12b of urinary peptide. It is used not only for creating but also for creating a decoy database corresponding to the peptide sequence database 12c.

一方、UniProtKB（最上位のタンパク質配列データベース１２ａ）から尿中ペプチドのタンパク質配列データベース１２ｂを除いた残りは、その尿中ペプチドのタンパク質配列データベース１２ｂとは共通のエントリを持たないデータの集合である。即ち、これは、最上位のタンパク質配列データベース１２ａを二つに分割し、一方を尿中ペプチドのタンパク質配列データベース１２ｂとし、他方を差集合データベース１２ｄにしたことと同じであり、この差集合データベース１２ｄも部分集合データベースである。この差集合データベース１２ｄに収録されているタンパク質と尿中ペプチドのタンパク質配列データベース１２ｂに収録されているタンパク質とではアミノ酸組成が異なる。そこで、この差集合データベース１２ｄについては尿中ペプチドのタンパク質配列データベース１２ｂとは別に、アミノ酸分布統計情報及びＰＴＭ分布統計情報を作成する。 On the other hand, the remainder obtained by removing the protein sequence database 12b of the urinary peptide from UniProtKB (the highest protein sequence database 12a) is a set of data that does not have a common entry with the protein sequence database 12b of the urine peptide. That is, this is the same as dividing the top-level protein sequence database 12a into two, one being a protein sequence database 12b of urinary peptides and the other being a difference set database 12d. Is also a subset database. The amino acid composition is different between the protein recorded in the difference set database 12d and the protein recorded in the protein sequence database 12b of the urinary peptide. Therefore, amino acid distribution statistical information and PTM distribution statistical information are created for the difference set database 12d separately from the protein sequence database 12b of urinary peptides.

こうして、尿中ペプチドのタンパク質配列データベース１２ｂ及び差集合データベース１２ｄに基づいてそれぞれアミノ酸分布統計情報及びＰＴＭ分布統計情報を作成したあと、それを用いて、各データベースの各タンパク質エントリと同じ配列数のデコイ（アミノ酸）配列を上述の手順で作成する。また作成された各デコイ配列に対し、デコイＰＴＭ付加情報としてＰＴＭ分布統計情報に基づいてデコイＰＴＭ付加情報を決定する。そして、作成したデコイ配列とデコイＰＴＭ付加情報を元のデータベース中のデータと統合することで部分集合デコイデータベース１４ｂ及び差集合デコイデータベース１４ｄをそれぞれ作成し、さらにそれらを統合して最上位のデータベース１２ａに対応する最上位デコイデータベース１４ａを作成する。
こうして、UniProtKBに対応するデコイデータベースを作成することができる。 Thus, after the amino acid distribution statistical information and the PTM distribution statistical information are created based on the protein sequence database 12b and the difference set database 12d of the urine peptide, respectively, the decoy having the same number of sequences as each protein entry of each database is used. The (amino acid) sequence is created according to the procedure described above. In addition, decoy PTM additional information is determined based on PTM distribution statistical information as decoy PTM additional information for each created decoy array. Then, the subset decoy database 14b and the difference set decoy database 14d are respectively created by integrating the created decoy sequence and the decoy PTM additional information with the data in the original database, and further integrated to create the top-level database 12a. The top-level decoy database 14a corresponding to is created.
In this way, a decoy database corresponding to UniProtKB can be created.

なお、上記実施例は本発明の一例にすぎず、本発明の趣旨の範囲で適宜変形、修正、追加等を行っても本願特許請求の範囲に包含されることは当然である。 It should be noted that the above embodiment is merely an example of the present invention, and it will be understood that the present invention is encompassed in the scope of the claims of the present application even if appropriate modifications, corrections, additions, etc. are made within the scope of the present invention.

１…内在性修飾ペプチド同定品質評価装置
１１…ペプチド同定部
１２…ターゲットデータベース
１２ａ…最上位タンパク質配列データベース
１２ｂ…部分集合タンパク質配列データベース
１２ｃ…ペプチド配列データベース
１２ｄ…差集合データベース
１３…デコイデータベース作成部
１３１…データベース分割部
１３２…データ読込み部
１３３…アミノ酸分布統計情報作成部
１３４…ＰＴＭ分布統計情報作成部
１３５…デコイ配列生成部
１３６…デコイＰＴＭ情報生成部
１３７…部分集合デコイデータベース構築部
１３８…全体集合デコイデータベース構築部
１４…デコイデータベース
１４ａ…最上位デコイデータベース
１４ｂ…部分集合デコイタンパク質データベース
１４ｂ…部分集合デコイペプチド配列データベース
１４ｄ…差集合デコイデータベース
１５…偽発見率推定部
２…質量分析部
３…スペクトル解析部 DESCRIPTION OF SYMBOLS 1 ... Endogenous modified peptide identification quality evaluation apparatus 11 ... Peptide identification part 12 ... Target database 12a ... Top-level protein sequence database 12b ... Subset protein sequence database 12c ... Peptide sequence database 12d ... Difference set database 13 ... Decoy database creation part 131 ... database division part 132 ... data reading part 133 ... amino acid distribution statistical information creation part 134 ... PTM distribution statistical information creation part 135 ... decoy sequence generation part 136 ... decoy PTM information generation part 137 ... subset decoy database construction part 138 ... whole set Decoy database construction unit 14 ... Decoy database 14a ... Top-level decoy database 14b ... Subset decoy protein database 14b ... Subset decoy peptide sequence database 14d ... Difference set data Lee database 15 ... false discovery rate estimator 2 ... mass analyzer 3 ... spectrum analysis unit

Claims

A method for evaluating identification quality when identifying endogenous modified peptides by database search based on mass spectrometry results, and calculating an index value for evaluating identification quality using a decoy sequence. In
a) an information reading step for reading the amino acid sequence information and post-translational modification information of the modified protein recorded from the protein sequence database to which post-translational modification information is added;
b) Amino acid distribution statistics acquisition step for obtaining amino acid distribution statistical information indicating the appearance frequency of each amino acid based on the amino acid sequence information obtained by the information reading step;
c) based on the post-translational modification information obtained by the information reading step, a post-translational modification distribution statistics acquisition step for obtaining post-translational modification distribution statistical information indicating the frequency of appearance of each post-translational modification;
d) a decoy sequence generation step for generating a decoy sequence similar to a normal amino acid sequence based on the amino acid sequence information obtained by the information reading step and the amino acid distribution statistical information;
e) Generation of decoy post-translational modification information generating decoy post-translational modification information including the type and site of modification based on the decoy sequence generated in the decoy sequence generation step and the post-translational modification distribution statistical information Steps,
f) By integrating the decoy sequence generated in the decoy sequence generation step and the decoy post-translation modification information generated in the decoy post-translation modification information generation step, a decoy sequence to which post-translation modification information is added is recorded. A decoy database creation step for creating a decorated decoy database;
A method for evaluating the quality of identification of endogenous modified peptides, characterized by comprising:

An identification quality evaluation method for an endogenous modified peptide according to claim 1, comprising:
A database dividing step of dividing the target database, which is a protein sequence database to which post-translational modification information is added, into a plurality of subset databases having no common entry;
For each of the plurality of subset databases, the information reading step, the amino acid distribution statistics acquisition step, the post-translational modification distribution statistics acquisition step, the decoy sequence generation step, the decoy post-translational modification information generation step, and the decoy database A subset decoy database creation step of creating a subset decoy database by performing the processing in the creation step;
A decoy database integration step of creating a decoy database by integrating a plurality of subset decoy databases created in the subset decoy database creation step in correspondence with the target database;
A method for evaluating the quality of identification of endogenous modified peptides, characterized by comprising:

A device that evaluates the identification quality when identifying endogenous modified peptides by database search based on mass spectrometry results, and calculates the index value for evaluating the identification quality using decoy sequences. In the device
a) an information reading unit for reading amino acid sequence information and post-translational modification information of the recorded modified protein from a protein sequence database to which post-translational modification information is added;
b) based on the amino acid sequence information obtained by the information reading unit, amino acid distribution statistics acquisition unit for obtaining amino acid distribution statistical information indicating the appearance frequency of each amino acid;
c) based on the post-translational modification information obtained by the information reading unit, a post-translational modification distribution statistical acquisition unit for obtaining post-translational modification distribution statistical information indicating the appearance frequency of each post-translational modification;
d) a decoy sequence generation unit that generates a decoy sequence similar to a normal amino acid sequence based on the amino acid sequence information obtained by the information reading unit and the amino acid distribution statistical information;
e) Generation of decoy post-translational modification information that generates decoy post-translational modification information including the type and site of modification based on the decoy sequence generated by the decoy sequence generation unit and the post-translational modification distribution statistical information And
f) By integrating the decoy sequence generated by the decoy sequence generation unit and the decoy post-translation modification information generated by the decoy post-translation modification information generation unit, a decoy sequence to which post-translation modification information is added is recorded. A decoy database creation unit for creating a decorated decoy database;
An identification quality evaluation apparatus for endogenous modified peptides, comprising: