JP5765700B2

JP5765700B2 - Soluble control tag design apparatus and method and program thereof

Info

Publication number: JP5765700B2
Application number: JP2010270358A
Authority: JP
Inventors: 修一廣瀬; 保野口; 直樹五島; 正敏森
Original assignee: National Institute of Advanced Industrial Science and Technology AIST; Japan Biological Informatics Consortium
Current assignee: National Institute of Advanced Industrial Science and Technology AIST; Japan Biological Informatics Consortium
Priority date: 2010-12-03
Filing date: 2010-12-03
Publication date: 2015-08-19
Anticipated expiration: 2030-12-03
Also published as: JP2012116816A

Description

本発明は、タンパク質の可溶化または不溶化をコントロールする技術に関する。 The present invention relates to a technique for controlling solubilization or insolubilization of proteins.

タンパク質の生産は、生化学、構造科学、薬学、産業などの分野において重要な問題である。遺伝子組み換えにより、タンパク質をうまく得るためには、発現、可溶性、精製の３つのステップを克服することが必要である。これまで、タンパク質の発現システムには、生きた細胞がよく用いられてきた。大腸菌は、遺伝子的に扱いやすく、組み換えタンパク質を大量に得られるので、好ましい宿主の一つである。微生物や培養細胞を用いる方法に加えて、原核生物や真核生物から抽出されたタンパク質合成系を用いる方法も提案されている。これらの技術は、ターゲットのタンパク質を大量に発現させることができ、さらには、タンパク質の可溶性を劇的に高めると共に、精製を行いやすくする。 Protein production is an important issue in fields such as biochemistry, structural science, pharmacy, and industry. In order to obtain a protein successfully by genetic recombination, it is necessary to overcome the three steps of expression, solubility and purification. Until now, living cells have often been used in protein expression systems. Escherichia coli is one of the preferred hosts because it is genetically easy to handle and a large amount of recombinant protein can be obtained. In addition to methods using microorganisms and cultured cells, methods using protein synthesis systems extracted from prokaryotes and eukaryotes have also been proposed. These techniques can express the target protein in large quantities, and further dramatically increase the solubility of the protein and facilitate purification.

タンパク質の可溶性を高める確実なアプローチは、ターゲットのタンパク質に高可溶性のタンパク質を付加することである。一般的に、付加された配列をタグと呼ぶ。可溶化タグとして機能するいくつかのタンパク質が文献にて報告されている。例えば、非特許文献１にてグルタチオン−Ｓ−トランスフェラーゼ（ＧＳＴ）が、非特許文献２にてマルトース結合タンパク質（ＭＢＰ）が、非特許文献３にてチオレドキシン（Ｔｒｘ）が、非特許文献４にてＮ利用物質（ＮｕｓＡ）が報告されている。これらのタンパク質は、高い可溶性を有するものとして経験的に良く知られている。 A reliable approach to increase protein solubility is to add highly soluble proteins to the target protein. In general, the added sequence is called a tag. Several proteins that function as solubilizing tags have been reported in the literature. For example, glutathione-S-transferase (GST) in Non-Patent Document 1, maltose binding protein (MBP) in Non-Patent Document 2, thioredoxin (Trx) in Non-Patent Document 3, N-use substance (NusA) has been reported. These proteins are well known empirically as having high solubility.

可溶化タグと同様に、組換えタンパク質の精製を容易にするために、アフィニティタグも開発されてきている。ＭＢＰおよびＧＳＴはともに、可溶化タグであると共にアフィニティタグとしての機能を有している。ＧＳＴがグルタチオンレジンに強く結合し、ＭＢＴはアミラーゼレジンに強く結合する。 Similar to solubilization tags, affinity tags have also been developed to facilitate the purification of recombinant proteins. Both MBP and GST have solubilization tags and functions as affinity tags. GST binds strongly to glutathione resin, and MBT binds strongly to amylase resin.

Nygren, P.A.et.al,「Engineering proteins to facilitate bioprocessing」 Trends Biotechnol.12 (1994), 184-188Nygren, P.A.et.al, `` Engineering proteins to facilitate bioprocessing '' Trends Biotechnol. 12 (1994), 184-188 Nallamsetty, S. and Waugh, D.S.「Solubility-enhancing proteins MBP and NusA play a passive role in the folding of their fusion partners」Protein Expr. Purif. 45(2006), 175-182Nallamsetty, S. and Waugh, D.S.``Solubility-enhancing proteins MBP and NusA play a passive role in the folding of their fusion partners '' Protein Expr. Purif. 45 (2006), 175-182 LaVallie, E.R.et.al,「A thioredoxin gene fusion expression system that circumvents inclusion body formation in the E. coli cytoplasm」Biotechnology(NY) 11(1993), 187-193LaVallie, E.R.et.al, `` A thioredoxin gene fusion expression system that circumvents inclusion body formation in the E. coli cytoplasm '' Biotechnology (NY) 11 (1993), 187-193 Davis, G.D.et.al,「New fusion protein systems designed to give soluble expression in Escherichia coli」Biotechnol. Bioeng 65(1999) 382-388Davis, G.D.et.al, `` New fusion protein systems designed to give soluble expression in Escherichia coli '' Biotechnol. Bioeng 65 (1999) 382-388

タグは、タンパク質可溶性や精製に有用なツールであるにも関わらず、すべてのタンパク質に有益なわけではない。研究者は、異なるタグが結合した種々の組換えタンパク質を発現させ、可溶性を比較して、最適のタグを探さなければならない。また、生化学の研究や治療用タンパク質の試験には、タグを外すことが必要である。なぜなら、タグのサイズが大きいために、ターゲットのタンパク質の構造と機能の両面に影響を及ぼすからである。これらの問題は、特に、高効率のクローニングや発現のプロジェクトの妨げとなる。 Despite being a useful tool for protein solubility and purification, tags are not useful for all proteins. Researchers must express various recombinant proteins with different tags attached and compare their solubility to find the optimal tag. In addition, it is necessary to remove the tag for biochemical research and therapeutic protein testing. This is because the large tag size affects both the structure and function of the target protein. These problems are particularly impeding high efficiency cloning and expression projects.

本発明は、末端領域のアミノ酸配列がタンパク質の可溶性に影響するという考えに基づいて、タンパク質の可溶性をコントロールするタグを設計する方法を提案する。 The present invention proposes a method of designing a tag that controls the solubility of a protein based on the idea that the amino acid sequence of the terminal region affects the solubility of the protein.

本発明は、実験により可溶性または不溶性であることが確認されたタンパク質のアミノ酸配列を記憶したデータベースから読み出したデータに基づいて可溶性制御タグを設計する。具体的には、データベースから可溶性タンパク質および不溶性タンパク質のＮ末端におけるアミノ酸配列を読み出し、読み出したアミノ酸配列を解析することにより、可溶性制御タグを求める。 The present invention designs a soluble control tag based on data read from a database that stores amino acid sequences of proteins that have been experimentally confirmed to be soluble or insoluble. Specifically, the amino acid sequences at the N-terminus of the soluble protein and the insoluble protein are read from the database, and the read amino acid sequence is analyzed to obtain the soluble control tag.

このように可溶性タンパク質および不溶性タンパク質のＮ末端において見られるアミノ酸配列を解析することにより、実際のデータに基づいて可溶性制御タグを適切に設計することができる。すなわち、可溶性制御タグは、発現系等の条件によって異なるが、実際の発現系のデータに基づいて可溶性制御タグを求めることにより、条件に合ったタグを設計することができる。 Thus, by analyzing the amino acid sequences found at the N-terminus of soluble proteins and insoluble proteins, soluble control tags can be appropriately designed based on actual data. That is, the soluble control tag varies depending on conditions such as an expression system, but a tag that meets the conditions can be designed by obtaining the soluble control tag based on actual expression system data.

本発明の可溶性制御タグ設計装置は、可溶性タンパク質および不溶性タンパク質のアミノ酸配列を記憶したデータベースから読み出したデータに基づいて、可溶性制御タグを設計する装置であって、求めるべき可溶性制御タグの残基長Ｌを入力する入力部と、前記データベースから読み出したデータに基づいて可溶性制御タグを求める演算部と、前記演算部にて求めた可溶性制御タグを出力する出力部とを備え、
前記演算部は、
（１）Ｌ残基長のアミノ酸類似群の配列であってアミノ酸類似群のすべての組合せを定義したアミノ酸類似群配列を生成するステップと、
（２）前記データベースから、可溶性タンパク質および不溶性タンパク質のＮ末端のＫ残基（Ｋ≧Ｌ）に含まれるＬ残基長のアミノ酸配列を読み出すステップと、
（３）読み出したアミノ酸配列に基づいて、可溶性タンパク質と不溶性タンパク質のそれぞれのＮ末端における各アミノ酸類似群配列の出現回数をカウントするステップと、
（４）可溶性を高める可溶化タグを設計する場合には可溶性タンパク質のＮ末端における各アミノ酸類似群配列の出現頻度、不溶性を高める不溶化タグを設計する場合には不溶性タンパク質のＮ末端における各アミノ酸類似群配列の出現頻度を計算し、出現頻度が所定の閾値より高いアミノ酸類似群配列を頻出アミノ酸類似群配列として求めるステップと、
（５）複数の頻出アミノ酸類似群配列をクラスタリングして複数のクラスタに分類するステップと、
（６）可溶化タグを設計する場合には可溶性タンパク質、不溶化タグを設計する場合には不溶性タンパク質のＮ末端のＫ残基から読み出したＬ残基長のアミノ酸配列と、前記頻出アミノ酸類似群配列とを比較して、そのアミノ酸配列が前記頻出アミノ酸類似群配列に対応する場合には、そのアミノ酸配列に含まれるアミノ酸に基づいて、アミノ酸配列中の各場所におけるアミノ酸の種類ごとの出現回数をカウントアップする処理を、読み出したアミノ酸配列について順次行うステップと、
（７）同じクラスタ内の頻出アミノ酸類似群配列にて求めたアミノ酸配列中の各場所におけるアミノ酸の種類ごとの出現回数を合算するステップと、
（８）各場所において出現回数が最多のアミノ酸の組み合わせからなるアミノ酸配列を可溶化タグまたは不溶化タグとして求めるステップと、
を実行する。 The soluble control tag design apparatus of the present invention is an apparatus for designing a soluble control tag based on data read from a database storing amino acid sequences of soluble protein and insoluble protein, and the residue length of the soluble control tag to be obtained. An input unit for inputting L, an arithmetic unit for obtaining a soluble control tag based on data read from the database, and an output unit for outputting the soluble control tag obtained by the arithmetic unit,
The computing unit is
(1) generating an amino acid similarity group sequence in which all combinations of amino acid similarity groups are defined which are sequences of amino acid similarity groups of L residue length;
(2) reading an amino acid sequence having an L residue length contained in the N-terminal K residues (K ≧ L) of the soluble protein and the insoluble protein from the database;
(3) counting the number of appearances of each amino acid similarity group sequence at the N-terminus of each of the soluble protein and the insoluble protein based on the read-out amino acid sequence;
(4) When designing a solubilization tag that enhances solubility, the appearance frequency of each amino acid similarity group sequence at the N-terminus of the soluble protein; when designing an insolubilization tag that enhances insolubility, each amino acid similarity at the N-terminus of the insoluble protein Calculating the frequency of occurrence of the group sequence, and obtaining an amino acid similarity group sequence having an appearance frequency higher than a predetermined threshold as a frequent amino acid similarity group sequence;
(5) clustering a plurality of frequent amino acid similarity group sequences into a plurality of clusters;
(6) A soluble protein when designing a solubilizing tag; an amino acid sequence of L residue length read from the N-terminal K residue of an insoluble protein when designing an insolubilizing tag; and the frequent amino acid similarity group sequence If the amino acid sequence corresponds to the frequent amino acid similarity group sequence, the number of occurrences of each type of amino acid at each location in the amino acid sequence is counted based on the amino acids contained in the amino acid sequence. A step of sequentially performing a process of uploading the read amino acid sequence;
(7) summing up the number of appearances for each type of amino acid at each location in the amino acid sequence determined by the frequent amino acid similarity group sequence in the same cluster;
(8) obtaining an amino acid sequence comprising a combination of amino acids having the highest number of appearances at each location as a solubilization tag or an insolubilization tag;
Execute.

このように可溶性タンパク質または不溶性タンパク質のＮ末端に含まれるアミノ酸類似群配列の出現回数を求めることにより、アミノ酸配列の出現回数を求める場合と比較して計算量が少なくて済む。また、同じ長さの残基長で比較すると、アミノ酸類似群配列の総数の方が、アミノ酸配列の総数よりもはるかに少ない。従って、アミノ酸配列のデータベースに記憶されたタンパク質のデータが少ない場合でも出現回数に一定の傾向を見出すことができる。 Thus, by calculating the number of appearances of the amino acid similarity group sequence contained at the N-terminus of the soluble protein or insoluble protein, the amount of calculation can be reduced as compared with the case of determining the number of appearances of the amino acid sequence. In addition, when compared with the same length of residues, the total number of amino acid similarity group sequences is much smaller than the total number of amino acid sequences. Therefore, even when there are few protein data stored in the amino acid sequence database, a certain tendency can be found in the number of appearances.

前記ステップ（４）は、可溶性タグを求める場合、下記式によって計算されるＳ値、ｐ値が、Ｓ＞０．９、ｐ＜１×１０^−５となるアミノ酸類似群配列を頻出アミノ酸類似群配列として求めてもよい。

In the step (4), when a soluble tag is obtained, an amino acid similarity group sequence in which an S value and a p value calculated by the following formulas are S> 0.9 and p <1 × 10 ⁻⁵ is a frequent amino acid similarity group. It may be obtained as an array.

前記ステップ（５）は、前記頻出アミノ酸類似群配列を、その頻出アミノ酸類似群に含まれるアミノ酸を「１」、含まれないアミノ酸を「０」とした２０次元座標値に変換し、前記頻出アミノ酸類似群配列のユークリッド距離に基づいて樹状図を生成し、所定の高さで前記樹状図を切断することによりクラスタリングを行ってもよい。 The step (5) converts the frequent amino acid similarity group sequence into a 20-dimensional coordinate value in which the amino acid included in the frequent amino acid similarity group is “1” and the amino acid not included is “0”. Clustering may be performed by generating a dendrogram based on the Euclidean distance of the similar group arrangement and cutting the dendrogram at a predetermined height.

本発明の可溶性制御タグ設計方法は、可溶性制御タグとして、タンパク質の可溶性を高める可溶化タグまたは不溶性を高める不溶化タグを設計する方法であって、
（１）Ｌ残基長のアミノ酸類似群の配列であってアミノ酸類似群のすべての組合せを定義したアミノ酸類似群配列を生成するステップと、
（２）可溶性タンパク質および不溶性タンパク質のアミノ酸配列を記憶したデータベースから、可溶性タンパク質および不溶性タンパク質のＮ末端のＫ残基（Ｋ≧Ｌ）に含まれるＬ残基長のアミノ酸配列を読み出すステップと、
（３）読み出したアミノ酸配列に基づいて、可溶性タンパク質と不溶性タンパク質のそれぞれのＮ末端における各アミノ酸類似群配列の出現回数をカウントするステップと、
（４）可溶化タグを設計する場合には可溶性タンパク質のＮ末端における各アミノ酸類似群配列の出現頻度、不溶化タグを設計する場合には不溶性タンパク質のＮ末端における各アミノ酸類似群配列の出現頻度を計算し、出現頻度が所定の閾値より高いアミノ酸類似群配列を頻出アミノ酸類似群配列として求めるステップと、
（５）複数の頻出アミノ酸類似群配列をクラスタリングして複数のクラスタに分類するステップと、
（６）可溶化タグを設計する場合には可溶性タンパク質、不溶化タグを設計する場合には不溶性タンパク質のＮ末端のＫ残基から読み出したＬ残基長のアミノ酸配列と、前記頻出アミノ酸類似群配列とを比較して、そのアミノ酸配列が前記頻出アミノ酸類似群配列に対応する場合には、そのアミノ酸配列に含まれるアミノ酸に基づいて、アミノ酸配列中の各場所におけるアミノ酸の種類ごとの出現回数をカウントアップする処理を、読み出したアミノ酸配列について順次行うステップと、
（７）同じクラスタ内の頻出アミノ酸類似群配列にて求めたアミノ酸配列中の各場所におけるアミノ酸の種類ごとの出現回数を合算するステップと、
（８）各場所において出現回数が最多のアミノ酸の組み合わせからなるアミノ酸配列を可溶化タグまたは不溶化タグとして求めるステップと、
を備える。 The soluble control tag design method of the present invention is a method of designing a solubilized tag that increases the solubility of a protein or an insolubilized tag that increases insolubility as a soluble control tag,
(1) generating an amino acid similarity group sequence in which all combinations of amino acid similarity groups are defined which are sequences of amino acid similarity groups of L residue length;
(2) reading an amino acid sequence having an L residue length contained in the N-terminal K residues (K ≧ L) of the soluble protein and the insoluble protein from a database storing the amino acid sequences of the soluble protein and the insoluble protein;
(3) counting the number of appearances of each amino acid similarity group sequence at the N-terminus of each of the soluble protein and the insoluble protein based on the read-out amino acid sequence;
(4) When designing a solubilization tag, the frequency of appearance of each amino acid similarity group sequence at the N-terminus of the soluble protein, and when designing an insolubilization tag, the frequency of occurrence of each amino acid similarity group sequence at the N-terminus of the insoluble protein. Calculating an amino acid similarity group sequence having an appearance frequency higher than a predetermined threshold as a frequent amino acid similarity group sequence;
(5) clustering a plurality of frequent amino acid similarity group sequences into a plurality of clusters;
(6) A soluble protein when designing a solubilizing tag; an amino acid sequence of L residue length read from the N-terminal K residue of an insoluble protein when designing an insolubilizing tag; and the frequent amino acid similarity group sequence If the amino acid sequence corresponds to the frequent amino acid similarity group sequence, the number of occurrences of each type of amino acid at each location in the amino acid sequence is counted based on the amino acids contained in the amino acid sequence. A step of sequentially performing a process of uploading the read amino acid sequence;
(7) summing up the number of appearances for each type of amino acid at each location in the amino acid sequence determined by the frequent amino acid similarity group sequence in the same cluster;
(8) obtaining an amino acid sequence comprising a combination of amino acids having the highest number of appearances at each location as a solubilization tag or an insolubilization tag;
Is provided.

本発明のプログラムは、可溶性制御タグとして、タンパク質の可溶性を高める可溶化タグまたは不溶性を高める不溶化タグを設計するためのプログラムであって、コンピュータに、
（１）Ｌ残基長のアミノ酸類似群の配列であってアミノ酸類似群のすべての組合せを定義したアミノ酸類似群配列のそれぞれが、可溶性タンパク質および不溶性タンパク質のそれぞれのＮ末端において出現する回数を記憶する領域を確保するステップと、
（２）可溶性タンパク質および不溶性タンパク質のアミノ酸配列を記憶したデータベースから、可溶性タンパク質および不溶性タンパク質のＮ末端のＫ残基（Ｋ≧Ｌ）に含まれるＬ残基長のアミノ酸配列を読み出すステップと、
（３）読み出したアミノ酸配列に基づいて、可溶性タンパク質と不溶性タンパク質のそれぞれのＮ末端における各アミノ酸類似群配列の出現回数をカウントするステップと、
（４）可溶化タグを設計する場合には可溶性タンパク質のＮ末端における各アミノ酸類似群配列の出現頻度、不溶化タグを設計する場合には不溶性タンパク質のＮ末端における各アミノ酸類似群配列の出現頻度を計算し、出現頻度が所定の閾値より高いアミノ酸類似群配列を頻出アミノ酸類似群配列として求めるステップと、
（５）複数の頻出アミノ酸類似群配列をクラスタリングして複数のクラスタに分類するステップと、
（６）可溶化タグを設計する場合には可溶性タンパク質、不溶化タグを設計する場合には不溶性タンパク質のＮ末端のＫ残基から読み出したＬ残基長のアミノ酸配列と前記頻出アミノ酸類似群配列とを比較して、そのアミノ酸配列が前記頻出アミノ酸類似群配列に対応する場合には、そのアミノ酸配列に含まれるアミノ酸に基づいて、アミノ酸配列中の各場所におけるアミノ酸の種類ごとの出現回数をカウントアップする処理を、読み出したアミノ酸配列について順次行うステップと、
（７）同じクラスタ内の頻出アミノ酸類似群配列にて求めたアミノ酸配列中の各場所におけるアミノ酸の種類ごとの出現回数を合算するステップと、
（８）各場所において出現回数が最多のアミノ酸の組み合わせからなるアミノ酸配列を可溶化タグまたは不溶化タグとして求めるステップと、
（９）可溶化タグまたは不溶化タグを出力するステップと、
を実行させる。 The program of the present invention is a program for designing a solubilization tag that enhances the solubility of a protein or an insolubilization tag that enhances insolubility as a solubility control tag.
(1) The number of occurrences of each amino acid similarity group sequence that defines all combinations of amino acid similarity groups, each of which appears at the N-terminus of each of the soluble protein and the insoluble protein. Securing an area to be
(2) reading an amino acid sequence having an L residue length contained in the N-terminal K residues (K ≧ L) of the soluble protein and the insoluble protein from a database storing the amino acid sequences of the soluble protein and the insoluble protein;
(3) counting the number of appearances of each amino acid similarity group sequence at the N-terminus of each of the soluble protein and the insoluble protein based on the read-out amino acid sequence;
(4) When designing a solubilization tag, the frequency of appearance of each amino acid similarity group sequence at the N-terminus of the soluble protein, and when designing an insolubilization tag, the frequency of occurrence of each amino acid similarity group sequence at the N-terminus of the insoluble protein. Calculating an amino acid similarity group sequence having an appearance frequency higher than a predetermined threshold as a frequent amino acid similarity group sequence;
(5) clustering a plurality of frequent amino acid similarity group sequences into a plurality of clusters;
(6) A soluble protein when designing a solubilizing tag, and an amino acid sequence of L residue length read from the N-terminal K residue of the insoluble protein when designing an insolubilizing tag and the frequent amino acid similarity group sequence If the amino acid sequence corresponds to the frequent amino acid similarity group sequence, the number of occurrences of each type of amino acid at each location in the amino acid sequence is counted up based on the amino acids contained in the amino acid sequence. Sequentially performing processing for the read amino acid sequence;
(7) summing up the number of appearances for each type of amino acid at each location in the amino acid sequence determined by the frequent amino acid similarity group sequence in the same cluster;
(8) obtaining an amino acid sequence comprising a combination of amino acids having the highest number of appearances at each location as a solubilization tag or an insolubilization tag;
(9) outputting a solubilized tag or an insolubilized tag;
Is executed.

本発明によれば、可溶性タンパク質および不溶性タンパク質のＮ末端におけるＬ残基長のアミノ酸配列を解析することにより、発現系等の条件に合った適切な可溶性制御タグを設計することができる。 According to the present invention, by analyzing the amino acid sequence having an L residue length at the N-terminus of soluble proteins and insoluble proteins, it is possible to design an appropriate soluble control tag that meets the conditions of the expression system and the like.

実施の形態の可溶性制御タグ設計装置の構成を示す図である。It is a figure which shows the structure of the soluble control tag design apparatus of embodiment. 実施の形態の可溶性制御タグ設計装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the soluble control tag design apparatus of embodiment. アミノ酸類似群に含まれるアミノ酸を示す図である。It is a figure which shows the amino acid contained in an amino acid similarity group. ７残基長のアミノ酸類似群配列を示す図である。It is a figure which shows the amino acid similarity group arrangement | sequence of 7 residue length. 可溶性タンパク質、不溶性タンパク質のＮ末端２０残基の例を示す図である。It is a figure which shows the example of N terminal 20 residues of soluble protein and insoluble protein. タンパク質のＮ末端２０残基から７残基長のアミノ酸配列を読み出す例を示す図である。It is a figure which shows the example which reads the amino acid sequence of 7 residue length from N terminal 20 residues of protein. （ａ）アミノ酸配列に対応するアミノ酸類似群を示す図である。（ｂ）アミノ酸配列に対応するアミノ酸類似群配列を示す図である。（ｃ）アミノ酸類似群配列の出現回数をカウントした例を示す図である。(A) It is a figure which shows the amino acid similarity group corresponding to an amino acid sequence. (B) It is a figure which shows the amino acid similarity group sequence corresponding to an amino acid sequence. (C) It is a figure which shows the example which counted the appearance frequency of the amino acid similarity group sequence | arrangement. アミノ酸類似群を数値化した例を示す図である。It is a figure which shows the example which digitized the amino acid similarity group. アミノ酸類似群配列を数値化した例を示す図である。It is a figure which shows the example which digitized the amino acid similarity group sequence. 頻出アミノ酸類似群配列をクラスタリングした例を示す図である。It is a figure which shows the example which clustered the frequent amino acid similarity group arrangement | sequence. 頻出アミノ酸類似群配列に対応するアミノ酸配列の例を示す図である。It is a figure which shows the example of the amino acid sequence corresponding to a frequent amino acid similarity group sequence. アミノ酸配列の各場所において、アミノ酸の種類ごとの出現回数を求めた例を示す図である。It is a figure which shows the example which calculated | required the frequency | count of appearance for every kind of amino acid in each place of an amino acid sequence.

以下、本発明の可溶性制御タグ設計装置および方法について、図面を参照しながら説明する。本実施の形態では、可溶性制御タグとして、特に、数残基ないし１００残基程度の短いアミノ酸配列を対象としている。 Hereinafter, the soluble control tag designing apparatus and method of the present invention will be described with reference to the drawings. In this embodiment, the soluble control tag is particularly intended for short amino acid sequences of several to 100 residues.

図１は、実施の形態の可溶性制御タグ設計装置１０の構成を示す図である。可溶性制御タグ設計装置１０は、設計すべき可溶性制御タグの残基長Ｌを入力する入力部１２と、タンパク質データベース（以下、「タンパク質ＤＢ」という）１４に記憶された可溶性タンパク質および不溶性タンパク質のデータを用いてＬ残基長の可溶性制御タグを設計するＣＰＵ１６と、設計された可溶性制御タグのデータを出力する出力部１８とを備えている。ＣＰＵ１６には、ＲＡＭ２０とＲＯＭ２２が接続されている。ＣＰＵ１６は、可溶化タグの設計を行う際に、計算処理に必要なデータをＲＡＭ２０に書き込み、また、ＲＡＭ２０から読み出す。ＣＰＵ１６は、ＲＯＭ２２に記憶されたプログラム２４を読み出して実行することにより、可溶化タグの設計を行う。このプログラム２４も本発明の範囲に含まれる。 FIG. 1 is a diagram illustrating a configuration of a soluble control tag designing apparatus 10 according to an embodiment. The soluble control tag designing apparatus 10 includes an input unit 12 for inputting a residue length L of a soluble control tag to be designed, and soluble protein and insoluble protein data stored in a protein database (hereinafter referred to as “protein DB”) 14. Is used to design a soluble control tag having an L residue length, and an output unit 18 that outputs data of the designed soluble control tag. A RAM 20 and a ROM 22 are connected to the CPU 16. When designing the solubilization tag, the CPU 16 writes data necessary for calculation processing into the RAM 20 and reads out data from the RAM 20. The CPU 16 designs the solubilization tag by reading and executing the program 24 stored in the ROM 22. This program 24 is also included in the scope of the present invention.

タンパク質ＤＢ１４には、所定の系で発現させた実験に基づいて、可溶性タンパク質あるいは不溶性タンパク質であることが確認されたタンパク質のアミノ酸配列のデータが記憶されている。本実施の形態では、可溶性制御タグ設計装置１０がタンパク質ＤＢ１４を有する構成を例としているが、タンパク質ＤＢ１４は、可溶性制御タグ設計装置１０の外部にあってもよい。この場合、可溶性制御タグ設計装置１０は、外部のデータベースと通信するための通信部を備え、通信部を介してタンパク質ＤＢ１４のデータを読み出す。 The protein DB 14 stores amino acid sequence data of a protein that is confirmed to be a soluble protein or an insoluble protein based on an experiment expressed in a predetermined system. In the present embodiment, a configuration in which the soluble control tag design apparatus 10 includes the protein DB 14 is taken as an example, but the protein DB 14 may be outside the soluble control tag design apparatus 10. In this case, the soluble control tag designing apparatus 10 includes a communication unit for communicating with an external database, and reads the data of the protein DB 14 via the communication unit.

可溶性制御タグ設計装置１０は、例えば、パーソナルコンピュータによって構成される。入力部１２は、例えば、キーボード、マウス、ＣＤ−ＲＯＭ読取装置などで構成される。出力部１８は、例えば、モニタ、プリンタ、ＣＤ−ＲＯＭ書込装置などで構成される。 The soluble control tag designing apparatus 10 is configured by, for example, a personal computer. The input unit 12 includes, for example, a keyboard, a mouse, a CD-ROM reader, and the like. The output unit 18 includes, for example, a monitor, a printer, a CD-ROM writing device, and the like.

次に、可溶性制御タグ設計装置１０によって、可溶性制御タグを設計する処理について説明する。なお、以下の説明では、可溶性制御タグとして可溶化タグを設計する場合を取り上げるが、同じ方法により不溶化タグを設計することができる。 Next, a process for designing a soluble control tag by the soluble control tag designing apparatus 10 will be described. In the following description, a case where a solubilization tag is designed as a solubility control tag is taken up, but an insolubilization tag can be designed by the same method.

（概要説明）
図２は、可溶性制御タグ設計装置１０による可溶化タグ設計の動作を示すフローチャートである。本明細書では、最初に、可溶化タグ設計の概要について説明し、その後で、各処理の詳細について説明する。 (Overview)
FIG. 2 is a flowchart showing the operation of solubilization tag design by the solubility control tag design apparatus 10. In this specification, the outline of the solubilization tag design will be described first, and then the details of each process will be described.

図３に示すように、アミノ酸は、疎水性や極性等の性質に基づいて１０個のアミノ酸類似群に分けることができる。なお、本実施の形態では、アミノ酸類似群として、すべてのアミノ酸を含むグループｘも含めており、アミノ酸類似群を１１個とした。 As shown in FIG. 3, amino acids can be divided into 10 amino acid similarity groups based on properties such as hydrophobicity and polarity. In the present embodiment, the amino acid similarity group includes group x including all amino acids, and the number of amino acid similarity groups is 11.

本実施の形態の可溶性制御タグ設計装置１０では、アミノ酸類似群の組合せからなる配列（これを「アミノ酸類似群配列」という）という概念を用い、最初に、不溶性タンパク質には見られないが可溶性タンパク質によく見られるアミノ酸類似群配列を求める（これを「頻出アミノ酸類似群配列」という）。以上は、図２に示すフローチャートのステップＳ１０〜Ｓ１６に対応する。 The soluble control tag design apparatus 10 of the present embodiment uses the concept of a sequence composed of a combination of amino acid similarity groups (this is called “amino acid similarity group sequence”). The amino acid similarity group sequence that is often found in (1) is obtained (this is called “frequent amino acid similarity group sequence”). The above corresponds to steps S10 to S16 of the flowchart shown in FIG.

次に、可溶性制御タグ設計装置１０は、可溶性タンパク質のＮ末端に含まれるアミノ酸配列の中から、頻出アミノ酸類似群配列に対応するアミノ酸配列を探索し、探索された全てのアミノ酸配列を総合して、可溶性タグとしてのアミノ酸配列を決定する。これは、図２に示すフローチャートのステップＳ１８〜Ｓ２２に該当する。 Next, the soluble control tag designing apparatus 10 searches the amino acid sequence corresponding to the frequent amino acid similarity group sequence from the amino acid sequences included in the N-terminus of the soluble protein, and combines all the searched amino acid sequences. Determine the amino acid sequence as a soluble tag. This corresponds to steps S18 to S22 of the flowchart shown in FIG.

（各処理の詳細説明）
次に、可溶化タグ決定の各処理について詳細に説明する。なお、以下の説明では、Ｎ末端の２０残基のアミノ酸配列を解析して、７残基長の可溶化タグを求める場合を例として説明する。 (Detailed explanation of each process)
Next, each process of solubilization tag determination is demonstrated in detail. In the following description, a case where a solubilizing tag having a 7-residue length is obtained by analyzing the amino acid sequence of the N-terminal 20 residues will be described as an example.

まず、可溶性制御タグ設計装置１０は、設計すべき可溶化タグの残基長の入力を受け付ける。本実施の形態では、７残基長というデータが入力される。なお、残基長の入力は、必ずしも毎回行う必要はなく、可溶性制御タグ設計装置１０は、入力された残基長を設定値として記憶しておいてもよい。また、この段階で、Ｎ末端の何残基を解析するかの入力を受け付けてもよい。本実施の形態では、Ｎ末端２０残基長を解析するので、２０残基長というデータが入力される。 First, the soluble control tag designing apparatus 10 receives an input of the residue length of the solubilized tag to be designed. In the present embodiment, data of 7 residue length is input. It is not always necessary to input the residue length every time, and the soluble control tag designing apparatus 10 may store the input residue length as a set value. Further, at this stage, an input of how many residues at the N-terminal are to be analyzed may be accepted. In this embodiment, since the N-terminal 20-residue length is analyzed, data of 20-residue length is input.

図４に示すように、可溶性制御タグ設計装置１０は、７残基長のすべてのアミノ酸類似群の組合せを生成する（Ｓ１０）。アミノ酸類似群は、図３に示すように、１１個存在するので、７残基長の場合には、１１^７個の組み合わせが存在する。ただし、すべてのアミノ酸を含むアミノ酸類似群（グループｘ）が配列の最初と最後に位置する場合には、６残基長のアミノ酸配列と同じになるので、本実施の形態では、配列の最初と最後はグループｘ以外とし、１０^２×１１^５個のアミノ酸類似群配列を生成する。また、可溶性制御タグ設計装置１０は、次に説明するように、各アミノ酸類似群配列が可溶性タンパク質または不溶性タンパク質のそれぞれのＮ末端２０残基に出現する回数をカウントするので、ＲＡＭ２０に、出現回数を記憶するための領域を割り当てる。 As shown in FIG. 4, the soluble control tag designing apparatus 10 generates a combination of all amino acid similarity groups having a length of 7 residues (S10). Amino acid analogs groups, as shown in FIG. 3, since there 11, 7 in the case of residues in length, there are 11 ^seven combinations. However, when the amino acid similarity group (group x) including all amino acids is located at the beginning and the end of the sequence, the amino acid sequence is the same as the 6-residue amino acid sequence. Finally the other group x, to produce a 10 ² × 11 ⁵ amino acid similarity group sequence. In addition, the soluble control tag designing apparatus 10 counts the number of times each amino acid similarity group sequence appears at each N-terminal 20 residue of the soluble protein or insoluble protein, as will be described below. Allocate an area for storing

次に、可溶性制御タグ設計装置１０は、可溶性タンパク質および不溶性タンパク質のそれぞれのＮ末端２０残基から７残基長のアミノ酸配列を読み出し、読み出したアミノ酸配列に対応するアミノ酸類似群配列の出現回数をカウントする（Ｓ１２）。以下、詳細に説明する。 Next, the soluble control tag designing apparatus 10 reads out the amino acid sequence having a length of 7 residues from the N-terminal 20 residues of each of the soluble protein and the insoluble protein, and determines the number of appearances of the amino acid similarity group sequence corresponding to the read-out amino acid sequence. Count (S12). Details will be described below.

図５は、可溶性タンパク質および不溶性タンパク質のＮ末端２０残基の例を示す図である。図６は、Ｎ末端２０残基から７残基長のアミノ酸配列を抽出した例を示す図である。Ｎ末端２０残基から７残基長の配列を取り出す方法は、２０−７＋１＝１４通りある。すなわち、Ｎ末端の１番目〜７番目までのアミノ酸配列、Ｎ末端の２番目〜８番目までのアミノ酸配列・・・Ｎ末端の１４番目〜２０番目までのアミノ酸配列である。次に、抽出したアミノ酸配列に対応するアミノ酸類似群配列を求め、そのアミノ酸類似群配列の出現回数をカウントアップする。 FIG. 5 is a diagram showing examples of N-terminal 20 residues of soluble and insoluble proteins. FIG. 6 is a diagram showing an example in which an amino acid sequence having a length of 7 residues from the N-terminal 20 residues is extracted. There are 20−7 + 1 = 14 methods for extracting a 7-residue sequence from 20 N-terminal residues. That is, the first to seventh amino acid sequence at the N-terminal, the second to eighth amino acid sequence at the N-terminal, and the amino acid sequence from the 14th to 20th at the N-terminal. Next, an amino acid similarity group sequence corresponding to the extracted amino acid sequence is obtained, and the number of appearances of the amino acid similarity group sequence is counted up.

図７（ａ）は、Ｎ末端から取り出したアミノ酸配列の例を示す図である。アミノ酸配列の下に、各アミノ酸に対応するアミノ酸類似群を示している。例えば、「Ａ」（アラニン）に対応するアミノ酸類似群は、「ａ」（疎水性）、「ｅ」（極小の側鎖）、「ｘ」（すべて）であり（ただし、配列の先頭においては「ｘ」を除いている）、「Ｅ」（グルタミン酸）に対応するアミノ酸類似群は、「ｂ」（極性）、「ｉ」（負電荷）、「ｊ」（荷電性）、「ｘ」（すべて）である。 FIG. 7A shows an example of an amino acid sequence extracted from the N-terminus. Below the amino acid sequence, an amino acid similarity group corresponding to each amino acid is shown. For example, the amino acid similarity group corresponding to “A” (alanine) is “a” (hydrophobic), “e” (minimal side chain), “x” (all) (however, at the top of the sequence) (Except for “x”), the amino acid similarity group corresponding to “E” (glutamic acid) includes “b” (polarity), “i” (negative charge), “j” (chargeability), “x” ( All).

図７（ｂ）は、図７（ａ）に示すアミノ酸配列に対応するアミノ酸類似群配列を示す図である。図７（ａ）に示すアミノ酸配列では、配列中の各アミノ酸に対応するアミノ酸類似群がそれぞれ、２個、３個、４個、３個、３個、４個、４個あるので、合計で３４５６通りのアミノ酸類似群配列に対応する。可溶性制御タグ設計装置１０は、読み出したアミノ酸配列に対応するアミノ酸類似群配列を順次求め、その出現回数をカウントアップしていく。 FIG. 7B shows an amino acid similarity group sequence corresponding to the amino acid sequence shown in FIG. In the amino acid sequence shown in FIG. 7 (a), there are 2, 3, 4, 3, 3, 4, and 4 amino acid similarity groups corresponding to each amino acid in the sequence. It corresponds to 3456 amino acid similarity group sequences. The soluble control tag designing device 10 sequentially obtains amino acid similarity group sequences corresponding to the read amino acid sequences, and counts up the number of appearances.

図７（ｃ）は、可溶性タンパク質のＮ末端、不溶性タンパク質のＮ末端のそれぞれにおいて、１０^２×１１^５個の全アミノ酸類似群配列のそれぞれの出現回数の例を示す図である。以下、本明細書において、可溶性タンパク質のＮ末端に出現した所定のアミノ酸類似群配列の出現回数を「Ｍｐ」、不溶性タンパク質のＮ末端に出現した所定のアミノ酸類似群配列の出現回数を「Ｍｎ」とする。また、可溶性タンパク質のＮ末端に出現した全アミノ酸類似群配列の合計をセグメント数「Ｎｐ」、不溶性タンパク質のＮ末端に出現した全アミノ酸類似群配列の合計をセグメント数「Ｎｎ」とする。 FIG. 7 (c), the N-terminus of the soluble protein, in each of the N-terminus of the insoluble protein is a diagram showing an example of a respective number of occurrences of the 10 ² × 11 ⁵ pieces of the total amino acid similarity group sequence. Hereinafter, in this specification, the number of occurrences of a predetermined amino acid similarity group sequence appearing at the N-terminus of a soluble protein is “Mp”, and the number of occurrences of a predetermined amino acid similarity group sequence appearing at the N-terminus of an insoluble protein is “Mn”. And Further, the total number of all amino acid similarity group sequences appearing at the N-terminus of the soluble protein is defined as the segment number “Np”, and the total of all amino acid similarity group sequences appearing at the N-terminus of the insoluble protein is defined as the segment number “Nn”.

次に、可溶性制御タグ設計装置１０は、可溶化タンパク質における各アミノ酸類似群配列の出現頻度を計算する（Ｓ１４）。本実施の形態では、出現頻度を次の式で示すＳ値とｐ値によって表す。なお、Ｓ値は、可溶性タンパク質から得られたデータセットに、どれくらい特異的に配列が出現するかを示す値、ｐ値は配列がどのくらい稀かを示す値である。

Next, the soluble control tag designing apparatus 10 calculates the appearance frequency of each amino acid similarity group sequence in the solubilized protein (S14). In the present embodiment, the appearance frequency is represented by an S value and a p value represented by the following expression. The S value is a value indicating how specific the sequence appears in the data set obtained from the soluble protein, and the p value is a value indicating how rare the sequence is.

続いて、可溶性制御タグ設計装置１０は、出現頻度が所定の閾値より高いアミノ酸類似群配列（これを「頻出アミノ酸類似群配列」という）を求める（Ｓ１６）。具体的には、Ｓ値およびｐ値が、Ｓ＞０．９とｐ＜１×１０^−５をともに満たすアミノ酸類似群配列を頻出アミノ酸類似群配列として抽出する。 Subsequently, the soluble control tag design apparatus 10 obtains an amino acid similarity group sequence (this is referred to as a “frequent amino acid similarity group sequence”) whose appearance frequency is higher than a predetermined threshold (S16). Specifically, an amino acid similarity group sequence in which S value and p value satisfy both S> 0.9 and p <1 × 10 ⁻⁵ is extracted as a frequent amino acid similarity group sequence.

次に、可溶性制御タグ設計装置１０は、求められた複数の頻出アミノ酸類似群配列をクラスタリングする（Ｓ１８）。頻出アミノ酸類似群配列どうしの距離は、次のように定義する。まず、頻出アミノ酸類似群をそのアミノ酸類似群に含まれるアミノ酸の種類によって数値化し、次に、アミノ酸類似群の組合せからなるアミノ酸類似群配列を数値化する。次に、具体例を示す。 Next, the soluble control tag design apparatus 10 clusters the obtained plurality of frequent amino acid similarity group sequences (S18). The distance between frequent amino acid similarity group sequences is defined as follows. First, a frequent amino acid similarity group is digitized according to the type of amino acid contained in the amino acid similarity group, and then an amino acid similarity group sequence comprising a combination of amino acid similarity groups is digitized. Next, a specific example is shown.

図８は、アミノ酸類似群を数値化した例を示す図である。図８に示すように、アミノ酸類似群に含まれるアミノ酸を「１」とし、含まれていないアミノ酸を「０」とすることにより、アミノ酸類似群を２０次元の座標値で表すことができる。例えば、グループａ（疎水性）は、（１，１，１，１，１，１，１，１，１，１，１，１，０，０，０，０，０，０，１，０）となる。アミノ酸類似群配列は、７つのアミノ酸類似群の組合せからなるので、２０×７＝１４０次元の座標により表される。
図９は、頻出アミノ酸類似群配列を数値化した例を示す図である。図中、下線を引いた数字は、各アミノ酸類似群を表した数値の先頭を示す。 FIG. 8 is a diagram showing an example in which the amino acid similarity group is digitized. As shown in FIG. 8, the amino acid similarity group can be represented by a 20-dimensional coordinate value by setting the amino acid included in the amino acid similarity group to “1” and not including “0”. For example, group a (hydrophobic) is (1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,1,0). ) Since the amino acid similarity group sequence is composed of a combination of seven amino acid similarity groups, it is represented by coordinates of 20 × 7 = 140 dimensions.
FIG. 9 is a diagram showing an example of quantifying frequent amino acid similarity group sequences. In the figure, the underlined number indicates the beginning of the numerical value representing each amino acid similarity group.

このように頻出アミノ酸類似群配列を数値化することにより、頻出アミノ酸類似群配列どうしのユークリッド距離を求めることができる。可溶性制御タグ設計装置１０は、ユークリッド距離を用いて頻出アミノ酸類似群配列の樹状図を生成し、樹状図を適当な高さで切ることによりクラスタリングを行う。具体的には、最長距離法（クラスタ内に含まれる個体どうしの距離の中で最長距離をクラスタの距離とする方法）によりクラスタ間の距離を計算し、最も近いクラスタ（最初は、クラスタのメンバーは各頻出アミノ酸類似群配列である）どうしを順次統合していく処理を繰り返し行うことにより、樹状図を生成する。このような樹状図を用いたクラスタリング手法自体は、公知である。 Thus, the Euclidean distance between the frequent amino acid similarity group sequences can be obtained by digitizing the frequent amino acid similarity group sequences. The soluble control tag design apparatus 10 generates a dendrogram of frequent amino acid similarity group sequences using the Euclidean distance, and performs clustering by cutting the dendrogram at an appropriate height. Specifically, the distance between the clusters is calculated by the longest distance method (the longest distance among the individual distances included in the cluster is used as the cluster distance), and the nearest cluster (initially a member of the cluster) is calculated. Is a frequent amino acid similarity group sequence), and a dendrogram is generated by repeating the process of sequentially integrating the sequences. The clustering method itself using such a dendrogram is known.

図１０は、可溶性制御タグ設計装置１０にて生成された樹状図を示す図である。この例では、１０個のクラスタを生成している。 FIG. 10 is a diagram illustrating a dendrogram generated by the soluble control tag designing apparatus 10. In this example, 10 clusters are generated.

次に、可溶性制御タグ設計装置１０は、可溶性タンパク質のＮ末端２０残基のアミノ酸配列の中から、頻出アミノ酸類似群配列に対応するアミノ酸配列を検索する。
図１１は、頻出アミノ酸類似群に対応するアミノ酸配列の例を示す図である。この例では、頻出アミノ酸類似群配列「ａｂｘｘａｃａ」に対応するアミノ酸配列として、「ＩＨＶＧＬＤＴ」「ＣＫＲＥＭＰＡ」等が検索されている。可溶性制御タグ設計装置１０は、検索されたアミノ酸配列に基づいて、配列中の各場所におけるアミノ酸の種類ごとの出現回数をカウントし、ＲＡＭ２０に記憶する。 Next, the soluble control tag designing apparatus 10 searches the amino acid sequence corresponding to the frequent amino acid similarity group sequence from the amino acid sequence of the N-terminal 20 residues of the soluble protein.
FIG. 11 is a diagram showing examples of amino acid sequences corresponding to frequent amino acid similarity groups. In this example, “IHVGLDT”, “CKREMPA”, and the like are searched for as amino acid sequences corresponding to the frequent amino acid similarity group sequence “abxxaca”. The soluble control tag designing apparatus 10 counts the number of appearances of each type of amino acid at each location in the sequence based on the searched amino acid sequence, and stores it in the RAM 20.

可溶性制御タグ設計装置１０は、同じクラスタに含まれる別の頻出アミノ酸類似群配列についても同様に、配列中の場所ごとにアミノ酸の種類ごとの出現回数を記憶する。そして、可溶性制御タグ設計装置１０は、図１１に示すように、同じクラスタに含まれるすべての頻出アミノ酸類似群配列を用いて求めた配列中の各場所でのアミノ酸の種類ごとの出現回数を合算する（Ｓ２０）。このようにして求められた配列中の各場所でのアミノ酸の種類ごとの出現回数に基づき、各場所において最も多く出現したアミノ酸を組み合わせて、可溶化タグを決定する（Ｓ２２）。 Similarly, the soluble control tag designing apparatus 10 stores the number of appearances for each type of amino acid for each location in the sequence for other frequent amino acid similarity group sequences included in the same cluster. Then, as shown in FIG. 11, the soluble control tag designing apparatus 10 adds up the number of appearances of each type of amino acid at each location in the sequence obtained using all the frequent amino acid similarity group sequences included in the same cluster. (S20). Based on the number of appearances of each type of amino acid at each location in the sequence determined in this way, the solubilized tag is determined by combining the most frequently occurring amino acids at each location (S22).

図１２は、配列の各場所におけるアミノ酸の出現回数を視覚的に示す図である。横軸は、配列の各場所を示し、左から、１番目、２番目・・・７番目である。縦軸は、出現したアミノ酸の頻度を示し、多く出現したアミノ酸ほど大きなフォントで、上方に記載している。この例では、配列の１，２，４番目では「Ｅ」が最も多く見られ、配列の７番目では「Ｌ」が多く見られ、その他では、出現するアミノ酸の多寡に有意な差が見られなかったので、「ｘ」（すべて）としている。この場合、可溶化タグとして「ＥＥｘＥｘｘＬ」が決定される。なお、有意な差があったかどうかは、閾値により判断することができる。例えば、最も多く見られたアミノ酸と２番目に多く見られたアミノ酸の出現回数に１０％以上の差がある場合に有意な差があるというように判断することができる。 FIG. 12 is a diagram visually showing the number of appearances of amino acids at each place in the sequence. The horizontal axis indicates each location of the array, which is first, second,..., Seventh from the left. The vertical axis shows the frequency of the appearing amino acids, and the more frequently appearing amino acids are shown in a larger font in the upper part. In this example, “E” is the most common in the first, second and fourth sequences, “L” is the most common in the seventh sequence, and in other cases, there is a significant difference in the number of appearing amino acids. Since there was no, “x” (all) is set. In this case, “EEExExL” is determined as the solubilization tag. Whether there is a significant difference can be determined by a threshold value. For example, it can be determined that there is a significant difference when there is a difference of 10% or more in the number of appearances of the most frequently seen amino acid and the second most frequently occurring amino acid.

以上、本実施の形態の可溶性制御タグ設計装置１０の構成および可溶化タグ設計の動作について説明した。なお、不溶化タグを設計する場合には、頻出アミノ酸類似群配列を抽出するステップＳ１６において、次式を用いて、不溶性タンパク質において頻出するアミノ酸類似群配列を求め、頻出アミノ酸類似群に対応するアミノ酸配列を検索するステップＳ２０において、不溶性タンパク質のＮ末端から検索することとすればよい。

Heretofore, the configuration of the soluble control tag designing apparatus 10 of the present embodiment and the operation of solubilizing tag design have been described. When designing an insolubilizing tag, in step S16 for extracting a frequent amino acid similarity group sequence, an amino acid similarity group sequence that frequently appears in the insoluble protein is obtained using the following formula, and an amino acid sequence corresponding to the frequent amino acid similarity group is obtained. In step S20 for searching for an insoluble protein, the search may be performed from the N-terminus of the insoluble protein.

本実施の形態では、最初に、可溶性タンパク質、不溶性タンパク質のＮ末端において頻出するアミノ酸類似群配列を求めているので、アミノ酸配列を用いる場合と比較して計算量を低減することができる。また、アミノ酸配列よりも総数の少ないアミノ酸類似群配列の出現回数をカウントすることにより、データが少ない場合でも出現回数の傾向を把握することができる。７残基長の場合、アミノ酸配列は２０^７個存在し、アミノ酸類似群配列の約８０倍もの数の配列が存在するので、各アミノ酸配列に割り振られる出現回数は、約１／８０となり、全体的にどの配列も出現回数が小さくなって可溶性または不溶性タンパク質の特徴を掴みにくい。これに対し、アミノ酸類似群配列を用いることにより、比較的データが少ない場合であっても、特徴的なアミノ酸類似群配列を求めることが可能となる。 In the present embodiment, since an amino acid similarity group sequence that frequently appears at the N-terminus of a soluble protein and an insoluble protein is first obtained, the amount of calculation can be reduced as compared with the case of using an amino acid sequence. Further, by counting the number of appearances of the amino acid similarity group sequence having a smaller total number than the amino acid sequence, the tendency of the number of appearances can be grasped even when the data is small. In the case of 7 residues in length, there are 20 ⁷ amino acid sequences, and there are about 80 times as many sequences as the amino acid similarity group sequence, so the number of appearances assigned to each amino acid sequence is about 1/80, In particular, every sequence has a small number of occurrences, making it difficult to grasp the characteristics of soluble or insoluble proteins. On the other hand, by using the amino acid similarity group sequence, a characteristic amino acid similarity group sequence can be obtained even when there is relatively little data.

以上、本発明の可溶性制御タグ設計装置について実施の形態を挙げて詳細に説明したが、本発明は上記した実施の形態に限定されるものではない。 The soluble control tag designing apparatus of the present invention has been described in detail with reference to the embodiment, but the present invention is not limited to the above-described embodiment.

上記した実施の形態では、アミノ酸類似群配列を用いて可溶性制御タグを設計する例について説明したが、タンパク質ＤＢに大量のデータがある場合には、可溶性タンパク質あるいは不溶性タンパク質に特有に含まれるアミノ酸配列を直接に検索することとしてもよい。 In the above-described embodiment, an example in which a soluble control tag is designed using an amino acid similarity group sequence has been described. However, when there is a large amount of data in the protein DB, an amino acid sequence that is uniquely included in a soluble protein or an insoluble protein It is good also as searching directly.

上記した実施の形態では、樹状図を用いてクラスタリングを行う例について説明したが、クラスタリングの方法は、樹状図を用いた方法に限定されず、ｋ平均法などの公知の方法を採用することができる。また、頻出アミノ酸類似群配列の数が多くない場合には、必ずしもクラスタリングを行わなくてもよい。 In the above-described embodiment, an example in which clustering is performed using a tree diagram has been described. However, a clustering method is not limited to a method using a tree diagram, and a known method such as a k-average method is employed. be able to. Further, when the number of frequent amino acid similarity group sequences is not large, clustering is not necessarily performed.

上記した実施の形態では、Ｓ値およびｐ値を用いて、頻出アミノ酸類似群配列を求めたが、頻出するか否かの閾値は、別の方法によって定めてもよい。 In the above-described embodiment, the frequent amino acid similarity group sequence is obtained by using the S value and the p value. However, the threshold for determining whether or not to frequently appear may be determined by another method.

本発明の可溶性制御タグ設計装置にて設計した可溶化タグおよび不溶化タグを付加することによって、タンパク質の可溶性に与える影響を評価した。 By adding the solubilization tag and the insolubilization tag designed by the soluble control tag designing apparatus of the present invention, the influence on the solubility of the protein was evaluated.

（可溶性制御タグの生成）
コムギ胚芽無細胞系で発現させたタンパク質のデータベースを用いて、下表１に示す１６種類の可溶化タグと、１２種類の不溶化タグを設計した。

(Generation of soluble control tags)
Using a database of proteins expressed in a wheat germ cell-free system, 16 types of solubilization tags and 12 types of insolubilization tags shown in Table 1 below were designed.

可溶性制御タグを付加する対象の９種類の遺伝子を、以下の基準で選んだ。
（１）ＲｅｆＳｅｑと一致し、膜貫通ドメインがないもの。
（２）ＨＧＰＤ（Human Gene and Protein Database）のデータにより、分子量が５０ｋＤａ程度のもので、可溶化の程度が異なるもの。
９種類の遺伝子を下表２に示す

Nine types of genes to which soluble control tags are added were selected according to the following criteria.
(1) Matches with RefSeq and has no transmembrane domain.
(2) According to the data of HGPD (Human Gene and Protein Database), the molecular weight is about 50 kDa, and the degree of solubilization is different.
Nine genes are shown in Table 2 below.

表２に示す遺伝子のＮ末端に表１の可溶化タグ、不溶化タグに対応する遺伝子配列をタンパク質をコードする遺伝子配列の上流に付加して、コムギ胚芽無細胞系でタンパク質を発現させ、可溶化タグまたは不溶化タグを付加しない場合と比較して、タンパク質の可溶性がどう変化したかを調べた結果を下表３に示す。

The gene sequence corresponding to the solubilization tag and insolubilization tag shown in Table 1 is added to the N-terminal of the gene shown in Table 2 upstream of the gene sequence encoding the protein, and the protein is expressed in a wheat germ cell-free system and solubilized. Table 3 shows the results of examining how the solubility of the protein was changed as compared with the case where no tag or insolubilizing tag was added.

表３に示すように、本発明の可溶性制御タグ設計装置にて設計した可溶化タグ、不溶化タグは、タンパク質の可溶化／不溶化に影響を与えた。 As shown in Table 3, the solubilization tag and the insolubilization tag designed by the soluble control tag design apparatus of the present invention affected the solubilization / insolubilization of proteins.

本発明は、実際のデータに基づいて可溶性制御タグを求めることにより、条件に合ったタグを設計することができ、タンパク質の生産に有用である。 In the present invention, a soluble control tag is obtained based on actual data, so that a tag suitable for the condition can be designed, which is useful for protein production.

１０可溶性制御タグ設計装置
１２入力部
１４タンパク質データベース
１６ＣＰＵ
１８出力部
２０ＲＡＭ
２２ＲＯＭ
２４プログラム 10 Soluble Control Tag Design Device 12 Input Unit 14 Protein Database 16 CPU
18 Output unit 20 RAM
22 ROM
24 programs

Claims

An apparatus for designing a soluble control tag based on data read from a database storing amino acid sequences of soluble protein and insoluble protein,
An input unit for inputting the residue length L of the soluble control tag to be obtained;
An arithmetic unit for obtaining a soluble control tag based on data read from the database;
An output unit for outputting the solubility control tag obtained by the calculation unit;
With
The computing unit is
(1) generating an amino acid similarity group sequence in which all combinations of amino acid similarity groups are defined for amino acids having L residue length, the sequences of amino acid similarity groups classified based on the properties of amino acids;
(2) reading an amino acid sequence having an L residue length contained in the N-terminal K residues (K ≧ L) of the soluble protein and the insoluble protein from the database;
(3) counting the number of appearances of each amino acid similarity group sequence at the N-terminus of each of the soluble protein and the insoluble protein based on the read-out amino acid sequence;
(4) When designing a solubilization tag that enhances solubility, the appearance frequency of each amino acid similarity group sequence at the N-terminus of the soluble protein; when designing an insolubilization tag that enhances insolubility, each amino acid similarity at the N-terminus of the insoluble protein Calculating an appearance frequency of the group sequence and obtaining an amino acid similarity group sequence having an appearance frequency higher than a predetermined threshold as a frequent amino acid similarity group sequence; and (5) clustering a plurality of frequent amino acid similarity group sequences into a plurality of clusters. A step of classification;
(6) A soluble protein when designing a solubilizing tag; an amino acid sequence of L residue length read from the N-terminal K residue of an insoluble protein when designing an insolubilizing tag; and the frequent amino acid similarity group sequence If the amino acid sequence corresponds to the frequent amino acid similarity group sequence, the number of occurrences of each type of amino acid at each location in the amino acid sequence is counted based on the amino acids contained in the amino acid sequence. A step of sequentially performing a process of uploading the read amino acid sequence;
(7) summing up the number of appearances for each type of amino acid at each location in the amino acid sequence determined by the frequent amino acid similarity group sequence in the same cluster;
(8) obtaining an amino acid sequence comprising a combination of amino acids having the highest number of appearances at each location as a solubilization tag or an insolubilization tag;
A soluble control tag design device that performs.

In the step (4), when a soluble tag is obtained, an amino acid similarity group sequence in which an S value and a p value calculated by the following formulas are S> 0.9 and p <1 × 10 ⁻⁵ is a frequent amino acid similarity group. The soluble control tag design apparatus according to claim 1, which is obtained as an array.

The step (5)
The frequent amino acid similarity group sequence is converted into a 20-dimensional coordinate value in which the amino acid contained in the frequent amino acid similarity group is “1” and the non-included amino acid is “0”.
Generate a dendrogram based on the Euclidean distance of the frequent amino acid similarity group sequence,
Clustering is performed by cutting the dendrogram at a predetermined height.
The soluble control tag design apparatus according to claim 1 or 2.

A method for designing a solubilization tag that increases the solubility of a protein or an insolubility tag that increases insolubility as a solubility control tag,
(1) generating an amino acid similarity group sequence in which all combinations of amino acid similarity groups are defined for amino acids having L residue length, the sequences of amino acid similarity groups classified based on the properties of amino acids;
(2) reading an amino acid sequence having an L residue length contained in the N-terminal K residues (K ≧ L) of the soluble protein and the insoluble protein from a database storing the amino acid sequences of the soluble protein and the insoluble protein;
(3) counting the number of appearances of each amino acid similarity group sequence at the N-terminus of each of the soluble protein and the insoluble protein based on the read-out amino acid sequence;
(4) When designing a solubilization tag, the frequency of appearance of each amino acid similarity group sequence at the N-terminus of the soluble protein, and when designing an insolubilization tag, the frequency of occurrence of each amino acid similarity group sequence at the N-terminus of the insoluble protein. Calculating an amino acid similarity group sequence having an appearance frequency higher than a predetermined threshold as a frequent amino acid similarity group sequence;
(5) clustering a plurality of frequent amino acid similarity group sequences into a plurality of clusters;
(6) A soluble protein when designing a solubilizing tag, and an amino acid sequence of L residue length read from the N-terminal K residue of the insoluble protein when designing an insolubilizing tag and the frequent amino acid similarity group sequence If the amino acid sequence corresponds to the frequent amino acid similarity group sequence, the number of occurrences of each type of amino acid at each location in the amino acid sequence is counted up based on the amino acids contained in the amino acid sequence. Sequentially performing processing for the read amino acid sequence;
(7) summing up the number of appearances for each type of amino acid at each location in the amino acid sequence determined by the frequent amino acid similarity group sequence in the same cluster;
(8) obtaining an amino acid sequence comprising a combination of amino acids having the highest number of appearances at each location as a solubilization tag or an insolubilization tag;
A soluble control tag design method comprising:

In the step (4), when a soluble tag is obtained, an amino acid similarity group sequence in which an S value and a p value calculated by the following formulas are S> 0.9 and p <1 × 10 ⁻⁵ is a frequent amino acid similarity group. The soluble control tag design method according to claim 4, which is obtained as an array.

The step (5)
The frequent amino acid similarity group sequence is converted into a 20-dimensional coordinate value in which the amino acid contained in the frequent amino acid similarity group is “1” and the non-included amino acid is “0”.
Generate a dendrogram based on the Euclidean distance of the frequent amino acid similarity group sequence,
Clustering is performed by cutting the dendrogram at a predetermined height.
The soluble control tag design method according to claim 4 or 5.

A program for designing a solubilization tag that increases the solubility of a protein or an insolubilization tag that increases insolubility as a solubility control tag,
(1) Amino acid similarity group sequences that are classified based on the properties of amino acids and that define all combinations of amino acid similarity groups for amino acids with L residue length are soluble in amino acid similarity group sequences. Ensuring an area for storing the number of occurrences at each N-terminus of the protein and insoluble protein;
(2) reading an amino acid sequence having an L residue length contained in the N-terminal K residues (K ≧ L) of the soluble protein and the insoluble protein from a database storing the amino acid sequences of the soluble protein and the insoluble protein;
(3) counting the number of appearances of each amino acid similarity group sequence at the N-terminus of each of the soluble protein and the insoluble protein based on the read-out amino acid sequence;
(4) When designing a solubilization tag, the frequency of appearance of each amino acid similarity group sequence at the N-terminus of the soluble protein, and when designing an insolubilization tag, the frequency of occurrence of each amino acid similarity group sequence at the N-terminus of the insoluble protein. Calculating an amino acid similarity group sequence having an appearance frequency higher than a predetermined threshold as a frequent amino acid similarity group sequence;
(5) clustering a plurality of frequent amino acid similarity group sequences into a plurality of clusters;
(6) A soluble protein when designing a solubilizing tag, and an amino acid sequence of L residue length read from the N-terminal K residue of the insoluble protein when designing an insolubilizing tag and the frequent amino acid similarity group sequence If the amino acid sequence corresponds to the frequent amino acid similarity group sequence, the number of occurrences of each type of amino acid at each location in the amino acid sequence is counted up based on the amino acids contained in the amino acid sequence. Sequentially performing processing for the read amino acid sequence;
(7) summing up the number of appearances for each type of amino acid at each location in the amino acid sequence determined by the frequent amino acid similarity group sequence in the same cluster;
(8) obtaining an amino acid sequence comprising a combination of amino acids having the highest number of appearances at each location as a solubilization tag or an insolubilization tag;
(9) outputting a solubilized tag or an insolubilized tag;
A program that executes

In the step (4), when a soluble tag is obtained, an amino acid similarity group sequence in which an S value and a p value calculated by the following formulas are S> 0.9 and p <1 × 10 ⁻⁵ is a frequent amino acid similarity group. The program according to claim 7, which is obtained as an array.

The step (5)
The frequent amino acid similarity group sequence is converted into a 20-dimensional coordinate value in which the amino acid contained in the frequent amino acid similarity group is “1” and the non-included amino acid is “0”.
Generate a dendrogram based on the Euclidean distance of the frequent amino acid similarity group sequence,
Clustering is performed by cutting the dendrogram at a predetermined height.
The program according to claim 7 or 8.