JP2019095819A

JP2019095819A - Information processing device and program

Info

Publication number: JP2019095819A
Application number: JP2016070976A
Authority: JP
Inventors: 悟朗寺井; Goro Terai; 浅井　潔; Kiyoshi Asai; 潔浅井
Original assignee: Intec Inc Japan; University of Tokyo NUC
Current assignee: Intec Inc Japan; University of Tokyo NUC
Priority date: 2016-03-31
Filing date: 2016-03-31
Publication date: 2019-06-20
Also published as: WO2017169736A1

Abstract

To provide an information processing device and a program with which it is possible to suppress homologous recombination and increase the production amount of target protein.SOLUTION: First generation parent individual group data is acquired that is generated on the basis of data representing an amino acid sequence, the number of genes and a codon frequency table and includes a predetermined number of individual data. A mutation process is executed on individuals included in the first generation parent individual group data, and first generation child individual group data is acquired that includes the individuals having undergone the mutation process. A non-dominance sort process is executed on first generation integrated data having been integrated from the first generation parent individual group data and the first generation child individual group data on the basis of a predetermined evaluation standard that pertains to codon suitability and the base sequence of the codon, and the entire individual data included in the first generation integrated data is classified into ranks in the Pareto optimal solution. A predetermined number of individual data is selected in descending order of rank from the entire individual data classified into ranks.SELECTED DRAWING: Figure 8

Description

本発明は、目的タンパク質の生産量を高めることができる情報処理装置及びプログラムに関する。 The present invention relates to an information processing apparatus and program that can increase the production amount of a target protein.

微生物等に目的タンパク質を生産させる際に、目的タンパク質をコードする遺伝子を複数個導入する手法が知られている。かかる遺伝子は、同じＤＮＡ配列を有するものが利用されることが多い。しかし、同じＤＮＡ配列を有する複数個の遺伝子を導入すると、これらの遺伝子間で相同組み換えが生じ、遺伝子の一部が欠損してしまう。ここで、相同組み換えとは、ＤＮＡの塩基配列がよく似た部位（相同部位）で起こる組み換えのことである。これを概念的に表したのが図１である。図１（ａ）は、同じＤＮＡ配列を有する５個の遺伝子を導入した例を示す。かかる５個の遺伝子のうち、２個目の遺伝子の後半部分〜５個目の遺伝子の前半部分において相同組み換えが生じると、図１（ｂ）に示されるように、遺伝子の数が２つまで減少してしまい、目的タンパク質の生産効率が低下してしまう。 When producing a target protein in a microorganism or the like, a method is known in which a plurality of genes encoding the target protein are introduced. Such genes are often used having the same DNA sequence. However, when a plurality of genes having the same DNA sequence are introduced, homologous recombination occurs between these genes and a part of the gene is deleted. Here, the homologous recombination is a recombination that occurs at a site where the nucleotide sequences of DNAs closely resemble (homologous site). A conceptual representation of this is shown in FIG. FIG. 1 (a) shows an example in which five genes having the same DNA sequence are introduced. When homologous recombination occurs in the second half of the second gene to the first half of the fifth gene among the five genes, as shown in FIG. 1 (b), the number of genes is up to two. It will decrease and the production efficiency of the target protein will decrease.

特許文献１には、合成核酸分子を取得するための方法であって、（ｉ）ポリペプチドのアミノ酸繰り返し領域由来のアミノ酸配列を提供する工程；（ｉｉ）前記アミノ酸配列をそれぞれコードする複数のサンプルコドン最適化核酸配列を推測する工程；（ｉｉｉ）前記複数のサンプルコドン最適化核酸配列を、配列相同性により整列させ、前記複数のサンプルコドン最適化核酸配列を含む近隣結合ツリーを構築する工程；（ｉｖ）前記複数のサンプルコドン最適化核酸配列の１つのみを選択する工程；ならびに、（ｖ）前記選択されたサンプルコドン最適化核酸配列を含む核酸分子を取得する工程を含む、方法が開示されている。 Patent Document 1 discloses a method for obtaining a synthetic nucleic acid molecule, which comprises the steps of: (i) providing an amino acid sequence derived from an amino acid repeat region of a polypeptide; (ii) a plurality of samples each encoding the amino acid sequence Predicting a codon-optimized nucleic acid sequence; (iii) aligning the plurality of sample codon-optimized nucleic acid sequences by sequence homology and constructing a neighboring binding tree comprising the plurality of sample codon-optimized nucleic acid sequences; (Iv) selecting a single sample of said plurality of sample codon optimized nucleic acid sequences; and (v) obtaining a nucleic acid molecule comprising said selected sample codon optimized nucleic acid sequences It is done.

特表２０１５-５２４６５８号公報JP-A-2015-524658

本発明は、相同組み換えを抑制し、目的タンパク質の生産量を高めることが可能な情報処理装置及びプログラムを提供するものである。 The present invention provides an information processor and program capable of suppressing homologous recombination and enhancing the production amount of a target protein.

本発明によれば、アミノ酸配列、遺伝子数及びコドン頻度表を表すデータに基いて生成されたデータであって、予め定められた数の個体データを含む第１世代の親個体集団を表す第１世代親個体集団データを取得する親個体集団データ取得部と、前記第１世代親個体集団データに含まれる個体に対し、変異処理を実行する変異処理部と、前記変異処理が実行された個体を含む第１世代の子個体集団を表す第１世代子個体集団データを取得する子個体集団データ取得部と、予め定められた評価基準であって、コドン適合度及び前記コドンの塩基配列に関する評価基準に基いて、前記第１世代親個体集団データ及び前記第１世代子個体集団データを統合した第１世代統合データに対して非優越ソート処理を実行し、前記第１世代統合データに含まれる全個体データをパレート最適解におけるランク毎に分類する非優越ソート実行部と、前記ランク毎に分類された全個体データから、前記ランクの高い順に予め定められた数の前記個体データを選択する個体選択部と、を有する情報処理装置が提供される。 According to the present invention, data generated based on data representing an amino acid sequence, gene number and codon frequency table is a data representing a first generation parent population including a predetermined number of individual data. A parent individual population data acquisition unit for acquiring generation parent individual population data; a mutation processing unit for performing mutation processing on the individuals included in the first generation parent individual population data; and an individual for which the mutation processing is performed A child population data acquisition unit for acquiring a first generation child population data representing a first generation child population including the first generation, and a predetermined evaluation standard, which is an evaluation standard regarding codon suitability and the base sequence of the codon And performing non-dominant sorting on the first generation integrated data obtained by integrating the first generation parent individual population data and the first generation child individual population data, and the data is included in the first generation integrated data. Select the individual data of a predetermined number in the descending order of the rank from the non-dominated sort execution unit which classifies all individual data according to rank in the Pareto optimal solution and all individual data classified according to the rank An information processing apparatus having an individual selection unit is provided.

本発明によれば、異なる２つの評価基準に基づいて、相同組み換えを抑制し、目的タンパク質の生産量を高めることが可能となる。 According to the present invention, it is possible to suppress homologous recombination and increase the production amount of a target protein based on two different evaluation criteria.

以下、本発明の種々の実施形態を例示する。以下に示す実施形態は互いに組み合わせ可能である。
好ましくは、前記個体選択部は、前記予め定められた数の前記個体データを選択するときに、前記ランクが同じ前記個体データが存在する場合には、混雑距離が大きいものから順に選択する。
好ましくは、前記親個体集団データ取得部は、前記個体選択部により選択された前記個体データを、第２世代の親個体集団を表す第２世代親個体集団データとし、前記変異処理部、前記非優越ソート実行部及び前記個体選択部による処理を、予め定められた世代数となるまで実行する。
好ましくは、前記コドン適合度に関する評価基準は、各個体が複数有する塩基配列であって、アミノ酸翻訳の対象となる塩基配列を表すＣＤＳのコドン適合インデックスの最小値を基準とする。
好ましくは、前記個体に含まれる前記コドン適合インデックスの最小値が大きいほど、前記個体の評価を高くする。
好ましくは、前記コドンの塩基配列に関する評価基準は、前記各個体に含まれる２つの前記ＣＤＳのうち、互いに一致しない塩基の数を表す不一致塩基数の最小値を基準とする。
好ましくは、前記不一致塩基数の最小値が大きいほど、前記個体の評価を高くする。
好ましくは、前記コドンの塩基配列に関する評価基準は、前記各個体に含まれる前記ＣＤＳのうち、それぞれのＣＤＳ間又は１つのＣＤＳ内部の異なる部位で連続して一致する塩基配列のうち最長の塩基配列である最長共通文字列の長さを基準とする。
好ましくは、前記最長共通文字列の長さが短いほど、前記個体を高く評価する。
好ましくは、前記変異処理部は、第ｇ世代の親個体集団を表す第ｇ世代親個体集団データに含まれる各個体データに対し、第１変異処理及び前記第１変異処理とは異なる第２変異処理を実行する。
好ましくは、前記変異処理部は、前記各個体に含まれる全てのＣＤＳに対し、前記ＣＤＳに含まれる前記コドンを、予め定められた確率で前記コドンより高頻度のコドンに置換する第１変異処理を実行する。
好ましくは、前記変異処理部は、前記各個体に含まれるＣＤＳのうち、それぞれのＣＤＳ間又は１つのＣＤＳ内部の異なる部位で連続して一致する塩基配列のうち最長の塩基配列である最長共通文字と重なる前記コドンを、予め定められた確率で他のコドンに置換する第２変異処理を実行する。
好ましくは、前記第１変異処理又は前記第２変異処理は、ランダムに選択される。
好ましくは、前記第１世代親個体集団データに含まれる個体に対し、交差処理を実行する交差処理部を有し、前記交差処理は、第ｇ世代の親個体集団を表す第ｇ世代親個体集団データから予め定められた偶数個の個体データを抽出し、前記抽出された個体データから２個の個体データを選択し、前記選択された２個の個体データに対して交差処理を実行する。
好ましくは、前記交差処理部は、前記選択された２個の個体データである第１個体データ及び第２個体データに含まれる前記ＣＤＳに含まれる前記コドンの境界から交差ポイントを決定し、前記交差ポイントを境として前記第１個体データと前記第２個体データに含まれる前記コドンを入れ替える。
好ましくは、コンピュータを、アミノ酸配列、遺伝子数及びコドン頻度表を表すデータに基いて生成されたデータであって、予め定められた数の個体データを含む第１世代の親個体集団を表す第１世代親個体集団データを取得する親個体集団データ取得部、前記第１世代親個体集団データに含まれる個体に対し、変異処理を実行する変異処理部、前記変異処理が実行された個体を含む第１世代の子個体集団を表す第１世代子個体集団データを取得する子個体集団データ取得部、予め定められた評価基準であって、コドン適合度及び前記コドンの塩基配列に関する評価基準に基いて、前記第１世代親個体集団データ及び前記第１世代子個体集団データを統合した第１世代統合データに対して非優越ソート処理を実行し、前記第１世代統合データに含まれる全個体データをパレート最適解におけるランク毎に分類する非優越ソート実行部、前記ランク毎に分類された全個体データから、前記ランクの高い順に予め定められた数の前記個体データを選択する個体選択部、として機能させるための情報処理プログラムが提供される。 Hereinafter, various embodiments of the present invention will be illustrated. The embodiments shown below can be combined with one another.
Preferably, when selecting the predetermined number of pieces of the individual data, the individual selecting unit sequentially selects pieces of the individual data having the same rank in descending order of congestion distance.
Preferably, the parent individual population data acquisition unit sets the individual data selected by the individual selection unit as second generation parent individual population data representing a second generation parent individual population, and the mutation processing unit, The processing by the superior sort execution unit and the individual selection unit is executed until the number of generations determined in advance is reached.
Preferably, the evaluation criteria related to the degree of codon suitability are based on the minimum value of the codon matching index of CDS which is a base sequence possessed by each individual and is a target of amino acid translation.
Preferably, the larger the minimum value of the codon matching index included in the individual, the higher the evaluation of the individual.
Preferably, the evaluation criteria for the base sequences of the codons are based on the minimum value of the number of unmatched bases representing the number of unmatched bases among the two CDSs contained in each individual.
Preferably, the larger the minimum value of the number of unmatched bases, the higher the evaluation of the individual.
Preferably, the evaluation criteria for the base sequences of the codons are the longest base sequence among the base sequences among the CDSs contained in each individual, which continuously match at different sites between each CDS or within one CDS. Based on the length of the longest common character string.
Preferably, the shorter the length of the longest common character string, the higher the value of the individual.
Preferably, the mutation processing unit performs a second mutation different from the first mutation processing and the first mutation processing on each individual data included in the g-th generation parent individual population data representing the g-th generation parent individual population. Execute the process
Preferably, the mutation processing unit substitutes the codons contained in the CDS with codons more frequently than the codons with a predetermined probability for all the CDSs contained in each individual. Run.
Preferably, the mutation processing unit is a longest common character, which is the longest base sequence among base sequences successively matched among different CDSs or in different sites within one CDS among the CDSs contained in each individual. The second mutation process is performed to replace the codon overlapping with the other codon with another codon with a predetermined probability.
Preferably, the first mutation treatment or the second mutation treatment is randomly selected.
Preferably, a cross processing unit for executing cross processing on individuals included in the first generation parent individual population data, the cross processing including a g generation parent individual population representing a g generation parent individual population A predetermined number of individual data items are extracted from the data, two individual data items are selected from the extracted individual data items, and a cross process is performed on the selected two individual data items.
Preferably, the intersection processing unit determines an intersection point from the boundaries of the codons included in the CDS included in the first individual data and the second individual data which are the selected two individual data, and the intersection point is determined. The codons included in the first individual data and the second individual data are interchanged at points.
Preferably, the computer is data generated based on data representing an amino acid sequence, gene number and codon frequency table, and a first generation parent population including a predetermined number of individual data A parent individual population data acquisition unit for acquiring generation parent individual population data, a mutation processing unit for performing mutation processing on an individual included in the first generation parent individual population data, a target including the individual on which the mutation processing has been performed A child individual population data acquisition unit for acquiring first generation child individual population data representing one generation child individual population, a predetermined evaluation criterion, which is based on the evaluation criteria regarding codon suitability and the nucleotide sequence of the codon Performing a non-dominated sort process on first generation integrated data obtained by integrating the first generation parent individual population data and the first generation child individual population data, and the first generation integrated data The non-dominated sort execution unit which classifies all the contained individual data according to the rank in the Pareto optimal solution, and selects the predetermined number of pieces of the individual data from the total individual data classified according to the rank in descending order of the rank An information processing program for functioning as an individual selection unit is provided.

微生物等に目的タンパク質を生産させる際に、目的タンパク質をコードする遺伝子を複数個導入する従来の手法を表す概念図であり、（ａ）は同じＤＮＡ配列を有する５個の遺伝子を導入した例、（ｂ）は５個の遺伝子のうち、２個目の遺伝子の後半部分〜５個目の遺伝子の前半部分において相同組み換えが生じ、遺伝子の数が２個まで減少した結果を表す。When producing a target protein in microorganisms etc., it is a conceptual diagram showing the conventional method which introduce | transduces two or more genes which encode target protein, and (a) is an example which introduce | transduced five genes which have the same DNA sequence, (B) shows the result that homologous recombination occurs in the first half of the second half of the second gene to the first half of the fifth gene among the five genes, and the number of genes is reduced to two. 本発明の一実施形態に係る遺伝子配列設計を表す概念図であり、（ａ）は本発明に係るアルゴリズムに入力データを入力し、出力データとして遺伝子配列を出力する様子を、（ｂ）は導入された５個の遺伝子データに相同組み換えが生じず、全ての遺伝子から目的タンパク質が生産される様子を表す。It is a conceptual diagram showing gene sequence design concerning one embodiment of the present invention, (a) is a mode which inputs input data into the algorithm concerning the present invention, and outputs a gene sequence as output data, (b) introduces. It shows that the homologous recombination does not occur in the 5 gene data obtained, and the target protein is produced from all the genes. 情報処理装置１のハードウェア構成の例を示す図である。FIG. 2 is a diagram illustrating an example of a hardware configuration of the information processing device 1; 本発明の一実施形態に係る情報処理装置１の例示的な機能ブロック図である。It is an exemplary functional block diagram of information processor 1 concerning one embodiment of the present invention. 混雑距離を説明するための図であり、（ａ）は混雑距離の概念図、（ｂ）は混雑距離の計算式を表す。It is a figure for demonstrating congestion distance, (a) is a conceptual diagram of congestion distance, (b) represents the calculation formula of congestion distance. 本発明の一実施形態に係る遺伝子配列設計を実施するためのフローチャートの一例を示す図である。かかる処理は、図８に示されるメインルーチンに先立ち実行される。It is a figure showing an example of the flow chart for carrying out gene sequence design concerning one embodiment of the present invention. Such processing is executed prior to the main routine shown in FIG. 個体データの例を表す図である。本実施形態では、１つの個体を、同じアミノ酸をコードする複数のタンパクコード領域（ＣＤＳ）として表現する。It is a figure showing the example of individual data. In the present embodiment, one individual is expressed as a plurality of protein coding regions (CDS) encoding the same amino acid. 本発明の一実施形態に係る遺伝子配列設計を実施するためのフローチャートの一例を示す図である。なお、S２２において、交差処理は任意であり、必要に応じて省略することができる。It is a figure showing an example of the flow chart for carrying out gene sequence design concerning one embodiment of the present invention. In S22, the cross process is optional and can be omitted as necessary. 本発明の一実施形態に係る交差処理を実施するためのフローチャートの一例を示す図である。FIG. 6 is a diagram illustrating an example of a flowchart for performing intersection processing according to an embodiment of the present invention. 本発明の一実施形態に係る交差処理を表す概念図である。It is a conceptual diagram showing the intersection process which concerns on one Embodiment of this invention. 本発明の一実施形態に係る変異処理を実施するためのフローチャートの一例を示す図である。It is a figure which shows an example of the flowchart for implementing the mutation process which concerns on one Embodiment of this invention. 本発明の一実施形態に係る変異処理を表す概念図であり、（ａ）は第１変異処理、（ｂ）は第２変異処理を表す概念図である。It is a conceptual diagram showing the mutation process which concerns on one Embodiment of this invention, (a) is a 1st mutation process, (b) is a conceptual diagram showing a 2nd mutation process. 本発明の一実施形態に係るコドンの塩基配列に関する評価基準である「不一致塩基数」を説明するための図である。図１３の例では、不一致塩基数は５である。It is a figure for demonstrating "the number of unmatched bases" which is an evaluation standard regarding the base sequence of the codon concerning one embodiment of the present invention. In the example of FIG. 13, the number of unmatched bases is five. 本発明の一実施形態に係るコドンの塩基配列に関する評価基準である「最長共通文字列」を説明するための図である。図１４では、実線の下線が付された部分が最長共通文字列、破線の下線が付された部分が共通文字列を表す。It is a figure for demonstrating the "longest common character string" which is an evaluation criteria regarding the base sequence of the codon concerning one embodiment of the present invention. In FIG. 14, the underlined part of the solid line represents the longest common character string, and the underlined part of the broken line represents the common character string. 本発明の実施例における処理結果を表す図である。図１５では、コドン適合度に関する評価基準（各個体が複数有する塩基配列であって、アミノ酸翻訳の対象となる塩基配列を表すＣＤＳのコドン適合インデックス（ＣＡＩ）の最小値）を横軸、コドンの塩基配列に関する評価基準（不一致塩基数の最小値）を縦軸とし、第１世代、第１０世代及び第２５０世代の計算結果をそれぞれプロットしたグラフである。It is a figure showing the process result in the Example of this invention. In FIG. 15, evaluation criteria for codon suitability (minimum value of codon suitability index (CAI) of CDS, which is a nucleotide sequence possessed by each individual and which is a target of amino acid translation) are shown on the horizontal axis. It is the graph which plotted the calculation result of the 1st generation, the 10th generation, and the 250th generation, respectively, with the evaluation standard (minimum value of the number of unmatched bases) about the base sequence as the vertical axis. 本発明の実施例における処理結果を表す図である。図１６では、コドン適合度に関する評価基準（各個体が複数有する塩基配列であって、アミノ酸翻訳の対象となる塩基配列を表すＣＤＳのコドン適合インデックス（ＣＡＩ）の最小値）を横軸、コドンの塩基配列に関する評価基準（最長共通文字列の長さ）を縦軸とし、第１世代、第１０世代及び第２５０世代の計算結果をそれぞれプロットしたグラフである。It is a figure showing the process result in the Example of this invention. In FIG. 16, evaluation criteria for codon suitability (minimum value of codon suitability index (CAI) of CDS, which is a base sequence possessed by each individual and which is a target of amino acid translation) are shown on the horizontal axis. It is the graph which plotted the calculation result of the 1st generation, the 10th generation, and the 250th generation, respectively, with the evaluation standard (the length of the longest common character string) about the base sequence as the vertical axis.

＜実施形態＞
以下、図面を用いて本発明の実施形態について説明する。以下に示す実施形態中で示した各種特徴事項は、互いに組み合わせ可能である。 Embodiment
Hereinafter, embodiments of the present invention will be described using the drawings. The various features shown in the embodiments described below can be combined with one another.

＜本発明の一実施形態に係る遺伝子配列設計＞
図２は、本発明の一実施形態に係る遺伝子配列設計を表す概念図である。一実施形態に係る遺伝子配列設計は、相同組み換えを誘発しない遺伝子配列群を設計し、微生物等に導入することで、目的タンパク質の生産量を高めるものである。図２（ａ）に示されるように、目的タンパク質を表すデータ及びＮ個の遺伝子を表すデータを入力データとし、アルゴリズム（以下、本アルゴリズムという）に基いて計算処理し、かかる計算結果である目的タンパク質をコードするＮ個の遺伝子配列群を表すデータを出力する。ここで、本アルゴリズムは、相競合する評価基準を持つ複数の目的関数を同時に最適化することを目的とする多目的遺伝的アルゴリズムを利用する。これにより、図２（ｂ）に示されるように、例えば導入された遺伝子が５個である場合、かかる５個の遺伝子データに相同組み換えが生じず、全ての遺伝子から目的タンパク質が生産される。ここで、図２（ｂ）では、１〜５までの遺伝子は、互いに塩基配列が異なる遺伝子である。 <Gene Sequence Design According to One Embodiment of the Present Invention>
FIG. 2 is a conceptual diagram showing gene sequence design according to an embodiment of the present invention. The gene sequence design according to one embodiment is to increase gene yield of a target protein by designing gene sequence groups that do not induce homologous recombination and introducing into a microorganism or the like. As shown in FIG. 2A, data representing a target protein and data representing N genes are used as input data, and calculation processing is performed based on an algorithm (hereinafter referred to as the present algorithm), and the calculation results Output data representing a group of N gene sequences encoding a protein. Here, the present algorithm utilizes a multi-objective genetic algorithm aiming to simultaneously optimize a plurality of objective functions having phase conflicting evaluation criteria. Thereby, as shown in FIG. 2 (b), for example, when there are five introduced genes, the homologous recombination does not occur in the five gene data, and the target protein is produced from all the genes. Here, in FIG. 2 (b), the genes of 1 to 5 are genes having different base sequences from each other.

＜ハードウェア構成＞
次に、本発明の一実施形態に係る情報処理装置１のハードウェア構成の例について、図３を用いて説明する。情報処理装置１は、処理部１０、記憶部２０、操作部３０、表示部４０及び通信部５０を有する。処理部１０は、種々の演算処理を実行するものであり、例えば、ＣＰＵ等により構成される。記憶部２０は、種々のデータやプログラムを記憶するものであり、例えば、メモリ、ＨＤＤ又はＳＳＤ等により構成される。ここで、プログラムは、情報処理装置１の出荷時点においてプリインストールされていてもよく、Ｗｅｂ上のサイトからアプリケーションとしてダウンロードしてもよく、無線通信により他の情報処理装置から転送されてもよい。操作部３０は、情報処理装置１を操作するものであり、例えば、タッチパネル、キーボード、音声入力部、カメラ等を利用した動き認識装置等により構成される。表示部４０は、種々の画像（静止画及び動画を含む）を表示するものであり、例えば、タッチパネルディスプレイ、有機ＥＬディスプレイ、電子ペーパーその他のディスプレイで構成される。通信部５０は、他の情報処理装置と種々のデータを送受信するものであり、任意のＩ／Ｏにより構成される。バス１００はシリアルバス、パラレルバス等で構成され、各部を電気的に接続し、種々のデータの送受信を可能にするものである。 <Hardware configuration>
Next, an example of a hardware configuration of the information processing apparatus 1 according to an embodiment of the present invention will be described with reference to FIG. The information processing apparatus 1 includes a processing unit 10, a storage unit 20, an operation unit 30, a display unit 40, and a communication unit 50. The processing unit 10 executes various arithmetic processing, and is configured of, for example, a CPU or the like. The storage unit 20 stores various data and programs, and is configured of, for example, a memory, an HDD, or an SSD. Here, the program may be pre-installed when the information processing apparatus 1 is shipped, may be downloaded as an application from a site on the Web, and may be transferred from another information processing apparatus by wireless communication. The operation unit 30 operates the information processing apparatus 1 and is configured of, for example, a touch panel, a keyboard, a voice input unit, a motion recognition apparatus using a camera, and the like. The display unit 40 displays various images (including still images and moving images), and includes, for example, a touch panel display, an organic EL display, an electronic paper, and other displays. The communication unit 50 transmits and receives various data to and from other information processing apparatuses, and is configured by an arbitrary I / O. The bus 100 is configured by a serial bus, a parallel bus, and the like, and electrically connects the respective units to enable transmission and reception of various data.

＜機能ブロック図＞
次に、情報処理装置１の機能について、図４の機能ブロック図を用いて説明する。情報処理装置１は、例えば、多機能情報端末であり、ＰＣ、サーバ、スマートフォン、タブレット端末、スマートウォッチ等である。情報処理装置１は、操作部３０、表示部４０及び通信部５０と、処理部１０と、記憶部２０を備える。処理部１０は、個体生成部１０１、親個体集団データ取得部１０２、子個体集団データ取得部１０３、交差処理部１０４、変異処理部１０５、非優越ソート実行部１０６、個体選択部１０７を備える。また、記憶部２０は、アミノ酸配列データ記憶部２０１、遺伝子数データ記憶部２０２、コドン頻度表データ記憶部２０３、計算データ記憶部２０４、評価基準記憶部２０５を備える。 <Function block diagram>
Next, the function of the information processing apparatus 1 will be described using the functional block diagram of FIG. 4. The information processing apparatus 1 is, for example, a multifunction information terminal, and is a PC, a server, a smartphone, a tablet terminal, a smart watch, or the like. The information processing apparatus 1 includes an operation unit 30, a display unit 40, a communication unit 50, a processing unit 10, and a storage unit 20. The processing unit 10 includes an individual generation unit 101, a parent individual group data acquisition unit 102, a child individual group data acquisition unit 103, a cross processing unit 104, a mutation processing unit 105, a non-dominated sort execution unit 106, and an individual selection unit 107. In addition, the storage unit 20 includes an amino acid sequence data storage unit 201, a gene number data storage unit 202, a codon frequency table data storage unit 203, a calculation data storage unit 204, and an evaluation criteria storage unit 205.

操作部３０、表示部４０及び通信部５０の各機能については、図３の説明を参照されたい。 For the functions of the operation unit 30, the display unit 40, and the communication unit 50, refer to the description of FIG.

＜処理部１０＞
次に、処理部１０の機能について説明する。個体生成部１０１は、アミノ酸配列、遺伝子数及びコドン頻度表を表すデータをそれぞれアミノ酸配列データ記憶部２０１、遺伝子数データ記憶部２０２及びコドン頻度表データ記憶部２０３から取得し、同じタンパク質のアミノ酸配列をコードするという制約下でランダムに生成した個体を表す個体データをｐ個生成するものである。ここで、ｐは正の数のパラメータであり、任意の数とすることができる。 <Processing unit 10>
Next, the function of the processing unit 10 will be described. The individual generation unit 101 acquires data representing the amino acid sequence, the number of genes and the codon frequency table from the amino acid sequence data storage unit 201, the gene number data storage unit 202 and the codon frequency table data storage unit 203, respectively, Is used to generate p pieces of individual data representing randomly generated individuals under the restriction of coding. Here, p is a positive number parameter and can be any number.

親個体集団データ取得部１０２は、個体生成部１０１が生成したｐ個の個体データを、本アルゴリズムに利用するデータであって、第ｇ世代の親個体集団を表す第ｇ世代親個体集団データとして取得する。さらに、本アルゴリズムにおける計算は、後述するように所定のフローを複数回繰り返し実行するループ計算を実行するものであり、親個体集団データ取得部１０２は、第１世代、第２世代、・・・第ｇ世代の親個体集団を表す第１世代親個体集団データ、第２世代親個体集団データ・・・第ｇ世代親個体集団データを取得する。ここで、ｇは正の数であり、本アルゴリズムにおける計算のループ数を表す。 The parent individual population data acquisition unit 102 is data for using the p individual data generated by the individual generation unit 101 for the present algorithm, and as g-th generation parent individual population data representing a g-th generation parent individual population get. Further, the calculation in the present algorithm is to execute loop calculation in which a predetermined flow is repeatedly executed a plurality of times as described later, and the parent individual group data acquisition unit 102 generates a first generation, a second generation,. First generation parent individual population data representing a g-th generation parent individual population, second generation parent individual population data... G-th generation parent individual population data is acquired. Here, g is a positive number and represents the number of loops of calculation in the present algorithm.

交差処理部１０４は、第ｇ世代親個体集団データからｅ個（予め定められた偶数個）の個体データを抽出し、抽出された個体データから２個の個体データを選択し、選択された２個の個体データに対して交差処理を実行するものである。そして、まだ交差処理が行われていない個体データの中から２個の個体データを選択し、交差処理を実行する。かかる処理を、抽出されたｅ個の個体データの全てに対して繰り返す。具体的には、選択された２個の個体データである第１個体データ及び第２個体データに含まれるＣＤＳに含まれるコドンの境界から交差ポイントを決定し、交差ポイントを境として第１個体データと第２個体データに含まれるコドンを入れ替える。ここで、２個の個体データの選択は、例えば乱数表等を利用してランダムに実行される。ここで、ｅは「ｐ×Ｐｃ」を超えない最大の偶数である。なお、Ｐｃはパラメータであり、０より大きく１より小さい任意の値とすることができる。ｐ個の個体データを含む第ｇ世代親個体集団データからｅ個の個体データを抽出する手法は特に限定されないが、例えば「ｂｉｎａｒｙｔｏｕｒｎａｍｅｎｔｓｅｌｅｃｔｉｏｎ法」を用いることができる。 The cross processing unit 104 extracts e (predetermined even number) pieces of individual data from the g-th generation parent individual population data, selects two pieces of individual data from the extracted individual data, and selects 2 Cross processing is performed on individual data items. Then, two pieces of individual data are selected from among pieces of individual data for which cross processing has not been performed yet, and cross processing is performed. This process is repeated for all of the e individual data extracted. Specifically, the intersection point is determined from the boundaries of the codons contained in the CDS included in the first individual data and the second individual data which are the two selected individual data, and the first individual data is made bordering on the intersection point And the codons contained in the second individual data are replaced. Here, selection of two pieces of individual data is performed at random using, for example, a random number table or the like. Here, e is the largest even number not exceeding “p × Pc”. Pc is a parameter, and can be any value greater than 0 and less than 1. The method of extracting e individual data from the g generation parent individual population data including p individual data is not particularly limited, but for example, the “binary tournament selection method” can be used.

変異処理部１０５は、第ｇ世代親個体集団データに含まれる全ての個体に対し、変異処理を実行するものである。本実施形態では、第ｇ世代親個体集団データに含まれる各個体データに対し、第１変異処理及び前記第１変異処理とは異なる第２変異処理を実行する。具体的には、各個体データに対し、第１変異処理及び第２変異処理をランダムに決定する。そして、第１変異処理と決定された場合、各個体データに含まれる全てのＣＤＳに対し、ＣＤＳに含まれるコドンを、予め定められた確率Ｐｍでかかるコドンより高頻度のコドンに置換する。また、第２変異処理と決定された場合、各個体データに含まれるＣＤＳのうち、それぞれのＣＤＳ間又は１つのＣＤＳ内部の異なる部位で連続して一致する塩基配列のうち最長の塩基配列である最長共通文字と重なるコドンを、予め定められた確率Ｐｍで他のコドンに置換する。これらの処理の詳細については後述する。 The mutation processing unit 105 executes mutation processing on all the individuals included in the g-th generation parent individual population data. In this embodiment, a second mutation process different from the first mutation process and the first mutation process is performed on each individual data included in the g-th generation parent individual population data. Specifically, the first mutation treatment and the second mutation treatment are randomly determined for each individual data. Then, when it is determined that the first mutation treatment is performed, codons included in the CDS are replaced with codons more frequently than such codons with a predetermined probability Pm for all the CDSs included in each individual data. In addition, when it is determined to be the second mutation treatment, among the CDSs contained in each individual data, it is the longest base sequence among the base sequences which continuously coincide at different sites between each CDS or within one CDS. The codon overlapping with the longest common letter is replaced with another codon with a predetermined probability Pm. Details of these processes will be described later.

子個体集団データ取得部１０３は、変異処理部１０５による変異処理が実行された個体を含む第ｇ世代の子個体集団を表す第ｇ世代子個体集団データを取得する。ここで、第ｇ世代子個体集団データに含まれる個体データの数は、第ｇ世代親個体集団データに含まれる個体データの数と等しく、ｐ個である。これは、変異処理部１０５による変異処理が、第ｇ世代親個体集団データに含まれる全ての個体に実行されたためである。 The offspring population data acquisition unit 103 acquires g generation offspring population data representing a offspring population of the g generation including the individuals on which the mutation processing by the mutation processing unit 105 has been performed. Here, the number of individual data included in the g-th generation child individual population data is equal to the number of individual data included in the g-th generation parent individual population data, and is p. This is because the mutation processing by the mutation processing unit 105 is performed on all the individuals included in the g generation parent individual population data.

非優越ソート実行部１０６は、予め定められた評価基準であって、コドン適合度及び前記コドンの塩基配列に関する評価基準に基いて、第ｇ世代親個体集団データ及び第ｇ子個体集団データを統合した第ｇ世代統合データに対し、非優越ソート処理を実行するものである。第ｇ世代統合データに含まれる個体データの数は、２ｐ（＝ｐ＋ｐ）個である。そして、全個体をパレート最適解におけるフロント毎（ランク毎）に分類する。 The non-dominant sort execution unit 106 integrates the g-th generation parent individual population data and the g-th child individual population data based on a predetermined evaluation criterion based on the degree of codon suitability and the base sequence of the codon. A non-dominated sort process is performed on the g-th generation integrated data. The number of individual data included in the g-th generation integrated data is 2p (= p + p). Then, all individuals are classified for each front (per rank) in the Pareto optimal solution.

個体選択部１０７は、パレート最適解におけるフロント毎（ランク毎）に分類された第ｇ世代統合データから、ランクの高い順に定められた数の個体データを選択するものである。例えば、予め定められた数として、ｐを採用することができる。そして、親個体集団データ取得部１０２は、個体選択部１０７により選択されたｐ個の個体データを、第ｇ＋１世代の親個体集団を表す第ｇ＋１世代親個体集団データとして取得する。 The individual selecting unit 107 selects the individual data of the number determined in descending order of rank from the g-th generation integrated data classified for each front (per rank) in the Pareto optimal solution. For example, p can be adopted as a predetermined number. Then, the parent individual group data acquisition unit 102 acquires p pieces of individual data selected by the individual selection unit 107 as g + 1th generation parent individual group data representing a parent individual group of the (g + 1) th generation.

ここで、個体選択部１０７は、定められた数の個体データを選択するときに、ランクが同じ個体データが存在する場合には、混雑距離（ＣｒｏｗｄｉｎｇＤｉｓｔａｎｃｅ）が大きいものから順に選択することとしてもよい。ここで、混雑距離とは、ある解の両側にある２つの解の平均距離である。これを概念的に表したのが図５（ａ）である。そして、図５（ｂ）の計算式により、混雑距離が計算される。ここで、混雑距離は、図５（ａ）において破線で示される四角形の周囲の長さの平均に相当する。 Here, when selecting a predetermined number of pieces of individual data, if there is individual data of the same rank, the individual selecting unit 107 may select in order from the one with the largest crowding distance (Crowding Distance). Good. Here, the congestion distance is an average distance between two solutions on both sides of a certain solution. FIG. 5 (a) schematically shows this. Then, the congestion distance is calculated by the calculation formula of FIG. 5 (b). Here, the crowded distance corresponds to the average of the lengths of the peripheries of the squares shown by broken lines in FIG. 5 (a).

＜記憶部２０＞
次に、記憶部２０の機能について説明する。アミノ酸配列データ記憶部２０１は、アミノ酸配列を表すデータを記憶するものである。アミノ酸配列は、タンパク質中のアミノ酸の配列を表すものである。 <Storage unit 20>
Next, the function of the storage unit 20 will be described. The amino acid sequence data storage unit 201 stores data representing an amino acid sequence. An amino acid sequence is intended to represent the sequence of amino acids in a protein.

遺伝子数データ記憶部２０２は、遺伝子数を表すデータを記憶するものである。ここで、本実施形態では、遺伝子数は、個体データに含まれるＣＤＳの数を表すものとする。 The gene number data storage unit 202 stores data representing the number of genes. Here, in the present embodiment, the number of genes represents the number of CDSs contained in individual data.

コドン頻度表データ記憶部２０３は、コドン頻度表を表すデータを記憶するものである。コドン頻度表は、宿主細胞におけるコドンの使用頻度をまとめた表である。 The codon frequency table data storage unit 203 stores data representing a codon frequency table. The codon frequency table is a table summarizing the frequency of use of codons in host cells.

計算データ記憶部２０４は、個体生成部１０１、親個体集団データ取得部１０２、子個体集団データ取得部１０３、交差処理部１０４、変異処理部１０５、非優越ソート実行部１０６、個体選択部１０７等による種々の処理における計算結果を記憶するものである。 The calculation data storage unit 204 includes an individual generation unit 101, a parent individual group data acquisition unit 102, a child individual group data acquisition unit 103, a cross processing unit 104, a mutation processing unit 105, a non-dominated sort execution unit 106, an individual selection unit 107, etc. It stores the calculation results in various processes by.

評価基準記憶部２０５は、予め定められた評価基準であって、コドン適合度及びコドンの塩基配列に関する評価基準を記憶するものである。具体的には、コドン適合度に関する評価基準は、各個体が複数有する塩基配列であって、アミノ酸翻訳の対象となる塩基配列を表すＣＤＳのコドン適合インデックスの最小値を基準とする。以下、かかる基準を第１評価基準という。第１評価基準においては、個体に含まれるコドン適合インデックスの最小値が大きいほど、個体が高く評価される。そして、コドンの塩基配列に関する評価基準のうちの１つ目は、各個体に含まれる２つのＣＤＳのうち、互いに一致しない塩基の数を表す不一致塩基数の最小値を基準とする。以下、かかる基準を第２評価基準という。第２評価基準においては、不一致塩基数の最小値が大きいほど、個体が高く評価される。コドンの塩基配列に関する評価基準のうちの２つ目は、各個体に含まれるＣＤＳのうち、それぞれのＣＤＳ間又は１つのＣＤＳ内部の異なる部位で連続して一致する塩基配列のうち最長の塩基配列である最長共通文字列の長さを基準とする。以下、かかる基準を第３評価基準という。第３評価基準においては、最長共通文字列の長さが短いほど、個体が高く評価される。 The evaluation criteria storage unit 205 is a predetermined evaluation criteria, and stores evaluation criteria regarding the degree of codon suitability and the base sequence of codons. Specifically, the evaluation criteria for the degree of codon suitability are based on the minimum value of the codon suitability index of CDS which is a base sequence possessed by each individual and is a target of amino acid translation. Hereinafter, such criteria are referred to as first evaluation criteria. In the first evaluation criterion, the higher the minimum value of the codon matching index included in an individual, the higher the individual is evaluated. The first of the evaluation criteria for the base sequences of codons is based on the minimum value of the number of unmatched bases representing the number of unmatched bases out of two CDSs contained in each individual. Hereinafter, such a criterion is referred to as a second evaluation criterion. In the second evaluation criteria, individuals are evaluated higher as the minimum value of the number of unmatched bases is larger. The second of the evaluation criteria for base sequences of codons is the longest base sequence among the base sequences consecutively matched among different CDSs or within one CDS among different CDSs contained in each individual. Based on the length of the longest common character string. Hereinafter, such a criterion is referred to as a third evaluation criterion. In the third evaluation criterion, the shorter the longest common character string, the higher the individual is evaluated.

次に、以上説明した種々の機能、処理及び基準の詳細について、図６〜図１３を用いて説明する。 Next, details of the various functions, processes, and criteria described above will be described with reference to FIGS.

＜前処理＞
図６は、本発明の一実施形態に係る遺伝子配列設計を実施するためのフローチャートの一例を示す図である。図６に示される処理は、図８に示されるメインルーチンに先立ち実行される処理である。以下、図６に示される処理を前処理という。 <Pre-processing>
FIG. 6 is a diagram showing an example of a flow chart for performing gene sequence design according to an embodiment of the present invention. The process shown in FIG. 6 is a process executed prior to the main routine shown in FIG. Hereinafter, the process shown in FIG. 6 is referred to as pre-processing.

まず、Ｓ１１において、処理部１０は、アミノ酸配列データ記憶部２０１及び遺伝子数データ記憶部２０２から、アミノ酸配列データ及び遺伝子数データを取得する。そして、図示しないキャッシュメモリ等の記憶部にデータを記憶する。 First, in S11, the processing unit 10 acquires amino acid sequence data and gene number data from the amino acid sequence data storage unit 201 and the gene number data storage unit 202. Then, the data is stored in a storage unit such as a cache memory (not shown).

次に、Ｓ１２において、処理部１０は、コドン頻度表データ記憶部２０３からコドン頻度表データを取得する。そして、図示しないキャッシュメモリ等の記憶部にデータを記憶する。 Next, in S12, the processing unit 10 acquires codon frequency table data from the codon frequency table data storage unit 203. Then, the data is stored in a storage unit such as a cache memory (not shown).

次に、Ｓ１３において、個体生成部１０１は、同じタンパク質のアミノ酸配列をコードするという制約下でランダムに生成した個体を表す個体データをｐ個生成する。例えば、ランダムな個体データを１００個生成してもよい。 Next, in S13, the individual generation unit 101 generates p individual data representing an individual randomly generated under the restriction of encoding the same amino acid sequence of the protein. For example, 100 pieces of random individual data may be generated.

（個体データ）
ここで、図７を用いて個体データについて説明する。図７に示されるように、本実施形態では、１つの個体を、同じアミノ酸をコードする複数のタンパクコード領域（ＣＤＳ）として表現する。図７に示される個体データでは、ＣＤＳがｘ個である。これは、図６のＳ１１において処理部１０が遺伝子数データ記憶部２０２から取得した遺伝子数データが表す遺伝子の数である。各ＣＤＳはそれぞれ同じアミノ酸をコードする。ここで、図７に示されるＧ，Ｉ，Ｖ，Ｅ，Ｑは、図６のＳ１１において処理部１０がアミノ酸配列データ記憶部２０１から取得したアミノ酸配列データが表すアミノ酸配列である。また、各ＣＤＳは、それぞれ塩基配列が異なっている。 (Individual data)
Here, individual data will be described with reference to FIG. As shown in FIG. 7, in the present embodiment, one individual is expressed as a plurality of protein coding regions (CDS) encoding the same amino acid. In the individual data shown in FIG. 7, there are x CDSs. This is the number of genes represented by the gene number data acquired by the processing unit 10 from the gene number data storage unit 202 in S11 of FIG. Each CDS encodes the same amino acid. Here, G, I, V, E and Q shown in FIG. 7 are the amino acid sequences represented by the amino acid sequence data acquired by the processing unit 10 from the amino acid sequence data storage unit 201 in S11 of FIG. In addition, each CDS has a different base sequence.

図６に戻り、前処理についてさらに説明する。Ｓ１４において、親個体集団データ取得部１０２は、Ｓ１３において個体生成部１０１がランダムに生成したｐ個の個体データを第１世代の親個体集団を表す第１世代親個体集団データとして取得する。親個体集団データは、本アルゴリズムにおける処理において保存されるアーカイブ母集団である。そして、第１世代親個体集団データを取得すると、前処理を終了する。 Returning to FIG. 6, the pre-processing will be further described. In S14, the parent individual group data acquisition unit 102 acquires p individual data randomly generated by the individual generation unit 101 in S13 as first generation parent individual group data representing a first generation parent individual group. The parent individual population data is an archive population stored in the processing in the present algorithm. And if 1st generation parent individual population data is acquired, pre-processing will be ended.

＜メインルーチン＞
次に、図８を用いて、本アルゴリズムにおけるメインルーチンについて説明する。まず、Ｓ２０において、処理部１０は、変数ｇを１にセットする。ここで、ｇは第ｇ世代の親個体集団を表す符号である。ｇは、１〜Ｇ（後述する予め定められた世代数Ｇ）までの値をとる。 <Main routine>
Next, the main routine in the present algorithm will be described using FIG. First, in S20, the processing unit 10 sets a variable g to one. Here, g is a code representing a parent individual group of the g generation. g takes values from 1 to G (predetermined number of generations G described later).

次に、Ｓ２１において、処理部１０は、親個体集団データ取得部１０２から第１世代親個体集団データを取得する。 Next, in S21, the processing unit 10 acquires first generation parent individual population data from the parent individual population data acquisition unit 102.

次に、Ｓ２２において、交差処理部１０４及び変異処理部１０５は、第１世代親個体集団データに含まれる個体データに対して交差処理及び変異処理を実行する。なお、交差処理は任意であり、必要に応じて省略することができる。以下、図９〜図１２を用いて交差処理及び変異処理について説明する。 Next, in S22, the crossover processing unit 104 and the mutation processing unit 105 execute crossover processing and mutation processing on the individual data included in the first generation parent individual population data. Note that the cross process is optional and can be omitted as needed. The cross processing and mutation processing will be described below with reference to FIGS. 9 to 12.

＜交差処理＞
まず、図９及び図１０を用いて交差処理について説明する。図９は、本発明の一実施形態に係る交差処理を実施するためのフローチャートの一例を示す図である。まず、Ｓ３２１において、交差処理部１０４は、変数ｉを０にセットする。 <Crossing process>
First, the intersection processing will be described using FIGS. 9 and 10. FIG. 9 is a diagram illustrating an example of a flowchart for performing intersection processing according to an embodiment of the present invention. First, in S321, the intersection processing unit 104 sets a variable i to zero.

次に、Ｓ３２２において、交差処理部１０４は、処理部１０又は親個体集団データ取得部１０２から、第ｇ世代親個体集団データ（図８におけるメインルーチンでｇ＝１の場合は第１世代親個体集団データ）を取得する。そして、第ｇ世代親個体集団データに含まれるｐ個の個体データから、（ｅ−ｉ）個（現時点ではｉ＝０のためにｅ個）の個体データをランダムに抽出する。かかる抽出に利用する手法は特に限定されないが、例えば「ｂｉｎａｒｙｔｏｕｒｎａｍｅｎｔｓｅｌｅｃｔｉｏｎ法」を用いることができる。ここで、ｅは「ｐ×Ｐｃ」を超えない最大の偶数である。なお、Ｐｃはパラメータであり、０より大きく１より小さい任意の値とすることができる。 Next, in S322, the cross processing unit 104 receives the g-th generation parent individual group data (the first generation parent individual when g = 1 in the main routine in FIG. 8) from the processing unit 10 or the parent individual group data acquisition unit 102. Collect data). Then, (e−i) pieces of individual data (e pieces for i = 0 at this time) are randomly extracted from p pieces of individual data included in the g-th generation parent individual population data. Although the method to be used for such extraction is not particularly limited, for example, the “binary tournament selection method” can be used. Here, e is the largest even number not exceeding “p × Pc”. Pc is a parameter, and can be any value greater than 0 and less than 1.

次に、Ｓ３２３において、交差処理部１０４は、（ｅ−ｉ）個の個体データからランダムに２個の個体データを選択する。２個の個体データの選択は、例えば乱数表等を利用してランダムに実行される。 Next, in S323, the intersection processing unit 104 randomly selects two pieces of individual data from the (ei) pieces of individual data. Selection of two pieces of individual data is performed at random using, for example, a random number table or the like.

次に、Ｓ３２４において、交差処理部１０４は、Ｓ３２３にて選択された２個の個体データに対して交差処理を実行する。ここで、交差処理について、図１０を用いて具体的に説明する。 Next, in S324, the cross processing unit 104 performs cross processing on the two individual data selected in S323. Here, the cross processing will be specifically described with reference to FIG.

図１０に示されるように、Ｓ３２３にて選択された２個の個体データをそれぞれ第１個体データ及び第２個体データとする。図１０の例では、第１個体データ及び第２個体データはそれぞれ３つのＣＤＳを有し、異なる塩基配列を有する。これらの個体データから、交差ポイントを決定する。交差ポイントは、コドンとコドンの境界から１箇所選ばれる。かかる決定はランダムに行われてもよい。本実施形態では、第１個体データと第２個体データにおける交差ポイントは同じ場所とする。そして、交差ポイントを境として、第１個体データと第２個体データに含まれるコドンを入れ替える。本実施形態では、かかる処理を交差処理という。 As shown in FIG. 10, the two pieces of individual data selected in S323 are set as first individual data and second individual data, respectively. In the example of FIG. 10, the first individual data and the second individual data each have three CDSs and different base sequences. From these individual data, crossing points are determined. The crossover point is selected at one position from the codon-codon boundary. Such decisions may be made randomly. In this embodiment, the intersection points in the first individual data and the second individual data are the same place. Then, the codons included in the first individual data and the second individual data are interchanged at the intersection point. In the present embodiment, such processing is called cross processing.

図９に戻り、交差処理についてさらに説明する。Ｓ３２５において、交差処理部１０４は、変数ｉを２増やす。 Returning to FIG. 9, the cross process will be further described. In S325, the intersection processing unit 104 increases the variable i by two.

次に、Ｓ３２６において、交差処理部１０４は、変数ｉ＝ｅであるか否かを判定する。そして、判定結果がＮＯであれば、再びＳ３２３に戻る。一方、判定結果がＹＥＳであれば、交差処理を終了し、かかる計算結果を計算データ記憶部２０４へ出力する。ここで、現時点ではｉ＝２であり、ｅが２よりも大きいとすると、Ｓ３２６からＳ３２３へ戻ることになる。そして、まだ交差処理が実行されていない（ｅ−２）個の個体データからランダムに２個の個体データを選択する。かかる処理を、Ｓ３２６における判定結果がＹＥＳ、つまり、ｅ個の個体データ全てに対して交差処理が実行されるまで繰り返す。なお、前述のとおり、かかる交差処理は任意であり、必要に応じて省略することができる。 Next, in S326, the intersection processing unit 104 determines whether or not the variable i = e. Then, if the determination result is NO, the process returns to S323 again. On the other hand, if the determination result is YES, the intersection processing is ended, and the calculation result is output to the calculation data storage unit 204. Here, if it is i = 2 at present, and e is larger than 2, it returns from S326 to S323. Then, two pieces of individual data are randomly selected from (e-2) pieces of individual data for which the cross process has not been performed yet. This process is repeated until the determination result in S326 is YES, that is, the cross process is performed on all e individual data. As described above, such cross processing is optional and can be omitted as necessary.

＜変異処理＞
次に、図１１及び図１２を用いて、変異処理について説明する。変異処理は、第ｇ世代親個体集団データに含まれる全ての個体に対して実行される。ここで、Ｓ２２において交差処理が実行されていない場合には、第ｇ世代に含まれるｐ個の個体データに対して変異処理を実行する。一方、Ｓ２２において交差処理が実行された場合には、交差処理が実行されたe個の個体データと、交差処理が実行されていないｐ−ｅ個の個体データを合わせた計ｐ個の個体データに対して変異処理を実行する。 <Mutation processing>
Next, mutation processing will be described using FIGS. 11 and 12. The mutation process is performed on all the individuals included in the g generation parent individual population data. Here, when the cross process is not performed in S22, the mutation process is performed on p individual data included in the g-th generation. On the other hand, when the cross process is executed in S22, a total of p individual data obtained by combining e individual data subjected to the cross process and p-e individual data not subjected to the cross process Perform mutation processing on

図１１は、本発明の一実施形態に係る変異処理を実施するためのフローチャートの一例を示す図である。まず、Ｓ２２１において、変異処理部１０５は、第ｇ世代親個体集団データに含まれる各個体データに対し、第１変異処理又は第２変異処理のいずれを実行するかをランダムに決定する。本実施形態では、第ｇ世代親個体集団データに含まれるｐ個の個体データの全てに対して変異処理を実行するものとする。ここで、第２変異処理は、第１変異処理とは異なる変異処理である。 FIG. 11 is a diagram showing an example of a flowchart for carrying out a mutation process according to an embodiment of the present invention. First, in S221, the mutation processing unit 105 randomly determines which of the first mutation processing and the second mutation processing is to be performed on each individual data included in the g-th generation parent individual population data. In this embodiment, mutation processing is performed on all p individual data included in the g-th generation parent individual population data. Here, the second mutation treatment is a mutation treatment different from the first mutation treatment.

次に、Ｓ２２２において、変異処理部１０５は、Ｓ２２１における決定結果が第１変異処理であるか否かを判定する。そして、判定結果がＹＥＳであれば、Ｓ２２３ａに進み、第１変異処理を実行する。一方、判定結果がＮＯであれば、Ｓ２２３ｂに進み、第２変異処理を実行する。 Next, in S222, the mutation processing unit 105 determines whether or not the determination result in S221 is the first mutation processing. Then, if the determination result is YES, the process proceeds to S223a to execute the first mutation process. On the other hand, if the determination result is NO, the process proceeds to S223b to execute the second mutation process.

（第１変異処理）
次に、Ｓ２２３ａにおいて、変異処理部１０５は、個体データに対して第１変異処理を実行する。具体的には、個体データに含まれる全てのＣＤＳに対し、各コドンを予め定められた確率Ｐｍでかかるコドンより高頻度のコドンに置換する。ここで、より高頻度のコドンは、図６の前処理におけるＳ１２でコドン頻度表データ記憶部２０３から取得したコドン頻度表データより得る。ここで、図１２（ａ）を用いて第１変異処理について説明する。 (First mutation treatment)
Next, in S223a, the mutation processing unit 105 executes the first mutation processing on the individual data. Specifically, for every CDS contained in the individual data, each codon is replaced with a codon having a frequency higher than that of such a codon with a predetermined probability Pm. Here, more frequent codons are obtained from the codon frequency table data acquired from the codon frequency table data storage unit 203 in S12 in the pre-processing of FIG. Here, the first mutation process will be described with reference to FIG.

図１２（ａ）は、個体データに３つのＣＤＳが含まれる例を示す。図１２（ａ）に示されるように、第１変異処理では、個体データに含まれる３つのＣＤＳについて、全てのコドン（５×３＝１５個のコドン）に対して確率Ｐｍで変異処理を実行する。なお、図１２（ａ）中の破線は、確率Ｐｍで変異処理が実行される対象となるコドンの範囲を表すものである。一例として、３つ目のＣＤＳであるＣＤＳ−３に含まれる最初のコドンである「ＧＧＣ」を、確率ＰｍでＧＧＣより高頻度なコドンに置換する。ここで、より高頻度なコドンは、コドン頻度表データから得る。図１２（ａ）の例では、「ＧＧＣ」より高頻度なコドンは、「ＧＧＴ」及び「ＧＧＡ」が存在する。このように、より高頻度なコドンが複数ある場合には、いずれか１つのコドンをランダムに選び、「ＧＧＣ」と置換する。なお、「ＧＧＣ」より高頻度なコドンが存在しない場合、かかる置換はされない。このような置換を、個体データに含まれる全てのコドンに対して実行する。本実施形態では、このような処理を第１変異処理という。ここで、第１変異処理は、後述する第１評価基準に係る最小ＣＡＩ値を大きくすることを意図するものである。 FIG. 12 (a) shows an example in which three CDSs are included in individual data. As shown in FIG. 12 (a), in the first mutation processing, mutation processing is performed with probability Pm for all codons (5 × 3 = 15 codons) for three CDSs included in the individual data. Do. The broken line in FIG. 12A indicates the range of codons to be subjected to mutation processing with probability Pm. As an example, "GGC", which is the first codon included in the third CDS, CDS-3, is replaced with a codon more frequent than GGC with probability Pm. Here, more frequent codons are obtained from the codon frequency table data. In the example of FIG. 12 (a), “GGT” and “GGA” exist as codons more frequently than “GGC”. Thus, when there are a plurality of more frequent codons, any one codon is randomly selected and replaced with "GGC". In the absence of codons more frequently than "GGC", such substitution is not performed. Such substitutions are performed for all codons contained in the individual data. In the present embodiment, such a process is referred to as a first mutation process. Here, the first mutation treatment is intended to increase the minimum CAI value according to the first evaluation criteria described later.

（第２変異処理）
図１１に戻り、変異処理についてさらに説明する。Ｓ２２３ｂにおいて、変異処理部１０５は、個体データに対して第２変異処理を実行する。具体的には、個体データに含まれるＣＤＳのうち、それぞれのＣＤＳ間又は１つのＣＤＳ内部の異なる部位で連続して一致する塩基配列のうち最長の塩基配列である最長共通文字と重なるコドンを、予め定められた確率Ｐｍで他のコドンに置換する。ここで、図１４を用いて、最長共通文字列について説明する。 (Second mutation treatment)
Returning to FIG. 11, the mutation process will be further described. In S223b, the mutation processing unit 105 executes the second mutation processing on the individual data. Specifically, among the CDSs included in the individual data, codons overlapping with the longest common character, which is the longest base sequence among the base sequences which continuously match at different sites between each CDS or within one CDS, Replace with another codon with a predetermined probability Pm. Here, the longest common character string will be described with reference to FIG.

「最長共通文字列」
図１４に示される個体データは、一例として３つのＣＤＳを含むものである。ここで、各ＣＤＳに含まれる５個のコドンを表す文字列（３個の塩基（＝文字）×５＝１５文字）を、他のＣＤＳ又は１つのＣＤＳ内部の異なる部位に含まれる文字列と対比して、連続して一致する文字列の中で最も長いものを最長共通文字列という。図１４の例では、「ＧＧＣＡＴＣＧＴＣＧＡ」（実線の下線が付された部分）が最長共通文字列となり、その長さ（文字数）は１１である。なお、「ＧＴＣＧＡＧＣＡＧ」（破線の下線が付された部分）も共通文字列であるが、長さが９であり、最長ではないので最長共通文字列とならない。なお、最長共通文字列は、計算機科学における最長共通部分文字列(Ｔｈｅｌｏｎｇｅｓｔｃｏｍｍｏｎｓｕｂｓｔｒｉｎｇ)と呼ばれている概念に相当する。 "Longest common string"
The individual data shown in FIG. 14 includes three CDSs as an example. Here, a character string (three bases (= characters) × 5 = 15 characters) representing five codons contained in each CDS and a character string contained at another site inside another CDS or one CDS In contrast, the longest of the continuously matching strings is called the longest common string. In the example of FIG. 14, “GGCATCGTCGA” (portion underlined with a solid line) is the longest common character string, and its length (number of characters) is 11. Although "GTCGAGCAG" (a portion underlined with a broken line) is also a common character string, it has a length of 9 and is not the longest, so it can not be the longest common character string. The longest common string corresponds to a concept called the longest common substring in computer science.

図１２（ｂ）は、個体データに３つのＣＤＳが含まれる例を示す。図１２（ｂ）に示されるように、第２変異処理では、個体データに含まれる３つのＣＤＳについて、ＣＤＳに含まれる５個のコドンを表す文字列（３個の塩基（＝文字）×５＝１５文字）のうち、最長共通文字列と重なるコドンに対して確率Ｐｍで変異処理を実行する。ここで、図１２（ｂ）の例では、最長共通文字列は「ＧＧＣＡＴＣＧＴＣＧＡ」（実線の下線が付された部分）である。図１２（ｂ）の例では、２つ目及び３つ目のＣＤＳであるＣＤＳ−２及びＣＤＳ−３に含まれるコドンのうち、１〜４つ目のコドンが最長共通文字列と重なるコドンである。なお、図１２（ｂ）中の破線は、確率Ｐｍで変異処理が実行される対象となるコドンの範囲を表すものである。一例として、ＣＤＳ−３に含まれる最初のコドンである「ＧＧＣ」を、確率Ｐｍで他のコドンに置換する。図１２（ｂ）の例では、「ＧＧＣ」とは異なるコドンとして、「ＧＧＴ」、「ＧＧＡ」及び「ＧＧＧ」が存在する。このように、他のコドンが複数ある場合には、いずれか１つのコドンをランダムに選び、「ＧＧＣ」と置換する。なお、「ＧＧＣ」以外のコドンが存在しない場合には、かかる置換はされない。例えば、特定のアミノ酸をコードするコドンが１種類しか存在しないときには、置換ができない場合があるためである。このような置換を、最長共通文字列と重なるコドンに対して実行する。本実施形態では、このような処理を第２変異処理という。ここで、第２変異処理は、後述する第２評価基準に係る不一致塩基数を大きくし、最長共通文字列を小さくすることを意図するものである。 FIG. 12 (b) shows an example in which three pieces of CDS are included in individual data. As shown in FIG. 12 (b), in the second mutation process, a string (three bases (= letters) × 5) representing five codons included in the CDS for three CDSs included in the individual data. The mutation process is executed with the probability Pm on the codon overlapping with the longest common character string among = 15 characters). Here, in the example of FIG. 12 (b), the longest common character string is “GGCATCGTCGA” (portion underlined with a solid line). In the example of FIG. 12 (b), among the codons contained in the second and third CDSs, CDS-2 and CDS-3, the first to fourth codons overlap with the longest common character string. is there. The broken line in FIG. 12B indicates the range of codons to be subjected to mutation processing with probability Pm. As an example, the first codon "GGC" contained in CDS-3 is replaced with another codon with probability Pm. In the example of FIG. 12 (b), "GGT", "GGA" and "GGG" are present as codons different from "GGC". Thus, when there are a plurality of other codons, one of the codons is randomly selected and replaced with "GGC". In the case where no codon other than "GGC" exists, such substitution is not made. For example, substitution may not be possible when there is only one type of codon encoding a specific amino acid. Such substitutions are performed for those codons that overlap with the longest common string. In the present embodiment, such processing is called second mutation processing. Here, the second mutation processing is intended to increase the number of unmatched bases related to the second evaluation criteria described later, and to reduce the longest common character string.

そして、第１変異処理及び第２変異処理が終了すると、かかる計算結果を計算データ記憶部２０４へ出力する。 Then, when the first mutation processing and the second mutation processing are completed, the calculation result is output to the calculation data storage unit 204.

図８に戻り、メインルーチンについてさらに説明する。Ｓ２２において、第ｇ世代親個体集団データに対して変異処理部１０５による変異処理、必要に応じて、交差処理部１０４による交差処理が実行された後、Ｓ２３に進む。 Returning to FIG. 8, the main routine will be further described. In S22, mutation processing by the mutation processing unit 105 is performed on the g-th generation parent individual population data, and if necessary, cross processing by the cross processing unit 104 is performed, and then the process proceeds to S23.

次に、Ｓ２３において、子個体集団データ取得部１０３は、第ｇ世代子個体集団データを生成する。以下、交差処理の実行の有無毎に、第ｇ世代子個体集団データの生成の仕方について説明する。 Next, in S23, the offspring population data acquisition unit 103 generates g-th generation offspring population data. Hereinafter, how to generate the g-th generation individual population data will be described for each execution of the cross process.

１．Ｓ２２において変異処理のみが実行された場合
子個体集団データ取得部１０３は、第ｇ世代親個体集団データに含まれるｐ個の個体データが全て変異処理されたｐ個の個体データを、新たに第ｇ世代子個体集団データとする。 1. When only mutation processing is performed in S22, the offspring individual population data acquisition unit 103 newly generates p individual data on which all p individual data included in the g-th generation parent individual population data are mutated. Let g generation child individual population data.

２．Ｓ２２において変異処理及び交差処理が実行された場合
子個体集団データ取得部１０３は、第ｇ世代親個体集団データに含まれるｐ個の個体データのうち、交差処理が実行されたe個の個体データと、交差処理が実行されていないｐ−ｅ個の個体データを合わせた計ｐ個の個体データが全て変異処理されたｐ個の個体データを、新たに第ｇ世代子個体集団データとする。 2. When mutation processing and crossover processing are performed in S22, the offspring population data acquisition unit 103 selects e pieces of data on which crossover processing has been performed among the p pieces of individual data included in the g-th generation parent population data. In addition, p individual data in which a total of p individual data obtained by combining the p-e individual data for which cross processing has not been performed are all mutated is newly defined as the g generation child individual population data.

次に、Ｓ２４において、処理部１０は、第ｇ世代親個体集団データ及び第ｇ世代子個体集団データを統合し、第ｇ世代統合データを生成する。これにより、第ｇ世代統合データには２ｐ個の個体データが含まれることとなる。 Next, in S24, the processing unit 10 integrates the g-th generation parent individual population data and the g-th generation child individual population data to generate the g-th generation integrated data. As a result, the g-th integrated data includes 2p individual data.

次に、Ｓ２５において、非優越ソート実行部１０６は、予め定められた評価基準であって、コドン適合度及び前記コドンの塩基配列に関する評価基準に基いて、第ｇ世代統合データに対して非優越ソートを実行する。そして、２ｐ個の個体データをパレート最適解におけるフロント毎（ランク毎）に分類する。 Next, in S25, the non-dominated sort executing unit 106 is non-dominated to the g-th generation integrated data, which is a predetermined evaluation criterion based on the degree of codon suitability and the evaluation criteria regarding the base sequences of the codons. Perform a sort Then, 2p individual data are classified for each front (per rank) in the Pareto optimal solution.

次に、Ｓ２６において、個体選択部１０７は、パレート最適解におけるフロント毎（ランク毎）に分類された第ｇ世代統合データから、ランクの高い順に定められた数の個体データを選択する。なお、個体選択部１０７は、定められた数の個体データを選択するときに、ランクが同じ個体データが存在する場合には、混雑距離が大きいものから順に選択することとしてもよい。ここで、予め定められた数として、ｐを採用することができる。そして、親個体集団データ取得部１０２は、個体選択部１０７により選択されたｐ個の個体データを、第ｇ＋１世代の親個体集団を表す第ｇ＋１世代親個体集団データとして生成し、取得する。以下、図１３及び図１４を用いて、かかる評価基準について説明する。 Next, in S26, the individual selecting unit 107 selects the individual data of the number determined in descending order of rank from the g-th generation integrated data classified for each front (per rank) in the Pareto optimal solution. When selecting a predetermined number of pieces of individual data, the individual selecting unit 107 may select pieces of data in order of decreasing congestion distance when there is individual data of the same rank. Here, p can be adopted as a predetermined number. Then, the parent individual group data acquisition unit 102 generates and acquires p individual data selected by the individual selection unit 107 as g + 1th generation parent individual group data representing a parent individual group of the (g + 1) th generation. Hereinafter, the evaluation criteria will be described with reference to FIGS. 13 and 14.

＜評価基準＞
本実施形態では、非優越ソートを実行した２ｐ個の個体データからｐ個の個体データを選択するに際し、２つの観点の評価基準を利用する。かかる観点は、相同組み換えを抑制し、目的タンパク質の生産量を高めることを目的として導き出された観点である。１つ目の観点は、「コドン適合度」に関するものである。具体的には、各個体が複数有する塩基配列であって、アミノ酸翻訳の対象となる塩基配列を表すＣＤＳのコドン適合インデックスの最小値を基準とする。これが第１評価基準である。そして、２つ目の観点は、「コドンの塩基配列」に関するものである。さらに、２つ目の観点は、「不一致塩基数」及び「最長共通文字列」に分かれる。そして、個体データに含まれる２つのＣＤＳのうち、不一致塩基数の最小値を基準とするのが第２評価基準である。また、個体データに含まれるＣＤＳのうち、最長共通文字列の長さを基準とするのが第３評価基準である。以下、これら３つの評価基準の意義について、それぞれ説明する。 <Evaluation criteria>
In this embodiment, when selecting p individual data from 2p individual data subjected to non-dominated sorting, evaluation criteria of two viewpoints are used. Such a viewpoint is a viewpoint derived for the purpose of suppressing the homologous recombination and increasing the production amount of the target protein. The first aspect relates to "codon match". Specifically, the minimum value of the codon matching index of CDS, which is a base sequence possessed by each individual and is a target of amino acid translation, is used as a reference. This is the first evaluation standard. And, the second aspect relates to the "base sequence of codon". Furthermore, the second aspect is divided into “number of unmatched bases” and “longest common string”. And it is a 2nd evaluation standard that it is based on the minimum of the number of unmatched bases among two CDS contained in individual data. Further, among the CDSs included in the individual data, the third evaluation standard is based on the length of the longest common character string. The significance of these three evaluation criteria will be described below.

（第１評価基準：コドン適合度）
第１の観点である第１評価基準は、「コドン適合度」に関するものである。ここで、「コドン適合度」とは、個体データに含まれるＣＤＳ中に利用頻度の高いコドンが多く含まれているほど高くなるものとする。具体的には、各個体データに含まれるＣＤＳのコドン適合インデックス（ＣｏｄｏｎＡｄａｐｔａｔｉｏｎＩｎｄｅｘ（以下、ＣＡＩという））の最小値（以下、最小ＣＡＩ値という）を基準とする。ＣＡＩは、例えば以下の式で求めることができる。 (First evaluation criteria: degree of codon matching)
The first evaluation criterion, which is the first aspect, relates to "codon suitability". Here, “codon suitability” is higher as the number of frequently used codons is included in the CDS included in the individual data. Specifically, the minimum value (hereinafter referred to as minimum CAI value) of the codon adaptation index (hereinafter referred to as CAI) of CDS contained in each individual data is used as a reference. The CAI can be determined, for example, by the following equation.

Ｌ：個体データに含まれるコドンの数
ｆｉ：ｉ番目のコドンの頻出度
ｍａｘ（ｆｊ）：最頻出である同義コドン（ｊ番目のコドン）の頻出度
ここで、同義コドンとは、同じアミノ酸をコードするコドンであって、異なる配列を持ったコドンのことである。

L: Number of codons contained in individual data fi: Frequency of i-th codon max (fj): Frequency of synonymous codon (j-th codon) that is most frequent Here, synonymous codon means the same amino acid Codons that encode and have different sequences.

そして、上記の式で求めたＣＡＩを用いて、以下の式で最小ＣＡＩ値を求めることができる。 Then, using the CAI determined by the above equation, the minimum CAI value can be determined by the following equation.

ｘ：個体データに含まれるＣＤＳの数
Ｃｉ：ｉ番目のＣＤＳ
ＣＡＩ（Ｃｉ）：ＣｉのＣＡＩ

x: Number of CDS contained in individual data Ci: i-th CDS
CAI (Ci): Ci of CAI

ここで、あるＣＤＳのＣＡＩが高いほど、そのＣＤＳには利用頻度の高いコドンが多く含まれている（逆に言うとＣＤＳに含まれるレアコドンの数が少ない）ことを示す。そして、ある個体データが、ＣＡＩが極端に低い（換言すると、レアコドンが多く含まれた）ＣＤＳを持っていると、そのＣＤＳは効率的に翻訳されない可能性がある。したがって、最小ＣＡＩ値を第１評価基準として用い、最小ＣＡＩ値が大きいほど、かかる個体データの評価を高くすることにより、ＣＡＩが極端に低いＣＤＳを持つ個体データを最適化の過程で取り除くことが可能になる。したがって、第１変異処理により、最小ＣＡＩ値を大きくすることで、より好ましいシミュレーション結果を得ることができる。 Here, it indicates that the higher the CAI of a certain CDS, the more the codons contained in the CDS (in other words, the fewer the number of rare codons contained in the CDS). And, if certain individual data has a CDS with an extremely low CAI (in other words, rich in rare codons), the CDS may not be translated efficiently. Therefore, by using the minimum CAI value as the first evaluation criterion and increasing the evaluation of such individual data as the minimum CAI value is larger, individual data having an extremely low CAI CDS can be removed in the optimization process. It will be possible. Therefore, more favorable simulation results can be obtained by increasing the minimum CAI value by the first mutation treatment.

（第２評価基準：不一致塩基数）
次に、図１３を用いて第２評価基準について説明する。第２の観点のうちの１つ目である第２評価基準は、「不一致塩基数」に関するものである。具体的には、不一致塩基数の最小値（以下、最小不一致塩基数という）を評価基準に用いる。ここで、不一致塩基とは、個体データに含まれるｘ個のＣＤＳのうち、２つのＣＤＳ（以下、ＣＤＳペアという）Ｃｉ及びＣｊを対比して、コドンを構成する塩基が不一致となる塩基のことである。図１３の例では、Ｃｉ及びＣｊを構成する塩基のうち、不一致塩基の数が５個となっている。したがって、かかるＣＤＳペア（Ｃｉ及びＣｊ）の不一致塩基数は５となる。最小不一致塩基数は、以下の式で求めることができる。 (Second evaluation criteria: number of mismatched bases)
Next, the second evaluation criteria will be described with reference to FIG. The second evaluation criterion, which is the first of the second aspects, relates to the “number of unmatched bases”. Specifically, the minimum value of the number of unmatched bases (hereinafter referred to as the minimum number of unmatched bases) is used as an evaluation criterion. Here, the unmatched base refers to a base in which bases constituting codons do not match with each other between two CDS (hereinafter referred to as a CDS pair) Ci and Cj among x CDS included in individual data. It is. In the example of FIG. 13, the number of unmatched bases is five among the bases constituting Ci and Cj. Therefore, the number of unmatched bases of such a CDS pair (Ci and Cj) is five. The minimum number of unmatched bases can be determined by the following equation.

ｘ：個体データに含まれるＣＤＳの数
Ｃｉ：ｉ番目のＣＤＳ
Ｃｊ：ｊ番目のＣＤＳ
ＮＮ（Ｃｉ，Ｃｊ）：ＣｉとＣｊの不一致塩基数
x: Number of CDS contained in individual data Ci: i-th CDS
Cj: jth CDS
NN (Ci, Cj): Number of unmatched bases of Ci and Cj

ここで、ある個体データが、不一致塩基数が極端に低い（換言すると、塩基配列がよく似た）ＣＤＳペアを持っていると、そのＣＤＳペアの間で相同組み換えが生じる可能性が高くなる。これは、相同組み換えは、塩基配列がよく似た部位（相同部位）で生じるためである。したがって、最小不一致塩基数を第２評価基準として用い、最小不一致塩基数が大きい（換言すると、塩基配列が異なる割合が大きい）ほど、かかる個体データの評価を高くすることにより、塩基配列がよく似た個体データを最適化の過程で取り除くことが可能になる。したがって、第２変異処理により、不一致塩基数を大きくすることで、より好ましいシミュレーション結果を得ることができる。 Here, if one individual data has a CDS pair with an extremely low number of mismatched bases (in other words, similar base sequences), the possibility of homologous recombination between the CDS pairs increases. This is because the homologous recombination occurs at a site where the base sequences closely resemble (homologous site). Therefore, by using the minimum number of unmatched bases as the second evaluation criterion, and the higher the minimum number of unmatched bases (in other words, the larger the percentage of different base sequences), the base sequence is similar by raising the evaluation of such individual data. Individual data can be removed in the process of optimization. Therefore, more favorable simulation results can be obtained by increasing the number of unmatched bases by the second mutation treatment.

（第３評価基準：最長共通文字列）
次に、図１４を用いて第３評価基準について説明する。第２の観点のうちの２つ目である第３評価基準は、「最長共通文字列」に関するものである。すでに述べたように、「最長共通文字列」とは、各ＣＤＳ又は１つのＣＤＳ内部の異なる部位に含まれるコドンを表す文字列を、他のＣＤＳに含まれる文字列と対比して、連続して一致する文字列の中で最も長いもののことである。 (Third evaluation criteria: longest common string)
Next, the third evaluation criteria will be described with reference to FIG. The second evaluation criterion, which is the second of the second aspects, relates to the "longest common character string". As described above, “longest common character string” refers to a character string representing codons contained at different sites within each CDS or one CDS, in contrast to the character strings contained in other CDSs. The longest matching string.

ここで、「全く同じ塩基配列」がゲノム近傍にあると、相同組み換えが生じる可能性が高くなる。これは、前述の通り、相同組み換えは、塩基配列がよく似た部位（相同部位）で生じるためである。したがって、「最長共通文字列」の長さを第３評価基準として用い、最長共通文字列の長さが短いほど、かかる個体データの評価を高くすることにより、「全く同じ塩基配列」が高い割合で含まれる個体データを最適化の過程で取り除くことが可能になる。したがって、第２変異処理により、最長共通文字列を小さくすることで、より好ましいシミュレーション結果を得ることができる。 Here, when the “exactly identical nucleotide sequence” is in the vicinity of the genome, the possibility of occurrence of homologous recombination is high. This is because, as described above, homologous recombination occurs at a site where the base sequences closely resemble (homologous site). Therefore, by using the length of the “longest common character string” as the third evaluation criterion and making the evaluation of such individual data higher as the length of the longest common character string is shorter, the proportion of “exactly the same base sequence” is high It is possible to remove individual data included in the process of optimization. Therefore, more favorable simulation results can be obtained by reducing the longest common character string by the second mutation process.

以上説明したように、第１の観点である第１評価基準を用いることにより、利用頻度の高いコドンが多く含まれるＣＤＳを有する個体データを選択することが可能となる。また、第２の観点である第２評価基準及び第３評価基準を用いることにより、塩基配列が異なる割合が大きい個体データを選択し、相同組み換えの発生を抑制することが可能となる。 As described above, it is possible to select individual data having a CDS in which a large number of frequently used codons are included by using the first evaluation criterion which is the first aspect. Further, by using the second evaluation criterion and the third evaluation criterion, which are the second aspect, it is possible to select individual data having a large proportion of different base sequences, and to suppress the occurrence of homologous recombination.

図８に戻り、メインルーチンについてさらに説明する。Ｓ２６において、個体選択部１０７は、パレート最適解におけるフロント毎（ランク毎）に分類された（２ｐ個の個体データを含む）第ｇ世代統合データから、ランクの高い順にｐ個の個体データを選択する。そして、親個体集団データ取得部１０２は、選択されたｐ個の個体データを新たに第ｇ＋１世代の親個体集団データとし、第ｇ＋１世代親個体集団データを生成する。 Returning to FIG. 8, the main routine will be further described. In S26, the individual selecting unit 107 selects p individual data in descending order of rank from the g-th generation integrated data (including 2p individual data) classified for each front (per rank) in the Pareto optimal solution Do. Then, the parent individual group data acquisition unit 102 newly sets the selected p individual data as the (g + 1) th generation parent individual group data, and generates the (g + 1) th generation parent individual group data.

次に、Ｓ２７において、処理部１０は、変数ｇが予め定められた世代数Ｇを超えるか否かを判定する。そして、かかる判定結果がＮＯであれば、Ｓ２８に進む。一方、Ｓ２７における判定結果がＹＥＳであれば、メインルーチンを終了し、かかる計算結果を計算データ記憶部２０４へ出力する。 Next, in S27, the processing unit 10 determines whether or not the variable g exceeds a predetermined number of generations G. And if this determination result is NO, it will progress to S28. On the other hand, if the determination result in S27 is YES, the main routine is ended, and the calculation result is output to the calculation data storage unit 204.

Ｓ２７における判定結果がＮＯであれば、Ｓ２８に進み、変数ｇをインクリメントし（つまり、変数ｇに１を加え）、再びＳ２１に戻る。ここで、現時点では変数ｇ＝２であるので、親個体集団データ取得部１０２は、Ｓ２６において生成された第２世代親個体集団データを取得する。かかる処理を、変数ｇが予め定められた世代数Ｇとなるまで繰り返し実行する。換言すると、Ｓ２１〜Ｓ２６における処理を２５０回繰り返し実行する。 If the determination result in S27 is NO, the process proceeds to S28, the variable g is incremented (that is, 1 is added to the variable g), and the process returns to S21 again. Here, since the variable g = 2 at the present time, the parent individual group data acquisition unit 102 acquires the second generation parent individual group data generated in S26. This process is repeated until the variable g reaches a predetermined number G of generations. In other words, the processing in S21 to S26 is repeated 250 times.

以上説明したメインルーチンを繰り返し実行することにより、３つの評価基準に基いて選択されたｐ個の個体データは、繰り返し回数が増えるほど、遺伝子配列群として好ましいものとなっていく。 By repeatedly executing the main routine described above, the p individual data items selected based on the three evaluation criteria become preferable as a gene sequence group as the number of repetitions increases.

＜実施例＞
以下、本アルゴリズムを用いた遺伝子配列設計につき、実施例について説明する。かかる実施例では、シミュレーションとして、ヒトのインスリンＡ鎖（アミノ酸配列：ＧＩＶＥＱＣＣＴＳＩＣＳＬＹＱＬＥＮＹＣＮ）をコードする１０個のＣＤＳを設計した。種々のパラメータについては、以下の通りである。
予め定められた確率Ｐｍ（変異率）＝０．０５
Ｐｃ（交差率）＝０．５
第ｇ世の個体集団データ（親個体集団データ、子個体集団データ）に含まれる個体データの数ｐ＝１００
予め定められた世代数Ｇ（最大世代数）＝２５０ <Example>
Examples of gene sequence design using the present algorithm will be described below. In this example, 10 CDSs encoding human insulin A chain (amino acid sequence: GIVEQCCTSICSLYQLENYCN) were designed as a simulation. The various parameters are as follows.
Predetermined probability Pm (mutation rate) = 0.05
Pc (crossing rate) = 0.5
Number of individual data included in individual population data of g-group (parent individual population data, child individual population data) p = 100
Number of predetermined generations G (maximum number of generations) = 250

以下、図１５及び図１６を用いて、本シミュレーションにおける計算結果について、第１評価基準を横軸に、第２評価基準を縦軸にとってプロットしたグラフと、第１評価基準を横軸に、第３評価基準を縦軸にとってプロットしたグラフについて説明する。 Hereinafter, with reference to FIG. 15 and FIG. 16, regarding the calculation results in this simulation, a graph plotting the first evaluation standard on the horizontal axis and the second evaluation standard on the vertical axis, and the first evaluation standard on the horizontal axis A graph in which the evaluation criteria are plotted on the vertical axis will be described.

図１５は、第１評価基準を横軸に、第２評価基準を縦軸にとってプロットしたグラフである。ここで、グラフ中にて丸で表されるプロットは第１世代、四角形で表されるプロットは第１０世代、三角形で表されるプロットは第２５０世代における計算結果を示す。なお、１つのプロットは１つの設計結果（＝個体データ）に対応する。すでに述べたように、第１評価基準は最小ＣＡＩ値が大きいほど評価が高いので、グラフ中では横軸の右側にプロットされた点ほど評価が良く、横軸の左側にプロットされた点ほど評価が悪いといえる。また、第２評価基準は、最小不一致塩基数が大きいほど評価が高いので、グラフ中では縦軸の上側にプロットされた点ほど評価が良く、縦軸の下側にプロットされた点ほど評価が悪いといえる。図１５に示されるように、世代数が大きくなるにしたがって（換言すると、図８におけるメインルーチンの繰り返し回数が増えるにしたがって）、個体集団データ全体として好ましいものとなっていることが読み取れる。 FIG. 15 is a graph in which the first evaluation criterion is plotted on the horizontal axis and the second evaluation criterion is plotted on the vertical axis. Here, in the graph, a plot represented by a circle indicates a first generation, a plot represented by a square indicates a tenth generation, and a plot represented by a triangle indicates a calculation result in the 250th generation. One plot corresponds to one design result (= solid data). As described above, since the first evaluation criterion is evaluated higher as the minimum CAI value is larger, the points plotted on the right side of the horizontal axis in the graph are better evaluated, and the points plotted on the left side of the horizontal axis are evaluated It can be said that it is bad. In the second evaluation criterion, the higher the minimum number of mismatched bases, the higher the evaluation. Therefore, in the graph, the higher the point plotted on the vertical axis, the better the evaluation, and the lower the vertical axis, the better. It can be said that it is bad. As shown in FIG. 15, as the number of generations increases (in other words, as the number of repetitions of the main routine in FIG. 8 increases), it can be read that the whole population data is preferable.

図１６は、第１評価基準を横軸に、第３評価基準を縦軸にとってプロットしたグラフである。丸、四角形及び三角形で表される各プロットの意味は、図１５と同様である。ここで、第３評価基準は、最長共通文字列の長さが短いほど評価が高いので、グラフ中では縦軸の下側にプロットされた点ほど評価が良く、縦軸の上側にプロットされた点ほど評価が悪いといえる。図１６に示されるように、世代数が大きくなるにしたがって（換言すると、図８におけるメインルーチンの繰り返し回数が増えるにしたがって）、個体集団データ全体として好ましいものとなっていることが読み取れる。 FIG. 16 is a graph in which the first evaluation criterion is plotted on the horizontal axis and the third evaluation criterion is plotted on the vertical axis. The meanings of the plots represented by circles, squares and triangles are the same as in FIG. Here, in the third evaluation criteria, since the evaluation is higher as the length of the longest common string is shorter, the points plotted on the lower side of the vertical axis in the graph are evaluated better and plotted on the upper side of the vertical axis It can be said that the point is bad evaluation. As shown in FIG. 16, as the number of generations increases (in other words, as the number of repetitions of the main routine in FIG. 8 increases), it can be read that the entire population data is preferable.

以上、種々の実施形態について説明したが、本発明はこれらに限定されない。 Although the various embodiments have been described above, the present invention is not limited to these.

例えば、図８におけるメインルーチンのＳ２６における選択は、第１評価基準及び第２評価基準、又は、第１評価基準及び第３評価基準のいずれか一方を用い、図１５及び図１６に示されるグラフの一方を得ることとしてもよい。また、第１評価基準及び第２評価基準、及び、第１評価基準及び第３評価基準の両方を用いる場合は、世代毎に図１５におけるグラフと図１６におけるグラフからそれぞれ評価の高い個体データを特定し、任意の基準でポイントを付与し、これら２つのグラフにおけるポイントの合計が高い個体データを選択してもよい。もしくは、図１５及び図１６のように２次元のグラフではなく、第１評価基準をｘ軸に、第２評価基準をｙ軸に、第３評価基準をｚ軸にして、３次元のグラフを作成することにより３つの評価基準のそれぞれについて高い評価を得た個体データを選択してもよい。 For example, the selection in S26 of the main routine in FIG. 8 uses the first evaluation criterion and the second evaluation criterion, or any one of the first evaluation criterion and the third evaluation criterion, and the graphs shown in FIG. It is also possible to obtain one of them. In addition, when using both the first and second evaluation criteria and both the first and third evaluation criteria, high evaluation individual data is obtained from the graph in FIG. 15 and the graph in FIG. 16 for each generation. The individual data may be selected and given points on any basis, and the total sum of points in these two graphs may be selected. Alternatively, instead of a two-dimensional graph as shown in FIG. 15 and FIG. 16, a three-dimensional graph with the first evaluation criterion as x axis, the second evaluation criterion as y axis, and the third evaluation criterion as z axis By creating, individual data that has obtained high evaluation for each of the three evaluation criteria may be selected.

また、記憶部２０は、情報処理装置１の内部に設けずに、外部のＰＣ又はサーバ等の情報処理装置に設けるクラウドコンピューティングの態様とすることができる。この場合、計算の度に必要なデータを外部の情報処理装置が情報処理装置１に送信する。 In addition, the storage unit 20 can be in an aspect of cloud computing provided in an information processing apparatus such as an external PC or a server without being provided inside the information processing apparatus 1. In this case, the external information processing apparatus transmits data necessary for each calculation to the information processing apparatus 1.

また、情報処理装置１の機能を実装したＡＳＩＣ（ａｐｐｌｉｃａｔｉｏｎｓｐｅｃｉｆｉｃｉｎｔｅｇｒａｔｅｄｃｉｒｃｕｉｔ）、ＦＰＧＡ（ｆｉｅｌｄ−ｐｒｏｇｒａｍｍａｂｌｅｇａｔｅａｒｒａｙ）、ＤＲＰ（ＤｙｎａｍｉｃＲｅＣｏｎｆｉｇｕｒａｂｌｅＰｒｏｃｅｓｓｏｒ）として提供することもできる。また、コンピュータに、情報処理装置１の機能を実現するためのプログラムとして提供することもできる。この場合、かかるプログラムをインターネット等を介して配信することもできる。 Also, it can be provided as an application specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a dynamic reconfigurable processor (DRP) in which the functions of the information processing apparatus 1 are implemented. Further, the program may be provided to a computer as a program for realizing the functions of the information processing apparatus 1. In this case, such a program can also be distributed via the Internet or the like.

さらに、本アルゴリズムとして、多目的遺伝的アルゴリズムである「ＮＳＧＡ−II」を利用することもできる。これは、本アルゴリズムと同様に、ｐ個の最適解をまとめて得ることができるためである。また、組み合わせ最適化アルゴリズムの一種である「シミュレーテッドアニーリング」や「（単目的の）遺伝的アルゴリズム」を利用してもよい。ただし、この場合には、ｐ個の最適解をまとめて得ることができないので、計算を少なくともｐ回以上繰り返し、ｐ個の最適解を得る必要がある。さらに、これら２つ以上のアルゴリズムの計算結果を混合してもよい。この場合、ｐ以下の任意の整数αを設定し、あるアルゴリズムによる計算結果からα個の個体を選択し、他のアルゴリズムによる計算結果からｐ−α個の個体を選択し、これらを結合したｐ個の個体を用いることとしてもよい。 Furthermore, "NSGA-II" which is a multipurpose genetic algorithm can also be used as this algorithm. This is because p optimal solutions can be obtained collectively as in the present algorithm. In addition, "simulated annealing" or "(single purpose) genetic algorithm", which is a kind of combined optimization algorithm, may be used. However, in this case, since the p optimal solutions can not be obtained collectively, it is necessary to repeat the calculation at least p times or more to obtain p optimal solutions. Furthermore, calculation results of these two or more algorithms may be mixed. In this case, an arbitrary integer α equal to or less than p is set, α individuals are selected from calculation results by a certain algorithm, and p−α individuals are selected from calculation results by another algorithm, and these are combined p It is also possible to use an individual.

さらに、本発明は、
アミノ酸配列、遺伝子数及びコドン頻度表を表すデータに基いて生成されたデータであって、予め定められた数の個体データを含む第１世代の親個体集団を表す第１世代親個体集団データを取得する親個体集団データ取得ステップと、
前記第１世代親個体集団データに含まれる個体に対し、変異処理を実行する変異処理ステップと、
前記変異処理が実行された個体を含む第１世代の子個体集団を表す第１世代子個体集団データを取得する子個体集団データ取得ステップと、
予め定められた評価基準であって、コドン適合度及び前記コドンの塩基配列に関する評価基準に基いて、前記第１世代親個体集団データ及び前記第１世代子個体集団データを統合した第１世代統合データに対して非優越ソート処理を実行し、前記第１世代統合データに含まれる全個体データをパレート最適解におけるランク毎に分類する非優越ソート実行ステップと、
前記ランク毎に分類された全個体データから、前記ランクの高い順に予め定められた数の前記個体データを選択する個体選択ステップと、
を有する遺伝子配列設計方法
として捉えることもできる。 Furthermore, the present invention
Data generated based on data representing an amino acid sequence, gene number and codon frequency table, which is a first generation parent individual population data representing a first generation parent individual population including a predetermined number of individual data Parent individual group data acquisition step to be acquired;
Performing mutation processing on the individuals included in the first generation parent individual population data;
Child individual population data acquisition step for acquiring first generation child individual population data representing a first generation child individual population including individuals subjected to the mutation process;
First-generation integration integrating the first-generation parent individual population data and the first-generation child individual population data, which is a predetermined evaluation criterion, based on codon suitability and an evaluation criterion regarding the base sequence of the codon A non-dominant sort execution step of executing non-dominant sort processing on data and classifying all individual data included in the first generation integrated data according to rank in Pareto optimal solution;
An individual selection step of selecting a predetermined number of the individual data in descending order of the rank from all individual data classified according to the rank;
It can also be understood as a gene sequence design method having

１：情報処理装置、１０：処理部、２０：記憶部、３０：操作部、４０：表示部、５０：通信部、１００：バス、１０１：個体生成部、１０２：親個体集団データ取得部、１０３：子個体集団データ取得部、１０４：交差処理部、１０５：変異処理部、１０６：非優越ソート実行部、１０７：個体選択部、２０１：アミノ酸配列データ記憶部、２０２：遺伝子数データ記憶部、２０３：コドン頻度表データ記憶部、２０４：計算データ記憶部、２０５：評価基準記憶部２０５ 1: Information processing apparatus 10: Processing unit 20: Storage unit 30: Operation unit 40: Display unit 50: Communication unit 100: Bus 101: Individual generation unit 102: Parent individual group data acquisition unit 103: child individual population data acquisition unit, 104: cross processing unit, 105: mutation processing unit, 106: non-dominated sort execution unit, 107: individual selection unit, 201: amino acid sequence data storage unit, 202: gene number data storage unit , 203: codon frequency table data storage unit, 204: calculation data storage unit, 205: evaluation criteria storage unit 205

Claims

Data generated based on data representing an amino acid sequence, gene number and codon frequency table, which is a first generation parent individual population data representing a first generation parent individual population including a predetermined number of individual data Parent individual group data acquisition unit to acquire;
A mutation processing unit that executes mutation processing on the individuals included in the first generation parent individual population data;
A child individual population data acquisition unit for acquiring first generation child individual population data representing a first generation child individual population including individuals subjected to the mutation process;
First-generation integration integrating the first-generation parent individual population data and the first-generation child individual population data, which is a predetermined evaluation criterion, based on codon suitability and an evaluation criterion regarding the base sequence of the codon A non-dominant sort execution unit that executes non-dominant sort processing on data and classifies all individual data included in the first generation integrated data according to rank in a Pareto optimal solution;
An individual selecting unit which selects a predetermined number of the individual data in descending order of the rank from all individual data classified according to the rank;
An information processing apparatus having

When the individual selecting unit selects the predetermined number of the individual data, if there is the individual data having the same rank, the individual selecting unit sequentially selects in order from the one with the largest crowded distance.
An information processing apparatus according to claim 1.

The parent individual population data acquisition unit uses the individual data selected by the individual selection unit as second generation parent individual population data representing a second generation parent individual population.
The processing by the mutation processing unit, the non-dominated sort execution unit, and the individual selection unit is performed until a predetermined number of generations is reached.
The information processing apparatus according to claim 1 or 2.

The evaluation standard for the degree of codon suitability is a base sequence possessed by each individual, and is based on the minimum value of the codon suitability index of CDS representing the base sequence to be subjected to amino acid translation.
The information processing apparatus according to any one of claims 1 to 3.

The higher the minimum value of the codon matching index included in the individual, the higher the evaluation of the individual,
The information processing apparatus according to claim 4.

The evaluation criteria for the base sequences of the codons are based on the minimum value of the number of unmatched bases representing the number of unmatched bases among the two CDSs included in each individual.
The information processing apparatus according to any one of claims 1 to 5.

The higher the minimum value of the number of unmatched bases, the higher the evaluation of the individual,
The information processing apparatus according to claim 6.

The evaluation criteria for the nucleotide sequences of the codons are the longest among the CDSs contained in each individual, among the CDSs among the individual CDSs or within a single CDS, the longest base sequence among the base sequences which are consecutively matched. Based on the length of the common string,
The information processing apparatus according to any one of claims 1 to 7.

The shorter the length of the longest common string, the higher the value of the individual.
The information processing apparatus according to claim 8.

The mutation processing unit
A second mutation process different from the first mutation process and the first mutation process is performed on each individual data included in the g-th generation parent individual population data representing the g-th generation parent individual population.
The information processing apparatus according to any one of claims 1 to 9.

The mutation processing unit
Performing a first mutation process of replacing the codons contained in the CDS with codons more frequently than the codons with respect to all the CDSs contained in the individual, with a predetermined probability;
The information processing apparatus according to claim 10.

The mutation processing unit
Among the CDSs contained in each individual, the codons overlapping with the longest common character, which is the longest base sequence among the base sequences which continuously coincide at different sites within each CDS or within one CDS, are predetermined. Execute a second mutation process that substitutes for another codon with a certain probability
The information processing apparatus according to claim 10.

The first mutation treatment or the second mutation treatment is randomly selected.
The information processing apparatus according to any one of claims 10 to 12.

And a crossover processing unit that performs crossover processing on the individuals included in the first generation parent individual population data,
The cross processing is
A predetermined even number of individual data is extracted from the g-th generation parent individual population data representing the g-th generation parent individual population, two individual data are selected from the extracted individual data, and the selected one is selected Perform cross processing on two sets of individual data,
The information processing apparatus according to any one of claims 1 to 13.

The intersection processing unit
Determine a crossing point from boundaries of the codons included in the CDS included in the first individual data and the second individual data which are the two selected individual data,
The codons included in the first individual data and the second individual data are switched at the intersection point,
The information processing apparatus according to claim 14.

Computer,
Data generated based on data representing an amino acid sequence, gene number and codon frequency table, which is a first generation parent individual population data representing a first generation parent individual population including a predetermined number of individual data Parent individual group data acquisition unit to acquire,
A mutation processing unit that executes mutation processing on the individuals included in the first generation parent individual population data,
A child individual population data acquisition unit for acquiring first generation child individual population data representing a first generation child individual population including individuals subjected to the mutation processing;
First-generation integration integrating the first-generation parent individual population data and the first-generation child individual population data, which is a predetermined evaluation criterion, based on codon suitability and an evaluation criterion regarding the base sequence of the codon A non-dominated sort execution unit that performs non-dominated sort processing on data and classifies all individual data included in the first generation integrated data according to rank in Pareto optimal solution;
An individual selection unit which selects a predetermined number of the individual data in descending order of the rank from all individual data classified according to the rank,
Information processing program to function as.