JP7551189B1

JP7551189B1 - Method for predicting promoter activity and method for modifying promoter based on the results of the prediction

Info

Publication number: JP7551189B1
Application number: JP2024001583A
Authority: JP
Inventors: 浩一豊倉; 連谷本; 隆之近藤
Original assignee: GRA&GREEN INC.
Current assignee: GRA&GREEN INC.
Priority date: 2023-12-28
Filing date: 2024-01-10
Publication date: 2024-09-17
Anticipated expiration: 2043-12-28

Abstract

【課題】本開示は、所望の転写活性を有するように改変されたプロモーター配列の取得方法を提供することを目的とする。【解決手段】所望の転写活性を有するように改変されたプロモーター配列の取得方法であって、改変の対象となる元のプロモーター配列を用意すること、前記プロモーター配列に基づきゲノム編集技術により作成可能な複数の改変プロモーター配列のセットを生成すること、生成された前記改変プロモーター配列のセットに含まれる個々の改変プロモーター配列の転写活性を機械学習モデルによって予測すること、ならびに所望の活転写性を有すると予測された改変プロモーター配列を選択することを含む、方法が提供される。【選択図】図２[Problem] The present disclosure aims to provide a method for obtaining a promoter sequence modified to have a desired transcription activity. [Solution] The method for obtaining a promoter sequence modified to have a desired transcription activity includes preparing an original promoter sequence to be modified, generating a set of multiple modified promoter sequences that can be created by genome editing technology based on the promoter sequence, predicting the transcription activity of each modified promoter sequence included in the generated set of modified promoter sequences using a machine learning model, and selecting the modified promoter sequence predicted to have the desired transcription activity. [Selected Figure] Figure 2

Description

本開示は、プロモーター活性の予測とその予測結果に基づくプロモーターの改変に関する。 This disclosure relates to predicting promoter activity and modifying promoters based on the results of that prediction.

現在、ゲノム編集によって植物に形質を付与する場合には、コーディングリージョン（CDS）内に変異を導入し、フレームシフトにより遺伝子の機能を欠損させるloss-of-functionの方法が主である。 Currently, the main method for conferring traits to plants through genome editing is the loss-of-function method, in which a mutation is introduced into the coding region (CDS) to cause a frameshift, resulting in the loss of gene function.

一方で、標的の遺伝子の発現量を上昇させることで機能強化を図ったり、（発現をゼロにするのではなく）僅かに残すことでノックアウトのサイドエフェクトを抑制するような方法が考えられる。トランスジーンを導入することで機能増強した研究や、RNAiによって機能解析をした知見が応用可能になることから、このような方法は、ゲノム編集で付与できる形質のバリエーションを大きく広げることと期待され、重要である。 On the other hand, methods can be considered to enhance function by increasing the expression level of the target gene, or to suppress the side effects of knockout by leaving only a small amount of expression (rather than reducing expression to zero). Since it will be possible to apply research on enhancing function by introducing transgenes and knowledge gained from functional analysis using RNAi, such methods are important as they are expected to greatly expand the variety of traits that can be conferred by genome editing.

Zhang, Yi, et al. "Applications and potential of genome editing in crop improvement." Genome biology 19 (2018): 1-11.Zhang, Yi, et al. "Applications and potential of genome editing in crop improvement." Genome biology 19 (2018): 1-11.

そのような方法の1つとして考えられるのが、CDSではなく、プロモーター領域を編集するという方法である。遺伝子発現強度を決定しているのはプロモーターやエンハンサーといった領域であることから、この領域の塩基配列を改変することで、遺伝子の機能はそのままに、発現量を上下できると考えられる。 One such method is to edit the promoter region instead of the CDS. Since the strength of gene expression is determined by regions such as the promoter and enhancer, it is thought that by modifying the base sequence of this region, it is possible to increase or decrease the expression level while maintaining the function of the gene.

実際、RNA-seq法で測定されるmRNAレベルでの遺伝子発現量の幅（ダイナミックレンジ）は非常に大きい。1つの細胞内に数コピー程度しか含まれないmRNAがあるのに対し、10⁵コピー程度含まれるmRNAもある。このような遺伝子発現量の差は、主としてプロモーターやエンハンサーによりもたらされると考えられ、プロモーター領域の塩基配列を改変する方法のポテンシャルを示していると考えられる。よって、ゲノム編集により、遺伝子発現量を精密に制御する技術の提供が、本開示に係る目的の１つである。 In fact, the dynamic range of gene expression at the mRNA level measured by the RNA-seq method is very large. Some mRNAs contain only a few copies in a cell, while others contain about 10 ⁵ copies. Such differences in gene expression are thought to be mainly caused by promoters and enhancers, and are thought to show the potential of methods that modify the base sequence of promoter regions. Therefore, one of the objectives of the present disclosure is to provide a technology for precisely controlling gene expression levels by genome editing.

プロモーターやエンハンサーといった領域は、「この部分がこの機能を有する」という配列－機能の対応関係が、CDSに比べると曖昧である。TATAボックスに代表されるような、高度に保存されたコンセンサス配列をもつCis制御エレメントがいくつか発見されており、データベース化されている。例えば、理化学研究所のPromoterCADや農業・食品産業技術総合研究機構のNEW PLACEのようなソフトウェアは、このような知見を元にCis制御エレメントを検索し、転写因子がプロモーターに結合するか推測するシステムであるが、発現量を予測し、設計するという目的からすると、その精度は不十分である。 Compared to CDS, the sequence-function relationship of "this part has this function" is vaguer in regions such as promoters and enhancers. Several Cis regulatory elements with highly conserved consensus sequences, such as the TATA box, have been discovered and are now in a database. For example, software such as PromoterCAD from the RIKEN Institute and NEW PLACE from the National Agriculture and Food Research Organization are systems that search for Cis regulatory elements based on this knowledge and predict whether transcription factors will bind to promoters, but the accuracy is insufficient for the purpose of predicting and designing expression levels.

そこで、本発明者らは、個別のエレメントを登録するデータベース方式ではなく、塩基配列と発現量の関係をマシンラーニングにより学習させる方式を開発した。その結果、実測値と予測値との相関を示す決定係数としてR²＝0.75という高い値が得られた。そして、予測システムを利用して配列を設計し、植物を使用した実験を行って、予測の精度を実証した。任意の塩基配列をコンピューターに与えて、転写活性を予測することができれば、発現量を「設計」することができるようになる。本開示において示されるように、本発明者らは、コンピューターによる予測に基づき、遺伝子のプロモーターを編集することで、遺伝子の機能（発現量）を上昇または低下させることができることを実証した。 Therefore, the present inventors developed a method of learning the relationship between base sequences and expression levels by machine learning, rather than a database method in which individual elements are registered. As a result, a high value of ^R2 = 0.75 was obtained as the coefficient of determination indicating the correlation between the actual value and the predicted value. Then, a sequence was designed using the prediction system, and an experiment using plants was performed to demonstrate the accuracy of the prediction. If an arbitrary base sequence can be given to a computer and the transcription activity can be predicted, it becomes possible to "design" the expression level. As shown in the present disclosure, the present inventors have demonstrated that the function (expression level) of a gene can be increased or decreased by editing the promoter of the gene based on computer prediction.

本開示は、これらの知見を基礎とするものであり、以下の態様を包含する：
［態様１］
所望の活性を有するように改変されたプロモーター配列の取得方法であって、
改変の対象となる元のプロモーター配列を用意すること、
前記プロモーター配列に基づきゲノム編集技術により作成可能な複数の改変プロモーター配列のセットを生成すること、
生成された前記改変プロモーター配列のセットに含まれる個々の改変プロモーター配列の活性を機械学習モデルによって予測すること、
所望の活性を有すると予測された改変プロモーター配列を選択すること
を含む、方法。
［態様２］
所望の活性が、元のプロモーター配列よりも高い遺伝子発現誘導活性、または元のプロモーター配列よりも低い遺伝子発現誘導活性である、態様１に記載の方法。
［態様３］
プロモーター配列が植物細胞のプロモーター配列である、態様１に記載の方法。
［態様４］
元のプロモーター配列が、コアプロモーター配列とその上流の配列を含む、態様１に記載の方法。
［態様５］
ゲノム編集技術がCRISPR/Cas系を用いたゲノム編集技術である、態様１に記載の方法。
［態様６］
複数の改変プロモーター配列のセットが、２つのPAM認識配列に基づき設計されるガイドRNA配列の組合せが誘導する切断により生じる配列欠失により生成される、態様５に記載の方法。
［態様７］
改変プロモーター配列のセットが、少なくとも1000の異なる配列を含む、態様１に記載の方法。
［態様８］
機械学習モデルが、植物細胞における複数のプロモーター配列の遺伝子発現誘導活性データを教師データとして、プロモーター配列から遺伝子発現誘導活性を予測するように訓練された回帰モデルである、態様１に記載の方法。
［態様９］
生成された前記改変プロモーター配列のセットに含まれる個々の改変プロモーター配列の活性をコンピューターディスプレイ上でビジュアライズする工程をさらに含む、態様1に記載の方法。
［態様１０］
所望の活性を有するように改変されたプロモーター配列を予測する情報処理装置であって、
改変の対象となる元のプロモーター配列の入力を受け付ける配列入力部、
前記プロモーター配列に基づきゲノム編集技術により作成可能な複数の改変プロモーター配列のセットを生成する改変配列生成部、
生成された前記改変プロモーター配列のセットに含まれる個々の改変プロモーター配列の活性を機械学習モデルによって予測する活性予測部、
所望の活性を有すると予測された改変プロモーター配列を選択する配列選択部
を含む、情報処理装置。
［態様１１］
命令が格納された非一時的なコンピューター可読媒体であって、命令がプロセッサーによって実行されると、以下のステップ：
改変の対象となる元のプロモーター配列の入力を受け付けるステップ、
前記プロモーター配列に基づきゲノム編集技術により作成可能な複数の改変プロモーター配列のセットを生成するステップ、
生成された前記改変プロモーター配列のセットに含まれる個々の改変プロモーター配列の活性を機械学習モデルによって予測するステップ、
所望の活性を有すると予測された改変プロモーター配列を選択するステップ
を実行することができる、コンピューター可読媒体。
［態様１２］
コンピューターに、
改変の対象となる元のプロモーター配列の入力を受け付ける機能、
前記プロモーター配列に基づきゲノム編集技術により作成可能な複数の改変プロモーター配列のセットを生成する機能、
生成された前記改変プロモーター配列のセットに含まれる個々の改変プロモーター配列の活性を機械学習モデルによって予測する機能、
所望の活性を有すると予測された改変プロモーター配列を選択する機能
を実現させる、プログラム。
［態様１３］
所望の遺伝子の発現量を調節するための細胞のゲノム編集方法であって、
態様１に記載の方法により所望の活性を有する改変プロモーター配列を取得すること、
ゲノム編集の対象となる細胞を用意すること、
前記改変プロモーター配列を生じるように前記細胞のゲノムを編集すること
を含む、方法。
［態様１４］
所望の遺伝子の発現量が調節されたゲノム編集植物の製造方法であって、
態様１３に記載の方法により所望の植物細胞のゲノムを編集すること、
ゲノム編集された細胞に由来する植物個体を得ること
を含む、方法。
［態様１５］
プロモーター配列から植物細胞における遺伝子発現誘導活性を予測する機械学習モデルであって、植物細胞における複数のプロモーター配列の遺伝子発現誘導活性データを教師データとして、プロモーター配列から遺伝子発現誘導活性を予測するように訓練された回帰モデルである、機械学習モデル。
［態様１６］
プロモーター配列から植物細胞における遺伝子発現誘導活性を予測する機械学習モデルの生成方法であって、植物細胞における複数のプロモーター配列の遺伝子発現誘導活性データを教師データとしてモデルを訓練することを含む、方法。
［態様１７］
訓練の対象となるモデルとして、トランスフォーマーベースの事前訓練された基礎モデルが用いられる、態様１６に記載の方法。 The present disclosure builds on these findings and includes the following aspects:
[Aspect 1]
A method for obtaining a promoter sequence modified to have a desired activity, comprising the steps of:
Providing an original promoter sequence to be modified;
Generating a set of multiple modified promoter sequences that can be created by genome editing technology based on the promoter sequence;
predicting the activity of each modified promoter sequence included in the generated set of modified promoter sequences using a machine learning model;
selecting a modified promoter sequence predicted to have a desired activity.
[Aspect 2]
The method according to aspect 1, wherein the desired activity is a higher gene expression induction activity than the original promoter sequence, or a lower gene expression induction activity than the original promoter sequence.
[Aspect 3]
2. The method of embodiment 1, wherein the promoter sequence is a plant cell promoter sequence.
[Aspect 4]
2. The method of embodiment 1, wherein the original promoter sequence comprises a core promoter sequence and its upstream sequence.
[Aspect 5]
The method according to aspect 1, wherein the genome editing technique is a genome editing technique using a CRISPR/Cas system.
[Aspect 6]
6. The method of embodiment 5, wherein the set of multiple modified promoter sequences is generated by sequence deletion caused by cleavage induced by a combination of guide RNA sequences designed based on two PAM recognition sequences.
[Aspect 7]
2. The method of embodiment 1, wherein the set of modified promoter sequences comprises at least 1000 different sequences.
[Aspect 8]
The method according to aspect 1, wherein the machine learning model is a regression model trained to predict gene expression induction activity from a promoter sequence using gene expression induction activity data of a plurality of promoter sequences in plant cells as training data.
[Aspect 9]
The method of embodiment 1, further comprising the step of visualizing on a computer display the activity of each modified promoter sequence contained in the generated set of modified promoter sequences.
[Aspect 10]
An information processing device for predicting a promoter sequence modified to have a desired activity, comprising:
a sequence input section for receiving input of an original promoter sequence to be modified;
A modified sequence generating unit that generates a set of multiple modified promoter sequences that can be created by genome editing technology based on the promoter sequence;
an activity prediction unit that predicts the activity of each modified promoter sequence included in the generated set of modified promoter sequences using a machine learning model;
An information processing device comprising a sequence selection unit that selects a modified promoter sequence predicted to have a desired activity.
[Aspect 11]
A non-transitory computer-readable medium having instructions stored thereon, the instructions, when executed by a processor, performing the following steps:
receiving an input of an original promoter sequence to be modified;
A step of generating a set of multiple modified promoter sequences that can be created by genome editing technology based on the promoter sequence;
predicting the activity of each of the modified promoter sequences included in the generated set of modified promoter sequences using a machine learning model;
A computer readable medium capable of carrying out the step of selecting modified promoter sequences predicted to have a desired activity.
[Aspect 12]
On the computer,
A function to accept input of the original promoter sequence to be modified;
A function of generating a set of multiple modified promoter sequences that can be created by genome editing technology based on the promoter sequence;
A function of predicting the activity of each modified promoter sequence included in the generated set of modified promoter sequences using a machine learning model;
A program that performs the function of selecting a modified promoter sequence predicted to have a desired activity.
[Aspect 13]
A genome editing method for a cell for regulating the expression level of a desired gene, comprising:
Obtaining a modified promoter sequence having a desired activity by the method according to embodiment 1;
Preparing cells to be subjected to genome editing;
editing the genome of said cell to create said modified promoter sequence.
[Aspect 14]
A method for producing a genome-edited plant in which the expression level of a desired gene is regulated, comprising:
Editing the genome of a desired plant cell by the method of embodiment 13;
A method comprising obtaining a plant individual derived from a genome-edited cell.
[Aspect 15]
A machine learning model that predicts gene expression induction activity in plant cells from a promoter sequence, the machine learning model being a regression model trained to predict gene expression induction activity from a promoter sequence using gene expression induction activity data of multiple promoter sequences in plant cells as teacher data.
[Aspect 16]
A method for generating a machine learning model that predicts gene expression induction activity in plant cells from a promoter sequence, the method comprising training the model using gene expression induction activity data of a plurality of promoter sequences in plant cells as training data.
[Aspect 17]
17. The method of claim 16, wherein a transformer-based pre-trained base model is used as the model to be trained.

図１は、人工合成遺伝子の構造を示す概略図である。高発現グループから7プロモーター、中程度グループから8プロモーター、低発現グループから4プロモーターをそれぞれ選び、ルシフェラーゼ（LUC）遺伝子に接続した人工合成遺伝子を作成した。プロモーターは19種類の様々な配列パターンを持つ。下流のLUC遺伝子と上流のカリフラワーモザイクウイルス（CaMV）35Sエンハンサーの配列は各人工遺伝子で共通である。Figure 1 is a schematic diagram showing the structure of the artificial synthetic gene. Seven promoters were selected from the high expression group, eight promoters from the medium expression group, and four promoters from the low expression group, and artificial synthetic genes were created by connecting them to the luciferase (LUC) gene. The promoters have 19 different sequence patterns. The sequences of the downstream LUC gene and the upstream cauliflower mosaic virus (CaMV) 35S enhancer are common to each artificial gene. 図２は、LUC遺伝子の発現量の実測値と、プロモーター強度の予測値の関係を示す散布図である。縦軸は予測されたプロモーター強度を示す。値が高いほど、下流の遺伝子の発現量が大きくなると期待される。横軸はLUC発現量の実測値を示す。ただし、ポジティブコントロールであるCaMVプロモーターの転写活性で標準化した。Figure 2 is a scatter plot showing the relationship between the measured expression level of the LUC gene and the predicted promoter strength. The vertical axis shows the predicted promoter strength. The higher the value, the greater the expected expression level of downstream genes. The horizontal axis shows the measured LUC expression level. Note that this was normalized with the transcription activity of the CaMV promoter, which is a positive control. 図３は、予測された転写活性に対応するLUC発現量を計算して比較した結果を示している。縦軸にLUCの発現量をプロットした。ただし、ポジティブコントロールの発光強度を1とした相対値で示している。ポジティブコントロールとして、高い転写活性を示すことが知られているカリフラワーモザイクウイルス（CaMV）35Sプロモーターを使用した。横軸に19種類のプロモーターの番号を示した。プロモーター番号3～10は高い発現強度が予測された「高発現」グループ、プロモーター番号13～21は同「中程度」グループ、プロモーター番号22～25は同「低発現」グループに分類した。黒い棒グラフがLUCアッセイの実測値を示す。白色の棒グラフは、本開示の予測システムによる転写活性予測値を示す。FIG. 3 shows the results of calculating and comparing the LUC expression levels corresponding to the predicted transcriptional activity. The expression levels of LUC are plotted on the vertical axis. However, the values are shown as relative values with the luminescence intensity of the positive control taken as 1. As a positive control, the cauliflower mosaic virus (CaMV) 35S promoter, which is known to exhibit high transcriptional activity, was used. The horizontal axis shows the numbers of 19 types of promoters. Promoter numbers 3 to 10 were classified into the "high expression" group, which was predicted to have high expression intensity, promoter numbers 13 to 21 into the "moderate" group, and promoter numbers 22 to 25 into the "low expression" group. The black bars show the actual values of the LUC assay. The white bars show the predicted transcriptional activity values by the prediction system of the present disclosure. 図４は、転写活性を低下させることのできる編集パターンを探索した結果を示している。プロモーター番号3、4、5、6、9、10、21番について、転写活性を低下させることのできる編集パターンを探索した。黒色の三角形のプロットで、予測された新しい転写活性に基づいて計算されたLUCアッセイのスコア（発現量）の予測値を示した。その結果、元のプロモーターに比して、14％～1％程度の転写活性となることが予測された。Figure 4 shows the results of searching for editing patterns that can reduce transcriptional activity. We searched for editing patterns that can reduce transcriptional activity for promoter numbers 3, 4, 5, 6, 9, 10, and 21. The plot of black triangles shows the predicted LUC assay score (expression level) calculated based on the predicted new transcriptional activity. As a result, it was predicted that the transcriptional activity would be about 14% to 1% compared to the original promoter. 図５は、図４に重ね合わせて、網かけの棒グラフで新しいプロモーターのLUCの発現量の実測値を示している。非常に低い発現量となったため、縦軸を拡大して示した。プロモーター5番については欠損値となった。測定値の得られた6個のプロモーターについて、いずれも大幅な発現量の低下が認められ、予測値と一致した。Figure 5 shows the actual measured expression levels of the new promoter LUC in shaded bars overlaid on Figure 4. The vertical axis is expanded because the expression levels were very low. Promoter 5 had missing values. For the six promoters for which measured values were available, a significant decrease in expression level was observed, which was consistent with the predicted values. 図６は、編集後に転写活性の予測値が上昇するものを探索した結果を示している。プロモーター番号13、17、22、23、24、25について、転写活性が上昇するものを選び、予測される発現量を白色の円で示した。元のプロモーターに比して、7.4～125倍のLUC発現量が予測された。Figure 6 shows the results of searching for promoters whose predicted transcription activity increases after editing. Promoter numbers 13, 17, 22, 23, 24, and 25 were selected for their increased transcription activity, and the predicted expression levels are indicated by white circles. Compared to the original promoter, the predicted LUC expression levels were 7.4 to 125 times higher. 図７は、理論的なプロモーター配列をもつ遺伝子を人工的に合成し、プロトプラストに対し同様にトランスフェクションし発現量をプレートリーダーで測定した結果を示している。6個のプロモーター中、5個で発現量が元のプロモーターに比して上昇したものの、予測値を超えて発現量が上昇したものは13番のみであった。Figure 7 shows the results of artificially synthesizing genes with theoretical promoter sequences, transfecting them into protoplasts in the same manner, and measuring the expression levels using a plate reader. Of the six promoters, the expression levels of five were increased compared to the original promoter, but only the 13th promoter showed an increase in expression level beyond the predicted value. 図８は、入力された塩基配列Xの全体像を分析するためのビジュアライズ方法の一例を示している。縦軸は予測された転写活性、横軸は塩基配列の位置に対応している。Figure 8 shows an example of a visualization method for analyzing the entire image of an input base sequence X. The vertical axis corresponds to the predicted transcription activity, and the horizontal axis corresponds to the position of the base sequence. 図９は、別の方法によるビジュアライズの例を示している。縦軸は、標的の遺伝子から近い位置に設計されたガイドRNA（近位ガイド）の位置、横軸は、遠い位置に設計されたガイドRNA（遠位ガイド）の位置に対応している。各位置のプロットの色は、転写活性の推測された値に基づいて変更される。例えば、カラーチャートで、明るい灰色は転写活性が高いと推測されたもの、黒色は低いと推測されたものといった着色が行われる。Figure 9 shows an example of visualization by another method. The vertical axis corresponds to the position of the guide RNA (proximal guide) designed near the target gene, and the horizontal axis corresponds to the position of the guide RNA (distal guide) designed far from the target gene. The color of the plot of each position is changed based on the predicted value of transcription activity. For example, in a color chart, light gray is predicted to be high transcription activity, and black is predicted to be low transcription activity. 図１０は、特定のダイズ遺伝子について、転写活性を高める検討を行った結果を示している。ダイズのある遺伝子のプロモーターを基に、遺伝子の発現を上昇させるためのゲノム編集後のプロモーター配列を2種類作成した（edit1、edit2）。edit1の転写活性は2.950119と予測された。また、LUC発現量は線形回帰により、Pコントロールに対し13.2％と予測された。このプロモーターのLUCアッセイの実測値はPコントロールに対し19.0％であった。edit2の転写活性は2.432977と予測された。また、LUC発現量はPコントロールに対し7.27％と予測された。このプロモーターのLUCアッセイの実測値はPコントロールに対し41.6％であった。Figure 10 shows the results of an investigation into increasing the transcription activity of a specific soybean gene. Based on the promoter of a soybean gene, two types of promoter sequences after genome editing were created to increase gene expression (edit1, edit2). The transcription activity of edit1 was predicted to be 2.950119. In addition, the LUC expression level was predicted to be 13.2% compared to the P control by linear regression. The actual LUC assay value of this promoter was 19.0% compared to the P control. The transcription activity of edit2 was predicted to be 2.432977. In addition, the LUC expression level was predicted to be 7.27% compared to the P control. The actual LUC assay value of this promoter was 41.6% compared to the P control. 図１１は、所望の活性を有するように改変されたプロモーター配列を予測する情報処理装置の例示的な構成を示している。FIG. 11 shows an exemplary configuration of an information processing device that predicts a promoter sequence modified to have a desired activity. 図１２は、コンピューターに、改変の対象となる元のプロモーター配列の入力を受け付ける機能、前記プロモーター配列に基づきゲノム編集技術により作成可能な複数の改変プロモーター配列のセットを生成する機能、生成された前記改変プロモーター配列のセットに含まれる個々の改変プロモーター配列の活性を機械学習モデルによって予測する機能、所望の活性を有すると予測された改変プロモーター配列を選択する機能を実現させる、プログラムの例示的なフローチャートを示している。FIG. 12 shows an exemplary flowchart of a program that enables a computer to perform the functions of accepting input of an original promoter sequence to be modified, generating a set of multiple modified promoter sequences that can be created using genome editing technology based on the promoter sequence, predicting the activity of each modified promoter sequence included in the generated set of modified promoter sequences using a machine learning model, and selecting a modified promoter sequence predicted to have a desired activity. 図１３は、本開示の態様の実装に用いられ得るコンピューターの概略構成である。FIG. 13 is a schematic diagram of a computer that can be used to implement aspects of the present disclosure.

作物の育種の過程で、特定の遺伝子の発現を増強あるいは抑制したいという希望を育種家は持っている。遺伝子の機能が解析されるようになり、遺伝子と形質の関連が次々に明らかにされている。例えばイネにおける半矮性の例を示すと、イネでは半矮性という形質が育種の目標とされ、1966年に半矮性品種「レイメイ」が育成された。半矮性イネでは肥料を投入した場合にも草丈が高くなりづらく、かつ収量は低下しない。肥料を投入すると通常イネは垂直方向に生育するが、これにより台風や強風により容易に倒伏し、収穫できないか品質が低下する。そのため、一定以上の施肥は倒伏を招き収穫量増に結びつかない。半矮性イネでは草丈が高くなりづらいため、多量の肥料を施した場合でも倒伏に強く、収量が増した。レイメイに半矮性を与えた変異は、ジベレリン生合成系のG20酸化酵素をコードしている遺伝子に存在することが、今日では明らかになっている。半矮性作物は食糧生産の拡大に絶大な進歩をもたらした大発明であった。 In the process of breeding crops, breeders hope to enhance or suppress the expression of specific genes. As gene functions are analyzed, the relationship between genes and traits is becoming clearer one after another. For example, in the case of semi-dwarf rice, the trait of semi-dwarfness was set as the breeding goal for rice, and in 1966 the semi-dwarf variety "Reimei" was developed. Semi-dwarf rice is less likely to grow taller even when fertilizer is applied, and the yield does not decrease. When fertilizer is applied, rice usually grows vertically, but this means that it is easily lodged by typhoons or strong winds, making it impossible to harvest or reducing the quality. Therefore, applying fertilizer beyond a certain level will cause lodging and will not lead to an increase in yield. Semi-dwarf rice is less likely to grow taller, so it is resistant to lodging even when a large amount of fertilizer is applied, and the yield increases. It is now known that the mutation that gave the semi-dwarf trait to Leucanthemum moniliforme is present in a gene encoding the G20 oxidase enzyme involved in gibberellin biosynthesis. Semi-dwarf crops were a major invention that brought about enormous advances in the expansion of food production.

このような、遺伝子と形質の関連が次々に明らかにされ、知見が蓄積されている。これにより目標とする形質を得るために、しばしば標的の遺伝子が設定されるようになった。例えば、イネに半矮性を導入しようとした場合、上記遺伝子の変異体（sd1変異体）を選抜すればよい。天然に生じたミュータントを探索することも可能であるが、化学的あるいは放射線照射によりランダムに変異を導入することもできる。sd1の機能欠失アレルは潜性であると考えられるが、計画的な交配によって両アレルが機能欠失型sd1となった個体を育成できる。 The relationship between genes and traits is being clarified one after another, and knowledge is being accumulated. As a result, target genes are often set to obtain desired traits. For example, if one wishes to introduce semi-dwarfism into rice, a mutant of the above gene (the sd1 mutant) can be selected. It is possible to search for naturally occurring mutants, but it is also possible to randomly introduce mutations by chemical or radiation exposure. The loss-of-function allele of sd1 is thought to be recessive, but it is possible to breed individuals in which both alleles are loss-of-function sd1 by planned breeding.

遺伝子組換え技術の登場で、人工的に遺伝子型を設計できるようになった。つまり、対象とする作物に対し、任意の遺伝子配列を導入することで、機能や形質を付与できることとなった。代表的な例として、除草剤耐性を付与したトウモロコシが挙げられる（ラウンドアップレディ（登録商標））。ラウンドアップ（登録商標）（グリホサート）は除草剤であり、実体としてはアミノ酸アナログである。植物やバクテリアの一部ではグリホサートを取り込むとアミノ酸合成が阻害されて枯死する。グリホサートを無害化する遺伝子をバクテリアから導入することで植物にグリホサートに対する耐性が付与されたが、これは、除草のコストを大幅に削減する大発明であった。一方で、外来の遺伝子を導入する遺伝子組換え植物の生産と利用はは限定的であり、地球上の全作物が遺伝子組換えに置き換えられるようなことにはならなかった。特に欧州では反発が大きく、現在でも欧州では遺伝子組換え作物の栽培は大きく制限されており、栽培されている品種は1品種のみ（スペインにおけるトウモロコシ）である。 With the advent of recombinant gene technology, it became possible to artificially design genotypes. In other words, by introducing any gene sequence into the target crop, it became possible to give it functions and traits. A typical example is corn that has been given resistance to herbicides (Roundup Ready (registered trademark)). Roundup (registered trademark) (glyphosate) is a herbicide that is in fact an amino acid analog. When some plants and bacteria take in glyphosate, amino acid synthesis is inhibited and they die. By introducing a gene that neutralizes glyphosate from bacteria, plants were given resistance to glyphosate, which was a major invention that significantly reduced the cost of weeding. On the other hand, the production and use of genetically modified plants that introduce foreign genes was limited, and all crops on the planet were not replaced by genetically modified plants. There was particularly strong opposition in Europe, and even today, the cultivation of genetically modified crops is greatly restricted in Europe, with only one variety being cultivated (corn in Spain).

ゲノム編集が発明されてからは、さらに状況は変化した。非遺伝子組換えで、標的の遺伝子に直に変異を導入することができるようになった。これは歴史的な大発明と言える。例えば、コシヒカリに半矮性を導入しようと考えた場合、SD1遺伝子座に１～10塩基程度の小規模なindelを導入することでフレームシフトを誘導し、機能欠損させればよい。この方法では、化学的あるいは放射線による変異導入（とそれに続く5～7回程度の戻し交配）よりも早期に開発できることと、リンケージドラッグによる意図しない形質の発現が回避できることがあり、コストの点で大きなアドバンテージがある。この手法で作出され日本で発売されている食品としては、例えば、リージョナルフィッシュ社から発売されているマダイで、ミオスタチン遺伝子を欠損したものが挙げられる。 Since the invention of genome editing, the situation has changed even further. It has become possible to directly introduce mutations into target genes without genetic modification. This can be considered a historic invention. For example, if you want to introduce semi-dwarfness into Koshihikari, you can induce a frameshift and cause a loss of function by introducing a small indel of about 1 to 10 bases into the SD1 locus. This method allows for faster development than chemical or radiation-based mutation introduction (followed by 5 to 7 backcrosses), and it has a major advantage in terms of cost, as it can avoid the expression of unintended traits caused by linkage drugs. An example of a food product produced using this method and sold in Japan is red sea bream with a deletion of the myostatin gene, sold by Regional Fish Co.

ゲノム編集では、１～10塩基程度の小規模なindel以外にも、数千塩基までの中規模な欠失を誘導することもできる。この場合は2つのガイドRNAを同時に細胞内に送達し、2か所のDNAの標的配列を切断する2か所の切断サイトが連鎖しているなど物理的に接近している場合には、2か所の切断サイトの間が抜け落ちて、修復されることがある。具体的な例を挙げると、コルテバ・アグリサイエンス社のワキシーコーンが挙げられ、これはWx遺伝子座のコーディングシーケンス全体を含む、およそ4kbが欠失されたものである。 In addition to small indels of 1 to 10 bases, genome editing can also induce medium-sized deletions of up to several thousand bases. In this case, two guide RNAs are simultaneously delivered into the cell, and if the two cutting sites that cut the target sequences of the two DNA are linked together or physically close to each other, the gap between the two cutting sites may be removed and repaired. A specific example is Corteva Agriscience's waxy corn, which has a deletion of approximately 4 kb, including the entire coding sequence of the Wx locus.

上で挙げた、ゲノム編集を介して育成された品種における変異は、自然界においても起こりえる種類のものであって、出来上がったもの（作物）は、従来的な交配による育種と本質的に差がない。これに対し、ゲノム編集を用いてホモロガスリコンビネーションを起こすことも可能であり、ゲノム中の目標の位置を、任意のDNA配列で置換することができる。これを応用して点変異、欠失、挿入など様々な変異が導入できるが、さらに、外来遺伝子で置換することで、遺伝子組換えを起こすこともできる。このように、ひと口にゲノム編集と言っても、自然界で生じ得る変異や、自然下では全く起こり得ない遺伝子組換えなど、本質的に異なるカテゴリーの変異体を作出できる。紛らわしさを避けるため、ゲノム編集により作出された品種はSDN1～3という3種類のカテゴリーに分類されることになっている。上記のリージョナルフィッシュ社の例とコルテバ社の例は、いずれもSDN1に分類される。ゲノム編集作物では、新品種作出に伴い、行政機関への届出や事前相談といった手続きが日本を含む多くの国で求められる。SDN1では、新たに導入された変異が、自然に生じた変異と本質的に同等であるため、また、前例も蓄積されていることから、多くの国で迅速に手続きが進むと期待され、産業上利点がある。 The above mentioned mutations in the varieties developed through genome editing are the kind that can occur in nature, and the resulting crops are essentially the same as those produced by conventional breeding. In contrast, genome editing can also be used to cause homologous recombination, which allows the replacement of a target position in the genome with any DNA sequence. This can be used to introduce various mutations such as point mutations, deletions, and insertions, and can also cause genetic recombination by replacing with a foreign gene. In this way, even though genome editing is simply a term, it can create mutants of essentially different categories, such as mutations that can occur in nature and genetic recombinations that cannot occur at all in nature. To avoid confusion, varieties created through genome editing are classified into three categories, SDN1 to 3. The above examples from Regional Fish and Corteva are both classified as SDN1. In genome-edited crops, procedures such as notification to government agencies and prior consultation are required in many countries, including Japan, when creating new varieties. In SDN1, since the newly introduced mutations are essentially equivalent to naturally occurring mutations, and since there is a history of precedent, it is expected that the process will proceed quickly in many countries, which will be beneficial for industry.

ところで、ゲノム編集ではフレームシフトを誘導できると上で述べたが、これにより簡便に特定の遺伝子の機能欠損体が作り出せる。ところが、育種家にとって遺伝子の機能欠損体だけが有用なわけではない。遺伝子の発現上昇を伴う変異、あるいは遺伝子の発現の減少を伴う変異体が、意図して、あるいは意図されず無数に選抜されてきた。また、植物科学の歴史の中で、膨大な数のジーンサイレンシング（RNAi、KD）実験が報告されている。ノックアウト（KO）体とノックダウン（KD）体では表現型が異なることもある。僅かな発現の残存が有害な表現型を回避しているという事例もあると考えられる。このように、KD体を模した、遺伝子発現量の大幅な減少もまた、しばしば有用であると考えられる。本開示に係る方法は、日本において関係省庁への届出というプロセスのみで商業化が可能となるSDN1のゲノム編集を用いて、こうした遺伝子発現量の増加や減少をデザインすることを可能にするという利点がある。 By the way, as mentioned above, genome editing can induce frameshifts, which makes it easy to create a mutant with a specific gene's function. However, it is not only mutants with a gene's function that are useful to breeders. Mutations with increased gene expression or mutants with decreased gene expression have been selected intentionally or unintentionally in countless numbers. In addition, a huge number of gene silencing (RNAi, KD) experiments have been reported in the history of plant science. Phenotypes may differ between knockout (KO) and knockdown (KD) mutants. There are also cases where a small amount of remaining expression avoids a harmful phenotype. In this way, a significant decrease in gene expression, which mimics a KD mutant, is also often useful. The method disclosed herein has the advantage of making it possible to design such increases and decreases in gene expression using genome editing of SDN1, which can be commercialized in Japan simply by submitting a notification to the relevant government agency.

機能を強化する方法として、ひとつには遺伝子に変異を導入する方法が考えられる。例えば酵素の自己阻害ドメインを破壊することで活性を高めることができると考えられる。日本で発売されている作物を例示すると、サナテックシード社の高GABA蓄積トマトでは、GABA合成酵素GADのC末端に存在する自己阻害ドメインをフレームシフトにより破壊することで、GABA合成活性が強化されている。この方法は大変有効であるが、標的のタンパク質（遺伝子）が自己阻害ドメインを有する場合にしか利用できない。自己阻害ドメインを有するタンパク質は比較的少数であるため、応用の幅は狭い。それ以外の方法として、タンパク質の局在シグナルを破壊して、常に活性型にする方法や、リン酸化を受けるアミノ酸をグルタミン酸に置換する方法、酵素の活性部位に変異を導入することで反応性を向上させることが考えられる。しかしながら、このような手法は、一般的にどのような遺伝子でも利用できる方法ではない。 One method to enhance function is to introduce a mutation into a gene. For example, it is thought that activity can be increased by destroying the autoinhibitory domain of an enzyme. For example, in the case of a crop sold in Japan, the high GABA accumulation tomato from Sanatec Seed Co., Ltd. has its GABA synthesis activity enhanced by destroying the autoinhibitory domain at the C-terminus of the GABA synthesis enzyme GAD by frameshifting. This method is very effective, but it can only be used when the target protein (gene) has an autoinhibitory domain. Since there are relatively few proteins with an autoinhibitory domain, the range of applications is narrow. Other methods include destroying the protein's localization signal to make it always active, replacing the amino acid that is phosphorylated with glutamic acid, and introducing a mutation into the active site of the enzyme to improve reactivity. However, these methods are not generally applicable to any gene.

機能を強化する方法で、多くの遺伝子に一般的に利用可能なものとして、もうひとつは、遺伝子の発現量を調節する領域に変異を導入する方法が考えられる。つまり、プロモーターを編集することで発現量を増加させ、機能が強化されるというアイデアであるが、このアイデアには欠点がある。それは、プロモーターにしても、配列と機能の対応関係が曖昧であるという点である。遺伝子では、コドンとアミノ酸配列の対応関係は厳密であり、特定のアミノ酸配列がつくる機能ドメインも保存性が高いため、配列から機能を推定することができる。このため、標的のドメインを破壊するといったことが高い確度で実施できる。翻ってプロモーター（非ORF）では、「この部分をこうすれば、このようなことが起こる」という仮説を立てることが難しい。
もちろん、配列と機能の関連についても存在している。TATAボックスに代表されるCis制御エレメントは、基本転写因子や転写因子の結合部位として理解されており、配列を元に転写因子を推測すること、転写因子から結合部位を推定することに成功した事例がある。例えば、理化学研究所のPromoterCADや農業・食品産業技術総合研究機構のNEW PLACEといったWebアプリケーションは、こうしたエレメントを検索するためのツールである。こうしたツールが提案されている一方で、これらを利用して育種に成功した例は限られていると思われる。Song et al.,2022はプロモーターのゲノム編集による遺伝子発現の向上を報告した（Song, X., Meng, X., Guo, H. et al. Targeting a gene regulatory element enhances rice grain yield by decoupling panicle number and size. Nat Biotechnol 40, 1403-1411 (2022). https://doi.org/10.1038/s41587-022-01281-7）。この報告ではイネIPA1遺伝子のプロモーターまたは5’UTRを編集することで、収量の増加が得られることを報告している。ゲノム編集で様々な欠失パターンの系統を作出して、発現パターンや形質を評価して、重要な制御エレメントを探し当てる、という手順になっている。これは、制御エレメントから機能をデザインすることがいかに困難であるかを示している。制御エレメントのゆらぎ（典型的な、コンセンサス配列から数塩基のミスマッチを許容する）が、困難さのひとつの理由であろう。最近は、ディープラーニングを応用してCis制御エレメントを推定する報告もあるが、実際作物の形質に結びついたという例は絶無である。 Another method of enhancing function that can be generally used for many genes is to introduce mutations into the region that regulates the expression level of a gene. In other words, the idea is to increase the expression level by editing the promoter, and enhance the function, but this idea has a drawback. That is, even in the case of promoters, the correspondence between sequence and function is ambiguous. In genes, the correspondence between codons and amino acid sequences is strict, and the functional domains created by specific amino acid sequences are also highly conserved, so the function can be inferred from the sequence. For this reason, it is possible to destroy the target domain with a high degree of accuracy. On the other hand, with promoters (non-ORFs), it is difficult to make a hypothesis that "if you do this part like this, this will happen."
Of course, there is also a relationship between sequence and function. Cis regulatory elements, such as the TATA box, are understood as binding sites for basic transcription factors and transcription factors, and there have been successful cases of predicting transcription factors based on sequences and estimating binding sites from transcription factors. For example, web applications such as PromoterCAD from the RIKEN Institute and NEW PLACE from the National Agriculture and Food Research Organization are tools for searching such elements. While such tools have been proposed, there seem to be only a limited number of successful breeding cases using them. Song et al.,2022 reported the improvement of gene expression by genome editing of promoters (Song, X., Meng, X., Guo, H. et al. Targeting a gene regulatory element enhances rice grain yield by decoupling panicle number and size. Nat Biotechnol 40, 1403-1411 (2022). https://doi.org/10.1038/s41587-022-01281-7). This report reports that an increase in yield can be achieved by editing the promoter or 5'UTR of the rice IPA1 gene. The procedure involves creating lines with various deletion patterns using genome editing, evaluating expression patterns and traits, and identifying important control elements. This shows how difficult it is to design functions from control elements. One reason for this difficulty is the fluctuation of control elements (which typically tolerate a mismatch of several bases from the consensus sequence). Recently, there have been reports of applying deep learning to estimate Cis control elements, but there have been no examples of this being linked to actual crop traits.

本発明者らは、Cis制御エレメントから機能を推測する従来の方法ではなく、DNA配列から発現量を直接推測する戦略を採用した。そのために、プロモーターに標的を限定した。配列と発現量の関係性はブラックボックスであるが、ディープラーニングを利用した方法では、しばしば原理や理論が不明な場合でも高いパフォーマンスを発揮する。例えば機械翻訳の分野では、1990年代までは、語彙を収集して機能（意味情報）に紐づけた辞書を構築し、言語を変換する方法が模索されたが（例えば、英語から独語への変換）、高いパフォーマンスは得られなかった。これに対し、2011年ごろになってニューラルネットワーク（NN）を応用したWATSONが自然言語処理で高いパフォーマンスを発揮すると、人間が辞書を作成してコンピューターに意味を理解させるよりも、膨大なテキストデータを学習させることで、個々の文の意味を入力することなく高い精度で翻訳が実現されるようになった。今日のGoogle翻訳やDeepLはNNを用いた機械翻訳の代表的な例であって、いずれも大変高い有用性を提供している。このように、NNを用いた機械学習では、原理や理論を人間が理解・整理することなく高い性能を発揮することがある。 The inventors adopted a strategy to directly infer expression levels from DNA sequences, rather than the conventional method of inferring function from Cis regulatory elements. To achieve this, they limited the target to promoters. Although the relationship between sequence and expression level is a black box, methods using deep learning often perform well even when the principles and theories are unknown. For example, in the field of machine translation, until the 1990s, methods were explored to convert language by collecting vocabulary and building a dictionary linked to function (semantic information) (e.g., converting English to German), but high performance was not obtained. In contrast, around 2011, when WATSON, which applied neural networks (NN), demonstrated high performance in natural language processing, it became possible to translate with high accuracy without inputting the meaning of individual sentences by having the computer learn huge amounts of text data, rather than having humans create dictionaries and have the computer understand the meaning. Today's Google Translate and DeepL are typical examples of machine translation using NNs, and both provide very high usefulness. In this way, machine learning using NNs can sometimes demonstrate high performance without humans understanding or organizing the principles and theories.

DNABERTは、自然言語処理を目的として作成された自然言語モデルBERTを参考に、DNA配列を扱うように事前学習されたモデルである。DNABERTはDNA配列を元に非コード領域を解析することを目的に開発され、プロモーターや、スプライスサイトの発見などの課題で高いパフォーマンスが示されている。このことから、コアプロモーターの転写活性を予測させるという本発明の課題に用いることにした。しかし、DNABERTを本発明に応用するにあたり、2つの問題点があった。 DNABERT is a model that has been pre-trained to handle DNA sequences, based on the natural language model BERT, which was created for the purpose of natural language processing. DNABERT was developed with the aim of analyzing non-coding regions based on DNA sequences, and has shown high performance in tasks such as discovering promoters and splice sites. For this reason, we decided to use it for the task of this invention, which is to predict the transcriptional activity of core promoters. However, there were two problems with applying DNABERT to this invention.

第一に、ヒトゲノムが事前学習のデータとして用いられており、植物のゲノムをDNABERTで扱えるかどうかは明らかではなかった。植物ではコアプロモーターを構成するコアエレメントとして、TATA boxやInitiator、Kozakといった動物と共通の因子に加えて、Y patch、CAおよびGAという植物特異的エレメントを有する場合がある。また、動植物の遺伝子のプロモーターでは、DNAがメチル化されることで転写活性が制御される場合があるが、動物ではCG（CpG）という配列のシトシンがメチル化される場合がほとんどである。これに対して植物では非CG配列のシトシンもメチル化される。このように、転写の基本的なメカニズムは動植物間で共通しているとはいえ、コアエレメントやメチル化部位などの構成要素は異なる。このように、ヒトと構造が大きく異なるプロモーターを有する植物において、DNABERTが高いパフォーマンスを発揮するかどうかは不明であった。 First, the human genome was used as pre-training data, and it was unclear whether DNABERT could handle plant genomes. In plants, the core elements that make up the core promoter may have plant-specific elements such as Y patch, CA, and GA, in addition to common elements with animals such as the TATA box, Initiator, and Kozak. In addition, in the promoters of animal and plant genes, transcriptional activity may be controlled by DNA methylation, but in animals, cytosines in sequences called CG (CpG) are often methylated. In contrast, in plants, cytosines in non-CG sequences are also methylated. Thus, although the basic mechanism of transcription is common between animals and plants, the components such as core elements and methylation sites are different. Thus, it was unclear whether DNABERT would perform well in plants with promoters whose structures are significantly different from those of humans.

また、第二に、DNABERTは転写開始点の発見やスプライスサイトの発見などのタスクで高いパフォーマンスが示されていたが、これらは、与えられた配列中に、興味のある要素が「存在する」または「存在しない」ことを判定する課題で、二値分類問題であった。これに対し、本発明者らの課題はコアプロモーター配列の下流の遺伝子発現強度を予測するもので、連続的な値を出力することが要求された。このような回帰問題でもDNABERTが有効であるか、高いパフォーマンスを発揮するかどうかは全く不明であった。DNAの配列に関連した連続的な値を扱う場合は、CNNを用いることが一般的であった。本発明者らは、自然言語処理モデルBERTを応用して、回帰問題も扱えるとの報告を元に、DNABERTのコードを改変した。具体的には、クラスを追加して連続的な値を扱えるように変更した。 Secondly, DNABERT has been shown to perform well in tasks such as finding transcription start sites and splice sites, but these tasks were binary classification problems, in which the task was to determine whether an element of interest was "present" or "absent" in a given sequence. In contrast, the inventors' task was to predict gene expression intensity downstream of a core promoter sequence, and it was required to output continuous values. It was completely unclear whether DNABERT would be effective or perform well in such regression problems. When dealing with continuous values related to DNA sequences, it was common to use CNN. Based on reports that the natural language processing model BERT can also be used to handle regression problems, the inventors modified the DNABERT code. Specifically, they added a class and modified it so that it could handle continuous values.

このような問題点がありつつも、別に示した手順でDNABERTの事前学習モデルをファインチューニングし、高い精度でプロモーターの強度を予測することに成功した。さらに発明者らは、この学習モデルを組込み、ゲノム編集によりプロモーターの機能を変化させるプログラムの開発にも成功している。 Despite these issues, the inventors were able to fine-tune the DNABERT pre-training model using a procedure described separately, and successfully predict promoter strength with high accuracy. Furthermore, the inventors have also successfully developed a program that incorporates this learning model and changes promoter function through genome editing.

１つの態様において、本開示は、所望の活性を有するように改変されたプロモーター配列の取得方法に関する。一部の実施形態では、本開示に係る方法は、改変の対象となる元のプロモーター配列を用意すること、前記プロモーター配列に基づきゲノム編集技術により作成可能な複数の改変プロモーター配列のセットを生成すること、生成された前記改変プロモーター配列のセットに含まれる個々の改変プロモーター配列の活性を機械学習モデルによって予測すること、所望の活性を有すると予測された改変プロモーター配列を選択することを含む、方法でありうる。 In one aspect, the present disclosure relates to a method for obtaining a promoter sequence modified to have a desired activity. In some embodiments, the method of the present disclosure may include preparing an original promoter sequence to be modified, generating a set of multiple modified promoter sequences that can be created by genome editing technology based on the promoter sequence, predicting the activity of each modified promoter sequence included in the generated set of modified promoter sequences using a machine learning model, and selecting the modified promoter sequence predicted to have the desired activity.

プロモーターは遺伝子の（通常）上流に位置するDNA配列で、自身は転写されないが、遺伝子の転写量（発現レベル）や、多細胞生物では発現する部位・組織特異性を決定する機能を有する。本明細書では、特に遺伝子近傍の、基本転写因子と相互作用する領域を「コアプロモーター」と呼ぶ。コアプロモーターは、転写開始点の塩基の位置を＋1と表したとき、－200～＋50の位置として定義されうる。より好ましくは、－180～＋25の位置として定義されうる。さらに好ましくは、―170～＋17の位置として定義されうる。最も好ましくは、－165～＋5の位置がコアプロモ―ターとして定義されうる。コアプロモーターは基本的な転写量を決定する機能があると考えられる。 A promoter is a DNA sequence located (usually) upstream of a gene; it is not itself transcribed, but has the function of determining the amount of transcription (expression level) of the gene, and in multicellular organisms, the site and tissue specificity of expression. In this specification, the region that interacts with basic transcription factors, particularly in the vicinity of the gene, is called a "core promoter." A core promoter can be defined as a position between -200 and +50, when the base position of the transcription start point is represented as +1. More preferably, it can be defined as a position between -180 and +25. Even more preferably, it can be defined as a position between -170 and +17. Most preferably, a position between -165 and +5 can be defined as a core promoter. It is believed that a core promoter has the function of determining the basic amount of transcription.

一部の実施形態では、プロモーター配列は、植物細胞のプロモーター配列である。なお、動物と植物では、コアプロモ―ターに含まれるコアエレメントの種類が異なることが知られている。本明細書中において、「植物」は、特に制限されない。例えばコケ植物、シダ植物、裸子植物、被子植物のモクレン類、単子葉類、真正双子葉類（バラ類I、バラ類II、キク類I、キク類II及びそれらの外群）を含む広い範囲の植物を挙げることができる。植物のより具体的な例としては、トマト、ピーマン、トウガラシ、ナス、タバコ、トルバム等のナス類；キュウリ、カボチャ、メロン、スイカ等のウリ類；キャベツ、ブロッコリー、ハクサイ、ケール等の菜類；シソ、セルリー、パセリー、レタス等の生菜・香辛菜類；ネギ、タマネギ、ニンニク等のネギ類；イチゴ、メロン等のその他果菜類；ダイコン、カブ、ニンジン、ゴボウ等の直根類；サトイモ、キャッサバ、ジャガイモ（バレイショ）、サツマイモ、ナガイモ等のイモ類；イネ、トウモロコシ、コムギ、ソルガム、オオムギ、ライムギ、ミナトカモジグサ、ソバ等の穀類；ダイズ、アズキ、リョクトウ、ササゲ、インゲンマメ、ラッカセイ、エンドウ、ソラマメ等のマメ類；アスパラガス、ホウレンソウ、ミツバ等の柔菜類；トルコギキョウ、バラ、ストック、カーネーション、キク等の花卉類；ベントグラス、コウライシバ等の芝類；ナタネ、カメリナ、セイヨウアブラナ、ナンヨウアブラギリ（ジャトロファ）、ゴマ、エゴマ等の油料作物類；ワタ、イグサ、アサ等の繊維料作物類；クローバー、デントコーン、タルウマゴヤシ等の飼料作物類；リンゴ、ナシ、ブドウ、モモ等の落葉性果樹類；ウンシュウミカン、オレンジ、レモン、グレープフルーツ等の柑橘類；サツキ、ツツジ、スギ、ポプラ、パラゴムノキ等の木本類等が挙げられる。また、プロモーターは、部位特異的なものでも、非部位特異的なものでもあってもよい。部位特異的プロモーターは、例えば、葉や根において特異的な発現を制御するものでありうる。
一部の実施形態では、改変プロモーターに求められる所望の活性は、元のプロモーター配列よりも高い遺伝子発現誘導活性、または元のプロモーター配列よりも低い遺伝子発現誘導活性でありうる。元のプロモーター配列よりも高い遺伝子発現誘導活性は、例えば、元の活性の1.1倍から1000倍のいずれか、例えば、1.1倍、1.2倍、1.5倍、2倍、3倍、4倍、5倍、10倍、20倍、50倍、100倍、200倍、300倍、400倍、500倍、600倍、700倍、800倍、900倍、または1000倍でありうる。元のプロモーター配列よりも低い遺伝子発現誘導活性は、例えば、元の活性の90％から0.01％のいずれか、例えば、90％、80％、50％、10％、5％、1％、0.5％、0.1％、0.05％、0.04％、0.03％、0.02％、または0.01％でありうる。 In some embodiments, the promoter sequence is a promoter sequence of a plant cell. It is known that the types of core elements contained in the core promoters of animals and plants are different. In this specification, the term "plant" is not particularly limited. For example, a wide range of plants can be mentioned, including mosses, ferns, gymnosperms, magnolias of angiosperms, monocots, and eudicots (Rosa I, Rosa II, Chrysanthemum I, Chrysanthemum II, and their outgroups). More specific examples of plants include eggplants such as tomatoes, bell peppers, chili peppers, eggplants, tobacco, and torvum; melons such as cucumbers, pumpkins, melons, and watermelons; vegetables such as cabbage, broccoli, Chinese cabbage, and kale; raw and spicy vegetables such as shiso, celery, parsley, and lettuce; onions such as green onions, onions, and garlic; other fruit vegetables such as strawberries and melons; taproots such as radishes, turnips, carrots, and burdock; tubers such as taro, cassava, potato, sweet potato, and Chinese yam; grains such as rice, corn, wheat, sorghum, barley, rye, wheatgrass, and buckwheat; soybeans, adzuki beans, mung beans, cowpeas, and yams. Examples of such plants include beans such as columbine, peanut, pea, and broad bean; soft vegetables such as asparagus, spinach, and mitsuba; flowers such as lisianthus, rose, stock, carnation, and chrysanthemum; turf grasses such as bentgrass and zoysiagrass; oil crops such as rapeseed, camelina, rapeseed, jatropha, sesame, and perilla; fiber crops such as cotton, rush, and hemp; forage crops such as clover, dent corn, and alfalfa; deciduous fruit trees such as apple, pear, grape, and peach; citrus fruits such as satsuma mandarin, orange, lemon, and grapefruit; and woody plants such as azalea, azalea, cedar, poplar, and rubber tree. The promoter may be site-specific or non-site-specific. The site-specific promoter may be, for example, one that controls specific expression in leaves or roots.
In some embodiments, the desired activity of the modified promoter may be a higher gene expression induction activity than the original promoter sequence, or a lower gene expression induction activity than the original promoter sequence. The gene expression induction activity higher than the original promoter sequence may be, for example, 1.1 to 1000 times the original activity, for example, 1.1 times, 1.2 times, 1.5 times, 2 times, 3 times, 4 times, 5 times, 10 times, 20 times, 50 times, 100 times, 200 times, 300 times, 400 times, 500 times, 600 times, 700 times, 800 times, 900 times, or 1000 times. The gene expression induction activity lower than that of the original promoter sequence can be, for example, anywhere from 90% to 0.01% of the original activity, such as 90%, 80%, 50%, 10%, 5%, 1%, 0.5%, 0.1%, 0.05%, 0.04%, 0.03%, 0.02%, or 0.01%.

一部の実施形態では、改変プロモーターに求められる所望の活性は、元のプロモーター配列ではなく、他の基準となるプロモーター配列の活性との比較で決定されてもよい。 In some embodiments, the desired activity of the modified promoter may be determined by comparison with the activity of another reference promoter sequence, rather than the original promoter sequence.

あるいは、改変プロモーターに求められる所望の活性は、高発現、中発現、低発現といった層別化により決定されてもよい。一部の実施形態では、任意の数や範囲の層別化が行われうる。 Alternatively, the desired activity of the modified promoter may be determined by stratification, such as high expression, medium expression, and low expression. In some embodiments, any number or range of stratifications may be performed.

一部の実施形態では、プロモーター配列の改変は、配列の一部の欠失、置換、または挿入によるものでありうる。一部の実施形態では、プロモーター配列の改変は、プロモーター配列中の２カ所の切断による配列の一部の欠失でありうる。 In some embodiments, the modification of the promoter sequence may be by deletion, substitution, or insertion of a portion of the sequence. In some embodiments, the modification of the promoter sequence may be by deletion of a portion of the sequence by making two truncations in the promoter sequence.

一部の実施形態では、改変の対象となる元のプロモーター配列は、GenBankなどのデータベースから取得することにより用意されうる。プロモーター配列としては、例えば、標的遺伝子の転写開始点周辺の－9900～＋100、好ましくは－4950～＋50または－2975～＋25、より好ましくは－1995～＋5のDNA領域内の、転写開始点を含む任意のDNA領域が挙げられる。あるいは、標的遺伝子の転写開始点から、隣接する遺伝子の転写終結点までの間の任意のDNA領域が挙げられる。つまり、プロモーター配列としては、コアプロモーターとその上流の配列を含む領域が用いられうる。コアプロモーター上流の配列の長さは、例えば、少なくとも200bp、400bp、600bp、800bp、1000bp、1200bp、1400bp、1600bp、1800bp、2000bp、3000bp、4000bp、5000bp、6000bp、7000bp、8000bp、または9000bpでありうる。なお、転写開始点が複数存在する場合は、その複数の転写開始点を選択することができる。また、例えば、転写開始点周辺の－1995～＋5に含まれる100～1800塩基の任意の範囲が使用されてもよい。プロモーターの長さ（塩基対数：bp）としては、例えば、少なくとも200bp、400bp、600bp、800bp、1000bp、1200bp、1500bp、2000bp、3000bp、4000bp、5000bp、6000bp、7000bp、8000bp、または9000bp、範囲としては、100～10000bp、好ましくは200～5000bp、より好ましくは500～3000bpの範囲の長さが挙げられうる。 In some embodiments, the original promoter sequence to be modified may be prepared by obtaining it from a database such as GenBank. Examples of promoter sequences include any DNA region including the transcription start point within a DNA region of -9900 to +100, preferably -4950 to +50 or -2975 to +25, more preferably -1995 to +5 around the transcription start point of the target gene. Alternatively, any DNA region between the transcription start point of the target gene and the transcription end point of the adjacent gene may be used. In other words, a region including the core promoter and its upstream sequence may be used as the promoter sequence. The length of the sequence upstream of the core promoter may be, for example, at least 200bp, 400bp, 600bp, 800bp, 1000bp, 1200bp, 1400bp, 1600bp, 1800bp, 2000bp, 3000bp, 4000bp, 5000bp, 6000bp, 7000bp, 8000bp, or 9000bp. If there are multiple transcription start points, the multiple transcription start points can be selected. In addition, any range of 100 to 1800 bases included in -1995 to +5 around the transcription start point may be used. The length of the promoter (number of base pairs: bp) can be, for example, at least 200 bp, 400 bp, 600 bp, 800 bp, 1000 bp, 1200 bp, 1500 bp, 2000 bp, 3000 bp, 4000 bp, 5000 bp, 6000 bp, 7000 bp, 8000 bp, or 9000 bp, and the range can be 100 to 10000 bp, preferably 200 to 5000 bp, and more preferably 500 to 3000 bp.

一部の実施形態では、所望の活性を有する改変プロモーター配列を選択するために、プロモーター配列に基づきゲノム編集技術により作成可能な複数の改変プロモーター配列のセットが生成される。 In some embodiments, a set of multiple modified promoter sequences is generated that can be created using genome editing techniques based on the promoter sequence to select a modified promoter sequence with a desired activity.

ゲノム編集技術としては、ZFNsやTALEN、CRISPR-Casシステムが使用されうる。CRISPR-Casシステムは、Class1 Type IのCRISPR-Cas3やClass2 Type IIのCas9、Class2 Type VのCas12a、Class2 Type VのCas12f (Cas14a)、Class2 Type VIのCas13aなどが挙げられ、その機能メカニズムや分類は問わない。また、Casタンパク質は様々な細菌に由来するものが使用されうる。例えば、Cas9ではStreptococcus pyogenesに由来するSpCas9、Staphylococcus aureusに由来するSaCas9、Francisella novicidaに由来するFnCas9、Campylobacter jejuniに由来するCjCas9、Cas12aではAcidaminococcus sp.に由来するAsCas12a（AsCpf1）、Lachnospiraceae bacteriumに由来するLbCas12a(LbCpf1)、Eubacterium rectaleに由来するErCas12aなどが挙げられ、その由来を限定しない。また、Casタンパク質をコードする塩基配列やCasタンパク質のアミノ酸配列を改変したものや、他のタンパク質または機能ドメインまたはペプチドまたはアミノ酸配列を融合したもの、化合物で修飾したものも使用されうる。 Genome editing techniques that can be used include ZFNs, TALEN, and CRISPR-Cas systems. Examples of CRISPR-Cas systems include CRISPR-Cas3 (Class 1 Type I), Cas9 (Class 2 Type II), Cas12a (Class 2 Type V), Cas12f (Cas14a) (Class 2 Type V), and Cas13a (Class 2 Type VI), regardless of their functional mechanism or classification. Cas proteins derived from various bacteria can also be used. For example, Cas9 includes SpCas9 derived from Streptococcus pyogenes, SaCas9 derived from Staphylococcus aureus, FnCas9 derived from Francisella novicida, CjCas9 derived from Campylobacter jejuni, and Cas12a includes AsCas12a (AsCpf1) derived from Acidaminococcus sp., LbCas12a (LbCpf1) derived from Lachnospiraceae bacterium, and ErCas12a derived from Eubacterium rectale, and the origin is not limited. In addition, modified nucleotide sequences encoding Cas proteins or amino acid sequences of Cas proteins, fused with other proteins, functional domains, peptides, or amino acid sequences, or modified with compounds may also be used.

CRISPR-Cas系では、標的配列の切断のために、CasヌクレアーゼとガイドRNA（gRNA）が使用される。ガイドRNAはcrRNAとtracrRNAの２つの要素から成り、これらは個別のRNAとして存在しても、連結されて一本鎖のRNAとして存在してもよい。一本鎖のガイドRNAはsgRNAとも呼ばれる。ガイドRNAの5’末端、3’末端には1塩基から10塩基、10塩基から50塩基、50塩基から100塩基、100塩基から500塩基の塩基配列が付加される場合もあり得る。ガイドRNAは特定のDNA配列と相補的に結合し、Casヌクレアーゼをゲノムの特定の位置に導く。これにより、Casヌクレアーゼは、特定のDNA配列を切断し、ゲノム編集を可能とする。例えば、CRISPRを使用する場合は、プロモーター配列中のPAM配列を検索して、切断可能な位置の一覧をリスト化する。PAM配列は使用するCasタンパク質により、例えば、SpCas9の場合はNGG、SaCas9の場合はNGRRTもしくはNGNRRN、NmeCas9の場合はNNNNGATT、CjCas9の場合はNNNNRYAC、LbCas12a(Cpf1)の場合はTTTV、AsCas12a(Cpf1)の場合はTTTV、AacCas12a(Cpf1)の場合はTTN、BhCas12b v4の場合はATTNもしくはTTTNもしくはGTTNなどが挙げられるが、Casタンパク質が認識できる配列であればこれらのPAM配列に1塩基、2塩基、3塩基、4塩基、5塩基、6塩基、7塩基、8塩基のミスマッチが含まれる場合もありうる。PAM配列の存在は、CRISPR系のゲノム編集の精度と特異性を高める上で重要である。PAM配列が無い場合、CasヌクレアーゼはガイドRNAが指定するDNA配列に結合できない。この特性により、特定の遺伝子領域を正確にゲノム編集の標的とすることが可能になる。また、上記のように、異なる種類のCasヌクレアーゼを使用することで、異なるPAM配列を有する領域を標的化することもできる。例えば典型的には、Cas9ならばPAM配列から3～4塩基の位置、Cas12aならばPAM配列から18～23塩基の位置がそれぞれ切断位置となる。当業者は標的ゲノム中のPAM配列にもとづき、適切なガイドRNAを設計することができる。切断位置は典型的には、2000塩基の中では50か所程度発見される。例えば、n箇所の切断位置のうち、2箇所を指定する組み合わせはnC₂通りある。つまり、50か所の切断位置について、1250通りの組み合わせがありえる。よって、一部の実施形態では、所望の活性を有する改変プロモーター配列を選択するために、改変の対象とされるプロモーター配列中のPAM配列のnC₂通りの組み合わせについて、2か所の切断位置の間を切り詰めた（削除した）塩基配列のセットが生成されうる。このように、一部の実施形態では、複数の改変プロモーター配列のセットが、2つのPAM認識配列に基づき設計されるガイドRNA配列の組合せが誘導する切断により生じる配列欠失により生成される。 In the CRISPR-Cas system, Cas nuclease and guide RNA (gRNA) are used to cleave the target sequence. Guide RNA consists of two elements, crRNA and tracrRNA, which may exist as individual RNAs or be linked together to exist as a single-stranded RNA. Single-stranded guide RNA is also called sgRNA. A base sequence of 1 to 10 bases, 10 to 50 bases, 50 to 100 bases, or 100 to 500 bases may be added to the 5' and 3' ends of the guide RNA. Guide RNA binds complementarily to a specific DNA sequence and guides Cas nuclease to a specific position in the genome. This allows Cas nuclease to cleave a specific DNA sequence and enable genome editing. For example, when using CRISPR, a list of cleavable positions is created by searching for PAM sequences in the promoter sequence. The PAM sequence depends on the Cas protein used; for example, NGG for SpCas9, NGRRT or NGNRRN for SaCas9, NNNNGATT for NmeCas9, NNNNRYAC for CjCas9, TTTV for LbCas12a(Cpf1), TTTV for AsCas12a(Cpf1), TTN for AacCas12a(Cpf1), ATTN, TTTN, or GTTN for BhCas12b v4. However, these PAM sequences may contain mismatches of 1, 2, 3, 4, 5, 6, 7, or 8 bases if the Cas protein can recognize the sequence. The presence of the PAM sequence is important for increasing the accuracy and specificity of genome editing in the CRISPR system. Without the PAM sequence, Cas nuclease cannot bind to the DNA sequence specified by the guide RNA. This characteristic makes it possible to precisely target a specific gene region for genome editing. Also, as described above, by using different types of Cas nucleases, regions with different PAM sequences can be targeted. For example, typically, the cleavage position is 3 to 4 bases from the PAM sequence for Cas9, and 18 to 23 bases from the PAM sequence for Cas12a. A person skilled in the art can design an appropriate guide RNA based on the PAM sequence in the target genome. Typically, about 50 cleavage positions are found in 2000 bases. For example, there are nC ₂ combinations that specify two of the n cleavage positions. In other words, there are 1250 possible combinations for the 50 cleavage positions. Thus, in some embodiments, in order to select a modified promoter sequence having a desired activity, a set of base sequences that is truncated (deleted) between the two cleavage positions for nC ₂ combinations of PAM sequences in the promoter sequence to be modified can be generated. Thus, in some embodiments, a set of multiple modified promoter sequences is generated by sequence deletion caused by cleavage induced by a combination of guide RNA sequences designed based on two PAM recognition sequences.

また、他のプロモーター編集方法として、Cas9ニッカーゼによって生じさせた2箇所のニック間を削除する方法、ゲノム上の狙った塩基を別の塩基へと置換するbase editingを用いた方法、1から20塩基または20から50塩基または50から100塩基からなる任意の塩基配列をゲノム中に欠失、挿入、置換することが可能なprime editorを用いた方法も使用されうる。 Other promoter editing methods that can be used include deleting the space between two nicks created by Cas9 nickase, using base editing to replace a targeted base in the genome with another base, and using a prime editor that can delete, insert, or replace any base sequence consisting of 1 to 20 bases, 20 to 50 bases, or 50 to 100 bases in the genome.

改変プロモーター配列のセットに含まれる異なる配列の数は、少なくとも5、10、20、50、100、150、200、300、500、700、1000、1200、1500、2000、3000、4000、または5000でありうる。活性に影響を及ぼす配列の十分な探索のためには、改変プロモーター配列のセットに含まれる異なる配列の数は、1000以上であることが好ましい。また、改変プロモーター配列のセットに含まれる異なる配列の数を十分に確保するためには、元のプロモーター配列の長さは、例えば1800bp以上であることが好ましい。 The number of different sequences included in the set of modified promoter sequences may be at least 5, 10, 20, 50, 100, 150, 200, 300, 500, 700, 1000, 1200, 1500, 2000, 3000, 4000, or 5000. To fully explore sequences that affect activity, it is preferable that the number of different sequences included in the set of modified promoter sequences is 1000 or more. In addition, to ensure a sufficient number of different sequences included in the set of modified promoter sequences, it is preferable that the length of the original promoter sequence is, for example, 1800 bp or more.

複数の改変プロモーター配列のセットが生成された後、一部の実施形態では、生成された改変プロモーター配列のセットに含まれる個々の改変プロモーター配列の活性が、機械学習モデルによって予測される。例えば、生成されたそれぞれの塩基配列について、3'末端側170塩基を取得して、機械学習モデルに入力する。一部の実施形態では、3'末端側100～250塩基を取得して、機械学習モデルに入力してもよい。また、一部の実施形態では、3'末端側170塩基に含まれる一部の配列、例えば、100～169塩基が入力されてもよい。適切に訓練された機械学習モデルは、与えられた塩基配列のコアプロモーターとしての強さ（転写活性）を出力することができる。 After a set of multiple modified promoter sequences is generated, in some embodiments, the activity of each modified promoter sequence included in the set of modified promoter sequences generated is predicted by a machine learning model. For example, for each generated base sequence, 170 bases on the 3' end are obtained and input to the machine learning model. In some embodiments, 100 to 250 bases on the 3' end may be obtained and input to the machine learning model. In some embodiments, a portion of the sequence included in the 170 bases on the 3' end, for example, 100 to 169 bases, may be input. A properly trained machine learning model can output the strength (transcriptional activity) of a given base sequence as a core promoter.

一部の実施形態では、生成された改変プロモーター配列のセットに含まれる個々の改変プロモーター配列の活性がコンピューターディスプレイ上でビジュアライズされる。ビジュアライズは、例えば、図８または図９に示されるようにして行われうる。 In some embodiments, the activity of each modified promoter sequence in the generated set of modified promoter sequences is visualized on a computer display. Visualization can be performed, for example, as shown in FIG. 8 or FIG. 9.

そして、機械学習モデルの予測結果から、所望の活性を有すると予測された改変プロモーター配列を選択することにより、所望の活性を有するように改変されたプロモーター配列を取得することができる。 Then, by selecting a modified promoter sequence predicted to have the desired activity from the prediction results of the machine learning model, a promoter sequence modified to have the desired activity can be obtained.

一部の実施形態では、機械学習モデルは、植物細胞における複数のプロモーター配列の遺伝子発現誘導活性データを教師データとして、プロモーター配列から遺伝子発現誘導活性を予測するように訓練された回帰モデルでありうる。一部の実施形態では、Joresら（2021）のデータが学習に使用されうる（Jores, T., Tonnies, J., Wrightsman, T. et al. Synthetic promoter designs enabled by a comprehensive analysis of plant core promoters. Nat. Plants 7, 842-855 (2021). https://doi.org/10.1038/s41477-021-00932-y）。よって、本開示に係る１つの態様は、プロモーター配列から植物細胞における遺伝子発現誘導活性を予測する機械学習モデルの生成方法であって、植物細胞における複数のプロモーター配列の遺伝子発現誘導活性データを教師データとしてモデルを訓練することを含む、方法に関する。一部の実施形態では、訓練の対象となるモデルとしては、トランスフォーマーベースの事前訓練された基礎モデルが用いられうる。 In some embodiments, the machine learning model may be a regression model trained to predict gene expression induction activity from a promoter sequence using gene expression induction activity data of multiple promoter sequences in plant cells as teacher data. In some embodiments, data from Jores et al. (2021) may be used for learning (Jores, T., Tonnies, J., Wrightsman, T. et al. Synthetic promoter designs enabled by a comprehensive analysis of plant core promoters. Nat. Plants 7, 842-855 (2021). https://doi.org/10.1038/s41477-021-00932-y). Thus, one aspect of the present disclosure relates to a method for generating a machine learning model that predicts gene expression induction activity in plant cells from a promoter sequence, comprising training the model using gene expression induction activity data of multiple promoter sequences in plant cells as teacher data. In some embodiments, a transformer-based pre-trained basic model may be used as the model to be trained.

一部の実施形態では、機械学習モデルの構築において、トランスフォーマーなどの深層学習モデル、特にBERTをベースにしたモデルが使用されうる。BERT（Bidirectional Encoder Representations from Transformers）は、自然言語処理（NLP）において広く使われている機械学習モデルの一つである。Google社によって2018年に開発され、テキストの理解と生成において顕著な性能が示された。BERTの主な特徴としては、１）双方向の文脈理解、２）Transformerアーキテクチャの利用、３）事前学習とファインチューニング、４）多様な応用可能性などがある。まず、BERTは「双方向」モデルであり、与えられたテキスト内の単語を、左右両方の文脈で理解する。従来のモデルが一方向（左から右、またはその逆）でしか文脈を考慮しなかったのに対し、BERTはテキスト全体を包括的に理解することができる。また、BERTはTransformerと呼ばれるニューラルネットワークアーキテクチャを基に構築されている。Transformerは「アテンション機構」を使用して、入力されたテキスト内の各単語間の関係を捉える。これにより、より複雑で洗練されたテキスト理解が可能になる。さらに、BERTは大量のテキストデータで「事前学習」されており、一般的な言語の理解を身につけている。その後、特定のタスク（例えば感情分析や質問応答）に対して「ファインチューニング」（fine-tuning）を行うことで、特定の用途に合わせて最適化することができる。BERTは様々なNLPタスクに適用可能であり、例えば、テキスト分類、質問応答、感情分析、機械翻訳など、幅広い分野で利用されている。 In some embodiments, deep learning models such as Transformers, particularly models based on BERT, may be used to build the machine learning model. BERT (Bidirectional Encoder Representations from Transformers) is one of the machine learning models widely used in natural language processing (NLP). It was developed by Google in 2018 and has shown remarkable performance in understanding and generating text. The main features of BERT include 1) bidirectional context understanding, 2) use of Transformer architecture, 3) pre-training and fine-tuning, and 4) diverse application possibilities. First, BERT is a "bidirectional" model that understands words in a given text in both left and right context. While conventional models only consider context in one direction (from left to right or vice versa), BERT can comprehensively understand the entire text. In addition, BERT is built on a neural network architecture called Transformer. Transformer uses an "attention mechanism" to capture the relationship between each word in the input text. This enables more complex and sophisticated text understanding. In addition, BERT is "pre-trained" on a large amount of text data to acquire a general understanding of language. It can then be "fine-tuned" for a specific task (e.g., sentiment analysis or question answering) to optimize it for a specific use. BERT is applicable to a wide range of NLP tasks, and is used in a wide range of fields, including text classification, question answering, sentiment analysis, and machine translation.

BERTモデルのトレーニングには主に２つのタスクが用いられる。MLMタスクは、入力テキストからランダムに単語を「マスク」し、BERTにそのマスクされた単語を予測させるタスクである。このタスクの主な目的は、BERTに文脈を利用して単語の意味を理解させることである。双方向の文脈を考慮するため、モデルは文全体の情報を活用してマスクされた単語を予測する。NSPタスクは、BERTに２つの文が連続しているかどうかを判断させるものである。モデルには、ある文（A）ともう一つの文（B）が与えられ、BがAの直後に来る文かどうかを予測させる。この際、半分の確率でBはAに続く実際の文で、残り半分はランダムに選ばれた関連のない文である。NSPタスクの目的は、BERTに文章間の関係を理解させることである。この能力は特に、文章の繋がりや意味の流れを理解することが重要なタスク、例えば質問応答や文章の要約などにおいて有用となる。これらのタスクにより、BERTは単語レベルだけでなく、文全体や複数の文の関係を理解する能力を養う。この結果、BERTはさまざまな自然言語処理タスクにおいて高いパフォーマンスを発揮することができるようになる。 The BERT model is trained on two main tasks. The MLM task is a task that randomly "masks" words from the input text and asks BERT to predict the masked words. The main goal of this task is to have BERT use context to understand the meaning of words. To take into account bidirectional context, the model uses information from the entire sentence to predict the masked words. The NSP task asks BERT to determine whether two sentences are consecutive. The model is given a sentence (A) and another sentence (B) and is asked to predict whether B comes immediately after A. Half the time, B is the actual sentence that follows A, and the other half is a randomly chosen unrelated sentence. The goal of the NSP task is to teach BERT to understand the relationships between sentences. This ability is especially useful in tasks where understanding the connections and semantic flow of sentences is important, such as question answering and text summarization. These tasks train BERT to understand not only at the word level, but also the relationships between whole sentences and multiple sentences. As a result, BERT is able to perform well in a variety of natural language processing tasks.

一部の実施形態では、機械学習には、トランスフォーマーベースの事前訓練された基礎モデルであるDNABERTが利用されうる（Yanrong Ji, Zhihan Zhou, Han Liu, Ramana V Davuluri, DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome, Bioinformatics, Volume 37, Issue 15, August 2021, Pages 2112-2120）。DNABERTは、上流および下流の塩基配列のコンテキストに基づいて、ゲノムDNA配列のグローバルな理解を捉えることのできる、事前訓練された双方向エンコーダー表現である。DNABERTでは、BERTで用いられていたNSPタスクは行われず、MLMタスクのみでの学習が行われている。DNA配列において一定の割合をマスクし、マスク部位のk-merトークンが予測される。DNABERTの事前学習に用いられた訓練データはヒトゲノムからサンプリングされたDNA配列である。本発明者らは、DNABERTのような事前訓練された基礎モデルをファインチューニングすることで、プロモーター配列から植物細胞における遺伝子発現誘導活性を予測する機械学習モデルを生成できることを実証した。なお、本開示に係る機械学習モデルは、二値分類問題だけでなく、回帰問題も扱うことができ、その結果として、遺伝子発現誘導活性の高低を予測することができる。 In some embodiments, the machine learning may utilize DNABERT, a transformer-based pre-trained base model (Yanrong Ji, Zhihan Zhou, Han Liu, Ramana V Davuluri, DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome, Bioinformatics, Volume 37, Issue 15, August 2021, Pages 2112-2120). DNABERT is a pre-trained bidirectional encoder representation that can capture a global understanding of genomic DNA sequences based on the context of upstream and downstream base sequences. DNABERT does not perform the NSP task used in BERT, but only trains on the MLM task. A certain percentage of the DNA sequence is masked, and the k-mer token of the masked site is predicted. The training data used for pre-training DNABERT is DNA sequences sampled from the human genome. The inventors have demonstrated that by fine-tuning a pre-trained basic model such as DNABERT, it is possible to generate a machine learning model that predicts gene expression induction activity in plant cells from a promoter sequence. The machine learning model disclosed herein can handle not only binary classification problems but also regression problems, and as a result, can predict the level of gene expression induction activity.

本開示の１つの態様は、所望の遺伝子の発現量を調節するための細胞のゲノム編集方法に関する。一部の実施形態では、本方法は、本開示に係る所望の活性を有するように改変されたプロモーター配列の取得方法により所望の活性を有する改変プロモーター配列を取得すること、ゲノム編集の対象となる細胞を用意すること、前記改変プロモーター配列を生じるように前記細胞のゲノムを編集することを含む、方法でありうる。 One aspect of the present disclosure relates to a method for genome editing of a cell to regulate the expression level of a desired gene. In some embodiments, the method may include obtaining a modified promoter sequence having a desired activity by a method for obtaining a promoter sequence modified to have a desired activity according to the present disclosure, preparing a cell to be subjected to genome editing, and editing the genome of the cell to produce the modified promoter sequence.

一部の実施形態では、細胞は植物の細胞、例えば、コケ植物、シダ植物、裸子植物、被子植物のモクレン類、単子葉類、真正双子葉類（バラ類I、バラ類II、キク類I、キク類II及びそれらの外群）を含む広い範囲の植物を挙げることができる。植物のより具体的な例としては、トマト、ピーマン、トウガラシ、ナス、タバコ、トルバム等のナス類；キュウリ、カボチャ、メロン、スイカ等のウリ類；キャベツ、ブロッコリー、ハクサイ、ケール等の菜類；シソ、セルリー、パセリー、レタス等の生菜・香辛菜類；ネギ、タマネギ、ニンニク等のネギ類；イチゴ、メロン等のその他果菜類；ダイコン、カブ、ニンジン、ゴボウ等の直根類；サトイモ、キャッサバ、ジャガイモ（バレイショ）、サツマイモ、ナガイモ等のイモ類；イネ、トウモロコシ、コムギ、ソルガム、オオムギ、ライムギ、ミナトカモジグサ、ソバ等の穀類；ダイズ、アズキ、リョクトウ、ササゲ、インゲンマメ、ラッカセイ、エンドウ、ソラマメ等のマメ類；アスパラガス、ホウレンソウ、ミツバ等の柔菜類；トルコギキョウ、バラ、ストック、カーネーション、キク等の花卉類；ベントグラス、コウライシバ等の芝類；ナタネ、カメリナ、セイヨウアブラナ、ナンヨウアブラギリ（ジャトロファ）、ゴマ、エゴマ等の油料作物類；ワタ、イグサ、アサ等の繊維料作物類；クローバー、デントコーン、タルウマゴヤシ等の飼料作物類；リンゴ、ナシ、ブドウ、モモ等の落葉性果樹類；ウンシュウミカン、オレンジ、レモン、グレープフルーツ等の柑橘類；サツキ、ツツジ、スギ、ポプラ、パラゴムノキ等の木本類等が挙げられる。 In some embodiments, the cells are plant cells, including cells from a wide range of plants including bryophytes, ferns, gymnosperms, and angiosperms such as magnolias, monocots, and eudicots (Rosa I, Rosa II, Chrysanthemum I, Chrysanthemum II, and outgroups thereof). More specific examples of plants include eggplants such as tomatoes, bell peppers, chili peppers, eggplants, tobacco, and torvum; melons such as cucumbers, pumpkins, melons, and watermelons; vegetables such as cabbage, broccoli, Chinese cabbage, and kale; raw and spicy vegetables such as shiso, celery, parsley, and lettuce; onions, onions, and garlic; other fruit vegetables such as strawberries and melons; taproots such as radishes, turnips, carrots, and burdock; tubers such as taro, cassava, potato, sweet potato, and Chinese yam; grains such as rice, corn, wheat, sorghum, barley, rye, wheatgrass, and buckwheat; soybeans, adzuki beans, mung beans, cowpeas, and kidney beans. These include legumes such as beans, peanuts, peas, and broad beans; soft vegetables such as asparagus, spinach, and mitsuba; flowers such as lisianthus, roses, stocks, carnations, and chrysanthemums; turfgrass such as bentgrass and zoysiagrass; oil crops such as rapeseed, camelina, rapeseed, jatropha, sesame, and perilla; fiber crops such as cotton, rush, and hemp; forage crops such as clover, dent corn, and alfalfa; deciduous fruit trees such as apples, pears, grapes, and peaches; citrus fruits such as satsuma mandarins, oranges, lemons, and grapefruit; and woody plants such as azalea, azalea, cedar, poplar, and rubber tree.

本開示の１つの態様は、本開示に係る方法により取得された配列を有する、プロモーター活性を有するポリヌクレオチドに関する。そのようなポリヌクレオチドは、例えば、配列番号２または４の配列を有するものでありうる。 One aspect of the present disclosure relates to a polynucleotide having promoter activity, the polynucleotide having a sequence obtained by the method of the present disclosure. Such a polynucleotide may have, for example, the sequence of SEQ ID NO: 2 or 4.

本開示の１つの態様は、所望の遺伝子の発現量が調節されたゲノム編集植物の製造方法に関する。一部の実施形態では、本方法は、本開示に係る所望の遺伝子の発現量を調節するための細胞のゲノム編集方法により所望の植物細胞のゲノムを編集すること、ゲノム編集された細胞に由来する植物個体を得ることを含む、方法でありうる。さらに、本開示の１つの態様は、本開示に係る製造方法により製造されたゲノム編集植物に関する。 One aspect of the present disclosure relates to a method for producing a genome-edited plant in which the expression level of a desired gene is regulated. In some embodiments, the method may include editing the genome of a desired plant cell by a cellular genome editing method for regulating the expression level of a desired gene according to the present disclosure, and obtaining an individual plant derived from the genome-edited cell. Furthermore, one aspect of the present disclosure relates to a genome-edited plant produced by the production method according to the present disclosure.

ゲノム編集された細胞を得る際には、植物の組織や細胞にゲノム編集酵素をDNA、RNP、タンパク質のいずれかの態様で導入しうる。植物におけるゲノム編集酵素の導入部位としては、例えば花（卵細胞、花粉、花弁等）、茎（形成層、髄、皮層等）、葉（葉原基を含む）、根、茎頂、側芽、花芽、根端、プロトプラスト等が挙げられる。 When obtaining genome-edited cells, the genome editing enzyme can be introduced into plant tissues or cells in the form of DNA, RNP, or protein. Examples of sites where the genome editing enzyme can be introduced in plants include flowers (egg cells, pollen, petals, etc.), stems (cambium, pith, cortex, etc.), leaves (including leaf primordia), roots, shoot tips, lateral buds, flower buds, root tips, protoplasts, etc.

導入方法は、特に制限されず、導入する植物種や導入対象細胞/組織に応じて、適宜選択することができる。導入方法としては、例えば、アグロバクテリウム法、パーティクル・ガン法、ウィスカー法、ナノピペット法、ウイルス媒介性核酸送達等が挙げられる。 The introduction method is not particularly limited and can be selected appropriately depending on the plant species and the target cells/tissues. Examples of introduction methods include the Agrobacterium method, the particle gun method, the whisker method, the nanopipette method, and virus-mediated nucleic acid delivery.

一部の実施形態では、ゲノム編集された細胞に由来する植物個体を得る際には、例えば、ゲノム編集された細胞、あるいは編集された細胞を含む組織を適切な培養条件下で培養する工程（a）、細胞を増殖させてカルス（未分化の細胞塊）を形成させる工程（b）、カルスからシュート（新芽）を誘導する工程（c1）、編集された細胞を含む組織から直接的にシュートを形成させる工程（c2）、シュートに根を誘導して、植物体を再生させる工程（d）、再生した植物体から次世代の種子を得る工程(e1)、あるいは再生した植物体の一部を挿し木して増殖させる工程（e2）、などの工程を含みうる。 In some embodiments, obtaining an individual plant derived from a genome-edited cell may include, for example, steps such as (a) culturing the genome-edited cell or tissue containing the edited cell under appropriate culture conditions, (b) proliferating the cells to form a callus (an undifferentiated cell mass), (c1) inducing a shoot (a sprout) from the callus, (c2) directly forming a shoot from the tissue containing the edited cell, (d) inducing roots in the shoot to regenerate a plant, (e1) obtaining next-generation seeds from the regenerated plant, or (e2) propagating a part of the regenerated plant by cutting.

本開示の１つの態様は、所望の活性を有するように改変されたプロモーター配列を予測する情報処理装置に関する（図１１参照）。 One aspect of the present disclosure relates to an information processing device that predicts promoter sequences that have been modified to have a desired activity (see FIG. 11).

一部の実施形態において、本装置010は、改変の対象となる元のプロモーター配列の入力を受け付ける配列入力部011、前記プロモーター配列に基づきゲノム編集技術により作成可能な複数の改変プロモーター配列のセットを生成する改変配列生成部012、生成された前記改変プロモーター配列のセットに含まれる個々の改変プロモーター配列の活性を機械学習モデルによって予測する活性予測部013、所望の活性を有すると予測された改変プロモーター配列を選択する配列選択部014を含む、情報処理装置でありうる。配列入力部011は、例えば、外部ネットワークとの通信機器やキーボードなどの入力デバイスと接続されていてもよい。配列選択部014は、例えば、ディスプレイやプリンタなどの出力機器、外部ネットワークとの通信機器などと接続されていてもよい。当業者は、本装置が本開示に係る配列取得方法を実施するために必要な構成を、本明細書の開示に照らし、理解することができるであろう。 In some embodiments, the present device 010 may be an information processing device including a sequence input unit 011 that accepts input of an original promoter sequence to be modified, a modified sequence generation unit 012 that generates a set of multiple modified promoter sequences that can be created by genome editing technology based on the promoter sequence, an activity prediction unit 013 that predicts the activity of each modified promoter sequence included in the generated set of modified promoter sequences using a machine learning model, and a sequence selection unit 014 that selects a modified promoter sequence predicted to have a desired activity. The sequence input unit 011 may be connected to an input device such as a communication device with an external network or a keyboard. The sequence selection unit 014 may be connected to an output device such as a display or a printer, a communication device with an external network, or the like. A person skilled in the art would be able to understand the configuration required for the present device to carry out the sequence acquisition method according to the present disclosure in light of the disclosure of this specification.

本開示の１つの態様は、命令またはプログラムが格納された非一時的なコンピューター可読媒体に関する。 One aspect of the present disclosure relates to a non-transitory computer-readable medium having instructions or programs stored thereon.

一部の実施形態において、本コンピューター可読媒体は、命令またはプログラムがプロセッサーによって実行されると、以下のステップ：改変の対象となる元のプロモーター配列の入力を受け付けるステップS100、前記プロモーター配列に基づきゲノム編集技術により作成可能な複数の改変プロモーター配列のセットを生成するステップS110、生成された前記改変プロモーター配列のセットに含まれる個々の改変プロモーター配列の活性を機械学習モデルによって予測するステップS120、所望の活性を有すると予測された改変プロモーター配列を選択するステップS130を実行することができる命令またはプログラムが格納された、コンピューター可読媒体でありうる。図１２は、このようなプログラムの例示的なフローチャートを示している。 In some embodiments, the computer-readable medium may be a computer-readable medium having stored thereon instructions or a program that, when executed by a processor, can execute the following steps: step S100 of accepting an input of an original promoter sequence to be modified; step S110 of generating a set of multiple modified promoter sequences that can be created by genome editing technology based on the promoter sequence; step S120 of predicting the activity of each modified promoter sequence included in the generated set of modified promoter sequences using a machine learning model; and step S130 of selecting the modified promoter sequence predicted to have the desired activity. FIG. 12 shows an exemplary flowchart of such a program.

ゲノム編集技術により作成可能な複数の改変プロモーター配列のセットを生成するステップS110はさらに、入力された配列の中からCas9による切断可能位置（PAM配列）を網羅的に検索するステップS112、検索された切断可能位置（PAM配列）のうちの2つを組合せ論的にすべて選択するステップS114、すべての組合せについて選択された2か所の切断部位の間の配列を欠失させて改変プロモーター配列のセットを生成するステップS116を含んでいてもよい。また、個々の改変プロモーター配列の活性を機械学習モデルによって予測するステップS120は、個々の改変プロモーター配列のコアプロモーター部分（3'末端側の150～200bp、例えば約170bpの配列）のみを用いて行ってもよい。 Step S110 of generating a set of multiple modified promoter sequences that can be created by genome editing technology may further include step S112 of comprehensively searching for cleavable sites (PAM sequences) by Cas9 from the input sequences, step S114 of combinatorially selecting all two of the searched cleavable sites (PAM sequences), and step S116 of deleting the sequence between the two cleavage sites selected for all combinations to generate a set of modified promoter sequences. In addition, step S120 of predicting the activity of each modified promoter sequence using a machine learning model may be performed using only the core promoter portion (150 to 200 bp, for example about 170 bp, from the 3' end) of each modified promoter sequence.

本開示の１つの態様は、所望の活性を有するように改変されたプロモーター配列を予測するコンピュータープログラムに関する。 One aspect of the present disclosure relates to a computer program that predicts promoter sequences that have been modified to have a desired activity.

一部の実施形態において、本プログラムは、コンピューターに、改変の対象となる元のプロモーター配列の入力を受け付ける機能、前記プロモーター配列に基づきゲノム編集技術により作成可能な複数の改変プロモーター配列のセットを生成する機能、生成された前記改変プロモーター配列のセットに含まれる個々の改変プロモーター配列の活性を機械学習モデルによって予測する機能、所望の活性を有すると予測された改変プロモーター配列を選択する機能を実現させる、プログラムでありうる。 In some embodiments, the program may be a program that causes a computer to perform the functions of accepting input of an original promoter sequence to be modified, generating a set of multiple modified promoter sequences that can be created using genome editing technology based on the promoter sequence, predicting the activity of each modified promoter sequence included in the generated set of modified promoter sequences using a machine learning model, and selecting a modified promoter sequence predicted to have a desired activity.

図１３は、本開示の装置の例示的な一態様を示した概略図である。図１３おいて、100はコンピューターであり、制御部101、記憶部102、周辺機器I/F部103、入力部104、表示部105、通信部106を備え、これらがバス110により接続される。なお、この構成は例示であり、適宜、様々な構成を採ることができる。 FIG. 13 is a schematic diagram showing an exemplary embodiment of the device of the present disclosure. In FIG. 13, 100 is a computer, which includes a control unit 101, a memory unit 102, a peripheral device I/F unit 103, an input unit 104, a display unit 105, and a communication unit 106, which are connected by a bus 110. Note that this configuration is merely an example, and various configurations can be adopted as appropriate.

制御部101は、CPU（Central Processing Unit）、ROM（Read Only Memory）、RAM（Random Access Memory）等で構成される。CPUは、記憶部102、ROM、記録媒体等に格納されるプログラムをRAM上のワークメモリ領域に呼び出して実行し、バス110を介して接続された各装置を駆動制御し、コンピューターが行う処理を実現する。ROMは、不揮発性メモリであり、コンピューター100のブートプログラムやBIOS等のプログラム、データ等を保持している。RAMは、揮発性メモリであり、記憶部102、ROM、記録媒体等からロードしたプログラム、データ等を一時的に保持するとともに、制御部101が各種処理を行う際に使用するワークエリアを備える。記憶部102は、例えばHDD（ハードディスクドライブ）であり、制御部101が実行するプログラム、その他各種データを格納する。 The control unit 101 is composed of a CPU (Central Processing Unit), a ROM (Read Only Memory), a RAM (Random Access Memory), etc. The CPU loads and executes programs stored in the memory unit 102, the ROM, a recording medium, etc. into a work memory area on the RAM, drives and controls each device connected via the bus 110, and realizes the processing performed by the computer. The ROM is a non-volatile memory that holds the boot program of the computer 100, programs such as the BIOS, and data, etc. The RAM is a volatile memory that temporarily holds programs, data, etc. loaded from the memory unit 102, the ROM, a recording medium, etc., and has a work area that the control unit 101 uses when performing various processing. The memory unit 102 is, for example, a hard disk drive (HDD), and stores the programs executed by the control unit 101 and various other data.

周辺機器I/F（インターフェース）部103は、コンピューター100と周辺機器とを接続させるためのポートである。周辺機器I/F部103は、USBやIEEE1394やRS-232C等で構成される。なお、周辺機器との接続形態は有線、無線を問わない。入力部104は、キーボード、マウス等のポインティングデバイス、テンキー等の入力装置を有し、コンピューター100に対して、操作指示、動作指示、データ入力等を行う。表示部105は、液晶パネル等のディスプレイ装置に映像・画像等の表示を行うための論理回路乃至デバイスドライバーである。入力部104及び表示部105を、タッチディスプレイとして一体的に構成することもできる。 The peripheral device I/F (interface) unit 103 is a port for connecting the computer 100 to a peripheral device. The peripheral device I/F unit 103 is configured with USB, IEEE1394, RS-232C, etc. The connection with the peripheral device may be wired or wireless. The input unit 104 has input devices such as a keyboard, a pointing device such as a mouse, and a numeric keypad, and issues operational instructions, operation instructions, data input, etc. to the computer 100. The display unit 105 is a logic circuit or device driver for displaying videos, images, etc. on a display device such as a liquid crystal panel. The input unit 104 and the display unit 105 can also be configured as an integrated touch display.

通信部106は、通信制御装置、通信ポート等を有し、ネットワーク120との通信を媒介する有線または無線の通信インターフェースである。バス110は、各装置間の制御信号、データ信号等の授受を媒介する通信経路である。ネットワーク120は、さらに外部サーバー130やネットストレージ140に接続されていてもよい。 The communication unit 106 has a communication control device, a communication port, etc., and is a wired or wireless communication interface that mediates communication with the network 120. The bus 110 is a communication path that mediates the transmission and reception of control signals, data signals, etc. between each device. The network 120 may further be connected to an external server 130 and net storage 140.

例えば、図１３の装置に、本開示に係るコンピューター可読媒体に記録されたプログラムを読み込み、コンピューターを、改変の対象となる元のプロモーター配列の入力を受け付ける配列入力部、前記プロモーター配列に基づきゲノム編集技術により作成可能な複数の改変プロモーター配列のセットを生成する改変配列生成部、生成された前記改変プロモーター配列のセットに含まれる個々の改変プロモーター配列の活性を機械学習モデルによって予測する活性予測部、所望の活性を有すると予測された改変プロモーター配列を選択する配列選択部を備えた、情報処理装置として機能させることができる。 For example, a program recorded on a computer-readable medium according to the present disclosure can be loaded into the device of FIG. 13, and the computer can function as an information processing device including a sequence input unit that accepts input of an original promoter sequence to be modified, a modified sequence generation unit that generates a set of multiple modified promoter sequences that can be created by genome editing technology based on the promoter sequence, an activity prediction unit that predicts the activity of each modified promoter sequence included in the generated set of modified promoter sequences using a machine learning model, and a sequence selection unit that selects a modified promoter sequence predicted to have a desired activity.

特に定義されない限り、本明細書中で使用されるすべての技術的および科学的な用語は、本発明が属する技術分野の当業者によって一般に理解されるのと同じ意味を有する。本明細書中で記述されるものと類似もしくは等価なあらゆる方法および材料が、本発明の実施もしくは試験のために使用されうるものの、いくつかの可能性のある、好ましい方法および材料がこれから記述される。本明細書で言及されるすべての刊行物は、参照により本明細書に組み込まれ、関連して刊行物が引用される方法および／または材料が開示および記述される。本開示は、矛盾がある場合、組み込まれた刊行物の開示に優先することが理解される。 Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Although any methods and materials similar or equivalent to those described herein can be used in the practice or testing of the present invention, some possible, preferred methods and materials are now described. All publications mentioned herein are incorporated herein by reference to disclose and describe the methods and/or materials in connection with which the publications are cited. It is understood that the present disclosure supersedes the disclosure of the incorporated publications in the case of any conflict.

値の範囲が記載される場合、文脈上明らかに別段の指示がない限り、その範囲の上限と下限の間に、下限の単位の10分の1までの介在するそれぞれの値もまた具体的に開示されていると理解される。記載された範囲内の任意の記載された値または介在する値と、その記載された範囲内の任意の他の記載された値または介在する値との間の、より小さなそれぞれの範囲もまた、本開示に包含される。これらのより小さな範囲の上限と下限は独立して、その範囲に含められても除外されてもよく、より小さな範囲にどちらか、どちらも、または両方の限界値が含まれる各範囲もまた、本発明に包含されるが、記載された範囲において具体的に除外された限界値は留保される。記載された範囲が限界値の一方または両方を含む場合、含まれる限界値のいずれかまたは両方を除外する範囲も本発明に含まれる。数値に関する「約」という用語は、5％以内を意味する。 When a range of values is described, unless the context clearly indicates otherwise, it is understood that each intervening value, to one-tenth of the unit of the lower limit, between the upper and lower limits of that range is also specifically disclosed. Each smaller range between any stated or intervening value in a stated range and any other stated or intervening value in that stated range is also encompassed by the disclosure. The upper and lower limits of these smaller ranges may be independently included or excluded, and each range in which either, either or both limits are included in the smaller range is also encompassed by the invention, but with the reservation of any specifically excluded limit in the stated range. When a stated range includes one or both limits, ranges excluding either or both of the included limits are also included by the invention. The term "about" with respect to numerical values means within 5%.

本明細書に記載の実施形態は、単に例示的なものであることを意図しており、当業者であれば、本発明の精神から逸脱することなく、数多くの変形及び修正を行うことができるであろう。また、ある種の変形及び修正は、最適な結果には至らないものの、それでも満足のいく結果をもたらしうる。そのような変形及び修正は全て、添付の特許請求の範囲によって定義される本発明の範囲内にあることが意図されている。また、本明細書に開示の構成要素の任意の組み合わせ、本開示の表現を方法、装置、システム、コンピュータプログラム、データ構造、記録媒体などの間で変換したものもまた、本開示に係る態様として有効である。よって、本開示の方法に関して記載された詳細は、システム、コンピュータプログラム、データ構造、記録媒体等に適用されうる。 The embodiments described herein are intended to be merely exemplary, and those skilled in the art will be able to make numerous variations and modifications without departing from the spirit of the present invention. Also, certain variations and modifications may produce less than optimal results, but still be satisfactory. All such variations and modifications are intended to be within the scope of the present invention as defined by the appended claims. Also, any combination of the components disclosed herein, and any conversion of the expressions of the present disclosure between methods, devices, systems, computer programs, data structures, recording media, etc., are also valid aspects of the present disclosure. Thus, details described with respect to the methods of the present disclosure may be applied to systems, computer programs, data structures, recording media, etc.

実施例１：機械学習モデルの構築
プロモーター配列から植物細胞における遺伝子発現誘導活性を予測する機械学習モデルは、以下のようにして構築した。まず、学習元となるデータとして、Jores et al.による論文で示されたものを入手した（Jores, T., Tonnies, J., Wrightsman, T. et al. Synthetic promoter designs enabled by a comprehensive analysis of plant core promoters. Nat. Plants 7, 842-855 (2021). https://doi.org/10.1038/s41477-021-00932-y）。データセットの概要としては、約7万件のデータセットであって、170塩基のDNA配列と、発現強度の組であった。 Example 1: Construction of a machine learning model A machine learning model that predicts gene expression induction activity in plant cells from a promoter sequence was constructed as follows. First, the data shown in the paper by Jores et al. was obtained as the learning source data (Jores, T., Tonnies, J., Wrightsman, T. et al. Synthetic promoter designs enabled by a comprehensive analysis of plant core promoters. Nat. Plants 7, 842-855 (2021). https://doi.org/10.1038/s41477-021-00932-y). The overview of the dataset was a dataset of about 70,000 items, consisting of a 170-base DNA sequence and an expression intensity pair.

また、機械学習には、トランスフォーマーベースの事前訓練された基礎モデルであるDNABERTを利用した。DNABERTについては、例えば、上述のJi et al.による論文で報告されている。DNABERTは、主として2値分類問題を扱うことに主眼を置いており、例えば、「与えられた塩基配列について、スプライシングサイトを有するか否か」といった判定を行うことができると示されている。2値分類問題とはつまり、スプライシングサイトを有するならば1、そうでなければ0というように、2つのカテゴリーに分類する問題である。これに対し、「与えられた塩基配列について、下流の遺伝子発現量を予測する」といった問題は、分類問題に対して回帰問題と呼ばれ、連続的な値を出力する必要がある。そこで、DNABERTで連続的な値を扱うために、トランスフォーマーにクラスを追加してモデルの訓練を行った。 For machine learning, we used DNABERT, a transformer-based pre-trained basic model. DNABERT has been reported, for example, in the paper by Ji et al. mentioned above. DNABERT is primarily focused on dealing with binary classification problems, and has been shown to be able to determine, for example, whether a given base sequence has a splicing site or not. A binary classification problem is a problem of classifying into two categories, such as 1 if it has a splicing site and 0 if it does not. In contrast, a problem such as "predicting downstream gene expression levels for a given base sequence" is called a regression problem in contrast to a classification problem, and requires the output of continuous values. Therefore, in order to handle continuous values in DNABERT, a class was added to the transformer and the model was trained.

実施例２：学習済みモデルによるプロモーター強度の予測と分類
プロモーター配列は多くの場合遺伝子の上流に存在し、直下にある遺伝子の転写量（発現量）を制御している。遺伝子の発現量はプロモーターの配列により制御されていると考えられる。下流の遺伝子の発現量を、プロモーターの強度と定義する。プロモーターの配列ごとに固有の強度を示すと考えられる。 Example 2: Prediction and classification of promoter strength by trained model Promoter sequences are often located upstream of genes and control the transcription amount (expression amount) of genes located directly downstream. The expression amount of a gene is considered to be controlled by the promoter sequence. The expression amount of a downstream gene is defined as the promoter strength. Each promoter sequence is considered to show a unique strength.

実施例１で説明したように、塩基配列を基に、転写活性を予測する機械学習モデルを構築した。これを用いてシロイヌナズナの全遺伝子の転写活性を評価した一覧を作成した。 As explained in Example 1, a machine learning model was constructed to predict transcription activity based on the base sequence. Using this model, a list was created that evaluated the transcription activity of all Arabidopsis genes.

全てのプロモーターを、予測された転写活性を基に、以下の3つのグループに分類した：
１．高いプロモーター強度を示すと予測された「高発現グループ」；
２．非常に低いかほぼ発現を示さないプロモーター強度をもつと予測された「低発現グループ」；および
３．それらの中間のプロモーター強度を示す「中程度グループ」。 All promoters were classified into three groups based on their predicted transcriptional activity:
1. A "high expression group" predicted to show high promoter strength;
2. a "low expression group" predicted to have promoter strengths that are very low or almost non-existent; and 3. a "moderate expression group" showing promoter strengths intermediate between those.

実施例３：モデルによる予測と実際の遺伝子発現との比較
高発現グループから7プロモーター、中程度グループから8プロモーター、低発現グループから4プロモーターをそれぞれ選び、ルシフェラーゼ（LUC）遺伝子に接続した人工合成遺伝子を作成した。 Example 3: Comparison of model predictions and actual gene expression Seven promoters were selected from the high expression group, eight promoters from the medium expression group, and four promoters from the low expression group, and artificial synthetic genes were created by connecting them to the luciferase (LUC) gene.

遺伝子の構造は、図１に示されているとおりである。プロモーターは19種類の様々な配列パターンを持つ。下流のLUC遺伝子と上流のカリフラワーモザイクウイルス（CaMV）35Sエンハンサーの配列は各人工遺伝子で共通である。 The gene structures are shown in Figure 1. The promoters have 19 different sequence patterns. The downstream LUC gene and the upstream cauliflower mosaic virus (CaMV) 35S enhancer sequences are common to all artificial genes.

この人工遺伝子を搭載したプラスミドベクターをシロイヌナズナの葉から採取したプロトプラストに対し、ポリエチレングリコール法によりトランスフェクションした。プロトプラストはLUC遺伝子を様々の発現量で発現した。LUC発現量を評価するために、プレートリーダーを用いて発光強度を測定した。測定の結果と、予測されたプロモーター強度の関係を図２に示す。LUC遺伝子の発現量の実測値と、プロモーター強度の予測値の関係を散布図で示した。縦軸は予測された転写活性を示す。値が高いほど、下流の遺伝子の発現量が大きくなると期待される。横軸は対数変換したLUC発現強度の実測値を示す。ただし、ポジティブコントロール（Pコントロール）であるCaMVプロモーターの発現量で標準化した。 The plasmid vector carrying this artificial gene was transfected into protoplasts taken from Arabidopsis leaves using the polyethylene glycol method. The protoplasts expressed the LUC gene at various levels. To evaluate the LUC expression level, luminescence intensity was measured using a plate reader. Figure 2 shows the relationship between the measurement results and the predicted promoter strength. A scatter plot shows the relationship between the actual measured expression level of the LUC gene and the predicted promoter strength. The vertical axis shows the predicted transcription activity. The higher the value, the greater the expected expression level of the downstream gene. The horizontal axis shows the actual measured LUC expression intensity, which was logarithmically transformed. However, this was standardized using the expression level of the CaMV promoter, which was the positive control (P control).

図２に示されている結果から、予測値に基づくグループと一致して、実測値が3群に分かれることがわかる。このように、構築した機械学習モデルによる予測値と実測値には明瞭な相関関係が確認された。この図は、予測性の高さを示している。この図において、線形に収束する場合には予測値と実測値が一致していることを意味し、高い予測性能を有することになる。この予測性能を相関係数Rで評価すると、R=0.9296673となった。 The results shown in Figure 2 show that the actual measured values are divided into three groups, consistent with the groups based on the predicted values. In this way, a clear correlation was confirmed between the predicted values by the constructed machine learning model and the actual measured values. This figure shows the high level of predictability. In this figure, linear convergence means that the predicted values and actual measured values match, and the model has high predictive performance. When this predictive performance was evaluated using the correlation coefficient R, it was found to be R = 0.9296673.

次に、上記の散布図で示された相関関係を基に、線形回帰予測により、予測された転写活性に対応するLUC発現量を計算した。これらを比較したものを次の図３に示す。 Next, based on the correlation shown in the scatter plot above, we calculated the LUC expression level corresponding to the predicted transcription activity by linear regression prediction. A comparison of these is shown in Figure 3 below.

縦軸にLUCの発光強度をプロットした。ただし、Pコントロールの発光強度を1とした相対値で示している。Pコントロールとして、高い転写活性が知られているカリフラワーモザイクウイルス（CaMV）プロモーターを使用した。横軸に19種類のプロモーターの番号を示した。プロモーター番号3～10は高い発現強度が予測された「高発現」グループ、プロモーター番号13～21は同「中程度」グループ、プロモーター番号22～25は同「低発現」グループに分類した。黒い棒グラフはLUCアッセイの実測値（発現量）を示す。白色の棒グラフは、機械学習モデルによる発現強度予測値を示す。図３において、黒い棒グラフに着目すると、「高発現」グループでは比較的高い発現量を示し、また「低発現」グループでは低い発現量を示した。「中程度」グループでは中間の値を示した。 The luminescence intensity of LUC is plotted on the vertical axis. The luminescence intensity of the P control is set to 1, and is shown as a relative value. The cauliflower mosaic virus (CaMV) promoter, known for its high transcriptional activity, was used as the P control. The horizontal axis shows the numbers of 19 types of promoters. Promoter numbers 3 to 10 were classified into the "high expression" group, which was predicted to have high expression intensity, promoter numbers 13 to 21 into the "moderate" group, and promoter numbers 22 to 25 into the "low expression" group. The black bars show the actual values (expression levels) of the LUC assay. The white bars show the expression intensity predicted by the machine learning model. In Figure 3, the black bars show a relatively high expression level in the "high expression" group and a low expression level in the "low expression" group. The "moderate" group showed intermediate values.

次に、個別のプロモーターについて、どの程度正確に発現量が予測されるのか確認した。各プロモーターの黒い棒と白色の棒を比較すると、3番、6番、10番、18番のプロモーターでは高い一致度で予測されていることがわかる。一方で、4番は実測値＞予測値となった。5番では反対に予測値＞実測値となった。 Next, we checked how accurately the expression levels were predicted for individual promoters. Comparing the black and white bars for each promoter, we can see that promoters 3, 6, 10, and 18 were predicted with a high degree of agreement. On the other hand, for promoter 4, the actual value was greater than the predicted value. For promoter 5, the opposite was true, with the predicted value greater than the actual value.

概括すると、構築したモデルにより、高発現・低発現といった大まかな傾向を予測することができた。低発現のプロモーターについては、総じて高い精度で予測された。高発現のグループでは、実測値の50％～200％の範囲で発現を予測することができた。 In summary, the constructed model was able to predict general trends such as high and low expression. Low-expressing promoters were generally predicted with high accuracy. In the high-expressing group, expression could be predicted within the range of 50% to 200% of the actual measured value.

このように、本開示に係る機械学習モデルは、遺伝子発現量を、プロモーターの塩基配列を基に予測できる。 In this way, the machine learning model disclosed herein can predict gene expression levels based on promoter base sequences.

実施例４：活性の低下したプロモーターの予測と実証
次に、プロモーターの転写活性を低下させる方法を考案した。各プロモーターには、Cas9のPAM認識配列（NGG）が多く存在する。これらの配列を検索し、可能な編集パターンおよそ数千通りをリストアップした。 Example 4: Prediction and demonstration of promoters with reduced activity Next, we devised a method to reduce the transcriptional activity of promoters. Each promoter has many Cas9 PAM recognition sequences (NGG). We searched these sequences and listed approximately several thousand possible editing patterns.

より具体的な手順は以下のとおりである：
１．与えられた塩基配列の中からCas9による切断可能な位置（PAM配列）を網羅的に検索する；
２．その中から2か所を選ぶ；
３．2か所の切断部位で切断し、修復後に出現する新しいコアプロモーター配列をシミュレーションする；
４．シミュレーションされた配列に対し、コアプロモーターとしての強度を推測する；
５．上記の工程を、考えうる全ての切断部位のペアについて実行する；
６．シミュレーションされた全てのコアプロモーターの中で、最低のスコア（転写活性が上昇したプロモーターを選択する場合は最高のスコア）を示すものを選ぶ。 More specifically, the steps are as follows:
1. Comprehensively search for sites (PAM sequences) that can be cleaved by Cas9 within a given base sequence;
2. Choose two locations from the list;
3. Cut at the two cut sites and simulate the new core promoter sequence that appears after repair;
4. Estimate the core promoter strength for the simulated sequences;
5. The above steps are carried out for all possible cleavage site pairs;
6. Among all the simulated core promoters, choose the one showing the lowest score (or the highest score if you are selecting promoters with increased transcriptional activity).

以上により、発現量を強化/抑制したい場合に、どの切断部位を選択すればよいか判断することができる。このようにして、それぞれの編集パターンの塩基配列についてプロモーターの転写活性を推測することで、プロモーターの転写活性を低下させることのできる編集パターンを探索した。結果を図４に示す。プロモーター番号3、4、5、6、9、10、21番について、プロモーター強度を低下させることのできる編集パターンを探索した。黒色で三角形のプロットで、予測された新しいプロモーター強度に基づいて計算された、LUCアッセイのスコアの予測値を示した。その結果、元のプロモーターに比して、LUC発現量が14％～1％程度の転写活性となることが予測された。 From the above, it is possible to determine which cleavage site should be selected when enhancing/suppressing the expression level. In this way, by predicting the promoter transcription activity for the base sequence of each editing pattern, editing patterns that can reduce the promoter transcription activity were searched for. The results are shown in Figure 4. Editing patterns that can reduce promoter strength were searched for for promoter numbers 3, 4, 5, 6, 9, 10, and 21. The black triangular plot shows the predicted LUC assay score calculated based on the predicted new promoter strength. As a result, it was predicted that the LUC expression level would be approximately 14% to 1% of the transcription activity compared to the original promoter.

次に、これらの理論的なプロモーター配列をもつ遺伝子を人工的に合成し、プロトプラストに対し同様にトランスフェクションし発現量をプレートリーダーで測定した。その結果を図５に示す。図４に重ね合わせて、網かけの棒グラフで新しい配列のプロモーターのLUCの発現量の実測値を示している。非常に低い発現量となったため、縦軸を拡大して示した。プロモーター5番については欠損値となった。測定値の得られた6個のプロモーターについて、いずれも大幅な発現量の低下が認められ、予測値と一致した。例えばプロモーター３番の配列は
CGGAAACTTGTCACTTCCTTTACATTTGAGTTTCCAACACCTAATCACGACAACAATCATATAGCTCTCGCATACAAACAAACATATGCATGTATTCTTACACGTGAACTCCATGCAAGTCTCTTTTCTCACCTATAAATACCAACCACACCTTCACCACATTCTTCACT（配列番号１）
であり、転写活性の予測値は５．４３、この値から線形回帰によって予測されるLUC発現量はPコントロールに比して１１１．５％であった。また、LUC発現量の実測値は６８．６％であった。これに対して、プロモーター３番の、編集後の配列は
GAAACTGATTAGCTCCTATCAGTTCAGCAAACCACAAGCTGAAGAATCCAAGACTTGAGAAACAAATTTACAAAAGCCCATGTTCCAATCAAAACTGTTACCAAACATCTGAAATAGATCTAAATGAGCGTTGGTATAATTGAAACTTACCGAAGGCCCACATTCTTCAC（配列番号２）
であり、－852～－7の間の845塩基を切り詰めた場合に生じる。転写活性の予測値は0.393、この値から線形回帰によって予測されるLUC発現量はPコントロールに対して7.98％であった。また、LUC発現量の実測値はPコントロールに対して3.41％だった。 Next, genes with these theoretical promoter sequences were artificially synthesized, and protoplasts were transfected in the same manner, and the expression levels were measured using a plate reader. The results are shown in Figure 5. The actual measured values of the expression levels of LUC for promoters with new sequences are shown in shaded bars superimposed on Figure 4. Because the expression levels were very low, the vertical axis has been enlarged. Promoter number 5 was a missing value. For all six promoters for which measured values were obtained, a significant decrease in expression level was observed, which matched the predicted values. For example, the sequence of promoter number 3 was
CGGAAACTTGTCACTTCCTTTACATTTGAGTTTCCAACACCTAATCACGACAACAATCATATAGCTCTCGCATACAAACAAACATATGCATGTATTCTTACACGTGAACTCCATGCAAGTCTCTTTTCTCACCTATAAATACCAACCACACCTTCACCACATTCTTCACT (SEQ ID NO: 1)
The predicted transcription activity was 5.43, and the LUC expression level predicted by linear regression from this value was 111.5% compared to the P control. The actual LUC expression level was 68.6%. In contrast, the edited sequence of promoter 3 was
GAAACTGATTAGCTCCTATCAGTTCAGCAAACCACAAGCTGAAGAATCCAAGACTTGAGAAACAAATTTACAAAAGCCCATGTTCCAATCAAAACTGTTACCAAACATCTGAAATAGATCTAAATGAGCGTTGGTATAATTGAAACTTACCGAAGGCCCACATTCTTCAC (SEQ ID NO: 2)
This occurs when the 845 bases between -852 and -7 are truncated. The predicted transcriptional activity was 0.393, and the LUC expression level predicted by linear regression from this value was 7.98% of the P control. The actual LUC expression level was 3.41% of the P control.

このように、本開示に係る機械学習モデルの予測に基づきプロモーターの塩基配列を編集することで、発現を減少させることができる。 In this way, expression can be reduced by editing the promoter sequence based on the predictions of the machine learning model disclosed herein.

実施例５：活性の上昇したプロモーターの予測と実証
最後に、プロモーター活性を上昇させる方法を考案した。実施例４の場合と同様に、考えられる編集パターンのプロモーターの塩基配列のうち、編集後にプロモーターの強度の予測値が上昇するものを探索した。その結果を図６に示した。プロモーター番号13、17、22、23、24、25について、プロモーター強度が上昇するものを選び、予測される発現量を白色の円で示した。元のプロモーター比して、LUC発現量が7.4～125倍となることが予測された。 Example 5: Prediction and Demonstration of Promoters with Increased Activity Finally, a method for increasing promoter activity was devised. As in Example 4, among the promoter base sequences of possible editing patterns, those in which the predicted value of promoter strength increases after editing were searched for. The results are shown in Figure 6. For promoter numbers 13, 17, 22, 23, 24, and 25, those with increased promoter strength were selected, and the predicted expression levels are shown with white circles. It was predicted that the LUC expression level would be 7.4 to 125 times higher than the original promoter.

次に、これらの理論的なプロモーター配列をもつ遺伝子を人工的に合成し、プロトプラストに対し同様にトランスフェクションし発現量をプレートリーダーで測定した。その結果を次の図７に示す。6個のプロモーター中、5個で発現量が元のプロモーターに比して上昇したものの、予測値を超えて発現量が上昇したものは13番のみであった。例えばプロモーター１３番の配列は
TCAAGCAATCATTATCGACTACGGTCGTTCGTTAAAGATCATGCATGTGCTTAGTGGCAATACCCTACGCATCTTGATTCGTTACTGCGGCACGTGTCATGACCATGCACATGAATGATGATTAATGTTTAGTACATATAATGTTCACGCAAACGCATAGTGTTAGGAAA（配列番号３）
であり、転写活性の予測値は2.00、この値から線形回帰によって予測されるLUC発現量はPコントロールに比して6.94％であった。また、LUC発現量の実測値はPコントロールに比して6.60％であった。これに対して、プロモーター13番の、編集後の配列は
GAAACTTGAAAATCAAATCAGTGAGTCGCAAGTAAGACTTTGTGGTTGTTGTATCAGATTTCGCCGTGCGCATCTTGATTCGTTACTGCGGCACGTGTCATGACCATGCACATGAATGATGATTAATGTTTAGTACATATAATGTTCACGCAAACGCATAGTGTTAGGAA（配列番号４）
であり、転写活性の予測値は4.18、この値から線形回帰によって予測されるLUC発現量はPコントロールに比して36.7％であった。また、LUC発現量の実測値はPコントロールに比して69.5％だった。 Next, genes with these theoretical promoter sequences were artificially synthesized, and protoplasts were transfected in the same manner, and the expression levels were measured using a plate reader. The results are shown in Figure 7. Of the six promoters, five showed increased expression levels compared to the original promoter, but only promoter number 13 showed an increase in expression level beyond the predicted value. For example, the sequence of promoter number 13 is
TCAAGCAATCATTATCGACTACGGTCGTTCGTTAAAGATCATGCATGTGCTTAGTGGCAATACCCTACGCATCTTGATTCGTTACTGCGGCACGTGTCATGACCATGCACATGAATGATGATTAATGTTTAGTACATATAATGTTCACGCAAACGCATAGTGTTAGGAAA (SEQ ID NO: 3)
The predicted transcription activity was 2.00, and the LUC expression level predicted by linear regression from this value was 6.94% compared to the P control. The actual LUC expression level was 6.60% compared to the P control. In contrast, the edited sequence of promoter 13 was
GAAACTTGAAAATCAAATCAGTGAGTCGCAAGTAAGACTTTGTGGTTGTTGTATCAGATTTCGCCGTGCGCATCTTGATTCGTTACTGCGGCACGTGTCATGACCATGCACATGAATGATGATTAATGTTTAGTACATATAATGTTCACGCAAACGCATAGTGTTAGGAA (SEQ ID NO: 4)
The predicted transcriptional activity was 4.18, and the LUC expression level predicted by linear regression from this value was 36.7% compared to the P control, while the actual LUC expression level was 69.5% compared to the P control.

このように、本開示に係る機械学習モデルの予測に基づきプロモーターの塩基配列を編集することで、発現を上昇させることができる。 In this way, expression can be increased by editing the promoter sequence based on the predictions of the machine learning model disclosed herein.

実施例６：ゲノム編集によりプロモーターの機能を変化させるためのプログラム
本発明者らは、次のような手順でゲノム編集によりプロモーターの機能を変化させるために用いるプログラムを開発した。 Example 6: Program for changing promoter function by genome editing The present inventors developed a program used for changing promoter function by genome editing in the following procedure.

１．塩基配列Xを入力する。例えば標的遺伝子の転写開始点周辺の－1995～＋5の2000塩基である。
２．塩基配列X中のPAM配列を検索する。（Cas9：NGGまたはCas12a：TTTVの配列）
３．PAM配列を元に、切断位置の一覧を配列に格納する。ここで、Cas9ならばPAM配列の下流4塩基の位置、Cas12aならばPAM配列の下流18塩基の位置がそれぞれ切断位置となる。切断位置は例えば2000塩基の中で50か所程度発見される。
４．n箇所の切断位置のうち、2箇所を指定する組み合わせはnC₂通りある。例えば、50か所の切断位置について、1250通りの組み合わせがありえる。nC₂通りの組み合わせについて、2か所の切断位置間を切り詰めた（削除した）塩基配列を生成する。
５．生成されたそれぞれの塩基配列について、3’末端側170塩基を取得し、学習モデルに入力する。
６．学習モデルが、与えられた塩基配列のコアプロモーターとしての強さ（転写活性）を出力する。
７．総合すると、次の3種類の情報が得られる。
Ａ．塩基配列Xの転写活性
Ｂ．2か所の切断位置の間を切り詰めてできる配列のプロモーター活性がA.と比較して大となる、切断位置の組
Ｃ．2か所の切断位置の間を切り詰めてできる配列のプロモーター活性がA.と比較して小となる、切断位置の組
８．例えば標的の遺伝子の発現を増大させる場合、7.B.で得られる切断位置を標的とする、2種類のガイドRNAを同時に細胞に導入して、CRISPR-Cas9または-Cas12aによるゲノム編集を施すことで、2か所の切断サイトの間が欠失した個体が得られると期待される。 1. Input the base sequence X. For example, the 2000 bases from -1995 to +5 around the transcription start site of the target gene.
2. Search for the PAM sequence in the base sequence X. (Cas9: NGG or Cas12a: TTTV sequence)
3. Based on the PAM sequence, a list of cleavage positions is stored in the array. Here, for Cas9, the cleavage positions are 4 bases downstream of the PAM sequence, and for Cas12a, the cleavage positions are 18 bases downstream of the PAM sequence. For example, about 50 cleavage positions are found in a 2000-base sequence.
4. Of the n cleavage sites, there are nC ₂ combinations for specifying two sites. For example, for 50 cleavage sites, there are 1250 possible combinations. For each of the nC ₂ combinations, a base sequence is generated that is truncated (deleted) between the two cleavage sites.
5. For each generated base sequence, obtain 170 bases from the 3' end and input them into the learning model.
6. The learning model outputs the strength (transcriptional activity) of the given base sequence as a core promoter.
7. In summary, we obtain three types of information:
A. Transcription activity of base sequence X B. A set of cleavage sites where the promoter activity of the sequence obtained by truncating the region between the two cleavage sites is greater than that of A. C. A set of cleavage sites where the promoter activity of the sequence obtained by truncating the region between the two cleavage sites is less than that of A. 8. For example, to increase the expression of a target gene, two types of guide RNAs targeting the cleavage sites obtained in 7.B. are introduced into cells at the same time, and genome editing is performed using CRISPR-Cas9 or -Cas12a, which is expected to result in an individual with a deletion between the two cleavage sites.

特に、2種類の異なるガイドRNAを同時に用いるのは、カテゴリーSDN1の範囲内で、小～中規模の欠失を誘導することを意図したものであることに留意されたい。 In particular, it should be noted that the simultaneous use of two different guide RNAs is intended to induce small to medium-sized deletions within the category SDN1.

実施例７：プロモーター活性の予測値の可視化
最終的に、実施例６に記載のプログラムからは、2つのガイドRNAの位置と、欠失後の配列の新しいプロモーター活性の予測値の組が得られるが、これは直感的には理解しがたい。そこで、これらをビジュアライズする方法も本発明者らは考案した。 Example 7: Visualization of predicted promoter activity Finally, the program described in Example 6 gives us a set of predicted promoter activity values for the positions of the two guide RNAs and the deleted sequence, but this is not intuitively understandable. Therefore, the present inventors have devised a method to visualize these values.

ビジュアライズの方法の一つ目として、入力された塩基配列Xの全体像を分析する方法を考案した。縦軸に予測されたプロモーターの強さ、横軸に塩基配列の位置をとった。個々のプロットは、ある位置のDNA配列を170塩基のウィンドウで切り出して転写活性を予測した場合の値を示している。例えば横軸の位置0にある点は、入力された塩基配列Xのうち、5’末端から170塩基を取り出して、転写活性を予測した。その結果、転写活性は1.24であった（この170塩基の塩基配列は遺伝子に接していないため、コアプロモーターとして機能することは考えづらいが、仮にこの170塩基の配列の直下に遺伝子が接続された場合、1.24の強さで転写することを予測したものである）。したがって、点(0, 1.24)の位置に、プロットした。次に、ウィンドウを1塩基スライドし、塩基配列Xのうち2塩基目～171塩基目の塩基配列を取り出して、同様に予測した。その結果、転写活性は1.37と推定された。したがって、点(1, 1.37)の位置にプロットされた。この操作を繰り返し、全1830点がプロットされた（図８）。 As one of the visualization methods, we devised a method to analyze the whole picture of the input base sequence X. The vertical axis shows the predicted promoter strength, and the horizontal axis shows the position of the base sequence. Each plot shows the value when the DNA sequence at a certain position is cut out with a window of 170 bases and the transcription activity is predicted. For example, the point at position 0 on the horizontal axis is the transcription activity predicted by taking 170 bases from the 5' end of the input base sequence X. As a result, the transcription activity was 1.24 (this 170 base sequence is not adjacent to a gene, so it is unlikely to function as a core promoter, but if a gene is connected directly below this 170 base sequence, it is predicted that it will transcribe with a strength of 1.24). Therefore, it was plotted at the position of the point (0, 1.24). Next, the window was slid by one base, and the base sequence from the 2nd base to the 171st base of the base sequence X was taken and predicted in the same way. As a result, the transcription activity was estimated to be 1.37. Therefore, it was plotted at the point (1, 1.37). This process was repeated to plot a total of 1,830 points (Figure 8).

図８において、右端の点(1829, 1.47)が、遺伝子の直上流の170塩基を評価したものとなる。この値よりも高い値を持つ点と、3’末端との間の配列をゲノム編集により除去することができれば、遺伝子発現が増大するものと期待される。同様に、遺伝子発現を低下させることも可能と考えられる。この図を用いることで、配列Xの全体像を俯瞰するとともに、希望する発現量を得るためには、どの部分にハイブリダイズするガイドRNAを設計するべきかが明らかになる。 In Figure 8, the point on the far right (1829, 1.47) is an evaluation of the 170 bases immediately upstream of the gene. If the sequence between the point with a value higher than this value and the 3' end can be removed by genome editing, it is expected that gene expression will increase. Similarly, it is thought that gene expression can also be decreased. Using this diagram allows us to get an overview of the entire picture of sequence X, and makes clear which part the guide RNA should hybridize to in order to obtain the desired expression level.

上記のビジュアライズの方法では、標的配列の全体像を大まかに把握するのに有効である可能性がある。一方で、プロモーターの強さと設計すべき2箇所のガイドRNAの位置関係は示されていない。この点を克服するため、別の方法によるビジュアライズを考案した。次に示す図９は、別の方法でビジュアライズしたものである。 The visualization method described above may be effective in getting a rough overall picture of the target sequence. However, it does not show the strength of the promoter and the relative positions of the two guide RNAs that should be designed. To overcome this issue, we devised a different visualization method. Figure 9 below shows a visualization using a different method.

標的の遺伝子から近い位置に設計されたガイドRNAを近位ガイド、遠い位置に設計されたガイドRNAを遠位ガイドと呼ぶ。設計可能なガイドRNAの位置（PAM配列の位置）に基づいてビジュアライズを行った。ただし遺伝子の直上流から1000塩基の範囲のみを示している。 A guide RNA designed close to the target gene is called a proximal guide, and a guide RNA designed farther away is called a distal guide. Visualization was done based on the positions of designable guide RNAs (position of the PAM sequence). However, only the range of 1000 bases immediately upstream of the gene is shown.

横軸に遠位ガイドのハイブリダイズする塩基の、配列X中の位置を示した。つまり、遺伝子の転写開始点と遠位ガイドの距離を示している。縦軸に、同様に近位ガイドと遺伝子の転写開始点の距離をとった。ここで、近位ガイドは遠位ガイドよりも遠くに設計されることはないため、図９のプロットの半分は隠れている。直線y＝x上の点では、近位ガイドと遠位ガイドがともに遺伝子からほぼ同じ距離に位置しており、互いに接近していることを示している。この場合欠失する塩基は小さくなるため、最小限の編集となり、望ましい。近位ガイド－遠位ガイド間の配列を削除した場合にできる新しい配列の3’末端から170塩基を取り出し、転写活性を予測した。予測された値に基づいて、プロットの色を変更している。カラーチャートで、明るい灰色は転写活性が高いと予測されたもの、黒色は低いと予測されたものを示す。 The horizontal axis shows the position of the base hybridized by the distal guide in sequence X. In other words, it shows the distance between the transcription start point of the gene and the distal guide. Similarly, the vertical axis shows the distance between the proximal guide and the transcription start point of the gene. Here, the proximal guide is never designed to be farther away than the distal guide, so half of the plot in Figure 9 is hidden. At points on the line y = x, the proximal guide and the distal guide are both located at almost the same distance from the gene, indicating that they are close to each other. In this case, the bases deleted are small, so editing is minimal and desirable. 170 bases were taken from the 3' end of the new sequence created when the sequence between the proximal guide and the distal guide was deleted, and the transcription activity was predicted. The color of the plot is changed based on the predicted value. In the color chart, light gray indicates that the transcription activity is predicted to be high, and black indicates that the transcription activity is predicted to be low.

例えば、点(849, 17)のプロットを例に説明する。この点の遠位ガイドの位置は遺伝子から８４９塩基の距離である。また、近位ガイドの位置は遺伝子から17塩基の位置である。塩基配列Xの中から、この2点間の832塩基を削除した塩基配列X’を作成する。塩基配列X'の3’末端から170塩基を取り出し、転写活性を予測したところ、4.02であった。カラーチャートに従い、明るい灰色のプロットを打った。以上の操作を全ての遠位ガイドと近位ガイドの組み合わせについて行うことで、作図した。 For example, let us use the plot of point (849, 17) as an example. The distal guide position of this point is 849 bases away from the gene. The proximal guide position is 17 bases away from the gene. From base sequence X, the 832 bases between these two points are deleted to create base sequence X'. 170 bases were taken from the 3' end of base sequence X', and the predicted transcription activity was 4.02. A light gray plot was made according to the color chart. The above operations were performed for all combinations of distal and proximal guides to create the plot.

この遺伝子の発現量を増大することを希望する場合、元の転写活性2.34よりも大きい点を選び、それらの遠位および近位ガイドの位置から、設計すべきガイドRNAの位置を読み取ればよい。例えば、点（849, 17）または点（841, 29）を選択した場合、高い転写活性が期待される。また、点（432, 110）または点（271, 77）を選択した場合、低い転写活性が期待される。このようにビジュアライズすることで目標とするプロモーターの構造を実現するガイドRNAの設計が簡便になる可能性がある。 If you want to increase the expression level of this gene, you can select a point with a transcription activity greater than the original 2.34, and read the position of the guide RNA to be designed from the positions of the distal and proximal guides. For example, if you select point (849, 17) or point (841, 29), you can expect high transcription activity. Also, if you select point (432, 110) or point (271, 77), you can expect low transcription activity. Visualization in this way may make it easier to design guide RNAs that achieve the desired promoter structure.

このようにしてガイドRNAの標的配列の位置を取得したのちに、その位置の塩基配列を取得し、20塩基または23塩基を取得してガイドRNAを設計した。このガイドRNAの配列をCRISPOR等のツールにより評価・検証し、特異性の高さやオフターゲット等を評価する。適当なガイドRNAの設計に成功した場合、人工DNA合成とPCR法を応用してガイドRNAを搭載したベクターを作成した。これ以降は、定法に従いゲノム編集を実施することができる。 After obtaining the position of the target sequence of the guide RNA in this way, the base sequence of that position was obtained, and 20 or 23 bases were obtained to design the guide RNA. The sequence of this guide RNA was evaluated and verified using tools such as CRISPOR to evaluate the level of specificity and off-targets. If a suitable guide RNA was successfully designed, a vector carrying the guide RNA was created using artificial DNA synthesis and PCR. From this point on, genome editing can be performed according to standard methods.

実施例８：ダイズ遺伝子プロモーターの例
機械学習モデルの教師データはシロイヌナズナ、ソルガムおよびトウモロコシのプロモーター配列を用いたものであった。ゆえに、これら3種以外の生物のプロモーター配列に対しても本発明に係る予測システムが期待通り機能するかどうかは不明であった。そこで特定のダイズ遺伝子について、転写活性を高める検討を行った。その結果を図１０に示した。 Example 8: Example of soybean gene promoter The training data for the machine learning model was promoter sequences from Arabidopsis thaliana, sorghum, and corn. Therefore, it was unclear whether the prediction system according to the present invention would function as expected for promoter sequences from organisms other than these three species. Therefore, we investigated how to enhance the transcription activity of a specific soybean gene. The results are shown in FIG. 10.

ダイズのある遺伝子のコアプロモーターの170塩基の配列の転写活性を予測したところ、0.927494と予測された。この値を基にLUC発現量を予測すると、ポジティブコントロールの1.26％であった。このプロモーター配列を人工的に合成し、シロイヌナズナの葉から単離したプロトプラストに導入し、LUCアッセイを行った。その結果、4.15％であった。 The transcription activity of a 170-base sequence in the core promoter of a soybean gene was predicted to be 0.927494. Based on this value, the predicted LUC expression level was 1.26% of the positive control. This promoter sequence was artificially synthesized and introduced into protoplasts isolated from Arabidopsis leaves, and a LUC assay was performed. The result was 4.15%.

このプロモーターを基に、遺伝子の発現を上昇させるためのゲノム編集後のプロモーター配列を2種類作成した。それぞれedit1、edit2とする。edit1の転写活性は2.950119と予測された。また、LUC発現量は13.2％と予測された。このプロモーターのLUCアッセイの実測値は19.0％であった。 Based on this promoter, two types of promoter sequences were created after genome editing to increase gene expression. They are called edit1 and edit2. The transcription activity of edit1 was predicted to be 2.950119. The LUC expression level was predicted to be 13.2%. The actual measured value of the LUC assay for this promoter was 19.0%.

edit2の転写活性は2.432977と予測された。また、LUC発現量は7.27％と予測された。このプロモーターのLUCアッセイの実測値は41.6％であった。 The transcriptional activity of edit2 was predicted to be 2.432977. The LUC expression level was predicted to be 7.27%. The actual LUC assay value for this promoter was 41.6%.

０１０・・・情報処理装置
０１１・・・配列入力部
０１２・・・改変配列生成部
０１３・・・活性予測部
０１４・・・配列選択部
１００・・・コンピューター
１０１・・・制御部
１０２・・・記憶部
１０３・・・周辺機器Ｉ／Ｆ部
１０４・・・入力部
１０５・・・表示部
１０６・・・通信部
１１０・・・バス
１２０・・・ネットワーク
１３０・・・外部サーバー
１４０・・・データベース 010: Information processing device 011: Sequence input section 012: Modified sequence generation section 013: Activity prediction section 014: Sequence selection section 100: Computer 101: Control section 102: Memory section 103: Peripheral device I/F section 104: Input section 105: Display section 106: Communication section 110: Bus 120: Network 130: External server 140: Database

Claims

A method for obtaining a promoter sequence modified to have a desired activity, comprising the steps of:
Providing an original promoter sequence to be modified;
Generating a set of multiple modified promoter sequences that can be created by genome editing technology based on the promoter sequence;
predicting the activity of each modified promoter sequence included in the generated set of modified promoter sequences using a machine learning model;
selecting a modified promoter sequence predicted to have a desired activity;
the promoter sequence is a plant cell promoter sequence,
The original promoter sequence includes a core promoter and a sequence upstream thereof, wherein the core promoter is a sequence included in positions −200 to +50 when the base position of the transcription start site is represented as +1;
The genome editing technology is a genome editing technology using the CRISPR/Cas system,
A set of multiple modified promoter sequences is generated by sequence deletion caused by cleavage induced by a combination of guide RNA sequences designed based on two PAM recognition sequences, where at least one of the two PAM recognition sequences is located within a core promoter;
A method , wherein the original promoter is at least 2000 bp in length .

The method of claim 1, wherein the desired activity is a gene expression induction activity higher than that of the original promoter sequence, or a gene expression induction activity lower than that of the original promoter sequence.

The method of claim 1, wherein the original promoter sequence includes a core promoter sequence and an upstream sequence thereof.

The method of claim 1, wherein the set of modified promoter sequences includes at least 1000 different sequences.

The method according to claim 1, wherein the machine learning model is a regression model trained to predict gene expression induction activity from a promoter sequence using gene expression induction activity data of multiple promoter sequences in plant cells as training data.

The method of claim 1, further comprising a step of visualizing the activity of each of the modified promoter sequences included in the generated set of modified promoter sequences on a computer display.

An information processing device for predicting a promoter sequence modified to have a desired activity, comprising:
a sequence input section for receiving input of an original promoter sequence to be modified;
A modified sequence generating unit that generates a set of multiple modified promoter sequences that can be created by genome editing technology based on the promoter sequence;
an activity prediction unit that predicts the activity of each modified promoter sequence included in the generated set of modified promoter sequences using a machine learning model;
a sequence selection section for selecting a modified promoter sequence predicted to have a desired activity;
the promoter sequence is a plant cell promoter sequence,
The original promoter sequence includes a core promoter and an upstream sequence thereof, wherein the core promoter is a sequence included in positions −200 to +50 when the base position of the transcription start site is represented as +1;
The genome editing technology is a genome editing technology using the CRISPR/Cas system,
A set of multiple modified promoter sequences is generated by sequence deletion caused by cleavage induced by a combination of guide RNA sequences designed based on two PAM recognition sequences, where at least one of the two PAM recognition sequences is located within a core promoter;
The original promoter is at least 2000 bp in length.

A non-transitory computer-readable medium having instructions stored thereon, the instructions, when executed by a processor, performing the following steps:
receiving an input of an original promoter sequence to be modified;
A step of generating a set of multiple modified promoter sequences that can be created by genome editing technology based on the promoter sequence;
predicting the activity of each of the modified promoter sequences included in the generated set of modified promoter sequences using a machine learning model;
A step of selecting modified promoter sequences predicted to have a desired activity can be carried out,
the promoter sequence is a plant cell promoter sequence,
The original promoter sequence includes a core promoter and a sequence upstream thereof, wherein the core promoter is a sequence included in positions −200 to +50 when the base position of the transcription start site is represented as +1;
The genome editing technology is a genome editing technology using the CRISPR/Cas system,
A set of multiple modified promoter sequences is generated by sequence deletion caused by cleavage induced by a combination of guide RNA sequences designed based on two PAM recognition sequences, where at least one of the two PAM recognition sequences is located within a core promoter;
A computer readable medium, wherein the original promoter is at least 2000 bp in length.

On the computer,
A function to accept input of the original promoter sequence to be modified;
A function of generating a set of multiple modified promoter sequences that can be created by genome editing technology based on the promoter sequence;
A function of predicting the activity of each modified promoter sequence included in the generated set of modified promoter sequences using a machine learning model;
A function for selecting modified promoter sequences predicted to have a desired activity is realized,
the promoter sequence is a plant cell promoter sequence,
The original promoter sequence includes a core promoter and a sequence upstream thereof, wherein the core promoter is a sequence included in positions −200 to +50 when the base position of the transcription start site is represented as +1;
The genome editing technology is a genome editing technology using the CRISPR/Cas system,
A set of multiple modified promoter sequences is generated by sequence deletion caused by cleavage induced by a combination of guide RNA sequences designed based on two PAM recognition sequences, where at least one of the two PAM recognition sequences is located within a core promoter;
The original promoter length is at least 2000 bp, program.

A method for genome editing of a plant cell for regulating the expression level of a desired gene, comprising:
Obtaining a modified promoter sequence having a desired activity by the method of claim 1;
Preparing a plant cell to be subjected to genome editing;
editing the genome of said plant cell to generate said modified promoter sequence.

A method for producing a genome-edited plant in which the expression level of a desired gene is regulated, comprising:
Editing the genome of a desired plant cell by the method of claim 10 ;
A method comprising obtaining a plant individual derived from a genome-edited cell.

The method of claim 1, wherein the prediction of the activity of each modified promoter sequence by a machine learning model is performed using only the core promoter portion of each modified promoter sequence.