JPWO2020167667A5

JPWO2020167667A5 -

Info

Publication number: JPWO2020167667A5
Application number: JP2021546841A
Authority: JP
Publication date: 2023-02-16
Anticipated expiration: 2040-02-10

Description

上記説明及び例では、特定の数のサンプルサイズ、反復、エポック、バッチサイズ、学習速度、精度、データ入力サイズ、フィルタ、アミノ酸配列、及び他の数字が調整又は最適化可能であるが、当業者は認識することができる。特定の態様が実施例に記載されるが、実施例に列記された数字は非限定的である。
なお、本発明は、実施の態様として以下の内容を含む。
〔態様１〕
所望のタンパク質特性をモデリングする方法であって、
（ａ）第１のニューラルネットエンベッダー及び第１のニューラルネット予測子を含む第１の事前トレーニング済みシステムを提供することであって、前記事前トレーニング済みシステムの前記第１のニューラルネット予測子は、前記所望のタンパク質特性と異なる、提供することと、
（ｂ）前記事前トレーニング済みシステムの前記第１のニューラルネットエンベッダーの少なくとも一部を第２のシステムに転移することであって、前記第２のシステムは第２のニューラルネットエンベッダー及び第２のニューラルネット予測子を含み、前記第２のシステムの前記第２のニューラルネット予測子は、前記所望のタンパク質特性を提供する、転移することと、
（ｃ）前記第２のシステムにより、タンパク質検体の一次アミノ酸配列を解析することであって、それにより、前記タンパク質検体の前記所望のタンパク質特性の予測を生成する、解析することと、
を含む方法。
〔態様２〕
前記第１及び第２のシステムの前記ニューラルネットエンベッダーのアーキテクチャは、ＶＧＧ１６、ＶＧＧ１９、ＤｅｅｐＲｅｓＮｅｔ、Ｉｎｃｅｐｔｉｏｎ／ＧｏｏｇＬｅＮｅｔ（Ｖ１－Ｖ４）、Ｉｎｃｅｐｔｉｏｎ／ＧｏｏｇＬｅＮｅｔＲｅｓＮｅｔ、Ｘｃｅｐｔｉｏｎ、ＡｌｅｘＮｅｔ、ＬｅＮｅｔ、ＭｏｂｉｌｅＮｅｔ、ＤｅｎｓｅＮｅｔ、ＮＡＳＮｅｔ、及びＭｏｂｉｌｅＮｅｔの少なくとも１つから独立して選択される畳み込みアーキテクチャである、態様１に記載の方法。
〔態様３〕
前記第１のシステムは、条件付き敵対的生成ネットワーク（ＧＡＮ）、ＤＣＧＡＮ、ＣＧＡＮ、ＳＧＡＮ若しくはプログレッシブＧＡＮ、ＳＡＧＡＮ、ＬＳＧＡＮ、ＷＧＡＮ、ＥＢＧＡＮ、ＢＥＧＡＮ、又はｉｎｆｏＧＡＮから選択される敵対的生成ネットワーク（ＧＡＮ）を含む、態様１に記載の方法。
〔態様４〕
前記第１のシステムは、Ｂｉ－ＬＳＴＭ／ＬＳＴＭ、Ｂｉ－ＧＲＵ／ＧＲＵ、又はトランスフォーマネットワークから選択されるリカレントニューラルネットワークを含む、態様３に記載の方法。
〔態様５〕
前記第１のシステムは変分自動エンコーダ（ＶＡＥ）を含む、態様３に記載の方法又はシステム。
〔態様６〕
前記エンベッダーは、少なくとも５０、１００、１５０、２００、２５０、３００、３５０、４００、４５０、５００、６００、７００、８００、９００、１０００、又はそれ以上のアミノ酸配列のセットでトレーニングされる、態様１～５のいずれか一態様記載の方法。
〔態様７〕
前記アミノ酸配列は、ＧＰ、Ｐｆａｍ、キーワード、Ｋｅｇｇオントロジー、Ｉｎｔｅｒｐｒｏ、ＳＵＰＦＡＭ、又はＯｒｔｈｏＤＢの少なくとも１つを含む１つ又は複数の機能表現にわたるアノテーションを含む、態様６に記載の方法。
〔態様８〕
前記アミノ酸配列は、少なくとも約１万、２万、３万、４万、５万、７．５万、１０万、１２万、１４万、１５万、１６万、又は１７万の可能なアノテーションを有する、態様７に記載の方法。
〔態様９〕
前記第２のモデルは、前記第１のモデルの前記転移されたエンベッダーを使用せずにトレーニングされたモデルと比較して改善された性能尺度を有する、態様１～８のいずれか一態様記載の方法。
〔態様１０〕
前記第１又は第２のシステムは、Ａｄａｍ、ＲＭＳプロップ、モメンタムを用いる確率的勾配降下（ＳＧＤ）、モメンタム及びＮｅｓｔｒｏｖ加速勾配を用いるＳＧＤ、モメンタムなしのＳＧＤ、Ａｄａｇｒａｄ、Ａｄａｄｅｌｔａ、又はＮＡｄａｍにより最適化される、態様１～９のいずれか一態様記載の方法。
〔態様１１〕
前記第１及び第２のモデルは、以下の活性化関数のいずれかを使用して最適化することができる：ソフトマックス、ｅｌｕ、ＳｅＬＵ、ソフトプラス、ソフトサイン、ＲｅＬＵ、ｔａｎｈ、シグモイド、ハードシグモイド、指数、ＰＲｅＬＵ、及びＬｅａｓｋｙＲｅＬＵ、又は線形、態様１～１０のいずれか一態様記載の方法。
〔態様１２〕
前記ニューラルネットエンベッダーは、少なくとも１０、５０、１００、２５０、５００、７５０、１０００、又はそれ以上の層を含み、前記予測子は、少なくとも１、２、３、４、５、６、７、８、９、１０、１１、１２、１３、１４、１５、１６、１７、１８、１９、２０、又はそれ以上の層を含む、態様１～１１のいずれか一態様記載の方法。
〔態様１３〕
前記第１又は第２のシステムの少なくとも一方は、早期停止、Ｌ１－Ｌ２正則化、スキップ接続、又はそれらの組合せから選択される正則化を利用し、前記正則化は１、２、３、４、５、又はそれ以上の層で実行される、態様１～１２のいずれか一態様記載の方法。
〔態様１４〕
前記正則化はバッチ正規化を使用して実行される、態様１３に記載の方法。
〔態様１５〕
前記正則化はグループ正規化を使用して実行される、態様１３に記載の方法。
〔態様１６〕
前記第２のシステムの第２のモデルは、前記第１のモデルの最後の層が除去される前記第１のシステムの第１のモデルを含む、態様１～１５のいずれか一態様記載の方法。
〔態様１７〕
前記第１のモデルの２、３、４、５、又はそれ以上の層は、前記第２のモデルへの転移において除去される、態様１６に記載の方法。
〔態様１８〕
前記転移された層は、前記第２のモデルのトレーニング中、凍結される、態様１６又は１７に記載の方法。
〔態様１９〕
前記転移された層は、前記第２のモデルのトレーニング中、凍結されない、態様１６又は１７に記載の方法。
〔態様２０〕
前記第２のモデルは、前記第１のモデルの前記転移された層に追加される１、２、３、４、５、６、７、８、９、１０、又はそれ以上の層を有する、態様１７～１９のいずれか一態様記載の方法。
〔態様２１〕
前記第２のシステムの前記ニューラルネット予測子は、タンパク質結合活性、核酸結合活性、タンパク質溶解性、及びタンパク質安定性の１つ又は複数を予測する、態様１～２０のいずれか一態様記載の方法。
〔態様２２〕
前記第２のシステムの前記ニューラルネット予測子は、タンパク質蛍光を予測する、態様１～２１のいずれか一態様記載の方法。
〔態様２３〕
前記第２のシステムの前記ニューラルネット予測子は、酵素活性を予測する、態様１～２２のいずれか一態様記載の方法。
〔態様２４〕
アミノ酸配列とタンパク質機能との間の以前は未知であった関連を識別するコンピュータ実施方法であって、
（ａ）第１の機械学習ソフトウェアモジュールを用いて、複数のタンパク質特性と複数のアミノ酸配列との間の複数の関連の第１のモデルを生成することと、
（ｂ）第２の機械学習ソフトウェアモジュールに前記第１のモデル又はその一部を転移することと、
（ｃ）前記第２の機械学習ソフトウェアモジュールにより、前記第１のモデルの少なくとも一部を含む第２のモデルを生成することと、
（ｄ）前記第２のモデルに基づいて、前記アミノ酸配列と前記タンパク質機能との間の以前は未知であった関連を識別することと、
を含む方法。
〔態様２５〕
前記アミノ酸配列は一次タンパク質構造を含む、態様２４に記載の方法。
〔態様２６〕
前記アミノ酸配列は、前記タンパク質機能を生じさせるタンパク質構成を生じさせる、態様２４又は２５に記載の方法。
〔態様２７〕
前記タンパク質機能は蛍光を含む、態様２４～２６のいずれか一態様記載の方法。
〔態様２８〕
前記タンパク質機能は酵素活性を含む、態様２４～２７のいずれか一態様記載の方法。
〔態様２９〕
前記タンパク質機能はヌクレアーゼ活性を含む、態様２４～２８のいずれか一態様記載の方法。
〔態様３０〕
前記タンパク質機能は、タンパク質安定性の程度を含む、態様２４～２９のいずれか一態様記載の方法。
〔態様３１〕
前記複数のタンパク質特性及び前記複数のアミノ酸配列は、ＵｎｉＰｒｏｔからのものである、態様２４～３０のいずれか一態様記載の方法。
〔態様３２〕
前記複数のタンパク質特性は、ラベルＧＰ、Ｐｆａｍ、キーワード、Ｋｅｇｇオントロジー、Ｉｎｔｅｒｐｒｏ、ＳＵＰＦＡＭ、及びＯｒｔｈｏＤＢの１つ又は複数を含む、態様２４～３１のいずれか一態様記載の方法。
〔態様３３〕
前記複数のアミノ酸配列は、複数のタンパク質の一次タンパク質構造、二次タンパク質構造、及び三次タンパク質構造を形成する、態様２４～３２のいずれか一態様記載の方法。
〔態様３４〕
前記第１のモデルは、多次元テンソル、三次元原子位置の表現、対毎の相互作用の隣接行列、及び文字埋め込みの１つ又は複数を含む入力データでトレーニングされる、態様２４～３３のいずれか一態様記載の方法。
〔態様３５〕
前記第２の機械学習モジュールに、一次アミノ酸配列の変異に関連するデータ、アミノ酸相互作用のコンタクトマップ、三次タンパク質構造、及び選択的スプライシング転写からの予測されたアイソフォームの少なくとも１つを入力することを含む、態様２４～３４のいずれか一態様記載の方法。
〔態様３６〕
前記第１のモデル及び前記第２のモデルは、教師あり学習を使用してトレーニングされる、態様２４～３５のいずれか一態様記載の方法。
〔態様３７〕
前記第１のモデルは教師あり学習を使用してトレーニングされ、前記第２のモデルは教師なし学習を使用してトレーニングされる、態様２４～３６のいずれか一態様記載の方法。
〔態様３８〕
前記第１のモデル及び前記第２のモデルは、畳み込みニューラルネットワーク、敵対的生成ネットワーク、リカレントニューラルネットワーク、又は変分自動エンコーダを含むニューラルネットワークを含む、態様２４～３７のいずれか一態様記載の方法。
〔態様３９〕
前記第１のモデル及び前記第２のモデルはそれぞれ、異なるニューラルネットワークアーキテクチャを含む、態様３８に記載の方法。
〔態様４０〕
前記畳み込みネットワークは、ＶＧＧ１６、ＶＧＧ１９、ＤｅｅｐＲｅｓＮｅｔ、Ｉｎｃｅｐｔｉｏｎ／ＧｏｏｇＬｅＮｅｔ（Ｖ１－Ｖ４）、Ｉｎｃｅｐｔｉｏｎ／ＧｏｏｇＬｅＮｅｔＲｅｓＮｅｔ、Ｘｃｅｐｔｉｏｎ、ＡｌｅｘＮｅｔ、ＬｅＮｅｔ、ＭｏｂｉｌｅＮｅｔ、ＤｅｎｓｅＮｅｔ、ＮＡＳＮｅｔ、又はＭｏｂｉｌｅＮｅｔの１つを含む、態様３８又は３９に記載の方法。
〔態様４１〕
前記第１のモデルはエンベッダーを含み、前記第２のモデルは予測子を含む、態様２４～４０のいずれか一態様記載の方法。
〔態様４２〕
第１のモデルアーキテクチャは複数の層を含み、第２のモデルアーキテクチャは、前記複数の層のうちの少なくとも２つの層を含む、態様４１に記載の方法。
〔態様４３〕
前記第１の機械学習ソフトウェアモジュールは、少なくとも１０，０００のタンパク質特性を含む第１のトレーニングデータセットで前記第１のモデルをトレーニングし、前記第２の機械学習ソフトウェアモジュールは、第２のトレーニングデータセットを使用して前記第２のモデルをトレーニングする、態様２４～４２のいずれか一態様記載の方法。
〔態様４４〕
アミノ酸配列とタンパク質機能との間の以前は未知であった関連を識別するコンピュータシステムであって、
（ａ）プロセッサと、
（ｂ）命令を内部に記憶した非一時的コンピュータ可読媒体と、
を備え、前記命令は、実行されると、前記プロセッサに、
（ｉ）第１の機械学習ソフトウェアモデルを用いて、複数のタンパク質特性と複数のアミノ酸配列との間の複数の関連の第１のモデルを生成することと、
（ｉｉ）前記第１のモデル又はその一部を第２の機械学習ソフトウェアモジュールに転移することと、
（ｉｉｉ）前記第２の機械学習ソフトウェアモジュールにより、前記第１のモデルの少なくとも一部を含む第２のモデルを生成することと、
（ｉｖ）前記第２のモデルに基づいて、前記アミノ酸配列と前記タンパク質機能との間の以前は未知であった関連を識別することと、
を行わせるように構成される、システム。
〔態様４５〕
前記アミノ酸配列は一次タンパク質構造を含む、態様４４に記載のシステム。
〔態様４６〕
前記アミノ酸配列は、前記タンパク質機能を生じさせるタンパク質構成を生じさせる、態様４４又は４５に記載のシステム。
〔態様４７〕
前記タンパク質機能は蛍光を含む、態様４４～４６のいずれか一態様記載のシステム。
〔態様４８〕
前記タンパク質機能は酵素活性を含む、態様４４～４７のいずれか一態様記載のシステム。
〔態様４９〕
前記タンパク質機能はヌクレアーゼ活性を含む、態様４４～４８のいずれか一態様記載のシステム。
〔態様５０〕
前記タンパク質機能は、タンパク質安定性の程度を含む、態様４４～４９のいずれか一態様記載のシステム。
〔態様５１〕
前記複数のタンパク質特性及び複数のタンパク質マーカは、ＵｎｉＰｒｏｔからのものである、態様４４～５０のいずれか一態様記載のシステム。
〔態様５２〕
前記複数のタンパク質特性は、ラベルＧＰ、Ｐｆａｍ、キーワード、Ｋｅｇｇオントロジー、Ｉｎｔｅｒｐｒｏ、ＳＵＰＦＡＭ、及びＯｒｔｈｏＤＢの１つ又は複数を含む、態様４４～５１のいずれか一態様記載のシステム。
〔態様５３〕
前記複数のアミノ酸配列は、複数のタンパク質の一次タンパク質構造、二次タンパク質構造、及び三次タンパク質構造を含む、態様４４～５２のいずれか一態様記載のシステム。
〔態様５４〕
前記第１のモデルは、多次元テンソル、三次元原子位置の表現、対毎の相互作用の隣接行列、及び文字埋め込みの１つ又は複数を含む入力データでトレーニングされる、態様４４～５３のいずれか一態様記載のシステム。
〔態様５５〕
前記ソフトウェアは、前記プロセッサに、一次アミノ酸配列の変異に関連するデータ、アミノ酸相互作用のコンタクトマップ、三次タンパク質構造、及び選択的スプライシング転写からの予測されたアイソフォームの少なくとも１つを前記第２の機械学習モジュールに入力させるように構成される、態様４４～５４のいずれか一態様記載のシステム。
〔態様５６〕
前記第１のモデル及び前記第２のモデルは、教師あり学習を使用してトレーニングされる、態様４４～５５のいずれか一態様記載のシステム。
〔態様５７〕
前記第１のモデルは教師あり学習を使用してトレーニングされ、前記第２のモデルは教師なし学習を使用してトレーニングされる、態様４４～５６のいずれか一態様記載のシステム。
〔態様５８〕
前記第１のモデル及び前記第２のモデルは、畳み込みニューラルネットワーク、敵対的生成ネットワーク、リカレントニューラルネットワーク、又は変分自動エンコーダを含むニューラルネットワークを含む、態様４４～５７のいずれか一態様記載のシステム。
〔態様５９〕
前記第１のモデル及び前記第２のモデルはそれぞれ、異なるニューラルネットワークアーキテクチャを含む、態様５８に記載のシステム。
〔態様６０〕
前記畳み込みネットワークは、ＶＧＧ１６、ＶＧＧ１９、ＤｅｅｐＲｅｓＮｅｔ、Ｉｎｃｅｐｔｉｏｎ／ＧｏｏｇＬｅＮｅｔ（Ｖ１－Ｖ４）、Ｉｎｃｅｐｔｉｏｎ／ＧｏｏｇＬｅＮｅｔＲｅｓＮｅｔ、Ｘｃｅｐｔｉｏｎ、ＡｌｅｘＮｅｔ、ＬｅＮｅｔ、ＭｏｂｉｌｅＮｅｔ、ＤｅｎｓｅＮｅｔ、ＮＡＳＮｅｔ、又はＭｏｂｉｌｅＮｅｔの１つを含む、態様５８又は５９に記載のシステム。
〔態様６１〕
前記第１のモデルはエンベッダーを含み、前記第２のモデルは予測子を含む、態様４４～６０のいずれか一態様記載のシステム。
〔態様６２〕
第１のモデルアーキテクチャは複数の層を含み、第２のモデルアーキテクチャは、前記複数の層のうちの少なくとも２つの層を含む、態様６１に記載のシステム。
〔態様６３〕
前記第１の機械学習ソフトウェアモジュールは、少なくとも１０，０００のタンパク質特性を含む第１のトレーニングデータセットで前記第１のモデルをトレーニングし、前記第２の機械学習ソフトウェアモジュールは、第２のトレーニングデータセットを使用して前記第２のモデルをトレーニングする、態様４４～６２のいずれか一態様記載のシステム。
〔態様６４〕
所望のタンパク質特性をモデリングする方法であって、
第１のデータセットを用いて第１のシステムをトレーニングすることであって、前記第１のシステムは第１のニューラルネットトランスフォーマエンコーダ及び第１のデコーダを含み、事前トレーニング済みのシステムの前記第１のデコーダは、前記所望のタンパク質特性とは異なる出力を生成するように構成される、トレーニングすることと、
前記事前トレーニング済みシステムの前記第１のトランスフォーマエンコーダの少なくとも一部を第２のシステムに転移することであって、前記第２のシステムは第２のトランスフォーマエンコーダ及び第２のデコーダを含む、転移することと、
第２のデータセットを用いて前記第２のシステムをトレーニングすることであって、前記第２のデータセットは、前記第１のセットよりも少数のタンパク質クラスを表す１組のタンパク質を含み、前記タンパク質クラスは、（ａ）前記第１のデータセット内のタンパク質のクラス及び（ｂ）前記第１のデータセットから除外されるタンパク質のクラスの１つ又は複数を含む、トレーニングすることと、
前記第２のシステムにより、タンパク質検体の一次アミノ酸配列を解析することであって、それにより、前記タンパク質検体の前記所望のタンパク質特性の予測を生成する、解析することと、
を含む方法。
〔態様６５〕
タンパク質検体の前記一次アミノ酸配列は、１つ又は複数のアスパラギナーゼ配列及び対応する活性ラベルである、態様６４に記載の方法。
〔態様６６〕
前記第１のデータセットは、複数のクラスのタンパク質を含む１組のタンパク質を含む、態様６４又は６５に記載の方法。
〔態様６７〕
前記第２のデータセットは、タンパク質の前記クラスの１つである、態様６４～６６のいずれか一態様記載の方法。
〔態様６８〕
タンパク質の前記クラスの１つは酵素である、態様６４～６７のいずれか一態様記載の方法。
〔態様６９〕
態様６４～６８のいずれか一態様記載の方法を実行する構成されたシステム。 In the above descriptions and examples, the specific numbers of sample sizes, iterations, epochs, batch sizes, learning speeds, accuracies, data input sizes, filters, amino acid sequences, and other numbers can be adjusted or optimized, but those skilled in the art may can be recognized. Although certain aspects are described in the examples, the numbers listed in the examples are non-limiting.
In addition, this invention includes the following contents as a mode of implementation.
[Aspect 1]
A method of modeling a desired protein property comprising:
(a) providing a first pre-trained system comprising a first neural net embedder and a first neural net predictor, wherein said first neural net predictor of said pre-trained system; is different from the desired protein property; and
(b) transferring at least a portion of said first neural net embedder of said pretrained system to a second system, said second system comprising a second neural net embedder and a second neural net embedder; transferring, comprising two neural net predictors, wherein the second neural net predictor of the second system provides the desired protein property;
(c) analyzing a primary amino acid sequence of a protein sample by said second system, thereby generating a prediction of said desired protein profile of said protein sample;
method including.
[Aspect 2]
The architectures of the neural net embedders of the first and second systems are VGG16, VGG19, Deep ResNet, Inception/GoogLeNet (V1-V4), Inception/GoogLeNet ResNet, Xception, AlexNet, LeNet, MobileNet, DenseNet, NASNet. , and MobileNet.
[Aspect 3]
The first system comprises a conditional generative adversarial network (GAN), a generative adversarial network (GAN) selected from DCGAN, CGAN, SGAN or progressive GAN, SAGAN, LSGAN, WGAN, EBGAN, BEGAN, or infoGAN. The method of aspect 1, comprising:
[Aspect 4]
4. The method of aspect 3, wherein the first system comprises a recurrent neural network selected from Bi-LSTM/LSTM, Bi-GRU/GRU, or transformer networks.
[Aspect 5]
4. The method or system of aspect 3, wherein the first system comprises a variational autoencoder (VAE).
[Aspect 6]
Aspect 1, wherein said embedder is trained with a set of amino acid sequences of at least 50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 600, 700, 800, 900, 1000, or more 6. The method according to any one of aspects 1 to 5.
[Aspect 7]
7. The method of aspect 6, wherein the amino acid sequence comprises annotations spanning one or more functional expressions including at least one of GP, Pfam, keywords, Kegg ontology, Interpro, SUPFAM, or OrthoDB.
[Aspect 8]
The amino acid sequence bears at least about 10,000, 20,000, 30,000, 40,000, 50,000, 75,000, 100,000, 120,000, 140,000, 150,000, 160,000, or 170,000 possible annotations. 8. The method of aspect 7, comprising:
[Aspect 9]
9. The method of any one of aspects 1-8, wherein the second model has an improved performance measure compared to a model trained without the transferred embedder of the first model. Method.
[Aspect 10]
The first or second system is optimized by Adam, RMS prop, stochastic gradient descent (SGD) with momentum, SGD with momentum and Nestrov accelerating gradient, SGD without momentum, Adagrad, Adadelta, or NAdam. The method according to any one of aspects 1-9.
[Aspect 11]
The first and second models can be optimized using any of the following activation functions: softmax, elu, SeLU, softplus, softsine, ReLU, tanh, sigmoid, hardsigmoid. , exponential, PReLU, and LeaskyReLU, or linear, the method of any one of aspects 1-10.
[Aspect 12]
The neural net embedder includes at least 10, 50, 100, 250, 500, 750, 1000, or more layers, and the predictors include at least 1, 2, 3, 4, 5, 6, 7, 12. The method of any one of aspects 1-11, comprising 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or more layers.
[Aspect 13]
At least one of the first or second systems utilizes a regularization selected from early stopping, L1-L2 regularization, skip connection, or combinations thereof, wherein the regularization is 1, 2, 3, 4 , 5, or more layers.
[Aspect 14]
14. The method of aspect 13, wherein the regularization is performed using batch normalization.
[Aspect 15]
14. The method of aspect 13, wherein the regularization is performed using group normalization.
[Aspect 16]
16. The method of any one of aspects 1-15, wherein the second model of the second system comprises a first model of the first system wherein the last layer of the first model is removed. .
[Aspect 17]
17. The method of aspect 16, wherein 2, 3, 4, 5, or more layers of the first model are removed in transitioning to the second model.
[Aspect 18]
18. The method of aspect 16 or 17, wherein the transferred layer is frozen during training of the second model.
[Aspect 19]
18. The method of aspects 16 or 17, wherein the transferred layers are not frozen during training of the second model.
[Aspect 20]
wherein the second model has 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more layers added to the transferred layers of the first model; The method according to any one of aspects 17-19.
[Aspect 21]
21. The method of any one of aspects 1-20, wherein the neural net predictor of the second system predicts one or more of protein binding activity, nucleic acid binding activity, protein solubility, and protein stability. .
[Aspect 22]
22. The method of any one of aspects 1-21, wherein the neural net predictor of the second system predicts protein fluorescence.
[Aspect 23]
23. The method of any one of aspects 1-22, wherein the neural net predictor of the second system predicts enzymatic activity.
[Aspect 24]
A computer-implemented method of identifying previously unknown associations between amino acid sequences and protein function, comprising:
(a) using a first machine learning software module to generate a first model of a plurality of associations between a plurality of protein properties and a plurality of amino acid sequences;
(b) transferring said first model or part thereof to a second machine learning software module;
(c) generating, with said second machine learning software module, a second model comprising at least a portion of said first model;
(d) identifying previously unknown associations between said amino acid sequence and said protein function based on said second model;
method including.
[Aspect 25]
25. The method of aspect 24, wherein said amino acid sequence comprises a primary protein structure.
[Aspect 26]
26. A method according to aspect 24 or 25, wherein said amino acid sequence gives rise to a protein conformation that gives rise to said protein function.
[Aspect 27]
27. The method of any one of aspects 24-26, wherein said protein function comprises fluorescence.
[Aspect 28]
28. The method of any one of aspects 24-27, wherein said protein function comprises enzymatic activity.
[Aspect 29]
29. The method of any one of aspects 24-28, wherein said protein function comprises nuclease activity.
[Aspect 30]
30. The method of any one of aspects 24-29, wherein said protein function comprises the degree of protein stability.
[Aspect 31]
31. The method of any one of aspects 24-30, wherein said plurality of protein features and said plurality of amino acid sequences are from UniProt.
[Aspect 32]
32. The method of any one of aspects 24-31, wherein the plurality of protein properties comprises one or more of labels GP, Pfam, keywords, Kegg ontology, Interpro, SUPFAM, and OrthoDB.
[Aspect 33]
33. The method of any one of aspects 24-32, wherein said plurality of amino acid sequences form primary, secondary and tertiary protein structures of a plurality of proteins.
[Aspect 34]
34. Any of aspects 24-33, wherein the first model is trained on input data comprising one or more of a multidimensional tensor, a representation of three-dimensional atomic positions, an adjacency matrix of pairwise interactions, and character embeddings. or a method according to one aspect.
[Aspect 35]
Inputting into the second machine learning module at least one of data associated with variations in primary amino acid sequences, contact maps of amino acid interactions, tertiary protein structures, and predicted isoforms from alternatively spliced transcripts. 35. The method of any one of aspects 24-34, comprising
[Aspect 36]
36. The method of any one of aspects 24-35, wherein the first model and the second model are trained using supervised learning.
[Aspect 37]
37. The method of any one of aspects 24-36, wherein the first model is trained using supervised learning and the second model is trained using unsupervised learning.
[Aspect 38]
38. The method of any one of aspects 24-37, wherein the first model and the second model comprise a neural network comprising a convolutional neural network, a generative adversarial network, a recurrent neural network, or a variational autoencoder. .
[Aspect 39]
39. The method of aspect 38, wherein the first model and the second model each comprise different neural network architectures.
[Aspect 40]
wherein said convolutional network comprises one of VGG16, VGG19, Deep ResNet, Inception/GoogleLeNet (V1-V4), Inception/GoogleLeNet ResNet, Xception, AlexNet, LeNet, MobileNet, DenseNet, NASNet, or MobileNet, aspect 38 or 39 The method described in .
[Aspect 41]
41. The method of any one of aspects 24-40, wherein the first model comprises embedders and the second model comprises predictors.
[Aspect 42]
42. The method of aspect 41, wherein the first model architecture includes multiple layers and the second model architecture includes at least two layers of the multiple layers.
[Aspect 43]
The first machine learning software module trains the first model with a first training data set comprising at least 10,000 protein features, and the second machine learning software module trains the second training data. 43. The method of any one of aspects 24-42, wherein sets are used to train the second model.
[Aspect 44]
1. A computer system that identifies previously unknown associations between amino acid sequences and protein function, comprising:
(a) a processor;
(b) a non-transitory computer-readable medium having instructions stored therein;
wherein the instructions, when executed, cause the processor to:
(i) generating a first model of a plurality of associations between a plurality of protein properties and a plurality of amino acid sequences using a first machine learning software model;
(ii) transferring said first model or part thereof to a second machine learning software module;
(iii) generating, with said second machine learning software module, a second model comprising at least a portion of said first model;
(iv) identifying previously unknown associations between said amino acid sequence and said protein function based on said second model;
A system configured to cause
[Aspect 45]
45. The system of aspect 44, wherein said amino acid sequence comprises a primary protein structure.
[Aspect 46]
46. The system of aspect 44 or 45, wherein said amino acid sequence gives rise to a protein conformation that gives rise to said protein function.
[Aspect 47]
47. The system of any one of aspects 44-46, wherein said protein function comprises fluorescence.
[Aspect 48]
48. The system of any one of aspects 44-47, wherein said protein function comprises enzymatic activity.
[Aspect 49]
49. The system of any one of aspects 44-48, wherein said protein function comprises a nuclease activity.
[Aspect 50]
50. The system of any one of aspects 44-49, wherein said protein function comprises the degree of protein stability.
[Aspect 51]
51. The system of any one of aspects 44-50, wherein said plurality of protein features and plurality of protein markers are from UniProt.
[Aspect 52]
52. The system of any one of aspects 44-51, wherein the plurality of protein properties comprises one or more of labels GP, Pfam, keywords, Kegg ontology, Interpro, SUPFAM, and OrthoDB.
[Aspect 53]
53. The system of any one of aspects 44-52, wherein the plurality of amino acid sequences comprises primary protein structures, secondary protein structures, and tertiary protein structures of a plurality of proteins.
[Aspect 54]
54. Any of aspects 44-53, wherein the first model is trained with input data comprising one or more of a multidimensional tensor, a representation of three-dimensional atomic positions, an adjacency matrix of pairwise interactions, and character embeddings. or a system according to one aspect.
[Aspect 55]
The software provides the processor with at least one of data associated with primary amino acid sequence variation, a contact map of amino acid interactions, a tertiary protein structure, and predicted isoforms from alternative splicing transcripts for the second 55. The system of any one of aspects 44-54, configured to provide input to a machine learning module.
[Aspect 56]
56. The system of any one of aspects 44-55, wherein the first model and the second model are trained using supervised learning.
[Aspect 57]
57. The system of any one of aspects 44-56, wherein the first model is trained using supervised learning and the second model is trained using unsupervised learning.
[Aspect 58]
58. The system of any one of aspects 44-57, wherein the first model and the second model comprise a neural network comprising a convolutional neural network, a generative adversarial network, a recurrent neural network, or a variational autoencoder. .
[Aspect 59]
59. The system of aspect 58, wherein the first model and the second model each include different neural network architectures.
[Aspect 60]
wherein the convolutional network comprises one of VGG16, VGG19, Deep ResNet, Inception/GoogLeNet (V1-V4), Inception/GoogLeNet ResNet, Xception, AlexNet, LeNet, MobileNet, DenseNet, NASNet, or MobileNet; The system described in .
[Aspect 61]
61. The system of any one of aspects 44-60, wherein the first model includes embedders and the second model includes predictors.
[Aspect 62]
62. The system of aspect 61, wherein the first model architecture includes multiple layers and the second model architecture includes at least two layers of the multiple layers.
[Aspect 63]
The first machine learning software module trains the first model with a first training data set comprising at least 10,000 protein features, and the second machine learning software module trains the second training data. 63. The system of any one of aspects 44-62, wherein sets are used to train the second model.
[Aspect 64]
A method of modeling a desired protein property comprising:
training a first system using a first data set, said first system including a first neural net transformer encoder and a first decoder; the decoder of is configured to produce an output different from the desired protein property;
transferring at least a portion of the first transformer encoder of the pretrained system to a second system, the second system including a second transformer encoder and a second decoder; and
training the second system using a second dataset, the second dataset comprising a set of proteins representing fewer protein classes than the first set; training, wherein protein classes include one or more of (a) classes of proteins in the first dataset and (b) classes of proteins excluded from the first dataset;
analyzing a primary amino acid sequence of a protein sample by the second system, thereby generating a prediction of the desired protein properties of the protein sample;
method including.
[Aspect 65]
65. A method according to aspect 64, wherein the primary amino acid sequences of protein analytes are one or more asparaginase sequences and corresponding activity labels.
[Aspect 66]
66. The method of aspect 64 or 65, wherein said first data set comprises a set of proteins comprising multiple classes of proteins.
[Aspect 67]
67. The method of any one of aspects 64-66, wherein said second data set is one of said class of proteins.
[Aspect 68]
68. The method of any one of aspects 64-67, wherein one of said classes of proteins is an enzyme.
[Aspect 69]
A configured system for performing the method according to any one of aspects 64-68.

Claims

A computer-implemented method of modeling a desired protein property comprising:
(a) providing a first pre-trained system comprising a first neural net embedder and a first neural net predictor, wherein said first neural net predictor of said pre-trained system; is different from the desired protein property; and
(b) transferring at least a portion of said first neural net embedder of said pretrained system to a second system, said second system comprising a second neural net embedder and a second neural net embedder; transferring, comprising two neural net predictors, wherein the second neural net predictor of the second system provides the desired protein property;
(c) including at least a portion of said transferred first neural net embedder, a second neural net embedder of said second system, and a second neural net predictor of said second system; analyzing a primary amino acid sequence of a protein sample by the second system, thereby generating a prediction of the desired protein properties of the protein sample;
A computer-implemented method comprising:

2. The computer-implemented method of claim 1 , wherein the amino acid sequence comprises annotations spanning one or more functional expressions including at least one of GP, Pfam, keywords, Kegg ontology, Interpro, SUPFAM, or OrthoDB.

3. The computer-implemented method of claim 1 or 2 , wherein the second model has an improved performance measure compared to a model trained without the transferred embedder of the first model. .

4. The method of any one of claims 1-3 , wherein the second model of the second system comprises a first model of the first system in which the last layer of the first model is removed. Computer-implemented method.

5. The computer-implemented method of claim 4 , wherein 2, 3, 4, 5, or more layers of the first model are removed in transitioning to the second model.

6. The computer-implemented method of claim 4 or 5 , wherein the transferred layers are frozen during training of the second model.

6. The computer-implemented method of claim 4 or 5 , wherein the transferred layers are not frozen during training of the second model.

wherein the second model has 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more layers added to the transferred layers of the first model; A computer-implemented method according to any one of claims 5-7 .

9. The method of any one of claims 1-8 , wherein the neural net predictor of the second system predicts one or more of protein binding activity, nucleic acid binding activity, protein solubility, and protein stability. Computer-implemented method.

The computer-implemented method of any one of claims 1-9 , wherein the neural net predictor of the second system predicts protein fluorescence.

The computer-implemented method of any one of claims 1-10 , wherein the neural net predictor of the second system predicts enzymatic activity.

A computer-implemented method of identifying previously unknown associations between amino acid sequences and protein function, comprising:
(a) using a first machine learning software module to generate a first model of a plurality of associations between a plurality of protein properties and a plurality of amino acid sequences;
(b) transferring said first model or part thereof to a second machine learning software module;
(c) generating, with said second machine learning software module, a second model comprising at least a portion of said first model;
(d) identifying previously unknown associations between said amino acid sequence and said protein function based on said second model;
method including.

1. A computer system that identifies previously unknown associations between amino acid sequences and protein function, comprising:
(a) a processor;
(b) a non-transitory computer-readable medium having instructions stored therein;
wherein the instructions, when executed, cause the processor to:
(i) generating a first model of a plurality of associations between a plurality of protein properties and a plurality of amino acid sequences using a first machine learning software model;
(ii) transferring said first model or part thereof to a second machine learning software module;
(iii) generating, with said second machine learning software module, a second model comprising at least a portion of said first model;
(iv) identifying previously unknown associations between said amino acid sequence and said protein function based on said second model;
A system configured to cause

A computer-implemented method of modeling a desired protein property comprising:
training a first system with a first data set, said first system including a first neural net transformer encoder and a first decoder of a pre-trained system; wherein the first decoder of the system of is configured to produce an output different from the desired protein property;
transferring at least a portion of the first transformer encoder of the pretrained system to a second system, the second system including a second transformer encoder and a second decoder; and
training the second system using a second dataset, the second dataset comprising a set of proteins representing fewer protein classes than the first set; training, wherein protein classes include one or more of (a) classes of proteins in the first dataset and (b) classes of proteins excluded from the first dataset;
analyzing a primary amino acid sequence of a protein sample by the second system, thereby generating a prediction of the desired protein properties of the protein sample;
A computer-implemented method comprising:

15. The computer-implemented method of claim 14 , wherein the primary amino acid sequences of protein analytes are one or more asparaginase sequences and corresponding activity labels.