JP2020525939A

JP2020525939A - Iterative feature selection method

Info

Publication number: JP2020525939A
Application number: JP2019572105A
Authority: JP
Inventors: リリー，パトリック
Original assignee: リキッドバイオサイエンシズ，インコーポレイテッド
Priority date: 2017-06-28
Filing date: 2017-06-28
Publication date: 2020-08-27
Anticipated expiration: 2037-06-28
Also published as: EP3646207A4; EP3646207A1; WO2019005049A1; JP6741888B1

Abstract

反復モデリングのために利用可能なモデルコンポーネントの低減を容易にする特徴選択方法およびプロセス。解に有意に寄与しないモデルコンポーネントを排除する方法は、事前に発見されて破棄され、その結果、反復プログラミング技法における計算要件を劇的に減少させることができることが判明した。この開発は、従来であれば有用であるには桁が大きすぎる計算時間を必要としたであろう複雑な問題を解くために使用される反復モデリングの能力を引き出す。【選択図】図６Feature selection methods and processes that facilitate the reduction of model components available for iterative modeling. Methods to eliminate model components that do not contribute significantly to the solution have been found in advance and discarded, resulting in a dramatic reduction in computational requirements in iterative programming techniques. This development brings out the power of iterative modeling used to solve complex problems that would otherwise have required computational time that would have been too large to be useful. [Selection diagram] Fig. 6

Description

本発明の分野は、反復特徴選択である。 The field of the invention is iterative feature selection.

背景技術の説明は、本発明を理解するのに有用であり得る情報を含む。本出願で提供される情報のいずれかが先行技術である、またはクレームされている発明に関連するものであること、または具体的にもしくは暗黙的に言及されるいずれかの文献が先行技術であることを承認するものではない。 The background description includes information that may be useful in understanding the present invention. Any information provided in this application is prior art, or is related to the claimed invention, or any document specifically or implicitly referred to is prior art. It does not endorse.

データがより利用可能になり、データセットのサイズが増加するにつれて、多くの分析プロセスは、「次元の呪い」に見舞われる。ＲｉｃｈａｒｄＥ．Ｂｅｌｌｍａｎ（「Ａｄａｐｔｉｖｅｃｏｎｔｒｏｌｐｒｏｃｅｓｓｅｓ：ａｇｕｉｄｅｄｔｏｕｒ」１９６１年、ＰｒｉｎｃｅｔｏｎＵｎｉｖｅｒｓｉｔｙＰｒｅｓｓ）によって作り出された「次元の呪い」という句は、低次元設定で発生しない、超次元空間（例えば、数百、数千、または数百万の特徴または変数を有するデータセット）におけるデータを分析および編成する際に生じる問題を指す。 As the data becomes more available and the size of the dataset increases, many analytical processes are hit by a "dimensional curse." Richard E. Bellman ("Adaptive control processes: a guided tour" 1961, the phrase "dimensional curse" created by Princeton University Press) does not occur in a low-dimensional setting in a hyperdimensional space (for example, hundreds or thousands). Or refers to the problems that arise in analyzing and organizing data in a dataset with millions of features or variables.

本明細書内の全ての文献は、個々の文献または特許出願が参照により組み込まれていると具体的かつ個別に示されるように、同じように参照により本明細書に組み込まれる。組み込まれた参考文献における用語の定義または使用が、本明細書内で提供されるその用語の定義と矛盾しているか、または異なる場合、本明細書内で提供されるその用語の定義が適用され、参考文献における用語の定義は適用されない。 All documents within this specification are likewise incorporated herein by reference, as if each individual document or patent application was specifically and individually indicated to be incorporated by reference. Where the definition or use of a term in an incorporated reference contradicts or differs from the definition of that term provided herein, the definition of that term provided herein applies. , Definitions of terms in references do not apply.

コンピュータ技術は進歩し続けているが、超次元データセットを処理し、分析することは計算集約的である。例えば、反復モデリングプロセスでは、全ての可能なモデルコンポーネントの組み合わせを検索するために必要な計算時間は、追加のモデルコンポーネントの追加ごとに指数関数的に増加する。特に、反復モデリングプロセスのような技法を、大規模なデータセットを使用する複雑な問題を解くのにより適した形にする方法で、超次元空間における計算要件を低減する必要がある。反復モデリングプロセスにおける計算要件を低減する１つの方法は、モデリングプロセスに利用可能なアルゴリズムコンポーネントのユニバースを低減することである。 Although computer technology continues to evolve, processing and analyzing hyperdimensional datasets is computationally intensive. For example, in an iterative modeling process, the computational time required to retrieve all possible model component combinations increases exponentially with each additional model component. In particular, there is a need to reduce computational requirements in hyperdimensional space in ways that make techniques such as iterative modeling processes more suitable for solving complex problems with large datasets. One way to reduce the computational requirements in the iterative modeling process is to reduce the universe of algorithm components available to the modeling process.

どのコンポーネントが解に対して有意であるか、有意でないかを決定することによって、反復モデリングプロセスに利用可能なアルゴリズムコンポーネントの数が劇的に低減され得ることは、まだ理解されていない。 It is not yet understood that by determining which components are or are not significant to the solution, the number of algorithmic components available for the iterative modeling process can be dramatically reduced.

したがって、反復モデリングプロセスに適用される反復特徴選択方法は、当技術分野において依然として必要とされている。 Therefore, there is still a need in the art for iterative feature selection methods applied to iterative modeling processes.

本発明は、反復モデリングプロセスにおけるモデルの開発のための可能なモデルコンポーネントとしてモデルコンポーネントが排除される装置、システムおよび方法を提供するものである。 The present invention provides an apparatus, system and method in which model components are eliminated as possible model components for model development in an iterative modeling process.

本発明の主要部の１つの態様において、データセット内の予測子と結果を関連付けるモデルを改善するために必要な計算時間を減少させる方法が企図される。この方法は、いくつかのステップを含む。まず、モデルコンポーネントのプールからのモデルコンポーネントを使用してモデルが生成される。データセットのサブセットを使用して、モデル属性メトリック（例えば、精度、感度、特異性、受信者動作特性（ＲＯＣ）メトリックからの曲線下面積（ＡＵＣ）、およびアルゴリズムの長さ）が、各モデルに対して生成される。次に、いくつかのモデルコンポーネントについて、（１）各モデルコンポーネントが存在するモデルの量と（２）各モデルコンポーネントが存在するモデルコンポーネントプールの量との比であるユーティリティメトリックが計算される。次に、モデルコンポーネントに対応する重み付けユーティリティメトリックが計算され得る。 In one aspect of the main part of the invention, a method is contemplated which reduces the computational time required to improve the model that associates predictors with results in a dataset. The method includes several steps. First, a model is created using model components from a pool of model components. Using a subset of the dataset, model attribute metrics (eg, accuracy, sensitivity, specificity, area under the curve (AUC) from the receiver operating characteristic (ROC) metric, and length of the algorithm) can be calculated for each model. Is generated for. Next, for some model components, a utility metric is calculated which is the ratio of (1) the amount of models each model component is present to (2) the amount of model component pool each model component is present. Next, a weighted utility metric corresponding to the model component may be calculated.

重み付けユーティリティメトリックは、いくつかの実施形態において、（１）モデルコンポーネントが存在するモデルについてのモデル属性メトリックと、（２）これらのモデルコンポーネントについてのユーティリティメトリックとを含む関数の結果である。重み付けユーティリティメトリックに基づいて、モデルコンポーネントのプールからの特定のモデルコンポーネントは、排除されるか、または保持される。いくつかの実施形態では、関数は、モデル属性メトリックとユーティリティメトリックの積を含む。 The weighted utility metric is, in some embodiments, the result of a function that includes (1) model attribute metrics for the models in which the model components reside, and (2) utility metrics for these model components. Based on the weighted utility metric, particular model components from the pool of model components are excluded or retained. In some embodiments, the function comprises a product of model attribute metrics and utility metrics.

いくつかの実施形態では、モデルコンポーネントはランダムに生成される。モデルコンポーネントは、とりわけ、計算演算子、数学演算子、定数、予測子、特徴、変数、三項演算子、アルゴリズム、数式、二項演算子、重み、勾配、ノード、またはハイパーパラメータであり得る。 In some embodiments, the model components are randomly generated. Model components can be, among others, computational operators, mathematical operators, constants, predictors, features, variables, ternary operators, algorithms, mathematical expressions, binary operators, weights, gradients, nodes, or hyperparameters.

開示されている主要部は、特定のタスク（例えば、遺伝的プログラミング）を実行するために必要な計算サイクルを劇的に低減することにより、コンピュータの改善された動作を含む有利な技術的効果を提供することが評価されるべきである。本発明の主要部が存在しない場合は、反復モデリング方法は、多くの場合、主に数ヶ月および数年の計算時間を必要とする場合もある法外な計算要件のために、筋道の立った解を提供するものではない。 The disclosed subject matter provides beneficial technical effects, including improved operation of computers, by dramatically reducing the computational cycles required to perform certain tasks (eg, genetic programming). Offering should be evaluated. In the absence of the subject matter of the present invention, iterative modeling methods often proved plausible due to exorbitant computational requirements, which can often require months and years of computational time. It does not provide a solution.

本発明の主要部の様々な目的、特徴、態様および利点は、添付図面（同様の数字が同様のコンポーネントを表す）に沿った以下の好適な実施形態の詳細な説明から、より明らかになるであろう。 Various objects, features, aspects and advantages of the subject matter of the present invention will become more apparent from the following detailed description of the preferred embodiments along with the accompanying drawings (where like numerals represent like components). Ah

反復モデリングプロセスの汎用フレームワークを示す図である。FIG. 3 illustrates a generalized framework of the iterative modeling process. モデルコンポーネントのユーティリティメトリックを決定するための企図された方法を示す図である。FIG. 6 illustrates a contemplated method for determining utility metrics for model components. モデル属性メトリックを決定するための企図された方法を示す図である。FIG. 6 illustrates a contemplated method for determining model attribute metrics. 重み付けユーティリティメトリックを計算するための企図された方法を示す図である。FIG. 6 illustrates a contemplated method for calculating a weighted utility metric. モデルコンポーネントのプールから所与のモデルコンポーネントを排除する、または保持するための企図された方法を示す図である。FIG. 6 illustrates a contemplated method for excluding or retaining a given model component from a pool of model components. モデル、一連の世代、および「最良」モデルを有するランを含む１つの企図された実施形態を示す図である。FIG. 5 illustrates one contemplated embodiment including a model, a series of generations, and a run with a “best” model. 図６のランに対応するモデルコンポーネントのプールを示す図である。FIG. 7 is a diagram showing a pool of model components corresponding to the run of FIG. 6. 各々がモデル、一連の世代、および「最良」モデルを有する一連のランを含む１つの企図された実施形態を示す図である。FIG. 3 illustrates one contemplated embodiment, each including a model, a series of generations, and a series of runs with a “best” model. 図８のランに対応する一連のモデルコンポーネントプールを示す図である。FIG. 9 is a diagram showing a series of model component pools corresponding to the run of FIG. 8. モデルコンポーネントのプールから所与のモデルコンポーネントを排除する、または保持するための企図された方法を示す図である。FIG. 6 illustrates a contemplated method for excluding or retaining a given model component from a pool of model components. モデルコンポーネントのプールから所与のモデルコンポーネントを排除する、または保持するための別の企図された方法を示す図である。FIG. 6 illustrates another contemplated method for excluding or retaining a given model component from a pool of model components.

以下の説明は、本発明の主要部の例示的な実施形態を提供する。各々の実施形態は本発明の要素の１つの組み合わせを表すが、本発明の主要部は開示されている要素の全ての可能な組み合わせを含むものと考えられる。したがって、１つの実施形態が要素Ａ、要素Ｂ、および要素Ｃを備え、第２の実施形態が要素Ｂおよび要素Ｄを備える場合、本発明の主要部はさらに、明示的に開示されていないとしても、Ａ、Ｂ、Ｃ、またはＤの他の残りの組み合わせを含むと考えられる。 The following description provides exemplary embodiments of the main parts of the invention. While each embodiment represents one combination of the elements of the invention, the subject matter of the invention is considered to include all possible combinations of the disclosed elements. Thus, if one embodiment comprises element A, element B, and element C and the second embodiment comprises element B and element D, the subject matter of the invention is further said not to be explicitly disclosed. Are also considered to include other remaining combinations of A, B, C, or D.

本出願内で、さらに特許請求の範囲を通して使用される場合、「一」、「一つ」および「その」の意味は、文脈が明らかに別段の意味を示す場合を除いて、複数のものを含むものとする。また、本出願内の説明で使用される場合、「内」の意味は、文脈が明らかに別段の意味を示す場合を除いて、「内」および「上」の意味を含むものとする。 As used in this application and throughout the claims, the meanings of "one," "one," and "the" are plural unless the context clearly dictates otherwise. Shall be included. Also, as used in the description within this application, the meaning of "inside" is meant to include the meanings of "inside" and "above," unless the context clearly dictates otherwise.

また、本出願内で使用される場合、文脈が明らかに別段の意味を示す場合を除いて、用語「〜に結合される」は、直接結合（互いに結合される２つの要素が互いに接触する）および間接結合（少なくとも１つの追加の要素が２つの要素間に配置される）の両方を含むものとする。したがって、用語「〜に結合される」および「〜と結合される」という用語は、同義的に使用される。 Also, as used within this application, unless the context clearly indicates otherwise, the term "coupled to" is direct coupling (two elements coupled to each other contact each other). And indirect coupling (where at least one additional element is located between two elements). Thus, the terms "coupled with" and "coupled with" are used interchangeably.

いくつかの実施形態では、本発明の特定の実施形態を説明し、請求するために使用される構成要素の量、濃度、反応条件などの特性を表す数字は、場合によっては、用語「約」によって修正されたものとして理解されるべきである。したがって、いくつかの実施形態では、明細書および添付の特許請求の範囲に記載されている数値パラメータは、特定の実施形態によって得られることが求められる所望の特性に依存して変化し得る近似値である。いくつかの実施形態では、数値パラメータは、報告されている有効桁の数字を考慮し、通常の丸め技法を適用することによって解釈されるべきである。本発明のいくつかの実施形態の広い範囲を示す数値範囲およびパラメータは近似値であるが、特定の実施例に記載されている数値は、可能な限り正確に報告される。本発明のいくつかの実施形態において提示される数値は、それぞれの試験測定値に見られる標準偏差から必然的に生じる特定の誤差を含み得る。さらに、文脈が反対の意味を示す場合を除いて、本出願に記載されている全ての範囲は、それらのエンドポイントを包含するものとして解釈されるべきであり、オープンエンドの範囲は、商業上実用的な値のみを含むものと解釈されるべきである。同様に、全ての値のリストは、文脈が反対の意味を示す場合を除いて、中間値を含むものとして見なされるべきである。 In some embodiments, a number characterizing a component used to describe and claim a particular embodiment of the invention, such as amount, concentration, reaction conditions, etc., is sometimes referred to by the term "about". It should be understood as modified by. Therefore, in some embodiments, the numerical parameters recited in the specification and the appended claims may be approximate values that may vary depending on the desired properties sought to be obtained by the particular embodiment. Is. In some embodiments, numeric parameters should be interpreted by considering the reported significant digits and applying conventional rounding techniques. Although the numerical ranges and parameters indicating the broad ranges of some embodiments of the invention are approximations, the numerical values set forth in the specific examples are reported as accurately as possible. The numerical values presented in some embodiments of the invention may include certain errors necessarily resulting from the standard deviation found in their respective testing measurements. Further, all ranges stated in this application are to be construed as encompassing their endpoints, unless the context indicates to the contrary, and the open-ended range is commercially available. It should be construed as containing only practical values. Similarly, a list of all values should be considered to include intermediate values, unless the context indicates otherwise.

コンピュータ対象の任意の言語は、サーバ、インターフェース、システム、データベース、エージェント、ピア、エンジン、コントローラ、または個別にまたは集合的に動作する他のタイプのコンピューティングデバイスを含む、コンピューティングデバイスの任意の適切な組み合わせを含むものと解釈されるべきであることに留意されたい。コンピューティングデバイスは、有形の非一時的なコンピュータ可読記憶媒体（例えば、ハードドライブ、ソリッドステートドライブ、ＲＡＭ、フラッシュ、ＲＯＭなど）上に格納されたソフトウェア命令を実行するように構成されたプロセッサを備えることを理解すべきである。ソフトウェア命令は、好ましくは、開示されている装置に関して後述する役割、責任、または他の機能を提供するようにコンピューティングデバイスを構成する。特に好適な実施形態では、様々なサーバ、システム、データベース、またはインターフェースは、ＨＴＴＰ、ＨＴＴＰＳ、ＡＥＳ、公開秘密鍵交換、ウェブサービスＡＰＩ、既知の金融トランザクションプロトコル、または他の電子情報交換方法に基づくと思われる、標準化されたプロトコルまたはアルゴリズムを使用してデータを交換する。データ交換は、好ましくは、パケット交換ネットワーク、インターネット、ＬＡＮ、ＷＡＮ、ＶＰＮ、または他のタイプのパケット交換ネットワークを介して行われる。以下の説明は、本発明を理解するのに有用であり得る情報を含む。本出願で提供される情報のいずれかが先行技術である、またはクレームされている発明に関連するものであること、または具体的にもしくは暗黙的に言及されるいずれかの文献が先行技術であることを承認するものではない。 Any language of computer target is any suitable computing device, including servers, interfaces, systems, databases, agents, peers, engines, controllers, or other types of computing devices operating individually or collectively. Note that this should be construed as including any combination. The computing device comprises a processor configured to execute software instructions stored on a tangible, non-transitory computer readable storage medium (eg, hard drive, solid state drive, RAM, flash, ROM, etc.). You should understand that. The software instructions preferably configure the computing device to provide the roles, responsibilities, or other functionality described below with respect to the disclosed apparatus. In particularly preferred embodiments, the various servers, systems, databases, or interfaces are based on HTTP, HTTPS, AES, public secret key exchange, web services APIs, known financial transaction protocols, or other electronic information exchange methods. Exchange data using any standardized protocol or algorithm that you think is possible. The data exchange preferably takes place over a packet switched network, the Internet, LAN, WAN, VPN, or other type of packet switched network. The following description includes information that may be useful in understanding the present invention. Any information provided in this application is prior art, or is related to the claimed invention, or any document specifically or implicitly referred to is prior art. It does not endorse.

本出願で使用される場合、「セット」または「サブセット」のような用語は、１つ以上のアイテムを含むものと解釈されるべきである。「セット」は、特に断りのない限り、２つ以上の項目を含むとは限らない。 As used in this application, terms such as “set” or “subset” should be construed to include one or more items. The “set” does not always include two or more items unless otherwise specified.

本発明の主要部の１つの目的は、ターゲットデータセットにおける予測子と結果の関係を記述するモデルを作成するために使用される低性能（例えば、不必要または不要な）モデルコンポーネントを識別し、排除することである。可能なモデルコンポーネントの数をプルーニングすることは、反復モデリングプロセスにおいて高性能モデルに収束するために必要とされる計算時間を短縮することによって、計算効率を向上させる。 One of the main objects of the present invention is to identify low-performance (eg, unnecessary or unnecessary) model components used to create models that describe the relationship between predictors and outcomes in a target dataset, Eliminate. Pruning the number of possible model components improves computational efficiency by reducing the computational time required to converge to a high performance model in the iterative modeling process.

本発明の主要部にはいくつかのフェーズがあり、これらのフェーズは方法ステップとして実装され得る。 There are several phases to the main part of the invention and these phases can be implemented as method steps.

本発明の主要部の１つの企図される実施形態では、第１のフェーズは、反復モデリングプロセスを使用して、モデルコンポーネントのプールからモデルのセットを生成することである。図１は、汎用反復モデリングフレームワークを示しており、ここでは、セット｛ｃ_１,．．．,ｃ_ｚ｝内のモデルコンポーネントはモデリングプロセスを受けてモデルｍ_１〜ｍ_ｎを生成する。 In one contemplated embodiment of the main part of the invention, the first phase is to use an iterative modeling process to generate a set of models from a pool of model components. FIG. 1 shows a generic iterative modeling framework, where the set {c ₁ ,. ．． , the model components in the _{c z}} to generate a model _m 1 ~m _n receives the modeling process.

本明細書で使用される場合、用語「反復モデリングプロセス」とは、反復可能なまたはループ可能なサブルーチンまたはプロセス（例えば、ラン、フォアループ（ｆｏｒｌｏｏｐ）、エポック、サイクル）を含むターゲットデータセットにおける予測子と結果の関係を記述するために、１つ以上のモデルを作成するためのモデリング方法を指す。 As used herein, the term “iterative modeling process” refers to a target dataset that includes repeatable or loopable subroutines or processes (eg, runs, for loops, epochs, cycles). Refers to a modeling method for creating one or more models to describe the relationship between predictors and outcomes.

企図される反復モデリングプロセスは、人工ニューラルネットワーク（ＡＮＮ）、畳み込みニューラルネットワーク（ＣＮＮ）、再帰型ニューラルネットワーク、ディープ・ボルツマン・マシン（ＤＢＭ）、ディープ・ビリーフ・ネットワーク（ＤＢＮ）、積層自己符号化器、およびニューラル・ネットワーク・フレームワークから導出される他のモデリング技法のようなディープラーニング方法を含む。 Contemplated iterative modeling processes include artificial neural networks (ANN), convolutional neural networks (CNN), recursive neural networks, deep Boltzmann machines (DBM), deep belief networks (DBN), layered self-encoders. , And other deep learning methods such as other modeling techniques derived from the neural network framework.

追加的にまたは代替的に、企図される反復モデリングプロセスは、遺伝的アルゴリズムおよび遺伝的プログラミング（例えば、ツリーベースの遺伝的プログラミング、スタックベースの遺伝的プログラミング、線形（マシンコードを含む）遺伝的プログラミング、文法進化、拡張コンパクト遺伝的プログラミング（ＥＣＧＰ）、埋め込みカルテシアン遺伝的プログラミング（ＥＣＧＰ）、確率的増分プログラム進化（ＰＩＰＥ）、および型付き遺伝的プログラミング（ＳＴＥＰ）を含む、進化的プログラミング方法を含む。他の進化的プログラミング方法は、遺伝子発現プログラミング、進化戦略、差分進化、ニューロエボリューション（ｎｅｕｒｏｅｖｏｌｕｔｉｏｎ）、学習分類子システム、または強化学習システムを含み、この場合、解は、２値、実数、ニューラルネット、またはＳ式タイプであり得る分類子（規則または条件）のセットである。学習分類子システムの場合、適合性は、強度または精度ベースの強化学習アプローチまたは教師付き学習アプローチのいずれかを用いて決定され得る。 Additionally or alternatively, contemplated iterative modeling processes include genetic algorithms and genetic programming (eg, tree-based genetic programming, stack-based genetic programming, linear (including machine code) genetic programming. , Evolutionary programming methods, including grammar evolution, extended compact genetic programming (ECGP), embedded Cartesian genetic programming (ECGP), stochastic incremental program evolution (PIPE), and typed genetic programming (STEP) Other evolutionary programming methods include gene expression programming, evolution strategies, differential evolution, neuroevolution, learning classifier systems, or reinforcement learning systems, where the solution is binary, real, neural net. , Or a set of classifiers (rules or conditions) that can be of the S-expression type. Can be determined.

追加または代替の企図される反復モデリングプロセスは、プロセスが反復可能なまたはループ可能なサブルーチンまたはプロセス（たとえば、ラン、フォアループ、エポック、サイクル)を含む限り、モンテカルロ法、マルコフ連鎖、逐次線形ロジスティック回帰、決定木、ランダムフォレスト、サポートベクトルマシン、ベイズモデリング技法、または勾配ブースティング技法を含み得る。 Additional or alternative contemplated iterative modeling processes include Monte Carlo methods, Markov chains, sequential linear logistic regression, as long as the process includes repeatable or loopable subroutines or processes (eg, run, foreloop, epoch, cycle). , Decision trees, random forests, support vector machines, Bayesian modeling techniques, or gradient boosting techniques.

次のフェーズでは、選択モデルコンポーネントについてユーティリティメトリックが計算され、選択モデルについてモデル属性メトリックが計算される。次に、各ユーティリティメトリックおよび１つ以上のモデル属性メトリックを使用して、重み付けユーティリティメトリックが計算される。重み付けユーティリティメトリックに基づいて、モデルコンポーネントプールからいくつかのモデルコンポーネントが排除され、他のモデルコンポーネントはそのまま残される。このプルーニングプロセスは、モデルコンポーネントの数を削減することで探索空間の次元を減少させることによって、反復的モデリング方法を実行するコンピュータの能力を改善するものであり、これについて以下でより詳細に説明する。 In the next phase, utility metrics are calculated for the selection model components and model attribute metrics are calculated for the selection models. A weighted utility metric is then calculated using each utility metric and one or more model attribute metrics. Based on the weighted utility metric, some model components are excluded from the model component pool and other model components are left alone. This pruning process improves the computer's ability to perform iterative modeling methods by reducing the dimension of the search space by reducing the number of model components, which is described in more detail below. ..

いくつかの実施形態では、各モデルコンポーネントは、そのために計算されたユーティリティメトリックを有する。その１つの実施形態が図２に示されるユーティリティメトリックは、分子がモデル内にモデルコンポーネントが出現する回数であり、分母がモデルコンポーネントプール内にモデルコンポーネントが出現する回数である比を示す。 In some embodiments, each model component has a utility metric calculated for it. The utility metric, one embodiment of which is shown in FIG. 2, indicates a ratio where the numerator is the number of times the model component appears in the model and the denominator is the number of times the model component appears in the model component pool.

いくつかの企図される実施形態では、モデルコンポーネントは、例えば、計算演算子（例えば、ＩＦ、ＡＮＤ、ＯＲのような論理ステートメント)、数学演算子（例えば、乗算、除算、減算、加算のような算術演算）、三角演算、ロジスティック関数、微積分演算、「床」または「天井」演算子、または任意の他の数学演算子）、定数（例えば、整数またはπのような値を含む一定の数値）、予測子（例えば、観察値もしくは測定値または数式）、特徴（例えば、特性）、変数、三項演算子（例えば、３つの因数をとる演算子であり、ここで、第１の引数が比較引数であり、第２の引数が真の比較の結果であり、第３の引数が偽の比較の結果である）、アルゴリズム、数式、リテラル、関数（例えば、１変数関数、２変数関数など）、二項演算子（例えば、２つのオペランド上で演算し、そのオペランドに結果を返す演算子）、重みならびに重みベクトル、ノードならびに隠れノード、勾配降下、シグモイド活性化関数、ハイパーパラメータ、およびバイアスを含み得る。 In some contemplated embodiments, the model components are, for example, arithmetic operators (eg, logical statements such as IF, AND, OR), mathematical operators (eg, multiply, divide, subtract, add). Arithmetic operations), trigonometric operations, logistic functions, calculus operations, “floor” or “ceiling” operators, or any other mathematical operator), constants (eg integers or constant numbers, including values such as π) , Predictors (eg, observed or measured values or mathematical formulas), features (eg, characteristics), variables, ternary operators (eg, operators that take three factors), where the first argument is a comparison Argument, the second argument is the result of the true comparison, and the third argument is the result of the false comparison), algorithm, formula, literal, function (eg, one-variable function, two-variable function, etc.) , Binary operators (eg, operators that operate on two operands and return a result in that operand), weights and weight vectors, nodes and hidden nodes, gradient descent, sigmoid activation functions, hyperparameters, and biases. May be included.

図３は、モデル属性メトリックをどのように決定することができるかを示す。いくつかの企図される実施形態では、モデル属性メトリックは、予測子を使用して結果を予測するモデルの能力を記述することができることが企図され、その精度は、パーセントとして表される。データセットからのデータは、モデル属性メトリックを決定するために使用され、ここで、データセットは、予測子および結果を含み、モデル属性は、モデルに予測子のみを付与し、次いでモデルからの結果をデータセットからの実際の結果と比較することによって、決定される。例えば、モデルが、３５％の確率で結果を正確に予測するために予測子のセットを使用する場合、そのモデルについてのモデル属性メトリックは３５％となる。 FIG. 3 shows how the model attribute metric can be determined. In some contemplated embodiments, it is contemplated that the model attribute metric can describe the model's ability to predict results using predictors, the accuracy of which is expressed as a percentage. The data from the dataset is used to determine the model attribute metrics, where the dataset contains the predictors and the results, the model attributes give the model only the predictors, and then the results from the model. Is determined by comparing with the actual result from the data set. For example, if a model uses a set of predictors to accurately predict results with a 35% probability, then the model attribute metric for that model would be 35%.

他の実施形態では、モデル属性メトリックは、追加的にまたは代替的に、感度、特異性、受信者動作特性（ＲＯＣ）メトリックから曲線下面積（ＡＵＣ）、二乗平均平方根誤差（ＲＭＳＥ）、アルゴリズムの長さ、アルゴリズム計算時間、使用される変数もしくはコンポーネント、または他の適切なモデル属性であり得る。モデル属性メトリックは、識別されたモデル属性の１つ以上を使用して決定され得るが、モデル属性メトリックは、これらの属性のみに限定されないことが企図される。 In other embodiments, the model attribute metric may additionally or alternatively be a sensitivity, specificity, receiver operating characteristic (ROC) metric to area under curve (AUC), root mean squared error (RMSE), algorithmic It can be length, algorithmic computation time, variables or components used, or other suitable model attributes. The model attribute metric may be determined using one or more of the identified model attributes, but it is contemplated that the model attribute metric is not limited to just these attributes.

モデルコンポーネントがモデル性能に十分に寄与することができるかどうか（例えば、特定のモデルコンポーネントが予測子のセットを使用して結果を決定するためにモデルの能力に影響を与えるかどうか）を決定するために、図４に示されるように、重み付けユーティリティメトリックが各ユーティリティメトリックおよび１つ以上のモデル属性メトリックの関数として生成される。 Determine whether a model component can contribute sufficiently to model performance (eg, whether a particular model component affects the model's ability to use a set of predictors to determine its outcome) To do so, a weighted utility metric is generated as a function of each utility metric and one or more model attribute metrics, as shown in FIG.

モデルコンポーネントが「重要である」か、または「重要でない」かは、重み付けユーティリティメトリックが閾値未満であるか、または閾値を超えているかどうかによって決定されることが企図される。いくつかの実施形態では、閾値は、モデルのセット（例えば、図１における｛ｍ_１、．．．、ｍ_ｎ｝）に出現するモデルコンポーネントに対する重み付けユーティリティメトリックの全てを最初に平均化することによって計算され得る。次に、各々の重み付けユーティリティメトリックは、全ての重み付けユーティリティメトリックについての要約統計量（例えば、平均、三平均値、分散、標準偏差、モード、中央値）で除算される。 It is contemplated that the model component is “significant” or “not significant” is determined by whether the weighting utility metric is below or above the threshold. In some embodiments, the threshold is determined by first averaging all of the weighted utility metrics for the model components that appear in the set of models (eg, {m ₁ ,..., M _n } in FIG. 1). Can be calculated. Each weighted utility metric is then divided by a summary statistic (eg, mean, trimean, variance, standard deviation, mode, median) for all weighted utility metrics.

重み付けユーティリティメトリックを全ての重み付けユーティリティメトリックの要約統計量によって除算した結果が一定の閾値未満である場合（例えば、結果が１．２、１．１、１、０．９、０．８、０．７、０．６、０．５、または０．４未満である場合）、その重み付けユーティリティメトリックに対応するモデルコンポーネントは、考慮から除外される（例えば、新しいランセットを生成するために使用される任意の新しいモデルコンポーネントプールに、そのモデルコンポーネントを入れることはできない）。このプロセスを図５に示す。 If the result of dividing the weighted utility metric by the summary statistic of all weighted utility metrics is less than a certain threshold (eg, the result is 1.2, 1.1, 1, 0.9, 0.8, 0. If less than 7, 0.6, 0.5, or 0.4), the model component corresponding to that weighted utility metric is excluded from consideration (eg, any used to generate a new lancet). The new model component pool of cannot contain that model component). This process is shown in FIG.

モデルコンポーネントを維持するか、または排除するかを決定する他の適切な方法も企図される。例えば、いくつかの実施形態では、重み付けユーティリティメトリックは、比較前のいかなる操作（例えば、平均化、除算、および比較、または上述した他のプロセスのいずれかのプロセス）も受けることなく閾値と比較される。閾値は、任意の値であり得るか、または予想される重み付けユーティリティメトリック値の理解に基づいて選択され得る。これらの実施形態では、モデルコンポーネントに対する重み付けユーティリティメトリックが計算されると、その重み付けユーティリティメトリックは、次に、予め定義された閾値と比較され、その比較に基づいて、その重み付けユーティリティメトリックに対応するモデルコンポーネントは、全てのモデルコンポーネントプールから排除される（例えば、重み付けユーティリティメトリックが閾値未満である）か、または将来のランにおける使用のために残される。 Other suitable methods of determining whether to keep or eliminate model components are also contemplated. For example, in some embodiments, the weighted utility metric is compared to a threshold without undergoing any pre-comparison operations (eg, averaging, division, and comparison, or any of the other processes described above). It The threshold may be any value or may be selected based on an understanding of expected weighted utility metric values. In these embodiments, once the weighted utility metric for the model component is calculated, the weighted utility metric is then compared to a predefined threshold, and based on the comparison, the model corresponding to the weighted utility metric. The component is either removed from the pool of all model components (eg, the weighting utility metric is below a threshold) or left for use in future runs.

最終的に、いくつかのモデルコンポーネントは、それらの対応する重み付けユーティリティメトリックに基づいて、他のモデルコンポーネントよりも有用でないことが判明し、ユーティリティの不足が閾値未満である場合、これらのモデルコンポーネントは破棄されることが企図される。 Finally, some model components prove to be less useful than others based on their corresponding weighted utility metrics, and if the lack of utility is below a threshold, these model components are It is intended to be destroyed.

いくつかの実施形態では、モデルコンポーネントを考慮から除外した後、モデルコンポーネントの新しいプールは、排除されたモデルコンポーネントなしで生成される。他の実施形態では、モデルコンポーネントは、既存のモデルプールから排除され、新しいランセットについてモデルのセットを生成するために同じモデルプールが再度使用される。さらに別の実施形態では、モデルコンポーネントは、モデルコンポーネントプールから排除せずに、単に考慮から除外されるのみである。この時点以降は、プロセスは繰り返され、最終的には、より多くのモデルコンポーネントが排除され得る。このプロセスは、残りのモデルコンポーネント全てが各々の反復またはランにおいて「最良」モデルに有意に寄与することが判明するまで、必要に応じて繰り返され得る。 In some embodiments, after excluding model components from consideration, a new pool of model components is created without the excluded model components. In other embodiments, model components are removed from the existing model pool and the same model pool is used again to generate a set of models for the new lancet. In yet another embodiment, model components are not excluded from consideration, rather than excluded from the model component pool. From this point onward, the process is repeated and eventually more model components may be eliminated. This process may be repeated as needed until all remaining model components are found to contribute significantly to the "best" model in each iteration or run.

このプロセスを通して、モデルコンポーネントは、モデルコンポーネントの１つ以上のプールからプルーニングされる。本発明の主要部に従ってモデルコンポーネントをプルーニングすることによって、反復モデリング（および関連するタスク）を実行するために必要な計算時間が劇的に低減される。 Through this process, model components are pruned from one or more pools of model components. By pruning the model components according to the main part of the invention, the computational time required to perform iterative modeling (and associated tasks) is dramatically reduced.

１つの特定のタイプの反復モデリングに限定されることを望まないが、本発明の主要部の実施形態のサブセットは、遺伝的プログラミングプロセスにおけるモデルの開発のための可能なモデルコンポーネントとしてモデルコンポーネントが排除される装置、システムおよび方法を提供する。本発明の主要部を遺伝的プログラミングへ適用する実例は、他の反復モデリング技法への適用を理解するのに有用である。 While not wishing to be limited to one particular type of iterative modeling, a subset of the key embodiments of the present invention exclude model components as possible model components for the development of models in the genetic programming process. Apparatus, system and method. Examples of applying the subject matter of the present invention to genetic programming are useful for understanding its application to other iterative modeling techniques.

例えば、企図される実施形態のこのサブセットでは、第１のフェーズは、遺伝的プログラミングプロセスを使用して、「ラン」を構成するモデルのセットを生成することである。用語「ラン」は、「最良」モデルに収束するように操作されるモデルのセットを示す。ラン内で、モデルコンポーネントのプールからのモデルコンポーネントを使用して、モデルセットが生成される。 For example, in this subset of contemplated embodiments, the first phase is to use a genetic programming process to generate a set of models that make up a "run." The term "run" refers to the set of models that are manipulated to converge on the "best" model. Within the run, model components from the pool of model components are used to generate a model set.

このモデルセットは、モデルの世代と呼ばれる。次のフェーズでは、（ランダムに生成された）第１世代のモデルは、その世代におけるどのモデル（１つ以上）が最良の性能を発揮するかを決定するために競合させられ、次に、一部は（例えば、複製に基づいて、または複製することによって）前世代からのモデルを使用して、次のモデル世代が生成される。これらのフェーズは、データセットにおける予測子と結果の関係を適切に記述する１つ以上のモデルが開発されるまで、各ラン内の複数世代にわたって繰り返し完了される。 This model set is called the model generation. In the next phase, the (randomly generated) first generation model is contended to determine which model (one or more) in that generation performs best, and then The department uses the model from the previous generation (eg, based on or by duplication) to generate the next model generation. These phases are iteratively completed over multiple generations within each run until one or more models have been developed that adequately describe the predictor-result relationship in the dataset.

次のフェーズでは、選択されたモデルコンポーネントについてユーティリティメトリックが計算され、各ランの選択モデルについてモデル属性メトリックが計算される。 In the next phase, utility metrics are calculated for the selected model components and model attribute metrics are calculated for the selected model for each run.

第１世代のランは、モデルセットの生成を必要とする。本発明の主要部のモデルは、表記ｍ_ａｂｃを使用して記述され、この場合、ａはラン数であり、ｂは世代番号であり、ｃはモデル数である。図６は、ラン数１を有するランを示し、モデルｍ_１１１〜ｍ_１１ｉから構成される第１世代を示す。ｉの値は、その世代におけるモデル数である。ｉは、１０〜１，０００，０００、より好ましくは１００〜１０，０００、最も好ましくは１，０００〜５，０００であり得ることが企図される。 First generation runs require the generation of model sets. The model of the main part of the invention is described using the notation _mabc , where a is the run number, b is the generation number and c is the model number. FIG. 6 shows a run having a run number of 1 and shows a first generation composed of models m _{111 to} m _11i . The value of i is the number of models in that generation. It is contemplated that i can be 10 to 1,000,000, more preferably 100 to 10,000, most preferably 1,000 to 5,000.

モデルｍ_１１１〜ｍ_１１ｉは、図７に示すように、モデルコンポーネントプールからの様々なモデルコンポーネントを使用してランダムに生成される。モデルはアルゴリズムであり、モデルコンポーネントはアルゴリズムを構成するために使用されることが企図される。図７のモデルコンポーネントは、集合｛ｃ_１、．．．、ｃ_ｚ｝として表される。プール内の全てのモデルコンポーネントは、そのモデルプールに対応するモデル内で使用するために利用可能であるが、全てのモデルコンポーネントを使用する必要があるとは限らない。さらに、あるモデルコンポーネントがあるモデルに内で使用される場合、そのモデルコンポーネントは、他のモデル内で使用するために利用可能な状態である。 The models m _{111 to} m _11i are randomly generated using various model components from the model component pool, as shown in FIG. 7. It is contemplated that the model is an algorithm and the model components are used to construct the algorithm. The model component of FIG. 7 has sets {c ₁ ,. ．． , C _z }. All model components in the pool are available for use in the model corresponding to that model pool, but not all model components need to be used. Further, when a model component is used within a model, that model component is ready for use within another model.

別の段落で記載されているように、特定のモデルコンポーネントが予測子セットを使用して結果を決定するモデルの能力に影響を与えるかどうかを決定するために、重み付けユーティリティメトリックは、図４に示すように、各ユーティリティメトリックおよび１つ以上のモデル属性の関数として作成される。 To determine whether a particular model component affects the model's ability to use predictor sets to determine outcomes, as described in another paragraph, the weighting utility metric is set in FIG. As shown, it is created as a function of each utility metric and one or more model attributes.

本発明の主要部の一態様では、第１のモデル世代｛ｍ_１１１、．．．、ｍ_１１ｉ｝が生成され、その第１世代におけるモデルを互いに競合させて、どのモデルが最良の性能を発揮するかを決定する。例えば、競合は、モデル性能（例えば、予測子セットからの結果を予測するモデルの能力）の比較であり得る。いくつかの実施形態では、ランの各世代におけるモデルが互いに競合した後、最良の実行モデルのセットが識別される。他の実施形態では、１つの最良実行モデルが識別される。性能に基づくモデルの上位パーセント（例えば、上位１〜５％、５〜１０％、１０〜２０％、２０〜３０％、３０〜４０％、または４０〜５０％）が各世代における最良の実行モデルと見なされ得ることが企図される。 In one aspect of the subject matter of the present invention, the first model generation {m _111,. ．． , _{M 11i} } are generated and the models in its first generation compete with each other to determine which model performs best. For example, competition can be a comparison of model performance (eg, the model's ability to predict results from a set of predictors). In some embodiments, the set of best performing models is identified after the models in each generation of runs compete with each other. In other embodiments, one best performing model is identified. The top percent of performance-based models (eg, top 1-5%, 5-10%, 10-20%, 20-30%, 30-40%, or 40-50%) is the best execution model for each generation. It is contemplated that can be considered.

最良の実行モデルは、いくつかの方法で記述され得る。例えば、モデルが予測子を使用して、（例えば、結果が既知であり、モデルの結果をデータセットからの実際の結果と比較することによって）数パーセントの割合の結果を予測する場合、そのパーセントは、そのモデルがある世代の他のモデルよりも良好な性能を発揮するモデルであるかどうかを決定するために使用され得る。そのような実施形態では、ある世代におけるモデルは、予測子からの結果を決定する際の高いパーセント精度を有するモデルが、より低いパーセント精度を有するモデルを「打破する」形で、互いに対して「競合」する。ある世代におけるモデルのいくつか（または１つを除く全て）が排除される（例えば、負けたモデルがセットから除去される）と、最良の複数（または１つ）のモデルが残る。 The best execution model can be described in several ways. For example, if a model uses a predictor to predict a percentage of results (for example, by knowing the results and comparing the model's results with the actual results from the data set), the percentage Can be used to determine if the model performs better than other models of a generation. In such an embodiment, the models in one generation are "relative to each other" in such a way that models with high percent accuracy in determining results from the predictors "break through" models with lower percent accuracy. Competing. When some (or all but one) of the models in a generation are eliminated (eg, losing models are removed from the set), the best (or one) model remains.

別の実施例では、ある世代の「最良」モデルは、その世代における他のモデルと比較したときに、１つ以上の好ましい特性を有するモデルであり得る。例えば、「最良」モデルは、アルゴリズムの長さに関して「最も短い」（例えば、モデルは、量、タイプ、または重複しないモデルコンポーネントのいずれかに関して最も少ないモデルコンポーネントを使用する）、モデルを実行するのに必要な計算時間が最も短い、トレーニング精度が最高である、標準プロセストレーニング検証が最良である、またはトレーニング検証が最良であるモデルであり得る。さらに、「最良」モデルは、本出願で議論されているこれらのおよび任意の他の因子の組み合わせによって決定され得る。 In another example, a generation's "best" model may be a model that has one or more favorable characteristics when compared to other models in that generation. For example, the "best" model is the "shortest" in terms of the length of the algorithm (eg, the model uses the fewest model components in terms of either quantity, type, or non-overlapping model components), running the model. Can be the model that requires the least computation time, the best training accuracy, the best standard process training validation, or the best training validation. Further, the "best" model may be determined by the combination of these and any other factors discussed in this application.

最良のパフォーマーであると識別されたラン内の第１世代からの１つ以上のモデルを用いて、第２のモデル世代が生成され得る。第２のモデル世代は、モデルのいくつかのサブセットから構成され得る。例えば、次世代におけるモデルのサブセットは、モデルプール（図７に示す）からのモデルコンポーネントを使用してランダムに生成され得るが、モデルの別のサブセットは、前世代からのモデル（例えば、１つ以上の最良モデル）の突然変異によって生成され得、また別のサブセットは、前世代からのモデルを使用して子孫を作成する（交叉とも呼ばれる）ことによって生成され得る。 A second model generation may be generated using one or more models from the first generation in the run identified as the best performers. The second model generation may consist of several subsets of models. For example, a subset of models in the next generation may be randomly generated using model components from the model pool (shown in FIG. 7), while another subset of models may generate models from the previous generation (eg, one model). The best model above) and another subset may be generated by using the model from the previous generation to create progeny (also called crossover).

いくつかの実施形態では、１世代からのモデルのサブセットは、変更なしで、次世代（例えば、任意の次世代）に含まれる。例えば、前世代からの１つ以上のモデル（例えば、１つ以上の「最良」モデル）は、ランに対する「最良」モデル（ランに対する「最良」モデルの概念は以下にさらに詳細に説明する）に収束するのに必要な時間を低減するために、任意の次世代に導入され得る。したがって、例えば、図６において、世代ａに到達すると、世代１からａ−１までのモデルのいずれかを世代ａに含めることができる。 In some embodiments, the subset of models from one generation is included in the next generation (eg, any next generation) without modification. For example, one or more models from a previous generation (eg, one or more "best" models) can be transformed into a "best" model for a run (the concept of a "best" model for a run is described in further detail below). It can be introduced in any next generation to reduce the time required to converge. Therefore, for example, in FIG. 6, when the generation a is reached, any of the models from generation 1 to a-1 can be included in the generation a.

さらに、１世代から次世代へモデル（例えば、「最良」モデル）を変更せずに導入することが、いくつかの世代が（例えば、１０〜１００世代、１００〜１５０世代、１５０〜２５０世代）繰り返された後にのみ行われるようにフラグが立てられ得ることが企図される。例えば、いくつかの実施形態では、前世代のいずれかからの「最良」モデルが第１００世代に組み込まれ得る。他の実施形態では、ある世代からの「最良」モデルが第１００世代の後に初めて引き継ぐことができるようにランにフラグが立てられる場合、第１０１世代では、第１００世代からの「最良」モデル（１つ以上）が組み込まれ得る。これらの実施形態では、第１００世代の後に、第１００世代以前の世代からのモデルが以降の世代に組み込まれ得る。 In addition, introducing the model (eg, the “best” model) from one generation to the next without modification can be achieved in some generations (eg, 10-100 generations, 100-150 generations, 150-250 generations). It is contemplated that the flag may be flagged to occur only after being repeated. For example, in some embodiments, the "best" model from any of the previous generations can be incorporated into the 100th generation. In another embodiment, if the run is flagged so that the “best” model from one generation can only take over after the 100th generation, then in the 101st generation, the “best” model from the 100th generation ( One or more) can be incorporated. In these embodiments, after the 100th generation, models from the 100th generation and earlier may be incorporated into later generations.

用語「交叉」は、１世代から次世代の新しいモデルを生成するための１つ以上のモデルの組み合わせを表す。それは、遺伝的プログラミングが基礎とする、再生および生物学的交叉に類似している。いくつかの実施形態では、モデルは、さらに、適合度関数（例えば、１つの性能指数として、設定された目的に対する所与の設計解の達成度を要約するために使用される特定のタイプの目的関数）、またはユーザ定義タスク（例えば、予測子と結果の関係を記述する）を解決するための複数世代の進化を使用して、世代間で修正され得る。 The term "crossover" refers to a combination of one or more models to create a new model from one generation to the next. It is similar to the regenerative and biological crossovers on which genetic programming is based. In some embodiments, the model further includes a goodness-of-fit function (eg, as a figure of merit, a particular type of objective used to summarize the degree of achievement of a given design solution for a set objective). Functions) or user-defined tasks (eg, describing predictor-result relationships) can be modified between generations using multiple generation evolution.

モデルの突然変異は、１つの既存モデルに基づく新しいモデルの作成である。突然変異モデルは、元の形態から微妙に変化した、または変更されたモデルであることが企図される。突然変異は、モデルの母集団の１世代から次世代への多様性を維持するために使用され得る。これは生物学的ＤＮＡ突然変異と類似しており、その初期状態からのモデルの１つ以上の態様の変更を含む。 Model mutation is the creation of a new model based on one existing model. A mutation model is contemplated to be a model that has subtly changed or altered from its original form. Mutations can be used to maintain one-to-next generation diversity of the model population. It is similar to biological DNA mutations and involves the modification of one or more aspects of the model from its initial state.

突然変異の一例は、モデル内の任意のビットがその元の状態から変更される確率を実装することを含む。突然変異を実装する一般的な方法は、シーケンス内の各ビットに対してランダム変数を生成することを含む。このランダム変数は、特定のビットが修正されるかどうかを表す。この突然変異手順は、生物学的点突然変異に基づいて、単一点突然変異と呼ばれる。他のタイプには、逆位および浮動小数点突然変異が含まれる。他のタイプの突然変異には、スワップ、逆位、およびスクランブルが含まれる。 One example of a mutation involves implementing the probability that any bit in the model will change from its original state. A common method of implementing mutation involves generating a random variable for each bit in the sequence. This random variable represents whether a particular bit is modified. This mutation procedure is called a single point mutation based on biological point mutations. Other types include inversion and floating point mutations. Other types of mutations include swap, inversion, and scramble.

モデルの子孫を作成することは、２つ以上の既存モデルに基づいて新しいモデルを作成することである。２つ以上の親モデルの子孫は、親モデルから特徴を取得して、それらの特徴を組み合わせて新しいモデルを作成する。本発明の主要部の実施形態は、１世代から次世代へモデルの特徴を変化させるために、子孫を使用する。これは、本発明の主要部（例えば、遺伝的アルゴリズム）のモデルが基礎とする、再生および生物学的交叉と類似している。交叉は、複数（例えば、２つ以上）の親モデルを取得し、親モデルから子モデルを生成するプロセスである。 Creating a descendant of a model is creating a new model based on two or more existing models. Descendants of more than one parent model take features from the parent model and combine those features to create a new model. A key embodiment of the present invention uses progeny to change the characteristics of the model from one generation to the next. This is similar to the regeneration and biological crossover on which the model of the main part of the invention (eg genetic algorithm) is based. Crossover is a process of acquiring a plurality of (for example, two or more) parent models and generating child models from the parent models.

上述した技法の任意の数または組み合わせを使用して、モデル｛ｍ_１２１、．．．、ｍ_１２ｊ｝のセットとして図６に示されている第２世代のランが作成される。いくつかの実施形態では、各々の次世代は、前世代よりも少ないモデルを含む（例えば、ｊ＜i）が、他の実施形態では、各々の次世代は、前世代と同数のモデルを有する（例えば、ｊ＝i）ことが企図される。同様に、モデルの各々の次世代は、前世代よりも多くのモデルを含み得（例えば、ｊ＞i）、または各々の世代は、様々な数のモデルを含み得る（例えば、第２世代は第１世代よりも少ないモデルを有し、第３世代は第２世代よりも多くのモデルを有する、あるいは第１世代よりもさらに多くのモデルを有し得る、など）。 Using any number or combination of techniques described above, the model {m _121,. ．． , _{M 12j} }, the second generation run shown in FIG. 6 is created. In some embodiments, each next generation includes fewer models than the previous generation (eg, j<i), while in other embodiments each next generation has the same number of models as the previous generation. (Eg, j=i) is contemplated. Similarly, each next generation of models may include more models than previous generations (eg, j>i), or each generation may include a different number of models (eg, second generation Have fewer models than the first generation, the third generation has more models than the second generation, or may have more models than the first generation, etc.).

ラン内のモデルの世代を通して反復するプロセスは、所望の回数で完了され得る。図６において、世代数は変数ａとして表されている。好ましくは、ａは、結果として得られるモデル数がデータセットを十分にトラバースすることができるほど十分な大きさである。例えば、モデルがデータセットから可能な全ての変数（例えば、予測子）を考慮することができるほど十分な世代が存在するべきである。例えば、データセットが大きいほど、小さいデータセットと比較して、多くのモデル世代が必要となり得る。いくつかの実施形態では、ａは、１０〜１０，０００世代であり、より好ましくは、５０〜１，０００世代であり、最も好ましくは、１００〜５００世代であり得る。本発明の主要部に記載されるような世代進化は、分類上、遺伝的プログラミングとして記載され得る。本発明の主要部は、モデルコンポーネントの効率的な排除を可能にするので、本発明の主要部の方法は、反復プログラミングのいかなる方法でも計算効率を劇的に改善するために有用であり得ることが企図される。 The process of iterating through the generations of models in a run can be completed as many times as desired. In FIG. 6, the number of generations is represented as a variable a. Preferably, a is large enough that the resulting number of models is sufficient to traverse the dataset. For example, there should be enough generations that the model can account for all possible variables (eg, predictors) from the dataset. For example, a larger dataset may require more model generations as compared to a smaller dataset. In some embodiments, a can be 10-10,000 generations, more preferably 50-1,000 generations, and most preferably 100-500 generations. Generational evolution as described in the main part of the invention can be taxonomically described as genetic programming. Since the main part of the present invention enables efficient elimination of model components, the method of the main part of the present invention can be useful for dramatically improving the computational efficiency in any method of iterative programming. Is contemplated.

ａ世代を通して反復した後に、図６におけるランの最終世代に到達する。いくつかの実施形態において、ランの最終世代（例えば、図６の世代ａ）は、１つのモデルから構成されるが、ランの最終世代がモデルのセットから構成され得ることも企図される。ランの最終世代がモデルのセットを含む実施形態では、１つ以上の「最良」モデルは、ある世代においていずれのモデルが「最良」であるかを決定することに関して上述した基準のいずれかに基づいて、再度決定される。最終世代における全てのモデルが、それらのランの「最良」モデルと見なされ得ることも企図される。１つのモデルのみがランの最終世代に存在する実施形態では、そのモデルは、必然的に、その世代の「最良」モデル、したがって、ランの「最良」モデルと見なされる。 After iterating through generation a, the final generation of runs in FIG. 6 is reached. In some embodiments, the final generation of runs (eg, generation a in FIG. 6) is composed of one model, but it is also contemplated that the final generation of runs may be composed of a set of models. In embodiments where the final generation of the run includes a set of models, one or more "best" models are based on any of the criteria described above for determining which model is "best" in a generation. And decided again. It is also contemplated that all models in the final generation may be considered the "best" model of their run. In embodiments where only one model exists in the last generation of a run, that model is necessarily considered the "best" model of that generation, and thus the "best" model of the run.

識別されたランの「最良」モデル（１つ以上）（例えば、図６では、最良モデルは、ｍ_１ａ１としてラベル付けされる）を用いて、「最良」モデルについてモデル属性が計算される。 Model attributes are calculated for the “best” model using the identified “best” model(s) of the run (eg, in FIG. 6, the best model is labeled as m _1a1 ).

ラン内の各モデルはモデルプール内で識別されたモデルコンポーネントを使用して作成されるので、特定のランからの「最良」モデル（１つ以上）は、同様に、第１のモデル世代がモデルコンポーネントを取り出した同じモデルプールからのモデルコンポーネントを使用する。例えば、図７は、図６に示されるラン内のモデルを生成するために使用され得るモデルコンポーネントを有するモデルコンポーネントのプールを示す。したがって、図６に示されるランの「最良」モデル内で使用されるモデルコンポーネントは、図７に示されるモデルコンポーネントのプールから必然的に取り出されたものとなる。 Since each model in a run is created using the model components identified in the model pool, the "best" model (one or more) from a particular run is also modeled by the first model generation. Use a model component from the same model pool that retrieved the component. For example, FIG. 7 illustrates a pool of model components having model components that can be used to generate the model in the run shown in FIG. Therefore, the model components used in the "best" model of the run shown in FIG. 6 are necessarily taken from the pool of model components shown in FIG.

このことは、モデルコンポーネントについてユーティリティメトリックを計算するステップにとって重要である。いくつかの実施形態では、ラン内で使用される（例えば、ランのいずれかの世代において使用される）各々のモデルコンポーネントは、それについて計算されたユーティリティメトリックを有する。他の実施形態では、「最良」モデル内で使用される各々のモデルコンポーネントのみが、それについて計算されたユーティリティメトリックを有する。さらに別の実施形態では、ユーティリティメトリックは、ラン（例えば、最新の１０％、２０％、３０％、４０％、５０％、６０％、７０％の世代のみ）からモデルのサブセット内に見られるモデルコンポーネントについて計算され得る。 This is important for the step of calculating utility metrics for model components. In some embodiments, each model component used within a run (eg, used in any generation of the run) has a utility metric calculated for it. In other embodiments, only each model component used in the "best" model has a utility metric calculated for it. In yet another embodiment, the utility metric is a model found in a subset of models from runs (eg, only the latest 10%, 20%, 30%, 40%, 50%, 60%, 70% generations). Can be calculated for a component.

例えば、図６および図７において、モデルコンポーネントのプールからのモデルコンポーネントが「最良」モデル（例えば、ｍ_１ａ１）内に出現する場合、そのモデルコンポーネントのユーティリティメトリックの分子は１である。モデルコンポーネントが１つのモデル内で（または、「最良」世代を構成する複数のモデル内で）複数回出現する場合、カウントは、そのモデルに対して（またはそのランに対して）１だけ増加する。例えば、「最良」世代が２つのモデルを含み、両方のモデルがモデルコンポーネントＣ_ｇを含む場合、Ｃ_ｇについての分子は、そのランについて１だけ加算される。 For example, in FIGS. 6 and 7, if a model component from the pool of model components appears in the “best” model (eg, m _1a1 ), the numerator of that model component's utility metric is 1. If a model component occurs multiple times within a model (or within the models that make up the "best" generation), the count is incremented by 1 for that model (or for that run). .. For example, a "best" generation of two models, when both models including model component C _g, the molecules of the C _g, is incremented by 1 for the run.

ユーティリティメトリックの分母に関しては、モデルコンポーネントがモデルコンポーネントのプール内に出現するたびに、分母は１だけ増加する。例えば、図７のモデルコンポーネントのプール内の全てのモデルコンポーネントは、それらのユーティリティメトリックに対して１の分母を有する。ユーティリティメトリックの分母は、モデルコンポーネントの２つ以上のプールが存在する場合、１よりも大きくなり得る。 Regarding the denominator of the utility metric, the denominator is incremented by 1 each time a model component appears in the pool of model components. For example, all model components in the pool of model components of Figure 7 have a denominator of 1 for their utility metric. The denominator of the utility metric can be greater than 1 if there are two or more pools of model components.

図８および図９は、モデルコンポーネントのＸ個のランおよびＹ個のプールを実装する本発明の主要部の実施形態を示す。ラン毎にモデルコンポーネントの１つのプールが存在し（例えば、Ｘ＝Ｙ）、モデルコンポーネントの各プールが特に特定のランに対応することが企図されるが、ランよりも少ないモデルコンポーネントプールが存在し得る（例えば、Ｘ＞Ｙ）、またはランよりも多いモデルコンポーネントプールが存在し得る（例えば、Ｘ＜Ｙ）ことも同様に企図される。 8 and 9 show an embodiment of the main part of the invention implementing X runs and Y pools of model components. There is one pool of model components per run (eg, X=Y), and it is contemplated that each pool of model components specifically corresponds to a particular run, but there are fewer model component pools than runs. It is also contemplated that there may be more (eg, X>Y) or more model component pools than runs (eg, X<Y).

図８のラン１〜Ｘに出現するモデルコンポーネントのユーティリティメトリックを決定する際に、分子は、０〜Ｘの間（例えば、ランの総数）であり得、分母は、１〜Ｙ（例えば、モデルコンポーネントプールの総数）であり得る。例えば、モデルコンポーネントが２つのラン内の「最良」モデルに出現するが、同じモデルコンポーネントが４つのモデルコンポーネントプール内に存在する場合、ユーティリティメトリックは０．５（２割る４）になる。ユーティリティメトリックは、モデルコンポーネント毎に計算されるが、モデルコンポーネントがいずれのラン内のいずれの「最良」モデルにも出現しない場合、そのモデルコンポーネントは０の分子を有し、従って、ユーティリティメトリックは０となる。 In determining the utility metrics of model components that appear in runs 1-X of FIG. 8, the numerator can be between 0-X (eg, the total number of runs) and the denominator is 1-Y (eg, model). The total number of component pools). For example, if a model component appears in the "best" model in two runs, but the same model component exists in four model component pools, the utility metric will be 0.5 (4 divided by 2). The utility metric is calculated for each model component, but if the model component does not appear in any of the "best" models in any run, then the model component has a numerator of 0, so the utility metric is 0. Becomes

ユーティリティメトリックは、モデルコンポーネントの全てのプール内のあらゆるモデルコンポーネントに対して計算され得ることが企図される。しかしながら、いくつかの実施形態では、ユーティリティメトリックは、ランの「最良」世代における１つ以上のモデルに出現するモデルコンポーネントについてのみ計算される。直観的には、モデルコンポーネントが「最良」モデル内に出現しない場合、その分子は必然的に０となる。したがって、少なくとも１つの「最良」モデルに出現しないモデルコンポーネントについてユーティリティメトリックを計算することをスキップすることができ、代わりに、過剰なプロセッササイクルを使用せずに、全てのモデルコンポーネントプールから、少なくとも１つの「最良」モデルに出現しない全てのモデルコンポーネントを排除することができる。 It is contemplated that the utility metric can be calculated for every model component in every pool of model components. However, in some embodiments, utility metrics are calculated only for model components that appear in one or more models in the "best" generation of a run. Intuitively, if a model component does not appear in the "best" model, its numerator is necessarily 0. Therefore, it may be possible to skip computing utility metrics for model components that do not appear in at least one "best" model, and instead use at least 1 from all model component pools without using excessive processor cycles. All model components that do not appear in one "best" model can be eliminated.

例えば、図８では、Ｘ個のランが存在し、この場合、各ランは１つの最良モデル（すなわち、セット｛ｍ_１ａ１、．．．、ｍ_Ｘｃ１｝内のモデル、つまり各ランの最終世代）を有する。図９に示されるモデルコンポーネントのプールは、重複するモデルコンポーネントを有し得ることが企図されるので、モデルコンポーネントｃ_１ｇは、他のモデルプールの全てまたは一部に存在することが可能であり得る。モデルコンポーネントｃ_１ｇが５個のモデルコンポーネントプール（すなわち、Ｙ≧５）内に出現し、ｃ_１ｇが同様にこれらのランの「最良」モデルの３つの中に出現する場合、モデルコンポーネントｃ_１ｇに対するユーティリティメトリックは、３：５または０．６となる。 For example, in FIG. 8, there are X runs, where each run is one best model (ie, the model in the set {m _1a1 ,..., m _Xc1 }, ie the last generation of each run). Have. It is contemplated that the pool of model components shown in FIG. 9 may have overlapping model components, so model component c _1g may be able to be present in all or some of the other model pools. .. Model component c _{1 g} five model component pool (i.e., Y ≧ 5) appeared in the case where c _{1 g} is likewise appearing in the three "best" model for these runs, the model components c _{1 g} The utility metric will be 3:5 or 0.6.

ユーティリティメトリックｃ_１ｇが出現する各モデルに対して、モデル属性メトリックが必要である。重み付けユーティリティメトリックを計算するために、ｃ_１ｇのユーティリティメトリックは、ｃ_１ｇが出現するモデルのモデル属性の何らかの関数で乗算される。ｃ_１ｇが出現するモデルのモデル属性は、例えば、平均化され得る。他の実施形態では、モデル属性の中央値が使用され得、他の実施形態では、モードが使用され得、さらに別の実施形態では、幾何平均が実装され得ることも企図される。 A model attribute metric is needed for each model in which the utility metric c _1g appears. To calculate the weighting utility metric, a utility metric c _{1 g} is multiplied by some function of the model attribute models c _{1 g} appears. The model attributes of the model in which c _1g appears can be averaged, for example. It is also contemplated that in other embodiments, the median of model attributes may be used, in other embodiments modes may be used, and in yet other embodiments geometric mean may be implemented.

また、特定のモデルコンポーネントが出現する多数の「最良」モデルが存在する場合、平均、中央値、またはモードを計算する前に異常値が排除され得る（例えば、モデル属性の平均を計算する、または中央値を決定する前に、ある数の最高および最低のモデル属性が無視され得る）ことも企図される。いくつかの実施形態では、特定のモデルコンポーネントについて重み付けユーティリティメトリックを計算する際に使用され得る操作されたモデル属性に到達するために、他の既知の数学的演算または関数がモデル属性のセットに適用され得る。 Also, if there is a large number of "best" models in which a particular model component appears, outliers may be eliminated before calculating the mean, median, or mode (eg, calculating the average of model attributes, or It is also contemplated that a certain number of highest and lowest model attributes may be ignored before determining the median). In some embodiments, other known mathematical operations or functions are applied to the set of model attributes to arrive at manipulated model attributes that may be used in calculating weighting utility metrics for particular model components. Can be done.

したがって、上記の例に戻ると、ｃ_１ｇに対するユーティリティメトリックが０．６であり、モデル属性の平均が３０％である場合、重み付けユーティリティメトリックは０．１８となる。このプロセスは、「最良」モデルのセット｛ｍ_１ａ１、．．．、ｍ_ｘｃ１｝に出現する全てのモデルコンポーネントについて繰り返され、その結果、「最良」モデルのセット内の各モデルコンポーネントに対応する重み付けユーティリティメトリックを作成する。 Thus, returning to the example above, if the utility metric for c _1g is 0.6 and the average of the model attributes is 30%, the weighted utility metric is 0.18. This process consists of a set of “best” models {m _1a1,. ．． , _{M xc1} }, is repeated for each model component that results in a weighted utility metric corresponding to each model component in the set of "best" models.

本発明の主要部の方法の次のフェーズは、どのモデルコンポーネントが重要であると見なされ、どのモデルコンポーネントが重要でないと見なされるかの決定を必要とする。「重要」であるモデルコンポーネントは、再利用され、次のランセットを生成するために使用される新しいモデルプールのセット内に配置する対象となる。「重要でない」モデルコンポーネントは破棄され、新しいモデルプールセットで再使用されず、したがって、確実に、「重要でない」モデルコンポーネントは新しいモデルを作成するために使用されることはない。 The next phase of the method of the main part of the invention involves the determination of which model components are considered important and which are considered unimportant. Model components that are "important" are eligible to be reused and placed in a set of new model pools that will be used to generate the next lancet. The "insignificant" model components are discarded and not reused in the new model pool set, thus ensuring that the "insignificant" model components are not used to create new models.

モデルコンポーネントが「重要である」か、または「重要でない」かは、重み付けユーティリティメトリックが閾値未満であるか、または閾値を超えているかどうかによって決定されることが企図される。いくつかの実施形態では、閾値は、モデルの「最良」セット（例えば、図８の｛ｍ_１ａ１、．．．、ｍ_ｘｃ１｝）に出現するモデルコンポーネントに対する重み付けユーティリティメトリックの全てを最初に平均化することによって計算され得る。次に、個々の重み付けユーティリティメトリックは、その平均で除算される。重み付けユーティリティメトリックを全ての重み付けユーティリティメトリックの平均で除算することの結果が、一定の閾値未満である場合（例えば、結果が１．２、１．１、１、０．９、０．８、０．７、０．６、０．５、または０．４未満である場合）、その重み付けユーティリティメトリックに対応するモデルコンポーネントは、考慮から除外される（例えば、そのモデルコンポーネントを、新しいランセットを生成するために使用されるどの新しいモデルコンポーネントプールに入れることはできない）。このプロセスを図５に示す。 It is contemplated that the model component is “significant” or “not significant” is determined by whether the weighting utility metric is below or above the threshold. In some embodiments, the threshold first averages all of the weighted utility metrics for model components that appear in the “best” set of models (eg, {m _1a1 ,..., m _xc1 } in FIG. 8). Can be calculated by Each individual weighted utility metric is then divided by its average. If the result of dividing the weighted utility metric by the average of all weighted utility metrics is less than a certain threshold (eg, the result is 1.2, 1.1, 1, 0.9, 0.8, 0 .7, 0.6, 0.5, or 0.4), the model component corresponding to the weighted utility metric is excluded from consideration (eg, the model component is generated a new lancet). Cannot be put into any new model component pool used for. This process is shown in FIG.

モデルコンポーネントを維持するか、または排除するかを決定する他の適切な方法も企図される。例えば、いくつかの実施形態では、重み付けユーティリティメトリックは、比較前にいかなる操作（例えば、平均化、除算、および比較、または上述した他のプロセスのいずれかのプロセス）も受けることなく閾値と比較される。閾値は、任意の値であり得るか、または予想される重み付けユーティリティメトリック値の理解に基づいて選択され得る。これらの実施形態では、モデルコンポーネントに対する重み付けユーティリティメトリックが計算されると、その重み付けユーティリティメトリックは、次に、予め定義された閾値と比較され、その比較に基づいて、その重み付けユーティリティメトリックに対応するモデルコンポーネントは、全てのモデルコンポーネントプールから排除される（例えば、重み付けユーティリティメトリックが閾値未満である）か、または将来のランにおける使用のために残される。 Other suitable methods of determining whether to keep or eliminate model components are also contemplated. For example, in some embodiments, the weighted utility metric is compared to a threshold without undergoing any manipulation (eg, averaging, division, and comparison, or any of the other processes described above) prior to comparison. It The threshold may be any value or may be selected based on an understanding of expected weighted utility metric values. In these embodiments, once the weighted utility metric for the model component is calculated, the weighted utility metric is then compared to a predefined threshold, and based on the comparison, the model corresponding to the weighted utility metric. The component is either removed from the pool of all model components (eg, the weighting utility metric is below a threshold) or left for use in future runs.

いくつかの実施形態では、モデルコンポーネントは、それらの対応するユーティリティメトリックに基づいて、排除され（例えば、考慮から除外され）得る。これを行うために、ユーティリティメトリックがいくつかのコンポーネントについて計算されると、それらのモデルコンポーネントに対するユーティリティメトリックは、例えば、要約統計量を使用して分析される。想定される要約統計量は、位置（例えば、算術平均、中央値、モード、ならびに四分位間平均）、散布度（例えば、標準偏差、分散、範囲、四分位範囲、絶対偏差、平均絶対差、ならびに距離標準偏差）、形状（例えば、歪度もしくは尖度、ならびにＬモーメント法に基づく代替）、および依存性（例えば、ピアソンの積率相関係数もしくはスピアマンの順位相関係数）を含む。 In some embodiments, model components may be excluded (eg, excluded from consideration) based on their corresponding utility metrics. To do this, once the utility metrics are calculated for some components, the utility metrics for those model components are analyzed using, for example, summary statistics. Assumed summary statistics include position (eg arithmetic mean, median, mode, and interquartile mean), scatter (eg standard deviation, variance, range, interquartile range, absolute deviation, mean absolute). Difference and distance standard deviation), shape (eg skewness or kurtosis, and alternative based on L-moment method), and dependency (eg Pearson product moment correlation coefficient or Spearman rank correlation coefficient) ..

ユーティリティメトリックは、次に、要約統計量と比較され、それが維持されるべきか、または排除されるべきかが決定され得る。例えば、モデルコンポーネントに対するユーティリティメトリックがユーティリティメトリックのセットから計算された算術平均と比較される場合（例えば、ユーティリティメトリックはユーティリティメトリックのセットの平均で除算される）、そのモデルコンポーネントは、そのユーティリティメトリックが１未満（そのモデルコンポーネントが、平均に寄与するユーティリティメトリックを有するモデルコンポーネントの総数の半分よりも影響力が小さいか、有用でないことを示す）であれば、排除され得る。別の実施例では、ユーティリティメトリックが平均から１標準偏差を下回る場合、そのユーティリティメトリックに対応するモデルコンポーネントが排除され得る。その主要な目標は、他のモデルコンポーネントと比較したときに、他のモデルコンポーネントほど「最良」モデルに寄与しないモデルコンポーネントの排除を容易にすることである。図１１は、この概念を一般的に示しており、ここでは、閾値は、上述したように、要約統計量を使用して決定される。 The utility metric can then be compared to the summary statistic to determine if it should be maintained or eliminated. For example, if the utility metric for a model component is compared to the arithmetic mean calculated from the set of utility metrics (eg, the utility metric is divided by the average of the set of utility metrics), the model component If it is less than 1 (indicating that the model component has less than half the total number of model components that have utility metrics that contribute to the average, or is not useful), then it may be excluded. In another example, if the utility metric is less than one standard deviation from the mean, the model component corresponding to that utility metric may be eliminated. Its main goal is to facilitate the elimination of model components that contribute less to the "best" model than other model components when compared to other model components. FIG. 11 generally illustrates this concept, where the threshold is determined using summary statistics, as described above.

多くの状況において、ユーティリティメトリックは、個々のユーティリティメトリックをユーティリティメトリックのセットの要約統計量で除算することによって、要約統計量と比較されることが企図される。これは、いくつかの要約統計量（例えば、位置要約統計量）に有効であるが、他の要約統計量は、ユーティリティメトリックが所望の範囲（例えば、散布度要約統計量）内に入るかどうかを調べるために、ユーティリティメトリック値と値の範囲との比較を必要とする。 It is contemplated that in many situations the utility metric will be compared to the summary statistic by dividing the individual utility metric by the summary statistic of the set of utility metrics. This is useful for some summary statistics (eg, location summary statistics), while other summary statistics show whether the utility metric falls within the desired range (eg, scatter index summary statistic). To find out, we need to compare the utility metric value to a range of values.

重み付けユーティリティメトリックの平均を計算する代わりに、「最良」モデルのセット内の各モデルコンポーネントに対する重み付けユーティリティメトリックは、他の方法で操作され得ることも企図される。例えば、いくつかの実施形態では、個々の重み付けユーティリティメトリックは、重み付けユーティリティメトリックのセットの中央値で除算され得る。他の実施形態では、重み付けユーティリティメトリックのセットのモードが、平均または中央値の代わりに使用され得る。 Instead of calculating the average of the weighted utility metrics, it is also contemplated that the weighted utility metrics for each model component in the "best" model set may be manipulated in other ways. For example, in some embodiments, individual weighted utility metrics may be divided by the median of the set of weighted utility metrics. In other embodiments, the modes of the set of weighted utility metrics may be used instead of mean or median.

いくつかの実施形態では、モデルコンポーネントを考慮から除外した後、モデルコンポーネントの新しいプールは、排除されたモデルコンポーネントなしで生成される。他の実施形態では、モデルコンポーネントは、既存のモデルプールから排除され、新しいランセットについてモデルのセットを生成するために同じモデルプールが再度使用される。さらに別の実施形態では、モデルコンポーネントは、モデルコンポーネントプールから排除せずに、単に考慮から除外されるのみである。この時点以降は、プロセスは繰り返され、最終的には、より多くのモデルコンポーネントが排除され得る。このプロセスは、残りのモデルコンポーネント全てが各々のランにおいて「最良」モデルに有意に寄与することが判明するまで、必要に応じて繰り返され得る。 In some embodiments, after excluding model components from consideration, a new pool of model components is created without the excluded model components. In other embodiments, model components are removed from the existing model pool and the same model pool is used again to generate a set of models for the new lancet. In yet another embodiment, model components are not excluded from consideration, rather than excluded from the model component pool. From this point onward, the process is repeated and eventually more model components may be eliminated. This process may be repeated as needed until all remaining model components are found to contribute significantly to the "best" model in each run.

モデルコンポーネントのトリミングを受けたモデルコンポーネントプールを使用して次のランを生成するときに、以前のランからの「最良」モデルのセットが次のランに組み込まれ得ることも企図される。したがって、以前のランからの「最良」モデルが、別の理由では重要でないと決定されるために破棄されるであろうモデルコンポーネントを含む場合、そのモデルコンポーネントは、その以前のランからの「最良」モデルの導入によって再導入され得る。 It is also contemplated that when generating a next run using the model component pool that has undergone model component trimming, the set of "best" models from the previous run may be incorporated into the next run. Therefore, if a "best" model from a previous run contains a model component that would be discarded because it was determined to be unimportant for another reason, that model component would be "best" from that previous run. Can be reintroduced by the introduction of the model.

例えば、第１のランの結果、「最良」モデルが生成され、第２のラン（これは、プルーニングされたモデルコンポーネントプールからのモデルコンポーネントを使用する、ランダムに生成されたモデルのセットから始まる）は、ランダムに生成されたモデルの最初のセット内の第１のランの「最良」モデルを含み得る。そうすることにより、以前に識別された有効モデルの要素を新しいランに導入することができ（例えば、さもなければ破棄されていたであろう１つ以上のモデルコンポーネントを復活させることができる）、その結果、「最良」モデルを世代的に進化させる第２のランの能力が向上する。 For example, the first run results in a "best" model and a second run (which starts with a randomly generated set of models that use model components from the pruned model component pool). May include the "best" model of the first run in the first set of randomly generated models. By doing so, the previously identified valid model elements can be introduced into a new run (eg, one or more model components that would otherwise have been discarded can be reinstated), As a result, the ability of the second run to evolve the "best" model generationally improves.

このプロセスを通して、モデルコンポーネントは、モデルコンポーネントの１つ以上のプールからプルーニングされる。本発明の主要部に従ってモデルコンポーネントをプルーニングすることによって、遺伝的プログラミング（および関連するタスク）を実行するために必要な計算時間は、劇的に低減される。 Through this process, model components are pruned from one or more pools of model components. By pruning model components according to the main part of the invention, the computational time required to perform genetic programming (and related tasks) is dramatically reduced.

本発明者は、以前のランからの「最良」モデルからのモデルコンポーネントが追加的に再度組み込まれ得ることを企図している。上述のプロセスを通して排除されたモデルコンポーネントを再度考慮に入れることができる。過去のランからの「最良」モデル（例えば、１つ以上のモデルは、「最良」であることが判明した各ランを形成する）は、考慮の中に残すための閾値を満たすことができないために排除されたモデルコンポーネントを含み得る。これらの「最良」モデルは、（上述したように）次のランにおいて考慮され得、その結果、別の理由で排除されたモデルコンポーネントを再度考慮に入れることができる。図面に照らして見ると、例えば、図８に示すように、ラン２の任意の世代（例えば、最終世代）内のモデルは、ラン１からの「最良」モデル（ｍ_１ａ１）を組み込むことができ、その結果、さもなければ考慮から除外されていたかもしれないモデルｍ_１ａ１内の任意のモデルコンポーネントを再導入することが企図される。このプロセスを図１０に示す。 The inventor contemplates that model components from the "best" model from a previous run may additionally be reincorporated. Model components that have been eliminated through the process described above can be re-considered. The “best” model from past runs (eg, one or more models forming each run that was found to be “best”) cannot meet the threshold to remain in consideration. Model components that have been excluded from the. These "best" models can be considered in the next run (as described above), so that model components that have been excluded for another reason can be re-considered. Looking at the drawing, for example, as shown in FIG. 8, a model within any generation of run 2 (eg, the last generation) can incorporate the “best” model (m _1a1 ) from run 1. , _So that it is contemplated to reintroduce any model component in model m _1a1 that may otherwise have been excluded from consideration. This process is shown in FIG.

モデルコンポーネントがこのようにして再度考慮に入れられる実施形態では、モデルコンポーネントが閾値を満たさないときにモデルコンポーネントの１つ以上のプールからそのモデルコンポーネントを排除する代わりに、そのモデルコンポーネントは単に考慮から外される（例えば、全てのモデルコンポーネントプール内に留まることはできるが、任意のモデル内で使用されることはできなくなる）ことが企図される。このように、１つのランからの「最良」モデルが次のラン内に再導入されると、ユーティリティメトリックの分母はゼロではなくなり、そのモデルコンポーネントは再度考慮に入れられる機会を有する。例えば、モデルコンポーネントが最初に考慮から除外されたが、再導入され、その後、その重み付けユーティリティメトリックが閾値を超えた場合、モデルコンポーネントは、再度考慮に入れられ、後で生成されたモデル内で使用され得る。 In an embodiment in which a model component is re-considered in this way, instead of excluding it from one or more pools of model components when the model component does not meet a threshold, the model component is simply considered. It is contemplated that it is removed (eg, it can remain in all model component pools but cannot be used in any model). Thus, when the "best" model from one run is reintroduced into the next run, the denominator of the utility metric will be non-zero and that model component will have the opportunity to be considered again. For example, if a model component was first excluded from consideration, but reintroduced, and then its weighting utility metric exceeds a threshold, the model component is re-considered and used in later generated models. Can be done.

本発明の主要部は、１つには大きなデータセットを処理する計算方法が「次元の呪い」を受けるので、当該技術の状態の改善である。次元の呪いは、行（例えば、観測）および／または列（例えば、予測子）の数が増加するにつれて、問題の次元が増加するという考えである。次元が増加すると、空間の容積は非常に速く増加するので、利用可能なデータは疎になる。このスパース性は、統計的有意性を必要とする任意の方法に対して問題がある。このスパース性は、いくつかの重要な点で任意の分析方法に対して問題となる。 The main part of the invention is the improvement of the state of the art, in part because the computational methods that handle large data sets are subject to "dimensional curse." The dimension curse is the idea that the dimension of the problem increases as the number of rows (eg, observations) and/or columns (eg, predictors) increases. As the dimension increases, the volume of space grows so fast that the available data becomes sparse. This sparsity is problematic for any method that requires statistical significance. This sparsity is problematic for any analytical method in several important respects.

第１に、統計的に適切で信頼性のある結果が望まれる場合、結果をサポートするために必要なデータ量は、次元と共に指数関数的に増加することが多い。第２に、データを編成し検索する多くの方法は、オブジェクトが類似の特性を有するグループを形成する領域を検出することに依存することが多い。しかし、高い次元データでは、全てのオブジェクトは、疎であり、多くの点で異なるように見えるので、共通データ編成戦略の効率を低下させることがあり得る。 First, when statistically relevant and reliable results are desired, the amount of data needed to support the results often increases exponentially with dimension. Second, many methods of organizing and retrieving data often rely on finding regions where objects form groups with similar characteristics. However, in high dimensional data, all objects are sparse and appear to differ in many ways, which can reduce the efficiency of common data organization strategies.

反復モデリング技術の文脈では、高次元は、さらなる問題を提起する。各々の追加次元は、解探索空間のサイズを指数関数的に増加させる。多くの反復方法は、可能な解の探索空間をランダムにサンプリングするため、各モデルコンポーネントを問題に追加することは、解に収束するために必要な時間量（物理的および計算的）を指数関数的に増加させる。 In the context of iterative modeling techniques, higher dimensions pose additional problems. Each additional dimension exponentially increases the size of the solution search space. Since many iterative methods randomly sample the search space of possible solutions, adding each model component to the problem exponentially reduces the amount of time (physical and computational) needed to converge to the solution. Increase.

本発明の主要部を適用する際に、本発明者らは、反復モデリングプロセスに利用可能な入力特徴（例えば、モデルコンポーネント）の数を反復的に減少させることは、場合によっては、収束に到達するのに必要な時間を１００ｘだけ減少させることができること、または別の形として、プロセスが同じ時間量で考慮し得る「探索空間」または深さを大幅に増加させることができることに気付いた。 In applying the main part of the invention, we find that iteratively reducing the number of input features (eg, model components) available to the iterative modeling process may reach convergence. It has been found that the time required to do can be reduced by 100x, or alternatively, the process can significantly increase the "search space" or depth that can be considered in the same amount of time.

この性能向上の１つの理由は、反復モデリングプロセスにおいて利用可能なモデルコンポーネントの減少により、任意の個々のモデルコンポーネントが、ＲＡＭまたは別の形態の電子記憶装置（例えば、ハードドライブ・フラッシュまたは別のもの）から呼び出される（「キャッシュミス」と呼ばれる）のとは対照的に、ＣＰＵキャッシュ内に格納され、その後、呼び出される（「キャッシュヒット」と呼ばれる）可能性が高まるためである。本発明の主要部は、「キャッシュヒット」の可能性を高め、場合によっては、「キャッシュミス」よりも任意の所与のモデルコンポーネントについて「キャッシュヒット」の可能性をさらに高くする。 One reason for this performance improvement is that due to the reduction of model components available in the iterative modeling process, any individual model component may be in RAM or another form of electronic storage (eg, hard drive flash or another). ), which is stored in the CPU cache and subsequently called (called a “cache hit”). The essence of the present invention increases the likelihood of a "cache hit" and, in some cases, the "cache hit" for any given model component more than the "cache miss".

上で簡潔に言及されているように、「キャッシュヒット」とは、プログラムによる処理を行うために要求されたデータ（例えば、モデルコンポーネント）がＣＰＵのキャッシュメモリ内に見つかった状態をいう。キャッシュメモリは、プロセッサにデータを転送する際にかなり高速である。コマンドを実行するときに、ＣＰＵは、最も近いアクセス可能なメモリ位置（通常はプライマリＣＰＵキャッシュである）内のデータを探す。要求されたデータがキャッシュ内に見つかった場合には、それは「キャッシュヒット」と見なされる。「キャッシュヒット」は、ＣＰＵにデータを転送する際のＣＰＵキャッシュの速度により、より迅速にデータを提供する。「キャッシュヒット」は、要求されたデータが最初のクエリで記憶され、アクセスされるディスクキャッシュからのデータの引き出しを指すこともある。 As briefly mentioned above, a "cache hit" refers to a state in which the data (eg, model component) required to perform a program operation is found in the cache memory of the CPU. Cache memory is considerably faster at transferring data to the processor. When executing a command, the CPU looks for the data in the closest accessible memory location (typically the primary CPU cache). If the requested data is found in the cache, it is considered a "cache hit". A "cache hit" provides data more quickly due to the speed of the CPU cache when transferring data to the CPU. A "cache hit" may also refer to pulling data from the disk cache where the requested data is stored and accessed on the first query.

「キャッシュヒット」を最大化する際の計算時間の改善は、他の記憶媒体と比較して、ＣＰＵキャッシュに記憶されているデータへのアクセス速度によってもたらされる。例えば、レベル１キャッシュ参照は０．５ナノ秒のオーダーであり、レベル２キャッシュ参照は７ナノ秒のオーダーである。比較すると、ソリッドステート・ハードドライブからのランダム読み出しは、１５０，０００ナノ秒のオーダーを要し、すなわち、レベル１キャッシュ参照よりも３００，０００倍も遅い。 The improvement in computational time in maximizing "cache hits" is provided by the speed of access to the data stored in the CPU cache as compared to other storage media. For example, level 1 cache references are on the order of 0.5 nanoseconds and level 2 cache references are on the order of 7 nanoseconds. By comparison, a random read from a solid state hard drive takes on the order of 150,000 nanoseconds, or 300,000 times slower than a level 1 cache reference.

したがって、反復的特徴選択の特定の構成および方法が開示されている。しかしながら、当業者には、本出願における発明概念から逸脱することなく、既に説明されたもの以外の多くの変更が可能であることは明らかであろう。従って、本発明の主要部は、本開示の精神を除いて限定されるべきではない。さらに、本開示を解釈する際に、全ての用語は、文脈と一致する可能な限り広い意味で解釈されるべきである。特に、用語「含む」は、非排他的に要素、コンポーネント、またはステップに言及しているものとして解釈されるべきであり、言及されている要素、コンポーネント、またはステップが存在し、または利用され、または明示的に言及されていない他の要素、コンポーネント、またはステップと組み合わされ得ることを示す。 Accordingly, particular configurations and methods of iterative feature selection are disclosed. However, it will be apparent to those skilled in the art that many modifications other than those already described can be made without departing from the inventive concept in this application. Therefore, the body of the invention should not be limited except in the spirit of the disclosure. Moreover, in interpreting this disclosure, all terms should be interpreted in the broadest possible manner consistent with the context. In particular, the term “comprising” should be construed as non-exclusively referring to an element, component, or step, and the referenced element, component, or step is present or utilized, Or, it may be combined with other elements, components, or steps not explicitly mentioned.

Claims

A method of reducing the computational time required to improve a model that associates predictors with results in a dataset, the method comprising:
Generating a first model, the first model including a first model component from a first model component pool;
Generating a second model, the second model including a second model component from a second model component pool;
A first utility of the first model component that includes (1) a ratio of the amount of the model in which the first model component is present and (2) the amount of the model component pool in which the first model component is present. Calculating the metric,
A second utility of the second model component, including (1) a ratio of the amount of the model in which the second model component is present and (2) the amount of the model component pool in which the second model component is present. Calculating the metric,
Excluding the first model component and the second model component from the first model component pool and the second model component pool based on the first utility metric and the second utility metric When,
Including the method.

The method of claim 1, wherein the first model component is randomly generated.

The first model component pool and the second model component pool are calculation operators, mathematical operators, constants, predictors, features, variables, ternary operators, algorithms, mathematical expressions, binary operators, hidden nodes. The method of claim 1, comprising at least one of:, weight, bias, slope, hyperparameter.

The method of claim 1, wherein the first function comprises at least the product of the first model attribute and the first utility metric.

The method of claim 4, wherein the first function and the second function are the same.

The method of claim 1, wherein generating the first model and the second model comprises an iterative modeling process.

The iterative modeling process includes at least one of an evolutionary computing process, a genetic programming process, a genetic algorithm process, a neural network process, a deep learning process, a Markov modeling process, a Monte Carlo modeling process, and a stepwise regression process. 7. The method according to claim 6.

Retaining the first model component and the second model component from the first model component pool and the second model component pool based on the first utility metric and the second utility metric The method of claim 1, further comprising steps.

Excluding the first model component and the second model component from the first model component pool and the second model component pool based on the first utility metric and the second utility metric , Generating a third model component pool,
Generating a third model, the third model including a third model component from the third model component pool;
A third utility metric of the third model component, including 1) the amount of the model in which the third model component is present and (2) the amount of the model component pool in which the third model component is present. The step of calculating
Excluding the third model component from the third model component pool based on the third utility metric;
The method of claim 1, further comprising:

A method of reducing the computational time required to improve a model that associates predictors with results in a dataset, the method comprising:
Generating a first model, the first model generating a first model that includes a first model component from a first model component pool;
Generating a second model, the second model generating a second model that includes a second model component from a second model component pool;
(1) calculating a first model attribute metric corresponding to the first model and (2) a second model attribute metric corresponding to the second model,
A first utility of the first model component that includes (1) a ratio of the amount of the model in which the first model component is present and (2) the amount of the model component pool in which the first model component is present. Calculating the metric,
A second utility of the second model component, including (1) a ratio of the amount of the model in which the second model component is present and (2) the amount of the model component pool in which the second model component is present. Calculating the metric,
A first weighted utility metric corresponding to the first model component that incorporates (1) a model attribute metric for the model in which the first model component resides and (2) the first utility metric. Calculating a first weighted utility metric, including a result of the first function;
A second weighted utility metric corresponding to the second model component, comprising: (1) a model attribute metric for the model in which the second model component exists and (2) the second utility metric. Calculating a second weighted utility metric, including the result of the two functions;
Excluding the first model component and the second model component from the first model component pool and the second model component pool based on the first weighted utility metric and the second weighted utility metric What to do
Including the method.

11. The model attribute metric of claim 10, wherein the model attribute metric comprises at least one of accuracy, sensitivity, specificity, area under the curve (AUC) from a receiver operating characteristic (ROC) metric, and algorithm length. Method.

The method of claim 10, wherein generating the first model and the second model comprises an iterative modeling process.

The iterative modeling process includes at least one of an evolutionary computing process, a genetic programming process, a genetic algorithm process, a neural network process, a deep learning process, a Markov modeling process, a Monte Carlo modeling process, and a stepwise regression process. The method according to claim 12.

A method of reducing the computational time required to improve a model that associates predictors with results in a dataset, the method comprising:
Generating a model containing model components,
Calculating a model attribute metric corresponding to the model using a subset of the dataset;
Calculating a utility metric of a model component containing a ratio, wherein the numerator of the ratio comprises the amount of the model in which the model component is present,
The denominator of the ratio starts at 0 and increases by 1 if the model component is in the model component pool;
Calculating a weighted utility metric corresponding to the model component, the weighted utility metric including (1) a result of a function incorporating the model attribute metric and (2) the utility metric;
Excluding the model component from the model component pool based on the weighted utility metric;
Including the method.

15. The method of claim 14, further comprising retaining the model component from the model component pool based on the weighted utility metric.

15. The method of claim 14, wherein the model component is randomly generated.

The model component includes a calculation operator, a mathematical operator, a constant, a predictor, a feature, a variable, a ternary operator, an algorithm, a mathematical expression, a binary operator, a hidden node, a weight, a bias, a gradient, and a hyperparameter. 15. The method of claim 14, comprising at least one.

15. The model component pool of claim 14, wherein the model component pool includes at least one of a computational operator, a mathematical operator, a constant, a predictor, a feature, a variable, a ternary operator, an algorithm, a mathematical expression, and a binary operator. The described method.

15. The method of claim 14, wherein the function comprises at least the product of the model attribute and the utility metric.

15. The method of claim 14, wherein the model attributes include at least one of accuracy, sensitivity, specificity, area under the curve (AUC) from a receiver operating characteristic (ROC) metric, and algorithm length. ..