JP6824450B2

JP6824450B2 - Iterative feature selection method

Info

Publication number: JP6824450B2
Application number: JP2019571977A
Authority: JP
Inventors: リリー，パトリック
Original assignee: リキッドバイオサイエンシズ，インコーポレイテッド
Priority date: 2017-06-28
Filing date: 2017-10-03
Publication date: 2021-02-03
Anticipated expiration: 2037-10-03
Also published as: EP3646194A1; WO2019005187A1; EP3646194A4; JP2020528175A

Description

本出願は、２０１７年６月２８日に出願された米国特許出願第１５／６３６３９４号および２０１７年６月２８日に出願された国際特許ＰＣＴ／ＵＳ２０１７／３９８１２号の一部継続出願であり、これらに対する優先権を主張するものである。本出願において掲げられている全ての付帯的材料は、その全体が参照により本明細書に組み込まれる。 This application is a partial continuation of US Patent Application No. 15/636394 filed on June 28, 2017 and International Patent PCT / US2017 / 39812 filed on June 28, 2017. Claims priority over. All ancillary materials listed in this application are incorporated herein by reference in their entirety.

本発明の分野は、反復特徴選択である。 The field of the present invention is iterative feature selection.

背景技術の説明は、本発明を理解するのに有用であり得る情報を含む。本出願で提供される情報のいずれかが先行技術である、または請求項に係る発明に関連するものであること、または具体的にもしくは暗黙的に言及されるいずれかの文献が先行技術であることを承認するものではない。 Descriptions of the background art include information that may be useful in understanding the present invention. Any document that is prior art, that any of the information provided in the present application is related to the claimed invention, or that is specifically or implicitly referred to is prior art. It does not approve that.

データがより利用可能になり、データセットのサイズが増加するにつれて、多くの分析プロセスは、「次元の呪い」に見舞われる。ＲｉｃｈａｒｄＥ．Ｂｅｌｌｍａｎ（「Ａｄａｐｔｉｖｅｃｏｎｔｒｏｌｐｒｏｃｅｓｓｅｓ：ａｇｕｉｄｅｄｔｏｕｒ」１９６１年、ＰｒｉｎｃｅｔｏｎＵｎｉｖｅｒｓｉｔｙＰｒｅｓｓ）によって作り出された「次元の呪い」という句は、低次元設定で発生しない、超次元空間（例えば、数百、数千、または数百万の特徴または変数を有するデータセット）におけるデータを分析および編成する際に生じる問題を指す。 As data becomes more available and the size of datasets grows, many analytical processes are subject to a "curse of dimensionality". Richard E. The phrase "curse of dimensionality", coined by Bellman ("Adaptive control variables: a guided tour", 1961, Princiton University Press), does not occur in low-dimensional settings, in superdimensional spaces (eg, hundreds, thousands, etc.). Or a problem that arises when analyzing and organizing data in a dataset that has millions of features or variables.

本明細書内の全ての文献は、個々の文献または特許出願が参照により組み込まれていると具体的かつ個別に示されるように、同じように参照により本明細書に組み込まれる。組み込まれた参考文献における用語の定義または使用が、本明細書内で提供されるその用語の定義と矛盾しているか、または異なる場合、本明細書内で提供されるその用語の定義が適用され、参考文献における用語の定義は適用されない。 All documents herein are likewise incorporated herein by reference, as each document or patent application is specifically and individually indicated to be incorporated by reference. If the definition or use of a term in the incorporated references conflicts with or differs from the definition of that term provided herein, the definition of that term provided herein applies. , Definitions of terms in the bibliography do not apply.

コンピュータ技術は進歩し続けているが、超次元データセットを処理し、分析することは計算集約的である。例えば、反復モデリングプロセスでは、全ての可能なモデルコンポーネントの組み合わせを検索するために必要な計算時間は、追加のモデルコンポーネントの追加ごとに指数関数的に増加する。特に、反復モデリングプロセスのような技法を、大規模なデータセットを使用する複雑な問題を解くのにより適した形にする方法で、超次元空間における計算要件を低減する必要がある。反復モデリングプロセスにおける計算要件を低減する１つの方法は、モデリングプロセスに利用可能なアルゴリズムコンポーネントのユニバースを低減することである。 Computer technology continues to advance, but processing and analyzing superdimensional datasets is computationally intensive. For example, in an iterative modeling process, the computational time required to find all possible model component combinations increases exponentially with each additional model component added. In particular, there is a need to reduce computational requirements in superdimensional space by making techniques such as iterative modeling processes more suitable for solving complex problems with large datasets. One way to reduce the computational requirements in an iterative modeling process is to reduce the universe of algorithmic components available in the modeling process.

どのコンポーネントが解に対して有意であるか、有意でないかを決定することによって、反復モデリングプロセスに利用可能なアルゴリズムコンポーネントの数が劇的に低減され得ることは、まだ理解されていない。 It is not yet understood that the number of algorithmic components available to the iterative modeling process can be dramatically reduced by determining which components are significant or insignificant to the solution.

したがって、反復モデリングプロセスに適用される反復特徴選択方法は、当技術分野において依然として必要とされている。 Therefore, iterative feature selection methods applied to iterative modeling processes are still needed in the art.

本発明は、反復モデリングプロセスにおけるモデルの開発のための可能なモデルコンポーネントとしてモデルコンポーネントが排除される装置、システムおよび方法を提供するものである。 The present invention provides devices, systems and methods in which model components are excluded as possible model components for the development of models in an iterative modeling process.

本発明の主要部の１つの態様において、データセット内の予測子と結果を関連付けるモデルを改善するために必要な計算時間を減少させる方法が企図される。この方法は、いくつかのステップを含む。まず、モデルコンポーネントのプールからのモデルコンポーネントを使用してモデルが生成される。データセットのサブセットを使用して、モデル属性メトリック（例えば、精度、感度、特異性、受信者動作特性（ＲＯＣ）メトリックからの曲線下面積（ＡＵＣ）、およびアルゴリズムの長さ）が、各モデルに対して生成される。次に、いくつかのモデルコンポーネントについて、（１）各モデルコンポーネントが存在するモデルの量と（２）各モデルコンポーネントが存在するモデルコンポーネントプールの量との比であるユーティリティメトリックが計算される。次に、モデルコンポーネントに対応する重み付けユーティリティメトリックが計算され得る。 In one aspect of a key part of the invention, methods are contemplated to reduce the computational time required to improve the model that associates the predictors with the results in the dataset. This method involves several steps. First, a model is generated using the model components from the pool of model components. Using a subset of the dataset, model attribute metrics (eg, accuracy, sensitivity, specificity, area under the curve (AUC) from the receiver operating characteristic (ROC) metric, and algorithm length) are assigned to each model. It is generated for. Next, for some model components, utility metrics are calculated that are (1) the ratio of the amount of models in which each model component is present to (2) the amount of model component pool in which each model component is present. The weighting utility metric corresponding to the model component can then be calculated.

重み付けユーティリティメトリックは、いくつかの実施形態において、（１）モデルコンポーネントが存在するモデルについてのモデル属性メトリックと、（２）これらのモデルコンポーネントについてのユーティリティメトリックとを含む関数の結果である。重み付けユーティリティメトリックに基づいて、モデルコンポーネントのプールからの特定のモデルコンポーネントは、排除されるか、または保持される。いくつかの実施形態では、関数は、モデル属性メトリックとユーティリティメトリックの積を含む。 The weighted utility metric is, in some embodiments, the result of a function that includes (1) model attribute metrics for models in which model components are present, and (2) utility metrics for these model components. Certain model components from the pool of model components are eliminated or retained based on the weighted utility metric. In some embodiments, the function comprises the product of a model attribute metric and a utility metric.

いくつかの実施形態では、モデルコンポーネントはランダムに生成される。モデルコンポーネントは、とりわけ、計算演算子、数学演算子、定数、予測子、特徴、変数、三項演算子、アルゴリズム、数式、二項演算子、重み、勾配、ノード、またはハイパーパラメータであり得る。 In some embodiments, model components are randomly generated. Model components can be, among other things, computational operators, mathematical operators, constants, predictors, features, variables, ternary operators, algorithms, formulas, binary operators, weights, gradients, nodes, or hyperparameters.

開示されている主要部は、特定のタスク（例えば、遺伝的プログラミング）を実行するために必要な計算サイクルを劇的に低減することにより、コンピュータの改善された動作を含む有利な技術的効果を提供することが評価されるべきである。本発明の主要部が存在しない場合は、反復モデリング方法は、多くの場合、主に数ヶ月および数年の計算時間を必要とする場合もある法外な計算要件のために、筋道の立った解を提供するものではない。 The main part disclosed is that by dramatically reducing the computational cycles required to perform a particular task (eg, genetic programming), it has beneficial technical effects, including improved computer behavior. Offering should be appreciated. In the absence of the main part of the invention, iterative modeling methods are coherent due to exorbitant computational requirements, which can often require computational time, primarily months and years. It does not provide a solution.

本発明の主要部の様々な目的、特徴、態様および利点は、添付図面（同様の数字が同様のコンポーネントを表す）に沿った以下の好適な実施形態の詳細な説明から、より明らかになるであろう。 The various objectives, features, aspects and advantages of the main parts of the invention will become more apparent from the detailed description of the following preferred embodiments along with the accompanying drawings (similar numbers represent similar components). There will be.

反復モデリングプロセスの汎用フレームワークを示す図である。It is a figure which shows the general-purpose framework of an iterative modeling process. モデルコンポーネントのユーティリティメトリックを決定するための企図された方法を示す図である。FIG. 5 illustrates an intended method for determining utility metrics for model components. モデル属性メトリックを決定するための企図された方法を示す図である。It is a figure which shows the planned method for determining a model attribute metric. 重み付けユーティリティメトリックを計算するための企図された方法を示す図である。FIG. 5 illustrates an intended method for calculating weighted utility metrics. モデルコンポーネントのプールから所与のモデルコンポーネントを排除する、または保持するための企図された方法を示す図である。FIG. 5 illustrates an intended method for excluding or retaining a given model component from a pool of model components. モデル、一連の世代、および「最良」モデルを有するランを含む１つの企図された実施形態を示す図である。It is a figure which shows one planned embodiment which includes a model, a series of generations, and a run which has a "best" model. 図６のランに対応するモデルコンポーネントのプールを示す図である。It is a figure which shows the pool of the model component corresponding to the run of FIG. 各々がモデル、一連の世代、および「最良」モデルを有する一連のランを含む１つの企図された実施形態を示す図である。FIG. 5 illustrates one intended embodiment, each comprising a model, a series of generations, and a series of runs having a "best" model. 図８のランに対応する一連のモデルコンポーネントプールを示す図である。It is a figure which shows the series of model component pools corresponding to the run of FIG. モデルコンポーネントのプールから所与のモデルコンポーネントを排除する、または保持するための企図された方法を示す図である。FIG. 5 illustrates an intended method for excluding or retaining a given model component from a pool of model components. モデルコンポーネントのプールから所与のモデルコンポーネントを排除する、または保持するための別の企図された方法を示す図である。FIG. 5 illustrates another intended method for excluding or retaining a given model component from a pool of model components. １つのモデルセットから別のモデルセットへモデルコンポーネントを組み込む方法を示す図である。It is a figure which shows the method of incorporating a model component from one model set into another model set. １つのモデルセットから別のモデルセットへモデルコンポーネントを組み込む別の方法を示す図である。It is a figure which shows another method which incorporates a model component from one model set into another model set. モデルの１つ以上の世代からモデルの別の世代へ１つ以上の変更されていないモデルを組み込む方法を示す図である。It is a figure which shows the method of incorporating one or more unchanged models from one or more generations of a model to another generation of a model.

以下の説明は、本発明の主要部の例示的な実施形態を提供する。各々の実施形態は本発明の要素の１つの組み合わせを表すが、本発明の主要部は開示されている要素の全ての可能な組み合わせを含むものと考えられる。したがって、１つの実施形態が要素Ａ、要素Ｂ、および要素Ｃを備え、第２の実施形態が要素Ｂおよび要素Ｄを備える場合、本発明の主要部はさらに、明示的に開示されていないとしても、Ａ、Ｂ、Ｃ、またはＤの他の残りの組み合わせを含むと考えられる。 The following description provides exemplary embodiments of the main parts of the invention. Although each embodiment represents one combination of elements of the invention, the main part of the invention is believed to include all possible combinations of disclosed elements. Thus, if one embodiment comprises elements A, B, and C and a second embodiment comprises elements B and D, the main part of the invention is further not explicitly disclosed. Is also considered to include other remaining combinations of A, B, C, or D.

本出願内で、さらに特許請求の範囲を通して使用される場合、「一」、「一つ」および「その」の意味は、文脈が明らかに別段の意味を示す場合を除いて、複数のものを含むものとする。また、本出願内の説明で使用される場合、「内」の意味は、文脈が明らかに別段の意味を示す場合を除いて、「内」および「上」の意味を含むものとする。 As used herein and throughout the claims, the meanings of "one," "one," and "that" may be more than one, unless the context clearly indicates otherwise. It shall include. Also, as used in the description within this application, the meaning of "inside" shall include the meanings of "inside" and "above" unless the context clearly indicates otherwise.

また、本出願内で使用される場合、文脈が明らかに別段の意味を示す場合を除いて、用語「〜に結合される」は、直接結合（互いに結合される２つの要素が互いに接触する）および間接結合（少なくとも１つの追加の要素が２つの要素間に配置される）の両方を含むものとする。したがって、用語「〜に結合される」および「〜と結合される」という用語は、同義的に使用される。 Also, as used in this application, the term "bonded to" is a direct bond (two elements that are joined to each other contact each other), unless the context clearly indicates a different meaning. And indirect binding (at least one additional element is placed between the two elements) shall be included. Therefore, the terms "combined with" and "combined with" are used synonymously.

いくつかの実施形態では、本発明の特定の実施形態を説明し、請求するために使用される構成要素の量、濃度、反応条件などの特性を表す数字は、場合によっては、用語「約」によって修正されたものとして理解されるべきである。したがって、いくつかの実施形態では、明細書および添付の特許請求の範囲に記載されている数値パラメータは、特定の実施形態によって得られることが求められる所望の特性に依存して変化し得る近似値である。いくつかの実施形態では、数値パラメータは、報告されている有効桁の数字を考慮し、通常の丸め技法を適用することによって解釈されるべきである。本発明のいくつかの実施形態の広い範囲を示す数値範囲およびパラメータは近似値であるが、特定の実施例に記載されている数値は、可能な限り正確に報告される。本発明のいくつかの実施形態において提示される数値は、それぞれの試験測定値に見られる標準偏差から必然的に生じる特定の誤差を含み得る。さらに、文脈が反対の意味を示す場合を除いて、本出願に記載されている全ての範囲は、それらのエンドポイントを包含するものとして解釈されるべきであり、オープンエンドの範囲は、商業上実用的な値のみを含むものと解釈されるべきである。同様に、すべての値のリストは、文脈が反対の意味を示す場合を除いて、中間値を含むものとして見なされるべきである。 In some embodiments, numbers representing properties such as the amount, concentration, reaction conditions, etc. of the components used to describe and claim a particular embodiment of the invention are, in some cases, the term "about". Should be understood as modified by. Therefore, in some embodiments, the numerical parameters described in the specification and the appended claims may vary depending on the desired properties required to be obtained by the particular embodiment. Is. In some embodiments, the numerical parameters should be interpreted by taking into account the reported significant digits and applying conventional rounding techniques. While the numerical ranges and parameters that indicate the broad range of some embodiments of the invention are approximations, the numerical values described in the particular embodiment are reported as accurately as possible. The numbers presented in some embodiments of the invention may include certain errors that inevitably arise from the standard deviation found in each test measurement. In addition, all scopes described in this application should be construed as including those endpoints, unless the context indicates the opposite meaning, and open-ended scopes are commercially. It should be construed to include only practical values. Similarly, a list of all values should be considered as containing intermediate values, unless the context indicates the opposite meaning.

コンピュータ対象の任意の言語は、サーバ、インターフェース、システム、データベース、エージェント、ピア、エンジン、コントローラ、または個別にまたは集合的に動作する他のタイプのコンピューティングデバイスを含む、コンピューティングデバイスの任意の適切な組み合わせを含むものと解釈されるべきであることに留意されたい。コンピューティングデバイスは、有形の非一時的なコンピュータ可読記憶媒体（例えば、ハードドライブ、ソリッドステートドライブ、ＲＡＭ、フラッシュ、ＲＯＭなど）上に格納されたソフトウェア命令を実行するように構成されたプロセッサを備えることを理解すべきである。ソフトウェア命令は、好ましくは、開示されている装置に関して後述する役割、責任、または他の機能を提供するようにコンピューティングデバイスを構成する。特に好適な実施形態では、様々なサーバ、システム、データベース、またはインターフェースは、ＨＴＴＰ、ＨＴＴＰＳ、ＡＥＳ、公開秘密鍵交換、ウェブサービスＡＰＩ、既知の金融トランザクションプロトコル、または他の電子情報交換方法に基づくと思われる、標準化されたプロトコルまたはアルゴリズムを使用してデータを交換する。データ交換は、好ましくは、パケット交換ネットワーク、インターネット、ＬＡＮ、ＷＡＮ、ＶＰＮ、または他のタイプのパケット交換ネットワークを介して行われる。以下の説明は、本発明を理解するのに有用であり得る情報を含む。本出願で提供される情報のいずれかが先行技術である、または請求項に係る発明に関連するものであること、または具体的にもしくは暗黙的に言及されるいずれかの文献が先行技術であることを承認するものではない。 Any language targeted to a computer is any suitable computing device, including servers, interfaces, systems, databases, agents, peers, engines, controllers, or other types of computing devices that operate individually or collectively. It should be noted that it should be interpreted as including various combinations. A computing device comprises a processor configured to execute software instructions stored on a tangible, non-transitory computer-readable storage medium (eg, hard drive, solid state drive, RAM, flash, ROM, etc.). You should understand that. Software instructions preferably configure the computing device to provide the roles, responsibilities, or other functions described below with respect to the disclosed device. In a particularly preferred embodiment, the various servers, systems, databases, or interfaces are based on HTTP, HTTPS, AES, public private key exchange, web service APIs, known financial transaction protocols, or other electronic information exchange methods. Exchange data using a seemingly standardized protocol or algorithm. Data exchange is preferably done via packet-switched networks, the Internet, LAN, WAN, VPN, or other types of packet-switched networks. The following description includes information that may be useful in understanding the present invention. Any document that is prior art, that any of the information provided in the present application is related to the claimed invention, or that is specifically or implicitly referred to is prior art. It does not approve that.

本出願で使用される場合、「セット」または「サブセット」のような用語は、１つ以上のアイテムを含むものと解釈されるべきである。「セット」は、特に断りのない限り、２つ以上の項目を含むとは限らない。 As used in this application, terms such as "set" or "subset" should be construed to include one or more items. A "set" does not necessarily include more than one item unless otherwise noted.

本発明の主要部の１つの目的は、ターゲットデータセットにおける予測子と結果の関係を記述するモデルを作成するために使用される低性能（例えば、不必要または不要な）モデルコンポーネントを識別し、排除することである。可能なモデルコンポーネントの数をプルーニングすることは、反復モデリングプロセスにおいて高性能モデルに収束するために必要とされる計算時間を短縮することによって、計算効率を向上させる。 One of the main objectives of the present invention is to identify low performance (eg, unnecessary or unnecessary) model components used to create models that describe the relationship between predictors and results in a target dataset. To eliminate. Pruning the number of possible model components improves computational efficiency by reducing the computational time required to converge on a high performance model in an iterative modeling process.

本発明の主要部にはいくつかのフェーズがあり、これらのフェーズは方法ステップとして実装され得る。 There are several phases in the main part of the invention, and these phases can be implemented as method steps.

本発明の主要部の１つの企図される実施形態では、第１のフェーズは、反復モデリングプロセスを使用して、モデルコンポーネントのプールからモデルのセットを生成することである。図１は、汎用反復モデリングフレームワークを示しており、ここでは、セット｛ｃ_１,．．．,ｃ_ｚ｝内のモデルコンポーネントはモデリングプロセスを受けてモデルｍ_１〜ｍ_ｎを生成する。 In one intended embodiment of the main part of the invention, the first phase is to use an iterative modeling process to generate a set of models from a pool of model components. FIG. 1 shows a general-purpose iterative modeling framework, in which the set {c ₁ , ... .. .. The model component in, c _z } undergoes a modeling process to generate models m _{1 to} m _n .

本明細書で使用される場合、用語「反復モデリングプロセス」とは、反復可能なまたはループ可能なサブルーチンまたはプロセス（例えば、ラン、フォアループ（ｆｏｒｌｏｏｐ）、エポック、サイクル）を含むターゲットデータセットにおける予測子と結果の関係を記述するために、１つ以上のモデルを作成するためのモデリング方法を指す。 As used herein, the term "iteration modeling process" is used in a target dataset that includes repeatable or loopable subroutines or processes (eg, runs, for loops, epochs, cycles). Refers to a modeling method for creating one or more models to describe the relationship between a predictor and a result.

企図される反復モデリングプロセスは、人工ニューラルネットワーク（ＡＮＮ）、畳み込みニューラルネットワーク（ＣＮＮ）、再帰型ニューラルネットワーク、ディープ・ボルツマン・マシン（ＤＢＭ）、ディープ・ビリーフ・ネットワーク（ＤＢＮ）、積層自己符号化器、およびニューラル・ネットワーク・フレームワークから導出される他のモデリング技法のようなディープラーニング方法を含む。 The intended iterative modeling process is artificial neural networks (ANN), convolutional neural networks (CNN), recursive neural networks, deep Boltzmann machines (DBM), deep belief networks (DBN), stacked self-encoders. Includes deep learning methods such as, and other modeling techniques derived from neural network frameworks.

追加的にまたは代替的に、企図される反復モデリングプロセスは、遺伝的アルゴリズムおよび遺伝的プログラミング（例えば、ツリーベースの遺伝的プログラミング、スタックベースの遺伝的プログラミング、線形（マシンコードを含む）遺伝的プログラミング、文法進化、拡張コンパクト遺伝的プログラミング（ＥＣＧＰ）、埋め込みカルテシアン遺伝的プログラミング（ＥＣＧＰ）、確率的増分プログラム進化（ＰＩＰＥ）、および型付き遺伝的プログラミング（ＳＴＥＰ）を含む、進化的プログラミング方法を含む。他の進化的プログラミング方法は、遺伝子発現プログラミング、進化戦略、差分進化、ニューロエボリューション（ｎｅｕｒｏｅｖｏｌｕｔｉｏｎ）、学習分類子システム、または強化学習システムを含み、この場合、解は、２値、実数、ニューラルネット、またはＳ式タイプであり得る分類子（規則または条件）のセットである。学習分類子システムの場合、適合性は、強度または精度ベースの強化学習アプローチまたは教師付き学習アプローチのいずれかを用いて決定され得る。 Additional or alternative, the intended iterative modeling process includes genetic algorithms and genetic programming (eg, tree-based genetic programming, stack-based genetic programming, linear (including machine code) genetic programming. Includes evolutionary programming methods, including grammatical evolution, extended compact genetic programming (ECGP), embedded Cartesian genetic programming (ECGP), probabilistic incremental program evolution (PIPE), and typed genetic programming (STEP). Other evolutionary programming methods include gene expression programming, evolutionary strategies, differential evolution, neuroevolution, learning classifier systems, or enhanced learning systems, where the solution is binary, real, neural net. , Or a set of classifiers (rules or conditions) that can be of type S. For learning classifier systems, suitability is either a strength or accuracy-based reinforced learning approach or a supervised learning approach. Can be decided.

追加または代替の企図される反復モデリングプロセスは、プロセスが反復可能なまたはループ可能なサブルーチンまたはプロセス（たとえば、ラン、フォアループ、エポック、サイクル)を含む限り、モンテカルロ法、マルコフ連鎖、逐次線形ロジスティック回帰、決定木、ランダムフォレスト、サポートベクトルマシン、ベイズモデリング技法、または勾配ブースティング技法を含み得る。 Additional or alternative intended iterative modeling processes include Monte Carlo methods, Markov chains, and sequential linear logistic regression as long as the process contains repeatable or loopable subroutines or processes (eg, run, foreloop, epoch, cycle). , Decision trees, random forests, support vector machines, Bayesian modeling techniques, or gradient boosting techniques.

次のフェーズでは、選択モデルコンポーネントについてユーティリティメトリックが計算され、選択モデルについてモデル属性メトリックが計算される。次に、各ユーティリティメトリックおよび１つ以上のモデル属性メトリックを使用して、重み付けユーティリティメトリックが計算される。重み付けユーティリティメトリックに基づいて、モデルコンポーネントプールからいくつかのモデルコンポーネントが排除され、他のモデルコンポーネントはそのまま残される。このプルーニングプロセスは、モデルコンポーネントの数を削減することで探索空間の次元を減少させることによって、反復的モデリング方法を実行するコンピュータの能力を改善するものであり、これについて以下でより詳細に説明する。 In the next phase, utility metrics are calculated for the selected model component and model attribute metrics are calculated for the selected model. The weighted utility metric is then calculated using each utility metric and one or more model attribute metrics. Based on the weighted utility metric, some model components are excluded from the model component pool and others are left untouched. This pruning process improves the computer's ability to perform iterative modeling methods by reducing the dimensions of the search space by reducing the number of model components, which will be described in more detail below. ..

いくつかの実施形態では、各モデルコンポーネントは、そのために計算されたユーティリティメトリックを有する。その１つの実施形態が図２に示されるユーティリティメトリックは、分子がモデル内にモデルコンポーネントが出現する回数であり、分母がモデルコンポーネントプール内にモデルコンポーネントが出現する回数である比を示す。 In some embodiments, each model component has a utility metric calculated for it. The utility metric, one embodiment of which is shown in FIG. 2, shows the ratio where the molecule is the number of times the model component appears in the model and the denominator is the number of times the model component appears in the model component pool.

いくつかの企図される実施形態では、モデルコンポーネントは、例えば、計算演算子（例えば、ＩＦ、ＡＮＤ、ＯＲのような論理ステートメント)、数学演算子（例えば、乗算、除算、減算、加算のような算術演算）、三角演算、ロジスティック関数、微積分演算、「床」または「天井」演算子、または任意の他の数学演算子）、定数（例えば、整数またはπのような値を含む一定の数値）、予測子（例えば、観察値もしくは測定値または数式）、特徴（例えば、特性）、変数、三項演算子（例えば、３つの因数をとる演算子であり、ここで、第１の引数が比較引数であり、第２の引数が真の比較の結果であり、第３の引数が偽の比較の結果である）、アルゴリズム、数式、リテラル、関数（例えば、１変数関数、２変数関数など）、二項演算子（例えば、２つのオペランド上で演算し、そのオペランドに結果を返す演算子）、重みならびに重みベクトル、ノードならびに隠れノード、勾配降下、シグモイド活性化関数、ハイパーパラメータ、およびバイアスを含み得る。 In some intended embodiments, the model components are, for example, computational operators (eg, logical statements such as IF, AND, OR), mathematical operators (eg, such as multiplication, division, subtraction, addition). Arithmetic operations), triangle operations, logistic functions, quantification operations, "floor" or "ceiling" operators, or any other mathematical operator), constants (eg, constant numbers including values such as integers or π) , Predictor (eg, observed or measured or mathematical formula), feature (eg, characteristic), variable, ternary operator (eg, operator that takes three factors, where the first argument is the comparison Arguments, the second argument is the result of a true comparison and the third argument is the result of a false comparison), algorithms, formulas, literals, functions (eg, one-variable functions, two-variable functions, etc.) , Binary operator (for example, an operator that operates on two operands and returns the result in that operand), weights and weight vectors, nodes and hidden nodes, gradient descent, sigmoid activation function, hyperparameters, and bias. Can include.

図３は、モデル属性メトリックをどのように決定することができるかを示す。いくつかの企図される実施形態では、モデル属性メトリックは、予測子を使用して結果を予測するモデルの能力を記述することができることが企図され、その精度は、パーセントとして表される。データセットからのデータは、モデル属性メトリックを決定するために使用され、ここで、データセットは、予測子および結果を含み、モデル属性は、モデルに予測子のみを付与し、次いでモデルからの結果をデータセットからの実際の結果と比較することによって、決定される。例えば、モデルが、３５％の確率で結果を正確に予測するために予測子のセットを使用する場合、そのモデルについてのモデル属性メトリックは３５％となる。 FIG. 3 shows how the model attribute metric can be determined. In some intended embodiments, the model attribute metric is intended to be able to describe the model's ability to predict outcomes using predictors, the accuracy of which is expressed as a percentage. The data from the dataset is used to determine the model attribute metrics, where the dataset contains predictors and results, the model attributes give the model only predictors, and then the results from the model. Is determined by comparing the actual results from the dataset. For example, if a model uses a set of predictors to accurately predict results with a 35% probability, the model attribute metric for that model will be 35%.

他の実施形態では、モデル属性メトリックは、追加的にまたは代替的に、感度、特異性、受信者動作特性（ＲＯＣ）メトリックからの曲線下面積（ＡＵＣ）、二乗平均平方根誤差（ＲＭＳＥ）、アルゴリズムの長さ、アルゴリズム計算時間、使用される変数もしくはコンポーネント、または他の適切なモデル属性であり得る。モデル属性メトリックは、識別されたモデル属性の１つ以上を使用して決定され得るが、モデル属性メトリックは、これらの属性のみに限定されないことが企図される。 In other embodiments, the model attribute metric additionally or alternatives is sensitivity, specificity, area under the curve (AUC) from the receiver operating characteristic (ROC) metric, root mean square error (RMSE), algorithm. Can be the length of the algorithm, the algorithm calculation time, the variables or components used, or other suitable model attributes. Model attribute metrics can be determined using one or more of the identified model attributes, but it is contemplated that model attribute metrics are not limited to these attributes alone.

モデルコンポーネントがモデル性能に十分に寄与することができるかどうか（例えば、特定のモデルコンポーネントが予測子のセットを使用して結果を決定するためにモデルの能力に影響を与えるかどうか）を決定するために、図４に示されるように、重み付けユーティリティメトリックが各ユーティリティメトリックおよび１つ以上のモデル属性メトリックの関数として生成される。 Determine if a model component can contribute sufficiently to model performance (for example, if a particular model component affects the model's ability to use a set of predictors to determine results). To this end, weighted utility metrics are generated as a function of each utility metric and one or more model attribute metrics, as shown in FIG.

モデルコンポーネントが「重要である」か、または「重要でない」かは、重み付けユーティリティメトリックが閾値未満であるか、または閾値を超えているかどうかによって決定されることが企図される。いくつかの実施形態では、閾値は、モデルのセット（例えば、図１における｛ｍ_１、．．．、ｍ_ｎ｝）に出現するモデルコンポーネントに対する重み付けユーティリティメトリックの全てを最初に平均化することによって計算され得る。次に、各々の重み付けユーティリティメトリックは、全ての重み付けユーティリティメトリックについての要約統計量（例えば、平均、三平均値、分散、標準偏差、モード、中央値）で除算される。 It is intended that whether a model component is "important" or "insignificant" is determined by whether the weighting utility metric is below or above the threshold. In some embodiments, the threshold is obtained by first averaging all of the weighting utility metrics for the model components that appear in the set of models (eg, {m ₁ , ..., _mn } in FIG. 1). Can be calculated. Each weighted utility metric is then divided by a summary statistic (eg, mean, tri-mean, variance, standard deviation, mode, median) for all weighted utility metrics.

重み付けユーティリティメトリックを全ての重み付けユーティリティメトリックの要約統計量によって除算した結果が一定の閾値未満である場合（例えば、結果が１．２、１．１、１、０．９、０．８、０．７、０．６、０．５、または０．４未満である場合）、その重み付けユーティリティメトリックに対応するモデルコンポーネントは、考慮から除外される（例えば、新しいランセットを生成するために使用される任意の新しいモデルコンポーネントプールに、そのモデルコンポーネントを入れることはできない）。このプロセスを図５に示す。 When the result of dividing the weighted utility metric by the summary statistics of all the weighted utility metrics is less than a certain threshold (eg, the result is 1.2, 1.1, 1, 0.9, 0.8, 0. If less than 7, 0.6, 0.5, or 0.4, the model component corresponding to that weighting utility metric is excluded from consideration (eg, any used to generate a new lancet). You cannot put that model component in your new model component pool). This process is shown in FIG.

モデルコンポーネントを維持するか、または排除するかを決定する他の適切な方法も企図される。例えば、いくつかの実施形態では、重み付けユーティリティメトリックは、比較前にいかなる操作（例えば、平均化、除算、および比較、または上述した他のプロセスのいずれかのプロセス）も受けることなく閾値と比較される。閾値は、任意の値であり得るか、または予想される重み付けユーティリティメトリック値の理解に基づいて選択され得る。これらの実施形態では、モデルコンポーネントに対する重み付けユーティリティメトリックが計算されると、その重み付けユーティリティメトリックは、次に、予め定義された閾値と比較され、その比較に基づいて、その重み付けユーティリティメトリックに対応するモデルコンポーネントは、全てのモデルコンポーネントプールから排除される（例えば、重み付けユーティリティメトリックが閾値未満である）か、または将来のランにおける使用のために残される。 Other suitable methods of deciding whether to maintain or eliminate model components are also contemplated. For example, in some embodiments, the weighted utility metric is compared to a threshold without undergoing any operation (eg, averaging, division, and comparison, or any of the other processes described above) prior to comparison. To. The threshold can be any value or can be chosen based on an understanding of the expected weighting utility metric values. In these embodiments, when a weighted utility metric for a model component is calculated, the weighted utility metric is then compared to a predefined threshold, and based on that comparison, the model corresponding to the weighted utility metric. The component is either excluded from all model component pools (eg, the weighted utility metric is below the threshold) or left for use in future runs.

最終的に、いくつかのモデルコンポーネントは、それらの対応する重み付けユーティリティメトリックに基づいて、他のモデルコンポーネントよりも有用でないことが判明し、ユーティリティの不足が閾値未満である場合、これらのモデルコンポーネントは破棄されることが企図される。 Eventually, some model components prove to be less useful than others, based on their corresponding weighted utility metrics, and if the utility shortage is below the threshold, then these model components are It is intended to be destroyed.

いくつかの実施形態では、モデルコンポーネントを考慮から除外した後、モデルコンポーネントの新しいプールは、排除されたモデルコンポーネントなしで生成される。他の実施形態では、モデルコンポーネントは、既存のモデルプールから排除され、新しいランセットについてモデルのセットを生成するために同じモデルプールが再度使用される。さらに別の実施形態では、モデルコンポーネントは、モデルコンポーネントプールから排除せずに、単に考慮から除外されるのみである。この時点以降は、プロセスは繰り返され、最終的には、より多くのモデルコンポーネントが排除され得る。このプロセスは、残りのモデルコンポーネント全てが各々の反復またはランにおいて「最良」モデルに有意に寄与することが判明するまで、必要に応じて繰り返され得る。 In some embodiments, after excluding the model components from consideration, a new pool of model components is created without the excluded model components. In other embodiments, the model component is removed from the existing model pool and the same model pool is reused to generate a set of models for the new lancet. In yet another embodiment, the model component is not excluded from the model component pool, but is simply excluded from consideration. From this point on, the process is repeated and eventually more model components can be eliminated. This process can be repeated as needed until all the remaining model components are found to contribute significantly to the "best" model in each iteration or run.

このプロセスを通して、モデルコンポーネントは、モデルコンポーネントの１つ以上のプールからプルーニングされる。本発明の主要部に従ってモデルコンポーネントをプルーニングすることによって、反復モデリング（および関連するタスク）を実行するために必要な計算時間が劇的に低減される。 Through this process, model components are pruned from one or more pools of model components. By pruning the model components according to the main part of the invention, the computational time required to perform iterative modeling (and related tasks) is dramatically reduced.

１つの特定のタイプの反復モデリングに限定されることを望まないが、本発明の主要部の実施形態のサブセットは、遺伝的プログラミングプロセスにおけるモデルの開発のための可能なモデルコンポーネントとしてモデルコンポーネントが排除される装置、システムおよび方法を提供する。本発明の主要部を遺伝的プログラミングへ適用する実例は、他の反復モデリング技法への適用を理解するのに有用である。 Although not desired to be limited to one particular type of iterative modeling, a subset of the embodiments of the main parts of the invention exclude model components as possible model components for the development of models in the genetic programming process. Provide devices, systems and methods to be used. Examples of applying the main part of the present invention to genetic programming are useful for understanding the application to other iterative modeling techniques.

例えば、企図される実施形態のこのサブセットでは、第１のフェーズは、遺伝的プログラミングプロセスを使用して、「ラン」を構成するモデルのセットを生成することである。用語「ラン」は、「最良」モデルに収束するように操作されるモデルのセットを示す。ラン内で、モデルコンポーネントのプールからのモデルコンポーネントを使用して、モデルセットが生成される。 For example, in this subset of the intended embodiments, the first phase is to use a genetic programming process to generate a set of models that make up a "run." The term "run" refers to a set of models that are manipulated to converge to the "best" model. In the run, a model set is generated using the model components from the pool of model components.

このモデルセットは、モデルの世代と呼ばれる。次のフェーズでは、（ランダムに生成された）第１世代のモデルは、その世代におけるどのモデル（１つ以上）が最良の性能を発揮するかを決定するために競合させられ、次に、一部は（例えば、複製に基づいて、または複製することによって）前世代からのモデルを使用して、次のモデル世代が生成される。これらのフェーズは、データセットにおける予測子と結果の関係を適切に記述する１つ以上のモデルが開発されるまで、各ラン内の複数世代にわたって繰り返し完了される。 This model set is called the model generation. In the next phase, the (randomly generated) first generation models are competed to determine which model (one or more) in that generation will perform best, and then one. The part uses a model from the previous generation (eg, based on or by duplicating) to generate the next model generation. These phases are iteratively completed across multiple generations within each run until one or more models are developed that adequately describe the relationship between predictors and results in the dataset.

次のフェーズでは、選択されたモデルコンポーネントについてユーティリティメトリックが計算され、各ランの選択モデルについてモデル属性メトリックが計算される。 In the next phase, utility metrics are calculated for the selected model components and model attribute metrics are calculated for the selected model in each run.

第１世代のランは、モデルセットの生成を必要とする。本発明の主要部のモデルは、表記ｍ_ａｂｃを使用して記述され、この場合、ａはラン数であり、ｂは世代番号であり、ｃはモデル数である。図６は、ラン数１を有するランを示し、モデルｍ_１１１〜ｍ_１１ｉから構成される第１世代を示す。ｉの値は、その世代におけるモデル数である。ｉは、１０〜１，０００，０００、より好ましくは１００〜１０，０００、最も好ましくは１，０００〜５，０００であり得ることが企図される。 First-generation runs require model set generation. The model of the main part of the present invention is described using the notation _mabc , where a is the number of runs, b is the generation number, and c is the number of models. FIG. 6 shows a run having a run number of 1 and shows a first generation composed of models m _{111 to} m _11i . The value of i is the number of models in that generation. It is contemplated that i can be 10 to 1,000,000, more preferably 100 to 10,000, most preferably 1,000 to 5,000.

モデルｍ_１１１〜ｍ_１１ｉは、図７に示すように、モデルコンポーネントプールからの様々なモデルコンポーネントを使用してランダムに生成される。モデルはアルゴリズムであり、モデルコンポーネントはアルゴリズムを構成するために使用されることが企図される。図７のモデルコンポーネントは、集合｛ｃ_１、．．．、ｃ_ｚ｝として表される。プール内の全てのモデルコンポーネントは、そのモデルプールに対応するモデル内で使用するために利用可能であるが、全てのモデルコンポーネントを使用する必要があるとは限らない。さらに、あるモデルコンポーネントがあるモデルに内で使用される場合、そのモデルコンポーネントは、他のモデル内で使用するために利用可能な状態である。 Models m _{111 to} m _11i are randomly generated using various model components from the model component pool, as shown in FIG. The model is an algorithm and the model components are intended to be used to construct the algorithm. The model components in FIG. 7 are set {c ₁ , ... .. .. , _Cz }. All model components in a pool are available for use in the model corresponding to that model pool, but not all model components need to be used. Moreover, when one model component is used within one model, that model component is available for use within another model.

別の段落で記載されているように、特定のモデルコンポーネントが予測子セットを使用して結果を決定するモデルの能力に影響を与えるかどうかを決定するために、重み付けユーティリティメトリックは、図４に示すように、各ユーティリティメトリックおよび１つ以上のモデル属性の関数として作成される。 As described in another paragraph, weighted utility metrics are shown in Figure 4 to determine if a particular model component affects the ability of a model to determine results using a predictor set. Created as a function of each utility metric and one or more model attributes, as shown.

本発明の主要部の一態様では、第１のモデル世代｛ｍ_１１１、．．．、ｍ_１１ｉ｝が生成され、その第１世代におけるモデルを互いに競合させて、どのモデルが最良の性能を発揮するかを決定する。例えば、競合は、モデル性能（例えば、予測子セットからの結果を予測するモデルの能力）の比較であり得る。いくつかの実施形態では、ランの各世代におけるモデルが互いに競合した後、最良の実行モデルのセットが識別される。他の実施形態では、１つの最良実行モデルが識別される。性能に基づくモデルの上位パーセント（例えば、上位１〜５％、５〜１０％、１０〜２０％、２０〜３０％、３０〜４０％、または４０〜５０％）が各世代における最良の実行モデルと見なされ得ることが企図される。 In one aspect of the main part of the invention, the first model generation {m ₁₁₁ ,. .. .. , _{M 11i} } are generated and compete with each other for their first generation models to determine which model performs best. For example, competition can be a comparison of model performance (eg, the ability of a model to predict results from a set of predictors). In some embodiments, the best set of execution models is identified after the models in each generation of the run compete with each other. In other embodiments, one best execution model is identified. The top percentages of performance-based models (eg, top 1-5%, 5-10%, 10-20%, 20-30%, 30-40%, or 40-50%) are the best execution models for each generation. It is intended that it can be considered.

最良の実行モデルは、いくつかの方法で記述され得る。例えば、モデルが予測子を使用して、（例えば、結果が既知であり、モデルの結果をデータセットからの実際の結果と比較することによって）数パーセントの割合の結果を予測する場合、そのパーセントは、そのモデルがある世代の他のモデルよりも良好な性能を発揮するモデルであるかどうかを決定するために使用され得る。そのような実施形態では、ある世代におけるモデルは、予測子からの結果を決定する際の高いパーセント精度を有するモデルが、より低いパーセント精度を有するモデルを「打破する」形で、互いに対して「競合」する。ある世代におけるモデルのいくつか（または１つを除く全て）が排除される（例えば、負けたモデルがセットから除去される）と、最良の複数（または１つ）のモデルが残る。 The best execution model can be described in several ways. For example, if the model uses a predictor to predict a percentage of the result (for example, by comparing the result of the model with the actual result from the dataset), that percentage. Can be used to determine if the model is a model that performs better than other models of a generation. In such an embodiment, the models in one generation "break" the model with the lower percent accuracy with respect to each other, with the model having the higher percent accuracy in determining the result from the predictor. Competing. When some (or all but one) of the models in a generation are eliminated (eg, the losing model is removed from the set), the best multiple (or one) models remain.

別の実施例では、ある世代の「最良」モデルは、その世代における他のモデルと比較したときに、１つ以上の好ましい特性を有するモデルであり得る。例えば、「最良」モデルは、アルゴリズムの長さに関して「最も短い」（例えば、モデルは、量、タイプ、または非オーバーラップモデルコンポーネントのいずれかに関して最も少数のモデルコンポーネントを使用する）、モデルを実行するのに必要な最小計算時間、最良のトレーニング精度、最良の標準プロセストレーニング検証、または最良のトレーニング検証であるモデルであり得る。さらに、「最良」モデルは、本出願で議論されているこれらのおよび任意の他の因子の組み合わせによって決定され得る。 In another embodiment, the "best" model of a generation can be a model with one or more favorable properties when compared to other models of that generation. For example, the "best" model runs the model "shortest" in terms of algorithm length (eg, the model uses the fewest model components in terms of either quantity, type, or non-overlapping model components). It can be a model with the minimum computational time required to do, the best training accuracy, the best standard process training verification, or the best training verification. In addition, the "best" model can be determined by the combination of these and any other factors discussed in this application.

最良のパフォーマーであると識別されたラン内の第１世代からの１つ以上のモデルを用いて、第２のモデル世代が生成され得る。第２のモデル世代は、モデルのいくつかのサブセットから構成され得る。例えば、次世代におけるモデルのサブセットは、モデルプール（図７に示す）からのモデルコンポーネントを使用してランダムに生成され得るが、モデルの別のサブセットは、前世代からのモデル（例えば、１つ以上の最良モデル）の突然変異によって生成され得、また別のサブセットは、前世代からのモデルを使用して子孫を作成する（交叉とも呼ばれる）ことによって生成され得る。 A second model generation can be generated using one or more models from the first generation in the run identified as the best performer. The second model generation can consist of several subsets of the model. For example, a subset of models in the next generation can be randomly generated using model components from a model pool (shown in Figure 7), while another subset of models can be models from previous generations (eg, one). It can be produced by mutation of (the best model above), and another subset can be produced by creating offspring (also called crossovers) using models from previous generations.

いくつかの実施形態では、１世代からのモデル（例えば、１つ以上のモデル）のサブセットは、変更なしで、次世代（例えば、任意の次世代）に含まれる。例えば、前世代からの１つ以上のモデル（例えば、１つ以上の「最良」モデル）は、ランに対する「最良」モデルに収束するのに必要な時間を低減するために、任意の次世代に導入され得る。したがって、例えば、図６において、世代ａに到達すると、世代１からａ−１までのモデルのいずれかを世代ａに含めることができる。この概念の１つの実施形態を図１４に示す。 In some embodiments, a subset of models from one generation (eg, one or more models) is included in the next generation (eg, any next generation) without modification. For example, one or more models from the previous generation (eg, one or more "best" models) to any next generation to reduce the time required to converge to the "best" model for the run. Can be introduced. Therefore, for example, in FIG. 6, when the generation a is reached, any of the models from the generation 1 to the a-1 can be included in the generation a. One embodiment of this concept is shown in FIG.

１世代からの１つ以上のモデルを変更せずに次世代に組み込むことによって、モデルのコンポーネントまたはモデルの特徴（例えば、まとまってモデルの性能に影響を与えるモデルコンポーネントのグループ）は、その世代が存在するランについての「最良」モデルに収束する次世代の能力を改善するために、次世代に導入され得る。１つのランまたは世代から別のランまたは世代へ導入される任意のモデルが、後でリコールされ得るように、最初に（例えば、コンピュータメモリに）セーブされることが示唆される。 By incorporating one or more models from one generation into the next generation without modification, the components of the model or the characteristics of the model (eg, a group of model components that collectively affect the performance of the model) are described by that generation. It can be introduced to the next generation to improve the ability of the next generation to converge on the "best" model for existing runs. It is suggested that any model introduced from one run or generation to another run or generation is saved first (eg, in computer memory) so that it can be recalled later.

いくつかの実施形態では、次世代に導入されるモデルは必ずしも最適なモデルであるとは限らない（例えば、本出願で議論されるモデル性能特性のいずれかに従って、同じランにおける他の世代における他のモデルと比較して優れた性能を発揮するわけではない）。その目標は、モデルコンポーネントまたは特徴が、別の理由でモデルコンポーネントまたは特徴が考慮から破棄または排除される結果をもたらしたであろうモデル内にのみ存在する場合でも、偶発的に排除されないことを確実にすることのみである。 In some embodiments, the model introduced to the next generation is not always the optimal model (eg, according to any of the model performance characteristics discussed in this application, others in the same run in other generations. It does not perform well compared to the model of). The goal is to ensure that the model component or feature is not accidentally excluded, even if it exists only in the model that would have resulted in the model component or feature being discarded or excluded from consideration for another reason. It is only to make.

１世代から次世代へモデル（例えば、「最良」モデル）を変更せずに導入することが、いくつかの世代が（例えば、１０〜１００世代、１００〜１５０世代、１５０〜２５０世代）繰り返された後にのみ行われるようにフラグが立てられ得ることが企図される。例えば、いくつかの実施形態では、前世代のいずれかからの「最良」モデルが第１００世代に組み込まれ得る。他の実施形態では、ある世代からの「最良」モデルが第１００世代の後に初めて引き継ぐことができるようにランにフラグが立てられる場合、第１０１世代では、第１００世代からの「最良」モデル（１つ以上）が組み込まれ得る。これらの実施形態では、第１００世代の後に、第１００世代以前の世代からのモデルが以降の世代に組み込まれ得る。 Introducing a model from one generation to the next generation (eg, the "best" model) without change is repeated by several generations (eg, 10-100 generations, 100-150 generations, 150-250 generations). It is intended that it can be flagged to occur only afterwards. For example, in some embodiments, the "best" model from any of the previous generations may be incorporated into the 100th generation. In other embodiments, if the run is flagged so that the "best" model from one generation can only be taken over after the 100th generation, then in the 101st generation the "best" model from the 100th generation ( One or more) can be incorporated. In these embodiments, after the 100th generation, models from generations prior to the 100th generation can be incorporated into subsequent generations.

用語「交叉」は、１世代から次世代の新しいモデルを生成するための１つ以上のモデルの組み合わせを表す。これは、遺伝的プログラミングが基礎とする、再生および生物学的交叉に類似している。いくつかの実施形態では、モデルは、さらに、適合度関数（例えば、１つの性能指数として、設定された目的に対する所与の設計解の達成度を要約するために使用される特定のタイプの目的関数）、またはユーザ定義タスク（例えば、予測子と結果の関係を記述する）を解決するための複数世代の進化を使用して、世代間で修正され得る。 The term "crossover" refers to a combination of one or more models for generating new models from one generation to the next. This is similar to the regenerative and biological crossovers on which genetic programming is based. In some embodiments, the model further has a fitness function (eg, as a figure of merit, a particular type of purpose used to summarize the achievement of a given design solution for a set purpose. It can be modified between generations using functions), or multi-generational evolution to solve user-defined tasks (eg, describing the relationship between predictors and results).

モデルの突然変異は、１つの既存モデルに基づく新しいモデルの作成である。突然変異モデルは、元の形態から微妙に変化した、または変更されたモデルであることが企図される。突然変異は、モデルの母集団の１世代から次世代への多様性を維持するために使用され得る。これは生物学的ＤＮＡ突然変異と類似しており、その初期状態からのモデルの１つ以上の態様の変更を含む。 Model mutation is the creation of a new model based on one existing model. The mutation model is intended to be a model that is subtly altered or altered from its original form. Mutations can be used to maintain diversity from one generation to the next generation of the model population. This is similar to a biological DNA mutation and involves changing one or more aspects of the model from its initial state.

突然変異の一例は、モデル内の任意のビットがその元の状態から変更される確率を実装することを含む。突然変異を実装する一般的な方法は、シーケンス内の各ビットに対してランダム変数を生成することを含む。このランダム変数は、特定のビットが修正されるかどうかを表す。この突然変異手順は、生物学的点突然変異に基づいて、単一点突然変異と呼ばれる。他のタイプには、逆位および浮動小数点突然変異が含まれる。他のタイプの突然変異には、スワップ、逆位、およびスクランブルが含まれる。 An example of mutation involves implementing the probability that any bit in the model will change from its original state. A common way to implement mutations involves generating a random variable for each bit in the sequence. This random variable indicates whether a particular bit is modified. This mutation procedure is called a single point mutation, based on a biological point mutation. Other types include inversion and floating point mutations. Other types of mutations include swap, inversion, and scrambling.

モデルの子孫を作成することは、２つ以上の既存モデルに基づいて新しいモデルを作成することである。２つ以上の親モデルの子孫は、親モデルから特徴を取得して、それらの特徴を組み合わせて新しいモデルを作成する。本発明の主要部の実施形態は、１世代から次世代へモデルの特徴を変化させるために、子孫を使用する。これは、本発明の主要部（例えば、遺伝的アルゴリズム）のモデルが基礎とする、再生および生物学的交叉と類似している。交叉は、複数（例えば、２つ以上）の親モデルを取得し、親モデルから子モデルを生成するプロセスである。 Creating a model's offspring is creating a new model based on two or more existing models. Descendants of two or more parent models take features from the parent model and combine those features to create a new model. Embodiments of the main part of the invention use offspring to change the characteristics of the model from one generation to the next. This is similar to the regenerative and biological crossovers on which the models of key parts of the invention (eg, genetic algorithms) are based. Crossover is the process of acquiring multiple (eg, two or more) parent models and generating child models from the parent model.

上述した技法の任意の数または組み合わせを使用して、モデル｛ｍ_１２１、．．．、ｍ_１２ｊ｝のセットとして図６に示されている第２世代のランが作成される。いくつかの実施形態では、各々の次世代は、前世代よりも少ないモデルを含む（例えば、ｊ＜i）が、他の実施形態では、各々の次世代は、前世代と同数のモデルを有する（例えば、ｊ＝i）ことが企図される。同様に、モデルの各々の次世代は、前世代よりも多くのモデルを含み得（例えば、ｊ＞i）、または各々の世代は、様々な数のモデルを含み得る（例えば、第２世代は第１世代よりも少ないモデルを有し、第３世代は第２世代よりも多くのモデルを有する、あるいは第１世代よりもさらに多くのモデルを有し得る、など）。 Using any number or combination of techniques described above, the model {m ₁₂₁ ,. .. .. , _{M 12j} } are created as the second generation run shown in FIG. In some embodiments, each next generation contains fewer models than the previous generation (eg, j <i), but in other embodiments, each next generation has as many models as the previous generation. (For example, j = i) is intended. Similarly, each next generation of models may contain more models than the previous generation (eg j> i), or each generation may contain a different number of models (eg second generation). It has fewer models than the first generation, the third generation may have more models than the second generation, or it may have more models than the first generation, etc.).

ラン内のモデルの世代を通して反復するプロセスは、所望の回数で完了され得る。図６において、世代数は変数ａとして表されている。好ましくは、ａは、結果として得られるモデル数がデータセットを十分にトラバースすることができるほど十分な大きさである。例えば、モデルがデータセットから可能な全ての変数（例えば、予測子）を考慮することができるほど十分な世代が存在するべきである。例えば、データセットが大きいほど、小さいデータセットと比較して、多くのモデル世代が必要となり得る。いくつかの実施形態では、ａは、１０〜１０，０００世代であり、より好ましくは、５０〜１，０００世代であり、最も好ましくは、１００〜５００世代であり得る。本発明の主要部に記載されるような世代進化は、分類上、遺伝的プログラミングとして記載され得る。本発明の主要部は、モデルコンポーネントの効率的な排除を可能にするので、本発明の主要部の方法は、反復プログラミングのいかなる方法でも計算効率を劇的に改善するために有用であり得ることが企図される。 The process of iterating through generations of models in the run can be completed as many times as desired. In FIG. 6, the number of generations is represented as a variable a. Preferably, a is large enough that the resulting number of models is sufficient to traverse the dataset. For example, there should be enough generations for the model to consider all possible variables (eg, predictors) from the dataset. For example, a larger dataset may require more model generations than a smaller dataset. In some embodiments, a may be 10 to 10,000 generations, more preferably 50 to 1,000 generations, and most preferably 100 to 500 generations. Generational evolution, as described in the main part of the invention, can be taxonomically described as genetic programming. Since the main part of the invention allows efficient elimination of model components, the method of the main part of the invention can be useful for dramatically improving computational efficiency in any way of iterative programming. Is intended.

ａ世代を通して反復した後に、図６におけるランの最終世代に到達する。いくつかの実施形態において、ランの最終世代（例えば、図６の世代ａ）は、１つのモデルから構成されるが、ランの最終世代がモデルのセットから構成され得ることも企図される。ランの最終世代がモデルのセットを含む実施形態では、１つ以上の「最良」モデルは、ある世代においていずれのモデルが「最良」であるかを決定することに関して上述した基準のいずれかに基づいて、再度決定される。最終世代における全てのモデルが、それらのランの「最良」モデルと見なされ得ることも企図される。１つのモデルのみがランの最終世代に存在する実施形態では、そのモデルは、必然的に、その世代の「最良」モデル、したがって、ランの「最良」モデルと見なされる。 After repeating through generation a, the final generation of orchids in FIG. 6 is reached. In some embodiments, the final generation of orchids (eg, generation a in FIG. 6) is composed of one model, but it is also contemplated that the final generation of orchids may consist of a set of models. In embodiments where the final generation of orchids comprises a set of models, one or more "best" models are based on any of the criteria described above with respect to determining which model is "best" in a generation. Will be decided again. It is also envisioned that all models in the final generation can be considered the "best" models of those runs. In embodiments where only one model is present in the final generation of the run, that model is necessarily considered the "best" model of that generation, and thus the "best" model of the run.

識別されたランの「最良」モデル（１つ以上）（例えば、図６では、最良モデルは、ｍ_１ａ１としてラベル付けされる）を用いて、「最良」モデルについてモデル属性が計算される。 Model attributes are calculated for the "best" model using the "best" model (one or more) of the identified runs (eg, in FIG. 6, the best model is labeled as _m1a1 ).

ラン内の各モデルはモデルプール内で識別されたモデルコンポーネントを使用して作成されるので、特定のランからの「最良」モデル（１つ以上）は、同様に、第１のモデル世代がモデルコンポーネントを取り出した同じモデルプールからのモデルコンポーネントを使用する。例えば、図７は、図６に示されているラン内のモデルを生成するために使用され得るモデルコンポーネントを有するモデルコンポーネントのプールを示す。したがって、図６に示されているランの「最良」モデル内で使用されるモデルコンポーネントは、図７に示されているモデルコンポーネントのプールから必然的に取り出されたものとなる。 Since each model in the run is created using the model components identified in the model pool, the "best" model (one or more) from a particular run is similarly modeled by the first model generation. Use the model component from the same model pool from which the component was taken. For example, FIG. 7 shows a pool of model components with model components that can be used to generate the model in the run shown in FIG. Therefore, the model components used within the "best" model of the run shown in FIG. 6 will necessarily be taken from the pool of model components shown in FIG.

いくつかの実施形態では、ラン内で使用される（例えば、ランのいずれかの世代において使用される）各々のモデルコンポーネントは、それについて計算されたユーティリティメトリックを有する。他の実施形態では、「最良」モデル内で使用される各々のモデルコンポーネントのみが、それについて計算されたユーティリティメトリックを有する。さらに別の実施形態では、ユーティリティメトリックは、ラン（例えば、最新の１０％、２０％、３０％、４０％、５０％、６０％、７０％の世代のみ）からモデルのサブセット内に見られるモデルコンポーネントについて計算され得る。 In some embodiments, each model component used within a run (eg, used in any generation of the run) has a utility metric calculated for it. In other embodiments, only each model component used within the "best" model has a utility metric calculated for it. In yet another embodiment, the utility metric is a model found within a subset of the model from the run (eg, only the latest 10%, 20%, 30%, 40%, 50%, 60%, 70% generations). Can be calculated for a component.

例えば、図６および図７において、モデルコンポーネントのプールからのモデルコンポーネントが「最良」モデル（例えば、ｍ_１ａ１）内に出現する場合、そのモデルコンポーネントのユーティリティメトリックの分子は１である。モデルコンポーネントが１つのモデル内で（または、「最良」世代を構成する複数のモデル内で）複数回出現する場合、カウントは、そのモデルに対して（またはそのランに対して）１だけ増加する。例えば、「最良」世代が２つのモデルを含み、両方のモデルがモデルコンポーネントＣ_ｇを含む場合、Ｃ_ｇについての分子は、そのランについて１だけ加算される。 For example, in FIGS. 6 and 7, if a model component from a pool of model components appears in the "best" model (eg, m _1a1 ), the numerator of the utility metric for that model component is 1. If a model component appears multiple times within a model (or within multiple models that make up the "best" generation), the count is incremented by 1 for that model (or for that run). .. For example, if the "best" generation contains two models and both models contain the model component C _g , the numerator for C _g is incremented by 1 for that run.

ユーティリティメトリックの分母に関しては、モデルコンポーネントがモデルコンポーネントのプール内に出現するたびに、分母は１だけ増加する。例えば、図７のモデルコンポーネントのプール内の全てのモデルコンポーネントは、それらのユーティリティメトリックに対して１の分母を有する。ユーティリティメトリックの分母は、モデルコンポーネントの２つ以上のプールが存在する場合、１よりも大きくなり得る。 As for the denominator of utility metrics, each time a model component appears in the pool of model components, the denominator is incremented by 1. For example, all model components in the pool of model components in FIG. 7 have a denominator of 1 for their utility metrics. The denominator of the utility metric can be greater than 1 if there are two or more pools of model components.

図８および図９は、モデルコンポーネントのＸ個のランおよびＹ個のプールを実装する本発明の主要部の実施形態を示す。ラン毎にモデルコンポーネントの１つのプールが存在し（例えば、Ｘ＝Ｙ）、モデルコンポーネントの各プールが特に特定のランに対応することが企図されるが、ランよりも少ないモデルコンポーネントプールが存在し得る（例えば、Ｘ＞Ｙ）、またはランよりも多いモデルコンポーネントプールが存在し得る（例えば、Ｘ＜Ｙ）ことも同様に企図される。 8 and 9 show embodiments of the main part of the invention that implement X runs and Y pools of model components. There is one pool of model components per run (eg X = Y), and each pool of model components is specifically intended to correspond to a particular run, but there are fewer model component pools than the run. It is similarly contemplated that there may be more model component pools to obtain (eg, XY), or more than runs (eg, XY).

図８のラン１〜Ｘに出現するモデルコンポーネントのユーティリティメトリックを決定する際に、分子は、０〜Ｘの間（例えば、ランの総数）であり得、分母は、１〜Ｙ（例えば、モデルコンポーネントプールの総数）であり得る。例えば、モデルコンポーネントが２つのラン内の「最良」モデルに出現するが、同じモデルコンポーネントが４つのモデルコンポーネントプール内に存在する場合、ユーティリティメトリックは０．５（２割る４）になる。ユーティリティメトリックは、モデルコンポーネント毎に計算されるが、モデルコンポーネントがいずれのラン内のいずれの「最良」モデルにも出現しない場合、そのモデルコンポーネントは０の分子を有し、従って、ユーティリティメトリックは０となる。 In determining the utility metrics of the model components appearing in runs 1-X of FIG. 8, the numerator can be between 0-X (eg, the total number of runs) and the denominator can be 1-Y (eg, model). Total number of component pools). For example, if a model component appears in the "best" model in two runs, but the same model component is in four model component pools, the utility metric is 0.5 (20 divided by 4). The utility metric is calculated for each model component, but if the model component does not appear in any of the "best" models in any run, then that model component has a numerator of 0, so the utility metric is 0. It becomes.

ユーティリティメトリックは、モデルコンポーネントの全てのプール内のあらゆるモデルコンポーネントに対して計算され得ることが企図される。しかしながら、いくつかの実施形態では、ユーティリティメトリックは、ランの「最良」世代における１つ以上のモデルに出現するモデルコンポーネントについてのみ計算される。直観的には、モデルコンポーネントが「最良」モデル内に出現しない場合、その分子は必然的に０となる。したがって、少なくとも１つの「最良」モデルに出現しないモデルコンポーネントについてユーティリティメトリックを計算することをスキップすることができ、代わりに、過剰なプロセッササイクルを使用せずに、全てのモデルコンポーネントプールから、少なくとも１つの「最良」モデルに出現しない全てのモデルコンポーネントを排除することができる。 It is contemplated that utility metrics can be calculated for any model component in every pool of model components. However, in some embodiments, utility metrics are calculated only for model components that appear in one or more models in the "best" generation of runs. Intuitively, if the model component does not appear in the "best" model, the numerator will necessarily be zero. Therefore, it is possible to skip computing utility metrics for model components that do not appear in at least one "best" model, and instead, at least one from all model component pools without using excessive processor cycles. All model components that do not appear in one "best" model can be eliminated.

例えば、図８では、Ｘ個のランが存在し、この場合、各ランは１つの最良モデル（すなわち、セット｛ｍ_１ａ１、．．．、ｍ_Ｘｃ１｝内のモデル、つまり各ランの最終世代）を有する。図９に示されるモデルコンポーネントのプールは、重複するモデルコンポーネントを有し得ることが企図されるので、モデルコンポーネントｃ_１ｇは、他のモデルプールの全てまたは一部に存在することが可能であり得る。モデルコンポーネントｃ_１ｇが５個のモデルコンポーネントプール（すなわち、Ｙ≧５）内に出現し、ｃ_１ｇが同様にこれらのランの「最良」モデルの３つの中に出現する場合、モデルコンポーネントｃ_１ｇに対するユーティリティメトリックは、３：５または０．６となる。 For example, in FIG. 8, there are X runs, where each run is one of the best models (ie, the model in the set {m _1a1 , ..., m _Xc1 }, i.e. the final generation of each run). Has. Since the pool of model components shown in FIG. 9 is intended to have overlapping model components, model component c _1g may be present in all or part of the other model pools. .. If model component c _{1 g} appears in 5 model component pools (ie, Y ≧ 5) and c _{1 g} also appears in 3 of the “best” models of these runs, then for model component c _{1 g} . The utility metric will be 3: 5 or 0.6.

ユーティリティメトリックｃ_１ｇが出現する各モデルに対して、モデル属性メトリックが必要である。重み付けユーティリティメトリックを計算するために、ｃ_１ｇのユーティリティメトリックは、ｃ_１ｇが出現するモデルのモデル属性の何らかの関数で乗算される。ｃ_１ｇが出現するモデルのモデル属性は、例えば、平均化され得る。他の実施形態では、モデル属性の中央値が使用され得、他の実施形態では、モードが使用され得、さらに別の実施形態では、幾何平均が実装され得ることも企図される。 A model attribute metric is required for each model in which _{1 g of} utility metric c appears. To calculate the weighting utility metric, a utility metric c _{1 g} is multiplied by some function of the model attribute models c _{1 g} appears. The model attributes of the model in which c _{1 g} appear can be averaged, for example. It is also contemplated that in other embodiments the median model attributes may be used, in other embodiments modes may be used, and in yet other embodiments geometric mean may be implemented.

また、特定のモデルコンポーネントが出現する多数の「最良」モデルが存在する場合、平均、中央値、またはモードを計算する前に異常値が排除され得る（例えば、モデル属性の平均を計算する、または中央値を決定する前に、ある数の最高および最低のモデル属性が無視され得る）ことも企図される。いくつかの実施形態では、特定のモデルコンポーネントについて重み付けユーティリティメトリックを計算する際に使用され得る操作されたモデル属性に到達するために、他の既知の数学的演算または関数がモデル属性のセットに適用され得る。 Also, if there are many "best" models in which a particular model component appears, outliers can be eliminated before calculating the mean, median, or mode (eg, calculating the mean of the model attributes, or It is also envisioned that a certain number of highest and lowest model attributes can be ignored before the median is determined). In some embodiments, other known mathematical operations or functions are applied to the set of model attributes to reach the manipulated model attributes that can be used in calculating weighted utility metrics for a particular model component. Can be done.

したがって、上記の例に戻ると、ｃ_１ｇに対するユーティリティメトリックが０．６であり、モデル属性の平均が３０％である場合、重み付けユーティリティメトリックは０．１８となる。このプロセスは、「最良」モデルのセット｛ｍ_１ａ１、．．．、ｍ_ｘｃ１｝に出現する全てのモデルコンポーネントについて繰り返され、その結果、「最良」モデルのセット内の各モデルコンポーネントに対応する重み付けユーティリティメトリックを作成する。 Therefore, returning to the above example, if the utility metric for c _{1 g} is 0.6 and the average of the model attributes is 30%, the weighted utility metric is 0.18. This process involves a set of "best" models {m _1a1 , _... .. .. , _Mxc1 } is repeated for all model components, resulting in a weighted utility metric corresponding to each model component in the "best" set of models.

本発明の主要部の方法の次のフェーズは、どのモデルコンポーネントが重要であると見なされ、どのモデルコンポーネントが重要でないと見なされるかの決定を必要とする。「重要」であるモデルコンポーネントは、再利用され、次のランセットを生成するために使用される新しいモデルプールのセット内に配置する対象となる。「重要でない」モデルコンポーネントは破棄され、新しいモデルプールセットで再使用されず、したがって、確実に、「重要でない」モデルコンポーネントは新しいモデルを作成するために使用されることはない。 The next phase of the method of the main part of the invention requires determination of which model components are considered important and which are not. Model components that are "important" are to be reused and placed within a new set of model pools that will be used to generate the next lancet. The "insignificant" model components are discarded and not reused in the new model pool set, so certainly no "insignificant" model components are used to create new models.

モデルコンポーネントが「重要である」か、または「重要でない」かは、重み付けユーティリティメトリックが閾値未満であるか、または閾値を超えているかどうかによって決定されることが企図される。いくつかの実施形態では、閾値は、モデルの「最良」セット（例えば、図８の｛ｍ_１ａ１、．．．、ｍ_ｘｃ１｝）に出現するモデルコンポーネントに対する重み付けユーティリティメトリックの全てを最初に平均化することによって計算され得る。次に、個々の重み付けユーティリティメトリックは、その平均で除算される。重み付けユーティリティメトリックを全ての重み付けユーティリティメトリックの平均で除算した結果が、一定の閾値未満である場合（例えば、結果が１．２、１．１、１、０．９、０．８、０．７、０．６、０．５、または０．４未満である場合）、その重み付けユーティリティメトリックに対応するモデルコンポーネントは、考慮から除外される（例えば、そのモデルコンポーネントを、新しいランセットを生成するために使用されるどの新しいモデルコンポーネントプールに入れることはできない）。このプロセスを図５に示す。 It is intended that whether a model component is "important" or "insignificant" is determined by whether the weighting utility metric is below or above the threshold. In some embodiments, the threshold first averages all of the weighted utility metrics for the model components that appear in the "best" set of models (eg, {m _1a1 , ..., m _xc1 } in FIG. 8). Can be calculated by doing. The individual weighted utility metrics are then divided by their average. If the result of dividing the weighted utility metric by the average of all weighted utility metrics is less than a certain threshold (eg, the result is 1.2, 1.1, 1, 0.9, 0.8, 0.7). , 0.6, 0.5, or less than 0.4, the model component corresponding to the weighting utility metric is excluded from consideration (eg, the model component to generate a new lancet). Cannot be in any new model component pool used). This process is shown in FIG.

いくつかの実施形態では、モデルコンポーネントは、それらの対応するユーティリティメトリックに基づいて、排除され（例えば、考慮から除外され）得る。これを行うために、ユーティリティメトリックがいくつかのコンポーネントについて計算されると、それらのモデルコンポーネントに対するユーティリティメトリックは、例えば、要約統計量を使用して分析される。想定される要約統計量は、位置（例えば、算術平均、中央値、モード、ならびに四分位間平均）、散布度（例えば、標準偏差、分散、範囲、四分位範囲、絶対偏差、平均絶対差、ならびに距離標準偏差）、形状（例えば、歪度もしくは尖度、ならびにＬモーメント法に基づく代替）、および依存性（例えば、ピアソンの積率相関係数もしくはスピアマンの順位相関係数）を含む。 In some embodiments, model components may be excluded (eg, excluded from consideration) based on their corresponding utility metrics. To do this, when utility metrics are calculated for some components, the utility metrics for those model components are analyzed, for example, using summary statistics. Possible summary statistics are position (eg, arithmetic mean, median, mode, and interquarter mean), skewness (eg, standard deviation, variance, range, quadrant range, absolute deviation, mean absolute). Includes differences (and distance standard deviation), shape (eg skewness or kurtosis, and alternatives based on the L-moment method), and dependencies (eg Pearson's product moment correlation coefficient or Spearman's rank correlation coefficient). ..

ユーティリティメトリックは、次に、要約統計量と比較され、それが維持されるべきか、または排除されるべきかが決定され得る。例えば、モデルコンポーネントに対するユーティリティメトリックがユーティリティメトリックのセットから計算された算術平均と比較される場合（例えば、ユーティリティメトリックはユーティリティメトリックのセットの平均で除算される）、そのモデルコンポーネントは、そのユーティリティメトリックが１未満（そのモデルコンポーネントが、平均に寄与するユーティリティメトリックを有するモデルコンポーネントの総数の半分よりも影響力が小さいか、有用でないことを示す）であれば、排除され得る。別の実施例では、ユーティリティメトリックが平均から１標準偏差を下回る場合、そのユーティリティメトリックに対応するモデルコンポーネントが排除され得る。その主要な目標は、他のモデルコンポーネントと比較したときに、他のモデルコンポーネントほど「最良」モデルに寄与しないモデルコンポーネントの排除を容易にすることである。図１１は、この概念を一般的に示しており、ここでは、閾値は、上述したように、要約統計量を使用して決定される。 The utility metric can then be compared to the summary statistic to determine whether it should be maintained or excluded. For example, if a utility metric for a model component is compared to the arithmetic mean calculated from a set of utility metrics (for example, the utility metric is divided by the average of the set of utility metrics), then the model component has that utility metric. Less than 1 (indicating that the model component is less influential or less useful than half the total number of model components with utility metrics that contribute to the mean) can be excluded. In another embodiment, if the utility metric is less than one standard deviation from the mean, the model component corresponding to that utility metric may be excluded. Its main goal is to facilitate the elimination of model components that do not contribute as much to the "best" model as other model components when compared to other model components. FIG. 11 generally illustrates this concept, where thresholds are determined using summary statistics, as described above.

多くの状況において、ユーティリティメトリックは、個々のユーティリティメトリックをユーティリティメトリックのセットの要約統計量で除算することによって、要約統計量と比較されることが企図される。これは、いくつかの要約統計量（例えば、位置要約統計量）に有効であるが、他の要約統計量は、ユーティリティメトリックが所望の範囲（例えば、散布度要約統計量）内に入るかどうかを調べるために、ユーティリティメトリック値と値の範囲との比較を必要とする。 In many situations, utility metrics are intended to be compared to summary statistics by dividing individual utility metrics by a summary statistic for a set of utility metrics. This is useful for some summary statistics (eg, location summary statistics), while other summary statistics are whether the utility metric falls within the desired range (eg, dispersal degree summary statistics). Needs a comparison between the utility metric value and the range of values to find out.

重み付けユーティリティメトリックの平均を計算する代わりに、「最良」モデルのセット内の各モデルコンポーネントに対する重み付けユーティリティメトリックは、他の方法で操作され得ることも企図される。例えば、いくつかの実施形態では、個々の重み付けユーティリティメトリックは、重み付けユーティリティメトリックのセットの中央値で除算され得る。他の実施形態では、重み付けユーティリティメトリックのセットのモードが、平均または中央値の代わりに使用され得る。 Instead of calculating the average of the weighted utility metrics, it is also contemplated that the weighted utility metrics for each model component in the "best" model set can be manipulated in other ways. For example, in some embodiments, individual weighted utility metrics may be divided by the median of a set of weighted utility metrics. In other embodiments, the mode of the set of weighted utility metrics may be used instead of the mean or median.

いくつかの実施形態では、モデルコンポーネントを考慮から除外した後、モデルコンポーネントの新しいプールは、排除されたモデルコンポーネントなしで生成される。他の実施形態では、モデルコンポーネントは、既存のモデルプールから排除され、新しいランセットについてモデルのセットを生成するために同じモデルコンポーネントプールが再度使用される。さらに別の実施形態では、モデルコンポーネントは、モデルコンポーネントプールから排除せずに、単に考慮から除外されるのみである。この時点以降は、プロセスは繰り返され、最終的には、より多くのモデルコンポーネントが排除され得る。このプロセスは、残りのモデルコンポーネント全てが各々のランにおいて「最良」モデルに有意に寄与することが判明するまで、必要に応じて繰り返され得る。 In some embodiments, after excluding the model components from consideration, a new pool of model components is created without the excluded model components. In other embodiments, the model component is removed from the existing model pool and the same model component pool is used again to generate a set of models for the new lancet. In yet another embodiment, the model component is not excluded from the model component pool, but is simply excluded from consideration. From this point on, the process is repeated and eventually more model components can be eliminated. This process can be repeated as needed until all the remaining model components are found to contribute significantly to the "best" model in each run.

モデルコンポーネントのトリミングを受けたモデルコンポーネントプールを使用して次のランを生成するときに、以前のランからの「最良」モデルのセットが以降のランに組み込まれ得ることも企図される。したがって、以前のランからの「最良」モデルが、別の理由では重要でないと決定されるために破棄されるであろうモデルコンポーネントを含む場合、そのモデルコンポーネントは、以前のランからの「最良」モデルが他のランの「最良」モデルよりも有利でない場合でも、その以前のランからの「最良」モデルの導入によって再導入され得る。 It is also envisioned that when the next run is generated using the model component pool that has been trimmed of the model components, the "best" set of models from the previous run may be incorporated into subsequent runs. Therefore, if the "best" model from a previous run contains a model component that would be discarded because it was determined to be insignificant for another reason, then that model component is "best" from the previous run. Even if the model is less favorable than the "best" model of the other run, it can be reintroduced by the introduction of the "best" model from the previous run.

いくつかのランからの「最良」モデル内に見られるモデルのモデルコンポーネントまたは特徴は、その「最良」モデルが（例えば、本出願に記載されている性能特性のいずれかに関して）他の「最良」モデルより有利でないため、考慮から排除される場合があることが企図される。別の理由で（モデルコンポーネントを考慮から除外する、又はモデル全体を考慮から除外することによって）モデルコンポーネントまたは特徴を排除することができる機構を導入することによって、これらのモデルコンポーネントまたは特徴が再度考慮に入れられ、結果として、高性能を発揮するモデルへのより速い収束時間をもたらし得る。 A model component or feature found within a "best" model from some runs is that the "best" model is another "best" (eg, with respect to any of the performance characteristics described in this application). It is intended that it may be excluded from consideration because it is less favorable than the model. These model components or features are reconsidered by introducing a mechanism that allows the model components or features to be excluded for another reason (by excluding the model components from consideration or by excluding the entire model from consideration). As a result, it can result in faster convergence times for high performance models.

この概念は、本出願に記載されている任意の形態の反復モデル生成まで拡張され得る。図１２に示すように、第１のモデルセットからのモデルコンポーネントは、第１のモデルセットからのモデルを第２のモデルセットに組み込むことによって、第２のモデルセットに導入され得る。モデルの「セット」は、本出願に記載されているようなモデルの世代を含み得ることが企図される。 This concept can be extended to iterative model generation in any form described in this application. As shown in FIG. 12, the model components from the first model set can be introduced into the second model set by incorporating the models from the first model set into the second model set. It is contemplated that a "set" of models may include generations of models as described in this application.

例えば、第１のランの結果、「最良」モデルが生成され、第２のラン（これは、プルーニングされたモデルコンポーネントプールからのモデルコンポーネントを使用する、ランダムに生成されたモデルのセットから始まる）は、ランダムに生成されたモデルの最初のセット内の第１のランの「最良」モデルを含み得る。そうすることにより、以前に識別された有効モデルの要素を新しいランに導入することができ（例えば、さもなければ破棄されていたであろう１つ以上のモデルコンポーネントを復活させることができる）、その結果、「最良」モデルを世代的に進化させる第２のランの能力が向上する。第１のランからの「最良」モデルは、この実施例で説明されている第１のランおよび第２のランを含むランのグループ内の他のランの中で最良ではない場合でも、第２のランに導入され得る。その目標は、（例えば、本出願に記載されている方法によってモデルコンポーネントを排除することにより、または他のモデルに比べて低性能であるためにモデル自体を考慮から除外することによって）別の理由で排除されたであろう有用なモデルコンポーネントまたはモデル特徴を組み込むために、いくつかのランの中で必ずしも最高の性能を発揮するとは限らないモデルに、新しいランに導入される機会を与えることである。 For example, the result of the first run produces the "best" model, and the second run (which starts with a set of randomly generated models that use the model components from the pruned model component pool). May include the "best" model of the first run in the first set of randomly generated models. By doing so, elements of the previously identified valid model can be introduced into the new run (eg, one or more model components that would otherwise have been discarded) can be revived. As a result, the ability of the second run to evolve the "best" model generationally is improved. The "best" model from the first run is the second, even if it is not the best of the other runs in the group of runs that include the first run and the second run described in this example. Can be introduced in the run of. The goal is for another reason (eg, by eliminating the model component by the method described in this application, or by excluding the model itself from consideration due to its lower performance than other models). By giving a model that does not always perform best in some runs the opportunity to be introduced into a new run in order to incorporate useful model components or model features that would have been eliminated in is there.

このプロセスを通して、モデルコンポーネントは、モデルコンポーネントの１つ以上のプールからプルーニングされる。本発明の主要部に従ってモデルコンポーネントをプルーニングすることによって、遺伝的プログラミング（および関連するタスク）を実行するために必要な計算時間は、劇的に低減される。 Through this process, model components are pruned from one or more pools of model components. By pruning the model components according to the main part of the invention, the computational time required to perform genetic programming (and related tasks) is dramatically reduced.

本発明者は、以前のランからの「最良」モデルからのモデルコンポーネントが追加的に再度組み込まれ得ることを企図している。これらの状況では、１つのランからの「最良」モデルは、次のランからの「最良」モデルと比較した場合、「最良」モデルではない場合がある。以前のランからのモデルが「最良」であると見なされるものでなくなっても、そのモデルを１つ以上の以降のランに再導入することは重要であり得る。例えば、遺伝的プログラミングの場合、以前のランからの「最良」モデルの特徴（例えば、まとまってモデルの性能に影響を及ぼすモデルコンポーネントのグループ）は、以降のランにおいてより正確なモデルをもたらし得るが、何らかの理由で、これらの特徴は以前より良好なモデルをもたらすものではなかった。したがって、以前のランからの「最良」モデルを次のランに導入することによって、以前のランからの「最良」モデルからの特徴が次のラン内のモデルに組み込まれ得る。例えば、遺伝的プログラミング技術を実装する実施形態では、このことにより、確実に良好なモデルの特徴が新しいランに導入され、その結果、さもなければそれらの特徴を含むモデルが全体として他のランからのモデルほど高性能でないために破棄されるか、または無視されるであろう場合に、それらの特徴は失われることはない。 The inventor intends that model components from the "best" model from previous runs may be additionally reincorporated. In these situations, the "best" model from one run may not be the "best" model when compared to the "best" model from the next run. Even if a model from a previous run is no longer considered the "best", it can be important to reintroduce the model into one or more subsequent runs. For example, in the case of genetic programming, the characteristics of the "best" model from the previous run (eg, a group of model components that collectively affect the performance of the model) can result in a more accurate model in subsequent runs. For some reason, these features did not provide a better model than before. Therefore, by introducing the "best" model from the previous run into the next run, features from the "best" model from the previous run can be incorporated into the models in the next run. For example, in an embodiment that implements a genetic programming technique, this ensures that good model features are introduced into the new run, and as a result, the model containing those features as a whole is from other runs. Those features are not lost if they are discarded or ignored because they are not as powerful as the model in.

したがって、上述のプロセスを通して排除されたモデルコンポーネントを再度考慮に入れることができる。過去のランからの「最良」モデル（例えば、１つ以上のモデルは、「最良」であることが判明した各ランを形成する）は、例えば、考慮の中に残すための閾値を満たすことができないために排除されたモデルコンポーネントまたは特徴を含み得る。これらの「最良」モデルは、（上述したように、また図１３に示されているように）次のランにおいて考慮され得、その結果、さもなければ排除されたモデルコンポーネントを再度考慮に入れることができる。図面に照らして見ると、例えば、図８に示すように、ラン２の任意の世代（例えば、最終世代）内のモデルは、ラン１からの「最良」モデル（ｍ_１ａ１）を組み込むことができ、その結果、さもなければ考慮から除外されていたモデルｍ_１ａ１内の任意のモデルコンポーネントを再導入することが企図される。このプロセスを図１０に示す。 Therefore, the model components eliminated through the process described above can be taken into account again. The "best" model from past runs (eg, one or more models form each run that turns out to be "best") can, for example, meet a threshold to leave in consideration. It may contain model components or features that are excluded because they cannot. These "best" models can be considered in the next run (as described above and as shown in FIG. 13), thus reconsidering the otherwise excluded model components. Can be done. In the light of the drawings, for example, as shown in FIG. 8, a model within any generation (eg, final generation) of run 2 can incorporate the "best" model (m _1a1 ) from run 1. As a result, it is intended to reintroduce any model component within model _m1a1 , which was otherwise excluded from consideration. This process is shown in FIG.

モデルコンポーネントがこのようにして再度考慮に入れられる実施形態では、モデルコンポーネントが閾値を満たさないときにモデルコンポーネントの１つ以上のプールからそのモデルコンポーネントを排除する代わりに、そのモデルコンポーネントは単に考慮から外される（例えば、全てのモデルコンポーネントプール内に留まることはできるが、任意のモデル内で使用されることはできなくなる）ことが企図される。このように、１つのランからの「最良」モデルが次のラン内に再導入されると、ユーティリティメトリックの分母はゼロではなくなり、そのモデルコンポーネントは再度考慮に入れられる機会を有する。例えば、モデルコンポーネントが最初に考慮から除外されたが、再導入され、その後、その重み付けユーティリティメトリックが閾値を超えた場合、モデルコンポーネントは、再度考慮に入れられ、後で生成されたモデル内で使用され得る。 In an embodiment where the model component is thus reconsidered, instead of excluding the model component from one or more pools of model components when the model component does not meet the threshold, the model component is simply taken into account. It is intended to be removed (eg, it can remain in all model component pools, but cannot be used in any model). Thus, when the "best" model from one run is reintroduced into the next run, the denominator of the utility metric is no longer zero and that model component has the opportunity to be taken into account again. For example, if a model component is initially excluded from consideration, but then reintroduced and then its weighting utility metric exceeds a threshold, the model component is retaken into consideration and used in later generated models. Can be done.

本発明の主要部は、１つには大きなデータセットを処理する計算方法が「次元の呪い」を受けるので、当該技術の状態の改善である。次元の呪いは、行（例えば、観測）および／または列（例えば、予測子）の数が増加するにつれて、問題の次元が増加するという考えである。次元が増加すると、空間の容積は非常に速く増加するので、利用可能なデータは疎になる。このスパース性は、統計的有意性を必要とする任意の方法に対して問題がある。このスパース性は、いくつかの重要な点で任意の分析方法に対して問題となる。 A major part of the present invention is an improvement in the state of the art, in part because computational methods that process large data sets are subject to the "curse of dimensionality". The curse of dimensionality is the idea that as the number of rows (eg, observations) and / or columns (eg, predictors) increases, the dimension of the problem increases. As the dimensions increase, the volume of space increases so quickly that the available data becomes sparse. This sparsity is problematic for any method that requires statistical significance. This sparsity is problematic for any analytical method in several important ways.

第１に、統計的に適切で信頼性のある結果が望まれる場合、結果をサポートするために必要なデータ量は、次元と共に指数関数的に増加することが多い。第２に、データを編成し検索する多くの方法は、オブジェクトが類似の特性を有するグループを形成する領域を検出することに依存することが多い。しかし、高い次元データでは、全てのオブジェクトは、疎であり、多くの点で異なるように見えるので、共通データ編成手順の効率を低下させることがあり得る。 First, if statistically relevant and reliable results are desired, the amount of data required to support the results often increases exponentially with dimensions. Second, many methods of organizing and retrieving data often rely on detecting areas where objects form groups with similar properties. However, with high dimensional data, all objects are sparse and appear different in many ways, which can reduce the efficiency of common data organization procedures.

反復モデリング技術の文脈では、高次元は、さらなる問題を提起する。各々の追加次元は、解探索空間のサイズを指数関数的に増加させる。多くの反復方法は、可能な解の探索空間をランダムにサンプリングするため、各モデルコンポーネントを問題に追加することは、解に収束するために必要な時間量（物理的および計算的）を指数関数的に増加させる。 In the context of iterative modeling techniques, higher dimensions raise further issues. Each additional dimension exponentially increases the size of the solution search space. Many iterative methods randomly sample the search space for possible solutions, so adding each model component to the problem exponentially measures the amount of time (physical and computational) required to converge to the solution. To increase.

本発明の主要部を適用する際に、本発明者らは、反復モデリングプロセスに利用可能な入力特徴（例えば、モデルコンポーネント）の数を反復的に減少させることは、場合によっては、収束に到達するのに必要な時間を１００ｘだけ減少させることができること、または別の形として、プロセスが同じ時間量で考慮し得る「探索空間」または深さを大幅に増加させることができることに気付いた。 In applying the main parts of the invention, we iteratively reduce the number of input features (eg, model components) available in the iterative modeling process, in some cases reaching convergence. We have found that the time required to do this can be reduced by 100x, or, in other ways, the "search space" or depth that the process can consider in the same amount of time can be significantly increased.

この性能向上の１つの理由は、反復モデリングプロセスにおいて利用可能なモデルコンポーネントの減少により、任意の個々のモデルコンポーネントが、ＲＡＭまたは別の形態の電子記憶装置（例えば、ハードドライブ・フラッシュまたは別のもの）から呼び出される（「キャッシュミス」と呼ばれる）のとは対照的に、ＣＰＵキャッシュ内に格納され、その後、呼び出される（「キャッシュヒット」と呼ばれる）可能性が高まるためである。本発明の主要部は、「キャッシュヒット」の可能性を高め、場合によっては、「キャッシュミス」よりも任意の所与のモデルコンポーネントについて「キャッシュヒット」の可能性をさらに高くする。 One reason for this performance improvement is the reduction of model components available in the iterative modeling process, which allows any individual model component to be RAM or another form of electronic storage device (eg, hard drive flash or another). ) (Called a "cache miss"), as opposed to being stored in the CPU cache and then called (called a "cache hit"). A major part of the invention increases the likelihood of a "cache hit" and, in some cases, a "cache hit" for any given model component rather than a "cache miss".

簡潔に上述されているように、「キャッシュヒット」とは、プログラムによる処理を行うために要求されたデータ（例えば、モデルコンポーネント）がＣＰＵのキャッシュメモリ内に見つかった状態をいう。キャッシュメモリは、プロセッサにデータを転送する際にかなり高速である。コマンドを実行するときに、ＣＰＵは、最も近いアクセス可能なメモリ位置（通常はプライマリＣＰＵキャッシュである）内のデータを探す。要求されたデータがキャッシュ内に見つかった場合には、それは「キャッシュヒット」と見なされる。「キャッシュヒット」は、ＣＰＵにデータを転送する際のＣＰＵキャッシュの速度により、より迅速にデータを提供する。「キャッシュヒット」は、要求されたデータが最初のクエリで記憶され、アクセスされるディスクキャッシュからのデータの引き出しを指すこともある。 As briefly described above, the "cache hit" is a state in which the data (for example, a model component) requested for processing by a program is found in the cache memory of the CPU. Cache memory is fairly fast when transferring data to a processor. When executing a command, the CPU looks for data in the closest accessible memory location (usually the primary CPU cache). If the requested data is found in the cache, it is considered a "cache hit". A "cache hit" provides data more quickly due to the speed of the CPU cache when transferring data to the CPU. A "cache hit" may also refer to the retrieval of data from a disk cache where the requested data is stored and accessed in the first query.

「キャッシュヒット」を最大化する際の計算時間の改善は、他の記憶媒体と比較して、ＣＰＵキャッシュに記憶されているデータへのアクセス速度によってもたらされる。例えば、レベル１キャッシュ参照は０．５ナノ秒のオーダーであり、レベル２キャッシュ参照は７ナノ秒のオーダーである。比較すると、ソリッドステート・ハードドライブからのランダム読み出しは、１５０，０００ナノ秒のオーダーを要し、すなわち、レベル１キャッシュ参照よりも３００，０００倍も遅い。 The improvement in calculation time when maximizing "cache hits" is brought about by the speed of access to the data stored in the CPU cache as compared to other storage media. For example, a level 1 cache reference is on the order of 0.5 nanoseconds and a level 2 cache reference is on the order of 7 nanoseconds. By comparison, random reads from solid-state hard drives require orders of 150,000 nanoseconds, or 300,000 times slower than level 1 cache references.

したがって、反復的特徴選択の特定の構成および方法が開示されている。しかしながら、当業者には、本出願における発明概念から逸脱することなく、既に説明されたもの以外の多くの変更が可能であることは明らかであろう。従って、本発明の主要部は、本開示の精神を除いて限定されるべきではない。さらに、本開示を解釈する際に、全ての用語は、文脈と一致する可能な限り広い意味で解釈されるべきである。特に、用語「含む」は、非排他的に要素、コンポーネント、またはステップに言及しているものとして解釈されるべきであり、言及されている要素、コンポーネント、またはステップが存在し、または利用され、または明示的に言及されていない他の要素、コンポーネント、またはステップと組み合わされ得ることを示す。 Therefore, specific configurations and methods of iterative feature selection are disclosed. However, it will be apparent to those skilled in the art that many modifications other than those already described can be made without departing from the concept of the invention in this application. Therefore, the main part of the present invention should not be limited except in the spirit of the present disclosure. Moreover, in interpreting this disclosure, all terms should be construed in the broadest possible sense consistent with the context. In particular, the term "contains" should be construed as non-exclusively referring to an element, component, or step, and the element, component, or step mentioned is present or utilized. Or indicate that it can be combined with other elements, components, or steps not explicitly mentioned.

Claims

A method of reducing the computational time required to improve the model that associates predictors and results in a dataset that utilizes a processor in a computing system, said method.
Steps to generate at least one model with at least one model component,
A step of performing an iterative model development process to generate an improved model set, including a first improved model based on the at least one model, wherein the improved model set has at least two model generations. In preparation, the iterative model development process is a deep learning method with steps.
Using a subset of the dataset to calculate model attribute metrics for at least one of the models.
A step of calculating at least one utility metric of the at least one model component having a ratio, wherein the molecule of the ratio is the amount of models in which the at least one model component is present in the improved model set. The denominator of the ratio is a step that is incremented when the at least one model component is in the pool of model components.
A step of calculating a weighting utility metric corresponding to the at least one model component, wherein the weighting utility metric comprises the result of a function incorporating the model attribute metric and the at least one utility metric.
A step of excluding the at least one model component from the pool of model components based on the weighted utility metric.
A step of identifying a model from one of the at least two model generations based on at least one criterion.
Steps to save the identified model and
A method.

The method further comprises the step of introducing the identified model into the next model run by generating the next model run with the plurality of models, the next model generation in the next model run. The method of claim 1, comprising the identified model.

The method of claim 1, wherein the model run comprises a randomly generated model.

The method of claim 2, wherein the next model run further comprises crossing at least two models from the model run and at least one of a mutation model from the model run.

The method of claim 2, wherein the next model run further comprises a randomly generated model.

The at least one criterion is model accuracy compared to other models in the generation, characteristics compared to other models in the generation, model length compared to other models in the generation, and the generation. The method of claim 1, comprising at least one of the calculation times compared to the other models in.

The method of claim 2, wherein the identified model comprises a model component that is not in the next model run.

A method of reducing the computational time required to improve the model that associates predictors and results in a dataset that utilizes a processor in a computing system, said method.
Steps to generate at least one model with at least one model component,
A step of performing an iterative model development process to generate an improved model set, including a first improved model based on the at least one model, wherein the improved model set has at least one model generation. In preparation, the iterative model development process is a deep learning method with steps.
Using a subset of the dataset to calculate model attribute metrics for at least one of the models.
A step of calculating at least one utility metric of the at least one model component having a ratio, wherein the molecule of the ratio is the amount of models in which the at least one model component is present in the improved model set. The denominator of the ratio is a step that is incremented when the at least one model component is in the pool of model components.
A step of calculating a weighting utility metric corresponding to the at least one model component, wherein the weighting utility metric comprises the result of a function incorporating the model attribute metric and the at least one utility metric.
A step of excluding the at least one model component from the pool of model components based on the weighted utility metric.
A step of identifying a first model from the at least one model generation based on at least one criterion, wherein the first model is a step that is not a preferred model from the at least one model generation.
Steps to save the identified model and
A method.

The method of claim 8, further comprising the step of introducing the first model into the next model generation by generating the next model generation comprising the first model.

The method of claim 9, wherein the next model generation further comprises a randomly generated model.

The at least one criterion is model accuracy compared to other models in the generation, characteristics compared to other models in the generation, model length compared to other models in the generation, and the generation. 8. The method of claim 8, comprising at least one of the calculation times compared to the other models in.

The method of claim 9, wherein the next model generation further comprises crossing at least two models from the model generation and at least one of a mutant model from the model generation.

A step of identifying a second model from the next model generation based on the at least one criterion, wherein the second model is not a preferred model from the next model generation.
A step of introducing the second model and the first model into the new next model generation by generating a new next model generation comprising the second model and the first model.
9. The method of claim 9.

A method of reducing the computational time required to improve the model that associates predictors and results in a dataset that utilizes a processor in a computing system, said method.
Steps to generate at least one model with at least one model component,
A step of performing an iterative model development process to generate an improved model set, including a first improved model based on the at least one model, wherein the improved model set comprises multiple model generations. , The iterative model development process is a deep learning method,
Using a subset of the dataset to calculate model attribute metrics for at least one of the models.
A step of calculating at least one utility metric of the at least one model component having a ratio, wherein the molecule of the ratio is the amount of models in which the at least one model component is present in the improved model set. The denominator of the ratio is a step that is incremented when the at least one model component is in the pool of model components.
A step of calculating a weighting utility metric corresponding to the at least one model component, wherein the weighting utility metric comprises the result of a function incorporating the model attribute metric and the at least one utility metric.
A step of excluding the at least one model component from the pool of model components based on the weighted utility metric.
A step of identifying a model for each generation within the subset of the plurality of model generations based on the criteria, and each identified model is a step that is a preferred model from the corresponding generation.
Steps to save each identified model,
A method.

14. The method of claim 14, further comprising the step of introducing each identified model into a final model run generation by generating a final model generation with each identified model.

14. The method of claim 14, wherein the next model generation additionally comprises a randomly generated model.

The criteria are model accuracy compared to other models in the generation, characteristics compared to other models in the generation, model length compared to other models in the generation, and other models in the generation. 14. The method of claim 14, comprising at least one of the calculation times compared to the model.

14. The method of claim 14, wherein the next model generation further comprises crossing at least two models from said model generation and at least one of a mutant model from said model generation.