JP7439125B2

JP7439125B2 - Decentralized privacy-preserving computing for protected data

Info

Publication number: JP7439125B2
Application number: JP2021557379A
Authority: JP
Inventors: レイチェルカルカット; マイケルブラム; ジョーヘッセ; ロバートディー．ロジャーズ; スコットハモンド; マリーエリザベスチョーク
Original assignee: University of California
Current assignee: University of California
Priority date: 2019-03-26
Filing date: 2020-03-26
Publication date: 2024-02-27
Anticipated expiration: 2040-03-26
Also published as: AU2020244856A1; BR112021018241A2; AU2020244856B2; EP3948570A1; EP3948570A4; US20200311300A1; IL286232A; US11531904B2; JP2022526948A; KR20240019868A; WO2020198542A1; US20230325682A1; US12001965B2; KR20210143879A; CN113892093A; US11748633B2; CA3133466A1; IL311967A; US20230080780A1; KR102634785B1

Description

関連出願の相互参照
本出願は、2019年12月16日に出願された、「DISTRIBUTED PRIVACY-PRESERVING COMPUTING ON PROTECTED DATA」と題する米国仮出願第62/948,556号および2019年3月26日に出願された「FEDERATED MACHINE LEARNING TECHNIQUES FOR HIGHLY CURATED HEALTH-CARE DATA SETS」と題する米国仮出願第62/824,183号の優先権および利益を主張するものであり、その全内容はあらゆる目的で参照により本明細書に組み入れられる。 CROSS-REFERENCE TO RELATED APPLICATIONS This application is a cross-reference to U.S. Provisional Application No. 62/948,556, entitled "DISTRIBUTED PRIVACY-PRESERVING COMPUTING ON PROTECTED DATA," filed on December 16, 2019, and filed on March 26, 2019. 62/824,183, entitled ``FEDERATED MACHINE LEARNING TECHNIQUES FOR HIGHLY CURATED HEALTH-CARE DATA SETS,'' the entire contents of which are incorporated herein by reference for all purposes. Can be incorporated.

分野
本発明は、プライバシー保全コンピューティングに関し、特に、プライバシー保護された整合化データの複数のソースに対する分散解析を利用して人工知能アプリケーションおよび/またはアルゴリズムを開発するための技術（例えば、システム、方法、1つまたは複数のプロセッサによって実行可能なコードまたは命令を格納するコンピュータプログラム製品）に関する。本発明は、プライバシー保護された整合化された臨床データおよび健康データを使用して規制された医療用途のための人工知能アルゴリズムを開発するのに特に有効である。 FIELD The present invention relates to privacy-preserving computing, and more particularly to techniques (e.g., systems, methods, etc.) for developing artificial intelligence applications and/or algorithms using distributed analysis on multiple sources of privacy-preserving harmonized data. , a computer program product containing code or instructions executable by one or more processors). The present invention is particularly useful for developing artificial intelligence algorithms for regulated medical applications using privacy-protected harmonized clinical and health data.

背景
クラウド計算、データ並列クラスタ計算、および高性能計算を含む最新の計算パラダイムは、広く利用可能な様々な機械学習および深層学習アルゴリズムアーキテクチャと組み合わされて、基礎となるアルゴリズムを適切に最適化するのに十分なデータが利用可能であれば、ほぼあらゆる産業の問題を解決するための膨大な数の人工知能（artificial intelligence（AI））アプリケーションを開発することができる環境を作り出している。データへのアクセスがAIアプリケーションの開発に対する主要な障壁であることは今や明らかである。実際、多くの産業では、ロバストで一般化可能なAIを作成するために、様々なソースからのデータを使用する必要がある。課題は、一般に、データの所有者が、データを共有することができないか、または共有しないか、またはデータが自分の管理を離れることを許容しないことである。データは資産であり、機密性の高いプライベートデータおよび/または個人データを含むことが多く、共有を困難または不可能にする方法で規制することができるため、これは無理もないことである。これらの課題は、医療AIの開発において克服することが特に困難である。 Background Modern computational paradigms, including cloud computing, data-parallel cluster computing, and high-performance computing, combine with a variety of widely available machine learning and deep learning algorithm architectures to properly optimize the underlying algorithms. The availability of sufficient data creates an environment in which a vast number of artificial intelligence (AI) applications can be developed to solve problems in almost any industry. It is now clear that access to data is a major barrier to the development of AI applications. In fact, many industries require the use of data from a variety of sources to create robust and generalizable AI. The challenge is that the owners of the data are typically unable or unwilling to share the data or do not allow the data to leave their control. This is understandable, as data is an asset, often containing sensitive private and/or personal data, and can be regulated in ways that make sharing difficult or impossible. These challenges are particularly difficult to overcome in the development of medical AI.

全世界の格納データのうち、約30％が医療にあり、そのことがAIアルゴリズムの開発と資金提供を劇化させている。AIおよび機械学習（「ML」）によって作成された洞察は、臨床的判断を強化し、処置精度を可能にし、デジタル治療法を作成するために必要な複雑なデータ内の関連性（すなわち、相関）を学習する可能性を秘めている。医療AIは、バイオテクノロジー、医薬品、医療情報技術、解析および遺伝子検査、ならびに医療機器の分野に関係がある。機械学習システムを含む人工知能アプローチは、複雑なデータ内のパターンを識別することができる。一般に、アルゴリズムおよびモデルを作成するために利用されるデータの忠実度および多様性が高いほど、アルゴリズムおよびモデルは多様な環境および集団にわたってより正確かつ一貫して機能する。よって、これらのAIアプローチは、モデルを開発、最適化、および検証するために、多様な高忠実度データへのアクセスを必要とする。しかしながら、ほとんどのAIアルゴリズム開発者らには、データにアクセスするための大きな障壁を打開することなく開発者らのアルゴリズムおよびモデルを訓練、試験、および検証するのに十分な忠実度および多様性を有する医療データ資産がない。さらに、AIアルゴリズム開発者らに十分なデータ資産がある場合でも、第三者の検証を行う開発者はほとんどおらず、その結果、本質的に概念研究の証明であり、生産または臨床環境に適用できる解決策ではないアルゴリズムおよびモデルが得られることになる。生産または臨床使用のためのAIモデルおよびアルゴリズムのさらなる開発は、高い忠実度、多様性、プライバシー保護データへのタイムリーなアクセスという大きな障害によって著しく妨げられるようである。 Approximately 30% of the world's stored data is in healthcare, which is driving the development and funding of AI algorithms. Insights created by AI and machine learning (“ML”) enhance clinical judgment, enable treatment accuracy, and improve the ability to draw connections (i.e., correlations) within complex data needed to create digital therapeutics. ) has the potential to learn. Medical AI is relevant to the fields of biotechnology, pharmaceuticals, medical information technology, analysis and genetic testing, and medical devices. Artificial intelligence approaches, including machine learning systems, can identify patterns within complex data. Generally, the higher the fidelity and diversity of data utilized to create algorithms and models, the more accurately and consistently the algorithms and models will perform across diverse environments and populations. These AI approaches therefore require access to diverse high-fidelity data to develop, optimize, and validate models. However, most AI algorithm developers lack sufficient fidelity and diversity to train, test, and validate their algorithms and models without breaking significant barriers to accessing the data. There are no medical data assets. Furthermore, even when AI algorithm developers have sufficient data assets, few perform third-party validation, resulting in essentially proof-of-concept studies that cannot be applied in production or clinical environments. You will end up with algorithms and models that are not possible solutions. Further development of AI models and algorithms for production or clinical use appears to be severely hampered by major obstacles: timely access to high fidelity, diversity, and privacy-preserving data.

医療には、患者情報のプライバシーを維持するための規制要件、法的要件、および倫理的要件がある。プライバシーの目的には、不正アクセスからデータを防護すること、個人のプライバシーの同意に従って使用の透明性を提供すること、および可能な限り個人を特定できるデータの使用を最小限に抑えることが含まれる。よって、データプライバシーおよびその保護が、高忠実度の、リアルタイムの、多様なデータへのタイムリーなアクセスを必要とするAIの障壁である。医療AIにおける機会は、プライバシー保全コンピューティングを利用して、識別可能な情報を公開するリスクを排除することである。これらの考慮事項は、データの機密性（例えば、企業秘密や個人に関するプライベートデータを含むかどうか）が、データの保護に責任を負う組織の境界外でのデータの共有を妨げる多くの産業におけるAIの開発に当てはまる。したがって、高忠実度の多様なデータへのタイムリーなアクセスを容易にするために、複数の組織にわたってプライバシー保全コンピューティングを確立する必要性が存在する。 Healthcare has regulatory, legal, and ethical requirements to maintain the privacy of patient information. Privacy objectives include protecting data from unauthorized access, providing transparency of use in accordance with individual privacy consent, and minimizing the use of personally identifiable data wherever possible. . Therefore, data privacy and protection is a barrier for AI, which requires timely access to high-fidelity, real-time, and diverse data. The opportunity in medical AI is to use privacy-preserving computing to eliminate the risk of exposing identifiable information. These considerations are relevant to AI in many industries where the sensitivity of the data (e.g., whether it contains trade secrets or private data about an individual) prevents its sharing outside the boundaries of the organization responsible for protecting the data. This applies to the development of Therefore, a need exists to establish privacy-preserving computing across multiple organizations to facilitate timely access to high-fidelity, diverse data.

簡単な概要
1つまたは複数のコンピュータのシステムを、動作に際してシステムに特定の操作または動作を行わせるソフトウェア、ファームウェア、ハードウェア、またはそれらの組み合わせがシステムにインストールされていることによって、それらの動作を行うように構成することができる。1つまたは複数のコンピュータプログラムを、データ処理装置によって実行されると、装置に特定の操作または動作を行わせる命令を含むことによって、それらの動作を行うように構成することができる。1つの一般的な局面は、データ処理システムにおいて、アルゴリズムおよびアルゴリズムと関連付けられた入力データ要件を受け取る工程であって、入力データ要件が、データ資産がアルゴリズムで動作するための最適化および/または検証選択基準を含む、受け取る工程を含む方法を含む。この方法はまた、データ処理システムによって、データ資産を、データ資産についての最適化および/または検証選択基準に基づいてデータホストから利用可能であるものとして識別する工程も含む。この方法はまた、データ処理システムによって、データホストのインフラストラクチャ内のデータストレージ構造内のデータ資産をキュレートする工程も含む。この方法はまた、データ処理システムによって、アルゴリズムが処理するためのデータストレージ構造内のデータ資産を準備する工程も含む。この方法はまた、データ処理システムによって、アルゴリズムをセキュアなカプセル計算フレームワークに統合する工程であって、セキュアなカプセル計算フレームワークがアルゴリズムを、データストレージ構造内のデータ資産に、データ資産およびアルゴリズムのプライバシーを保全するセキュアな方法で提供する、統合する工程も含む。この方法はまた、データ処理システムによって、データ資産をアルゴリズムを通して動作させる工程も含む。この局面の他の態様は、方法の動作を行うように各々構成された、対応するコンピュータシステム、装置、および1つまたは複数のコンピュータ記憶デバイスに記録されたコンピュータプログラムを含む。 brief overview
A system of one or more computers that causes the system to perform certain operations or actions by having software, firmware, hardware, or a combination thereof installed on the system that causes the system to perform certain operations or actions during operation. Can be configured. One or more computer programs may be configured to perform certain operations or actions by containing instructions that, when executed by a data processing device, cause the device to perform certain operations or actions. One general aspect is, in a data processing system, receiving an algorithm and input data requirements associated with the algorithm, the input data requirements being used to optimize and/or validate data assets for operation with the algorithm. The method includes the step of receiving, including selection criteria. The method also includes identifying, by the data processing system, the data asset as available from the data host based on optimization and/or validation selection criteria for the data asset. The method also includes curating, by the data processing system, data assets within a data storage structure within an infrastructure of the data host. The method also includes preparing a data asset in a data storage structure for processing by the algorithm by the data processing system. The method also includes, by the data processing system, integrating the algorithm into a secure capsule computation framework, wherein the secure capsule computation framework integrates the algorithm into a data asset within a data storage structure. It also includes providing and integrating the information in a secure manner that preserves privacy. The method also includes operating the data asset through an algorithm by the data processing system. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the operations of the method.

実施態様は、以下の特徴のうちの1つまたは複数を含み得る。アルゴリズムおよび入力データ要件が、データホストとは異なるエンティティであるアルゴリズム開発者から受け取られ、最適化および/または検証選択基準が、データ資産がアルゴリズムで動作するための特性、フォーマットおよび要件を定義する方法。データ資産の特性および要件が、（i）アルゴリズムの環境、（ii）入力データ内の例の配分、（iii）入力データを生成するデバイスのパラメータおよびタイプ、（iv）分散対バイアス、（v）アルゴリズムによって実装されたタスク、または（vi）それらの任意の組み合わせに基づいて定義される方法。データ処理システムによって、データホストを迎え入れる工程であって、迎え入れる工程が、アルゴリズムでのデータ資産の使用がデータプライバシー要件に準拠したものであることを確認する工程を含む、迎え入れる工程をさらに含む方法。データ資産を準備する工程が、データ資産に1つもしくは複数の変換を適用する工程、データ資産に注釈を付ける工程、データ資産を整合化する工程、またはそれらの組み合わせを含む方法。データ資産をアルゴリズムを通して動作させる工程が、モデルの複数のインスタンスを作成する工程と、データ資産を、訓練データセットと1つまたは複数の試験データセットとに分割する工程と、モデルの複数のインスタンスを訓練データセットで訓練する工程と、モデルの複数のインスタンスの各々の訓練からの結果を完全連合モデルに統合する工程と、1つまたは複数の試験データセットを完全連合モデルによって動作させる工程と、1つまたは複数の試験データセットの動作に基づいて完全連合モデルの性能を計算する工程とを含む訓練ワークフローを実行する工程を含む方法。記載の技術の実施態様は、ハードウェア、方法もしくはプロセス、またはコンピュータアクセス可能媒体上のコンピュータソフトウェアを含み得る。 Implementations may include one or more of the following features. The algorithm and input data requirements are received from the algorithm developer, an entity distinct from the data host, and how the optimization and/or validation selection criteria define the characteristics, format, and requirements for the data assets to work with the algorithm. . The characteristics and requirements of the data assets include (i) the environment of the algorithm, (ii) the distribution of examples within the input data, (iii) the parameters and type of the device producing the input data, (iv) the variance versus bias, and (v) tasks implemented by algorithms, or (vi) methods defined based on any combination thereof. A method further comprising the step of hosting a data host by a data processing system, the hosting step comprising verifying that use of the data asset in the algorithm is compliant with data privacy requirements. The method in which preparing the data asset includes applying one or more transformations to the data asset, annotating the data asset, aligning the data asset, or a combination thereof. Running the data asset through the algorithm includes creating multiple instances of the model, splitting the data asset into a training dataset and one or more test datasets, and running the multiple instances of the model. 1. training on a training dataset; integrating results from training each of the plurality of instances of the model into a fully federated model; and operating one or more test datasets with the fully federated model; and computing performance of a fully federated model based on operation of one or more test datasets. Implementations of the described techniques may include hardware, methods or processes, or computer software on computer-accessible media.

実施態様は、以下の特徴のうちの1つまたは複数を含み得る。セキュアなカプセル計算フレームワークが、アルゴリズムを動作させるのに必要な暗号化コードを受け入れるように構成された計算インフラストラクチャ内にプロビジョニングされ、計算インフラストラクチャをプロビジョニングする工程が、計算インフラストラクチャ上でセキュアなカプセル計算フレームワークをインスタンス化する工程と、アルゴリズム開発者によって、暗号化コードをセキュアなカプセル計算フレームワークの内部に配置する工程と、セキュアなカプセル計算フレームワークがインスタンス化された後で、暗号化コードを復号する工程とを含む方法。データ資産をアルゴリズムを通して動作させる工程が、データ資産を1つまたは複数の検証データセットにおいて分割する工程と、1つまたは複数の検証データセットをアルゴリズムを通して動作させる工程と、1つまたは複数の検証データセットの動作に基づいてアルゴリズムの性能を計算する工程とを含む検証ワークフローを実行する工程を含む方法。識別する工程が、データ資産内の個人に関するプライベート情報を伏せたままで、データ資産内のグループのパターンを記述することによって、データ資産内の情報を共有するための差分プライバシーを使用して行われ、キュレートする工程が、複数のデータストレージ構造の中からデータストレージ構造を選択する工程と、データホストのインフラストラクチャ内にデータストレージ構造をプロビジョニングする工程とを含み、データストレージ構造の選択が、アルゴリズムのタイプ、データ資産内のデータのタイプ、データ処理システムのシステム要件、またはそれらの組み合わせに基づくものである方法。記載の技術の実施態様は、ハードウェア、方法もしくはプロセス、またはコンピュータアクセス可能媒体上のコンピュータソフトウェアを含み得る。 Implementations may include one or more of the following features. A secure encapsulated computational framework is provisioned within a computational infrastructure configured to accept the cryptographic code necessary to operate the algorithm, and the process of provisioning the computational infrastructure is performed using a instantiating the capsule computation framework; placing cryptographic code inside the secure capsule computation framework by an algorithm developer; and, after the secure capsule computation framework has been instantiated, encrypting the decoding the code. Running the data asset through the algorithm includes splitting the data asset into one or more validation datasets, running the one or more validation datasets through the algorithm, and one or more validation data sets. and calculating performance of the algorithm based on the set of operations. identifying is performed using differential privacy for sharing information within the data asset by describing patterns of groups within the data asset while keeping private information about individuals within the data asset hidden; The curating step includes selecting a data storage structure from among a plurality of data storage structures, and provisioning the data storage structure within an infrastructure of the data host, and the selection of the data storage structure is based on a type of algorithm. , the type of data within the data asset, the system requirements of the data processing system, or a combination thereof. Implementations of the described techniques may include hardware, methods or processes, or computer software on computer-accessible media.

1つの一般的な局面は、アルゴリズムの複数のインスタンスを識別する工程であって、アルゴリズムの各インスタンスが1つまたは複数のセキュアなカプセル計算フレームワークに統合され、1つまたは複数のセキュアなカプセル計算フレームワークが、アルゴリズムの各インスタンスを、1つまたは複数のデータホストの1つまたは複数のデータストレージ構造内の訓練データ資産に、訓練データ資産およびアルゴリズムの各インスタンスのプライバシーを保全するセキュアな方法で提供する、識別する工程を含む方法を含む。この方法はまた、データ処理システムによって、アルゴリズムの各インスタンスで連合訓練ワークフローを実行する工程であって、連合訓練ワークフローが、訓練データ資産を入力として取り込み、パラメータを使用して訓練データ資産の特徴をターゲット推論にマップし、損失関数または誤差関数を計算し、損失関数または誤差関数を最小化するためにパラメータを学習されたパラメータに更新し、アルゴリズムの1つまたは複数の訓練されたインスタンスを出力する、実行する工程も含む。この方法はまた、データ処理システムによって、アルゴリズムの訓練されたインスタンスごとの学習されたパラメータを、完全連合アルゴリズムに統合する工程であって、統合する工程が、学習されたパラメータを集約して、集約されたパラメータを取得する工程と、完全連合アルゴリズムの学習されたパラメータを集約されたパラメータで更新する工程とを含む、統合する工程も含む。この方法はまた、データ処理システムによって、完全連合アルゴリズムで試験ワークフローを実行する工程であって、試験ワークフローが、試験データを入力として取り込み、更新された学習されたパラメータを使用して試験データ内のパターンを見つけ、推論を出力する、実行する工程も含む。この方法はまた、データ処理システムによって、推論を提供する際の完全連合アルゴリズムの性能を計算する工程も含む。この方法はまた、データ処理システムによって、完全連合アルゴリズムの性能がアルゴリズム終了基準を満たすかどうかを判定する工程も含む。この方法はまた、完全連合アルゴリズムの性能がアルゴリズム終了基準を満たさない場合、データ処理システムによって、アルゴリズムの各インスタンスを完全連合アルゴリズムで置き換え、完全連合アルゴリズムの各インスタンスで連合訓練ワークフローを再実行する工程も含む。この方法はまた、完全連合アルゴリズムの性能がアルゴリズム終了基準を満たす場合、データ処理システムによって、完全連合アルゴリズムの性能および集約されたパラメータをアルゴリズムのアルゴリズム開発者に提供する工程も含む。この局面の他の態様は、方法の動作を行うように各々構成された、対応するコンピュータシステム、装置、および1つまたは複数のコンピュータ記憶デバイスに記録されたコンピュータプログラムを含む。 One general aspect is identifying multiple instances of an algorithm, where each instance of the algorithm is integrated into one or more secure capsule computation frameworks, and where each instance of the algorithm is integrated into one or more secure capsule computation frameworks. The framework transfers each instance of the algorithm to a training data asset within one or more data storage structures of one or more data hosts in a secure manner that preserves the privacy of the training data asset and each instance of the algorithm. The method includes the step of providing and identifying. The method also includes executing, by the data processing system, a federated training workflow on each instance of the algorithm, the federated training workflow taking the training data asset as input and using the parameters to characterize the training data asset. map to a target inference, compute a loss or error function, update parameters to learned parameters to minimize the loss or error function, and output one or more trained instances of the algorithm , also includes the steps of executing. The method also includes the step of integrating, by the data processing system, the learned parameters of each trained instance of the algorithm into a fully federated algorithm, the step of integrating comprising aggregating the learned parameters to aggregate the learned parameters. and updating the learned parameters of the fully federated algorithm with the aggregated parameters. The method also includes executing, by the data processing system, a test workflow with a fully federated algorithm, the test workflow taking the test data as input and using the updated learned parameters to It also includes the steps of finding patterns, outputting inferences, and executing them. The method also includes calculating, by the data processing system, the performance of the fully federated algorithm in providing the inference. The method also includes determining, by the data processing system, whether performance of the fully federated algorithm meets algorithm termination criteria. The method also includes the step of, by the data processing system, replacing each instance of the algorithm with a fully federated algorithm and rerunning the federated training workflow with each instance of the fully federated algorithm if the performance of the fully federated algorithm does not meet algorithm termination criteria. Also included. The method also includes providing, by the data processing system, the performance of the fully federated algorithm and the aggregated parameters to an algorithm developer of the algorithm if the fully federated algorithm's performance satisfies the algorithm termination criteria. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the operations of the method.

実施態様は、以下の特徴のうちの1つまたは複数を含み得る。アルゴリズムの複数のインスタンスを識別する工程が、データ処理システムにおいて、アルゴリズムおよびアルゴリズムと関連付けられた入力データ要件を受け取る工程であって、入力データ要件が、データ資産がアルゴリズムで動作するための最適化および/または検証選択基準を含む、受け取る工程を含む方法。この方法はまた、データ処理システムによって、データ資産を、データ資産についての最適化および/または検証選択基準に基づいて1つまたは複数のデータホストから利用可能であるものとして識別する工程も含み得る。この方法はまた、データ処理システムによって、1つまたは複数のデータホストの各データホストのインフラストラクチャ内にあるデータストレージ構造内のデータ資産をキュレートする工程も含み得る。この方法はまた、データ資産の少なくとも一部を、1つまたは複数のデータホストの各データホストのインフラストラクチャ内にあるデータストレージ構造内の訓練データ資産に分割する工程も含み得る。アルゴリズムおよび入力データ要件が、1つまたは複数のデータホストとは異なるエンティティであるアルゴリズム開発者から受け取られ、最適化および/または検証選択基準が、データ資産がアルゴリズムで動作するための特性、フォーマットおよび要件を定義する方法。連合訓練ワークフローが、訓練勾配を暗号化する工程をさらに含み、統合する工程が、訓練勾配を復号する工程を含む方法。記載の技術の実施態様は、ハードウェア、方法もしくはプロセス、またはコンピュータアクセス可能媒体上のコンピュータソフトウェアを含み得る。 Implementations may include one or more of the following features. Identifying a plurality of instances of the algorithm is a step of receiving, in a data processing system, an algorithm and input data requirements associated with the algorithm, wherein the input data requirements include optimizing and optimizing the data assets for operation with the algorithm. and/or a method comprising receiving, including verification selection criteria. The method may also include identifying, by the data processing system, the data asset as available from one or more data hosts based on optimization and/or validation selection criteria for the data asset. The method may also include curating, by the data processing system, data assets within a data storage structure within each data host's infrastructure of the one or more data hosts. The method may also include dividing at least a portion of the data assets into training data assets within a data storage structure within each data host's infrastructure of the one or more data hosts. The algorithm and input data requirements are received from an algorithm developer, an entity distinct from one or more data hosts, and the optimization and/or validation selection criteria are based on the characteristics, format, and How to define requirements. The method wherein the federated training workflow further includes encrypting the training gradients, and the step of integrating includes decoding the training gradients. Implementations of the described techniques may include hardware, methods or processes, or computer software on computer-accessible media.

実施態様は、以下の特徴のうちの1つまたは複数を含み得る。完全連合アルゴリズムの性能がアルゴリズム終了基準を満たす場合、データ処理システムによって、集約されたパラメータをアルゴリズムの各インスタンスに送る工程をさらに含む方法。この方法はまた、データ処理システムによって、アルゴリズムの各インスタンスで更新訓練ワークフローを実行する工程であって、更新訓練ワークフローが、学習されたパラメータを集約されたパラメータで更新し、アルゴリズムの1つまたは複数の更新および訓練されたインスタンスを出力する、実行する工程も含み得る。データ処理システムによって、残りのデータ資産を、アルゴリズムの各インスタンスによって動作させる工程をさらに含む方法。データ資産をアルゴリズムの各インスタンスによって動作させる工程が、データ資産の少なくとも一部を1つまたは複数の検証データセットにさらに分割する工程と、1つまたは複数の検証データセットをアルゴリズムの各インスタンスによって動作させる工程と、1つまたは複数の検証データセットの動作に基づいてアルゴリズムの各インスタンスの性能を計算する工程とを含む検証ワークフローを実行する工程を含む方法。記載の技術の実施態様は、ハードウェア、方法もしくはプロセス、またはコンピュータアクセス可能媒体上のコンピュータソフトウェアを含み得る。 Implementations may include one or more of the following features. If performance of the fully federated algorithm meets algorithm termination criteria, the method further comprises, by the data processing system, sending the aggregated parameters to each instance of the algorithm. The method also includes, by the data processing system, performing an update training workflow on each instance of the algorithm, the update training workflow updating the learned parameters with the aggregated parameters of one or more of the algorithms. It may also include performing, updating and outputting the trained instances. The method further comprises operating the remaining data assets with each instance of the algorithm by the data processing system. operating the data asset by each instance of the algorithm further partitioning at least a portion of the data asset into one or more validation datasets; and operating the one or more validation datasets by each instance of the algorithm. and calculating the performance of each instance of the algorithm based on behavior on one or more validation data sets. Implementations of the described techniques may include hardware, methods or processes, or computer software on computer-accessible media.

1つの一般的な局面は、データ処理システムによって、データ資産の選択基準に基づいてデータホストから利用可能なデータ資産を識別する工程を含む方法を含む。この方法はまた、データ処理システムによって、データホストのインフラストラクチャ内にあるデータストレージ構造内のデータ資産をキュレートする工程も含む。この方法はまた、データ処理システムによって、データ変換のためのアルゴリズムを開発するためのガイドとして使用すべきトランスフォーマ・プロトタイプ・データセットを準備する工程であって、トランスフォーマ・プロトタイプ・データセットが整合化プロセスのキー属性を取り込む、準備する工程も含む。この方法はまた、データ処理システムで、トランスフォーマ・プロトタイプ・データセット内のデータの現在のフォーマットに基づいてデータ資産の変換のための第1の整合化トランスフォーマセットを作成する工程も含む。この方法はまた、データ処理システムによって、変換されたデータ資産を生成するためにデータ資産に第1の整合化トランスフォーマセットを適用する工程も含む。この方法はまた、データ処理システムによって、データ変換のためのアルゴリズムを開発するためのガイドとして使用すべき整合化プロトタイプデータセットを準備する工程であって、整合化プロトタイプデータセットが整合化プロセスのキー属性を取り込む、準備する工程も含む。この方法はまた、データ処理システムによって、整合化プロトタイプデータセット内のデータの現在のフォーマットに基づいて変換されたデータ資産の変換のための第2の整合化トランスフォーマセットを作成する工程も含む。この方法はまた、データ処理システムによって、整合化されたデータ資産を生成するために変換されたデータ資産に第2の整合化トランスフォーマセットを適用する工程も含む。この方法はまた、データ処理システムによって、整合化されたデータ資産をアルゴリズムを通して動作させる工程であって、アルゴリズムが、アルゴリズムを、データストレージ構造内の整合化されたデータ資産に、整合化されたデータ資産およびアルゴリズムのプライバシーを保全するセキュアな方法で提供するセキュアなカプセル計算フレームワーク内にある、動作させる工程も含む。この局面の他の態様は、方法の動作を行うように各々構成された、対応するコンピュータシステム、装置、および1つまたは複数のコンピュータ記憶デバイスに記録されたコンピュータプログラムを含む。 One general aspect includes a method that includes identifying, by a data processing system, a data asset available from a data host based on data asset selection criteria. The method also includes curating, by the data processing system, data assets in a data storage structure within an infrastructure of the data host. The method also includes the step of preparing a transformer prototype data set to be used by the data processing system as a guide for developing algorithms for data transformation, the transformer prototype data set being subjected to a harmonization process. It also includes the step of importing and preparing the key attributes of. The method also includes creating, at the data processing system, a first harmonized transformer set for transformation of the data asset based on a current format of data in the transformer prototype data set. The method also includes applying, by the data processing system, a first harmonization transformer set to the data asset to generate a transformed data asset. The method also includes preparing a harmonized prototype data set to be used by the data processing system as a guide for developing algorithms for data transformation, the harmonized prototype data set being a key to the harmonization process. It also includes the process of importing and preparing attributes. The method also includes creating, by the data processing system, a second set of harmonized transformers for transformation of the transformed data asset based on a current format of data in the harmonized prototype data set. The method also includes applying, by the data processing system, a second set of harmonization transformers to the transformed data asset to produce a harmonized data asset. The method also includes operating the harmonized data asset through an algorithm by the data processing system, the algorithm operating the harmonized data asset within the data storage structure. It also includes operating within a secure encapsulated computational framework that provides assets and algorithms in a secure manner that preserves their privacy. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the operations of the method.

実施態様は、以下の特徴のうちの1つまたは複数を含み得る。選択基準が、データホストとは異なるエンティティであるアルゴリズム開発者から受け取られ、選択基準が、データ資産がアルゴリズムで動作するための特性、フォーマットおよび要件を定義する方法。データ資産の特性および要件が、（i）アルゴリズムの環境、（ii）入力データ内の例の配分、（iii）入力データを生成するデバイスのパラメータおよびタイプ、（iv）分散対バイアス、（v）アルゴリズムによって実装されたタスク、または（vi）それらの任意の組み合わせに基づいて定義される方法。トランスフォーマ・プロトタイプ・データセットを匿名化し、匿名化されたトランスフォーマ・プロトタイプ・データセットを、データ資産の変換のための第1の整合化トランスフォーマセットを作成する目的でアルゴリズム開発者に供する工程をさらに含む方法。データ資産に第1の整合化トランスフォーマセットを適用する工程が、データ構造内で行われる方法。データ処理システムで、注釈付きデータセットを生成するために定義済み注釈付けプロトコルに従って変換されたデータ資産に注釈を付ける工程であって、変換されたデータに注釈を付ける工程がデータ構造内で行われ、整合化されたデータ資産を生成するために注釈付きデータセットに第2の整合化トランスフォーマセットが適用される、注釈を付ける工程をさらに含む方法。注釈付きデータ資産に第2の整合化トランスフォーマセットを適用する工程がデータ構造内で行われる方法。第1の整合化トランスフォーマセット、注釈、および第2の整合化トランスフォーマセットが、正常に適用され、データプライバシー要件に違反せずに適用されるかどうかを判定する工程をさらに含む方法。記載の技術の実施態様は、ハードウェア、方法もしくはプロセス、またはコンピュータアクセス可能媒体上のコンピュータソフトウェアを含み得る。 Implementations may include one or more of the following features. How the selection criteria are received from the algorithm developer, which is a different entity than the data host, and where the selection criteria define the characteristics, formats, and requirements for the data assets to work with the algorithm. The characteristics and requirements of the data assets include (i) the environment of the algorithm, (ii) the distribution of examples within the input data, (iii) the parameters and type of the device producing the input data, (iv) the variance versus bias, and (v) tasks implemented by algorithms, or (vi) methods defined based on any combination thereof. further comprising anonymizing the transformer prototype dataset and providing the anonymized transformer prototype dataset to an algorithm developer for the purpose of creating a first harmonized transformer set for transformation of the data asset. Method. A method in which applying a first set of harmonized transformers to a data asset is performed within a data structure. In a data processing system, annotating a transformed data asset according to a defined annotation protocol to produce an annotated dataset, the step of annotating the transformed data occurring within a data structure. , a second harmonization transformer set is applied to the annotated dataset to generate a harmonized data asset. The method by which applying a second harmonization transformer set to an annotated data asset occurs within a data structure. The method further comprises determining whether the first harmonizing transformer set, the annotation, and the second harmonizing transformer set are applied successfully and without violating data privacy requirements. Implementations of the described techniques may include hardware, methods or processes, or computer software on computer-accessible media.

1つの一般的な局面は、アルゴリズムまたはモデルを識別する工程であって、アルゴリズムまたはモデルがセキュアなカプセル計算フレームワークに統合され、セキュアなカプセル計算フレームワークが、アルゴリズムまたはモデルを、データホストのデータストレージ構造内の訓練データ資産に、訓練データ資産およびアルゴリズムまたはモードのプライバシーを保全するセキュアな方法で提供する、識別する工程を含む方法を含む。この方法はまた、データ処理システムによって、アルゴリズムまたはモデルで連合訓練ワークフローを実行する工程であって、連合訓練ワークフローが、訓練データ資産を入力として取り込み、パラメータを使用して訓練データ資産の特徴をターゲット推論にマップし、損失関数または誤差関数を計算し、損失関数または誤差関数を最小化するためにパラメータを学習されたパラメータに更新し、訓練されたアルゴリズムまたはモデルを出力する、実行する工程も含む。この方法はまた、データ処理システムによって、アルゴリズムまたはモデルの学習されたパラメータを、完全連合アルゴリズムまたはモデルに統合する工程であって、統合する工程が、学習されたパラメータを集約して、集約されたパラメータを取得する工程と、完全連合アルゴリズムまたはモデルの学習されたパラメータを集約されたパラメータで更新する工程とを含む、統合する工程も含む。この方法はまた、データ処理システムによって、完全連合アルゴリズムまたはモデルで試験ワークフローを実行する工程であって、試験ワークフローが、試験データを入力として取り込み、更新された学習されたパラメータを使用して試験データ内のパターンを見つけ、推論を出力する、実行する工程も含む。この方法はまた、データ処理システムによって、推論を提供する際の完全連合アルゴリズムの性能を計算する工程も含む。この方法はまた、データ処理システムによって、完全連合アルゴリズムまたはモデルの性能がアルゴリズム終了基準を満たすかどうかを判定する工程も含む。この方法はまた、完全連合アルゴリズムまたはモデルの性能がアルゴリズム終了基準を満たさない場合、データ処理システムによって、アルゴリズムまたはモデルを完全連合アルゴリズムまたはモデルで置き換え、完全連合アルゴリズムまたはモデルで連合訓練ワークフローを再実行する工程も含む。この方法はまた、完全連合アルゴリズムまたはモデルの性能がアルゴリズム終了基準を満たす場合、データ処理システムによって、完全連合アルゴリズムまたはモデルの性能および集約されたパラメータをアルゴリズムまたはモデルのアルゴリズム開発者に提供する工程も含む。この局面の他の態様は、方法の動作を行うように各々構成された、対応するコンピュータシステム、装置、および1つまたは複数のコンピュータ記憶デバイスに記録されたコンピュータプログラムを含む。 One general aspect is the process of identifying an algorithm or model, where the algorithm or model is integrated into a secure capsule computation framework, and where the secure capsule computation framework integrates the algorithm or model with data in a data host. A method includes identifying, providing training data assets in a storage structure in a privacy-preserving, secure manner of the training data assets and algorithms or modes. The method also includes performing a federated training workflow on the algorithm or model by the data processing system, the federated training workflow taking a training data asset as input and using the parameters to target features of the training data asset. Also includes the steps of mapping the inference, computing a loss or error function, updating parameters to the learned parameters to minimize the loss or error function, and outputting and executing the trained algorithm or model. . The method also includes the step of integrating, by the data processing system, the learned parameters of the algorithm or model into a fully federated algorithm or model, the step of integrating comprising The method also includes a step of integrating, including obtaining the parameters and updating the learned parameters of the fully federated algorithm or model with the aggregated parameters. The method also includes executing, by the data processing system, a test workflow with a fully federated algorithm or model, the test workflow taking test data as input and using the updated learned parameters to It also includes the steps of finding patterns in the data, outputting inferences, and executing them. The method also includes calculating, by the data processing system, the performance of the fully federated algorithm in providing the inference. The method also includes determining, by the data processing system, whether performance of the fully federated algorithm or model satisfies algorithm termination criteria. The method also allows the data processing system to replace the algorithm or model with the fully federated algorithm or model and rerun the federated training workflow with the fully federated algorithm or model if the performance of the fully federated algorithm or model does not meet the algorithm termination criteria. It also includes the process of The method also includes providing, by the data processing system, the performance of the fully federated algorithm or model and the aggregated parameters to an algorithm developer of the algorithm or model if the performance of the fully federated algorithm or model satisfies algorithm termination criteria. include. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the operations of the method.

実施態様は、以下の特徴のうちの1つまたは複数を含み得る。アルゴリズムを識別する工程が、データ処理システムにおいて、アルゴリズムおよびアルゴリズムと関連付けられた入力データ要件を受け取る工程であって、入力データ要件が、データ資産がアルゴリズムで動作するための検証選択基準を含む、受け取る工程を含む方法。この方法はまた、データ処理システムによって、データ資産を、データ資産についての検証選択基準に基づいてデータホストから利用可能であるものとして識別する工程も含み得る。この方法はまた、データ処理システムによって、データホストのインフラストラクチャ内にあるデータストレージ構造内のデータ資産をキュレートする工程も含み得る。この方法はまた、データ資産の少なくとも一部を、データホストのインフラストラクチャ内にあるデータストレージ構造内の検証データ資産に分割する工程も含み得る。データ処理システムによって、データホストを迎え入れる工程であって、迎え入れる工程が、アルゴリズムでのデータ資産の使用がデータプライバシー要件に準拠したものであることを確認する工程を含む、迎え入れる工程をさらに含む方法。この方法はまた、アルゴリズムを検証する目的でのデータホストからのデータ資産の使用の施設内審査委員会からの許可を含む、ガバナンス要件およびコンプライアンス要件を完了する工程も含み得る。この方法はまた、キュレートする工程が、複数のデータストレージ構造の中からデータストレージ構造を選択する工程と、データホストのインフラストラクチャ内にデータストレージ構造をプロビジョニングする工程とを含み、データストレージ構造の選択が、アルゴリズム内のアルゴリズムのタイプ、データ資産内のデータのタイプ、データ処理システムのシステム要件、またはそれらの組み合わせに基づくものであることも含み得る。アルゴリズムの性能が検証基準を満たす場合、データ処理システムによって、アルゴリズムおよび検証データ資産を、検証データ資産およびアルゴリズムのプライバシーを保全するセキュアな方法で維持する工程をさらに含む方法。検証データ資産が、複数の独立したデータ資産セットであり、暗号化コードが、データ処理システムによって署名されてデータ・ストレージ・アーカイブに格納され、アルゴリズムの性能が、複数の独立したデータ資産セットに対して行われた複数の検証から集約されたアルゴリズムの検証についての単一の検証報告として提供される方法。記載の技術の実施態様は、ハードウェア、方法もしくはプロセス、またはコンピュータアクセス可能媒体上のコンピュータソフトウェアを含み得る。 Implementations may include one or more of the following features. identifying an algorithm includes receiving, in a data processing system, an algorithm and input data requirements associated with the algorithm, the input data requirements including validation selection criteria for data assets to operate on the algorithm; A method that involves a process. The method may also include identifying, by the data processing system, the data asset as available from the data host based on validation selection criteria for the data asset. The method may also include curating, by the data processing system, the data assets in a data storage structure within an infrastructure of the data host. The method may also include partitioning at least a portion of the data asset into a verification data asset within a data storage structure within the data host's infrastructure. A method further comprising the step of hosting a data host by a data processing system, the hosting step comprising verifying that use of the data asset in the algorithm is compliant with data privacy requirements. The method may also include completing governance and compliance requirements, including approval from an institutional review board for use of data assets from the data host for purposes of validating the algorithm. The method also includes selecting the data storage structure from among the plurality of data storage structures and provisioning the data storage structure within the infrastructure of the data host. may include being based on the type of algorithms within the algorithm, the type of data within the data asset, the system requirements of the data processing system, or a combination thereof. If performance of the algorithm satisfies verification criteria, the method further comprises maintaining the algorithm and the verification data asset by the data processing system in a secure manner that preserves the privacy of the verification data asset and the algorithm. The validation data assets are multiple independent sets of data assets, the encryption code is signed by the data processing system and stored in the data storage archive, and the performance of the algorithm is verified against multiple independent sets of data assets. A method that provides a single validation report for the validation of an algorithm that is aggregated from multiple validations performed. Implementations of the described techniques may include hardware, methods or processes, or computer software on computer-accessible media.

1つの一般的な局面は、アルゴリズムを識別する工程であって、アルゴリズムが、アルゴリズム開発者によって提供されてセキュアなカプセル計算フレームワークに統合され、セキュアなカプセル計算フレームワークが、アルゴリズムを、データストレージ構造内の検証データ資産に、検証データ資産およびアルゴリズムのプライバシーを保全するセキュアな方法で提供する、識別する工程を含む方法を含む。この方法はまた、データ処理システムによって、アルゴリズムで検証ワークフローを実行する工程であって、検証ワークフローが、検証データ資産を入力として取り込み、学習されたパラメータを使用して検証データ資産にアルゴリズムを適用し、推論を出力する、実行する工程も含む。この方法はまた、データ処理システムによって、推論を提供する際のアルゴリズムの性能を計算する工程であって、性能がゴールド・スタンダード・ラベルに基づいて計算される、計算する工程も含む。この方法はまた、データ処理システムによって、アルゴリズムの性能がアルゴリズム開発者によって定義された検証基準を満たすかどうかを判定する工程も含む。この方法はまた、アルゴリズムの性能が検証基準を満たさない場合、データ処理システムで、アルゴリズムの1つまたは複数のハイパーパラメータを最適化し、最適化された1つまたは複数のハイパーパラメータを用いてアルゴリズムで検証ワークフローを再実行する工程も含む。この方法はまた、アルゴリズムの性能が検証基準を満たす場合、データ処理システムによって、アルゴリズムの性能および1つまたは複数のハイパーパラメータをアルゴリズム開発者に提供する工程も含む。この局面の他の態様は、方法の動作を行うように各々構成された、対応するコンピュータシステム、装置、および1つまたは複数のコンピュータ記憶デバイスに記録されたコンピュータプログラムを含む。 One common aspect is the process of identifying an algorithm, where the algorithm is integrated into a secure capsule computation framework provided by the algorithm developer, and where the secure capsule computation framework integrates the algorithm with data storage. The method includes the step of identifying, providing a verification data asset within a structure in a secure manner that preserves the privacy of the verification data asset and the algorithm. The method also includes executing a validation workflow algorithmically by the data processing system, the validation workflow taking a validation data asset as input and applying the algorithm to the validation data asset using the learned parameters. , it also includes the steps of outputting and executing the inference. The method also includes calculating, by the data processing system, the performance of the algorithm in providing the inference, the performance being calculated based on the gold standard label. The method also includes determining, by the data processing system, whether performance of the algorithm meets validation criteria defined by the algorithm developer. The method also includes optimizing one or more hyperparameters of the algorithm in the data processing system and reprocessing the algorithm using the optimized one or more hyperparameters if the performance of the algorithm does not meet validation criteria. It also includes the step of re-running the verification workflow. The method also includes providing, by the data processing system, the performance of the algorithm and the one or more hyperparameters to the algorithm developer if the performance of the algorithm meets validation criteria. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the operations of the method.

実施態様は、以下の特徴のうちの1つまたは複数を含み得る。検証選択基準が、臨床コホート基準、人口統計学的基準、および/またはデータ・セット・クラス・バランスを含み、臨床コホート基準が、コホート研究のためにデータ資産を取得するべき人々のグループ、コホート研究のタイプ、人々のグループが一定期間にわたってさらされる可能性のあるリスク因子、解決されるべき疑問もしくは仮説および関連付けられる疾患もしくは状態、コホート研究の基準を定義するその他のパラメータ、またはそれらの任意の組み合わせを定義する方法。セキュアなカプセル計算フレームワークが、アルゴリズムを動作させるのに必要な暗号化コードを受け入れるように構成された計算インフラストラクチャ内にプロビジョニングされ、計算インフラストラクチャをプロビジョニングする工程が、計算インフラストラクチャ上でセキュアなカプセル計算フレームワークをインスタンス化する工程と、アルゴリズム開発者によって、暗号化コードをセキュアなカプセル計算フレームワークの内部に配置する工程と、セキュアなカプセル計算フレームワークがインスタンス化された後で、暗号化コードを復号する工程とを含む方法。記載の技術の実施態様は、ハードウェア、方法もしくはプロセス、またはコンピュータアクセス可能媒体上のコンピュータソフトウェアを含み得る。 Implementations may include one or more of the following features. The validation selection criteria include clinical cohort criteria, demographic criteria, and/or data set class balance, and the clinical cohort criteria include the group of people for which the data assets should be acquired for the cohort study, the cohort study the risk factors to which a group of people may be exposed over a period of time, the question or hypothesis to be answered and the associated disease or condition, other parameters defining the criteria for a cohort study, or any combination thereof. How to define. A secure encapsulated computational framework is provisioned within a computational infrastructure configured to accept the cryptographic code necessary to operate the algorithm, and the process of provisioning the computational infrastructure is performed using a instantiating a capsule computation framework; placing cryptographic code inside the secure capsule computation framework by an algorithm developer; and, after the secure capsule computation framework has been instantiated, encrypting decoding the code. Implementations of the described techniques may include hardware, methods or processes, or computer software on computer-accessible media.

1つの一般的な局面は、データ処理システムにおいて、アルゴリズムおよびアルゴリズムと関連付けられた入力データ要件を受け取る工程であって、入力データ要件が、データ資産がアルゴリズムで動作するための検証選択基準を含む、受け取る工程を含む方法を含む。この方法はまた、データ処理システムによって、データ資産を、データ資産についての検証選択基準に基づいてデータホストから利用可能であるものとして識別する工程も含む。この方法はまた、データ処理システムによって、データホストのインフラストラクチャ内のデータストレージ構造内のデータ資産をキュレートする工程も含む。この方法はまた、データ処理システムによって、アルゴリズムが処理するためのデータストレージ構造内のデータ資産を準備する工程も含む。この方法はまた、データ処理システムによって、アルゴリズムをセキュアなカプセル計算フレームワークに統合する工程であって、セキュアなカプセル計算フレームワークがアルゴリズムを、データストレージ構造内のデータ資産に、データ資産およびアルゴリズムのプライバシーを保全するセキュアな方法で提供する、統合する工程も含む。この方法はまた、データ処理システムによって、アルゴリズムで検証ワークフローを実行する工程であって、検証ワークフローが、データ資産を入力として取り込み、学習されたパラメータを使用してデータ資産内のパターンを見つけ、推論を出力する、実行する工程も含む。この方法はまた、データ処理システムによって、推論を提供する際のアルゴリズムの性能を計算する工程であって、性能がゴールド・スタンダード・ラベルに基づいて計算される、計算する工程も含む。この方法はまた、データ処理システムによって、アルゴリズムの性能をアルゴリズム開発者に提供する工程も含む。この局面の他の態様は、方法の動作を行うように各々構成された、対応するコンピュータシステム、装置、および1つまたは複数のコンピュータ記憶デバイスに記録されたコンピュータプログラムを含む。 One general aspect is, in a data processing system, receiving an algorithm and input data requirements associated with the algorithm, the input data requirements including validation selection criteria for data assets to operate on the algorithm. The method includes the step of receiving. The method also includes identifying, by the data processing system, the data asset as available from the data host based on validation selection criteria for the data asset. The method also includes curating, by the data processing system, data assets within a data storage structure within an infrastructure of the data host. The method also includes preparing a data asset in a data storage structure for processing by the algorithm by the data processing system. The method also includes, by the data processing system, integrating the algorithm into a secure capsule computation framework, wherein the secure capsule computation framework integrates the algorithm into a data asset within a data storage structure. It also includes providing and integrating the information in a secure manner that preserves privacy. The method also includes algorithmically executing a validation workflow by the data processing system, the validation workflow taking a data asset as input, finding patterns in the data asset using the learned parameters, and making inferences. It also includes the process of outputting and executing. The method also includes calculating, by the data processing system, the performance of the algorithm in providing the inference, the performance being calculated based on the gold standard label. The method also includes providing performance of the algorithm to an algorithm developer by the data processing system. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the operations of the method.

実施態様は、以下の特徴のうちの1つまたは複数を含み得る。検証選択基準が、臨床コホート基準、人口統計学的基準、および/またはデータ・セット・クラス・バランスを含み、臨床コホート基準が、コホート研究のためにデータ資産を取得するべき人々のグループ、コホート研究のタイプ、人々のグループが一定期間にわたってさらされる可能性のあるリスク因子、解決されるべき疑問もしくは仮説および関連付けられる疾患もしくは状態、コホート研究の基準を定義するその他のパラメータ、またはそれらの任意の組み合わせを定義する方法。データ処理システムによって、データホストを迎え入れる工程であって、迎え入れる工程が、アルゴリズムでのデータ資産の使用がデータプライバシー要件に準拠したものであることを確認する工程を含む、迎え入れる工程をさらに含む方法。この方法はまた、アルゴリズムを検証する目的でのデータホストからのデータ資産の使用の施設内審査委員会からの許可を含む、ガバナンス要件およびコンプライアンス要件を完了する工程も含み得る。この方法はまた、キュレートする工程が、複数のデータストレージ構造の中からデータストレージ構造を選択する工程と、データホストのインフラストラクチャ内にデータストレージ構造をプロビジョニングする工程とを含み、データストレージ構造の選択が、アルゴリズム内のアルゴリズムのタイプ、データ資産内のデータのタイプ、データ処理システムのシステム要件、またはそれらの組み合わせに基づくものであることも含み得る。データ処理システムによって、アルゴリズムおよびデータ資産を、データ資産およびアルゴリズムのプライバシーを保全するセキュアな方法で維持する工程をさらに含む方法。記載の技術の実施態様は、ハードウェア、方法もしくはプロセス、またはコンピュータアクセス可能媒体上のコンピュータソフトウェアを含み得る。 Implementations may include one or more of the following features. The validation selection criteria include clinical cohort criteria, demographic criteria, and/or data set class balance, and the clinical cohort criteria include the group of people for which the data assets should be acquired for the cohort study, the cohort study the risk factors to which a group of people may be exposed over a period of time, the question or hypothesis to be answered and the associated disease or condition, other parameters defining the criteria for a cohort study, or any combination thereof. How to define. A method further comprising the step of hosting a data host by a data processing system, the hosting step comprising verifying that use of the data asset in the algorithm is compliant with data privacy requirements. The method may also include completing governance and compliance requirements, including approval from an institutional review board for use of data assets from the data host for purposes of validating the algorithm. The method also includes selecting the data storage structure from among the plurality of data storage structures and provisioning the data storage structure within the infrastructure of the data host. may include being based on the type of algorithms within the algorithm, the type of data within the data asset, the system requirements of the data processing system, or a combination thereof. The method further comprises maintaining the algorithms and data assets by the data processing system in a secure manner that preserves the privacy of the data assets and algorithms. Implementations of the described techniques may include hardware, methods or processes, or computer software on computer-accessible media.

実施態様は、以下の特徴のうちの1つまたは複数を含み得る。セキュアなカプセル計算フレームワークが、アルゴリズムを動作させるのに必要な暗号化コードを受け入れるように構成された計算インフラストラクチャ内にプロビジョニングされ、計算インフラストラクチャをプロビジョニングする工程が、計算インフラストラクチャ上でセキュアなカプセル計算フレームワークをインスタンス化する工程と、アルゴリズム開発者によって、暗号化コードをセキュアなカプセル計算フレームワークの内部に配置する工程と、セキュアなカプセル計算フレームワークがインスタンス化された後で、暗号化コードを復号する工程とを含む方法。この局面の他の態様は、方法の動作を行うように各々構成された、対応するコンピュータシステム、装置、および1つまたは複数のコンピュータ記憶デバイスに記録されたコンピュータプログラムを含む。データ資産が、複数の独立したデータ資産セットであり、暗号化コードが、データ処理システムによって署名されてデータ・ストレージ・アーカイブに格納され、アルゴリズムの性能が、複数の独立したデータ資産セットに対して行われた複数の検証から集約されたアルゴリズムの検証についての単一の検証報告として提供される方法。記載の技術の実施態様は、ハードウェア、方法もしくはプロセス、またはコンピュータアクセス可能媒体上のコンピュータソフトウェアを含み得る。 Implementations may include one or more of the following features. A secure encapsulated computational framework is provisioned within a computational infrastructure configured to accept the cryptographic code necessary to operate the algorithm, and the process of provisioning the computational infrastructure is performed using a instantiating a capsule computation framework; placing cryptographic code inside the secure capsule computation framework by an algorithm developer; and, after the secure capsule computation framework has been instantiated, encrypting decoding the code. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the operations of the method. the data assets are multiple independent sets of data assets, the encryption code is signed by the data processing system and stored in the data storage archive, and the performance of the algorithm is A method that provides a single validation report for the validation of an algorithm that is aggregated from multiple validations performed. Implementations of the described techniques may include hardware, methods or processes, or computer software on computer-accessible media.

いくつかの態様では、1つまたは複数のデータプロセッサと、1つまたは複数のデータプロセッサ上で実行されると、1つまたは複数のデータプロセッサに、本明細書に開示される1つまたは複数の方法の一部または全部を行わせる命令を含む非一時的コンピュータ可読記憶媒体とを含むシステムが提供される。 In some aspects, one or more data processors and, when executed on the one or more data processors, the one or more data processors have one or more data processors disclosed herein. A non-transitory computer readable storage medium containing instructions for performing some or all of the methods is provided.

いくつかの態様では、非一時的機械可読記憶媒体において実体的に具体化された、1つまたは複数のデータプロセッサに、本明細書に開示される1つまたは複数の方法の一部または全部を行わせるように構成された命令を含むコンピュータプログラム製品が提供される。 In some embodiments, one or more data processors tangibly embodied in a non-transitory machine-readable storage medium perform some or all of the one or more methods disclosed herein. A computer program product is provided that includes instructions configured to perform.

本開示のいくつかの態様は、1つまたは複数のデータプロセッサを含むシステムを含む。いくつかの態様では、このシステムは、1つまたは複数のデータプロセッサ上で実行されると、1つまたは複数のデータプロセッサに、本明細書に開示される1つもしくは複数の方法の一部もしくは全部および/または1つもしくは複数のプロセスの一部もしくは全部を行わせる命令を含む非一時的コンピュータ可読記憶媒体を含む。本開示のいくつかの態様は、1つまたは複数のデータプロセッサに、本明細書に開示される1つもしくは複数の方法の一部もしくは全部および/または1つもしくは複数のプロセスの一部もしくは全部を行わせるように構成された命令を含む、非一時的機械可読記憶媒体において実体的に具体化されたコンピュータプログラム製品を含む。 Some aspects of the present disclosure include a system that includes one or more data processors. In some embodiments, the system, when executed on one or more data processors, causes the one or more data processors to perform part or all of one or more of the methods disclosed herein. A non-transitory computer-readable storage medium containing instructions for causing all and/or some or all of one or more processes to be performed. Some aspects of the present disclosure require one or more data processors to perform some or all of one or more methods and/or some or all of one or more processes disclosed herein. a computer program product tangibly embodied in a non-transitory machine-readable storage medium containing instructions configured to cause the computer program product to perform the following steps:

[本発明1001]
以下の工程を含む方法:
データ処理システムにおいて、アルゴリズムおよび前記アルゴリズムと関連付けられた入力データ要件を受け取る工程であって、前記入力データ要件が、データ資産が前記アルゴリズムで動作するための最適化および/または検証選択基準を含む、受け取る工程;
前記データ処理システムによって、前記データ資産を、前記データ資産についての前記最適化および/または検証選択基準に基づいて、データホストから利用可能であるものとして識別する工程;
前記データ処理システムによって、前記データホストのインフラストラクチャ内のデータストレージ構造内の前記データ資産をキュレートする工程;
前記データ処理システムによって、前記アルゴリズムが処理するための前記データストレージ構造内の前記データ資産を準備する工程;
前記データ処理システムによって、前記アルゴリズムをセキュアなカプセル計算フレームワークに統合する工程であって、前記セキュアなカプセル計算フレームワークが、前記アルゴリズムを前記データストレージ構造内の前記データ資産に、前記データ資産および前記アルゴリズムのプライバシーを保全するセキュアな方法で提供する、統合する工程;ならびに
前記データ処理システムによって、前記データ資産を前記アルゴリズムを通して動作させる工程。
[本発明1002]
前記アルゴリズムおよび前記入力データ要件が、前記データホストとは異なるエンティティであるアルゴリズム開発者から受け取られ、前記最適化および/または検証選択基準が、前記データ資産が前記アルゴリズムで動作するための特性、フォーマット、および要件を定義する、本発明1001の方法。
[本発明1003]
前記データ資産の前記特性および前記要件が、
（i）前記アルゴリズムの環境、（ii）前記入力データ内の例の配分、（iii）前記入力データを生成するデバイスのパラメータおよびタイプ、（iv）分散対バイアス、（v）前記アルゴリズムによって実装されたタスク、または（vi）それらの任意の組み合わせ
に基づいて定義される、本発明1002の方法。
[本発明1004]
前記識別する工程が、前記データ資産内の個人に関するプライベート情報を伏せたままで、前記データ資産内のグループのパターンを記述することによって、前記データ資産内の情報を共有するための差分プライバシーを使用して行われ、
前記キュレートする工程が、複数のデータストレージ構造の中から前記データストレージ構造を選択することと、前記データホストの前記インフラストラクチャ内に前記データストレージ構造をプロビジョニングすることとを含み、
前記データストレージ構造の前記選択が、前記アルゴリズムのタイプ、前記データ資産内のデータのタイプ、前記データ処理システムのシステム要件、またはそれらの組み合わせに基づくものである、
本発明1002の方法。
[本発明1005]
前記データ処理システムによって、前記データホストを迎え入れる工程をさらに含み、
前記迎え入れる工程が、前記アルゴリズムでの前記データ資産の使用がデータプライバシー要件に準拠したものであることを確認することを含む、
本発明1001、1002、または1003の方法。
[本発明1006]
前記データ資産を準備する工程が、前記データ資産に1つもしくは複数の変換を適用すること、前記データ資産に注釈を付けること、前記データ資産を整合化すること、またはそれらの組み合わせを含む、本発明1001～1005のいずれかの方法。
[本発明1007]
前記データ資産を前記アルゴリズムを通して前記動作させる工程が、
前記モデルの複数のインスタンスを作成することと、前記データ資産を、訓練データセットと1つまたは複数の試験データセットとに分割することと、前記モデルの前記複数のインスタンスを前記訓練データセットで訓練することと、前記モデルの前記複数のインスタンスの各々の前記訓練からの結果を完全連合モデルに統合することと、前記1つまたは複数の試験データセットを前記完全連合モデルを通して動作させることと、前記1つまたは複数の試験データセットの前記動作に基づいて前記完全連合モデルの性能を計算することとを含む、訓練ワークフロー
を実行することを含む、本発明1001～1006のいずれかの方法。
[本発明1008]
前記データ資産を前記アルゴリズムを通して前記動作させる工程が、
前記データ資産を1つまたは複数の検証データセットにおいて分割することと、前記1つまたは複数の検証データセットを前記アルゴリズムを通して動作させることと、前記1つまたは複数の検証データセットの前記動作に基づいて前記アルゴリズムの性能を計算することとを含む、検証ワークフロー
を実行することを含む、本発明1001～1006のいずれかの方法。
[本発明1009]
前記セキュアなカプセル計算フレームワークが、前記アルゴリズムを動作させるのに必要な暗号化コードを受け入れるように構成された計算インフラストラクチャ内にプロビジョニングされ、
前記計算インフラストラクチャを前記プロビジョニングすることが、前記計算インフラストラクチャ上で前記セキュアなカプセル計算フレームワークをインスタンス化することと、前記アルゴリズム開発者によって、前記暗号化コードを前記セキュアなカプセル計算フレームワークの内部に配置することと、前記セキュアなカプセル計算フレームワークがインスタンス化された後で、前記暗号化コードを復号することとを含む、
本発明1001～1008のいずれかの方法。
[本発明1010]
1つまたは複数のデータプロセッサと、
前記1つまたは複数のデータプロセッサ上で実行されると、前記1つまたは複数のデータプロセッサに、
アルゴリズムおよび前記アルゴリズムと関連付けられた入力データ要件を受け取る動作であって、前記入力データ要件が、データ資産が前記アルゴリズムで動作するための最適化および/または検証選択基準を含む、受け取る動作と、
前記データ資産を、前記データ資産についての前記最適化および/または検証選択基準に基づいてデータホストから利用可能であるものとして識別する動作と、
前記データ資産を、前記データホストのインフラストラクチャ内にあるデータストレージ構造内でキュレートする動作と、
前記アルゴリズムによって処理するための前記データストレージ構造内の前記データ資産を準備する動作と、
前記アルゴリズムをセキュアなカプセル計算フレームワークに統合する動作であって、前記セキュアなカプセル計算フレームワークが前記アルゴリズムを前記データストレージ構造内の前記データ資産に、前記データ資産および前記機械学習モデルのプライバシーを保全するセキュアな方法で提供する、統合する動作と、
前記データ資産を前記アルゴリズムを通して動作させる動作と
を含む動作を行わせる命令を含む、非一時的コンピュータ可読記憶媒体と
を含む、システム。
[本発明1011]
前記アルゴリズムおよび前記入力データ要件が、前記データホストとは異なるエンティティであるアルゴリズム開発者から受け取られ、前記最適化および/または検証選択基準が、データ資産が前記アルゴリズムで動作するための特性、フォーマット、および要件を定義する、本発明1010のシステム。
[本発明1012]
前記データ資産の前記特性および前記要件が、
（i）前記アルゴリズムの環境、（ii）前記入力データ内の例の配分、（iii）前記入力データを生成するデバイスのパラメータおよびタイプ、（iv）分散対バイアス、（v）前記アルゴリズムによって実装されたタスク、または（vi）それらの任意の組み合わせ
に基づいて定義される、本発明1011のシステム。
[本発明1013]
前記識別する動作が、前記データ資産内の個人に関するプライベート情報を伏せたままで、前記データ資産内のグループのパターンを記述することによって、前記データ資産内の情報を共有するための差分プライバシーを使用して行われ、
前記キュレートする動作が、複数のデータストレージ構造の中から前記データストレージ構造を選択することと、前記データホストの前記インフラストラクチャ内に前記データストレージ構造をプロビジョニングすることとを含み、
前記データストレージ構造の前記選択が、前記アルゴリズムのタイプ、前記データ資産内のデータのタイプ、前記システムの要件、またはそれらの組み合わせに基づくものである、
本発明1010のシステム。
[本発明1014]
前記動作が、前記データホストを迎え入れることをさらに含み、
前記迎え入れることが、前記アルゴリズムでの前記データ資産の使用がデータプライバシー要件に準拠したものであることを確認することを含む、
本発明1010、1011、または1012のシステム。
[本発明1015]
前記データ資産を準備する動作が、前記データ資産に1つもしくは複数の変換を適用すること、前記データ資産に注釈を付けること、前記データ資産を整合化すること、またはそれらの組み合わせを含む、本発明1010～1014のいずれかのシステム。
[本発明1016]
前記データ資産を前記アルゴリズムを通して前記動作させる動作が、
前記モデルの複数のインスタンスを作成することと、前記データ資産を、訓練データセットと1つまたは複数の試験データセットとに分割することと、前記モデルの前記複数のインスタンスを前記訓練データセットで訓練することと、前記モデルの前記複数のインスタンスの各々の前記訓練からの結果を完全連合モデルに統合することと、前記1つまたは複数の試験データセットを前記完全連合モデルによって動作させることと、前記1つまたは複数の試験データセットの前記動作に基づいて前記完全連合モデルの性能を計算することとを含む、訓練ワークフロー
を実行することを含む、本発明1010～1015のいずれかのシステム。
[本発明1017]
前記データ資産を前記アルゴリズムを通して前記動作させる動作が、
前記データ資産を1つまたは複数の検証データセットにおいて分割することと、前記1つまたは複数の検証データセットを前記アルゴリズムを通して動作させることと、前記1つまたは複数の検証データセットの前記動作に基づいて前記アルゴリズムの性能を計算することとを含む、検証ワークフロー
を実行することを含む、本発明1010～1015のいずれかのシステム。
[本発明1018]
前記セキュアなカプセル計算フレームワークが、前記アルゴリズムを動作させるのに必要な暗号化コードを受け入れるように構成された計算インフラストラクチャ内にプロビジョニングされ、
前記計算インフラストラクチャを前記プロビジョニングすることが、前記計算インフラストラクチャ上で前記セキュアなカプセル計算フレームワークをインスタンス化することと、前記アルゴリズム開発者を通して、前記暗号化コードを前記セキュアなカプセル計算フレームワークの内部に配置することと、前記セキュアなカプセル計算フレームワークがインスタンス化された後で、前記暗号化コードを復号することとを含む、
本発明1010～1017のいずれかのシステム。
[本発明1019]
1つまたは複数のデータプロセッサに、
アルゴリズムおよび前記アルゴリズムと関連付けられた入力データ要件を受け取る動作であって、前記入力データ要件が、データ資産が前記アルゴリズムで動作するための最適化および/または検証選択基準を含む、受け取る動作と、
前記データ資産を、前記データ資産についての前記最適化および/または検証選択基準に基づいてデータホストから利用可能であるものとして識別する動作と、
前記データ資産を、前記データホストのインフラストラクチャ内にあるデータストレージ構造内でキュレートする動作と、
前記アルゴリズムによって処理するための前記データストレージ構造内の前記データ資産を準備する動作と、
前記アルゴリズムをセキュアなカプセル計算フレームワークに統合する動作であって、前記セキュアなカプセル計算フレームワークが、前記アルゴリズムを、前記データストレージ構造内の前記データ資産に、前記データ資産および前記機械学習モデルのプライバシーを保全するセキュアな方法で提供する、統合する動作と、
前記データ資産を前記アルゴリズムを通して動作させる動作と
を含む動作を行わせるように構成された命令
を含む、非一時的機械可読記憶媒体において実体的に具体化された、コンピュータプログラム製品。
[本発明1020]
前記アルゴリズムおよび前記入力データ要件が、前記データホストとは異なるエンティティであるアルゴリズム開発者から受け取られ、前記最適化および/または検証選択基準が、データ資産が前記アルゴリズムで動作するための特性、フォーマット、および要件を定義する、本発明1019のコンピュータプログラム製品。
[本発明1021]
前記データ資産の前記特性および前記要件が、
（i）前記アルゴリズムの環境、（ii）前記入力データ内の例の配分、（iii）前記入力データを生成するデバイスのパラメータおよびタイプ、（iv）分散対バイアス、（v）前記アルゴリズムによって実装されたタスク、または（vi）それらの任意の組み合わせ
に基づいて定義される、本発明1020のコンピュータプログラム製品。
[本発明1022]
前記識別する動作が、前記データ資産内の個人に関するプライベート情報を伏せたままで、前記データ資産内のグループのパターンを記述することによって、前記データ資産内の情報を共有するための差分プライバシーを使用して行われ、
前記キュレートする動作が、複数のデータストレージ構造の中から前記データストレージ構造を選択することと、前記データホストの前記インフラストラクチャ内に前記データストレージ構造をプロビジョニングすることとを含み、
前記データストレージ構造の前記選択が、前記アルゴリズムのタイプ、前記データ資産内のデータのタイプ、前記システムの要件、またはそれらの組み合わせに基づくものである、
本発明1020のコンピュータプログラム製品。
[本発明1023]
前記動作が、前記データホストを迎え入れる動作をさらに含み、
前記迎え入れる動作が、前記アルゴリズムでの前記データ資産の使用がデータプライバシー要件に準拠したものであることを確認することを含む、
本発明1019、1020、または1021のコンピュータプログラム製品。
[本発明1024]
前記データ資産を前記アルゴリズムを通して前記動作させる動作が、
前記モデルの複数のインスタンスを作成することと、前記データ資産を、訓練データセットと1つまたは複数の試験データセットとに分割することと、前記モデルの前記複数のインスタンスを前記訓練データセットで訓練することと、前記モデルの前記複数のインスタンスの各々の前記訓練からの結果を完全連合モデルに統合することと、前記1つまたは複数の試験データセットを前記完全連合モデルを通して動作させることと、前記1つまたは複数の試験データセットの前記動作に基づいて前記完全連合モデルの性能を計算することとを含む、訓練ワークフロー
を実行することを含む、本発明1019～1023のいずれかのコンピュータプログラム製品。
[本発明1025]
前記データ資産を前記アルゴリズムを通して前記動作させる動作が、
前記データ資産を1つまたは複数の検証データセットにおいて分割することと、前記1つまたは複数の検証データセットを前記アルゴリズムを通して動作させることと、前記1つまたは複数の検証データセットの前記動作に基づいて前記アルゴリズムの性能を計算することとを含む、検証ワークフロー
を実行することを含む、本発明1019～1023のいずれかのコンピュータプログラム製品。
[本発明1026]
前記セキュアなカプセル計算フレームワークが、前記アルゴリズムを動作させるのに必要な暗号化コードを受け入れるように構成された計算インフラストラクチャ内にプロビジョニングされ、
前記計算インフラストラクチャを前記プロビジョニングすることが、前記計算インフラストラクチャ上で前記セキュアなカプセル計算フレームワークをインスタンス化することと、前記アルゴリズム開発者によって、前記暗号化コードを前記セキュアなカプセル計算フレームワークの内部に配置することと、前記セキュアなカプセル計算フレームワークがインスタンス化された後で、前記暗号化コードを復号することとを含む、
本発明1019～1025のいずれかのコンピュータプログラム製品。
[本発明1027]
以下の工程を含む方法:
アルゴリズムの複数のインスタンスを識別する工程であって、前記アルゴリズムの各インスタンスが、1つまたは複数のセキュアなカプセル計算フレームワークに統合され、前記1つまたは複数のセキュアなカプセル計算フレームワークが、前記アルゴリズムの各インスタンスを、1つまたは複数のデータホストの1つまたは複数のデータストレージ構造内の訓練データ資産に、前記訓練データ資産および前記アルゴリズムの各インスタンスのプライバシーを保全するセキュアな方法で提供する、識別する工程;
データ処理システムによって、前記アルゴリズムの各インスタンスで連合訓練ワークフローを実行する工程であって、前記連合訓練ワークフローが、前記訓練データ資産を入力として取り込み、パラメータを使用して前記訓練データ資産の特徴をターゲット推論にマップし、損失関数または誤差関数を計算し、前記損失関数または前記誤差関数を最小化するために前記パラメータを学習されたパラメータに更新し、前記アルゴリズムの1つまたは複数の訓練されたインスタンスを出力する、実行する工程;
前記データ処理システムによって、前記アルゴリズムの訓練されたインスタンスごとの前記学習されたパラメータを、完全連合アルゴリズムに統合する工程であって、前記統合する工程が、前記学習されたパラメータを集約して、集約されたパラメータを取得することと、前記完全連合アルゴリズムの学習されたパラメータを前記集約されたパラメータで更新することとを含む、統合する工程;
前記データ処理システムによって、前記完全連合アルゴリズムで試験ワークフローを実行する工程であって、前記試験ワークフローが、試験データを入力として取り込み、前記更新された学習されたパラメータを使用して前記試験データ内のパターンを見つけ、推論を出力する、実行する工程;
前記データ処理システムによって、前記推論を提供する際の前記完全連合アルゴリズムの性能を計算する工程;
前記データ処理システムによって、前記完全連合アルゴリズムの前記性能がアルゴリズム終了基準を満たすかどうかを判定する工程;
前記完全連合アルゴリズムの前記性能が前記アルゴリズム終了基準を満たさない場合、前記データ処理システムによって、前記アルゴリズムの各インスタンスを前記完全連合アルゴリズムで置き換え、前記完全連合アルゴリズムの各インスタンスで前記連合訓練ワークフローを再実行する工程;ならびに
前記完全連合アルゴリズムの前記性能が前記アルゴリズム終了基準を満たす場合、前記データ処理システムによって、前記完全連合アルゴリズムの前記性能および前記集約されたパラメータを、前記アルゴリズムのアルゴリズム開発者に提供する工程。
[本発明1028]
前記アルゴリズムの前記複数のインスタンスを前記識別する工程が、
前記データ処理システムにおいて、前記アルゴリズムおよび前記アルゴリズムと関連付けられた入力データ要件を受け取ることであって、前記入力データ要件が、データ資産が前記アルゴリズムで動作するための最適化および/または検証選択基準を含む、受け取ることと、
前記データ処理システムによって、前記データ資産を、前記データ資産についての最適化および/または検証選択基準に基づいて前記1つまたは複数のデータホストから利用可能であるものとして識別することと、
前記データ処理システムによって、前記1つまたは複数のデータホストの各データホストのインフラストラクチャ内にあるデータストレージ構造内の前記データ資産をキュレートすることと、
前記データ資産の少なくとも一部を、前記1つまたは複数のデータホストの各データホストの前記インフラストラクチャ内にある前記データストレージ構造内の前記訓練データ資産に分割することと
を含む、本発明1027の方法。
[本発明1029]
前記アルゴリズムおよび前記入力データ要件が、前記1つまたは複数のデータホストとは異なるエンティティであるアルゴリズム開発者から受け取られ、前記最適化および/または検証選択基準が、データ資産が前記アルゴリズムで動作するための特性、フォーマット、および要件を定義する、本発明1028の方法。
[本発明1030]
前記連合訓練ワークフローが、訓練勾配を暗号化することをさらに含み、前記統合することが、前記訓練勾配を復号することを含む、本発明1027、1028、または1029の方法。
[本発明1031]
前記完全連合アルゴリズムの前記性能が前記アルゴリズム終了基準を満たす場合、前記データ処理システムによって、集約されたパラメータを前記アルゴリズムの各インスタンスに送る工程;ならびに
前記データ処理システムによって、前記アルゴリズムの各インスタンスで更新訓練ワークフローを実行する工程であって、前記更新訓練ワークフローが、前記学習されたパラメータを前記集約されたパラメータで更新し、前記アルゴリズムの1つまたは複数の更新および訓練されたインスタンスを出力する、実行する工程
をさらに含む、本発明1027、1028、1029、または1030の方法。
[本発明1032]
前記データ処理システムによって、残りの前記データ資産を、前記アルゴリズムの各インスタンスを通して動作させる工程をさらに含む、本発明1031の方法。
[本発明1033]
前記データ資産を前記アルゴリズムの各インスタンスを通して前記動作させる工程が、
前記データ資産の少なくとも一部を1つまたは複数の検証データセットにさらに分割することと、前記1つまたは複数の検証データセットを前記アルゴリズムの各インスタンスを通して動作させることと、前記1つまたは複数の検証データセットの前記動作に基づいて前記アルゴリズムの各インスタンスの性能を計算することとを含む、検証ワークフロー
を実行することを含む、本発明1031の方法。
[本発明1034]
1つまたは複数のデータプロセッサと、
前記1つまたは複数のデータプロセッサ上で実行されると、前記1つまたは複数のデータプロセッサに、
アルゴリズムの複数のインスタンスを識別する動作であって、前記アルゴリズムの各インスタンスが1つまたは複数のセキュアなカプセル計算フレームワークに統合され、前記1つまたは複数のセキュアなカプセル計算フレームワークが、前記アルゴリズムの各インスタンスを、1つまたは複数のデータホストの1つまたは複数のデータストレージ構造内の訓練データ資産に、前記訓練データ資産および前記アルゴリズムの各インスタンスのプライバシーを保全するセキュアな方法で提供する、識別する動作と、
前記アルゴリズムの各インスタンスで連合訓練ワークフローを実行する動作であって、前記連合訓練ワークフローが、前記訓練データ資産を入力として取り込み、パラメータを使用して前記訓練データ資産の特徴をターゲット推論にマップし、損失関数または誤差関数を計算し、前記損失関数または前記誤差関数を最小化するために前記パラメータを学習されたパラメータに更新し、前記アルゴリズムの1つまたは複数の訓練されたインスタンスを出力する、実行する動作と、
前記データ処理システムによって、前記アルゴリズムの訓練されたインスタンスごとの前記学習されたパラメータを、完全連合アルゴリズムに統合する動作であって、前記統合する動作が、前記学習されたパラメータを集約して、集約されたパラメータを取得することと、前記完全連合アルゴリズムの学習されたパラメータを前記集約されたパラメータで更新することとを含む、統合する動作と、
前記データ処理システムによって、前記完全連合アルゴリズムで試験ワークフローを実行する動作であって、前記試験ワークフローが、試験データを入力として取り込み、前記更新された学習されたパラメータを使用して前記試験データ内のパターンを見つけ、推論を出力する、実行する動作と、
前記データ処理システムによって、前記推論を提供する際の前記完全連合アルゴリズムの性能を計算する動作と、
前記データ処理システムによって、前記完全連合アルゴリズムの前記性能がアルゴリズム終了基準を満たすかどうかを判定する動作と、
前記完全連合アルゴリズムの前記性能が前記アルゴリズム終了基準を満たさない場合、前記データ処理システムによって、前記アルゴリズムの各インスタンスを前記完全連合アルゴリズムで置き換え、前記完全連合アルゴリズムの各インスタンスで前記連合訓練ワークフローを再実行する動作と、
前記完全連合アルゴリズムの前記性能が前記アルゴリズム終了基準を満たす場合、前記データ処理システムによって、前記完全連合アルゴリズムの前記性能および前記集約されたパラメータを、前記アルゴリズムのアルゴリズム開発者に提供する動作と
を含む動作を行わせる命令を含む、非一時的コンピュータ可読記憶媒体と
を含む、システム。
[本発明1035]
以下の工程を含む方法:
データ処理システムによって、データ資産の選択基準に基づいてデータホストから利用可能なデータ資産を識別する工程;
前記データ処理システムによって、前記データホストのインフラストラクチャ内にあるデータストレージ構造内の前記データ資産をキュレートする工程;
前記データ処理システムによって、データ変換のためのアルゴリズムを開発するためのガイドとして使用すべきトランスフォーマ・プロトタイプ・データセットを準備する工程であって、前記トランスフォーマ・プロトタイプ・データセットが整合化プロセスのキー属性を取り込む、準備する工程;
前記データ処理システムで、前記トランスフォーマ・プロトタイプ・データセット内のデータの現在のフォーマットに基づいて前記データ資産の変換のための第1の整合化トランスフォーマセットを作成する工程;
前記データ処理システムによって、変換されたデータ資産を生成するために前記データ資産に前記第1の整合化トランスフォーマセットを適用する工程;
前記データ処理システムによって、データ変換のためのアルゴリズムを開発するためのガイドとして使用すべき整合化プロトタイプデータセットを準備する工程であって、前記整合化プロトタイプデータセットが前記整合化プロセスのキー属性を取り込む、準備する工程;
前記データ処理システムによって、前記整合化プロトタイプデータセット内のデータの現在のフォーマットに基づいて前記変換されたデータ資産の変換のための第2の整合化トランスフォーマセットを作成する工程;
前記データ処理システムによって、整合化されたデータ資産を生成するために、前記変換されたデータ資産に前記第2の整合化トランスフォーマセットを適用する工程;ならびに
前記データ処理システムによって、前記整合化されたデータ資産をアルゴリズムを通して動作させる工程であって、前記アルゴリズムが、前記アルゴリズムを、前記データストレージ構造内の前記整合化されたデータ資産に、前記整合化されたデータ資産および前記アルゴリズムのプライバシーを保全するセキュアな方法で提供するセキュアなカプセル計算フレームワーク内にある、動作させる工程。
[本発明1036]
前記選択基準が、前記データホストとは異なるエンティティであるアルゴリズム開発者から受け取られ、前記選択基準が、前記データ資産が前記アルゴリズムで動作するための特性、フォーマットおよび要件を定義する、本発明1035の方法。
[本発明1037]
前記データ資産の前記特性および前記要件が、
（i）前記アルゴリズムの環境、（ii）前記入力データ内の例の配分、（iii）前記入力データを生成するデバイスのパラメータおよびタイプ、（iv）分散対バイアス、（v）前記アルゴリズムによって実装されたタスク、または（vi）それらの任意の組み合わせ
に基づいて定義される、本発明1036の方法。
[本発明1038]
前記トランスフォーマ・プロトタイプ・データセットを匿名化し、前記匿名化されたトランスフォーマ・プロトタイプ・データセットを、前記データ資産の変換のための前記第1の整合化トランスフォーマセットを作成する目的で、前記アルゴリズム開発者に供する工程をさらに含む、本発明1037の方法。
[本発明1039]
前記データ資産に前記第1の整合化トランスフォーマセットを適用する工程が、前記データ構造内で行われる、本発明1035、1036、1037、または1038の方法。
[本発明1040]
前記データ処理システムで、注釈付きデータセットを生成するために、定義済み注釈付けプロトコルに従って、前記変換されたデータ資産に注釈を付ける工程をさらに含み、
前記変換されたデータに前記注釈を付ける工程が前記データ構造内で行われ、前記第2の整合化トランスフォーマセットが、整合化されたデータ資産を生成するために、前記注釈付きデータセットに適用される、
本発明1035～1039のいずれかの方法。
[本発明1041]
前記注釈付きデータ資産に前記第2の整合化トランスフォーマセットを前記適用する工程が、前記データ構造内で行われる、本発明1035～1040のいずれかの方法。
[本発明1042]
前記第1の整合化トランスフォーマセット、前記注釈、および前記第2の整合化トランスフォーマセットが、正常に適用され、かつデータプライバシー要件に違反せずに適用されるかどうかを判定する工程
をさらに含む、本発明1040または1041の方法。
[本発明1043]
1つまたは複数のデータプロセッサと、
前記1つまたは複数のデータプロセッサ上で実行されると、前記1つまたは複数のデータプロセッサに、
データ資産の選択基準に基づいてデータホストから利用可能な前記データ資産を識別する動作と、
前記データ資産を、前記データホストのインフラストラクチャ内にあるデータストレージ構造内でキュレートする動作と、
データ変換のためのアルゴリズムとして使用すべきトランスフォーマ・プロトタイプ・データセットを準備する動作であって、前記トランスフォーマ・プロトタイプ・データセットが整合化プロセスのキー属性を取り込む、準備する動作と、
前記トランスフォーマ・プロトタイプ・データセット内のデータの現在のフォーマットに基づいて前記データ資産の変換のための第1の整合化トランスフォーマセットを作成する動作と、
変換されたデータ資産を生成するために前記データ資産に前記第1の整合化トランスフォーマセットを適用する動作と、
データ変換のためのアルゴリズムを開発するために使用すべき整合化プロトタイプデータセットを準備する動作であって、前記整合化プロトタイプデータセットが前記整合化プロセスのキー属性を取り込む、準備する動作と、
前記整合化プロトタイプデータセット内のデータの現在のフォーマットに基づいて、前記変換されたデータ資産の変換のための第2の整合化トランスフォーマセットを作成する動作と、
整合化されたデータ資産を生成するために、前記変換されたデータ資産に前記第2の整合化トランスフォーマセットを適用する動作と、
前記整合化されたデータ資産をアルゴリズムを通して動作させる動作であって、前記アルゴリズムが、前記アルゴリズムを、前記データストレージ構造内の前記整合化されたデータ資産に、前記整合化されたデータ資産および前記アルゴリズムのプライバシーを保全するセキュアな方法で提供するセキュアなカプセル計算フレームワーク内にある、動作させる動作と
を含む動作を行わせる命令を含む、非一時的コンピュータ可読記憶媒体と
を含む、システム。
[本発明1044]
以下の工程を含む方法:
アルゴリズムまたはモデルを識別する工程であって、前記アルゴリズムまたはモデルが、セキュアなカプセル計算フレームワークに統合され、前記セキュアなカプセル計算フレームワークが、前記アルゴリズムまたはモデルを、データホストのデータストレージ構造内の訓練データ資産に、前記訓練データ資産および前記アルゴリズムまたはモードのプライバシーを保全するセキュアな方法で提供する、識別する工程;
データ処理システムによって、前記アルゴリズムまたはモデルで連合訓練ワークフローを実行する工程であって、前記連合訓練ワークフローが、前記訓練データ資産を入力として取り込み、パラメータを使用して前記訓練データ資産の特徴をターゲット推論にマップし、損失関数または誤差関数を計算し、前記損失関数または前記誤差関数を最小化するためにパラメータを学習されたパラメータに更新し、訓練されたアルゴリズムまたはモデルを出力する、実行する工程;
前記データ処理システムによって、前記アルゴリズムまたはモデルの前記学習されたパラメータを、完全連合アルゴリズムまたはモデルに統合する工程であって、前記統合する工程が、前記学習されたパラメータを集約して、集約されたパラメータを取得することと、前記完全連合アルゴリズムまたはモデルの学習されたパラメータを前記集約されたパラメータで更新することとを含む、統合する工程;
前記データ処理システムによって、前記完全連合アルゴリズムまたはモデルで試験ワークフローを実行する工程であって、前記試験ワークフローが、試験データを入力として取り込み、前記更新された学習されたパラメータを使用して前記試験データ内のパターンを見つけ、推論を出力する、実行する工程;
前記データ処理システムによって、前記推論を提供する際の前記完全連合アルゴリズムの性能を計算する工程;
前記データ処理システムによって、前記完全連合アルゴリズムまたはモデルの前記性能がアルゴリズム終了基準を満たすかどうかを判定する工程;
前記完全連合アルゴリズムまたはモデルの前記性能が前記アルゴリズム終了基準を満たさない場合、前記データ処理システムによって、前記アルゴリズムまたはモデルを前記完全連合アルゴリズムまたはモデルで置き換え、前記完全連合アルゴリズムまたはモデルで前記連合訓練ワークフローを再実行する工程;ならびに
前記完全連合アルゴリズムまたはモデルの前記性能が前記アルゴリズム終了基準を満たす場合、前記データ処理システムによって、前記完全連合アルゴリズムまたはモデルの前記性能および前記集約されたパラメータを、前記アルゴリズムまたはモデルのアルゴリズム開発者に提供する工程。
[本発明1045]
以下の工程を含む方法:
アルゴリズムを識別する工程であって、前記アルゴリズムが、アルゴリズム開発者によって提供されてセキュアなカプセル計算フレームワークに統合され、前記セキュアなカプセル計算フレームワークが、前記アルゴリズムを、データストレージ構造内の検証データ資産に、前記検証データ資産および前記アルゴリズムのプライバシーを保全するセキュアな方法で提供する、識別する工程;
データ処理システムによって、前記アルゴリズムで検証ワークフローを実行する工程であって、前記検証ワークフローが、前記検証データ資産を入力として取り込み、学習されたパラメータを使用して前記検証データ資産に前記アルゴリズムを適用し、推論を出力する、実行する工程;
前記データ処理システムによって、前記推論を提供する際の前記アルゴリズムの性能を計算する工程であって、前記性能がゴールド・スタンダード・ラベルに基づいて計算される、計算する工程;
前記データ処理システムによって、前記アルゴリズムの前記性能がアルゴリズム開発者によって定義された検証基準を満たすかどうかを判定する工程;
前記アルゴリズムの前記性能が前記検証基準を満たさない場合、前記データ処理システムで、前記アルゴリズムの1つまたは複数のハイパーパラメータを最適化し、前記最適化された1つまたは複数のハイパーパラメータを用いて、前記アルゴリズムで前記検証ワークフローを再実行する工程;ならびに
前記アルゴリズムの前記性能が前記検証基準を満たす場合、前記データ処理システムによって、前記アルゴリズムの前記性能および前記1つまたは複数のハイパーパラメータを、前記アルゴリズム開発者に提供する工程。
[本発明1046]
前記アルゴリズムを前記識別する工程が、
前記データ処理システムにおいて、前記アルゴリズムおよび前記アルゴリズムと関連付けられた入力データ要件を受け取ることであって、前記入力データ要件が、データ資産が前記アルゴリズムで動作するための検証選択基準を含む、受け取ることと、
前記データ処理システムによって、前記データ資産を、前記データ資産についての前記検証選択基準に基づいてデータホストから利用可能であるものとして識別することと、
前記データ処理システムによって、前記データホストのインフラストラクチャ内にあるデータストレージ構造内の前記データ資産をキュレートすることと、
前記データ資産の少なくとも一部を、前記データホストの前記インフラストラクチャ内にある前記データストレージ構造内の前記検証データ資産に分割することと
を含む、本発明1044の方法。
[本発明1047]
前記検証選択基準が、臨床コホート基準、人口統計学的基準、および/またはデータ・セット・クラス・バランスを含み、前記臨床コホート基準が、コホート研究のために前記データ資産を取得するべき人々のグループ、前記コホート研究のタイプ、前記人々のグループが一定期間にわたってさらされる可能性のあるリスク因子、解決されるべき疑問もしくは仮説および関連付けられる疾患もしくは状態、前記コホート研究の基準を定義するその他のパラメータ、またはそれらの任意の組み合わせを定義する、本発明1045の方法。
[本発明1048]
前記データ処理システムによって、前記データホストを迎え入れる工程であって、前記迎え入れる工程が、前記アルゴリズムでの前記データ資産の使用がデータプライバシー要件に準拠したものであることを確認することを含む、迎え入れる工程;ならびに
前記アルゴリズムを検証する目的での前記データホストからの前記データ資産の使用の施設内審査委員会からの許可を含む、ガバナンス要件およびコンプライアンス要件を完了する工程
をさらに含み、
前記キュレートする工程が、複数のデータストレージ構造の中から前記データストレージ構造を選択することと、前記データホストの前記インフラストラクチャ内に前記データストレージ構造をプロビジョニングすることとを含み、前記データストレージ構造の前記選択が、前記アルゴリズム内のアルゴリズムのタイプ、前記データ資産内のデータのタイプ、前記データ処理システムのシステム要件、またはそれらの組み合わせに基づくものである、
本発明1045または1046の方法。
[本発明1049]
前記アルゴリズムの前記性能が前記検証基準を満たす場合、前記データ処理システムによって、前記アルゴリズムおよび前記検証データ資産を、前記検証データ資産および前記アルゴリズムのプライバシーを保全するセキュアな方法で維持する工程
をさらに含む、本発明1044～1047のいずれかの方法。
[本発明1050]
前記セキュアなカプセル計算フレームワークが、前記アルゴリズムを動作させるのに必要な暗号化コードを受け入れるように構成された計算インフラストラクチャ内にプロビジョニングされ、
前記計算インフラストラクチャを前記プロビジョニングすることが、前記計算インフラストラクチャ上で前記セキュアなカプセル計算フレームワークをインスタンス化することと、前記アルゴリズム開発者によって、前記暗号化コードを前記セキュアなカプセル計算フレームワークの内部に配置することと、前記セキュアなカプセル計算フレームワークがインスタンス化された後で、前記暗号化コードを復号することとを含む、
本発明1044～1048のいずれかの方法。
[本発明1051]
前記検証データ資産が、複数の独立したデータ資産セットであり、前記暗号化コードが、前記データ処理システムによって署名されてデータ・ストレージ・アーカイブに格納され、前記アルゴリズムの前記性能が、前記複数の独立したデータ資産セットに対して行われた複数の検証から集約された前記アルゴリズムの検証についての単一の検証報告として提供される、本発明1049の方法。
[本発明1052]
1つまたは複数のデータプロセッサと、
前記1つまたは複数のデータプロセッサ上で実行されると、前記1つまたは複数のデータプロセッサに、
アルゴリズムを識別する動作であって、前記アルゴリズムが、アルゴリズム開発者によって提供されてセキュアなカプセル計算フレームワークに統合され、前記セキュアなカプセル計算フレームワークが、前記アルゴリズムを、データストレージ構造内の検証データ資産に、前記検証データ資産および前記アルゴリズムのプライバシーを保全するセキュアな方法で提供する、識別する動作と、
前記アルゴリズムで検証ワークフローを実行する動作であって、前記検証ワークフローが、前記検証データ資産を入力として取り込み、学習されたパラメータを使用して前記検証データ資産内のパターンを見つけ、推論を出力する、実行する動作と、
前記推論を提供する際の前記アルゴリズムの性能を計算する動作であって、前記性能がゴールド・スタンダード・ラベルに基づいて計算される、計算する動作と、
前記アルゴリズムの前記性能がアルゴリズム開発者によって定義された検証基準を満たすかどうかを判定する動作と、
前記アルゴリズムの前記性能が前記検証基準を満たさない場合、前記アルゴリズムの1つまたは複数のハイパーパラメータを最適化し、前記最適化された1つまたは複数のハイパーパラメータを用いて、前記アルゴリズムで前記検証ワークフローを再実行する動作と、
前記アルゴリズムの前記性能が前記検証基準を満たす場合、前記アルゴリズムの前記性能および前記1つまたは複数のハイパーパラメータを、前記アルゴリズム開発者に提供する動作と
を含む動作を行わせる命令を含む、非一時的コンピュータ可読記憶媒体と
を含む、システム。
[本発明1053]
以下の工程を含む方法:
データ処理システムにおいて、アルゴリズムおよび前記アルゴリズムと関連付けられた入力データ要件を受け取る工程であって、前記入力データ要件が、データ資産が前記アルゴリズムで動作するための検証選択基準を含む、受け取る工程;
前記データ処理システムによって、前記データ資産を、前記データ資産についての前記検証選択基準に基づいてデータホストから利用可能であるものとして識別する工程;
前記データ処理システムによって、前記データホストのインフラストラクチャ内のデータストレージ構造内の前記データ資産をキュレートする工程;
前記データ処理システムによって、前記アルゴリズムによって処理するための前記データストレージ構造内の前記データ資産を準備する工程;
前記データ処理システムによって、前記アルゴリズムをセキュアなカプセル計算フレームワークに統合する工程であって、前記セキュアなカプセル計算フレームワークが、前記アルゴリズムを前記データストレージ構造内の前記データ資産に、前記データ資産および前記アルゴリズムのプライバシーを保全するセキュアな方法で提供する、統合する工程;ならびに
前記データ処理システムによって、前記アルゴリズムで検証ワークフローを実行する工程であって、前記検証ワークフローが、前記データ資産を入力として取り込み、学習されたパラメータを使用して前記データ資産内のパターンを見つけ、推論を出力する、実行する工程;
前記データ処理システムによって、前記推論を提供する際の前記アルゴリズムの性能を計算する工程であって、前記性能がゴールド・スタンダード・ラベルに基づいて計算される、計算する工程;ならびに
前記データ処理システムによって、前記アルゴリズムの前記性能を、前記アルゴリズム開発者に提供する工程。
[本発明1054]
前記検証選択基準が、臨床コホート基準、人口統計学的基準、および/またはデータ・セット・クラス・バランスを含み、前記臨床コホート基準が、コホート研究のために前記データ資産を取得するべき人々のグループ、前記コホート研究のタイプ、前記人々のグループが一定期間にわたってさらされる可能性のあるリスク因子、解決されるべき疑問もしくは仮説および関連付けられる疾患もしくは状態、前記コホート研究の基準を定義するその他のパラメータ、またはそれらの任意の組み合わせを定義する、本発明1053の方法。
[本発明1055]
前記データ処理システムによって、前記データホストを迎え入れる工程であって、前記迎え入れる工程が、前記アルゴリズムでの前記データ資産の使用がデータプライバシー要件に準拠したものであることを確認することを含む、迎え入れる工程;ならびに
前記アルゴリズムを検証する目的での前記データホストからの前記データ資産の使用の施設内審査委員会からの許可を含む、ガバナンス要件およびコンプライアンス要件を完了する工程
をさらに含み、
前記キュレートする工程が、複数のデータストレージ構造の中から前記データストレージ構造を選択することと、前記データホストの前記インフラストラクチャ内に前記データストレージ構造をプロビジョニングすることとを含み、前記データストレージ構造の前記選択が、前記アルゴリズム内のアルゴリズムのタイプ、前記データ資産内のデータのタイプ、前記データ処理システムのシステム要件、またはそれらの組み合わせに基づくものである、
本発明1053または1054の方法。
[本発明1056]
前記データ処理システムによって、前記アルゴリズムおよび前記データ資産を、前記データ資産および前記アルゴリズムのプライバシーを保全するセキュアな方法で維持する工程をさらに含む、本発明1053～1055のいずれかの方法。
[本発明1057]
前記セキュアなカプセル計算フレームワークが、前記アルゴリズムを動作させるのに必要な暗号化コードを受け入れるように構成された計算インフラストラクチャ内にプロビジョニングされ、
前記計算インフラストラクチャを前記プロビジョニングすることが、前記計算インフラストラクチャ上で前記セキュアなカプセル計算フレームワークをインスタンス化することと、前記アルゴリズム開発者によって、前記暗号化コードを前記セキュアなカプセル計算フレームワークの内部に配置することと、前記セキュアなカプセル計算フレームワークがインスタンス化された後で、前記暗号化コードを復号することとを含む、
本発明1053～1056のいずれかの方法。
[本発明1058]
前記データ資産が、複数の独立したデータ資産セットであり、前記暗号化コードが、前記データ処理システムによって署名されてデータ・ストレージ・アーカイブに格納され、前記アルゴリズムの前記性能が、前記複数の独立したデータ資産セットに対して行われた複数の検証から集約された前記アルゴリズムの検証についての単一の検証報告として提供される、本発明1057の方法。
用いられている用語および表現は、限定ではなく説明の用語として使用されており、そのような用語および表現の使用に際して、図示および説明される特徴のうちのその部分の任意の均等物を除外する意図はなく、特許請求される発明の範囲内で様々な改変が可能であることを理解されたい。よって、特許請求される本発明は態様および任意の特徴によって具体的に開示されているが、当業者によれば本明細書に開示される概念の改変および変形が用いられ得ること、およびそのような改変および変形は、添付の特許請求の範囲によって定義される本発明の範囲内にあるとみなされることを理解されたい。 [Invention 1001]
A method including the following steps:
in a data processing system, receiving an algorithm and input data requirements associated with the algorithm, the input data requirements including optimization and/or validation selection criteria for data assets to operate with the algorithm; The process of receiving;
identifying, by the data processing system, the data asset as available from a data host based on the optimization and/or validation selection criteria for the data asset;
curating, by the data processing system, the data assets in a data storage structure within an infrastructure of the data host;
preparing the data asset in the data storage structure for processing by the algorithm by the data processing system;
integrating, by the data processing system, the algorithm into a secure capsule computation framework, the secure capsule computation framework integrating the algorithm into the data asset within the data storage structure; providing and integrating said algorithm in a privacy-preserving and secure manner; and
operating the data asset through the algorithm by the data processing system;
[Present invention 1002]
The algorithm and the input data requirements are received from an algorithm developer, which is a different entity than the data host, and the optimization and/or validation selection criteria are based on the characteristics, format, etc. for which the data asset is to be operated with the algorithm. , and the method of the present invention 1001 for defining requirements.
[Present invention 1003]
the characteristics of the data asset and the requirements,
(i) the environment of said algorithm; (ii) the distribution of examples within said input data; (iii) the parameters and type of devices producing said input data; (iv) the variance versus bias; or (vi) any combination thereof.
The method of the invention 1002, defined based on.
[Present invention 1004]
The identifying step uses differential privacy to share information within the data asset by describing patterns of groups within the data asset while keeping private information about individuals within the data asset hidden. carried out,
the curating step includes selecting the data storage structure from among a plurality of data storage structures; and provisioning the data storage structure within the infrastructure of the data host;
the selection of the data storage structure is based on the type of algorithm, the type of data within the data asset, system requirements of the data processing system, or a combination thereof;
Method of the invention 1002.
[Present invention 1005]
further comprising hosting the data host by the data processing system;
the step of welcoming comprises ensuring that the use of the data asset in the algorithm is compliant with data privacy requirements;
The method of the invention 1001, 1002, or 1003.
[Present invention 1006]
The present invention, wherein preparing the data asset includes applying one or more transformations to the data asset, annotating the data asset, harmonizing the data asset, or a combination thereof. Any method of inventions 1001 to 1005.
[Present invention 1007]
said operating said data asset through said algorithm;
creating a plurality of instances of the model; splitting the data asset into a training dataset and one or more test datasets; and training the plurality of instances of the model with the training dataset. integrating results from the training of each of the plurality of instances of the model into a fully federated model; operating the one or more test datasets through the fully federated model; and calculating performance of the fully federated model based on the operation of one or more test datasets.
The method of any of the inventions 1001-1006, comprising performing.
[Present invention 1008]
said operating said data asset through said algorithm;
dividing the data asset in one or more validation datasets; operating the one or more validation datasets through the algorithm; and based on the operation of the one or more validation datasets. and calculating the performance of said algorithm using
The method of any of the inventions 1001-1006, comprising performing.
[Present invention 1009]
the secure encapsulated computational framework is provisioned within a computational infrastructure configured to accept cryptographic code necessary to operate the algorithm;
The provisioning of the computational infrastructure includes instantiating the secure encapsulated computational framework on the computational infrastructure, and adding the cryptographic code to the secure encapsulated computational framework by the algorithm developer. and decrypting the encrypted code after the secure capsule computation framework is instantiated.
The method according to any one of the inventions 1001 to 1008.
[Present invention 1010]
one or more data processors;
when executed on the one or more data processors, the one or more data processors:
an act of receiving an algorithm and input data requirements associated with the algorithm, wherein the input data requirements include optimization and/or validation selection criteria for data assets to operate with the algorithm;
an act of identifying the data asset as available from a data host based on the optimization and/or validation selection criteria for the data asset;
an act of curating the data asset within a data storage structure within an infrastructure of the data host;
an act of preparing the data asset in the data storage structure for processing by the algorithm;
an act of integrating the algorithm into a secure capsule computation framework, the secure capsule computation framework integrating the algorithm into the data asset within the data storage structure, preserving the privacy of the data asset and the machine learning model; providing and integrating operations in a secure manner;
an act of running the data asset through the algorithm;
a non-transitory computer-readable storage medium containing instructions for performing operations including;
system, including.
[Present invention 1011]
The algorithm and the input data requirements are received from an algorithm developer that is a different entity than the data host, and the optimization and/or validation selection criteria are based on the characteristics, format, and a system of the present invention 1010 for defining requirements.
[Invention 1012]
the characteristics of the data asset and the requirements,
(i) the environment of said algorithm; (ii) the distribution of examples within said input data; (iii) the parameters and type of devices producing said input data; (iv) the variance versus bias; or (vi) any combination thereof.
The system of the present invention 1011 is defined based on.
[Present invention 1013]
The act of identifying uses differential privacy to share information within the data asset by describing patterns of groups within the data asset while concealing private information about individuals within the data asset. carried out,
The act of curating includes selecting the data storage structure from among a plurality of data storage structures and provisioning the data storage structure within the infrastructure of the data host;
the selection of the data storage structure is based on the type of algorithm, the type of data within the data asset, the requirements of the system, or a combination thereof;
System of the present invention 1010.
[Present invention 1014]
the operations further include welcoming the data host;
the welcoming includes ensuring that use of the data asset in the algorithm is compliant with data privacy requirements;
The system of the invention 1010, 1011, or 1012.
[Present invention 1015]
The act of preparing the data asset includes applying one or more transformations to the data asset, annotating the data asset, harmonizing the data asset, or a combination thereof. Any system from Inventions 1010 to 1014.
[Invention 1016]
the act of operating the data asset through the algorithm;
creating a plurality of instances of the model; splitting the data asset into a training dataset and one or more test datasets; and training the plurality of instances of the model with the training dataset. integrating results from the training of each of the plurality of instances of the model into a fully federated model; operating the one or more test datasets with the fully federated model; and calculating performance of the fully federated model based on the operation of one or more test datasets.
The system of any of the inventions 1010-1015, comprising performing.
[Invention 1017]
the act of operating the data asset through the algorithm;
dividing the data asset in one or more validation datasets; operating the one or more validation datasets through the algorithm; and based on the operation of the one or more validation datasets. and calculating the performance of said algorithm using
The system of any of the inventions 1010-1015, comprising performing.
[Invention 1018]
the secure encapsulated computational framework is provisioned within a computational infrastructure configured to accept cryptographic code necessary to operate the algorithm;
The provisioning of the computing infrastructure includes instantiating the secure encapsulated computing framework on the computing infrastructure and transmitting the cryptographic code to the secure encapsulated computing framework through the algorithm developer. and decrypting the encrypted code after the secure capsule computation framework is instantiated.
The system according to any one of the present inventions 1010 to 1017.
[Invention 1019]
one or more data processors,
an act of receiving an algorithm and input data requirements associated with the algorithm, wherein the input data requirements include optimization and/or validation selection criteria for data assets to operate with the algorithm;
an act of identifying the data asset as available from a data host based on the optimization and/or validation selection criteria for the data asset;
an act of curating the data asset within a data storage structure within an infrastructure of the data host;
an act of preparing the data asset in the data storage structure for processing by the algorithm;
an act of integrating the algorithm into a secure capsule computation framework, the secure capsule computation framework integrating the algorithm into the data asset in the data storage structure, the data asset and the machine learning model; providing and integrating operations in a secure manner that preserves privacy;
an act of running the data asset through the algorithm;
an instruction configured to perform an action involving
A computer program product tangibly embodied in a non-transitory machine-readable storage medium, comprising:
[Invention 1020]
The algorithm and the input data requirements are received from an algorithm developer that is a different entity than the data host, and the optimization and/or validation selection criteria are based on the characteristics, format, and the computer program product of the invention 1019 defining the requirements.
[Invention 1021]
the characteristics of the data asset and the requirements,
(i) the environment of said algorithm; (ii) the distribution of examples within said input data; (iii) the parameters and type of devices producing said input data; (iv) the variance versus bias; or (vi) any combination thereof.
The computer program product of the present invention 1020 as defined in accordance with .
[Invention 1022]
The act of identifying uses differential privacy to share information within the data asset by describing patterns of groups within the data asset while concealing private information about individuals within the data asset. carried out,
The act of curating includes selecting the data storage structure from among a plurality of data storage structures and provisioning the data storage structure within the infrastructure of the data host;
the selection of the data storage structure is based on the type of algorithm, the type of data within the data asset, the requirements of the system, or a combination thereof;
Computer program product of the present invention 1020.
[Invention 1023]
the act further includes an act of welcoming the data host;
the act of accepting comprises ensuring that use of the data asset in the algorithm is compliant with data privacy requirements;
The computer program product of the invention 1019, 1020, or 1021.
[Invention 1024]
the act of operating the data asset through the algorithm;
creating a plurality of instances of the model; splitting the data asset into a training dataset and one or more test datasets; and training the plurality of instances of the model with the training dataset. integrating results from the training of each of the plurality of instances of the model into a fully federated model; operating the one or more test datasets through the fully federated model; and calculating performance of the fully federated model based on the operation of one or more test datasets.
The computer program product of any of inventions 1019-1023, comprising: executing.
[Invention 1025]
the act of operating the data asset through the algorithm;
dividing the data asset in one or more validation datasets; operating the one or more validation datasets through the algorithm; and based on the operation of the one or more validation datasets. and calculating the performance of said algorithm using
The computer program product of any of inventions 1019-1023, comprising: executing.
[Invention 1026]
the secure encapsulated computational framework is provisioned within a computational infrastructure configured to accept cryptographic code necessary to operate the algorithm;
The provisioning of the computational infrastructure includes instantiating the secure encapsulated computational framework on the computational infrastructure, and adding the cryptographic code to the secure encapsulated computational framework by the algorithm developer. and decrypting the encrypted code after the secure capsule computation framework is instantiated.
The computer program product according to any of the inventions 1019 to 1025.
[Invention 1027]
A method including the following steps:
identifying a plurality of instances of an algorithm, each instance of the algorithm being integrated into one or more secure capsule computation frameworks, wherein the one or more secure capsule computation frameworks are integrated with the providing each instance of the algorithm to a training data asset within one or more data storage structures of one or more data hosts in a secure manner that preserves the privacy of the training data asset and each instance of the algorithm; , identifying process;
executing, by a data processing system, a federated training workflow with each instance of the algorithm, the federated training workflow taking the training data asset as input and targeting characteristics of the training data asset using parameters; one or more trained instances of the algorithm; The process of outputting and executing;
integrating, by the data processing system, the learned parameters for each trained instance of the algorithm into a fully federated algorithm, the step of integrating comprising: aggregating the learned parameters; updating the learned parameters of the fully federated algorithm with the aggregated parameters;
executing, by the data processing system, a test workflow with the fully federated algorithm, the test workflow taking test data as input and using the updated learned parameters to The process of finding patterns, outputting and executing inferences;
calculating, by the data processing system, the performance of the fully federated algorithm in providing the inference;
determining, by the data processing system, whether the performance of the fully federated algorithm satisfies an algorithm termination criterion;
If the performance of the fully federated algorithm does not meet the algorithm termination criteria, the data processing system replaces each instance of the algorithm with the fully federated algorithm and reruns the federated training workflow with each instance of the fully federated algorithm. the steps to be carried out; and
providing, by the data processing system, the performance of the fully federated algorithm and the aggregated parameters to an algorithm developer of the algorithm if the performance of the fully federated algorithm satisfies the algorithm termination criteria;
[Invention 1028]
the step of identifying the plurality of instances of the algorithm;
receiving, in the data processing system, the algorithm and input data requirements associated with the algorithm, the input data requirements providing optimization and/or validation selection criteria for data assets to operate with the algorithm; including, receiving,
identifying, by the data processing system, the data asset as available from the one or more data hosts based on optimization and/or validation selection criteria for the data asset;
curating, by the data processing system, the data assets in a data storage structure within an infrastructure of each data host of the one or more data hosts;
dividing at least a portion of the data assets into the training data assets within the data storage structure within the infrastructure of each data host of the one or more data hosts;
1027. The method of the invention 1027, comprising:
[Invention 1029]
said algorithm and said input data requirements are received from an algorithm developer that is a different entity than said one or more data hosts, and said optimization and/or validation selection criteria are for data assets to be operated on said algorithm; The method of the invention 1028 for defining the characteristics, format, and requirements of.
[Invention 1030]
The method of the present invention 1027, 1028, or 1029, wherein the federated training workflow further includes encrypting training gradients, and wherein the integrating includes decoding the training gradients.
[Present invention 1031]
sending, by the data processing system, aggregated parameters to each instance of the algorithm if the performance of the fully federated algorithm satisfies the algorithm termination criteria;
executing, by the data processing system, an update training workflow on each instance of the algorithm, the update training workflow updating the learned parameters with the aggregated parameters and updating the learned parameters with the aggregated parameters; Steps to execute that output multiple updated and trained instances
The method of invention 1027, 1028, 1029, or 1030, further comprising:
[Invention 1032]
1032. The method of the invention 1031 further comprising operating the remaining data assets through each instance of the algorithm by the data processing system.
[Present invention 1033]
said operating said data asset through each instance of said algorithm;
further dividing at least a portion of said data asset into one or more validation datasets; and operating said one or more validation datasets through each instance of said algorithm; and calculating performance of each instance of the algorithm based on the behavior of a validation dataset.
1032. The method of the invention 1031, comprising performing.
[Present invention 1034]
one or more data processors;
when executed on the one or more data processors, the one or more data processors:
an act of identifying multiple instances of an algorithm, each instance of the algorithm being integrated into one or more secure capsule computation frameworks, wherein the one or more secure capsule computation frameworks are integrated with the algorithm; providing each instance of the algorithm to a training data asset within one or more data storage structures of one or more data hosts in a secure manner that preserves the privacy of the training data asset and each instance of the algorithm; The action of identifying;
an act of executing a federated training workflow on each instance of the algorithm, the federated training workflow taking the training data asset as input and using parameters to map features of the training data asset to a target inference; calculating a loss function or an error function, updating the parameters to learned parameters to minimize the loss function or the error function, and outputting one or more trained instances of the algorithm; The action of
an act of integrating, by the data processing system, the learned parameters for each trained instance of the algorithm into a fully federated algorithm, wherein the act of integrating aggregates the learned parameters; and updating learned parameters of the fully federated algorithm with the aggregated parameters;
an act of executing, by the data processing system, a test workflow with the fully federated algorithm, the test workflow taking test data as input and using the updated learned parameters to perform a test workflow with the fully federated algorithm; Find patterns, output inferences, and the actions to perform.
an act of calculating, by the data processing system, the performance of the fully federated algorithm in providing the inference;
an act of determining, by the data processing system, whether the performance of the fully federated algorithm satisfies an algorithm termination criterion;
If the performance of the fully federated algorithm does not meet the algorithm termination criteria, the data processing system replaces each instance of the algorithm with the fully federated algorithm and reruns the federated training workflow with each instance of the fully federated algorithm. The action to be performed and
an act of providing, by the data processing system, the performance of the fully federated algorithm and the aggregated parameters to an algorithm developer of the algorithm if the performance of the fully federated algorithm satisfies the algorithm termination criteria;
a non-transitory computer-readable storage medium containing instructions for performing operations including;
system, including.
[Invention 1035]
A method including the following steps:
identifying, by the data processing system, data assets available from the data host based on data asset selection criteria;
curating, by the data processing system, the data assets in a data storage structure within an infrastructure of the data host;
preparing a transformer prototype data set to be used by the data processing system as a guide for developing algorithms for data transformation, the transformer prototype data set including key attributes of a harmonization process; The process of capturing and preparing;
creating, at the data processing system, a first harmonized transformer set for transformation of the data asset based on a current format of data in the transformer prototype dataset;
applying, by the data processing system, the first set of harmonized transformers to the data asset to generate a transformed data asset;
preparing, by the data processing system, a harmonized prototype data set to be used as a guide for developing algorithms for data transformation, the harmonized prototype data set including key attributes of the harmonization process; The process of taking in and preparing;
creating, by the data processing system, a second set of harmonized transformers for transformation of the transformed data asset based on a current format of data in the harmonized prototype data set;
applying, by the data processing system, the second harmonization transformer set to the transformed data asset to produce a harmonized data asset; and
operating the reconciled data asset through an algorithm by the data processing system, the algorithm operating the reconciled data asset in the data storage structure; operating within a secure encapsulated computational framework that provides in a secure manner that preserves the privacy of the acquired data assets and said algorithms;
[Invention 1036]
The selection criteria of the invention 1035 is received from an algorithm developer that is a different entity than the data host, and the selection criteria defines characteristics, formats, and requirements for the data asset to operate with the algorithm. Method.
[Present invention 1037]
the characteristics of the data asset and the requirements,
(i) the environment of said algorithm; (ii) the distribution of examples within said input data; (iii) the parameters and type of devices producing said input data; (iv) the variance versus bias; or (vi) any combination thereof.
The method of the invention 1036, defined based on.
[Invention 1038]
the algorithm developer for the purpose of anonymizing the transformer prototype dataset and creating the first harmonized transformer set for transformation of the data asset; 1037. The method of the invention 1037, further comprising the step of subjecting to.
[Invention 1039]
The method of the invention 1035, 1036, 1037, or 1038, wherein applying the first set of harmonized transformers to the data asset occurs within the data structure.
[Invention 1040]
further comprising, at the data processing system, annotating the transformed data asset according to a defined annotation protocol to generate an annotated dataset;
the annotating the transformed data is performed within the data structure, and the second harmonization transformer set is applied to the annotated dataset to generate a harmonized data asset. Ru,
The method according to any one of the present invention 1035 to 1039.
[Present invention 1041]
The method of any of the inventions 1035-1040, wherein the applying the second set of harmonized transformers to the annotated data asset occurs within the data structure.
[Present invention 1042]
determining whether the first harmonized transformer set, the annotation, and the second harmonized transformer set are applied successfully and without violating data privacy requirements;
The method of invention 1040 or 1041, further comprising:
[Invention 1043]
one or more data processors;
when executed on the one or more data processors, the one or more data processors:
an act of identifying the data asset available from a data host based on data asset selection criteria;
an act of curating the data asset within a data storage structure within an infrastructure of the data host;
an act of preparing a transformer prototype data set to be used as an algorithm for data transformation, the transformer prototype data set capturing key attributes for a harmonization process;
an act of creating a first harmonized transformer set for transformation of the data asset based on a current format of data in the transformer prototype dataset;
an act of applying the first harmonization transformer set to the data asset to generate a transformed data asset;
an act of preparing a harmonized prototype data set to be used to develop an algorithm for data transformation, the harmonized prototype data set capturing key attributes of the harmonization process;
an act of creating a second harmonized transformer set for transformation of the transformed data asset based on a current format of data in the harmonized prototype data set;
applying the second harmonization transformer set to the transformed data asset to generate a harmonized data asset;
an act of operating the harmonized data asset through an algorithm, the algorithm operating the harmonized data asset in the data storage structure, the harmonized data asset and the algorithm; The operating and
a non-transitory computer-readable storage medium containing instructions for performing operations including;
system, including.
[Present invention 1044]
A method including the following steps:
identifying an algorithm or model, the algorithm or model being integrated into a secure capsule computation framework, the secure capsule computation framework integrating the algorithm or model within a data storage structure of a data host; identifying, providing a training data asset in a secure manner that preserves the privacy of the training data asset and the algorithm or mode;
executing, by a data processing system, a federated training workflow with the algorithm or model, the federated training workflow taking the training data asset as input and using parameters to target inferences about features of the training data asset; calculating a loss function or an error function, updating parameters to the learned parameters to minimize the loss function or the error function, and outputting a trained algorithm or model;
integrating, by the data processing system, the learned parameters of the algorithm or model into a fully federated algorithm or model, the integrating step comprising aggregating the learned parameters to create an aggregated integrating, the step comprising: obtaining parameters; and updating learned parameters of the fully federated algorithm or model with the aggregated parameters;
executing a test workflow with the fully federated algorithm or model by the data processing system, the test workflow taking test data as input and using the updated learned parameters to execute a test workflow with the fully federated algorithm or model; The process of finding patterns within and outputting and executing inferences;
calculating, by the data processing system, the performance of the fully federated algorithm in providing the inference;
determining, by the data processing system, whether the performance of the fully federated algorithm or model satisfies algorithm termination criteria;
If the performance of the fully federated algorithm or model does not meet the algorithm termination criteria, the data processing system replaces the algorithm or model with the fully federated algorithm or model and executes the federated training workflow with the fully federated algorithm or model. a step of re-executing; and
If the performance of the fully federated algorithm or model satisfies the algorithm exit criteria, the data processing system transmits the performance of the fully federated algorithm or model and the aggregated parameters to an algorithm developer of the algorithm or model. Process of providing.
[Invention 1045]
A method including the following steps:
identifying an algorithm, the algorithm being integrated into a secure capsule computation framework provided by an algorithm developer, the secure capsule computation framework integrating the algorithm with validation data in a data storage structure; identifying, providing an asset in a secure manner that preserves the privacy of the verification data asset and the algorithm;
executing, by a data processing system, a validation workflow with the algorithm, the validation workflow taking the validation data asset as input and applying the algorithm to the validation data asset using the learned parameters; , the process of outputting and executing the inference;
calculating, by the data processing system, a performance of the algorithm in providing the inference, the performance being calculated based on a gold standard label;
determining, by the data processing system, whether the performance of the algorithm meets validation criteria defined by an algorithm developer;
if the performance of the algorithm does not meet the validation criteria, optimizing one or more hyperparameters of the algorithm at the data processing system and using the optimized one or more hyperparameters; re-running the validation workflow with the algorithm; and
If the performance of the algorithm meets the validation criteria, providing the performance of the algorithm and the one or more hyperparameters to the algorithm developer by the data processing system.
[Invention 1046]
the step of identifying the algorithm;
receiving, in the data processing system, the algorithm and input data requirements associated with the algorithm, the input data requirements including validation selection criteria for data assets to operate with the algorithm; ,
identifying, by the data processing system, the data asset as available from a data host based on the validation selection criteria for the data asset;
curating, by the data processing system, the data assets in a data storage structure within an infrastructure of the data host;
dividing at least a portion of the data asset into the verification data asset within the data storage structure within the infrastructure of the data host;
1044. The method of the invention 1044, comprising:
[Invention 1047]
The validation selection criteria include clinical cohort criteria, demographic criteria, and/or data set class balance, and the clinical cohort criteria includes a group of people for which the data asset is to be acquired for a cohort study. , the type of said cohort study, the risk factors to which said group of people may be exposed over a period of time, the question or hypothesis to be answered and the associated disease or condition, and other parameters defining the criteria for said cohort study; or any combination thereof.
[Invention 1048]
accommodating the data host by the data processing system, the accommodating step comprising verifying that use of the data asset in the algorithm is compliant with data privacy requirements; ; and
completing governance and compliance requirements, including approval from an institutional review board for use of said data assets from said data host for the purpose of validating said algorithms;
further including;
The curating step includes selecting the data storage structure from among a plurality of data storage structures and provisioning the data storage structure within the infrastructure of the data host, the selection is based on a type of algorithm within the algorithm, a type of data within the data asset, system requirements of the data processing system, or a combination thereof;
The method of the invention 1045 or 1046.
[Invention 1049]
if the performance of the algorithm meets the verification criteria, maintaining by the data processing system the algorithm and the verification data asset in a secure manner that preserves the privacy of the verification data asset and the algorithm;
The method of any of inventions 1044-1047, further comprising.
[Invention 1050]
the secure encapsulated computational framework is provisioned within a computational infrastructure configured to accept cryptographic code necessary to operate the algorithm;
The provisioning of the computational infrastructure includes instantiating the secure encapsulated computational framework on the computational infrastructure, and adding the cryptographic code to the secure encapsulated computational framework by the algorithm developer. and decrypting the encrypted code after the secure capsule computation framework is instantiated.
The method according to any one of the inventions 1044 to 1048.
[Present invention 1051]
the verification data asset is a plurality of independent sets of data assets, the encryption code is signed by the data processing system and stored in a data storage archive, and the performance of the algorithm is a plurality of independent sets of data assets; The method of the invention 1049 is provided as a single validation report for validation of said algorithm aggregated from multiple validations performed on a set of data assets.
[Invention 1052]
one or more data processors;
when executed on the one or more data processors, the one or more data processors:
an act of identifying an algorithm, the algorithm being integrated into a secure capsule computation framework provided by an algorithm developer, the secure capsule computation framework integrating the algorithm with validation data in a data storage structure; an act of identifying, providing an asset in a secure manner that preserves the privacy of the validation data asset and the algorithm;
an act of executing a validation workflow with the algorithm, the validation workflow taking the validation data asset as input, finding patterns in the validation data asset using learned parameters, and outputting inferences; The action to be performed and
an act of calculating performance of the algorithm in providing the inference, wherein the performance is calculated based on a gold standard label;
an act of determining whether the performance of the algorithm meets validation criteria defined by an algorithm developer;
If the performance of the algorithm does not meet the validation criteria, optimize one or more hyperparameters of the algorithm and use the optimized one or more hyperparameters to perform the validation workflow with the algorithm. and the action of re-running the
an act of providing the performance of the algorithm and the one or more hyperparameters to the algorithm developer if the performance of the algorithm meets the verification criteria;
a non-transitory computer-readable storage medium containing instructions for performing operations including;
system, including.
[Present invention 1053]
A method including the following steps:
receiving, in a data processing system, an algorithm and input data requirements associated with the algorithm, the input data requirements including validation selection criteria for data assets to operate with the algorithm;
identifying, by the data processing system, the data asset as available from a data host based on the validation selection criteria for the data asset;
curating, by the data processing system, the data assets in a data storage structure within an infrastructure of the data host;
preparing, by the data processing system, the data asset in the data storage structure for processing by the algorithm;
integrating, by the data processing system, the algorithm into a secure capsule computation framework, the secure capsule computation framework integrating the algorithm into the data asset within the data storage structure; providing and integrating said algorithm in a privacy-preserving and secure manner; and
executing, by the data processing system, a validation workflow with the algorithm, the validation workflow taking the data asset as input and using the learned parameters to find patterns in the data asset and to make inferences; The process of outputting and executing;
calculating, by the data processing system, the performance of the algorithm in providing the inference, the performance being calculated based on a gold standard label; and
providing the performance of the algorithm to the algorithm developer by the data processing system;
[Invention 1054]
The validation selection criteria include clinical cohort criteria, demographic criteria, and/or data set class balance, and the clinical cohort criteria includes a group of people for which the data asset is to be acquired for a cohort study. , the type of said cohort study, the risk factors to which said group of people may be exposed over a period of time, the question or hypothesis to be answered and the associated disease or condition, and other parameters defining the criteria for said cohort study; or any combination thereof.
[Present invention 1055]
accommodating the data host by the data processing system, the accommodating step comprising verifying that use of the data asset in the algorithm is compliant with data privacy requirements; ; and
completing governance and compliance requirements, including approval from an institutional review board for use of said data assets from said data host for the purpose of validating said algorithms;
further including;
The curating step includes selecting the data storage structure from among a plurality of data storage structures and provisioning the data storage structure within the infrastructure of the data host, the selection is based on a type of algorithm within the algorithm, a type of data within the data asset, system requirements of the data processing system, or a combination thereof;
The method according to the invention 1053 or 1054.
[Invention 1056]
1056. The method of any of the inventions 1053-1055, further comprising maintaining, by the data processing system, the algorithm and the data asset in a secure manner that preserves the privacy of the data asset and the algorithm.
[Present invention 1057]
the secure encapsulated computational framework is provisioned within a computational infrastructure configured to accept cryptographic code necessary to operate the algorithm;
The provisioning of the computational infrastructure includes instantiating the secure encapsulated computational framework on the computational infrastructure, and adding the cryptographic code to the secure encapsulated computational framework by the algorithm developer. and decrypting the encrypted code after the secure capsule computation framework is instantiated.
The method according to any one of the inventions 1053 to 1056.
[Invention 1058]
the data asset is a plurality of independent sets of data assets, the encryption code is signed by the data processing system and stored in a data storage archive, and the performance of the algorithm is a plurality of independent sets of data assets; 1057. The method of the invention 1057, provided as a single validation report for validation of said algorithm aggregated from multiple validations performed on a set of data assets.
The terms and expressions used are used as terms of description rather than limitation, and the use of such terms and expressions excludes any equivalent of that portion of the features illustrated and described. It is to be understood that this is not intended and that various modifications may be made within the scope of the claimed invention. Thus, while the claimed invention has been particularly disclosed in aspects and optional features, those skilled in the art will recognize that modifications and variations of the concepts disclosed herein may be used and that such It is to be understood that such modifications and variations are considered to be within the scope of the invention as defined by the appended claims.

本発明は、以下の非限定的な図を考慮すればよりよく理解されるであろう。 The invention will be better understood upon consideration of the following non-limiting figures.

様々な態様によるAIエコシステムを示す図である。FIG. 2 is a diagram illustrating an AI ecosystem in accordance with various aspects. 様々な態様によるAIアルゴリズム開発プラットフォームを示す図である。1 is a diagram illustrating an AI algorithm development platform in accordance with various aspects. FIG. 様々な態様による、データ（例えば、臨床データおよび健康データ）に関するモデルを最適化および/または検証するためのコアプロセスを示す図である。FIG. 3 illustrates core processes for optimizing and/or validating models on data (e.g., clinical and health data) in accordance with various aspects. 様々な態様によるAIアルゴリズム開発プラットフォーム上に1つまたは複数のデータホストを迎え入れるプロセスを示す図である。FIG. 2 illustrates a process of hosting one or more data hosts on an AI algorithm development platform in accordance with various aspects. 様々な態様によるAIアルゴリズム開発プラットフォームと共に使用されるべきデータ資産を識別、取得、およびキュレートするためのプロセスを示す図である。FIG. 2 illustrates a process for identifying, acquiring, and curating data assets to be used with an AI algorithm development platform in accordance with various aspects. 様々な態様によるAIアルゴリズム開発プラットフォームと共に使用されるべきデータ資産を変換するためのプロセスを示す図である。FIG. 2 illustrates a process for converting data assets for use with an AI algorithm development platform in accordance with various aspects. 様々な態様によるAIアルゴリズム開発プラットフォームと共に使用されるべきデータ資産に注釈を付けるためのプロセスを示す図である。FIG. 2 illustrates a process for annotating data assets for use with an AI algorithm development platform in accordance with various aspects. 様々な態様によるAIアルゴリズム開発プラットフォームと共に使用されるべきデータ資産を整合化するためのプロセスを示す図である。FIG. 2 illustrates a process for harmonizing data assets to be used with an AI algorithm development platform in accordance with various aspects. 様々な態様によるAIアルゴリズム開発プラットフォームを使用して1つまたは複数のモデルを最適化するためのプロセスを示す図である。FIG. 2 illustrates a process for optimizing one or more models using an AI algorithm development platform in accordance with various aspects. 様々な態様によるAIアルゴリズム開発プラットフォームを使用して1つまたは複数のモデルを検証するためのプロセスを示す図である。FIG. 2 illustrates a process for validating one or more models using an AI algorithm development platform in accordance with various aspects. 様々な態様による臨床データおよび健康データに関するモデルを最適化および/または検証するための例示的なフローを示す図である。FIG. 3 illustrates an example flow for optimizing and/or validating models for clinical and health data in accordance with various aspects. 様々な態様によるデータ処理システムの一部としての例示的なコンピューティングデバイスを示す図である。1 is an illustration of an example computing device as part of a data processing system in accordance with various aspects. FIG.

詳細な説明
I. 概論
本開示は、プライバシー保護された整合化データ（例えば、臨床データおよび健康データ）の複数のソースに解析を分散させることによってAIアプリケーションおよび/またはアルゴリズムを開発するための技術を説明する。より具体的には、本開示のいくつかの態様は、プライバシー保護された整合化データ（例えば、臨床データおよび健康データ）の複数のソースに解析を分散することによって、アプリケーションおよび/またはアルゴリズム開発（本明細書では個別にまたは集合的にアルゴリズム開発と呼ばれ得る）を加速するAIアルゴリズム開発プラットフォームを提供する。医療産業における問題を解決するためにAIアルゴリズムが開発される機械学習およびアルゴリズムアーキテクチャの様々な態様が本明細書に開示されているが、これらのアーキテクチャおよび技術を、他のタイプのシステムおよび設定で実施することもできることを理解されたい。例えば、これらのアーキテクチャおよび技術を、データの機密性（例えば、データが企業秘密や個人に関するプライベートデータを含むかどうか）が、データの保護に責任を負う組織の境界外でのデータの共有を妨げる、多くの産業（金融、ライフサイエンス、サプライチェーン、国家安全、法執行、公共安全など）におけるAIアルゴリズムの開発において実装することができる。 detailed description
I. Introduction
This disclosure describes techniques for developing AI applications and/or algorithms by distributing analysis across multiple sources of privacy-protected harmonized data (e.g., clinical and health data). More specifically, some aspects of the present disclosure improve application and/or algorithm development ( provides an AI algorithm development platform that accelerates algorithm development (which may individually or collectively be referred to herein as algorithm development). Although various aspects of machine learning and algorithmic architectures are disclosed herein in which AI algorithms are developed to solve problems in the healthcare industry, these architectures and techniques may also be used in other types of systems and settings. It should be understood that this can also be implemented. For example, use these architectures and technologies to ensure that the sensitivity of the data (e.g., whether the data contains trade secrets or private data about individuals) prevents the sharing of the data outside the boundaries of the organization responsible for protecting the data. , can be implemented in the development of AI algorithms in many industries (finance, life sciences, supply chain, national security, law enforcement, public safety, etc.).

本明細書に開示される様々な態様は、訓練、試験、最適化、および検証を含むアルゴリズムおよびモデル開発のための技術を説明する。「（1つまたは複数の）アルゴリズム」および「（1つまたは複数の）モデル」という用語は、詳細な説明（特許請求の範囲ではない）において読みやすさと簡潔さのために区別なく使用され、よって、本明細書で「（1つまたは複数の）アルゴリズム」という用語が使用される場合には、「（1つまたは複数の）モデル」という用語で置き換えることができ、「（1つまたは複数の）モデル」という用語が使用される場合には、「（1つまたは複数の）アルゴリズム」という用語で置き換えることができる。しかしながら、これらの用語は以下のような別個の意味を有することを理解されたい。「（1つまたは複数の）アルゴリズム」は、タスクを完了するかまたは問題を解決するために実装される関数、メソッド、またはプロシージャであり、一方、「（1つまたは複数の）モデル」は、1つまたは複数のアルゴリズムを含む明確に定義された計算であり、入力としてある値または値のセットを取り、出力としてある値または値のセットを生成する。例えば、変数が線形関係にある場合にxの特定の値に対するyの値を見つけるために、y＝mx＋bなどのアルゴリズムが、x変数とy変数との間の線形関係を記述するために使用され得る。勾配（m）および切片（c）の値を有するモデルが、訓練、試験、最適化、または検証され得、その後、モデルは、アルゴリズムy＝mx＋bならびに勾配（m）および切片（c）の値を使用して、xの異なる点に対するyの値を見つけるために使用され得る。通常、規制されたAIアルゴリズムおよびMLアルゴリズム（例えば、医療用途での使用が意図されるもの）は、一般化可能であり、したがって、アルゴリズムが使用される環境に関係なく、民族性、適用環境およびワークフロー（例えば、特定の臨床環境）および地理などの詳細に依存しない方法で実行されるべきである。一貫した性能を達成するために、アルゴリズムは、非常に多様なデータにタイムリーにアクセスして、発見、最適化、および検証を可能にする必要がある。 Various aspects disclosed herein describe techniques for algorithm and model development, including training, testing, optimization, and validation. The terms "algorithm(s)" and "model(s)" are used interchangeably in the detailed description (but not in the claims) for readability and brevity; Thus, where the term "algorithm(s)" is used herein, the term "model(s)" may be substituted and "(one or more) Where the term "model(s)" is used, it may be replaced by the term "algorithm(s)". However, it should be understood that these terms have separate meanings as follows. An "algorithm(s)" is a function, method, or procedure that is implemented to complete a task or solve a problem, whereas a "model(s)" is a A well-defined calculation involving one or more algorithms that takes as input a value or set of values and produces as output a value or set of values. For example, to find the value of y for a particular value of x when the variables are linearly related, an algorithm such as y=mx+b is used to describe the linear relationship between the x and y variables. obtain. A model with slope (m) and intercept (c) values may be trained, tested, optimized, or validated, and then the model uses the algorithm y=mx+b and slope (m) and intercept (c) values. can be used to find the value of y for different points of x. Typically, regulated AI and ML algorithms (e.g., those intended for use in medical applications) are generalizable and therefore independent of the environment in which they are used, regardless of ethnicity, application environment and It should be performed in a manner that is independent of details such as workflow (e.g., specific clinical environment) and geography. To achieve consistent performance, algorithms need timely access to highly diverse data to enable discovery, optimization, and validation.

規制されたAIアルゴリズムおよびMLアルゴリズムの発見、最適化、および検証に対する最大の障壁の1つは、多くの場合プライバシー保護されたデータである非常に多様なデータへのタイムリーなアクセスの欠如である。プライバシー保護されたデータは法的に保護されており、機関によって組織的資産とみなされ、「所有者」が異なる隔離されたリポジトリに存在し、アルゴリズム開発者がデータにアクセスできたとしても、データが計算可能な形態ではない可能性が高い。医療業界内では、この問題は、Francis Collins, MD, PhD、国立衛生研究所（NIH）所長が、NIH Workshop on Harnessing Artificial Intelligence and Machine Learning to Advance Biomedical Researchへの2018年7月のプレゼンテーションにおいて、「多くの研究者らが、データにアクセスして解析可能にするために相当な時間を費やしており、この局面が改善される必要がある」と述べたことによって強調されたように、広く理解されている。また、NIHは、その「データサイエンス戦略計画」において、「現在のところ、．．．革新的なアルゴリズムおよびツールを、使いやすさおよび運用効率の業界標準を満たす企業対応リソースに変換または強化する一般的なシステムはない」ことを認めている。これらの課題は、データ資産が異なる場所に分散され、機密データまたはプライベートデータを含む産業全体に広く当てはまる。 One of the biggest barriers to the discovery, optimization, and validation of regulated AI and ML algorithms is the lack of timely access to highly diverse data, often privacy-protected data. . Privacy-protected data is legally protected and is considered organizational property by institutions, even if it resides in an isolated repository with different "owners" and algorithm developers have access to the data. is likely not in a computable form. Within the medical industry, this issue was addressed by Francis Collins, MD, PhD, Director of the National Institutes of Health (NIH), in a July 2018 presentation to the NIH Workshop on Harnessing Artificial Intelligence and Machine Learning to Advance Biomedical Research. ``Many researchers spend considerable time making their data accessible and analyzable, and this aspect needs to be improved.'' ing. In addition, in its Data Science Strategic Plan, the NIH states that ``currently... there is a general public base that transforms or enhances innovative algorithms and tools into enterprise-ready resources that meet industry standards for ease of use and operational efficiency.'' He admits that there is no universal system. These challenges apply broadly across industries where data assets are distributed across different locations and include sensitive or private data.

AIソリューションの開発における重要な役割には、アルゴリズム開発者、データ提供者、第三者注釈者、（1または複数の）第三者検証者、（1または複数の）アルゴリズム規制機関、およびエンドユーザまたは顧客（例えば、医療組織）、が含まれる。（例えば、医療、金融、法執行、教育、および他の多くの場合に当てはまるような）規制された環境においてアルゴリズムを開発するプロセスの複雑さの一例として、医療のためのAIまたはMLの規制されたアルゴリズムを開発する企業の現在のプロセスを考察する。そのような企業は、典型的には、業界標準を満たすアルゴリズムを作成するために長時間のプロセスに耐える。このプロセスは、データ提供者を確保すること、技術環境を確立すること、セキュリティおよびプライバシー審査を確立すること、データを準備すること（例えば、キュレーションおよび匿名化）、匿名化データをアルゴリズム開発者に転送すること、アルゴリズム発見および開発のために匿名化データにAIまたはMLを適用すること、アルゴリズムの発見および開発の結果を文書化すること、ならびに審査および承認のためにアルゴリズムまたはモデルを規制機関（例えば、食品医薬品局（FDA））に提出することを含み得る。このプロセスは、再構成攻撃（生データから抽出された公開された特徴から生のプライベートデータを再構成する）、モデル反転攻撃（モデルから受け取られた応答を利用して特徴ベクトルを作成する）、メンバシップ推論攻撃（サンプルがMLモデルを構築するために使用された訓練セットのメンバであったかどうかを判定する）、およびプライベートデータの再特定化を含むいくつかのプライバシーの脅威を含む。 Key roles in the development of AI solutions include algorithm developers, data providers, third-party annotators, third-party verifier(s), algorithm regulator(s), and end users. or customers (e.g., healthcare organizations). As an example of the complexity of the process of developing algorithms in regulated environments (as is the case in healthcare, finance, law enforcement, education, and many other cases), This paper examines the company's current process of developing algorithms. Such companies typically endure lengthy processes to create algorithms that meet industry standards. This process includes securing data donors, establishing the technical environment, establishing security and privacy reviews, preparing the data (e.g., curation and anonymization), and transferring the anonymized data to algorithm developers. applying AI or ML to anonymized data for algorithm discovery and development; documenting the results of algorithm discovery and development; and transmitting algorithms or models to regulatory bodies for review and approval. (e.g., the Food and Drug Administration (FDA)). This process includes reconstruction attacks (reconstructing raw private data from public features extracted from the raw data), model inversion attacks (utilizing responses received from the model to create feature vectors), Includes several privacy threats, including membership inference attacks (determining whether a sample was a member of the training set used to build the ML model), and re-identification of private data.

多くの従来のプライバシー強化技術は、データ提供者およびアルゴリズム開発者がセキュアな通信、暗号化手法、または差分プライベートデータリリース（摂動技術）を利用してこれらのプライバシーの脅威を克服または最小化できるようにすることに集中している。例えば、差分プライバシーは、メンバシップ推論攻撃を防止するのに有効である。さらに、モデル推論出力を制限する（例えば、クラスラベルのみ）ことによって、モデル反転攻撃およびメンバシップ推論攻撃の成功を減らすことができる。発見、最適化、および検証を行いながらプライベートデータを保護する前述の技術にもかかわらず、多くのアルゴリズム開発者らは、市販までの時間の増加、コスト、およびアルゴリズム開発の複雑さの増加により、（医療機関からの臨床データを含む）プライバシー保護されたデータの使用を回避し、それにより、（臨床を含む）忠実性の欠如に起因する固有のバイアスが組み込まれ、プライバシーに敏感な産業で使用するためのAIを作成することになる現実世界の研究ではなく、本質的に概念研究の証明であるアルゴリズムが生成される。例えば、プライバシー保護されたデータを防護するために必要な工程を完了するためのスケジュールは、典型的なベンチャー投資利益の予想を超える。資本市場が経験しているのは、機密性の低いデータを利用する産業がアルゴリズムを生成し、（医療を含む）プライバシーに敏感な産業においてそれらのアルゴリズムを期待するための適時性である。例えば、利益を達成するためのこのような忍耐力の欠如は、初期段階の企業が臨床的に展開されたアルゴリズムに必要なマルチサイトの訓練を継続するための資金を使い果たすために、医療AIへのほとんどの投資が目標に達しないことを意味するであろう。 Many traditional privacy-enhancing techniques allow data providers and algorithm developers to utilize secure communications, cryptographic techniques, or differentially private data release (perturbation techniques) to overcome or minimize these privacy threats. I'm concentrating on what I want to do. For example, differential privacy is effective in preventing membership inference attacks. Additionally, limiting the model inference output (e.g., only class labels) can reduce the success of model inversion and membership inference attacks. Despite the aforementioned techniques to protect private data while discovering, optimizing, and validating, many algorithm developers are still struggling with increased time-to-market, cost, and complexity of algorithm development. Avoiding the use of privacy-protected data (including clinical data from healthcare institutions) that incorporates inherent biases due to lack of fidelity (including clinical) and used in privacy-sensitive industries An algorithm is generated that is essentially proof of concept research, rather than real-world research that would result in creating an AI to do something. For example, the timeline for completing the steps necessary to secure privacy-protected data exceeds typical venture capital return expectations. What capital markets are experiencing is the timeliness for industries that utilize less sensitive data to generate algorithms and expect those algorithms in privacy-sensitive industries (including healthcare). For example, this lack of perseverance to achieve profits may lead to early-stage companies running out of funds to continue the multi-site training required for clinically deployed algorithms, leading to a shift toward medical AI. would mean that most investments would not reach their goals.

これらの問題に対処するために、様々な態様は、プライバシー保護された整合化臨床データおよび健康データの複数のソースに解析を分散させることによって人工知能アプリケーションを開発するためのプラットフォームおよび技術を対象とする。例えば、医療AIアルゴリズム開発のコンテキストでは、プラットフォームは、民族および機器の多様なデータを集合的に表す世界中からの参加医療データ提供者組織のグループを含み得る。このグループは、アルゴリズム開発者が複数のデータソースにアクセスするための1つの契約を確立することを可能にする集中型の契約および価格設定モデルに参加し得る。データ整合化、注釈付け、検証、および連合訓練のための共有技術インフラストラクチャのプラットフォーム内で確立することにより、アルゴリズム開発者が単一の開発および展開環境で複数のデータソースにアクセスすることが可能になり得る。複数組織の協調により、アルゴリズム開発者が複数の組織にわたって適用可能な単一の標準化されたセキュリティおよびプライバシー審査を満たすことを可能にする協調の下で集中化される標準化されたセキュリティおよびプライバシー審査を採用することが可能になり得る。データ準備は、複数の組織と協働する際の冗長性を排除する標準化された複数組織の施設内審査委員会承認（IRB）（ライフサイエンス企業のコンテキストで利用される）フォーマットを採用すること、ならびにアルゴリズム開発者に基礎となるデータセットを公開せず、基礎となる資産を保護する方法でデータキュレーション、匿名化、変換、および増強を行い、デバッグするためのキュレーション、匿名化、および増強のツールおよびワークフローを開発すること（あるデータホストのために開発されたツールは追加のデータホストからのデータで再利用されるか、または再利用のために修正され得る）を含み得る。アルゴリズムの発見、検証、または訓練に使用されるべき準備データは、元のデータ所有者の管理から決して移行されない。発見、検証、および/または訓練の実行を担うソフトウェアコンポーネントは、データホストによって制御されるインフラストラクチャ内でカプセル化されたフォーマットで動作する。このインフラストラクチャは、オンプレミスまたはクラウド上にあってもよいが、全体がデータホストのドメイン制御内にある。 To address these issues, various aspects target platforms and technologies for developing artificial intelligence applications by distributing analysis across multiple sources of privacy-protected, harmonized clinical and health data. do. For example, in the context of medical AI algorithm development, a platform may include a group of participating medical data provider organizations from around the world that collectively represent ethnically and device diverse data. This group may participate in a centralized contract and pricing model that allows algorithm developers to establish one contract to access multiple data sources. Establish within a platform of shared technology infrastructure for data harmonization, annotation, validation, and federated training, allowing algorithm developers to access multiple data sources in a single development and deployment environment It can be. Multi-organizational collaboration allows algorithm developers to meet a single standardized security and privacy review that can be applied across multiple organizations. It may be possible to adopt. Data preparation should employ standardized, multi-institutional Institutional Review Board-approved (IRB) formats (utilized in the context of life sciences companies) that eliminate redundancy when collaborating with multiple organizations; and data curation, anonymization, transformation, and augmentation for debugging and debugging in a way that protects the underlying assets without exposing the underlying dataset to algorithm developers. (tools developed for one data host may be reused with data from additional data hosts or modified for reuse). Preparation data to be used for algorithm discovery, validation, or training is never migrated from the control of the original data owner. The software components responsible for performing discovery, validation, and/or training operate in an encapsulated format within an infrastructure controlled by the data host. This infrastructure may be on-premises or in the cloud, but is entirely within the domain control of the data host.

プラットフォームは、プライバシーを保全する方法で準備されたデータにAIまたはMLを適用するためのセキュアなカプセル計算サービスフレームワークを提供する。ある例では、このプロセスの一部として技術的較正が行われるが、エンドツーエンドの計算パイプラインが確実に完了するように、反復プロセスで準備されたデータの小部分に対して最初に行われ得る。これらのステップは、プラットフォームの標準化されたインフラストラクチャ内で動作するため、このプロセスは、データホストの従業員によってではなくプラットフォームによって管理され得る。これは、プロセスが反復可能かつスケーラブルであり、十分に訓練された、プロセスに精通した個人によって行われることを意味する。準備されたデータへのAIまたはMLの適用は、アルゴリズム発見、アルゴリズム最適化、および/またはアルゴリズム検証を含み得る。 The platform provides a secure encapsulated computational services framework for applying AI or ML to data prepared in a privacy-preserving manner. In some instances, technical calibration is performed as part of this process, but first on a small portion of the data prepared in an iterative process to ensure that the end-to-end computational pipeline is complete. obtain. Because these steps operate within the platform's standardized infrastructure, this process can be managed by the platform rather than by the data host's employees. This means that the process is repeatable, scalable, and performed by well-trained individuals who are familiar with the process. Application of AI or ML to prepared data may include algorithm discovery, algorithm optimization, and/or algorithm validation.

ある例では、アルゴリズム発見の動作は、完全にデータホストの技術制御内で実装されてもよく、これにより、基礎となるデータのある範囲のレベルの保護が可能になる。特定のタイプのアルゴリズム開発アクティビティは、データ漏洩の最小限のリスクを保証するために差分プライバシー技術を使用して保護され得る。当事者間の信頼レベルに応じて、新しいアルゴリズム発見方法が、データホスト環境で動作しているカプセル化されたソフトウェアからのアルゴリズムアクティビティおよび発信トラフィックの広範な監視を通じて可能とされ得る。本開示の1つの例示的な態様は、データ処理システムにおいて、アルゴリズムまたはモデルおよびアルゴリズムと関連付けられた入力データ要件を受け取る工程であって、入力データ要件が、データ資産がアルゴリズムまたはモデルで動作するための最適化および/または検証選択基準を含む、受け取る工程と、データ処理システムによって、データ資産を、データ資産についての最適化および/または検証選択基準に基づいてデータホストから利用可能であるものとして識別する工程と、データ処理システムによって、データホストのインフラストラクチャ内のデータストレージ構造内のデータ資産をキュレートする工程と、データ処理システムによって、アルゴリズムまたはモデルが処理するためのデータストレージ構造内のデータ資産を準備する工程と、データ処理システムによって、アルゴリズムまたはモデルをセキュアなカプセル計算フレームワークに統合する工程であって、セキュアなカプセル計算フレームワークがアルゴリズムまたはモデルを、データストレージ構造内のデータ資産に、データ資産およびアルゴリズムまたはモデルのプライバシーを保全するセキュアな方法で提供する、統合する工程と、データ処理システムによって、データ資産をアルゴリズムまたはモデルを通して動作させる工程とを含む。 In some examples, the operation of algorithmic discovery may be implemented entirely within the technical control of the data host, allowing for some level of protection of the underlying data. Certain types of algorithm development activities may be protected using differential privacy techniques to ensure minimal risk of data leakage. Depending on the level of trust between the parties, new algorithm discovery methods may be enabled through extensive monitoring of algorithm activity and outgoing traffic from encapsulated software running in the data host environment. One example aspect of the present disclosure is, in a data processing system, receiving an algorithm or model and input data requirements associated with the algorithm, the input data requirements for a data asset to operate on the algorithm or model. identifying the data asset as available from the data host based on the optimization and/or validation selection criteria for the data asset by the receiving process and the data processing system, including the optimization and/or validation selection criteria for the data asset; curating, by the data processing system, the data assets in a data storage structure within an infrastructure of the data host; and curing, by the data processing system, the data assets in the data storage structure for processing by an algorithm or model. and integrating the algorithm or model into a secure capsule computation framework by the data processing system, wherein the secure capsule computation framework integrates the algorithm or model into a data asset within a data storage structure. The method includes providing and integrating assets and algorithms or models in a privacy-preserving and secure manner, and operating the data assets through the algorithms or models by a data processing system.

ある例では、最適化は、データの外部共有を必要とせずに、完全にデータホストのインフラストラクチャおよび制御の内部で行われ得る。データホストによって選択されたセキュリティレベルに応じて、訓練データの漏洩の可能性なしに（例えば、準同型暗号化およびリファクタリングアルゴリズムを使用して）アルゴリズム最適化をサポートすることが可能である。コホート開発のためのすべてのツールは、プライベートデータに対する再構成攻撃、差分攻撃、追跡攻撃およびその他の攻撃を防ぐために、厳密な差分プライバシー制御で管理され得る。本開示の1つの例示的な態様は、アルゴリズムまたはモデルを識別する工程であって、アルゴリズムまたはモデルがセキュアなカプセル計算フレームワークに統合され、セキュアなカプセル計算フレームワークが、アルゴリズムまたはモデルを、データホストのデータストレージ構造内の訓練データ資産に、訓練データ資産およびアルゴリズムまたはモードのプライバシーを保全するセキュアな方法で提供する、識別する工程と、データ処理システムによって、アルゴリズムまたはモデルで連合訓練ワークフローを実行する工程であって、連合訓練ワークフローが、訓練データ資産を入力として取り込み、パラメータを使用して訓練データ資産の特徴をターゲット推論にマップし、損失関数または誤差関数を計算し、損失関数または誤差関数を最小化するためにパラメータを学習されたパラメータに更新し、訓練されたアルゴリズムまたはモデルを出力する、実行する工程と、データ処理システムによって、アルゴリズムまたはモデルの学習されたパラメータを、完全連合アルゴリズムまたはモデルに統合する工程であって、統合する工程が、学習されたパラメータを集約して、集約されたパラメータを取得する工程と、完全連合アルゴリズムまたはモデルの学習されたパラメータを集約されたパラメータで更新する工程とを含む、統合する工程と、データ処理システムによって、完全連合アルゴリズムまたはモデルで試験ワークフローを実行する工程であって、試験ワークフローが、試験データを入力として取り込み、更新された学習されたパラメータを使用して試験データ内のパターンを見つけ、推論（入力データ、学習されたパラメータ、およびアルゴリズムまたはモデルの構成に基づいて得られる結論（例えば、モデルの予測やアルゴリズムの結果））を出力する、実行する工程と、データ処理システムによって、推論を提供する際の完全連合アルゴリズムの性能を計算する工程と、データ処理システムによって、完全連合アルゴリズムまたはモデルの性能がアルゴリズム終了基準を満たすかどうかを判定する工程と、完全連合アルゴリズムまたはモデルの性能がアルゴリズム終了基準を満たさない場合、データ処理システムによって、アルゴリズムまたはモデルを完全連合アルゴリズムまたはモデルで置き換え、完全連合アルゴリズムまたはモデルで連合訓練ワークフローを再実行する工程と、完全連合アルゴリズムまたはモデルの性能がアルゴリズム終了基準を満たす場合、データ処理システムによって、完全連合アルゴリズムまたはモデルの性能および集約されたパラメータをアルゴリズムまたはモデルのアルゴリズム開発者に提供する工程とを含む。 In some examples, optimization may occur entirely within the data host's infrastructure and control, without requiring external sharing of data. Depending on the security level chosen by the data host, it is possible to support algorithm optimization (e.g. using homomorphic encryption and refactoring algorithms) without the possibility of leakage of training data. All tools for cohort development can be managed with strict differential privacy controls to prevent reconfiguration, differential, tracking, and other attacks on private data. One example aspect of the present disclosure is a step of identifying an algorithm or model, the algorithm or model being integrated into a secure capsule computation framework, the secure capsule computation framework integrating the algorithm or model with data. providing training data assets in a host data storage structure in a secure manner that preserves the privacy of the training data assets and the algorithm or mode, and performing a federated training workflow with the algorithm or model by the data processing system; The federated training workflow takes a training data asset as input, maps features of the training data asset to a target inference using parameters, computes a loss function or error function, and computes a loss function or error function. updating the learned parameters of the algorithm or model to minimize the trained algorithm or model; A step of integrating into a model, the step of integrating includes a step of aggregating learned parameters to obtain the aggregated parameters, and updating the learned parameters of the fully federated algorithm or the model with the aggregated parameters. and executing a test workflow with a fully federated algorithm or model by a data processing system, the test workflow taking test data as input and updating learned parameters. to find patterns in the test data and output inferences (e.g., model predictions or algorithm results) based on the input data, learned parameters, and configuration of the algorithm or model. calculating, by the data processing system, the performance of the fully federated algorithm in providing an inference; and determining, by the data processing system, whether the performance of the fully federated algorithm or model satisfies algorithm termination criteria. and, if the performance of the fully federated algorithm or model does not meet algorithm termination criteria, the data processing system replaces the algorithm or model with the fully federated algorithm or model and reruns the federated training workflow with the fully federated algorithm or model. and providing, by the data processing system, the performance of the fully federated algorithm or model and the aggregated parameters to an algorithm developer of the algorithm or model if the performance of the fully federated algorithm or model meets algorithm termination criteria.

ある例では、検証は、データの外部共有を必要とせずに、完全にデータホストのインフラストラクチャおよび制御の内部で行われ得る。検証データの漏洩の可能性はなく、コホート開発のためのすべてのツールは、プライベートデータに対する再構成攻撃、差分攻撃、追跡攻撃およびその他の攻撃を防ぐために、厳密な差分プライバシー制御で管理される。本開示の1つの例示的な態様は、データ処理システムにおいて、アルゴリズムおよびアルゴリズムと関連付けられた入力データ要件を受け取る工程であって、入力データ要件が、データ資産がアルゴリズムで動作するための検証選択基準を含む、受け取る工程と、データ処理システムによって、データ資産を、データ資産についての検証選択基準に基づいてデータホストから利用可能であるものとして識別する工程と、データ処理システムによって、データホストのインフラストラクチャ内のデータストレージ構造内のデータ資産をキュレートする工程と、データ処理システムによって、アルゴリズムが処理するためのデータストレージ構造内のデータ資産を準備する工程と、データ処理システムによって、アルゴリズムをセキュアなカプセル計算フレームワークに統合する工程であって、セキュアなカプセル計算フレームワークがアルゴリズムをデータストレージ構造内のデータ資産に、データ資産およびアルゴリズムのプライバシーを保全するセキュアな方法で提供する、統合する工程と、データ処理システムによって、アルゴリズムで検証ワークフローを実行する工程であって、検証ワークフローが、検証データ資産を入力として取り込み、学習されたパラメータを使用して検証データ資産内のパターンを見つけ、推論を出力する、実行する工程と、データ処理システムによって、推論を提供する際のアルゴリズムの性能を計算する工程であって、性能がゴールド・スタンダード・ラベルに基づいて計算される、計算する工程と、データ処理システムによって、アルゴリズムの性能をアルゴリズム開発者に提供する工程とを含む。書類提出を必要とするアルゴリズムまたはモデルの重要な要件が、多くの場合、異なる地理からの多様な独立したデータセットに対するアルゴリズムまたはモデルの検証である。プラットフォームは、独立したデータセットに対して一貫した動作パラメータおよび検証報告を用いて検証を行い、プラットフォームは、各データセットの基礎となる特性（人口統計、使用機器、データ収集に使用されたプロトコル）を報告するので、ほとんどの書類提出に必要な検証の完了は、劇的に容易で、迅速で、安価である。 In some examples, validation may occur entirely within the data host's infrastructure and control, without requiring external sharing of data. There is no possibility of leakage of validation data, and all tools for cohort development are managed with strict differential privacy controls to prevent reconstruction, differential, tracking, and other attacks on private data. One example aspect of the present disclosure is, in a data processing system, receiving an algorithm and input data requirements associated with the algorithm, wherein the input data requirements are validation selection criteria for data assets to operate on the algorithm. and identifying, by the data processing system, the data asset as available from the data host based on validation selection criteria for the data asset; and, by the data processing system, an infrastructure of the data host. preparing, by the data processing system, the data assets in the data storage structure for processing by the algorithm; and providing the algorithm with secure encapsulated computation by the data processing system. integrating into a framework, the secure capsule computation framework providing the algorithm to the data asset within the data storage structure in a secure manner that preserves the privacy of the data asset and the algorithm; algorithmically executing a validation workflow by the processing system, the validation workflow taking the validation data asset as input, finding patterns in the validation data asset using the learned parameters, and outputting inferences; calculating, by the data processing system, performance of the algorithm in providing the inference, the performance being calculated based on a gold standard label; , providing performance of the algorithm to an algorithm developer. A key requirement for algorithms or models that require documentation is often validation of the algorithm or model against diverse independent datasets from different geographies. The platform performs validation using consistent operating parameters and validation reports on independent datasets, and the platform determines the underlying characteristics of each dataset (demographics, equipment used, protocols used for data collection). Completing the verification required for most document filings is dramatically easier, faster, and cheaper.

有利には、これらの技術は、アルゴリズムおよびモデルのマルチサイト開発をサポートするためのプライバシーを保全する分散型プラットフォームの展開を可能にする。さらに、これらの技術は、アルゴリズム開発者がアルゴリズムまたはモデルの市販までの時間を加速し、商業的実行可能性を最適化し、プライベートデータを利用するアルゴリズムおよびモデルへの投資のリスクを軽減するのを助けることができる。さらに、これらの技術は、データホストがそれらのデータ資産の価値を最適化するのを、データが適用可能なデータセットに表されている個人のプライバシーを保護し、（医療用途における）患者の転帰を最適化するのを助け、当該分野における革新的思考のリーダーシップを生み出すやり方で助けることができる。 Advantageously, these techniques enable the deployment of privacy-preserving decentralized platforms to support multi-site development of algorithms and models. Additionally, these technologies can help algorithm developers accelerate time-to-market for algorithms or models, optimize commercial viability, and reduce risk of investments in algorithms and models that utilize private data. I can help. Additionally, these technologies help data hosts optimize the value of their data assets, protect the privacy of individuals whose data is represented in applicable datasets, and improve patient outcomes (in medical applications). in a way that creates innovative thought leadership in the field.

II. 人工知能（AI）エコシステム
図1に、データセット105a～105nからのデータの増強、アルゴリズム/モデル110a～110nの展開、アルゴリズム/モデル110a～110nの検証、アルゴリズム/モデル110a～110nの最適化（訓練/試験）、および複数のデータセット105a～105nに対するアルゴリズム/モデル110a～110nの連合訓練を含む、プライベートデータセット105a～105n（「n」は任意の自然数を表す）に対するセキュアな連合計算を可能にするAIエコシステム100を示す。AI開発プラットフォーム110は、AIシステム115と、セキュアなカプセル計算サービス120a～120nとを含む。AIシステム115は、AIエコシステム100内のソフトウェア資産の開発および展開を管理する。訓練、最適化、採点および検証のための事前訓練されたアルゴリズム/モデル110a～110nと、（データに対してアルゴリズム/モデルを訓練するための）アルゴリズム訓練コードとを含む様々なソフトウェアモジュールが、AIシステム115を介して開発および展開される。例えば、AIシステム115は、様々なソフトウェアコンポーネントを1つまたは複数のセキュアなカプセル計算サービス120a～120nに展開または出力し得る。セキュアなカプセル計算サービス120a～120nは、完全にデータホストの計算ドメイン内のセキュアなカプセル計算サービス環境で計算することができるカプセル化された、またはその他のポータブルなソフトウェアコンポーネントである。 II. Artificial Intelligence (AI) Ecosystem Figure 1 includes data augmentation from datasets 105a-105n, deployment of algorithms/models 110a-110n, validation of algorithms/models 110a-110n, and optimization of algorithms/models 110a-110n. Secure federated computation on private datasets 105a-105n (where "n" represents any natural number), including (training/testing) and federated training of algorithms/models 110a-110n on multiple datasets 105a-105n. Showing 100 AI ecosystems that enable AI development platform 110 includes an AI system 115 and secure capsule computing services 120a-120n. AI system 115 manages the development and deployment of software assets within AI ecosystem 100. Various software modules, including pre-trained algorithms/models 110a-110n for training, optimization, scoring and validation, and algorithm training code (for training the algorithms/models on data) Developed and deployed via system 115. For example, AI system 115 may deploy or output various software components to one or more secure capsule computing services 120a-120n. Secure capsule computing services 120a-120n are encapsulated or other portable software components that can be computed in a secure capsule computing service environment entirely within the computing domain of the data host.

セキュアなカプセル計算サービスは、検証および/または訓練されるべきアルゴリズムを含むソフトウェアの外部環境（この場合、データホストの計算環境）への展開を容易にし、さらに、カプセル内に展開されるソフトウェアのプライバシーとホスト計算環境のセキュリティの両方を保護するためのセキュリティサービスを提供する。異なる態様では、すべての当事者にセキュリティを提供するために暗号化、セキュアなカプセル計算フレームワーク、隔離、難読化、およびその他の方法が用いられ得る。例えば、ある例では、セキュアなカプセル化された計算モジュール（例えば、データホスト）でソフトウェアを動作させている組織が、アルゴリズム開発者の独自のソフトウェアを検査、コピー、または修正することができないことは、アルゴリズム開発者にとって重要であり得る。セキュアなカプセル化された計算サービスは、そのような計算モデルをサポートするためのポータブル環境を提供する。さらに、多くの場合、セキュアなカプセル化された計算サービスを動作させしている利害関係者は、ホストされているソフトウェアが悪意のあるものであり、ホストのインフラストラクチャに害を及ぼしたり、データプライバシーを侵害したりし得る可能性に対する保護を望む。さらに、AIシステム115は、訓練のためにデータリソース、モデルパラメータ、または共有データ（例えば、親教師訓練パラダイム（parent-teacher training paradigm））を展開または出力し得る。引き換えに、AIシステム115は、計算結果、データ、および計算監視結果、訓練されたモデル、モデルパラメータ、またはセキュアなカプセル計算サービス120a～120nなどの展開された計算コンポーネントおよびプロセスの他の結果を含む入力を受け取り得る。 Secure capsule computing services facilitate the deployment of software containing algorithms to be verified and/or trained to an external environment (in this case, the data host's computing environment), and further protect the privacy of the software deployed within the capsule. and provide security services to protect both the security of the host computing environment. In different aspects, encryption, secure encapsulation computational frameworks, isolation, obfuscation, and other methods may be used to provide security to all parties. For example, in some instances, an organization running software on a secure encapsulated computational module (e.g., a data host) may not be able to inspect, copy, or modify an algorithm developer's proprietary software. , which can be important to algorithm developers. Secure encapsulated computational services provide a portable environment to support such computational models. Additionally, stakeholders operating secure encapsulated compute services often believe that the hosted software is malicious and can harm the host's infrastructure or protect data privacy. We want protection against the possibility of infringement. Additionally, AI system 115 may deploy or output data resources, model parameters, or shared data (eg, a parent-teacher training paradigm) for training. In exchange, the AI system 115 includes computational results, data, and computational monitoring results, trained models, model parameters, or other results of deployed computational components and processes, such as secure capsule computational services 120a-120n. May receive input.

セキュアなカプセル計算サービス120a～120nは、同じかまたは異なる機能または動作を提供し得る。例えば、セキュアなカプセル計算サービス120aおよび120nは、データ変換および計算アクティビティを管理し得るが、これらは、計算アクティビティを行う際に異なる機能または動作を実施し得る。セキュアなカプセル計算サービス120aは、AIシステム100からデータおよびソフトウェアを受け取り、データソース125a～125n（例えば、データソース120aおよび120b）からキュレートされていないデータを受け取り得る。データは、AIシステム100からソフトウェア（例えば、アルゴリズム/モデル110a～110n）によってインポートされ、任意で格納され、変換され、それ以外に整合化され、次いで計算され得る。対照的に、セキュアなカプセル計算サービス120nは、AIシステム100からデータおよびソフトウェアを受け取り、データソース125a～125n（例えば、データソース120aおよび120b）からキュレートされていないデータを受け取り、1つまたは複数の第三者130からのデータを結合し得る。データは、AIシステム100からソフトウェア（例えば、アルゴリズム/モデル110a～110n）によってインポートされ、任意で格納され、変換され、結合データと結合され、それ以外に整合化され、次いで計算され得る。セキュアなカプセル計算サービス120a～120nは、通信ネットワーク135を介してデータソース125a～125nおよび任意で第三者130と通信する。通信ネットワーク135の例には、モバイルネットワーク、無線ネットワーク、セルラーネットワーク、エリアネットワーク（LAN）、広域ネットワーク（WAN）、他の無線通信ネットワーク、またはそれらの組み合わせが含まれ得る。 Secure capsule computing services 120a-120n may provide the same or different functionality or operations. For example, secure capsule computing services 120a and 120n may manage data transformation and computing activities, but they may perform different functions or operations in performing computing activities. Secure capsule computing service 120a receives data and software from AI system 100 and may receive uncurated data from data sources 125a-125n (eg, data sources 120a and 120b). Data may be imported by software (eg, algorithms/models 110a-110n) from AI system 100, optionally stored, transformed, otherwise harmonized, and then calculated. In contrast, secure capsule computing service 120n receives data and software from AI system 100, receives uncurated data from data sources 125a through 125n (e.g., data sources 120a and 120b), and receives data and software from one or more Data from third parties 130 may be combined. Data can be imported by software (eg, algorithms/models 110a-110n) from AI system 100, optionally stored, transformed, combined with combined data, otherwise aligned, and then calculated. Secure capsule computing services 120a-120n communicate with data sources 125a-125n and optionally a third party 130 via communication network 135. Examples of communication networks 135 may include mobile networks, wireless networks, cellular networks, area networks (LANs), wide area networks (WANs), other wireless communication networks, or combinations thereof.

図2に、プライバシー保護された整合化データの複数のソースに解析を分散させることによって人工知能アルゴリズムを開発するためのAIアルゴリズム開発プラットフォーム200（例えば、図1に関して説明したAIエコシステム100内に実装されたデータ処理システム）を示す。ある例では、プラットフォーム200は、セキュアなカプセル計算サービス210のネットワークと通信するAIシステム205を含む（簡単にするために、ただ1つのセキュアなカプセル計算サービスが示されている）。AIシステム205は、データサイエンス開発モジュール215、データ・ハーモナイザ・ワークフロー作成モジュール220、ソフトウェア展開モジュール225、連合マスタアルゴリズム訓練モジュール230、システム監視モジュール235、およびグローバル結合データ240を含むデータストアを含み得る。AIシステム205は、1者または複数のアルゴリズム開発者と通信し得、新しいプロジェクトで最適化および/または検証されるべき1つまたは複数のアルゴリズムまたはモデルを受け取るように構成される。 Figure 2 shows an AI algorithm development platform 200 for developing artificial intelligence algorithms by distributing analysis across multiple sources of privacy-preserving harmonized data (e.g., implemented within the AI ecosystem 100 described with respect to Figure 1). data processing system). In one example, platform 200 includes an AI system 205 that communicates with a network of secure capsule computing services 210 (for simplicity, only one secure capsule computing service is shown). AI system 205 may include a data store that includes a data science development module 215, a data harmonizer workflow creation module 220, a software deployment module 225, a federated master algorithm training module 230, a system monitoring module 235, and a global combined data 240. AI system 205 may communicate with one or more algorithm developers and is configured to receive one or more algorithms or models to be optimized and/or validated in a new project.

データサイエンス開発モジュール215は、1つまたは複数のモデルの最適化および/または検証のために、1者または複数のアルゴリズム開発者から入力データ要件を受け取るように構成され得る。入力データ要件は、データキュレーション、データ変換、およびデータ整合化のワークフローの目的を定義する。入力データ要件はまた、1つまたは複数のモデルでの使用が許容されるデータ資産を識別するための制約も提供する。データ・ハーモナイザ・ワークフロー作成モジュール220は、変換、整合化、および注釈付けプロトコルの開発および展開を管理するように構成され得る。ソフトウェア展開モジュール225は、データサイエンス開発モジュール215およびデータ・ハーモナイザ・ワークフロー作成モジュール220と共に、1つまたは複数のモデルで使用するためのデータ資産を評価するように構成され得る。このプロセスを自動化することができ、または対話型検索/問い合わせプロセスとすることができる。ソフトウェア展開モジュール225は、データサイエンス開発モジュール215と共に、モデルを、必要なライブラリおよびリソースと共に、セキュアなカプセル計算フレームワークに統合するようにさらに構成され得る。 Data science development module 215 may be configured to receive input data requirements from one or more algorithm developers for optimization and/or validation of one or more models. Input data requirements define the objectives of data curation, data transformation, and data harmonization workflows. Input data requirements also provide constraints for identifying data assets that are acceptable for use in one or more models. Data harmonizer workflow creation module 220 may be configured to manage the development and deployment of transformation, harmonization, and annotation protocols. Software deployment module 225, along with data science development module 215 and data harmonizer workflow creation module 220, may be configured to evaluate data assets for use in one or more models. This process can be automated or can be an interactive search/query process. Software deployment module 225, along with data science development module 215, may be further configured to integrate the model, along with the necessary libraries and resources, into a secure capsule computation framework.

いくつかの態様では、ソース260（例えば、患者）からデータホスト255によって収集された複数の独立したプライベートデータセット245、250（例えば、臨床データおよび健康データ）から学習したロバストな優れたアルゴリズム/モデルを開発することが求められる。連合マスタアルゴリズム訓練モジュール230は、独立したデータセットからの学習を単一のマスタアルゴリズムに集約するように構成され得る。異なる態様では、連合訓練のためのアルゴリズム方法論は異なり得る。例えば、モデルパラメータの共有、アンサンブル学習、共有データに対する親教師学習、および他の多くの方法が、連合訓練を可能にするために開発され得る。プライバシー要件およびセキュリティ要件が、各データシステムがデータへのアクセスに支払われ得る対価の決定などの商業的考慮事項と共に、どの連合訓練方法論が使用されるかを決定し得る。 In some aspects, robust superior algorithms/models learned from multiple independent private datasets 245, 250 (e.g., clinical and health data) collected by data hosts 255 from sources 260 (e.g., patients) is required to develop. Federated master algorithm training module 230 may be configured to aggregate learning from independent data sets into a single master algorithm. In different aspects, the algorithm methodology for federated training may be different. For example, sharing model parameters, ensemble learning, parent-supervised learning on shared data, and many other methods can be developed to enable federated training. Privacy and security requirements, along with commercial considerations such as determining the price each data system can be paid for access to the data, may determine which federated training methodology is used.

システム監視モジュール235は、セキュアなカプセル計算サービス210におけるアクティビティを監視する。監視されるアクティビティは、例としての計算作業負荷、エラー状態、および接続状況などの動作追跡から、処理されたデータ量、アルゴリズム収束状況、データ特性の変動、データエラー、アルゴリズム/モデル性能メトリック、および各ユースケースおよび態様によって必要とされる多数の追加メトリックなどのデータサイエンス監視にまで及び得る。 System monitoring module 235 monitors activity in secure capsule computing service 210. Activities monitored range from behavioral tracking such as computational workload, error conditions, and connection status to data volume processed, algorithm convergence status, variation in data characteristics, data errors, algorithm/model performance metrics, and It can extend to data science monitoring such as numerous additional metrics required by each use case and aspect.

ある例では、プライベートデータセット245、250を追加のデータ240（結合データ）で増強することが望ましい。例えば、環境暴露を確認するために、地理的位置の大気質データを患者の地理的位置データと結合することができる。特定の例では、結合データ240は、データ整合化または計算中にデータ245、250と結合されるようにセキュアなカプセル計算サービス210に送信され得る。 In some instances, it may be desirable to augment the private datasets 245, 250 with additional data 240 (combined data). For example, air quality data for a geographic location can be combined with patient geographic location data to ascertain environmental exposure. In particular examples, combined data 240 may be sent to secure encapsulated calculation service 210 to be combined with data 245, 250 during data alignment or computation.

セキュアなカプセル計算サービス210は、ハーモナイザ・ワークフロー・モジュール265、整合化データ270、リモート計算モジュール275、システム監視モジュール280、およびデータ管理モジュール285を含み得る。データ・ハーモナイザ・ワークフロー作成モジュール230によって管理される変換、整合化、および注釈付けのワークフローは、変換および整合化データ270を使用して、ハーモナイザ・ワークフロー・モジュール265によって環境において展開および実行され得る。ある例では、結合データ240は、データ整合化中にデータ245、250と結合されるように、ハーモナイザ・ワークフロー・モジュール265に送信され得る。リモート計算モジュール275は、プライベートデータセット245、250をアルゴリズム/モデルを通して動作させるように構成され得る。いくつかの態様では、動作させることは、アルゴリズム/モデルの複数のインスタンスを作成することと、プライベートデータセット245、250を、訓練データセットと1つまたは複数の試験データセットとに分割することと、アルゴリズム/モデルの複数のインスタンスを訓練データセットで訓練することと、モデルの複数のインスタンスの各々の訓練からの結果を完全連合アルゴリズム/モデルに統合することと、1つまたは複数の試験データセットを完全連合アルゴリズム/モデルによって動作させることと、1つまたは複数の試験データセットの動作に基づいて完全連合モデルの性能を計算することとを含む訓練ワークフローを実行することを含む。他の態様では、動作させることは、プライベートデータセット245、250を分割、結合および/または変換して1つまたは複数の検証データセットにすることと、1つまたは複数の検証データセットを機械学習アルゴリズム/モデルによって動作させることと、1つまたは複数の検証データセットの動作に基づいて機械学習アルゴリズム/モデルの性能を計算することとを含む検証ワークフローを実行することを含む。ある例では、結合データ240は、計算中にデータ245、250と結合されるようにリモート計算モジュール275に送信され得る。 Secure capsule computing service 210 may include harmonizer workflow module 265, harmonized data 270, remote computing module 275, system monitoring module 280, and data management module 285. Transformation, harmonization, and annotation workflows managed by data harmonizer workflow creation module 230 may be deployed and executed in the environment by harmonizer workflow module 265 using transformation and harmonization data 270. In one example, combined data 240 may be sent to harmonizer workflow module 265 to be combined with data 245, 250 during data harmonization. The remote computing module 275 may be configured to run the private data sets 245, 250 through the algorithm/model. In some aspects, operating includes creating multiple instances of the algorithm/model and splitting the private dataset 245, 250 into a training dataset and one or more test datasets. , training multiple instances of the algorithm/model on a training dataset, integrating the results from training each of the multiple instances of the model into a fully federated algorithm/model, and one or more test datasets. and calculating the performance of the fully federated model based on the behavior of one or more test datasets. In other aspects, operating includes splitting, combining and/or transforming the private datasets 245, 250 into one or more validation datasets, and applying the one or more validation datasets to machine learning. and performing a validation workflow that includes operating the algorithm/model and calculating performance of the machine learning algorithm/model based on operation of one or more validation datasets. In one example, combined data 240 may be sent to remote calculation module 275 to be combined with data 245, 250 during calculation.

システム監視モジュール280は、セキュアなカプセル計算サービス210におけるアクティビティを監視する。監視されるアクティビティは、各ユースケースおよび態様によって必要とされる、アルゴリズム/モデル受け入れ、ワークフロー構成、およびデータホスト迎え入れなどの動作追跡を含み得る。データ管理モジュール285は、データホスト255の既存のインフラストラクチャ内のデータ資産を維持しながら、データホスト255からプライベートデータセット245、250などのデータ資産をインポートするように構成され得る。 System monitoring module 280 monitors activity in secure capsule computing service 210. Monitored activities may include behavioral tracking such as algorithm/model acceptance, workflow configuration, and data host hosting as required by each use case and aspect. The data management module 285 may be configured to import data assets, such as private datasets 245, 250, from the data host 255 while maintaining the data assets within the data host 255's existing infrastructure.

III. モデルを最適化および/または検証するための技術
様々な態様において、プライバシー保護された整合化データ（例えば、整合化臨床データおよび健康データ）の1つまたは複数のソースを使用して1つまたは複数のモデルを最適化および検証するための技術が提供される。モデルは、第1のエンティティ（例えば、ライフサイエンス企業などのアルゴリズム開発者）によって提供され得、データセットは、第2のエンティティ（例えば、大学医療センターなどのデータホスト）によって提供され得る。例えば、ライフサイエンス企業は、プライバシー保護された整合化臨床データおよび健康データの1つまたは複数のソースに対する自社のモデルの実行性能を最適化または評価することによって、製品およびサービスの市販までの時間を加速し、または商業的実行可能性を最適化することに関心がある場合がある。さらに、大学医療センターは、その臨床データおよび健康データの価値をデータプライバシーを維持する方法で掘り起こすことに関心がある場合がある。図3に示されるように、これらの2つのエンティティの要望を満たすために、臨床データおよび健康データに対してモデルを最適化および/または検証するためのコアプロセス300が、人工知能アルゴリズム開発プラットフォーム（例えば、図1および図2に関して説明されたプラットフォームおよびシステム）を使用して行われ得る。 III. Techniques for optimizing and/or validating models , in various embodiments, using one or more sources of privacy-protected harmonized data (e.g., harmonized clinical and health data). Alternatively, techniques for optimizing and validating multiple models are provided. The model may be provided by a first entity (eg, an algorithm developer such as a life sciences company) and the dataset may be provided by a second entity (eg, a data host such as an academic medical center). For example, life sciences companies can accelerate the time-to-market of products and services by optimizing or evaluating the performance of their models against one or more sources of privacy-protected, harmonized clinical and health data. There may be an interest in accelerating or optimizing commercial viability. Additionally, academic medical centers may be interested in unlocking the value of their clinical and health data in a manner that maintains data privacy. As shown in Figure 3, to meet the desires of these two entities, a core process 300 for optimizing and/or validating models against clinical and health data is implemented using an artificial intelligence algorithm development platform ( For example, using the platform and system described with respect to FIGS. 1 and 2).

ブロック305で、第三者アルゴリズム開発者（第1のエンティティ）が、新しいプロジェクトにおいて最適化および/または検証されるべき1つまたは複数のアルゴリズムまたはモデルを提供する。1つまたは複数のアルゴリズムまたはモデルは、アルゴリズム開発者によって、独自の開発環境、ツール、およびシードデータセット（例えば、訓練/試験データセット）を使用して開発され得る。ある例では、モデルは1つまたは複数の予測モデルを含む。予測モデルは、任意のアルゴリズム、例えば、畳み込みニューラルネットワーク（「CNN」）、例えば、インセプション・ニューラル・ネットワークや、残差ニューラルネットワーク（「Resnet」）、またはリカレント・ニューラル・ネットワーク、例えば、長・短期記憶（「LSTM」）モデルやゲート付き回帰型ユニット（「GRU」）モデルを含むがこれらに限定されないMLモデルを含むことができる。予測モデルはまた、直接測定することができないか、もしくは将来発生することになる何かを予測するように、またはデータから推論（例えば、モデルの予測やアルゴリズムの結果）を行うように訓練された任意の他の適切なMLモデルとすることもできる。例えば、医療用途では、これらは、画像またはビデオフレームからの臨床症状、同一性、診断、または予後、例えば、三次元CNN（「3DCNN」）、動的時間伸縮（「DTW」）技術、隠れマルコフモデル（「HMM」）など、またはそのような技術の1つもしくは複数の組み合わせ、例えば、CNN-HMMやMCNN（Multi-Scale Convolutional Neural Network）などであり得る。アルゴリズム開発者は、同じタイプの予測モデルまたは異なるタイプの予測モデル（例えば、アンサンブルML技術）を用いて、予測または推論、例えば、臨床症状、同一性、診断、または予後を行い得る。シードデータセットは、モデルの初期訓練および試験のためにアルゴリズム開発者によって取得された初期データセット（例えば、プライベートまたは公開の臨床データまたは健康データ）であり得る。 At block 305, a third party algorithm developer (first entity) provides one or more algorithms or models to be optimized and/or validated in a new project. One or more algorithms or models may be developed by an algorithm developer using a proprietary development environment, tools, and seed datasets (eg, training/testing datasets). In some examples, the model includes one or more predictive models. The predictive model may be any algorithm, such as a convolutional neural network (“CNN”), e.g. an inception neural network, a residual neural network (“Resnet”), or a recurrent neural network, e.g. ML models can be included, including but not limited to memory storage ("LSTM") models and gated recurrent unit ("GRU") models. Predictive models are also trained to predict something that cannot be directly measured or will occur in the future, or to make inferences from data (e.g., the model's predictions or the results of an algorithm). It can also be any other suitable ML model. For example, in medical applications, these can be used to determine clinical symptoms, identity, diagnosis, or prognosis from images or video frames, e.g., three-dimensional CNN ("3DCNN"), dynamic time warping ("DTW") techniques, hidden Markov (“HMM”) or a combination of one or more such techniques, such as CNN-HMM or MCNN (Multi-Scale Convolutional Neural Network). Algorithm developers may use the same type of predictive model or different types of predictive models (e.g., ensemble ML techniques) to make predictions or inferences, e.g., clinical symptoms, identity, diagnosis, or prognosis. A seed dataset may be an initial dataset (e.g., private or public clinical or health data) obtained by an algorithm developer for initial training and testing of a model.

ブロック310で、アルゴリズム開発者は、1つまたは複数のモデルの最適化および/または検証のための制約を提供する。制約は、（i）訓練制約、（ii）データ準備制約、および（iii）検証制約のうちの1つまたは複数を含み得る。これらの制約は、図4～図12に関してさらに詳細に説明されるように、データ準備（例えば、データキュレーション、データ変換、データ整合化、およびデータ注釈付け）、モデル訓練、モデル検証、および報告を含む1つまたは複数のモデルの最適化および/または検証の目的を基本的に定義する。理解されるように、アルゴリズム開発者がデータホスト（第2のエンティティ）から利用可能な訓練/試験データセットを使用してアルゴリズム開発者のモデルのうちの1つまたは複数を最適化するように要求する場合には、アルゴリズム開発者は、モデル、訓練制約およびデータ準備制約を提供し得る。他の例では、アルゴリズム開発者が、データホストから利用可能な検証データセットを使用してアルゴリズム開発者のモデルのうちの1つまたは複数を検証することを要求する場合、アルゴリズム開発者は、モデル、検証制約およびデータ準備制約を提供することができる。さらに他の例では、アルゴリズム開発者がデータホストから利用可能なデータセットを訓練/試験/検証することを使用してそれらのモデルの1つまたは複数を最適化および検証することを要求する場合、アルゴリズム開発者は、モデル、訓練制約、検証制約、およびデータ準備制約を提供し得る。 At block 310, the algorithm developer provides constraints for optimization and/or validation of one or more models. The constraints may include one or more of (i) training constraints, (ii) data preparation constraints, and (iii) validation constraints. These constraints affect data preparation (e.g., data curation, data transformation, data harmonization, and data annotation), model training, model validation, and reporting, as described in further detail with respect to Figures 4-12. Essentially defining the optimization and/or validation objectives of one or more models, including: As understood, requests that the algorithm developer optimize one or more of the algorithm developer's models using training/testing datasets available from the data host (second entity) If so, the algorithm developer may provide the model, training constraints, and data preparation constraints. In another example, if an algorithm developer requests that one or more of the algorithm developer's models be validated using a validation dataset available from a data host, the algorithm developer , can provide validation constraints and data preparation constraints. In yet other examples, when algorithm developers require to optimize and validate one or more of their models using training/testing/validation datasets available from a data host, Algorithm developers may provide models, training constraints, validation constraints, and data preparation constraints.

いくつかの態様では、訓練制約には、ハイパーパラメータ、正則化基準、収束基準、アルゴリズム終了基準、1つまたは複数のアルゴリズムで使用するために定義された訓練/検証/試験データ分割、および訓練/試験報告要件のうちの1つまたは複数が含まれるが、これらに限定されない。モデル・ハイパー・パラメータは、モデルの外部にあり、その値をデータから推定することができない構成である。ハイパーパラメータは、機械学習アルゴリズムの挙動を制御し、モデルパラメータを推定または学習するのに役立つように調整または最適化することができる設定である。ハイパーパラメータを選択および最適化するプロセスは、多くの機械学習ソリューションの重要な局面である。ほとんどの機械学習アルゴリズムは、メモリや実行コストなどのモデルの異なる局面を制御するハイパーパラメータを明示的に定義する。ハイパーパラメータは、アルゴリズム開発者によって指定され、ヒューリスティクスなどの1つまたは複数の問題解決技術を使用して設定され得る。しかしながら、アルゴリズムを特定のシナリオに適合させるために、追加のハイパーパラメータが定義されてもよい。例えば、ハイパーパラメータは、モデルの隠れユニットの数、モデルの学習率、またはモデルの畳み込みカーネル幅を含み得る。 In some aspects, training constraints include hyperparameters, regularization criteria, convergence criteria, algorithm termination criteria, training/validation/test data partitions defined for use with one or more algorithms, and training/validation criteria. including, but not limited to, one or more of the test reporting requirements. Model hyper-parameters are constructs that are external to the model and whose values cannot be estimated from the data. Hyperparameters are settings that can be adjusted or optimized to control the behavior of machine learning algorithms and help estimate or learn model parameters. The process of selecting and optimizing hyperparameters is a key aspect of many machine learning solutions. Most machine learning algorithms explicitly define hyperparameters that control different aspects of the model, such as memory and execution cost. Hyperparameters are specified by the algorithm developer and may be set using one or more problem-solving techniques, such as heuristics. However, additional hyperparameters may be defined to adapt the algorithm to specific scenarios. For example, hyperparameters may include the number of hidden units of the model, the learning rate of the model, or the convolution kernel width of the model.

正則化は、係数推定値を0に向かって抑制し、または減少させる。言い換えれば、この技術は、過学習のリスクを回避するために、より複雑または柔軟なモデルを学習することを妨げる。正則化は、モデルのバイアスを実質的に増加させることなく、モデルの分散を大幅に減少させる。そのため、正則化技術で使用される調整パラメータλなどの正則化の制約は、バイアスおよび分散に対する影響を制御する。λの値が上昇すると、係数の値が減少し、よって分散が減少する。ある点まで、このλの増加は、データの重要な特性を失うことなく、分散を減少させる（したがって、過学習を回避する）だけであるため、有益である。しかし、特定の値の後、モデルは重要な特性を失い始め、モデルにバイアスが生じ、よって学習不足になる。したがって、λの値などの正則化の制約が、モデル内で様々な正則化技術を実施するために選択され得る。対照的に、収束基準は、シーケンスの収束（例えば、何回かの反復後の1つまたは複数の重みの収束）を確認するために使用される。収束基準は、固定されたエポック数、目標定義、および早期停止などの様々な形態で実装され得、よって、収束基準の制約は、使用されるべき形態または技術、および行われるべき訓練の反復回数、目標値、性能値、使用されるべき検証データセット、定義された改善などいった形態または技術の変数を含み得る。 Regularization suppresses or reduces the coefficient estimates toward zero. In other words, this technique prevents learning more complex or flexible models to avoid the risk of overfitting. Regularization significantly reduces model variance without substantially increasing model bias. Therefore, regularization constraints such as the tuning parameter λ used in the regularization technique control the impact on bias and variance. As the value of λ increases, the value of the coefficient decreases and thus the variance decreases. Up to a certain point, this increase in λ is beneficial because it only reduces the variance (thus avoiding overfitting) without losing important properties of the data. However, after a certain value, the model begins to lose important properties, causing the model to become biased and thus undertrained. Accordingly, regularization constraints such as the value of λ may be selected to implement various regularization techniques within the model. In contrast, a convergence criterion is used to check the convergence of a sequence (eg, the convergence of one or more weights after several iterations). The convergence criterion may be implemented in various forms, such as a fixed number of epochs, goal definition, and early stopping, and thus the constraints on the convergence criterion depend on the form or technique that should be used and the number of training iterations that should be performed. , target values, performance values, validation datasets to be used, improvements defined, etc.

アルゴリズム終了基準は、モデルが十分な訓練を達成したかどうかを判定するためのパラメータを定義する。アルゴリズム訓練は反復最適化プロセスであるため、訓練アルゴリズムは、訓練データの全部または一部に対して動作し、モデルパラメータを更新し、次いでモデルの性能を再評価する、というステップを複数回行い得る。一般に、終了基準は、処理を継続するために、反復または反復セットごとの最小限の性能改善として定義されることが多い、アルゴリズム性能目標を含み得る。ある例では、終了基準は、訓練モデル更新プロセスの最大反復回数、または訓練に割り振られるべき最大クロック時間もしくは計算グサイクル数を含み得る。反復訓練プロセスをいつ停止すべきかを決定するための他の方法も想定される。訓練/検証/試験データ分割は、データ資産を訓練セット、検証セット、および/または試験セットに分割するための基準を含む。訓練データセットは、モデルを適合または訓練するために使用されるデータのセットである。検証データセットは、モデルパラメータを調整しながら訓練データセットで適合または訓練されたモデルの公平な評価を提供するために使用されるデータセットである。試験データセットは、訓練データセットで適合または訓練された最終モデルの公平な評価を提供するために使用されるデータセットである。これらのデータセットの分割は、モデルの訓練、試験、および/または検証に使用されるべきデータから利用可能なサンプルの総数を含むいくつかの要因に依存し得る。例えば、一部のモデルは、訓練するためにかなりのデータを必要とするので、この場合、アルゴリズム開発者は、より大きな訓練セットに対して最適化する制約を定義し得る。さらに、パラメータが非常に少ないモデルは、検証および調整がより簡単であり得るので、アルゴリズム開発者は、検証セットのサイズを縮小する制約を定義し得る。しかしながら、モデルが多くのパラメータを有する場合には、アルゴリズム開発者が、大きな検証セットに対応する制約を定義したい場合がある（ただし、交差検証も考慮され、制約内に含まれ得る）。 Algorithm termination criteria define parameters for determining whether the model has achieved sufficient training. Because algorithm training is an iterative optimization process, the training algorithm may operate multiple times on all or part of the training data, update model parameters, and then re-evaluate model performance. . In general, termination criteria may include an algorithm performance goal, often defined as a minimum performance improvement per iteration or set of iterations, in order to continue processing. In some examples, termination criteria may include a maximum number of iterations of the training model update process, or a maximum number of clock time or computational cycles that should be allocated to training. Other methods for determining when to stop the iterative training process are also envisioned. Training/validation/test data partitioning includes criteria for partitioning data assets into training sets, validation sets, and/or test sets. A training dataset is a set of data used to fit or train a model. A validation dataset is a dataset used to provide an unbiased evaluation of a model fitted or trained on a training dataset while adjusting model parameters. A test dataset is a dataset used to provide a fair evaluation of the final model fitted or trained on the training dataset. The partitioning of these datasets may depend on several factors, including the total number of samples available from the data to be used for training, testing, and/or validating the model. For example, some models require significant data to train, so in this case the algorithm developer may define constraints to optimize for a larger training set. Additionally, models with very few parameters may be easier to verify and tune, so algorithm developers may define constraints that reduce the size of the validation set. However, if the model has many parameters, the algorithm developer may want to define constraints that accommodate a large validation set (although cross-validation may also be considered and included within the constraints).

訓練/試験報告は、アルゴリズム開発者が1つまたは複数のモデルの訓練、最適化、および/または試験から観察することに関心を有するメトリックおよび基準を含み得る。ある例では、メトリックおよび基準の制約は、モデルの性能を示すように選択される。例えば、平均パーセント誤差などのメトリックおよび基準は、勾配消失や勾配爆発などのモデルを完成させるときに発生し得るバイアス、分散、およびその他の誤差に関する情報を提供し得る。バイアスは、学習アルゴリズムがデータから学習するには弱い場合の学習アルゴリズムの誤差である。高いバイアスの場合には、学習アルゴリズムは、データ内の関連する詳細を学習することができない。よって、モデルは、訓練データおよび試験データセットに対して性能が低い。対照的に、分散は、学習アルゴリズムがデータセットから過学習しようとするか、または訓練データを可能な限り厳密に適合しようとする場合の学習アルゴリズムの誤差である。高い分散の場合には、アルゴリズムは試験データセットに対して性能が低いが、訓練データセットに対しては性能がかなり良好であり得る。さらに、平均パーセント誤差やR2スコアなどの一般的な誤差メトリックは、必ずしもモデルの正確度を示すとは限らないため、アルゴリズム開発者は、モデルの正確度をより詳細に見るための追加のメトリックおよび基準を定義したい場合がある。例えば、選択されたデータセットが、時間的に相関する傾向があり、しばしば有意な自己相関を示す時系列データを含む場合（値を直接予測する能力の観点からモデルを評価する場合、平均パーセント誤差やR²（決定係数）スコアなどの一般的な誤差メトリックは、両方とも誤った高い予測正確度を示す）には、アルゴリズム開発者は、1つまたは複数の追加のメトリックまたは基準を使用して自己相関を監視することを望む場合がある。 A training/testing report may include metrics and criteria that an algorithm developer is interested in observing from training, optimizing, and/or testing one or more models. In some examples, metric and criterion constraints are selected to indicate the performance of the model. For example, metrics and metrics such as mean percent error may provide information about bias, variance, and other errors that may occur when completing a model such as vanishing gradients and exploding gradients. Bias is an error in a learning algorithm when the learning algorithm is too weak to learn from the data. In the case of high bias, the learning algorithm is unable to learn relevant details in the data. Therefore, the model performs poorly on training and test data sets. In contrast, variance is the error of a learning algorithm when it tries to overfit from a dataset or tries to fit the training data as closely as possible. In the case of high variance, the algorithm performs poorly on the test dataset, but may perform quite well on the training dataset. Additionally, common error metrics such as average percent error and R2 score do not necessarily indicate model accuracy, so algorithm developers can use additional metrics and You may want to define criteria. For example, if the selected dataset contains time-series data that tends to be temporally correlated and often exhibits significant autocorrelation (if you are evaluating a model in terms of its ability to directly predict values, then the average percentage error Common error metrics such as ^R2 (coefficient of determination) and R2 (coefficient of determination) (both of which indicate falsely high predictive accuracy) require algorithm developers to use one or more additional metrics or criteria. One may wish to monitor autocorrelation.

いくつかの態様では、データ準備制約には、入力データ要件および注釈付けプロトコル要件、のうちの1つまたは複数が含まれるが、これらに限定されない。入力データ要件は、データ資産がアルゴリズムまたはモデルで動作するための最適化および/または検証選択基準を含み得る。最適化および/または検証選択基準は、モデルで使用可能であるべき入力データ（例えば、外部データ）の特性、データフォーマット、および要件を定義する。入力データの特性および要件は、データがモデルを最適化および/または検証するために使用可能であるようなデータの特性および要件を指す。例えば、モデルによって実装されるアルゴリズムは、より一般化可能なアルゴリズムを作成するために、異なる民族や地理的データなど、モデルが動作することになる環境を正確に表す訓練データを必要とし得る。ある例では、入力データの特性および要件は、（i）モデルの環境、（ii）50％男性および50％女性などの例の配分、（iii）データ（例えば、画像データ）および/または測定値を生成するデバイスのパラメータおよびタイプ、（iv）分散対バイアス、すなわち、分散の大きいモデルは、訓練データに簡単に適合し、複雑さを受け入れることができるが、ノイズに敏感であり、一方、バイアスの高いモデルは、より柔軟性が低く、データやノイズの変動に対する感度が低く、複雑さを失いやすい、（v）分類、クラスタ化、回帰、ランク付けなどのモデルによって実装される（1つもしくは複数の）タスク、または（vi）それらの任意の組み合わせに基づいて定義される。例えば、分類およびクラスタ化技術を使用して腫瘍の存在を予測するために開発されたモデルのデータの特性および要件は、そのような腫瘍の識別に使用される30歳～60歳の同数の女性と男性との混合からの三次元画像データおよび/またはバイオマーカ試験データの要件を含み得る。フォーマットは、データのファイル形式（例えば、すべての画像データは.jpegファイルフォーマットでなければならない）だけでなく、レコード自体の整合性も指す。例えば、データフォーマットの制約は、モデルによって認識されるデータセットのための標準的な命名法（例えば、解剖学、疾患、所見、手順、微生物、物質などを網羅する標準化されたコード、用語、同義語および定義を提供する）を定義し得る。 In some aspects, data preparation constraints include, but are not limited to, one or more of input data requirements and annotation protocol requirements. Input data requirements may include optimization and/or validation selection criteria for data assets to operate on an algorithm or model. Optimization and/or validation selection criteria define the characteristics, data formats, and requirements of input data (e.g., external data) that should be available to the model. Input data characteristics and requirements refer to the characteristics and requirements of the data such that the data can be used to optimize and/or validate the model. For example, an algorithm implemented by a model may require training data that accurately represents the environment in which the model will operate, such as different ethnic or geographic data, in order to create a more generalizable algorithm. In some examples, the characteristics and requirements of the input data include (i) the environment of the model, (ii) the distribution of the example, such as 50% male and 50% female, (iii) data (e.g., image data) and/or measurements. (iv) variance vs. bias, i.e. a model with large variance can easily fit the training data and accommodate complexity, but is sensitive to noise, while bias A model with a high (vi) any combination thereof. For example, the data characteristics and requirements of a model developed to predict the presence of tumors using classification and clustering techniques are may include requirements for three-dimensional imaging data and/or biomarker test data from a mixture of men and women. Format refers not only to the file format of the data (for example, all image data must be in .jpeg file format), but also to the integrity of the record itself. For example, data format constraints may require a standard nomenclature for datasets recognized by the model (e.g., standardized codes, terminology, synonyms covering anatomy, disease, findings, procedures, microorganisms, substances, etc.). terms and definitions).

注釈付けプロトコル要件は、モデルに使用されるべき異なるタイプのデータ注釈付けを含み得る。データ注釈付けは、構造化された数値、テキスト、オーディオ、画像、またはビデオなどの任意の形態であり得るデータ（例えば、最適化および/または検証のためのデータセット）をラベル付けするタスクである。データ注釈付けは、教師あり機械学習におけるデータ準備の重要な段階である。モデルは、注釈付きデータ内の反復パターンを認識することを学習する。アルゴリズムが十分な注釈付きデータを処理した後、アルゴリズムは、新しい注釈なしデータが提示されたときに同じパターンを認識し始め得る。制約に基づいて注釈付けプロトコル内で定義され得る様々なタイプの注釈付けがある。例えば、意味的注釈付けは、解析データ、医療メモ、または診断コードなどのテキスト内の様々な概念に対して定義され得る。さらに、テキストカテゴリ化および内容カテゴリ化が、文書に定義済みのカテゴリを割り当てるために定義され得る。例えば、文書内で文もしくは段落にトピックによってタグ付けすることができ、または医学出版物を、内医学、腫瘍学、血液学、微生物学などといった主題によって編成することができる。画像およびビデオの注釈付けの場合、画像またはビデオのフレーム上に描かれた仮想的なボックスである境界ボックスが使用され得る。境界ボックスの内容は、モデルが内容を、腫瘍や骨折などの別個のタイプのオブジェクトとして認識するのを支援するために注釈を付けられ得る。オーディオ注釈付けの場合、エンティティ注釈付け、エンティティ連結、および句チャンキングは、構造化されていない音声の部分をラベル付けおよび定義し、音声の部分に言語的または文法的な意味でタグ付けするために使用され得る。 Annotation protocol requirements may include different types of data annotations to be used in the model. Data annotation is the task of labeling data (e.g., datasets for optimization and/or validation), which can be in any form, such as structured numbers, text, audio, images, or video. . Data annotation is an important stage of data preparation in supervised machine learning. The model learns to recognize repeating patterns within the annotated data. After the algorithm has processed enough annotated data, the algorithm may begin to recognize the same patterns when presented with new unannotated data. There are various types of annotations that can be defined within an annotation protocol based on constraints. For example, semantic annotations may be defined for various concepts within text, such as analytical data, medical notes, or diagnostic codes. Additionally, text categorization and content categorization may be defined to assign predefined categories to documents. For example, sentences or paragraphs can be tagged by topic within a document, or medical publications can be organized by subject matter, such as internal medicine, oncology, hematology, microbiology, and so on. For image and video annotation, bounding boxes, which are virtual boxes drawn on frames of images or videos, may be used. The content of the bounding box may be annotated to help the model recognize the content as a distinct type of object, such as a tumor or a bone fracture. For audio annotation, entity annotation, entity concatenation, and phrase chunking are used to label and define parts of unstructured speech and tag parts of speech with linguistic or grammatical meanings. can be used for.

いくつかの態様では、検証制約は、検証データ選択基準、検証終了基準、および検証報告要件のうちの1つまたは複数を含むが、これらに限定されない。検証データ選択基準は、開発中のアプリケーションにデータの適切なサブセットを選択するために必要な任意の要因を含むことができる検証データセットの選択基準を含み得る。例えば、医療用途では、コホート選択には、臨床コホート基準、人口統計学的基準、およびデータ・セット・クラス・バランスが含まれるが、これらに限定されない。医療アルゴリズム開発において、コホート研究は、疾患の原因を調査し、コホートとして知られる人々のグループにおけるリスク因子と健康転帰との間の関連を確立するために使用される医療研究の一種である。後向きコホート研究は、既に存在するデータを調べ、特定の状態のリスク因子を特定しようとする。前向きコホート研究では、研究者らは疑問を提起し、何が疾患を引き起こす可能性があるかについての仮説を形成する。次いで、研究者らは、仮説を証明または反証するために、一定期間にわたってコホートを観察する。よって、臨床コホート基準は、研究のためにデータを取得すべき人々のグループ、研究のタイプ（例えば、後向きまたは前向き）、そのグループが一定期間にわたって暴露され得るリスク因子、解決されるべき疑問/仮説および関連する疾患もしくは状態、ならびに/またはコホート研究の基準を定義する他のパラメータを定義し得る。人口統計学的基準は、研究のためにデータを取得すべき人々のグループの人口統計学的要因を定義する。人口統計学的要因には、例えば、年齢、性別、教育レベル、収入レベル、配偶者の有無、職業、宗教、出生率、死亡率、家族の平均人数、平均結婚年齢が含まれ得る。データ・セット・クラス・バランスは、研究のためにデータが提示されるかどうかおよびどのように提示されるかを定義する。例えば、多くの場合、データセットは不均衡である（例えば、陽性の解析試験結果を有する患者と比較して、陰性の解析試験結果を有する患者がはるかに多いなど）。不均衡なデータセットを修正する簡単な方法は、少数クラスのインスタンスをオーバーサンプリングするか、または多数クラスのインスタンスをアンダーサンプリングすることによって、それらのデータセットのバランスをとることである。よって、データ・セット・クラス・バランスの制約は、（i）そもそもデータセットのバランスをとるべきかどうか、（ii）どのようにデータセットのバランスをとるべきか、例えば、80:20と比較して40:60が許容可能であるか、または50:50でなければならないか、および（iii）どのようにしてバランスをとるか、例えば少数クラスをオーバーサンプリングするかを定義し得る。医療におけるこれらのコホート定義の考慮事項の多くは、他の適用分野でも類似点を有する。 In some aspects, validation constraints include, but are not limited to, one or more of validation data selection criteria, validation termination criteria, and validation reporting requirements. The validation data selection criteria may include validation data set selection criteria that may include any factors necessary to select an appropriate subset of data for the application under development. For example, in medical applications, cohort selection includes, but is not limited to, clinical cohort criteria, demographic criteria, and data set class balance. In medical algorithm development, a cohort study is a type of medical research used to investigate the causes of disease and establish associations between risk factors and health outcomes in a group of people known as a cohort. Retrospective cohort studies examine existing data and attempt to identify risk factors for a particular condition. In prospective cohort studies, researchers pose questions and form hypotheses about what may cause the disease. Researchers then observe the cohort over a period of time to prove or disprove the hypothesis. Thus, clinical cohort criteria define the group of people for which data should be obtained for a study, the type of study (e.g., retrospective or prospective), the risk factors to which that group may be exposed over a period of time, and the questions/hypotheses to be answered. and associated diseases or conditions and/or other parameters that define the criteria for cohort studies. Demographic criteria define the demographic factors of a group of people for which data should be obtained for a study. Demographic factors may include, for example, age, gender, education level, income level, marital status, occupation, religion, birth rate, mortality rate, average family size, and average age at marriage. Data set class balance defines whether and how data is presented for research. For example, datasets are often unbalanced (eg, many more patients with negative analytical test results compared to patients with positive analytical test results). A simple way to correct unbalanced datasets is to balance them by oversampling instances of the minority class or undersampling instances of the majority class. Therefore, the constraints on data set class balance are: (i) whether the dataset should be balanced at all; and (ii) how the dataset should be balanced, e.g., compared to 80:20. and (iii) how to balance, e.g. oversampling minority classes. Many of these cohort definition considerations in medicine have similarities in other application areas.

検証終了基準は、モデルが十分な検証を達成したかどうかを定義する。検証報告は、アルゴリズム開発者が1つまたは複数のモデルの検証から観察することに関心を有するメトリックおよび基準を含み得る。ある例では、メトリックおよび基準の制約は、モデルの全体的な性能および/または正確度を示すように選択される。例えば、メトリックおよび基準は、モデルによって取り込まれた新しいデータバッチ内にデータエラー（例えば、データの予期される状態と実際の状態との間の不一致）があるかどうか、データバッチ間に現れる誤差があるかどうか、訓練と試験との間に特徴スキュー、訓練と試験との間の分布スキューがあるかどうか、予期されるデータと訓練コードでなされた仮定との間に不一致があるかどうか、モデルの品質、モデルの正確度、モデルの精度、モデルの性能、およびモデル品質の問題の診断を支援する経験的データに関する情報を提供し得る。特定の態様では、メトリックおよび基準は、モデル性能を高めるように調整することができるハイパーパラメータに関する情報を提供し得る。 Validation exit criteria define whether the model has achieved sufficient validation. A validation report may include metrics and criteria that an algorithm developer is interested in observing from validation of one or more models. In some examples, metric and criterion constraints are selected to indicate the overall performance and/or accuracy of the model. For example, metrics and criteria determine whether there are data errors (e.g., discrepancies between the expected and actual state of the data) within new data batches brought in by the model, and whether there are errors that appear between data batches. whether there is a feature skew between training and testing, whether there is a distribution skew between training and testing, whether there is a mismatch between the expected data and the assumptions made in the training code, the model model accuracy, model precision, model performance, and empirical data to assist in diagnosing model quality problems. In certain aspects, metrics and criteria may provide information about hyperparameters that can be adjusted to enhance model performance.

ブロック315で、図4に関して詳細に説明されたプロセスを使用して、1つまたは複数のデータホストがプラットフォームに迎え入れられる。ある例では、潜在的なデータホストは、データプライバシーを維持する方法で、データホストのデータ資産（例えば、臨床データおよび健康データ）から追加の価値を導出する可能性のある機会を通知され得る。プラットフォームを介してそれらのデータ資産を供することに関心のあるデータホストについて、プラットフォームは、データホスト計算およびストレージインフラストラクチャを、データホストのインフラストラクチャ（例えば、ストレージインフラストラクチャをインスタンス化する前のデータホストの既存のインフラストラクチャ）内にプロビジョニングすることと、プラットフォームを介したデータ資産の公開に関するガバナンス要件およびコンプライアンス要件を完了することと、データホストがプラットフォームを介してアルゴリズム開発者（第1のエンティティ）に供することに関心のあるデータ資産を取得することとを含むプロセスを使用して、1つまたは複数のデータホストを迎え入れ得る。迎え入れられると、取得されたデータ資産は、データプライバシーを維持する方法で第三者から検索可能および評価可能になる。 At block 315, one or more data hosts are welcomed to the platform using the process described in detail with respect to FIG. In one example, a potential data host may be notified of potential opportunities to derive additional value from the data host's data assets (e.g., clinical and health data) in a manner that maintains data privacy. For data hosts interested in serving their data assets through the Platform, the Platform will provide data host compute and storage infrastructure to the data host's infrastructure (e.g., to the data host prior to instantiating the storage infrastructure). provisioning within the existing infrastructure of the Algorithm Developer (the first entity) and completing governance and compliance requirements for publishing data assets through the Platform, and One or more data hosts may be hosted using a process that includes obtaining data assets that are interested in serving. Once adopted, the acquired data assets will be searchable and evaluable by third parties in a manner that maintains data privacy.

ブロック320で、モデルと共に使用されるべきデータ資産が、図5に関して詳細に説明されたプロセスを使用して識別、取得、およびキュレートされる。ある例では、ブロック310でアルゴリズム開発者によって定義されたデータ資産は、変換、注釈付け、および計算のためにプラットフォームによって識別、取得、およびキュレートされる。すべてのデータ資産は、データホストの環境内に留まり、物理的にもしくはクラウドベースのデータ構造において編成することができ、またはデータ資産を、既存のデータストレージインフラストラクチャ内で論理的に編成することができる。メタデータ、中間計算ステップ、構成およびデータ、ならびにモデルの起源および計算結果を格納するために、記憶空間が識別され得る。ある例では、データセット、メタデータ、および起源データは、将来の参照および規制上の審査のために永続記憶に保存され得る。 At block 320, data assets to be used with the model are identified, obtained, and curated using the process described in detail with respect to FIG. In one example, the data assets defined by the algorithm developer at block 310 are identified, retrieved, and curated by the platform for transformation, annotation, and computation. All data assets remain within the data host's environment and can be organized physically or in cloud-based data structures, or data assets can be logically organized within an existing data storage infrastructure. can. Storage space may be identified to store metadata, intermediate computational steps, configurations and data, and model origins and computational results. In certain examples, datasets, metadata, and provenance data may be stored in persistent storage for future reference and regulatory review.

ブロック325で、キュレートされたデータ資産が注釈付けプロトコル（例えば、ブロック310で定義された注釈付けプロトコル）に従って注釈を付けられているかどうかに関する判定が行われる。ある例では、判定は、注釈付けプロトコルの制約を、キュレートされたデータ資産に現在適用されている注釈と比較することによって行われ得る。キュレートされたデータ資産が注釈付けプロトコルに従って注釈を付けられている場合には、注釈付けは不要であり、プロセスはブロック340に進む。キュレートされたデータ資産が注釈付けプロトコルに従って注釈を付けられていない場合には、注釈付けが必要であり、プロセスはブロック330に進む。ブロック330で、キュレートされたデータ資産は、図6に関して詳細に説明されるように注釈付けのために準備される。ある例では、注釈付けプロトコルは、データ資産が注釈付けのために特定のフォーマットに変換されるか、または特定の方法で変更されることを必要とする。例えば、複数の画像から構成されるビデオは、閲覧および注釈付けのために個々のフレームまたは画像に分離される必要があり得る。変換プロセスは、通常、データが（1つまたは複数の）モデルの1つまたは複数のアルゴリズムによって必要とされる形態になる、整合化の最終目標に向かう中間ステップである。 At block 325, a determination is made as to whether the curated data asset is annotated according to an annotation protocol (eg, the annotation protocol defined at block 310). In one example, the determination may be made by comparing the constraints of the annotation protocol to the annotations currently applied to the curated data asset. If the curated data asset has been annotated according to an annotation protocol, no annotation is necessary and the process proceeds to block 340. If the curated data asset has not been annotated according to the annotation protocol, annotation is required and the process continues at block 330. At block 330, the curated data asset is prepared for annotation as described in detail with respect to FIG. In some examples, an annotation protocol requires that the data asset be converted to a particular format or modified in a particular way for annotation. For example, a video composed of multiple images may need to be separated into individual frames or images for viewing and annotation. The transformation process is typically an intermediate step toward the end goal of harmonization, where the data is in the form required by one or more algorithms of the model(s).

ブロック335で、データ資産が注釈付けのために準備されると、データ資産は、図7に関して詳細に説明されるように注釈を付けられる。モデルの各アルゴリズムは、データが特定の方法でラベル付けされることを必要とする場合がある。例えば、乳癌検出/スクリーニングシステムは、特定の病変がizedおよび特定されることを必要とし得る。別の例は、各画像がセグメント化され、存在する組織の種類（正常、壊死、悪性など）によって標識される必要があり得る胃腸がんデジタル病理であろう。テキストまたは臨床データを含むいくつかの事例では、注釈付けは、テキストおよび構造化データの選択されたサブセットに標識オントロジーを適用することを含み得る。注釈付けは、セキュアなカプセル計算サービス内のデータホストに対して（ly）行われる。変換プロセスおよび注釈付けプロセスの重要な原理は、プラットフォームが、データ資産のプライバシーを保全しながら、データがデータホストの技術的範囲外に移動される必要なしに、データのクリーニングおよび変換のアルゴリズムを適用および改良するための様々なプロセスを容易にすることである。 At block 335, once the data asset is prepared for annotation, the data asset is annotated as described in detail with respect to FIG. Each algorithm in the model may require data to be labeled in a particular way. For example, a breast cancer detection/screening system may require that specific lesions be identified and identified. Another example would be gastrointestinal cancer digital pathology where each image may need to be segmented and labeled by the type of tissue present (normal, necrotic, malignant, etc.). In some cases involving text or clinical data, annotation may include applying a label ontology to a selected subset of text and structured data. The annotation is done ly to the data host within the secure capsule computing service. A key principle of the transformation and annotation process is that the platform applies data cleaning and transformation algorithms without requiring the data to be moved outside the technical scope of the data host, while preserving the privacy of the data assets. and to facilitate various processes for improvement.

ブロック340で、注釈付きデータ資産がアルゴリズムプロトコル（例えば、ブロック310で定義された訓練制約および/または検証制約）に従って整合化されているかどうかに関する判定が行われる。データ整合化は、様々なファイルフォーマット、命名規則、および列のデータセットをまとめて1つのまとまりのあるデータセットに変換するプロセスである。ある例では、判定は、訓練制約および/または検証制約を注釈付きデータ資産の整合化と比較することによって行われ得る。注釈付きデータ資産がアルゴリズムプロトコルに従って整合化される場合には、さらなる整合化は不要であり、プロセスはブロック360に進む。注釈付きデータ資産がアルゴリズムプロトコルに従って整合化されていない場合、整合化が必要であり、プロセスはブロック345に進む。ブロック345で、注釈付きデータ資産は、図8に関して詳細に説明されるように整合化される。ある例では、アルゴリズムプロトコルは、データ資産が計算のために特定のフォーマットに変換されるか、または特定の方法で変更されることを必要とする。データの整合化は、データ資産を特定のフォーマットに変換するために行われ得、または特定の方法で変更され得る。 At block 340, a determination is made as to whether the annotated data asset is harmonized according to the algorithm protocol (eg, training constraints and/or validation constraints defined at block 310). Data harmonization is the process of bringing together datasets of different file formats, naming conventions, and columns into one cohesive dataset. In some examples, the determination may be made by comparing training constraints and/or validation constraints to the alignment of annotated data assets. If the annotated data asset is harmonized according to the algorithmic protocol, no further harmonization is necessary and the process proceeds to block 360. If the annotated data asset has not been harmonized according to the algorithmic protocol, harmonization is required and the process proceeds to block 345. At block 345, the annotated data assets are reconciled as described in detail with respect to FIG. In some examples, algorithmic protocols require data assets to be converted to a particular format or modified in a particular way for the calculation. Data harmonization may be performed to convert data assets to a particular format or to be modified in a particular manner.

ブロック350およびブロック355で、モデルは、データ資産を計算するために準備される。ブロック350およびブロック355で行われるプロセスは、データ資産の準備（例えば、変換および注釈付け）に続いて、データ資産の準備に先立って、またはデータ資産の準備と並行して行われ得る。任意のブロック350で、モデルはリファクタリングされてもよい。このプロセスは、モデルがプラットフォーム内で動作することを可能にするように特別なライブラリまたは最適化をサポートするために行われ得る。例えば、準同型暗号化データを効率的な方法で操作するためには、特別なライブラリがモデルコードに含められる必要があり得る。これらのライブラリは、リファクタリングプロセス中にモデルに組み込まれ得る。しかしながら、リファクタリングプロセスは、すべての最適化アルゴリズムおよび/または検証アルゴリズムに必要とされるわけではないことを理解されたい。 At block 350 and block 355, the model is prepared to calculate data assets. The processes performed at blocks 350 and 355 may occur subsequent to, prior to, or in parallel with data asset preparation (eg, transformation and annotation). At optional block 350, the model may be refactored. This process may be done to support special libraries or optimizations to allow the model to work within the platform. For example, special libraries may need to be included in the model code to manipulate homomorphically encrypted data in an efficient manner. These libraries can be incorporated into the model during the refactoring process. However, it should be understood that a refactoring process is not required for all optimization and/or verification algorithms.

ブロック355で、モデルは、検証および/または訓練されるべきアルゴリズムを含むソフトウェアの外部環境（この場合、データホストの計算環境）への展開を容易にし、さらに、カプセル内に展開されるソフトウェアのプライバシーとホスト計算環境のセキュリティの両方を保護するためのセキュリティサービスを提供するために、セキュアなカプセル計算サービスフレームワークに統合される。セキュアなカプセル計算サービスフレームワークは、プラットフォーム内にプロビジョニングされたモデル・サービング・システムを含む。モデル・サービング・システムは、各モデルを1つまたは複数のセキュアな展開カプセルと関連付け、1つまたは複数のセキュアな展開カプセル内のモデルをデータホストに展開し、モデルは、1つまたは複数のセキュアなアプリケーション・プログラム・インターフェース（API）を介して最適化および/または検証のために1つまたは複数のセキュアなカプセル内から提供または利用され得る。例えば、モデルがセキュアなカプセル計算サービスの一部としてデータホストに展開されると、プラットフォームは、アルゴリズムAPIを介してモデルにデータリソース、モデルパラメータ、または訓練用の共有データ（例えば、親教師訓練パラダイム）を展開または出力し得る。引き換えに、プラットフォームは、アルゴリズムAPIを介して、計算結果、データ、および計算監視結果、訓練されたモデル、モデルパラメータ、またはセキュアなカプセル計算サービスなどの展開された計算コンポーネントおよびプロセスの他の結果を含む入力を受け取り得る。セキュアな展開カプセルは、モデルが、データホストのインフラストラクチャ内から、データ資産およびモデル（例えば、アルゴリズム）のプライバシーを維持しながら隔離された方法で動作することを可能にする。さらに、1つまたは複数のセキュアな展開カプセル内でモデルを動作させているデータホストは、アルゴリズム開発者の独自のソフトウェアを検査、コピー、または修正することができない可能性がある。さらに、1つまたは複数のセキュアな展開カプセル内でモデルを動作させているデータホストは、ホストされているモデルが悪意のあるものであり、組織のインフラストラクチャに害を及ぼしたり、データプライバシーを侵害したりし得る可能性から保護され得る。 At block 355, the model facilitates deployment of software containing algorithms to be validated and/or trained to an external environment (in this case, the computational environment of the data host), and further provides privacy protection for the software deployed within the capsule. be integrated into the secure capsule computing services framework to provide security services to protect both the security of the host computing environment and the security of the host computing environment. The secure capsule computation services framework includes a model serving system provisioned within the platform. The model serving system associates each model with one or more secure deployment capsules, deploys the models in the one or more secure deployment capsules to data hosts, and the model serves one or more secure deployment capsules. may be provided or utilized from within one or more secure capsules for optimization and/or validation via a secure application program interface (API). For example, when a model is deployed to a data host as part of a secure capsule compute service, the platform provides the model with data resources, model parameters, or shared data for training (e.g., in a parent-teacher training paradigm) via an algorithm API. ) can be expanded or output. In exchange, the platform provides computational results, data, and other results of deployed computational components and processes, such as computational monitoring results, trained models, model parameters, or secure capsule computational services, through algorithmic APIs. can receive input containing. The secure deployment capsule allows models to operate in an isolated manner from within the data host's infrastructure while maintaining the privacy of data assets and models (e.g., algorithms). Additionally, data hosts running models within one or more secure deployment capsules may not be able to inspect, copy, or modify the algorithm developer's proprietary software. Additionally, data hosts running models within one or more secure deployment capsules may be aware that the hosted models are malicious and could harm an organization's infrastructure or violate data privacy. be protected from the possibility of

様々な態様において、セキュアなカプセル計算フレームワークは、アルゴリズムを動作させるために必要な暗号化コードを受け入れるように構成された計算インフラストラクチャ（例えば、1つまたは複数のサーバ、1つまたは複数の計算ノード、1つまたは複数の仮想コンピューティングデバイスなど）内にプロビジョニングされる。暗号化は、業界標準の公開鍵/秘密鍵暗号化（例えば、AES 256）であり得る。この計算インフラストラクチャは、データホストの計算インフラストラクチャで動作することもでき、またはデータホストのクラウドインフラストラクチャで動作することもできる。暗号化コードは、プラットフォームによって署名され、アーカイブ（例えば、データ・ストレージ・デバイス）に格納される。プラットフォームは暗号化コードの内容を見ることができないが、これにより、食品医薬品協会（Food＆Drug Association（FDA））などの規制機関によって要求される場合にはプラットフォームによって検証された正確なアルゴリズムの記録が確立される。セキュアなカプセル計算フレームワークは、計算インフラストラクチャ上でインスタンス化され、暗号化コードは、セキュアなカプセル計算フレームワーク内にアルゴリズム開発者によって配置される。この時点で、暗号化コードが復号される。ある例では、暗号化コードは、アルゴリズム開発者によって、セキュアなカプセル計算フレームワークに正しい秘密鍵を渡すことによって復号され得る。セキュアなカプセル計算フレームワークは、本明細書に記載される定義されたAPIによる以外、一旦起動されるとセキュアなカプセル計算フレームワークの内容と対話することができないように、サーバのCPUハードウェアに格納された秘密鍵を用いてメモリの内容を暗号化することによって作成され得る。オペレーティングシステム自体でさえも、CPUハードウェアのみがアクセス可能な鍵で暗号化されているため、セキュアなカプセル計算フレームワークの内容と対話することができない。これは、たとえオペレーティングシステムが損なわれたとしても、セキュアなカプセル計算フレームワークの内容を監視または読み取ることができないことを意味する。 In various aspects, a secure encapsulated computational framework includes computational infrastructure (e.g., one or more servers, one or more computational node, one or more virtual computing devices, etc.). The encryption may be industry standard public/private key encryption (eg, AES 256). This computational infrastructure can operate on the data host's computational infrastructure or can operate on the data host's cloud infrastructure. The cryptographic code is signed by the platform and stored in an archive (eg, a data storage device). Although the platform cannot see the contents of the encrypted code, this establishes a record of the exact algorithms verified by the platform if required by regulatory bodies such as the Food & Drug Association (FDA). be done. A secure capsule computation framework is instantiated on the compute infrastructure, and cryptographic code is placed by an algorithm developer within the secure capsule computation framework. At this point, the encrypted code is decrypted. In one example, the encrypted code may be decrypted by an algorithm developer by passing the correct private key to a secure capsule computing framework. The Secure Capsule Compute Framework is configured to require the server's CPU hardware to be configured such that once started, no one can interact with the contents of the Secure Capsule Compute Framework other than through defined APIs described herein. It may be created by encrypting the contents of memory using a stored private key. Even the operating system itself cannot interact with the contents of the secure capsule computing framework, as it is encrypted with a key that only the CPU hardware can access. This means that even if the operating system is compromised, the contents of the secure capsule computation framework cannot be monitored or read.

ブロック360で、モデルの最適化および/または検証が、データ資産を使用して行われる。ある例では、最適化は、アルゴリズムを重みおよびバイアスの定義済みの値またはランダム値で初期設定し、それらの値での出力を予測しようと試みることを含む。特定の例では、モデルはアルゴリズム開発者によって事前に訓練されており、よってアルゴリズムは、重みおよびバイアスで既に初期設定されている場合がある。他の例では、重みおよびバイアスの定義済みの値またはランダム値は、ブロック310でアルゴリズム開発者によって定義され、ブロック360でアルゴリズムに読み込まれ得る。その後、データ資産およびハイパーパラメータがアルゴリズムに入力され得、推論または予測が計算され得、訓練されたモデルが出力をどの程度正確に予測したかを判定するために試験または比較が行われ得る。最適化は、図9に関して詳細に説明されるように、モデルの性能を最適化（例えば、重みおよびバイアスを最適化）しようと、データ資産を用いて訓練および/または試験の1つまたは複数のインスタンスを動作させることを含み得る。検証は、図10に関して詳細に説明されるように、ゴールド・スタンダード・ラベルに基づいてモデルを検証しようと、データ資産を用いて検証の1つまたは複数のインスタンスを動作させることを含み得る。ブロック365で、ブロック360の結果に基づいて、1つまたは複数の報告が生成され、アルゴリズム開発者に配信される。ある例では、報告は、ブロック310で定義された訓練/試験報告要件および/または検証報告要件に従って生成される。 At block 360, model optimization and/or validation is performed using the data assets. In some examples, optimization includes initializing an algorithm with predefined or random values of weights and biases and attempting to predict output at those values. In certain examples, the model may have been pre-trained by the algorithm developer, so the algorithm is already initialized with weights and biases. In other examples, predefined or random values for weights and biases may be defined by the algorithm developer at block 310 and loaded into the algorithm at block 360. Data assets and hyperparameters may then be input into the algorithm, inferences or predictions may be calculated, and tests or comparisons may be made to determine how accurately the trained model predicted the output. Optimization is one or more steps of training and/or testing using data assets in an attempt to optimize model performance (e.g., optimizing weights and biases), as described in detail with respect to Figure 9. may include operating an instance. Validation may include running one or more instances of validation with the data asset to attempt to validate the model based on the gold standard label, as described in detail with respect to FIG. 10. At block 365, one or more reports are generated and distributed to the algorithm developer based on the results of block 360. In some examples, the report is generated in accordance with training/test reporting requirements and/or validation reporting requirements defined at block 310.

図4に、人工知能アルゴリズム開発プラットフォーム（例えば、図1および図2に関して説明されたプラットフォームおよびシステム）に1つまたは複数のデータホストを迎え入れるためのプロセス400を示す。ブロック405で、データホスト計算およびストレージインフラストラクチャ（例えば、図1および図2に関して説明されたセキュアなカプセル計算サービス）が、データホストのインフラストラクチャ内にプロビジョニングされる。ある例では、このプロビジョニングは、インフラストラクチャにおけるカプセル化されたアルゴリズムの展開、インフラストラクチャにおける適切にプロビジョニングされたハードウェアおよびソフトウェアを有する物理コンピューティングデバイスの展開、ストレージ（物理データストアまたはクラウドベースのストレージ）の展開、インフラストラクチャを介してアクセス可能な公共または専用のクラウドインフラストラクチャ上の展開などを含む。 FIG. 4 illustrates a process 400 for onboarding one or more data hosts to an artificial intelligence algorithm development platform (eg, the platforms and systems described with respect to FIGS. 1 and 2). At block 405, data host compute and storage infrastructure (eg, the secure encapsulated compute services described with respect to FIGS. 1 and 2) is provisioned within the data host's infrastructure. In some instances, this provisioning includes deploying encapsulated algorithms in the infrastructure, deploying physical computing devices with appropriately provisioned hardware and software in the infrastructure, and storing storage (physical data stores or cloud-based storage). ), including deployments on public or private cloud infrastructure accessible through the infrastructure.

ブロック410で、ガバナンス要件およびコンプライアンス要件が完了される。いくつかの例では、ガバナンス要件およびコンプライアンス要件の完了は、施設内審査委員会（IRB）からの許可を含む。IRBは、プラットフォームの一部として設けられてもよく、またはデータホストに所属する別個のエンティティであってもよい。IRBは、連合訓練を含むアルゴリズムの開発、検証、および訓練の目的でのデータホストからのデータ資産の使用のための審査および承認に使用され得る。特定の例では、プラットフォーム上で動作すべき特定のプロジェクトが、審査および承認プロセスを合理化するために、IRBの修正として審査される。ガバナンス要件およびコンプライアンス要件の完了は、医療保険の相互運用性と責任に関する法律（Health Insurance Portability and Accountability Act（HIPAA））などの準拠法の下でのプラットフォームによって行われている任意のプロジェクトおよび/またはプラットフォーム自体のコンプライアンスの審査および承認をさらに含み得る。ある例では、既存のインフラストラクチャにおけるカプセル化されたアルゴリズムの展開、適切にプロビジョニングされたハードウェアおよびソフトウェアを備えた物理機器の展開、公共または専用のクラウドインフラストラクチャにおける展開などを含むプラットフォーム、ならびにプロジェクトを実行するためのプラットフォーム内およびプラットフォームに付随する一部またはすべてのアクティビティは、準拠法に準拠する（例えば、100％HIPAA準拠である）必要がある。このプロセスステップは、プラットフォームの準拠法コンプライアンスを審査および承認するために必要なデータホストによるアクティビティを取り込むことを意図されている。ある例では、このプロセスには、HyTrust認証などの所定の認証のアサーションで十分であり得る。 At block 410, governance and compliance requirements are completed. In some instances, completion of governance and compliance requirements includes clearance from an Institutional Review Board (IRB). The IRB may be provided as part of the platform or may be a separate entity belonging to the data host. IRBs may be used to review and approve the use of data assets from data hosts for algorithm development, validation, and training purposes, including federated training. In certain instances, certain projects to run on the platform are reviewed as IRB modifications to streamline the review and approval process. Completion of governance and compliance requirements is required for any project and/or completed by the platform under applicable law, such as the Health Insurance Portability and Accountability Act (HIPAA). It may further include compliance review and approval of the platform itself. In some instances, platforms and projects include deployment of encapsulated algorithms on existing infrastructure, deployment on physical equipment with appropriately provisioned hardware and software, deployment on public or private cloud infrastructure, etc. Any or all activities within and associated with the Platform to perform the Platform must comply with applicable law (e.g., be 100% HIPAA compliant). This process step is intended to capture activities by the data host necessary to review and approve the platform's compliance with applicable laws. In some examples, assertion of a predetermined authentication, such as HyTrust authentication, may be sufficient for this process.

ガバナンス要件およびコンプライアンス要件の完了は、セキュリティ認証の取得をさらに含み得る。例えば、データホストは、新たにプロビジョニングされたすべてのハードウェアおよびソフトウェアシステムにセキュリティ審査プロセスを適用するのが一般的である。セキュリティ審査の詳細は、データホストで定義され、セキュリティ認証を取得するために個別的に決定され得るが、プラットフォームは、明確に文書化し、実証することができるセキュリティの最良慣行に準拠するようにプロビジョニングされる必要がある。ガバナンス要件およびコンプライアンス要件の完了は、追加のガバナンス/コンプライアンスアクティビティの下でのプラットフォームよって行われている任意のプロジェクトおよび/またはプラットフォーム自体のコンプライアンスの審査および承認をさらに含み得る。例えば、各データホストまたはデータホストが位置する政府領域は、追加のガバナンス/コンプライアンスアクティビティを有する場合がある。既存のインフラストラクチャにおけるカプセル化されたアルゴリズムの展開、適切にプロビジョニングされたハードウェアおよびソフトウェアを備えた物理機器の展開、公共または専用のクラウドインフラストラクチャにおける展開などを含むプラットフォーム、ならびにプロジェクトを実行するためのプラットフォーム内およびプラットフォームに付随する一部またはすべてのアクティビティは、追加のガバナンス/コンプライアンスアクティビティに準拠する必要がある。このプロセスステップは、プラットフォームのガバナンス/コンプライアンスアクティビティを審査および承認するために必要なデータホストによるアクティビティを取り込むことを意図されている。ある例では、このプロセスには、所定の認証のアサーションで十分であり得る。 Completion of governance and compliance requirements may further include obtaining security certifications. For example, data hosts typically apply a security review process to all newly provisioned hardware and software systems. Although security review details may be defined by the data host and determined separately to obtain security certification, the platform is provisioned to comply with security best practices that can be clearly documented and demonstrated. need to be done. Completion of governance and compliance requirements may further include review and approval of the compliance of any projects being undertaken by the platform and/or the platform itself under additional governance/compliance activities. For example, each data host or the government realm in which the data host is located may have additional governance/compliance activities. platforms, including deployment of encapsulated algorithms on existing infrastructure, deployment of physical equipment with appropriately provisioned hardware and software, deployment on public or dedicated cloud infrastructure, and to execute projects. Some or all activities within and associated with the Platform may be subject to additional governance/compliance activities. This process step is intended to capture activities by data hosts necessary to review and approve platform governance/compliance activities. In some examples, a predetermined authentication assertion may be sufficient for this process.

ブロック415で、データホストがモデルの最適化および/または検証に供されることを望むデータ資産が取り出される。ある例では、データ資産は、プラットフォームによる使用のために、既存の記憶場所およびフォーマットからプロビジョニングされたストレージ（物理データストアまたはクラウドベースのストレージ）に転送され（プラットフォームによってアクセス可能な1つまたは複数のデータストアにキュレートされ）得る。例えば、データ資産の第1のデータセットは、データホスト（第2のエンティティ）の第1のデータベースで利用可能であると識別され得、データセットの第2のデータセットは、データホストの第2のデータベースで利用可能であると識別され得る。この例では、データ資産の取り出しは、データプライバシーを維持する方法で、プロビジョニングされたストレージ内の第1のデータセットおよび第2のデータセットを物理的にizingすることを含み得る。プロビジョニングされるストレージは、ステップ505で新たにプロビジョニングされ得るか、またはプラットフォームのための新しいアクセス許可が定義される既存のストレージであり得る。他の例では、データ資産はプロビジョニングされたストレージに物理的に移動されるのではなく、論理アドレスの集合が、プロビジョニングされたストレージまたはプラットフォームからアクセス可能な既存のストレージ（データホストの既存のインフラストラクチャの一部である物理データストアまたはクラウドベースのストレージ）に記録され得る。データ資産が（物理的または論理的に）取り出され、格納される際に、プロジェクト管理およびデータ品質保証においてプラットフォームが使用するためにデータ起源が文書化され得る。理解されるように、データ資産は、データホストのインフラストラクチャ（新たにプロビジョニングされたストレージまたは既存のストレージであり得る）から移動されない。 At block 415, data assets that the data host desires to be subjected to model optimization and/or validation are retrieved. In some examples, data assets are transferred from an existing storage location and format to provisioned storage (physical data stores or cloud-based storage) for use by the platform (one or more datastore (curated). For example, a first dataset of data assets may be identified as being available in a first database of a data host (a second entity), and a second dataset of data assets may be identified as available in a first database of a data host (a second entity), and a second dataset of data assets may be identified as available in a first database of a data host (a second entity). can be identified as available in the database. In this example, retrieving the data asset may include physically izing the first data set and the second data set within the provisioned storage in a manner that maintains data privacy. The provisioned storage may be newly provisioned in step 505 or may be existing storage for which new permissions are defined for the platform. In other examples, data assets are not physically moved to provisioned storage, but instead a collection of logical addresses is moved to existing storage that is accessible from the provisioned storage or platform (the data host's existing infrastructure). physical data stores or cloud-based storage). As data assets are retrieved and stored (physically or logically), data provenance may be documented for use by the platform in project management and data quality assurance. As will be appreciated, data assets are not moved from the data host's infrastructure (which may be newly provisioned storage or existing storage).

ブロック415は、データホストが新しいデータを収集するか、または新しいデータ資産のプラットフォームを利用するときに、プラットフォームを新しいデータで更新するために、経時的に繰り返し行われ得る。これらの更新は、連続的であっても一括でもよいが、起源情報を含む、取り出されたデータ資産に対するすべての変更は、各プロジェクトの各段階で使用されるデータを正確に複製し、その起源を理解できることを保証するように記録され得る。 Block 415 may be repeated over time to update the platform with new data as the data host collects new data or utilizes the platform for new data assets. These updates may be continuous or bulk, but all changes to the retrieved data assets, including provenance information, must accurately replicate the data used at each stage of each project and trace its provenance. can be recorded to ensure that it can be understood.

任意で、ブロック420で、データ資産は匿名化されてもよい。ある例では、データ資産は、プラットフォームで使用する前に匿名化される必要がある。匿名化は、誰かの個人の身元が明らかになるのを防ぐために使用されるプロセスである。例えば、人間の対象者の研究中に生成されたデータは、研究参加者のプライバシーを保全するために匿名化される場合がある。匿名化のための戦略は、個人名などの個人識別要素を削除するかまたは隠すことと、生年月日などの準識別要素を抑制または一般化することとを含み得る。特定の例では、データ資産が匿名化され得、適切な再特定化情報が保護された形態（例えば、暗号化）で記録され得る。 Optionally, at block 420, the data asset may be anonymized. In some instances, data assets need to be anonymized before use on the platform. Anonymization is a process used to prevent someone's personal identity from being revealed. For example, data generated during research on human subjects may be anonymized to preserve the privacy of research participants. Strategies for anonymization may include removing or hiding personal identifiers, such as a person's name, and suppressing or generalizing quasi-identifiers, such as date of birth. In certain examples, data assets may be anonymized and appropriate re-identification information may be recorded in a protected form (eg, encrypted).

任意で、ブロック425で、データ資産は難読化されてもよい。ある例では、プラットフォームで使用する前に、データ資産が難読化される必要がある。データ難読化は、データ暗号化またはトークン化を含むプロセスである。データ暗号化は、機密データがアルゴリズムおよび暗号鍵を使用して符号化され、アルゴリズムおよび正しい暗号鍵を持つユーザのみが機密データをアクセスまたは復号することができるセキュリティプロセスである。いくつかの態様では、データは、従来の暗号化アルゴリズム（例えば、RSA）または準同型暗号のどちらかを使用して暗号化される。特定のアルゴリズムによって暗号化する選択は、規制要件とビジネスニーズとの組み合わせによって決定され得る。例えば、データホストは、アルゴリズム検証の目的では準同型暗号化データへの広範なアクセスを提供し得るが、アルゴリズム訓練ではRSA暗号化データへの高度に制限されたアクセスを提供し得る。データトークン化は、機密データが、破られた場合に意味のある値を有さず、機密値とトークンとの間の関係を格納するトークン保管庫へのアクセス権を有するユーザのみがアクセスまたは復号することができるトークンと呼ばれるランダムな文字列に変換されるセキュリティ方法である。いくつかの態様では、データはトークン化され、トークン保管庫はデータホストによって維持される。 Optionally, at block 425, the data asset may be obfuscated. In some instances, data assets need to be obfuscated before being used on the platform. Data obfuscation is a process that involves data encryption or tokenization. Data encryption is a security process in which sensitive data is encoded using an algorithm and an encryption key such that only users with the algorithm and correct encryption key can access or decrypt the sensitive data. In some aspects, data is encrypted using either traditional encryption algorithms (eg, RSA) or homomorphic encryption. The choice to encrypt with a particular algorithm may be determined by a combination of regulatory requirements and business needs. For example, a data host may provide extensive access to homomorphically encrypted data for algorithm validation purposes, but highly restricted access to RSA encrypted data for algorithm training. Data tokenization means that sensitive data has no meaningful value if broken and can only be accessed or decrypted by a user with access to a token vault that stores the relationship between the sensitive value and the token. This is a security method in which a random string of characters called a token can be converted. In some aspects, data is tokenized and a token vault is maintained by a data host.

ブロック430で、データ資産が索引付けされ得る。データ索引付けは、問い合わせがデータベースからデータを効率的に取り出すことを可能にする。索引は、特定のテーブルに関連付けられ得、索引で検索されるべき1つまたは複数のキーまたは値から構成され得る（例えば、キーは、データテーブルの列または行に基づくものであり得る）。問い合わせ用語を索引内のキーと比較することにより、同じ値を有する1つまたは複数のデータベースレコードを効率的な方法で見つけることが可能である。ある例では、メタデータやデータフィールドの統計的属性などの基本情報が、1つまたは複数のキーとして計算され、索引に格納される。どんな基本情報が収集され、プラットフォーム内での検索するためにどのように公開されるかの詳細は、データタイプおよび予想されるユースケースに依存する。一般に、この基本情報は、プロジェクトおよびデータの属性についてプラットフォーム上でどんなデータが利用可能であり得るかを識別することによって問い合わせを支援することを意図されている。基本情報を、プラットフォームまたはエンドユーザに、データ資産を処理するために必要となり得るデータ変換および整合化方法を知らせるために使用することもできる。さらに、基本情報を、異常検出および一般的なデータ品質保証目的に使用することができる。 At block 430, data assets may be indexed. Data indexing allows queries to efficiently retrieve data from a database. An index may be associated with a particular table and may consist of one or more keys or values to be searched in the index (eg, the keys may be based on columns or rows of a data table). By comparing query terms to keys in an index, it is possible to find one or more database records with the same value in an efficient manner. In some examples, basic information such as metadata or statistical attributes of data fields is computed as one or more keys and stored in the index. The details of what basic information is collected and how it is exposed for search within the platform will depend on the data type and anticipated use case. In general, this basic information is intended to aid queries by identifying what data may be available on the platform about project and data attributes. The basic information can also be used to inform the platform or end users of data transformation and harmonization methods that may be required to process the data assets. Additionally, the basic information can be used for anomaly detection and general data quality assurance purposes.

図5に、新しいプロジェクト（例えば、図3のブロック305で受け取られた（1つもしくは複数の）モデルの最適化および/または検証）を完了するためにプラットフォーム（例えば、図1および図2に関して説明されたプラットフォームおよびシステム）と共に使用されるべきデータ資産を識別、取得、およびキュレートするためのプロセス500を示す。ブロック505で、モデルと共に使用されるべきデータ資産がプラットフォーム上で既に利用可能である（既に迎え入れられている）かどうかに関する判定が行われる。判定は、プラットフォーム上のデータ資産の存在（またはその欠如）を識別することによって行われ得る。データ資産は、図4に関して説明されたように、迎え入れられたデータホスト（例えば、パートナーの大学医療センター）から取り出されたデータ資産に対して問い合わせを実行することによって識別され得る。ある例では、入力データ要件（例えば、入力データ特性、データフォーマット、および図3のブロック310で取得されるモデルで使用可能な外部データの要件）は、プラットフォームが利用可能であり、記載の目的を達成するためにモデルによって使用可能なデータ資産を識別するために、プラットフォームによって検索語、フィルタ、および/または追加情報として使用され得る。 FIG. 5 illustrates a platform (e.g., as described with respect to FIG. 1 and FIG. 2) to complete a new project (e.g., optimization and/or validation of the model(s) received in block 305 of FIG. 5 depicts a process 500 for identifying, acquiring, and curating data assets to be used with (enabled platforms and systems). At block 505, a determination is made as to whether the data assets to be used with the model are already available (already hosted) on the platform. The determination may be made by identifying the presence (or lack thereof) of the data asset on the platform. Data assets may be identified by performing queries on data assets retrieved from hosted data hosts (eg, partner academic medical centers) as described with respect to FIG. 4. In some instances, the input data requirements (e.g., input data characteristics, data formats, and requirements for external data available in the model obtained at block 310 of Figure 3) are those that are available to the platform and that meet the stated purpose. May be used as search terms, filters, and/or additional information by the platform to identify data assets that can be used by the model to accomplish the desired results.

識別プロセスは、入力データ要件を検索語および/またはフィルタとして使用して、データ資産に対して問い合わせを実行する（例えば、データ索引を使用してプロビジョニングされたデータストアに対して問い合わせを実行する）プラットフォームによって自動的に行われ得る。あるいは、このプロセスは、対話型プロセスを使用して行われてもよく、例えば、アルゴリズム開発者は、プラットフォームに検索語および/またはフィルタを提供し、それに応答して、プラットフォームは、追加情報を取得するために問題を立て、アルゴリズム開発者は、追加情報を提供し、プラットフォームは、検索語、フィルタ、および/または追加情報を使用して、データ資産についての問い合わせ（例えば、1つまたは複数のデータホストのデータベース上での問い合わせの実行や、データ資産を有し得るデータホストを識別するウェブクローリング）を実行し得る。どちらの場合も、識別は、データ資産内の個人に関するプライベート情報を伏せたままで、データ資産内のグループのパターンを記述することによって、データ資産内の情報を共有するための差分プライバシーを使用して行われる。データ資産がプラットフォーム上で利用可能である（例えば、検索により、問い合わせまたは入力データ要件の制約を満たすデータ資産が識別される）場合には、プロセスはブロック510に進んで、データ資産の新しいプロジェクトを構成する。データ資産がプラットフォーム上で利用できない（例えば、検索により、問い合わせまたは入力データ要件の制約を満たすデータ資産が識別されない）場合には、プロセスはブロック525に進んで、データ資産が既存のデータホストから利用できるかどうかを判定する。 The identification process uses the input data requirements as search terms and/or filters to query the data assets (e.g., query a provisioned data store using a data index). Can be done automatically by the platform. Alternatively, this process may be performed using an interactive process, e.g., the algorithm developer provides the platform with search terms and/or filters, and in response, the platform obtains additional information. In order to formulate a problem, the algorithm developer provides additional information, and the platform uses search terms, filters, and/or additional information to query about the data asset (e.g. (e.g., performing queries on a host's database or web crawling to identify data hosts that may have data assets). In both cases, identification uses differential privacy to share information within a data asset by describing patterns of groups within the data asset while keeping private information about individuals within the data asset hidden. It will be done. If the data asset is available on the platform (e.g., the search identifies a data asset that meets the constraints of the query or input data requirements), the process continues to block 510 to create a new project for the data asset. Configure. If the data asset is not available on the platform (e.g., the search does not identify a data asset that satisfies the constraints of the query or input data requirements), the process continues to block 525 where the data asset is available from an existing data host. Determine whether it is possible.

ブロック510で、データ資産のために新しいプロジェクトが構成される。ある例では、データホストコンピュータおよびストレージインフラストラクチャは、識別されたデータ資産で新しいプロジェクトを処理するために、データホストのインフラストラクチャ内にプロビジョニングまたは構成される。ある例では、プロビジョニングまたは構成は、図4のブロック405で説明されたプロセスと同様の方法で行われる。例えば、プロビジョニングまたは構成は、インフラストラクチャにおける新しいプロジェクトに固有のカプセル化されたアルゴリズムの展開、新しいプロジェクトに固有のストレージ（物理データストアまたはクラウドベースのストレージ）の展開、インフラストラクチャを介してアクセス可能な公共または専用のクラウドインフラストラクチャ上の展開などを含み得る。 At block 510, a new project is configured for the data asset. In some examples, data host computers and storage infrastructure are provisioned or configured within the data host's infrastructure to process new projects with the identified data assets. In some examples, provisioning or configuration occurs in a manner similar to the process described in block 405 of FIG. For example, provisioning or configuration can include deploying new project-specific encapsulated algorithms in the infrastructure, deploying new project-specific storage (physical data stores or cloud-based storage), and making the infrastructure accessible through the infrastructure. This may include deployment on public or private cloud infrastructure.

ブロック515で、規制上の承認（例えば、IRBその他のデータ・ガバナンス・プロセス）が完了され、文書化される。いくつかの例では、規制上の承認は既に存在し、新しいプロジェクトのために単に更新されればよい場合もあり、または新しいプロジェクトのために完全に完了される必要がある場合もある。ある例では、規制上の承認は、図4のブロック410で説明されたプロセスと同様の方法で完了される。例えば、ガバナンス要件およびコンプライアンス要件の完了は、IRBの設定または現在のIRBの修正、準拠法の下での新しいプロジェクトおよび/またはプラットフォームのコンプライアンスの審査および承認、セキュリティ認証の取得、ならびに追加のガバナンス/コンプライアンスアクティビティの下での新しいプロジェクトおよび/またはプラットフォームのコンプライアンスの審査および承認を含み得る。 At block 515, regulatory approvals (eg, IRB or other data governance processes) are completed and documented. In some instances, regulatory approvals may already exist and simply need to be updated for a new project, or may need to be fully completed for a new project. In one example, regulatory approval is completed in a manner similar to the process described in block 410 of FIG. For example, completing governance and compliance requirements may include establishing an IRB or amending a current IRB, reviewing and approving new projects and/or platforms for compliance under applicable law, obtaining security certifications, and implementing additional governance/compliance requirements. Compliance activities may include reviewing and approving new projects and/or platforms for compliance.

ブロック520で、データストレージがプロビジョニングされ、プラットフォーム上で識別された新しいデータ資産または迎え入れられる新しいデータ資産のためにデータフォーマットが構成される。ある例では、データストレージのプロビジョニングは、適切なデータストレージおよび問い合わせ構造の作成と共に、新しい論理データ記憶場所の識別およびプロビジョニングを含む。例えば、いくつかの態様では、データは、リレーショナルデータベース、no-SQLデータストア、フラットファイル（例えば、JSON）、または他の構造に格納され得る。これらのアクセスモデル内で、データは、他の多くの可能な方法の中でも特に、リレーショナル・データ・スキーマ（例えばスタースキーマ）として、またはデータフレームのセットとして編成され得る。ストレージモデルの決定は、データタイプ、アルゴリズムタイプ、および基礎となるアルゴリズムソフトウェアによって影響され得るか、またはプラットフォーム自体によって設定されたシステム要件によって決定され得る。プラットフォームへのデータ収集および集約が進行中である場合（前向き研究の場合のように）には、進行中の性能メトリックを適切に比較できるように、訓練または検証プロセスの各ステップでどんなデータセットが使用されたかを正確に識別するために追加の文書が格納され得る。（データセットが増大する際の）データセットの許容される合計サイズ、および欠陥のあるデータの訓練セットへの導入を回避するために進行中の品質評価がどのように実行されるかを含む、追加のプロビジョニング要因がここで考慮される必要があり得る。さらに、これは既存のデータホストのための新しいプロジェクトであるので、ブロック510およびブロック515に関して説明されたように、データホスト計算およびストレージインフラストラクチャがプロビジョニングまたは再構成される必要があり得、規制上の承認（例えば、IRBその他のデータ・ガバナンス・プロセス）が完了され、新しいデータ資産の意図される使用を説明するように文書化される必要があり得る。 At block 520, data storage is provisioned and data formats are configured for new data assets identified or welcomed on the platform. In some examples, data storage provisioning includes identifying and provisioning new logical data storage locations, as well as creating appropriate data storage and query structures. For example, in some aspects data may be stored in a relational database, no-SQL data store, flat file (eg, JSON), or other structure. Within these access models, data may be organized as a relational data schema (eg, a star schema) or as a set of data frames, among many other possible ways. Storage model decisions can be influenced by data type, algorithm type, and underlying algorithm software, or can be determined by system requirements set by the platform itself. If data collection and aggregation into the platform is ongoing (as in the case of a prospective study), what datasets are used at each step of the training or validation process so that ongoing performance metrics can be properly compared? Additional documents may be stored to accurately identify what has been used. including the allowable total size of the dataset (as it grows) and how ongoing quality assessment will be performed to avoid introducing defective data into the training set; Additional provisioning factors may need to be considered here. Additionally, because this is a new project for an existing data host, the data host compute and storage infrastructure may need to be provisioned or reconfigured, as described with respect to blocks 510 and 515, and regulatory approvals (e.g., IRB or other data governance processes) may need to be completed and documented to explain the intended use of the new data asset.

ブロック525で、モデルと共に使用されるべきデータ資産が既知または既存のデータホスト（以前に迎え入れられていない）から利用可能であるかどうかに関する判定が行われる。判定は、入力データ要件（例えば、入力データ特性、データフォーマット、および図3のブロック310で取得されるモデルで使用可能な外部データの要件）を有するデータ資産要求を既知または既存のデータホストに送出するプラットフォームによって行われ得る。データホストは機会を通知され、データ資産は入力データ要件に基づいてデータホストによって識別され得る。データ資産が既知のまたは既存のデータホストから利用可能である（例えば、1つまたは複数のデータホストが要求に応答する）場合には、プロセスはブロック510に進んで、新しいプロジェクトを構成し、データ資産を迎え入れる（既知または既存のデータホストは、以前に迎え入れられた他のデータセットを有していた可能性もあるが、この例では、新しいデータ資産が迎え入れられる）。データ資産が既知のまたは既存のデータホストから利用できない（例えば、1つまたは複数のデータホストが要求に応答しない）場合には、プロセスはブロック530に進んで、新しいホストを探す（例えば、潜在的なデータホストは、図3のブロック315に関して説明されたように、データプライバシーを維持する方法でそれらのデータ資産から追加の価値を導出する機会として新しいプロジェクトを通知され得る）。 At block 525, a determination is made as to whether data assets to be used with the model are available from a known or existing data host (not previously hosted). The determination sends a data asset request to a known or existing data host with input data requirements (e.g., input data characteristics, data format, and requirements for external data available in the model obtained in block 310 of Figure 3). This can be done by any platform that supports The data host may be notified of the opportunity and the data asset may be identified by the data host based on the input data requirements. If the data assets are available from known or existing data hosts (e.g., one or more data hosts respond to the request), the process continues to block 510 to configure a new project and Welcoming an asset (in this example, a new data asset is being welcomed, although the known or existing data host may have had other datasets previously welcomed). If the data asset is not available from known or existing data hosts (e.g., one or more data hosts do not respond to the request), the process proceeds to block 530 to search for a new host (e.g., a potential data hosts may be notified of the new project as an opportunity to derive additional value from their data assets in a manner that maintains data privacy (as described with respect to block 315 of FIG. 3).

ブロック530で、1つまたは複数の新しいデータホストがプラットフォームに迎え入れられる。1つまたは複数の新しいデータホストが新しいプロジェクトの通知に応答する例では、図4に関して説明されたように、1つまたは複数の新しいデータホストおよびそれらのデータ資産がプラットフォームに迎え入れられ得る。データ資産が1つまたは複数の新しいデータホストから利用可能である（例えば、1つまたは複数のデータホストが迎え入れられる）場合には、プロセスはブロック510に進んで、新しいプロジェクトを構成し、新しいデータ資産を迎え入れる。 At block 530, one or more new data hosts are welcomed to the platform. In an example where one or more new data hosts respond to a notification of a new project, one or more new data hosts and their data assets may be welcomed into the platform as described with respect to FIG. 4. If the data assets are available from one or more new data hosts (e.g., one or more data hosts are welcomed), the process continues to block 510 to configure a new project and create new data hosts. welcome assets.

図6に、新しいプロジェクト（例えば、図3のブロック305で受け取られた（1つもしくは複数の）モデルの最適化および/または検証）を完了するためにプラットフォーム（例えば、図1および図2に関して説明されたプラットフォームおよびシステム）と共に使用されるべきデータ資産を変換するためのプロセス600を示す。プロセス600の目標は、注釈付けのためにデータ資産を準備することであり、これは、データが特定のフォーマットで提示されるか、または注釈付けのために修正されることを必要とし得る。ブロック605で、模範データが準備され得る。いくつかの例では、模範データの準備は、整合化プロセスのキー属性を取り込むデータセットの識別または作成を含む。特定の例では、準備は、模範データの匿名化を含む。例えば、データホストは、データ変換のためのデータ変換モデルのアルゴリズムを開発するためのガイドとして使用するために、小さな代表的な模範データセット（トランスフォーマ・プロトタイプ・セット）を識別または作成し得る。トランスフォーマ・プロトタイプ・セットのデータは、匿名化され、整合化プロセスを実施するためのリモート整合化トランスフォーマの作成のためにアルゴリズム開発者に供され得る。 FIG. 6 shows a platform (e.g., as described with respect to FIG. 1 and FIG. 2) for completing a new project (e.g., optimization and/or validation of the model(s) received in block 305 of FIG. 6 shows a process 600 for converting a data asset to be used with an integrated platform and system. The goal of process 600 is to prepare a data asset for annotation, which may require the data to be presented in a particular format or modified for annotation. At block 605, exemplary data may be prepared. In some examples, preparing the exemplar data includes identifying or creating a dataset that captures key attributes for the harmonization process. In certain examples, the preparation includes anonymizing the exemplar data. For example, a data host may identify or create a small representative exemplar dataset (transformer prototype set) for use as a guide for developing data transformation model algorithms for data transformation. The data of the transformer prototype set may be anonymized and provided to algorithm developers for the creation of remote harmonization transformers to perform the harmonization process.

ブロック610で、トランスフォーマ・プロトタイプ・セット内のデータの現在のフォーマットに基づいて、データ資産の変換のために整合化トランスフォーマが作成される。整合化トランスフォーマは、図3のブロック310で定義された（i）訓練制約、（ii）データ準備制約、および（iii）検証制約に従って作成され得る。整合化トランスフォーマは、データホストの生データ資産を注釈付けに使用可能なフォーマットに最終的に変換するために使用される。ブロック615で、トランスフォーマ・プロトタイプ・セット用に開発された整合化トランスフォーマが、プロジェクトに供されたデータホストの生データ資産に適用される。整合化トランスフォーマは、データ資産のプライバシーを維持するために、データホストのインフラストラクチャ内のプラットフォームまたはデータホストによってデータ資産に対して実行される。ブロック620で、プラットフォームまたはデータホストは、（注釈前に）結果として得られた変換されたデータセットを審査して、変換が正常に適用され、データプライバシー要件に違反することなく適用されるかどうかを判定する。ある例では、判定は、初期トランスフォーマ・プロトタイプ・セットのギャップおよび/またはトランスフォーマ実行のエラーを識別することを含む。変換プロセスに失敗が存在する場合には、プロセスはブロック605に戻って、トランスフォーマ・プロトタイプ・セットの新しいデータメンバを識別する。データ資産が正常に変換される場合には、プロセスは、変換されたデータ資産に注釈を付けるための注釈付けプロセスに進む。 At block 610, a harmonized transformer is created for transforming the data asset based on the current format of the data in the transformer prototype set. A harmonization transformer may be created according to (i) training constraints, (ii) data preparation constraints, and (iii) validation constraints defined in block 310 of FIG. 3. The harmonization transformer is used to finally transform the data host's raw data assets into a format that can be used for annotation. At block 615, the harmonized transformer developed for the transformer prototype set is applied to the raw data assets of the data host submitted to the project. A harmonization transformer is executed on a data asset by a platform within the data host's infrastructure or by the data host to maintain the privacy of the data asset. At block 620, the platform or data host reviews the resulting transformed dataset (prior to annotation) to determine whether the transformations were applied successfully and without violating data privacy requirements. Determine. In some examples, the determining includes identifying gaps in the initial transformer prototype set and/or errors in transformer execution. If there is a failure in the transformation process, the process returns to block 605 to identify new data members of the transformer prototype set. If the data asset is successfully converted, the process proceeds to an annotation process to annotate the converted data asset.

図7に、新しいプロジェクト（例えば、図3のブロック305で受け取られた（1つもしくは複数の）モデルの最適化および/または検証）を完了するためにプラットフォーム（例えば、図1および図2に関して説明されたプラットフォームおよびシステム）と共に使用されるべきデータ資産に注釈を付けるためのプロセス700を示す。ブロック705で、アルゴリズム開発者は、モデルの訓練、試験、および検証のためのゴールド・スタンダード・ラベルを作成するための注釈付けプロトコルを定義する。注釈付けプロトコルは、図3のブロック310に関して説明されたように定義され得る。ブロック710で、アルゴリズム開発者は、注釈者のための訓練資料を作成する。訓練資料は、例示的データに対して行われた注釈付けのビデオ、アルゴリズム開発者ソースからの注釈付きの例示的データ、データ・ホスト・ソースからの匿名化データの注釈付きの例、書面によるフローチャート、注釈付けソフトウェアから利用可能な自動スクリプトおよび自動ツールを含み得るが、これらに限定されない。ブロック715で、注釈付けインフラストラクチャが展開され得る。インフラストラクチャの展開は、注釈付きデータのための注釈付けインターフェースおよびバックエンドデータ記憶環境を設定することを含み得る。ある例では、注釈付けインフラストラクチャは、データ資産のプライバシーを維持するためにデータホストのインフラストラクチャ内に展開される。このプロセスは、十分な注釈者ログインのためのライセンス入手、注釈者の成績、注釈、ならびにプロジェクトステータスおよびデータの起源に関するその他のメタデータの追跡を含み得る。 FIG. 7 shows a platform (e.g., described with respect to FIG. 1 and FIG. 2) for completing a new project (e.g., optimization and/or validation of the model(s) received in block 305 of FIG. 7 shows a process 700 for annotating a data asset to be used with a system (platforms and systems). At block 705, the algorithm developer defines an annotation protocol to create gold standard labels for training, testing, and validating the model. An annotation protocol may be defined as described with respect to block 310 of FIG. At block 710, the algorithm developer creates training materials for the annotators. Training materials include videos of annotations made on example data, annotated example data from algorithm developer sources, annotated examples of anonymized data from data host sources, and written flowcharts. , including, but not limited to, automatic scripts and tools available from annotation software. At block 715, annotation infrastructure may be deployed. Deploying the infrastructure may include setting up an annotation interface and backend data storage environment for annotated data. In some examples, the annotation infrastructure is deployed within the data host's infrastructure to maintain privacy of the data assets. This process may include obtaining licenses for sufficient annotator logins, tracking annotator grades, annotations, and other metadata regarding project status and data origin.

ブロック720で、注釈者がプラットフォームに迎え入れられる。例えば、注釈者が識別され、従事し、訓練され得る。ログイン認証情報が、任意のコンピューティング要件またはアクセス要件（コンピュータ、クラウド・コンピューティング・アクセス、ディスプレイワークステーションなど）と共に注釈者に提供され得る。注釈付けの間中ずっとデータ資産のプライバシーが確実に維持するように、報酬、HIPAA準拠などを含む作業条件、契約が実施され得る。ブロック725で、注釈付けプロジェクトが設定または構成される。ある例では、データホストは、アルゴリズム開発者と協調してプロジェクトの目的および注釈者ワークリスト戦略を定義し得る。次いで、プロジェクトの目的および注釈者ワークリスト戦略を使用して、注釈付けプロジェクトを設定または構成することができる。例えば、測定可能であり、注釈付けプロジェクトの成功を評価するために使用され得る重要な成績指標（例えば、1日当たり、またはセッションごとに注釈付けされるべきデータ点数）を含むプロジェクトの目的が定義され得る。さらに、注釈ワークリスト戦略は、（既知のラベルを有するアルゴリズム開発者によって供給された）試験データを含む注釈者間のデータの分割、複数の注釈者へのデータの公開（例えば、オペレータ間のばらつきおよび注釈者の正確度監視のため）、同じ注釈者への複数のデータ公開（例えば、オペレータ内のばらつきを監視するため）、ならびに他の作業上の考慮事項を含む1つまたは複数の作業上の考慮事項を説明するように定義され得る。 At block 720, annotators are welcomed to the platform. For example, annotators can be identified, engaged, and trained. Login credentials may be provided to the annotator along with any computing or access requirements (computer, cloud computing access, display workstation, etc.). Working conditions, contracts, including compensation, HIPAA compliance, etc., can be implemented to ensure that the privacy of data assets is maintained throughout the annotation. At block 725, an annotation project is set up or configured. In one example, a data host may work with an algorithm developer to define project objectives and annotator worklist strategies. The annotation project can then be set up or configured using the project purpose and annotator worklist strategy. For example, project objectives are defined that include key performance indicators (e.g., number of data points to be annotated per day or per session) that are measurable and can be used to evaluate the success of the annotation project. obtain. Additionally, annotation worklist strategies can include splitting data between annotators, including test data (supplied by algorithm developers with known labels), exposing data to multiple annotators (e.g., due to inter-operator variability) and for annotator accuracy monitoring), multiple data exposures to the same annotator (e.g., to monitor intra-operator variability), and other operational considerations. can be defined to account for considerations of

ブロック730で、データ（例えば、プロセス600からの変換データ）の注釈付けが、定義された注釈付けプロトコルにより、ブロック725で定義されたプロジェクト構造に従って行われる。ブロック735で、注釈付けプロセスの成績およびデータプライバシー要件のコンプライアンスが監視される。アルゴリズム開発者、データホスト、またはそれらの組み合わせは、注釈者の成績を監視し得る。このコンテキストでは、成績は、作業速度と作業正確度の両方を含むことができ（これらはブロック740で別々に監視され得る）、注釈者別およびデータタイプ別に分けることができる。例えば、注釈者が契約上の義務を満たしており、プロジェクトの目的に向かって進んでいるかどうかを判定するために、1日当たりの注釈の数および注釈者ごとの合計注釈数が監視され得る。注釈付けの正確度を、いくつかの手段によって監視することができる。例えば、試験データがアルゴリズム開発者によって供給され得、既知のラベルを有し、注釈者がプロトコルに実質的に従っていることを保証するために注釈者に提示することができる。いくつかの態様では、注釈者の正確度を評価/推定するために、同じデータを複数の注釈者に提示することができる。本明細書で使用される場合、「実質的に（substantially）」、「およそ（approximately）」、および「約（about）」という用語は、当業者には理解されるように、指定されたものの必ずしも全部ではないが大部分として定義される（かつ指定されたものを完全に含む）。任意の開示の態様では、「実質的に（substantially）」、「およそ（approximately）」、または「約（about）」という用語は、指定されたものの「［パーセンテージ］内」で置き換えられてもよく、パーセンテージは0.1パーセント、1パーセント、5パーセント、および10パーセントを含む。ブロック730では、自動注釈付け技術も同様に組み込まれ得ることが想定される。ある例では、既存の自動注釈付け技術をいくつかの注釈付けタスクに使用することができる。他の場合には、データが注釈付けされる際に、モデルが、注釈付けプロセスを自動化、加速、および/または増強するために開発される。そのようなツールを次いで、異なるデータセットおよび異なるアルゴリズム開発者プロジェクトに関連付けられ得る将来の注釈付けタスクにおいて使用することができる。 At block 730, annotation of data (eg, transformed data from process 600) is performed according to the project structure defined at block 725 by a defined annotation protocol. At block 735, performance of the annotation process and compliance with data privacy requirements is monitored. The algorithm developer, data host, or a combination thereof may monitor the annotator's performance. In this context, performance may include both work speed and work accuracy (which may be monitored separately at block 740) and may be separated by annotator and data type. For example, the number of annotations per day and the total number of annotations per annotator may be monitored to determine whether annotators are meeting contractual obligations and progressing toward project objectives. Annotation accuracy can be monitored by several means. For example, test data may be provided by the algorithm developer, have known labels, and be presented to the annotator to ensure that the annotator has substantially followed the protocol. In some aspects, the same data can be presented to multiple annotators to evaluate/estimate the annotators' accuracy. As used herein, the terms "substantially," "approximately," and "about" refer to the specified Defined as most, but not necessarily all (and completely inclusive of the specified). In any disclosure aspect, the terms “substantially,” “approximately,” or “about” may be replaced with “within [percentage]” of the specified. , percentages include 0.1 percent, 1 percent, 5 percent, and 10 percent. At block 730, it is envisioned that automatic annotation techniques may be incorporated as well. In some examples, existing automatic annotation techniques can be used for some annotation tasks. In other cases, as data is annotated, models are developed to automate, accelerate, and/or augment the annotation process. Such tools can then be used in future annotation tasks that may be associated with different datasets and different algorithm developer projects.

図8に、新しいプロジェクト（例えば、図3のブロック305で受け取られた（1つもしくは複数の）モデルの最適化および/または検証）を完了するためにプラットフォーム（例えば、図1および図2に関して説明されたプラットフォームおよびシステム）と共に使用されるべきデータ資産を整合化するためのプロセス800を示す。プロセス800の目標は、1つまたは複数のアルゴリズムによる計算のためにデータ資産を準備することであり、これは、データが特定のフォーマットで提示されるか、または処理のために修正されることを必要とし得る。ブロック805で、模範データが準備され得る。いくつかの例では、模範データの準備は、整合化プロセスのキー属性を取り込むデータセットの識別または作成を含む。特定の例では、準備は、模範データの匿名化を含む。例えば、データホストは、データ変換のためのモデルとして使用するために、小さな代表的な模範データセット（整合化プロトタイプセット）を識別または作成し得る。整合化プロトタイプセットのデータは、匿名化され、整合化プロセスを実施するためのリモート整合化トランスフォーマの作成のためにアルゴリズム開発者に供され得る。模範データは、図6に関して説明された変換プロセスで使用される同じ模範データセット（トランスフォーマ・プロトタイプ・セット）であってもよく、補足されたデータセット（例えば、整合化プロトタイプセットを作成するためにトランスフォーマ・プロトタイプ・セットに追加された新しいデータメンバ）であってもよく、または整合化のための完全に新しいデータセットであってもよい。 FIG. 8 shows a platform (e.g., as described with respect to FIG. 1 and FIG. 2) for completing a new project (e.g., optimization and/or validation of the model(s) received in block 305 of FIG. 8 shows a process 800 for harmonizing data assets to be used with a system (platforms and systems). The goal of process 800 is to prepare a data asset for computation by one or more algorithms, which means that the data is presented in a particular format or modified for processing. may be required. At block 805, exemplary data may be prepared. In some examples, preparing the exemplar data includes identifying or creating a dataset that captures key attributes for the harmonization process. In certain examples, the preparation includes anonymizing the exemplar data. For example, a data host may identify or create a small representative exemplar data set (harmonized prototype set) for use as a model for data transformation. The data of the harmonization prototype set may be anonymized and provided to algorithm developers for the creation of remote harmonization transformers to implement the harmonization process. The exemplar data may be the same exemplar dataset (transformer prototype set) used in the transformation process described with respect to Figure 6, and may be a supplemented dataset (e.g., to create a harmonized prototype set). It may be a new data member added to the transformer prototype set) or it may be an entirely new data set for harmonization.

ブロック810で、整合化プロトタイプセット内のデータの現在のフォーマットに基づいて、データ資産の整合化のための整合化トランスフォーマが作成される。整合化トランスフォーマは、図3のブロック310で定義された（i）訓練制約、（ii）データ準備制約、および（iii）検証制約に従って作成され得る。整合化トランスフォーマは、データホストの変換された/注釈付きデータ資産をモデルへの入力に使用可能なフォーマットに最終的に変換するために使用される。ブロック815で、整合化プロトタイプセットのために開発された整合化トランスフォーマが、プロジェクトに供されたデータホストの変換された/注釈付きデータ資産に適用される。データトランスフォーマは、データ資産のプライバシーを維持するために、データホストのインフラストラクチャ内のプラットフォームまたはデータホストによって変換された/注釈付きデータ資産に対して実行される。ブロック820で、プラットフォームまたはデータホストは、（モデル実行前に）結果として得られた整合化データセットを審査して、変換が正常に適用され、データプライバシー要件に違反することなく適用されるかどうかを判定する。ある例では、判定は、初期トランスフォーマ・プロトタイプ・セットのギャップおよび/またはトランスフォーマ実行のエラーを識別することを含む。変換プロセスに失敗が存在する場合には、プロセスはブロック805に戻って、整合化プロトタイプセットの新しいデータメンバを識別する。データ資産が正常に変換される場合には、プロセスは、モデル内の整合化されたデータ資産を使用するための最適化および/または検証プロセスに進む。有利には、アルゴリズムで使用するためのデータの多段階整合化および注釈付けを容易にするためのこれらのステップは、基礎となるデータをアルゴリズム開発者に公開することなく実施されることが可能である。 At block 810, a harmonization transformer for harmonization of data assets is created based on the current format of the data in the harmonization prototype set. A harmonization transformer may be created according to (i) training constraints, (ii) data preparation constraints, and (iii) validation constraints defined in block 310 of FIG. 3. The harmonization transformer is used to finally transform the transformed/annotated data assets of the data host into a format that can be used as input to the model. At block 815, the harmonization transformer developed for the harmonization prototype set is applied to the transformed/annotated data assets of the data host submitted to the project. Data transformers are executed on the transformed/annotated data assets by a platform or data host within the data host's infrastructure to maintain the privacy of the data assets. At block 820, the platform or data host reviews the resulting harmonized dataset (prior to model execution) to determine whether the transformations were applied successfully and without violating data privacy requirements. Determine. In some examples, the determining includes identifying gaps in the initial transformer prototype set and/or errors in transformer execution. If there is a failure in the conversion process, the process returns to block 805 to identify new data members of the harmonized prototype set. If the data asset is successfully converted, the process proceeds to an optimization and/or validation process for using the harmonized data asset in the model. Advantageously, these steps to facilitate multi-level harmonization and annotation of data for use in algorithms can be performed without exposing the underlying data to algorithm developers. be.

図9に、プラットフォーム（例えば、図1および図2に関して説明されたプラットフォームおよびシステム）を使用して1つまたは複数のモデルを最適化するためのプロセス900を示す。ブロック905で、アルゴリズムの各インスタンスでデータセット（データ資産から分割された訓練データセット）を訓練するために連合訓練ワークフローが実行される。ある例では、訓練されたモデルを生成するためにアルゴリズム訓練が訓練データセットに対して行われる。訓練データセットは、図3および図4に関して説明されたように、1つまたは複数のデータホストによってホストされ得る。連合訓練ワークフローは、訓練データ資産を入力として取り込み、パラメータを使用して訓練データ資産の特徴をターゲット推論にマップし、損失関数または誤差関数を計算し、損失関数または誤差関数を最小化するためにパラメータを学習されたパラメータに更新し、モデルの1つまたは複数の訓練されたインスタンスを出力する。例えば、連合訓練ワークフローは、訓練データ資産を入力として取り込み得、モデルパラメータを使用して訓練データ資産の属性をターゲット予測にマップする訓練データ資産内のパターンを見つけ、損失関数または誤差関数の訓練勾配を計算し、訓練勾配に応答してモデルパラメータを学習されたモデルパラメータに更新し、学習されたモデルパラメータを使用して訓練データ資産内のパターンを取り込むアルゴリズムの訓練されたインスタンスを出力する。 FIG. 9 shows a process 900 for optimizing one or more models using a platform (eg, the platforms and systems described with respect to FIGS. 1 and 2). At block 905, a federated training workflow is executed to train a dataset (a training dataset split from the data assets) with each instance of the algorithm. In some examples, algorithm training is performed on a training dataset to generate a trained model. The training data set may be hosted by one or more data hosts as described with respect to FIGS. 3 and 4. A federated training workflow takes a training data asset as input, uses parameters to map the features of the training data asset to a target inference, computes a loss or error function, and then uses parameters to minimize the loss or error function. Update the parameters to the learned parameters and output one or more trained instances of the model. For example, a federated training workflow may take a training data asset as input, use model parameters to find patterns in the training data asset that map attributes of the training data asset to target predictions, and use the training gradient of a loss or error function to computes, updates the model parameters to the learned model parameters in response to the training gradient, and outputs a trained instance of the algorithm that uses the learned model parameters to capture the patterns in the training data asset.

訓練セットからのデータが、APIを介してデータホストによってセキュアなカプセル計算フレームワークに渡され、1つまたは複数のモデルの応答（例えば、訓練勾配）がプラットフォームに返され、そこで応答は訓練報告に集約される。この対話以外に、セキュアなカプセル計算フレームワークの内部からセキュアなカプセル計算フレームワークの外部への通信は許されず、不正なアルゴリズムがセキュアなカプセル計算フレームワークの外部でプライベートデータを通信するのを防ぐ。いくつかの態様では、連合訓練ワークフローは、アルゴリズム訓練が完全に収束し、すべてのハイパーパラメータが最適化される完全なモデル訓練を含む。他の態様では、連合訓練ワークフローは、増分訓練を実行する。どちらの場合も、パラメータおよび/または訓練勾配などの訓練の結果は、「完全連合モデル」への統合のためにマスタ・アルゴリズム・モジュールに送信される。パラメータおよび/または訓練勾配などの結果は、データ資産およびモデルのプライバシーを保証するために送信前に暗号化され得る。 Data from the training set is passed by the data host via an API to a secure capsule computation framework, and one or more model responses (e.g., training gradients) are returned to the platform, where the responses are included in the training report. be aggregated. Other than this interaction, no communication from inside the secure capsule computation framework to outside the secure capsule computation framework is allowed, preventing rogue algorithms from communicating private data outside of the secure capsule computation framework. . In some aspects, the federated training workflow includes full model training where algorithm training is fully converged and all hyperparameters are optimized. In other aspects, the federated training workflow performs incremental training. In either case, the results of the training, such as parameters and/or training gradients, are sent to a master algorithm module for integration into a "fully federated model." Results such as parameters and/or training gradients may be encrypted before transmission to ensure privacy of data assets and models.

ブロック910で、モデルの訓練されたインスタンスごとのパラメータおよび/または訓練勾配を含む結果が、完全連合モデルに統合される。統合することは、パラメータおよび/または訓練勾配などの結果を集約して集約されたパラメータおよび/または訓練勾配を得ることと、完全連合モデルの学習されたモデルパラメータを集約されたパラメータおよび/または訓練勾配で更新することとを含む。ある例では、集約することは、水平連合学習、垂直連合学習、連合転移学習、またはそれらの任意の組み合わせを使用して行われる。 At block 910, the results, including parameters and/or training gradients for each trained instance of the model, are integrated into a fully federated model. Integrating means aggregating results such as parameters and/or training gradients to obtain aggregated parameters and/or training gradients, and combining learned model parameters of a fully federated model with aggregated parameters and/or training gradients. and updating with a gradient. In some examples, the aggregation is performed using horizontal federated learning, vertical federated learning, federated transfer learning, or any combination thereof.

ブロック915で、完全連合モデルで試験ワークフローが実行される。試験ワークフローは、試験データを入力として取り込み、更新された学習されたモデルパラメータを使用して試験データ内のパターンを見つけ、推論を出力する。さらに、推論を提供する際の完全連合モデルの性能が計算され、完全連合モデルがアルゴリズム終了基準（すなわち、モデルが十分な訓練を達成したかどうかを定義する基準）を満たしているかどうかに関する判定が行われる。ある例では、モデルのハイパーパラメータおよびモデルパラメータが、図3のブロック310でアルゴリズム開発者によって定義されたアルゴリズム終了基準に従って、収束、エラー状態、収束の失敗などについて試験される。終了基準に達していない場合には、プロセスはブロック905に戻り、そこで更新されたモデル（例えば、完全連合モデル）が訓練データセットでの追加の訓練のために各データホストに配布される。この反復プロセスは、アルゴリズム終了基準が満たされるまで繰り返される。アルゴリズム訓練プロセスの進捗状況は、報告制約に従って、このステップ中にアルゴリズム開発者に報告され得る。アルゴリズム終了基準が満たされた場合には、プロセスはブロック920に進む。 At block 915, a test workflow is executed in a fully federated model. The test workflow takes test data as input, uses updated learned model parameters to find patterns in the test data, and outputs inferences. Additionally, the performance of the fully federated model in providing inference is computed, and a decision is made regarding whether the fully federated model satisfies the algorithm termination criteria (i.e., the criterion that defines whether the model has achieved sufficient training). It will be done. In some examples, model hyperparameters and model parameters are tested for convergence, error conditions, failure to converge, etc. according to algorithm termination criteria defined by the algorithm developer at block 310 of FIG. If the termination criterion has not been reached, the process returns to block 905 where the updated model (eg, fully federated model) is distributed to each data host for additional training on the training dataset. This iterative process is repeated until the algorithm termination criteria are met. The progress of the algorithm training process may be reported to the algorithm developer during this step, subject to reporting constraints. If the algorithm termination criteria are met, the process continues at block 920.

ブロック920で、反復アルゴリズム訓練プロセスがアルゴリズム終了基準を満たすと、完全連合モデルの集約されたパラメータおよび/または訓練勾配などの集約結果が、報告制約に従って性能メトリックの報告と共に、アルゴリズム開発者に配信され得る。ブロック925で、アルゴリズム開発者によって最適化プロセスが完了されたと判定され得る。ある例では、反復アルゴリズム訓練プロセスがアルゴリズム終了基準を満たしたとき、またはアルゴリズム開発者が要求したときに、集約されたパラメータおよび/または訓練勾配などの集約結果がモデルの各インスタンスに送信され得る。モデルの各インスタンスで更新訓練ワークフローが実行され得る。更新訓練ワークフローは、学習されたモデルパラメータを集約されたパラメータおよび/または訓練勾配で更新し、更新された学習されたモデルパラメータを使用して訓練および試験データ資産内のパターンを取り込むモデルの更新されたインスタンスを出力する。 At block 920, once the iterative algorithm training process meets the algorithm termination criteria, aggregated results, such as aggregated parameters and/or training gradients of the fully federated model, are delivered to the algorithm developer along with reporting of performance metrics according to reporting constraints. obtain. At block 925, it may be determined by the algorithm developer that the optimization process is complete. In certain examples, aggregated results, such as aggregated parameters and/or training gradients, may be sent to each instance of the model when the iterative algorithm training process meets algorithm termination criteria or as requested by the algorithm developer. An update training workflow may be performed on each instance of the model. The update training workflow updates the learned model parameters with aggregated parameters and/or training gradients, and updates the model using the updated learned model parameters to capture patterns in the training and test data assets. Output the instance that was created.

特定の例では、訓練プロセスで使用されるモデルおよびデータセットは、将来の目的（例えば、規制上の審査）のために維持される。例えば、訓練アクティビティが完了すると、セキュアなカプセル計算フレームワーク全体およびその内容のすべては、転送されたデータまたはモデルに対する継続的な暴露のリスクがないように安全に削除され得る。訓練のために提出された元の暗号化コードは、規制当局による検査のために常に利用可能であるようにするためにプラットフォームによってアーカイブされる。 In certain instances, models and datasets used in the training process are maintained for future purposes (eg, regulatory review). For example, once a training activity is complete, the entire secure capsule computation framework and all of its contents may be securely deleted so that there is no risk of continued exposure to transferred data or models. The original encryption code submitted for training will be archived by the platform to ensure that it is always available for inspection by regulatory authorities.

図10に、プラットフォーム（例えば、図1および図2に関して説明されたプラットフォームおよびシステム）を使用して1つまたは複数のモデルを検証するためのプロセス1000を示す。ブロック1005で、初期モデルが取得される。ある例では、取得されるモデルは、アルゴリズム開発者から取得された既存の訓練されたモデルである。初期モデルは、図3に関して説明されたように取得され得る。ブロック1010で、検証データセット（データ資産から分割された検証データセット）に対して検証ワークフローが実行される。ある例では、モデル性能または正確度を判定するために、検証データセットに対して検証が行われる。検証データセットは、図3および図4に関して説明されたように、1つまたは複数のデータホストによってホストされ得る。検証ワークフローは、検証データセットを入力として取り込み、学習されたモデルパラメータを使用して検証データセット内のパターンを見つけ、推論を出力する。検証セットからのデータが、APIを介してデータホストによってセキュアなカプセル計算フレームワークに渡され、1つまたは複数のモデルの応答（例えば、推論）がプラットフォームに返され、そこで応答は検証報告に集約される。この対話以外に、セキュアなカプセル計算フレームワークの内部からセキュアなカプセル計算フレームワークの外部への通信は許されず、不正なアルゴリズムがセキュアなカプセル計算フレームワークの外部でプライベートデータを通信するのを防ぐ。 FIG. 10 shows a process 1000 for validating one or more models using a platform (eg, the platforms and systems described with respect to FIGS. 1 and 2). At block 1005, an initial model is obtained. In some examples, the model obtained is an existing trained model obtained from an algorithm developer. An initial model may be obtained as described with respect to FIG. At block 1010, a validation workflow is executed on the validation dataset (validation dataset split from the data asset). In some examples, validation is performed against a validation data set to determine model performance or accuracy. The validation data set may be hosted by one or more data hosts as described with respect to FIGS. 3 and 4. The validation workflow takes the validation dataset as input, uses learned model parameters to find patterns in the validation dataset, and outputs inferences. Data from the validation set is passed by the data host via an API to a secure capsule computation framework, and one or more model responses (e.g., inference) are returned to the platform, where the responses are aggregated into a validation report. be done. Other than this interaction, no communication from inside the secure capsule computation framework to outside the secure capsule computation framework is allowed, preventing rogue algorithms from communicating private data outside of the secure capsule computation framework. .

ブロック1015で、モデルの性能または正確度がゴールド・スタンダード・ラベル（すなわち、グラウンドトゥルース）に基づいて計算され、モデルが検証されたかどうかに関する判定が行われる。例えば、マンモグラムで乳癌病変を検出するように設計されたアルゴリズムを、病変を含むかまたは含まないかのどちらかとして医療専門家によってラベル付けされたマンモグラムのセットで検証することができる。この専門的にラベル付けされたマンモグラムのセットに対するアルゴリズムの性能が検証報告である。ある例では、図3のブロック310でアルゴリズム開発者によって定義された検証基準（すなわち、モデルの検証が判定される基準）に従って、モデルの特徴選択、分類、およびパラメータ化が（例えば、曲線下面積解析を使用して）視覚化され、ランク付けされる。ある例では、モデルが検証されたかどうかを判定することは、モデルが図3のブロック310でアルゴリズム開発者によって定義された検証終了基準（すなわち、モデルが十分な検証を達成したかどうかを定義する基準）を満たしたかどうかを判定することを含む。 At block 1015, the performance or accuracy of the model is calculated based on the gold standard label (ie, ground truth) and a determination is made as to whether the model is validated. For example, an algorithm designed to detect breast cancer lesions in mammograms can be validated on a set of mammograms labeled by a medical professional as either containing or not containing lesions. The performance of the algorithm on this set of professionally labeled mammograms is a validation report. In one example, the feature selection, classification, and parameterization of the model is performed according to the validation criteria (i.e., the criteria by which model validation is determined) defined by the algorithm developer in block 310 of Figure 3 (e.g., (using analytics) and ranked. In some examples, determining whether the model has been validated is based on the validation exit criteria defined by the algorithm developer at block 310 of Figure 3 (i.e., defining whether the model has achieved sufficient validation). including determining whether the criteria) have been met.

モデルの性能が検証されていない場合には、プロセスはブロック1020に戻り、そこでモデルが微調整され得る（例えば、ハイパーパラメータを最適化する）。ハイパーパラメータの最適化は、グリッド検索技術やランダム検索技術などの任意の公知の最適化技術を使用して行われ得る。この反復プロセスは、モデルが検証されるまで、または検証終了基準が満たされるまで繰り返される。検証プロセスの進捗状況は、報告制約に従って、このステップ中にアルゴリズム開発者に報告され得る。検証終了基準が満たされた場合には、プロセスはブロック1025に進む。ブロック1025で、検証終了基準が満たされると、検証されたモデルの最適化されたハイパーパラメータが、報告制約に従って性能メトリックの報告と共に、アルゴリズム開発者に配信され得る。ある例では、性能メトリックの報告は、単一のデータ資産セットに対するアルゴリズムまたはモデルの検証の単一の検証報告として提供され得る。他の例では、性能メトリックの報告は、任意の数の独立したデータ資産セットに対する検証から集約されたアルゴリズムまたはモデルの検証の単一の検証報告として提供され得る。その後、検証プロセスは、アルゴリズム開発者によって完了したと判定され得る。特定の例では、検証プロセスで使用されるモデルおよびデータセットは、将来の目的（例えば、規制上の審査）のために維持される。例えば、検証アクティビティが完了すると、セキュアなカプセル計算フレームワーク全体およびその内容のすべては、転送されたデータまたはモデルに対する継続的な暴露のリスクがないように安全に削除され得る。検証のために提出された元の暗号化コードは、規制当局による検査のために常に利用可能であるようにするためにプラットフォームによってアーカイブされる。 If the performance of the model has not been verified, the process returns to block 1020 where the model may be fine-tuned (eg, optimizing hyperparameters). Hyperparameter optimization may be performed using any known optimization technique, such as grid search techniques or random search techniques. This iterative process is repeated until the model is validated or until validation termination criteria are met. The progress of the validation process may be reported to the algorithm developer during this step, subject to reporting constraints. If the verification termination criteria are met, the process continues at block 1025. At block 1025, once validation termination criteria are met, the optimized hyperparameters of the validated model may be delivered to the algorithm developer along with reporting of performance metrics according to reporting constraints. In some examples, reporting of performance metrics may be provided as a single validation report of validation of an algorithm or model against a single set of data assets. In other examples, reporting of performance metrics may be provided as a single validation report of algorithm or model validation aggregated from validation against any number of independent data asset sets. Thereafter, the validation process may be determined to be complete by the algorithm developer. In certain instances, models and datasets used in the validation process are maintained for future purposes (e.g., regulatory review). For example, once the validation activity is complete, the entire secure capsule computation framework and all of its contents may be securely deleted with no risk of continued exposure to transferred data or models. The original encryption code submitted for verification will be archived by the platform to ensure that it is always available for inspection by regulatory authorities.

図11は、モデル開発プラットフォームおよびシステム（例えば、図1～図10に関して説明されたモデル開発プラットフォームおよびシステム）を使用してモデルを最適化および/または検証するための処理の一例を示す簡略フローチャート1100である。プロセス1100はブロック1105で開始し、そこでモデルおよびモデルと関連付けられる入力データ要件がアルゴリズム開発者（例えば、第1のエンティティ）から受け取られる。入力データ要件は、データ資産がモデルで動作するための最適化および/または検証選択基準を含み得る。最適化および/または検証選択基準は、データ資産がモデルで動作するための特性、フォーマット、および要件を定義する。ブロック1110で、データ資産の入力データ要件（例えば、最適化および/または検証選択基準）に基づいて、データ資産がデータホストから利用可能であると識別される。データ資産は、最適化および/または検証選択基準に基づいて1つまたは複数のホストのデータストレージ構造に対して1つまたは複数の問い合わせを実行することによって識別され得る。 FIG. 11 is a simplified flowchart 1100 illustrating an example of a process for optimizing and/or validating a model using a model development platform and system (e.g., the model development platform and system described with respect to FIGS. 1-10). It is. Process 1100 begins at block 1105, where a model and input data requirements associated with the model are received from an algorithm developer (eg, a first entity). Input data requirements may include optimization and/or validation selection criteria for data assets to operate on the model. Optimization and/or validation selection criteria define the characteristics, format, and requirements for the data asset to work with the model. At block 1110, a data asset is identified as available from a data host based on the data asset's input data requirements (eg, optimization and/or validation selection criteria). Data assets may be identified by performing one or more queries against one or more hosts' data storage structures based on optimization and/or validation selection criteria.

ブロック1115で、データホストが迎え入れられる（以前に迎え入れられていない場合）。迎え入れることは、モデルでのデータ資産の使用がデータプライバシー要件に準拠していることを確認することを含む。ブロック1120で、データ資産は、データホストのインフラストラクチャ内にあるデータストレージ構造内でキュレートされる。キュレートすることは、複数のデータストレージ構造の中からデータストレージ構造を選択すること、およびデータホストのインフラストラクチャ内にデータストレージ構造をプロビジョニングすることを含み得る。データストレージ構造の選択は、モデル内のアルゴリズムのタイプ、データ資産内のデータのタイプ、コンピューティングデバイスのシステム要件、またはそれらの組み合わせに基づくものであり得る。ブロック1125で、モデルによる処理のためにデータ資産がデータストレージ構造内で準備される。データ資産を準備することは、データ資産に1つまたは複数の変換を適用すること、データ資産に注釈を付けること、データ資産を整合化すること、またはそれらの組み合わせを含み得る。 At block 1115, a data host is welcomed (if not previously welcomed). Embracing includes ensuring that the use of data assets in models complies with data privacy requirements. At block 1120, data assets are curated within a data storage structure within the data host's infrastructure. Curating may include selecting a data storage structure from among a plurality of data storage structures and provisioning the data storage structure within an infrastructure of the data host. The choice of data storage structure may be based on the type of algorithm in the model, the type of data in the data asset, the system requirements of the computing device, or a combination thereof. At block 1125, the data asset is prepared in a data storage structure for processing by the model. Preparing the data asset may include applying one or more transformations to the data asset, annotating the data asset, aligning the data asset, or a combination thereof.

ブロック1130で、モデルがセキュアなカプセル計算フレームワークに統合される。セキュアなカプセル計算フレームワークは、データ資産のプライバシーを保全するセキュアな方法でアプリケーション・プログラム・インターフェースを介してデータストレージ構造内のデータ資産にモデルを提供し得る。ブロック1135で、データ資産がモデルによって動作する。いくつかの態様では、データ資産をモデルによって動作させることは、モデルの複数のインスタンスを作成することと、データ資産を、訓練データセットと1つまたは複数の試験データセットとに分割することと、モデルの複数のインスタンスを訓練データセットで訓練することと、モデルの複数のインスタンスの各々の訓練からの結果を完全連合モデルに統合することと、1つまたは複数の試験データセットを完全連合モデルによって動作させることと、1つまたは複数の試験データセットの動作に基づいて完全連合モデルの性能を計算することとを含む訓練ワークフローを実行することを含む。他の態様では、データ資産をモデルによって動作させることは、1つまたは複数の検証データセットにおいてデータ資産を分割することと、1つまたは複数の検証データセットをモデルによって動作させることと、1つまたは複数の検証データセットの動作に基づいてモデルの性能を計算することとを含む検証ワークフローを実行することを含む。ブロック1140で、ブロック1145でモデルを動作させたことに関する報告がアルゴリズム開発者に提供され得る。 At block 1130, the model is integrated into a secure capsule computation framework. A secure capsule computing framework may provide models to data assets within a data storage structure via an application program interface in a secure manner that preserves the privacy of the data assets. At block 1135, the data asset is operated on by the model. In some aspects, operating the data asset with the model includes creating multiple instances of the model and splitting the data asset into a training dataset and one or more test datasets. training multiple instances of the model on a training dataset; integrating results from training each of the multiple instances of the model into a fully federated model; and training one or more test datasets by the fully federated model. and calculating performance of the fully federated model based on the performance of the one or more test datasets. In other aspects, operating the data asset with the model includes: partitioning the data asset in one or more validation datasets; operating the one or more validation datasets with the model; or calculating performance of the model based on operation of a plurality of validation datasets. At block 1140, a report regarding running the model at block 1145 may be provided to the algorithm developer.

図12に、本開示による、プライバシー保護された整合化臨床データおよび健康データの複数のソースに解析を分散させることによって人工知能アルゴリズムを開発するためのシステムおよび方法での使用に適した例示的なコンピューティングデバイス1200を示す。例示的なコンピューティングデバイス1200は、1つまたは複数の通信バス1215を使用してメモリ1210およびコンピューティングデバイス1200のその他の構成要素と通信するプロセッサ1205を含む。プロセッサ1205は、メモリ1210に格納されたプロセッサ実行可能命令を実行して、図11に関して上述された例示的な方法1100の一部または全部などの、異なる例に従って人工知能アルゴリズムを開発するための1つまたは複数の方法を行うように構成される。この例では、メモリ1210は、図1～図11に関して上述されたように、モデルおよびデータ資産およびモデル性能解析1225を使用して推論1220を提供するプロセッサ実行可能命令を格納する。コンピューティングデバイス1200は、この例では、ユーザ入力を受け入れるために、キーボード、マウス、タッチスクリーン、マイクロフォンなどの1つまたは複数のユーザ入力デバイス1230も含む。コンピューティングデバイス1200はまた、ユーザインターフェースなどのユーザに視覚出力を提供するためのディスプレイ1235も含む。 FIG. 12 shows an exemplary system and method suitable for use in systems and methods for developing artificial intelligence algorithms by distributing analysis across multiple sources of privacy-preserving harmonized clinical and health data, according to the present disclosure. A computing device 1200 is shown. Exemplary computing device 1200 includes a processor 1205 that communicates with memory 1210 and other components of computing device 1200 using one or more communication buses 1215. Processor 1205 executes processor-executable instructions stored in memory 1210 to develop artificial intelligence algorithms according to different examples, such as some or all of example method 1100 described above with respect to FIG. configured to perform one or more methods. In this example, memory 1210 stores processor-executable instructions that provide inference 1220 using models and data assets and model performance analysis 1225, as described above with respect to FIGS. 1-11. Computing device 1200 also includes one or more user input devices 1230, such as a keyboard, mouse, touch screen, microphone, etc., in this example, for accepting user input. Computing device 1200 also includes a display 1235 for providing visual output to a user, such as a user interface.

コンピューティングデバイス1000はまた、通信インターフェース1240も含む。いくつかの例では、通信インターフェース1240は、エリアネットワーク（「LAN」）、インターネットなどの広域ネットワーク（「WAN」）、メトロポリタン・エリア・ネットワーク（「MAN」）、ポイントツーポイント接続またはピアツーピア接続を含む、1つまたは複数のネットワークを使用した通信を可能にし得る。他のデバイスとの通信は、任意の適切なネットワーキングプロトコルを用いて達成され得る。例えば、1つの適切なネットワーキングプロトコルは、インターネットプロトコル（「IP」）、伝送制御プロトコル（「TCP」）、ユーザ・データグラム・プロトコル（「UDP」）、またはTCP/IPやUDP/IPなどのそれらの組み合わせを含み得る。 Computing device 1000 also includes a communications interface 1240. In some examples, communication interface 1240 includes an area network (“LAN”), a wide area network (“WAN”) such as the Internet, a metropolitan area network (“MAN”), a point-to-point connection, or a peer-to-peer connection. , may enable communication using one or more networks. Communication with other devices may be accomplished using any suitable networking protocol. For example, one suitable networking protocol is Internet Protocol ("IP"), Transmission Control Protocol ("TCP"), User Datagram Protocol ("UDP"), or any of those such as TCP/IP and UDP/IP. may include a combination of.

本明細書の方法およびシステムのいくつかの例は、様々な機械上で実行されるソフトウェアに関して説明されているが、方法およびシステムはまた、特に本開示による様々な方法を実行するためのフィールド・プログラマブル・ゲート・アレイ（FPGA）などの特別に構成されたハードウェアとして実装されてもよい。例えば、各例を、デジタル電子回路として、またはコンピュータハードウェア、ファームウェア、ソフトウェアとして、もしくはそれらの組み合わせとして実装することができる。一例では、デバイスは、1つまたは複数のプロセッサを含み得る。プロセッサは、プロセッサに結合されたランダム・アクセス・メモリ（RAM）などのコンピュータ可読媒体を含む。プロセッサは、1つまたは複数のコンピュータプログラムを実行するなど、メモリに格納されたコンピュータ実行可能プログラム命令を実行する。そのようなプロセッサは、マイクロプロセッサ、デジタル信号プロセッサ（DSP）、アルゴリズム固有集積回路（ASIC）、フィールド・プログラマブル・ゲート・アレイ（FPGA）、および状態機械を含み得る。そのようなプロセッサは、PLC、プログラマブル割り込みコントローラ（PIC）、プログラマブル論理デバイス（PLD）、プログラマブル読み出し専用メモリ（PROM）、電子的プログラム可能読み出し専用メモリ（EPROMもしくはEEPROM）、または他の同様のデバイスなどのプログラマブル電子デバイスをさらに含み得る。 Although some examples of the methods and systems herein are described in terms of software running on various machines, the methods and systems are also particularly applicable to field applications for performing various methods according to the present disclosure. It may also be implemented as specially configured hardware such as a programmable gate array (FPGA). For example, each example may be implemented as digital electronic circuitry, or as computer hardware, firmware, software, or a combination thereof. In one example, a device may include one or more processors. A processor includes a computer-readable medium such as random access memory (RAM) coupled to the processor. A processor executes computer-executable program instructions stored in memory, such as executing one or more computer programs. Such processors may include microprocessors, digital signal processors (DSPs), algorithm-specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), and state machines. Such processors may include PLCs, programmable interrupt controllers (PICs), programmable logic devices (PLDs), programmable read-only memories (PROMs), electronic programmable read-only memories (EPROMs or EEPROMs), or other similar devices. may further include a programmable electronic device.

そのようなプロセッサは、プロセッサによって実行されると、プロセッサに、本開示による方法を、プロセッサによって実行されるかまたは支援されるように実行させることができるプロセッサ実行可能命令を格納し得る媒体、例えば、1つまたは複数の非一時的コンピュータ可読媒体を含むか、または媒体と通信し得る。非一時的コンピュータ可読媒体の例には、ウェブサーバ内のプロセッサなどのプロセッサにプロセッサ実行可能命令を提供することができる電子記憶デバイス、光学記憶デバイス、磁気記憶デバイス、またはその他の記憶装置が含まれ得るが、これらに限定されない。非一時的コンピュータ可読媒体の他の例には、フロッピーディスク、CD-ROM、磁気ディスク、メモリチップ、ROM、RAM、ASIC、構成済みプロセッサ、すべての光学媒体、すべての磁気テープその他の磁気媒体、またはコンピュータプロセッサが読み取ることができる任意の他の媒体が含まれるが、これらに限定されない。記載のプロセッサおよび処理は、1つまたは複数の構造内にあってもよく、1つまたは複数の構造を介して分散されてもよい。プロセッサは、本開示による方法（または方法の一部）を実行するためのコードを含み得る。 Such a processor may include a medium that may store processor-executable instructions that, when executed by the processor, may cause the processor to perform, as executed or assisted by, a method according to the present disclosure. , may include or be in communication with one or more non-transitory computer-readable media. Examples of non-transitory computer-readable media include electronic, optical, magnetic, or other storage devices that can provide processor-executable instructions to a processor, such as a processor in a web server. but not limited to. Other examples of non-transitory computer-readable media include floppy disks, CD-ROMs, magnetic disks, memory chips, ROMs, RAM, ASICs, preconfigured processors, all optical media, all magnetic tape and other magnetic media; or any other medium that can be read by a computer processor. The described processors and processes may be within one or more structures or distributed across one or more structures. A processor may include code for performing a method (or portion of a method) according to this disclosure.

IV. さらなる考慮事項
上記の説明では、態様の完全な理解を提供するために具体的な詳細が示されている。しかしながら、態様をこれらの具体的な詳細なしで実施することができることが理解される。例えば、態様を不必要な詳細で不明瞭にしないために、回路をブロック図に示すことができる。他の例では、態様を不明瞭にすることを回避するために、周知の回路、プロセス、アルゴリズム、構造、および技術を不必要な詳細なしで示すことができる。 IV. Further Considerations In the above description, specific details are set forth to provide a thorough understanding of the embodiments. However, it is understood that the embodiments may be practiced without these specific details. For example, circuits may be shown in block diagrams in order to not obscure the aspects in unnecessary detail. In other instances, well-known circuits, processes, algorithms, structures, and techniques may be shown without unnecessary detail in order to avoid obscuring aspects.

上述の技術、ブロック、ステップ、および手段の実装は、様々な方法で行うことができる。例えば、これらの技術、ブロック、ステップ、および手段を、ハードウェア、ソフトウェア、またはそれらの組み合わせとして実装することができる。ハードウェア実装の場合、処理ユニットを、1つまたは複数のアルゴリズム固有集積回路（ASIC）、デジタル信号プロセッサ（DSP）、デジタル信号処理デバイス（DSPD）、プログラマブル論理デバイス（PLD）、フィールド・プログラマブル・ゲート・アレイ（FPGA）、プロセッサ、コントローラ、マイクロコントローラ、マイクロプロセッサ、上述の機能を行うように設計された他の電子ユニット、および/またはそれらの組み合わせ内に実装することができる。 Implementation of the techniques, blocks, steps and means described above can be done in various ways. For example, these techniques, blocks, steps, and means may be implemented as hardware, software, or a combination thereof. For hardware implementations, the processing unit can be implemented as one or more algorithm-specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), or field-programmable gates. - Can be implemented in an array (FPGA), processor, controller, microcontroller, microprocessor, other electronic unit designed to perform the functions described above, and/or combinations thereof.

また、態様を、フローチャート、フロー図、データフロー図、構造図、またはブロック図として示されるプロセスとして記述することができることにも留意されたい。フローチャートは動作を順次プロセスとして記述することができるが、動作の多くを並行してまたは同時に行うことができる。さらに、動作の順序を並べ替えることもできる。プロセスは、その動作が完了したときに終了するが、図に含まれていない追加のステップを有することも可能である。プロセスは、メソッド、関数、プロシージャ、サブルーチン、サブプログラムなどに対応することができる。プロセスが関数に対応する場合、その終了は、呼び出し関数またはメイン関数への関数の戻りに対応する。 Note also that aspects can be described as processes that are illustrated as flowcharts, flow diagrams, data flow diagrams, structural diagrams, or block diagrams. Although a flowchart can describe the operations as a sequential process, many of the operations can be performed in parallel or simultaneously. Furthermore, the order of operations can be rearranged. A process ends when its operations are complete, but may have additional steps not included in the diagram. A process can correspond to a method, function, procedure, subroutine, subprogram, etc. If the process corresponds to a function, its termination corresponds to the function's return to the calling function or main function.

さらに、態様を、ハードウェア、ソフトウェア、スクリプト言語、ファームウェア、ミドルウェア、マイクロコード、ハードウェア記述言語、および/またはそれらの任意の組み合わせによって実装することができる。ソフトウェア、ファームウェア、ミドルウェア、スクリプト言語、および/またはマイクロコードで実装される場合、必要なタスクを行うためのプログラムコードまたはコードセグメントを、記憶媒体などの機械可読媒体に格納することができる。コードセグメントまたは機械実行可能命令は、プロシージャ、関数、サブプログラム、プログラム、ルーチン、サブルーチン、モジュール、ソフトウェアパッケージ、スクリプト、クラス、または命令、データ構造、および/もしくはプログラム文の任意の組み合わせを表すことができる。コードセグメントを、情報、データ、引数、パラメータ、および/またはメモリ内容を受け渡すことによって、別のコードセグメントまたはハードウェア回路に結合することができる。情報、引数、パラメータ、データなどを、メモリ共有、メッセージパッシング、チケットパッシング、ネットワーク伝送などを含む任意の適切な手段を介して受け渡し、転送し、または伝送することができる。 Further, aspects may be implemented by hardware, software, scripting languages, firmware, middleware, microcode, hardware description languages, and/or any combination thereof. When implemented in software, firmware, middleware, scripting languages, and/or microcode, program code or code segments for performing necessary tasks may be stored on a machine-readable medium, such as a storage medium. A code segment or machine-executable instructions may represent a procedure, function, subprogram, program, routine, subroutine, module, software package, script, class, or any combination of instructions, data structures, and/or program statements. can. A code segment can be coupled to another code segment or hardware circuit by passing information, data, arguments, parameters, and/or memory contents. Information, arguments, parameters, data, etc. may be passed, transferred, or transmitted via any suitable means, including memory sharing, message passing, ticket passing, network transmission, and the like.

ファームウェアおよび/またはソフトウェア実装の場合、方法を、本明細書に記載の機能を行うモジュール（例えば、プロシージャ、関数など）を用いて実装することができる。本明細書に記載の方法を実装する際に、命令を実体的に具体化する任意の機械可読媒体を使用することができる。例えば、ソフトウェアコードをメモリに格納することができる。メモリは、プロセッサ内またはプロセッサの外部に実装することができる。本明細書で使用される場合、「メモリ」という用語は、任意のタイプの長期、短期、揮発性、不揮発性、またはその他の記憶媒体を指し、どんな特定のタイプのメモリまたはメモリの数にも、または記憶が格納される媒体のタイプにも限定されない。 For a firmware and/or software implementation, the methodologies can be implemented with modules (eg, procedures, functions, and so on) that perform the functions described herein. Any machine-readable medium tangibly embodying instructions may be used in implementing the methodologies described herein. For example, software code may be stored in memory. Memory can be implemented within the processor or external to the processor. As used herein, the term "memory" refers to any type of long-term, short-term, volatile, non-volatile, or other storage medium, regardless of any particular type or number of memories. , or the type of medium on which the storage is stored.

さらに、本明細書で開示されるように、「記憶媒体」、「ストレージ」または「メモリ」という用語は、読み出し専用メモリ（ROM）、ランダム・アクセス・メモリ（RAM）、磁気RAM、コアメモリ、磁気ディスク記憶媒体、光記憶媒体、フラッシュ・メモリ・デバイス、および/または情報を格納するためのその他の機械可読媒体を含む、データを格納するための1つまたは複数のメモリを表すことができる。「機械可読媒体」という用語は、携帯型もしくは固定型の記憶デバイス、光記憶デバイス、無線チャネル、ならびに/または（1つもしくは複数の）命令および/もしくはデータを含むかもしくは搬送する、格納することができる様々な他の記憶媒体を含むが、これらに限定されない。 Further, as disclosed herein, the terms "storage medium", "storage" or "memory" refer to read only memory (ROM), random access memory (RAM), magnetic RAM, core memory, May represent one or more memories for storing data, including magnetic disk storage media, optical storage media, flash memory devices, and/or other machine-readable media for storing information. The term "machine-readable medium" refers to a portable or fixed storage device, an optical storage device, a wireless channel, and/or a medium that contains or carries or stores instructions and/or data. including, but not limited to, various other storage media that can be used.

本開示の原理を特定の装置および方法に関連して上述したが、この説明は例としてなされているにすぎず、本開示の範囲に対する限定としてではないことを明確に理解されたい。 Although the principles of the present disclosure have been described above with reference to particular devices and methods, it should be clearly understood that this description is made by way of example only and not as a limitation on the scope of the disclosure.

Claims

A method including the following steps:
in a data processing system, receiving an algorithm and input data requirements associated with the algorithm, the input data requirements including optimization and/or validation selection criteria for data assets to operate with the algorithm; The optimization and/or validation selection criteria define characteristics, formats, and requirements for the data asset to operate with the algorithm, and the characteristics and requirements of the data asset define the characteristics and requirements for the data asset to operate on the algorithm and/or refers to the characteristics and requirements of data that can be used to validate said algorithm and input data requirements are received from an algorithm developer;
identifying, by the data processing system, the data asset as available from a data host based on the optimization and/or validation selection criteria for the data asset, the data host comprising: said identifying is an entity different from said algorithm developer;
curating, by the data processing system, the data assets in a data storage structure within an infrastructure of the data host;
preparing the data asset in the data storage structure for processing by the algorithm by the data processing system;
provisioning, by the data processing system, a secure capsule computation framework within a computation infrastructure of the data host infrastructure, wherein the secure capsule computation framework deploys the algorithm within the data storage structure; provisioning the data asset in a secure manner that preserves the privacy of the data asset and the algorithm;
integrating, by the data processing system, the algorithm into the secure capsule computing framework, the algorithm developer integrating cryptographic code for operating the algorithm within the secure capsule computing framework; and decrypting, by the algorithm developer, the encrypted code to obtain a decryption code for operating the algorithm; and
operating, by the data processing system, the data asset through the algorithm, the data asset being transferred from the data storage structure to the secure encapsulated computation via one or more secure application program interfaces; passing to said algorithm in a framework; optimizing, validating, or computing inferences by said algorithm using said data asset and said decryption code; and said optimizing, validating, or passing the results of the inference computation to the algorithm developer or the data host via the one or more secure application program interfaces.

the characteristics of the data asset and the requirements,
(i) the environment of said algorithm; (ii) the distribution of instances within said data asset ; (iii) the parameters and type of device producing said data asset ; (iv) the variance versus bias; (v) the 2. The method of claim 1 , defined based on tasks implemented by an algorithm, or (vi) any combination thereof.

The identifying step uses differential privacy to share information within the data asset by describing patterns of groups within the data asset while keeping private information about individuals within the data asset hidden. carried out,
the curating step includes selecting the data storage structure from among a plurality of data storage structures; and provisioning the data storage structure within the infrastructure of the data host;
the selection of the data storage structure is based on the type of algorithm, the type of data within the data asset, system requirements of the data processing system, or a combination thereof;
The method according to claim 1 .

further comprising hosting the data host by the data processing system;
the step of welcoming comprises ensuring that the use of the data asset in the algorithm is compliant with data privacy requirements;
The method according to claim 1.

Claims wherein the step of preparing the data asset includes applying one or more transformations to the data asset, annotating the data asset, harmonizing the data asset, or a combination thereof. The method described in Section 1.

said operating said data asset through said algorithm;
creating a plurality of instances of the algorithm; splitting the data asset into a training dataset and one or more test datasets; and training the plurality of instances of the algorithm with the training dataset. integrating results from the training of each of the plurality of instances of the algorithm into a fully federated algorithm; running the one or more test data sets through the fully federated algorithm; 2. The method of claim 1, further comprising: calculating performance of the fully federated algorithm based on the operation of one or more test data sets.

said operating said data asset through said algorithm;
dividing the data asset in one or more validation datasets; operating the one or more validation datasets through the algorithm; and based on the operation of the one or more validation datasets. 2. The method of claim 1, further comprising: performing a validation workflow comprising: calculating performance of the algorithm using the algorithm.

one or more data processors;
when executed on the one or more data processors, the one or more data processors:
an act of receiving an algorithm and input data requirements associated with the algorithm, wherein the input data requirements include optimization and/or validation selection criteria for data assets to operate with the algorithm; or validation selection criteria define characteristics, formats, and requirements for said data asset to operate with said algorithm, and characteristics and requirements of said data asset define said data asset to optimize and/or validate said algorithm. refers to the characteristics and requirements of the data that can be used for the algorithm and input data requirements received from the algorithm developer;
an act of identifying the data asset as available from a data host based on the optimization and/or validation selection criteria for the data asset, the data host being different from the algorithm developer; the identifying action being an entity;
an act of curating the data asset within a data storage structure within an infrastructure of the data host;
an act of preparing the data asset in the data storage structure for processing by the algorithm;
an act of provisioning a secure capsule computing framework within a computing infrastructure of an infrastructure of the data host, wherein the secure capsule computing framework applies the algorithm to the data assets in the data storage structure; the act of provisioning data assets and the algorithm in a secure manner that preserves privacy;
an act of integrating the algorithm into the secure capsule computation framework, the algorithm developer placing cryptographic code within the secure capsule computation framework for operating the algorithm; decrypting the encrypted code to obtain a decryption code for operating the algorithm by the algorithm developer;
an act of operating the data asset through the algorithm within the secure capsule computation framework, the act of operating the data asset from the data storage structure through one or more secure application program interfaces; optimizing, validating, or calculating inferences by the algorithm using the data asset and the decryption code; and , through the one or more secure application program interfaces to the algorithm developer or the data host. and a computer-readable storage medium.

the characteristics of the data asset and the requirements,
(i) the environment of said algorithm; (ii) the distribution of instances within said data asset ; (iii) the parameters and type of device producing said data asset ; (iv) the variance versus bias; (v) the 9. The system of claim 8 , defined based on tasks implemented by an algorithm, or (vi) any combination thereof.

The act of identifying uses differential privacy to share information within the data asset by describing patterns of groups within the data asset while concealing private information about individuals within the data asset. carried out,
The act of curating includes selecting the data storage structure from among a plurality of data storage structures and provisioning the data storage structure within the infrastructure of the data host;
the selection of the data storage structure is based on the type of algorithm, the type of data within the data asset, the requirements of the system, or a combination thereof;
9. The system of claim 8 .

the operations further include welcoming the data host;
the welcoming includes ensuring that use of the data asset in the algorithm is compliant with data privacy requirements;
9. The system of claim 8 .

The act of preparing the data asset includes applying one or more transformations to the data asset, annotating the data asset, harmonizing the data asset, or a combination thereof. The system described in Section 8 .

the act of operating the data asset through the algorithm;
creating a plurality of instances of the algorithm; splitting the data asset into a training dataset and one or more test datasets; and training the plurality of instances of the algorithm with the training dataset. integrating results from the training of each of the plurality of instances of the algorithm into a fully federated algorithm; operating the one or more test datasets with the fully federated algorithm; 9. The system of claim 8 , further comprising: calculating performance of the fully federated algorithm based on the operation of one or more test data sets.

the act of operating the data asset through the algorithm;
dividing the data asset in one or more validation datasets; operating the one or more validation datasets through the algorithm; and based on the operation of the one or more validation datasets. 9. The system of claim 8 , further comprising: performing a validation workflow comprising: calculating the performance of the algorithm using the algorithm.

When executed by one or more processors , the system
an act of receiving an algorithm and input data requirements associated with the algorithm, wherein the input data requirements include optimization and/or validation selection criteria for data assets to operate with the algorithm; or validation selection criteria define characteristics, formats, and requirements for said data asset to operate with said algorithm, and characteristics and requirements of said data asset define said data asset to optimize and/or validate said algorithm. refers to the characteristics and requirements of the data that can be used for the algorithm and input data requirements received from the algorithm developer;
an act of identifying the data asset as available from a data host based on the optimization and/or validation selection criteria for the data asset, the data host being different from the algorithm developer; the identifying action being an entity;
an act of curating the data asset within a data storage structure within an infrastructure of the data host;
an act of preparing the data asset in the data storage structure for processing by the algorithm;
an act of provisioning a secure capsule computing framework within a computing infrastructure of an infrastructure of the data host, wherein the secure capsule computing framework applies the algorithm to the data asset within the data storage structure; , the act of provisioning the data asset and the algorithm in a secure manner that preserves privacy;
an act of integrating the algorithm into the secure capsule computation framework, the algorithm developer placing cryptographic code within the secure capsule computation framework for operating the algorithm; decrypting the encrypted code to obtain a decryption code for operating the algorithm by the algorithm developer;
an act of operating the data asset through the algorithm within the secure capsule computation framework, the act of operating the data asset from the data storage structure through one or more secure application program interfaces; optimizing, validating, or calculating inferences by the algorithm using the data asset and the decryption code; and , passing to the algorithm developer or the data host via the one or more secure application program interfaces ; , one or more non-transitory machine-readable storage media .