CN117648998A

CN117648998A - Large language model federal pre-training method based on trusted execution environment

Info

Publication number: CN117648998A
Application number: CN202410117882.3A
Authority: CN
Inventors: 罗清彩; 李辉; 孙善宝; 王亚宁
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2024-01-29
Filing date: 2024-01-29
Publication date: 2024-03-05
Anticipated expiration: 2044-01-29
Also published as: CN117648998B

Abstract

The invention discloses a large language model federated pre-training method based on a trusted execution environment, which includes the following steps: Step 1: Create a large language model joint pre-training task, determine joint modeling participants, prepare data, and create Computing and storage network resources for joint modeling; Step 2: Perform joint pre-training of large language models; Step 3: Optimize the large language model obtained by joint pre-training. Based on the actual scenario of multi-party federated modeling applied to large language model pre-training, RDMA and CXL technologies are fully utilized to build a cross-domain trusted execution environment cluster in a distributed environment, and divide the memory into shared areas and private areas. The area forms a large memory so that it can accommodate large language models and their training data as well as intermediate training results, overcoming the communication bottlenecks of trusted modeling and insufficient resource utilization at the scale of large models and big data.

Description

A federated pre-training method for large language models based on a trusted execution environment

技术领域Technical field

本发明属于大语言模型联邦预训练技术领域，具体涉及一种基于可信执行环境的大语言模型联邦预训练方法。The invention belongs to the technical field of large language model federated pre-training, and specifically relates to a large language model federated pre-training method based on a trusted execution environment.

背景技术Background technique

在当今人工智能飞速发展的背景下，自然语言处理作为人工智能重要的方向，已经在各个领域展现了广泛的应用前景，如机器翻译、情感分析、智能对话、文本生成等应用前景愈加广泛，并持续赋能推动医疗、金融、政务等产业发展。In the context of today's rapid development of artificial intelligence, natural language processing, as an important direction of artificial intelligence, has shown a wide range of application prospects in various fields, such as machine translation, sentiment analysis, intelligent dialogue, text generation and other application prospects are becoming more and more widespread, and Continue to empower and promote the development of medical, financial, government and other industries.

预训练大模型作为自然语言处理的关键技术，通过在大规模数据上进行预训练，使模型能够学习丰富的语言知识和模式，为各类下游任务提供了强大的支持。然而，这一技术也面临着一系列严峻挑战，这些挑战不仅仅涉及技术方面，更关乎数据隐私、计算效率和安全性等多个关键领域的问题。Pre-training large models is a key technology for natural language processing. By pre-training on large-scale data, the model can learn rich language knowledge and patterns, providing strong support for various downstream tasks. However, this technology also faces a series of serious challenges, which not only involve technical aspects, but also relate to issues in multiple key areas such as data privacy, computing efficiency, and security.

首先，大语言模型的预训练所需的数据规模庞大，涵盖多个数据源和数据拥有者，这些数据可能包含敏感信息，如个人隐私和商业机密，因此如何在多方合作的情况下进行大模型训练，保障数据隐私成为至关重要的问题，以免数据泄露和滥用。First of all, the data required for pre-training of large language models is huge, covering multiple data sources and data owners. These data may contain sensitive information, such as personal privacy and business secrets, so how to conduct large models in the context of multi-party cooperation? Training, ensuring data privacy has become a crucial issue to avoid data leakage and misuse.

其次，大语言模型预训练过程需要庞大的存储空间，比如用于训练的数据集存储、大语言模型的存储以及训练中间结果的存储，对计算和存储资源提出了高要求，资源受限情况下可能导致训练速度下降甚至无法满足训练需求。同时，GPU加速在NLP大语言模型的训练中具有重要作用，然而GPU的可信性受到挑战，可能会引入安全风险，需要解决如何确保GPU的可信性和安全性。此外，大语言模型预训练的过程时间较长，节点失效和异常中断可能导致训练过程中断，加之数据全部存储在内存中，中间数据丢失风险显著。Secondly, the pre-training process of large language models requires huge storage space, such as the storage of data sets for training, the storage of large language models, and the storage of intermediate training results, which places high demands on computing and storage resources. When resources are limited, It may lead to a decrease in training speed or even an inability to meet training needs. At the same time, GPU acceleration plays an important role in the training of NLP large language models. However, the credibility of the GPU is challenged and may introduce security risks. How to ensure the credibility and security of the GPU needs to be solved. In addition, the pre-training process of large language models takes a long time, and node failures and abnormal interruptions may cause the training process to be interrupted. In addition, all data is stored in memory, and the risk of intermediate data loss is significant.

大语言模型基于 Transformer，预训练的思想是，模型参数不再是随机初始化的，而是通过任务进行预先训练，得到一套模型参数，然后用这套参数对模型进行初始化，再进行训练。大语言模型预训练过程主要是通过上文的词来预测下一个词，属于无监督的预训练，包括AR 模型（即指从左往右学习的模型）、in-context learning 技术等。而随着ChatGPT 的出现，在完成大语言模型的预训练后，还会采用监督学习、奖励模型以及强化学习进行进一步的微调。The large language model is based on Transformer. The idea of pre-training is that the model parameters are no longer randomly initialized, but are pre-trained through tasks to obtain a set of model parameters, and then use this set of parameters to initialize the model before training. The large language model pre-training process mainly uses the above words to predict the next word. It is an unsupervised pre-training, including AR models (that is, models that learn from left to right), in-context learning technology, etc. With the emergence of ChatGPT, after completing the pre-training of the large language model, supervised learning, reward models and reinforcement learning will be used for further fine-tuning.

在这种情况下，如何充分利用可信执行环境（TEE）、Compute Express Link（CXL）技术以及远程直接内存访问（RDMA）等前沿技术，结合密钥管理、加密传输和数据隔离等手段，针对大语言模型预训练场景，在合规的前提下进行多方联合高效建模成为亟需解决的问题。In this case, how to make full use of cutting-edge technologies such as Trusted Execution Environment (TEE), Compute Express Link (CXL) technology, and Remote Direct Memory Access (RDMA), combined with key management, encrypted transmission, and data isolation, to target In large language model pre-training scenarios, multi-party joint and efficient modeling under the premise of compliance has become an urgent problem that needs to be solved.

发明内容Contents of the invention

为了克服上述现有技术存在的不足，本发明的目的在于提供一种基于可信执行环境的大语言模型联邦预训练方法，以应用于大语言模型预训练多方联邦建模的实际场景，充分利用RDMA和CXL技术，在分布式环境中构建跨域可信执行环境（TEE）集群，并为内存划分共享区域和私有区域，通过将共享区域形成大内存，使其可以容纳大语言模型及其训练数据以及中间训练结果，克服大模型大数据规模下的可信建模通信瓶颈和资源利用不足问题。In order to overcome the shortcomings of the above-mentioned existing technologies, the purpose of the present invention is to provide a large language model federated pre-training method based on a trusted execution environment, so as to be applied to the actual scenario of large language model pre-training multi-party federated modeling, and make full use of RDMA and CXL technology build a cross-domain Trusted Execution Environment (TEE) cluster in a distributed environment, and divide the memory into shared areas and private areas. By forming the shared area into a large memory, it can accommodate large language models and their training. data and intermediate training results to overcome the communication bottlenecks and resource underutilization of trusted modeling at the scale of large models and big data.

为了实现上述目的，本发明采用的技术方案是：In order to achieve the above objects, the technical solution adopted by the present invention is:

一种基于可信执行环境的大语言模型联邦预训练方法，包括以下步骤：A federated pre-training method for large language models based on a trusted execution environment, including the following steps:

步骤1、创建大语言模型联合预训练任务，确定联合建模参与方，准备数据，并创建用于联合建模的计算存储网络资源；Step 1. Create a large language model joint pre-training task, determine joint modeling participants, prepare data, and create computing and storage network resources for joint modeling;

步骤2、进行大语言模型的联合预训练；Step 2. Conduct joint pre-training of large language models;

步骤3、对联合预训练得到的大语言模型进行优化。Step 3. Optimize the large language model obtained by joint pre-training.

所述步骤1具体包括以下步骤：The step 1 specifically includes the following steps:

步骤101、确认预训练任务：定义大语言模型预训练的具体任务，包括大语言模型的初始参数配置和训练数据的要求；Step 101. Confirm the pre-training tasks: Define the specific tasks of large language model pre-training, including the initial parameter configuration of the large language model and training data requirements;

步骤102、确定联合建模参与方：确定所述参与大语言模型联合预训练的各联合建模参与方，包括大语言模型的拥有者和数据提供者；Step 102. Determine the joint modeling participants: Determine the joint modeling participants participating in the joint pre-training of the large language model, including the owners and data providers of the large language model;

步骤103、构建跨域TEE集群：在分布式环境中，搭建跨域的可信执行环境（TEE）集群，各个联合建模参与方启动TEE管理节点，在TEE规划共享区域、私有区域以及GPU资源，并将访问权限信息加载到FPGA中，用于控制RDMA和CXL的访问；Step 103. Build a cross-domain TEE cluster: In a distributed environment, build a cross-domain Trusted Execution Environment (TEE) cluster. Each joint modeling participant starts the TEE management node and plans shared areas, private areas, and GPU resources in the TEE. , and load the access permission information into the FPGA to control access to RDMA and CXL;

步骤104、设置全局时钟：引入全局时钟，作为统一的时间标尺，为所有联合建模参与方分发N个随机数种子，确保时间的一致性和数据的安全性；Step 104. Set the global clock: Introduce the global clock as a unified time scale and distribute N random number seeds to all joint modeling participants to ensure time consistency and data security;

步骤105、加密和MAC认证：大语言模型联邦预训练任务发起方选取随机种子获取全局时钟生成密钥，将初始模型进行加密，同时生成MAC消息认证码，并将加密后的初始模型和相关标识信息放入共享内存区域，为模型分发做准备；Step 105. Encryption and MAC authentication: The initiator of the large language model federated pre-training task selects a random seed to obtain the global clock generation key, encrypts the initial model, generates a MAC message authentication code, and sends the encrypted initial model and related identifiers Information is placed into the shared memory area to prepare for model distribution;

所述共享内存区域划分成私有区域和共享的公共区域，私有区域当中存放参与联邦建模联合建模参与方的私有数据，为敏感数据，这些数据是不出域的；共享的公共区域用于非敏感数据、建模过程中的元数据、全局模型参数以及公共数据集的高效快速共享；The shared memory area is divided into a private area and a shared public area. The private area stores the private data of the participants participating in the federated modeling joint modeling, which is sensitive data. These data do not leave the domain; the shared public area is used for Efficient and fast sharing of non-sensitive data, metadata in the modeling process, global model parameters, and public data sets;

步骤106、训练数据加载和处理：各所述联合建模参与方根据数据敏感性，将数据加载到TEE的私有区域和共享区域，对数据进行Token化处理形成向量表示，并获取全局时钟，选取本地随机种子对共享区域的数据进行加密，以增强数据的保密性；Step 106. Training data loading and processing: Each of the joint modeling participants loads the data into the private area and shared area of the TEE according to the data sensitivity, tokenizes the data to form a vector representation, obtains the global clock, and selects Local random seeds encrypt data in the shared area to enhance data confidentiality;

步骤107、数据聚合和混淆：所述任务发起方对TEE共享区域的数据进行聚合，根据对元数据标识的随机数种子序号以及全局时钟生成密钥，并利用密钥将数据进行解密，再进行数据混淆，模糊其数据来源，通过获取全局时钟，选取本地随机种子重新对共享区域的数据进行加密，生成公共数据集，并存储于共享内存区域中；Step 107. Data aggregation and obfuscation: The task initiator aggregates the data in the TEE shared area, generates a key based on the random seed number identified by the metadata and the global clock, and uses the key to decrypt the data, and then performs Data obfuscation blurs the source of the data. By obtaining the global clock and selecting local random seeds, the data in the shared area is re-encrypted to generate a public data set and store it in the shared memory area;

所述步骤107中具体包括：The step 107 specifically includes:

数据聚合：任务发起方从各个联合建模参与方的TEE共享区域获取数据（这些数据是经过预处理的、Token化的、加密的向量表示），任务发起方使用全局时钟和元数据标识的随机数种子序号生成密钥，使用生成的密钥，任务发起方对从各个TEE获取的数据进行解密，这一步将原始的、加密的数据还原为其原始状态；Data aggregation: The task initiator obtains data from the TEE shared area of each joint modeling participant (these data are preprocessed, tokenized, encrypted vector representations), and the task initiator uses the global clock and random metadata identification Several sub-serial numbers generate keys, and using the generated keys, the task initiator decrypts the data obtained from each TEE. This step restores the original, encrypted data to its original state;

数据混淆：在解密的基础上，任务发起方对数据进行混淆（混淆的目的是模糊数据的来源，增加数据的隐私性），混淆过程可以采用不同的技术，可以通过采样的方式将其进行混淆，也可以采用去重的方式全部混合到一起，可以采用加入一些噪声、对数据进行扰动或采用其他随机性引入的技术。Data obfuscation: On the basis of decryption, the task initiator obfuscates the data (the purpose of obfuscation is to obscure the source of the data and increase the privacy of the data). The obfuscation process can use different technologies, and it can be obfuscated through sampling. , you can also use the deduplication method to mix them all together, you can add some noise, perturb the data, or use other techniques to introduce randomness.

再加密：混淆后的数据被重新加密，再次使用全局时钟和任务发起方本地的随机种子生成新的密钥，加密确保了混淆后的数据在存储过程中的安全性，同时仍然能够在下一步中被解密并用于模型的进一步训练。Re-encryption: The obfuscated data is re-encrypted, and a new key is generated again using the global clock and the local random seed of the task initiator. The encryption ensures the security of the obfuscated data during the storage process, while still being able to be used in the next step. are decrypted and used for further training of the model.

生成了公共数据集：处理完的、混淆后的并重新加密的数据被组合成一个公共数据集。（公共数据集是一个包含了来自不同联合建模参与方的信息的集合，但由于混淆的存在，其中的个体数据的具体来源已经变得不明确。公共数据集被存储在共享内存区域中，以便其他联合建模参与方可以访问和使用。）A public dataset is generated: the processed, obfuscated, and re-encrypted data are combined into a common dataset. (A public data set is a collection that contains information from different joint modeling participants, but the specific source of individual data has become unclear due to obfuscation. The public data set is stored in a shared memory area, So that other joint modeling participants can access and use it.)

这个步骤的目标是确保在联邦预训练中，数据能够在不暴露个体隐私的前提下进行合作建模。加密和混淆的过程在数据的传输和存储过程中提供了额外的安全性，同时通过聚合这些数据，模型可以从多方面获得更全面的信息。The goal of this step is to ensure that in federated pre-training, the data can be used for collaborative modeling without exposing individual privacy. The process of encryption and obfuscation provides additional security during the transmission and storage of data, and by aggregating this data, the model can obtain more comprehensive information from many aspects.

步骤108、公共数据集拉取：各所述联合建模参与方通过RDMA协议将共享区域的数据拉入到各自本地环境中，为后续训练做准备；Step 108. Pulling public data sets: Each of the joint modeling participants pulls the data in the shared area into their respective local environments through the RDMA protocol to prepare for subsequent training;

步骤109、GPU初始化：各所述联合建模参与方对GPU进行初始化，使其成为可信化的计算资源，用于加速大语言模型的训练过程。Step 109. GPU initialization: Each of the joint modeling participants initializes the GPU to make it a trusted computing resource for accelerating the training process of the large language model.

所述步骤2具体包括以下步骤：The step 2 specifically includes the following steps:

步骤201、模型初始化与内存传输：各所述联合建模参与方将初始模型通过FPGA控制器，采用RDMA协议将模型的内存直接复制到所述各自TEE集群内存中，确保模型加载到可信环境中；Step 201. Model initialization and memory transfer: Each of the joint modeling participants passes the initial model through the FPGA controller and uses the RDMA protocol to directly copy the memory of the model to the memory of the respective TEE cluster to ensure that the model is loaded into a trusted environment. middle;

步骤202、数据选择和准备：所述联合建模参与方从公共数据中随机选择一部分数据，同时使用用户的私有数据，作为本轮训练的训练数据；Step 202. Data selection and preparation: The joint modeling participants randomly select a part of the data from the public data and use the user's private data as the training data for this round of training;

所述公共数据指用户非敏感数据或者其进行相应的数据脱敏，或者来自外部已经公开的数据，这些数据是由各个用户发给中心节点进行数据混淆，然后形成的基础公共数据集；主要是为了加速联邦学习收敛速度（本质上是解决NonIID的问题，公共数据集的分布与最终联合建模面向目标的数据分布相同）。The public data refers to user non-sensitive data or corresponding data desensitization, or data that has been published from the outside. These data are sent by each user to the central node for data obfuscation, and then formed a basic public data set; mainly In order to accelerate the convergence speed of federated learning (essentially solving the problem of NonIID, the distribution of the public data set is the same as the final joint modeling target-oriented data distribution).

步骤203、加速计算和数据传输：所述FPGA控制器锁定所述GPU进行独占，将需要加速计算的部分程序及训练数据经过FPGA路由，并进行解密操作，数据采用CXL技术传输到GPU中，以便高速的计算和处理；Step 203. Accelerated calculation and data transmission: The FPGA controller locks the GPU for exclusive use, routes part of the program and training data that need to be accelerated through the FPGA, and performs a decryption operation. The data is transmitted to the GPU using CXL technology so that High-speed calculation and processing;

步骤204、分布式并行训练：在本地所述TEE集群中，采用多组数据进行分布式并行训练，联合建模参与方使用不同策略处理公共区域和私有区域的数据，计算梯度，并在私有区域的TEE集群中完成梯度聚合，再通过全局时钟及选取的随机种子生成密钥进行加密，将结果存储到公共共享内存中，并通知各联合建模参与方；Step 204. Distributed parallel training: In the local TEE cluster, multiple sets of data are used for distributed parallel training. The joint modeling participants use different strategies to process the data in the public area and the private area, calculate the gradient, and perform the training in the private area. Gradient aggregation is completed in the TEE cluster, and then the global clock and the selected random seed are used to generate a key for encryption, the result is stored in the public shared memory, and all joint modeling participants are notified;

所述共享内存是一个区域，为各个节点共同形成一个大内存，直接读写内存，提升处理速度；The shared memory is an area that forms a large memory for each node to directly read and write the memory to improve processing speed;

步骤205、敏感内存清除：当本轮计算完成后，利用FPGA控制器执行GPU片上敏感内存的清除操作，以确保数据安全；Step 205. Clear sensitive memory: After this round of calculation is completed, use the FPGA controller to clear the sensitive memory on the GPU chip to ensure data security;

步骤206、全局聚合和模型更新：由所述聚合节点负责将共享内存中的梯度数据进行聚合，然后进行模型参数的更新操作；Step 206. Global aggregation and model update: The aggregation node is responsible for aggregating the gradient data in the shared memory and then updating the model parameters;

所述梯度数据是由联邦建模的联合建模参与方利用本地数据集，执行本地的前向计算，然后根据设定的损失函数，计算参数的梯度，再将这些梯度上传到聚合节点。可以参考FedAvg算法，中心聚合节点将来自各个节点的梯度参数值进行加权平均，更新全局模型，是联邦学习算法普遍的处理方式。The gradient data is used by the joint modeling participants of federated modeling to use local data sets to perform local forward calculations, then calculate the gradients of parameters according to the set loss function, and then upload these gradients to the aggregation node. You can refer to the FedAvg algorithm. The central aggregation node performs a weighted average of the gradient parameter values from each node and updates the global model. This is a common processing method for federated learning algorithms.

步骤207、模型分发：所述聚合节点将更新后的模型放入公共区域，并使用随机数种子和全局时钟进行加密，准备用于模型分发，以便后续训练使用；Step 207. Model distribution: The aggregation node puts the updated model into the public area, encrypts it using a random number seed and a global clock, and prepares it for model distribution for subsequent training;

步骤208、更新后模型下载和训练：各所述联合建模参与方从聚合节点下载新的模型，聚合节点更新全局模型，直接发送给联合建模参与方，采用由联合建模参与方去拉取的方式加速执行，用于新一轮的模型训练，直至模型达到收敛状态；Step 208. Download and train the updated model: Each of the joint modeling participants downloads the new model from the aggregation node, and the aggregation node updates the global model and sends it directly to the joint modeling participants, who then pull it. The method adopted is used to accelerate execution and be used for a new round of model training until the model reaches a convergence state;

步骤209、中间结果加密存储：定时触发任务，所述聚合节点将不断迭代进行训练，中间不断聚合参数的中间结果的模型及训练情况等数据生成三副本，并使用随机密钥进行加密存储；随后，采用Shamir秘密分享算法（Shamir's Secret Sharing Algorithm），将随机密钥分片分发给各联合建模参与方，以备未来的密钥恢复；Step 209. Encrypted storage of intermediate results: The task is triggered regularly, and the aggregation node will continuously iterate for training, and the data of the model and training status of the intermediate results of continuously aggregating parameters will generate three copies, and use random keys for encrypted storage; then , using Shamir's Secret Sharing Algorithm to distribute random key fragments to each joint modeling participant for future key recovery;

所述Shamir秘密分享算法是一种通过将秘密信息分割成多个部分，每个部分被分发给联合建模参与方，只有在达到一定阈值的情况下才能还原原始秘密的方法。这个算法由Adi Shamir在1979年提出，是一种阈值密码学的应用。The Shamir secret sharing algorithm is a method that divides secret information into multiple parts, each part is distributed to joint modeling participants, and the original secret can only be restored when a certain threshold is reached. This algorithm was proposed by Adi Shamir in 1979 and is an application of threshold cryptography.

以下是Shamir秘密分享算法的基本原理：The following is the basic principle of Shamir's secret sharing algorithm:

秘密分割：假设有一个秘密S，算法将这个秘密分割成N个部分，其中至少需要K个部分才能还原出原始秘密；Secret division: Suppose there is a secret S, and the algorithm divides the secret into N parts, of which at least K parts are needed to restore the original secret;

生成多项式：从一个高次数的多项式开始，该多项式的常数项就是秘密S，多项式的系数是从一个有限域（比如整数模数）中随机选择；Generating polynomial: starting from a high-degree polynomial, the constant term of the polynomial is the secret S, and the coefficients of the polynomial are randomly selected from a finite field (such as integer modulus);

计算分片：通过在多项式上选择不同的X值，计算出多个点的坐标，每个坐标对应于一个秘密部分；Compute shards: By choosing different X values on the polynomial, calculate the coordinates of multiple points, each coordinate corresponding to a secret part;

分发部分：将这些点的坐标作为秘密的部分分发给不同的联合建模参与方，每个联合建模参与方都只知道他们所持有的坐标值；Distribution part: Distribute the coordinates of these points as secret parts to different joint modeling participants. Each joint modeling participant only knows the coordinate values they hold;

还原秘密：至少需要K个不同的部分，通过这些部分使用插值法还原出原始的多项式，从而得到原始的秘密。Restore the secret: At least K different parts are needed, and the original polynomial is restored through these parts using interpolation method to obtain the original secret.

Shamir的这种方法提供了很好的安全性和弹性，只有在达到阈值K时，才能还原秘密。这种方法在密钥管理、数据存储、以及分布式系统的安全性上有着广泛的应用。Shamir's method provides good security and resilience, and the secret can only be restored when the threshold K is reached. This method has wide applications in key management, data storage, and distributed system security.

步骤210、异常情况处理：当出现异常情况时，所述各联合建模参与方方联合将各自拥有的秘密分片进行密钥恢复，进行解密操作，以恢复中间结果，确保训练的连续性和可靠性。Step 210. Abnormal situation processing: When an abnormal situation occurs, the joint modeling participants jointly recover the secret fragments they own for key recovery and perform decryption operations to recover the intermediate results and ensure the continuity and accuracy of training. reliability.

异常情况指由于网络故障、硬件故障等异常情况，聚合节点作为中心节点存在单点失效的风险，由于大语言模型的训练往往需要较长的时间和算力资源，出现聚合节点失效或者大量参与建模节点掉线的情况，需要进行预案处理。Abnormal situations refer to abnormal situations such as network failures and hardware failures. The aggregation node as a central node has the risk of single point failure. Since the training of large language models often requires a long time and computing resources, aggregation nodes fail or a large number of participants participate in the construction. If the module node goes offline, a plan needs to be handled.

所述步骤3具体包括以下步骤；Described step 3 specifically includes the following steps;

步骤301、所述大语言模型联合预训练任务完成训练后，部署到实际应用环境中进行使用；Step 301: After completing the training of the large language model joint pre-training task, deploy it to the actual application environment for use;

步骤302、收集反馈数据，持续增加训练数据，不断提升模型精度和训练效率。Step 302: Collect feedback data, continue to increase training data, and continuously improve model accuracy and training efficiency.

本发明的有益效果：Beneficial effects of the present invention:

本发明旨在应用于大语言模型预训练多方联邦建模的实际场景，充分利用RDMA和CXL技术，在分布式环境中构建跨域可信执行环境（TEE）集群，并为内存划分共享区域和私有区域，通过将共享区域形成大内存，使其可以容纳大语言模型及其训练数据以及中间训练结果，克服大模型大数据规模下的可信建模通信瓶颈和资源利用不足问题。该方法能够确保资源的有效利用和隔离，降低内存延迟，提升预训练速度，增强数据保密性、完整性和安全性。This invention is intended to be applied to the actual scenario of large language model pre-training multi-party federated modeling, making full use of RDMA and CXL technology to build a cross-domain trusted execution environment (TEE) cluster in a distributed environment, and dividing shared areas and memory areas. The private area, by forming a shared area into a large memory, can accommodate large language models and their training data as well as intermediate training results, overcoming the bottleneck of trusted modeling communication and insufficient resource utilization at the scale of large models and big data. This method can ensure effective utilization and isolation of resources, reduce memory latency, increase pre-training speed, and enhance data confidentiality, integrity, and security.

本发明采用远程直接内存访问（RDMA）协议，实现了不同TEE之间的高速通信，并引入了CXL技术，实现GPU与TEE高速互联，降低内存延迟，提升预训练速度。The present invention adopts the remote direct memory access (RDMA) protocol to realize high-speed communication between different TEEs, and introduces CXL technology to realize high-speed interconnection between GPU and TEE, reduce memory delay, and improve pre-training speed.

为确保GPU的安全性，本发明在FPGA中增加了TPM芯片，形成可信控制器，并根据训练任务进行权限策略的动态分发，通过验证GPU驱动程序的完整性和正确性，确保GPU在启动过程中处于安全状态，并维护GPU执行过程中的独占性，以防止恶意攻击和数据篡改，同时GPU执行完任务后通过FPGA路由到TEE环境，并对GPU片上内存的敏感数据进行清除，解决显存残留问题。In order to ensure the security of the GPU, the present invention adds a TPM chip to the FPGA to form a trusted controller, and dynamically distributes permission policies according to the training tasks. By verifying the integrity and correctness of the GPU driver, it ensures that the GPU starts up The process is in a safe state and the exclusivity of the GPU execution process is maintained to prevent malicious attacks and data tampering. At the same time, after the GPU completes the task, it is routed to the TEE environment through the FPGA and the sensitive data in the GPU on-chip memory is cleared to solve the problem of video memory Residual issues.

本方法引入全局时钟、多方密钥协商机制和随机数种子分发，对共享内存数据进行分片混淆，并结合加密和MAC认证增强数据保密性、完整性和安全性。在大模型预训练阶段，选择公共数据和私有数据进行联邦建模，并采用分级聚合的策略，最终在公共区域进行梯度聚合，从而减少了梯度泄露的风险。This method introduces a global clock, a multi-party key agreement mechanism and random number seed distribution to perform fragmentation and obfuscation of shared memory data, and combines encryption and MAC authentication to enhance data confidentiality, integrity and security. In the large model pre-training stage, public data and private data are selected for federated modeling, and a hierarchical aggregation strategy is adopted, and finally gradient aggregation is performed in the public area, thus reducing the risk of gradient leakage.

本发明采用多方秘密分享和多副本加密存储持久化方式，保障了模型训练过程中间数据的安全性和可恢复性，降低了数据丢失的风险。The present invention adopts multi-party secret sharing and multi-copy encrypted storage and persistence methods to ensure the security and recoverability of intermediate data during model training and reduce the risk of data loss.

综上所述，本发明能够有效应对大语言模型预训练中的诸多挑战，为大模型多方联邦建模实际应用场景提供了可信、高效的解决方案，具有较高的实用价值和广泛的应用前景。In summary, the present invention can effectively deal with many challenges in large language model pre-training, provides a credible and efficient solution for practical application scenarios of large model multi-party federated modeling, and has high practical value and wide application. prospect.

附图说明Description of drawings

图1是本发明大模型联邦建模节点组成示意图。Figure 1 is a schematic diagram of the node composition of the large model federated modeling of the present invention.

图2是本发明内存共享协议连接示意图。Figure 2 is a schematic connection diagram of the memory sharing protocol of the present invention.

图3是本发明节点功能组成示意图。Figure 3 is a schematic diagram of the node function composition of the present invention.

具体实施方式Detailed ways

下面结合附图对本发明作进一步详细说明。The present invention will be further described in detail below in conjunction with the accompanying drawings.

如图1中所示，根据大语言模型预训练多方联邦建模任务的实际需求，在分布式环境中构建跨域可信执行环境（TEE）集群，并为内存划分共享区域和私有区域，通过将共享区域形成大内存，来加载大模型及训练数据进行预训练，克服大模型大数据规模下的可信建模资源限制问题。As shown in Figure 1, according to the actual needs of large language model pre-training multi-party federated modeling tasks, a cross-domain Trusted Execution Environment (TEE) cluster is constructed in a distributed environment, and the memory is divided into shared areas and private areas. The shared area is formed into a large memory to load large models and training data for pre-training, thus overcoming the problem of trusted modeling resource limitations in the scale of large models and big data.

设置全局时钟，通过RDMA和CXL协议，实现高速的内存快速访问，充分利用GPU和TEE环境资源，有效降低内存延迟造成的性能瓶颈。采用FPGA控制器和安全模块，融合多方密钥协商、随机数种子分发、数据分片混淆、加密和MAC认证等多种机制，确保GPU的可信化，以及分布式训练过程中的隐私性和安全性。此外，采用分级聚合的策略，减少梯度泄露的风险，通过设置Checkpoint检查点，针对中间结果采用多副本加密持久化存储，提升训练可靠性。Set the global clock to achieve high-speed memory access through RDMA and CXL protocols, make full use of GPU and TEE environment resources, and effectively reduce performance bottlenecks caused by memory delays. Using FPGA controller and security module, it integrates multiple mechanisms such as multi-party key negotiation, random number seed distribution, data fragmentation obfuscation, encryption and MAC authentication to ensure the credibility of the GPU, as well as the privacy and security during the distributed training process. safety. In addition, a hierarchical aggregation strategy is adopted to reduce the risk of gradient leakage. Checkpoints are set up and multi-copy encrypted persistent storage is used for intermediate results to improve training reliability.

其中，大语言模型的主要功能是对自然语言进行处理、理解和生成，在预训练阶段获得丰富的语言知识和模式，为各种下游NLP任务提供强大支持；联合建模参与方是参与模型的联合训练的组织或机构，贡献训练数据、模型参数和计算资源，协同训练大语言模型以提高预训练性能；Among them, the main function of the large language model is to process, understand and generate natural language, obtain rich language knowledge and patterns in the pre-training stage, and provide strong support for various downstream NLP tasks; the joint modeling participants are those who participate in the model Organizations or institutions that jointly train, contribute training data, model parameters and computing resources, and collaboratively train large language models to improve pre-training performance;

所述TEE环境提供隔离、安全的计算环境，每个TEE集群节点具有私有内存区域和共享内存区域，TEE具备远程认证、权限控制、安全模块、安全互联等功能，以确保计算环境的安全性；The TEE environment provides an isolated and secure computing environment. Each TEE cluster node has a private memory area and a shared memory area. The TEE has functions such as remote authentication, authority control, security module, and secure interconnection to ensure the security of the computing environment;

所述内存划分是指TEE集群节点将内存划分为私有区域和共享区域，确保数据在不同节点之间的隔离和保密；The memory division means that the TEE cluster nodes divide the memory into private areas and shared areas to ensure the isolation and confidentiality of data between different nodes;

所述远程认证功能保证TEE节点能够进行远程认证，验证其他节点的身份，确保只有合法的节点能够访问其资源；所述安全模块是TEE环境内置的安全相关功能，包括加解密、MAC运算、秘密分享的功能，用于数据的安全传输和处理，以及实现隐私保护。The remote authentication function ensures that TEE nodes can perform remote authentication, verify the identities of other nodes, and ensure that only legal nodes can access their resources; the security module is a security-related function built into the TEE environment, including encryption and decryption, MAC operations, secret The sharing function is used for secure transmission and processing of data, as well as privacy protection.

如图2所示：安全互联是通过FPGA控制器，在TEE集群节点之间以及TEE与 GPU之间实现安全的高速通信，支持加密和认证，保障数据高速传输的安全性以及权限认证。As shown in Figure 2: Secure interconnection realizes secure high-speed communication between TEE cluster nodes and between TEE and GPU through FPGA controller, supports encryption and authentication, and ensures the security of high-speed data transmission and authority authentication.

所述FPGA控制器实现GPU可信化以及TEE与外部的安全通信和权限认证，通过验证GPU驱动程序的完整性和正确性，确保GPU在启动过程中处于安全状态，并维护GPU执行过程中的独占性，以防止恶意攻击和数据篡改，同时GPU执行完任务后通过FPGA路由到TEE环境，并对GPU片上内存的敏感数据进行清除，解决显存残留问题，主要包括TPM芯片管理、RDMA内存共享、CXL安全通信等功能；The FPGA controller implements GPU trustworthiness and secure communication and authority authentication between the TEE and the outside. By verifying the integrity and correctness of the GPU driver, it ensures that the GPU is in a safe state during the startup process and maintains the integrity of the GPU during execution. Exclusivity to prevent malicious attacks and data tampering. At the same time, after the GPU completes the task, it is routed to the TEE environment through the FPGA, and sensitive data in the GPU on-chip memory is cleared to solve the problem of residual video memory, which mainly includes TPM chip management, RDMA memory sharing, CXL secure communication and other functions;

所述TPM芯片管理是所述FPGA控制器中的TPM芯片用于管理可信计算的相关密钥和证书，确保TEE集群的安全性和可信度；The TPM chip management is used by the TPM chip in the FPGA controller to manage the relevant keys and certificates of trusted computing to ensure the security and credibility of the TEE cluster;

所述RDMA内存共享是通过RDMA协议，FPGA控制器实现TEE集群之间的高速内存共享，加速数据传输和通信；The RDMA memory sharing is through the RDMA protocol and the FPGA controller realizes high-speed memory sharing between TEE clusters to accelerate data transmission and communication;

所述CXL安全通信是FPGA控制器引入CXL技术，实现GPU与TEE之间的高速、安全互联，提升预训练速度并降低内存延迟。The CXL secure communication is the introduction of CXL technology into the FPGA controller to achieve high-speed and secure interconnection between GPU and TEE, improve pre-training speed and reduce memory latency.

所述GPU负责进行大语言模型的加速训练，提高训练效率和速度，并由FPGA控制器进行控制和验证，实现GPU可信化。The GPU is responsible for accelerating training of large language models to improve training efficiency and speed, and is controlled and verified by the FPGA controller to achieve GPU credibility.

所述TEE集群由多个可信执行环境（TEE）组成，TEE集群归属于不同的参与模型联合训练的联合建模参与方，由所述联邦建模联合建模参与方设计TEE节点的互联，提供远程认证以及权限控制功能；所述TEE集群远程认证确保只有经过认证的节点可以参与到预训练过程中，指定部分节点与其他外部联合建模参与方TEE集群环境进行交互；所述TEE集群权限控制功能主要是TEE集群能够根据不同的训练任务和联合建模参与方，动态地分配权限，确保每个节点只能访问其具有权限的数据和资源。The TEE cluster is composed of multiple trusted execution environments (TEEs). The TEE cluster belongs to different joint modeling participants participating in model joint training. The federated modeling joint modeling participants design the interconnection of TEE nodes. Provide remote authentication and permission control functions; the TEE cluster remote authentication ensures that only certified nodes can participate in the pre-training process, and designates some nodes to interact with other external joint modeling participants TEE cluster environment; the TEE cluster permissions The main control function is that the TEE cluster can dynamically allocate permissions based on different training tasks and joint modeling participants to ensure that each node can only access the data and resources it has permissions for.

所述如图3所示：大语言模型联合预训练任务运行在多方参与的所述TEE集群中，包括计算节点和聚合/管理节点；As shown in Figure 3: the large language model joint pre-training task runs in the TEE cluster involving multiple parties, including computing nodes and aggregation/management nodes;

所述计算节点用于执行模型训练和模型训练加速任务，在可信执行环境中进行训练，保障模型训练的安全性和可信性；计算节点就是参与节点，负责本地模型训练计算。The computing nodes are used to perform model training and model training acceleration tasks, and perform training in a trusted execution environment to ensure the security and credibility of model training; the computing nodes are participating nodes and are responsible for local model training calculations.

所述聚合/管理节点负责大语言模型分布式训练的管理调度以及梯度聚合和模型分发，包括模型加载、数据加载、数据混淆、梯度聚合、参数更新、模型分发、密钥管理和模型持久化；The aggregation/management node is responsible for the management and scheduling of distributed training of large language models as well as gradient aggregation and model distribution, including model loading, data loading, data obfuscation, gradient aggregation, parameter update, model distribution, key management and model persistence;

所述模型加载是将大语言模型从存储介质加载到TEE集群的操作；The model loading is an operation of loading a large language model from the storage medium to the TEE cluster;

所述数据加载是将大语言模型预训练所需的数据根据隐私安全性加载到所述TEE集群的指定内存中；所述数据混淆负责将存放在共享区域中的数据，进行联合混淆操作，以消除数据的指代性和识别性，并使用统一密钥进行加密；The data loading is to load the data required for large language model pre-training into the designated memory of the TEE cluster according to privacy security; the data obfuscation is responsible for performing a joint obfuscation operation on the data stored in the shared area to Eliminate the referentiality and identification of data and use a unified key for encryption;

所述梯度聚合功能负责将分布式训练TEE节点计算得到的梯度进行聚合，以更新模型参数；The gradient aggregation function is responsible for aggregating the gradients calculated by the distributed training TEE nodes to update the model parameters;

所述参数更新功能是使用聚合得到的梯度来更新模型参数，以便在下一轮训练中使用；The parameter update function is to use the gradient obtained by aggregation to update the model parameters for use in the next round of training;

所述模型分发是将更新后的模型参数采用RDMA协议分发给各个TEE训练节点；The model distribution is to distribute the updated model parameters to each TEE training node using the RDMA protocol;

所述密钥管理功能是在系统中引入全局时钟作为时间参考输入，结合随机数种子生成共享对称密钥，并对其进行管理；The key management function is to introduce a global clock into the system as a time reference input, combine it with a random number seed to generate a shared symmetric key, and manage it;

所述模型持久化是将模型参数等中间结果进行多副本加密持久化存储，以便在训练过程中出现异常情况时能够恢复训练进度，降低数据丢失风险。所述全局时钟作为分布式系统的时钟同步机制，确保各节点的时间一致性，用于内存的区域标识，作为密钥的一部分。The model persistence is to store intermediate results such as model parameters in multiple copies encrypted and persistently, so that the training progress can be restored and the risk of data loss can be restored when an abnormality occurs during the training process. The global clock serves as the clock synchronization mechanism of the distributed system to ensure the time consistency of each node and is used for the area identification of the memory as part of the key.

下面将结合具体的实施例对本发明提供的方法过程进行详细的说明。The method process provided by the present invention will be described in detail below with reference to specific examples.

步骤一、大语言模型联合预训练建模环境准备：Step 1. Preparing the large language model joint pre-training modeling environment:

大语言模型联合预训练建模环境准备包括以下步骤：Preparing the large language model joint pre-training modeling environment includes the following steps:

步骤101、确认预训练任务：定义大语言模型预训练的具体任务，包括模型的初始参数配置和训练数据的要求。Step 101. Confirm the pre-training tasks: Define the specific tasks for pre-training the large language model, including the initial parameter configuration of the model and training data requirements.

步骤102、确定联合建模参与方：确定所述参与大语言模型联合预训练的各联合建模参与方，包括模型拥有者和数据提供者等。Step 102. Determine the joint modeling participants: Determine the joint modeling participants participating in the joint pre-training of the large language model, including model owners and data providers.

步骤103、构建跨域TEE集群：在分布式环境中，搭建跨域的可信执行环境（TEE）集群，各个联合建模参与方启动TEE管理节点，在TEE规划共享区域、私有区域以及GPU资源，并将访问权限信息加载到FPGA中，用于控制RDMA和CXL的访问。Step 103. Build a cross-domain TEE cluster: In a distributed environment, build a cross-domain Trusted Execution Environment (TEE) cluster. Each joint modeling participant starts the TEE management node and plans shared areas, private areas, and GPU resources in the TEE. , and load the access permission information into the FPGA to control access to RDMA and CXL.

步骤104、设置全局时钟：引入全局时钟，作为统一的时间标尺，为所有联合建模参与方分发N个随机数种子，确保时间的一致性和数据的安全性。Step 104. Set the global clock: Introduce the global clock as a unified time scale and distribute N random number seeds to all joint modeling participants to ensure time consistency and data security.

步骤105、加密和MAC认证：所述大语言模型联邦预训练任务发起方选取随机种子获取全局时钟生成密钥，将初始模型进行加密，同时生成MAC消息认证码，并将加密后的初始模型和相关标识信息放入共享内存区域，为模型分发做准备。Step 105. Encryption and MAC authentication: The initiator of the large language model federated pre-training task selects a random seed to obtain the global clock generation key, encrypts the initial model, generates a MAC message authentication code, and combines the encrypted initial model and Relevant identification information is placed into the shared memory area to prepare for model distribution.

步骤106、训练数据加载和处理：各联合建模参与方根据数据敏感性，将数据加载到TEE的私有区域和共享区域，对数据进行Token化处理形成向量表示，并获取全局时钟，选取本地随机种子对共享区域的数据进行加密，以增强数据的保密性。Step 106. Training data loading and processing: Each joint modeling participant loads the data into the private area and shared area of the TEE according to the data sensitivity, tokenizes the data to form a vector representation, obtains the global clock, and selects local random Seeds encrypt data in shared areas to enhance data confidentiality.

步骤107、数据聚合和混淆：所述大语言模型联邦预训练任务发起方对TEE共享区域的数据进行聚合，根据元数据标识的随机数种子序号以及全局时钟生成密钥，并利用密钥将数据进行解密，再进行数据混淆，模糊其数据来源，通过获取全局时钟，选取本地随机种子重新对共享区域的数据进行加密，生成公共数据集，并存储于共享内存区域中。Step 107. Data aggregation and obfuscation: The initiator of the large language model federated pre-training task aggregates the data in the TEE shared area, generates a key based on the random seed number identified by the metadata and the global clock, and uses the key to convert the data Decrypt, then perform data obfuscation to blur the source of the data. By obtaining the global clock, select local random seeds to re-encrypt the data in the shared area, generate a public data set, and store it in the shared memory area.

步骤108、公共数据集拉取：各联合建模参与方通过RDMA协议将共享区域的数据拉入到各自本地环境中，为后续训练做准备。Step 108. Public data set pulling: Each joint modeling participant pulls the data in the shared area into their respective local environments through the RDMA protocol to prepare for subsequent training.

步骤109、GPU初始化：各联合建模参与方对GPU进行初始化，使其成为可信化的计算资源，用于加速模型的训练过程。Step 109. GPU initialization: Each joint modeling participant initializes the GPU to make it a trusted computing resource to accelerate the model training process.

本发明提供了一种基于可信执行环境的大语言模型联邦预训练方法，用于大语言模型的联合预训练建模，包括：The present invention provides a large language model federated pre-training method based on a trusted execution environment, which is used for joint pre-training modeling of large language models, including:

步骤二、大语言模型的联合预训练建模：Step 2. Joint pre-training modeling of large language models:

大语言模型的联合预训练建模包括以下步骤：Joint pre-training modeling of large language models includes the following steps:

步骤201、模型初始化与内存传输：各联合建模参与方将初始模型通过FPGA控制器，采用RDMA协议将模型的内存直接复制到所述各自TEE集群内存中，确保初始模型加载到可信环境中。Step 201. Model initialization and memory transfer: Each joint modeling participant passes the initial model through the FPGA controller and uses the RDMA protocol to directly copy the memory of the model to the memory of the respective TEE cluster to ensure that the initial model is loaded into a trusted environment. .

步骤202、数据选择和准备：联合建模参与方从公共数据中随机选择一部分数据，同时使用用户的私有数据，作为本轮训练的训练数据。Step 202. Data selection and preparation: The joint modeling participants randomly select a part of the data from the public data and use the user's private data as the training data for this round of training.

步骤203、加速计算和数据传输：FPGA控制器锁定所述GPU进行独占，将需要加速计算的部分程序及训练数据经过FPGA路由，并进行解密操作，数据采用CXL技术传输到GPU中，以便高速的计算和处理。Step 203. Accelerated calculation and data transmission: The FPGA controller locks the GPU for exclusive use, routes part of the program and training data that need to be accelerated through the FPGA, and performs decryption operations. The data is transmitted to the GPU using CXL technology for high-speed processing. Calculation and processing.

步骤204、分布式并行训练：在本地所述TEE集群中，采用多组数据进行分布式并行训练，联合建模参与方使用不同策略处理公共区域和私有区域的数据，计算梯度，并在私有区域的TEE集群中完成梯度聚合，再通过全局时钟及选取的随机种子生成密钥进行加密，将结果存储到公共共享内存中，并通知各联合建模参与方。Step 204. Distributed parallel training: In the local TEE cluster, multiple sets of data are used for distributed parallel training. The joint modeling participants use different strategies to process the data in the public area and the private area, calculate the gradient, and perform the training in the private area. Gradient aggregation is completed in the TEE cluster, and then the key is generated through the global clock and the selected random seed for encryption, the result is stored in the public shared memory, and all joint modeling participants are notified.

步骤205、敏感内存清除：当本轮计算完成后，利用FPGA控制器执行GPU片上敏感内存的清除操作，以确保数据安全。Step 205. Clear sensitive memory: After this round of calculation is completed, use the FPGA controller to clear the sensitive memory on the GPU chip to ensure data security.

步骤206、全局聚合和模型更新：由所述聚合节点负责将共享内存中的梯度数据进行聚合，然后进行模型参数的更新操作。Step 206. Global aggregation and model update: The aggregation node is responsible for aggregating the gradient data in the shared memory and then updating the model parameters.

步骤207、模型分发：所述聚合节点将更新后的模型放入公共区域，并使用随机数种子和全局时钟进行加密，准备用于模型分发，以便后续训练的使用。Step 207. Model distribution: The aggregation node puts the updated model into the public area, encrypts it using a random number seed and a global clock, and prepares it for model distribution for subsequent training.

步骤208、更新后模型下载和训练：所述各联合建模联合建模参与方下载新的模型，用于新一轮的模型训练，直至模型达到收敛状态。Step 208. Download and train the updated model: Each of the joint modeling participants downloads the new model and uses it for a new round of model training until the model reaches a convergence state.

步骤209、中间结果加密存储：定时触发任务，所述聚合节点将中间结果的模型及训练情况等数据生成三副本，并使用随机密钥进行加密存储；随后，采用Shamir秘密分享算法，将随机密钥分片分发给各联合建模参与方，以备未来的密钥恢复。Step 209. Encrypted storage of intermediate results: The task is triggered regularly. The aggregation node generates three copies of the model and training status data of the intermediate results, and uses random keys for encrypted storage. Then, the Shamir secret sharing algorithm is used to store the random keys. Key shards are distributed to each joint modeling participant for future key recovery.

步骤210、异常情况处理：当出现异常情况时，所述各方联合将各自拥有的秘密分片进行密钥恢复，进行解密操作，以恢复中间结果，确保训练的连续性和可靠性。Step 210. Abnormal situation handling: When an abnormal situation occurs, the parties jointly restore the secret fragments they own and perform decryption operations to restore the intermediate results and ensure the continuity and reliability of training.

步骤三、联合预训练大语言模型的优化：Step 3. Optimization of joint pre-training large language model:

联合预训练大语言模型的优化包括以下步骤：The optimization of jointly pretrained large language models includes the following steps:

以上所述实施例，只是本发明具体实施方式的一种，本领域的技术人员在本发明技术方案范围内进行的通常变化和替换都应包含在本发明的保护范围内。The above-described embodiment is only one of the specific implementation modes of the present invention, and ordinary changes and substitutions made by those skilled in the art within the scope of the technical solution of the present invention should be included in the protection scope of the present invention.

Claims

1. A large language model federated pre-training method based on a trusted execution environment, which is characterized by including the following steps:

Step 1. Create a large language model joint pre-training task, determine joint modeling participants, prepare data, and create computing and storage network resources for joint modeling;

Step 2. Conduct joint pre-training of large language models;

Step 3. Optimize the large language model obtained by joint pre-training.

2. A large language model federated pre-training method based on a trusted execution environment according to claim 1, characterized in that said step 1 specifically includes the following steps:

Step 101. Confirm the pre-training tasks: Define the specific tasks of large language model pre-training, including the initial parameter configuration of the large language model and training data requirements;

Step 102. Determine the joint modeling participants: Determine the joint modeling participants participating in the joint pre-training of the large language model, including the owners and data providers of the large language model;

Step 103. Build a cross-domain TEE cluster: In a distributed environment, build a cross-domain Trusted Execution Environment (TEE) cluster. Each joint modeling participant starts the TEE management node and plans shared areas, private areas, and GPU resources in the TEE. , and load the access permission information into the FPGA to control access to RDMA and CXL;

Step 104. Set the global clock: Introduce the global clock as a unified time scale and distribute N random number seeds to all joint modeling participants to ensure time consistency and data security;

Step 105. Encryption and MAC authentication: The initiator of the large language model federated pre-training task selects a random seed to obtain the global clock generation key, encrypts the initial model, generates a MAC message authentication code, and sends the encrypted initial model and related identifiers Information is placed into the shared memory area to prepare for model distribution;

The shared memory area is divided into a private area and a shared public area. The private area stores private data of participants participating in federated modeling and joint modeling, which is sensitive data; the shared public area is used for non-sensitive data and modeling processes. Efficient and fast sharing of metadata, global model parameters, and public data sets;

Step 106. Training data loading and processing: Each of the joint modeling participants loads the data into the private area and shared area of the TEE according to the data sensitivity, tokenizes the data to form a vector representation, obtains the global clock, and selects Local random seeds encrypt data in the shared area;

Step 107. Data aggregation and obfuscation: The task initiator aggregates the data in the TEE shared area, generates a key based on the random seed number identified by the metadata and the global clock, and uses the key to decrypt the data, and then performs data obfuscation. , blur the source of the data, obtain the global clock, select local random seeds to re-encrypt the data in the shared area, generate a public data set, and store it in the shared memory area;

Step 108. Pulling public data sets: Each of the joint modeling participants pulls the data in the shared area into their respective local environments through the RDMA protocol to prepare for subsequent training;

Step 109. GPU initialization: Each of the joint modeling participants initializes the GPU to make it a trusted computing resource for accelerating the training process of the model.

3. A large language model federated pre-training method based on a trusted execution environment according to claim 2, characterized in that the step 107 specifically includes;

Data aggregation: The task initiator obtains data from the TEE shared area of each joint modeling participant. The task initiator uses the global clock and the random seed number identified by the metadata to generate a key. Using the generated key, the task initiator pairs the data from The data obtained by each TEE is decrypted and the original, encrypted data is restored to its original state;

Data obfuscation: Based on decryption, the task initiator obfuscates the data;

Re-encryption: The obfuscated data is re-encrypted, and a new key is generated again using the global clock and the local random seed of the task initiator;

A public dataset is generated: the processed, obfuscated, and re-encrypted data are combined into a common dataset.

4. A large language model federated pre-training method based on a trusted execution environment according to claim 2, characterized in that step 2 specifically includes the following steps:

Step 201. Model initialization and memory transfer: Each of the joint modeling participants passes the initial model through the FPGA controller and uses the RDMA protocol to directly copy the memory of the initial model to the memory of their respective TEE clusters to ensure that the initial model is loaded into a trusted environment. middle;

Step 202. Data selection and preparation: The joint modeling participants randomly select a part of the data from the public data and use the user's private data as the training data for this round of training;

The public data refers to user non-sensitive data or corresponding data desensitization, or data that has been published from the outside. These data are sent by each user to the central node for data obfuscation, and then form a basic public data set;

Step 203. Accelerated calculation and data transmission: The FPGA controller locks the GPU for exclusive use, routes part of the program and training data that need to be accelerated through the FPGA, and performs a decryption operation. The data is transmitted to the GPU using CXL technology for processing. calculation and processing;

Step 204. Distributed parallel training: In the local TEE cluster, multiple sets of data are used for distributed parallel training. The joint modeling participants use different strategies to process the data in the public area and the private area, calculate the gradient, and perform the training in the private area. Gradient aggregation is completed in the TEE cluster, and then the key is generated through the global clock and the selected random seed for encryption, the results are stored in the shared memory in the public area, and all joint modeling participants are notified;

The shared memory is an area that forms a large memory for each node to directly read and write the memory to improve processing speed;

Step 205. Clear sensitive memory: After this round of calculation is completed, use the FPGA controller to clear the sensitive memory on the GPU chip to ensure data security;

Step 206. Global aggregation and model update: The aggregation node is responsible for aggregating the gradient data in the shared memory and then updating the model parameters;

The gradient data is used by the joint modeling participants of federated modeling to perform local forward calculations using local data sets, and then calculates the gradients of parameters according to the set loss function, and then uploads these gradients to the aggregation node;

Step 207. Model distribution: The aggregation node puts the updated model into the public area, encrypts it using random number seeds and global clocks, and prepares it for model distribution for subsequent training;

Step 208. Download and train the updated model: Each of the joint modeling participants downloads the new model from the aggregation node, and the aggregation node updates the global model and sends it directly to the joint modeling participants, who then pull it. The method adopted is used to accelerate execution and be used for a new round of model training until the model reaches a convergence state;

Step 209. Encrypted storage of intermediate results: The task is triggered regularly, and the aggregation node will continue to iteratively train, and the model of the intermediate results of the parameters and the data of the training situation will be continuously aggregated to generate three copies, and a random key will be used for encrypted storage; then, use Shamir secret sharing algorithm distributes random key fragments to each joint modeling participant for future key recovery;

Step 210. Abnormal situation handling: When an abnormal situation occurs, each of the joint modeling participants jointly restores the secret fragments they own and performs decryption operations to restore the intermediate results and ensure the continuity and continuity of training. reliability.

5. A large language model federated pre-training method based on a trusted execution environment according to claim 4, characterized in that the FPGA controller implements GPU credibility and secure communication and authority authentication between the TEE and the outside, FPGA controller includes TPM chip management, RDMA memory sharing, and CXL secure communication;

The TPM chip management is used by the TPM chip in the FPGA controller to manage the relevant keys and certificates of trusted computing to ensure the security and credibility of the TEE cluster;

The RDMA memory sharing is through the RDMA protocol and the FPGA controller realizes high-speed memory sharing between TEE clusters to accelerate data transmission and communication;

The CXL secure communication realizes high-speed and secure interconnection between GPU and TEE, improves pre-training speed and reduces memory latency;

The GPU is responsible for accelerating training of large language models to improve training efficiency and speed, and is controlled and verified by the FPGA controller to achieve GPU credibility.

6. A large language model federated pre-training method based on a trusted execution environment according to claim 4, characterized in that said step 3 specifically includes the following steps;

Step 301: After completing the training of the large language model joint pre-training task, deploy it to the actual application environment for use;

Step 302: Collect feedback data, continue to increase training data, and continuously improve model accuracy and training efficiency.