CN117648998B

CN117648998B - Large language model federal pre-training method based on trusted execution environment

Info

Publication number: CN117648998B
Application number: CN202410117882.3A
Authority: CN
Inventors: 罗清彩; 李辉; 孙善宝; 王亚宁
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2024-01-29
Filing date: 2024-01-29
Publication date: 2024-04-26
Anticipated expiration: 2044-01-29
Also published as: CN117648998A

Abstract

The invention discloses a large language model federation pre-training method based on a trusted execution environment, which comprises the following steps: step 1: creating a large language model joint pre-training task, determining joint modeling participants, preparing data, and creating computing storage network resources for joint modeling; step 2: performing joint pre-training of a large language model; step 3: and optimizing the large language model obtained by the joint pre-training. The method is applied to the actual scene of the pretrained multiparty federal modeling of the large language model, RDMA and CXL technologies are fully utilized, a cross-domain trusted execution environment cluster is built in a distributed environment, a shared area and a private area are divided for a memory, the shared area is formed into the large memory, the large language model, training data thereof and intermediate training results can be contained, and the problems of communication bottleneck and resource utilization deficiency of the trusted modeling under the large data scale of the large model are solved.

Description

Large language model federal pre-training method based on trusted execution environment

Technical Field

The invention belongs to the technical field of large language model federal pre-training, and particularly relates to a large language model federal pre-training method based on a trusted execution environment.

Background

Under the background of rapid development of the current artificial intelligence, natural language processing is taken as an important direction of the artificial intelligence, and has been shown to have wide application prospects in various fields, such as machine translation, emotion analysis, intelligent dialogue, text generation and the like, and continuous energization is carried out to promote industrial development of medical treatment, finance, government affairs and the like.

The pre-training large model is used as a key technology of natural language processing, and the model can learn rich language knowledge and modes by pre-training on large-scale data, so that strong support is provided for various downstream tasks. However, this technology also faces a serious set of challenges that are not only related to technical aspects but also related to the issues of data privacy, computational efficiency, and security, among other critical areas.

First, the data needed for the pre-training of large language models is huge in size, covering multiple data sources and data owners, which may contain sensitive information such as personal privacy and business confidentiality, so how to perform large model training in the case of multiparty collaboration becomes a critical issue to guarantee data privacy from data disclosure and abuse.

Secondly, the large language model pre-training process requires huge storage space, such as data set storage for training, storage of a large language model and storage of training intermediate results, high requirements are put on calculation and storage resources, and training speed can be reduced even training requirements cannot be met under the condition of limited resources. Meanwhile, GPU acceleration plays an important role in training of the NLP large language model, however, the reliability of the GPU is challenged, possibly introducing security risks, and how to ensure the reliability and security of the GPU needs to be solved. In addition, the pre-training process of the large language model is long, node failure and abnormal interruption can cause the training process to be interrupted, and the data are stored in the memory, so that the risk of losing the intermediate data is remarkable.

The large language model is based on a transducer, and the pre-training idea is that model parameters are not initialized randomly any more, but are pre-trained through tasks to obtain a set of model parameters, and then the model is initialized by the set of parameters and then trained. The large language model pre-training process mainly predicts the next word through the words, belongs to unsupervised pre-training, and comprises an AR model (namely a model which is learned from left to right), an in-context training technology and the like. With ChatGPT, after the pre-training of the large language model is completed, further fine tuning is performed by using supervised learning, rewarding model and reinforcement learning.

Under the circumstance, how to fully utilize the Trusted Execution Environment (TEE), compute Express Link (CXL) technology, remote Direct Memory Access (RDMA) and other front-end technologies, and combine the means of key management, encryption transmission, data isolation and the like, and to perform multiparty joint efficient modeling on the premise of compliance aiming at a large language model pre-training scene becomes a problem to be solved.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a large language model federal pre-training method based on a trusted execution environment, which is applied to the actual scene of large language model pre-training multiparty federal modeling, fully utilizes RDMA and CXL technologies, builds a cross-domain Trusted Execution Environment (TEE) cluster in a distributed environment, divides a shared area and a private area for a memory, and can accommodate the large language model, training data thereof and intermediate training results by forming the shared area into the large memory, thereby overcoming the problems of communication bottleneck and resource utilization of the trusted modeling under the large data scale of the large model.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

A large language model federation pre-training method based on a trusted execution environment comprises the following steps:

step 1, creating a large language model joint pre-training task, determining joint modeling participants, preparing data, and creating computing storage network resources for joint modeling;

Step 2, performing joint pre-training of a large language model;

and 3, optimizing the large language model obtained by the combined pre-training.

The step1 specifically comprises the following steps:

Step 101, confirming a pre-training task: defining a specific task of pre-training of the large language model, wherein the specific task comprises initial parameter configuration of the large language model and requirement of training data;

step 102, determining a joint modeling participant: determining each joint modeling participant participating in the joint pre-training of the large language model, including an owner of the large language model and a data provider;

Step 103, constructing a cross-domain TEE cluster: in a distributed environment, building a cross-domain Trusted Execution Environment (TEE) cluster, starting a TEE management node by each joint modeling participant, planning a shared area, a private area and GPU resources in the TEE, and loading access authority information into an FPGA (field programmable gate array) for controlling access of RDMA (remote direct memory access) and CXL (virtual local area network);

Step 104, setting a global clock: introducing a global clock as a unified time scale to distribute N random number seeds for all joint modeling participants, so as to ensure the consistency of time and the safety of data;

step 105, encryption and MAC authentication: the large language model federal pre-training task initiator selects a random seed to acquire a global clock generation key, encrypts an initial model, generates an MAC message authentication code at the same time, and puts the encrypted initial model and related identification information into a shared memory area to prepare for model distribution;

the shared memory area is divided into a private area and a shared public area, private data of participants participating in federal modeling joint modeling are stored in the private area and are sensitive data, and the data cannot go out of the domain; the shared public area is used for efficient and rapid sharing of non-sensitive data, metadata in a modeling process, global model parameters and public data sets;

Step 106, training data loading and processing: each joint modeling participant loads data into a private area and a shared area of the TEE according to data sensitivity, performs Token processing on the data to form vector representation, acquires a global clock, and selects a local random seed to encrypt the data of the shared area so as to enhance the confidentiality of the data;

Step 107, data aggregation and confusion: the task initiator aggregates the data of the TEE shared area, generates a secret key according to the random number seed number identified by the metadata and the global clock, decrypts the data by using the secret key, obfuscates the data source, and encrypts the data of the shared area again by acquiring the global clock, selecting a local random seed to generate a public data set and storing the public data set in the shared memory area;

The step 107 specifically includes:

data aggregation: the task initiator obtains data (the data are preprocessed, token-based and encrypted vector representations) from the TEE shared areas of the joint modeling participants, the task initiator generates a key by using a global clock and a random number seed number identified by metadata, and decrypts the data obtained from the TEEs by using the generated key, and the step restores the original encrypted data to the original state;

Data confusion: based on decryption, the task initiator can mix the data (the purpose of mixing is to blur the source of the data and increase the privacy of the data), the mixing process can adopt different technologies, can mix the data in a sampling mode, can also adopt a duplication removing mode to mix the data together, and can adopt technologies of adding some noise, disturbing the data or introducing other randomness.

And (5) re-encrypting: the obfuscated data is re-encrypted, again using the global clock and a random seed local to the task initiator to generate a new key, the encryption ensuring security of the obfuscated data during storage, while still being able to be decrypted in the next step and used for further training of the model.

A common dataset is generated: the processed, obfuscated and re-encrypted data is combined into a common data set. ( A common dataset is a collection that contains information from different joint modeling participants, but the specific source of individual data therein has become ambiguous due to the presence of confusion. The common data set is stored in a shared memory area so that other joint modeling participants can access and use it. )

The goal of this step is to ensure that in federal pre-training, the data can be co-modeled without exposing individual privacy. The encryption and obfuscation process provides additional security in the transmission and storage of data, while the model can obtain more comprehensive information from many aspects by aggregating the data.

Step 108, common data set pulling: each joint modeling participant pulls the data of the shared area into the respective local environment through an RDMA protocol to prepare for subsequent training;

Step 109, initializing the GPU: each joint modeling participant initializes the GPU to become a trusted computing resource for accelerating the training process of the large language model.

The step2 specifically comprises the following steps:

Step 201, model initialization and memory transfer: each joint modeling participant directly copies the internal memory of the initial model to the internal memory of each TEE cluster through an FPGA controller by adopting an RDMA protocol, so as to ensure that the model is loaded into a trusted environment;

step 202, data selection and preparation: the joint modeling participant randomly selects a part of data from public data, and simultaneously uses private data of a user as training data of the training round;

The public data refers to user non-sensitive data or corresponding data desensitization of the user non-sensitive data or data from outside which has been disclosed, wherein the data are basic public data sets which are formed after data confusion is carried out by sending the data to a central node by each user; mainly to accelerate the convergence speed of federal learning (essentially solving NonIID problems), the distribution of the common dataset is the same as the final joint modeling object-oriented data distribution.

Step 203, accelerating calculation and data transmission: the FPGA controller locks the GPU to be exclusive, partial programs and training data needing to be accelerated to be calculated are routed through the FPGA, decryption operation is carried out, and the data are transmitted to the GPU by adopting CXL technology so as to be calculated and processed at high speed;

Step 204, distributed parallel training: in the local TEE cluster, carrying out distributed parallel training by adopting a plurality of groups of data, processing the data of a public area and a private area by using different strategies by a joint modeling participant, calculating gradients, finishing gradient aggregation in the TEE cluster of the private area, encrypting by using a global clock and a selected random seed generation key, storing the result into a public shared memory, and notifying each joint modeling participant;

The shared memory is a region, a large memory is formed for each node together, and the memory is directly read and written, so that the processing speed is improved;

Step 205, clearing sensitive memory: after the calculation of the round is completed, the FPGA controller is utilized to execute the clearing operation of the sensitive memory on the GPU chip so as to ensure the data security;

Step 206, global aggregation and model update: the aggregation node is responsible for aggregating gradient data in a shared memory, and then updating model parameters;

The gradient data is obtained by a joint modeling participant of federal modeling by using a local data set to perform local forward calculation, then calculating the gradient of parameters according to a set loss function, and uploading the gradient to an aggregation node. The central aggregation node can refer to FedAvg algorithm, and the central aggregation node performs weighted average on gradient parameter values from each node to update the global model, so that the central aggregation node is a common processing mode of the federal learning algorithm.

Step 207, model distribution: the aggregation node puts the updated model into a public area, encrypts the model by using a random number seed and a global clock, and prepares the model for model distribution for subsequent training;

Step 208, downloading and training the updated model: each joint modeling participant downloads a new model from an aggregation node, the aggregation node updates a global model and directly sends the global model to the joint modeling participant, and the model is accelerated to be executed in a mode of being pulled by the joint modeling participant and used for new rounds of model training until the model reaches a convergence state;

Step 209, intermediate result encryption storage: the task is triggered at fixed time, the aggregation node carries out training in a continuous iteration mode, a model of an intermediate result of the intermediate continuous aggregation parameter, data such as training conditions and the like generate three copies, and a random key is used for encryption storage; then, a Shamir secret sharing Algorithm (Shamir' S SECRET SHARING Algorithm) is adopted to distribute random key fragments to all joint modeling participants for future key recovery;

The Shamir secret sharing algorithm is a method that restores the original secret only if a certain threshold is reached by dividing the secret information into multiple parts, each of which is distributed to the joint modeling participants. This algorithm was proposed by ADI SHAMIR in 1979 as an application of threshold cryptography.

The following is the basic principle of Shamir secret sharing algorithm:

secret segmentation: assuming a secret S, the algorithm splits this secret into N parts, where at least K parts are needed to recover the original secret;

Generating a polynomial: starting from a polynomial of high degree, the constant term of which is the secret S, the coefficients of which are randomly chosen from a finite field (e.g. integer modulus);

calculating the segmentation: calculating coordinates of a plurality of points by selecting different X values on the polynomial, each coordinate corresponding to a secret portion;

Distribution portion: distributing the coordinates of the points as part of a secret to different joint modeling participants, each joint modeling participant only knowing the coordinate values they hold;

Restoring the secret: at least K different parts are needed from which the original polynomial is restored using interpolation to obtain the original secret.

This method of Shamir provides a good security and resilience, and only when the threshold K is reached can the secret be restored. This approach has wide application in key management, data storage, and security of distributed systems.

Step 210, abnormal condition processing: and when abnormal conditions occur, the joint modeling participants jointly recover secret fragments owned by the parties and perform decryption operation so as to recover intermediate results and ensure the continuity and reliability of training.

The abnormal condition refers to the abnormal condition that the aggregation node is used as a central node to have the risk of single-point failure due to network failure, hardware failure and the like, and long time and calculation resources are often required for training a large language model, so that the condition that the aggregation node fails or a large number of modeling nodes participate in disconnection occurs, and the planning processing is required.

The step 3 specifically comprises the following steps of;

Step 301, after the large language model is trained in combination with a pre-training task, the large language model is deployed into an actual application environment for use;

step 302, collecting feedback data, continuously increasing training data, and continuously improving model precision and training efficiency.

The invention has the beneficial effects that:

The invention aims to be applied to the actual scene of the pretrained multiparty federal modeling of the large language model, fully utilizes RDMA and CXL technologies, builds a cross-domain Trusted Execution Environment (TEE) cluster in a distributed environment, divides a shared area and a private area for a memory, and can accommodate the large language model, training data and intermediate training results thereof by forming the shared area into the large memory, thereby overcoming the problem of communication bottleneck and resource utilization deficiency of the trusted modeling under the large data scale of the large model. The method can ensure effective utilization and isolation of resources, reduce memory delay, promote pre-training speed and enhance data confidentiality, integrity and safety.

The invention adopts Remote Direct Memory Access (RDMA) protocol, realizes high-speed communication between different TEEs, introduces CXL technology, realizes high-speed interconnection of GPU and TEEs, reduces memory delay and improves pre-training speed.

In order to ensure the safety of the GPU, a TPM chip is added in the FPGA to form a trusted controller, and authority strategies are dynamically distributed according to training tasks, so that the GPU is ensured to be in a safe state in the starting process and to be monopolized in the executing process of the GPU, malicious attacks and data tampering are prevented, meanwhile, the GPU is routed to a TEE environment through the FPGA after the tasks are executed, sensitive data of a memory on the GPU chip are cleared, and the problem of residual video memory is solved.

The method introduces a global clock, a multiparty key negotiation mechanism and random number seed distribution, performs fragmentation confusion on shared memory data, and combines encryption and MAC authentication to enhance data confidentiality, integrity and security. In the large model pre-training stage, public data and private data are selected for federal modeling, and a hierarchical aggregation strategy is adopted, so that gradient aggregation is finally carried out in a public area, and the risk of gradient leakage is reduced.

The invention adopts multiparty secret sharing and multicopy encryption storage persistence mode, ensures the safety and restorability of intermediate data in the model training process, and reduces the risk of data loss.

In conclusion, the method can effectively address a plurality of challenges in the pretraining of the large language model, provides a reliable and efficient solution for the practical application scene of the large model multi-party federal modeling, and has higher practical value and wide application prospect.

Drawings

FIG. 1 is a schematic diagram of the federal modeling node composition of a large model of the present invention.

FIG. 2 is a diagram illustrating a memory sharing protocol connection according to the present invention.

FIG. 3 is a schematic diagram of the functional composition of the node of the present invention.

Detailed Description

The invention is described in further detail below with reference to the accompanying drawings.

As shown in fig. 1, according to the actual requirement of the multi-party federation modeling task pre-trained by the large language model, a cross-domain Trusted Execution Environment (TEE) cluster is built in a distributed environment, a shared area and a private area are divided for a memory, the large model and training data are loaded for pre-training by forming the shared area into the large memory, and the problem of trusted modeling resource limitation under the large data scale of the large model is solved.

And a global clock is set, high-speed memory quick access is realized through RDMA and CXL protocols, GPU and TEE environment resources are fully utilized, and performance bottleneck caused by memory delay is effectively reduced. The FPGA controller and the security module are adopted, and a plurality of mechanisms such as multiparty key negotiation, random number seed distribution, data fragmentation confusion, encryption, MAC authentication and the like are integrated, so that the reliability of the GPU and the privacy and the security in the distributed training process are ensured. In addition, a hierarchical aggregation strategy is adopted, the risk of gradient leakage is reduced, and by setting a Checkpoint, multi-copy encryption persistence storage is adopted for intermediate results, so that training reliability is improved.

The main function of the large language model is to process, understand and generate natural language, obtain rich language knowledge and modes in the pre-training stage and provide powerful support for various downstream NLP tasks; the joint modeling participants are organizations or institutions participating in joint training of the model, contribute to training data, model parameters and computing resources, and cooperatively train the large language model to improve pre-training performance;

The TEE environment provides an isolated and safe computing environment, each TEE cluster node is provided with a private memory area and a shared memory area, and the TEE has the functions of remote authentication, authority control, a security module, security interconnection and the like so as to ensure the security of the computing environment;

the memory division refers to that the TEE cluster nodes divide the memory into a private area and a shared area, so that the isolation and confidentiality of data among different nodes are ensured;

The remote authentication function ensures that the TEE node can perform remote authentication, verifies the identities of other nodes, and ensures that only legal nodes can access the resources; the security module is a built-in security related function of the TEE environment, and comprises encryption and decryption, MAC operation and secret sharing functions, and is used for secure transmission and processing of data and privacy protection.

As shown in fig. 2: the security interconnection is realized by the FPGA controller between the TEE cluster nodes and between the TEE and the GPU, so that encryption and authentication are supported, and the security and authority authentication of high-speed data transmission are ensured.

The FPGA controller realizes the reliability of the GPU and the safety communication and authority authentication between the TEE and the outside, ensures the safety state of the GPU in the starting process by verifying the integrity and correctness of a GPU driver, maintains the monopolization in the GPU executing process so as to prevent malicious attack and data tampering, simultaneously, the GPU is routed to the TEE environment through the FPGA after executing tasks, and clears sensitive data of the internal memory on the GPU chip, thereby solving the problem of residual video memory, and mainly comprising the functions of TPM chip management, RDMA memory sharing, CXL safety communication and the like;

the TPM chip management is that the TPM chip in the FPGA controller is used for managing related keys and certificates of trusted computing, so that the safety and the credibility of the TEE cluster are ensured;

The RDMA memory sharing is realized by an RDMA protocol, and the FPGA controller realizes high-speed memory sharing among TEE clusters and accelerates data transmission and communication;

The CXL safety communication is that an FPGA controller introduces CXL technology, high-speed and safe interconnection between the GPU and the TEE is realized, the pre-training speed is improved, and the memory delay is reduced.

The GPU is responsible for performing accelerated training of a large language model, improving training efficiency and speed, and is controlled and verified by the FPGA controller to realize GPU credibility.

The TEE cluster consists of a plurality of Trusted Execution Environments (TEEs), the TEE cluster belongs to different joint modeling participants participating in model joint training, and the federal modeling joint modeling participants design interconnection of TEE nodes to provide remote authentication and permission control functions; the TEE cluster remote authentication ensures that only authenticated nodes can participate in the pre-training process, and designates part of nodes to interact with other external joint modeling participant TEE cluster environments; the TEE cluster authority control function is mainly that the TEE cluster can dynamically allocate authorities according to different training tasks and joint modeling participants, so that each node can only access data and resources with authorities.

As shown in fig. 3: the large language model joint pre-training task runs in the TEE cluster participated by multiple parties and comprises a computing node and an aggregation/management node;

The computing node is used for executing model training and model training acceleration tasks, training is carried out in a trusted execution environment, and the safety and the credibility of model training are ensured; the computing nodes are the participating nodes and are responsible for the local model training computation.

The aggregation/management node is responsible for management scheduling of large language model distributed training and gradient aggregation and model distribution, including model loading, data confusion, gradient aggregation, parameter updating, model distribution, key management and model persistence;

The model loading is an operation of loading a large language model from a storage medium to a TEE cluster;

The data loading is to load data required by the pre-training of the large language model into a designated memory of the TEE cluster according to privacy security; the data confusion is responsible for carrying out joint confusion operation on the data stored in the shared area so as to eliminate the designability and the identification of the data and encrypt the data by using a unified key;

The gradient aggregation function is responsible for aggregating gradients obtained by calculation of the distributed training TEE nodes so as to update model parameters;

the parameter updating function is to update model parameters by using the gradient obtained by aggregation so as to be used in the next training round;

the model distribution is to distribute updated model parameters to each TEE training node by adopting an RDMA protocol;

The key management function is to introduce a global clock into the system as time reference input, combine random number seeds to generate a shared symmetric key and manage the shared symmetric key;

The model persistence is to store intermediate results such as model parameters and the like in a multi-copy encryption persistence mode so as to recover training progress and reduce data loss risk when abnormal conditions occur in the training process. The global clock is used as a clock synchronization mechanism of the distributed system, ensures the time consistency of each node, is used for the area identification of the memory and is used as a part of the secret key.

The process of the method provided by the invention will be described in detail with reference to specific examples.

Step one, preparing a large language model joint pre-training modeling environment:

The preparation of the large language model joint pre-training modeling environment comprises the following steps:

Step 101, confirming a pre-training task: specific tasks of the large language model pre-training are defined, including initial parameter configuration of the model and requirements of training data.

Step 102, determining a joint modeling participant: each joint modeling participant, including model owners and data providers, that participates in the large language model joint pre-training is determined.

Step 103, constructing a cross-domain TEE cluster: in a distributed environment, a cross-domain Trusted Execution Environment (TEE) cluster is built, each joint modeling participant starts a TEE management node, shared areas, private areas and GPU resources are planned in the TEE, and access authority information is loaded into an FPGA for controlling access of RDMA and CXL.

Step 104, setting a global clock: and introducing a global clock as a unified time scale to distribute N random number seeds for all joint modeling participants, so as to ensure the consistency of time and the safety of data.

Step 105, encryption and MAC authentication: the large language model federal pre-training task initiator selects a random seed to obtain a global clock generation key, encrypts an initial model, generates an MAC message authentication code, and puts the encrypted initial model and related identification information into a shared memory area to prepare for model distribution.

Step 106, training data loading and processing: each joint modeling participant loads data into a private area and a shared area of the TEE according to data sensitivity, token processing is carried out on the data to form vector representation, a global clock is obtained, and a local random seed is selected to encrypt the data of the shared area so as to enhance the confidentiality of the data.

Step 107, data aggregation and confusion: the large language model federal pre-training task initiator aggregates the data of the TEE shared area, generates a key according to the random number seed number of the metadata identification and the global clock, decrypts the data by using the key, confuses the data, obscures the data source, and encrypts the data of the shared area again by acquiring the global clock, selecting the local random seed to generate a public data set and storing the public data set in the shared memory area.

Step 108, common data set pulling: each joint modeling participant pulls the shared region data into the respective local environment through the RDMA protocol in preparation for subsequent training.

Step 109, initializing the GPU: each joint modeling participant initializes the GPU to become a trusted computing resource for accelerating the training process of the model.

The invention provides a large language model federal pre-training method based on a trusted execution environment, which is used for joint pre-training modeling of a large language model and comprises the following steps:

step two, joint pre-training modeling of a large language model:

the joint pre-training modeling of the large language model comprises the following steps:

Step 201, model initialization and memory transfer: each joint modeling participant directly copies the internal memory of the initial model to the internal memory of each TEE cluster through the FPGA controller by adopting an RDMA protocol, so as to ensure that the initial model is loaded into a trusted environment.

Step 202, data selection and preparation: the joint modeling participant randomly selects a part of data from public data, and simultaneously uses private data of a user as training data of the training round.

Step 203, accelerating calculation and data transmission: the FPGA controller locks the GPU to monopolize, partial programs and training data needing to be calculated in an accelerating way are routed through the FPGA, decryption operation is carried out, and the data are transmitted to the GPU by adopting CXL technology so as to be calculated and processed at high speed.

Step 204, distributed parallel training: in the local TEE cluster, multiple groups of data are adopted for distributed parallel training, the joint modeling participants process the data of the public area and the private area by using different strategies, gradient is calculated, gradient aggregation is completed in the TEE cluster of the private area, encryption is performed by a global clock and a selected random seed generation key, the result is stored in a public shared memory, and each joint modeling participant is informed.

Step 205, clearing sensitive memory: after the calculation of the round is completed, the FPGA controller is utilized to execute the clearing operation of the sensitive memory on the GPU chip so as to ensure the data security.

Step 206, global aggregation and model update: and the aggregation node is responsible for aggregating the gradient data in the shared memory and then updating the model parameters.

Step 207, model distribution: the aggregation node places the updated model into a common area and encrypts it using a random number seed and a global clock, ready for model distribution for use in subsequent training.

Step 208, downloading and training the updated model: and downloading a new model by each joint modeling party for a new round of model training until the model reaches a convergence state.

Step 209, intermediate result encryption storage: the aggregation node generates three copies of data such as a model of an intermediate result, training conditions and the like, and encrypts and stores the data by using a random key; then, a Shamir secret sharing algorithm is adopted to distribute random key fragments to all joint modeling participants for future key recovery.

Step 210, abnormal condition processing: when abnormal conditions occur, the parties combine to restore secret fragments owned by the parties, and decrypt the secret fragments to restore intermediate results, so that the training continuity and reliability are ensured.

Step three, optimizing a combined pre-training large language model:

the optimization of the joint pre-trained large language model comprises the following steps:

The above examples are only one of the specific embodiments of the present invention, and the ordinary changes and substitutions made by those skilled in the art within the scope of the technical solution of the present invention should be included in the scope of the present invention.

Claims

1. The large language model federal pre-training method based on the trusted execution environment is characterized by comprising the following steps of:

Step 2, performing joint pre-training of a large language model;

Step 3, optimizing the large language model obtained by the combined pre-training;

The step1 specifically comprises the following steps:

Step 103, constructing a cross-domain TEE cluster: in a distributed environment, building a cross-domain trusted execution environment cluster, starting a TEE management node by each joint modeling participant, planning a shared area, a private area and GPU resources in the TEE, and loading access authority information into an FPGA (field programmable gate array) for controlling access of RDMA (remote direct memory access) and CXL (virtual private line);

The shared memory area is divided into a private area and a shared public area, private data of participants participating in federal modeling and joint modeling are stored in the private area, and the private data are sensitive data; the shared public area is used for efficient and rapid sharing of non-sensitive data, metadata in a modeling process, global model parameters and public data sets;

Step 106, training data loading and processing: each joint modeling participant loads data into a private area and a shared area of the TEE according to data sensitivity, performs Token processing on the data to form vector representation, acquires a global clock, and selects a local random seed to encrypt the data of the shared area;

Step 107, data aggregation and confusion: the task initiator aggregates the data of the TEE shared area, generates a secret key according to the random number seed number identified by the metadata and the global clock, decrypts the data by using the secret key, obfuscates the data source, selects a local random seed to encrypt the data of the shared area again by acquiring the global clock, generates a public data set, and stores the public data set in the shared memory area;

2. The large language model federal pre-training method based on trusted execution environment according to claim 1, wherein said step 107 comprises;

Data aggregation: the method comprises the steps that a task initiator obtains data from a TEE sharing area of each joint modeling participant, the task initiator generates a secret key by using a global clock and a random number seed number identified by metadata, the task initiator decrypts the data obtained from each TEE by using the generated secret key, and the original encrypted data is restored to an original state;

Data confusion: on the basis of decryption, the task initiator confuses the data;

and (5) re-encrypting: re-encrypting the confused data, and generating a new key again by using the global clock and a random seed local to the task initiator;

a common dataset is generated: the processed, obfuscated and re-encrypted data is combined into a common data set.

3. The large language model federal pre-training method based on trusted execution environment according to claim 1, wherein the step 2 specifically comprises the steps of:

Step 201, model initialization and memory transfer: each joint modeling participant directly copies the memory of the initial model to the memory of each TEE cluster through the FPGA controller by adopting an RDMA protocol, so as to ensure that the initial model is loaded into a trusted environment;

The public data refers to user non-sensitive data or corresponding data desensitization of the user non-sensitive data or data from outside which has been disclosed, wherein the data are basic public data sets which are formed after data confusion is carried out by sending the data to a central node by each user;

step 203, accelerating calculation and data transmission: the FPGA controller locks the GPU to be exclusive, partial programs and training data needing to be accelerated to be calculated are routed through the FPGA, decryption operation is carried out, and the data are transmitted to the GPU by adopting CXL technology to be calculated and processed;

Step 204, distributed parallel training: in the local TEE cluster, carrying out distributed parallel training by adopting a plurality of groups of data, processing the data of a public area and a private area by using different strategies by a joint modeling participant, calculating gradients, finishing gradient aggregation in the TEE cluster of the private area, encrypting by using a global clock and a selected random seed generation key, storing the result into a shared memory of the public area, and notifying each joint modeling participant;

Step 206, global aggregation and model update: the aggregation node is responsible for aggregating gradient data in the shared memory and then updating model parameters;

the gradient data is obtained by a joint modeling participant of federal modeling by utilizing a local data set to execute local forward calculation, then calculating the gradient of parameters according to a set loss function, and uploading the gradient to an aggregation node;

Step 207, model distribution: the aggregation node puts the updated model into a public area, encrypts the model by using a random number seed and a global clock, and prepares the model for model distribution for subsequent training and use;

Step 209, intermediate result encryption storage: the task is triggered at fixed time, the aggregation node carries out training in a continuous iteration mode, a model of an intermediate result of the intermediate continuous aggregation parameter and data of a training condition generate three copies, and a random key is used for encryption storage; then, a Shamir secret sharing algorithm is adopted to distribute random key fragments to all joint modeling participants so as to prepare for future key recovery;

step 210, abnormal condition processing: when abnormal conditions occur, each joint modeling participant jointly restores secret fragments owned by each joint modeling participant to carry out decryption operation so as to restore intermediate results and ensure the continuity and reliability of training.

4. The large language model federal pre-training method based on trusted execution environment according to claim 3, wherein the FPGA controller realizes GPU trusted and TEE and external secure communication and authority authentication, and the FPGA controller comprises TPM chip management, RDMA memory sharing, CXL secure communication;

The CXL safety communication realizes high-speed and safe interconnection between the GPU and the TEE, improves the pre-training speed and reduces the memory delay;

5. A large language model federal pre-training method based on trusted execution environment according to claim 3, wherein said step 3 comprises the steps of;