CN117648998A - Large language model federal pre-training method based on trusted execution environment - Google Patents
Large language model federal pre-training method based on trusted execution environment Download PDFInfo
- Publication number
- CN117648998A CN117648998A CN202410117882.3A CN202410117882A CN117648998A CN 117648998 A CN117648998 A CN 117648998A CN 202410117882 A CN202410117882 A CN 202410117882A CN 117648998 A CN117648998 A CN 117648998A
- Authority
- CN
- China
- Prior art keywords
- data
- training
- model
- language model
- tee
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000012549 training Methods 0.000 title claims abstract description 158
- 238000000034 method Methods 0.000 title claims abstract description 44
- 238000003860 storage Methods 0.000 claims abstract description 18
- 238000005516 engineering process Methods 0.000 claims abstract description 15
- 238000004891 communication Methods 0.000 claims abstract description 12
- 238000004220 aggregation Methods 0.000 claims description 40
- 230000002776 aggregation Effects 0.000 claims description 40
- 238000009826 distribution Methods 0.000 claims description 17
- 238000012545 processing Methods 0.000 claims description 16
- 239000003999 initiator Substances 0.000 claims description 15
- 238000007726 management method Methods 0.000 claims description 14
- 230000006870 function Effects 0.000 claims description 13
- 238000004364 calculation method Methods 0.000 claims description 11
- 230000002159 abnormal effect Effects 0.000 claims description 10
- 230000005540 biological transmission Effects 0.000 claims description 9
- 239000012634 fragment Substances 0.000 claims description 7
- 230000004931 aggregating effect Effects 0.000 claims description 5
- 238000002360 preparation method Methods 0.000 claims description 5
- 238000011084 recovery Methods 0.000 claims description 3
- 230000035945 sensitivity Effects 0.000 claims description 3
- 238000012546 transfer Methods 0.000 claims description 3
- 238000000586 desensitisation Methods 0.000 claims description 2
- 230000001960 triggered effect Effects 0.000 claims description 2
- 230000007812 deficiency Effects 0.000 abstract description 2
- 230000002688 persistence Effects 0.000 description 5
- 238000003058 natural language processing Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 238000002955 isolation Methods 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 230000001133 acceleration Effects 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 238000013467 fragmentation Methods 0.000 description 2
- 238000006062 fragmentation reaction Methods 0.000 description 2
- 238000002156 mixing Methods 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 241000613118 Gryllus integer Species 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000002787 reinforcement Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/57—Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Computer Hardware Design (AREA)
- Computer Security & Cryptography (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Storage Device Security (AREA)
Abstract
The invention discloses a large language model federation pre-training method based on a trusted execution environment, which comprises the following steps: step 1: creating a large language model joint pre-training task, determining joint modeling participants, preparing data, and creating computing storage network resources for joint modeling; step 2: performing joint pre-training of a large language model; step 3: and optimizing the large language model obtained by the joint pre-training. The method is applied to the actual scene of the pretrained multiparty federal modeling of the large language model, RDMA and CXL technologies are fully utilized, a cross-domain trusted execution environment cluster is built in a distributed environment, a shared area and a private area are divided for a memory, the shared area is formed into the large memory, the large language model, training data thereof and intermediate training results can be contained, and the problems of communication bottleneck and resource utilization deficiency of the trusted modeling under the large data scale of the large model are solved.
Description
Technical Field
The invention belongs to the technical field of large language model federal pre-training, and particularly relates to a large language model federal pre-training method based on a trusted execution environment.
Background
Under the background of rapid development of the current artificial intelligence, natural language processing is taken as an important direction of the artificial intelligence, and has been shown to have wide application prospects in various fields, such as machine translation, emotion analysis, intelligent dialogue, text generation and the like, and continuous energization is carried out to promote industrial development of medical treatment, finance, government affairs and the like.
The pre-training large model is used as a key technology of natural language processing, and the model can learn rich language knowledge and modes by pre-training on large-scale data, so that strong support is provided for various downstream tasks. However, this technology also faces a serious set of challenges that are not only related to technical aspects but also related to the issues of data privacy, computational efficiency, and security, among other critical areas.
First, the data needed for the pre-training of large language models is huge in size, covering multiple data sources and data owners, which may contain sensitive information such as personal privacy and business confidentiality, so how to perform large model training in the case of multiparty collaboration becomes a critical issue to guarantee data privacy from data disclosure and abuse.
Secondly, the large language model pre-training process requires huge storage space, such as data set storage for training, storage of a large language model and storage of training intermediate results, high requirements are put on calculation and storage resources, and training speed can be reduced even training requirements cannot be met under the condition of limited resources. Meanwhile, GPU acceleration plays an important role in training of the NLP large language model, however, the reliability of the GPU is challenged, possibly introducing security risks, and how to ensure the reliability and security of the GPU needs to be solved. In addition, the pre-training process of the large language model is long, node failure and abnormal interruption can cause the training process to be interrupted, and the data are stored in the memory, so that the risk of losing the intermediate data is remarkable.
The large language model is based on a transducer, and the pre-training idea is that model parameters are not initialized randomly any more, but are pre-trained through tasks to obtain a set of model parameters, and then the model is initialized by the set of parameters and then trained. The large language model pre-training process mainly predicts the next word through the words, belongs to unsupervised pre-training, and comprises an AR model (namely a model which is learned from left to right), an in-context training technology and the like. With the advent of ChatGPT, after the pre-training of the large language model is completed, further fine tuning can be performed by using supervised learning, rewarding model and reinforcement learning.
Under the circumstance, how to fully utilize the front edge technologies of Trusted Execution Environment (TEE), compute Express Link (CXL) technology, remote Direct Memory Access (RDMA) and the like, and combine the means of key management, encryption transmission, data isolation and the like, and to perform multiparty joint efficient modeling on the premise of compliance aiming at a large language model pre-training scene becomes a problem to be solved.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to provide a large language model federal pre-training method based on a trusted execution environment, which is applied to the actual scene of large language model pre-training multiparty federal modeling, fully utilizes RDMA and CXL technologies, builds a cross-domain Trusted Execution Environment (TEE) cluster in a distributed environment, divides a shared area and a private area for a memory, and can accommodate the large language model, training data thereof and intermediate training results by forming the shared area into the large memory, thereby overcoming the problems of communication bottleneck and resource utilization of the trusted modeling under the large data scale of the large model.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:
a large language model federation pre-training method based on a trusted execution environment comprises the following steps:
step 1, creating a large language model joint pre-training task, determining joint modeling participants, preparing data, and creating computing storage network resources for joint modeling;
step 2, performing joint pre-training of a large language model;
and 3, optimizing the large language model obtained by the combined pre-training.
The step 1 specifically comprises the following steps:
step 101, confirming a pre-training task: defining a specific task of pre-training of the large language model, wherein the specific task comprises initial parameter configuration of the large language model and requirement of training data;
step 102, determining a joint modeling participant: determining each joint modeling participant participating in the joint pre-training of the large language model, including an owner of the large language model and a data provider;
step 103, constructing a cross-domain TEE cluster: in a distributed environment, building a cross-domain Trusted Execution Environment (TEE) cluster, starting a TEE management node by each joint modeling participant, planning a shared area, a private area and GPU resources in the TEE, and loading access authority information into an FPGA (field programmable gate array) for controlling access of RDMA (remote direct memory access) and CXL (virtual local area network);
step 104, setting a global clock: introducing a global clock as a unified time scale to distribute N random number seeds for all joint modeling participants, so as to ensure the consistency of time and the safety of data;
step 105, encryption and MAC authentication: the large language model federal pre-training task initiator selects a random seed to acquire a global clock generation key, encrypts an initial model, generates an MAC message authentication code at the same time, and puts the encrypted initial model and related identification information into a shared memory area to prepare for model distribution;
the shared memory area is divided into a private area and a shared public area, private data of participants participating in federal modeling joint modeling are stored in the private area and are sensitive data, and the data cannot go out of the domain; the shared public area is used for efficient and rapid sharing of non-sensitive data, metadata in a modeling process, global model parameters and public data sets;
step 106, training data loading and processing: each joint modeling participant loads data into a private area and a shared area of the TEE according to data sensitivity, performs Token processing on the data to form vector representation, acquires a global clock, and selects a local random seed to encrypt the data of the shared area so as to enhance the confidentiality of the data;
step 107, data aggregation and confusion: the task initiator aggregates the data of the TEE shared area, generates a secret key according to the random number seed number identified by the metadata and the global clock, decrypts the data by using the secret key, obfuscates the data source, and encrypts the data of the shared area again by acquiring the global clock, selecting a local random seed to generate a public data set and storing the public data set in the shared memory area;
the step 107 specifically includes:
data aggregation: the task initiator obtains data (the data are preprocessed, token-based and encrypted vector representations) from the TEE shared areas of the joint modeling participants, the task initiator generates a key by using a global clock and a random number seed number identified by metadata, and decrypts the data obtained from the TEEs by using the generated key, and the step restores the original encrypted data to the original state;
data confusion: based on decryption, the task initiator can mix the data (the purpose of mixing is to blur the source of the data and increase the privacy of the data), the mixing process can adopt different technologies, can mix the data in a sampling mode, can also adopt a duplication removing mode to mix the data together, and can adopt technologies of adding some noise, disturbing the data or introducing other randomness.
And (5) re-encrypting: the obfuscated data is re-encrypted, again using the global clock and a random seed local to the task initiator to generate a new key, the encryption ensuring security of the obfuscated data during storage, while still being able to be decrypted in the next step and used for further training of the model.
A common dataset is generated: the processed, obfuscated and re-encrypted data is combined into a common data set. ( A common dataset is a collection that contains information from different joint modeling participants, but the specific source of individual data therein has become ambiguous due to the presence of confusion. The common data set is stored in a shared memory area so that other joint modeling participants can access and use it. )
The goal of this step is to ensure that in federal pre-training, the data can be co-modeled without exposing individual privacy. The encryption and obfuscation process provides additional security in the transmission and storage of data, while the model can obtain more comprehensive information from many aspects by aggregating the data.
Step 108, common data set pulling: each joint modeling participant pulls the data of the shared area into the respective local environment through an RDMA protocol to prepare for subsequent training;
step 109, initializing the GPU: each joint modeling participant initializes the GPU to become a trusted computing resource for accelerating the training process of the large language model.
The step 2 specifically comprises the following steps:
step 201, model initialization and memory transfer: each joint modeling participant directly copies the internal memory of the initial model to the internal memory of each TEE cluster through an FPGA controller by adopting an RDMA protocol, so as to ensure that the model is loaded into a trusted environment;
step 202, data selection and preparation: the joint modeling participant randomly selects a part of data from public data, and simultaneously uses private data of a user as training data of the training round;
the public data refers to user non-sensitive data or corresponding data desensitization of the user non-sensitive data or data from outside which has been disclosed, wherein the data are basic public data sets which are formed after data confusion is carried out by sending the data to a central node by each user; mainly to accelerate the convergence speed of federal learning (essentially solving the problem of nonid), the distribution of the common dataset is the same as the final joint modeling object-oriented data distribution.
Step 203, accelerating calculation and data transmission: the FPGA controller locks the GPU to be exclusive, partial programs and training data needing to be accelerated to be calculated are routed through the FPGA, decryption operation is carried out, and the data are transmitted to the GPU by adopting CXL technology so as to be calculated and processed at high speed;
step 204, distributed parallel training: in the local TEE cluster, carrying out distributed parallel training by adopting a plurality of groups of data, processing the data of a public area and a private area by using different strategies by a joint modeling participant, calculating gradients, finishing gradient aggregation in the TEE cluster of the private area, encrypting by using a global clock and a selected random seed generation key, storing the result into a public shared memory, and notifying each joint modeling participant;
the shared memory is a region, a large memory is formed for each node together, and the memory is directly read and written, so that the processing speed is improved;
step 205, clearing sensitive memory: after the calculation of the round is completed, the FPGA controller is utilized to execute the clearing operation of the sensitive memory on the GPU chip so as to ensure the data security;
step 206, global aggregation and model update: the aggregation node is responsible for aggregating gradient data in a shared memory, and then updating model parameters;
the gradient data is obtained by a joint modeling participant of federal modeling by using a local data set to perform local forward calculation, then calculating the gradient of parameters according to a set loss function, and uploading the gradient to an aggregation node. The FedAvg algorithm can be consulted, the central aggregation node carries out weighted average on gradient parameter values from each node, and a global model is updated, so that the method is a common processing mode of the federal learning algorithm.
Step 207, model distribution: the aggregation node puts the updated model into a public area, encrypts the model by using a random number seed and a global clock, and prepares the model for model distribution for subsequent training;
step 208, downloading and training the updated model: each joint modeling participant downloads a new model from an aggregation node, the aggregation node updates a global model and directly sends the global model to the joint modeling participant, and the model is accelerated to be executed in a mode of being pulled by the joint modeling participant and used for new rounds of model training until the model reaches a convergence state;
step 209, intermediate result encryption storage: the task is triggered at fixed time, the aggregation node carries out training in a continuous iteration mode, a model of an intermediate result of the intermediate continuous aggregation parameter, data such as training conditions and the like generate three copies, and a random key is used for encryption storage; then, a Shamir secret sharing algorithm (Shamir's Secret Sharing Algorithm) is adopted to distribute random key fragments to all joint modeling participants for future key recovery;
the Shamir secret sharing algorithm is a method that restores the original secret only if a certain threshold is reached by dividing the secret information into multiple parts, each of which is distributed to the joint modeling participants. This algorithm was proposed by Adi Shamir in 1979 as an application of threshold cryptography.
The following is the basic principle of Shamir secret sharing algorithm:
secret segmentation: assuming a secret S, the algorithm splits this secret into N parts, where at least K parts are needed to recover the original secret;
generating a polynomial: starting from a polynomial of high degree, the constant term of which is the secret S, the coefficients of which are randomly chosen from a finite field (e.g. integer modulus);
calculating the segmentation: calculating coordinates of a plurality of points by selecting different X values on the polynomial, each coordinate corresponding to a secret portion;
distribution portion: distributing the coordinates of the points as part of a secret to different joint modeling participants, each joint modeling participant only knowing the coordinate values they hold;
restoring the secret: at least K different parts are needed from which the original polynomial is restored using interpolation to obtain the original secret.
This method of Shamir provides a good security and resilience, and only when the threshold K is reached can the secret be restored. This approach has wide application in key management, data storage, and security of distributed systems.
Step 210, abnormal condition processing: and when abnormal conditions occur, the joint modeling participants jointly recover secret fragments owned by the parties and perform decryption operation so as to recover intermediate results and ensure the continuity and reliability of training.
The abnormal condition refers to the abnormal condition that the aggregation node is used as a central node to have the risk of single-point failure due to network failure, hardware failure and the like, and long time and calculation resources are often required for training a large language model, so that the condition that the aggregation node fails or a large number of modeling nodes participate in disconnection occurs, and the planning processing is required.
The step 3 specifically comprises the following steps of;
step 301, after the large language model is trained in combination with a pre-training task, the large language model is deployed into an actual application environment for use;
step 302, collecting feedback data, continuously increasing training data, and continuously improving model precision and training efficiency.
The invention has the beneficial effects that:
the invention aims to be applied to the actual scene of the pretrained multiparty federal modeling of the large language model, fully utilizes RDMA and CXL technologies, builds a cross-domain Trusted Execution Environment (TEE) cluster in a distributed environment, divides a shared area and a private area for a memory, and can accommodate the large language model, training data and intermediate training results thereof by forming the shared area into the large memory, thereby overcoming the problem of communication bottleneck and resource utilization deficiency of the trusted modeling under the large data scale of the large model. The method can ensure effective utilization and isolation of resources, reduce memory delay, promote pre-training speed and enhance data confidentiality, integrity and safety.
The invention adopts Remote Direct Memory Access (RDMA) protocol, realizes high-speed communication between different TEEs, introduces CXL technology, realizes high-speed interconnection of GPU and TEEs, reduces memory delay and improves pre-training speed.
In order to ensure the safety of the GPU, a TPM chip is added in the FPGA to form a trusted controller, and authority strategies are dynamically distributed according to training tasks, so that the GPU is ensured to be in a safe state in the starting process and to be monopolized in the executing process of the GPU, malicious attacks and data tampering are prevented, meanwhile, the GPU is routed to a TEE environment through the FPGA after the tasks are executed, sensitive data of a memory on the GPU chip are cleared, and the problem of residual video memory is solved.
The method introduces a global clock, a multiparty key negotiation mechanism and random number seed distribution, performs fragmentation confusion on shared memory data, and combines encryption and MAC authentication to enhance data confidentiality, integrity and security. In the large model pre-training stage, public data and private data are selected for federal modeling, and a hierarchical aggregation strategy is adopted, so that gradient aggregation is finally carried out in a public area, and the risk of gradient leakage is reduced.
The invention adopts multiparty secret sharing and multicopy encryption storage persistence mode, ensures the safety and restorability of intermediate data in the model training process, and reduces the risk of data loss.
In conclusion, the method can effectively address a plurality of challenges in the pretraining of the large language model, provides a reliable and efficient solution for the practical application scene of the large model multi-party federal modeling, and has higher practical value and wide application prospect.
Drawings
FIG. 1 is a schematic diagram of the federal modeling node composition of a large model of the present invention.
FIG. 2 is a diagram illustrating a memory sharing protocol connection according to the present invention.
FIG. 3 is a schematic diagram of the functional composition of the node of the present invention.
Detailed Description
The invention is described in further detail below with reference to the accompanying drawings.
As shown in fig. 1, according to the actual requirement of the multi-party federation modeling task pre-trained by the large language model, a cross-domain Trusted Execution Environment (TEE) cluster is built in a distributed environment, a shared area and a private area are divided for a memory, the large model and training data are loaded for pre-training by forming the shared area into the large memory, and the problem of trusted modeling resource limitation under the large data scale of the large model is solved.
And a global clock is set, high-speed memory quick access is realized through RDMA and CXL protocols, GPU and TEE environment resources are fully utilized, and performance bottleneck caused by memory delay is effectively reduced. The FPGA controller and the security module are adopted, and a plurality of mechanisms such as multiparty key negotiation, random number seed distribution, data fragmentation confusion, encryption, MAC authentication and the like are integrated, so that the reliability of the GPU and the privacy and the security in the distributed training process are ensured. In addition, a hierarchical aggregation strategy is adopted, the risk of gradient leakage is reduced, and by setting a Checkpoint, multi-copy encryption persistence storage is adopted for intermediate results, so that training reliability is improved.
The main function of the large language model is to process, understand and generate natural language, obtain rich language knowledge and modes in the pre-training stage and provide powerful support for various downstream NLP tasks; the joint modeling participants are organizations or institutions participating in joint training of the model, contribute to training data, model parameters and computing resources, and cooperatively train the large language model to improve pre-training performance;
the TEE environment provides an isolated and safe computing environment, each TEE cluster node is provided with a private memory area and a shared memory area, and the TEE has the functions of remote authentication, authority control, a security module, security interconnection and the like so as to ensure the security of the computing environment;
the memory division refers to that the TEE cluster nodes divide the memory into a private area and a shared area, so that the isolation and confidentiality of data among different nodes are ensured;
the remote authentication function ensures that the TEE node can perform remote authentication, verifies the identities of other nodes, and ensures that only legal nodes can access the resources; the security module is a built-in security related function of the TEE environment, and comprises encryption and decryption, MAC operation and secret sharing functions, and is used for secure transmission and processing of data and privacy protection.
As shown in fig. 2: the security interconnection is realized by the FPGA controller between the TEE cluster nodes and between the TEE and the GPU, so that encryption and authentication are supported, and the security and authority authentication of high-speed data transmission are ensured.
The FPGA controller realizes the reliability of the GPU and the safety communication and authority authentication between the TEE and the outside, ensures the safety state of the GPU in the starting process by verifying the integrity and correctness of a GPU driver, maintains the monopolization in the GPU executing process so as to prevent malicious attack and data tampering, simultaneously, the GPU is routed to the TEE environment through the FPGA after executing tasks, and clears sensitive data of the internal memory on the GPU chip, thereby solving the problem of residual video memory, and mainly comprising the functions of TPM chip management, RDMA memory sharing, CXL safety communication and the like;
the TPM chip management is that the TPM chip in the FPGA controller is used for managing related keys and certificates of trusted computing, so that the safety and the credibility of the TEE cluster are ensured;
the RDMA memory sharing is realized by an RDMA protocol, and the FPGA controller realizes high-speed memory sharing among TEE clusters and accelerates data transmission and communication;
the CXL safety communication is that an FPGA controller introduces CXL technology, high-speed and safe interconnection between the GPU and the TEE is realized, the pre-training speed is improved, and the memory delay is reduced.
The GPU is responsible for performing accelerated training of a large language model, improving training efficiency and speed, and is controlled and verified by the FPGA controller to realize GPU credibility.
The TEE cluster consists of a plurality of Trusted Execution Environments (TEEs), the TEE cluster belongs to different joint modeling participants participating in model joint training, and the federal modeling joint modeling participants design interconnection of TEE nodes to provide remote authentication and permission control functions; the TEE cluster remote authentication ensures that only authenticated nodes can participate in the pre-training process, and designates part of nodes to interact with other external joint modeling participant TEE cluster environments; the TEE cluster authority control function is mainly that the TEE cluster can dynamically allocate authorities according to different training tasks and joint modeling participants, so that each node can only access data and resources with authorities.
As shown in fig. 3: the large language model joint pre-training task runs in the TEE cluster participated by multiple parties and comprises a computing node and an aggregation/management node;
the computing node is used for executing model training and model training acceleration tasks, training is carried out in a trusted execution environment, and the safety and the credibility of model training are ensured; the computing nodes are the participating nodes and are responsible for the local model training computation.
The aggregation/management node is responsible for management scheduling of large language model distributed training and gradient aggregation and model distribution, including model loading, data confusion, gradient aggregation, parameter updating, model distribution, key management and model persistence;
the model loading is an operation of loading a large language model from a storage medium to a TEE cluster;
the data loading is to load data required by the pre-training of the large language model into a designated memory of the TEE cluster according to privacy security; the data confusion is responsible for carrying out joint confusion operation on the data stored in the shared area so as to eliminate the designability and the identification of the data and encrypt the data by using a unified key;
the gradient aggregation function is responsible for aggregating gradients obtained by calculation of the distributed training TEE nodes so as to update model parameters;
the parameter updating function is to update model parameters by using the gradient obtained by aggregation so as to be used in the next training round;
the model distribution is to distribute updated model parameters to each TEE training node by adopting an RDMA protocol;
the key management function is to introduce a global clock into the system as time reference input, combine random number seeds to generate a shared symmetric key and manage the shared symmetric key;
the model persistence is to store intermediate results such as model parameters and the like in a multi-copy encryption persistence mode so as to recover training progress and reduce data loss risk when abnormal conditions occur in the training process. The global clock is used as a clock synchronization mechanism of the distributed system, ensures the time consistency of each node, is used for the area identification of the memory and is used as a part of the secret key.
The process of the method provided by the invention will be described in detail with reference to specific examples.
Step one, preparing a large language model joint pre-training modeling environment:
the preparation of the large language model joint pre-training modeling environment comprises the following steps:
step 101, confirming a pre-training task: specific tasks of the large language model pre-training are defined, including initial parameter configuration of the model and requirements of training data.
Step 102, determining a joint modeling participant: each joint modeling participant, including model owners and data providers, that participates in the large language model joint pre-training is determined.
Step 103, constructing a cross-domain TEE cluster: in a distributed environment, a cross-domain Trusted Execution Environment (TEE) cluster is built, each joint modeling participant starts a TEE management node, shared areas, private areas and GPU resources are planned in the TEE, and access authority information is loaded into an FPGA for controlling access of RDMA and CXL.
Step 104, setting a global clock: and introducing a global clock as a unified time scale to distribute N random number seeds for all joint modeling participants, so as to ensure the consistency of time and the safety of data.
Step 105, encryption and MAC authentication: the large language model federal pre-training task initiator selects a random seed to obtain a global clock generation key, encrypts an initial model, generates an MAC message authentication code, and puts the encrypted initial model and related identification information into a shared memory area to prepare for model distribution.
Step 106, training data loading and processing: each joint modeling participant loads data into a private area and a shared area of the TEE according to data sensitivity, token processing is carried out on the data to form vector representation, a global clock is obtained, and a local random seed is selected to encrypt the data of the shared area so as to enhance the confidentiality of the data.
Step 107, data aggregation and confusion: the large language model federal pre-training task initiator aggregates the data of the TEE shared area, generates a key according to the random number seed number of the metadata identification and the global clock, decrypts the data by using the key, confuses the data, obscures the data source, and encrypts the data of the shared area again by acquiring the global clock, selecting the local random seed to generate a public data set and storing the public data set in the shared memory area.
Step 108, common data set pulling: each joint modeling participant pulls the shared region data into the respective local environment through the RDMA protocol in preparation for subsequent training.
Step 109, initializing the GPU: each joint modeling participant initializes the GPU to become a trusted computing resource for accelerating the training process of the model.
The invention provides a large language model federal pre-training method based on a trusted execution environment, which is used for joint pre-training modeling of a large language model and comprises the following steps:
step two, joint pre-training modeling of a large language model:
the joint pre-training modeling of the large language model comprises the following steps:
step 201, model initialization and memory transfer: each joint modeling participant directly copies the internal memory of the initial model to the internal memory of each TEE cluster through the FPGA controller by adopting an RDMA protocol, so as to ensure that the initial model is loaded into a trusted environment.
Step 202, data selection and preparation: the joint modeling participant randomly selects a part of data from public data, and simultaneously uses private data of a user as training data of the training round.
Step 203, accelerating calculation and data transmission: the FPGA controller locks the GPU to monopolize, partial programs and training data needing to be calculated in an accelerating way are routed through the FPGA, decryption operation is carried out, and the data are transmitted to the GPU by adopting CXL technology so as to be calculated and processed at high speed.
Step 204, distributed parallel training: in the local TEE cluster, multiple groups of data are adopted for distributed parallel training, the joint modeling participants process the data of the public area and the private area by using different strategies, gradient is calculated, gradient aggregation is completed in the TEE cluster of the private area, encryption is performed by a global clock and a selected random seed generation key, the result is stored in a public shared memory, and each joint modeling participant is informed.
Step 205, clearing sensitive memory: after the calculation of the round is completed, the FPGA controller is utilized to execute the clearing operation of the sensitive memory on the GPU chip so as to ensure the data security.
Step 206, global aggregation and model update: and the aggregation node is responsible for aggregating the gradient data in the shared memory and then updating the model parameters.
Step 207, model distribution: the aggregation node places the updated model into a common area and encrypts it using a random number seed and a global clock, ready for model distribution for use in subsequent training.
Step 208, downloading and training the updated model: and downloading a new model by each joint modeling party for a new round of model training until the model reaches a convergence state.
Step 209, intermediate result encryption storage: the aggregation node generates three copies of data such as a model of an intermediate result, training conditions and the like, and encrypts and stores the data by using a random key; then, a Shamir secret sharing algorithm is adopted to distribute random key fragments to all joint modeling participants for future key recovery.
Step 210, abnormal condition processing: when abnormal conditions occur, the parties combine to restore secret fragments owned by the parties, and decrypt the secret fragments to restore intermediate results, so that the training continuity and reliability are ensured.
Step three, optimizing a combined pre-training large language model:
the optimization of the joint pre-trained large language model comprises the following steps:
step 301, after the large language model is trained in combination with a pre-training task, the large language model is deployed into an actual application environment for use;
step 302, collecting feedback data, continuously increasing training data, and continuously improving model precision and training efficiency.
The above examples are only one of the specific embodiments of the present invention, and the ordinary changes and substitutions made by those skilled in the art within the scope of the technical solution of the present invention should be included in the scope of the present invention.
Claims (6)
1. The large language model federal pre-training method based on the trusted execution environment is characterized by comprising the following steps of:
step 1, creating a large language model joint pre-training task, determining joint modeling participants, preparing data, and creating computing storage network resources for joint modeling;
step 2, performing joint pre-training of a large language model;
and 3, optimizing the large language model obtained by the combined pre-training.
2. The large language model federal pre-training method based on trusted execution environment according to claim 1, wherein the step 1 specifically comprises the steps of:
step 101, confirming a pre-training task: defining a specific task of pre-training of the large language model, wherein the specific task comprises initial parameter configuration of the large language model and requirement of training data;
step 102, determining a joint modeling participant: determining each joint modeling participant participating in the joint pre-training of the large language model, including an owner of the large language model and a data provider;
step 103, constructing a cross-domain TEE cluster: in a distributed environment, building a cross-domain Trusted Execution Environment (TEE) cluster, starting a TEE management node by each joint modeling participant, planning a shared area, a private area and GPU resources in the TEE, and loading access authority information into an FPGA (field programmable gate array) for controlling access of RDMA (remote direct memory access) and CXL (virtual local area network);
step 104, setting a global clock: introducing a global clock as a unified time scale to distribute N random number seeds for all joint modeling participants, so as to ensure the consistency of time and the safety of data;
step 105, encryption and MAC authentication: the large language model federal pre-training task initiator selects a random seed to acquire a global clock generation key, encrypts an initial model, generates an MAC message authentication code at the same time, and puts the encrypted initial model and related identification information into a shared memory area to prepare for model distribution;
the shared memory area is divided into a private area and a shared public area, private data of participants participating in federal modeling and joint modeling are stored in the private area, and the private data are sensitive data; the shared public area is used for efficient and rapid sharing of non-sensitive data, metadata in a modeling process, global model parameters and public data sets;
step 106, training data loading and processing: each joint modeling participant loads data into a private area and a shared area of the TEE according to data sensitivity, performs Token processing on the data to form vector representation, acquires a global clock, and selects a local random seed to encrypt the data of the shared area;
step 107, data aggregation and confusion: the task initiator aggregates the data of the TEE shared area, generates a secret key according to the random number seed number identified by the metadata and the global clock, decrypts the data by using the secret key, obfuscates the data source, selects a local random seed to encrypt the data of the shared area again by acquiring the global clock, generates a public data set, and stores the public data set in the shared memory area;
step 108, common data set pulling: each joint modeling participant pulls the data of the shared area into the respective local environment through an RDMA protocol to prepare for subsequent training;
step 109, initializing the GPU: each joint modeling participant initializes the GPU to become a trusted computing resource for accelerating the training process of the model.
3. The large language model federal pre-training method based on trusted execution environment according to claim 2, wherein said step 107 comprises;
data aggregation: the method comprises the steps that a task initiator obtains data from a TEE sharing area of each joint modeling participant, the task initiator generates a secret key by using a global clock and a random number seed number identified by metadata, the task initiator decrypts the data obtained from each TEE by using the generated secret key, and the original encrypted data is restored to an original state;
data confusion: on the basis of decryption, the task initiator confuses the data;
and (5) re-encrypting: re-encrypting the confused data, and generating a new key again by using the global clock and a random seed local to the task initiator;
a common dataset is generated: the processed, obfuscated and re-encrypted data is combined into a common data set.
4. The large language model federal pre-training method based on trusted execution environment according to claim 2, wherein the step 2 specifically comprises the steps of:
step 201, model initialization and memory transfer: each joint modeling participant directly copies the memory of the initial model to the memory of each TEE cluster through the FPGA controller by adopting an RDMA protocol, so as to ensure that the initial model is loaded into a trusted environment;
step 202, data selection and preparation: the joint modeling participant randomly selects a part of data from public data, and simultaneously uses private data of a user as training data of the training round;
the public data refers to user non-sensitive data or corresponding data desensitization of the user non-sensitive data or data from outside which has been disclosed, wherein the data are basic public data sets which are formed after data confusion is carried out by sending the data to a central node by each user;
step 203, accelerating calculation and data transmission: the FPGA controller locks the GPU to be exclusive, partial programs and training data needing to be accelerated to be calculated are routed through the FPGA, decryption operation is carried out, and the data are transmitted to the GPU by adopting CXL technology to be calculated and processed;
step 204, distributed parallel training: in the local TEE cluster, carrying out distributed parallel training by adopting a plurality of groups of data, processing the data of a public area and a private area by using different strategies by a joint modeling participant, calculating gradients, finishing gradient aggregation in the TEE cluster of the private area, encrypting by using a global clock and a selected random seed generation key, storing the result into a shared memory of the public area, and notifying each joint modeling participant;
the shared memory is a region, a large memory is formed for each node together, and the memory is directly read and written, so that the processing speed is improved;
step 205, clearing sensitive memory: after the calculation of the round is completed, the FPGA controller is utilized to execute the clearing operation of the sensitive memory on the GPU chip so as to ensure the data security;
step 206, global aggregation and model update: the aggregation node is responsible for aggregating gradient data in the shared memory and then updating model parameters;
the gradient data is obtained by a joint modeling participant of federal modeling by utilizing a local data set to execute local forward calculation, then calculating the gradient of parameters according to a set loss function, and uploading the gradient to an aggregation node;
step 207, model distribution: the aggregation node puts the updated model into a public area, encrypts the model by using a random number seed and a global clock, and prepares the model for model distribution for subsequent training and use;
step 208, downloading and training the updated model: each joint modeling participant downloads a new model from an aggregation node, the aggregation node updates a global model and directly sends the global model to the joint modeling participant, and the model is accelerated to be executed in a mode of being pulled by the joint modeling participant and used for new rounds of model training until the model reaches a convergence state;
step 209, intermediate result encryption storage: the task is triggered at fixed time, the aggregation node carries out training in a continuous iteration mode, a model of an intermediate result of the intermediate continuous aggregation parameter and data of a training condition generate three copies, and a random key is used for encryption storage; then, a Shamir secret sharing algorithm is adopted to distribute random key fragments to all joint modeling participants so as to prepare for future key recovery;
step 210, abnormal condition processing: when abnormal conditions occur, each joint modeling participant jointly restores secret fragments owned by each joint modeling participant to carry out decryption operation so as to restore intermediate results and ensure the continuity and reliability of training.
5. The large language model federal pre-training method based on a trusted execution environment according to claim 4, wherein the FPGA controller realizes GPU trusted and TEE and external secure communication and authority authentication, and the FPGA controller comprises TPM chip management, RDMA memory sharing, CXL secure communication;
the TPM chip management is that the TPM chip in the FPGA controller is used for managing related keys and certificates of trusted computing, so that the safety and the credibility of the TEE cluster are ensured;
the RDMA memory sharing is realized by an RDMA protocol, and the FPGA controller realizes high-speed memory sharing among TEE clusters and accelerates data transmission and communication;
the CXL safety communication realizes high-speed and safe interconnection between the GPU and the TEE, improves the pre-training speed and reduces the memory delay;
the GPU is responsible for performing accelerated training of a large language model, improving training efficiency and speed, and is controlled and verified by the FPGA controller to realize GPU credibility.
6. The large language model federal pre-training method based on trusted execution environment according to claim 4, wherein the step 3 specifically comprises the following steps of;
step 301, after the large language model is trained in combination with a pre-training task, the large language model is deployed into an actual application environment for use;
step 302, collecting feedback data, continuously increasing training data, and continuously improving model precision and training efficiency.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410117882.3A CN117648998B (en) | 2024-01-29 | 2024-01-29 | Large language model federal pre-training method based on trusted execution environment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410117882.3A CN117648998B (en) | 2024-01-29 | 2024-01-29 | Large language model federal pre-training method based on trusted execution environment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117648998A true CN117648998A (en) | 2024-03-05 |
CN117648998B CN117648998B (en) | 2024-04-26 |
Family
ID=90045405
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410117882.3A Active CN117648998B (en) | 2024-01-29 | 2024-01-29 | Large language model federal pre-training method based on trusted execution environment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117648998B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118114771A (en) * | 2024-04-25 | 2024-05-31 | 蚂蚁科技集团股份有限公司 | Function tool calling method and device in trusted execution environment, medium and equipment |
CN118296615A (en) * | 2024-06-05 | 2024-07-05 | 蓝象智联(杭州)科技有限公司 | Large model fine tuning method and device based on trusted execution environment |
CN118394889A (en) * | 2024-06-21 | 2024-07-26 | 之江实验室 | Large language model federal fine tuning method and device based on gradient compression |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113487042A (en) * | 2021-06-28 | 2021-10-08 | 海光信息技术股份有限公司 | Federated learning method and device and federated learning system |
CN114254386A (en) * | 2021-12-13 | 2022-03-29 | 北京理工大学 | Federated learning privacy protection system and method based on hierarchical aggregation and block chain |
WO2023035507A1 (en) * | 2021-09-07 | 2023-03-16 | 天翼电子商务有限公司 | Trusted executive environment multi-node authentication method |
CN116402169A (en) * | 2023-06-09 | 2023-07-07 | 山东浪潮科学研究院有限公司 | Federal modeling verification method, federal modeling verification device, federal modeling verification equipment and storage medium |
CN116611124A (en) * | 2023-05-25 | 2023-08-18 | 南京恒木兴信息科技有限公司 | GPU trusted execution environment construction method, system and data transmission method |
CN116957109A (en) * | 2023-07-31 | 2023-10-27 | 山东浪潮科学研究院有限公司 | Model construction method, device, equipment and medium based on federal learning |
CN117216788A (en) * | 2023-08-26 | 2023-12-12 | 北京工业大学 | Video scene identification method based on federal learning privacy protection of block chain |
CN117332247A (en) * | 2023-12-01 | 2024-01-02 | 苏州大学 | Big data transaction and quality assessment method and system using big language model as medium |
-
2024
- 2024-01-29 CN CN202410117882.3A patent/CN117648998B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113487042A (en) * | 2021-06-28 | 2021-10-08 | 海光信息技术股份有限公司 | Federated learning method and device and federated learning system |
WO2023035507A1 (en) * | 2021-09-07 | 2023-03-16 | 天翼电子商务有限公司 | Trusted executive environment multi-node authentication method |
CN114254386A (en) * | 2021-12-13 | 2022-03-29 | 北京理工大学 | Federated learning privacy protection system and method based on hierarchical aggregation and block chain |
CN116611124A (en) * | 2023-05-25 | 2023-08-18 | 南京恒木兴信息科技有限公司 | GPU trusted execution environment construction method, system and data transmission method |
CN116402169A (en) * | 2023-06-09 | 2023-07-07 | 山东浪潮科学研究院有限公司 | Federal modeling verification method, federal modeling verification device, federal modeling verification equipment and storage medium |
CN116957109A (en) * | 2023-07-31 | 2023-10-27 | 山东浪潮科学研究院有限公司 | Model construction method, device, equipment and medium based on federal learning |
CN117216788A (en) * | 2023-08-26 | 2023-12-12 | 北京工业大学 | Video scene identification method based on federal learning privacy protection of block chain |
CN117332247A (en) * | 2023-12-01 | 2024-01-02 | 苏州大学 | Big data transaction and quality assessment method and system using big language model as medium |
Non-Patent Citations (3)
Title |
---|
MAI, HAOHUI等: "Honeycomb: Secure and Efficient GPU Executions via Static Validation", 《PROCEEDINGS OF THE 17TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, OSDI 2023》, 16 October 2023 (2023-10-16) * |
吴冰清: "基于秘密共享的车联网公链隐私数据处理方法", 《中国优秀硕士学位论文全文数据库 工程科技Ⅱ辑》, 15 March 2023 (2023-03-15) * |
景维鹏;姜涛;朱良宽;刘美玲;: "一种GPU及深度置信网络的语音识别加速算法研究", 小型微型计算机系统, no. 03, 15 March 2018 (2018-03-15) * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118114771A (en) * | 2024-04-25 | 2024-05-31 | 蚂蚁科技集团股份有限公司 | Function tool calling method and device in trusted execution environment, medium and equipment |
CN118296615A (en) * | 2024-06-05 | 2024-07-05 | 蓝象智联(杭州)科技有限公司 | Large model fine tuning method and device based on trusted execution environment |
CN118296615B (en) * | 2024-06-05 | 2024-08-09 | 蓝象智联(杭州)科技有限公司 | Large model fine tuning method and device based on trusted execution environment |
CN118394889A (en) * | 2024-06-21 | 2024-07-26 | 之江实验室 | Large language model federal fine tuning method and device based on gradient compression |
Also Published As
Publication number | Publication date |
---|---|
CN117648998B (en) | 2024-04-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN117648998B (en) | Large language model federal pre-training method based on trusted execution environment | |
Choudhuri et al. | Fairness in an unfair world: Fair multiparty computation from public bulletin boards | |
Wallden et al. | Cyber security in the quantum era | |
Bonawitz et al. | Practical secure aggregation for privacy-preserving machine learning | |
CN109189727B (en) | Block chain ciphertext cloud storage sharing method based on attribute proxy re-encryption | |
Gordon et al. | Secure two-party computation in sublinear (amortized) time | |
US8171306B2 (en) | Universal secure token for obfuscation and tamper resistance | |
Kermanshahi et al. | Multi-client cloud-based symmetric searchable encryption | |
Xu et al. | An integrated privacy preserving attribute-based access control framework supporting secure deduplication | |
Lyu et al. | Towards fair and decentralized privacy-preserving deep learning with blockchain | |
CN114254386A (en) | Federated learning privacy protection system and method based on hierarchical aggregation and block chain | |
CN111181719B (en) | Hierarchical access control method and system based on attribute encryption in cloud environment | |
Chandran et al. | Efficient, constant-round and actively secure MPC: beyond the three-party case | |
Chen et al. | Lightweight privacy-preserving training and evaluation for discretized neural networks | |
Chen et al. | Titanium: A metadata-hiding file-sharing system with malicious security | |
Han et al. | Access control mechanism for the Internet of Things based on blockchain and inner product encryption | |
Kavousi et al. | Blindperm: Efficient mev mitigation with an encrypted mempool and permutation | |
CN108763944B (en) | Multi-center large-attribute domain attribute-based encryption method capable of being safely revoked in fog computing | |
Dolev et al. | SodsBC: a post-quantum by design asynchronous blockchain framework | |
Wang et al. | Enabling privacy and leakage resistance for dynamic blockchain-based access control systems | |
Tate et al. | Mobile Agent Security Through Multi-Agent Cryptographic Protocols. | |
CN116415267A (en) | Iterative updating method, device and system for joint learning model and storage medium | |
Jarrous et al. | Canon-mpc, a system for casual non-interactive secure multi-party computation using native client | |
Ghavamipour et al. | Federated synthetic data generation with stronger security guarantees | |
KR20220142254A (en) | Multi-signature wallet system in blockchain using the bloom filter |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |