CN115297008B - Collaborative training method, device, terminal and storage medium based on intelligent computing network - Google Patents

Collaborative training method, device, terminal and storage medium based on intelligent computing network Download PDF

Info

Publication number
CN115297008B
CN115297008B CN202210793410.0A CN202210793410A CN115297008B CN 115297008 B CN115297008 B CN 115297008B CN 202210793410 A CN202210793410 A CN 202210793410A CN 115297008 B CN115297008 B CN 115297008B
Authority
CN
China
Prior art keywords
training
trained
collaborative
model
computing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210793410.0A
Other languages
Chinese (zh)
Other versions
CN115297008A (en
Inventor
张艳
王晖
王进
颜达森
易泽轩
陶恒韬
蒋芳清
秦爽
徐增林
曾炜
余跃
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peng Cheng Laboratory
Original Assignee
Peng Cheng Laboratory
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peng Cheng Laboratory filed Critical Peng Cheng Laboratory
Priority to CN202210793410.0A priority Critical patent/CN115297008B/en
Publication of CN115297008A publication Critical patent/CN115297008A/en
Application granted granted Critical
Publication of CN115297008B publication Critical patent/CN115297008B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/145Network analysis or design involving simulating, designing, planning or modelling of a network
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a collaborative training method, a device, a terminal and a storage medium based on an intelligent computing network, wherein the method comprises the following steps: acquiring a plurality of algorithms to be trained and corresponding data sets, and generating a plurality of task groups according to the acquired algorithms and data sets; determining terminals to be trained in the distributed intelligent collaborative computing platform according to the selected task group, and determining a to-be-trained algorithm and a data set corresponding to each terminal to be trained; collaborative training and reasoning are carried out on the models of all the terminals to be trained through a collaborative training strategy of the cross heterogeneous intelligent computing center, and collaborative training and reasoning results are obtained; and acquiring a multi-model fusion strategy according to the collaborative training and reasoning result, and fusing the algorithm in the trained terminal through the multi-model fusion strategy to obtain a collaborative calculation model of the cross-heterogeneous intelligent computation center based on the distributed multi-framework. The invention can realize the technologies of large model collaborative training, multi-model fusion, large model compression and the like which are difficult to realize by a single cluster.

Description

Collaborative training method, device, terminal and storage medium based on intelligent computing network
Technical Field
The invention relates to the technical field of intelligent computing networks, in particular to a collaborative training method, a collaborative training device, a collaborative training terminal and a collaborative training storage medium based on an intelligent computing network.
Background
The intelligent computing network collaborative computing refers to a computing mode that a plurality of intelligent computing network users use resources such as data, computing power, a model, a network and the like to cooperatively complete an intelligent computing job according to an application scene abstract job role based on intelligent computing network infrastructure and services.
Currently, the computing center of scientific computation is a super computing center, and the computing center bearing the current enterprise application, government application and personal application is a plurality of various data centers; while artificial intelligence computing demands are exponentially growing, 80% of computing demands will be occupied in the future, and the AI computing force center, i.e., the intelligent computing center, is carrying such demands. In the face of the continuously multiplied computing power and network demands, the performance limit of Shan Diansuan power is broken through the intelligent computing network, and the whole scale of the intelligent computing center is improved.
At present, obstetric research is actively explored to jointly promote intelligent computing network layout. Technically, there are the following points to be broken through and studied. First, computing power is various and ubiquitous, and different intelligent computing centers have different software and hardware architectures and different geographical distributions and have various forms such as a computing cluster, a data center, a computing center, an intelligent computing center and the like. Second, the development of networks has made computing more ubiquitous, and the scale of applications is limited by the size of a single computing center, which, if available, will produce more value and create more value bringing more money to users and devices. Thirdly, the multi-technology fusion and the intelligent computing network become important carriers for the multi-technology fusion and the multi-field fusion. And collaborative computing based on intelligent computing networks is a very important technology.
Accordingly, there is a need in the art for improvement.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a collaborative training method, a collaborative training device, a collaborative training terminal and a collaborative training storage medium based on an intelligent computing network, which are used for solving the technical problem that the existing intelligent computing network is difficult to realize large-model collaborative training and multi-model fusion.
The technical scheme adopted for solving the technical problems is as follows:
in a first aspect, the present invention provides a collaborative training method based on an intelligent computing network, including:
acquiring a plurality of algorithms to be trained and corresponding data sets, and generating a plurality of task groups according to the acquired algorithms and data sets; wherein, each task group at least corresponds to an algorithm to be trained and a data set thereof;
determining terminals to be trained in the distributed intelligent collaborative computing platform according to the selected task group, and determining a to-be-trained algorithm and a data set corresponding to each terminal to be trained;
collaborative training and reasoning are carried out on the models of all the terminals to be trained through a collaborative training strategy of the cross heterogeneous intelligent computing center, and collaborative training and reasoning results are obtained;
and acquiring a multi-model fusion strategy according to the collaborative training and reasoning result, and fusing the algorithm in the trained terminal through the multi-model fusion strategy to obtain a collaborative calculation model of the cross-heterogeneous intelligent computation center based on the distributed multi-framework.
In one implementation manner, the collaborative training and reasoning are performed on the models of all the terminals to be trained through the collaborative training strategy of the cross heterogeneous intelligent computing center to obtain collaborative training and reasoning results, including:
dividing all terminals to be trained into heterogeneous AI clusters and heterogeneous AI architectures according to AI types;
and respectively carrying out cooperative training and reasoning on the models between the heterogeneous AI clusters and/or the heterogeneous AI architecture through the cooperative training strategy of the cross-heterogeneous intelligent computation center.
In one implementation, the classifying all the terminals to be trained according to AI types into heterogeneous AI clusters and heterogeneous AI architectures, and then includes:
initializing each of the different AI architectures;
and (3) transmitting deep learning framework fields of corresponding algorithms into each of the different AI architectures, and calling corresponding API interfaces to obtain corresponding model parameters.
In one implementation, the collaborative training and reasoning of the models between the heterogeneous AI clusters and/or the heterogeneous AI architectures through a collaborative training strategy across heterogeneous intelligent computing centers respectively includes:
acquiring memory information, communication delay information, computing capacity information and algorithm model information corresponding to each terminal to be trained;
and sending a collaborative computing resource allocation scheme to the corresponding terminal to be trained according to the memory information, the communication delay information, the computing capability information and the algorithm model information.
In one implementation, the collaborative training and reasoning of the models between the heterogeneous AI clusters and/or the heterogeneous AI architectures through a collaborative training strategy across heterogeneous intelligent computing centers includes:
dynamically adjusting the training synchronization period in each terminal to be trained;
and determining the training overtime and the failure time in each terminal to be trained, and resetting the local training of each terminal to be trained according to the training overtime and the failure time.
In one implementation, the collaborative training and reasoning of the models between the heterogeneous AI clusters and/or the heterogeneous AI architectures through a collaborative training strategy across heterogeneous intelligent computing centers respectively includes:
acquiring an evaluation instruction, and verifying an evaluation data set of each trained terminal through the evaluation instruction;
and obtaining model training precision values corresponding to the trained terminals according to the verification results.
In one implementation manner, the acquiring the multi-model fusion strategy according to the collaborative training and reasoning result, and fusing the algorithm in the trained terminal through the multi-model fusion strategy includes:
acquiring an average fusion strategy or a contribution fusion strategy according to the model training precision value;
determining a model fusion weight ratio corresponding to each trained terminal according to the average fusion strategy or the contribution fusion strategy;
and fusing algorithms in all the trained terminals according to the determined weight ratio.
In one implementation manner, the acquiring the multi-model fusion strategy according to the collaborative training and reasoning result, and fusing the algorithm in the trained terminal through the multi-model fusion strategy, and then comprises the following steps:
the fused result is sent to a corresponding post-training terminal;
setting a corresponding model initial value of the trained terminal according to the fused result, and taking the set model initial value as a model parameter of the next training.
In one implementation, the method further comprises:
and setting an infrastructure meeting the cooperative training and fusion according to the input instruction, and setting a corresponding cooperative training and fusion algorithm.
In one implementation, the distributed multi-frame based heterogeneous computing center includes at least: a Central Processing Unit (CPU), an NPU and a GPU.
In a second aspect, the present invention provides a collaborative training device based on a mental arithmetic network, comprising:
the task group module is used for acquiring a plurality of algorithms to be trained and corresponding data sets and generating a plurality of task groups according to the acquired algorithms and data sets; wherein, each task group at least corresponds to an algorithm to be trained and a data set thereof;
the selection module is used for determining terminals to be trained in the distributed intelligent collaborative computing platform according to the selected task group and determining a to-be-trained algorithm and a data set corresponding to each terminal to be trained;
the collaborative training and reasoning module is used for carrying out collaborative training and reasoning on the models of all the terminals to be trained through a collaborative training strategy crossing the heterogeneous intelligent computation center to obtain collaborative training and reasoning results;
and the fusion module is used for acquiring a multi-model fusion strategy according to the collaborative training and reasoning result, and fusing the algorithm in the trained terminal through the multi-model fusion strategy to obtain a collaborative calculation model of the cross-heterogeneous intelligent computation center based on the distributed multi-framework.
In a third aspect, the present invention provides a terminal comprising: a processor and a memory storing a co-training program based on a smart network, which when executed by the processor is adapted to carry out the co-training method based on a smart network as described in the first aspect.
In a fourth aspect, the present invention also provides a storage medium, which is a computer readable storage medium, storing a co-training program based on a smart network, which when executed by a processor is configured to implement the co-training method based on a smart network according to the first aspect.
The technical scheme adopted by the invention has the following effects:
the invention can realize technologies such as large model collaborative training, multi-model fusion, large model compression and the like which are difficult to realize by a single cluster, and complete collaborative computing operation across a plurality of intelligent computing centers by enabling data, computing power, models, networks and services through intelligent computing network infrastructure, thereby realizing brand new computing paradigms and business scenes such as large model cross-domain collaborative computing, multi-center model aggregation, multi-center federal learning and the like, so that the intelligent collaborative computing paradigms become the key for fully exerting the overall efficiency of the intelligent computing network and enabling the industrial scale application of artificial intelligence.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to the structures shown in these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a collaborative training method based on a smart network in one implementation of the invention.
FIG. 2 is a schematic diagram of a collaborative training framework based on a mental arithmetic network in one implementation of the invention.
FIG. 3 is a schematic diagram of the relationship of the modules in one implementation of the invention.
Fig. 4 is a functional schematic of a terminal in one implementation of the invention.
The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more clear and clear, the present invention will be further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
Exemplary method
At present, the computing power of the intelligent computing network is various and ubiquitous, and different intelligent computing centers have different software and hardware architectures and different geographical distributions and have various forms such as a computing cluster, a data center, a computing center, an intelligent computing center and the like. Moreover, the scale of application is limited by the scale of a single computing center, so that an intelligent computing network is difficult to become an important carrier for multi-technology fusion and multi-field fusion, and technologies such as large-model collaborative training, multi-model fusion and large-model compression are difficult to realize.
Aiming at the technical problems, the embodiment provides a collaborative training method based on a smart computing network, which can realize technologies such as large model collaborative training, multi-model fusion, large model compression and the like which are difficult to realize by a single cluster, and complete collaborative computing operation across a plurality of smart computing centers through smart computing network infrastructure, thereby realizing brand new computing paradigms and business scenes such as large model cross-domain collaborative computing, multi-center model aggregation, multi-center federal learning and the like, so that the smart collaborative computing paradigms become keys for fully exerting the overall efficiency of the smart computing network and enabling the industrial scale application of artificial intelligence.
As shown in fig. 1, the embodiment of the invention provides a collaborative training method based on an intelligent computing network, which comprises the following steps:
step S100, a plurality of algorithms to be trained and corresponding data sets are obtained, and a plurality of task groups are generated according to the obtained algorithms and data sets.
In this embodiment, the collaborative training method based on the intelligent computing network is implemented by a collaborative computing system based on the intelligent computing network, where the collaborative computing system based on the intelligent computing network is a distributed intelligent collaborative computing platform. The platform aims to complete collaborative computing operation across multiple intelligent computing centers by enabling data, computing power, models, networks and services through intelligent computing network infrastructure, so that brand new computing paradigms and business scenes, such as large-model cross-domain collaborative computing, multi-center model aggregation, multi-center federal learning and the like, are realized. The intelligent collaborative computing paradigm becomes a key to fully exert the overall efficiency of the intelligent computing network to enable the industrial scale application of artificial intelligence.
As shown in fig. 2, in the present embodiment, the collaborative computing platform based on the intelligent computing network includes: collaborative computing job management, collaborative computing frameworks, and resources; the collaborative computing job management is a management interface integrating computing power management, algorithm management, task management, data set management, model management and task visualization; the collaborative computing framework is a framework for implementing communication schemes, fusion strategies, training strategies and effect evaluation; the resources are communication resources, storage resources and computing resources distributed on each cooperative device.
In the collaborative computing platform based on the intelligent computing network, the collaborative computing job management functions are as follows: and (3) accepting external requests, realizing identity management, authority management, algorithm management and model management, collecting and storing resources, CPU, GPU, NPU resources and the like, controlling various events of the distributed learning process such as parameter fusion events, effect evaluation and the like, and carrying out various request operations.
As shown in fig. 3, in this embodiment, in the collaborative computing platform based on the intelligent computing network, the functions of each module are as follows:
the Web is a front-end display page;
the Web API Server is a module for background management, database management and command forwarding;
proxy gRPC server is a module for implementing command forwarding and collecting information of the machine;
the Proxy is a module for realizing machine information acquisition, running state acquisition and task execution (task execution comprises task starting, task stopping and other operations, and mainly starts and stops containers);
the Agent is a cooperative training server and runs on a task group machine to realize the training process of clients of a cooperative multi-computing center;
the Client is a collaborative training Client and operates on a computing center where the task is located, training data is trained, and the training data is uploaded to the agent for fusion.
Before implementing the collaborative training method based on the intelligent computing network, an infrastructure meeting collaborative training and fusion needs to be set in the collaborative computing platform based on the intelligent computing network, and corresponding collaborative training and fusion algorithms are set to obtain corresponding training components and fusion components.
Specifically, in one implementation of the present embodiment, the step S100 includes the following steps before:
step S101, setting an infrastructure meeting cooperative training and fusion according to an input instruction, and setting a corresponding cooperative training and fusion algorithm.
In this embodiment, the collaborative computing platform based on the intelligent computing network mainly comprises an algorithm, computing power, a data set, a task group, a task and a model; the user needs to write an algorithm with specific requirements, meets the basic framework of fusion training of collaborative training, packs and deploys the basic framework to a corresponding Docker hub server through the Docker to complete deployment of the algorithm; the computing power is a physical foundation for supporting collaborative training, and can support physical machine and cloud computer deployment. The dataset is an input of training data, which needs to be deployed over computational power.
In this embodiment, after setting the corresponding infrastructure and algorithms, a plurality of algorithms and corresponding data sets may be added to the algorithm management interface of the collaborative computing platform based on the intelligent computing network; the added algorithm is an algorithm container name, and the added data set is uploaded to the shared storage of the platform so as to realize the sharing function of the data set; in the process of adding a plurality of algorithms, the platform classifies each algorithm according to the names of the algorithms, so as to obtain the types corresponding to each algorithm.
Further, after adding a plurality of algorithms and corresponding data sets, a corresponding task group can be added in the platform; when the task group is added, the name of an algorithm container can be input, and the corresponding task number is set; by adding the task groups, a user can select the task groups needing collaborative computing training in the platform and select the data sets needing to participate in the training.
In one implementation manner of the embodiment, the platform may generate a plurality of task groups according to the acquired plurality of algorithms and data sets by acquiring the plurality of algorithms to be trained and the corresponding data sets; wherein, each task group at least corresponds to an algorithm to be trained and a data set thereof; in one task group, one task group can be associated with one or more users, and collaborative training learning can be performed between different users or the same user. It should be noted that in this embodiment, a task group and a training task are equivalent, and the task group needs to formulate an algorithm and participate in the minimum number of users. The task is one instance of the user joining in co-training.
As shown in fig. 1, in an implementation manner of the embodiment of the present invention, the collaborative training method based on the intelligent computing network further includes the following steps:
and step 200, determining terminals to be trained in the distributed intelligent collaborative computing platform according to the selected task group, and determining a to-be-trained algorithm and a data set corresponding to each terminal to be trained.
In this embodiment, a user may select a task group to be trained on the collaborative computing platform based on the intelligent computing network, where the platform determines a terminal to be trained in the distributed intelligent collaborative computing platform according to the task group selected by the user, that is, determines user information associated with each task in the task group, and a computing center (that is, a terminal to be trained) bound with the user information; meanwhile, the platform also determines the algorithm to be trained and the data set corresponding to each terminal to be trained, so as to deploy the algorithm to be trained and the data set in the corresponding computing center.
In this embodiment, the application scenario of the collaborative computing platform based on the intelligent computing network is:
1) Collaborative training of large models: the scale of the dense model parameters reaches more than trillion levels, the sparse model parameters reach more than trillion levels, different intelligent computing centers are realized through a cross-domain data parallel and model parallel scheme to complete the training computing tasks of a part of data or a part of structures of the model, and the overall training efficiency is improved.
2) Pre-training model compression: the ultra-large-scale parameter model is difficult to apply and deploy, even computational centers with smaller scales are difficult to deploy, and the miniaturization application of the large model is realized by supporting the compression and distillation of the cross-center pre-training model.
3) Multi-model fusion: the medium-small scale parameter models with different scales and different types are distributed and deployed in each computing center of the intelligent computing network, and a large model with stronger computing capacity is obtained through learning by fusing a plurality of model knowledge.
4) Model training on 100T-scale dataset: when the 100T-level large-scale data set is calculated in a single computing center, data parallelism brings a large amount of internal data transmission requirements, single-cluster computing efficiency is reduced, and the data I/O bottleneck in the single computing center is solved by means of performing decentralized storage on the ultra-large-scale data set in multiple centers.
5) Multiparty reinforcement learning: the reinforcement learning application models are distributed and deployed in different computing centers, and the joint learning and fusion of the reinforcement learning models are realized through the cooperative computation of an intelligent computing network, so that the computing capacity of a single computing center is enhanced.
The embodiment solves the problem of how to perform efficient collaborative calculation based on the intelligent calculation network, maximizes the effective calculation force, and realizes the scenes of 'calculation with number walking', and the like. The platform unifies systems with different frameworks, such as multiple frameworks, parallel strategies, simplicity, easiness in use, compatibility with arm, X86 and the like.
As shown in fig. 1, in an implementation manner of the embodiment of the present invention, the collaborative training method based on the intelligent computing network further includes the following steps:
and step S300, performing cooperative training and reasoning on the models of all the terminals to be trained through a cooperative training strategy crossing heterogeneous intelligent computation centers to obtain cooperative training and reasoning results.
In this embodiment, intelligent collaborative training based on an intelligent computing network is implemented, based on the following assumptions: network conditions: at least one computing center is configured with an external network, and other computing centers can access the external network. The mirror image and the code which can be pulled on different clusters can be simultaneously scheduled to NPU, GPU, MLU clusters, and the task submitted by a user is full of software and hardware environment for starting and executing the algorithm.
Specifically, in one implementation of the present embodiment, step S300 includes the steps of:
step S301, dividing all terminals to be trained into heterogeneous AI clusters and heterogeneous AI architecture according to AI types;
step S302, respectively performing collaborative training and reasoning on the models between the heterogeneous AI clusters and/or the heterogeneous AI framework through the collaborative training strategy of the heterogeneous computing center.
In the embodiment, collaborative computing across heterogeneous intelligent computing centers is supported, and model collaborative training and reasoning among heterogeneous AI clusters such as GPU, NPU, MLU and model collaborative training and reasoning among heterogeneous AI architectures such as Mindspore, pytorch, tensorFlow, paddlepaddle can be supported; therefore, in the training process, the platform divides each computing center into a heterogeneous AI cluster and a heterogeneous AI architecture according to AI types, so as to perform collaborative training and reasoning according to classification.
During the training process, each computing center has its own data and these data are not propagated through the platform to third parties. Each party has a proprietary GPU, NPU, MLU algorithm and a controllable local deep learning framework (e.g., tensorflow, pytorch, mindspore, etc.). Each computing center trains the private data by utilizing a local computing power and deep learning framework, and different fusion strategies are fused through strategy files of the server after training is finished. The fused results are returned to each computing center and used as initial values of the models for the next round of training, and the accuracy of the local models can reach the model accuracy when the data are intensively trained after a plurality of rounds due to the fact that the parameter characteristics of other participants are fused in the local model training process.
Specifically, in one implementation of the present embodiment, step S301 includes the following steps:
step S301a, initializing each of the heterogeneous AI architecture;
step S301b, the deep learning frame field of the corresponding algorithm is transmitted into each heterogeneous AI architecture, and the corresponding API interface is called to obtain the corresponding model parameters.
In one implementation manner of the embodiment, after the heterogeneous AI clusters and the heterogeneous AI architecture are divided, interfaces can be unified aiming at different AI frameworks, and a user can change the calculation force into a collaborative calculation algorithm only by adding a plurality of lines of codes, so that the use is flexible and convenient; the method comprises the following steps:
when a user initializes AISYNCORE.client.NumPyClientAdap, dl_frame deep learning frame fields used by a current algorithm (namely a collaborative training and fusion algorithm written in a platform by the user) are transmitted, the AISYNCORE can be automatically adapted to different frame characteristics, corresponding APIs are called, and functions of model parameter acquisition, server, server parameter pull-down and load model entering, model training and model test are achieved. The user only needs to execute a line of codes, AISYNCORE.client run_numlyAdap_client, and can start the co-training task.
Specifically, in one implementation of the present embodiment, step S302 is preceded by the steps of:
step S302a, memory information, communication delay information, computing capacity information and algorithm model information corresponding to each terminal to be trained are obtained;
step S302b, sending a collaborative computing resource allocation scheme to the corresponding terminal to be trained according to the memory information, the communication delay information, the computing capability information and the algorithm model information.
In one implementation manner of the present embodiment, before performing collaborative training and reasoning, information (such as memory, communication delay, computing power) about the device and the training model may be collected through Benchmark (a performance evaluation manner) to provide a plurality of optimized collaborative computing resource allocation schemes for each subtask; the client side realizes the training of a plurality of internationns on specific hardware, and can obtain information such as memory, communication delay, calculation capability and the like.
Specifically, in one implementation of the present embodiment, step S302 includes the steps of:
step S302c, dynamically adjusting the training synchronization period in each terminal to be trained;
step S302d, determining the training overtime and the failure time in each terminal to be trained, and resetting the local training of each terminal to be trained according to the training overtime and the failure time.
In one implementation manner of the embodiment, in the process of performing collaborative training and reasoning, operations such as synchronous period dynamic adjustment, overtime or invalid local training reset can be performed for each computing center; the method comprises the following steps: for local training with overtime or failure of each computing center, the current wheel can be restarted to perform local training again; or directly fusing the training model parameters of the current round to be overtime with the model parameters of the next round to realize the dynamic adjustment of the synchronous period.
Specifically, in one implementation of the present embodiment, step S302 includes the following steps:
step S302e, acquiring an evaluation instruction, and verifying an evaluation data set of each trained terminal through the evaluation instruction;
and step S302f, obtaining model training precision values corresponding to the trained terminals according to the verification result.
In one implementation manner of the embodiment, after the process of collaborative training and reasoning, the platform may verify the intermediate result (i.e., the collaborative calculation training result), and perform operations such as analysis on the final result (i.e., the algorithm model after training); the method comprises the following steps: and verifying an evaluation data set (namely an evaluation data set obtained by an algorithm after training) through an evaluation code to obtain an accuracy value of the model, and determining a weight ratio (contribution degree) of the model in the fusion process according to the accuracy value of the model.
As shown in fig. 1, in an implementation manner of the embodiment of the present invention, the collaborative training method based on the intelligent computing network further includes the following steps:
and step S400, acquiring a multi-model fusion strategy according to the collaborative training and reasoning result, and fusing the algorithm in the trained terminal through the multi-model fusion strategy to obtain a collaborative calculation model of the cross-heterogeneous intelligent computation center based on the distributed multi-framework.
In this embodiment, in the collaborative computing platform based on the intelligent computing network, job management and core computing logic may be separated, i.e., the collaborative computing job management and the collaborative computing core computing bare computer frame are decoupled on the platform architecture; in addition, in order to reduce invasive codes as much as possible, collaborative training of multiple parallel strategies is supported, customized parallel strategies are supported at the same time, communication primitives such as a gloo end broadcast, allreduce and the like are supported, and therefore a scene of large-model cross-domain training is adapted.
Specifically, in one implementation of the present embodiment, step S400 includes the following steps:
step S401, acquiring an average fusion strategy or a contribution fusion strategy according to the model training precision value;
step S402, determining a model fusion weight ratio corresponding to each trained terminal according to the average fusion strategy or the contribution fusion strategy;
and step S403, fusing algorithms in all the trained terminals according to the determined weight ratio.
In this embodiment, in the collaborative computing platform based on the intelligent computing network, a plurality of different fusion strategies may be set, for example, an average fusion strategy, a fusion strategy that performs different weight ratios on different model parameters according to the contribution degree, and the like; and selecting a corresponding fusion strategy according to the model training precision value of each computing center, and fusing algorithms in all trained terminals.
Sorting according to model training precision values of all computing centers, selecting a minimum precision value and a maximum precision value, and calculating to obtain a difference value of the minimum precision value and the maximum precision value; judging whether the difference exceeds a set threshold value or not according to the difference, and if the difference is larger, selecting a fusion strategy for carrying out different weight proportions on different model parameters according to the contribution degree; if the difference is smaller, an average fusion strategy is selected.
Further, if the selected fusion strategy is an average fusion strategy, fusing algorithms of all the calculation centers according to an average principle to obtain a fusion model; if the selected fusion strategy is the fusion strategy with different weight ratios, corresponding weight ratios are distributed to the algorithms of the corresponding computation centers according to the model training precision values, and finally the algorithms of all the computation centers are fused according to the set weight ratios to obtain a fusion model.
In one implementation manner of the embodiment of the present invention, the collaborative training method based on the intelligent computing network further includes the following steps:
step S500, the fused result is sent to a corresponding post-training terminal;
and step S600, setting a corresponding model initial value of the trained terminal according to the fused result, and taking the set model initial value as a model parameter of the next training.
In the embodiment, each computing center trains private data by using a local computing power and deep learning framework, and fusion of different fusion strategies is carried out through strategy files of a server after training is completed; after the fusion is completed, the fusion scheme can be further optimized, and the fused result is returned to each computing center to serve as a model initial value for the next local training, and the accuracy of the local model can reach the model accuracy when the data is intensively trained after a plurality of rounds due to the fact that the parameter characteristics of other participants are fused in the local model training process.
In an actual application scenario of the present embodiment, the collaborative training method based on the intelligent computing network may include the following steps:
step S01, preparing a fusion machine in a platform, and adding a collaborative training and fusion algorithm;
step S02, preparing corresponding training machines and training data sets in the computing center 1 (client 1);
step S03, preparing corresponding training machines and training data sets in a computing center 2 (a client 2);
step S04, distributing corresponding mechanism accounts to each computing center in the platform;
step S05, adding training machines on each computing center in the platform;
step S06, after the addition is successful, the platform sends a request for acquiring the ID of the training machine to each computing center;
step S07, installing and starting agent programs on each computing center;
step S08, each computing center uploads the corresponding training machine ID to the platform through the agent program;
step S09, adding a data set on each computing center in the platform;
step S10, creating a task group in a platform;
step S11, creating tasks in the platform according to the training machine ID of the computing center 1, and adding the tasks into a task group;
step S12, creating tasks in the platform according to the training machine ID of the computing center 2, and adding the tasks into a task group;
step S13, starting collaborative training in a platform;
step S14, starting a fusion node container in the platform;
step S15, the platform sends a training machine starting command to the computing center 1;
step S16, starting a training machine in the computing center 1;
step S17, the platform sends a training machine starting command to the computing center 2;
step S18, starting a training machine in the computing center 2;
step S19, fusion training is carried out on the computing center 1 and the computing center 2 in the platform.
In the above-mentioned practical application scenario, an account may be created for an existing intelligent computing network user in the intelligent computing network-based collaborative computing platform, and a fusion machine may be prepared by a scheduler or a designated intelligent computing network user and added to the platform; in the platform, the user can add machine information (i.e. a computing center that needs collaborative computing training and fusion); and after proxy agent software is installed in the platform, the platform can manage the user machine, the proxy submits the machine foundation fine and smooth, gives commands, and executes the function of starting or closing the container.
The following technical effects are achieved through the technical scheme:
according to the embodiment, technologies such as large model collaborative training, multi-model fusion, large model compression and the like which are difficult to realize by a single cluster can be realized, collaborative computing operation which spans a plurality of intelligent computing centers is completed through intelligent computing network infrastructure, and further brand-new computing paradigms and business scenes such as large model cross-domain collaborative computing, multi-center model aggregation, multi-center federal learning and the like are realized, so that the intelligent collaborative computing paradigms become keys for fully exerting the overall efficiency of the intelligent computing network and enabling industrial scale application of artificial intelligence.
Exemplary apparatus
Based on the above embodiment, the present invention further provides a collaborative training device based on an intelligent computing network, including:
the task group module is used for acquiring a plurality of algorithms to be trained and corresponding data sets and generating a plurality of task groups according to the acquired algorithms and data sets; wherein, each task group at least corresponds to an algorithm to be trained and a data set thereof;
the selection module is used for determining terminals to be trained in the distributed intelligent collaborative computing platform according to the selected task group and determining a to-be-trained algorithm and a data set corresponding to each terminal to be trained;
the collaborative training and reasoning module is used for carrying out collaborative training and reasoning on the models of all the terminals to be trained through a collaborative training strategy crossing the heterogeneous intelligent computation center to obtain collaborative training and reasoning results;
and the fusion module is used for acquiring a multi-model fusion strategy according to the collaborative training and reasoning result, and fusing the algorithm in the trained terminal through the multi-model fusion strategy to obtain a collaborative calculation model of the cross-heterogeneous intelligent computation center based on the distributed multi-framework.
Based on the above embodiment, the present invention also provides a terminal, and a functional block diagram thereof may be shown in fig. 4.
The terminal comprises: the system comprises a processor, a memory, an interface, a display screen and a communication module which are connected through a system bus; wherein the processor of the terminal is configured to provide computing and control capabilities; the memory of the terminal comprises a storage medium and an internal memory; the storage medium stores an operating system and a computer program; the internal memory provides an environment for the operation of the operating system and computer programs in the storage medium; the interface is used for connecting external equipment such as mobile terminals, computers and other equipment; the display screen is used for displaying corresponding cooperative training information based on intelligent computing network; the communication module is used for communicating with a cloud server or a mobile terminal.
The computer program is configured to implement a collaborative training method based on a mental arithmetic network when executed by a processor.
It will be appreciated by those skilled in the art that the functional block diagram shown in fig. 4 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the terminal to which the present inventive arrangements may be applied, and that a particular terminal may include more or less components than those shown, or may combine some of the components, or have a different arrangement of components.
In one embodiment, a terminal is provided, including: the intelligent network-based co-training system comprises a processor and a memory, wherein the memory stores the intelligent network-based co-training program which is used for realizing the intelligent network-based co-training method when being executed by the processor.
In one embodiment, a storage medium is provided, wherein the storage medium stores a smart network-based co-training program for implementing the smart network-based co-training method as above when executed by a processor.
Those skilled in the art will appreciate that implementing all or part of the above-described methods may be accomplished by way of a computer program comprising instructions for the relevant hardware, the computer program being stored on a non-volatile storage medium, the computer program when executed comprising the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory.
In summary, the invention provides a collaborative training method, a device, a terminal and a storage medium based on an intelligent computing network, wherein the method comprises the following steps: acquiring a plurality of algorithms to be trained and corresponding data sets, and generating a plurality of task groups according to the acquired algorithms and data sets; determining terminals to be trained in the distributed intelligent collaborative computing platform according to the selected task group, and determining a to-be-trained algorithm and a data set corresponding to each terminal to be trained; collaborative training and reasoning are carried out on the models of all the terminals to be trained through a collaborative training strategy of the cross heterogeneous intelligent computing center, and collaborative training and reasoning results are obtained; and acquiring a multi-model fusion strategy according to the collaborative training and reasoning result, and fusing the algorithm in the trained terminal through the multi-model fusion strategy to obtain a collaborative calculation model of the cross-heterogeneous intelligent computation center based on the distributed multi-framework. The invention can realize the technologies of large model collaborative training, multi-model fusion, large model compression and the like which are difficult to realize by a single cluster.
It is to be understood that the invention is not limited in its application to the examples described above, but is capable of modification and variation in light of the above teachings by those skilled in the art, and that all such modifications and variations are intended to be included within the scope of the appended claims.

Claims (12)

1. The intelligent network-based collaborative training method is characterized by comprising the following steps of:
acquiring a plurality of algorithms to be trained and corresponding data sets, and generating a plurality of task groups according to the acquired algorithms and data sets; wherein, each task group at least corresponds to an algorithm to be trained and a data set thereof;
determining terminals to be trained in the distributed intelligent collaborative computing platform according to the selected task group, and determining a to-be-trained algorithm and a data set corresponding to each terminal to be trained;
collaborative training and reasoning are carried out on the models of all the terminals to be trained through a collaborative training strategy of the cross heterogeneous intelligent computing center, and collaborative training and reasoning results are obtained;
acquiring a multi-model fusion strategy according to the collaborative training and reasoning result, and fusing the algorithm in the trained terminal through the multi-model fusion strategy to obtain a collaborative calculation model of the cross-heterogeneous intelligent computation center based on a distributed multi-framework;
the method for obtaining the multi-model fusion strategy according to the collaborative training and reasoning results, and fusing the algorithm in the trained terminal through the multi-model fusion strategy comprises the following steps:
acquiring an average fusion strategy or a contribution fusion strategy according to the model training precision value;
determining a model fusion weight ratio corresponding to each trained terminal according to the average fusion strategy or the contribution fusion strategy;
and fusing algorithms in all the trained terminals according to the determined weight ratio.
2. The collaborative training method based on the intelligent computing network according to claim 1, wherein the collaborative training and reasoning are performed on the models of all the terminals to be trained through the collaborative training strategy crossing heterogeneous intelligent computing centers to obtain collaborative training and reasoning results, and the method comprises the following steps:
dividing all terminals to be trained into heterogeneous AI clusters and heterogeneous AI architectures according to AI types;
and respectively carrying out cooperative training and reasoning on the models between the heterogeneous AI clusters and/or the heterogeneous AI architecture through the cooperative training strategy of the cross-heterogeneous intelligent computation center.
3. The intelligent network-based collaborative training method according to claim 2, wherein the classifying all terminals to be trained into heterogeneous AI clusters and heterogeneous AI architectures according to AI types, then comprises:
initializing each of the different AI architectures;
and (3) transmitting deep learning framework fields of corresponding algorithms into each of the different AI architectures, and calling corresponding API interfaces to obtain corresponding model parameters.
4. The intelligent computing network-based collaborative training method according to claim 2, wherein the collaborative training and reasoning of models between the heterogeneous AI clusters and/or the heterogeneous AI architecture by a collaborative training strategy across heterogeneous intelligent computing centers, respectively, previously comprises:
acquiring memory information, communication delay information, computing capacity information and algorithm model information corresponding to each terminal to be trained; and sending a collaborative computing resource allocation scheme to the corresponding terminal to be trained according to the memory information, the communication delay information, the computing capability information and the algorithm model information.
5. The intelligent computing network-based collaborative training method according to claim 2, wherein the collaborative training and reasoning of models between the heterogeneous AI clusters and/or the heterogeneous AI architecture, respectively, through a collaborative training strategy across heterogeneous intelligent computing centers, includes:
dynamically adjusting the training synchronization period in each terminal to be trained;
and determining the training overtime and the failure time in each terminal to be trained, and resetting the local training of each terminal to be trained according to the training overtime and the failure time.
6. The intelligent computing network-based collaborative training method according to claim 2, wherein the collaborative training and reasoning of models between the heterogeneous AI clusters and/or the heterogeneous AI architecture by a collaborative training strategy across heterogeneous intelligent computing centers, respectively, includes:
acquiring an evaluation instruction, and verifying an evaluation data set of each trained terminal through the evaluation instruction;
and obtaining model training precision values corresponding to the trained terminals according to the verification results.
7. The collaborative training method based on an intelligent computing network according to claim 1, wherein the obtaining a multi-model fusion strategy according to collaborative training and reasoning results, and fusing algorithms in a trained terminal through the multi-model fusion strategy, then comprises:
the fused result is sent to a corresponding post-training terminal;
setting a corresponding model initial value of the trained terminal according to the fused result, and taking the set model initial value as a model parameter of the next training.
8. The intelligent network-based co-training method of claim 1, further comprising:
and setting an infrastructure meeting the cooperative training and fusion according to the input instruction, and setting a corresponding cooperative training and fusion algorithm.
9. The intelligent network-based co-training method of claim 1, wherein the distributed multi-frame-based heterogeneous intelligent computing center comprises at least: a Central Processing Unit (CPU), an NPU and a GPU.
10. A mental arithmetic network-based collaborative training apparatus, comprising:
the task group module is used for acquiring a plurality of algorithms to be trained and corresponding data sets and generating a plurality of task groups according to the acquired algorithms and data sets; wherein, each task group at least corresponds to an algorithm to be trained and a data set thereof;
the selection module is used for determining terminals to be trained in the distributed intelligent collaborative computing platform according to the selected task group and determining a to-be-trained algorithm and a data set corresponding to each terminal to be trained;
the collaborative training and reasoning module is used for carrying out collaborative training and reasoning on the models of all the terminals to be trained through a collaborative training strategy crossing the heterogeneous intelligent computation center to obtain collaborative training and reasoning results;
the fusion module is used for acquiring a multi-model fusion strategy according to the collaborative training and reasoning result, and fusing the algorithm in the trained terminal through the multi-model fusion strategy to obtain a collaborative calculation model of the cross-heterogeneous intelligent computation center based on the distributed multi-framework;
the method for obtaining the multi-model fusion strategy according to the collaborative training and reasoning results, and fusing the algorithm in the trained terminal through the multi-model fusion strategy comprises the following steps:
acquiring an average fusion strategy or a contribution fusion strategy according to the model training precision value;
determining a model fusion weight ratio corresponding to each trained terminal according to the average fusion strategy or the contribution fusion strategy;
and fusing algorithms in all the trained terminals according to the determined weight ratio.
11. A terminal, comprising: a processor and a memory storing a co-training program based on a smart network, which when executed by the processor is adapted to carry out the co-training method based on a smart network as claimed in any one of claims 1 to 9.
12. A storage medium, characterized in that the storage medium is a computer readable storage medium storing a co-training program based on a smart network, which when executed by a processor is adapted to implement the co-training method based on a smart network according to any of claims 1-9.
CN202210793410.0A 2022-07-07 2022-07-07 Collaborative training method, device, terminal and storage medium based on intelligent computing network Active CN115297008B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210793410.0A CN115297008B (en) 2022-07-07 2022-07-07 Collaborative training method, device, terminal and storage medium based on intelligent computing network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210793410.0A CN115297008B (en) 2022-07-07 2022-07-07 Collaborative training method, device, terminal and storage medium based on intelligent computing network

Publications (2)

Publication Number Publication Date
CN115297008A CN115297008A (en) 2022-11-04
CN115297008B true CN115297008B (en) 2023-08-22

Family

ID=83822848

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210793410.0A Active CN115297008B (en) 2022-07-07 2022-07-07 Collaborative training method, device, terminal and storage medium based on intelligent computing network

Country Status (1)

Country Link
CN (1) CN115297008B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116074179B (en) * 2023-03-06 2023-07-14 鹏城实验室 High expansion node system based on CPU-NPU cooperation and training method
CN116595384B (en) * 2023-07-14 2023-11-24 支付宝(杭州)信息技术有限公司 Model training method and device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109558778A (en) * 2017-09-27 2019-04-02 武汉嫦娥信息科技有限公司 A kind of target tracking algorism based on multi-method fusion
CN113407312A (en) * 2020-03-17 2021-09-17 阿尔法云计算(深圳)有限公司 Task cooperative processing method, device and system for model training
CN113609508A (en) * 2021-08-24 2021-11-05 上海点融信息科技有限责任公司 Block chain-based federal learning method, device, equipment and storage medium
CN114244835A (en) * 2021-11-19 2022-03-25 海南火链科技有限公司 Decentralized self-adaptive collaborative training method and device based on block chain
CN114298326A (en) * 2021-12-29 2022-04-08 杭州海康威视数字技术股份有限公司 Model training method and device and model training system
CN114330464A (en) * 2020-09-27 2022-04-12 南京大学 Multi-terminal collaborative training algorithm and system fusing meta learning
CN114694015A (en) * 2022-06-02 2022-07-01 深圳市万物云科技有限公司 General framework-based multi-task federal learning scene recognition method and related components

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11188791B2 (en) * 2019-11-18 2021-11-30 International Business Machines Corporation Anonymizing data for preserving privacy during use for federated machine learning

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109558778A (en) * 2017-09-27 2019-04-02 武汉嫦娥信息科技有限公司 A kind of target tracking algorism based on multi-method fusion
CN113407312A (en) * 2020-03-17 2021-09-17 阿尔法云计算(深圳)有限公司 Task cooperative processing method, device and system for model training
CN114330464A (en) * 2020-09-27 2022-04-12 南京大学 Multi-terminal collaborative training algorithm and system fusing meta learning
CN113609508A (en) * 2021-08-24 2021-11-05 上海点融信息科技有限责任公司 Block chain-based federal learning method, device, equipment and storage medium
CN114244835A (en) * 2021-11-19 2022-03-25 海南火链科技有限公司 Decentralized self-adaptive collaborative training method and device based on block chain
CN114298326A (en) * 2021-12-29 2022-04-08 杭州海康威视数字技术股份有限公司 Model training method and device and model training system
CN114694015A (en) * 2022-06-02 2022-07-01 深圳市万物云科技有限公司 General framework-based multi-task federal learning scene recognition method and related components

Also Published As

Publication number Publication date
CN115297008A (en) 2022-11-04

Similar Documents

Publication Publication Date Title
CN115297008B (en) Collaborative training method, device, terminal and storage medium based on intelligent computing network
CN113448721A (en) Network system for computing power processing and computing power processing method
Téllez et al. A tabu search method for load balancing in fog computing
Zhai et al. Toward reinforcement-learning-based service deployment of 5G mobile edge computing with request-aware scheduling
CN110658794B (en) Manufacturing execution system
CN111865622B (en) Cloud service metering and charging method and system based on rule engine cluster
CN112288423A (en) Aggregation payment method and system of distributed framework
Long et al. A novel fault-tolerant approach to web service composition upon the edge computing environment
Edinger et al. Decentralized low-latency task scheduling for ad-hoc computing
CN111597035B (en) Simulation engine time propulsion method and system based on multithreading
Qadeer et al. DDPG-edge-cloud: A deep-deterministic policy gradient based multi-resource allocation in edge-cloud system
CN116954944A (en) Distributed data stream processing method, device and equipment based on memory grid
CN115361280B (en) Method, device, equipment and storage medium for invoking calculation power network
Mora et al. Serverless computing at the edge for aiot applications
CN115001692A (en) Model updating method and device, computer readable storage medium and electronic device
WO2023209414A1 (en) Methods and apparatus for computing resource allocation
Xhafa et al. Jxta-Overlay: An interface for efficient peer selection in P2P JXTA-based systems
CN113296750A (en) Function creating method and system, and function calling method and system
CN115250276A (en) Distributed system and data processing method and device
CN117076057B (en) AI service request scheduling method, device, equipment and medium
CN116887357B (en) Computing platform management system based on artificial intelligence
CN113485718B (en) Context-aware AIoT application program deployment method in edge cloud cooperative system
KR102642396B1 (en) Batch scheduling device for deep learning inference model using limited gpu resources
CN115348324B (en) Method and device for determining optimized scheduling strategy and electronic equipment
CN114866612B (en) Electric power micro-service unloading method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant