WO2023201981A1 - 混合专家模型实现方法、系统、电子设备及存储介质 - Google Patents

混合专家模型实现方法、系统、电子设备及存储介质 Download PDF

Info

Publication number
WO2023201981A1
WO2023201981A1 PCT/CN2022/119752 CN2022119752W WO2023201981A1 WO 2023201981 A1 WO2023201981 A1 WO 2023201981A1 CN 2022119752 W CN2022119752 W CN 2022119752W WO 2023201981 A1 WO2023201981 A1 WO 2023201981A1
Authority
WO
WIPO (PCT)
Prior art keywords
communication group
parallel communication
computing device
tensor
expert
Prior art date
Application number
PCT/CN2022/119752
Other languages
English (en)
French (fr)
Inventor
沈亮
王海峰
吴华超
巩伟宝
吴志华
于佃海
Original Assignee
北京百度网讯科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京百度网讯科技有限公司 filed Critical 北京百度网讯科技有限公司
Priority to EP22865889.4A priority Critical patent/EP4287074A1/en
Publication of WO2023201981A1 publication Critical patent/WO2023201981A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/042Knowledge-based neural networks; Logical representations of neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0495Quantised networks; Sparse networks; Compressed networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/098Distributed learning, e.g. federated learning

Definitions

  • the present disclosure relates to the field of artificial intelligence technology, and in particular to hybrid expert model implementation methods, systems, electronic devices and storage media in the fields of deep learning and distributed storage.
  • the Mixed Experts (MoE, Mixure-of-Experts) model is a kind of neural network, but unlike general neural networks, it can separately train multiple models based on data.
  • Each model can be called an expert network, that is, a mixed expert.
  • the idea of the model is to train multiple expert networks, each applied to a different part of the dataset.
  • the hybrid expert model can achieve ultra-large-scale model training.
  • the present disclosure provides a hybrid expert model implementation method, system, electronic device and storage medium.
  • a hybrid expert model implementation method including:
  • the communication group includes a tensor parallel communication group.
  • the tensor parallel communication group includes at least two computing devices.
  • the sparse parameters of each computing device in the same tensor parallel communication group are divided using tensor parallel. Way;
  • a hybrid expert model is trained based on the communication group.
  • a hybrid expert model implementation system including a communication group for training a hybrid expert model based on the communication group;
  • the communication group includes a tensor parallel communication group, the tensor parallel communication group includes at least two computing devices, and the sparse parameters of each computing device in the same tensor parallel communication group adopt a tensor parallel segmentation method.
  • An electronic device including:
  • the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to perform the method as described above.
  • a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method described above.
  • a computer program product includes a computer program/instruction which, when executed by a processor, implements the method as described above.
  • One embodiment in the above disclosure has the following advantages or beneficial effects: by adopting a tensor parallel segmentation method for sparse parameters, it is possible to avoid sparse parameters (such as expert network parameters) from being too large, thereby avoiding the overflow of video memory caused by the inability of the computing device to support it. and other issues to ensure the normal progress of model training.
  • sparse parameters such as expert network parameters
  • Figure 1 is a schematic diagram of the data processing method of the existing hybrid expert model
  • Figure 2 is a flow chart of an embodiment of the hybrid expert model implementation method described in the present disclosure
  • Figure 3 is a schematic diagram of the data processing method of the hybrid expert model after introducing data parallelism
  • Figure 4 is a schematic diagram of the data processing method of the hybrid expert model after introducing the tensor parallel segmentation method and the data parallel method according to the present disclosure
  • Figure 5 is a schematic structural diagram of the first embodiment 500 of the hybrid expert model implementation system of the present disclosure
  • Figure 6 is a schematic structural diagram of a second embodiment 600 of a hybrid expert model implementation system according to the present disclosure
  • Figure 7 shows a schematic block diagram of an electronic device 700 that may be used to implement embodiments of the present disclosure.
  • Figure 1 is a schematic diagram of the data processing method of the existing hybrid expert model.
  • the processing result H0 is obtained.
  • the backbone network is usually some fully connected layers.
  • the processing result H0 can be selected as a route by the gated network.
  • the expert network of the network usually selects the k expert networks with the highest scores, k is a positive integer, that is, the number of selected expert networks can be 1 or multiple, as shown in the figure, assuming that Expert network 0, further, expert network 0 can process the processing result H0, thereby obtaining the processing result H0' as the final output result.
  • FIG. 2 is a flow chart of an embodiment of the hybrid expert model implementation method described in the present disclosure. As shown in Figure 2, it includes the following specific implementation methods.
  • a communication group is constructed.
  • the communication group includes a tensor parallel communication group.
  • the tensor parallel communication group includes at least two computing devices (workers). Each computing device in the same tensor parallel communication group has Sparse parameters adopt tensor parallel segmentation.
  • step 202 a hybrid expert model is trained based on the communication group.
  • the constructed communication group may also include: a data parallel communication group, the data parallel communication group includes at least two computing devices, and each computing device in the same data parallel communication group adopts data parallel mode, and in addition , for any tensor parallel communication group, each computing device in it is included in a data parallel communication group, and the first computing device set composed of computing devices included in all data parallel communication groups is equal to all tensor parallel communication groups A second set of computing devices is comprised of computing devices included in the communication group.
  • a hybrid communication group including a tensor parallel communication group and a data parallel communication group can be constructed.
  • the number of tensor parallel communication groups and data parallel communication groups can be greater than or equal to 2.
  • FIG. 3 is a schematic diagram of the data processing method of the hybrid expert model after the introduction of data parallelism.
  • computing device 0 and computing device 1 correspond to 3 expert networks respectively, that is, the total number of expert networks (total expert) is 6, and the parameters of each expert network are different from each other.
  • the gated network can be selected from 6 Select the expert network as the routing network from the expert network to enable cross-card communication (i.e. communication across computing devices).
  • computing device 0 For example, the processing result H0 of the backbone network can be sent to the selected expert network 4. After being processed by the expert network 4, the processing result H0' is returned to the computing device 0, thereby obtaining the final output result 0.
  • FIG. 4 is a schematic diagram of the data processing method of the hybrid expert model after introducing the tensor parallel segmentation method and the data parallel method according to the present disclosure.
  • any computing device may include: a backbone network, used to perform predetermined processing on input data to obtain the first processing result; a gate control network, used to obtain the first processing result from the data Select an expert network as a routing network from the expert networks of each computing device in the parallel communication group, and send the first processing result to the selected expert network; the expert network is used to reserve the obtained first processing result Process, obtain the second processing result, and return the return result determined based on the second processing result to the computing device corresponding to the obtained first processing result; different computing devices in the same tensor parallel communication group correspond to the same backbone network and expert networks.
  • the processing results of the backbone network and the processing results of the expert network are called the first processing result and the second processing result respectively.
  • the dense (Dense) parameters of each computing device in the same tensor parallel communication group can also adopt the tensor parallel segmentation method, where the sparse parameters can include: Expert network parameters; the dense parameters may include: backbone network parameters.
  • the parallel mode is 2-way data parallelism and 2-way tensor parallelism, and it is assumed that there are The two expert networks are expert network 0 and expert network 1 respectively.
  • [computing device 0, computing device 1] and [computing device 2, computing device 3] are 2 tensor parallel communication groups
  • [computing device 0, Computing device 2] and [computing device 1, computing device 3] are a hybrid communication group constructed by 2 data parallel communication groups, 2 tensor parallel communication groups and 2 data parallel communication groups.
  • tensor parallel segmentation is performed on the backbone network parameters, where computing device 0 and computing device 1 share the same backbone network, and computing device 2 and computing device 3 share Same backbone network.
  • computing device 0 and computing device 1 read the same input data 0, and after processing by the backbone network, obtain the same first processing result H_0.
  • computing device 2 and computing device 3 read the same input data 1. , after processing by the backbone network, the same first processing result H_1 is obtained.
  • computing device 0 and computing device 1 share the same expert network 0, and computing device 2 and computing device 3 share the same expert network 1.
  • the gated network can select an expert network as a routing network for each corresponding first processing result. Assuming that the number of selected expert networks is 1, then in [computing device 0, Computing device 2] In this data parallel communication group, there are a total of 2 expert networks for selection. Among them, the gated network in computing device 0 selects expert network 1 in computing device 2, and the gated network in computing device 2 is selected. The expert network 0 in the computing device 0 is selected. Further, the gate control networks in the computing device 0 and the computing device 2 can respectively send the corresponding first processing results to the selected expert network for processing.
  • the gated network in Computing Device 1 selects Expert Network 1 in Computing Device 3
  • the gating network in computing device 3 selects expert network 0 in computing device 1.
  • the gating networks in computing device 1 and computing device 3 can respectively send the corresponding first processing results to the selected expert network for processing. deal with. Since the backbone network parameters adopt the tensor parallel segmentation method, the routing results of computing device 1 and computing device 0 are consistent, and the routing results of computing device 3 and computing device 2 are consistent.
  • the expert network parameters adopt the tensor parallel segmentation method
  • the expert networks in the same tensor parallel communication group will obtain the same second processing result, that is, computing device 0
  • the expert network 0 in the computing device 1 will obtain the same second processing result H_3
  • the expert network 1 in the computing device 2 and the computing device 3 will obtain the same second processing result H_4.
  • the obtained second processing result also needs to be returned to the corresponding computing device.
  • the expert network may return the return result determined based on the second processing result to the computing device corresponding to the obtained first processing result.
  • the return result may refer to the second processing result, that is, the second processing result can be directly returned as the return result.
  • the returned result may include: part of the second processing result.
  • the obtained results belonging to the same second processing result may be respectively The partial contents are combined to obtain a complete second processing result.
  • the partial contents are respectively returned by the same expert network corresponding to each computing device in the same tensor parallel communication group.
  • the specific method used can be determined according to actual needs, and is very flexible and convenient. Preferably, the latter method can be used.
  • the computing device 0 can only return the first half of the content H_3[:2] of the second processing result H_3 to the computing device 2, and the computing device 1 can only return the second half of the content H_3[2] of the second processing result H_3 :] is returned to the computing device 3.
  • the computing device 2 can only return the first half of the content H_4[:2] of the second processing result H_4 to the computing device 0, and the computing device 3 can only return the second processing result H_4
  • the second half of the content H_4[2:] is returned to the computing device 1, thereby reducing the amount of data that needs to be transmitted, thereby reducing resource consumption, etc.
  • the above is an example of two computing devices included in the same tensor parallel communication group. Assuming three computing devices are included, then 1/3 of the content of the same second processing result can be returned, that is, the first, middle and last parts can be returned respectively. Content. Typically, the portions of content returned do not overlap with each other to further reduce resource consumption.
  • computing device 0 and computing device 1 obtain H_4[:2] and H_4[2:] respectively, then H_4[:2] and H_4[2:] can be combined to obtain the complete first Second processing result H_4, similarly, computing device 2 and computing device 3 obtain H_3[:2] and H_3[2:] respectively, then H_3[:2] and H_3[2:] can be combined to obtain the complete The second processing result H_3.
  • hybrid expert model training can be completed, and the training method can be the same as in the existing technology.
  • the problem of too large sparse parameters (such as expert network parameters) can be avoided, thereby avoiding problems such as video memory overflow caused by the inability of the computing device to support it, ensuring the normal progress of model training, and Storage space can be used more efficiently to support larger-scale model training, etc.
  • Figure 5 is a schematic structural diagram of the first embodiment 500 of the hybrid expert model implementation system according to the present disclosure. As shown in Figure 5, a communication group 501 is included for training a hybrid expert model based on the communication group 501.
  • the communication group 501 may include a tensor parallel communication group; the tensor parallel communication group includes at least two computing devices, and the sparse parameters of each computing device in the same tensor parallel communication group adopt the tensor parallel segmentation method.
  • the communication group 501 may also include a data parallel communication group.
  • the data parallel communication group includes at least two computing devices. Each computing device in the same data parallel communication group adopts a data parallel mode.
  • each computing device therein is included in a data parallel communication group, and the first computing device set composed of computing devices included in all data parallel communication groups is equal to all tensor parallel communication
  • the computing devices included in the group comprise a second set of computing devices.
  • a hybrid communication group including a tensor parallel communication group and a data parallel communication group can be constructed.
  • the number of tensor parallel communication groups and data parallel communication groups can be greater than or equal to 2.
  • any computing device may include: a backbone network, used to perform predetermined processing on input data to obtain the first processing result; a gate control network, used to obtain the first processing result from each data parallel communication group. Select an expert network as a routing network from the expert network of the computing device, and send the first processing result to the selected expert network; the expert network is used to perform predetermined processing on the obtained first processing result to obtain the second processing result, and return the return result determined based on the second processing result to the computing device corresponding to the obtained first processing result; different computing devices in the same tensor parallel communication group correspond to the same backbone network and expert network.
  • the dense parameters of each computing device in the same tensor parallel communication group can also adopt the tensor parallel segmentation method, wherein the sparse parameters can include: expert network parameters; the dense Parameters may include: backbone network parameters.
  • the obtained second processing result also needs to be returned to the corresponding computing device.
  • the expert network may return the return result determined based on the second processing result to the computing device corresponding to the obtained first processing result.
  • the return result may refer to the second processing result, that is, the second processing result can be directly returned as the return result.
  • the returned result may include: part of the second processing result.
  • FIG. 6 is a schematic structural diagram of a second embodiment 600 of the hybrid expert model implementation system of the present disclosure. .
  • aggregation module 502 configured to separately obtain each part of the content belonging to the same second processing result for any tensor parallel communication group. Combined, a complete second processing result is obtained, and each part of the content is returned by the same expert network corresponding to each computing device in the same tensor parallel communication group.
  • the problem of too large sparse parameters (such as expert network parameters) can be avoided, thereby avoiding problems such as video memory overflow caused by the inability of the computing device to support it, ensuring the normal progress of model training, and Storage space can be used more efficiently to support larger-scale model training, etc.
  • Artificial intelligence is the study of using computers to simulate certain human thinking processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.).
  • hardware-level technologies generally include such as Sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing and other technologies.
  • Artificial intelligence software technology mainly includes computer vision technology, speech recognition technology, natural language processing technology, machine learning/deep learning, and big data processing technology. Knowledge graph technology and other major directions.
  • the data in the embodiments of the present disclosure are not targeted at a specific user and cannot reflect the personal information of a specific user.
  • the collection, storage, use, processing, transmission, provision and disclosure of user personal information are in compliance with relevant laws and regulations and do not violate public order and good customs.
  • the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.
  • FIG. 7 shows a schematic block diagram of an electronic device 700 that may be used to implement embodiments of the present disclosure.
  • Electronic devices are intended to refer to various forms of digital computers, such as laptop computers, desktop computers, workstations, servers, blade servers, mainframe computers, and other suitable computers.
  • Electronic devices may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices, and other similar computing devices.
  • the components shown herein, their connections and relationships, and their functions are examples only and are not intended to limit implementations of the disclosure described and/or claimed herein.
  • device 700 includes a computing device 701 that can execute according to a computer program stored in read-only memory (ROM) 702 or loaded from storage unit 708 into random access memory (RAM) 703 Various appropriate actions and treatments. In the RAM 703, various programs and data required for the operation of the device 700 can also be stored.
  • Computing device 701, ROM 702, and RAM 703 are connected to each other via bus 704.
  • An input/output (I/O) interface 705 is also connected to bus 704.
  • the I/O interface 705 includes: an input unit 706, such as a keyboard, a mouse, etc.; an output unit 707, such as various types of displays, speakers, etc.; and a storage unit 708, such as a magnetic disk, optical disk, etc. ; and communication unit 709, such as a network card, modem, wireless communication transceiver, etc.
  • the communication unit 709 allows the device 700 to exchange information/data with other devices through computer networks such as the Internet and/or various telecommunications networks.
  • Computing device 701 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing devices 701 include, but are not limited to, central processing units (CPUs), graphics processing units (GPUs), various specialized artificial intelligence (AI) computing chips, various computing devices that run machine learning model algorithms, digital signal processing processor (DSP), and any appropriate processor, controller, microcontroller, etc. Computing device 701 performs various methods and processes described above, such as those described in this disclosure. For example, in some embodiments, the methods described in the present disclosure may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 708.
  • a machine-readable medium such as storage unit 708.
  • part or all of the computer program may be loaded and/or installed onto device 700 via ROM 702 and/or communication unit 709.
  • ROM 702 and/or communication unit 709 When a computer program is loaded into RAM 703 and executed by computing device 701, one or more steps of the methods described in this disclosure may be performed.
  • computing device 701 may be configured to perform the methods described in this disclosure in any other suitable manner (eg, via firmware).
  • Various implementations of the systems and techniques described above may be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on a chip implemented in a system (SOC), complex programmable logic device (CPLD), computer hardware, firmware, software, and/or combinations thereof.
  • FPGAs field programmable gate arrays
  • ASICs application specific integrated circuits
  • ASSPs application specific standard products
  • SOC system
  • CPLD complex programmable logic device
  • computer hardware firmware, software, and/or combinations thereof.
  • These various embodiments may include implementation in one or more computer programs executable and/or interpreted on a programmable system including at least one programmable processor, the programmable processor
  • the processor which may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
  • An output device may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
  • An output device may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
  • Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing device, such that the program codes, when executed by the processor or controller, cause the functions specified in the flowcharts and/or block diagrams/ The operation is implemented.
  • the program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with an instruction execution system, apparatus, or device.
  • the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include one or more wire-based electrical connections, laptop disks, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM portable compact disk read-only memory
  • magnetic storage device or any suitable combination of the above.
  • the systems and techniques described herein may be implemented on a computer having a display device (eg, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user ); and a keyboard and pointing device (eg, a mouse or a trackball) through which a user can provide input to the computer.
  • a display device eg, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
  • a keyboard and pointing device eg, a mouse or a trackball
  • Other kinds of devices may also be used to provide interaction with the user; for example, the feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and may be provided in any form, including Acoustic input, voice input or tactile input) to receive input from the user.
  • the systems and techniques described herein may be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., A user's computer having a graphical user interface or web browser through which the user can interact with implementations of the systems and technologies described herein), or including such backend components, middleware components, or any combination of front-end components in a computing system.
  • the components of the system may be interconnected by any form or medium of digital data communication (eg, a communications network). Examples of communication networks include: local area network (LAN), wide area network (WAN), and the Internet.
  • Computer systems may include clients and servers.
  • Clients and servers are generally remote from each other and typically interact over a communications network.
  • the relationship of client and server is created by computer programs running on corresponding computers and having a client-server relationship with each other.
  • the server can be a cloud server, a distributed system server, or a server combined with a blockchain.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

本公开提供了混合专家模型实现方法、系统、电子设备及存储介质,涉及深度学习以及分布式存储等人工智能领域,其中的方法可包括:构建通信组,所述通信组包括张量并行通信组,所述张量并行通信组中包括至少两个计算设备,同一张量并行通信组中的各计算设备的稀疏参数采用张量并行切分方式;基于所述通信组训练混合专家模型。应用本公开所述方案,可保障模型训练的正常进行等。

Description

混合专家模型实现方法、系统、电子设备及存储介质
本申请要求了申请日为2022年04月22日,申请号为202210430519.8发明名称为“混合专家模型实现方法、系统、电子设备及存储介质”的中国专利申请的优先权。
技术领域
本公开涉及人工智能技术领域,特别涉及深度学习以及分布式存储等领域的混合专家模型实现方法、系统、电子设备及存储介质。
背景技术
混合专家(MoE,Mixure-of-Experts)模型是一种神经网络,但不同于一般的神经网络是它可以根据数据分离训练多个模型,每个模型可分别称为一个专家网络,即混合专家模型的思想是训练多个专家网络,每个专家网络分别应用于数据集的不同部分。混合专家模型作为一种新兴的稀疏激活深度学习模型架构,可实现超大规模的模型训练。
发明内容
本公开提供了混合专家模型实现方法、系统、电子设备及存储介质。
一种混合专家模型实现方法,包括:
构建通信组,所述通信组包括张量并行通信组,所述张量并行通信组中包括至少两个计算设备,同一张量并行通信组中的各计算设备的稀疏参数采用张量并行切分方式;
基于所述通信组训练混合专家模型。
一种混合专家模型实现系统,包括通信组,用于基于所述通信组训练混合专家模型;
所述通信组包括张量并行通信组,所述张量并行通信组中包括至少两个计算设备,同一张量并行通信组中的各计算设备的稀疏参数采用张量并行切分方式。
一种电子设备,包括:
至少一个处理器;以及
与所述至少一个处理器通信连接的存储器;其中,
所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行如以上所述的方法。
一种存储有计算机指令的非瞬时计算机可读存储介质,所述计算机指令用于使计算机执行如以上所述的方法。
一种计算机程序产品,包括计算机程序/指令,所述计算机程序/指令被处理器执行时实现如以上所述的方法。
上述公开中的一个实施例具有如下优点或有益效果:通过对稀疏参数采用张量并行切分方式,可避免稀疏参数(如专家网络参数)过大,进而避免了计算设备无法支持导致的显存溢出等问题,保障了模型训练的正常进行。
应当理解,本部分所描述的内容并非旨在标识本公开的实施例的关键或重要特征,也不用于限制本公开的范围。本公开的其它特征将通过以下的说明书而变得容易理解。
附图说明
附图用于更好地理解本方案,不构成对本公开的限定。其中:
图1为现有混合专家模型的数据处理方式示意图;
图2为本公开所述混合专家模型实现方法实施例的流程图;
图3为引入数据并行方式后的混合专家模型的数据处理方式示意图;
图4为本公开所述引入张量并行切分方式及数据并行方式后的混合专家模型的数据处理方式示意图;
图5为本公开所述混合专家模型实现系统第一实施例500的组成结构示意图;
图6为本公开所述混合专家模型实现系统第二实施例600的组成结构示意图;
图7示出了可以用来实施本公开的实施例的电子设备700的示意性框图。
具体实施方式
以下结合附图对本公开的示范性实施例做出说明,其中包括本公开实施例的各种细节以助于理解,应当将它们认为仅仅是示范性的。因此,本领域普通技术人员应当认识到,可以对这里描述的实施例做出各种改变和修改,而不会背离本公开的范围和精神。同样,为了清楚和简明,以下的描述中省略了对公知功能和结构的描述。
另外,应理解,本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。
图1为现有混合专家模型的数据处理方式示意图。如图1所示,输入数据经过主干网络(Backbone)的处理(即计算)后,得到处理结果H0,主干网络通常为一些全连接层,之后,可由门控网络为处理结果H0选出作为路由网络的专家网络,通常会选出评分最高的k个专家网络,k为正整数,即选出的专家网络的数量可以为1个,也可以为多个,如图中所示,假设选中了专家网络0,进一步地,专家网络0可对处理结果H0进行处理,从而得到处理结果H0’,作为最终的输出结果。
对于图1所示的混合专家模型,在实际应用中,很可能会出现专家网络参数过大的问题,从而导致计算设备无法支持,如显存溢出,进而影响模型的正常训练。
为此,本公开中提出了一种混合专家模型实现方法。相应地,图2为本公开所述混合专家模型实现方法实施例的流程图。如图2所示,包括以下具体实现方式。
在步骤201中,构建通信组,所述通信组包括张量并行通信组,所述张量并行通信组中包括至少两个计算设备(worker),同一张量并行通信组中的各计算设备的稀疏(Sparse)参数采用张量并行切分方式。
在步骤202中,基于所述通信组训练混合专家模型。
上述方法实施例所述方案中,通过对稀疏参数采用张量并行切分方式,可避免稀疏参数(如专家网络参数)过大,进而避免了计算设备无法支持导致的显存溢出等问题,保障了模型训练的正常进行。
本公开的一个实施例中,所构建的通信组还可包括:数据并行通信组,数据并行通信组中包括至少两个计算设备,同一数据并行通信组中 的各计算设备采用数据并行方式,另外,对于任一张量并行通信组,其中的每个计算设备均包括在一个数据并行通信组中,且,所有数据并行通信组中包括的计算设备组成的第一计算设备集合等于所有张量并行通信组中包括的计算设备组成的第二计算设备集合。
即可构建包括张量并行通信组以及数据并行通信组的混合通信组,相应地,张量并行通信组以及数据并行通信组的数量均可大于或等于2。
通过在混合专家模型中引入数据并行方式,可提升整体训练的吞吐量,相应地,同时引入数据并行方式和张量并行切分方式,可兼顾两种方式的优点,进而进一步提升了模型的训练效果等。
以下基于图1所示混合专家模型对数据并行方式的具体实现进行说明。图3为引入数据并行方式后的混合专家模型的数据处理方式示意图。如图3所示,假设计算设备0和计算设备1分别对应3个专家网络,即专家网络的总数量(total expert)为6,各专家网络的参数互不相同,门控网络可从6个专家网络中选出作为路由网络的专家网络,即可进行跨卡通信(即跨计算设备的通信),如图中所示,假设分别选中了专家网络4和专家网络1,那么以计算设备0为例,可将主干网络的处理结果H0发送给选中的专家网络4,经过专家网络4的处理后,将处理结果H0’返回给计算设备0,从而得到最终的输出结果0。
相应地,图4为本公开所述引入张量并行切分方式及数据并行方式后的混合专家模型的数据处理方式示意图。
如图4所示,本公开的一个实施例中,任一计算设备中可分别包括:主干网络,用于对输入数据进行预定处理,得到第一处理结果;门控网络,用于从所在数据并行通信组中的各计算设备的专家网络中选出作为路由网络的专家网络,并将第一处理结果发送给选出的专家网络;专家网络,用于对获取到的第一处理结果进行预定处理,得到第二处理结果,并将根据第二处理结果确定出的返回结果返回给获取到的第一处理结果对应的计算设备;同一张量并行通信组中的不同计算设备对应相同的主干网络及专家网络。
为便于区分,将主干网络的处理结果和专家网络的处理结果分别称为第一处理结果和第二处理结果。另外,所述预定处理具体为何种处理不作限制,可根据实际情况而定。
如图4所示,本公开的一个实施例中,同一张量并行通信组中的各计算设备的稠密(Dense)参数也可采用张量并行切分方式,其中,所述稀疏参数可包括:专家网络参数;所述稠密参数可包括:主干网络参数。
如图4所示,假设共存在4个计算设备,分别为计算设备0、计算设备1、计算设备2和计算设备3,并行方式为2路数据并行和2路张量并行,并假设共存在两个专家网络,分别为专家网络0和专家网络1,其中,[计算设备0,计算设备1]以及[计算设备2,计算设备3]为2个张量并行通信组,[计算设备0,计算设备2]以及[计算设备1,计算设备3]为2个数据并行通信组,2个张量并行通信组和2个数据并行通信组共同组成所构建的混合通信组。
如图4所示,在每个张量并行通信组中,对主干网络参数进行张量并行切分,其中,计算设备0和计算设备1共享相同的主干网络,计算设备2和计算设备3共享相同的主干网络。另外,计算设备0和计算设备1读入相同的输入数据0,经过主干网络的处理后,得到相同的第一处理结果H_0,同样地,计算设备2和计算设备3读入相同的输入数据1,经过主干网络的处理后,得到相同的第一处理结果H_1。
如图4所示,计算设备0和计算设备1共享相同的专家网络0,计算设备2和计算设备3共享相同的专家网络1。在每个数据并行通信组中,门控网络可分别为各自对应的第一处理结果选出作为路由网络的专家网络,假设选出的专家网络的数量为1个,那么在[计算设备0,计算设备2]这一数据并行通信组中,共2个专家网络供选择,其中,计算设备0中的门控网络选中了计算设备2中的专家网络1,计算设备2中的门控网络选中了计算设备0中的专家网络0,进一步地,计算设备0和计算设备2中的门控网络可分别将对应的第一处理结果发送给选中的专家网络进行处理。类似地,在[计算设备1,计算设备3]这一数据并行通信组中,共2个专家网络供选择,其中,计算设备1中的门控网络选中了计算设备3中的专家网络1,计算设备3中的门控网络选中了计算设备1中的专家网络0,进一步地,计算设备1和计算设备3中的门控网络可分别将对应的第一处理结果发送给选中的专家网络进行处理。由于主干网络参数采用了张量并行切分方式,因此计算设备1和计算设备0的路由结果一致,计算设备3和计算设备2的路由结果一致。
如图4所示,由于专家网络参数采用了张量并行切分方式,因此经过专家网络的处理后,同一张量并行通信组中的专家网络会得到相同的第二处理结果,即计算设备0和计算设备1中的专家网络0会得到相同的第二处理结果H_3,计算设备2和计算设备3中的专家网络1会得到相同的第二处理结果H_4。
对于得到的第二处理结果,还需要将其返回对应的计算设备。具体地,专家网络可将根据第二处理结果确定出的返回结果返回给获取到的第一处理结果对应的计算设备。
所述返回结果可以是指第二处理结果,即可直接将第二处理结果作为返回结果返回。或者,本公开的一个实施例中,所述返回结果可包括:第二处理结果中的部分内容,相应地,针对任一张量并行通信组,可分别将获取到的属于同一第二处理结果的各部分内容进行组合,从而得到完整的第二处理结果,所述各部分内容为同一张量并行通信组中的各计算设备对应的同一专家网络分别返回的。
具体采用哪种方式可根据实际需要而定,非常的灵活方便,优选地,可采用后一种方式。
后一种方式中,如图4所示,由于同一张量并行通信组中的专家网络得到的第二处理结果是一样的,因此在返回第二处理结果时,可分别只返回其中的部分内容,比如,计算设备0中可只将第二处理结果H_3的前半部分内容H_3[:2]返回给计算设备2,计算设备1中可只将第二处理结果H_3的后半部分内容H_3[2:]返回给计算设备3,同样地,计算设备2中可只将第二处理结果H_4的前半部分内容H_4[:2]返回给计算设备0,计算设备3中可只将第二处理结果H_4的后半部分内容H_4[2:]返回给计算设备1,从而减少了所需传输的数据量,进而减少了资源消耗等。
上述以同一张量并行通信组中包括2个计算设备为例,假设包括3个计算设备,那么可分别返回同一第二处理结果的1/3部分内容,即分别返回前、中和后三部分内容,通常,返回的各部分内容彼此之间不存在重叠,以进一步减少资源消耗。
由于返回的是部分内容,因此,针对任一张量并行通信组,还需要分别将获取到的属于同一第二处理结果的各部分内容进行组合,从而得 到完整的第二处理结果,即还需要在张量并行通信组中做一次汇聚(all_gather)通信,以恢复原来的第二处理结果。如图4所示,计算设备0和计算设备1分别获取到了H_4[:2]和H_4[2:],那么可将H_4[:2]和H_4[2:]进行组合,以得到完整的第二处理结果H_4,同样地,计算设备2和计算设备3分别获取到了H_3[:2]和H_3[2:],那么可将H_3[:2]和H_3[2:]进行组合,以得到完整的第二处理结果H_3。
基于图4所示混合通信组,可完成混合专家模型训练,训练方式可与现有技术中相同。
需要说明的是,对于前述的方法实施例,为了简单描述,将其表述为一系列的动作组合,但是本领域技术人员应该知悉,本公开并不受所描述的动作顺序的限制,因为依据本公开,某些步骤可以采用其它顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本公开所必须的。
总之,采用本公开方法实施例所述方案,可避免稀疏参数(如专家网络参数)过大的问题,进而避免了计算设备无法支持导致的显存溢出等问题,保障了模型训练的正常进行,并可更加有效地利用存储空间,以支持更大规模的模型训练等。
以上是关于方法实施例的介绍,以下通过系统实施例,对本公开所述方案进行进一步说明。
图5为本公开所述混合专家模型实现系统第一实施例500的组成结构示意图。如图5所示,包括通信组501,用于基于通信组501训练混合专家模型。
其中,通信组501可包括张量并行通信组;张量并行通信组中包括至少两个计算设备,同一张量并行通信组中的各计算设备的稀疏参数采用张量并行切分方式。
上述系统实施例所述方案中,通过对稀疏参数采用张量并行切分方式,可避免稀疏参数(如专家网络参数)过大,进而避免了计算设备无法支持导致的显存溢出等问题,保障了模型训练的正常进行。
本公开的一个实施例中,所述通信组501还可包括数据并行通信组,数据并行通信组中包括至少两个计算设备,同一数据并行通信组中的各 计算设备采用数据并行方式,另外,对于任一张量并行通信组,其中的每个计算设备均包括在一个数据并行通信组中,且,所有数据并行通信组中包括的计算设备组成的第一计算设备集合等于所有张量并行通信组中包括的计算设备组成的第二计算设备集合。
即可构建包括张量并行通信组以及数据并行通信组的混合通信组,相应地,张量并行通信组以及数据并行通信组的数量均可大于或等于2。
本公开的一个实施例中,任一计算设备中可分别包括:主干网络,用于对输入数据进行预定处理,得到第一处理结果;门控网络,用于从所在数据并行通信组中的各计算设备的专家网络中选出作为路由网络的专家网络,并将第一处理结果发送给选出的专家网络;专家网络,用于对获取到的第一处理结果进行预定处理,得到第二处理结果,并将根据第二处理结果确定出的返回结果返回给获取到的第一处理结果对应的计算设备;同一张量并行通信组中的不同计算设备对应相同的主干网络及专家网络。
另外,本公开的一个实施例中,同一张量并行通信组中的各计算设备的稠密参数也可采用张量并行切分方式,其中,所述稀疏参数可包括:专家网络参数;所述稠密参数可包括:主干网络参数。
如前所述,对于得到的第二处理结果,还需要将其返回对应的计算设备。具体地,专家网络可将根据第二处理结果确定出的返回结果返回给获取到的第一处理结果对应的计算设备。
所述返回结果可以是指第二处理结果,即可直接将第二处理结果作为返回结果返回。
或者,本公开的一个实施例中,所述返回结果可包括:第二处理结果中的部分内容,相应地,图6为本公开所述混合专家模型实现系统第二实施例600的组成结构示意图。
如图6所示,相比于图5所示实施例,还包括:汇聚模块502,用于针对任一张量并行通信组,分别将获取到的属于同一第二处理结果的各部分内容进行组合,得到完整的第二处理结果,所述各部分内容为同一张量并行通信组中的各计算设备对应的同一专家网络分别返回的。
图5和图6所示系统实施例的具体工作流程可参照前述方法实施例中的相关说明,不再赘述。
总之,采用本公开系统实施例所述方案,可避免稀疏参数(如专家网络参数)过大的问题,进而避免了计算设备无法支持导致的显存溢出等问题,保障了模型训练的正常进行,并可更加有效地利用存储空间,以支持更大规模的模型训练等。
本公开所述方案可应用于人工智能领域,特别涉及深度学习及分布式存储等领域。人工智能是研究使计算机来模拟人的某些思维过程和智能行为(如学习、推理、思考、规划等)的学科,既有硬件层面的技术也有软件层面的技术,人工智能硬件技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理等技术,人工智能软件技术主要包括计算机视觉技术、语音识别技术、自然语言处理技术以及机器学习/深度学习、大数据处理技术、知识图谱技术等几大方向。
本公开所述实施例中的数据等并不是针对某一特定用户的,并不能反映出某一特定用户的个人信息。本公开的技术方案中,所涉及的用户个人信息的收集、存储、使用、加工、传输、提供和公开等处理,均符合相关法律法规的规定,且不违背公序良俗。
根据本公开的实施例,本公开还提供了一种电子设备、一种可读存储介质和一种计算机程序产品。
图7示出了可以用来实施本公开的实施例的电子设备700的示意性框图。电子设备旨在表示各种形式的数字计算机,诸如,膝上型计算机、台式计算机、工作台、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置,诸如,个人数字助理、蜂窝电话、智能电话、可穿戴设备和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例,并且不意在限制本文中描述的和/或者要求的本公开的实现。
如图7所示,设备700包括计算设备701,其可以根据存储在只读存储器(ROM)702中的计算机程序或者从存储单元708加载到随机访问存储器(RAM)703中的计算机程序,来执行各种适当的动作和处理。在RAM 703中,还可存储设备700操作所需的各种程序和数据。计算设备701、ROM 702以及RAM 703通过总线704彼此相连。输入/输出(I/O)接口705也连接至总线704。
设备700中的多个部件连接至I/O接口705,包括:输入单元706, 例如键盘、鼠标等;输出单元707,例如各种类型的显示器、扬声器等;存储单元708,例如磁盘、光盘等;以及通信单元709,例如网卡、调制解调器、无线通信收发机等。通信单元709允许设备700通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据。
计算设备701可以是各种具有处理和计算能力的通用和/或专用处理组件。计算设备701的一些示例包括但不限于中央处理单元(CPU)、图形处理单元(GPU)、各种专用的人工智能(AI)计算芯片、各种运行机器学习模型算法的计算设备、数字信号处理器(DSP)、以及任何适当的处理器、控制器、微控制器等。计算设备701执行上文所描述的各个方法和处理,例如本公开所述的方法。例如,在一些实施例中,本公开所述的方法可被实现为计算机软件程序,其被有形地包含于机器可读介质,例如存储单元708。在一些实施例中,计算机程序的部分或者全部可以经由ROM 702和/或通信单元709而被载入和/或安装到设备700上。当计算机程序加载到RAM 703并由计算设备701执行时,可以执行本公开所述的方法的一个或多个步骤。备选地,在其他实施例中,计算设备701可以通过其他任何适当的方式(例如,借助于固件)而被配置为执行本公开所述的方法。
本文中以上描述的系统和技术的各种实施方式可以在数字电子电路系统、集成电路系统、场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、芯片上系统的系统(SOC)、复杂可编程逻辑设备(CPLD)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括:实施在一个或者多个计算机程序中,该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释,该可编程处理器可以是专用或者通用可编程处理器,可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令,并且将数据和指令传输至该存储系统、该至少一个输入装置、和该至少一个输出装置。
用于实施本公开的方法的程序代码可以采用一个或多个编程语言的任何组合来编写。这些程序代码可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理器或控制器,使得程序代码当由处理器或控制器执行时使流程图和/或框图中所规定的功能/操作被实施。程序 代码可以完全在机器上执行、部分地在机器上执行,作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。
为了提供与用户的交互,可以在计算机上实施此处描述的系统和技术,该计算机具有:用于向用户显示信息的显示装置(例如,CRT(阴极射线管)或者LCD(液晶显示器)监视器);以及键盘和指向装置(例如,鼠标或者轨迹球),用户可以通过该键盘和该指向装置来将输入提供给计算机。其它种类的装置还可以用于提供与用户的交互;例如,提供给用户的反馈可以是任何形式的传感反馈(例如,视觉反馈、听觉反馈、或者触觉反馈);并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。
可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如,作为数据服务器)、或者包括中间件部件的计算系统(例如,应用服务器)、或者包括前端部件的计算系统(例如,具有图形用户界面或者网络浏览器的用户计算机,用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如,通信网络)来将系统的部件相互连接。通信网络的示例包括:局域网(LAN)、广域网(WAN)和互联网。
计算机系统可以包括客户端和服务器。客户端和服务器一般远离彼 此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。服务器可以是云服务器,也可以为分布式系统的服务器,或者是结合了区块链的服务器。
应该理解,可以使用上面所示的各种形式的流程,重新排序、增加或删除步骤。例如,本发公开中记载的各步骤可以并行地执行也可以顺序地执行也可以不同的次序执行,只要能够实现本公开公开的技术方案所期望的结果,本文在此不进行限制。
上述具体实施方式,并不构成对本公开保护范围的限制。本领域技术人员应该明白的是,根据设计要求和其他因素,可以进行各种修改、组合、子组合和替代。任何在本公开的精神和原则之内所作的修改、等同替换和改进等,均应包含在本公开保护范围之内。

Claims (13)

  1. 一种混合专家模型实现方法,包括:
    构建通信组,所述通信组包括张量并行通信组,所述张量并行通信组中包括至少两个计算设备,同一张量并行通信组中的各计算设备的稀疏参数采用张量并行切分方式;
    基于所述通信组训练混合专家模型。
  2. 根据权利要求1所述的方法,其中,
    所述通信组还包括数据并行通信组;
    所述数据并行通信组中包括至少两个计算设备,同一数据并行通信组中的各计算设备采用数据并行方式;
    对于任一张量并行通信组,其中的每个计算设备均包括在一个数据并行通信组中,且,所有数据并行通信组中包括的计算设备组成的第一计算设备集合等于所有张量并行通信组中包括的计算设备组成的第二计算设备集合。
  3. 根据权利要求2所述的方法,其中,
    任一计算设备中分别包括:
    主干网络,用于对输入数据进行预定处理,得到第一处理结果;
    门控网络,用于从所在数据并行通信组中的各计算设备的专家网络中选出作为路由网络的专家网络,并将所述第一处理结果发送给选出的专家网络;
    专家网络,用于对获取到的第一处理结果进行预定处理,得到第二处理结果,并将根据所述第二处理结果确定出的返回结果返回给所述获取到的第一处理结果对应的计算设备;
    同一张量并行通信组中的不同计算设备对应相同的主干网络及专家网络。
  4. 根据权利要求3所述的方法,还包括:
    同一张量并行通信组中的各计算设备的稠密参数采用所述张量并行切分方式;
    其中,所述稀疏参数包括:专家网络参数;所述稠密参数包括:主干网络参数。
  5. 根据权利要求4所述的方法,其中,
    所述返回结果包括:所述第二处理结果中的部分内容;
    所述方法还包括:针对任一张量并行通信组,分别将获取到的属于同一第二处理结果的各部分内容进行组合,得到完整的第二处理结果,所述各部分内容为同一张量并行通信组中的各计算设备对应的同一专家网络分别返回的。
  6. 一种混合专家模型实现系统,包括通信组,用于基于所述通信组训练混合专家模型;
    所述通信组包括张量并行通信组,所述张量并行通信组中包括至少两个计算设备,同一张量并行通信组中的各计算设备的稀疏参数采用张量并行切分方式。
  7. 根据权利要求6所述的系统,其中,
    所述通信组还包括数据并行通信组;
    所述数据并行通信组中包括至少两个计算设备,同一数据并行通信组中的各计算设备采用数据并行方式;
    对于任一张量并行通信组,其中的每个计算设备均包括在一个数据并行通信组中,且,所有数据并行通信组中包括的计算设备组成的第一计算设备集合等于所有张量并行通信组中包括的计算设备组成的第二计算设备集合。
  8. 根据权利要求7所述的系统,其中,
    任一计算设备中分别包括:
    主干网络,用于对输入数据进行预定处理,得到第一处理结果;
    门控网络,用于从所在数据并行通信组中的各计算设备的专家网络中选出作为路由网络的专家网络,并将所述第一处理结果发送给选出的专家网络;
    专家网络,用于对获取到的第一处理结果进行预定处理,得到第二处理结果,并将根据所述第二处理结果确定出的返回结果返回给所述获取到的第一处理结果对应的计算设备;
    同一张量并行通信组中的不同计算设备对应相同的主干网络及专家网络。
  9. 根据权利要求8所述的系统,其中,
    同一张量并行通信组中的各计算设备的稠密参数采用所述张量并行切分方式;
    所述稀疏参数包括:专家网络参数;所述稠密参数包括:主干网络参数。
  10. 根据权利要求9所述的系统,其中,
    所述返回结果包括:所述第二处理结果中的部分内容;
    所述系统还包括:汇聚模块,用于针对任一张量并行通信组,分别将获取到的属于同一第二处理结果的各部分内容进行组合,得到完整的第二处理结果,所述各部分内容为同一张量并行通信组中的各计算设备对应的同一专家网络分别返回的。
  11. 一种电子设备,包括:
    至少一个处理器;以及
    与所述至少一个处理器通信连接的存储器;其中,
    所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求1-5中任一项所述的方法。
  12. 一种存储有计算机指令的非瞬时计算机可读存储介质,其中,所述计算机指令用于使计算机执行权利要求1-5中任一项所述的方法。
  13. 一种计算机程序产品,包括计算机程序/指令,所述计算机程序/指令被处理器执行时实现权利要求1-5中任一项所述的方法。
PCT/CN2022/119752 2022-04-22 2022-09-20 混合专家模型实现方法、系统、电子设备及存储介质 WO2023201981A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP22865889.4A EP4287074A1 (en) 2022-04-22 2022-09-20 Mixture-of-experts model implementation method and system, electronic device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210430519.8A CN114841315A (zh) 2022-04-22 2022-04-22 混合专家模型实现方法、系统、电子设备及存储介质
CN202210430519.8 2022-04-22

Publications (1)

Publication Number Publication Date
WO2023201981A1 true WO2023201981A1 (zh) 2023-10-26

Family

ID=82565543

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/119752 WO2023201981A1 (zh) 2022-04-22 2022-09-20 混合专家模型实现方法、系统、电子设备及存储介质

Country Status (3)

Country Link
EP (1) EP4287074A1 (zh)
CN (1) CN114841315A (zh)
WO (1) WO2023201981A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114841315A (zh) * 2022-04-22 2022-08-02 北京百度网讯科技有限公司 混合专家模型实现方法、系统、电子设备及存储介质
CN117827418A (zh) * 2022-09-28 2024-04-05 华为云计算技术有限公司 数据处理方法、装置、系统、介质以及程序产品
CN115630677B (zh) * 2022-11-07 2023-10-13 北京百度网讯科技有限公司 任务处理方法、装置、电子设备及介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190251423A1 (en) * 2016-11-04 2019-08-15 Google Llc Mixture of experts neural networks
US20200151580A1 (en) * 2018-11-13 2020-05-14 International Business Machines Corporation Generating and managing deep tensor neural networks
CN114169427A (zh) * 2021-12-06 2022-03-11 北京百度网讯科技有限公司 基于端到端自适应的分布式训练方法、装置、设备
CN114186633A (zh) * 2021-12-10 2022-03-15 北京百度网讯科技有限公司 模型的分布式训练方法、装置、设备以及存储介质
CN114282681A (zh) * 2021-08-11 2022-04-05 腾讯科技(深圳)有限公司 多任务处理及模型的训练方法、装置、介质及设备
CN114841315A (zh) * 2022-04-22 2022-08-02 北京百度网讯科技有限公司 混合专家模型实现方法、系统、电子设备及存储介质

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114202027B (zh) * 2021-12-10 2023-05-23 北京百度网讯科技有限公司 执行配置信息的生成方法、模型训练方法和装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190251423A1 (en) * 2016-11-04 2019-08-15 Google Llc Mixture of experts neural networks
US20200151580A1 (en) * 2018-11-13 2020-05-14 International Business Machines Corporation Generating and managing deep tensor neural networks
CN114282681A (zh) * 2021-08-11 2022-04-05 腾讯科技(深圳)有限公司 多任务处理及模型的训练方法、装置、介质及设备
CN114169427A (zh) * 2021-12-06 2022-03-11 北京百度网讯科技有限公司 基于端到端自适应的分布式训练方法、装置、设备
CN114186633A (zh) * 2021-12-10 2022-03-15 北京百度网讯科技有限公司 模型的分布式训练方法、装置、设备以及存储介质
CN114841315A (zh) * 2022-04-22 2022-08-02 北京百度网讯科技有限公司 混合专家模型实现方法、系统、电子设备及存储介质

Also Published As

Publication number Publication date
EP4287074A1 (en) 2023-12-06
CN114841315A (zh) 2022-08-02

Similar Documents

Publication Publication Date Title
WO2023201981A1 (zh) 混合专家模型实现方法、系统、电子设备及存储介质
CN112561078B (zh) 分布式的模型训练方法及相关装置
EP3913545A2 (en) Method and apparatus for updating parameter of multi-task model, and electronic device
US20220004811A1 (en) Method and apparatus of training model, device, medium, and program product
US20220276899A1 (en) Resource scheduling method, device, and storage medium
EP4016398A1 (en) Apparatus and method for distributed training model, and computer program product
US20220374776A1 (en) Method and system for federated learning, electronic device, and computer readable medium
US20230153337A1 (en) Question answering method, method of training a question answering model, electronic device, and medium
US20230084055A1 (en) Method for generating federated learning model
KR20210156243A (ko) 딥러닝 프레임워크의 훈련 방법, 장치 및 저장 매체
US20240144570A1 (en) Method for generating drivable 3d character, electronic device and storage medium
US20220391780A1 (en) Method of federated learning, electronic device, and storage medium
US20220374678A1 (en) Method for determining pre-training model, electronic device and storage medium
CN114840734B (zh) 多模态表示模型的训练方法、跨模态检索方法及装置
WO2023165058A1 (zh) 存储器模型的镜像存储实现方法、装置及存储介质
WO2023174189A1 (zh) 图网络模型节点分类方法、装置、设备及存储介质
WO2023221454A1 (zh) 基于注意力机制优化的文本处理方法、网络模型训练方法
WO2023221370A1 (zh) 批量任务处理的方法、装置及电子设备
WO2023155359A1 (zh) 语音芯片实现方法、语音芯片及相关设备
EP4155670A1 (en) Intersection vertex height value acquisition method and apparatus, electronic device and storage medium
WO2023015942A1 (zh) 确定图像特征的方法、装置、电子设备和存储介质
CN113408304B (zh) 文本翻译方法、装置、电子设备及存储介质
CN114238611A (zh) 用于输出信息的方法、装置、设备以及存储介质
CN113344213A (zh) 知识蒸馏方法、装置、电子设备及计算机可读存储介质
CN113361574A (zh) 数据处理模型的训练方法、装置、电子设备及存储介质

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2022865889

Country of ref document: EP

Effective date: 20230314