CN118132672A

CN118132672A - Model pre-training method and device, storage medium and electronic equipment

Info

Publication number: CN118132672A
Application number: CN202410131223.5A
Authority: CN
Inventors: 姜博健; 杨青
Original assignee: Du Xiaoman Technology Beijing Co Ltd
Current assignee: Du Xiaoman Technology Beijing Co Ltd
Priority date: 2024-01-30
Filing date: 2024-01-30
Publication date: 2024-06-04

Abstract

The invention provides a model pre-training method, a device, a storage medium and electronic equipment, wherein the method comprises the following steps: compression processing is carried out to obtain compressed data sources under each path of data in each training data; masking the compressed data sources under the data of each path in each training data, and encoding the masking data sources under the corresponding path in each training data to obtain the encoded data sources under the data of each path in each training data; based on the coding data sources under each path of data in each training data and the compression data sources under each path of data in each training data, respectively calculating model loss values under each path of data, respectively optimizing model parameters in an initial language model corresponding to the corresponding path of data according to the model loss values under each path of data, and obtaining an intermediate language model corresponding to each path of data so as to determine a target language model corresponding to each path of data. The embodiment of the invention can conveniently perform model pre-training so as to improve the model pre-training efficiency.

Description

Model pre-training method, device, storage medium and electronic device

技术领域Technical Field

本发明涉及计算机技术领域，尤其涉及一种模型预训练方法、装置、存储介质及电子设备。The present invention relates to the field of computer technology, and in particular to a model pre-training method, device, storage medium and electronic equipment.

背景技术Background technique

目前，预训练语言模型在自然语言处理中已被广泛关注，所谓的预训练语言模型是指：利用大量的无监督数据，在下游任务训练之前先预训练得到比较好的语义表示；但相关技术在进行模型预训练(即建模)，尤其是进行长文本建模时，导致模型预训练效率较低。基于此，如何便捷地进行模型预训练，以提高模型预训练效率目前暂未具有较好的解决方案。At present, pre-trained language models have attracted extensive attention in natural language processing. The so-called pre-trained language models refer to: using a large amount of unsupervised data to pre-train before downstream task training to obtain a better semantic representation; however, the relevant technology has low efficiency in model pre-training (i.e. modeling), especially when modeling long texts. Based on this, there is currently no good solution for how to conveniently pre-train models to improve model pre-training efficiency.

发明内容Summary of the invention

有鉴于此，本发明实施例提供了一种模型预训练方法、装置、存储介质及电子设备，以解决相关技术导致模型预训练效率较低等问题；也就是说，本发明实施例可便捷地进行模型预训练，以提高模型预训练效率，即可有效提升模型的训练和推理的速度。In view of this, the embodiments of the present invention provide a model pre-training method, device, storage medium and electronic device to solve the problem of low model pre-training efficiency caused by related technologies; that is, the embodiments of the present invention can conveniently perform model pre-training to improve the model pre-training efficiency, which can effectively improve the speed of model training and reasoning.

根据本发明的一方面，提供了一种模型预训练方法，所述方法包括：According to one aspect of the present invention, a model pre-training method is provided, the method comprising:

获取训练数据集，一个训练数据包括一个对象在M路数据中各路数据下的初始数据源，且一路数据对应一个语言模型，M为正整数；Obtain a training data set, where one training data set includes an initial data source of an object under each data path in M data paths, and one data path corresponds to one language model, where M is a positive integer;

分别对所述训练数据集包括的各个训练数据中所述各路数据下的初始数据源进行压缩处理，得到所述各个训练数据中所述各路数据下的压缩数据源，一个压缩数据源中压缩语义表示的数量小于相应初始数据源中初始语义表示的数量；Performing compression processing on the initial data sources of each data path in each training data included in the training data set to obtain compressed data sources of each data path in each training data, wherein the number of compressed semantic representations in one compressed data source is less than the number of initial semantic representations in the corresponding initial data source;

分别对所述各个训练数据中所述各路数据下的压缩数据源进行掩码，得到所述各个训练数据中所述各路数据下的掩码数据源，并分别调用所述各路数据对应的初始语言模型，对所述各个训练数据中相应路数据下的掩码数据源进行编码，得到所述各个训练数据中所述各路数据下的编码数据源；Masking the compressed data sources under each path of data in each of the training data respectively to obtain the masked data sources under each path of data in each of the training data, and respectively calling the initial language models corresponding to each path of data to encode the masked data sources under the corresponding path of data in each of the training data to obtain the encoded data sources under each path of data in each of the training data;

基于所述各个训练数据中所述各路数据下的编码数据源和所述各个训练数据中所述各路数据下的压缩数据源，分别计算所述各路数据下的模型损失值，并分别按照所述各路数据下的模型损失值，优化相应路数据对应的初始语言模型中的模型参数，得到所述各路数据对应的中间语言模型，以基于所述各路数据对应的中间语言模型，确定所述各路数据对应的目标语言模型。Based on the encoded data source under each data path in each training data and the compressed data source under each data path in each training data, the model loss value under each data path is calculated respectively, and according to the model loss value under each data path, the model parameters in the initial language model corresponding to the corresponding data path are optimized respectively to obtain the intermediate language model corresponding to each data path, so as to determine the target language model corresponding to each data path based on the intermediate language model corresponding to each data path.

根据本发明的另一方面，提供了一种模型预训练装置，所述装置包括：According to another aspect of the present invention, a model pre-training device is provided, the device comprising:

获取单元，用于获取训练数据集，一个训练数据包括一个对象在M路数据中各路数据下的初始数据源，且一路数据对应一个语言模型，M为正整数；An acquisition unit is used to acquire a training data set, where one training data set includes an initial data source of an object under each data path in M data paths, and one data path corresponds to one language model, where M is a positive integer;

处理单元，用于分别对所述训练数据集包括的各个训练数据中所述各路数据下的初始数据源进行压缩处理，得到所述各个训练数据中所述各路数据下的压缩数据源，一个压缩数据源中压缩语义表示的数量小于相应初始数据源中初始语义表示的数量；A processing unit, configured to compress the initial data sources of the respective data paths in the respective training data included in the training data set to obtain compressed data sources of the respective data paths in the respective training data, wherein the number of compressed semantic representations in one compressed data source is less than the number of initial semantic representations in the corresponding initial data source;

所述处理单元，还用于分别对所述各个训练数据中所述各路数据下的压缩数据源进行掩码，得到所述各个训练数据中所述各路数据下的掩码数据源，并分别调用所述各路数据对应的初始语言模型，对所述各个训练数据中相应路数据下的掩码数据源进行编码，得到所述各个训练数据中所述各路数据下的编码数据源；The processing unit is further configured to respectively mask the compressed data sources under the respective paths of data in the respective training data to obtain the masked data sources under the respective paths of data in the respective training data, and respectively call the initial language models corresponding to the respective paths of data to encode the masked data sources under the corresponding paths of data in the respective training data to obtain the encoded data sources under the respective paths of data in the respective training data;

所述处理单元，还用于基于所述各个训练数据中所述各路数据下的编码数据源和所述各个训练数据中所述各路数据下的压缩数据源，分别计算所述各路数据下的模型损失值，并分别按照所述各路数据下的模型损失值，优化相应路数据对应的初始语言模型中的模型参数，得到所述各路数据对应的中间语言模型，以基于所述各路数据对应的中间语言模型，确定所述各路数据对应的目标语言模型。The processing unit is further used to calculate the model loss value of each data path based on the encoded data source of each data path in the each training data and the compressed data source of each data path in the each training data, and optimize the model parameters in the initial language model corresponding to the corresponding data path according to the model loss value of each data path, so as to obtain the intermediate language model corresponding to each data path, and determine the target language model corresponding to each data path based on the intermediate language model corresponding to each data path.

根据本发明的另一方面，提供了一种电子设备，所述电子设备包括处理器、以及存储程序的存储器，其中，所述程序包括指令，所述指令在由所述处理器执行时使所述处理器执行上述所提及的方法。According to another aspect of the present invention, an electronic device is provided, the electronic device comprising a processor and a memory storing a program, wherein the program comprises instructions, and when the instructions are executed by the processor, the processor executes the above-mentioned method.

根据本发明的另一方面，提供了一种存储有计算机指令的非瞬时计算机可读存储介质，所述计算机指令用于使计算机执行上述所提及的方法。According to another aspect of the present invention, a non-transitory computer-readable storage medium storing computer instructions is provided, wherein the computer instructions are used to enable a computer to execute the above-mentioned method.

本发明实施例可在获取到训练数据集后，分别对训练数据集包括的各个训练数据中各路数据下的初始数据源进行压缩处理，得到各个训练数据中各路数据下的压缩数据源，一个训练数据包括一个对象在M路数据中各路数据下的初始数据源，且一路数据对应一个语言模型，M为正整数；一个压缩数据源中压缩语义表示的数量小于相应初始数据源中初始语义表示的数量。然后，可分别对各个训练数据中各路数据下的压缩数据源进行掩码，得到各个训练数据中各路数据下的掩码数据源，并分别调用各路数据对应的初始语言模型，对各个训练数据中相应路数据下的掩码数据源进行编码，得到各个训练数据中各路数据下的编码数据源。基于此，可基于各个训练数据中各路数据下的编码数据源和各个训练数据中各路数据下的压缩数据源，分别计算各路数据下的模型损失值，并分别按照各路数据下的模型损失值，优化相应路数据对应的初始语言模型中的模型参数，得到各路数据对应的中间语言模型，以基于各路数据对应的中间语言模型，确定各路数据对应的目标语言模型。可见，本发明实施例可便捷地进行模型预训练，以提高模型预训练效率，即可有效提升模型的训练和推理的速度。After obtaining the training data set, the embodiment of the present invention can compress the initial data sources under each data path in each training data included in the training data set to obtain the compressed data sources under each data path in each training data. One training data includes the initial data sources under each data path of an object in M data paths, and one data path corresponds to one language model, M is a positive integer; the number of compressed semantic representations in one compressed data source is less than the number of initial semantic representations in the corresponding initial data source. Then, the compressed data sources under each data path in each training data can be masked to obtain the masked data sources under each data path in each training data, and the initial language model corresponding to each data path can be called to encode the masked data sources under the corresponding data path in each training data to obtain the encoded data sources under each data path in each training data. Based on this, the model loss value under each data path can be calculated based on the encoded data source under each data path in each training data and the compressed data source under each data path in each training data, and the model parameters in the initial language model corresponding to the corresponding data path can be optimized according to the model loss value under each data path to obtain the intermediate language model corresponding to each data path, so as to determine the target language model corresponding to each data path based on the intermediate language model corresponding to each data path. It can be seen that the embodiments of the present invention can conveniently perform model pre-training to improve the efficiency of model pre-training, that is, can effectively improve the speed of model training and reasoning.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

在下面结合附图对于示例性实施例的描述中，本发明的更多细节、特征和优点被公开，在附图中：Further details, features and advantages of the invention are disclosed in the following description of exemplary embodiments in conjunction with the accompanying drawings, in which:

图1示出了根据本发明示例性实施例的一种模型预训练方法的流程示意图；FIG1 is a schematic diagram showing a flow chart of a model pre-training method according to an exemplary embodiment of the present invention;

图2示出了根据本发明示例性实施例的一种预训练的示意图；FIG2 shows a schematic diagram of pre-training according to an exemplary embodiment of the present invention;

图3示出了根据本发明示例性实施例的另一种模型预训练方法的流程示意图；FIG3 shows a schematic flow chart of another model pre-training method according to an exemplary embodiment of the present invention;

图4示出了根据本发明示例性实施例的一种对比学习的示意图；FIG4 shows a schematic diagram of contrastive learning according to an exemplary embodiment of the present invention;

图5示出了根据本发明示例性实施例的一种模型预训练装置的示意性框图；FIG5 shows a schematic block diagram of a model pre-training device according to an exemplary embodiment of the present invention;

图6示出了能够用于实现本发明的实施例的示例性电子设备的结构框图。FIG. 6 shows a block diagram of an exemplary electronic device that can be used to implement an embodiment of the present invention.

具体实施方式Detailed ways

下面将参照附图更详细地描述本发明的实施例。虽然附图中显示了本发明的某些实施例，然而应当理解的是，本发明可以通过各种形式来实现，而且不应该被解释为限于这里阐述的实施例，相反提供这些实施例是为了更加透彻和完整地理解本发明。应当理解的是，本发明的附图及实施例仅用于示例性作用，并非用于限制本发明的保护范围。Embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. Although certain embodiments of the present invention are shown in the accompanying drawings, it should be understood that the present invention can be implemented in various forms and should not be construed as being limited to the embodiments described herein, which are instead provided for a more thorough and complete understanding of the present invention. It should be understood that the drawings and embodiments of the present invention are only for exemplary purposes and are not intended to limit the scope of protection of the present invention.

应当理解，本发明的方法实施方式中记载的各个步骤可以按照不同的顺序执行，和/或并行执行。此外，方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本发明的范围在此方面不受限制。It should be understood that the various steps described in the method embodiments of the present invention may be performed in different orders and/or in parallel. In addition, the method embodiments may include additional steps and/or omit the steps shown. The scope of the present invention is not limited in this respect.

本文使用的术语“包括”及其变形是开放性包括，即“包括但不限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”；术语“另一实施例”表示“至少一个另外的实施例”；术语“一些实施例”表示“至少一些实施例”。其他术语的相关定义将在下文描述中给出。需要注意，本发明中提及的“第一”、“第二”等概念仅用于对不同的装置、模块或单元进行区分，并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。The term "including" and its variations used in this document are open inclusions, that is, "including but not limited to". The term "based on" means "based at least in part on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one other embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions of other terms will be given in the following description. It should be noted that the concepts of "first", "second", etc. mentioned in the present invention are only used to distinguish different devices, modules or units, and are not used to limit the order or interdependence of the functions performed by these devices, modules or units.

需要注意，本发明中提及的“一个”、“多个”的修饰是示意性而非限制性的，本领域技术人员应当理解，除非在上下文另有明确指出，否则应该理解为“一个或多个”。It should be noted that the modifications of "one" and "plurality" mentioned in the present invention are illustrative rather than restrictive, and those skilled in the art should understand that, unless otherwise clearly indicated in the context, it should be understood as "one or more".

本发明实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的，而并不是用于对这些消息或信息的范围进行限制。The names of the messages or information exchanged between multiple devices in the embodiments of the present invention are only used for illustrative purposes, and are not used to limit the scope of these messages or information.

需要说明的是，本发明实施例提供的模型预训练方法的执行主体可以是一个或多个电子设备，本发明对此不作限定；其中，电子设备可以是终端(即客户端)或者服务器，那么在执行主体包括多个电子设备，且多个电子设备中包括至少一个终端和至少一个服务器时，本发明实施例提供的模型预训练方法可由终端和服务器共同执行。相应的，此处所提及的终端可以包括但不限于：智能手机、平板电脑、笔记本电脑、台式计算机、智能语音交互设备，等等。此处所提及的服务器可以是独立的物理服务器，也可以是多个物理服务器构成的服务器集群或者分布式系统，还可以是提供云服务、云数据库、云计算(cloud computing)、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN(ContentDelivery Network，内容分发网络)、以及大数据和人工智能平台等基础云计算服务的云服务器，等等。It should be noted that the execution subject of the model pre-training method provided in the embodiment of the present invention may be one or more electronic devices, and the present invention does not limit this; wherein, the electronic device may be a terminal (i.e., a client) or a server, then when the execution subject includes multiple electronic devices, and the multiple electronic devices include at least one terminal and at least one server, the model pre-training method provided in the embodiment of the present invention may be jointly executed by the terminal and the server. Accordingly, the terminals mentioned here may include but are not limited to: smart phones, tablet computers, laptop computers, desktop computers, intelligent voice interaction devices, and the like. The server mentioned here may be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or a cloud server that provides cloud services, cloud databases, cloud computing (cloud computing), cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDN (Content Delivery Network), and basic cloud computing services such as big data and artificial intelligence platforms, and the like.

基于上述描述，本发明实施例提出一种模型预训练方法，该模型预训练方法可以由上述所提及的电子设备(终端或服务器)执行；或者，该模型预训练方法可由终端和服务器共同执行。为了便于阐述，后续均以电子设备执行该模型预训练方法为例进行说明；如图1所示，该模型预训练方法可包括以下步骤S101-S104：Based on the above description, an embodiment of the present invention proposes a model pre-training method, which can be executed by the electronic device (terminal or server) mentioned above; or, the model pre-training method can be executed by the terminal and the server together. For the sake of convenience, the following description will be given by taking the electronic device executing the model pre-training method as an example; as shown in Figure 1, the model pre-training method may include the following steps S101-S104:

S101，获取训练数据集，一个训练数据包括一个对象在M路数据中各路数据下的初始数据源，且一路数据对应一个语言模型，M为正整数。S101, obtaining a training data set, where one training data set includes an initial data source of an object under each data path in M data paths, and one data path corresponds to one language model, and M is a positive integer.

其中，一个训练数据对应一个对象；那么相应的，训练数据集可包括至少一个对象中各个对象的训练数据，即训练数据集可包括各个对象在各路数据下的初始数据源；可选的，一个对象在各路数据下的数据源也可称为相应对象的训练数据中各路数据下的数据源，也就是说，一个训练数据可包括各路数据下的初始数据源，且一个训练数据中各路数据下的数据源可以是指：相应训练数据对应的对象在各路数据下的数据源。示例性的，一个训练数据中各路数据下的压缩数据源可以是指相应训练数据对应的对象在各路数据下压缩数据源，一个训练数据中各路数据下的掩码数据源可以是指相应训练数据对应的对象在各路数据下的掩码数据源，一个训练数据中各路数据下的编码数据源与可以是指相应训练数据对应的对象在各路数据下的编码数据源，等等。Among them, one training data corresponds to one object; then correspondingly, the training data set may include the training data of each object in at least one object, that is, the training data set may include the initial data source of each object under each data channel; optionally, the data source of an object under each data channel may also be called the data source of each data channel in the training data of the corresponding object, that is, one training data may include the initial data source under each data channel, and the data source under each data channel in one training data may refer to: the data source of the object corresponding to the corresponding training data under each data channel. Exemplarily, the compressed data source under each data channel in one training data may refer to the compressed data source of the object corresponding to the corresponding training data under each data channel, the mask data source under each data channel in one training data may refer to the mask data source of the object corresponding to the corresponding training data under each data channel, the encoded data source under each data channel in one training data may refer to the encoded data source of the object corresponding to the corresponding training data under each data channel, and so on.

可选的，一个对象可以是一个用户，也可以是一个商品(如图书等)，等等；本发明实施例对此不作限定。可选的，至少一个对象中各个对象可以属于同一个类型，如可以均为用户或均为图书等。Optionally, an object may be a user, or a commodity (such as a book), etc., which is not limited in the embodiment of the present invention. Optionally, each object in at least one object may be of the same type, such as all users or all books.

应当理解的是，当M的取值为1时，本发明实施例涉及的是单路数据，从而可实现下述确定一路数据对应的目标语言模型；当M的取值大于1时，可将数据扩充到多路数据，从而实现下述确定各路数据对应的目标语言模型，也就是说，目标语言模型的数量可以为多个，一路数据对应一个目标语言模型，等等。It should be understood that when the value of M is 1, the embodiment of the present invention involves a single-channel data, thereby realizing the following determination of the target language model corresponding to one channel of data; when the value of M is greater than 1, the data can be expanded to multiple channels of data, thereby realizing the following determination of the target language model corresponding to each channel of data, that is, the number of target language models can be multiple, one channel of data corresponds to one target language model, and so on.

示例性的，假设M的取值为2，且一个对象为一个用户，那么一路数据可以为用户的属性标签数据(如性别和年龄等)，即一路数据下的初始数据源可用于描述用户的属性信息；相应的，另一路数据可以为用户的搜索文本和/或浏览文本等，即另一路数据下的初始数据源可用于描述用户的搜索内容和/或浏览内容，等等。Exemplarily, assuming that the value of M is 2 and an object is a user, then one channel of data can be the user's attribute label data (such as gender and age, etc.), that is, the initial data source under one channel of data can be used to describe the user's attribute information; correspondingly, another channel of data can be the user's search text and/or browsing text, etc., that is, the initial data source under another channel of data can be used to describe the user's search content and/or browsing content, and so on.

在本发明实施例中，上述训练数据集的获取方式可以包括但不限于以下几种：In the embodiment of the present invention, the above training data set may be obtained in the following ways, including but not limited to:

第一种获取方式：电子设备的自身存储空间中存储有多个训练数据，在此种情况下，电子设备可从多个训练数据中选取出至少一个训练数据，并将至少一个训练数据添加至训练数据集中，以使训练数据集包括至少一个训练数据。The first acquisition method: multiple training data are stored in the electronic device's own storage space. In this case, the electronic device can select at least one training data from the multiple training data and add the at least one training data to the training data set so that the training data set includes at least one training data.

第二种获取方式：电子设备可获取训练数据下载链接，并将基于训练数据下载链接下载的训练数据，添加至训练数据集中，以使训练数据集包括基于训练数据下载链接下载的训练数据。The second acquisition method: the electronic device can obtain the training data download link, and add the training data downloaded based on the training data download link to the training data set, so that the training data set includes the training data downloaded based on the training data download link.

第三种获取方式：电子设备可获取训练文本集，训练文本集包括至少一个训练文本，一个训练文本可包括一个对象在各路数据下的训练子文本。在此种情况下，针对训练文本集中的任一训练文本，电子设备可对任一训练文本中各路数据下的训练子文本进行分词处理，得到任一训练文本中各路数据下的分词结果，一个分词结果可包括多个分词(即token，令牌)，并可分别对任一训练文本中各路数据下的分词结果进行向量化处理，得到任一训练文本中各路数据下的初始数据源，任一训练文本中任一路数据下的初始数据源可包括任一训练文本中任一路数据下的各个分词的初始语义表示，从而可将任一训练文本中各路数据下的初始数据源添加至训练数据集中，以将任一训练文本中各路数据下的初始数据源作为训练数据集中的一个训练数据，等等。The third acquisition method: the electronic device can obtain a training text set, the training text set includes at least one training text, and a training text may include a training sub-text of an object under each data channel. In this case, for any training text in the training text set, the electronic device can perform word segmentation processing on the training sub-text under each data channel in any training text to obtain a word segmentation result under each data channel in any training text, and a word segmentation result may include multiple word segments (i.e., tokens), and the word segmentation results under each data channel in any training text can be vectorized respectively to obtain an initial data source under each data channel in any training text, and the initial data source under any data channel in any training text may include the initial semantic representation of each word segmentation under any data channel in any training text, so that the initial data source under each data channel in any training text can be added to the training data set, so that the initial data source under each data channel in any training text is used as a training data in the training data set, and so on.

需要说明的是，一个训练数据可对应一个训练文本，且一个训练数据中的一路数据对应相应训练文本中的一个训练子文本。其中，一个训练子文本的文本长度可以是任意长度，本发明实施例对此不作限定，一个训练子文本的文本长度可以为相应训练子文本中的分词数量。可选的，当一个文本的文本长度大于预设长度阈值时，可将该文本作为一个长文本；可选的，预设长度阈值可以是按照经验设置的，也可以是按照实际需求设置的，本发明实施例对此不作限定。It should be noted that one training data may correspond to one training text, and one data path in one training data corresponds to one training sub-text in the corresponding training text. The text length of a training sub-text may be any length, which is not limited in the embodiment of the present invention. The text length of a training sub-text may be the number of segmented words in the corresponding training sub-text. Optionally, when the text length of a text is greater than a preset length threshold, the text may be regarded as a long text; optionally, the preset length threshold may be set according to experience or according to actual needs, which is not limited in the embodiment of the present invention.

可选的，当各个训练数据对应的训练文本中的训练子文本为长文本时，本发明实施例可实现长文本建模，具体可见下述所示，本发明实施例在此不再赘述。Optionally, when the training sub-text in the training text corresponding to each training data is a long text, the embodiment of the present invention can realize long text modeling, which can be seen as follows, and the embodiment of the present invention will not be repeated here.

S102，分别对训练数据集包括的各个训练数据中各路数据下的初始数据源进行压缩处理，得到各个训练数据中各路数据下的压缩数据源，一个压缩数据源中压缩语义表示的数量小于相应初始数据源中初始语义表示的数量。S102, respectively compressing the initial data sources of each data path in each training data included in the training data set to obtain compressed data sources of each data path in each training data, wherein the number of compressed semantic representations in one compressed data source is less than the number of initial semantic representations in the corresponding initial data source.

在本发明实施例中，针对训练数据集中的任一训练数据，以及M路数据中的任一路数据，电子设备可对任一训练数据中任一路数据下的初始数据源进行压缩处理，得到任一训练数中任一路数据下的压缩数据源；也就是说，可对任一训练数据对应的对象在任一路数据下的初始数据源进行压缩处理，得到任一训练数据对应的对象在任一路数据下的压缩数据源。其中，任一训练数据中任一路数据下的压缩数据源可包括：任一训练数据对应的对象在任一路数据下的压缩数据源，一个压缩数据源可包括多个压缩语义表示，即任一训练数据中任一路数据下的压缩数据源可包括：任一训练数据中任一路数据下的多个压缩语义表示。相应的，任一训练数据中任一路数据下的初始数据源可包括：任一训练数据中任一路数据下的多个初始语义表示。In an embodiment of the present invention, for any training data in a training data set, and any data in M data paths, the electronic device may compress the initial data source under any data path in any training data to obtain a compressed data source under any data path in any training data; that is, the initial data source under any data path of an object corresponding to any training data may be compressed to obtain a compressed data source under any data path of an object corresponding to any training data. Among them, the compressed data source under any data path in any training data may include: a compressed data source under any data path of an object corresponding to any training data, and a compressed data source may include multiple compressed semantic representations, that is, the compressed data source under any data path in any training data may include: multiple compressed semantic representations under any data path in any training data. Correspondingly, the initial data source under any data path in any training data may include: multiple initial semantic representations under any data path in any training data.

需要说明的是，任一训练数据中任一路数据下的压缩语义表示的数量，可小于任一训练数据中任一路数据下的初始语义表示的数量；可选的，当任一训练数据中任一路数据对应的训练子文本为一个长文本时，任一训练数据中任一路数据下的压缩语义表示的数量，可远小于任一训练数据中任一路数据下的初始语义表示的数量，从而有效压缩文本长度。It should be noted that the number of compressed semantic representations under any data path in any training data may be less than the number of initial semantic representations under any data path in any training data; optionally, when the training sub-text corresponding to any data path in any training data is a long text, the number of compressed semantic representations under any data path in any training data may be much smaller than the number of initial semantic representations under any data path in any training data, thereby effectively compressing the text length.

S103，分别对各个训练数据中各路数据下的压缩数据源进行掩码，得到各个训练数据中各路数据下的掩码数据源，并分别调用各路数据对应的初始语言模型，对各个训练数据中相应路数据下的掩码数据源进行编码，得到各个训练数据中各路数据下的编码数据源。S103, respectively masking the compressed data source under each data path in each training data to obtain the masked data source under each data path in each training data, and respectively calling the initial language model corresponding to each data path, encoding the masked data source under the corresponding data path in each training data to obtain the encoded data source under each data path in each training data.

其中，一个掩码数据源可包括H个掩码单元中各个掩码单元的掩码语义表示，也就是说，电子设备可对一个压缩数据源中各个掩码单元的压缩语义表示进行掩码，得到相应的掩码数据源，那么相应的，一个掩码数据源中除H个掩码单元以外的各个语义单元的掩码语义表示可以为相应压缩数据源中相应语义单元的压缩语义表示，H为正整数。并且，一个编码数据源可包括各个语义单元的编码语义表示，从而可包括H个掩码单元中各个掩码单元的编码语义表示。其中，一个语义表示可对应一个语义单元，一个语义单元可以为一个掩码单元或除掩码单元以外的语义单元。Among them, a masked data source may include the masked semantic representation of each mask unit in H mask units, that is, the electronic device may mask the compressed semantic representation of each mask unit in a compressed data source to obtain the corresponding masked data source, then correspondingly, the masked semantic representation of each semantic unit other than the H mask units in a masked data source may be the compressed semantic representation of the corresponding semantic unit in the corresponding compressed data source, and H is a positive integer. In addition, an encoded data source may include the encoded semantic representation of each semantic unit, and thus may include the encoded semantic representation of each mask unit in H mask units. Among them, one semantic representation may correspond to one semantic unit, and one semantic unit may be a mask unit or a semantic unit other than a mask unit.

需要说明的是，任意两个数据源中的掩码单元可以相同，也可以不同；也就是说，任意两个数据源中的掩码单元的数量可以相同，也可以不同，本发明实施例对此不作限定；应当理解的是，同一训练数据中同一路数据下的数据源(如压缩数据源和编码数据源等)中的掩码单元相同。It should be noted that the mask units in any two data sources may be the same or different; that is, the number of mask units in any two data sources may be the same or different, and the embodiment of the present invention is not limited to this; it should be understood that the mask units in the data sources under the same data path in the same training data (such as a compressed data source and an encoded data source, etc.) are the same.

示例性的，假设一个压缩数据源包括4个压缩语义表示，且4个压缩语义表示中的掩码单元包括第3个语义单元(即第3个压缩语义表示对应的语义单元)，那么电子设备可对该压缩数据源中的第3个语义单元进行掩码，从而得到第3个语义单元的掩码语义表示，以实现对该压缩数据源进行掩码，得到该压缩数据源对应的掩码数据源；其中，该压缩数据源对应的掩码数据源可包括：第1个语义单元的压缩语义表示、第2个语义单元的压缩语义表示、第3个语义单元的掩码语义表示以及第4个语义单元的压缩语义表示。Exemplarily, assuming that a compressed data source includes four compressed semantic representations, and the mask unit in the four compressed semantic representations includes the third semantic unit (i.e., the semantic unit corresponding to the third compressed semantic representation), then the electronic device may mask the third semantic unit in the compressed data source to obtain the masked semantic representation of the third semantic unit, so as to mask the compressed data source and obtain the masked data source corresponding to the compressed data source; wherein the masked data source corresponding to the compressed data source may include: the compressed semantic representation of the first semantic unit, the compressed semantic representation of the second semantic unit, the masked semantic representation of the third semantic unit, and the compressed semantic representation of the fourth semantic unit.

可选的，一个语言模型可以为FLASH模型(一种高效的长文本模型)，也可以为Bert(一种深度双向预训练模型)模型，等等；本发明实施例对此不作限定。需要说明的是，在进行长文本建模时，可优选FLASH模型作为语言模型。可选的，一个初始语言模型中的模型参数可以是随机生成的，也可以是按照经验设置的，还可以是按照实际需求设置的，本发明实施例对此不作限定。Optionally, a language model can be a FLASH model (an efficient long text model), or a Bert (a deep bidirectional pre-training model), etc.; the embodiment of the present invention is not limited to this. It should be noted that when performing long text modeling, the FLASH model can be preferably used as the language model. Optionally, the model parameters in an initial language model can be randomly generated, set according to experience, or set according to actual needs, which is not limited in the embodiment of the present invention.

S104，基于各个训练数据中各路数据下的编码数据源和各个训练数据中各路数据下的压缩数据源，分别计算各路数据下的模型损失值，并分别按照各路数据下的模型损失值，优化相应路数据对应的初始语言模型中的模型参数，得到各路数据对应的中间语言模型，以基于各路数据对应的中间语言模型，确定各路数据对应的目标语言模型。S104, based on the encoded data source under each data channel in each training data and the compressed data source under each data channel in each training data, respectively calculate the model loss value under each data channel, and optimize the model parameters in the initial language model corresponding to the corresponding data channel according to the model loss value under each data channel, to obtain the intermediate language model corresponding to each data channel, so as to determine the target language model corresponding to each data channel based on the intermediate language model corresponding to each data channel.

在一种实施方式中，针对M路数据中的第m路数据，电子设备可分别从各个训练数据中第m路数据下的编码数据源中，确定出相应训练数据中第m路数据下的每个掩码单元的编码语义表示(即可确定出各个对象在第m路数据下的每个掩码单元的编码语义表示)，m∈[1，M]。那么相应的，可基于各个训练数据中第m路数据下的每个掩码单元的编码语义表示，以及相应训练数据中第m路数据下的压缩数据源，计算第m路数据下的模型损失值。具体的，电子设备可遍历训练数据集中的各个训练数据，并将当前遍历的训练数据作为当前训练数据；然后，可采用当前训练数据中第m路数据下的每个掩码单元的编码语义表示和当前训练数据中第m路数据下的压缩数据源，计算当前训练数据中第m路数据下的模型损失值；在遍历完训练数据集中的各个训练数据后，可得到各个训练数据中第m路数据下的模型损失值，并对各个训练数据中第m路数据下的模型损失值进行加权求和，得到第m路数据下的模型损失值。可选的，本发明实施例在加权求和过程中所涉及的权重值均可以是按照经验设置的，也可以是按照实际需求设置的，本发明实施例对此不作限定；示例性的，当各个训练数据中第m路数据下的模型损失值的权重值相同时，可以对各个训练数据中第m路数据下的模型损失值进行均值运算或求和运算，等等。In one implementation, for the m-th data path in the M-path data, the electronic device may determine the encoded semantic representation of each mask unit under the m-th data path in the corresponding training data (that is, the encoded semantic representation of each mask unit of each object under the m-th data path) from the encoded data source under the m-th data path in each training data path, m∈[1,M]. Then, correspondingly, the model loss value under the m-th data path may be calculated based on the encoded semantic representation of each mask unit under the m-th data path in each training data path and the compressed data source under the m-th data path in the corresponding training data path. Specifically, the electronic device can traverse each training data in the training data set, and use the currently traversed training data as the current training data; then, the encoded semantic representation of each mask unit under the m-th data path in the current training data and the compressed data source under the m-th data path in the current training data can be used to calculate the model loss value under the m-th data path in the current training data; after traversing each training data in the training data set, the model loss value under the m-th data path in each training data can be obtained, and the model loss value under the m-th data path in each training data can be weighted and summed to obtain the model loss value under the m-th data path. Optionally, the weight values involved in the weighted summation process of the embodiment of the present invention can be set according to experience or according to actual needs, and the embodiment of the present invention is not limited to this; illustratively, when the weight values of the model loss values under the m-th data path in each training data are the same, the model loss values under the m-th data path in each training data can be averaged or summed, and so on.

可选的，在采用当前训练数据中第m路数据下的每个掩码单元的编码语义表示和当前训练数据中第m路数据下的压缩数据源，计算当前训练数据中第m路数据下的模型损失值时，针对当前训练数据中第m路数据中的任一掩码单元，可采用当前训练数据中第m路数据下的任一掩码单元的编码语义表示和当前训练数据中第m路数据源下的压缩数据源，计算任一掩码单元下的模型损失值，以得到当前训练数据中第m路数据下的每个掩码单元下的模型损失值，从而对当前训练数据中第m路数据下的每个掩码单元下的模型损失值进行加权求和(如均值运算或求和运算等)，得到当前训练数据中第m路数据下的模型损失值。具体的，电子设备可采用公式1.1，计算任一掩码单元下的模型损失值：Optionally, when using the encoded semantic representation of each mask unit under the m-th data path in the current training data and the compressed data source under the m-th data path in the current training data to calculate the model loss value under the m-th data path in the current training data, for any mask unit in the m-th data path in the current training data, the encoded semantic representation of any mask unit under the m-th data path in the current training data and the compressed data source under the m-th data source in the current training data can be used to calculate the model loss value under any mask unit to obtain the model loss value under each mask unit under the m-th data path in the current training data, thereby performing a weighted summation (such as a mean operation or a summation operation, etc.) of the model loss values under each mask unit under the m-th data path in the current training data to obtain the model loss value under the m-th data path in the current training data. Specifically, the electronic device can use formula 1.1 to calculate the model loss value under any mask unit:

其中，Ex可以为当前训练数据中第m路数据下的编码数据源的期望(即各个编码语义表示的期望)，也可以为当前训练数据中第m路数据下的压缩数据源的期望等，本发明实施例对此不作限定；相应的f(x)可以为当前训练数据中第m路数据下的任一掩码单元的编码语义表示，f(x⁺)可以为当前训练数据中第m路数据下的任一掩码单元的压缩语义表示，W可以为当前训练数据中第m路数据下的压缩语义表示的数量，f(x_j)可以为当前训练数据中第m路数据下除任一掩码单元以外的第j个语义单元的压缩语义表示，L可以为相应的损失值，等等。Among them, Ex can be the expectation of the encoded data source under the mth data in the current training data (that is, the expectation of each encoded semantic representation), or it can be the expectation of the compressed data source under the mth data in the current training data, etc., and the embodiment of the present invention is not limited to this; the corresponding f(x) can be the encoded semantic representation of any mask unit under the mth data in the current training data, f(x ⁺ ) can be the compressed semantic representation of any mask unit under the mth data in the current training data, W can be the number of compressed semantic representations under the mth data in the current training data, f(x _j ) can be the compressed semantic representation of the jth semantic unit under the mth data in the current training data except any mask unit, L can be the corresponding loss value, and so on.

示例性的，如图2所示，假设当前训练数据中第m路数据下的压缩语义表示的数量为4，且当前训练数据中第m路数据下的任一掩码单元为第3个语义单元，在此种情况下，f(x⁺)可以为当前训练数据中第m路数据下的第3个语义单元的压缩语义表示；此时W的取值可以为4，当前训练数据中第m路数据下的第1个语义单元可以为当前训练数据中第m路数据下除任一掩码单元以外的第1个语义单元，第2个语义单元可以为当前训练数据中第m路数据下除任一掩码单元以外的第2个语义单元，第4个语义单元可以为当前训练数据中第m路数据下除任一掩码单元以外的第3个语义单元。Exemplarily, as shown in Figure 2, assume that the number of compressed semantic representations under the mth data path in the current training data is 4, and any mask unit under the mth data path in the current training data is the third semantic unit. In this case, f(x ⁺ ) can be the compressed semantic representation of the third semantic unit under the mth data path in the current training data; at this time, the value of W can be 4, the first semantic unit under the mth data path in the current training data can be the first semantic unit under the mth data path in the current training data except any mask unit, the second semantic unit can be the second semantic unit under the mth data path in the current training data except any mask unit, and the fourth semantic unit can be the third semantic unit under the mth data path in the current training data except any mask unit.

可见，本发明实施例可对当前训练数据中第m路数据下的任一掩码单元的编码语义表示与当前训练数据中第m路数据下的压缩数据源进行对比学习，从而拉近mask(即掩码)后的encoder(编码器)语义表示(即编码语义表示)与底层mask之前的语义表示(即任一掩码单元的压缩语义表示)，并可拉远任一掩码单元的编码语义表示与任一掩码单元以外的各个语义单元的压缩语义表示。It can be seen that the embodiment of the present invention can compare and learn the encoded semantic representation of any mask unit under the m-th data channel in the current training data and the compressed data source under the m-th data channel in the current training data, thereby bringing the encoder semantic representation after the mask (i.e., the mask) (i.e., the encoded semantic representation) and the semantic representation before the underlying mask (i.e., the compressed semantic representation of any mask unit) closer, and can distance the encoded semantic representation of any mask unit from the compressed semantic representation of each semantic unit outside any mask unit.

另一种实施方式中，针对M路数据中的第m路数据，电子设备可基于各个训练数据中第m路数据下的编码数据源和各个训练数据中第m各路数据下的压缩数据源，计算第m路数据下的模型损失值；具体的，针对训练数据集中的任一训练数据中第m路数据下的任一编码语义表示，可基于任一训练数据中第m路数据下的任一编码语义表示和任一训练数据中第m路数据下的压缩数据源，计算任一编码语义表示对应的语义单位下的模型损失值，以得到任一训练数据中第m路数据下的每个语义单元的模型损失值，并对任一训练数据中第m路数据下的每个语义单元的模型损失值进行加权求和，得到任一训练数据中第m路数据下的模型损失值。基于此，可基于各个训练数据中第m路数据下的模型损失值，计算第m路数据下的模型损失值。In another implementation, for the m-th data in M data, the electronic device may calculate the model loss value under the m-th data based on the encoded data source under the m-th data in each training data and the compressed data source under the m-th data in each training data; specifically, for any encoded semantic representation under the m-th data in any training data set, the electronic device may calculate the model loss value under the semantic unit corresponding to any encoded semantic representation based on any encoded semantic representation under the m-th data in any training data and the compressed data source under the m-th data in any training data, so as to obtain the model loss value of each semantic unit under the m-th data in any training data, and perform weighted summation on the model loss value of each semantic unit under the m-th data in any training data, so as to obtain the model loss value under the m-th data in any training data. Based on this, the model loss value under the m-th data may be calculated based on the model loss value under the m-th data in each training data.

其中，基于任一训练数据中第m路数据下的任一编码语义表示和任一训练数据中第m路数据下的压缩数据源，计算任一编码语义表示对应的语义单位下的模型损失值的具体方式，与计算任一掩码单元下的模型损失值的具体方式相同，本发明实施例对此不再赘述。Among them, based on any encoded semantic representation under the m-th data in any training data and the compressed data source under the m-th data in any training data, the specific method of calculating the model loss value under the semantic unit corresponding to any encoded semantic representation is the same as the specific method of calculating the model loss value under any mask unit, and the embodiments of the present invention will not go into details about this.

进一步的，在分别按照各路数据下的模型损失值，优化相应路数据对应的初始语言模型中的模型参数，得到各路数据对应的中间语言模型时，针对M路数据中的第m路数据，电子设备可按照第m路数据下的模型损失值，优化第m路数据对应的初始语言模型中的模型参数，得到第m路数据对应的中间语言模型。Furthermore, when optimizing the model parameters in the initial language model corresponding to the corresponding data according to the model loss value under each data channel to obtain the intermediate language model corresponding to each data channel, for the m-th data among the M data channels, the electronic device can optimize the model parameters in the initial language model corresponding to the m-th data according to the model loss value under the m-th data channel to obtain the intermediate language model corresponding to the m-th data channel.

基于上述描述，本发明实施例还提出一种更为具体的模型预训练方法。相应的，该模型预训练方法可以由上述所提及的电子设备(终端或服务器)执行；或者，该模型预训练方法可由终端和服务器共同执行。为了便于阐述，后续均以电子设备执行该模型预训练方法为例进行说明；请参见图3，该模型预训练方法可包括以下步骤S301-S308：Based on the above description, the embodiment of the present invention also proposes a more specific model pre-training method. Accordingly, the model pre-training method can be executed by the electronic device (terminal or server) mentioned above; or, the model pre-training method can be executed by the terminal and the server together. For the sake of convenience, the following description will be given by taking the electronic device executing the model pre-training method as an example; please refer to Figure 3, the model pre-training method may include the following steps S301-S308:

S301，获取训练数据集，一个训练数据包括一个对象在M路数据中各路数据下的初始数据源，且一路数据对应一个语言模型，M为正整数。S301, obtaining a training data set, where one training data set includes an initial data source of an object under each data path in M data paths, and one data path corresponds to one language model, and M is a positive integer.

S302，分别对训练数据集包括的各个训练数据中各路数据下的初始数据源进行压缩处理，得到各个训练数据中各路数据下的压缩数据源，一个压缩数据源中压缩语义表示的数量小于相应初始数据源中初始语义表示的数量。S302, respectively compressing the initial data sources of each data path in each training data included in the training data set to obtain compressed data sources of each data path in each training data, wherein the number of compressed semantic representations in one compressed data source is less than the number of initial semantic representations in the corresponding initial data source.

具体的，针对训练数据集中的任一训练数据，以及M路数据中的任一路数据，电子设备可确定文本压缩长度；并按照文本压缩长度，对任一训练数据中任一路数据下的初始数据源进行数据划分，得到N个待压缩语义表示组，一个待压缩语义表示组中语义表示的数量小于或等于文本压缩长度；其中，前N-1个待压缩语义表示组中的每个待压缩语义表示组中语义表示(此处即为初始语义表示)的数量可以等于文本压缩长度，N个待压缩语义表示组中的第N个待压缩语义表示组中语义表示的数量可以小于或等于文本压缩长度。进一步的，可分别对N个待压缩语义表示组中的各个待压缩语义表示组进行压缩处理，得到各个待压缩语义表示组的压缩语义表示，以得到任一训练数据中任一路数据下的压缩数据源。可选的，可将一个待压缩语义表示组作为一个句子，从而实现句子级别的压缩，进而实现下述句子级别的掩码。Specifically, for any training data in the training data set and any data in the M data, the electronic device can determine the text compression length; and according to the text compression length, the initial data source under any data in any training data is divided to obtain N semantic representation groups to be compressed, and the number of semantic representations in one semantic representation group to be compressed is less than or equal to the text compression length; wherein, the number of semantic representations (here, the initial semantic representations) in each semantic representation group to be compressed in the first N-1 semantic representation groups to be compressed can be equal to the text compression length, and the number of semantic representations in the Nth semantic representation group to be compressed in the N semantic representation groups to be compressed can be less than or equal to the text compression length. Further, each of the N semantic representation groups to be compressed can be compressed to obtain the compressed semantic representation of each semantic representation group to be compressed, so as to obtain the compressed data source under any data in any training data. Optionally, a semantic representation group to be compressed can be taken as a sentence, so as to achieve sentence-level compression, and then achieve the following sentence-level masking.

可选的，文本压缩长度可以是按照经验设置的，也可以是按照实际需求设置的，本发明实施例对此不作限定。可选的，N的取值可以为任一训练数据中任一路数据对应文本中的句子数量，此时文本压缩长度的数量可以等于N，一个文本压缩长度为一个句子中分词的数量，即一个文本压缩长度可对应一个句子，等等；本发明实施例对此不作限定。Optionally, the text compression length can be set according to experience or according to actual needs, and the embodiment of the present invention does not limit this. Optionally, the value of N can be the number of sentences in the text corresponding to any data in any training data. In this case, the number of text compression lengths can be equal to N, and one text compression length is the number of word segments in a sentence, that is, one text compression length can correspond to one sentence, and so on; the embodiment of the present invention does not limit this.

示例性的，如图2所示，假设文本压缩长度为3，且任一训练数据中任一路数据下的初始数据源对应的文本长度为12个token，一个token对应一个初始语义表示，在此种情况下，可将每3个token的初始语义表示作为一个待压缩语义表示组，从而将每3个token压缩为1个表示单元(即语义单元)，也就是说，可将3个初始语义表示压缩为1个压缩语义表示；那么相应的，任一训练数据中任一路数据下的压缩语义表示的数量可以为4(即文本表示长度可以为4)，其中，文本表示长度可以指的是用于表示一个文本的语义表示的数量。可见，当初始数据源为长文本数据源时，本发明实施例可将长文本数据源压缩为相应的短文本数据源(即压缩数据源)。Exemplarily, as shown in FIG2 , assuming that the text compression length is 3, and the text length corresponding to the initial data source under any data path in any training data is 12 tokens, and one token corresponds to an initial semantic representation, in this case, the initial semantic representation of every 3 tokens can be used as a semantic representation group to be compressed, so that every 3 tokens can be compressed into 1 representation unit (i.e., semantic unit), that is, 3 initial semantic representations can be compressed into 1 compressed semantic representation; then correspondingly, the number of compressed semantic representations under any data path in any training data can be 4 (i.e., the text representation length can be 4), where the text representation length can refer to the number of semantic representations used to represent a text. It can be seen that when the initial data source is a long text data source, the embodiment of the present invention can compress the long text data source into a corresponding short text data source (i.e., compressed data source).

可选的，各个训练数据中各路数据下的压缩数据源是通过初始压缩模型对各个训练数据中各路数据下的初始数据源进行压缩处理得到的，也就是说，电子设备可调用初始压缩模型，对各个训练数据中各路数据下的初始数据源进行压缩处理。可选的，一个压缩模型可以为CNN(Convolutional Neural Network，卷积神经网络)语义压缩模型(也可称为语义压缩模块)，也可以为mean pooling(取均值)模块等，本发明实施例对此不作限定。Optionally, the compressed data source of each data channel in each training data is obtained by compressing the initial data source of each data channel in each training data through the initial compression model, that is, the electronic device can call the initial compression model to compress the initial data source of each data channel in each training data. Optionally, a compression model can be a CNN (Convolutional Neural Network) semantic compression model (also called a semantic compression module), or a mean pooling module, etc., which is not limited in the embodiments of the present invention.

在本发明实施例中，当压缩模型为CNN语义压缩模型时，电子设备还可基于各路数据下的模型损失值，优化初始压缩模型中的模型参数，得到中间压缩模型；并基于中间压缩模型，确定目标压缩模型；具体的，可继续对中间压缩模型中的模型参数进行优化，直至达到压缩收敛条件，如迭代次数(即优化次数)达到第一预设迭代次数，或压缩损失值小于第一预设损失阈值时，可确定达到压缩收敛条件，等等。可选的，第一预设迭代次数和第一预设损失阈值均可以是按照经验设置的，也可以是按照实际需求设置的，本发明实施例对此不作限定。In an embodiment of the present invention, when the compression model is a CNN semantic compression model, the electronic device may also optimize the model parameters in the initial compression model based on the model loss values under each data channel to obtain an intermediate compression model; and determine the target compression model based on the intermediate compression model; specifically, the model parameters in the intermediate compression model may continue to be optimized until the compression convergence condition is reached, such as when the number of iterations (i.e., the number of optimizations) reaches a first preset number of iterations, or the compression loss value is less than a first preset loss threshold, it may be determined that the compression convergence condition is reached, and so on. Optionally, the first preset number of iterations and the first preset loss threshold may be set according to experience or according to actual needs, and the embodiment of the present invention does not limit this.

其中，压缩损失值可以是基于各路数据下的模型损失值确定的，也就是说，电子设备可基于各路数据下的模型损失值，确定压缩损失值；基于此，可按照减小压缩损失值的方向，优化初始压缩模型中的模型参数。Among them, the compression loss value can be determined based on the model loss value under each data channel, that is, the electronic device can determine the compression loss value based on the model loss value under each data channel; based on this, the model parameters in the initial compression model can be optimized in the direction of reducing the compression loss value.

在一种实施方式中，在基于各路数据下的模型损失值，确定压缩损失值时，电子设备可对各路数据下的模型损失值进行加权求和，得到加权求和损失值，并按将加权求和损失值作为压缩损失值；可选的，各路数据下的模型损失值在加权求和过程中的权重值可以是按照经验设置的，也可以是按照实际需求设置的，本发明实施例对此不作限定。In one embodiment, when determining the compression loss value based on the model loss value under each data channel, the electronic device may perform weighted summation on the model loss value under each data channel to obtain a weighted summation loss value, and use the weighted summation loss value as the compression loss value; optionally, the weight value of the model loss value under each data channel in the weighted summation process may be set according to experience or according to actual needs, and the embodiment of the present invention is not limited to this.

另一种实施方式中，电子设备可从各路数据下的模型损失值中，选取出最大的模型损失值，并将选取出的模型损失值作为压缩损失值，等等。In another implementation, the electronic device may select the maximum model loss value from the model loss values under each channel of data, and use the selected model loss value as the compression loss value, and so on.

S303，分别对各个训练数据中各路数据下的压缩数据源进行掩码，得到各个训练数据中各路数据下的掩码数据源，并分别调用各路数据对应的初始语言模型，对各个训练数据中相应路数据下的掩码数据源进行编码，得到各个训练数据中各路数据下的编码数据源。S303, respectively masking the compressed data source under each data path in each training data to obtain the masked data source under each data path in each training data, and respectively calling the initial language model corresponding to each data path, encoding the masked data source under the corresponding data path in each training data to obtain the encoded data source under each data path in each training data.

具体的，针对训练数据集中的任一训练数据，以及M路数据中的任一路数据，电子设备可确定掩码概率，并按照掩码概率从任一训练数据中任一路数据下的压缩数据源中，确定出至少一个掩码单元，一个掩码单元对应一个压缩语义表示；基于此，可分别对各个确定出的掩码单元的压缩语义表示进行掩码，得到任一训练数据中任一路数据下的掩码数据源，任一训练数据中任一路数据下的掩码数据源包括各个确定出的掩码单元的掩码语义表示，相应的，任一训练数据中任一路数据下的掩码数据还可包括任一训练数据中任一路数据下除各个确定出的掩码单元以外的每个语义单元的压缩语义表示。Specifically, for any training data in the training data set and any data in the M-path data, the electronic device can determine the mask probability, and determine at least one mask unit from the compressed data source under any data in any training data according to the mask probability, and one mask unit corresponds to a compressed semantic representation; based on this, the compressed semantic representation of each determined mask unit can be masked respectively to obtain the mask data source under any data in any training data, and the mask data source under any data in any training data includes the mask semantic representations of each determined mask unit. Accordingly, the mask data under any data in any training data can also include the compressed semantic representation of each semantic unit under any data in any training data except for each determined mask unit.

可选的，掩码概率可以是按照经验设置的，也可以是按照实际需求设置的，本发明实施例对此不作限定；示例性的掩码概率可以为15％，在此种情况下，可以将压缩后的语义单元以15％的概率mask掉，即可随机mask掉15％的语义单元的压缩语义表示，等等。Optionally, the mask probability can be set according to experience or according to actual needs, and the embodiments of the present invention are not limited to this; an exemplary mask probability can be 15%. In this case, the compressed semantic units can be masked with a probability of 15%, that is, 15% of the compressed semantic representations of the semantic units can be randomly masked, and so on.

可见，本发明实施例可对压缩语义表示进行掩码，从而使得本发明实施例所提及的方法可以是一种句子级别掩码的模型预训练方法，即句子级别的掩码预训练方法，可提升单路数据的语义表示能力。基于此，可通过mask language model(MLM，掩码语言模型)的方式实现目标语言模型的预训练，即可针对压缩语义表示进行掩码语言模型的无监督预训练，从而提高预训练效率，进而提高目标语言模型的模型性能；可选的，语言模型可对应两个训练任务，一个训练任务可以是使用MLM的方式使得Transformer(一个利用注意力机制来提高模型训练速度的模型)的encoder(编码器)实现了融合双向特征(即预训练任务)的任务，另一个训练任务可以是微调任务。It can be seen that the embodiment of the present invention can mask the compressed semantic representation, so that the method mentioned in the embodiment of the present invention can be a sentence-level mask model pre-training method, that is, a sentence-level mask pre-training method, which can improve the semantic representation ability of single-channel data. Based on this, the pre-training of the target language model can be achieved by means of a mask language model (MLM), that is, the unsupervised pre-training of the mask language model can be performed for the compressed semantic representation, thereby improving the pre-training efficiency and further improving the model performance of the target language model; optionally, the language model can correspond to two training tasks, one training task can be to use the MLM method to enable the encoder of the Transformer (a model that uses the attention mechanism to increase the model training speed) to realize the task of fusing bidirectional features (i.e., the pre-training task), and the other training task can be a fine-tuning task.

可选的，一个语言模型可以包括一个编码器；可选的，在预训练过程中，一个语言模型还可包括一个掩码语言模型(一种输出层)；那么相应的，在下游任务中，一个语言模型可包括一个编码器和NLP(自然语言处理)层(另一种输出层)；其中，下游任务中语言模型包括的编码器可以为通过预训练过程得到的编码器。Optionally, a language model may include an encoder; optionally, during the pre-training process, a language model may also include a masked language model (an output layer); then correspondingly, in a downstream task, a language model may include an encoder and an NLP (natural language processing) layer (another output layer); wherein the encoder included in the language model in the downstream task may be an encoder obtained through the pre-training process.

S304，基于各个训练数据中各路数据下的编码数据源和各个训练数据中各路数据下的压缩数据源，分别计算各路数据下的模型损失值，并分别按照各路数据下的模型损失值，优化相应路数据对应的初始语言模型中的模型参数，得到各路数据对应的中间语言模型。S304, based on the encoded data source under each data channel in each training data and the compressed data source under each data channel in each training data, respectively calculate the model loss value under each data channel, and optimize the model parameters in the initial language model corresponding to the corresponding data channel according to the model loss value under each data channel, so as to obtain the intermediate language model corresponding to each data channel.

S305，基于训练数据集，迭代优化各路数据对应的中间语言模型中的模型参数，直至达到收敛条件，得到各路数据对应的待关联语言模型。S305, based on the training data set, iteratively optimize the model parameters in the intermediate language model corresponding to each channel of data until a convergence condition is reached, and obtain the to-be-associated language model corresponding to each channel of data.

应当理解的是，在得到各路数据对应的中间语言模型后，可继续优化各路数据对应的中间语言模型中的模型参数，直至达到收敛条件。It should be understood that after the intermediate language model corresponding to each channel of data is obtained, the model parameters in the intermediate language model corresponding to each channel of data may be continuously optimized until a convergence condition is reached.

可选的，可在迭代次数达到第二预设迭代次数时，确定达到收敛条件，此时得到各路数据对应的待关联语言模型的迭代次数可以相同，也可以不同，本发明实施例对此不作限定；也就是说，任意两路数据对应的第二预设迭代次数可以相同，也可以不同，本发明实施例对此不作限定。或者，也可在模型损失值达到第二预设损失阈值时，确定达到收敛条件，等等；需要说明的是，任意两路数据对应的第二预设损失阈值可以相同，也可以不同，本发明实施例对此不作限定。可见，一路数据可对应一个收敛条件，且任意两路数据对应的收敛条件可以相同，也可以不同，本发明实施例对此不作限定；应当理解的是，针对M路数据中的任一路数据，在达到任一路数据对应的收敛条件时，可得到任一路对应的待关联语言模型。可选的，各路数据对应的第二预设迭代次数和第二预设损失阈值均可以是按照经验设置的，也可以是按照实际需求设置的，本发明实施例对此不作限定。Optionally, when the number of iterations reaches the second preset number of iterations, it is determined that the convergence condition is reached. At this time, the number of iterations of the language model to be associated corresponding to each data path can be the same or different, and the embodiment of the present invention does not limit this; that is, the second preset number of iterations corresponding to any two data paths can be the same or different, and the embodiment of the present invention does not limit this. Alternatively, it can also be determined that the convergence condition is reached when the model loss value reaches the second preset loss threshold, and so on; it should be noted that the second preset loss threshold corresponding to any two data paths can be the same or different, and the embodiment of the present invention does not limit this. It can be seen that one data path can correspond to one convergence condition, and the convergence conditions corresponding to any two data paths can be the same or different, and the embodiment of the present invention does not limit this; it should be understood that for any data path in the M data paths, when the convergence condition corresponding to any data path is reached, the language model to be associated corresponding to any path can be obtained. Optionally, the second preset number of iterations and the second preset loss threshold corresponding to each data path can be set according to experience or according to actual needs, and the embodiment of the present invention does not limit this.

应当理解的是，当达到任一路数据对应的收敛条件指的是：迭代次数达到任一路数据对应的第二预设迭代次数；各路数据对应的第二预设迭代次数相同时，各路数据对应的初始语言模型的训练次数可以相同，也就是说，可同时得到各路数据对应的待关联语言模型。It should be understood that when the convergence condition corresponding to any one path of data is reached, it means that: the number of iterations reaches the second preset number of iterations corresponding to any one path of data; when the second preset number of iterations corresponding to each path of data is the same, the number of training times of the initial language models corresponding to each path of data can be the same, that is, the language models to be associated corresponding to each path of data can be obtained at the same time.

可选的，上述基于各路数据下的模型损失值，确定压缩损失值时，若任一路数据在上一次迭代时已达到收敛条件，即任一路数据在当前迭代下的模型损失值为空，则可基于各路数据下的模型损失值中不为空的模型损失值，确定压缩损失值；或者，可在任一路数据达到收敛条件时，确定达到压缩收敛条件，等等。Optionally, when determining the compression loss value based on the model loss value under each data channel, if any data channel has reached the convergence condition in the previous iteration, that is, the model loss value of any data channel in the current iteration is empty, then the compression loss value can be determined based on the model loss value that is not empty among the model loss values under each data channel; or, it can be determined that the compression convergence condition is reached when any data channel reaches the convergence condition, and so on.

在本发明实施例中，在基于训练数据集，迭代优化各路数据对应的中间语言模型中的模型参数时，若压缩模型为CNN语义压缩模型，则压缩模型可在迭代过程中被更新，那么任一训练数据中任一路数据下的压缩数据源可随着压缩模型的更新而更新；在此种情况下，可迭代执行上述分别对训练数据集包括的各个训练数据中各路数据下的初始数据源进行压缩处理，得到各个训练数据中各路数据下的压缩数据源，从而实现对各路数据对应的中间语言模型的迭代训练。或者，若压缩模型为mean pooling模块，则压缩模型可在迭代过程中不被更新，那么任一训练数据中任一路数据下的压缩数据源可不变；在此种情况下，可迭代执行上述分别对各个训练数据中各路数据下的压缩数据源进行掩码，得到各个训练数据中各路数据下的掩码数据源，从而实现对各路数据对应的中间语言模型的迭代训练，等等。In an embodiment of the present invention, when iteratively optimizing the model parameters in the intermediate language model corresponding to each data path based on the training data set, if the compression model is a CNN semantic compression model, the compression model can be updated during the iteration process, so the compressed data source under any data path in any training data can be updated along with the update of the compression model; in this case, the above-mentioned compression processing of the initial data source under each data path in each training data included in the training data set can be iteratively performed to obtain the compressed data source under each data path in each training data, thereby realizing iterative training of the intermediate language model corresponding to each data path. Alternatively, if the compression model is a mean pooling module, the compression model may not be updated during the iteration process, so the compressed data source under any data path in any training data may remain unchanged; in this case, the above-mentioned masking of the compressed data source under each data path in each training data can be iteratively performed to obtain the masked data source under each data path in each training data, thereby realizing iterative training of the intermediate language model corresponding to each data path, and so on.

S306，确定各个训练数据中各路数据下的当前压缩数据源，并基于各个训练数据中各路数据下的当前压缩数据源，确定相应训练数据中各路数据下的待编码数据源。S306, determining the current compressed data source of each data channel in each training data, and based on the current compressed data source of each data channel in each training data, determining the data source to be encoded of each data channel in the corresponding training data.

在一种实施方式中，电子设备可调用当前系统时间下的压缩模型，分别对各个训练数据中各路数据下的初始数据源进行压缩处理，得到各个训练数据中各路数据下的当前压缩数据源。可选的，当前系统时间下的压缩模型可以目标压缩模型(即达到压缩收敛条件时压缩模型)，也可以是未达到压缩收敛条件时的压缩模型(此时可继续基于下述关联损失值进行压缩模型的模型训练，也就是说，可将下述关联损失值作为当前系统时间下的压缩损失值，以进行模型训练)，等等；本发明实施例对此不作限定。In one embodiment, the electronic device can call the compression model at the current system time, and compress the initial data source under each data path in each training data to obtain the current compressed data source under each data path in each training data. Optionally, the compression model at the current system time can be a target compression model (i.e., a compression model when the compression convergence condition is met), or a compression model when the compression convergence condition is not met (at this time, the model training of the compression model can continue to be performed based on the following associated loss value, that is, the following associated loss value can be used as the compression loss value at the current system time for model training), etc.; the embodiment of the present invention is not limited to this.

另一种实施方式，电子设备可将各个训练数据中各路数据下的压缩数据源，作为相应训练数据中相应路数据下的当前压缩数据源，以确定各个训练数据中各路数据下的当前压缩数据源，等等。示例性的，当压缩模型为mean pooling模块时，各个训练数据中各路数据下的压缩数据源相同，那么可直接将各个训练数据中各路数据下的压缩数据源，作为各个训练数据中各路数据下的当前压缩数据源。In another implementation manner, the electronic device may use the compressed data source under each data path in each training data as the current compressed data source under the corresponding data path in the corresponding training data to determine the current compressed data source under each data path in each training data, and so on. Exemplarily, when the compression model is a mean pooling module, the compressed data source under each data path in each training data is the same, then the compressed data source under each data path in each training data may be directly used as the current compressed data source under each data path in each training data.

可选的，电子设备可将任一训练数据中任一路数据下的当前压缩数据源，作为任一训练数据中任一路数据下的待编码数据源；或者，可按照上述掩码概率，对任一训练数据中任一路数据下的当前压缩数据源进行掩码，得到任一训练数据中任一路数据下的当前掩码数据源，并将任一训练数据中任一路数据下的当前掩码数据源，作为任一训练数据中任一路数据下的待编码数据源，等等。Optionally, the electronic device may use the current compressed data source under any data path in any training data as the data source to be encoded under any data path in any training data; or, it may mask the current compressed data source under any data path in any training data according to the above-mentioned masking probability to obtain the current masked data source under any data path in any training data, and use the current masked data source under any data path in any training data as the data source to be encoded under any data path in any training data, and so on.

S307，调用各路数据对应的待关联语言模型，分别对各个训练数据中相应路数据下的待编码数据源进行编码，得到各个训练数据中各路数据下的当前编码数据源。S307, calling the to-be-associated language model corresponding to each data path, respectively encoding the to-be-encoded data source under the corresponding data path in each training data, and obtaining the current encoded data source under each data path in each training data.

可选的，电子设备可调用各路数据对应的待关联语言模型中的编码器，分别对各个训练数据中相应路数据下的待编码数据源进行编码，如一个待编码数据源为相应的当前压缩数据源时；或者，也可调用各路数据对应的待关联语言模型中的编码器和掩码语言模型(即调用整个待关联语言模型)，分别对各个训练数据中相应路数据下的待编码数据源进行编码，如一个待编码数据源为相应的当前掩码数据源时，等等；本发明实施例对此不作限定。Optionally, the electronic device may call the encoder in the language model to be associated corresponding to each path of data, and encode the data source to be encoded under the corresponding path of data in each training data, such as when a data source to be encoded is a corresponding current compressed data source; or, it may also call the encoder and mask language model in the language model to be associated corresponding to each path of data (that is, call the entire language model to be associated), and encode the data source to be encoded under the corresponding path of data in each training data, such as when a data source to be encoded is a corresponding current mask data source, and so on; the embodiment of the present invention is not limited to this.

S308，基于各个训练数据中各路数据下的当前编码数据源，计算关联损失值，并按照减小关联损失值的方向，优化各路数据对应的待关联语言模型中的模型参数，得到各路数据对应的待关联中间语言模型，以基于各路数据对应的待关联中间语言模型，确定各路数据对应的目标语言模型。S308, based on the current encoded data source under each data channel in each training data, calculate the association loss value, and optimize the model parameters in the language model to be associated corresponding to each data channel in the direction of reducing the association loss value, to obtain the intermediate language model to be associated corresponding to each data channel, so as to determine the target language model corresponding to each data channel based on the intermediate language model to be associated corresponding to each data channel.

具体的，在基于各个训练数据中各路数据下的当前编码数据源，计算关联损失值时，电子设备可基于各个训练数据中各路数据下的当前编码数据源，确定至少一个对比学习组，一个对比学习组包括各个训练数据中任意两路数据下的当前编码数据源；然后，可遍历至少一个对比学习组中的各个对比学习组，并将当前遍历的对比学习组作为当前对比学习组，以及将当前对比学习组对应的两路数据作为第一路数据和第二路数据。基于此，可基于各个训练数据中第一路数据下的当前编码数据源和各个训练数据中第二路数据下的当前编码数据源，计算当前对比学习组下的损失值；在遍历完至少一个对比学习组中的各个对比学习组后，得到各个对比学习组下的损失值，并基于各个对比学习组下的损失值，计算关联损失值。可选的，可对各个对比学习组下的损失值进行加权求和(如均值运算或求和运算等)，得到关联损失值；可选的，各个对比学习组下的损失值的权重可以是按照经验设置的，也可以是按照实际需求设置的，本发明实施例对此不作限定。Specifically, when calculating the associated loss value based on the current coded data source under each data path in each training data, the electronic device can determine at least one comparative learning group based on the current coded data source under each data path in each training data, and one comparative learning group includes the current coded data sources under any two data paths in each training data; then, each comparative learning group in at least one comparative learning group can be traversed, and the currently traversed comparative learning group can be used as the current comparative learning group, and the two data paths corresponding to the current comparative learning group can be used as the first data path and the second data path. Based on this, the loss value under the current comparative learning group can be calculated based on the current coded data source under the first data path in each training data and the current coded data source under the second data path in each training data; after traversing each comparative learning group in at least one comparative learning group, the loss value under each comparative learning group is obtained, and the associated loss value is calculated based on the loss value under each comparative learning group. Optionally, the loss values under each comparative learning group may be weightedly summed (such as mean operation or sum operation, etc.) to obtain an associated loss value; optionally, the weights of the loss values under each comparative learning group may be set according to experience or according to actual needs, and the embodiments of the present invention are not limited to this.

在一种实施方式中，在基于各个训练数据中第一路数据下的当前编码数据源和各个训练数据中第二路数据下的当前编码数据源，计算当前对比学习组下的损失值时，针对任一训练数据中第一路数据下的任一当前编码语义表示，可基于任一训练数据中第一路数据下的任一当前编码语义表示和各个训练数据中第二路数据下的当前编码数据源，计算任一当前编码语义表示的损失值，从而得到各个训练数据中第一路数据下的每个当前编码语义表示的损失值；那么相应的，可对各个训练数据中第一路数据下的每个当前编码语义表示的损失值进行加权求和(如均值运算或求和运算等)，得到当前对比学习组下的损失值。In one embodiment, when calculating the loss value of the current comparative learning group based on the current encoded data source under the first data path in each training data and the current encoded data source under the second data path in each training data, for any current encoded semantic representation under the first data path in any training data, the loss value of any current encoded semantic representation can be calculated based on any current encoded semantic representation under the first data path in any training data and the current encoded data source under the second data path in each training data, thereby obtaining the loss value of each current encoded semantic representation under the first data path in each training data; accordingly, the loss value of each current encoded semantic representation under the first data path in each training data can be weighted summed (such as mean operation or summation operation, etc.) to obtain the loss value of the current comparative learning group.

可选的，电子设备可采用公式1.1，基于任一训练数据中第一路数据下的任一当前编码语义表示和各个训练数据中第二路数据下的当前编码数据源，计算任一当前编码语义表示的损失值。示例性的，W-1可以为训练数据集中除任一训练数据以外的每个训练数据中第二路数据下的当前编码语义表示的数量之和，Ex可为各个训练数据中第一路数据下的每个当前编码语义表示之间的期望或各个训练数据中第二路数据下的每个当前编码语义表示之间的期望等，f(x)可以为任一训练数据中第一路数据下的任一当前编码语义表示，f(x⁺)可以为任一训练数据中第二路数据下的每个当前编码语义表示(此时f(x⁺)的数量可以为多个，那么分子可以为多项乘积之和)，f(x_j)可以为训练数据集中除任一训练数据以外的每个训练数据中第二路数据下的第j个语义单元的当前编码语义表示。Optionally, the electronic device may use formula 1.1 to calculate the loss value of any current coded semantic representation based on any current coded semantic representation under the first path of data in any training data and the current coded data source under the second path of data in each training data. Exemplarily, W-1 may be the sum of the number of current coded semantic representations under the second path of data in each training data set except any training data, Ex may be the expectation between each current coded semantic representation under the first path of data in each training data set or the expectation between each current coded semantic representation under the second path of data in each training data set, etc., f(x) may be any current coded semantic representation under the first path of data in any training data, f(x ⁺ ) may be each current coded semantic representation under the second path of data in any training data (in this case, the number of f(x ⁺ ) may be multiple, and the numerator may be the sum of multiple products), and f(x _j ) may be the current coded semantic representation of the jth semantic unit under the second path of data in each training data set except any training data.

另一种实施方式中，在基于各个训练数据中第一路数据下的当前编码数据源和各个训练数据中第二路数据下的当前编码数据源，计算当前对比学习组下的损失值时，可分别对各个训练数据中第一路数据下的当前编码数据源进行语义表示整合(即可对任一训练数据中第一路数据下的各个当前编码语义表示进行语义表示整合)，得到各个训练数据中第一路数据下的语义表示整合结果，以及可分别对各个训练数据中第二路数据下的当前编码数据源进行语义表示整合，得到各个训练数据中第二路数据下的语义表示整合结果。基于此，针对训练数据集中的任一训练数据，可基于任一训练数据中第一路数据下的语义表示整合结果和各个训练数据中第二路数据下的语义表示整合结果，计算任一训练数据在当前对比学习组下的损失值，以得到各个训练数据在当前对比学习组下的损失值；相应的，可对各个训练数据在当前对比学习组下的损失值进行加权求和，得到当前对比学习组下的损失值，等等。可选的，上述语义表示整合可以指的是语义表示拼接，即可将同一当前编码数据源中的各个当前编码语义表示拼接在一起；或者，语义表示整合也可以指的均值运算，即可对同一当前编码数据源中的各个当前编码语义表示进行均值运算，等等；本发明实施例对此不作限定。In another embodiment, when calculating the loss value of the current comparative learning group based on the current coded data source under the first data path in each training data and the current coded data source under the second data path in each training data, the semantic representation integration of the current coded data source under the first data path in each training data can be performed respectively (that is, the semantic representation integration of each current coded semantic representation under the first data path in any training data) to obtain the semantic representation integration result under the first data path in each training data, and the semantic representation integration of the current coded data source under the second data path in each training data can be performed respectively to obtain the semantic representation integration result under the second data path in each training data. Based on this, for any training data in the training data set, the loss value of any training data under the current comparative learning group can be calculated based on the semantic representation integration result under the first data path in any training data and the semantic representation integration result under the second data path in each training data, so as to obtain the loss value of each training data under the current comparative learning group; accordingly, the loss value of each training data under the current comparative learning group can be weighted summed to obtain the loss value under the current comparative learning group, and so on. Optionally, the above-mentioned semantic representation integration may refer to semantic representation splicing, that is, splicing together the various currently encoded semantic representations in the same currently encoded data source; or, semantic representation integration may also refer to mean operation, that is, performing mean operation on the various currently encoded semantic representations in the same currently encoded data source, and so on; the embodiments of the present invention are not limited to this.

可选的，电子设备可采用公式1.1，基于任一训练数据中第一路数据下的语义表示整合结果和各个训练数据中第二路数据下的语义表示整合结果，计算任一训练数据在当前对比学习组下的损失值。示例性的，W可以为训练数据集中训练数据的数量，Ex可以各个训练数据中第一路数据下的语义表示整合结果之间的期望或各个训练数据中第二路数据下的语言表示整合结果之间的期望等，f(x)可以为任一训练数据中第一路数据下的语义表示整合结果，f(x⁺)可以为任一训练数据中第二路数据下的语义表示整合结果，f(x_j)可以为训练数据集中除任一训练数据以外的第j个训练数据中第二路数据下的语义表示整合结果。Optionally, the electronic device may use formula 1.1 to calculate the loss value of any training data under the current comparative learning group based on the semantic representation integration result under the first data path in any training data and the semantic representation integration result under the second data path in each training data. Exemplarily, W can be the number of training data in the training data set, Ex can be the expectation between the semantic representation integration results under the first data path in each training data or the expectation between the language representation integration results under the second data path in each training data, etc., f(x) can be the semantic representation integration result under the first data path in any training data, f(x ⁺ ) can be the semantic representation integration result under the second data path in any training data, and f(x _j ) can be the semantic representation integration result under the second data path in the jth training data other than any training data in the training data set.

进一步的，在基于各路数据对应的待关联中间语言模型，确定各路数据对应的目标语言模型时，电子设备可继续对各路数据对应的待关联中间语言模型进行模型训练(即模型预训练)，直至达到关联收敛条件(如迭代次数达到第三预设迭代次数或关联损失值小于第三预设损失阈值等)，得到各路数据对应的目标语言模型。可选的，第三预设迭代次数和第三预设损失阈值均可以是按照经验设置的，也可以是按照实际需求设置的，本发明实施例对此不作限定。Further, when determining the target language model corresponding to each path of data based on the intermediate language model to be associated corresponding to each path of data, the electronic device may continue to perform model training (i.e., model pre-training) on the intermediate language model to be associated corresponding to each path of data until the associated convergence condition is reached (such as the number of iterations reaches a third preset number of iterations or the associated loss value is less than a third preset loss threshold, etc.), and obtain the target language model corresponding to each path of data. Optionally, the third preset number of iterations and the third preset loss threshold may be set according to experience or according to actual needs, and the embodiment of the present invention is not limited to this.

可见，本发明实施例可对各路数据的编码语义表示进行对比学习，可有效拉近语义空间的表示，如图4所示；基于此，本发明实施例可在单路数据的语义表示建模后(即得到各路数据对应的待关联语言模型后)，使得后续合并多路数据源之间的语义表示更加合理。It can be seen that the embodiment of the present invention can perform comparative learning on the encoded semantic representations of each data channel, which can effectively close the representation of the semantic space, as shown in Figure 4; based on this, the embodiment of the present invention can make the subsequent merging of semantic representations between multiple data sources more reasonable after the semantic representation modeling of single-channel data (that is, after obtaining the language model to be associated corresponding to each channel of data).

可选的，各路数据对应的目标语言模型中的编码器可用于下游任务中，以基于各路数据对应的目标语言模型中的编码器，确定各路数据对应的任务语言模型，一路数据对应的任务语言模型可包括相应路数据对应的目标语言模型中的编码器。可选的，电子设备还可对各路数据对应的任务语言模型进行微调，从而得到各路数据对应的目标任务语言模型；其中，各路数据对应的目标任务语言模型可应用于目标任务，目标任务可以为任一下游任务，本发明实施例对此不作限定。Optionally, the encoder in the target language model corresponding to each data channel can be used in downstream tasks to determine the task language model corresponding to each data channel based on the encoder in the target language model corresponding to each data channel, and the task language model corresponding to one data channel may include the encoder in the target language model corresponding to the corresponding data channel. Optionally, the electronic device can also fine-tune the task language model corresponding to each data channel to obtain the target task language model corresponding to each data channel; wherein, the target task language model corresponding to each data channel can be applied to the target task, and the target task can be any downstream task, which is not limited in the embodiment of the present invention.

可选的，电子设备可通过各路数据对应的目标语言模型或目标任务语言模型，确定目标数据中各路数据下的编码数据源后，电子设备还对目标数据中各路数据下的编码数据源进行数据融合(如对各路数据下的编码数据源进行采样或取平均等)，从而得到目标数据的融合数据源(包括至少一个融合语义表示)，等等；需要说明的是，本发明实施例对数据融合的具体实施方式不作限定。可选的，目标数据可以包括目标对象在各路数据下的初始数据源，目标对象可以为任一对象，本发明实施例对此不作限定。Optionally, after the electronic device determines the coded data source under each channel of data in the target data through the target language model or the target task language model corresponding to each channel of data, the electronic device also performs data fusion on the coded data source under each channel of data in the target data (such as sampling or averaging the coded data source under each channel of data, etc.), thereby obtaining a fused data source of the target data (including at least one fused semantic representation), etc.; it should be noted that the embodiments of the present invention do not limit the specific implementation methods of data fusion. Optionally, the target data may include the initial data source of the target object under each channel of data, and the target object may be any object, which is not limited by the embodiments of the present invention.

本发明实施例可在获取到训练数据集后，分别对训练数据集包括的各个训练数据中各路数据下的初始数据源进行压缩处理，得到各个训练数据中各路数据下的压缩数据源，一个训练数据包括一个对象在M路数据中各路数据下的初始数据源，且一路数据对应一个语言模型，M为正整数；一个压缩数据源中压缩语义表示的数量小于相应初始数据源中初始语义表示的数量。然后，可分别对各个训练数据中各路数据下的压缩数据源进行掩码，得到各个训练数据中各路数据下的掩码数据源，并分别调用各路数据对应的初始语言模型，对各个训练数据中相应路数据下的掩码数据源进行编码，得到各个训练数据中各路数据下的编码数据源。基于此，可基于各个训练数据中各路数据下的编码数据源和各个训练数据中各路数据下的压缩数据源，分别计算各路数据下的模型损失值，并分别按照各路数据下的模型损失值，优化相应路数据对应的初始语言模型中的模型参数，得到各路数据对应的中间语言模型。进一步的，可基于训练数据集，迭代优化各路数据对应的中间语言模型中的模型参数，直至达到收敛条件，得到各路数据对应的待关联语言模型；从而确定各个训练数据中各路数据下的当前压缩数据源，并基于各个训练数据中各路数据下的当前压缩数据源，确定相应训练数据中各路数据下的待编码数据源。那么相应的，可调用各路数据对应的待关联语言模型，分别对各个训练数据中相应路数据下的待编码数据源进行编码，得到各个训练数据中各路数据下的当前编码数据源；以及基于各个训练数据中各路数据下的当前编码数据源，计算关联损失值，并按照减小关联损失值的方向，优化各路数据对应的待关联语言模型中的模型参数，得到各路数据对应的待关联中间语言模型，以基于各路数据对应的待关联中间语言模型，确定各路数据对应的目标语言模型。可见，本发明实施例可实现压缩后的文本内部的无监督预训练，即可针对压缩语义表示进行掩码语言模型的无监督预训练，以通过压缩处理解决长文本和/或多路数据造成的训练和推理速度较慢的问题，从而便捷地进行模型预训练，以提高模型预训练效率，即可有效提升模型的训练和推理的速度，并可提升单路数据的语义表示能力；另外，当M的取值大于1时，本发明实施例所提及的方法可以是一种联合了多路数据源的预训练方法，可对齐不同数据的语义空间表示，即可拉近各路数据下的编码语义表示等，以使同一对象的不同数据源存在较强的关联，即可呈互补的关系，从而能够得到更好的最终语义表示(即融合语义表示)，即可提高语义表示的准确性，基于此，可通过对比学习技术，让多路数据的语义表示进行对齐，可有效提升下游任务的泛化能力。After obtaining the training data set, the embodiment of the present invention can compress the initial data sources under each data path in each training data included in the training data set to obtain the compressed data sources under each data path in each training data. One training data includes the initial data sources under each data path of an object in M data paths, and one data path corresponds to one language model, M is a positive integer; the number of compressed semantic representations in one compressed data source is less than the number of initial semantic representations in the corresponding initial data source. Then, the compressed data sources under each data path in each training data can be masked to obtain the masked data sources under each data path in each training data, and the initial language model corresponding to each data path can be called respectively, and the masked data sources under the corresponding data path in each training data can be encoded to obtain the encoded data sources under each data path in each training data. Based on this, the model loss value under each data path can be calculated respectively based on the encoded data source under each data path in each training data and the compressed data source under each data path in each training data, and the model parameters in the initial language model corresponding to the corresponding data path can be optimized according to the model loss value under each data path to obtain the intermediate language model corresponding to each data path. Furthermore, based on the training data set, the model parameters in the intermediate language model corresponding to each data channel can be iteratively optimized until the convergence condition is reached, and the language model to be associated corresponding to each data channel is obtained; thereby determining the current compressed data source under each data channel in each training data channel, and based on the current compressed data source under each data channel in each training data channel, determining the data source to be encoded under each data channel in the corresponding training data channel. Then, accordingly, the language model to be associated corresponding to each data channel can be called to encode the data source to be encoded under the corresponding data channel in each training data channel, respectively, to obtain the current encoded data source under each data channel in each training data channel; and based on the current encoded data source under each data channel in each training data channel, calculate the associated loss value, and optimize the model parameters in the language model to be associated corresponding to each data channel in the direction of reducing the associated loss value, to obtain the intermediate language model to be associated corresponding to each data channel, so as to determine the target language model corresponding to each data channel based on the intermediate language model to be associated corresponding to each data channel. It can be seen that the embodiment of the present invention can realize unsupervised pre-training within the compressed text, that is, unsupervised pre-training of the masked language model can be performed for the compressed semantic representation, so as to solve the problem of slow training and reasoning speed caused by long text and/or multi-channel data through compression processing, so as to conveniently perform model pre-training to improve the model pre-training efficiency, that is, to effectively improve the training and reasoning speed of the model, and to improve the semantic representation ability of a single-channel data; in addition, when the value of M is greater than 1, the method mentioned in the embodiment of the present invention can be a pre-training method that combines multiple data sources, which can align the semantic space representations of different data, that is, to bring the encoded semantic representations under each channel of data closer, etc., so that different data sources of the same object have a strong correlation, that is, to be complementary, so as to obtain a better final semantic representation (that is, a fused semantic representation), which can improve the accuracy of the semantic representation. Based on this, the semantic representations of multiple data can be aligned through comparative learning technology, which can effectively improve the generalization ability of downstream tasks.

基于上述模型预训练方法的相关实施例的描述，本发明实施例还提出了一种模型预训练装置，该模型预训练装置可以是运行于电子设备中的一个计算机程序(包括程序代码)；如图5所示，该模型预训练装置可包括获取单元501和处理单元502。该模型预训练装置可以执行图1或图3所示的模型预训练方法，即该模型预训练装置可以运行上述单元：Based on the description of the relevant embodiments of the above-mentioned model pre-training method, the embodiment of the present invention further proposes a model pre-training device, which can be a computer program (including program code) running in an electronic device; as shown in FIG5 , the model pre-training device can include an acquisition unit 501 and a processing unit 502. The model pre-training device can execute the model pre-training method shown in FIG1 or FIG3 , that is, the model pre-training device can run the above-mentioned units:

获取单元501，用于获取训练数据集，一个训练数据包括一个对象在M路数据中各路数据下的初始数据源，且一路数据对应一个语言模型，M为正整数；An acquisition unit 501 is used to acquire a training data set, where one training data set includes an initial data source of an object under each data path in M data paths, and one data path corresponds to one language model, where M is a positive integer;

处理单元502，用于分别对所述训练数据集包括的各个训练数据中所述各路数据下的初始数据源进行压缩处理，得到所述各个训练数据中所述各路数据下的压缩数据源，一个压缩数据源中压缩语义表示的数量小于相应初始数据源中初始语义表示的数量；The processing unit 502 is used to compress the initial data sources of each data path in each training data included in the training data set to obtain compressed data sources of each data path in each training data, wherein the number of compressed semantic representations in one compressed data source is less than the number of initial semantic representations in the corresponding initial data source;

所述处理单元502，还用于分别对所述各个训练数据中所述各路数据下的压缩数据源进行掩码，得到所述各个训练数据中所述各路数据下的掩码数据源，并分别调用所述各路数据对应的初始语言模型，对所述各个训练数据中相应路数据下的掩码数据源进行编码，得到所述各个训练数据中所述各路数据下的编码数据源；The processing unit 502 is further configured to respectively mask the compressed data sources under the respective paths of data in the respective training data to obtain the masked data sources under the respective paths of data in the respective training data, and respectively call the initial language models corresponding to the respective paths of data to encode the masked data sources under the corresponding paths of data in the respective training data to obtain the encoded data sources under the respective paths of data in the respective training data;

所述处理单元502，还用于基于所述各个训练数据中所述各路数据下的编码数据源和所述各个训练数据中所述各路数据下的压缩数据源，分别计算所述各路数据下的模型损失值，并分别按照所述各路数据下的模型损失值，优化相应路数据对应的初始语言模型中的模型参数，得到所述各路数据对应的中间语言模型，以基于所述各路数据对应的中间语言模型，确定所述各路数据对应的目标语言模型。The processing unit 502 is further used to calculate the model loss value of each data path based on the encoded data source of each data path in the each training data and the compressed data source of each data path in the each training data, and optimize the model parameters in the initial language model corresponding to the corresponding data path according to the model loss value of each data path, so as to obtain the intermediate language model corresponding to each data path, and determine the target language model corresponding to each data path based on the intermediate language model corresponding to each data path.

在一种实施方式中，处理单元502在分别对所述训练数据集包括的各个训练数据中所述各路数据下的初始数据源进行压缩处理，得到所述各个训练数据中所述各路数据下的压缩数据源时，可具体用于：In one implementation, when the processing unit 502 compresses the initial data sources of each path of data in each training data included in the training data set to obtain the compressed data sources of each path of data in each training data, it can be specifically used to:

针对所述训练数据集中的任一训练数据，以及所述M路数据中的任一路数据，确定文本压缩长度；Determine a text compression length for any training data in the training data set and any data in the M data;

按照所述文本压缩长度，对所述任一训练数据中所述任一路数据下的初始数据源进行数据划分，得到N个待压缩语义表示组，一个待压缩语义表示组中语义表示的数量小于或等于所述文本压缩长度；According to the text compression length, data division is performed on the initial data source under any one of the data paths in any one of the training data to obtain N semantic representation groups to be compressed, wherein the number of semantic representations in one semantic representation group to be compressed is less than or equal to the text compression length;

分别对所述N个待压缩语义表示组中的各个待压缩语义表示组进行压缩处理，得到所述各个待压缩语义表示组的压缩语义表示，以得到所述任一训练数据中所述任一路数据下的压缩数据源。Compression processing is performed on each of the N semantic representation groups to be compressed to obtain compressed semantic representations of each of the semantic representation groups to be compressed, so as to obtain a compressed data source under any one of the data in any one of the training data.

另一种实施方式中，处理单元502在分别对所述各个训练数据中所述各路数据下的压缩数据源进行掩码，得到所述各个训练数据中所述各路数据下的掩码数据源时，可具体用于：In another implementation manner, when the processing unit 502 respectively masks the compressed data sources under the respective data paths in the respective training data to obtain the masked data sources under the respective data paths in the respective training data, it can be specifically used to:

确定掩码概率，并按照所述掩码概率从所述任一训练数据中所述任一路数据下的压缩数据源中，确定出至少一个掩码单元，一个掩码单元对应一个压缩语义表示；Determine a mask probability, and determine at least one mask unit from the compressed data source under any one of the data paths in any one of the training data according to the mask probability, wherein one mask unit corresponds to one compressed semantic representation;

分别对各个确定出的掩码单元的压缩语义表示进行掩码，得到所述任一训练数据中所述任一路数据下的掩码数据源，所述任一训练数据中所述任一路数据下的掩码数据源包括所述各个确定出的掩码单元的掩码语义表示。The compressed semantic representation of each determined mask unit is masked respectively to obtain a mask data source under any one path of data in any one of the training data, and the mask data source under any one path of data in any one of the training data includes the mask semantic representation of each determined mask unit.

另一种实施方式中，一个编码数据源包括H个掩码单元中各个掩码单元的编码语义表示，H为正整数；处理单元502在基于所述各个训练数据中所述各路数据下的编码数据源和所述各个训练数据中所述各路数据下的压缩数据源，分别计算所述各路数据下的模型损失值时，可具体用于：In another implementation, an encoded data source includes an encoded semantic representation of each mask unit in H mask units, where H is a positive integer; when the processing unit 502 calculates the model loss value of each data path based on the encoded data source under each data path in each training data and the compressed data source under each data path in each training data, the processing unit 502 can be specifically used to:

针对所述M路数据中的第m路数据，分别从所述各个训练数据中所述第m路数据下的编码数据源中，确定出相应训练数据中所述第m路数据下的每个掩码单元的编码语义表示，m∈[1，M]；For the m-th data in the M-th data, determine the encoding semantic representation of each mask unit under the m-th data in the corresponding training data from the encoding data source under the m-th data in each training data, m∈[1,M];

基于所述各个训练数据中所述第m路数据下的每个掩码单元的编码语义表示，以及相应训练数据中所述第m路数据下的压缩数据源，计算所述第m路数据下的模型损失值；Calculate the model loss value of the m-th data path based on the encoded semantic representation of each mask unit under the m-th data path in the respective training data and the compressed data source under the m-th data path in the corresponding training data;

处理单元502在分别按照所述各路数据下的模型损失值，优化相应路数据对应的初始语言模型中的模型参数，得到所述各路数据对应的中间语言模型时，可具体用于：When the processing unit 502 optimizes the model parameters in the initial language model corresponding to the corresponding path data according to the model loss values under the respective paths of data, and obtains the intermediate language model corresponding to the respective paths of data, it can be specifically used to:

按照所述第m路数据下的模型损失值，优化所述第m路数据对应的初始语言模型中的模型参数，得到所述第m路数据对应的中间语言模型。According to the model loss value under the m-th data, the model parameters in the initial language model corresponding to the m-th data are optimized to obtain the intermediate language model corresponding to the m-th data.

另一种实施方式中，处理单元502在基于所述各路数据对应的中间语言模型，确定所述各路数据对应的目标语言模型时，可具体用于：In another implementation manner, when the processing unit 502 determines the target language model corresponding to each path of data based on the intermediate language model corresponding to each path of data, it can be specifically used to:

基于所述训练数据集，迭代优化所述各路数据对应的中间语言模型中的模型参数，直至达到收敛条件，得到所述各路数据对应的待关联语言模型；Based on the training data set, iteratively optimize the model parameters in the intermediate language model corresponding to each channel of data until a convergence condition is reached, thereby obtaining the language model to be associated corresponding to each channel of data;

确定所述各个训练数据中所述各路数据下的当前压缩数据源，并基于所述各个训练数据中所述各路数据下的当前压缩数据源，确定相应训练数据中所述各路数据下的待编码数据源；Determine the current compressed data source of each data path in each training data, and determine the data source to be encoded of each data path in the corresponding training data based on the current compressed data source of each data path in each training data;

调用所述各路数据对应的待关联语言模型，分别对所述各个训练数据中相应路数据下的待编码数据源进行编码，得到所述各个训练数据中所述各路数据下的当前编码数据源；Calling the to-be-associated language model corresponding to each path of data, respectively encoding the to-be-encoded data source under the corresponding path of data in each training data, to obtain the current encoded data source under each path of data in each training data;

基于所述各个训练数据中所述各路数据下的当前编码数据源，计算关联损失值，并按照减小所述关联损失值的方向，优化所述各路数据对应的待关联语言模型中的模型参数，得到所述各路数据对应的待关联中间语言模型，以基于所述各路数据对应的待关联中间语言模型，确定所述各路数据对应的目标语言模型。Based on the current encoded data source under each data path in the each training data, an associated loss value is calculated, and the model parameters in the language model to be associated corresponding to each data path are optimized in the direction of reducing the associated loss value to obtain the intermediate language model to be associated corresponding to each data path, so as to determine the target language model corresponding to each data path based on the intermediate language model to be associated corresponding to each data path.

另一种实施方式中，处理单元502在基于所述各个训练数据中所述各路数据下的当前编码数据源，计算关联损失值时，可具体用于：In another implementation manner, when the processing unit 502 calculates the association loss value based on the current encoded data source under each data path in each training data, it can be specifically used to:

基于所述各个训练数据中所述各路数据下的当前编码数据源，确定至少一个对比学习组，一个对比学习组包括所述各个训练数据中任意两路数据下的当前编码数据源；Based on the current encoding data sources under the respective data paths in the respective training data, determining at least one comparative learning group, where one comparative learning group includes the current encoding data sources under any two data paths in the respective training data;

遍历所述至少一个对比学习组中的各个对比学习组，并将当前遍历的对比学习组作为当前对比学习组，以及将所述当前对比学习组对应的两路数据作为第一路数据和第二路数据；Traversing each comparative learning group in the at least one comparative learning group, and taking the currently traversed comparative learning group as the current comparative learning group, and taking the two paths of data corresponding to the current comparative learning group as the first path of data and the second path of data;

基于所述各个训练数据中所述第一路数据下的当前编码数据源和所述各个训练数据中所述第二路数据下的当前编码数据源，计算所述当前对比学习组下的损失值；Calculate the loss value of the current comparative learning group based on the current encoded data source under the first path of data in each of the training data and the current encoded data source under the second path of data in each of the training data;

在遍历完所述至少一个对比学习组中的各个对比学习组后，得到所述各个对比学习组下的损失值，并基于所述各个对比学习组下的损失值，计算所述关联损失值。After traversing each comparative learning group in the at least one comparative learning group, the loss value under each comparative learning group is obtained, and the associated loss value is calculated based on the loss value under each comparative learning group.

另一种实施方式中，所述各个训练数据中所述各路数据下的压缩数据源是通过初始压缩模型对所述各个训练数据中所述各路数据下的初始数据源进行压缩处理得到的，处理单元502还可用于：In another implementation manner, the compressed data sources of the various data paths in the various training data are obtained by compressing the initial data sources of the various data paths in the various training data using an initial compression model, and the processing unit 502 may also be used to:

基于所述各路数据下的模型损失值，优化所述初始压缩模型中的模型参数，得到中间压缩模型；Based on the model loss values under the data of each channel, the model parameters in the initial compression model are optimized to obtain an intermediate compression model;

基于所述中间压缩模型，确定目标压缩模型。Based on the intermediate compression model, a target compression model is determined.

根据本发明的一个实施例，图1或图3所示的方法所涉及的各个步骤均可由图5所示的模型预训练装置中的各个单元来执行的。例如，图1中所示的步骤S101可由图5中所示的获取单元501执行，步骤S102-S104均可由图5中所示的处理单元502执行。又如，图3中所示的步骤S301可由图5中所示的获取单元501执行，步骤S302-S308均可由图5中所示的处理单元502执行，等等。According to one embodiment of the present invention, each step involved in the method shown in FIG. 1 or FIG. 3 can be performed by each unit in the model pre-training device shown in FIG. 5. For example, step S101 shown in FIG. 1 can be performed by the acquisition unit 501 shown in FIG. 5, and steps S102-S104 can be performed by the processing unit 502 shown in FIG. 5. For another example, step S301 shown in FIG. 3 can be performed by the acquisition unit 501 shown in FIG. 5, and steps S302-S308 can be performed by the processing unit 502 shown in FIG. 5, and so on.

根据本发明的另一个实施例，图5所示的模型预训练装置中的各个单元均可以分别或全部合并为一个或若干个另外的单元来构成，或者其中的某个(些)单元还可以再拆分为功能上更小的多个单元来构成，这可以实现同样的操作，而不影响本发明的实施例的技术效果的实现。上述单元是基于逻辑功能划分的，在实际应用中，一个单元的功能也可以由多个单元来实现，或者多个单元的功能由一个单元实现。在本发明的其它实施例中，任一模型预训练装置也可以包括其他单元，在实际应用中，这些功能也可以由其它单元协助实现，并且可以由多个单元协作实现。According to another embodiment of the present invention, each unit in the model pre-training device shown in Figure 5 can be separately or completely combined into one or several other units to constitute, or one (some) of the units can be further divided into multiple smaller units in function to constitute, which can achieve the same operation without affecting the realization of the technical effects of the embodiments of the present invention. The above-mentioned units are divided based on logical functions. In practical applications, the function of one unit can also be implemented by multiple units, or the function of multiple units can be implemented by one unit. In other embodiments of the present invention, any model pre-training device may also include other units. In practical applications, these functions can also be implemented with the assistance of other units, and can be implemented by the collaboration of multiple units.

根据本发明的另一个实施例，可以通过在包括中央处理单元(CPU)、随机存取存储介质(RAM)、只读存储介质(ROM)等处理元件和存储元件的例如计算机的通用电子设备上运行能够执行如图1或图3中所示的相应方法所涉及的各步骤的计算机程序(包括程序代码)，来构造如图5中所示的模型预训练装置，以及来实现本发明实施例的模型预训练方法。所述计算机程序可以记载于例如计算机存储介质上，并通过计算机存储介质装载于上述电子设备中，并在其中运行。According to another embodiment of the present invention, a model pre-training device as shown in FIG5 can be constructed, and the model pre-training method of the embodiment of the present invention can be implemented by running a computer program (including program code) capable of executing each step involved in the corresponding method as shown in FIG1 or FIG3 on a general electronic device such as a computer including a central processing unit (CPU), a random access storage medium (RAM), a read-only storage medium (ROM) and other processing elements and storage elements. The computer program can be recorded on, for example, a computer storage medium, and loaded into the above-mentioned electronic device through the computer storage medium, and run therein.

基于上述方法实施例以及装置实施例的描述，本发明示例性实施例还提供一种电子设备，包括：至少一个处理器；以及与至少一个处理器通信连接的存储器。所述存储器存储有能够被所述至少一个处理器执行的计算机程序，所述计算机程序在被所述至少一个处理器执行时用于使所述电子设备执行根据本发明实施例的方法。Based on the description of the above method embodiment and device embodiment, the exemplary embodiment of the present invention further provides an electronic device, including: at least one processor; and a memory connected to the at least one processor. The memory stores a computer program that can be executed by the at least one processor, and the computer program is used to enable the electronic device to perform the method according to the embodiment of the present invention when executed by the at least one processor.

本发明示例性实施例还提供一种存储有计算机程序的非瞬时计算机可读存储介质，其中，所述计算机程序在被计算机的处理器执行时用于使所述计算机执行根据本发明实施例的方法。Exemplary embodiments of the present invention also provide a non-transitory computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor of a computer, is used to cause the computer to perform a method according to an embodiment of the present invention.

本发明示例性实施例还提供一种计算机程序产品，包括计算机程序，其中，所述计算机程序在被计算机的处理器执行时用于使所述计算机执行根据本发明实施例的方法。An exemplary embodiment of the present invention further provides a computer program product, comprising a computer program, wherein when the computer program is executed by a processor of a computer, the computer is used to enable the computer to perform a method according to an embodiment of the present invention.

参考图6，现将描述可以作为本发明的服务器或客户端的电子设备600的结构框图，其是可以应用于本发明的各方面的硬件设备的示例。电子设备旨在表示各种形式的数字电子的计算机设备，诸如，膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置，诸如，个人数字处理、蜂窝电话、智能电话、可穿戴设备和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例，并且不意在限制本文中描述的和/或者要求的本发明的实现。With reference to Fig. 6, the structural block diagram of the electronic device 600 that can be used as the server or client of the present invention will now be described, which is an example of the hardware device that can be applied to various aspects of the present invention. The electronic device is intended to represent various forms of digital electronic computer equipment, such as laptop computers, desktop computers, workbenches, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device can also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices and other similar computing devices. The components shown herein, their connections and relationships, and their functions are only used as examples, and are not intended to limit the implementation of the present invention described herein and/or required.

如图6所示，电子设备600包括计算单元601，其可以根据存储在只读存储器(ROM)602中的计算机程序或者从存储单元608加载到随机访问存储器(RAM)603中的计算机程序，来执行各种适当的动作和处理。在RAM 603中，还可存储电子设备600操作所需的各种程序和数据。计算单元601、ROM 602以及RAM 603通过总线604彼此相连。输入/输出(I/O)接口605也连接至总线604。As shown in Figure 6, electronic device 600 includes a computing unit 601, which can perform various appropriate actions and processes according to a computer program stored in a read-only memory (ROM) 602 or a computer program loaded from a storage unit 608 into a random access memory (RAM) 603. In RAM 603, various programs and data required for the operation of electronic device 600 can also be stored. Computing unit 601, ROM 602 and RAM 603 are connected to each other via bus 604. Input/output (I/O) interface 605 is also connected to bus 604.

电子设备600中的多个部件连接至I/O接口605，包括：输入单元606、输出单元607、存储单元608以及通信单元609。输入单元606可以是能向电子设备600输入信息的任何类型的设备，输入单元606可以接收输入的数字或字符信息，以及产生与电子设备的用户设置和/或功能控制有关的键信号输入。输出单元607可以是能呈现信息的任何类型的设备，并且可以包括但不限于显示器、扬声器、视频/音频输出终端、振动器和/或打印机。存储单元608可以包括但不限于磁盘、光盘。通信单元609允许电子设备600通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据，并且可以包括但不限于调制解调器、网卡、红外通信设备、无线通信收发机和/或芯片组，例如蓝牙TM设备、WiFi设备、WiMax设备、蜂窝通信设备和/或类似物。A plurality of components in the electronic device 600 are connected to the I/O interface 605, including: an input unit 606, an output unit 607, a storage unit 608, and a communication unit 609. The input unit 606 may be any type of device capable of inputting information to the electronic device 600, and the input unit 606 may receive input digital or character information, and generate key signal inputs related to user settings and/or function control of the electronic device. The output unit 607 may be any type of device capable of presenting information, and may include, but is not limited to, a display, a speaker, a video/audio output terminal, a vibrator, and/or a printer. The storage unit 608 may include, but is not limited to, a disk, an optical disk. The communication unit 609 allows the electronic device 600 to exchange information/data with other devices via a computer network such as the Internet and/or various telecommunication networks, and may include, but is not limited to, a modem, a network card, an infrared communication device, a wireless communication transceiver, and/or a chipset, such as a Bluetooth™ device, a WiFi device, a WiMax device, a cellular communication device, and/or the like.

计算单元601可以是各种具有处理和计算能力的通用和/或专用处理组件。计算单元601的一些示例包括但不限于中央处理单元(CPU)、图形处理单元(GPU)、各种专用的人工智能(AI)计算芯片、各种运行机器学习模型算法的计算单元、数字信号处理器(DSP)、以及任何适当的处理器、控制器、微控制器等。计算单元601执行上文所描述的各个方法和处理。例如，在一些实施例中，模型预训练方法可被实现为计算机软件程序，其被有形地包含于机器可读介质，例如存储单元608。在一些实施例中，计算机程序的部分或者全部可以经由ROM602和/或通信单元609而被载入和/或安装到电子设备600上。在一些实施例中，计算单元601可以通过其他任何适当的方式(例如，借助于固件)而被配置为执行模型预训练方法。The computing unit 601 may be a variety of general and/or special processing components with processing and computing capabilities. Some examples of the computing unit 601 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various dedicated artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, digital signal processors (DSPs), and any appropriate processors, controllers, microcontrollers, etc. The computing unit 601 performs the various methods and processes described above. For example, in some embodiments, the model pre-training method may be implemented as a computer software program, which is tangibly included in a machine-readable medium, such as a storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed on the electronic device 600 via ROM 602 and/or communication unit 609. In some embodiments, the computing unit 601 may be configured to perform the model pre-training method in any other appropriate manner (e.g., by means of firmware).

用于实施本发明的方法的程序代码可以采用一个或多个编程语言的任何组合来编写。这些程序代码可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理器或控制器，使得程序代码当由处理器或控制器执行时使流程图和/或框图中所规定的功能/操作被实施。程序代码可以完全在机器上执行、部分地在机器上执行，作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。The program code for implementing the method of the present invention can be written in any combination of one or more programming languages. These program codes can be provided to a processor or controller of a general-purpose computer, a special-purpose computer or other programmable data processing device, so that the program code, when executed by the processor or controller, enables the functions/operations specified in the flow chart and/or block diagram to be implemented. The program code can be executed entirely on the machine, partially on the machine, partially on the machine and partially on a remote machine as a stand-alone software package, or entirely on a remote machine or server.

在本发明的上下文中，机器可读介质可以是有形的介质，其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备，或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of the present invention, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, device, or equipment. A machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or equipment, or any suitable combination of the foregoing. A more specific example of a machine-readable storage medium may include an electrical connection based on one or more lines, a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

如本发明使用的，术语“机器可读介质”和“计算机可读介质”指的是用于将机器指令和/或数据提供给可编程处理器的任何计算机程序产品、设备、和/或装置(例如，磁盘、光盘、存储器、可编程逻辑装置(PLD))，包括，接收作为机器可读信号的机器指令的机器可读介质。术语“机器可读信号”指的是用于将机器指令和/或数据提供给可编程处理器的任何信号。As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., disk, optical disk, memory, programmable logic device (PLD)) for providing machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal for providing machine instructions and/or data to a programmable processor.

为了提供与用户的交互，可以在计算机上实施此处描述的系统和技术，该计算机具有：用于向用户显示信息的显示装置(例如，CRT(阴极射线管)或者LCD(液晶显示器)监视器)；以及键盘和指向装置(例如，鼠标或者轨迹球)，用户可以通过该键盘和该指向装置来将输入提供给计算机。其它种类的装置还可以用于提供与用户的交互；例如，提供给用户的反馈可以是任何形式的传感反馈(例如，视觉反馈、听觉反馈、或者触觉反馈)；并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。To provide interaction with a user, the systems and techniques described herein can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user; and a keyboard and pointing device (e.g., a mouse or trackball) through which the user can provide input to the computer. Other types of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including acoustic input, voice input, or tactile input).

可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如，作为数据服务器)、或者包括中间件部件的计算系统(例如，应用服务器)、或者包括前端部件的计算系统(例如，具有图形用户界面或者网络浏览器的用户计算机，用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如，通信网络)来将系统的部件相互连接。通信网络的示例包括：局域网(LAN)、广域网(WAN)和互联网。The systems and techniques described herein may be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., a user computer with a graphical user interface or a web browser through which a user can interact with implementations of the systems and techniques described herein), or a computing system that includes any combination of such back-end components, middleware components, or front-end components. The components of the system may be interconnected by any form or medium of digital data communication (e.g., a communications network). Examples of communications networks include: a local area network (LAN), a wide area network (WAN), and the Internet.

计算机系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。A computer system may include clients and servers. Clients and servers are generally remote from each other and usually interact through a communication network. The relationship of client and server is generated by computer programs running on respective computers and having a client-server relationship to each other.

并且，应理解的是，以上所揭露的仅为本发明较佳实施例而已，当然不能以此来限定本发明之权利范围，因此依本发明权利要求所作的等同变化，仍属本发明所涵盖的范围。Furthermore, it should be understood that what is disclosed above is only a preferred embodiment of the present invention, and certainly cannot be used to limit the scope of rights of the present invention. Therefore, equivalent changes made according to the claims of the present invention are still within the scope covered by the present invention.

Claims

1. A model pre-training method, characterized by comprising:

Obtain a training data set, where one training data set includes an initial data source of an object under each data path in M data paths, and one data path corresponds to one language model, where M is a positive integer;

Performing compression processing on the initial data sources of each data path in each training data included in the training data set to obtain compressed data sources of each data path in each training data, wherein the number of compressed semantic representations in one compressed data source is less than the number of initial semantic representations in the corresponding initial data source;

Masking the compressed data sources under each path of data in each of the training data respectively to obtain the masked data sources under each path of data in each of the training data, and respectively calling the initial language models corresponding to each path of data to encode the masked data sources under the corresponding path of data in each of the training data to obtain the encoded data sources under each path of data in each of the training data;

Based on the encoded data source under each data path in each training data and the compressed data source under each data path in each training data, the model loss value under each data path is calculated respectively, and according to the model loss value under each data path, the model parameters in the initial language model corresponding to the corresponding data path are optimized respectively to obtain the intermediate language model corresponding to each data path, so as to determine the target language model corresponding to each data path based on the intermediate language model corresponding to each data path.

2. The method according to claim 1, characterized in that the step of compressing the initial data sources of the various data paths in the various training data included in the training data set to obtain the compressed data sources of the various data paths in the various training data comprises:

Determine a text compression length for any training data in the training data set and any data in the M data;

According to the text compression length, data division is performed on the initial data source under any one of the data paths in any one of the training data to obtain N semantic representation groups to be compressed, wherein the number of semantic representations in one semantic representation group to be compressed is less than or equal to the text compression length;

Compression processing is performed on each of the N semantic representation groups to be compressed to obtain compressed semantic representations of each of the semantic representation groups to be compressed, so as to obtain a compressed data source under any one of the data in any one of the training data.

3. The method according to claim 2, characterized in that the step of respectively masking the compressed data sources under the respective data paths in the respective training data to obtain the masked data sources under the respective data paths in the respective training data comprises:

Determine a mask probability, and determine at least one mask unit from the compressed data source under any one of the data paths in any one of the training data according to the mask probability, wherein one mask unit corresponds to one compressed semantic representation;

The compressed semantic representation of each determined mask unit is masked respectively to obtain a mask data source under any one path of data in any one of the training data, and the mask data source under any one path of data in any one of the training data includes the mask semantic representation of each determined mask unit.

4. The method according to any one of claims 1 to 3, characterized in that an encoded data source includes an encoded semantic representation of each mask unit in H mask units, where H is a positive integer; and the step of calculating the model loss value of each data path based on the encoded data source under each data path in each training data and the compressed data source under each data path in each training data comprises:

For the m-th data in the M-th data, determine the encoding semantic representation of each mask unit under the m-th data in the corresponding training data from the encoding data source under the m-th data in each training data, m∈[1,M];

Calculate the model loss value of the m-th data path based on the encoded semantic representation of each mask unit under the m-th data path in the respective training data and the compressed data source under the m-th data path in the corresponding training data;

The optimizing the model parameters in the initial language model corresponding to the corresponding path data according to the model loss value under each path data, and obtaining the intermediate language model corresponding to each path data, comprises:

According to the model loss value under the m-th data, the model parameters in the initial language model corresponding to the m-th data are optimized to obtain the intermediate language model corresponding to the m-th data.

5. The method according to any one of claims 1 to 3, characterized in that the step of determining the target language model corresponding to each path of data based on the intermediate language model corresponding to each path of data comprises:

Based on the training data set, iteratively optimize the model parameters in the intermediate language model corresponding to each channel of data until a convergence condition is reached, thereby obtaining the language model to be associated corresponding to each channel of data;

Determine the current compressed data source of each data path in each training data, and determine the data source to be encoded of each data path in the corresponding training data based on the current compressed data source of each data path in each training data;

Calling the to-be-associated language model corresponding to each path of data, respectively encoding the to-be-encoded data source under the corresponding path of data in each training data, to obtain the current encoded data source under each path of data in each training data;

Based on the current encoded data source under each data path in the each training data, an associated loss value is calculated, and the model parameters in the language model to be associated corresponding to each data path are optimized in the direction of reducing the associated loss value to obtain the intermediate language model to be associated corresponding to each data path, so as to determine the target language model corresponding to each data path based on the intermediate language model to be associated corresponding to each data path.

6. The method according to claim 5, characterized in that the calculation of the association loss value based on the current encoding data source under the respective data in the respective training data comprises:

Based on the current encoding data sources under the respective data paths in the respective training data, determining at least one comparative learning group, where one comparative learning group includes the current encoding data sources under any two data paths in the respective training data;

Traversing each comparative learning group in the at least one comparative learning group, and taking the currently traversed comparative learning group as the current comparative learning group, and taking the two paths of data corresponding to the current comparative learning group as the first path of data and the second path of data;

Calculate the loss value of the current comparative learning group based on the current encoded data source under the first path of data in each of the training data and the current encoded data source under the second path of data in each of the training data;

After traversing each comparative learning group in the at least one comparative learning group, the loss value under each comparative learning group is obtained, and the associated loss value is calculated based on the loss value under each comparative learning group.

7. The method according to any one of claims 1 to 3, characterized in that the compressed data sources of the various data paths in the various training data are obtained by compressing the initial data sources of the various data paths in the various training data through an initial compression model, and the method further comprises:

Based on the model loss values under the data of each channel, the model parameters in the initial compression model are optimized to obtain an intermediate compression model;

Based on the intermediate compression model, a target compression model is determined.

8. A model pre-training device, characterized in that the device comprises:

An acquisition unit is used to acquire a training data set, where one training data set includes an initial data source of an object under each data path in M data paths, and one data path corresponds to one language model, where M is a positive integer;

A processing unit, configured to compress the initial data sources of each path of data in each training data included in the training data set to obtain compressed data sources of each path of data in each training data, wherein the number of compressed semantic representations in one compressed data source is less than the number of initial semantic representations in the corresponding initial data source;

The processing unit is further configured to respectively mask the compressed data sources under the respective paths of data in the respective training data to obtain the masked data sources under the respective paths of data in the respective training data, and respectively call the initial language models corresponding to the respective paths of data to encode the masked data sources under the corresponding paths of data in the respective training data to obtain the encoded data sources under the respective paths of data in the respective training data;

The processing unit is further used to calculate the model loss value of each data path based on the encoded data source of each data path in the each training data and the compressed data source of each data path in the each training data, and optimize the model parameters in the initial language model corresponding to the corresponding data path according to the model loss value of each data path, so as to obtain the intermediate language model corresponding to each data path, and determine the target language model corresponding to each data path based on the intermediate language model corresponding to each data path.

9. An electronic device, comprising:

Processor; and

Memory for storing programs,

The program includes instructions, which, when executed by the processor, cause the processor to perform the method according to any one of claims 1 to 7.

10. A non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause a computer to execute the method according to any one of claims 1 to 7.