CN117688176A

CN117688176A - A pseudo-language family clustering method and device based on multi-language pre-trained large model

Info

Publication number: CN117688176A
Application number: CN202311653724.1A
Authority: CN
Inventors: 刘学博; 马新羽; 张民
Original assignee: Harbin Institute Of Technology shenzhen Shenzhen Institute Of Science And Technology Innovation Harbin Institute Of Technology
Current assignee: Harbin Institute Of Technology shenzhen Shenzhen Institute Of Science And Technology Innovation Harbin Institute Of Technology
Priority date: 2023-12-04
Filing date: 2023-12-04
Publication date: 2024-03-12
Anticipated expiration: 2043-12-04
Also published as: CN117688176B

Abstract

The invention relates to the technical field of text machine translation, and in particular refers to a pseudo-language family clustering method and device based on a large multi-language pre-training model. The method includes: establishing a shared language pool; based on a large multi-language pre-training model, calculating the shared language The Fisher information matrix of the language pairs in the pool is used to obtain the representation results of the language pairs in the shared language pool; the similarity between the language pairs is calculated based on the representation results to obtain the similarity value; based on the similarity value, the language pair is Sort the similarities between them, select auxiliary language pairs that meet the boundary values according to the preset boundary values, and complete pseudo-language family clustering based on the multi-language pre-trained large model. This invention uses the ability of multi-language pre-training itself to characterize language pairs, more effectively selects and clusters auxiliary languages and improves its generalization between different models and data sets, and ultimately improves the multi-language performance of low-resource language pairs. Translation quality under collaborative training.

Description

A pseudo-language family clustering method and device based on multi-language pre-trained large model

技术领域Technical field

本发明涉及机器翻译技术领域，特别是指一种基于多语言预训练大模型的伪语言族聚类方法及装置。The present invention relates to the technical field of machine translation, and in particular refers to a pseudo-language family clustering method and device based on a multi-language pre-trained large model.

背景技术Background technique

神经机器翻译(NMT)已成为学术研究和商业使用中主导的机器翻译(MT)范式。近年来研究发现NMT框架可以自然地整合多种语言。因此，涉及多种语言的MT系统的研究工作急剧增加。研究人员将处理多于一个语言对的翻译的NMT系统称为多语言NMT(MNMT)系统。MNMT研究的终极目标是通过有效利用可用的语言资源，开发一个用于尽可能多的语言之间翻译的单一模型。尽管MNMT在翻译质量方面带来了可喜的改进，但这些模型都依赖于大规模平行语料库。由于此类语料库仅存在于少数语言对上，因此在大多数低资源语言中，翻译性能远未达到预期效果。相关研究表明对于低资源语言翻译时，通过在微调阶段引入额外的辅助语言对，进行多语言协同训练，能够在某些情况下优于传统的微调方法。但是后续研究进一步指出，协同训练并不总是能够带来正面的效果，有时甚至可能导致翻译质量的下降，这取决于协同语言对的选择。Neural machine translation (NMT) has become the dominant machine translation (MT) paradigm in academic research and commercial use. In recent years, research has found that the NMT framework can naturally integrate multiple languages. As a result, research efforts involving MT systems in multiple languages have increased dramatically. Researchers call NMT systems that handle translation for more than one language pair multilingual NMT (MNMT) systems. The ultimate goal of MNMT research is to develop a single model for translation between as many languages as possible by efficiently utilizing available language resources. Although MNMT brings promising improvements in translation quality, these models rely on massively parallel corpora. Since such corpora only exist for a few language pairs, translation performance falls far short of expectations in most low-resource languages. Relevant research shows that for low-resource language translation, by introducing additional auxiliary language pairs in the fine-tuning stage and performing multi-language collaborative training, it can outperform traditional fine-tuning methods in some cases. However, follow-up research further pointed out that collaborative training does not always bring positive effects, and sometimes may even lead to a decrease in translation quality, depending on the choice of collaborative language pairs.

在近年来国内外的研究中表明，通过使用与目标语言相近的语言对对模型进行微调，可以在无需使用目标语言对的数据的情况下，提升目标语言对的翻译质量，进一步说明了语言对之间存在协同作用。但并非任意语言对的协同训练都能达到相同的效果，因此，协同语言对的筛选成为提升MNMT在低资源语言对翻译质量的关键步骤。语言族中的语言通常具有共同的地缘、语系背景，因此在字符或词语层面会有较多意义相近的地方，从语言学的角度而言，这些语言对之间会具有更多相同或者相近的字符、语法等语言学特征。目前该领域的学术研究主要分为两个方向：一方面，研究者通常会整合不同先验知识，包括语言相似性、资源可用性、语言类型和特定于任务的要求等；另一方面，研究者尝试应用语言嵌入，用嵌入向量表示每种语言，并将它们聚类在嵌入空间中，如在模型中增加语言嵌入(LanguageEmbedding)层，通过多语言训练后，为每个语言对构建嵌入向量，再通过层次聚类构建语言族，以提升语言对的翻译质量，或者在保持预训练模型参数不变的情况下，在模型的结构中嵌入适配器(Adapter)结构，在下游任务中通过训练语言族Adapter来提升翻译质量。In recent years, research at home and abroad has shown that by fine-tuning the model using language pairs that are similar to the target language, the translation quality of the target language pair can be improved without using the data of the target language pair, further illustrating that language pairs There is synergy between them. However, not all collaborative training of any language pair can achieve the same effect. Therefore, the screening of collaborative language pairs has become a key step to improve the translation quality of MNMT in low-resource language pairs. Languages in a language family usually have a common geographical and linguistic background, so there will be more similar meanings at the character or word level. From a linguistic perspective, these language pairs will have more identical or similar meanings. Linguistic features such as characters and grammar. Current academic research in this field is mainly divided into two directions: on the one hand, researchers usually integrate different prior knowledge, including language similarity, resource availability, language type and task-specific requirements; on the other hand, researchers Try to apply language embedding, use embedding vectors to represent each language, and cluster them in the embedding space, such as adding a language embedding (LanguageEmbedding) layer to the model, and building embedding vectors for each language pair after multi-language training. Then build a language family through hierarchical clustering to improve the translation quality of language pairs, or embed an adapter structure in the structure of the model while keeping the parameters of the pre-trained model unchanged, and train the language family in downstream tasks. Adapter to improve translation quality.

虽然这些方法能够提升语言对的翻译质量，但它们在实际应用中也面临着一定的困难。特别是训练新的模型或更改模型的结构，会使得这些方法变得复杂，并且在大语言模型原始结构和数据难以获取的情况下，这些方法也难以复现。Although these methods can improve the translation quality of language pairs, they also face certain difficulties in practical application. In particular, training new models or changing the structure of the model will make these methods complicated, and these methods are also difficult to reproduce when the original structure and data of large language models are difficult to obtain.

发明内容Contents of the invention

为了解决现有技术存在的训练新的模型或更改模型的结构，会使得这些方法变得复杂，并且在大语言模型原始结构和数据难以获取的情况下，难以复现的技术问题，本发明实施例提供了一种基于多语言预训练大模型的伪语言族聚类方法及装置。所述技术方案如下：In order to solve the technical problem in the existing technology that training a new model or changing the structure of the model will make these methods complicated and difficult to reproduce when the original structure and data of a large language model are difficult to obtain, the present invention is implemented The example provides a pseudo-language family clustering method and device based on a multi-language pre-trained large model. The technical solutions are as follows:

一方面，提供了一种基于多语言预训练大模型的伪语言族聚类方法，该方法由基于多语言预训练大模型的伪语言族聚类设备实现，该方法包括：On the one hand, a pseudo-language family clustering method based on a large multi-language pre-trained model is provided. The method is implemented by a pseudo-language family clustering device based on a large multi-language pre-trained model. The method includes:

S1、建立共享语言池；S1. Establish a shared language pool;

S2基于多语言预训练大模型，计算共享语言池中的语言对的费舍信息矩阵，获得共享语言池中的语言对的表征结果；S2 is based on the large multi-language pre-training model, calculates the Fisher information matrix of the language pairs in the shared language pool, and obtains the representation results of the language pairs in the shared language pool;

S3、根据表征结果对语言对之间的相似度进行计算，获得相似度值；S3. Calculate the similarity between language pairs based on the representation results and obtain the similarity value;

S4、根据相似度值，对语言对之间的相似性进行排序，根据预设边界值选择符合边界值的辅助语言对，完成基于多语言预训练大模型的伪语言族聚类。S4. Sort the similarities between language pairs according to the similarity value, select auxiliary language pairs that meet the boundary value according to the preset boundary value, and complete pseudo-language family clustering based on the multi-language pre-trained large model.

可选地，步骤S1中，建立共享语言池，包括：Optionally, in step S1, establish a shared language pool, including:

获取TED数据集；Get TED data set;

提取TED数据集中的多种语言，将多种语言译成英语的语言对作为基础数据集，建立共享语言池。Extract multiple languages from the TED data set, and use the language pairs translated from multiple languages into English as the basic data set to establish a shared language pool.

可选地，步骤S2中，基于多语言预训练大模型，计算共享语言池中的语言对的费舍信息矩阵，获得共享语言池中的语言对的表征结果，包括：Optionally, in step S2, based on the multi-language pre-trained large model, calculate the Fisher information matrix of the language pairs in the shared language pool, and obtain the representation results of the language pairs in the shared language pool, including:

获取与共享语言池中语言对对应的平行语料库，将平行语料库中的数据均等化分为j个小批量数据集；Obtain a parallel corpus corresponding to the language pairs in the shared language pool, and equalize the data in the parallel corpus into j mini-batch data sets;

将小批量数据集依次输入多语言预训练大模型，输出每个小批量数据集的费舍信息矩阵；Input the small batch data sets into the multi-language pre-trained large model in turn, and output the Fisher information matrix of each small batch data set;

一个输入轮次后计算每个小批量数据集的平均费舍信息矩阵，将平均费舍信息矩阵作为估计值，获得每个小批量数据集的费舍信息权重；After an input round, the average Fisher information matrix of each mini-batch data set is calculated, and the average Fisher information matrix is used as an estimate to obtain the Fisher information weight of each mini-batch data set;

根据费舍信息权重，对共享语言池中对应语言对的分布进行表征。Characterize the distribution of corresponding language pairs in the shared language pool according to Fisher information weights.

可选地，步骤S3中，根据表征结果对语言对之间的相似度进行计算，获得相似度值，包括：Optionally, in step S3, calculate the similarity between the language pairs based on the characterization results to obtain the similarity value, including:

获取表征结果；Obtain characterization results;

采用均方误差法，计算目标语言对和辅助语言对之间的距离，距离与相近，相似度越高。The mean square error method is used to calculate the distance between the target language pair and the auxiliary language pair. The closer the distance is, the higher the similarity.

使用费舍信息矩阵，计算辅助语言到目标语言的KL散度，获得目标语言对和辅助语言对之间的距离，距离与相近，相似度越高。Use the Fisher information matrix to calculate the KL divergence from the auxiliary language to the target language, and obtain the distance between the target language pair and the auxiliary language pair. The closer the distance is, the higher the similarity.

选择前K的参数并为其分配值1，而其余参数分配值0来创建费舍信息掩码；Create a Fisher information mask by selecting the first K parameters and assigning them the value 1, while the remaining parameters are assigned the value 0;

根据同时激活的参数数量和目标方向上激活的参数量，计算目标语言对和辅助语言对之间的距离，距离与相近，相似度越高。Based on the number of parameters activated simultaneously and the number of parameters activated in the target direction, the distance between the target language pair and the auxiliary language pair is calculated. The closer the distance is, the higher the similarity.

可选地，步骤S4中，根据相似度值，对语言对之间的相似性进行排序，根据预设边界值选择符合边界值的辅助语言对，完成基于多语言预训练大模型的伪语言族聚类，包括：Optionally, in step S4, the similarities between the language pairs are sorted according to the similarity value, and the auxiliary language pairs that meet the boundary value are selected according to the preset boundary value to complete the pseudo-language family based on the multi-language pre-training large model. Clustering, including:

遍历计算所有语言对之间的相似度；Traversely calculate the similarity between all language pairs;

根据语言对之间的相似度进行降序排列；Sort in descending order according to similarity between language pairs;

预设初始搜索半径，根据初始搜索半径划定边界范围；Preset the initial search radius and delineate the boundary range based on the initial search radius;

将边界范围内，最接近的语言对整合到辅助语言名单中；Integrate the closest language pairs within the boundary into the auxiliary language list;

根据最新添加的语言对与目标语言对的相似性，更新搜索半径；Update the search radius based on the similarity between the newly added language pair and the target language pair;

重复更新搜索半径，直至不再扩展新的语言对为止，获得聚类后的伪语言族，完成基于多语言预训练大模型的伪语言族聚类。The search radius is updated repeatedly until new language pairs are no longer expanded, the clustered pseudo-language family is obtained, and the pseudo-language family clustering based on the multi-language pre-trained large model is completed.

另一方面，提供了一种基于多语言预训练大模型的伪语言族聚类装置，该装置应用于基于多语言预训练大模型的伪语言族聚类方法，该装置包括：On the other hand, a pseudo-language family clustering device based on a large multi-language pre-trained model is provided. The device is applied to a pseudo-language family clustering method based on a large multi-language pre-trained model. The device includes:

语言池模块，用于建立共享语言池；Language pool module, used to establish a shared language pool;

表征模块，用于基于多语言预训练大模型，计算共享语言池中的语言对的费舍信息矩阵，获得共享语言池中的语言对的表征结果；The representation module is used to calculate the Fisher information matrix of the language pairs in the shared language pool based on the multi-language pre-trained large model, and obtain the representation results of the language pairs in the shared language pool;

相似度计算模块，用于根据表征结果对语言对之间的相似度进行计算，获得相似度值；The similarity calculation module is used to calculate the similarity between language pairs based on the representation results and obtain the similarity value;

聚类模块，用于根据相似度值，对语言对之间的相似性进行排序，根据预设边界值选择符合边界值的辅助语言对，完成基于多语言预训练大模型的伪语言族聚类。The clustering module is used to sort the similarities between language pairs according to the similarity value, select auxiliary language pairs that meet the boundary value according to the preset boundary value, and complete pseudo-language family clustering based on the multi-language pre-trained large model. .

另一方面，提供一种基于多语言预训练大模型的伪语言族聚类设备，所述基于多语言预训练大模型的伪语言族聚类设备包括：处理器；存储器，所述存储器上存储有计算机可读指令，所述计算机可读指令被所述处理器执行时，实现如上述基于多语言预训练大模型的伪语言族聚类方法中的任一项方法。On the other hand, a pseudo-language family clustering device based on a large multi-language pre-training model is provided. The pseudo-language family clustering device based on a large multi-language pre-training model includes: a processor; a memory, and the memory stores There are computer readable instructions. When the computer readable instructions are executed by the processor, any one of the above-mentioned pseudo-language family clustering methods based on multi-language pre-trained large models is implemented.

另一方面，提供了一种计算机可读存储介质，所述存储介质中存储有至少一条指令，所述至少一条指令由处理器加载并执行以实现上述基于多语言预训练大模型的伪语言族聚类方法中的任一项方法。On the other hand, a computer-readable storage medium is provided. At least one instruction is stored in the storage medium. The at least one instruction is loaded and executed by a processor to implement the above-mentioned pseudo-language family based on a multi-language pre-trained large model. Any of the clustering methods.

本发明实施例提供的技术方案带来的有益效果至少包括：The beneficial effects brought by the technical solutions provided by the embodiments of the present invention include at least:

本发明针对现有技术中存在的需要额外先验知识或者需要对模型架构进行修改的限制，本发明提供构建一种更有效的语言对聚类方法进行多语言协同训练的方法。核心目标是使用多语言预训练本身的能力对语言对进行表征，更有效地选择并聚类辅助语言并提高其在不同模型和数据集之间的泛化性，最终提高低资源语言对在多语言协同训练下的翻译质量。In view of the limitations in the prior art that require additional prior knowledge or the need to modify the model architecture, the present invention provides a method for constructing a more effective language pair clustering method for multi-language collaborative training. The core goal is to use the ability of multi-language pre-training itself to characterize language pairs, more effectively select and cluster auxiliary languages and improve their generalization between different models and data sets, and ultimately improve the performance of low-resource language pairs in multiple languages. Translation quality under language collaborative training.

附图说明Description of the drawings

为了更清楚地说明本发明实施例中的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained based on these drawings without exerting creative efforts.

图1是本发明实施例提供的一种基于多语言预训练大模型的伪语言族聚类方法流程图；Figure 1 is a flow chart of a pseudo-language family clustering method based on a multi-language pre-trained large model provided by an embodiment of the present invention;

图2是本发明实施例提供的语言对(XX-en)示意图；Figure 2 is a schematic diagram of the language pair (XX-en) provided by the embodiment of the present invention;

图3是本发明实施例提供的高40％费舍信息参数在模型结构中的分布示意图；Figure 3 is a schematic diagram of the distribution of high 40% Fisher information parameters in the model structure provided by the embodiment of the present invention;

图4是本发明实施例提供的一种基于多语言预训练大模型的伪语言族聚类装置框图；Figure 4 is a block diagram of a pseudo-language family clustering device based on a multi-language pre-trained large model provided by an embodiment of the present invention;

图5是本发明实施例提供的一种基于多语言预训练大模型的伪语言族聚类设备的结构示意图。Figure 5 is a schematic structural diagram of a pseudo-language family clustering device based on a multi-language pre-trained large model provided by an embodiment of the present invention.

具体实施方式Detailed ways

下面结合附图，对本发明中的技术方案进行描述。The technical solutions in the present invention will be described below with reference to the accompanying drawings.

在本发明实施例中，“示例地”、“例如”等词用于表示作例子、例证或说明。本发明中被描述为“示例”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言，使用示例的一词旨在以具体方式呈现概念。此外，在本发明实施例中，“和/或”所表达的含义可以是两者都有，或者可以是两者任选其一。In the embodiments of the present invention, words such as "exemplarily" and "for example" are used to represent examples, illustrations or explanations. Any embodiment or design described as "an example" of the invention is not intended to be construed as preferred or advantageous over other embodiments or designs. Rather, the use of the word example is intended to present a concept in a concrete way. In addition, in the embodiment of the present invention, the meaning expressed by "and/or" may be both, or may be either one of the two.

本发明实施例中，“图像”，“图片”有时可以混用，应当指出的是，在不强调其区别时，其所要表达的含义是一致的。“的(of)”，“相应的(corresponding，relevant)”和“对应的(corresponding)”有时可以混用，应当指出的是，在不强调其区别时，其所要表达的含义是一致的。In the embodiments of the present invention, "image" and "picture" may sometimes be used interchangeably. It should be noted that when the difference is not emphasized, the meanings they convey are consistent. "Of", "corresponding, relevant" and "corresponding" can sometimes be used interchangeably. It should be noted that when the difference is not emphasized, the meanings they convey are consistent.

本发明实施例中，有时候下标如W1可能会笔误为非下标的形式如W1，在不强调其区别时，其所要表达的含义是一致的。In the embodiment of the present invention, sometimes a subscript such as W1 may be mistakenly written in a non-subscript form such as W1. When the difference is not emphasized, the meanings to be expressed are consistent.

为使本发明要解决的技术问题、技术方案和优点更加清楚，下面将结合附图及具体实施例进行详细描述。In order to make the technical problems, technical solutions and advantages to be solved by the present invention clearer, a detailed description will be given below with reference to the accompanying drawings and specific embodiments.

本发明实施例提供了一种基于多语言预训练大模型的伪语言族聚类方法，该方法可以由基于多语言预训练大模型的伪语言族聚类设备实现，该基于多语言预训练大模型的伪语言族聚类设备可以是终端或服务器。如图1所示的基于多语言预训练大模型的伪语言族聚类方法流程图，该方法的处理流程可以包括如下的步骤：The embodiment of the present invention provides a pseudo-language family clustering method based on a large multi-language pre-training model. The method can be implemented by a pseudo-language family clustering device based on a large multi-language pre-training model. The method is based on a large multi-language pre-training model. The pseudo-language family clustering device of the model can be a terminal or a server. As shown in Figure 1, the flow chart of the pseudo-language family clustering method based on a multi-language pre-trained large model is shown. The processing flow of this method can include the following steps:

S101、建立共享语言池；S101. Establish a shared language pool;

一种可行的实施方式中，步骤S101中，建立共享语言池，包括：In a feasible implementation, in step S101, establishing a shared language pool includes:

获取TED数据集；Get TED data set;

一种可行的实施方式中，对于语言对表征的研究，首先需要建立一个共享语言池，包含高资源和低资源语言对并横跨多个语言系。本发明选用TED数据集，使用其中17种语言到英语(English，简称en)的数据作为本发明的基础数据集。这些语言对共同构成了可供选择的共享语言池，将在后续的选择中作为低资源语言的可供选择的辅助语言进行使用。这些语言跨越七个不同的语系：斯拉夫语系(Balto-Slavic)、南岛语系(Austronesian)、印度-伊朗语系(Indo-Iranian)、突厥语系(Turkic)、日本语系(Japonic)、朝鲜语系(Koreanic)和日耳曼语系(Germanic)，如图2所示。In a feasible implementation, for the study of language pair representation, it is first necessary to establish a shared language pool that contains high-resource and low-resource language pairs and spans multiple language departments. This invention selects the TED data set, and uses data from 17 languages to English (English, referred to as en) as the basic data set of the invention. Together, these language pairs form a shared language pool for selection, which will be used as alternative auxiliary languages to low-resource languages in subsequent selections. These languages span seven different language families: Balto-Slavic, Austronesian, Indo-Iranian, Turkic, Japonic, and Korean ) and Germanic (Germanic), as shown in Figure 2.

S102基于多语言预训练大模型，计算共享语言池中的语言对的费舍信息矩阵，获得共享语言池中的语言对的表征结果；S102 is based on the large multi-language pre-training model, calculates the Fisher information matrix of the language pairs in the shared language pool, and obtains the representation results of the language pairs in the shared language pool;

一种可行的实施方式中，步骤S102中，基于多语言预训练大模型，计算共享语言池中的语言对的费舍信息矩阵，获得共享语言池中的语言对的表征结果，包括：In a feasible implementation, in step S102, based on the multi-language pre-trained large model, the Fisher information matrix of the language pairs in the shared language pool is calculated, and the characterization results of the language pairs in the shared language pool are obtained, including:

本发明选择使用FIM对于预训练模型中的参数进行计算，对每个参数计算它的费舍信息权重，权重大小表示其重要性，权重大的参数代表其在特定翻译方向敏感。通过参数计算用来评估和选择那些对于特定翻译方向敏感的参数即在微调阶段需要进行大量权重更新的参数，是评估特定参数的重要性和潜在价值的重要指标。本质上，它量化了似然函数对数的一阶导数的方差。通过衡量这一度量的大小，可以推断出在后续任务中微调特定参数的必要性。其原始的计算公式如下：The present invention chooses to use FIM to calculate the parameters in the pre-training model, and calculates its Fisher information weight for each parameter. The size of the weight indicates its importance, and the parameter with a large weight indicates that it is sensitive in a specific translation direction. Parameter calculation is used to evaluate and select those parameters that are sensitive to a specific translation direction, that is, parameters that require a large number of weight updates during the fine-tuning phase. It is an important indicator to evaluate the importance and potential value of specific parameters. Essentially, it quantifies the variance of the first derivative of the logarithm of the likelihood function. By measuring the magnitude of this metric, one can infer the necessity of fine-tuning specific parameters in subsequent tasks. Its original calculation formula is as follows:

其中，X和Y分别表示模型的输入和输出；θ表示模型的参数；P表示在输入为X，参数为θ下，输出Y的概率分布；T表示矩阵转秩；E表示期望。对于其中第i个参数，使用利用对角矩阵有助于估计费舍尔信息矩阵：Among them, X and Y represent the input and output of the model respectively; θ represents the parameters of the model; P represents the probability distribution of the output Y when the input is For the i-th parameter, using the diagonal matrix helps estimate the Fisher information matrix:

虽然利用对角矩阵有助于估计FIM，但获得精确的概率估计仍然是一项艰巨的任务。鉴于此，我们采用以下公式来近似FIM：Although utilizing a diagonal matrix helps estimate FIM, obtaining accurate probability estimates is still a difficult task. Given this, we use the following formula to approximate FIM:

这里，D表示整个数据集，|D|表示数据的数量，整个数据集被划分为j个数据量相等的小批量依次输入模型进行训练。本发明将语言对应的平行语料库输入模型，仅在一个轮次(epoch)中计算每个小批量的FIM。在模型训练过程中，我们使用公式(3)为每个小批量累积FIM，但不进行反向传播。在一个epoch完成后，我们计算每个小批量得到的平均FIM，作为最终的估计值。Here, D represents the entire data set, |D| represents the number of data, and the entire data set is divided into j small batches with equal amounts of data and are sequentially input to the model for training. This invention inputs the parallel corpus corresponding to the language into the model and calculates the FIM of each mini-batch in only one round (epoch). During model training, we use formula (3) to accumulate FIM for each mini-batch but do not perform backpropagation. After an epoch is completed, we calculate the average FIM obtained for each mini-batch as the final estimate.

进一步，本发明对高费舍信息参数在预训练模型结构中的分布进行分析。在研究中观察了高40％参数的分布如图3所示。预训练模型初步分为5个部分：编码器注意力层(encoder attention layer,简称E_a)、编码器全连接层(encoder fully connectedlayer，简称E_f)、解码器自注意力(decoder self-attention，简称D_a)、解码器交叉注意力层(decoder cross attention layer，简称D_c)和解码器全连接层(decoder fullyconnected layer，简称D_f)。其中超过60％的参数分布在前馈网络(feed-forwardnetworks，简称FFN)中，本发明因此选用FFN层作进行FIM的计算并以此进行相似度的衡量。Furthermore, the present invention analyzes the distribution of high Fisher information parameters in the pre-training model structure. The distribution of the high 40% parameters observed in the study is shown in Figure 3. The pre-training model is initially divided into five parts: encoder attention layer (E _a for short), encoder fully connected layer (E _f for short), decoder self-attention (decoder self-attention) , referred to as D _a ), decoder cross attention layer (decoder cross attention layer, referred to as D _c ) and decoder fully connected layer (decoder fully connected layer, referred to as D _f ). More than 60% of the parameters are distributed in feed-forward networks (FFN). Therefore, the present invention selects the FFN layer to calculate the FIM and measure the similarity.

S103、根据表征结果对语言对之间的相似度进行计算，获得相似度值；S103. Calculate the similarity between language pairs based on the representation results and obtain the similarity value;

一种可行的实施方式中，步骤S103中，根据表征结果对语言对之间的相似度进行计算，获得相似度值，包括：In a feasible implementation, in step S103, the similarity between the language pairs is calculated according to the characterization result, and the similarity value is obtained, including:

获取表征结果；Obtain characterization results;

采用均方误差法，计算共享语言池中的语言对与目标语言对之间的距离，距离与相近，相似度越高。The mean square error method is used to calculate the distance between the language pairs in the shared language pool and the target language pair. The closer the distance is, the higher the similarity.

一种可行的实施方式中，均方误差(Mean Square Error,简称MSE)：计算公式如下，其中t，a分别为目标语言对和辅助语言对，S_(t，a)为t和a之间的距离，距离越近其相似度越高，F为FIM，|F_t|表示参数量。In a feasible implementation, the mean square error (Mean Square Error, referred to as MSE): the calculation formula is as follows, where t, a are the target language pair and the auxiliary language pair respectively, and S _{(t, a)} is between t and a distance, the closer the distance, the higher the similarity, F is FIM, |F _t | represents the parameter amount.

使用费舍信息矩阵，计算共享语言池中的语言对与目标语言对的KL散度，获得共享语言池中的语言对与目标语言对之间的距离，距离与相近，相似度越高。Use the Fisher information matrix to calculate the KL divergence between the language pair in the shared language pool and the target language pair, and obtain the distance between the language pair in the shared language pool and the target language pair. The closer the distance is, the higher the similarity.

一种可行的实施方式中，KL散度(Kullback-Leibler散度,简称KL)：直接使用FIM来计算共享语言池中的语言对与目标语言对的KL散度，从而更准确地表示语言对之间的距离。计算公式如下，其中符号与MSE中描述的一致，|·|表示计算出的绝对值。In a feasible implementation, KL divergence (Kullback-Leibler divergence, KL for short): directly use FIM to calculate the KL divergence of the language pairs in the shared language pool and the target language pair, thereby representing the language pairs more accurately. the distance between. The calculation formula is as follows, where the symbols are consistent with those described in MSE, and |·| represents the calculated absolute value.

一种可行的实施方式中，重叠相似度(Overlap Similarity,简称Overlap)：与前两种计算方式不同，重叠相似度不直接使用FIM，通过选择前K的参数并为其分配值1，而其余参数分配值0来创建费舍信息掩码(fisher information mask，简称M)。计算方式如下，其中，Overlapping和Activate表示同时激活的参数数量和目标方向上激活的参数量。In a feasible implementation, Overlap Similarity (Overlap): Different from the first two calculation methods, the Overlap Similarity does not directly use FIM, by selecting the top K parameters and assigning them a value of 1, and the rest The parameter is assigned a value of 0 to create a Fisher information mask (M). The calculation method is as follows, where Overlapping and Activate represent the number of parameters activated simultaneously and the amount of parameters activated in the target direction.

在Overlap方法中，本发明默认使用40％作为K的默认参数，因为其在取得了最佳的翻译效果。In the Overlap method, the present invention uses 40% as the default parameter of K by default because it achieves the best translation effect.

S104、根据相似度值，对语言对之间的相似性进行排序，根据预设边界值选择符合边界值的辅助语言对，完成基于多语言预训练大模型的伪语言族聚类。S104. Sort the similarities between language pairs according to the similarity value, select auxiliary language pairs that meet the boundary value according to the preset boundary value, and complete pseudo-language family clustering based on the multi-language pre-trained large model.

一种可行的实施方式中，步骤S104中，根据相似度值，对语言对之间的相似性进行排序，根据预设边界值选择符合边界值的辅助语言对，完成基于多语言预训练大模型的伪语言族聚类，包括：In a feasible implementation, in step S104, the similarities between the language pairs are sorted according to the similarity value, and the auxiliary language pairs that meet the boundary value are selected according to the preset boundary value to complete the multi-language pre-training large model. Pseudo-language family clustering, including:

一种可行的实施方式中，本发明设计了一个简单的算法来选择辅助语言。首先，我们对目标语言对使用上述相似度度量方法计算其他语言对与其的相似度，根据语言对之间的相似性进行排序，然后设定一个初始的搜索半径。在这个预先定义的边界内，最接近的语言对被整合到辅助语言名单中。然后，根据最新添加的语言对与目标语言对的相似性度量，调整半径。重复这个过程，直到不再扩展新的语言对为止。本发明将这样聚类的语言族称为伪语言族。选择辅助语言对的算法如下：In a feasible implementation, the present invention designs a simple algorithm to select the auxiliary language. First, we use the above similarity measurement method for the target language pair to calculate the similarity of other language pairs with it, sort them according to the similarity between the language pairs, and then set an initial search radius. Within this predefined boundary, the closest language pairs are integrated into the auxiliary language list. The radius is then adjusted based on the similarity measure between the newly added language pair and the target language pair. This process is repeated until no more new language pairs are extended. This invention refers to such clustered language families as pseudo-language families. The algorithm for selecting auxiliary language pairs is as follows:

1.将相似度按降序(MSE和KL按升序，Overlap降序)排列，创建列表L；1. Arrange the similarities in descending order (MSE and KL in ascending order, Overlap in descending order) and create a list L;

2.初始化Gap为|L[1]-L[0]|，将第一种语言添加到辅助列表中；2. Initialize Gap to |L[1]-L[0]|, and add the first language to the auxiliary list;

3.从i＝2迭代到L的末尾，每个循环我们选择语言如下：3. Iterate from i=2 to the end of L. For each loop, we choose the language as follows:

b)如果|L[i-1]-L[i]|＜Gap，将第i种语言添加到辅助列表中，并更新 b) If |L[i-1]-L[i]|<Gap, add the i-th language to the auxiliary list and update

c)如果|L[i-1]-L[i]|＞Gap，终止循环；c) If |L[i-1]-L[i]|>Gap, terminate the loop;

4.辅助列表中与目标对的语言对组成目标语言对的伪语言家族。4. The language pairs in the auxiliary list that are paired with the target form a pseudo-language family of the target language pair.

一种可行的实施方式中，为了评估本发明的方法，基于m2m100_418M模型设计了如下的基线(baseline)：In a feasible implementation, in order to evaluate the method of the present invention, the following baseline is designed based on the m2m100_418M model:

1.Pre-trained(预训练)：直接使用预训练模型进行目标语言对的直接翻译，不进行任何微调；1. Pre-trained: directly use the pre-trained model to directly translate the target language pair without any fine-tuning;

2.FT(fine-tune，微调)：使用目标语言对的双语数据对基础模型进行微调；2. FT (fine-tune, fine-tuning): Use bilingual data of the target language pair to fine-tune the basic model;

3.LF(langauge family，语言族)：使用图2中划分的传统语言家族进行微调，采用temperature(温度)采样并将temperature设置为1.5；3.LF (langauge family, language family): Use the traditional language family divided in Figure 2 for fine-tuning, use temperature (temperature) sampling and set the temperature to 1.5;

4.LF+FT：在LF方法的基础上，使用目标语言对的数据，进一步进行微调。4. LF+FT: Based on the LF method, use the data of the target language pair to further fine-tune.

对于训练阶段来说，将批量大小设置为4096；对本发明提出的方法，将训练数据进行上采样以使其相同，以确保每个小批量中每种语言的比例相同，但印地语(Hindi，简称hi)除外，在使用Overlap方法时，使用与LF相同的采样方式。使用Adam优化器进行优化，其中β₁＝0.98、β₂＝0.98和∈＝10e^-6。学习率为lr＝3e^-5。在本发明中使用BLEU(bilingualevaluation understudy，双语替换评测)值对翻译质量进行评估。For the training phase, the batch size is set to 4096; for the method proposed in the present invention, the training data is upsampled to make it the same to ensure that the proportion of each language in each mini-batch is the same, except for Hindi , except for hi), when using the Overlap method, the same sampling method as LF is used. Optimization was performed using Adam optimizer with β ₁ =0.98, β ₂ =0.98 and ∈ =10e ⁻⁶ . The learning rate is lr=3e ^-5 . In the present invention, the BLEU (bilingualevaluation understudy, bilingual replacement evaluation) value is used to evaluate the translation quality.

表1不同方法选择下的伪语言族Table 1 Pseudo-language families under different method selections

表2不同方法下各个低资源语言对的BLEU分数Table 2 BLEU scores of each low-resource language pair under different methods

表1展示了本发明的方法在共享语言池中对低资源语言对聚类出的伪语言族，表2展示了波斯语(Persian，简称fa)、印地语(Hindi，简称hi)、孟加拉语(Bengali，简称bn)、印度尼西亚语(Indonesian，简称id)、马来西亚语(Malay，ms)到英语在TED数据集上的测试结果。Table 1 shows the pseudo-language families clustered by the method of the present invention for low-resource language pairs in the shared language pool. Table 2 shows Persian (Persian, referred to as fa), Hindi (Hindi, referred to as hi), Bengali Test results on the TED data set from Bengali (bn for short), Indonesian (Indonesian for short), Malay (ms) to English.

模型(1)到(4)代表了当前常用微调中的基线，模型(5)到(7)代表了我们对该方法使用不同计算方法的实现。Models (1) to (4) represent the baseline in currently commonly used fine-tuning, and models (5) to (7) represent our implementation of this method using different computational methods.

本发明使用MSE、KL和Overlap三种度量方式进行语言对之间的距离或相似度的计算，以此进行伪语言族的聚类，并在低资源语言对上进行验证。评估结果显示，三种计算方式都进一步提升了BLEU得分且最终得到的提升效果相当，模型(7)取得最好的提升效果。The present invention uses three measurement methods: MSE, KL and Overlap to calculate the distance or similarity between language pairs, thereby clustering pseudo-language families and conducting verification on low-resource language pairs. The evaluation results show that all three calculation methods further improve the BLEU score and the final improvement effect is equivalent. Model (7) achieves the best improvement effect.

图4是根据一示例性实施例示出的一种基于多语言预训练大模型的伪语言族聚类装置框图，该装置用于基于多语言预训练大模型的伪语言族聚类方法。参照图4，该装置包括语言池模块410、表征模块420、相似度计算模块430、以及聚类模块440。为了便于说明，图4仅示出了该全流程可视化装置400的主要部件：Figure 4 is a block diagram of a pseudo-language family clustering device based on a large multi-language pre-trained model according to an exemplary embodiment. The device is used for a pseudo-language family clustering method based on a large multi-language pre-trained model. Referring to FIG. 4 , the device includes a language pool module 410 , a representation module 420 , a similarity calculation module 430 , and a clustering module 440 . For ease of explanation, Figure 4 only shows the main components of the full process visualization device 400:

语言池模块410，用于建立共享语言池；Language pool module 410, used to establish a shared language pool;

表征模块420，用于基于多语言预训练大模型，计算共享语言池中的语言对的费舍信息矩阵，获得共享语言池中的语言对的表征结果；The representation module 420 is used to calculate the Fisher information matrix of the language pairs in the shared language pool based on the multi-language pre-trained large model, and obtain the representation results of the language pairs in the shared language pool;

相似度计算模块430，用于根据表征结果对语言对之间的相似度进行计算，获得相似度值；The similarity calculation module 430 is used to calculate the similarity between language pairs according to the representation results and obtain the similarity value;

聚类模块440，用于根据相似度值，对语言对之间的相似性进行排序，根据预设边界值选择符合边界值的辅助语言对，完成基于多语言预训练大模型的伪语言族聚类。The clustering module 440 is used to sort the similarities between language pairs according to the similarity value, select auxiliary language pairs that meet the boundary value according to the preset boundary value, and complete pseudo-language family clustering based on the large multi-language pre-training model. kind.

可选地，语言池模块410，用于获取TED数据集；Optionally, the language pool module 410 is used to obtain the TED data set;

可选地，表征模块420，用于获取与共享语言池中语言对对应的平行语料库，将平行语料库中的数据均等化分为j个小批量数据集；Optionally, the characterization module 420 is used to obtain a parallel corpus corresponding to the language pairs in the shared language pool, and equalize the data in the parallel corpus into j mini-batch data sets;

可选地，相似度计算模块430，用于获取表征结果；Optionally, the similarity calculation module 430 is used to obtain the characterization results;

可选地，相似度计算模块430，用于使用费舍信息矩阵，计算共享语言池中的语言对与所述目标语言对的KL散度，获得共享语言池中的语言对与目标语言对之间的距离，距离与相近，相似度越高。Optionally, the similarity calculation module 430 is used to use the Fisher information matrix to calculate the KL divergence between the language pairs in the shared language pool and the target language pair, and obtain the KL divergence between the language pairs in the shared language pool and the target language pair. The distance between them, the distance and proximity, the higher the similarity.

可选地，相似度计算模块430，用于选择前K的参数并为其分配值1，而其余参数分配值0来创建费舍信息掩码；Optionally, the similarity calculation module 430 is used to select the first K parameters and assign them a value of 1, while the remaining parameters are assigned a value of 0 to create a Fisher information mask;

根据同时激活的参数数量和目标方向上激活的参数量，计算共享语言池中的语言对与目标语言对之间的距离，距离与相近，相似度越高。Based on the number of parameters activated simultaneously and the number of parameters activated in the target direction, the distance between the language pair in the shared language pool and the target language pair is calculated. The closer the distance is, the higher the similarity.

可选地，聚类模块440，用于遍历计算所有语言对之间的相似度；Optionally, the clustering module 440 is used to traversely calculate the similarity between all language pairs;

本发明针对现有技术中存在的需要额外先验知识或者需要对模型架构进行修改的限制，本发明提供构建一种更有效的语言对聚类方法进行多语言协同训练的方法。核心目标是使用多语言预训练本身的能力对语言对进行表征，更有效地选择并聚类辅助语言并提高其在不同模型和数据集之间的泛化性，最终提高低资源语言对在多语言协同训练下的翻译质量。In view of the limitations in the existing technology that require additional prior knowledge or the need to modify the model architecture, the present invention provides a method for constructing a more effective language pair clustering method for multi-language collaborative training. The core goal is to use the capabilities of multilingual pre-training to characterize language pairs, more effectively select and cluster auxiliary languages and improve their generalization across different models and data sets, and ultimately improve the performance of low-resource language pairs in multiple languages. Translation quality under language collaborative training.

图5是本发明实施例提供的一种基于多语言预训练大模型的伪语言族聚类设备的结构示意图，如图5所示，基于多语言预训练大模型的伪语言族聚类设备可以包括上述图4所示的基于多语言预训练大模型的伪语言族聚类装置。可选地，基于多语言预训练大模型的伪语言族聚类设备510可以包括处理器2001。Figure 5 is a schematic structural diagram of a pseudo-language family clustering device based on a large multi-language pre-trained model provided by an embodiment of the present invention. As shown in Figure 5, the pseudo-language family clustering device based on a large multi-language pre-trained model can It includes the pseudo-language family clustering device based on the multi-language pre-trained large model shown in Figure 4 above. Optionally, the pseudo-language family clustering device 510 based on a multi-language pre-trained large model may include a processor 2001.

可选地，基于多语言预训练大模型的伪语言族聚类设备510还可以包括存储器2002和收发器2003。Optionally, the pseudo-language family clustering device 510 based on the multi-language pre-trained large model may also include a memory 2002 and a transceiver 2003.

其中，处理器2001与存储器2002以及收发器2003，如可以通过通信总线连接。Among them, the processor 2001, the memory 2002 and the transceiver 2003 can be connected through a communication bus.

下面结合图5对基于多语言预训练大模型的伪语言族聚类设备510的各个构成部件进行具体的介绍：The following is a detailed introduction to each component of the pseudo-language family clustering device 510 based on a multi-language pre-trained large model with reference to Figure 5:

其中，处理器2001是基于多语言预训练大模型的伪语言族聚类设备510的控制中心，可以是一个处理器，也可以是多个处理元件的统称。例如，处理器2001是一个或多个中央处理器(central processing unit，CPU)，也可以是特定集成电路(applicationspecific integrated circuit，ASIC)，或者是被配置成实施本发明实施例的一个或多个集成电路，例如：一个或多个微处理器(digital signal processor，DSP)，或，一个或者多个现场可编程门阵列(field programmable gate array，FPGA)。Among them, the processor 2001 is the control center of the pseudo-language family clustering device 510 based on the multi-language pre-trained large model, and can be a processor or a collective name for multiple processing elements. For example, the processor 2001 is one or more central processing units (CPUs), may also be an application specific integrated circuit (ASIC), or may be one or more processors configured to implement embodiments of the present invention. Integrated circuits, such as one or more microprocessors (digital signal processor, DSP), or one or more field programmable gate arrays (field programmable gate array, FPGA).

可选地，处理器2001可以通过运行或执行存储在存储器2002内的软件程序，以及调用存储在存储器2002内的数据，执行基于多语言预训练大模型的伪语言族聚类设备510的各种功能。Optionally, the processor 2001 can execute various functions of the pseudo-language family clustering device 510 based on the multi-language pre-trained large model by running or executing the software program stored in the memory 2002 and calling the data stored in the memory 2002. Function.

在具体的实现中，作为一种实施例，处理器2001可以包括一个或多个CPU，例如图4中所示出的CPU0和CPU1。In a specific implementation, as an embodiment, the processor 2001 may include one or more CPUs, such as CPU0 and CPU1 shown in FIG. 4 .

在具体实现中，作为一种实施例，基于多语言预训练大模型的伪语言族聚类设备510也可以包括多个处理器，例如图4中所示的处理器2001和处理器2004。这些处理器中的每一个可以是一个单核处理器(single-CPU)，也可以是一个多核处理器(multi-CPU)。这里的处理器可以指一个或多个设备、电路、和/或用于处理数据(例如计算机程序指令)的处理核。In specific implementation, as an embodiment, the pseudo-language family clustering device 510 based on a multi-language pre-trained large model may also include multiple processors, such as the processor 2001 and the processor 2004 shown in Figure 4 . Each of these processors can be a single-core processor (single-CPU) or a multi-core processor (multi-CPU). A processor here may refer to one or more devices, circuits, and/or processing cores for processing data (eg, computer program instructions).

其中，所述存储器2002用于存储执行本发明方案的软件程序，并由处理器2001来控制执行，具体实现方式可以参考上述方法实施例，此处不再赘述。The memory 2002 is used to store the software program for executing the solution of the present invention, and the processor 2001 controls the execution. For specific implementation methods, please refer to the above method embodiments, which will not be described again here.

可选地，存储器2002可以是只读存储器(read-only memory，ROM)或可存储静态信息和指令的其他类型的静态存储设备，随机存取存储器(random access memory，RAM)或者可存储信息和指令的其他类型的动态存储设备，也可以是电可擦可编程只读存储器(electrically erasable programmable read-only memory，EEPROM)、只读光盘(compactdisc read-only memory，CD-ROM)或其他光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其他磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质，但不限于此。存储器2002可以和处理器2001集成在一起，也可以独立存在，并通过基于多语言预训练大模型的伪语言族聚类设备510的接口电路(图5中未示出)与处理器2001耦合，本发明实施例对此不作具体限定。Optionally, the memory 2002 may be a read-only memory (ROM) or other type of static storage device that can store static information and instructions, a random access memory (random access memory, RAM) or a random access memory (RAM) that can store information and instructions. Other types of dynamic storage devices for instructions can also be electrically erasable programmable read-only memory (EEPROM), compactdisc read-only memory (CD-ROM) or other optical disk storage , optical disc storage (including compressed optical disc, laser disc, optical disc, digital versatile disc, Blu-ray disc, etc.), magnetic disk storage media or other magnetic storage devices, or can be used to carry or store the desired program code in the form of instructions or data structures and Any other media capable of being accessed by a computer, without limitation. The memory 2002 can be integrated with the processor 2001, or can exist independently, and is coupled to the processor 2001 through the interface circuit (not shown in Figure 5) of the pseudo-language family clustering device 510 based on the multi-language pre-trained large model, The embodiment of the present invention does not specifically limit this.

收发器2003，用于与网络设备通信，或者与终端设备通信。Transceiver 2003 is used to communicate with network equipment or with terminal equipment.

可选地，收发器2003可以包括接收器和发送器(图5中未单独示出)。其中，接收器用于实现接收功能，发送器用于实现发送功能。Optionally, the transceiver 2003 may include a receiver and a transmitter (not shown separately in Figure 5). Among them, the receiver is used to implement the receiving function, and the transmitter is used to implement the sending function.

可选地，收发器2003可以和处理器2001集成在一起，也可以独立存在，并通过基于多语言预训练大模型的伪语言族聚类设备510的接口电路(图5中未示出)与处理器2001耦合，本发明实施例对此不作具体限定。Optionally, the transceiver 2003 can be integrated with the processor 2001, or can exist independently, and communicate with the transceiver 2003 through the interface circuit (not shown in Figure 5) of the pseudo-language family clustering device 510 based on the multi-language pre-trained large model. The processor 2001 is coupled, which is not specifically limited in the embodiment of the present invention.

需要说明的是，图5中示出的基于多语言预训练大模型的伪语言族聚类设备510的结构并不构成对该路由器的限定，实际的知识结构识别设备可以包括比图示更多或更少的部件，或者组合某些部件，或者不同的部件布置。It should be noted that the structure of the pseudo-language family clustering device 510 based on the multi-language pre-trained large model shown in Figure 5 does not constitute a limitation on the router. The actual knowledge structure recognition device may include more than what is shown in the figure. Or fewer parts, or a combination of certain parts, or a different arrangement of parts.

此外，基于多语言预训练大模型的伪语言族聚类设备410的技术效果可以参考上述方法实施例所述的基于多语言预训练大模型的伪语言族聚类方法的技术效果，此处不再赘述。In addition, the technical effects of the pseudo-language family clustering device 410 based on the multi-language pre-trained large model can be referred to the technical effects of the pseudo-language family clustering method based on the multi-language pre-trained large model described in the above method embodiments, which are not discussed here. Again.

应理解，在本发明实施例中的处理器2001可以是中央处理单元(centralprocessing unit，CPU)，该处理器还可以是其他通用处理器、数字信号处理器(digitalsignal processor，DSP)、专用集成电路(application specific integrated circuit，ASIC)、现成可编程门阵列(field programmable gate array，FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。It should be understood that the processor 2001 in the embodiment of the present invention can be a central processing unit (CPU), and the processor can also be other general-purpose processors, digital signal processors (DSP), or application-specific integrated circuits. (application specific integrated circuit, ASIC), off-the-shelf programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor or the processor may be any conventional processor, etc.

还应理解，本发明实施例中的存储器可以是易失性存储器或非易失性存储器，或可包括易失性和非易失性存储器两者。其中，非易失性存储器可以是只读存储器(read-only memory，ROM)、可编程只读存储器(programmable ROM，PROM)、可擦除可编程只读存储器(erasable PROM，EPROM)、电可擦除可编程只读存储器(electrically EPROM，EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory，RAM)，其用作外部高速缓存。通过示例性但不是限制性说明，许多形式的随机存取存储器(random accessmemory，RAM)可用，例如静态随机存取存储器(static RAM，SRAM)、动态随机存取存储器(DRAM)、同步动态随机存取存储器(synchronous DRAM，SDRAM)、双倍数据速率同步动态随机存取存储器(doubledata rate SDRAM，DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM，ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM，SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM，DR RAM)。It should also be understood that memory in embodiments of the present invention may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory. Among them, the non-volatile memory can be read-only memory (ROM), programmable ROM (PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically removable memory. Erase programmable read-only memory (electrically EPROM, EEPROM) or flash memory. Volatile memory may be random access memory (RAM), which is used as an external cache. By way of illustration, but not limitation, many forms of random access memory (RAM) are available, such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (RAM), etc. Access memory (synchronous DRAM, SDRAM), double data rate synchronous dynamic random access memory (doubledata rate SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM), synchronous connection dynamic random access memory (synchlink DRAM, SLDRAM) and direct memory bus random access memory (direct rambus RAM, DR RAM).

上述实施例，可以全部或部分地通过软件、硬件(如电路)、固件或其他任意组合来实现。当使用软件实现时，上述实施例可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令或计算机程序。在计算机上加载或执行所述计算机指令或计算机程序时，全部或部分地产生按照本发明实施例所述的流程或功能。所述计算机可以为通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中，或者从一个计算机可读存储介质向另一个计算机可读存储介质传输，例如，所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集合的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如，软盘、硬盘、磁带)、光介质(例如，DVD)、或者半导体介质。半导体介质可以是固态硬盘。The above embodiments may be implemented in whole or in part by software, hardware (such as circuits), firmware, or any other combination. When implemented using software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions or computer programs. When the computer instructions or computer programs are loaded or executed on a computer, the processes or functions described in accordance with the embodiments of the present invention are generated in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, e.g., the computer instructions may be transferred from a website, computer, server, or data center Transmit to another website, computer, server or data center through wired (such as infrared, wireless, microwave, etc.) means. The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center that contains one or more sets of available media. The usable media may be magnetic media (eg, floppy disk, hard disk, tape), optical media (eg, DVD), or semiconductor media. The semiconductor medium may be a solid state drive.

应理解，本文中术语“和/或”，仅仅是一种描述关联对象的关联关系，表示可以存在三种关系，例如，A和/或B，可以表示：单独存在A，同时存在A和B，单独存在B这三种情况，其中A,B可以是单数或者复数。另外，本文中字符“/”，一般表示前后关联对象是一种“或”的关系，但也可能表示的是一种“和/或”的关系，具体可参考前后文进行理解。It should be understood that the term "and/or" in this article is only an association relationship describing related objects, indicating that there can be three relationships, for example, A and/or B, which can mean: A alone exists, and A and B exist simultaneously. , there are three cases of B alone, where A and B can be singular or plural. In addition, the character "/" in this article generally indicates that the related objects are an "or" relationship, but it may also indicate an "and/or" relationship. For details, please refer to the previous and later contexts for understanding.

本发明中，“至少一个”是指一个或者多个，“多个”是指两个或两个以上。“以下至少一项(个)”或其类似表达，是指的这些项中的任意组合，包括单项(个)或复数项(个)的任意组合。例如，a,b,或c中的至少一项(个)，可以表示：a,b,c,a-b,a-c,b-c,或a-b-c，其中a,b,c可以是单个，也可以是多个。In the present invention, "at least one" means one or more, and "plurality" means two or more. "At least one of the following" or similar expressions thereof refers to any combination of these items, including any combination of a single item (items) or a plurality of items (items). For example, at least one of a, b, or c can mean: a, b, c, a-b, a-c, b-c, or a-b-c, where a, b, c can be single or multiple .

应理解，在本发明的各种实施例中，上述各过程的序号的大小并不意味着执行顺序的先后，各过程的执行顺序应以其功能和内在逻辑确定，而不应对本发明实施例的实施过程构成任何限定。It should be understood that in various embodiments of the present invention, the size of the sequence numbers of the above-mentioned processes does not mean the order of execution. The execution order of each process should be determined by its functions and internal logic, and should not be used in the embodiments of the present invention. The implementation process constitutes any limitation.

本领域普通技术人员可以意识到，结合本文中所公开的实施例描述的各示例的单元及算法步骤，能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行，取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能，但是这种实现不应认为超出本发明的范围。Those of ordinary skill in the art will appreciate that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented with electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each specific application, but such implementations should not be considered to be beyond the scope of the present invention.

所属领域的技术人员可以清楚地了解到，为描述的方便和简洁，上述描述的设备、装置和单元的具体工作过程，可以参考前述方法实施例中的对应过程，在此不再赘述。Those skilled in the art can clearly understand that for the convenience and simplicity of description, the specific working processes of the above-described equipment, devices and units can be referred to the corresponding processes in the foregoing method embodiments, and will not be described again here.

在本发明所提供的几个实施例中，应该理解到，所揭露的设备、装置和方法，可以通过其它的方式实现。例如，以上所描述的装置实施例仅仅是示意性的，例如，所述单元的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个单元或组件可以结合或者可以集成到另一个设备，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口，装置或单元的间接耦合或通信连接，可以是电性，机械或其它的形式。In the several embodiments provided by the present invention, it should be understood that the disclosed equipment, devices and methods can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or can be integrated into another device, or some features can be ignored, or not implemented. On the other hand, the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be in electrical, mechanical or other forms.

所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or they may be distributed to multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

另外，在本发明各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。In addition, each functional unit in various embodiments of the present invention can be integrated into one processing unit, or each unit can exist physically alone, or two or more units can be integrated into one unit.

所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(read-only memory，ROM)、随机存取存储器(random access memory，RAM)、磁碟或者光盘等各种可以存储程序代码的介质。If the functions are implemented in the form of software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product. The computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in various embodiments of the present invention. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program code. .

以上所述，仅为本发明的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，可轻易想到变化或替换，都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应以所述权利要求的保护范围为准。The above are only specific embodiments of the present invention, but the protection scope of the present invention is not limited thereto. Any person familiar with the technical field can easily think of changes or substitutions within the technical scope disclosed by the present invention. should be covered by the protection scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

1. A pseudo language family clustering method based on a multilingual pre-training large model, the method comprising:

s1, establishing a shared language pool;

s2, calculating a Fisher information matrix of the language pairs in the shared language pool based on the multi-language pre-training large model, and obtaining a characterization result of the language pairs in the shared language pool;

s3, calculating the similarity between the language pairs according to the characterization result to obtain a similarity value;

s4, sorting the similarity among the language pairs according to the similarity value, selecting auxiliary language pairs conforming to the boundary value according to a preset boundary value, and completing pseudo language family clustering based on a multilingual pre-training large model.

2. The method according to claim 1, wherein in the step S1, the step of creating the shared language pool includes:

Acquiring a TED data set;

extracting multiple languages in the TED data set, translating the multiple languages into language pairs of English to serve as a basic data set, and establishing a shared language pool.

3. The method according to claim 1, wherein in the step S2, based on the multilingual pre-training large model, a fee-house information matrix of the language pairs in the shared language pool is calculated, and the characterization result of the language pairs in the shared language pool is obtained, including:

acquiring a parallel corpus corresponding to the languages in the shared language pool, and equally dividing the data in the parallel corpus into j small batch data sets;

sequentially inputting the small batch data sets into a multilingual pre-training large model, and outputting a Fisher information matrix of each small batch data set;

calculating an average Fisher information matrix of each small batch data set after one input round, and taking the average Fisher information matrix as an estimated value to obtain Fisher information weight of each small batch data set;

and characterizing the distribution of the corresponding language pairs in the shared language pool according to the Fisher information weight.

4. The method according to claim 3, wherein in the step S3, the step of calculating the similarity between the language pairs according to the characterization result to obtain a similarity value includes:

Obtaining a characterization result; selecting a target language pair;

and calculating the distance between the language pair in the shared language pool and the target language pair by adopting a mean square error method, wherein the distance is similar to the target language pair, and the higher the similarity is.

5. The method according to claim 3, wherein in the step S3, the step of calculating the similarity between the language pairs according to the characterization result to obtain a similarity value includes:

selecting a target language pair;

and calculating the KL divergence of the language pairs in the shared language pool and the target language pairs by using the fee-house information matrix to obtain the distance between the language pairs in the shared language pool, wherein the distance is similar, and the similarity is higher.

6. The method according to claim 3, wherein in the step S3, the step of calculating the similarity between the language pairs according to the characterization result to obtain a similarity value includes:

selecting a target language pair;

selecting and assigning a value of 1 to the parameter of the previous K, and assigning a value of 0 to the remaining parameters to create a fee-house information mask;

and calculating the distance between the language pair in the shared language pool and the target language pair according to the number of the parameters activated simultaneously and the number of the parameters activated in the target direction, wherein the distance is similar, and the higher the similarity is.

7. The method according to claim 4, 5 or 6, wherein in step S4, the similarity between the language pairs is ranked according to the similarity value, the auxiliary language pair conforming to the boundary value is selected according to a preset boundary value, and the completion of the pseudo language family clustering based on the multilingual pre-training large model includes:

traversing and calculating the similarity between all language pairs;

descending order according to the similarity between the language pairs;

presetting an initial searching radius, and defining a boundary range according to the initial searching radius;

integrating the nearest language pair in the boundary range into an auxiliary language list;

updating the search radius according to the similarity between the latest added language pair and the target language pair;

and repeatedly updating the search radius until new language pairs are not expanded, obtaining clustered pseudo language families, and completing pseudo language family clustering based on the multilingual pre-training large model.

8. A pseudo-language family clustering device based on a multilingual pre-training large model, the device comprising:

the language pool module is used for establishing a shared language pool;

the characterization module is used for calculating a Fisher information matrix of the language pairs in the shared language pool based on the multi-language pre-training large model to obtain a characterization result of the language pairs in the shared language pool;

The similarity calculation module is used for calculating the similarity between the language pairs according to the characterization result to obtain a similarity value;

and the clustering module is used for sequencing the similarity among the language pairs according to the similarity value, selecting auxiliary language pairs conforming to the boundary value according to a preset boundary value, and completing pseudo language family clustering based on the multilingual pre-training large model.

9. A pseudo language family clustering device based on a multilingual pre-training large model, the pseudo language family clustering device based on the multilingual pre-training large model comprising:

a processor;

a memory having stored thereon computer readable instructions which, when executed by the processor, implement the method of any of claims 1 to 7.

10. A computer readable storage medium having stored therein program code which is callable by a processor to perform the method of any one of claims 1 to 7.