WO2023211525A1

WO2023211525A1 - Establishing a language model adapted to a cross-lingual sequence labeling task

Info

Publication number: WO2023211525A1
Application number: PCT/US2023/012468
Authority: WO
Inventors: Ming GONG; Linjun SHOU; Daxin Jiang
Original assignee: Microsoft Technology Licensing, Llc
Priority date: 2022-04-24
Filing date: 2023-02-07
Publication date: 2023-11-02
Also published as: CN116976340A

Abstract

This disclosure proposes to establish a language model adapted to a cross-lingual sequence labeling task. A training sentence pair including a first sentence in a first language and a second sentence in a second language is obtained, the second sentence being a version of the first sentence in the second language. At least one original span in the first sentence is masked by at least one predefined token. An input sequence for the language model is formed with the masked first sentence and the second sentence. An input sequence representation is generated through the language model. Target span prediction is performed based at least on the input sequence representation, to predict a start position and an end position of at least one target span in the second sentence which corresponds to the at least one predefined token respectively. The language model is optimized based on the target span prediction.

Description

ESTABLISHING A LANGUAGE MODEL ADAPTED TO A CROSS-LINGUAL SEQUENCE LABELING TASK

BACKGROUND

A pre-trained language model (PLM) may be deployed to various downstream tasks, and dominate natural language understanding and generating areas. A PLM may be extended to a cross-lingual pre-trained language model (xPLM). Usually, in a pre-training stage, the xPLM may be pretrained with token-level pre-training tasks on large multi-lingual corpus. In a further fine-tuning stage, the pre-trained xPLM may be transferred or deployed to downstream tasks.

SUMMARY

This Summary is provided to introduce a selection of concepts that are further described below in the Detailed Description. It is not intended to identify key features or essential features of the claimed subj ect matter, nor is it intended to be used to limit the scope of the claimed subj ect matter. Embodiments of the present disclosure propose methods, apparatuses, computer program products and computer-readable mediums for establishing a language model adapted to a cross-lingual sequence labeling task. A training sentence pair including a first sentence in a first language and a second sentence in a second language may be obtained, the second sentence being a version of the first sentence in the second language. At least one original span in the first sentence may be masked with at least one predefined token, to obtain the masked first sentence. An input sequence for the language model may be formed with at least the masked first sentence and the second sentence. An input sequence representation of the input sequence may be generated through the language model. Target span prediction may be performed based at least on the input sequence representation, to predict a start position and an end position of at least one target span in the second sentence, the at least one target span corresponding to the at least one predefined token respectively. The language model may be optimized based at least on the target span prediction. It should be noted that the above one or more aspects comprise the features hereinafter fully described and particularly pointed out in the claims. The following description and the drawings set forth in detail certain illustrative features of the one or more aspects. These features are only indicative of the various ways in which the principles of various aspects may be employed, and this disclosure is intended to include all such aspects and their equivalents.

BRIEF DESCRIPTION OF THE DRAWINGS

The disclosed aspects will hereinafter be described in conjunction with the appended drawings that are provided to illustrate and not to limit the disclosed aspects.

FIG.1 illustrates an exemplary process of pre-training and fine-tuning a language model according to an embodiment. FIG.2 illustrates an exemplary process of a Cross-lingual Language Informative Span Masking (CLISM) strategy in a pre-training stage according to an embodiment.

FIG.3 illustrates an example of forming an input sequence in a CLISM strategy according to an embodiment.

FIG.4 illustrates an exemplary process of a ContrAstive-Consistency Regularization (CACR) strategy in a pre-training stage according to an embodiment.

FIG.5 illustrates a flowchart of an exemplary method for establishing a language model adapted to a cross-lingual sequence labeling task according to an embodiment.

FIG.6 illustrates an exemplary apparatus for establishing a language model adapted to a crosslingual sequence labeling task according to an embodiment.

FIG.7 illustrates an exemplary apparatus for establishing a language model adapted to a crosslingual sequence labeling task according to an embodiment.

DETAILED DESCRIPTION

The present disclosure will now be discussed with reference to several example implementations. It should be understood that these implementations are discussed only for enabling those skilled in the art to better understand and thus implement the embodiments of the present disclosure, rather than suggesting any limitations on the scope of the present disclosure.

A Sequence Labeling (SL) task is a task that targets span extraction, which may include, e.g., named entity recognition (NER), machine reading comprehension (MRC), question-answering, relation recognition, event extraction, etc. A cross-lingual sequence labeling task (xSL) aims to extend the SL task to be performed on different languages. The xSL task may require extending the boundary of the SL task to low-resource languages, which will face the challenge of limited training data for the low-resource languages.

An xPLM may be transferred or deployed to an xSL task, and may show a certain degree of effectiveness in the xSL task by, e.g., transferring knowledge from a high-resource language to low-resource languages. However, when transferring a pre-trained xPLM to an xSL task in a finetuning stage, discrepancy or gap in training objectives between a pre-training stage and a finetuning stage may be generated, which makes the xPLM not well adapted to the xSL task.

In a pre-training stage of a language model, the language model may be trained with a pre-training task such as mask language modeling (MLM), etc., wherein a training objective of the MLM requires local understanding of a masked token. In a fine-tuning stage of the pre-trained language model for an xSL task, the pre-trained language model is typically trained with distantly supervised multi-lingual task-related instances, wherein a training objective of span extraction of the xSL task requires global understanding and reasoning of, e.g., an input question and passage. This will lead to a gap between the training objective in the pre-training stage and the training objective in the fine-tuning stage, and will further lead to the resulting xPLM not well adapted to the xSL task.

The embodiments of the present disclosure aim to establish a language model adapted to an xSL task. For example, the embodiments of the present disclosure may enable an xPLM to obtain characteristics adapted to the xSL task in a pre-training stage, and thus a pre-trained language model may obtain better performance for the xSL task in a fine-tuning stage.

In an aspect, the embodiments of the present disclosure propose a pre-training strategy customized for an xSL task in a pre-training stage, which may be referred to as a Cross-lingual Language Informative Span Masking (CLISM) strategy or a CLISM task. The CLISM strategy eliminates a training objective gap between the pre-training stage and a fine-tuning stage for the xSL in a selfsupervised approach.

In an aspect, the embodiments of the present disclosure propose a strategy for enhancing alignment capability in a pre-training stage, which may be referred to as a ContrAstive-Consistency Regularization (CACR) strategy. The CACR strategy may encourage an xPLM to better capture alignment between cross-lingual representations. For example, during pre-training, the CACR strategy may leverage contrastive learning to encourage consistency between representations of input parallel sequences.

The embodiments of the present disclosure may not only eliminate a gap between an objective of pre-training and an objective of fine-tuning, but may also enhance the capability of a language model to better capture alignment between cross-lingual representations in a sentence-level. According to the embodiments of the present disclosure, even with limited training data in the pre-training stage, an established language model is able to have better performance. The language model established according to the embodiments of the present disclosure may achieve good applicability in various xSL tasks with limited training data. For example, even in the case of fewshot data settings where only a few training instances are available or zero-shot data settings where no training instance is available, the language model established according to the embodiments of the present disclosure is still able to achieve good applicability.

FIG.l illustrates an exemplary process 100 of pre-training and fine-tuning a language model according to an embodiment. The process 100 is performed for pre-training and fine-tuning a language model 110 in order to apply the language model 110 to an xSL task.

In a pre-training stage, the language model 110 may be trained with a training dataset 102 which is based on multi-lingual parallel corpus. In an implementation, the parallel corpus may be divided into multiple sub-groups. Each sub-group may be referred to as a language-informative group. As an example, each sub-group may include two parallel versions of the same sentence in two different languages, e.g., a sentence in a first language and a sentence in a second language. Hereinafter, a sentence in a first language will be referred to as a source language sentence, and a sentence in a second language will be referred to as a target language sentence. A target language sentence may be a version or translation of a source language sentence in a target language, or a source language sentence may be a version or translation of a target language sentence in a source language. Since a source language sentence and a target language sentence in each sub-group are in parallel, there will be multiple meaning-aligned spans or tokens between these two sentences. Herein, a token may correspond to a word or a part of a word, and a span may include one or more tokens. A source language sentence and a target language sentence in each sub-group form a training sentence pair.

In an implementation, a CLISM strategy 120 may be adopted in the pre-training stage. The CLISM strategy may eliminate a training objective gap between the pre-training stage and a fine-tuning stage for xSL. The CLISM strategy may also be referred to as a CLISM task, which is a pretraining task customized for a xSL task. The CLISM task is a self-supervised task. Taking an xSL task of cross-lingual MRC (xMRC) which is similar to question-answering as an example, one focus of the CLISM task is how to create multi-lingual <question, answer> training pairs. It should be understood that although multiple parts of the following discussion take the CLISM task being customized for a question-answering-type xSL task as an example, CLISM tasks according to the embodiments of the present disclosure are not limited to this, but may be customized for any other types of xSL tasks in a similar approach.

According to the CLISM task, for a training sentence pair including a source language sentence and a target language sentence obtained from the training dataset 102, one or more selected original spans in the source language sentence may be masked. Each original span may be an n- ary span, which includes n tokens. An original span may be, e.g., a named entity, a phrase, etc. It should be understood that the embodiments of the present disclosure are not limited to any specific type of original span. A masked source language sentence may be obtained through performing masking on the source language sentence. An input sequence for the language model 110 may be formed with at least the masked source language sentence and the corresponding unmasked target language sentence. As an example, firstly, one or more selected original spans in the source language sentence may be masked by one or more predefined tokens, respectively. A predefined token may be, e.g., [QUE], etc. It should be understood that although a predefined token is exemplarily expressed as [QUE] hereinafter, the embodiments of the present disclosure are not limited to any specific expressions of a predefined token, e.g., a predefined token may also be in any other expressions than [QUE], Then, the language model 110 may be required to find correct start position and end position of a target span corresponding to each predefined token from the target language sentence based on global contextual understanding of the masked source language sentence and the target language sentence. Each target span has the same meaning as the corresponding original span masked by the predefined token. Accordingly, each predefined token [QUE] may be regarded as, e.g., a question in the xMRC task, and the target span, in the target language sentence, corresponding to this predefined token [QUE] may be regarded as, e.g., an answer in the xMRC task.

Under the CLISM strategy, the language model 110 needs to predict whether each token in the target language sentence is a start position of an answer (i.e., a start position token) or an end position of an answer (i.e., an end position token). This is essentially consistent with the xSL task, because span extraction in the xSL task also aims to find a start position and an end position of a target span. Thus, through the CLISM strategy, a training objective of the pre-training stage will be close to or consistent with a training objective of the subsequent fine-tuning stage for the xSL task. Thus, the pre-trained language model 110 will be more adapted to the xSL task.

In an implementation, a CACR strategy 130 may be adopted in the pre-training stage. The CACR strategy may enable the language model 110 to better capture alignment of the same sentence between different languages, and avoid learning representations that are affected by noisy data. Noisy data may include, e.g., a predefined token [QUE] used in masking, two sentences in a training sentence pair that are not fully semantically paired, etc. In an aspect, the CACR strategy may, through contrastive learning, make representations of sentences originating from the same training sentence pair as close as possible in a latent space. For example, a source language sentence, a masked source language sentence and a target language sentence originating from the same training sentence pair may be taken as parallel sequences, and the CACR strategy may utilize contrastive learning to encourage the language model 110 to achieve consistency among representations of the multiple sentences in the parallel sequences. In an aspect, the CACR strategy may, through contrastive learning, make representations of sentences originating from different training sentence pairs as far as possible in a latent space. For example, the CACR strategy may utilize contrastive learning to encourage the language model 110 to implement distinctiveness between representations of a source language sentence, a masked source language sentence and a target language sentence originating from a first training sentence pair and representations of a source language sentence, a masked source language sentence and a target language sentence originating from a second training sentence pair.

During a fine-tuning stage, the pre-trained language model 110 may be further trained with a training dataset 104. The training dataset 104 may be specific to an xSL task 140. Assuming that the xSL task 140 is an xMRC task, the training dataset 104 may be a training dataset for the xMRC task. It should be understood that the embodiments of the present disclosure are not limited to any specific types of xSL task. Since a training objective gap between the pre-training stage and the fine-tuning stage for xSL is eliminated in the pre-training stage through, e.g., the CLISM strategy, etc., the pre-trained language model 110 will be more adapted to the xSL task. When performing fine-tuning for the xSL task on such pre-trained language model 110, a language model with higher applicability can be finally obtained.

In an implementation, optionally, in the fine-tuning stage, a predefined token [QUE] may be added to an input sequence of the xSL task. Taking the xSL task being an MRC task as an example, since at least one predefined token [QUE] is taken as a question in the pre-training stage, after pretraining, the at least one predefined token [QUE] will capture enough information about the question. Through adding the at least one predefined token [QUE] to an input sequence of the xSL task, a representation of the at least one predefined token [QUE] will facilitate to select a groundtruth answer span in the xSL task.

It should be understood that although exemplary description of a fine-tuning stage is included in the above description of the process 100, the embodiments of the present disclosure are not limited to any specific processing in a fine-tuning stage. For example, after a pre-training process according to the embodiments of the present disclosure has been performed on a language model, any known fine-tuning process may be then applied to the pre-trained language model. Benefiting from the pre-training process performed on the language model according to the embodiments of the present disclosure, the resulting pre-trained language model will exhibit its performance advantages of being adapted to xSL tasks in any fine-tuning process.

FIG.2 illustrates an exemplary process 200 of a CLISM strategy in a pre-training stage according to an embodiment.

An exemplary training sentence pair 202 may be obtained from, e.g., the training dataset 102 in FIG.1. The training sentence pair 202 may include a source language sentence as a first sentence and a target language sentence as a second sentence, wherein the source language sentence adopts a source language as a first language, and the target language sentence adopts a target language as a second language. The target language sentence is a version of the source language sentence in the target language, and the source language sentence is a version of the target language sentence in the source language. The source language sentence may be represented as s^s, and the target language sentence may be represented as s^t.

At 210, at least one original span to be masked may be selected from the source language sentence. Taking the original span being a named entity as an example, a named entity span, as the original span, may be selected from the source language sentence through various named entity recognition tools. It should be understood that the embodiments of the present disclosure are neither limited to any specific types of original span, nor limited to any specific technique for selecting an original span. The at least one original span selected or identified from the source language sentence s^s may form an original span set S.

In some implementations, the selecting operation at 210 may also follow some predetermined rules so that spans with semantic meanings may be selected. For example, spans which only contain stop words may be filtered out. For example, it may be specified that a boundary of a selected span must be words. For example, a sequence length of each selected span should not exceed a predetermined maximum sequence length threshold, e.g., 10, etc.

At 220, the selected at least one original span in the source language sentence may be masked by at least one predefined token, so as to obtain the masked source language sentence. The masking operation may include: replacing the selected at least one original span with at least one predefined token. For example, for each original span in the original span set S, the original span may be replaced, in the source language sentence, by a single predefined token [QUE], Through masking all original spans, the masked source language sentence may be obtained, which is represented as .

At 230, an input sequence for the language model may be formed with at least the masked source language sentence and the unmasked target language sentence. In an implementation, a final input sequence X may be obtained through cascading the masked source language sentence

, the target language sentence s^t, and special tokens such as [CLS], [SEP], etc. For example, the input sequence X may be represented as wherein [CLS] is a classification

marker token and [SEP] is a sentence separation token. It should be understood that the embodiments of the present disclosure are not limited to any specific approach of utilizing the masked source language sentence and the target language sentence to form the input sequence, e.g., any special tokens may be omitted, any other special tokens may be added, etc.

It is assumed that a training sentence pair includes a source language sentence 310 and a target language sentence 320. The source language sentence 310 is an English sentence "List of Ottawa buildings", and the target language sentence 320 is a Vietnamese sentence "Danh sach cac toa nha Ottawa".

It is assumed that an original span 312 "List" and an original span 314 "Ottawa" are selected from the source language sentence 310 according to the operation at 210 of FIG.2. The original span 312 corresponds to or is aligned with a target span 322 "Danh sach" in the target language sentence 320, i.e., the target span 322 is the Vietnamese version of the original span 314. The original span 314 corresponds to or is aligned with a target span 324 "Ottawa" in the target language sentence 320, and the target span 324 and the original span 314 have the same expression.

According to the operation at 220 of FIG.2, the original span 312 may be replaced by a predefined token [QUE] 332 and the original span 314 may be replaced by a predefined token [QUE] 334, thereby obtaining a masked source language sentence 330.

According to the operation at 230 of FIG.2, an input sequence 340 for the language model may be formed with at least the masked source language sentence 330 and the target language sentence 320. As shown in FIG.3, the input sequence 340 is formed through sequentially cascading a special token [CLS] 342, the masked source language sentence 330, the special token [SEP] 344, the target language sentence 320 and a special token [SEP] 346.

It should be understood that all the elements in the example of FIG.3 are exemplary, which are only for providing additional explanation for relevant operations in the process 200 of FIG.2, but not for limiting the scope of the embodiments of the present disclosure in any approaches.

Returning to FIG.2, when the input sequence is formed at 230, the input sequence may be provided to a language model 240. The language model 240 may generate an input sequence representation 242 of the input sequence. The language model 240 may be represented as . The generation of the input sequence representation 242 may include generating a representation of each token in the input sequence. Since the input sequence X contains at least the masked source language sentence s^s and the target language sentence s^f, a representation of each token generated by the language model may be a global contextual semantic representation.

At 250, target span prediction may be performed based at least on the input sequence representation 242. The target span prediction aims to predict a start position and an end position of at least one target span in the target language sentence, wherein the at least one target span corresponds to the at least one predefined token in the masked source language sentence respectively. For example, in the target span prediction, for each predefined token in the masked source language sentence, a start position and an end position of a target span, corresponding to the predefined token, in the target language sentence may be predicted. A start position of a target span may be indicated by a token at the start position, therefore, prediction of a start position of a target span may refer to determining which token is a start position of the target span and thus serves as a start position token. Similarly, prediction of an end position of a target span may refer to determining which token is an end position of the target span and thus serves as an end position token.

In an implementation, in the target span prediction, for each predefined token in the masked source language sentence, target span start position probability distribution and target span end position probability distribution associated with the predefined token may be calculated based on token representations in the input sequence representation 242. The target span start position probability distribution may indicate probabilities that different tokens serve as a start position of a target span. The target span end position probability distribution may indicate probabilities that different tokens serve as an end position of a target span. Taking a specific token in the input sequence as an example, a start position probability that the token is a start position of a target span and an end position probability that the token is an end position of a target span may be calculated based at least on a representation of a predefined token and a representation of the specific token.

Taking the CLISM task being customized for an xMRC task as an example, each predefined token [QUE] in the masked source language sentence may be regarded as a question Q, and the objective of the target span prediction is: for a given question Q corresponding to a given predefined token [QUE], a correct answer corresponding to an aligned target span in the target language sentence s^t is predicted based on the meaning of the input sequence X. Multiple predefined tokens in the masked source language sentence may serve as a set of questions that need to be answered simultaneously.

A representation x_q of the question Q may be obtained from the input sequence representation 242. Then, a dynamic start vector s_q and a dynamic end vector e_q may be calculated according to: Equation (1)

wherein W_s and W_e are learnable parameters. The dynamic start vector s_q is a sub-representation derived from the representation x_q of the question Q for subsequent calculation of target span start position probability distribution, and the dynamic end vector e_q is a sub -representation derived from the representation x_q of the question Q for subsequent calculation of target span end position probability distribution.

Referring to the known standard span-extraction process, an inner product of the learned vector s_q with each token representation in the input sequence representation of the input sequence X may be calculated according to Equation (2) to obtain the target span start position probability distribution, and an inner product of the learned vector e_q with each token representation in the input sequence representation of the input sequence X may be calculated according to Equation (3) to obtain the target span end position probability distribution, as follow:

P( Start Index Equation (2)

P (Endindex Equation (3)

wherein Startindex = k indicates that a start index is k, i.e., Equation (2) calculates, for a token k at the fc-th position in the input sequence X, a start position probability that the token k is a start position of a target span; and Endindex = k indicates that an end index is k, i.e., Equation (3) calculates, for a token k at the fc-th position in the input sequence X, an end position probability that the token k is an end position of a target span. The index j in Equation (2) and Equation (3) is an index for tokens in the input sequence X. Through performing the calculations of Equation (2) and Equation (3) for each token in the input sequence, a start position probability and an end position probability calculated for each token may be obtained, and further, the target span start position probability distribution and the target span end position probability distribution may be formed.

It should be understood that although a start position probability and an end position probability are calculated for each token in the input sequence in the above description, optionally, start position probabilities and end position probabilities may also be calculated only for tokens in the target language sentence in the input sequence.

According to the process 200, the language model 240 may be optimized through the performing of the target span prediction. In order to perform the optimization, firstly, at 260, alignment may be performed between the source language sentence and the target language sentence. For example, at 260, alignment may be performed between spans, tokens or words in the source language sentence and spans, tokens or words in the target language sentence. Accordingly, at least one target span in the target language sentence aligned with at least one selected original span in the source language sentence may at least be identified. The alignment operation at 260 may be performed through any alignment tools or techniques. The alignment operation at 260 may be used for determining a ground-truth target span, corresponding to a specific predefined token, in the target language sentence. For example, assuming that the specific predefined token was previously used for masking a specific original span, a target span aligned with the specific original span determined through the alignment operation at 260 may be used as a ground-truth target span corresponding to the specific predefined token. Referring to the example in FIG.3, the alignment operation at 260 may align the original span 312 with the target span 322; accordingly, in the case that the original span 312 is masked by the predefined token 332, a ground-truth target span corresponding to the predefined token 332 may be determined to be the target span 322.

At 270, the language model 240 may be optimized based at least on the target span prediction. In the optimization operation, for a predefined token in the masked source language sentence, a start position probability of a ground-truth start position token of a ground-truth target span, corresponding to the predefined token, in the target language sentence may be maximized, and an end position probability of a ground-truth end position token of the ground-truth target span may be maximized. The ground-truth start position token may refer to a token at a start position of the ground-truth target span, and the ground-truth end position token may refer to a token at an end position of the ground-truth target span.

In an implementation, the optimization operation at 270 may be implemented through constructing a loss function and minimizing a loss calculated through the loss function. A loss function £_CLISM for the CLISM strategy may be constructed based on, e.g., standard cross-entropy loss, as follow: ^CLISM ⁼ -log? (Startindex = af |X, Q) — logP(EndIndex = af X, Q) Equation (4) wherein a_t denotes a ground-truth target span corresponding to the i-th question Q in the target language sentence, af denotes a start position of the ground-truth target span

and af denotes an end position of the ground-truth target span a_i. logP(StartIndex = af |X, Q) in Equation (4) represents a start position probability of a ground-truth start position token of the ground-truth target span, and logP(EndIndex = in Equation (4) represents an end position probability

of a ground-truth end position token of the ground-truth target span. Through the optimization operation described above, the capability of the language model to extract spans across different languages may be enhanced during pre-training and may be smoothly transferred to an xSL task during fine-tuning.

It should be understood that all the operations in the process 200 and the order in which they are performed are exemplary, and the embodiments of the present disclosure will cover any changes to the process 200. For example, the process 200 may be iteratively performed respectively for different training sentence pairs in the training dataset, so that the language model may be continuously pre-trained with different training sentence pairs. For example, the alignment operation at 260 do not have a specific performing order relative to other operations in the process 200, but only needs to be performed between the operation at 210 and the operation at 270. For example, all the above equations are exemplary and the embodiments of the present disclosure are not limited to any details of these equations, but may encompass any changes to these equations and any other equations for similar purposes.

FIG.4 illustrates an exemplary process 400 of a CACR strategy in a pre-training stage according to an embodiment.

The CACR strategy may help the language model to avoid learning representations affected by noisy data during pre-training, and may facilitate the language model to better capture alignment between representations of sentences in parallel sequences.

A source language sentence 402 and a target language sentence 406 may be from the same training sentence pair. A masked source language sentence 404 may be obtained through performing, e.g., the masking operation at 220 in FIG.2 on the source language sentence 402.

According to the process 400, a representation of the source language sentence 402, a representation of the masked source language sentence 404, and a representation of the target language sentence 406 may be obtained at least through a language model 410, and these representations are for further use in contrastive learning.

In an implementation, firstly, a hidden representation 412 of the source language sentence, a hidden representation 414 of the masked source language sentence, and a hidden representation 416 of the target language sentence may be obtained through the language model 410, respectively. Then, sequence lengths of these hidden representations may be unified by applying, e.g., an aggregation layer 420. For example, an aggregated representation 422 of the source language sentence may be obtained through applying the aggregation layer 420 to the hidden representation 412 of the source language sentence, an aggregated representation 424 of the masked source language sentence may be obtained through applying the aggregation layer 420 to the hidden representation 414 of the masked source language sentence, and an aggregated representation 426 of the target language sentence may be obtained through applying the aggregation layer 420 to the hidden representation 416 of the target language sentence. The aggregated representation 422, the aggregated representation 424, and the aggregated representation 426 may respectively serve as a representation of the source language sentence, a representation of the masked source language sentence, and a representation of the target language sentence that are finally obtained. It should be understood that, optionally, in the case that an input sequence representation of an input sequence formed by the masked source language sentence 404 and the target language sentence 406 is obtained according to the process 200 in FIG.2, the hidden representation 414 of the masked source language sentence and the hidden representation 416 of the target language sentences may also be extracted directly from the input sequence representation.

At 430, the obtained representation of the source language sentence, the representation of the masked source language sentences, and the representation of the target language sentence may be used for contrastive learning. In an implementation, in the contrastive learning, any two of the representation of the source language sentence, the representation of the masked source language sentence and the representation of the target language sentence may be taken as a positive sample pair for contrastive learning. Thus, through the contrastive learning, the representation of the source language sentence, the representation of the masked source language sentence, and the representation of the target language sentence will be as close as possible in a latent space to achieve representation consistency.

At 440, the language model may be optimized based at least on the contrastive learning. In an implementation, in the optimization operation, the language model may be optimized through minimizing latent space distance among a representation of a source language sentence, a representation of a masked source language sentence, and a representation of a target language sentence originating from the same training sentence pair.

Moreover, optionally, the CACR strategy may also, through the contrastive learning, make representations of sentences originating from different training sentence pairs as far as possible in a latent space. For example, assuming that a source language sentence representation, a masked source language sentence representation and a target language sentence representation respectively corresponding to a source language sentence, a masked source language sentence and a target language sentence originating from a first training sentence pair are combined into a first representation set, and a source language sentence representation, a masked source language sentence representation and a target language sentence representation respectively corresponding to a source language sentence, a masked source language sentence and a target language sentence originating from a second training sentence pair are combined into a second representation set, the CACR strategy may make any one representation in the first representation set be as far away as possible from any one representation in the second representation set in a latent space. Accordingly, in the optimization operation, the language model may be optimized through maximizing a latent space distance between any one representation in the first representation set and any one representation in the second representation set.

More exemplary implementations of the CACR strategy shown in the process 400 will be presented in the below.

As described above in connection with FIG.2, the input sequence of the language model is . Through taking X as the input for the language model, an input

sequence representation M of the input sequence may be obtained as follow: Equation (5)

wherein

, I is the maximum input sequence length, d is a hidden size,

E R^mxd denotes a hidden representation of the masked source language sentence

,

e R^nxd denotes a hidden representation of the target language sentence s^c, m denotes a sequence length of ^s, n

denotes a sequence length of s^c, denotes a representation of a special token [CLS], and M

_sep denotes a representation of a special token [SEP],

Due to the mismatch between m and n,

and M

¹ cannot be directly used as inputs for contrastive learning. Therefore, an extra aggregation layer JI (e.g., mean-pooling, etc.) may be applied to obtain an aggregated representation f^s of the masked source language sentence

and an aggregated representation r^t of the target language sentence s^t, as follow: Equation (6)

wherein .

In the contrastive learning, r^s may be used for forming a positive sample pair with r^c, while other data in a mini-batch may be used for forming negative sample pairs. The language model may be optimized through constructing a loss function for

and r^f and minimizing a loss calculated by the loss function. A loss function may be constructed based on, e.g., a

standard contrastive learning objective, as follow:

wherein B is a mini -batch, T is a temperature coefficient denoting a smoothing strategy, and 'P(-) denotes a cosine similarity function.

Considering that the masked source language sentence s^s is an input interfered by noises, e.g., it includes a predefined token [QUE], thus, if the language model is optimized only by Equation (7), this may cause the language model to learn representations affected by noisy data, thereby may cause incorrect representation alignment between the source language sentence and the target language sentence. Therefore, the CACR strategy further introduces the unmasked source language sentence s^s as an input. The language model may encode s^s to obtain a hidden representation BC^S of the source language sentence, and then, an aggregated representation r^s of the source language sentence may be obtained through applying an aggregation layer c/Z to BC^S . Accordingly, the aggregated representation r^s of the source language sentence, the aggregated representation r^s of the masked source language sentence, and the aggregated representation r^f of the target language sentences may be formed into a positive threefold set

Every two representations in this positive threefold set is considered as a positive pair, i.e., it is desired that every two representations are as similar as possible in a latent space. A final loss function £_CACR for the CACR strategy may be:

Equation (8) wherein is a loss function for r^s and is a loss function for r^s and , and

is a loss function for r^f and r^s . The language model may be optimized through minimizing a loss calculated via the loss function £_CACR . Through this approach, the language model is encouraged to learn representations that are not affected by noisy data, and is able to better capture alignment between cross-lingual representations, which may also facilitate to alleviate a training objective gap between the pre-training stage and the fine-tuning stage.

It should be understood that all the operations in the process 400 and the order in which they are performed are exemplary, and the embodiments of the present disclosure will cover any changes to the process 400.

According to the embodiments of the present disclosure, optionally, an existing pre-training task, e.g., a MLM task, may also be retained in the pre-training stage. Accordingly, the MLM task may also be used for optimizing a language model. Thus, the total training objective or total loss function L of the language model may be defined as: Equation (9)

wherein £_MLM is a loss function for the MLM task. The language model may be optimized with the total loss function £. In this approach, the embodiments of the present disclosure actually train a language model in a pre-training stage with a multi-task setting which may include, e.g., CLISM, CACR, MLM, etc. It should be understood that although Equation (9) considers the loss function £_MLM of the MLM task, the embodiments of the present disclosure may also construct the total loss function £ with only CLISM ^ar|d CACR ■

After the pre-trained language model is obtained through the above-described processing in the pre-training stage according to the embodiments of the present disclosure, fine-tuning for the xSL task may be performed on the language model in the subsequent fine-tuning stage.

In an implementation, the fine-tuning stage may be based on any known fine-tuning process.

In an implementation, the fine-tuning stage may include improvements proposed by the embodiments of the present disclosure. For example, in order to be consistent with the strategy in the pre-training stage, in the fine-tuning stage, a predefined token [QUE] may be added to an input sequence of the xSL task. This processing approach may achieve better performance in the xSL task, especially in the case of few-shot data setting or zero-shot data setting. Taking the xSL task being an xMRC task as an example, an input sequence for the xMRC task may be:

X = {[CLS]Q[QUE] [SEP]P[SEP]} Equation (10) wherein Q denotes an input question, and P denotes an input passage. In the input sequence X, a predefined token [QUE] is added after the question Q.

Since the pre-trained token [QUE] is taken as a question in the pre-training stage, after the pretraining stage, the pre-defined token [QUE] will be able to capture enough information about the question. Through adding the predefined token [QUE] to the input sequence of the xMRC task, a representation of the predefined token [QUE] will facilitate to select a correct answer span in the xMRC task. It should be understood that although the above Equation (10) only shows adding a single [QUE] to the input sequence, optionally, in the case that at least one predefined token [QUE] is used in the pre-training stage, at least one corresponding predefined token [QUE] may also be added to the input sequence for the xMRC task. Moreover, it should be understood that although the above description gives an example in which the xSL task is an xMRC task, for any other types of xSL task, input sequences for these tasks may also be processed in a similar approach.

FIG.5 illustrates a flowchart of an exemplary method 500 for establishing a language model adapted to a cross-lingual sequence labeling task according to an embodiment.

At 510, a training sentence pair including a first sentence in a first language and a second sentence in a second language may be obtained, the second sentence being a version of the first sentence in the second language.

At 520, at least one original span in the first sentence may be masked with at least one predefined token, to obtain the masked first sentence.

At 530, an input sequence for the language model may be formed with at least the masked first sentence and the second sentence. At 540, an input sequence representation of the input sequence may be generated through the language model.

At 550, target span prediction may be performed based at least on the input sequence representation, to predict a start position and an end position of at least one target span in the second sentence, the at least one target span corresponding to the at least one predefined token respectively.

At 560, the language model may be optimized based at least on the target span prediction.

In an implementation, the method 500 may further comprise: selecting the at least one original span from the first sentence. The masking at least one original span may comprise: replacing the selected at least one original span with the at least one predefined token.

In an implementation, the target span prediction may comprise, for each predefined token in the at least one predefined token: calculating target span start position probability distribution and target span end position probability distribution associated with the predefined token based on token representations in the input sequence representation.

The calculating may comprise, for each token in the input sequence: calculating, based at least on a representation of the predefined token and a representation of the token, a start position probability that the token is a start position of a target span and an end position probability that the token is an end position of a target span.

The optimizing the language model based at least on the target span prediction may comprise optimizing the language model at least through: maximizing a start position probability of a ground-truth start position token of a ground-truth target span in the second sentence corresponding to the predefined token; and maximizing an end position probability of a groundtruth end position token of the ground-truth target span.

The method 500 may further comprise: determining the ground-truth target span corresponding to the predefined token through at least performing alignment between the first sentence and the second sentence.

In an implementation, the method 500 may further comprise: obtaining a representation of the first sentence, a representation of the masked first sentence and a representation of the second sentence through at least the language model; taking any two of the representation of the first sentence, the representation of the masked first sentence and the representation of the second sentence as a positive sample pair for contrastive learning; and optimizing the language model based at least on the contrastive learning.

The optimizing the language model based at least on the contrastive learning may comprise optimizing the language model at least through: minimizing latent space distances among the representation of the first sentence, the representation of the masked first sentence and the representation of the second sentence.

In an implementation, the method 500 may further comprise: optimizing the language model based at least on a mask language modeling task.

In an implementation, the method 500 may further comprise: fine-tuning the language model for the cross-lingual sequence labeling task.

The fine-tuning may comprise: adding the at least one predefined token to an input sequence for the cross-lingual sequence labeling task.

It should be understood that the method 500 may further comprise any steps/processes for establishing a language model adapted to a cross-lingual sequence labeling task according to the embodiments of the present disclosure described above.

FIG.6 illustrates an exemplary apparatus 600 for establishing a language model adapted to a crosslingual sequence labeling task according to an embodiment.

The apparatus 600 may comprise: a training sentence pair obtaining module 610, for obtaining a training sentence pair including a first sentence in a first language and a second sentence in a second language, the second sentence being a version of the first sentence in the second language; a masking module 620, for masking at least one original span in the first sentence with at least one predefined token, to obtain the masked first sentence; an input sequence forming module 630, for forming an input sequence for the language model with at least the masked first sentence and the second sentence; an input sequence representation generating module 640, for generating an input sequence representation of the input sequence through the language model; a target span prediction performing module 650, for performing target span prediction based at least on the input sequence representation, to predict a start position and an end position of at least one target span in the second sentence, the at least one target span corresponding to the at least one predefined token respectively; and an optimizing module 660, for optimizing the language model based at least on the target span prediction. Moreover, the apparatus 600 may further comprise any other modules configured to perform any steps/processes of the methods for establishing a language model adapted to a cross-lingual sequence labeling task according to the embodiments of the present disclosure described above.

FIG.7 illustrates an exemplary apparatus 700 for establishing a language model adapted to a crosslingual sequence labeling task according to an embodiment.

The apparatus 700 may comprise at least one processor 710. The apparatus 700 may further comprise a memory 720 connected with at least one processor 710. The memory 720 may store computer-executable instructions that, when executed, cause the at least one processor 710 to: obtain a training sentence pair including a first sentence in a first language and a second sentence in a second language, the second sentence being a version of the first sentence in the second language; mask at least one original span in the first sentence with at least one predefined token, to obtain the masked first sentence; form an input sequence for the language model with at least the masked first sentence and the second sentence; generate an input sequence representation of the input sequence through the language model; perform target span prediction based at least on the input sequence representation, to predict a start position and an end position of at least one target span in the second sentence, the at least one target span corresponding to the at least one predefined token respectively; and optimize the language model based at least on the target span prediction.

In an implementation, the computer-executable instructions, when executed, may further cause the at least one processor to: obtain a representation of the first sentence, a representation of the masked first sentence and a representation of the second sentence through at least the language model; take any two of the representation of the first sentence, the representation of the masked first sentence and the representation of the second sentence as a positive sample pair for contrastive learning; and optimize the language model based at least on the contrastive learning. The optimizing the language model based at least on the contrastive learning may comprise optimizing the language model at least through: minimizing latent space distances among the representation of the first sentence, the representation of the masked first sentence and the representation of the second sentence.

In an implementation, the computer-executable instructions, when executed, may further cause the at least one processor to: fine-tune the language model for the cross-lingual sequence labeling task.

Moreover, the at least one processor 710 may be further configured to perform any other steps/processes of the methods for establishing a language model adapted to a cross-lingual sequence labeling task according to the embodiments of the present disclosure described above. The embodiments of the present disclosure propose a computer program product for establishing a language model adapted to a cross-lingual sequence labeling task. The computer program product may comprise a computer program that is executed by at least one processor for: obtaining a training sentence pair including a first sentence in a first language and a second sentence in a second language, the second sentence being a version of the first sentence in the second language; masking at least one original span in the first sentence with at least one predefined token, to obtain the masked first sentence; forming an input sequence for the language model with at least the masked first sentence and the second sentence; generating an input sequence representation of the input sequence through the language model; performing target span prediction based at least on the input sequence representation, to predict a start position and an end position of at least one target span in the second sentence, the at least one target span corresponding to the at least one predefined token respectively; and optimizing the language model based at least on the target span prediction. Moreover, the computer program may be further executed by the at least one processor for performing any other steps/processes of the methods for establishing a language model adapted to a cross-lingual sequence labeling task according to the embodiments of the present disclosure described above.

The embodiments of the present disclosure may be embodied in a non-transitory computer- readable medium. The non-transitory computer readable medium may comprise instructions that, when executed, cause one or more processors to perform any steps/processes of the methods for establishing a language model adapted to a cross-lingual sequence labeling task according to the embodiments of the present disclosure described above.

It should be understood that all the operations in the methods described above are merely exemplary, and the present disclosure is not limited to any operations in the methods or sequence orders of these operations, and should cover all other equivalents under the same or similar concepts.

In addition, the articles "a" and "an" as used in this description and appended claims, unless otherwise specified or clear from the context that they are for the singular form, should generally be interpreted as meaning "one" or "one or more."

It should also be understood that all the modules in the apparatuses described above may be implemented in various approaches. These modules may be implemented as hardware, software, or a combination thereof. Moreover, any of these modules may be further functionally divided into sub-modules or combined together.

Processors have been described in connection with various apparatuses and methods. These processors may be implemented using electronic hardware, computer software, or any combination thereof. Whether such processors are implemented as hardware or software will depend upon the specific application and overall design constraints imposed on the system. By way of example, a processor, any portion of a processor, or any combination of processors presented in the present disclosure may be implemented with a micro-processor, micro-controller, digital signal processor (DSP), a field-programmable gate array (FPGA), a programmable logic device (PLD), a state machine, gated logic, discrete hardware circuits, and other suitable processing components configured to perform the various functions described throughout the present disclosure. The functionality of a processor, any portion of a processor, or any combination of processors presented in the present disclosure may be implemented with software being executed by a microprocessor, micro-controller, DSP, or other suitable platform.

Software shall be construed broadly to mean instructions, instruction sets, code, code segments, program code, programs, subprograms, software modules, applications, software applications, software packages, routines, subroutines, objects, threads of execution, procedures, functions, etc. The software may reside on a computer-readable medium. A computer-readable medium may include, by way of example, memory such as a magnetic storage device (e.g., hard disk, floppy disk, magnetic strip), an optical disk, a smart card, a flash memory device, random access memory (RAM), read only memory (ROM), programmable ROM (PROM), erasable PROM (EPROM), electrically erasable PROM (EEPROM), a register, or a removable disk. Although a memory is shown as being separate from the processor in various aspects presented in this disclosure, a memory may also be internal to the processor (e.g., a cache or a register).

The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein. All structural and functional equivalents to the elements of the various aspects described throughout the present disclosure that are known or later come to be known to those of ordinary skilled in the art are intended to be encompassed by the claims.

Claims

1. A method for establishing a language model adapted to a cross-lingual sequence labeling task, comprising: obtaining a training sentence pair including a first sentence in a first language and a second sentence in a second language, the second sentence being a version of the first sentence in the second language; masking at least one original span in the first sentence with at least one predefined token, to obtain the masked first sentence; forming an input sequence for the language model with at least the masked first sentence and the second sentence; generating an input sequence representation of the input sequence through the language model; performing target span prediction based at least on the input sequence representation, to predict a start position and an end position of at least one target span in the second sentence, the at least one target span corresponding to the at least one predefined token respectively; and optimizing the language model based at least on the target span prediction.

2. The method of claim 1, further comprising: selecting the at least one original span from the first sentence, and wherein the masking at least one original span comprises: replacing the selected at least one original span with the at least one predefined token.

3. The method of claim 1, wherein the target span prediction comprises, for each predefined token in the at least one predefined token: calculating target span start position probability distribution and target span end position probability distribution associated with the predefined token based on token representations in the input sequence representation.

4. The method of claim 3, wherein the calculating comprises: for each token in the input sequence: calculating, based at least on a representation of the predefined token and a representation of the token, a start position probability that the token is a start position of a target span and an end position probability that the token is an end position of a target span.

5. The method of claim 3, wherein the optimizing the language model based at least on the target span prediction comprises optimizing the language model at least through: maximizing a start position probability of a ground-truth start position token of a groundtruth target span in the second sentence corresponding to the predefined token; and maximizing an end position probability of a ground-truth end position token of the ground- truth target span.

6. The method of claim 5, further comprising: determining the ground-truth target span corresponding to the predefined token through at least performing alignment between the first sentence and the second sentence.

7. The method of claim 1, further comprising: obtaining a representation of the first sentence, a representation of the masked first sentence and a representation of the second sentence through at least the language model; taking any two of the representation of the first sentence, the representation of the masked first sentence and the representation of the second sentence as a positive sample pair for contrastive learning; and optimizing the language model based at least on the contrastive learning.

8. The method of claim 7, wherein the optimizing the language model based at least on the contrastive learning comprises optimizing the language model at least through: minimizing latent space distances among the representation of the first sentence, the representation of the masked first sentence and the representation of the second sentence.

9. The method of claim 1, further comprising: optimizing the language model based at least on a mask language modeling task.

10. The method of claim 1, further comprising: fine-tuning the language model for the cross-lingual sequence labeling task.

11. The method of claim 10, wherein the fine-tuning comprises: adding the at least one predefined token to an input sequence for the cross-lingual sequence labeling task.

12. An apparatus for establishing a language model adapted to a cross-lingual sequence labeling task, comprising: at least one processor; and a memory storing computer-executable instructions that, when executed, cause the at least one processor to: obtain a training sentence pair including a first sentence in a first language and a second sentence in a second language, the second sentence being a version of the first sentence in the second language, mask at least one original span in the first sentence with at least one predefined token, to obtain the masked first sentence, form an input sequence for the language model with at least the masked first sentence and the second sentence, generate an input sequence representation of the input sequence through the language model, perform target span prediction based at least on the input sequence representation, to predict a start position and an end position of at least one target span in the second sentence, the at least one target span corresponding to the at least one predefined token respectively, and optimize the language model based at least on the target span prediction.

13. The apparatus of claim 12, wherein the target span prediction comprises, for each predefined token in the at least one predefined token: calculating target span start position probability distribution and target span end position probability distribution associated with the predefined token based on token representations in the input sequence representation.

14. The apparatus of claim 12, wherein the computer-executable instructions, when executed, further cause the at least one processor to: obtain a representation of the first sentence, a representation of the masked first sentence and a representation of the second sentence through at least the language model; take any two of the representation of the first sentence, the representation of the masked first sentence and the representation of the second sentence as a positive sample pair for contrastive learning; and optimize the language model based at least on the contrastive learning.

15. A computer program product for establishing a language model adapted to a crosslingual sequence labeling task, comprising a computer program that is executed by at least one processor for: obtaining a training sentence pair including a first sentence in a first language and a second sentence in a second language, the second sentence being a version of the first sentence in the second language; masking at least one original span in the first sentence with at least one predefined token, to obtain the masked first sentence; forming an input sequence for the language model with at least the masked first sentence and the second sentence; generating an input sequence representation of the input sequence through the language model; performing target span prediction based at least on the input sequence representation, to predict a start position and an end position of at least one target span in the second sentence, the at least one target span corresponding to the at least one predefined token respectively; and optimizing the language model based at least on the target span prediction.