CN116976340A

CN116976340A - Building a language model suitable for cross-language sequence markup tasks

Info

Publication number: CN116976340A
Application number: CN202210434197.4A
Authority: CN
Inventors: 公明; 寿林钧; 姜大昕
Original assignee: Microsoft Technology Licensing LLC
Current assignee: Microsoft Technology Licensing LLC
Priority date: 2022-04-24
Filing date: 2022-04-24
Publication date: 2023-10-31
Also published as: WO2023211525A1

Abstract

The present disclosure provides methods, apparatus, computer program products, and computer readable media for building language models suitable for cross-language sequential markup tasks. A training sentence pair comprising a first sentence in a first language and a second sentence in a second language may be obtained, the second sentence being a version of the first sentence in the second language. At least one original fragment in the first sentence may be masked with at least one predefined term to obtain a masked first sentence. An input sequence of the language model may be formed using at least the masked first sentence and the second sentence. An input sequence representation of the input sequence may be generated by the language model. Target segment prediction may be performed based at least on the input sequence representation to predict a start position and an end position of at least one target segment in the second sentence, the at least one target segment corresponding to the at least one predefined entry, respectively. The language model may be optimized based at least on the target segment predictions.

Description

Building a language model suitable for cross-language sequence markup tasks

Background

A pre-trained language model (PLM) can be deployed into a variety of downstream tasks and dominates the natural language understanding and generation field. PLM can be extended to cross-language pre-training language model (xPLM). In general, in the pre-training phase, the xPLM may be pre-trained with vocabulary level pre-training tasks over a large number of multilingual corpus. In a further fine-tune phase, the pre-trained xPLM may be migrated or deployed to downstream tasks.

Disclosure of Invention

This summary is provided to introduce a selection of concepts that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

Embodiments of the present disclosure propose methods, apparatuses, computer program products and computer readable media for building a language model suitable for cross-language sequential marking tasks. A training sentence pair comprising a first sentence in a first language and a second sentence in a second language may be obtained, the second sentence being a version of the first sentence in the second language. At least one original fragment in the first sentence may be masked with at least one predefined term to obtain a masked first sentence. An input sequence of the language model may be formed using at least the masked first sentence and the second sentence. An input sequence representation of the input sequence may be generated by the language model. Target segment prediction may be performed based at least on the input sequence representation to predict a start position and an end position of at least one target segment in the second sentence, the at least one target segment corresponding to the at least one predefined entry, respectively. The language model may be optimized based at least on the target segment predictions.

It is noted that one or more of the aspects above include the features specifically pointed out in the following detailed description and the claims. The following description and the annexed drawings set forth in detail certain illustrative features of the one or more aspects. These features are indicative of but a few of the various ways in which the principles of various aspects may be employed and the present disclosure is intended to include all such aspects and their equivalents.

Drawings

The disclosed aspects will be described below in conjunction with the drawings, which are provided to illustrate and not limit the disclosed aspects.

FIG. 1 illustrates an exemplary process of pre-training and fine-tuning a language model according to an embodiment.

FIG. 2 illustrates an exemplary process of cross-language information fragment mask (CLISM) strategy in a pre-training phase, according to an embodiment.

Fig. 3 shows an example of forming an input sequence in a CLISM policy according to an embodiment.

FIG. 4 illustrates an exemplary process of a contrast consistency regularization (CACR) strategy in a pre-training stage according to an embodiment.

FIG. 5 sets forth a flow chart illustrating an exemplary method for building a language model suitable for cross-language sequential markup tasks according to embodiments.

FIG. 6 illustrates an exemplary apparatus for building a language model suitable for cross-language sequence markup tasks, according to an embodiment.

FIG. 7 illustrates an exemplary apparatus for building a language model suitable for cross-language sequence markup tasks, according to an embodiment.

Detailed Description

The present disclosure will now be discussed with reference to various exemplary embodiments. It should be understood that the discussion of these embodiments is merely intended to enable one skilled in the art to better understand and thereby practice the examples of the present disclosure and is not intended to limit the scope of the present disclosure in any way.

Sequence tagging (SL) tasks are tasks targeting fragment extraction (span extraction), which may include, for example, named Entity Recognition (NER), machine reading understanding (MRC), question answering, relationship recognition, event extraction, and the like. The cross-language sequence markup task (xSL) is intended to extend the SL task to be executed on different languages. xSL tasks may require extending the boundaries of SL tasks to low-resource languages, which would be challenged with limited training data in low-resource languages.

The xPLM may be migrated or deployed to xSL tasks and may be effective to some extent in xSL tasks by, for example, passing knowledge from high resource language to low resource language. However, when pre-trained xPLM is migrated to xSL task in the fine tuning phase, a difference (gap) or gap of training targets between the pre-training phase and the fine tuning phase may be generated, such that xPLM is not well suited for xSL task.

In the pre-training phase of the language model, the language model may be trained using pre-training tasks such as Masked Language Modeling (MLM) where the training goals of the MLM require local understanding of the masked terms (token). In the fine-tuning phase for xSL tasks on a pre-trained language model, the pre-trained language model is typically trained using distant supervised multilingual task dependent instances, where training goals extracted by fragments of xSL tasks require global understanding and reasoning of the input, e.g., questions and documents. This will lead to a gap between the training targets of the pre-training phase and the training targets of the fine tuning phase, which in turn results in that the resulting xPLM is not well suited for the xSL task.

Embodiments of the present disclosure are directed to building language models suitable for xSL tasks. For example, embodiments of the present disclosure may enable the xPLM to obtain characteristics suitable for the xSL task in a pre-training phase, so that a pre-trained language model may obtain better performance for the xSL task in a fine-tuning phase.

In one aspect, embodiments of the present disclosure propose a pre-training strategy tailored to xSL tasks in the pre-training phase, which may be referred to as a Cross-language information fragment mask (CLISM: cross-lingual Language Informative Span Masking) strategy or CLISM task. The CLISM strategy reduces the training target gap between the pre-training phase and the fine tuning phase for xSL in a self-supervising manner.

In one aspect, embodiments of the present disclosure propose a strategy for enhancing alignment capability in a pre-training phase, which may be referred to as a contrast consistency regularization (CACR: contrAdive-Consistency Regularization) strategy. The CACR policy may encourage the xPLM to better capture alignment between cross-language representations. For example, during pre-training, the CACR strategy may utilize contrast learning to encourage consistency between representations of the input parallel sequences.

Embodiments of the present disclosure may not only narrow the gap between pre-trained targets and fine-tuned targets, but may also enhance the ability of the language model to better capture alignment between cross-language representations at the sentence level. According to the embodiment of the disclosure, even if only limited training data is provided in the pre-training stage, the established language model can have better performance. The language model built according to the embodiments of the present disclosure may achieve good applicability in various xSL tasks with limited training data. For example, a language model built according to embodiments of the present disclosure can achieve good applicability even with a few sample (few-shot) data settings available with only a few training examples or with zero sample (zero-shot) data settings available without training examples.

FIG. 1 illustrates an exemplary process 100 for pre-training and fine-tuning a language model according to an embodiment. The process 100 is performed for pre-training and fine-tuning the language model 110 to apply the language model 110 to xSL tasks.

In the pre-training phase, the language model 110 may be trained using a training dataset 102 based on a multilingual parallel corpus. In one implementation, the parallel corpus may be divided into a plurality of subgroups. Each subgroup may be referred to as a language information group. As an example, each subgroup may comprise two parallel versions of the same sentence in two different languages, e.g. a sentence in a first language and a sentence in a second language. The sentence in the first language will be referred to as a source language sentence and the sentence in the second language will be referred to as a target language sentence hereinafter. The target language sentence may be a version or translation of the source language sentence in the target language, or the source language sentence may be a version or translation of the target language sentence in the source language. Since the source language sentence and the target language sentence in each subgroup are parallel, there will be multiple fragments or terms aligned in meaning between the two sentences. In this context, a term may correspond to a word or a portion of a word, and a segment may include one or more terms. The source language sentence and the target language sentence in each subgroup form a training sentence pair.

In one implementation, the CLISM strategy 120 may be employed in a pre-training phase. The CLISM strategy may narrow the training target gap between the pre-training phase and the fine tuning phase for xSL. The CLISM policy may also be referred to as a CLISM task, which is a pre-training task tailored to the xSL task. The CLISM task is a self-supervising task. Taking xSL tasks of cross-language MRC (xMRC) like question answering as an example, one concern of CLISM tasks is how to create multilingual < question, answer > training pairs. It should be appreciated that while the CLISM task is exemplified at various portions of the following discussion as being tailored to a xSL task of the question answer type, the CLISM task according to embodiments of the present disclosure is not so limited, but may be tailored to any other type of xSL task in a similar manner.

One or more selected original fragments in the source language sentence may be masked for one training sentence pair comprising the source language sentence and the target language sentence obtained from the training data set 102 according to the CLISM task. Each original segment may be an n-gram segment that includes n terms. The original snippet may be, for example, a named entity, phrase, or the like. It should be understood that embodiments of the present disclosure are not limited to any particular type of original fragment. By performing masking on the source language statement, a masked source language statement may be obtained. The input sequence of language model 110 may be formed using at least the masked source language sentence and the corresponding unmasked target language sentence. As an example, first, one or more predefined lemmas may be utilized to mask the selected one or more original fragments in the source language sentence, respectively. The predefined entry may be, for example, [ QUE ], etc. It should be appreciated that although the predefined vocabulary entry is hereinafter exemplarily expressed as [ QUE ], embodiments of the present disclosure are not limited to any specific expression of the predefined vocabulary entry, e.g., the predefined vocabulary entry may be any other expression than [ QUE ]. The language model 110 may then be required to find the start and end positions of the correct target segment corresponding to each predefined term in the target language sentence based on a global context understanding of the masked source language sentence and the target language sentence. Each target segment has the same meaning as the corresponding original segment masked by the predefined entry. Accordingly, each predefined entry [ QUE ] may be considered a question in, for example, an xMRC task, and a target fragment in the target language sentence corresponding to the predefined entry [ QUE ] may be considered an answer in, for example, an xMRC task.

Under the CLISM policy, the language model 110 needs to predict whether each term in the target language sentence is a starting position of the answer (i.e., a starting position term) or an ending position of the answer (i.e., an ending position term). This is essentially consistent with the xSL task, as the fragment extraction in the xSL task is also intended to find the start and end positions of the target fragment. Thus, by the CLISM strategy, the training goals of the pre-training phase will be close to or consistent with the training goals of the subsequent fine-tuning phase for the xSL task. Thus, the pre-trained language model 110 would be more suitable for xSL tasks.

In one implementation, the CACR strategy 130 may be employed during the pre-training phase. The CACR strategy may enable the language model 110 to better capture the alignment of the same sentence between different languages and avoid learning representations that are affected by noisy data. Noise data may include, for example, predefined terms [ QUE ] used in masking, two sentences of a training sentence pair that are not completely semantically paired, and so forth. In one aspect, the CACR strategy may learn by contrast such that representations of sentences from the same training sentence pair are as close as possible in potential space. For example, source language statements, masked source language statements, and target language statements that originate from the same training statement pair may be provided as parallel sequences, and the CACR strategy may utilize contrast learning to encourage the language model 110 to achieve consistency between representations of multiple statements in the parallel sequences. In one aspect, the CACR strategy may learn by contrast such that representations of sentences from different training sentence pairs are as far apart as possible in potential space. For example, the CACR policy may utilize contrast learning to encourage the language model 110 to achieve distinguishability between representations of source language statements, masked source language statements, and target language statements originating from a first training statement pair, and representations of source language statements, masked source language statements, and target language statements originating from a second training statement pair.

In the fine tuning stage, training data set 104 may be employed to further train pre-trained language model 110. The training data set 104 may be specific to the xSL task 140. Assuming xSL task 140 is an xMRC task, training data set 104 can be a training data set for the xMRC task. It should be appreciated that embodiments of the present disclosure are not limited to any particular type of xSL task. The pre-trained language model 110 would be more suitable for the xSL task since the training target gap between the pre-training phase and the fine-tuning phase for xSL is narrowed in the pre-training phase by, for example, a CLISM strategy or the like. When fine-tuning for xSL tasks is performed for such a pre-trained language model 110, a language model with higher applicability will be able to be finally obtained.

In one implementation, optionally, in the fine tuning phase, a predefined entry [ QUE ] may be added to the input sequence of xSL tasks. Taking the example of xSL task being the MRC task, since at least one predefined entry [ QUE ] is taken as a question in the pre-training phase, after pre-training the at least one predefined entry [ QUE ] will capture enough information about the question. By adding the at least one predefined entry [ QUE ] to the input sequence of the xSL task, the representation of the at least one predefined entry [ QUE ] will facilitate the selection of the actual answer segment in the xSL task.

It should be appreciated that while the above description of process 100 includes an exemplary description of a trimming stage, embodiments of the present disclosure are not limited to any particular trimming stage processing. For example, after a pre-training process according to embodiments of the present disclosure is performed on a language model, any known fine-tuning process may in turn be applied to the pre-trained language model. Thanks to the pre-training process performed on the language model according to embodiments of the present disclosure, the resulting pre-trained language model will exhibit its performance advantages for xSL tasks in any fine-tuning process.

FIG. 2 illustrates an exemplary process 200 of a CLISM strategy in a pre-training phase, according to an embodiment.

Exemplary training statement pairs 202 may be obtained from training data set 102, for example, in fig. 1. The training sentence pair 202 may include a source language sentence as a first sentence and a target language sentence as a second sentence, wherein the source language sentence is in a source language as the first language and the target language sentence is in a target language as the second language. The target language statement is a version of the source language statement in the target language, and the source language statement is a version of the target language statement in the source language. The source language statement may be expressed as s ^s And the target language sentence may be expressed as s ^t 。

At 210, at least one original fragment to be masked from the source language sentence may be selected. Taking the example that the original fragment is a named entity, named entity fragments that are the original fragments may be selected from the source language sentence by various named entity recognition tools. It should be understood that embodiments of the present disclosure are not limited to any particular type of original fragment, nor to any particular technique for selecting an original fragment. From source language sentence s ^s The selected or identified at least one original fragment of (c) may form an original fragment set S.

In some implementations, the selection operation at 210 may also follow some predetermined rules to enable selection of segments having semantic meaning. For example, segments containing only stop words may be filtered out. For example, it may be specified that the boundaries of the selected segments must be words. For example, the sequence length of each selected fragment should not exceed a predetermined maximum sequence length threshold, e.g., 10, etc.

At 220, the selected at least one original fragment in the source language sentence can be masked with at least one predefined term to obtain a masked source language sentence. The shield The operations may include replacing the selected at least one original segment with at least one predefined term. For example, for each original fragment in the set S of original fragments, the original fragment may be replaced in the source language sentence with a single predefined entry [ QUE]. By masking all original fragments, a masked source language statement can be obtained, which is represented as

At 230, an input sequence of the language model may be formed using at least the masked source language sentence and the unmasked target language sentence. In one implementation, the source language statement may be generated by masking a source language statementTarget language sentence s ^t For example [ CLS ]]、[SEP]And cascading the special terms to obtain a final input sequence X. For example, the input sequence X may be expressed as +.>Wherein, [ CLS ]]Is a classification mark entry, [ SEP ]]Is the sentence separating entry. It should be understood that embodiments of the present disclosure are not limited to any particular manner of forming an input sequence using masked source language statements and target language statements, e.g., any special terms may be omitted, any other special terms may be added, etc.

Assume that a training sentence pair includes a source language sentence 310 and a target language sentence 320. The source language statement 310 is the english statement "List of Ottawa buildings" and the target language statement 320 is the vietnam statement "Danh s a ch c a c t co a nh a Ottawa".

Assume that according to the operation at 210 of FIG. 2, original fragment 312"List" and original fragment 314"Ottawa" are selected from source language sentence 310. The original segment 312 corresponds to or is aligned with the target segment 322"Danh s ch" in the target language sentence 320, i.e., the target segment 322 is a version of the Vietnam language of the original segment 314. The original fragment 314 corresponds to or is aligned with the target fragment 324"ottawa" in the target language sentence 320, the target fragment 324 and the original fragment 314 having the same expression.

According to the operations at 220 of FIG. 2, the original fragment 312 may be replaced with a predefined entry [ QUE ]332 and the original fragment 314 replaced with a predefined entry [ QUE ]334, resulting in a masked source language statement 330.

According to the operations at 230 of FIG. 2, an input sequence 340 of a language model may be formed using at least the masked source language sentence 330 and the target language sentence 320. As shown in FIG. 3, an input sequence 340 is formed by concatenating, in order, the specialized term [ CLS ]342, the masked source language sentence 330, the specialized term [ SEP ]344, the target language sentence 320, and the specialized term [ SEP ] 346.

It should be understood that all of the elements in the example of fig. 3 are exemplary and are intended to provide additional explanation of the relevant operations in process 200 of fig. 2 only and are not intended to limit the scope of the embodiments of the present disclosure in any way.

Returning to FIG. 2, after the input sequence is formed at 230, the input sequence may be provided to language model 240. The language model 240 may generate an input sequence representation 242 of the input sequence. Language model 240 may be represented asThe generation of the input sequence representation 242 may include generating a representation of each term in the input sequence. Since the input sequence X contains at least the masked source language sentence +.>And target language sentence s ^t Therefore, language model->The generated representation of each term may be a global context semantic representation.

At 250, target segment prediction may be performed based at least on the input sequence representation 242. The target segment prediction is intended to predict a start position and an end position of at least one target segment in the target language sentence, wherein the at least one target segment corresponds to at least one predefined term in the masked source language sentence, respectively. For example, in target segment prediction, for each predefined term in a masked source language sentence, a start position and an end position of a target segment in the target language sentence corresponding to the predefined term may be predicted. The starting position of the target segment may be indicated by the entry located at the starting position, and thus predicting the starting position of the target segment may refer to determining which entry is the starting position of the target segment and thus as the starting position entry. Similarly, predicting the ending location of the target segment may refer to determining which term is the ending location of the target segment and thus as the ending location term.

In one implementation, in target segment prediction, for each predefined term in the masked source language sentence, a target segment start position probability distribution and a target segment end position probability distribution associated with the predefined term may be calculated based on the term representation in the input sequence representation 242. The target segment start position probability distribution may indicate the probability that a different term is the start position of the target segment. The target segment end position probability distribution may indicate the probabilities of different terms as the end positions of the target segments. Taking a particular term in the input sequence as an example, a start position probability that the particular term is a target segment start position may be calculated based at least on a representation of the predefined term and a representation of the particular term, and an end position probability that the particular term is a target segment end position may be calculated.

Taking the example that the CLISM task is tailored to the xMRC task, each predefined entry [ QUE ] in the masked source language statement]Can be regarded as a problem Q, then the goal of the target segment prediction is to: for a given predefined entry [ QUE ]]Corresponding to a given question Q, a correct answer is predicted based on the meaning of the input sequence X, the answer corresponding to In target language sentence s ^t Is a target segment of the alignment of the target segment. Multiple predefined entries in the masked source language sentence may be used as a set of questions that need to be answered simultaneously.

A representation x of the problem Q may be obtained from the input sequence representation 242 _q . The dynamic start vector s can then be calculated according to _q And a dynamic end vector e _q ：

s _q ＝W _s x _q ,e _q ＝W _e x _q Formula (1)

Wherein W is _s And W is _e Is a learnable parameter. Dynamic start vector s _q Is from the representation x of the problem Q _q The sub-representation of the derived probability distribution for the starting position of the target segment for subsequent calculation, and the dynamic end vector e _q Is from the representation x of the problem Q _q The derived sub-representation for the subsequent calculation of the probability distribution of the end position of the target segment.

Referring to the known standard segment extraction process, the learned vector s can be calculated according to equation (2) _q Inner product of each term representation in the input sequence representation with the input sequence X to obtain target segment start position probability distribution, and the learned vector e can be calculated according to formula (3) _q Inner products of each term representation in the input sequence representation with the input sequence X to obtain a target segment end position probability distribution as follows:

where startindex=k indicates a start position probability that entry k is a target segment start position for entry k at a kth position in input sequence X, i.e., equation (2) is a calculation of an end position probability that entry k is a target segment end position for entry k at a kth position in input sequence X, and endindex=k indicates an end index k, i.e., equation (3) is an calculation of an end position probability that entry k is a target segment end position for entry k at a kth position in input sequence X. The index j in equation (2) and equation (3) is an index to an entry in the input sequence X. By performing the calculations of the formula (2) and the formula (3) for each term in the input sequence, the start position probability and the end position probability calculated for each term can be obtained, and further, the target segment start position probability distribution and the target segment end position probability distribution can be formed.

It should be appreciated that although in the above description the start position probability and the end position probability are calculated for each term in the input sequence, alternatively the start position probability and the end position probability may be calculated for only terms in the target language sentence in the input sequence.

According to process 200, language model 240 may be optimized by the execution of target segment predictions. To perform the optimization, an alignment may first be performed between the source language statement and the target language statement at 260. For example, at 260, alignment may be performed between a segment, term, or term in the source language sentence and a segment, term, or term in the target language sentence. Accordingly, at least one target fragment in the target language sentence that is aligned with the selected at least one original fragment in the source language sentence may be identified. The alignment operation at 260 may be performed by any alignment tool or technique. The alignment operation at 260 may be used to determine the actual (ground-trunk) target fragment in the target language sentence that corresponds to the particular predefined term. For example, assuming that the particular predefined entry was previously used to mask a particular original segment, a target segment that is aligned with the particular original segment, as determined by the alignment operation at 260, may be the actual target segment corresponding to the particular predefined entry. Referring to the example in fig. 3, the alignment operation at 260 may align the original segment 312 with the target segment 322, and accordingly, in the event that the original segment 312 is masked by a predefined term 332, the actual target segment corresponding to the predefined term 332 may be determined to be the target segment 322.

At 270, the language model 240 may be optimized based at least on the target segment predictions. In the optimizing operation, for a predefined term in the masked source language sentence, a starting position probability of a real starting position term of a real target segment corresponding to the predefined term in the target language sentence may be maximized, and an ending position probability of a real ending position term of the real target segment may be maximized. The real start position entry may refer to an entry located at a start position of the real target segment, and the real end position entry may refer to an entry located at an end position of the real target segment.

In one implementation, the optimization operation at 270 may be achieved by constructing a loss function and minimizing the loss calculated by the loss function. A loss function for a CLISM policy may be constructed based on, for example, standard cross entropy lossThe following are provided:

wherein a is _i Represents the real target fragment corresponding to the ith question Q in the target language sentence,representing a real target segment a _i Is (are) the starting position of->Representing a real target segment a _i End position of (c). In formula (4)Starting position probability of the real starting position entry representing the real target segment +. >True representing a true target segmentThe end position probability of the end position entry is real. Through the optimization operations described above, the ability of the language model to extract fragments across different languages may be enhanced during pre-training and may be smoothly migrated to the xSL task during fine-tuning.

It should be understood that all operations in process 200 and the order of execution thereof are exemplary, and that embodiments of the present disclosure will encompass any modification to process 200. For example, process 200 may be performed iteratively for different pairs of training sentences in a training dataset, respectively, so that the language model may be continuously pre-trained with different pairs of training sentences. For example, the alignment operation at 260 does not have a particular order of execution relative to other operations in process 200, but rather is merely performed between the operation at 210 and the operation at 270. For example, all of the above formulas are exemplary, and embodiments of the present disclosure are not limited to any details of these formulas, but may encompass any modification to these formulas and any other formulas for similar purposes.

Fig. 4 illustrates an exemplary process 400 of CACR strategy in a pre-training phase, according to an embodiment.

The CACR strategy may help the language model avoid learning representations that are affected by noise data during pre-training, and may cause the language model to better capture the alignment between the representations of the sentences in the parallel sequences.

The source language sentence 402 and the target language sentence 406 may be from the same training sentence pair. The masked source language statement 404 may be obtained by performing a masking operation on the source language statement 402, such as 220 in fig. 2.

According to process 400, a representation of a source language sentence 402, a representation of a masked source language sentence 404, and a representation of a target language sentence 406 may be obtained at least by a language model 410 for use in turn in contrast learning.

In one implementation, first, hidden representations 412 of the source language sentence, hidden representations 414 of the masked source language sentence, and hidden representations 416 of the target language sentence may be obtained, respectively, by the language model 410. The sequence lengths of these hidden representations may then be unified by applying, for example, an aggregation layer 420. For example, the aggregate representation 422 of the source language statement may be obtained by applying the aggregate layer 420 to the hidden representation 412 of the source language statement, the aggregate representation 424 of the masked source language statement may be obtained by applying the aggregate layer 420 to the hidden representation 414 of the masked source language statement, and the aggregate representation 426 of the target language statement may be obtained by applying the aggregate layer 420 to the hidden representation 416 of the target language statement. Aggregate representation 422, aggregate representation 424, and aggregate representation 426 may be respectively as a representation of the finally obtained source language statement, a representation of the masked source language statement, and a representation of the target language statement. It should be appreciated that, optionally, where an input sequence representation of an input sequence formed by the masked source language sentence 404 and the target language sentence 406 is obtained in accordance with the process 200 of fig. 2, the hidden representation 414 of the masked source language sentence and the hidden representation 416 of the target language sentence may also be directly extracted from the input sequence representation.

At 430, the obtained representation of the source language sentence, the masked representation of the source language sentence, and the representation of the target language sentence may be used for contrast learning. In one implementation, any two of the representation of the source language sentence, the representation of the masked source language sentence, and the representation of the target language sentence may be used as a positive sample pair for contrast learning in contrast learning. Thus, through contrast learning, the representation of the source language sentence, the representation of the masked source language sentence, and the representation of the target language sentence will be as close as possible in potential space to achieve representation consistency.

At 440, the language model may be optimized based at least on the contrast learning. In one implementation, in an optimization operation, the language model may be optimized by minimizing potential spatial distances between representations of source language statements, representations of masked source language statements, and representations of target language statements that originate from the same training statement pair.

Furthermore, the CACR strategy may alternatively also be learned by contrast such that representations of sentences originating from different training sentence pairs are as far apart as possible in potential space. For example, assuming that the representations of the source language sentence, the masked source language sentence, and the target language sentence, respectively, corresponding to the source language sentence, the masked source language sentence, and the target language sentence, respectively, originating from the first training sentence pair, are to be composed into a first set of representations, and the representations of the source language sentence, the masked source language sentence, and the target language sentence, respectively, corresponding to the source language sentence, the masked source language sentence, and the target language sentence, originating from the second training sentence pair, are to be composed into a second set of representations, the CACR policy may cause any one of the first set of representations to be as far away from any one of the second set of representations as possible in the potential space. Accordingly, in an optimization operation, the language model may be optimized by maximizing the potential spatial distance of any one of the first set of representations from any one of the second set of representations.

In the following, further exemplary implementations of the CACR policy shown in process 400 will be given.

As described above in connection with fig. 2, language modelIs +.>By taking X as input of the language model, the input sequence representation of the input sequence can be obtained +.>The following are provided:

wherein,,l is the maximum input sequence length, d is the concealment size,/->Representing a masked source language sentence +.>Hidden representation of->Statement s representing target language ^t Is hidden in (a), m represents->N represents s ^t Sequence length of>Representing special vocabulary entry [ CLS ]]Is indicated by->Representing special vocabulary entry [ SEP ]]Is a representation of (c).

Due to the mismatch between m and n, it is not possible to directly applyAnd->As input to the contrast study. Thus, an additional polymeric layer can be applied +.>(e.g., average pooling, etc.) to obtain a masked source language statement +.>Aggregate representation of +.>Statement s in target language ^t Is represented by the aggregate r ^t The following are provided:

wherein,,and r is ^t ∈R ^d 。

In contrast learning, one can compareFor and r ^t Positive sample pairs are composed, while other data in small batches (mini-batch) are used to compose negative sample pairs. By aiming at->And r ^t A loss function is constructed and the loss calculated by the loss function is minimized to optimize the language model. The target +.A can be constructed based on, for example, standard contrast learning objectives >And r ^t Is>The following are provided:

wherein B is mini-batch,is a temperature coefficient representing a smoothing strategy, ψ (·) represents the cosine similarity function.

Taking into account the masked source language statementsIs an input subject to noise interference, e.g. its packetsIncludes predefined vocabulary entry [ QUE ]]Therefore, if the language model is optimized using only equation (7), it will likely result in the language model learning to the representation affected by the noise data, which may cause an incorrect alignment of the representation between the source language sentence and the target language sentence. Thus, the CACR strategy further introduces unmasked source language statements s ^s As input. The language model can be applied to s ^s Encoding to obtain a hidden representation of the source language sentence +.>Then, can pass through->Application of the polymeric layer->To obtain an aggregate representation r of the source language statement ^s . Accordingly, the aggregate representation r of the source language statement may be represented ^s Aggregate representation of masked source language statements +.>And aggregate representation r of target language statements ^t Composing a ternary positive example set->Each two representations in the ternary positive example set are considered positive sample pairs, i.e., it is desirable that each two representations be as similar as possible in potential space. Final loss function of CACR strategy >The method can be as follows:

wherein,,is directed at->And r ^t Is a loss function of->Is directed to r ^s And->Is a loss function of->Is directed to r ^t And r ^s Is a function of the loss of (2). By making via the loss function->The calculated loss is minimized to optimize the language model. In this way, the language model may be encouraged to learn representations that are not affected by noise data, and be able to better capture alignment between cross-language representations, which also helps reduce training target gaps between the pre-training phase and the fine-tuning phase.

It should be understood that all operations in process 400 and the order of execution thereof are exemplary, and that embodiments of the present disclosure will encompass any modification to process 400.

According to embodiments of the present disclosure, existing pre-training tasks, e.g., MLM tasks, may also optionally be retained during the pre-training phase. Accordingly, the MLM task may also be used to optimize the language model. Thus, the total training objective or total loss function of the language modelCan be defined as:

wherein,,is the loss function of the MLM task. The total loss function can be exploited>To optimize the language model. In this manner, embodiments of the present disclosure actually train the language model in a pre-training phase with a multitasking setup, which may include, for example, CLISM, CACR, MLM, etc.

It should be appreciated that although equation (9) considers the loss function of the MLM taskBut embodiments of the present disclosure may also utilize only +.>And->To construct the total loss function +.>

After the pre-trained language model is obtained through the process of the pre-training phase described above in accordance with embodiments of the present disclosure, fine-tuning for the xSL task may be performed on the language model in a subsequent fine-tuning phase.

In one implementation, the trimming stage may be based on any known trimming process.

In one implementation, the fine tuning stage may include improvements as set forth by embodiments of the present disclosure. For example, to be consistent with the strategy in the pre-training phase, in the fine-tuning phase, a predefined entry [ QUE ] may be added to the input sequence of xSL tasks. This approach may enable better performance in the xSL task, especially in the case of a few sample data setting or a zero sample data setting. Taking xSL task as an example, the input sequence of an xMRC task may be:

where Q represents the question of the input and P represents the segment of the input. In the input sequenceIn the predefined vocabulary entry [ QUE ]]Is added after problem Q.

Since the predefined vocabulary entry [ QUE ] is used as a question in the pre-training phase, the predefined vocabulary entry [ QUE ] will be able to capture enough information about the question after the pre-training phase. By adding the predefined entry [ QUE ] to the input sequence of the xMRC task, the representation of the predefined entry [ QUE ] will help to select the correct answer piece in the xMRC task. It should be appreciated that while the above equation (10) only shows the addition of a single [ QUE ] to the input sequence, alternatively, where at least one predefined word [ QUE ] is used in the pre-training phase, a corresponding at least one predefined word [ QUE ] may also be added to the input sequence of the xMRC task. Furthermore, it should be appreciated that while an example is given above in which the xSL task is an xMRC task, the input sequence of these tasks may be processed in a similar manner for any other type of xSL task.

FIG. 5 illustrates a flowchart of an exemplary method 500 for building a language model suitable for cross-language sequential markup tasks, according to an embodiment.

At 510, a training sentence pair comprising a first sentence in a first language and a second sentence in a second language, the second sentence being a version of the first sentence in the second language, may be obtained.

At 520, at least one original fragment in the first sentence may be masked with at least one predefined term to obtain a masked first sentence.

At 530, an input sequence of the language model may be formed using at least the masked first statement and the second statement.

At 540, an input sequence representation of the input sequence may be generated by the language model.

At 550, target segment prediction may be performed based at least on the input sequence representation to predict a start position and an end position of at least one target segment in the second sentence, the at least one target segment corresponding to the at least one predefined entry, respectively.

At 560, the language model may be optimized based at least on the target segment predictions.

In one implementation, the method 500 may further include: the at least one original fragment is selected from the first sentence. The masking at least one original segment may include: the selected at least one original segment is replaced with the at least one predefined entry.

In one implementation, the target segment prediction may include, for each of the at least one predefined vocabulary entry: a target segment start position probability distribution and a target segment end position probability distribution associated with the predefined vocabulary entry are calculated based on the vocabulary entry representations in the input sequence representation.

The calculating may include, for each term in the input sequence: a start position probability that the term is a target segment start position and an end position probability that the term is a target segment end position are calculated based at least on the representation of the predefined term and the representation of the term.

The optimizing the language model based at least on the target segment predictions may include optimizing the language model by at least: maximizing the initial position probability of a real initial position entry of a real target segment corresponding to the predefined entry in the second sentence; and maximizing an end position probability of a real end position entry of the real target segment.

The method 500 may further include: the real target segment corresponding to the predefined entry is determined at least by performing an alignment between the first sentence and the second sentence.

In one implementation, the method 500 may further include: obtaining, by at least the language model, a representation of the first sentence, a representation of the masked first sentence, and a representation of the second sentence; taking any two of the representation of the first sentence, the representation of the masked first sentence, and the representation of the second sentence as positive sample pairs for contrast learning; and optimizing the language model based at least on the contrast learning.

The optimizing the language model based at least on the contrast learning may include optimizing the language model by at least: minimizing potential spatial distances between the representation of the first statement, the representation of the masked first statement, and the representation of the second statement.

In one implementation, the method 500 may further include: the language model is optimized based at least on masking language modeling tasks.

In one implementation, the method 500 may further include: the language model fine-tuning is used for the cross-language sequence markup task.

The fine tuning may include: the at least one predefined entry is added to the input sequence of the cross-language sequence tagging task.

It should be appreciated that the method 500 may also include any steps/processes for building a language model suitable for cross-language sequence markup tasks in accordance with embodiments of the present disclosure described above.

FIG. 6 illustrates an exemplary apparatus 600 for building a language model suitable for cross-language sequence markup tasks, according to an embodiment.

The apparatus 600 may include: a training sentence pair obtaining module 610, configured to obtain a training sentence pair including a first sentence in a first language and a second sentence in a second language, where the second sentence is a version of the first sentence in the second language; a masking module 620 for masking at least one original fragment in the first sentence with at least one predefined entry to obtain a masked first sentence; an input sequence forming module 630 for forming an input sequence of the language model using at least the masked first sentence and the second sentence; an input sequence representation generation module 640 for generating an input sequence representation of the input sequence by the language model; a target segment prediction execution module 650 for performing target segment prediction based at least on the input sequence representation to predict a start position and an end position of at least one target segment in the second sentence, the at least one target segment corresponding to the at least one predefined entry, respectively; and an optimization module 660 for optimizing the language model based at least on the target segment predictions. In addition, the apparatus 600 may also include any other modules configured to perform any of the steps/processes of the method for building a language model suitable for cross-language sequential markup tasks according to embodiments of the present disclosure described above.

FIG. 7 illustrates an exemplary apparatus 700 for building a language model suitable for cross-language sequence markup tasks, according to an embodiment.

The apparatus 700 may include at least one processor 710. The apparatus 700 may further include a memory 720 coupled to the at least one processor 710. Memory 720 may store computer-executable instructions that, when executed, cause at least one processor 710 to: obtaining training sentence pairs comprising a first sentence in a first language and a second sentence in a second language, the second sentence being a version of the first sentence in the second language; masking at least one original segment in the first sentence with at least one predefined term to obtain a masked first sentence; forming an input sequence of the language model using at least the masked first sentence and the second sentence; generating an input sequence representation of the input sequence by the language model; performing target segment prediction based at least on the input sequence representation to predict a start position and an end position of at least one target segment in the second sentence, the at least one target segment corresponding to the at least one predefined entry, respectively; and optimizing the language model based at least on the target segment predictions.

In one implementation, the computer-executable instructions, when executed, may also cause the at least one processor to: obtaining, by at least the language model, a representation of the first sentence, a representation of the masked first sentence, and a representation of the second sentence; taking any two of the representation of the first sentence, the representation of the masked first sentence, and the representation of the second sentence as positive sample pairs for contrast learning; and optimizing the language model based at least on the contrast learning.

In one implementation, the computer-executable instructions, when executed, may also cause the at least one processor to: the language model fine-tuning is used for the cross-language sequence markup task.

In addition, the at least one processor 710 may also be configured to perform any other steps/processes of the method for building a language model suitable for cross-language sequential markup tasks according to embodiments of the present disclosure described above.

Embodiments of the present disclosure propose a computer program product for building a language model suitable for cross-language sequential markup tasks. The computer program product may comprise a computer program for execution by at least one processor to: obtaining training sentence pairs comprising a first sentence in a first language and a second sentence in a second language, the second sentence being a version of the first sentence in the second language; masking at least one original segment in the first sentence with at least one predefined term to obtain a masked first sentence; forming an input sequence of the language model using at least the masked first sentence and the second sentence; generating an input sequence representation of the input sequence by the language model; performing target segment prediction based at least on the input sequence representation to predict a start position and an end position of at least one target segment in the second sentence, the at least one target segment corresponding to the at least one predefined entry, respectively; and optimizing the language model based at least on the target segment predictions. Furthermore, the computer program may also be executable by at least one processor for performing any other steps/processes for establishing a language model suitable for cross-language sequential marking tasks according to embodiments of the present disclosure described above.

Embodiments of the present disclosure may be embodied in non-transitory computer readable media. The non-transitory computer-readable medium may include instructions that, when executed, cause one or more processors to perform any steps/processes of a method for building a language model suitable for cross-language sequential markup tasks according to embodiments of the present disclosure described above.

It should be understood that all operations in the methods described above are merely exemplary, and the present disclosure is not limited to any operations in the methods or to the order of such operations, but rather should cover all other equivalent variations under the same or similar concepts.

In addition, the articles "a" and "an" as used in this specification and the appended claims should generally be construed to mean "one" or "one or more" unless specified otherwise or clear from context to be directed to a singular form.

It should also be understood that all of the modules in the apparatus described above may be implemented in various ways. These modules may be implemented as hardware, software, or a combination thereof. Furthermore, any of these modules may be functionally further divided into sub-modules or combined together.

The processor has been described in connection with various apparatuses and methods. These processors may be implemented using electronic hardware, computer software, or any combination thereof. Whether such processors are implemented as hardware or software will depend upon the particular application and the overall design constraints imposed on the system. As an example, a processor, any portion of a processor, or any combination of processors presented in this disclosure may be implemented as a microprocessor, microcontroller, digital Signal Processor (DSP), field Programmable Gate Array (FPGA), programmable Logic Device (PLD), state machine, gate logic, discrete hardware circuits, and other suitable processing components configured to perform the various functions described in this disclosure. The functions of a processor, any portion of a processor, or any combination of processors presented in this disclosure may be implemented as software that is executed by a microprocessor, microcontroller, DSP, or other suitable platform.

Software should be construed broadly to mean instructions, instruction sets, code segments, program code, programs, subroutines, software modules, applications, software packages, routines, subroutines, objects, threads of execution, procedures, functions, and the like. The software may reside in a computer readable medium. Computer-readable media may include, for example, memory, which may be, for example, a magnetic storage device (e.g., hard disk, floppy disk, magnetic strips), optical disk, smart card, flash memory device, random Access Memory (RAM), read-only memory (ROM), programmable ROM (PROM), erasable PROM (EPROM), electrically Erasable PROM (EEPROM), registers, or removable disk. Although the memory is shown separate from the processor in various aspects presented in this disclosure, the memory may also be located internal to the processor (e.g., in a cache or register).

The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Accordingly, the claims are not intended to be limited to the aspects shown herein. All structural and functional equivalents to the elements of the various aspects described in the disclosure that are known or later come to be known to those of ordinary skill in the art are intended to be encompassed by the claims.

Claims

1. A method for building a language model suitable for cross-language sequential markup tasks, comprising:

obtaining training sentence pairs comprising a first sentence in a first language and a second sentence in a second language, the second sentence being a version of the first sentence in the second language;

masking at least one original segment in the first sentence with at least one predefined term to obtain a masked first sentence;

forming an input sequence of the language model using at least the masked first sentence and the second sentence;

generating an input sequence representation of the input sequence by the language model;

Performing target segment prediction based at least on the input sequence representation to predict a start position and an end position of at least one target segment in the second sentence, the at least one target segment corresponding to the at least one predefined entry, respectively; and

optimizing the language model based at least on the target segment predictions.

2. The method of claim 1, further comprising:

selecting the at least one original fragment from the first sentence, and

wherein said masking at least one original segment comprises: the selected at least one original segment is replaced with the at least one predefined entry.

3. The method of claim 1, wherein the target segment prediction comprises, for each of the at least one predefined vocabulary entry:

a target segment start position probability distribution and a target segment end position probability distribution associated with the predefined vocabulary entry are calculated based on the vocabulary entry representations in the input sequence representation.

4. A method as claimed in claim 3, wherein the calculating comprises, for each entry in the input sequence:

a start position probability that the term is a target segment start position and an end position probability that the term is a target segment end position are calculated based at least on the representation of the predefined term and the representation of the term.

5. The method of claim 3, wherein the optimizing the language model based at least on the target segment predictions comprises optimizing the language model by at least:

maximizing the initial position probability of a real initial position entry of a real target segment corresponding to the predefined entry in the second sentence; and

and maximizing the end position probability of the real end position entry of the real target fragment.

6. The method of claim 5, further comprising:

the real target segment corresponding to the predefined entry is determined at least by performing an alignment between the first sentence and the second sentence.

7. The method of claim 1, further comprising:

obtaining, by at least the language model, a representation of the first sentence, a representation of the masked first sentence, and a representation of the second sentence;

taking any two of the representation of the first sentence, the representation of the masked first sentence, and the representation of the second sentence as positive sample pairs for contrast learning; and

optimizing the language model based at least on the contrast learning.

8. The method of claim 7, wherein the optimizing the language model based at least on the contrast learning comprises optimizing the language model by at least:

minimizing potential spatial distances between the representation of the first statement, the representation of the masked first statement, and the representation of the second statement.

9. The method of claim 1, further comprising:

the language model is optimized based at least on masking language modeling tasks.

10. The method of claim 1, further comprising:

the language model fine-tuning is used for the cross-language sequence markup task.

11. The method of claim 10, wherein the fine tuning comprises:

the at least one predefined entry is added to the input sequence of the cross-language sequence tagging task.

12. An apparatus for building a language model suitable for cross-language sequential markup tasks, comprising:

at least one processor; and

a memory storing computer-executable instructions that, when executed, cause the at least one processor to:

obtaining a training sentence pair comprising a first sentence in a first language and a second sentence in a second language, said second sentence being a version of said first sentence in said second language,

Masking at least one original segment in the first sentence with at least one predefined term to obtain a masked first sentence,

forming an input sequence of the language model using at least the masked first sentence and the second sentence,

generating an input sequence representation of the input sequence by the language model,

performing target segment prediction based at least on the input sequence representation to predict a start position and an end position of at least one target segment in the second sentence, the at least one target segment corresponding to the at least one predefined entry, respectively, and

optimizing the language model based at least on the target segment predictions.

13. The apparatus of claim 12, wherein the target segment prediction comprises, for each of the at least one predefined vocabulary entry:

14. The apparatus of claim 13, wherein the computing comprises, for each term in the input sequence:

15. The apparatus of claim 13, wherein the optimizing the language model based at least on the target segment predictions comprises optimizing the language model by at least:

16. The apparatus of claim 12, wherein the computer-executable instructions, when executed, further cause the at least one processor to:

Optimizing the language model based at least on the contrast learning.

17. The apparatus of claim 16, wherein the optimizing the language model based at least on the contrast learning comprises optimizing the language model by at least:

18. The apparatus of claim 12, wherein the computer-executable instructions, when executed, further cause the at least one processor to:

19. The apparatus of claim 18, wherein the fine tuning comprises:

20. A computer program product for building a language model suitable for cross-language sequential markup tasks, comprising a computer program for execution by at least one processor for:

optimizing the language model based at least on the target segment predictions.