WO2023211525A1 - Establishing a language model adapted to a cross-lingual sequence labeling task - Google Patents

Establishing a language model adapted to a cross-lingual sequence labeling task Download PDF

Info

Publication number
WO2023211525A1
WO2023211525A1 PCT/US2023/012468 US2023012468W WO2023211525A1 WO 2023211525 A1 WO2023211525 A1 WO 2023211525A1 US 2023012468 W US2023012468 W US 2023012468W WO 2023211525 A1 WO2023211525 A1 WO 2023211525A1
Authority
WO
WIPO (PCT)
Prior art keywords
sentence
representation
language model
token
language
Prior art date
Application number
PCT/US2023/012468
Other languages
French (fr)
Inventor
Ming GONG
Linjun SHOU
Daxin Jiang
Original Assignee
Microsoft Technology Licensing, Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Technology Licensing, Llc filed Critical Microsoft Technology Licensing, Llc
Publication of WO2023211525A1 publication Critical patent/WO2023211525A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/0895Weakly supervised learning, e.g. semi-supervised or self-supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/096Transfer learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks

Definitions

  • a pre-trained language model may be deployed to various downstream tasks, and dominate natural language understanding and generating areas.
  • a PLM may be extended to a cross-lingual pre-trained language model (xPLM).
  • xPLM cross-lingual pre-trained language model
  • the xPLM may be pretrained with token-level pre-training tasks on large multi-lingual corpus.
  • the pre-trained xPLM may be transferred or deployed to downstream tasks.
  • Embodiments of the present disclosure propose methods, apparatuses, computer program products and computer-readable mediums for establishing a language model adapted to a cross-lingual sequence labeling task.
  • a training sentence pair including a first sentence in a first language and a second sentence in a second language may be obtained, the second sentence being a version of the first sentence in the second language.
  • At least one original span in the first sentence may be masked with at least one predefined token, to obtain the masked first sentence.
  • An input sequence for the language model may be formed with at least the masked first sentence and the second sentence.
  • An input sequence representation of the input sequence may be generated through the language model.
  • Target span prediction may be performed based at least on the input sequence representation, to predict a start position and an end position of at least one target span in the second sentence, the at least one target span corresponding to the at least one predefined token respectively.
  • the language model may be optimized based at least on the target span prediction.
  • FIG.1 illustrates an exemplary process of pre-training and fine-tuning a language model according to an embodiment.
  • FIG.2 illustrates an exemplary process of a Cross-lingual Language Informative Span Masking (CLISM) strategy in a pre-training stage according to an embodiment.
  • CLISM Cross-lingual Language Informative Span Masking
  • FIG.3 illustrates an example of forming an input sequence in a CLISM strategy according to an embodiment.
  • FIG.4 illustrates an exemplary process of a ContrAstive-Consistency Regularization (CACR) strategy in a pre-training stage according to an embodiment.
  • CACR ContrAstive-Consistency Regularization
  • FIG.5 illustrates a flowchart of an exemplary method for establishing a language model adapted to a cross-lingual sequence labeling task according to an embodiment.
  • FIG.6 illustrates an exemplary apparatus for establishing a language model adapted to a crosslingual sequence labeling task according to an embodiment.
  • FIG.7 illustrates an exemplary apparatus for establishing a language model adapted to a crosslingual sequence labeling task according to an embodiment.
  • a Sequence Labeling (SL) task is a task that targets span extraction, which may include, e.g., named entity recognition (NER), machine reading comprehension (MRC), question-answering, relation recognition, event extraction, etc.
  • NER named entity recognition
  • MRC machine reading comprehension
  • xSL cross-lingual sequence labeling task aims to extend the SL task to be performed on different languages.
  • the xSL task may require extending the boundary of the SL task to low-resource languages, which will face the challenge of limited training data for the low-resource languages.
  • An xPLM may be transferred or deployed to an xSL task, and may show a certain degree of effectiveness in the xSL task by, e.g., transferring knowledge from a high-resource language to low-resource languages.
  • a pre-trained xPLM to an xSL task in a finetuning stage
  • discrepancy or gap in training objectives between a pre-training stage and a finetuning stage may be generated, which makes the xPLM not well adapted to the xSL task.
  • the language model may be trained with a pre-training task such as mask language modeling (MLM), etc., wherein a training objective of the MLM requires local understanding of a masked token.
  • MLM mask language modeling
  • the pre-trained language model is typically trained with distantly supervised multi-lingual task-related instances, wherein a training objective of span extraction of the xSL task requires global understanding and reasoning of, e.g., an input question and passage. This will lead to a gap between the training objective in the pre-training stage and the training objective in the fine-tuning stage, and will further lead to the resulting xPLM not well adapted to the xSL task.
  • the embodiments of the present disclosure aim to establish a language model adapted to an xSL task.
  • the embodiments of the present disclosure may enable an xPLM to obtain characteristics adapted to the xSL task in a pre-training stage, and thus a pre-trained language model may obtain better performance for the xSL task in a fine-tuning stage.
  • the embodiments of the present disclosure propose a pre-training strategy customized for an xSL task in a pre-training stage, which may be referred to as a Cross-lingual Language Informative Span Masking (CLISM) strategy or a CLISM task.
  • CLISM Cross-lingual Language Informative Span Masking
  • the CLISM strategy eliminates a training objective gap between the pre-training stage and a fine-tuning stage for the xSL in a selfsupervised approach.
  • the embodiments of the present disclosure propose a strategy for enhancing alignment capability in a pre-training stage, which may be referred to as a ContrAstive-Consistency Regularization (CACR) strategy.
  • CACR ContrAstive-Consistency Regularization
  • the CACR strategy may encourage an xPLM to better capture alignment between cross-lingual representations.
  • the CACR strategy may leverage contrastive learning to encourage consistency between representations of input parallel sequences.
  • the embodiments of the present disclosure may not only eliminate a gap between an objective of pre-training and an objective of fine-tuning, but may also enhance the capability of a language model to better capture alignment between cross-lingual representations in a sentence-level. According to the embodiments of the present disclosure, even with limited training data in the pre-training stage, an established language model is able to have better performance.
  • the language model established according to the embodiments of the present disclosure may achieve good applicability in various xSL tasks with limited training data. For example, even in the case of fewshot data settings where only a few training instances are available or zero-shot data settings where no training instance is available, the language model established according to the embodiments of the present disclosure is still able to achieve good applicability.
  • FIG.l illustrates an exemplary process 100 of pre-training and fine-tuning a language model according to an embodiment.
  • the process 100 is performed for pre-training and fine-tuning a language model 110 in order to apply the language model 110 to an xSL task.
  • the language model 110 may be trained with a training dataset 102 which is based on multi-lingual parallel corpus.
  • the parallel corpus may be divided into multiple sub-groups.
  • Each sub-group may be referred to as a language-informative group.
  • each sub-group may include two parallel versions of the same sentence in two different languages, e.g., a sentence in a first language and a sentence in a second language.
  • a sentence in a first language will be referred to as a source language sentence
  • a sentence in a second language will be referred to as a target language sentence.
  • a target language sentence may be a version or translation of a source language sentence in a target language, or a source language sentence may be a version or translation of a target language sentence in a source language. Since a source language sentence and a target language sentence in each sub-group are in parallel, there will be multiple meaning-aligned spans or tokens between these two sentences.
  • a token may correspond to a word or a part of a word, and a span may include one or more tokens.
  • a source language sentence and a target language sentence in each sub-group form a training sentence pair.
  • a CLISM strategy 120 may be adopted in the pre-training stage.
  • the CLISM strategy may eliminate a training objective gap between the pre-training stage and a fine-tuning stage for xSL.
  • the CLISM strategy may also be referred to as a CLISM task, which is a pretraining task customized for a xSL task.
  • the CLISM task is a self-supervised task. Taking an xSL task of cross-lingual MRC (xMRC) which is similar to question-answering as an example, one focus of the CLISM task is how to create multi-lingual ⁇ question, answer> training pairs.
  • xMRC cross-lingual MRC
  • CLISM tasks are not limited to this, but may be customized for any other types of xSL tasks in a similar approach.
  • each original span may be an n- ary span, which includes n tokens.
  • An original span may be, e.g., a named entity, a phrase, etc. It should be understood that the embodiments of the present disclosure are not limited to any specific type of original span.
  • a masked source language sentence may be obtained through performing masking on the source language sentence.
  • An input sequence for the language model 110 may be formed with at least the masked source language sentence and the corresponding unmasked target language sentence.
  • one or more selected original spans in the source language sentence may be masked by one or more predefined tokens, respectively.
  • a predefined token may be, e.g., [QUE], etc. It should be understood that although a predefined token is exemplarily expressed as [QUE] hereinafter, the embodiments of the present disclosure are not limited to any specific expressions of a predefined token, e.g., a predefined token may also be in any other expressions than [QUE], Then, the language model 110 may be required to find correct start position and end position of a target span corresponding to each predefined token from the target language sentence based on global contextual understanding of the masked source language sentence and the target language sentence.
  • each target span has the same meaning as the corresponding original span masked by the predefined token. Accordingly, each predefined token [QUE] may be regarded as, e.g., a question in the xMRC task, and the target span, in the target language sentence, corresponding to this predefined token [QUE] may be regarded as, e.g., an answer in the xMRC task.
  • the language model 110 needs to predict whether each token in the target language sentence is a start position of an answer (i.e., a start position token) or an end position of an answer (i.e., an end position token). This is essentially consistent with the xSL task, because span extraction in the xSL task also aims to find a start position and an end position of a target span. Thus, through the CLISM strategy, a training objective of the pre-training stage will be close to or consistent with a training objective of the subsequent fine-tuning stage for the xSL task. Thus, the pre-trained language model 110 will be more adapted to the xSL task.
  • a CACR strategy 130 may be adopted in the pre-training stage.
  • the CACR strategy may enable the language model 110 to better capture alignment of the same sentence between different languages, and avoid learning representations that are affected by noisy data.
  • noisy data may include, e.g., a predefined token [QUE] used in masking, two sentences in a training sentence pair that are not fully semantically paired, etc.
  • the CACR strategy may, through contrastive learning, make representations of sentences originating from the same training sentence pair as close as possible in a latent space.
  • a source language sentence, a masked source language sentence and a target language sentence originating from the same training sentence pair may be taken as parallel sequences, and the CACR strategy may utilize contrastive learning to encourage the language model 110 to achieve consistency among representations of the multiple sentences in the parallel sequences.
  • the CACR strategy may, through contrastive learning, make representations of sentences originating from different training sentence pairs as far as possible in a latent space.
  • the CACR strategy may utilize contrastive learning to encourage the language model 110 to implement distinctiveness between representations of a source language sentence, a masked source language sentence and a target language sentence originating from a first training sentence pair and representations of a source language sentence, a masked source language sentence and a target language sentence originating from a second training sentence pair.
  • the pre-trained language model 110 may be further trained with a training dataset 104.
  • the training dataset 104 may be specific to an xSL task 140.
  • the training dataset 104 may be a training dataset for the xMRC task. It should be understood that the embodiments of the present disclosure are not limited to any specific types of xSL task. Since a training objective gap between the pre-training stage and the fine-tuning stage for xSL is eliminated in the pre-training stage through, e.g., the CLISM strategy, etc., the pre-trained language model 110 will be more adapted to the xSL task. When performing fine-tuning for the xSL task on such pre-trained language model 110, a language model with higher applicability can be finally obtained.
  • a predefined token [QUE] may be added to an input sequence of the xSL task.
  • the xSL task being an MRC task as an example
  • the at least one predefined token [QUE] will capture enough information about the question.
  • a representation of the at least one predefined token [QUE] will facilitate to select a groundtruth answer span in the xSL task.
  • any known fine-tuning process may be then applied to the pre-trained language model.
  • the resulting pre-trained language model will exhibit its performance advantages of being adapted to xSL tasks in any fine-tuning process.
  • FIG.2 illustrates an exemplary process 200 of a CLISM strategy in a pre-training stage according to an embodiment.
  • An exemplary training sentence pair 202 may be obtained from, e.g., the training dataset 102 in FIG.1.
  • the training sentence pair 202 may include a source language sentence as a first sentence and a target language sentence as a second sentence, wherein the source language sentence adopts a source language as a first language, and the target language sentence adopts a target language as a second language.
  • the target language sentence is a version of the source language sentence in the target language
  • the source language sentence is a version of the target language sentence in the source language.
  • the source language sentence may be represented as s s
  • the target language sentence may be represented as s t .
  • At 210 at least one original span to be masked may be selected from the source language sentence.
  • the original span being a named entity as an example
  • a named entity span as the original span, may be selected from the source language sentence through various named entity recognition tools. It should be understood that the embodiments of the present disclosure are neither limited to any specific types of original span, nor limited to any specific technique for selecting an original span.
  • the at least one original span selected or identified from the source language sentence s s may form an original span set S.
  • the selecting operation at 210 may also follow some predetermined rules so that spans with semantic meanings may be selected. For example, spans which only contain stop words may be filtered out. For example, it may be specified that a boundary of a selected span must be words. For example, a sequence length of each selected span should not exceed a predetermined maximum sequence length threshold, e.g., 10, etc.
  • a predetermined maximum sequence length threshold e.g. 10, etc.
  • the selected at least one original span in the source language sentence may be masked by at least one predefined token, so as to obtain the masked source language sentence.
  • the masking operation may include: replacing the selected at least one original span with at least one predefined token. For example, for each original span in the original span set S, the original span may be replaced, in the source language sentence, by a single predefined token [QUE], Through masking all original spans, the masked source language sentence may be obtained, which is represented as .
  • an input sequence for the language model may be formed with at least the masked source language sentence and the unmasked target language sentence.
  • a final input sequence X may be obtained through cascading the masked source language sentence , the target language sentence s t , and special tokens such as [CLS], [SEP], etc.
  • the input sequence X may be represented as wherein [CLS] is a classification marker token and [SEP] is a sentence separation token. It should be understood that the embodiments of the present disclosure are not limited to any specific approach of utilizing the masked source language sentence and the target language sentence to form the input sequence, e.g., any special tokens may be omitted, any other special tokens may be added, etc.
  • FIG.3 illustrates an example of forming an input sequence in a CLISM strategy according to an embodiment.
  • a training sentence pair includes a source language sentence 310 and a target language sentence 320.
  • the source language sentence 310 is an English sentence "List of Ottawa buildings”
  • the target language sentence 320 is a Vietnamese sentence "Danh sach cac toa nha Ottawa”.
  • an original span 312 "List” and an original span 314 "Ottawa” are selected from the source language sentence 310 according to the operation at 210 of FIG.2.
  • the original span 312 corresponds to or is aligned with a target span 322 "Danh sach" in the target language sentence 320, i.e., the target span 322 is the Vietnamese version of the original span 314.
  • the original span 314 corresponds to or is aligned with a target span 324 "Ottawa" in the target language sentence 320, and the target span 324 and the original span 314 have the same expression.
  • the original span 312 may be replaced by a predefined token [QUE] 332 and the original span 314 may be replaced by a predefined token [QUE] 334, thereby obtaining a masked source language sentence 330.
  • an input sequence 340 for the language model may be formed with at least the masked source language sentence 330 and the target language sentence 320.
  • the input sequence 340 is formed through sequentially cascading a special token [CLS] 342, the masked source language sentence 330, the special token [SEP] 344, the target language sentence 320 and a special token [SEP] 346.
  • the input sequence may be provided to a language model 240.
  • the language model 240 may generate an input sequence representation 242 of the input sequence.
  • the language model 240 may be represented as .
  • the generation of the input sequence representation 242 may include generating a representation of each token in the input sequence. Since the input sequence X contains at least the masked source language sentence s s and the target language sentence s f , a representation of each token generated by the language model may be a global contextual semantic representation.
  • target span prediction may be performed based at least on the input sequence representation 242.
  • the target span prediction aims to predict a start position and an end position of at least one target span in the target language sentence, wherein the at least one target span corresponds to the at least one predefined token in the masked source language sentence respectively.
  • a start position and an end position of a target span, corresponding to the predefined token, in the target language sentence may be predicted.
  • a start position of a target span may be indicated by a token at the start position, therefore, prediction of a start position of a target span may refer to determining which token is a start position of the target span and thus serves as a start position token.
  • prediction of an end position of a target span may refer to determining which token is an end position of the target span and thus serves as an end position token.
  • target span start position probability distribution and target span end position probability distribution associated with the predefined token may be calculated based on token representations in the input sequence representation 242.
  • the target span start position probability distribution may indicate probabilities that different tokens serve as a start position of a target span.
  • the target span end position probability distribution may indicate probabilities that different tokens serve as an end position of a target span. Taking a specific token in the input sequence as an example, a start position probability that the token is a start position of a target span and an end position probability that the token is an end position of a target span may be calculated based at least on a representation of a predefined token and a representation of the specific token.
  • each predefined token [QUE] in the masked source language sentence may be regarded as a question Q, and the objective of the target span prediction is: for a given question Q corresponding to a given predefined token [QUE], a correct answer corresponding to an aligned target span in the target language sentence s t is predicted based on the meaning of the input sequence X.
  • Multiple predefined tokens in the masked source language sentence may serve as a set of questions that need to be answered simultaneously.
  • a representation x q of the question Q may be obtained from the input sequence representation 242. Then, a dynamic start vector s q and a dynamic end vector e q may be calculated according to: Equation (1) wherein W s and W e are learnable parameters.
  • the dynamic start vector s q is a sub-representation derived from the representation x q of the question Q for subsequent calculation of target span start position probability distribution
  • the dynamic end vector e q is a sub -representation derived from the representation x q of the question Q for subsequent calculation of target span end position probability distribution.
  • an inner product of the learned vector s q with each token representation in the input sequence representation of the input sequence X may be calculated according to Equation (2) to obtain the target span start position probability distribution
  • an inner product of the learned vector e q with each token representation in the input sequence representation of the input sequence X may be calculated according to Equation (3) to obtain the target span end position probability distribution, as follow:
  • Equation (2) calculates, for a token k at the fc-th position in the input sequence X, a start position probability that the token k is a start position of a target span
  • Equation (3) calculates, for a token k at the fc-th position in the input sequence X, an end position probability that the token k is an end position of a target span.
  • the index j in Equation (2) and Equation (3) is an index for tokens in the input sequence X.
  • start position probabilities and end position probabilities are calculated for each token in the input sequence in the above description, optionally, start position probabilities and end position probabilities may also be calculated only for tokens in the target language sentence in the input sequence.
  • the language model 240 may be optimized through the performing of the target span prediction.
  • alignment may be performed between the source language sentence and the target language sentence.
  • alignment may be performed between spans, tokens or words in the source language sentence and spans, tokens or words in the target language sentence.
  • at least one target span in the target language sentence aligned with at least one selected original span in the source language sentence may at least be identified.
  • the alignment operation at 260 may be performed through any alignment tools or techniques.
  • the alignment operation at 260 may be used for determining a ground-truth target span, corresponding to a specific predefined token, in the target language sentence.
  • a target span aligned with the specific original span determined through the alignment operation at 260 may be used as a ground-truth target span corresponding to the specific predefined token.
  • the alignment operation at 260 may align the original span 312 with the target span 322; accordingly, in the case that the original span 312 is masked by the predefined token 332, a ground-truth target span corresponding to the predefined token 332 may be determined to be the target span 322.
  • the language model 240 may be optimized based at least on the target span prediction.
  • a start position probability of a ground-truth start position token of a ground-truth target span, corresponding to the predefined token, in the target language sentence may be maximized, and an end position probability of a ground-truth end position token of the ground-truth target span may be maximized.
  • the ground-truth start position token may refer to a token at a start position of the ground-truth target span
  • the ground-truth end position token may refer to a token at an end position of the ground-truth target span.
  • the optimization operation at 270 may be implemented through constructing a loss function and minimizing a loss calculated through the loss function.
  • X, Q) — logP(EndIndex af X, Q) Equation (4) wherein a t denotes a ground-truth target span corresponding to the i-th question Q in the target language sentence, af denotes a start position of the ground-truth target span and af denotes an end position of the ground-truth target span a i .
  • logP(StartIndex af
  • X, Q) in Equation (4) represents a start position probability of a ground-truth start position token of the ground-truth target span
  • the process 200 may be iteratively performed respectively for different training sentence pairs in the training dataset, so that the language model may be continuously pre-trained with different training sentence pairs.
  • the alignment operation at 260 do not have a specific performing order relative to other operations in the process 200, but only needs to be performed between the operation at 210 and the operation at 270.
  • all the above equations are exemplary and the embodiments of the present disclosure are not limited to any details of these equations, but may encompass any changes to these equations and any other equations for similar purposes.
  • FIG.4 illustrates an exemplary process 400 of a CACR strategy in a pre-training stage according to an embodiment.
  • the CACR strategy may help the language model to avoid learning representations affected by noisy data during pre-training, and may facilitate the language model to better capture alignment between representations of sentences in parallel sequences.
  • a source language sentence 402 and a target language sentence 406 may be from the same training sentence pair.
  • a masked source language sentence 404 may be obtained through performing, e.g., the masking operation at 220 in FIG.2 on the source language sentence 402.
  • a representation of the source language sentence 402 a representation of the masked source language sentence 404, and a representation of the target language sentence 406 may be obtained at least through a language model 410, and these representations are for further use in contrastive learning.
  • a hidden representation 412 of the source language sentence, a hidden representation 414 of the masked source language sentence, and a hidden representation 416 of the target language sentence may be obtained through the language model 410, respectively. Then, sequence lengths of these hidden representations may be unified by applying, e.g., an aggregation layer 420.
  • an aggregated representation 422 of the source language sentence may be obtained through applying the aggregation layer 420 to the hidden representation 412 of the source language sentence
  • an aggregated representation 424 of the masked source language sentence may be obtained through applying the aggregation layer 420 to the hidden representation 414 of the masked source language sentence
  • an aggregated representation 426 of the target language sentence may be obtained through applying the aggregation layer 420 to the hidden representation 416 of the target language sentence.
  • the aggregated representation 422, the aggregated representation 424, and the aggregated representation 426 may respectively serve as a representation of the source language sentence, a representation of the masked source language sentence, and a representation of the target language sentence that are finally obtained.
  • the hidden representation 414 of the masked source language sentence and the hidden representation 416 of the target language sentences may also be extracted directly from the input sequence representation.
  • the obtained representation of the source language sentence, the representation of the masked source language sentences, and the representation of the target language sentence may be used for contrastive learning.
  • any two of the representation of the source language sentence, the representation of the masked source language sentence and the representation of the target language sentence may be taken as a positive sample pair for contrastive learning.
  • the representation of the source language sentence, the representation of the masked source language sentence, and the representation of the target language sentence will be as close as possible in a latent space to achieve representation consistency.
  • the language model may be optimized based at least on the contrastive learning.
  • the language model in the optimization operation, may be optimized through minimizing latent space distance among a representation of a source language sentence, a representation of a masked source language sentence, and a representation of a target language sentence originating from the same training sentence pair.
  • the CACR strategy may also, through the contrastive learning, make representations of sentences originating from different training sentence pairs as far as possible in a latent space. For example, assuming that a source language sentence representation, a masked source language sentence representation and a target language sentence representation respectively corresponding to a source language sentence, a masked source language sentence and a target language sentence originating from a first training sentence pair are combined into a first representation set, and a source language sentence representation, a masked source language sentence representation and a target language sentence representation respectively corresponding to a source language sentence, a masked source language sentence and a target language sentence originating from a second training sentence pair are combined into a second representation set, the CACR strategy may make any one representation in the first representation set be as far away as possible from any one representation in the second representation set in a latent space. Accordingly, in the optimization operation, the language model may be optimized through maximizing a latent space distance between any one representation in the first representation set and any one representation in the second representation
  • an input sequence representation M of the input sequence may be obtained as follow: Equation (5) wherein , I is the maximum input sequence length, d is a hidden size, E R mxd denotes a hidden representation of the masked source language sentence , e R nxd denotes a hidden representation of the target language sentence s c , m denotes a sequence length of s , n denotes a sequence length of s c , denotes a representation of a special token [CLS], and M sep denotes a representation of a special token [SEP],
  • an extra aggregation layer JI (e.g., mean-pooling, etc.) may be applied to obtain an aggregated representation f s of the masked source language sentence and an aggregated representation r t of the target language sentence s t , as follow: Equation (6) wherein .
  • r s may be used for forming a positive sample pair with r c , while other data in a mini-batch may be used for forming negative sample pairs.
  • the language model may be optimized through constructing a loss function for and r f and minimizing a loss calculated by the loss function.
  • a loss function may be constructed based on, e.g., a standard contrastive learning objective, as follow: wherein B is a mini -batch, T is a temperature coefficient denoting a smoothing strategy, and 'P(-) denotes a cosine similarity function.
  • the CACR strategy further introduces the unmasked source language sentence s s as an input.
  • the language model may encode s s to obtain a hidden representation BC S of the source language sentence, and then, an aggregated representation r s of the source language sentence may be obtained through applying an aggregation layer c/Z to BC S .
  • the aggregated representation r s of the source language sentence, the aggregated representation r s of the masked source language sentence, and the aggregated representation r f of the target language sentences may be formed into a positive threefold set Every two representations in this positive threefold set is considered as a positive pair, i.e., it is desired that every two representations are as similar as possible in a latent space.
  • a final loss function £ C ACR for the CACR strategy may be: Equation (8) wherein is a loss function for r s and is a loss function for r s and , and is a loss function for r f and r s .
  • the language model may be optimized through minimizing a loss calculated via the loss function £ CACR . Through this approach, the language model is encouraged to learn representations that are not affected by noisy data, and is able to better capture alignment between cross-lingual representations, which may also facilitate to alleviate a training objective gap between the pre-training stage and the fine-tuning stage.
  • an existing pre-training task e.g., a MLM task
  • the MLM task may also be used for optimizing a language model.
  • the total training objective or total loss function L of the language model may be defined as: Equation (9) wherein £ MLM is a loss function for the MLM task.
  • the language model may be optimized with the total loss function £.
  • the embodiments of the present disclosure actually train a language model in a pre-training stage with a multi-task setting which may include, e.g., CLISM, CACR, MLM, etc. It should be understood that although Equation (9) considers the loss function £ MLM of the MLM task, the embodiments of the present disclosure may also construct the total loss function £ with only CLISM ar
  • fine-tuning for the xSL task may be performed on the language model in the subsequent fine-tuning stage.
  • the fine-tuning stage may be based on any known fine-tuning process.
  • the fine-tuning stage may include improvements proposed by the embodiments of the present disclosure.
  • a predefined token [QUE] may be added to an input sequence of the xSL task.
  • This processing approach may achieve better performance in the xSL task, especially in the case of few-shot data setting or zero-shot data setting.
  • an input sequence for the xMRC task may be:
  • the pre-trained token [QUE] is taken as a question in the pre-training stage, after the pretraining stage, the pre-defined token [QUE] will be able to capture enough information about the question.
  • a representation of the predefined token [QUE] will facilitate to select a correct answer span in the xMRC task.
  • Equation (10) only shows adding a single [QUE] to the input sequence, optionally, in the case that at least one predefined token [QUE] is used in the pre-training stage, at least one corresponding predefined token [QUE] may also be added to the input sequence for the xMRC task.
  • the above description gives an example in which the xSL task is an xMRC task, for any other types of xSL task, input sequences for these tasks may also be processed in a similar approach.
  • FIG.5 illustrates a flowchart of an exemplary method 500 for establishing a language model adapted to a cross-lingual sequence labeling task according to an embodiment.
  • a training sentence pair including a first sentence in a first language and a second sentence in a second language may be obtained, the second sentence being a version of the first sentence in the second language.
  • At 520 at least one original span in the first sentence may be masked with at least one predefined token, to obtain the masked first sentence.
  • an input sequence for the language model may be formed with at least the masked first sentence and the second sentence.
  • an input sequence representation of the input sequence may be generated through the language model.
  • target span prediction may be performed based at least on the input sequence representation, to predict a start position and an end position of at least one target span in the second sentence, the at least one target span corresponding to the at least one predefined token respectively.
  • the language model may be optimized based at least on the target span prediction.
  • the method 500 may further comprise: selecting the at least one original span from the first sentence.
  • the masking at least one original span may comprise: replacing the selected at least one original span with the at least one predefined token.
  • the target span prediction may comprise, for each predefined token in the at least one predefined token: calculating target span start position probability distribution and target span end position probability distribution associated with the predefined token based on token representations in the input sequence representation.
  • the calculating may comprise, for each token in the input sequence: calculating, based at least on a representation of the predefined token and a representation of the token, a start position probability that the token is a start position of a target span and an end position probability that the token is an end position of a target span.
  • the optimizing the language model based at least on the target span prediction may comprise optimizing the language model at least through: maximizing a start position probability of a ground-truth start position token of a ground-truth target span in the second sentence corresponding to the predefined token; and maximizing an end position probability of a groundtruth end position token of the ground-truth target span.
  • the method 500 may further comprise: determining the ground-truth target span corresponding to the predefined token through at least performing alignment between the first sentence and the second sentence.
  • the method 500 may further comprise: obtaining a representation of the first sentence, a representation of the masked first sentence and a representation of the second sentence through at least the language model; taking any two of the representation of the first sentence, the representation of the masked first sentence and the representation of the second sentence as a positive sample pair for contrastive learning; and optimizing the language model based at least on the contrastive learning.
  • the optimizing the language model based at least on the contrastive learning may comprise optimizing the language model at least through: minimizing latent space distances among the representation of the first sentence, the representation of the masked first sentence and the representation of the second sentence.
  • the method 500 may further comprise: optimizing the language model based at least on a mask language modeling task.
  • the method 500 may further comprise: fine-tuning the language model for the cross-lingual sequence labeling task.
  • the fine-tuning may comprise: adding the at least one predefined token to an input sequence for the cross-lingual sequence labeling task.
  • the method 500 may further comprise any steps/processes for establishing a language model adapted to a cross-lingual sequence labeling task according to the embodiments of the present disclosure described above.
  • FIG.6 illustrates an exemplary apparatus 600 for establishing a language model adapted to a crosslingual sequence labeling task according to an embodiment.
  • the apparatus 600 may comprise: a training sentence pair obtaining module 610, for obtaining a training sentence pair including a first sentence in a first language and a second sentence in a second language, the second sentence being a version of the first sentence in the second language; a masking module 620, for masking at least one original span in the first sentence with at least one predefined token, to obtain the masked first sentence; an input sequence forming module 630, for forming an input sequence for the language model with at least the masked first sentence and the second sentence; an input sequence representation generating module 640, for generating an input sequence representation of the input sequence through the language model; a target span prediction performing module 650, for performing target span prediction based at least on the input sequence representation, to predict a start position and an end position of at least one target span in the second sentence, the at least one target span corresponding to the at least one predefined token respectively; and an optimizing module 660, for optimizing the language model based at least on the target span prediction.
  • the apparatus 600 may further comprise
  • FIG.7 illustrates an exemplary apparatus 700 for establishing a language model adapted to a crosslingual sequence labeling task according to an embodiment.
  • the apparatus 700 may comprise at least one processor 710.
  • the apparatus 700 may further comprise a memory 720 connected with at least one processor 710.
  • the memory 720 may store computer-executable instructions that, when executed, cause the at least one processor 710 to: obtain a training sentence pair including a first sentence in a first language and a second sentence in a second language, the second sentence being a version of the first sentence in the second language; mask at least one original span in the first sentence with at least one predefined token, to obtain the masked first sentence; form an input sequence for the language model with at least the masked first sentence and the second sentence; generate an input sequence representation of the input sequence through the language model; perform target span prediction based at least on the input sequence representation, to predict a start position and an end position of at least one target span in the second sentence, the at least one target span corresponding to the at least one predefined token respectively; and optimize the language model based at least on the target span prediction.
  • the target span prediction may comprise, for each predefined token in the at least one predefined token: calculating target span start position probability distribution and target span end position probability distribution associated with the predefined token based on token representations in the input sequence representation.
  • the calculating may comprise, for each token in the input sequence: calculating, based at least on a representation of the predefined token and a representation of the token, a start position probability that the token is a start position of a target span and an end position probability that the token is an end position of a target span.
  • the optimizing the language model based at least on the target span prediction may comprise optimizing the language model at least through: maximizing a start position probability of a ground-truth start position token of a ground-truth target span in the second sentence corresponding to the predefined token; and maximizing an end position probability of a groundtruth end position token of the ground-truth target span.
  • the computer-executable instructions when executed, may further cause the at least one processor to: obtain a representation of the first sentence, a representation of the masked first sentence and a representation of the second sentence through at least the language model; take any two of the representation of the first sentence, the representation of the masked first sentence and the representation of the second sentence as a positive sample pair for contrastive learning; and optimize the language model based at least on the contrastive learning.
  • the optimizing the language model based at least on the contrastive learning may comprise optimizing the language model at least through: minimizing latent space distances among the representation of the first sentence, the representation of the masked first sentence and the representation of the second sentence.
  • the computer-executable instructions when executed, may further cause the at least one processor to: fine-tune the language model for the cross-lingual sequence labeling task.
  • the fine-tuning may comprise: adding the at least one predefined token to an input sequence for the cross-lingual sequence labeling task.
  • the at least one processor 710 may be further configured to perform any other steps/processes of the methods for establishing a language model adapted to a cross-lingual sequence labeling task according to the embodiments of the present disclosure described above.
  • the embodiments of the present disclosure propose a computer program product for establishing a language model adapted to a cross-lingual sequence labeling task.
  • the computer program product may comprise a computer program that is executed by at least one processor for: obtaining a training sentence pair including a first sentence in a first language and a second sentence in a second language, the second sentence being a version of the first sentence in the second language; masking at least one original span in the first sentence with at least one predefined token, to obtain the masked first sentence; forming an input sequence for the language model with at least the masked first sentence and the second sentence; generating an input sequence representation of the input sequence through the language model; performing target span prediction based at least on the input sequence representation, to predict a start position and an end position of at least one target span in the second sentence, the at least one target span corresponding to the at least one predefined token respectively; and optimizing the language model based at least on the target span prediction.
  • the computer program may be further executed by the at least one processor for performing any other steps/processes of the methods for establishing a language model adapted to a cross-lingual sequence labeling task according to the embodiments of the present disclosure
  • the embodiments of the present disclosure may be embodied in a non-transitory computer- readable medium.
  • the non-transitory computer readable medium may comprise instructions that, when executed, cause one or more processors to perform any steps/processes of the methods for establishing a language model adapted to a cross-lingual sequence labeling task according to the embodiments of the present disclosure described above.
  • modules in the apparatuses described above may be implemented in various approaches. These modules may be implemented as hardware, software, or a combination thereof. Moreover, any of these modules may be further functionally divided into sub-modules or combined together.
  • processors have been described in connection with various apparatuses and methods. These processors may be implemented using electronic hardware, computer software, or any combination thereof. Whether such processors are implemented as hardware or software will depend upon the specific application and overall design constraints imposed on the system.
  • a processor, any portion of a processor, or any combination of processors presented in the present disclosure may be implemented with a micro-processor, micro-controller, digital signal processor (DSP), a field-programmable gate array (FPGA), a programmable logic device (PLD), a state machine, gated logic, discrete hardware circuits, and other suitable processing components configured to perform the various functions described throughout the present disclosure.
  • DSP digital signal processor
  • FPGA field-programmable gate array
  • PLD programmable logic device
  • the functionality of a processor, any portion of a processor, or any combination of processors presented in the present disclosure may be implemented with software being executed by a microprocessor, micro-controller, DSP, or other suitable platform.
  • a computer-readable medium may include, by way of example, memory such as a magnetic storage device (e.g., hard disk, floppy disk, magnetic strip), an optical disk, a smart card, a flash memory device, random access memory (RAM), read only memory (ROM), programmable ROM (PROM), erasable PROM (EPROM), electrically erasable PROM (EEPROM), a register, or a removable disk.
  • a memory is shown as being separate from the processor in various aspects presented in this disclosure, a memory may also be internal to the processor (e.g., a cache or a register).

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

This disclosure proposes to establish a language model adapted to a cross-lingual sequence labeling task. A training sentence pair including a first sentence in a first language and a second sentence in a second language is obtained, the second sentence being a version of the first sentence in the second language. At least one original span in the first sentence is masked by at least one predefined token. An input sequence for the language model is formed with the masked first sentence and the second sentence. An input sequence representation is generated through the language model. Target span prediction is performed based at least on the input sequence representation, to predict a start position and an end position of at least one target span in the second sentence which corresponds to the at least one predefined token respectively. The language model is optimized based on the target span prediction.

Description

ESTABLISHING A LANGUAGE MODEL ADAPTED TO A CROSS-LINGUAL SEQUENCE LABELING TASK
BACKGROUND
A pre-trained language model (PLM) may be deployed to various downstream tasks, and dominate natural language understanding and generating areas. A PLM may be extended to a cross-lingual pre-trained language model (xPLM). Usually, in a pre-training stage, the xPLM may be pretrained with token-level pre-training tasks on large multi-lingual corpus. In a further fine-tuning stage, the pre-trained xPLM may be transferred or deployed to downstream tasks.
SUMMARY
This Summary is provided to introduce a selection of concepts that are further described below in the Detailed Description. It is not intended to identify key features or essential features of the claimed subj ect matter, nor is it intended to be used to limit the scope of the claimed subj ect matter. Embodiments of the present disclosure propose methods, apparatuses, computer program products and computer-readable mediums for establishing a language model adapted to a cross-lingual sequence labeling task. A training sentence pair including a first sentence in a first language and a second sentence in a second language may be obtained, the second sentence being a version of the first sentence in the second language. At least one original span in the first sentence may be masked with at least one predefined token, to obtain the masked first sentence. An input sequence for the language model may be formed with at least the masked first sentence and the second sentence. An input sequence representation of the input sequence may be generated through the language model. Target span prediction may be performed based at least on the input sequence representation, to predict a start position and an end position of at least one target span in the second sentence, the at least one target span corresponding to the at least one predefined token respectively. The language model may be optimized based at least on the target span prediction. It should be noted that the above one or more aspects comprise the features hereinafter fully described and particularly pointed out in the claims. The following description and the drawings set forth in detail certain illustrative features of the one or more aspects. These features are only indicative of the various ways in which the principles of various aspects may be employed, and this disclosure is intended to include all such aspects and their equivalents.
BRIEF DESCRIPTION OF THE DRAWINGS
The disclosed aspects will hereinafter be described in conjunction with the appended drawings that are provided to illustrate and not to limit the disclosed aspects.
FIG.1 illustrates an exemplary process of pre-training and fine-tuning a language model according to an embodiment. FIG.2 illustrates an exemplary process of a Cross-lingual Language Informative Span Masking (CLISM) strategy in a pre-training stage according to an embodiment.
FIG.3 illustrates an example of forming an input sequence in a CLISM strategy according to an embodiment.
FIG.4 illustrates an exemplary process of a ContrAstive-Consistency Regularization (CACR) strategy in a pre-training stage according to an embodiment.
FIG.5 illustrates a flowchart of an exemplary method for establishing a language model adapted to a cross-lingual sequence labeling task according to an embodiment.
FIG.6 illustrates an exemplary apparatus for establishing a language model adapted to a crosslingual sequence labeling task according to an embodiment.
FIG.7 illustrates an exemplary apparatus for establishing a language model adapted to a crosslingual sequence labeling task according to an embodiment.
DETAILED DESCRIPTION
The present disclosure will now be discussed with reference to several example implementations. It should be understood that these implementations are discussed only for enabling those skilled in the art to better understand and thus implement the embodiments of the present disclosure, rather than suggesting any limitations on the scope of the present disclosure.
A Sequence Labeling (SL) task is a task that targets span extraction, which may include, e.g., named entity recognition (NER), machine reading comprehension (MRC), question-answering, relation recognition, event extraction, etc. A cross-lingual sequence labeling task (xSL) aims to extend the SL task to be performed on different languages. The xSL task may require extending the boundary of the SL task to low-resource languages, which will face the challenge of limited training data for the low-resource languages.
An xPLM may be transferred or deployed to an xSL task, and may show a certain degree of effectiveness in the xSL task by, e.g., transferring knowledge from a high-resource language to low-resource languages. However, when transferring a pre-trained xPLM to an xSL task in a finetuning stage, discrepancy or gap in training objectives between a pre-training stage and a finetuning stage may be generated, which makes the xPLM not well adapted to the xSL task.
In a pre-training stage of a language model, the language model may be trained with a pre-training task such as mask language modeling (MLM), etc., wherein a training objective of the MLM requires local understanding of a masked token. In a fine-tuning stage of the pre-trained language model for an xSL task, the pre-trained language model is typically trained with distantly supervised multi-lingual task-related instances, wherein a training objective of span extraction of the xSL task requires global understanding and reasoning of, e.g., an input question and passage. This will lead to a gap between the training objective in the pre-training stage and the training objective in the fine-tuning stage, and will further lead to the resulting xPLM not well adapted to the xSL task.
The embodiments of the present disclosure aim to establish a language model adapted to an xSL task. For example, the embodiments of the present disclosure may enable an xPLM to obtain characteristics adapted to the xSL task in a pre-training stage, and thus a pre-trained language model may obtain better performance for the xSL task in a fine-tuning stage.
In an aspect, the embodiments of the present disclosure propose a pre-training strategy customized for an xSL task in a pre-training stage, which may be referred to as a Cross-lingual Language Informative Span Masking (CLISM) strategy or a CLISM task. The CLISM strategy eliminates a training objective gap between the pre-training stage and a fine-tuning stage for the xSL in a selfsupervised approach.
In an aspect, the embodiments of the present disclosure propose a strategy for enhancing alignment capability in a pre-training stage, which may be referred to as a ContrAstive-Consistency Regularization (CACR) strategy. The CACR strategy may encourage an xPLM to better capture alignment between cross-lingual representations. For example, during pre-training, the CACR strategy may leverage contrastive learning to encourage consistency between representations of input parallel sequences.
The embodiments of the present disclosure may not only eliminate a gap between an objective of pre-training and an objective of fine-tuning, but may also enhance the capability of a language model to better capture alignment between cross-lingual representations in a sentence-level. According to the embodiments of the present disclosure, even with limited training data in the pre-training stage, an established language model is able to have better performance. The language model established according to the embodiments of the present disclosure may achieve good applicability in various xSL tasks with limited training data. For example, even in the case of fewshot data settings where only a few training instances are available or zero-shot data settings where no training instance is available, the language model established according to the embodiments of the present disclosure is still able to achieve good applicability.
FIG.l illustrates an exemplary process 100 of pre-training and fine-tuning a language model according to an embodiment. The process 100 is performed for pre-training and fine-tuning a language model 110 in order to apply the language model 110 to an xSL task.
In a pre-training stage, the language model 110 may be trained with a training dataset 102 which is based on multi-lingual parallel corpus. In an implementation, the parallel corpus may be divided into multiple sub-groups. Each sub-group may be referred to as a language-informative group. As an example, each sub-group may include two parallel versions of the same sentence in two different languages, e.g., a sentence in a first language and a sentence in a second language. Hereinafter, a sentence in a first language will be referred to as a source language sentence, and a sentence in a second language will be referred to as a target language sentence. A target language sentence may be a version or translation of a source language sentence in a target language, or a source language sentence may be a version or translation of a target language sentence in a source language. Since a source language sentence and a target language sentence in each sub-group are in parallel, there will be multiple meaning-aligned spans or tokens between these two sentences. Herein, a token may correspond to a word or a part of a word, and a span may include one or more tokens. A source language sentence and a target language sentence in each sub-group form a training sentence pair.
In an implementation, a CLISM strategy 120 may be adopted in the pre-training stage. The CLISM strategy may eliminate a training objective gap between the pre-training stage and a fine-tuning stage for xSL. The CLISM strategy may also be referred to as a CLISM task, which is a pretraining task customized for a xSL task. The CLISM task is a self-supervised task. Taking an xSL task of cross-lingual MRC (xMRC) which is similar to question-answering as an example, one focus of the CLISM task is how to create multi-lingual <question, answer> training pairs. It should be understood that although multiple parts of the following discussion take the CLISM task being customized for a question-answering-type xSL task as an example, CLISM tasks according to the embodiments of the present disclosure are not limited to this, but may be customized for any other types of xSL tasks in a similar approach.
According to the CLISM task, for a training sentence pair including a source language sentence and a target language sentence obtained from the training dataset 102, one or more selected original spans in the source language sentence may be masked. Each original span may be an n- ary span, which includes n tokens. An original span may be, e.g., a named entity, a phrase, etc. It should be understood that the embodiments of the present disclosure are not limited to any specific type of original span. A masked source language sentence may be obtained through performing masking on the source language sentence. An input sequence for the language model 110 may be formed with at least the masked source language sentence and the corresponding unmasked target language sentence. As an example, firstly, one or more selected original spans in the source language sentence may be masked by one or more predefined tokens, respectively. A predefined token may be, e.g., [QUE], etc. It should be understood that although a predefined token is exemplarily expressed as [QUE] hereinafter, the embodiments of the present disclosure are not limited to any specific expressions of a predefined token, e.g., a predefined token may also be in any other expressions than [QUE], Then, the language model 110 may be required to find correct start position and end position of a target span corresponding to each predefined token from the target language sentence based on global contextual understanding of the masked source language sentence and the target language sentence. Each target span has the same meaning as the corresponding original span masked by the predefined token. Accordingly, each predefined token [QUE] may be regarded as, e.g., a question in the xMRC task, and the target span, in the target language sentence, corresponding to this predefined token [QUE] may be regarded as, e.g., an answer in the xMRC task.
Under the CLISM strategy, the language model 110 needs to predict whether each token in the target language sentence is a start position of an answer (i.e., a start position token) or an end position of an answer (i.e., an end position token). This is essentially consistent with the xSL task, because span extraction in the xSL task also aims to find a start position and an end position of a target span. Thus, through the CLISM strategy, a training objective of the pre-training stage will be close to or consistent with a training objective of the subsequent fine-tuning stage for the xSL task. Thus, the pre-trained language model 110 will be more adapted to the xSL task.
In an implementation, a CACR strategy 130 may be adopted in the pre-training stage. The CACR strategy may enable the language model 110 to better capture alignment of the same sentence between different languages, and avoid learning representations that are affected by noisy data. Noisy data may include, e.g., a predefined token [QUE] used in masking, two sentences in a training sentence pair that are not fully semantically paired, etc. In an aspect, the CACR strategy may, through contrastive learning, make representations of sentences originating from the same training sentence pair as close as possible in a latent space. For example, a source language sentence, a masked source language sentence and a target language sentence originating from the same training sentence pair may be taken as parallel sequences, and the CACR strategy may utilize contrastive learning to encourage the language model 110 to achieve consistency among representations of the multiple sentences in the parallel sequences. In an aspect, the CACR strategy may, through contrastive learning, make representations of sentences originating from different training sentence pairs as far as possible in a latent space. For example, the CACR strategy may utilize contrastive learning to encourage the language model 110 to implement distinctiveness between representations of a source language sentence, a masked source language sentence and a target language sentence originating from a first training sentence pair and representations of a source language sentence, a masked source language sentence and a target language sentence originating from a second training sentence pair.
During a fine-tuning stage, the pre-trained language model 110 may be further trained with a training dataset 104. The training dataset 104 may be specific to an xSL task 140. Assuming that the xSL task 140 is an xMRC task, the training dataset 104 may be a training dataset for the xMRC task. It should be understood that the embodiments of the present disclosure are not limited to any specific types of xSL task. Since a training objective gap between the pre-training stage and the fine-tuning stage for xSL is eliminated in the pre-training stage through, e.g., the CLISM strategy, etc., the pre-trained language model 110 will be more adapted to the xSL task. When performing fine-tuning for the xSL task on such pre-trained language model 110, a language model with higher applicability can be finally obtained.
In an implementation, optionally, in the fine-tuning stage, a predefined token [QUE] may be added to an input sequence of the xSL task. Taking the xSL task being an MRC task as an example, since at least one predefined token [QUE] is taken as a question in the pre-training stage, after pretraining, the at least one predefined token [QUE] will capture enough information about the question. Through adding the at least one predefined token [QUE] to an input sequence of the xSL task, a representation of the at least one predefined token [QUE] will facilitate to select a groundtruth answer span in the xSL task.
It should be understood that although exemplary description of a fine-tuning stage is included in the above description of the process 100, the embodiments of the present disclosure are not limited to any specific processing in a fine-tuning stage. For example, after a pre-training process according to the embodiments of the present disclosure has been performed on a language model, any known fine-tuning process may be then applied to the pre-trained language model. Benefiting from the pre-training process performed on the language model according to the embodiments of the present disclosure, the resulting pre-trained language model will exhibit its performance advantages of being adapted to xSL tasks in any fine-tuning process.
FIG.2 illustrates an exemplary process 200 of a CLISM strategy in a pre-training stage according to an embodiment.
An exemplary training sentence pair 202 may be obtained from, e.g., the training dataset 102 in FIG.1. The training sentence pair 202 may include a source language sentence as a first sentence and a target language sentence as a second sentence, wherein the source language sentence adopts a source language as a first language, and the target language sentence adopts a target language as a second language. The target language sentence is a version of the source language sentence in the target language, and the source language sentence is a version of the target language sentence in the source language. The source language sentence may be represented as ss, and the target language sentence may be represented as st.
At 210, at least one original span to be masked may be selected from the source language sentence. Taking the original span being a named entity as an example, a named entity span, as the original span, may be selected from the source language sentence through various named entity recognition tools. It should be understood that the embodiments of the present disclosure are neither limited to any specific types of original span, nor limited to any specific technique for selecting an original span. The at least one original span selected or identified from the source language sentence ss may form an original span set S.
In some implementations, the selecting operation at 210 may also follow some predetermined rules so that spans with semantic meanings may be selected. For example, spans which only contain stop words may be filtered out. For example, it may be specified that a boundary of a selected span must be words. For example, a sequence length of each selected span should not exceed a predetermined maximum sequence length threshold, e.g., 10, etc.
At 220, the selected at least one original span in the source language sentence may be masked by at least one predefined token, so as to obtain the masked source language sentence. The masking operation may include: replacing the selected at least one original span with at least one predefined token. For example, for each original span in the original span set S, the original span may be replaced, in the source language sentence, by a single predefined token [QUE], Through masking all original spans, the masked source language sentence may be obtained, which is represented as .
At 230, an input sequence for the language model may be formed with at least the masked source language sentence and the unmasked target language sentence. In an implementation, a final input sequence X may be obtained through cascading the masked source language sentence
Figure imgf000009_0002
, the target language sentence st, and special tokens such as [CLS], [SEP], etc. For example, the input sequence X may be represented as wherein [CLS] is a classification
Figure imgf000009_0001
marker token and [SEP] is a sentence separation token. It should be understood that the embodiments of the present disclosure are not limited to any specific approach of utilizing the masked source language sentence and the target language sentence to form the input sequence, e.g., any special tokens may be omitted, any other special tokens may be added, etc.
FIG.3 illustrates an example of forming an input sequence in a CLISM strategy according to an embodiment.
It is assumed that a training sentence pair includes a source language sentence 310 and a target language sentence 320. The source language sentence 310 is an English sentence "List of Ottawa buildings", and the target language sentence 320 is a Vietnamese sentence "Danh sach cac toa nha Ottawa".
It is assumed that an original span 312 "List" and an original span 314 "Ottawa" are selected from the source language sentence 310 according to the operation at 210 of FIG.2. The original span 312 corresponds to or is aligned with a target span 322 "Danh sach" in the target language sentence 320, i.e., the target span 322 is the Vietnamese version of the original span 314. The original span 314 corresponds to or is aligned with a target span 324 "Ottawa" in the target language sentence 320, and the target span 324 and the original span 314 have the same expression.
According to the operation at 220 of FIG.2, the original span 312 may be replaced by a predefined token [QUE] 332 and the original span 314 may be replaced by a predefined token [QUE] 334, thereby obtaining a masked source language sentence 330.
According to the operation at 230 of FIG.2, an input sequence 340 for the language model may be formed with at least the masked source language sentence 330 and the target language sentence 320. As shown in FIG.3, the input sequence 340 is formed through sequentially cascading a special token [CLS] 342, the masked source language sentence 330, the special token [SEP] 344, the target language sentence 320 and a special token [SEP] 346.
It should be understood that all the elements in the example of FIG.3 are exemplary, which are only for providing additional explanation for relevant operations in the process 200 of FIG.2, but not for limiting the scope of the embodiments of the present disclosure in any approaches.
Returning to FIG.2, when the input sequence is formed at 230, the input sequence may be provided to a language model 240. The language model 240 may generate an input sequence representation 242 of the input sequence. The language model 240 may be represented as . The generation of the input sequence representation 242 may include generating a representation of each token in the input sequence. Since the input sequence X contains at least the masked source language sentence ss and the target language sentence sf, a representation of each token generated by the language model may be a global contextual semantic representation.
At 250, target span prediction may be performed based at least on the input sequence representation 242. The target span prediction aims to predict a start position and an end position of at least one target span in the target language sentence, wherein the at least one target span corresponds to the at least one predefined token in the masked source language sentence respectively. For example, in the target span prediction, for each predefined token in the masked source language sentence, a start position and an end position of a target span, corresponding to the predefined token, in the target language sentence may be predicted. A start position of a target span may be indicated by a token at the start position, therefore, prediction of a start position of a target span may refer to determining which token is a start position of the target span and thus serves as a start position token. Similarly, prediction of an end position of a target span may refer to determining which token is an end position of the target span and thus serves as an end position token.
In an implementation, in the target span prediction, for each predefined token in the masked source language sentence, target span start position probability distribution and target span end position probability distribution associated with the predefined token may be calculated based on token representations in the input sequence representation 242. The target span start position probability distribution may indicate probabilities that different tokens serve as a start position of a target span. The target span end position probability distribution may indicate probabilities that different tokens serve as an end position of a target span. Taking a specific token in the input sequence as an example, a start position probability that the token is a start position of a target span and an end position probability that the token is an end position of a target span may be calculated based at least on a representation of a predefined token and a representation of the specific token.
Taking the CLISM task being customized for an xMRC task as an example, each predefined token [QUE] in the masked source language sentence may be regarded as a question Q, and the objective of the target span prediction is: for a given question Q corresponding to a given predefined token [QUE], a correct answer corresponding to an aligned target span in the target language sentence st is predicted based on the meaning of the input sequence X. Multiple predefined tokens in the masked source language sentence may serve as a set of questions that need to be answered simultaneously.
A representation xq of the question Q may be obtained from the input sequence representation 242. Then, a dynamic start vector sq and a dynamic end vector eq may be calculated according to: Equation (1)
Figure imgf000011_0003
wherein Ws and We are learnable parameters. The dynamic start vector sq is a sub-representation derived from the representation xq of the question Q for subsequent calculation of target span start position probability distribution, and the dynamic end vector eq is a sub -representation derived from the representation xq of the question Q for subsequent calculation of target span end position probability distribution.
Referring to the known standard span-extraction process, an inner product of the learned vector sq with each token representation in the input sequence representation of the input sequence X may be calculated according to Equation (2) to obtain the target span start position probability distribution, and an inner product of the learned vector eq with each token representation in the input sequence representation of the input sequence X may be calculated according to Equation (3) to obtain the target span end position probability distribution, as follow:
P( Start Index Equation (2)
Figure imgf000011_0001
P (Endindex Equation (3)
Figure imgf000011_0002
wherein Startindex = k indicates that a start index is k, i.e., Equation (2) calculates, for a token k at the fc-th position in the input sequence X, a start position probability that the token k is a start position of a target span; and Endindex = k indicates that an end index is k, i.e., Equation (3) calculates, for a token k at the fc-th position in the input sequence X, an end position probability that the token k is an end position of a target span. The index j in Equation (2) and Equation (3) is an index for tokens in the input sequence X. Through performing the calculations of Equation (2) and Equation (3) for each token in the input sequence, a start position probability and an end position probability calculated for each token may be obtained, and further, the target span start position probability distribution and the target span end position probability distribution may be formed.
It should be understood that although a start position probability and an end position probability are calculated for each token in the input sequence in the above description, optionally, start position probabilities and end position probabilities may also be calculated only for tokens in the target language sentence in the input sequence.
According to the process 200, the language model 240 may be optimized through the performing of the target span prediction. In order to perform the optimization, firstly, at 260, alignment may be performed between the source language sentence and the target language sentence. For example, at 260, alignment may be performed between spans, tokens or words in the source language sentence and spans, tokens or words in the target language sentence. Accordingly, at least one target span in the target language sentence aligned with at least one selected original span in the source language sentence may at least be identified. The alignment operation at 260 may be performed through any alignment tools or techniques. The alignment operation at 260 may be used for determining a ground-truth target span, corresponding to a specific predefined token, in the target language sentence. For example, assuming that the specific predefined token was previously used for masking a specific original span, a target span aligned with the specific original span determined through the alignment operation at 260 may be used as a ground-truth target span corresponding to the specific predefined token. Referring to the example in FIG.3, the alignment operation at 260 may align the original span 312 with the target span 322; accordingly, in the case that the original span 312 is masked by the predefined token 332, a ground-truth target span corresponding to the predefined token 332 may be determined to be the target span 322.
At 270, the language model 240 may be optimized based at least on the target span prediction. In the optimization operation, for a predefined token in the masked source language sentence, a start position probability of a ground-truth start position token of a ground-truth target span, corresponding to the predefined token, in the target language sentence may be maximized, and an end position probability of a ground-truth end position token of the ground-truth target span may be maximized. The ground-truth start position token may refer to a token at a start position of the ground-truth target span, and the ground-truth end position token may refer to a token at an end position of the ground-truth target span.
In an implementation, the optimization operation at 270 may be implemented through constructing a loss function and minimizing a loss calculated through the loss function. A loss function £CLISM for the CLISM strategy may be constructed based on, e.g., standard cross-entropy loss, as follow: ^CLISM = -log? (Startindex = af |X, Q) — logP(EndIndex = af X, Q) Equation (4) wherein at denotes a ground-truth target span corresponding to the i-th question Q in the target language sentence, af denotes a start position of the ground-truth target span
Figure imgf000013_0001
and af denotes an end position of the ground-truth target span ai. logP(StartIndex = af |X, Q) in Equation (4) represents a start position probability of a ground-truth start position token of the ground-truth target span, and logP(EndIndex = in Equation (4) represents an end position probability
Figure imgf000013_0002
of a ground-truth end position token of the ground-truth target span. Through the optimization operation described above, the capability of the language model to extract spans across different languages may be enhanced during pre-training and may be smoothly transferred to an xSL task during fine-tuning.
It should be understood that all the operations in the process 200 and the order in which they are performed are exemplary, and the embodiments of the present disclosure will cover any changes to the process 200. For example, the process 200 may be iteratively performed respectively for different training sentence pairs in the training dataset, so that the language model may be continuously pre-trained with different training sentence pairs. For example, the alignment operation at 260 do not have a specific performing order relative to other operations in the process 200, but only needs to be performed between the operation at 210 and the operation at 270. For example, all the above equations are exemplary and the embodiments of the present disclosure are not limited to any details of these equations, but may encompass any changes to these equations and any other equations for similar purposes.
FIG.4 illustrates an exemplary process 400 of a CACR strategy in a pre-training stage according to an embodiment.
The CACR strategy may help the language model to avoid learning representations affected by noisy data during pre-training, and may facilitate the language model to better capture alignment between representations of sentences in parallel sequences.
A source language sentence 402 and a target language sentence 406 may be from the same training sentence pair. A masked source language sentence 404 may be obtained through performing, e.g., the masking operation at 220 in FIG.2 on the source language sentence 402.
According to the process 400, a representation of the source language sentence 402, a representation of the masked source language sentence 404, and a representation of the target language sentence 406 may be obtained at least through a language model 410, and these representations are for further use in contrastive learning.
In an implementation, firstly, a hidden representation 412 of the source language sentence, a hidden representation 414 of the masked source language sentence, and a hidden representation 416 of the target language sentence may be obtained through the language model 410, respectively. Then, sequence lengths of these hidden representations may be unified by applying, e.g., an aggregation layer 420. For example, an aggregated representation 422 of the source language sentence may be obtained through applying the aggregation layer 420 to the hidden representation 412 of the source language sentence, an aggregated representation 424 of the masked source language sentence may be obtained through applying the aggregation layer 420 to the hidden representation 414 of the masked source language sentence, and an aggregated representation 426 of the target language sentence may be obtained through applying the aggregation layer 420 to the hidden representation 416 of the target language sentence. The aggregated representation 422, the aggregated representation 424, and the aggregated representation 426 may respectively serve as a representation of the source language sentence, a representation of the masked source language sentence, and a representation of the target language sentence that are finally obtained. It should be understood that, optionally, in the case that an input sequence representation of an input sequence formed by the masked source language sentence 404 and the target language sentence 406 is obtained according to the process 200 in FIG.2, the hidden representation 414 of the masked source language sentence and the hidden representation 416 of the target language sentences may also be extracted directly from the input sequence representation.
At 430, the obtained representation of the source language sentence, the representation of the masked source language sentences, and the representation of the target language sentence may be used for contrastive learning. In an implementation, in the contrastive learning, any two of the representation of the source language sentence, the representation of the masked source language sentence and the representation of the target language sentence may be taken as a positive sample pair for contrastive learning. Thus, through the contrastive learning, the representation of the source language sentence, the representation of the masked source language sentence, and the representation of the target language sentence will be as close as possible in a latent space to achieve representation consistency.
At 440, the language model may be optimized based at least on the contrastive learning. In an implementation, in the optimization operation, the language model may be optimized through minimizing latent space distance among a representation of a source language sentence, a representation of a masked source language sentence, and a representation of a target language sentence originating from the same training sentence pair.
Moreover, optionally, the CACR strategy may also, through the contrastive learning, make representations of sentences originating from different training sentence pairs as far as possible in a latent space. For example, assuming that a source language sentence representation, a masked source language sentence representation and a target language sentence representation respectively corresponding to a source language sentence, a masked source language sentence and a target language sentence originating from a first training sentence pair are combined into a first representation set, and a source language sentence representation, a masked source language sentence representation and a target language sentence representation respectively corresponding to a source language sentence, a masked source language sentence and a target language sentence originating from a second training sentence pair are combined into a second representation set, the CACR strategy may make any one representation in the first representation set be as far away as possible from any one representation in the second representation set in a latent space. Accordingly, in the optimization operation, the language model may be optimized through maximizing a latent space distance between any one representation in the first representation set and any one representation in the second representation set.
More exemplary implementations of the CACR strategy shown in the process 400 will be presented in the below.
As described above in connection with FIG.2, the input sequence of the language model is . Through taking X as the input for the language model, an input
Figure imgf000015_0003
sequence representation M of the input sequence may be obtained as follow: Equation (5)
Figure imgf000015_0004
wherein
Figure imgf000015_0012
, I is the maximum input sequence length, d is a hidden size,
Figure imgf000015_0005
E Rmxd denotes a hidden representation of the masked source language sentence
Figure imgf000015_0006
,
Figure imgf000015_0001
e Rnxd denotes a hidden representation of the target language sentence sc, m denotes a sequence length of s, n
Figure imgf000015_0007
denotes a sequence length of sc, denotes a representation of a special token [CLS], and M
Figure imgf000015_0016
sep denotes a representation of a special token [SEP],
Due to the mismatch between m and n,
Figure imgf000015_0009
and M
Figure imgf000015_0010
1 cannot be directly used as inputs for contrastive learning. Therefore, an extra aggregation layer JI (e.g., mean-pooling, etc.) may be applied to obtain an aggregated representation fs of the masked source language sentence
Figure imgf000015_0014
and an aggregated representation rt of the target language sentence st, as follow: Equation (6)
Figure imgf000015_0008
wherein .
Figure imgf000015_0013
In the contrastive learning, rs may be used for forming a positive sample pair with rc, while other data in a mini-batch may be used for forming negative sample pairs. The language model may be optimized through constructing a loss function for
Figure imgf000015_0015
and rf and minimizing a loss calculated by the loss function. A loss function may be constructed based on, e.g., a
Figure imgf000015_0011
standard contrastive learning objective, as follow:
Figure imgf000015_0002
wherein B is a mini -batch, T is a temperature coefficient denoting a smoothing strategy, and 'P(-) denotes a cosine similarity function.
Considering that the masked source language sentence ss is an input interfered by noises, e.g., it includes a predefined token [QUE], thus, if the language model is optimized only by Equation (7), this may cause the language model to learn representations affected by noisy data, thereby may cause incorrect representation alignment between the source language sentence and the target language sentence. Therefore, the CACR strategy further introduces the unmasked source language sentence ss as an input. The language model may encode ss to obtain a hidden representation BCS of the source language sentence, and then, an aggregated representation rs of the source language sentence may be obtained through applying an aggregation layer c/Z to BCS . Accordingly, the aggregated representation rs of the source language sentence, the aggregated representation rs of the masked source language sentence, and the aggregated representation rf of the target language sentences may be formed into a positive threefold set
Figure imgf000016_0001
Every two representations in this positive threefold set is considered as a positive pair, i.e., it is desired that every two representations are as similar as possible in a latent space. A final loss function £CACR for the CACR strategy may be:
Figure imgf000016_0002
Equation (8) wherein is a loss function for rs and is a loss function for rs and , and
Figure imgf000016_0004
Figure imgf000016_0005
Figure imgf000016_0006
is a loss function for rf and rs . The language model may be optimized through minimizing a loss calculated via the loss function £CACR . Through this approach, the language model is encouraged to learn representations that are not affected by noisy data, and is able to better capture alignment between cross-lingual representations, which may also facilitate to alleviate a training objective gap between the pre-training stage and the fine-tuning stage.
It should be understood that all the operations in the process 400 and the order in which they are performed are exemplary, and the embodiments of the present disclosure will cover any changes to the process 400.
According to the embodiments of the present disclosure, optionally, an existing pre-training task, e.g., a MLM task, may also be retained in the pre-training stage. Accordingly, the MLM task may also be used for optimizing a language model. Thus, the total training objective or total loss function L of the language model may be defined as: Equation (9)
Figure imgf000016_0003
wherein £MLM is a loss function for the MLM task. The language model may be optimized with the total loss function £. In this approach, the embodiments of the present disclosure actually train a language model in a pre-training stage with a multi-task setting which may include, e.g., CLISM, CACR, MLM, etc. It should be understood that although Equation (9) considers the loss function £MLM of the MLM task, the embodiments of the present disclosure may also construct the total loss function £ with only CLISM ar|d CACR ■
After the pre-trained language model is obtained through the above-described processing in the pre-training stage according to the embodiments of the present disclosure, fine-tuning for the xSL task may be performed on the language model in the subsequent fine-tuning stage.
In an implementation, the fine-tuning stage may be based on any known fine-tuning process.
In an implementation, the fine-tuning stage may include improvements proposed by the embodiments of the present disclosure. For example, in order to be consistent with the strategy in the pre-training stage, in the fine-tuning stage, a predefined token [QUE] may be added to an input sequence of the xSL task. This processing approach may achieve better performance in the xSL task, especially in the case of few-shot data setting or zero-shot data setting. Taking the xSL task being an xMRC task as an example, an input sequence for the xMRC task may be:
X = {[CLS]Q[QUE] [SEP]P[SEP]} Equation (10) wherein Q denotes an input question, and P denotes an input passage. In the input sequence X, a predefined token [QUE] is added after the question Q.
Since the pre-trained token [QUE] is taken as a question in the pre-training stage, after the pretraining stage, the pre-defined token [QUE] will be able to capture enough information about the question. Through adding the predefined token [QUE] to the input sequence of the xMRC task, a representation of the predefined token [QUE] will facilitate to select a correct answer span in the xMRC task. It should be understood that although the above Equation (10) only shows adding a single [QUE] to the input sequence, optionally, in the case that at least one predefined token [QUE] is used in the pre-training stage, at least one corresponding predefined token [QUE] may also be added to the input sequence for the xMRC task. Moreover, it should be understood that although the above description gives an example in which the xSL task is an xMRC task, for any other types of xSL task, input sequences for these tasks may also be processed in a similar approach.
FIG.5 illustrates a flowchart of an exemplary method 500 for establishing a language model adapted to a cross-lingual sequence labeling task according to an embodiment.
At 510, a training sentence pair including a first sentence in a first language and a second sentence in a second language may be obtained, the second sentence being a version of the first sentence in the second language.
At 520, at least one original span in the first sentence may be masked with at least one predefined token, to obtain the masked first sentence.
At 530, an input sequence for the language model may be formed with at least the masked first sentence and the second sentence. At 540, an input sequence representation of the input sequence may be generated through the language model.
At 550, target span prediction may be performed based at least on the input sequence representation, to predict a start position and an end position of at least one target span in the second sentence, the at least one target span corresponding to the at least one predefined token respectively.
At 560, the language model may be optimized based at least on the target span prediction.
In an implementation, the method 500 may further comprise: selecting the at least one original span from the first sentence. The masking at least one original span may comprise: replacing the selected at least one original span with the at least one predefined token.
In an implementation, the target span prediction may comprise, for each predefined token in the at least one predefined token: calculating target span start position probability distribution and target span end position probability distribution associated with the predefined token based on token representations in the input sequence representation.
The calculating may comprise, for each token in the input sequence: calculating, based at least on a representation of the predefined token and a representation of the token, a start position probability that the token is a start position of a target span and an end position probability that the token is an end position of a target span.
The optimizing the language model based at least on the target span prediction may comprise optimizing the language model at least through: maximizing a start position probability of a ground-truth start position token of a ground-truth target span in the second sentence corresponding to the predefined token; and maximizing an end position probability of a groundtruth end position token of the ground-truth target span.
The method 500 may further comprise: determining the ground-truth target span corresponding to the predefined token through at least performing alignment between the first sentence and the second sentence.
In an implementation, the method 500 may further comprise: obtaining a representation of the first sentence, a representation of the masked first sentence and a representation of the second sentence through at least the language model; taking any two of the representation of the first sentence, the representation of the masked first sentence and the representation of the second sentence as a positive sample pair for contrastive learning; and optimizing the language model based at least on the contrastive learning.
The optimizing the language model based at least on the contrastive learning may comprise optimizing the language model at least through: minimizing latent space distances among the representation of the first sentence, the representation of the masked first sentence and the representation of the second sentence.
In an implementation, the method 500 may further comprise: optimizing the language model based at least on a mask language modeling task.
In an implementation, the method 500 may further comprise: fine-tuning the language model for the cross-lingual sequence labeling task.
The fine-tuning may comprise: adding the at least one predefined token to an input sequence for the cross-lingual sequence labeling task.
It should be understood that the method 500 may further comprise any steps/processes for establishing a language model adapted to a cross-lingual sequence labeling task according to the embodiments of the present disclosure described above.
FIG.6 illustrates an exemplary apparatus 600 for establishing a language model adapted to a crosslingual sequence labeling task according to an embodiment.
The apparatus 600 may comprise: a training sentence pair obtaining module 610, for obtaining a training sentence pair including a first sentence in a first language and a second sentence in a second language, the second sentence being a version of the first sentence in the second language; a masking module 620, for masking at least one original span in the first sentence with at least one predefined token, to obtain the masked first sentence; an input sequence forming module 630, for forming an input sequence for the language model with at least the masked first sentence and the second sentence; an input sequence representation generating module 640, for generating an input sequence representation of the input sequence through the language model; a target span prediction performing module 650, for performing target span prediction based at least on the input sequence representation, to predict a start position and an end position of at least one target span in the second sentence, the at least one target span corresponding to the at least one predefined token respectively; and an optimizing module 660, for optimizing the language model based at least on the target span prediction. Moreover, the apparatus 600 may further comprise any other modules configured to perform any steps/processes of the methods for establishing a language model adapted to a cross-lingual sequence labeling task according to the embodiments of the present disclosure described above.
FIG.7 illustrates an exemplary apparatus 700 for establishing a language model adapted to a crosslingual sequence labeling task according to an embodiment.
The apparatus 700 may comprise at least one processor 710. The apparatus 700 may further comprise a memory 720 connected with at least one processor 710. The memory 720 may store computer-executable instructions that, when executed, cause the at least one processor 710 to: obtain a training sentence pair including a first sentence in a first language and a second sentence in a second language, the second sentence being a version of the first sentence in the second language; mask at least one original span in the first sentence with at least one predefined token, to obtain the masked first sentence; form an input sequence for the language model with at least the masked first sentence and the second sentence; generate an input sequence representation of the input sequence through the language model; perform target span prediction based at least on the input sequence representation, to predict a start position and an end position of at least one target span in the second sentence, the at least one target span corresponding to the at least one predefined token respectively; and optimize the language model based at least on the target span prediction.
In an implementation, the target span prediction may comprise, for each predefined token in the at least one predefined token: calculating target span start position probability distribution and target span end position probability distribution associated with the predefined token based on token representations in the input sequence representation.
The calculating may comprise, for each token in the input sequence: calculating, based at least on a representation of the predefined token and a representation of the token, a start position probability that the token is a start position of a target span and an end position probability that the token is an end position of a target span.
The optimizing the language model based at least on the target span prediction may comprise optimizing the language model at least through: maximizing a start position probability of a ground-truth start position token of a ground-truth target span in the second sentence corresponding to the predefined token; and maximizing an end position probability of a groundtruth end position token of the ground-truth target span.
In an implementation, the computer-executable instructions, when executed, may further cause the at least one processor to: obtain a representation of the first sentence, a representation of the masked first sentence and a representation of the second sentence through at least the language model; take any two of the representation of the first sentence, the representation of the masked first sentence and the representation of the second sentence as a positive sample pair for contrastive learning; and optimize the language model based at least on the contrastive learning. The optimizing the language model based at least on the contrastive learning may comprise optimizing the language model at least through: minimizing latent space distances among the representation of the first sentence, the representation of the masked first sentence and the representation of the second sentence.
In an implementation, the computer-executable instructions, when executed, may further cause the at least one processor to: fine-tune the language model for the cross-lingual sequence labeling task.
The fine-tuning may comprise: adding the at least one predefined token to an input sequence for the cross-lingual sequence labeling task.
Moreover, the at least one processor 710 may be further configured to perform any other steps/processes of the methods for establishing a language model adapted to a cross-lingual sequence labeling task according to the embodiments of the present disclosure described above. The embodiments of the present disclosure propose a computer program product for establishing a language model adapted to a cross-lingual sequence labeling task. The computer program product may comprise a computer program that is executed by at least one processor for: obtaining a training sentence pair including a first sentence in a first language and a second sentence in a second language, the second sentence being a version of the first sentence in the second language; masking at least one original span in the first sentence with at least one predefined token, to obtain the masked first sentence; forming an input sequence for the language model with at least the masked first sentence and the second sentence; generating an input sequence representation of the input sequence through the language model; performing target span prediction based at least on the input sequence representation, to predict a start position and an end position of at least one target span in the second sentence, the at least one target span corresponding to the at least one predefined token respectively; and optimizing the language model based at least on the target span prediction. Moreover, the computer program may be further executed by the at least one processor for performing any other steps/processes of the methods for establishing a language model adapted to a cross-lingual sequence labeling task according to the embodiments of the present disclosure described above.
The embodiments of the present disclosure may be embodied in a non-transitory computer- readable medium. The non-transitory computer readable medium may comprise instructions that, when executed, cause one or more processors to perform any steps/processes of the methods for establishing a language model adapted to a cross-lingual sequence labeling task according to the embodiments of the present disclosure described above.
It should be understood that all the operations in the methods described above are merely exemplary, and the present disclosure is not limited to any operations in the methods or sequence orders of these operations, and should cover all other equivalents under the same or similar concepts.
In addition, the articles "a" and "an" as used in this description and appended claims, unless otherwise specified or clear from the context that they are for the singular form, should generally be interpreted as meaning "one" or "one or more."
It should also be understood that all the modules in the apparatuses described above may be implemented in various approaches. These modules may be implemented as hardware, software, or a combination thereof. Moreover, any of these modules may be further functionally divided into sub-modules or combined together.
Processors have been described in connection with various apparatuses and methods. These processors may be implemented using electronic hardware, computer software, or any combination thereof. Whether such processors are implemented as hardware or software will depend upon the specific application and overall design constraints imposed on the system. By way of example, a processor, any portion of a processor, or any combination of processors presented in the present disclosure may be implemented with a micro-processor, micro-controller, digital signal processor (DSP), a field-programmable gate array (FPGA), a programmable logic device (PLD), a state machine, gated logic, discrete hardware circuits, and other suitable processing components configured to perform the various functions described throughout the present disclosure. The functionality of a processor, any portion of a processor, or any combination of processors presented in the present disclosure may be implemented with software being executed by a microprocessor, micro-controller, DSP, or other suitable platform.
Software shall be construed broadly to mean instructions, instruction sets, code, code segments, program code, programs, subprograms, software modules, applications, software applications, software packages, routines, subroutines, objects, threads of execution, procedures, functions, etc. The software may reside on a computer-readable medium. A computer-readable medium may include, by way of example, memory such as a magnetic storage device (e.g., hard disk, floppy disk, magnetic strip), an optical disk, a smart card, a flash memory device, random access memory (RAM), read only memory (ROM), programmable ROM (PROM), erasable PROM (EPROM), electrically erasable PROM (EEPROM), a register, or a removable disk. Although a memory is shown as being separate from the processor in various aspects presented in this disclosure, a memory may also be internal to the processor (e.g., a cache or a register).
The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein. All structural and functional equivalents to the elements of the various aspects described throughout the present disclosure that are known or later come to be known to those of ordinary skilled in the art are intended to be encompassed by the claims.

Claims

1. A method for establishing a language model adapted to a cross-lingual sequence labeling task, comprising: obtaining a training sentence pair including a first sentence in a first language and a second sentence in a second language, the second sentence being a version of the first sentence in the second language; masking at least one original span in the first sentence with at least one predefined token, to obtain the masked first sentence; forming an input sequence for the language model with at least the masked first sentence and the second sentence; generating an input sequence representation of the input sequence through the language model; performing target span prediction based at least on the input sequence representation, to predict a start position and an end position of at least one target span in the second sentence, the at least one target span corresponding to the at least one predefined token respectively; and optimizing the language model based at least on the target span prediction.
2. The method of claim 1, further comprising: selecting the at least one original span from the first sentence, and wherein the masking at least one original span comprises: replacing the selected at least one original span with the at least one predefined token.
3. The method of claim 1, wherein the target span prediction comprises, for each predefined token in the at least one predefined token: calculating target span start position probability distribution and target span end position probability distribution associated with the predefined token based on token representations in the input sequence representation.
4. The method of claim 3, wherein the calculating comprises: for each token in the input sequence: calculating, based at least on a representation of the predefined token and a representation of the token, a start position probability that the token is a start position of a target span and an end position probability that the token is an end position of a target span.
5. The method of claim 3, wherein the optimizing the language model based at least on the target span prediction comprises optimizing the language model at least through: maximizing a start position probability of a ground-truth start position token of a groundtruth target span in the second sentence corresponding to the predefined token; and maximizing an end position probability of a ground-truth end position token of the ground- truth target span.
6. The method of claim 5, further comprising: determining the ground-truth target span corresponding to the predefined token through at least performing alignment between the first sentence and the second sentence.
7. The method of claim 1, further comprising: obtaining a representation of the first sentence, a representation of the masked first sentence and a representation of the second sentence through at least the language model; taking any two of the representation of the first sentence, the representation of the masked first sentence and the representation of the second sentence as a positive sample pair for contrastive learning; and optimizing the language model based at least on the contrastive learning.
8. The method of claim 7, wherein the optimizing the language model based at least on the contrastive learning comprises optimizing the language model at least through: minimizing latent space distances among the representation of the first sentence, the representation of the masked first sentence and the representation of the second sentence.
9. The method of claim 1, further comprising: optimizing the language model based at least on a mask language modeling task.
10. The method of claim 1, further comprising: fine-tuning the language model for the cross-lingual sequence labeling task.
11. The method of claim 10, wherein the fine-tuning comprises: adding the at least one predefined token to an input sequence for the cross-lingual sequence labeling task.
12. An apparatus for establishing a language model adapted to a cross-lingual sequence labeling task, comprising: at least one processor; and a memory storing computer-executable instructions that, when executed, cause the at least one processor to: obtain a training sentence pair including a first sentence in a first language and a second sentence in a second language, the second sentence being a version of the first sentence in the second language, mask at least one original span in the first sentence with at least one predefined token, to obtain the masked first sentence, form an input sequence for the language model with at least the masked first sentence and the second sentence, generate an input sequence representation of the input sequence through the language model, perform target span prediction based at least on the input sequence representation, to predict a start position and an end position of at least one target span in the second sentence, the at least one target span corresponding to the at least one predefined token respectively, and optimize the language model based at least on the target span prediction.
13. The apparatus of claim 12, wherein the target span prediction comprises, for each predefined token in the at least one predefined token: calculating target span start position probability distribution and target span end position probability distribution associated with the predefined token based on token representations in the input sequence representation.
14. The apparatus of claim 12, wherein the computer-executable instructions, when executed, further cause the at least one processor to: obtain a representation of the first sentence, a representation of the masked first sentence and a representation of the second sentence through at least the language model; take any two of the representation of the first sentence, the representation of the masked first sentence and the representation of the second sentence as a positive sample pair for contrastive learning; and optimize the language model based at least on the contrastive learning.
15. A computer program product for establishing a language model adapted to a crosslingual sequence labeling task, comprising a computer program that is executed by at least one processor for: obtaining a training sentence pair including a first sentence in a first language and a second sentence in a second language, the second sentence being a version of the first sentence in the second language; masking at least one original span in the first sentence with at least one predefined token, to obtain the masked first sentence; forming an input sequence for the language model with at least the masked first sentence and the second sentence; generating an input sequence representation of the input sequence through the language model; performing target span prediction based at least on the input sequence representation, to predict a start position and an end position of at least one target span in the second sentence, the at least one target span corresponding to the at least one predefined token respectively; and optimizing the language model based at least on the target span prediction.
PCT/US2023/012468 2022-04-24 2023-02-07 Establishing a language model adapted to a cross-lingual sequence labeling task WO2023211525A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210434197.4A CN116976340A (en) 2022-04-24 2022-04-24 Building a language model suitable for cross-language sequence markup tasks
CN202210434197.4 2022-04-24

Publications (1)

Publication Number Publication Date
WO2023211525A1 true WO2023211525A1 (en) 2023-11-02

Family

ID=85556811

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2023/012468 WO2023211525A1 (en) 2022-04-24 2023-02-07 Establishing a language model adapted to a cross-lingual sequence labeling task

Country Status (2)

Country Link
CN (1) CN116976340A (en)
WO (1) WO2023211525A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117236335A (en) * 2023-11-13 2023-12-15 江西师范大学 Two-stage named entity recognition method based on prompt learning
CN117744657A (en) * 2023-12-26 2024-03-22 广东外语外贸大学 Medicine adverse event detection method and system based on neural network model

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
ANONYMOUS: "[PDF] Bridging the Gap between Language Models and Cross-Lingual Sequence Labeling | Semantic Scholar", 1 January 2022 (2022-01-01), XP093044031, Retrieved from the Internet <URL:https://www.semanticscholar.org/paper/Bridging-the-Gap-between-Language-Models-and-Chen-Shou/6c6ef8700507618b2851f428ef383d34ab99d7af> [retrieved on 20230503] *
CHEN NUO ET AL: "Bridging the Gap between Language Models and Cross-Lingual Sequence Labeling", PROCEEDINGS OF THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 11 April 2022 (2022-04-11), Stroudsburg, PA, USA, pages 1909 - 1923, XP093043908, Retrieved from the Internet <URL:https://arxiv.org/ftp/arxiv/papers/2204/2204.05210.pdf> DOI: 10.18653/v1/2022.naacl-main.139 *
RAM ORI ET AL: "Few-Shot Question Answering by Pretraining Span Selection", PROCEEDINGS OF THE 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (VOLUME 1: LONG PAPERS), 1 January 2021 (2021-01-01), Stroudsburg, PA, USA, pages 3066 - 3079, XP093044054, Retrieved from the Internet <URL:https://arxiv.org/pdf/2101.00438v1.pdf> DOI: 10.18653/v1/2021.acl-long.239 *
ZHU FEIDA ET AL: "Reinforced Iterative Knowledge Distillation for Cross-Lingual Named Entity Recognition", ACM SYMPOSIUM ON APPLIED PERCEPTION 2020, 14 August 2021 (2021-08-14), New York, NY, USA, pages 3231 - 3239, XP093044050, ISBN: 978-1-4503-8332-5, Retrieved from the Internet <URL:https://arxiv.org/pdf/2106.00241v1.pdf> DOI: 10.1145/3447548.3467196 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117236335A (en) * 2023-11-13 2023-12-15 江西师范大学 Two-stage named entity recognition method based on prompt learning
CN117236335B (en) * 2023-11-13 2024-01-30 江西师范大学 Two-stage named entity recognition method based on prompt learning
CN117744657A (en) * 2023-12-26 2024-03-22 广东外语外贸大学 Medicine adverse event detection method and system based on neural network model

Also Published As

Publication number Publication date
CN116976340A (en) 2023-10-31

Similar Documents

Publication Publication Date Title
Zhang et al. Discriminative nearest neighbor few-shot intent detection by transferring natural language inference
CN113987209B (en) Natural language processing method, device, computing equipment and storage medium based on knowledge-guided prefix fine adjustment
Liu et al. Exploiting argument information to improve event detection via supervised attention mechanisms
Chen et al. Event extraction via dynamic multi-pooling convolutional neural networks
WO2022083423A1 (en) System and method for relation extraction with adaptive thresholding and localized context pooling
CN109992773B (en) Word vector training method, system, device and medium based on multi-task learning
WO2023211525A1 (en) Establishing a language model adapted to a cross-lingual sequence labeling task
WO2021211207A1 (en) Adversarial pretraining of machine learning models
US11715008B2 (en) Neural network training utilizing loss functions reflecting neighbor token dependencies
Wan et al. Financial causal sentence recognition based on BERT-CNN text classification
Pramanik et al. Text normalization using memory augmented neural networks
CN113743099B (en) System, method, medium and terminal for extracting terms based on self-attention mechanism
Gomes et al. BERT-and TF-IDF-based feature extraction for long-lived bug prediction in FLOSS: a comparative study
Liu et al. A Hybrid Neural Network BERT‐Cap Based on Pre‐Trained Language Model and Capsule Network for User Intent Classification
Li et al. Exploiting argument information to improve biomedical event trigger identification via recurrent neural networks and supervised attention mechanisms
CN111967253A (en) Entity disambiguation method and device, computer equipment and storage medium
Lundeqvist et al. Author profiling: A machinelearning approach towards detectinggender, age and native languageof users in social media
Unanue et al. Regressing word and sentence embeddings for low-resource neural machine translation
Jain et al. Detecting Twitter posts with Adverse Drug Reactions using Convolutional Neural Networks.
Wu et al. A Model Ensemble Approach with LLM for Chinese Text Classification
Shen et al. Towards domain-generalizable paraphrase identification by avoiding the shortcut learning
Nguyen et al. An Embedding Method for Sentiment Classification across Multiple Languages
Nukarinen Automated text sentiment analysis for Finnish language using deep learning
Üveges Comprehensibility and Automation: Plain Language in the Era of Digitalization
Vo et al. Development of a fake news detection tool for Vietnamese based on deep learning techniques

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23710114

Country of ref document: EP

Kind code of ref document: A1