CN115906854A

CN115906854A - Multi-level confrontation-based cross-language named entity recognition model training method

Info

Publication number: CN115906854A
Application number: CN202211679089.XA
Authority: CN
Inventors: 赵易淳; 康光梁; 都金涛; 祝慧佳
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2022-12-26
Filing date: 2022-12-26
Publication date: 2023-04-04

Abstract

The invention provides a cross-language named entity recognition model training method based on multi-level confrontation. The method mainly comprises the following steps: translating the source language data with the tags into target language data with the tags through an external word-to-word translation model; constructing various data such as language code conversion data, disorder data and the like, and inputting the data into a multi-level confrontation network to carry out in-field confrontation training on mBERT; and (4) fine-tuning mBERT obtained through confrontation training on three groups of data respectively, and then performing multi-model distillation to obtain a student model.

Description

Multi-level confrontation-based cross-language named entity recognition model training method

Technical Field

The present invention relates to named entity recognition, and more particularly, to a method, system, apparatus, and medium for training a multi-level confrontation-based cross-language named entity recognition model.

Background

With the development of the internet, there are a great number of multilingual demands in various overseas business scenarios. This phenomenon has led to scarcity of more language resources, as the major population of the world still uses a small number of mainstream languages, while more non-generic languages are used by only a small population. The existing natural language processing method usually needs a large amount of manual labeling data sets, which can result in higher manual labeling cost. Therefore, how to effectively name entity recognition for languages with scarce resources based on data resources of rich resources (e.g. english, chinese) becomes a current challenge for cross-language natural language processing.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.

One or more embodiments of the present specification achieve the above objects by the following technical solutions.

In one aspect, a method for multi-level confrontation-based cross-language named entity recognition model training is provided, comprising: creating a plurality of data sets, the plurality of data sets comprising: a source language data set with a label, a target language data set with a label, a language code conversion data set and an out-of-order data set; countertraining the mBERT through a multi-level countermeasure network using, at least in part, the created tagged source language data, the set of transcoding data, and the set of out-of-order data to obtain a countertrained mBERT, wherein the multi-level countermeasure network includes a word level, a sentence level, and an order level; fine-tuning the confrontational trained mBERT on multiple sets of data to obtain a corresponding plurality of teacher models with different tendencies, and distilling the plurality of teacher models to obtain student models.

In another aspect, a system for multi-level confrontation-based cross-language named entity recognition model training is provided, comprising: a dataset creation module configured to create a plurality of datasets, the plurality of datasets comprising: a source language data set with a tag, a target language data set with a tag, a language code conversion data set and a disorder data set; a struggle training module configured to struggle training mBERT over a multi-level struggle network using, at least in part, the created tagged source language data, the set of language-to-code translation data, and the set of out-of-order data to obtain struggled mBERT, wherein the multi-level struggle network comprises a word level, a sentence level, and a language order level; a distillation module configured to fine-tune the confrontational trained mBERT across sets of data to obtain a corresponding plurality of teacher models having different tendencies, the student models obtained by distilling the plurality of teacher models.

In yet another aspect, an apparatus for multi-level confrontation-based cross-language named entity recognition model training is provided, comprising: a memory; and a processor configured to perform the method as claimed in any one of the above.

In yet another aspect, a computer-readable storage medium storing instructions that, when executed by a computer, cause the computer to perform the above-described method is provided.

In yet another aspect, there is provided a method of named entity recognition of input data using the student model described above, comprising: the student model receives a data set, the data set comprising one or more sentences, the one or more sentences comprising non-entity word portions and/or entity word portions; the student model carries out named entity recognition on the received data set; and the student model outputs a label for each word in each sentence in the dataset, the label indicating whether the word is an entity or a non-entity.

These and other features and advantages will become apparent upon reading the following detailed description and upon reference to the accompanying drawings. It is to be understood that both the foregoing general description and the following detailed description are explanatory only and are not restrictive of aspects as claimed.

Drawings

So that the manner in which the above recited features of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only some typical aspects of this invention and are therefore not to be considered limiting of its scope, for the description may admit to other equally effective aspects.

FIG. 1 shows a flowchart 100 of a multi-countermeasure-based training method for a cross-language named entity recognition model according to one embodiment of the invention.

FIG. 2 illustrates a data flow diagram 200 of a multi-countermeasure based cross-language named entity recognition model training technique.

FIG. 3 illustrates a method 300 for named entity recognition of input data using the models derived by the present invention, according to one embodiment of the present invention.

FIG. 4 illustrates a block diagram of a system 400 for multi-level countermeasure-based cross-language named entity recognition model training, according to one embodiment of the invention.

Fig. 5 shows a schematic block diagram of an apparatus 500 for implementing a system or method according to one or more embodiments of the invention.

Detailed Description

The present invention will be described in detail below with reference to the attached drawings, and the features of the present invention will be further apparent from the following detailed description.

The following detailed description refers to the accompanying drawings that illustrate exemplary embodiments of the invention. The scope of the invention is not, however, limited to these embodiments, but is defined by the appended claims. Accordingly, embodiments other than those shown in the drawings, such as modified versions of the illustrated embodiments, are encompassed by the present invention.

References in the specification to "one embodiment," "an example embodiment," etc., indicate that the embodiment may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the relevant art to effect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.

Hereinafter, technical terms appearing in the present invention will be briefly described. No deviation from its conventional interpretation in the art and/or its general understanding by those skilled in the art is intended to be implied.

Named Entity Recognition (NER): the method is used for identifying entities with specific meanings in texts, and mainly comprises a person name, a place name, an organization name, a proper noun and the like. Generally speaking, the task of named entity recognition is to identify named entities in three major categories (entity category, time category and number category), seven minor categories (person name, organization name, place name, time, date, currency and percentage) in the text to be processed.

speech-to-Code conversion (Code-Switch): the term code refers to a language system used in interplay. Transcoding refers to the use of two or more languages by a person during the same communication session.

Generating a countermeasure network (generated adaptive network): the method is an important generation model in the field of deep learning, the core of the countermeasure network is countermeasure (adaptive), a Generator (Generator) is responsible for generating samples, and a Discriminator (Discriminator) is responsible for discriminating the samples. Both nets (generator and arbiter) are trained at the same time and compete in a minimax algorithm (minimax). This countermeasure approach approximates some unsolvable loss functions by counterlearning, and is widely used in the generation of data such as images, videos, natural languages, and music.

Model Distillation (Model Distillation): the method aims to transfer the knowledge learned by one large model or a plurality of models to another lightweight single model, and is convenient to deploy. In brief, the small model is used to learn the prediction result of the large model. Specifically, the original large model is called a teacher model (teacher), the new small model is called a student model (student), and the probability output predicted by the teacher model is a soft label (soft label).

Multilingual BERT (mBERT): it is a specific deep learning model issued by Google. The mBERT model may simultaneously accept training in about 100 languages, and this multilingual training allows the model to perform various language tasks, such as translating textual content from one language to another.

In the prior art, the countermeasure network model AdvPicker in the ACL2021 only uses word-level multi-language countermeasures, and the countermeasure input only has data of the source language and the target language, so the language category of each word in the same sentence is the same, which easily causes contextual information leakage during the countermeasure process and affects the countermeasure effect. In addition, most distillation models only carry out single-model distillation for the target language in the distillation process, and the distillation information is single. While Unitrans in IJCAI2020 is multi-model distillation, the distillation process is complicated by combining hard and soft labels.

Therefore, the invention provides a multi-level confrontation-based cross-language named entity recognition model training scheme, which enables a single sentence in a confrontation process at a word level to contain words of multiple languages by constructing language code conversion data, and enhances the robustness. Meanwhile, two extra countermeasure networks of sentence level and word order level are designed to carry out countermeasure training from more dimensions. In the distillation process, various data are grouped and matched, and a plurality of teacher models are finely adjusted, so that more various information can be distilled.

FIG. 1 shows a flowchart 100 of a multi-level confrontation-based cross-language named entity recognition model training method according to one embodiment of the invention. FIG. 2 illustrates a data flow diagram 200 of a multi-countermeasure based cross-language named entity recognition model training technique. The present invention will be described below with reference to fig. 1 and 2.

At step 105, a plurality of data sets is created, which may include: a tagged source language dataset, a tagged target language dataset, a transcoded dataset, and an out-of-order dataset.

Specifically, referring to fig. 2, in the step of creating a data set, the following processes can be further divided:

a. the original is translated through an open-source word-to-word model (e.g., provided by facebook)Tagged source language data

Translate to tagged target language data->

Wherein the tags on the source language data are copied directly to the target language data such that the target language data is also tagged.

For example, in the context of the present invention, each word in the source language data may be tagged in a BIO tagging manner. For example, tag B (Beginning) may represent the Beginning of a named entity or a single word named entity; tag I (Inside) may represent the middle position or the end position of the named entity; tag O (Outside) may represent a non-named entity. Of course, it is fully understood by those skilled in the art that other labeling schemes commonly used in NER technology (e.g., BMES, bios, etc.) can be used for labeling.

In the context of the present invention, the source language is English and the target language is German, but it will be fully understood by those skilled in the art that the source and target languages may be in any of a variety of different languages. Suppose that the labeled English sample sentence "Full name of company is Loxley Publications Plc." (the Full name of company is Loxley Publications Plc.) is translated from word to obtain the labeled German sample sentence "volle Namen der company is middle dictionary publishing Plc.". Wherein "Loxley Publications Plc" is an Organization (ORG) entity, i.e. a part of entity words, and "Full name of company is" is a part of non-entity words. Similarly, "Middleton publishing Plc" in the corresponding German translation is the part of the entity word, and "volle Namen der company ist" is the part of the non-entity word.

b. Will be provided with

All the entity word parts and non-entity word parts in the system are translated into target language from word to word respectively, and two labeled data sets converted by the language codes are constructed and are stored in a database>

And/or>

Each sentence in the two sets of transcoded tagged data includes both a source language and a target language.

Wherein, in

In the method, the entity word part of the source language sentence is replaced by the translated target language, such as "Full name of company is Loxley publishing Plc". The word is converted into "Full name of company is Middleton publishing Plc. And is at>

In (1), the non-entity word portion of the source language sentence is replaced with the translated target language, e.g., "Full name of company is Loxley Publications Plc". The word is transcoded into "volle Namen der company is Loxley Publications Plc.

c. Will be provided with

The data in the method is scrambled according to the word order in the boundary divided by the entity word part, and a tag-scrambled data set is constructed to be based on the word order in the boundary>

Wherein, the interior of the entity word and the non-entity word is disturbed according to a certain probability. For example, for the above sentence "Full name of company is Loxley Publications Plc.", according to the non-entity word part "Loxley Publications Plc.", the non-entity word boundaries of the non-entity word part may be "Full" and "is", and the entity word boundaries of the entity word part may be "Loxley" and "Plc". Therefore, the interior of the non-entity word portion "Full name of company is" is shuffled as "company name of is" according to the non-entity word boundary and the entity word boundary,the inside of the entity word part "Loxley Publications Plc" is shuffled to "Publications Plc Loxley", and thus "company name Full of is Publications Plc Loxley" after being sorted can be obtained.

It will be fully understood by those skilled in the art that the above scrambling is merely illustrative and that different results of scrambling can be obtained with different probabilities.

Returning to FIG. 1, at step 110, the labeled source language data, the set of transcoding data, and the set of out-of-order data created in step 105 are employed, at least in part, to countermeasure train mBERTs through a multi-level countermeasure network to yield countertrained mBERTs. Wherein the multi-level confrontation network comprises a word level, a sentence level and a word order level.

Referring to fig. 2, in the confrontation training step, first, mBERT is used as a generator, and

and

each sentence in (a) is encoded to generate a corresponding feature vector. Wherein +>

Is an unlabeled target language dataset from an external source. Secondly, the method comprises the following steps:

a. will be provided with

And &>

The mBERT encoded sentence in (1) is input to a word level discriminator for discriminating whether each word in the sentence belongs to a source language or a target language, with a training loss of L _DIS1 ；

b. Will be provided with

And/or>

The mBERT encoded sentences are input to a sentence level discriminator for judging whether each input sentence belongs to a source language, a target language or a speech-code converted sentence, with a training loss of L _DIS2 ；

c. Will be provided with

And &>

The mBERT encoded sentences are input into a word order level discriminator which judges whether each sentence is out of order or not, and the training loss is L _DIS3 ；

d. During the course of the counter-training, at the same time

Supervised training of the NER in the source language is carried out on the data, and the training loss is L _NER ；

mBERT as a generator, the training loss is the sum L of the loss of the NER task minus the loss of the three arbiter confrontation tasks _E ＝L _NER -(L _DIS1 +L _DIS2 +L _DIS3 )。

And in the countertraining process, updating the parameters in each discriminator and the mBERT generator according to the losses. For example, for the NER task, the parameters of the mBERT generator and the NER classifier may be based on L _NER To update; while for the confrontation task, the parameters of the mBERT generator may be based on L _E To update, the parameters of the word level discriminator may be based on L _DIS1 To update, the parameters of the sentence level discriminator may be based on L _DIS2 To update, the parameter of the word order level discriminator can be based on L _DIS3 To be updated.

Thus, based on the updated parameters, the three discriminators can better and better classify the input data, and the mBERT as a generator can also encode more and more similar vector representations for different input data.

Returning to FIG. 1, at step 115, the confrontational trained mBERT obtained at step 110 is trimmed across sets of data to obtain a corresponding plurality of teacher models having different tendencies, and student models are obtained by distilling the plurality of teacher models.

Referring to fig. 2, in the multi-model distillation step, the following processes may be further divided:

a. respectively carrying out mBERT after antagonistic training on

And fine-tuning the three groups of labeled data to obtain three teacher models with different trends.

Specifically, the first set of tagged data

Share the same sentences, but have different word orders, thereby focusing more on the aspect of word order. The second group has the label data->

Share the same solid word portions but have non-solid word portions in different languages, thereby focusing more on the non-solid word portions. A third group having tag data->

Share the same non-entity word portions but have entity word portions in different languages, thereby focusing more on entity word portions.

b. Three teacher models are added to the unlabeled target language data

The probabilities of the upper outputs are normalized to serve as a soft label for distillation.

For example, the probability W output by the teacher model 1 ₁ Probability W output by teacher model 2 ₂ And the probability W output by the teacher model 3 ₃ Can be summed and averaged together and,for example, the values obtained by adding and dividing by 3 are used as a soft label for distillation.

c. Reload unopposed mBERT as student model for

The probability of the mBERT output of the unopposed training and the Mean Square Error (MSE) loss between the soft labels output by the teacher model are calculated, and the final student model is obtained through training and applied to downstream tasks (such as translation, information extraction and the like). The student model enables named entity recognition of sentences in a target language dataset.

It will be appreciated by those skilled in the art that although three teacher models are employed in the present invention, it is entirely possible to employ a different number of classroom models to address different trends.

Therefore, the method comprises the steps of firstly translating source language data with labels into target language data with labels through an external word-to-word translation model, then constructing a multi-level countermeasure network for inputting various data such as language code conversion data, disorder data and the like to mBERT to perform in-field countermeasure training, and finally finely adjusting the mBERT obtained through the countermeasure training on three groups of data respectively and then performing multi-model distillation to obtain a student model which is applied to downstream tasks.

FIG. 3 illustrates a method 300 for named entity recognition of input data using the model derived by the present invention, according to one embodiment of the present invention. The model is trained by the method described above with respect to fig. 1. In summary, the model can be trained by: translating the source language data with the tags into target language data with the tags through an external word-to-word translation model; constructing various data such as language code conversion data, disorder data and the like, inputting the various data into a multi-level countermeasure network to carry out in-field countermeasure training on mBERT; and (3) fine-tuning mBERT obtained through confrontation training on three groups of data respectively, and then carrying out multi-model distillation to obtain a student model, wherein the student model is used as the model.

At 305, the model receives a data set, the data set including one or more sentences. The one or more sentences may include non-entity word portions and/or entity word portions.

At 310, the model performs named entity recognition on the received data set.

At 315, the model outputs a label for each word in each sentence in the dataset. The label can indicate whether the word is an entity or a non-entity.

In summary, in the present invention, a multi-level countermeasure mechanism is used to perform countermeasure training from three dimensions, namely, word level, sentence level and word sequence level, so that the pre-training model mBERT is more similar to the information coding of the source language and the target language, and helps better transfer the label information of the source language to the target language. In addition, by using multi-model distillation, mBERT after the countertraining is further finely adjusted on a plurality of groups of data with different emphasis points, so that more various models with stronger robustness are distilled. Distillation uses only the average locations of multiple teacher models as soft labels, which is more compact than the previous Unitrans model.

In tests, experiments are carried out on the models obtained by the method on the data sets of conll2002 and conll2003 by taking English as a source language and taking Spanish, dutch and German as target languages respectively, the effect on the Spanish and Dutch languages exceeds the effect of the current SOTA model, and the effect on the German is close to the SOTA effect.

FIG. 4 illustrates a block diagram of a system 400 for multi-level countermeasure-based cross-language named entity recognition model training, according to one embodiment of the invention. As shown in fig. 4, the system 400 may include a dataset creation module 405, a challenge training module 410, and a distillation module 415. The specific details of each module may be found in the description of the relevant operation above.

According to one embodiment of the invention, the data set creation module 405 may be configured to create a plurality of data sets that may include: a tagged source language dataset, a tagged target language dataset, a transcoded dataset, and an out-of-order dataset.

According to one embodiment of the invention, the struggle training module 410 may be configured to struggle training the mBERT through a multi-level struggle network using, at least in part, the created tagged source language data, the set of transcoding data, and the set of out-of-order data to arrive at a struggled mBERT. Wherein the multi-level confrontation network comprises a word level, a sentence level and a word order level.

According to one embodiment of the invention, distillation module 415 may be configured to fine-tune the confrontationally trained mBERT across multiple sets of data to obtain a corresponding plurality of teacher models having different tendencies, the student models being obtained by distilling the plurality of teacher models.

Fig. 5 shows a schematic block diagram of an apparatus 500 for implementing a system or method in accordance with one or more embodiments of the invention. The apparatus may include a processor 510 configured to perform any of the methods described above, and a memory 515.

The apparatus 500 may include a network connection element 525, which may include, for example, a network connection device that connects to other devices through a wired connection or a wireless connection. The wireless connection may be, for example, a WiFi connection, a Bluetooth connection, a 3G/4G/5G network connection, or the like.

The device may also optionally include other peripheral elements 520 such as input devices (e.g., keyboard, mouse), output devices (e.g., display), etc. For example, in a method based on user input, a user may perform an input operation via an input device. The corresponding information may also be output to the user via an output device.

Each of these modules may communicate with each other directly or indirectly, e.g., via one or more buses such as bus 505.

Also, the present application discloses a computer-readable storage medium comprising computer-executable instructions stored thereon, which, when executed by a processor, cause the processor to perform the method of the embodiments described herein.

Additionally, an apparatus is disclosed that includes a processor and a memory having stored thereon computer-executable instructions that, when executed by the processor, cause the processor to perform the methods of the embodiments described herein.

Additionally, a system comprising means for implementing the methods of the embodiments described herein is also disclosed.

It will be appreciated that methods according to one or more embodiments of the specification can be implemented in software, firmware, or a combination thereof.

It should be understood that the embodiments in the present specification are described in a progressive manner, and the same or similar parts in the embodiments are referred to each other, and each embodiment is described with emphasis on the differences from the other embodiments. In particular, as to the apparatus and system embodiments, since they are substantially similar to the method embodiments, the description is relatively simple and reference may be made to some descriptions of the method embodiments for related points.

It should be understood that the above description describes particular embodiments of the present specification. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

It should be understood that an element described herein in the singular or shown in the figures only represents that the element is limited in number to one. Furthermore, modules or elements described or illustrated herein as separate may be combined into a single module or element, and modules or elements described or illustrated herein as single may be split into multiple modules or elements.

It is also to be understood that the terms and expressions employed herein are used as terms of description and not of limitation, and that the embodiment or embodiments of the specification are not limited to those terms and expressions. The use of such terms and expressions is not intended to exclude any equivalents of the features shown and described (or portions thereof), and it is recognized that various modifications may be made within the scope of the claims. Other modifications, variations, and alternatives are also possible. Accordingly, the claims should be looked to in order to cover all such equivalents.

Also, it should be noted that while the present invention has been described with reference to specific exemplary embodiments, it should be understood by those skilled in the art that the above embodiments are merely illustrative of one or more embodiments of the present invention, and various changes and substitutions of equivalents may be made without departing from the spirit of the present invention, and therefore, it is intended that all changes and modifications to the above embodiments be included within the scope of the appended claims.

Claims

1. A method for multi-level confrontation-based cross-language named entity recognition model training, comprising:

creating a plurality of data sets, the plurality of data sets comprising: a source language data set with a tag, a target language data set with a tag, a language code conversion data set and a disorder data set;

countertraining the mBERT through a multi-level countermeasure network at least partially employing the created tagged source language data, the set of language-to-code translation data, and the set of out-of-order data to yield countertrained mBERT, wherein the multi-level countermeasure network includes a word level, a sentence level, and an order level;

fine-tuning the confrontationally trained mBERT over sets of data to obtain a corresponding plurality of teacher models having different tendencies, and distilling the plurality of teacher models to obtain student models.

2. The method of claim 1, wherein creating a plurality of data sets further comprises:

constructing the tagged target language data set by translating the tagged source language data set through a word-to-word translation model.

3. The method of claim 2, wherein creating a plurality of data sets further comprises:

constructing the set of language-to-code conversion data by word-to-word translation of all solid word portions and non-solid word portions, respectively, of the tagged source language data set to a target language, each sentence in each set of language-to-code conversion data set including both the source language and the target language.

4. The method of claim 3, wherein the set of transcoding data comprises a first set of transcoding data and a second set of transcoding data;

wherein in each sentence of said first language-to-code conversion data set, a substantial word portion of a source language sentence is replaced with a translated target language;

wherein in each sentence of said second language-to-code conversion data set, non-entity word portions of the source language sentence are replaced with the translated target language.

5. The method of claim 1, wherein creating a plurality of data sets further comprises:

constructing the out-of-order data set by garbling data in the labeled source language data set according to the word order in the boundary divided by the entity word part.

6. The method of claim 4, wherein the oppot training against through the multi-stage opposition network further comprises:

inputting mBERT encoded sentences of the tagged source language data set, the first language code conversion data set and the untagged target language data set into a word level discriminator, wherein the word level discriminator is used for judging whether each word in each input sentence belongs to the source language or the target language, and the training loss is L _DIS1 ；

Inputting mBERT encoded sentences of the tagged source language data set, the first language code translation data set and the untagged target language data set into a sentence level discriminator, the sentence level discriminator being configured to determine whether each input sentence belongs to a source languageTarget language or sentence with language code conversion, training loss is L _DIS2 ；

Inputting the labeled source language data set and mBERT coded sentences in the disordered data set into a word order level discriminator, wherein the word order level discriminator judges whether each input sentence is disordered and the training loss is L _DIS3 。

7. The method of claim 6, wherein the oppot training against through the multi-stage opposition network further comprises:

performing supervised training of the source language NER on the labeled source language data set with a training loss of L _NER 。

8. The method of claim 7, wherein the training loss of the mBERT is L in opponent training _{ENERDIS1DIS2DIS3} )。

9. The method of claim 4, wherein the plurality of sets of data comprises a first set of data, a second set of data, and a third set of data;

wherein the first set of data comprises the tagged target language dataset and the out-of-order dataset;

the second set of data includes the tagged target language dataset and the first vocoding dataset; and

the third set of data includes the tagged target language dataset and the second transcoding dataset.

10. A system for multi-level confrontation-based cross-language named entity recognition model training, comprising:

a dataset creation module configured to create a plurality of datasets, the plurality of datasets comprising: a source language data set with a tag, a target language data set with a tag, a language code conversion data set and a disorder data set;

a countermeasure training module configured to employ, at least in part, the created tagged source language data, the set of language-to-code conversion data, and the set of out-of-order data to countertrain the mBERT over a multi-level countermeasure network to yield countertrained mBERT, wherein the multi-level countermeasure network includes a word level, a sentence level, and a word level;

a distillation module configured to fine-tune the confrontational-trained mBERT across sets of data to derive a corresponding plurality of teacher models having different tendencies, a student model derived by distilling the plurality of teacher models.

11. A method of using the student model of any one of claims 1-9 for named entity recognition of input data, comprising:

the student model receiving a data set, the data set comprising one or more sentences, the one or more sentences comprising non-entity word portions and/or entity word portions;

the student model carries out named entity recognition on the received data set; and

the student model outputs a label for each word in each sentence in the dataset, the label indicating that the word is an entity or a non-entity.

12. An apparatus for multi-level confrontation-based cross-language named entity recognition model training, comprising:

a memory; and

a processor configured to perform the method of any one of claims 1-9.

13. A computer-readable storage medium storing instructions that, when executed by a computer, cause the computer to perform the method of any of claims 1-9.