CN114386395A - Sequence labeling method and device for multi-language text and electronic equipment - Google Patents

Sequence labeling method and device for multi-language text and electronic equipment Download PDF

Info

Publication number
CN114386395A
CN114386395A CN202011112593.2A CN202011112593A CN114386395A CN 114386395 A CN114386395 A CN 114386395A CN 202011112593 A CN202011112593 A CN 202011112593A CN 114386395 A CN114386395 A CN 114386395A
Authority
CN
China
Prior art keywords
language
model
annotation
training
loss
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011112593.2A
Other languages
Chinese (zh)
Inventor
王新宇
蒋勇
阮巴赫
王涛
黄非
黄忠强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN202011112593.2A priority Critical patent/CN114386395A/en
Publication of CN114386395A publication Critical patent/CN114386395A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

One or more embodiments of the present specification provide a method, an apparatus, and an electronic device for sequence tagging of multilingual texts, including: obtaining training results of a plurality of single language models for respective language data sets; constructing a training sample set according to all language data sets and training results thereof; training a multi-language model by using the training sample set until the multi-language model converges; and performing sequence labeling on the text by using the converged multi-language model.

Description

Sequence labeling method and device for multi-language text and electronic equipment
Technical Field
One or more embodiments of the present disclosure relate to the technical field of computer applications, and in particular, to a method and an apparatus for sequence tagging of multilingual texts, and an electronic device.
Background
On e-commerce platforms serving international buyers, the commodity description information usually contains languages of various countries. When searching for goods desired to be purchased, a buyer can input a sentence describing actual needs in a client provided by the platform, so that the most relevant goods can be retrieved by the platform based on a relevance algorithm. The sequence marking is an important ring in the correlation calculation, namely, the key information is extracted by marking the sentences input by the buyers, and the matching calculation is carried out based on the extracted key information to obtain the correlation scores of the commodities and the actual demands.
The existing sequence labeling module usually adopts a mode that one language corresponds to one sequence labeling model, but the accuracy of the model for the calculation of the input except the corresponding language is poor, and the business requirement is difficult to meet.
Disclosure of Invention
The specification provides a sequence labeling method of a multilingual text, which comprises the following steps:
obtaining training results of a plurality of single language models for respective language data sets;
constructing a training sample set according to all language data sets and training results thereof;
training a multi-language model by using the training sample set until the multi-language model converges;
and performing sequence labeling on the text by using the converged multi-language model.
Optionally, before obtaining the training results of the plurality of single language models for the respective language data sets, the method further includes:
obtaining a data set for a first language, wherein data in the data set is a sentence with a sequence labeling result;
performing sequence annotation on the data set by using a single language model of a first language, and calculating annotation loss;
and updating the model parameters of the single language model of the first language according to the annotation loss until the single language model of the first language converges.
Optionally, the obtaining training results of the plurality of single language models for the corresponding language data sets includes:
and inputting a first sentence in the data set of the first language into the converged single language model of the first language to obtain a sequence labeling result of the first sentence.
Optionally, the sequence labeling result includes posterior probability distributions of labels corresponding to the respective words in the first sentence.
Optionally, determining whether the monolingual model of the first language converges is performed by:
and if the marking loss is less than a preset threshold value, determining that the single language model of the first language is converged.
Optionally, the training a multi-language model using the training sample set until the multi-language model converges includes:
performing sequence annotation on the training sample set by using the multi-language model, and calculating a first annotation loss;
performing sequence annotation on the training result by using the multi-language model, and calculating a second annotation loss;
and weighting the first annotation loss and the second annotation loss based on a preset weight, and determining the multi-language model to be converged under the condition that the loss obtained by weighting is less than a preset threshold value.
Optionally, the monolingual model is a conditional random field.
Optionally, the multi-language model is a BERT model-based conditional random field;
the training of the multilingual model using the training sample set includes:
carrying out semantic representation calculation on the sentences in the training sample set by the BERT model, inputting semantic representation results of the sentences into the conditional random field, and carrying out sequence labeling on the sentences by the conditional random field based on the semantic representation results.
The present specification also proposes a device for sequential labeling of multilingual texts, said device comprising:
the first acquisition module is used for acquiring training results of a plurality of single language models for corresponding language data sets;
the construction module is used for constructing a training sample set according to all language data sets and training results thereof;
the training module is used for training the multi-language model by using the training sample set until the multi-language model converges;
and the labeling module is used for performing sequence labeling on the text by using the converged multi-language model.
Optionally, the apparatus further comprises:
a second obtaining module, configured to obtain a data set for a first language before obtaining training results of multiple single language models for corresponding language data sets, where data in the data set is a sentence with a sequence labeling result;
the calculation module is used for carrying out sequence annotation on the data set by using a single language model of a first language and calculating annotation loss;
and the updating module is used for updating the model parameters of the single language model of the first language according to the annotation loss until the single language model of the first language is converged.
Optionally, the first obtaining module is specifically configured to:
and inputting a first sentence in the data set of the first language into the converged single language model of the first language to obtain a sequence labeling result of the first sentence.
Optionally, the sequence labeling result includes posterior probability distributions of labels corresponding to the respective words in the first sentence.
Optionally, determining whether the monolingual model of the first language converges is performed by:
and if the marking loss is less than a preset threshold value, determining that the single language model of the first language is converged.
Optionally, the training module is specifically configured to:
performing sequence annotation on the training sample set by using the multi-language model, and calculating a first annotation loss;
performing sequence annotation on the training result by using the multi-language model, and calculating a second annotation loss;
and weighting the first annotation loss and the second annotation loss based on a preset weight, and determining the multi-language model to be converged under the condition that the loss obtained by weighting is less than a preset threshold value.
Optionally, the monolingual model is a conditional random field.
Optionally, the multi-language model is a BERT model-based conditional random field;
the training module is specifically configured to:
carrying out semantic representation calculation on the sentences in the training sample set by the BERT model, inputting semantic representation results of the sentences into the conditional random field, and carrying out sequence labeling on the sentences by the conditional random field based on the semantic representation results.
This specification also proposes an electronic device including:
a processor;
a memory for storing processor-executable instructions;
wherein the processor implements the steps of the above method by executing the executable instructions.
The present specification also contemplates a computer-readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the steps of the above-described method.
In the above technical solution, the training results of a plurality of single language models for the corresponding language data set can be obtained, and then the training sample set is constructed according to all the language data sets and the training results thereof, so as to train the multilingual models by using the training sample set until the multilingual models converge, thereby performing sequence labeling on the text by using the converged multilingual models. By adopting the mode, when the multi-language model processes the sequence labeling task aiming at the multi-language text, the accuracy of the sequence labeling task aiming at the text of the language corresponding to the single-language sequence labeling processing can be achieved, namely, the accuracy of the sequence labeling task aiming at the multi-language text processed by the multi-language model can be improved.
Drawings
FIG. 1 is a schematic diagram of a system for sequence annotation of multilingual text, according to an exemplary embodiment of the present disclosure;
FIG. 2 is a flowchart of a method for sequence tagging of multilingual text according to an exemplary embodiment of the present disclosure;
FIG. 3 is a hardware block diagram of an electronic device with a sequential labeling apparatus for multiple language texts according to an exemplary embodiment of the present disclosure;
FIG. 4 is a block diagram of a device for labeling sequences of multiple language texts according to an exemplary embodiment of the present specification.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with one or more embodiments of the present specification. Rather, they are merely examples of apparatus and methods consistent with certain aspects of one or more embodiments of the specification, as detailed in the claims which follow.
It should be noted that: in other embodiments, the steps of the corresponding methods are not necessarily performed in the order shown and described herein. In some other embodiments, the method may include more or fewer steps than those described herein. Moreover, a single step described in this specification may be broken down into multiple steps for description in other embodiments; multiple steps described in this specification may be combined into a single step in other embodiments.
The present specification aims to provide a technical solution for obtaining training results of a plurality of single language models for corresponding language data sets, then constructing a training sample set according to all language data sets and the training results thereof, and training a multilingual model by using the training sample set until the multilingual model converges, thereby performing sequence labeling on a text by using the converged multilingual model.
In specific implementation, in order to obtain a multi-language model for performing sequence labeling on multi-language texts, a plurality of single-language models can be set; wherein each single language model corresponds to a language. In this case, for each single language model, the single language model may be trained based on a dataset of a language corresponding to the single language model, and a training result of the single language model with respect to the dataset of the language may be obtained.
When it is determined that all of the single language models are trained and training results of the trained single language models are obtained, a training sample set may be constructed based on the multi-language sentences to which corresponding sequence labeling results are previously labeled and the training results of the trained single language models, and the set multi-language models may be trained based on the training sample set until it is determined that the multi-language models converge.
In the case that the multi-language model convergence is determined, the multi-language model can be considered to be trained completely, so that the subsequent sequence labeling task can be executed based on the trained multi-language model.
In the above technical solution, the training results of a plurality of single language models for the corresponding language data set can be obtained, and then the training sample set is constructed according to all the language data sets and the training results thereof, so as to train the multilingual models by using the training sample set until the multilingual models converge, thereby performing sequence labeling on the text by using the converged multilingual models. By adopting the mode, when the multi-language model processes the sequence labeling task aiming at the multi-language text, the accuracy of the sequence labeling task aiming at the text of the language corresponding to the single-language sequence labeling processing can be achieved, namely, the accuracy of the sequence labeling task aiming at the multi-language text processed by the multi-language model can be improved.
Referring to fig. 1, fig. 1 is a schematic diagram of a system for labeling sequences of multiple language texts according to an exemplary embodiment of the present disclosure. Referring to fig. 2 in conjunction with fig. 1, fig. 2 is a flowchart illustrating a method for labeling a sequence of a multilingual text according to an exemplary embodiment of the present disclosure. The sequence labeling method of the multi-language text can be applied to a server side for executing a sequence labeling task of the multi-language text; the server may be deployed on an electronic device, and the electronic device may specifically be a server or a computer, which is not limited in this specification. The sequence labeling method of the multi-language text can comprise the following steps:
step 201, obtaining training results of a plurality of single language models for corresponding language data sets;
step 202, constructing a training sample set according to all language data sets and training results thereof;
step 203, training a multi-language model by using the training sample set until the multi-language model is converged;
and step 204, performing sequence annotation on the text by using the converged multi-language model.
In the embodiment, in order to obtain a multilingual model for performing sequence labeling on multilingual texts, a plurality of single language models can be set; wherein each single language model corresponds to a language. In this case, for each single language model, the single language model may be trained based on a dataset of a language corresponding to the single language model, and a training result of the single language model with respect to the dataset of the language may be obtained.
In one embodiment shown, a preset number of sentences in multiple languages (e.g., Chinese, English, French, German, Russian, etc.) may be obtained as training samples; the preset number can be preset by a technician according to actual requirements.
It should be noted that, on one hand, any one of the sentences in the plurality of languages is pre-labeled with the corresponding sequence labeling result.
In practical applications, the sequence annotation task may include: for a sentence, marking a text sequence obtained by sequencing all words in the sentence according to the sequence of the positions of the words; the sequence labeling performed on the sentence may be to label the part of speech of the word at each position in the sentence, or may also be to label the category of the word at each position in the sentence, which may be preset by a technician according to actual requirements, and the specification does not limit this. In addition, for a sentence, the sentence can be regarded as a text sequence, so that a sequence labeling result obtained by performing sequence labeling on the sentence can also be output in a sequence form; in this case, the sequence annotation result may be referred to as an annotation sequence.
Taking an english sentence as an example, assuming that the sentence is "Bob drank coffee at Starbucks", the words at the positions in the sentence are as shown in table 1 below:
position 1 Position 2 Position 3 Position 4 Position 5
Bob drank coffee at Starbucks
TABLE 1
That is, all words in the sentence may be ordered in order from positions 1 to 5, resulting in a text sequence corresponding to the sentence: bob, drank, coffee, at, Starbucks.
After the sentence is subjected to sequence annotation, the obtained sequence annotation result can be shown in the following table 2:
position 1 Position 2 Position 3 Position 4 Position 5
Bob drank coffee at Starbucks
Noun (name) Verb and its usage Noun (name) Preposition word Noun (name)
TABLE 2
That is, the sequence annotation result (i.e., annotation sequence) of the sentence may be: noun, verb, noun, preposition, noun.
On the other hand, for any one of the sentences in the plurality of languages, the language type to which the sentence belongs is the language type to which the text in the sentence belongs. For example, if the text in a sentence is a chinese text, the language type to which the sentence belongs is chinese; assuming that the text in a certain sentence is English text, the language type to which the sentence belongs is English; and so on.
In the case where the sentences in the plurality of languages are acquired, a plurality of data sets may be further created based on the sentences in the plurality of languages as a plurality of training sample sets.
It should be noted that, in the data sets, all sentences in the same data set belong to the same language type, and sentences in different data sets belong to different language types.
That is, for the above-described sentences of a plurality of languages, one data set may be created based on the sentences in which all the belonged languages are of the same kind. In this case, since there are sentences in a plurality of languages, there are correspondingly a plurality of created data sets, i.e., the number of created data sets is the same as the number of kinds of languages to which the sentences belong.
For example, assume that the number of sentences acquired is 100; further assume that there are 50 chinese sentences, 30 english sentences, and 20 french sentences. In this case, it is possible to create a data set 1 based on the 50 chinese sentences, a data set 2 based on the 30 english sentences, and a data set 3 based on the 20 french sentences; that is, the 50 chinese sentences are included in data set 1, the 30 english sentences are included in data set 2, and the 20 french sentences are included in data set 3. Since there are three types of languages to which the 100 sentences belong, there are correspondingly three data sets created.
In the case where the plurality of data sets are created, the plurality of single language models may be trained based on the plurality of data sets until it is determined that the respective single language models converge.
That is, for any one of the plurality of single language models, a data set belonging to a language category corresponding to the single language model may be obtained, and the single language model may be trained based on the data set until it is determined that the single language model converges. For example, assuming that the language type corresponding to a single language model is chinese, the single language model may be trained based on a chinese dataset until it is determined that the single language model converges; assuming that the language type corresponding to a certain single language model is English, training the single language model based on an English data set until the single language model is determined to be converged; and so on.
Specifically, for a single language model, a data set belonging to a language category corresponding to the single language model may be input into the single language model, i.e., the data set is subjected to sequence labeling by using the single language model, so as to obtain a sequence labeling result of each sentence in the data set. Subsequently, a tagging loss may be calculated based on the sequence tagging result of each sentence in the data set obtained by the single language model and the sequence tagging result that each sentence in the data set is pre-tagged, and the model parameters of the single language model may be updated according to the tagging loss until it is determined that the single language model converges.
In practical applications, the single language model may be a Conditional Random Field (CRF).
To determine whether the single language model has converged, a loss of sequence labeling results for each sentence in the dataset from the single language model, relative to the sequence labeling results that were previously labeled for each sentence in the dataset, can be calculated as a labeling loss based on a common loss function of the conditional random field.
Subsequently, it may be determined whether the calculated loss of annotation is less than a preset threshold. If the calculated annotation loss is less than a preset threshold, convergence of the single language model may be determined. If the calculated annotation loss is greater than or equal to a preset threshold, the single language model is considered to be not converged, so that the model parameters of the single language model can be updated, the updated single language model is used for carrying out sequence annotation on the data set again, the sequence annotation result of each sentence in the data set obtained by the updated single language model is determined, and whether the annotation loss of the sequence annotation result which is relative to each sentence in the data set and is annotated in advance is smaller than the preset threshold or not is determined; and so on. The preset threshold may be preset by a technician, or may be a default value, which is not limited in this specification.
In the case where it is determined that the single language model converges with respect to any one of the plurality of single language models, it is considered that the training of the single language model is completed, and a training result of the trained single language model with respect to a data set belonging to a language type corresponding to the single language model can be acquired.
In practical applications, the training result may specifically include: the single language model predicts a posterior probability distribution of labels corresponding to each word in each sentence with respect to each sentence in the dataset. In addition, the training results may further include: the single language model predicts a sequence annotation result of each sentence with respect to each sentence in the data set.
In this embodiment, when it is determined that all of the single language models are trained and training results of the trained single language models are obtained, a training sample set may be constructed based on sentences of the multiple languages (i.e., sentences used for training the multiple single language models) to which corresponding sequence labeling results are pre-labeled and training results of the trained single language models, and a preset multilingual model may be trained based on the training sample set until it is determined that the multilingual model converges.
For example, suppose that a chinese-based data set 1 trains a single language model 1 corresponding to chinese to obtain a trained single language model 1 and a corresponding training result 1; training a single language model 2 corresponding to English based on an English data set 2 to obtain a training result 2 corresponding to the trained single language model 2; the single language model 3 corresponding to the french is trained based on the french dataset 3, and a training result 3 corresponding to the trained single language model 3 is obtained. In this case, a chinese sentence in the data set 1, an english sentence in the data set 2, a french sentence in the data set 3, and the training results 1, 2, 3 may be constructed as a training sample set, and the preset multilingual model may be trained based on the training sample set until it is determined that the multilingual model converges.
In one embodiment shown, the multi-lingual model may be a conditional random field based on a BERT (bidirectional Encoder registration from transformations) model.
In this case, the BERT model performs semantic representation calculation on each sentence in the training sample set in advance, the semantic representation result of each sentence output by the BERT model is input into the conditional random field, and the conditional random field performs sequence labeling on each sentence based on the semantic representation result of each sentence.
It should be noted that since the BERT model is applicable to a plurality of languages, the BERT model-based conditional random field is also applicable to a plurality of languages accordingly.
In one illustrated embodiment, to determine whether the multi-lingual model has been trained to converge, on one hand, a sequence annotation result for each sentence in the training sample set derived from the multi-lingual model may be calculated based on a common loss function of the conditional random field, a loss of the sequence annotation result that is pre-annotated with respect to each sentence in the training sample set (referred to as a first loss); on the other hand, the sequence labeling result of each sentence in the training sample set obtained from the multi-language model can be calculated, and the total loss (referred to as the second loss) of the training results of the plurality of single-language models can be calculated.
Specifically, the second loss may be calculated using the following equation:
Figure BDA0002729082490000111
wherein, L isposRepresenting a second loss; q is a number oft(yiJ | x) represents the probability that each single language model is labeled as j at location i; q is a number ofs(yiJ | x) represents the probability that the multi-language model is labeled as j at location i; the | V | represents the size of the annotation set.
It should be noted that, for a single language model, the probability of the single language model being labeled as j at the position i can be obtained by performing data analysis on the posterior probability distribution of labels corresponding to words in each sentence predicted by the single language model with respect to each sentence in the data set.
After the first loss and the second loss are calculated, weighted addition may be performed on the first loss and the second loss based on a preset weight, and when the loss obtained by the weighted addition is smaller than a preset threshold, convergence of the multi-language model is determined; the weight may be preset by a technician, or may be a default value, which is not limited in this specification; the preset threshold may be preset by a technician, or may be a default value, which is not limited in this specification.
Specifically, the first loss and the second loss may be weighted and added using the following formula:
Lall=λLpos+(1-λ)LNLL
wherein, L isallRepresents the loss resulting from the weighted addition; said LposRepresenting the second loss; said LNLLRepresenting a first loss; the λ represents a weight set for a second loss; the (1- λ) represents a weight set for the first loss. That is, the sum of the weight of the first loss and the weight of the second loss is 1.
In this embodiment, when it is determined that the multi-language model converges, it may be considered that the training of the multi-language model is completed, so that the subsequent sequence tagging task may be executed based on the trained multi-language model.
It should be further noted that the above sequence tagging method for multi-language text can be effectively applied to numerous industries such as e-commerce, telecommunication, government affairs, finance, education, entertainment, health, tourism, etc. For example, in these industries, it is often desirable to provide machine translation services to users; in the machine translation service, the sequence labeling method of the multi-language text can be adopted to perform sequence labeling on sentences of various languages input by a user, so that semantic analysis, segmented translation and the like can be performed according to a sequence labeling result, and a translation result closer to the actual semantic expressed by the user is output to the user.
In the above technical solution, the training results of a plurality of single language models for the corresponding language data set can be obtained, and then the training sample set is constructed according to all the language data sets and the training results thereof, so as to train the multilingual models by using the training sample set until the multilingual models converge, thereby performing sequence labeling on the text by using the converged multilingual models. By adopting the mode, when the multi-language model processes the sequence labeling task aiming at the multi-language text, the accuracy of the sequence labeling task aiming at the text of the language corresponding to the single-language sequence labeling processing can be achieved, namely, the accuracy of the sequence labeling task aiming at the multi-language text processed by the multi-language model can be improved.
Corresponding to the embodiment of the sequence labeling method of the multi-language text, the specification also provides an embodiment of a sequence labeling device of the multi-language text.
The embodiment of the device for marking the sequence of the multilingual texts can be applied to electronic equipment. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. Taking a software implementation as an example, as a logical device, the device is formed by reading, by a processor of the electronic device where the device is located, a corresponding computer program instruction in the nonvolatile memory into the memory for operation. From a hardware aspect, as shown in fig. 3, the hardware structure diagram of the electronic device where the multi-language text sequence labeling apparatus is located in this specification is shown, except for the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 3, the electronic device where the apparatus is located in the embodiment may generally include other hardware according to the actual function labeled by the multi-language text sequence, which is not described again.
Referring to fig. 4, fig. 4 is a block diagram of a device for labeling sequences of multiple language texts according to an exemplary embodiment of the present disclosure. The sequence annotation device 40 for multilingual texts can be applied to the electronic device shown in fig. 3, and comprises:
a first obtaining module 401, configured to obtain training results of multiple single language models for corresponding language data sets;
a construction module 402, configured to construct a training sample set according to all language data sets and training results thereof;
a training module 403, configured to train a multi-language model using the training sample set until the multi-language model converges;
and the labeling module 404 is configured to perform sequence labeling on the text by using the converged multi-language model.
In this embodiment, the apparatus 40 further includes:
a second obtaining module 405, configured to obtain a data set for a first language before obtaining training results of multiple single language models for corresponding language data sets, where data in the data set is a sentence with a sequence tagging result;
a calculating module 406, configured to perform sequence labeling on the data set by using a monolingual model of a first language, and calculate a labeling loss;
and an updating module 407, configured to update the model parameters of the monolingual model of the first language according to the annotation loss until the monolingual model of the first language converges.
In this embodiment, the first obtaining module 401 is specifically configured to:
and inputting a first sentence in the data set of the first language into the converged single language model of the first language to obtain a sequence labeling result of the first sentence.
In this embodiment, the sequence labeling result includes a posterior probability distribution of the label corresponding to each word in the first sentence.
In this embodiment, whether the monolingual model of the first language converges is determined by:
and if the marking loss is less than a preset threshold value, determining that the single language model of the first language is converged.
In this embodiment, the training module 403 is specifically configured to:
performing sequence annotation on the training sample set by using the multi-language model, and calculating a first annotation loss;
performing sequence annotation on the training result by using the multi-language model, and calculating a second annotation loss;
and weighting the first annotation loss and the second annotation loss based on a preset weight, and determining the multi-language model to be converged under the condition that the loss obtained by weighting is less than a preset threshold value.
In this embodiment, the monolingual model is a conditional random field.
In this embodiment, the multi-language model is a conditional random field based on a BERT model;
the training module 403 is specifically configured to:
carrying out semantic representation calculation on the sentences in the training sample set by the BERT model, inputting semantic representation results of the sentences into the conditional random field, and carrying out sequence labeling on the sentences by the conditional random field based on the semantic representation results.
The implementation process of the functions and actions of each module in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.
For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, wherein the modules described as separate parts may or may not be physically separate, and the parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution in the specification. One of ordinary skill in the art can understand and implement it without inventive effort.
The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. A typical implementation device is a computer, which may take the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email messaging device, game console, tablet computer, wearable device, or a combination of any of these devices.
In a typical configuration, a computer includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic disk storage, quantum memory, graphene-based storage media or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The terminology used in the description of the one or more embodiments is for the purpose of describing the particular embodiments only and is not intended to be limiting of the description of the one or more embodiments. As used in one or more embodiments of the present specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used in one or more embodiments of the present description to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of one or more embodiments herein. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
The above description is only for the purpose of illustrating the preferred embodiments of the one or more embodiments of the present disclosure, and is not intended to limit the scope of the one or more embodiments of the present disclosure, and any modifications, equivalent substitutions, improvements, etc. made within the spirit and principle of the one or more embodiments of the present disclosure should be included in the scope of the one or more embodiments of the present disclosure.

Claims (18)

1. A method of sequence annotation for multilingual text, the method comprising:
obtaining training results of a plurality of single language models for respective language data sets;
constructing a training sample set according to all language data sets and training results thereof;
training a multi-language model by using the training sample set until the multi-language model converges;
and performing sequence labeling on the text by using the converged multi-language model.
2. The method of claim 1, prior to obtaining training results for a plurality of single language models for respective language data sets, the method further comprising:
obtaining a data set for a first language, wherein data in the data set is a sentence with a sequence labeling result;
performing sequence annotation on the data set by using a single language model of a first language, and calculating annotation loss;
and updating the model parameters of the single language model of the first language according to the annotation loss until the single language model of the first language converges.
3. The method of claim 2, the obtaining training results for a plurality of single language models for respective language data sets, comprising:
and inputting a first sentence in the data set of the first language into the converged single language model of the first language to obtain a sequence labeling result of the first sentence.
4. The method of claim 3, wherein the sequence annotation result comprises a posterior probability distribution of the annotation corresponding to each word in the first sentence.
5. The method of claim 2, determining whether the monolingual model of the first language converges by:
and if the marking loss is less than a preset threshold value, determining that the single language model of the first language is converged.
6. The method of claim 1, the training a multi-language model using the training sample set until the multi-language model converges, comprising:
performing sequence annotation on the training sample set by using the multi-language model, and calculating a first annotation loss;
performing sequence annotation on the training result by using the multi-language model, and calculating a second annotation loss;
and weighting the first annotation loss and the second annotation loss based on a preset weight, and determining the multi-language model to be converged under the condition that the loss obtained by weighting is less than a preset threshold value.
7. The method of any one of claims 1-6, the monolingual model being a conditional random field.
8. The method of any of claims 1-6, the multilingual model being a BerT model-based conditional random field;
the training of the multilingual model using the training sample set includes:
carrying out semantic representation calculation on the sentences in the training sample set by the BERT model, inputting semantic representation results of the sentences into the conditional random field, and carrying out sequence labeling on the sentences by the conditional random field based on the semantic representation results.
9. An apparatus for sequence annotation of multilingual text, said apparatus comprising:
the first acquisition module is used for acquiring training results of a plurality of single language models for corresponding language data sets;
the construction module is used for constructing a training sample set according to all language data sets and training results thereof;
the training module is used for training the multi-language model by using the training sample set until the multi-language model converges;
and the labeling module is used for performing sequence labeling on the text by using the converged multi-language model.
10. The apparatus of claim 9, further comprising:
a second obtaining module, configured to obtain a data set for a first language before obtaining training results of multiple single language models for corresponding language data sets, where data in the data set is a sentence with a sequence labeling result;
the calculation module is used for carrying out sequence annotation on the data set by using a single language model of a first language and calculating annotation loss;
and the updating module is used for updating the model parameters of the single language model of the first language according to the annotation loss until the single language model of the first language is converged.
11. The apparatus of claim 10, wherein the first obtaining module is specifically configured to:
and inputting a first sentence in the data set of the first language into the converged single language model of the first language to obtain a sequence labeling result of the first sentence.
12. The apparatus of claim 11, wherein the sequence annotation result comprises a posterior probability distribution of annotations corresponding to respective words in the first sentence.
13. The apparatus of claim 2, determining whether the monolingual model of the first language converges by:
and if the marking loss is less than a preset threshold value, determining that the single language model of the first language is converged.
14. The apparatus of claim 1, the training module to:
performing sequence annotation on the training sample set by using the multi-language model, and calculating a first annotation loss;
performing sequence annotation on the training result by using the multi-language model, and calculating a second annotation loss;
and weighting the first annotation loss and the second annotation loss based on a preset weight, and determining the multi-language model to be converged under the condition that the loss obtained by weighting is less than a preset threshold value.
15. The apparatus of any one of claims 9-14, the monolingual model being a conditional random field.
16. The apparatus of any of claims 9-14, the multilingual model being a BERT model-based conditional random field;
the training module is specifically configured to:
carrying out semantic representation calculation on the sentences in the training sample set by the BERT model, inputting semantic representation results of the sentences into the conditional random field, and carrying out sequence labeling on the sentences by the conditional random field based on the semantic representation results.
17. An electronic device, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor implements the method of any one of claims 1 to 8 by executing the executable instructions.
18. A computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the method of any one of claims 1 to 8.
CN202011112593.2A 2020-10-16 2020-10-16 Sequence labeling method and device for multi-language text and electronic equipment Pending CN114386395A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011112593.2A CN114386395A (en) 2020-10-16 2020-10-16 Sequence labeling method and device for multi-language text and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011112593.2A CN114386395A (en) 2020-10-16 2020-10-16 Sequence labeling method and device for multi-language text and electronic equipment

Publications (1)

Publication Number Publication Date
CN114386395A true CN114386395A (en) 2022-04-22

Family

ID=81194255

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011112593.2A Pending CN114386395A (en) 2020-10-16 2020-10-16 Sequence labeling method and device for multi-language text and electronic equipment

Country Status (1)

Country Link
CN (1) CN114386395A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110941945A (en) * 2019-12-02 2020-03-31 百度在线网络技术(北京)有限公司 Language model pre-training method and device
CN111695344A (en) * 2019-02-27 2020-09-22 阿里巴巴集团控股有限公司 Text labeling method and device
CN111738004A (en) * 2020-06-16 2020-10-02 中国科学院计算技术研究所 Training method of named entity recognition model and named entity recognition method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111695344A (en) * 2019-02-27 2020-09-22 阿里巴巴集团控股有限公司 Text labeling method and device
CN110941945A (en) * 2019-12-02 2020-03-31 百度在线网络技术(北京)有限公司 Language model pre-training method and device
CN111738004A (en) * 2020-06-16 2020-10-02 中国科学院计算技术研究所 Training method of named entity recognition model and named entity recognition method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
XINYU WANG 等: "Structure-level knowledge distillation for multilingual sequence labeling", ARXIV, 4 May 2020 (2020-05-04), pages 1 - 14 *

Similar Documents

Publication Publication Date Title
US10417350B1 (en) Artificial intelligence system for automated adaptation of text-based classification models for multiple languages
JP6829559B2 (en) Named place name dictionary for documents for named entity extraction
US20190287142A1 (en) Method, apparatus for evaluating review, device and storage medium
CN105988990B (en) Chinese zero-reference resolution device and method, model training method and storage medium
US20070118351A1 (en) Apparatus, method and computer program product for translating speech input using example
JP6462970B1 (en) Classification device, classification method, generation method, classification program, and generation program
CN111444320A (en) Text retrieval method and device, computer equipment and storage medium
CN108475264B (en) Machine translation method and device
CN106778878B (en) Character relation classification method and device
US11651015B2 (en) Method and apparatus for presenting information
US20210004438A1 (en) Identifying entity attribute relations
CN111753082A (en) Text classification method and device based on comment data, equipment and medium
US20220391647A1 (en) Application-specific optical character recognition customization
Korpusik et al. Data collection and language understanding of food descriptions
CN114528413B (en) Knowledge graph updating method, system and readable storage medium supported by crowdsourced marking
CN110413996B (en) Method and device for constructing zero-index digestion corpus
CN110516175B (en) Method, device, equipment and medium for determining user label
CN109614494B (en) Text classification method and related device
CN114386395A (en) Sequence labeling method and device for multi-language text and electronic equipment
CN113919354A (en) Natural language enhancement processing method and device for text countermeasure
JP5342574B2 (en) Topic modeling apparatus, topic modeling method, and program
CN112579774A (en) Model training method, model training device and terminal equipment
JP5398638B2 (en) Symbol input support device, symbol input support method, and program
Saquete et al. Combining automatic acquisition of knowledge with machine learning approaches for multilingual temporal recognition and normalization
CN117951303B (en) Text information relevance analysis method and equipment based on generation type large model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination