CN113836271A

CN113836271A - Method and product for natural language processing

Info

Publication number: CN113836271A
Application number: CN202111146400.XA
Authority: CN
Inventors: 杨惠云; 陈华栋; 周浩; 李磊
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2021-09-28
Filing date: 2021-09-28
Publication date: 2021-12-24
Anticipated expiration: 2041-09-28
Also published as: WO2023051284A1; CN113836271B

Abstract

Embodiments of the present disclosure relate to methods and products for natural language processing. In the method, the method comprises the following steps: generating a first semantic code vector based on the representation of the corpus in the first language; a second semantic code vector is generated based on a representation of a corpus in a second language different from the first language. The method further comprises the following steps: generating a mixed semantic vector by mixing the first semantic code vector and the second semantic code vector. The method further comprises the following steps: based on the mixed semantic vector, a mixed representation of the corpus in the second language is generated. The embodiment of the disclosure also relates to a method and a device for training the natural language model. By using the method, the accuracy of conversion among different languages is effectively improved, and the semantic learning cost is reduced, so that the execution result of the downstream task is more accurate, and the expenditure of computing resources is reduced.

Description

Method and product for natural language processing

Technical Field

Embodiments of the present disclosure relate to the field of natural language processing technology, and more particularly, to a method, apparatus, device, medium, and program product for semantic conversion of different languages.

Background

Some pre-trained natural language models can handle the task of converting between corpora across different languages. However, both the source and target languages need to be common languages, and the corpus of the target language still loses semantic information. Especially when the two languages are cross-lingual languages, such a situation of losing semantic information may be very obvious, and even affect the continuous execution of the downstream task. Moreover, when there are not enough pre-labeled corpora as sample data, it is not possible to train out the corresponding natural language model at all.

To improve the accuracy of the natural language model, the model may be trained using more pre-labeled corpora. However, the cost of obtaining pre-labeled corpora is typically high. Moreover, more training data complicates the model, and the overhead of computational resources is also large. Similar problems exist in other models that require the execution of cross-language translation tasks.

Disclosure of Invention

Embodiments of the present disclosure provide a method, apparatus, device, medium, and program product for natural language processing.

In a first aspect of the disclosure, a method for natural language processing is provided. The method comprises the following steps: generating a first semantic code vector based on the representation of the corpus in the first language; generating a second semantic code vector based on a representation of a corpus in a second language different from the first language; generating a mixed semantic vector by mixing the first semantic code vector and the second semantic code vector; and generating a mixed representation of the corpus in the second language based on the mixed semantic vector.

In a second aspect of the disclosure, a method for training a natural language processing model is provided. The method comprises the following steps: acquiring sample data, wherein the sample data comprises the expression of the linguistic data of the first language and the expression of the linguistic data of the second language; acquiring sample labels pre-labeled for the linguistic data of the first language and the linguistic data of the second language; and training the natural language processing model by using the sample data and the sample label.

In a third aspect of the present disclosure, an apparatus for natural language processing is provided. The device includes: a first semantic vector module configured to generate a first semantic code vector based on a representation of a corpus of a first language; a second semantic vector module configured to generate a second semantic code vector based on a representation of a corpus in a second language different from the first language; a mixed semantic vector module configured to generate a mixed semantic vector by mixing the first semantic encoding vector and the second semantic encoding vector; and a mixed representation module configured to generate a mixed representation of the corpus in the second language based on the mixed semantic vector.

In a fourth aspect of the present disclosure, an apparatus for training a natural language processing model is provided. The device includes: the system comprises a sample data module, a data analysis module and a data analysis module, wherein the sample data module is configured to acquire sample data, and the sample data comprises a representation of a language material of a first language and a representation of a language material of a second language; the system comprises a sample label module, a first language module and a second language module, wherein the sample label module is configured to obtain a sample label pre-labeled for a corpus of a first language and a corpus of a second language; and a training module configured to train the natural language processing model using the sample data and the sample labels.

In a fifth aspect of the present disclosure, an electronic device is provided. The electronic device includes: a memory and a processor; wherein the memory is for storing one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement the method according to the first aspect or the second aspect.

In a sixth aspect of the disclosure, a computer-readable storage medium is provided. The computer readable storage medium has stored thereon one or more computer instructions, wherein the one or more computer instructions are executed by a processor to implement the method according to the first or second aspect.

In a seventh aspect of the disclosure, a computer program product is provided. The computer program product comprises one or more computer instructions, wherein the one or more computer instructions are executed by a processor to implement the method according to the first aspect or the second aspect.

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, like or similar reference characters designate like or similar elements, and wherein:

FIG. 1 illustrates a schematic diagram of a use environment for a natural language processing method in accordance with certain embodiments of the present disclosure;

FIG. 2 illustrates a flow diagram of a natural language processing method in accordance with certain embodiments of the present disclosure;

FIG. 3 illustrates a visualization schematic of the differences in representations of linguistic data across languages, in accordance with certain embodiments of the present disclosure;

FIG. 4 illustrates a flow diagram of a method of training a natural language processing model in accordance with certain embodiments of the present disclosure;

FIG. 5 illustrates a visualization schematic of the accuracy of a conversion of a representation of a corpus of cross-languages, in accordance with certain embodiments of the present disclosure;

FIG. 6 illustrates a block diagram of a natural language processing device, in accordance with certain embodiments of the present disclosure;

FIG. 7 illustrates a block diagram of an apparatus for training a natural language processing model in accordance with certain embodiments of the present disclosure; and

FIG. 8 illustrates a block diagram of a computing system in which one or more embodiments of the disclosure may be implemented.

Throughout the drawings, the same or similar reference numbers refer to the same or similar elements.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

The term "language" as used in this disclosure refers to the category of languages, such as english, chinese, french, and the like. The term "corpus" as used in this disclosure refers to a form of presentation of a language, such as text presented in words, which has a conceptual content and meaning that can be understood by a user who is in the language. A corpus may also be information or data of some nature. Examples of types of information or data include, without limitation, voice, video, text, pictures or documents, and so forth. The term "representing" as used in this disclosure refers to mapping a corpus into corresponding vectors, e.g., word-embedded vectors, for processing by a computing system. Examples of technologies that can be used to map the corpus into a representation may be a known word2vec technology or a one hot technology, and other methods may also be used to map the corpus into a representation corresponding to the corpus, which is not limited in this disclosure.

The term "convert" as used herein refers to converting between any two types of information or data. Examples of conversion include, without limitation, translation between two languages, conversion between speech and text, conversion between text and pictures, and so forth. In the context of the present disclosure, for the purpose of convenience of discussion and description, a translation process between different languages is mainly taken as an example of the conversion process. In general, the conversion process can be implemented by means of a corresponding conversion model. Thus, the term "model" or "layer" will sometimes be used in the following description to refer to the respective conversion process.

The term "training" or "learning" as used herein refers to a process of optimizing system performance using experience or data. For example, machine translation systems may gradually optimize translation performance, such as improving translation accuracy, through a training or learning process. In the context of the present disclosure, the terms "training" or "learning" may be used interchangeably for purposes of discussion convenience.

The term "natural language processing method/model" as used herein refers to a method/model that is built based on a priori knowledge associated with the syntax, grammar, lexical, etc. of a particular language, and may be used to generate a conversion result during the conversion process. The conversion result may include generating a corpus of the target language and may also include generating a representation of the corpus of the target language, which may continue to be used by other subjects for other tasks, such as classification tasks, labeling tasks, and the like.

As used herein, the terms "comprises," comprising, "and variations thereof are intended to be open-ended, i.e.," including, but not limited to. The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment". Relevant definitions for other terms will be given in the following description.

The inventor has noted that, since in the natural language processing model, the linguistic data (e.g., text) of different languages are usually mapped into vectors, and are converted into text from the vectors after a series of processing. Therefore, the accuracy of the representation of corpora of different languages with associated (e.g., identical) semantics on vectors constitutes an important factor as to whether the language conversion is accurate or not. The accuracy of the representation of the linguistic data across languages brings about a remarkable difference of language conversion performance, and even the converted linguistic data loses semantics.

When training a natural language model, if a large number of sample labels based on pre-labeled corpus and a large number of sample data based on source language are lacked, a natural language processing model with good performance cannot be trained. However, since the work of labeling the corpus is cumbersome and enormous, the cost of obtaining the pre-labeled corpus is very high. Meanwhile, the markup language material can only be generated on the language with a large number of users. For a large number of long-tailed languages (i.e., small languages, even local languages), no one would like to label them. Therefore, it becomes meaningful to use a markup corpus of one language to correspond to a multi-language cross-language conversion.

Even on pre-trained natural language processing models, such as the known BERT model, there are cases where the linguistic data of the converted language is semantically lost. And there may also be situations where the source language cannot be converted to a lingual language. This is because there are not enough markup corpora; and vice versa.

The inventors have also found that when two languages are languages of different languages, such as the chinese Tibetan language and the indolo language, or at least one language is a dialect, the difference in the representation of the corpus on the vector for the different languages with associated (e.g., identical) semantics can be very large, which also affects the accuracy of the natural language processing.

However, the conventional method improves the accuracy of the conversion using a method of adding sample data and sample tags, at the cost of increased computational resource overhead, such as increased complexity of the model, and increased cost of acquiring the markup corpus.

In embodiments of the present disclosure, processing performance will be improved by increasing the accuracy of the representation on vectors of corpora in different languages with associated semantics, without relying on large amounts of annotation data. This is different from the accuracy of conversion that is improved by adding training data in conventional natural language processing. Thus, the principles of operation and mechanisms of the present disclosure are significantly different from any known method.

In some embodiments of the present disclosure, a method for natural language processing is provided. The method generates a mixed semantic vector by mixing semantic encoding vectors of representations of corpora of different languages, and generates a mixed representation based on the mixed semantic vector. This allows for reduced differences in the representation of the corpus in different languages with associated semantics over vectors, thereby improving conversion accuracy and conversion efficiency, and reducing the overhead of computational resources.

In the description that follows, certain embodiments will be discussed with reference to language translation processes, e.g., English, Chinese, etc. It will nevertheless be understood that no limitation of the scope of the disclosure is thereby intended, such limitation being the result of the understanding of the principles and concepts of the embodiments disclosed herein by way of example.

FIG. 1 illustrates a schematic diagram of a use environment 100 for a natural language processing method in accordance with certain embodiments of the present disclosure.

As shown in fig. 1, at an executing agent (such as a computing system), a representation 101 based on a corpus of a first language is obtained (e.g., received), generating a first semantic encoding vector 103. In parallel or sequentially to this, the performing agent obtains (e.g., receives) a representation 102 of a corpus based on a second language different from the first language, generating a second semantic code vector 104. By mixing the first semantic code vector 103 and the second semantic code vector 104, a mixed semantic vector 105 is generated. Based on the mixed semantic vector 105, a mixed representation 106 of the corpus in the second language is generated.

Any suitable method may be used to generate the first semantic code vector 103 or the second semantic code vector 104. As an example, in some embodiments, the first semantic code vector 103 or the second semantic code vector 104 may be generated using a BERT model-based multi-head (Multihead) mechanism.

As an example, representation 101 is input into a multi-level layer. In this layer, the input representation 101 will be compressed, encoded, and semantically extracted. The generated first semantic code vector may be a smaller-dimension vector having semantic features corresponding to the semantics implied by the corpus, which is an abstract form of expression of the semantics on a vector space, and may also be referred to as a hidden state or a hidden vector.

As an example, in the same way, a second semantic code vector 104 may be generated.

Note that although the multi-layer may be based on the known BERT model. However, this multi-headed layer may be made more suitable for the conversion task discussed herein by applying the training methods described herein. Moreover, the multi-head layer may be trained for the conversion tasks discussed herein. This may make the conversion more accurate, as will be described in more detail below.

One example of mixing the first semantic code vector 103 and the second semantic code vector 104 is to mix the first semantic code vector 103 and the second semantic code vector 104 by a mixing ratio λ associated with both. During mixing, the mixing degree of the semantics of the first language and the semantics of the second language can be controlled by adjusting the mixing ratio λ, so that a mixed semantic vector 105 in which the associated semantics of the two languages are fused is generated.

As an example of generating a hybrid representation, the hybrid semantic vector 105 may be mapped to a hybrid representation 106 in the same space as the representation of the corpus of the second language. This is more accurate because the mixed representation 106 fuses the associated semantics of the first language. The differences in the representation of the corpus in different languages with associated semantics on the vector as described above are narrowed.

As an example of applying a hybrid representation, the hybrid representation 106 may be input to a normalization layer and an output layer, generating suitable data that may be interfaced with downstream tasks. For example, for a classification task, the output results of the output layer may be with respect to the classification results and probabilities corresponding to the classification results. For the translation task, the output result of the output layer may be a translation corpus. The present disclosure is not so limited as to how the mixed representation of the corpus of the present disclosure is applied in relation to the target language.

FIG. 2 illustrates a flow diagram of a natural language processing method 200 in accordance with certain embodiments of the present disclosure. For ease of presentation, the language translation and processing performed by method 200 will be described with english and chinese as examples. As noted above, however, this is merely exemplary and is not intended to limit the scope of the present disclosure in any way. Embodiments of the method 200 described herein can be used for translation and processing between any other suitable languages as well.

As described above, a statement in English (e.g., "Today is Sunday") may be converted to a representation 101, e.g., mapped as a vector. For example only, the vector may be a 128-dimensional vector, for example. At block 201, a first semantic code vector 103 is generated based on the representation 101. For example, the first semantic code vector 103 may be a smaller-dimensional vector, such as a 32-dimensional vector.

Likewise, the sentence "today is sunday" in Chinese may also be converted into a representation 102, e.g. mapped as a vector. For example only, the vector may be a 128-dimensional vector, for example. At block 202, a second semantic code vector 104 is generated based on the representation 102. For example, the second semantic code vector 104 may also be a smaller-dimensional vector, such as a 32-dimensional vector.

The generated first semantic code vector 103 and second semantic code vector 104 (such as 103 and 104 shown in fig. 1) may represent a mapping of corpora of different languages on another space, and the mapping includes semantic features.

It should be understood that in general, block 201 and block 202 may be performed in parallel, but may also be performed sequentially, as the present disclosure is not limited thereto.

At block 203, a mixed semantic vector 105 (105 shown in fig. 1) is generated by mixing the first semantic code vector 103 and the second semantic code vector 104. As an example, the first semantic code vector 103 and the second semantic code vector 104 may be weighted mixed in a mixing ratio.

Additionally or alternatively, in some embodiments, one example of generating the mixed semantic vector 105 is mixing the first semantic code vector 103 and the second semantic code vector 104 with a mixing ratio λ. Meanwhile, the first semantic code vector 103 and the second semantic code vector 104 have associated (e.g., the same) semantics (e.g., the corpus of english "Today is Sunday" and the corpus of chinese).

The hybrid semantic vector 105 includes both semantic features of the corpus of the source language and semantic features of the corresponding corpus of the target language. The probability of losing semantics during language conversion is reduced, and the mixed semantic vector is small in dimension, so that calculation and storage are facilitated.

Additionally or alternatively, in some embodiments, semantic cross-correlations of the corpora of the source language with the corpora of the target language may be extracted through a multi-head attention mechanism, such as by the following formula:

wherein S represents a source language (i.e., a first language); t represents a target language (i.e., a second language); h represents a semantic coding vector; l represents the number of layers used by the natural language processing method/model;

semantic code vectors corresponding to representations of corpora of the target language associated with semantics of corpora of the source language; multihead is an operator that represents a multi-head operation.

As an example, the second semantic code vector 104 may be generated by, for example, a multi-head attention mechanism using the first semantic code vector 103 as a key vector (K in FIG. 1) and a value vector (V in FIG. 1) as a query vector (Q in FIG. 1) and the first semantic code vector 104 as a key vector (K in FIG. 1)

。

Additionally or alternatively, in some embodiments, the blending process may be determined by using the following formula:

wherein the mixing ratio lambda is between 0 and 1; LN is an operator that represents a normalization operation.

Additionally or alternatively, in some embodiments, a weight a associated with semantic importance between the corpus of the first language and the corpus of the second language may be determined based on the first semantic code vector 103 and the second semantic code vector 104. Based on the entropy associated with the weight a, a mixing ratio λ is determined.

Additionally or alternatively, in some embodiments, the weight a may be determined by using the following formula:

wherein I is the number of word sequences in the target language; j is the number of word sequences in the source language; i is the ith of the word sequence in the target language; j is the jth of the word sequence in the source language; h is an operator of the information entropy.

Specifically, in one embodiment, by way of example only, the entropy of A may be determined by determining A by the following equation:

wherein softmax is an operator of the normalized exponential function; n is the number of sequences; t is the transposed operator.

It can be seen that the degree of association between the corpus of the first language and the corpus of the second language in semantic importance can be determined by the weight a. The mixing ratio λ is generated by calculating the information entropy associated with the weight a. Since the information entropy reflects the degree of semantic loss (or can be understood as translation quality) from the corpus of the first language to the corpus of the second language, the mixing ratio λ can be used to control the degree of mixing. The degree of mixing can be adjusted so that the translation process reaches the best quality of language conversion.

Additionally or alternatively, in some embodiments, the mixing ratio λ may be determined by using the following formula:

λ＝λ_O·σ[(H(A)+H(A^T))W+b] (6)

wherein W, b is a parameter that can be obtained by training; σ is a sigmoid function; lambda [ alpha ]_OIs the maximum value of the mixing ratio lambda. As an example, λ_OMay be 0.5.

Additionally or alternatively, in some embodiments, mixing the first semantic code vector 103 and the second semantic code vector 104 may include: sampling representations of corpora in a first language and corpora in a second language; and mixing a first semantic code vector 103 corresponding to the sampled representation 101 of the corpus in the first language with a second semantic code vector 104 corresponding to the sampled representation 102 of the corpus in the second language.

Since the corpus in training the natural language processing model is different from the corpus in applying the natural language processing model (i.e., the inference phase) (which may be referred to herein as exposure bias). In order to reduce the influence caused by the exposure deviation, the natural language processing method of the present disclosure proposes a sampling scheme. In particular, a portion of the representation of the corpus in the second language may be selected and fed into a portion of the representation of the corpus in the first language to reduce such exposure bias. The number of samples will be controlled by a probability threshold, as will be described in detail below.

Additionally or alternatively, in some embodiments, the corpus of the second language comprises a translation corpus from a corpus of the first language to a corpus of the second language. Due to the richness of language expression or the insufficient number of labeled corpora, some training data can be artificially constructed, so that the translation effect is better.

By way of example, when translating the English language "Today is Sunday" into Chinese, there may be many translation results. For example, "today is a weekday," "today is a sunday," and so on. Thus, these translated corpuses can also be determined as corpuses of the second language, so that different expressions of the same semantics can be learned.

Additionally or alternatively, in some embodiments, the representations of the corpora in the first language and the representations of the corpora in the second language may be batched. And determining the probability threshold p based on a function associated with an index of the size of a batch. And adjusting a number of samples of the representation of the corpus in the first language and the representation of the corpus in the second language based on a probability threshold p.

As an example, the probability threshold p may be determined by a reverse sigmoid decay function associated with an exponent of the size of a batch of processing volumes.

At block 204, a mixed representation 106 of the corpus in the second language, such as 106 shown in fig. 1, may be generated based on the mixed semantic vector 105. As an example, the mixed semantic vector 105 may be mapped to a mixed representation 106 of the corpus of the second language using a known linear normalization function and decoder.

By the method of the present disclosure, differences in the representation of the cross-language corpus with associated semantics may be reduced.

FIG. 3 illustrates a visualization diagram of the differences in representations of linguistic data across languages, in accordance with certain embodiments of the present disclosure.

As can be seen from fig. 3, the natural language processing method of the present disclosure can reduce the difference in the representation of english (en) and chinese (zh), urdu (ur), swalisi (sw) to within concentric circles, as compared to the method in the related art. There are obviously no discrete outliers far from the center of the circle.

Because the hybrid semantic vector includes both semantic features of the corpus of the source language and semantic features of the corresponding corpus of the target language, differences in representations of the cross-language corpus with associated semantics may be reduced. This may increase the accuracy of the language conversion. Meanwhile, as a large amount of training data is not needed, the expenditure of computing resources is reduced.

The present disclosure also proposes a method of training a natural language model on which the natural language processing method described above can be run.

FIG. 4 illustrates a flow diagram of a method 400 of training a natural language processing model in accordance with certain embodiments of the present disclosure.

At block 401, sample data is acquired. The sample data includes a representation of a corpus in a first language and a representation of a corpus in a second language. As an example, text in a web page crawled from the internet for english may be used as the corpus.

At block 402, pre-labeled sample tags for the corpus of the first language and the corpus of the second language are obtained. As an example, text in a web page crawled from the internet for english and chinese may be used as the corpus. And determining the label as a sample label after labeling.

At block 403, a natural language processing model is trained using the sample data and the sample labels. As an example, a BP back-propagation algorithm or other known training algorithm may be used. The natural language processing model is trained to learn relationships between chinese and english, such as grammar, syntax, lexical, word senses, and the like.

It is understood that, in general, blocks 401 and 402 may be performed in parallel, but may also be performed sequentially, as the present disclosure is not limited thereto.

Unlike the prior art, when training a natural language model, only the representation of the corpus in the source language (i.e., the first language) is determined as the training data of the input model, and the pre-labeled corpus in the target language (i.e., the second language) is determined as the sample label of the model output. The training method of the present disclosure mixes the semantics of the corpora of the two languages from the beginning, so that the trained natural language model can be better used for the conversion task discussed herein. Semantic features of corpora of different languages with associated semantics may be better learned, making the conversion more accurate. And the overhead of computational resources can be reduced since there is no need to improve the quality of language conversion at the expense of increasing training data.

Additionally or alternatively, in some embodiments, the training method may further include block 404.

At block 404, a sum of a task loss function associated with cross entropy of the representation of the corpus in the first language and the representation of the corpus in the second language and a consistency loss function associated with mean square error or relative entropy of the representation of the corpus in the first language and the representation of the corpus in the second language is determined as a target loss function, and a natural language processing model is trained.

As an example, the task loss function L may be determined by the following formula:

wherein the content of the first and second substances,

a task loss function; r is mean pooling of semantic code vectors; p is the probability of the corpus of the candidate second language at the time of conversion; MSE is the mean square error; KL is the relative entropy. The second term (MSE) and the third term (KL) in equation (7) may exist simultaneously or only one of the two terms may exist.

In addition or alternatively to the above-mentioned features,

may be associated with a cross entropy of a representation of a corpus in a first language and a representation of a corpus in a second language.

As an example of this, the following is given,

can be determined by the following formula:

or

Wherein C is the number of sample tags; n is the length of the representation of the corpus.

Additionally or alternatively, in some embodiments, wherein the sample data may further comprise: and combining the representation corresponding to the corpus of the pre-labeled first language with the representation corresponding to the corpus of the pre-labeled second language to form sample data, wherein the corpus of the second language comprises a translation corpus from the corpus of the first language to the corpus of the second language.

Thus, in this training method, the input training data makes the model initially aware of the semantic pairing relationship between the source language and the target language. And the natural language model learns more expression forms of the same semantic meaning by translating the corpus. Also, such a multi-head layer as discussed above can also be trained to be more suitable for the conversion task discussed herein.

Thus, this can further reduce the difference in representations of corpora of different languages having associated semantics as compared to the corpora not being translated, thereby providing at least one advantage as described herein.

FIG. 5 illustrates a visualization schematic of the accuracy of the conversion of a representation of a corpus of cross-languages, according to certain embodiments of the present disclosure.

As can be seen from fig. 5, the natural language processing model generated according to the training method of the present disclosure makes the distribution of centroids (centroids) of representations of the same semantics of corpora of different languages more concentrated compared to the prior art. This means that the representation of the corpus of these languages is less different and more accurate, resulting in the advantages described above.

As shown in table 1, it provides an exemplary comparison table of the execution results of the natural language processing method of the prior art and the present disclosure, i.e., a comparison in translation quality. Of these, XLM-R, Trans-train, Filter is a comparative approach, with high resources indicating languages with a large number of users and a large number of pre-labeled corpora, such as English. The resources represent a language with a medium number of users and a medium pre-labeled corpus, such as Thai. Low resources mean that there are languages with a low number of users and a low pre-labeled corpus, such as swansin.

(Resource)	Height of	In	Is low in	Average
					XLM-R	82.4	79.7	73.7	79.2
Trans-train	84.7	83.4	79.2	82.9
					Filter	85.7	84.3	80.5	83.9
Methods of the present disclosure	86.8	85.7	82.0	85.3

TABLE 1

As can be seen from table 1, the method of the present disclosure provides the highest performance score. This means that the method of the present disclosure can effectively improve translation quality (i.e., performance of the conversion of cross-language semantics).

FIG. 6 illustrates a block diagram of a natural language processing device 600, in accordance with certain embodiments of the present disclosure.

The apparatus comprises a first semantic vector module 601 configured to generate a first semantic code vector based on a representation of a corpus of a first language. The apparatus also includes a second semantic vector module 602 configured to generate a second semantic code vector based on a representation of a corpus in a second language different from the first language. The apparatus further comprises a mixed semantic vector module 603 configured to generate a mixed semantic vector by mixing the first semantic encoding vector and the second semantic encoding vector. And the apparatus further comprises a mixed representation module 604 configured to generate a mixed representation of the corpus in the second language based on the mixed semantic vector.

As an example, the mixing process may be determined by equation (4).

Additionally or alternatively, generating the hybrid representation in the second language may comprise: the first semantic code vector and the second semantic code vector are mixed based on a mixing ratio of the first semantic code vector and the second semantic code vector, wherein the first semantic code vector and the second semantic code vector have associated semantics.

Additionally or alternatively, the apparatus may further include a mixing proportion module 605 configured to determine a weight associated with semantic importance between the corpus of the first language and the corpus of the second language based on the first semantic code vector and the second semantic code vector; and determining a mixing proportion based on the entropy associated with the weight.

As an example, the weight and the mixing ratio may be determined by formula (5) and formula (6).

Additionally or alternatively, mixing the first semantic code vector and the second semantic code vector may include: sampling representations of corpora in a first language and corpora in a second language; and blending a first semantic code vector corresponding to the sampled representation of the corpus of the first language with a second semantic code vector corresponding to the sampled representation of the corpus of the second language.

Additionally or alternatively, the corpus of the second language may include: from a corpus in a first language to a translated corpus in a second language.

Additionally or alternatively, the apparatus may be further configured to batch process the representation of the corpus in the first language and the representation of the corpus in the second language, and the apparatus further comprises a probability threshold module 606 configured to determine a probability threshold p based on a function associated with an index of a size of a batch of processing volumes; and adjusting a number of samples of the representation of the corpus in the first language and the representation of the corpus in the second language based on a probability threshold p.

With the apparatus 600 of the present disclosure, differences in representations of corpora of different languages with associated semantics may be reduced, thereby achieving at least one advantage as the natural language processing method 200 described above.

FIG. 7 illustrates a block diagram of an apparatus 700 for training a natural language processing model in accordance with certain embodiments of the present disclosure. The apparatus includes a sample data module 701 configured to obtain sample data, the sample data including a representation of a corpus of a first language and a representation of a corpus of a second language. The apparatus also includes a sample tag module 702 configured to obtain sample tags pre-labeled for the corpus in the first language and the corpus in the second language. The apparatus also includes a training module 703 configured to train the natural language processing model using the sample data and the sample labels.

Additionally or alternatively, the apparatus may further include a loss function module 704 configured to determine a sum of a task loss function and a consistency loss function as a target loss function, train the natural language processing model, wherein the task loss function is associated with a cross entropy of the representation of the corpus of the first language and the representation of the corpus of the second language, and the consistency loss function is associated with a mean square error or a relative entropy of the representation of the corpus of the first language and the representation of the corpus of the second language.

As an example, the target loss function may be determined by equation (7) as described above. The task loss function may be determined by equation (8) or equation (9) as described above.

Additionally or alternatively, wherein the sample data may further comprise: and combining the representation corresponding to the corpus of the pre-labeled first language with the representation corresponding to the corpus of the pre-labeled second language to form sample data, wherein the corpus of the second language comprises a translation corpus from the corpus of the first language to the corpus of the second language.

By way of example, when translating the English language "Today is Sunday" into Chinese, there may be many translation results. Such as "today is a weekday", "today is a sunday", etc. These may be determined as corpora in the second language.

It can be seen that the natural language processing model generated by the training apparatus of the present disclosure mixes the semantics of the corpora of the two languages from the beginning, so that the trained natural language model can be better used for the conversion task discussed herein. Semantic features of corpora of different languages with associated semantics may be better learned, making the conversion more accurate. And the overhead of computational resources can be reduced since there is no need to improve the quality of language conversion at the expense of increasing training data.

FIG. 8 illustrates a block diagram of a computing system 800 in which one or more embodiments of the disclosure may be implemented. The

methods

200 and 400 illustrated in fig. 2 and 4 may be implemented by the computing system 800. The computing system 800 shown in fig. 8 is only an example, and should not be construed as limiting the scope or functionality of use of the implementations described herein.

As shown in fig. 8, computing system 800 is in the form of a general purpose computing device. Components of computing system 800 may include, but are not limited to, one or more processors or processing units 800, memory 820, one or more input devices 830, one or more output devices 840, storage 850, and one or more communication units 860. The processing unit 800 may be a real or virtual processor and may be capable of performing various processes according to persistence stored in the memory 820. In a multi-processing system, multiple processing units execute computer-executable instructions to increase processing power.

Computing system 800 typically includes a number of computer media. Such media may be any available media that is accessible by computing system 800 and includes, but is not limited to, volatile and non-volatile media, removable and non-removable media. The memory 820 may be volatile memory (e.g., registers, cache, Random Access Memory (RAM)), non-volatile memory (e.g., Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory), or some combination thereof. Storage 850 may be removable or non-removable, and may include machine-readable media, such as a flash drive, a magnetic disk, or any other medium, which may be capable of being used to store information and which may be accessed within computing system 800.

The computing system 800 may further include additional removable/non-removable, volatile/nonvolatile computer system storage media. Although not shown in FIG. 8, a magnetic disk drive for reading from and writing to a removable, non-volatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, non-volatile optical disk may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. Memory 820 may include at least one program product having (e.g., at least one) set of program modules that are configured to carry out the functions of various embodiments described herein.

A program/utility tool 822 having a set of one or more execution modules 824 may be stored, for example, in the memory 820. Execution module 824 may include, but is not limited to, an operating system, one or more application programs, other program modules, and operating data. Each of these examples, or particular combinations, may include an implementation of a networked environment. Execution module 824 generally performs the functions and/or methods of embodiments of the subject matter described herein, such as method 200.

The input unit 830 may be one or more of various input devices. For example, the input unit 839 may include a user device such as a mouse, a keyboard, a trackball, or the like. A communication unit 860 enables communication over a communication medium to another computing entity. Additionally, the functionality of the components of computing system 800 may be implemented in a single computing cluster or multiple computing machines, which are capable of communicating over a communication connection. Thus, the computing system 800 may operate in a networked environment using logical connections to one or more other servers, network Personal Computers (PCs), or another general network node. By way of example, and not limitation, communication media includes wired or wireless networking technologies.

Computing system 800 can also communicate with one or more external devices (not shown), such as storage devices, display devices, etc., as desired, one or more devices that enable a user to interact with computing system 800, or any device (e.g., network card, modem, etc.) that enables computing system 800 to communicate with one or more other computing devices. Such communication may be performed via input/output (I/O) interfaces (not shown).

The functions described herein may be performed, at least in part, by one or more hardware logic components. By way of example, and not limitation, illustrative types of hardware logic components that may be used include Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

Program code for implementing the methodologies of the subject matter described herein may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Further, while operations are depicted in a particular order, this should be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, although several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the subject matter described herein. Certain features that are described in the context of separate implementations can also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation can also be implemented in multiple implementations separately or in any suitable subcombination.

Some example implementations of the present disclosure are listed below.

In certain embodiments of the first aspect, a method for natural language processing is provided. The method comprises the following steps: generating a first semantic code vector based on the representation of the corpus in the first language; generating a second semantic code vector based on a representation of a corpus in a second language different from the first language; generating a mixed semantic vector by mixing the first semantic code vector and the second semantic code vector; and generating a mixed representation of the corpus in the second language based on the mixed semantic vector.

In some embodiments, generating the hybrid representation in the second language comprises: the first semantic code vector and the second semantic code vector are mixed based on a mixing ratio of the first semantic code vector and the second semantic code vector, wherein the first semantic code vector and the second semantic code vector have associated semantics.

In certain embodiments, the method further comprises: determining a weight associated with semantic importance between the corpus of the first language and the corpus of the second language based on the first semantic encoding vector and the second semantic encoding vector; and determining a mixing proportion based on the entropy associated with the weight.

In some embodiments, mixing the first semantic code vector and the second semantic code vector comprises: sampling representations of corpora in a first language and corpora in a second language; and blending a first semantic code vector corresponding to the sampled representation of the corpus of the first language with a second semantic code vector corresponding to the sampled representation of the corpus of the second language.

In some embodiments, the corpus in the second language comprises: from a corpus in a first language to a translated corpus in a second language.

In certain embodiments, the method further comprises: batching representations of the corpora in the first language and representations of the corpora in the second language; determining a probability threshold based on a function associated with an index of a size of a batch; and adjusting a number of samples of the representation of the corpus in the first language and the representation of the corpus in the second language based on a probability threshold.

In some embodiments of the second aspect, a method for training a natural language processing model is provided. The method comprises the following steps: acquiring sample data, wherein the sample data comprises the expression of the linguistic data of the first language and the expression of the linguistic data of the second language; acquiring sample labels pre-labeled for the linguistic data of the first language and the linguistic data of the second language; and training the natural language processing model by using the sample data and the sample label.

In certain embodiments, the method further comprises: determining a sum of a task loss function and a consistency loss function as a target loss function, and training a natural language processing model, wherein the task loss function is associated with cross entropy of the representation of the corpus of the first language and the representation of the corpus of the second language, and the consistency loss function is associated with mean square error or relative entropy of the representation of the corpus of the first language and the representation of the corpus of the second language.

In some embodiments, the sample data further comprises: and combining the representation corresponding to the corpus of the pre-labeled first language with the representation corresponding to the corpus of the pre-labeled second language to form sample data, wherein the corpus of the second language comprises a translation corpus from the corpus of the first language to the corpus of the second language.

In an embodiment of the third aspect, an apparatus for natural language processing is provided. The device includes: a first semantic vector module configured to generate a first semantic code vector based on a representation of a corpus of a first language; a second semantic vector module configured to generate a second semantic code vector based on a representation of a corpus in a second language different from the first language; a mixed semantic vector module configured to generate a mixed semantic vector by mixing the first semantic encoding vector and the second semantic encoding vector; and a mixed representation module configured to generate a mixed representation of the corpus in the second language based on the mixed semantic vector.

In certain embodiments, the apparatus further comprises: a blending proportion module configured to determine weights associated with semantic importance between the corpus of the first language and the corpus of the second language based on the first semantic encoding vector and the second semantic encoding vector; and determining a mixing proportion based on the entropy associated with the weight.

In some embodiments, the corpus of the second language includes translation corpuses from a corpus of the first language to a corpus of the second language.

In some embodiments, the apparatus is further configured to batch the representation of the corpus in the first language and the representation of the corpus in the second language, and the apparatus further comprises: a probability threshold module configured to determine a probability threshold based on a function associated with an index of a size of a batch; and adjusting a number of samples of the representation of the corpus in the first language and the representation of the corpus in the second language based on a probability threshold.

In an embodiment of the fourth aspect, an apparatus for training a natural language processing model is provided. The device includes: the system comprises a sample data module, a data analysis module and a data analysis module, wherein the sample data module is configured to acquire sample data, and the sample data comprises a representation of a language material of a first language and a representation of a language material of a second language; the system comprises a sample label module, a first language module and a second language module, wherein the sample label module is configured to obtain a sample label pre-labeled for a corpus of a first language and a corpus of a second language; and a training module configured to train the natural language processing model using the sample data and the sample labels.

In certain embodiments, the apparatus further comprises: a loss function module configured to determine a sum of a task loss function associated with a cross entropy of the representation of the corpus in the first language and the representation of the corpus in the second language and a consistency loss function associated with a mean square error or a relative entropy of the representation of the corpus in the first language and the representation of the corpus in the second language as a target loss function, train the natural language processing model.

In an embodiment of the fifth aspect, an electronic device is provided. The electronic device includes: a memory and a processor; wherein the memory is for storing one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement the method according to the first aspect or the second aspect.

In an embodiment of the sixth aspect, a computer-readable storage medium is provided. The computer readable storage medium has stored thereon one or more computer instructions, wherein the one or more computer instructions are executed by a processor to implement the method according to the first or second aspect.

In an embodiment of the seventh aspect, a computer program product is provided. The computer program product comprises one or more computer instructions which, when executed by a processor, implement the method according to the first or second aspect.

Although the disclosure has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims

1. A method for natural language processing, comprising:

generating a first semantic code vector based on the representation of the corpus in the first language;

generating a second semantic code vector based on a representation of a corpus in a second language different from the first language;

generating a mixed semantic vector by mixing the first semantic code vector and the second semantic code vector; and

and generating a mixed representation of the corpus of the second language based on the mixed semantic vector.

2. The method of claim 1, wherein generating the hybrid representation of the second language comprises:

mixing the first semantic code vector and the second semantic code vector based on a mixing ratio of the first semantic code vector and the second semantic code vector, wherein the first semantic code vector and the second semantic code vector have associated semantics.

3. The method of claim 2, further comprising:

determining a weight associated with semantic importance between the corpus of the first language and the corpus of the second language based on the first semantic encoding vector and the second semantic encoding vector; and

determining the blending ratio based on an entropy associated with the weight.

4. The method of claim 2, wherein mixing the first semantic code vector and the second semantic code vector comprises:

sampling representations of the corpora in the first language and the corpora in the second language; and

blending a first semantic code vector corresponding to the sampled representation of the corpus of the first language with the second semantic code vector corresponding to the sampled representation of the corpus of the second language.

5. The method of claim 4, wherein the corpus of the second language comprises:

a translation corpus from a corpus of the first language to a corpus of the second language.

6. The method of claim 4 or 5, further comprising:

batching representations of the corpora in the first language and representations of the corpora in the second language;

determining a probability threshold based on a function associated with an index of a size of a batch; and

adjusting a number of samples of the representation of the corpus of the first language and the representation of the corpus of the second language based on the probability threshold.

7. A method for training a natural language processing model, comprising:

acquiring sample data, wherein the sample data comprises a representation of a corpus of a first language and a representation of a corpus of a second language;

acquiring sample labels pre-labeled for the linguistic data of the first language and the linguistic data of the second language; and

training the natural language processing model using the sample data and the sample labels.

8. The method of claim 7, further comprising:

determining a sum of a task loss function and a consistency loss function as a target loss function, and training the natural language processing model, wherein the task loss function is associated with cross entropy of the representation of the corpus of the first language and the representation of the corpus of the second language, and the consistency loss function is associated with mean square error or relative entropy of the representation of the corpus of the first language and the representation of the corpus of the second language.

9. The method of claim 7, wherein the sample data further comprises:

and combining the representation corresponding to the pre-labeled corpus of the first language with the representation corresponding to the pre-labeled corpus of the second language to form sample data, wherein the corpus of the second language comprises a translation corpus from the corpus of the first language to the corpus of the second language.

10. An apparatus for natural language processing, comprising:

a first semantic vector module configured to generate a first semantic code vector based on a representation of a corpus of a first language;

a second semantic vector module configured to generate a second semantic code vector based on a representation of a corpus in a second language different from the first language;

a hybrid semantic vector module configured to generate a hybrid semantic vector by mixing the first semantic code vector and the second semantic code vector; and

a hybrid representation module configured to generate a hybrid representation of the corpus of the second language based on the hybrid semantic vector.

11. The apparatus of claim 10, generating the hybrid representation of the second language comprising:

12. The apparatus of claim 11, the apparatus further comprising:

a blending proportion module configured to determine weights associated with semantic importance between the corpus of the first language and the corpus of the second language based on the first semantic code vector and the second semantic code vector; and

determining the blending ratio based on an entropy associated with the weight.

13. The device of claim 11, wherein mixing the first semantic code vector and the second semantic code vector comprises:

14. The apparatus of claim 13, wherein the corpus of the second language comprises:

15. The apparatus according to claim 13 or 14, further configured to batch representations of corpora in the first language and representations of corpora in the second language, and further comprising:

a probability threshold module configured to determine a probability threshold based on a function associated with an index of a size of a batch; and

16. An apparatus for training a natural language processing model, comprising:

the system comprises a sample data module, a data processing module and a data processing module, wherein the sample data module is configured to acquire sample data, and the sample data comprises a representation of a corpus of a first language and a representation of a corpus of a second language;

the system comprises a sample label module, a first language module and a second language module, wherein the sample label module is configured to obtain a sample label pre-labeled for a corpus of a first language and a corpus of a second language; and

a training module configured to train the natural language processing model using the sample data and the sample labels.

17. The apparatus of claim 16, the apparatus further comprising:

a loss function module configured to determine a sum of a task loss function associated with a cross entropy of the representation of the corpus in the first language and the representation of the corpus in the second language and a consistency loss function associated with a mean square error or a relative entropy of the representation of the corpus in the first language and the representation of the corpus in the second language as a target loss function, train the natural language processing model.

18. The device of claim 16, wherein the sample data further comprises:

19. An electronic device, comprising:

a memory and a processor;

wherein the memory is to store one or more computer instructions, wherein the one or more computer instructions are to be executed by the processor to implement the method of any of claims 1 to 6 or claims 7 to 9.

20. A computer readable storage medium having one or more computer instructions stored thereon, wherein the one or more computer instructions are executed by a processor to implement the method of any of claims 1-6 or claims 7-9.

21. A computer program product comprising one or more computer instructions which, when executed by a processor, implement the method of any one of claims 1 to 6 or claims 7 to 9.