WO2022135080A1

WO2022135080A1 - Corpus sample determination method and apparatus, electronic device, and storage medium

Info

Publication number: WO2022135080A1
Application number: PCT/CN2021/134269
Authority: WO
Inventors: 曹军; 许润昕; 王明轩; 李磊
Original assignee: 北京有竹居网络技术有限公司
Priority date: 2020-12-23
Filing date: 2021-11-30
Publication date: 2022-06-30
Also published as: CN112668339A

Abstract

The present invention provides a corpus sample determination method and apparatus, an electronic device, and a storage medium. The corpus sample determination method comprises: constructing semantic vectors of a source-side corpus and a target-side corpus in a general domain in a text corpus; retrieving, in the constructed semantic vectors, a candidate vector set similar to a target semantic vector in a target domain, wherein the candidate vector set comprises a semantic vector of at least one side of at least one domain; and determining, according to the target semantic vector and the candidate vector set, a corpus sample corresponding to the target domain.

Description

Corpus sample determination method, device, electronic device and storage medium

This application claims the priority of the Chinese Patent Application No. 202011538595.8 filed with the China Patent Office on December 23, 2020, the entire contents of which are incorporated herein by reference.

technical field

The present disclosure relates to machine learning technology, for example, to a method, apparatus, electronic device and storage medium for determining a corpus sample.

Background technique

Machine translation refers to the translation of source text in one language into target text in another language. With the rapid development of deep learning technology, the quality of machine translation based on neural networks has been continuously improved, and it is playing an increasingly important role in daily life and industrial production environments. In the process of building a translation model, it is necessary to use a large number of corpus samples to train the translation model, so that the translation model can learn the semantic features of different corpora, so that the source text to be translated can be effectively translated into the target text. However, there are differences in syntax and semantics in different fields. For example, in the fields of medicine, law, and economics, there are many professional terms, and more professional translation models need to be trained to ensure the accuracy of translation.

The monolingual corpus in the target domain can be selected to construct a corpus sample, or the monolingual corpus in the target domain can be used to construct a pseudo-parallel corpus in the target domain as a corpus sample for training the translation model to achieve the purpose of domain adaptation. However, the corpus samples used by these methods to train the translation model are only for a single specific target domain. The corpus samples in a specific domain are difficult to obtain, the sample size is usually small, and the semantic features of the corpus samples are too single, making it difficult to achieve translation. adequate training of the model.

SUMMARY OF THE INVENTION

The present disclosure provides a method, device, electronic device and storage medium for determining corpus samples, which improve the diversity of corpus samples.

A corpus sample determination method is provided, including:

Construct the source corpus of the general domain and the semantic vector of the target corpus in the corpus;

In the constructed semantic vector, retrieve a candidate vector set similar to the target semantic vector of the target domain, wherein the candidate vector set includes the semantic vector of at least one end of at least one domain;

A corpus sample corresponding to the target domain is determined according to the target semantic vector and the candidate vector set.

A corpus sample determination device is also provided, including:

The building module is set to construct the source corpus of the general domain in the corpus and the semantic vector of the target corpus;

The retrieval module is configured to retrieve a candidate vector set similar to the target semantic vector of the target domain in the constructed semantic vector, wherein the candidate vector set includes the semantic vector of at least one end of at least one domain;

The sample determination module is configured to determine the corpus sample corresponding to the target domain according to the target semantic vector and the candidate vector set.

Also provided is an electronic device comprising:

one or more processors;

storage means arranged to store one or more programs;

When the one or more programs are executed by the one or more processors, the one or more processors implement the methods provided by the embodiments of the present disclosure.

A computer-readable medium is also provided, which stores a computer program, and when the computer program is executed by the processing apparatus, implements the method provided by the embodiments of the present disclosure.

Description of drawings

1 is a schematic flowchart of a method for determining a corpus sample according to Embodiment 1 of the present disclosure;

2 is a schematic flowchart of a method for determining a corpus sample according to Embodiment 2 of the present disclosure;

3 is a schematic diagram of recalling corpus in the general field according to Embodiment 2 of the present disclosure;

4 is a schematic flowchart of a method for determining a corpus sample according to Embodiment 3 of the present disclosure;

5 is a schematic diagram of a translation model according to Embodiment 3 of the present disclosure;

6 is a schematic structural diagram of an apparatus for determining a corpus sample according to Embodiment 4 of the present disclosure;

FIG. 7 is a schematic structural diagram of an electronic device according to Embodiment 5 of the present disclosure.

Detailed ways

Embodiments of the present disclosure will be described below with reference to the accompanying drawings. Although some embodiments of the present disclosure are shown in the drawings, the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein. The drawings and embodiments of the present disclosure are only for exemplary purposes, and are not intended to limit the protection scope of the present disclosure.

The multiple steps described in the method embodiments of the present disclosure may be performed in different orders and/or in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this regard.

As used herein, the term "including" and variations thereof are open-ended inclusions, ie, "including but not limited to". The term "based on" is "based at least in part on." The term "one embodiment" means "at least one embodiment."

Modifications of "a" and "a plurality" mentioned in the present disclosure are illustrative and not restrictive, and should be read as "one or more" unless the context dictates otherwise.

The names of messages or information exchanged between multiple devices in the embodiments of the present disclosure are only for illustrative purposes, and are not intended to limit the scope of these messages or information.

In the following multiple embodiments, optional features and examples are provided in each embodiment at the same time, and multiple features described in the embodiments can be combined to form multiple optional solutions, and each numbered embodiment should not be used. Considered only as a technical solution. Furthermore, the embodiments of this disclosure and the features of the embodiments may be combined with each other without conflict.

Example 1

FIG. 1 is a schematic flowchart of a method for determining a corpus sample according to Embodiment 1 of the present disclosure. The method can be applied to the case of selecting corpus samples for machine translation in a target field, for example, for selecting corpus samples in corpora involving different fields. , for training a translation model in the target domain. The method can be performed by a corpus sample determination device, wherein the device can be implemented by software and/or hardware, and is generally integrated on an electronic device, in this embodiment, the electronic device includes a notebook computer, a tablet computer, a desktop computer, a server, and the like.

As shown in FIG. 1 , a method for determining a corpus sample provided by Embodiment 1 of the present disclosure includes the following steps.

S110. Construct the source-end corpus of the general domain and the semantic vector of the target-end corpus in the corpus.

In this embodiment, the corpus is the basic unit constituting the corpus, and the form of the corpus can be words, words, phrases, sentences, or the like. The corpus in the corpus comes from different fields, such as law, medicine, mathematics, computer and other fields. The corpus in these different fields together constitute the corpus of the general field, where the general field covers the target field and the non-target field. The corpus in the corpus can be divided into source corpus and target corpus according to the actual scene. For example, in the case of translating English into Chinese, "Hello" can be used as the source corpus, and correspondingly, "Hello" can be used as the target. The corpus, "Hello" and "Hello" belong to different ends, but have similar semantics, and form a certain mapping relationship in the translation process. The source corpus with similar semantics and the corresponding target corpus form a set of parallel corpus (Parallel Corpus).

The corpus in the general field can be used to train the translation model. The translation model can be constructed based on the deep neural network. After large-scale training, it can learn the characteristics of the source corpus and the target corpus, as well as the mapping relationship between the source corpus and the target corpus. . For the input source corpus in any field, the translation model can translate and output the corresponding target corpus.

The process of constructing semantic vectors can be understood as the process of encoding the source corpus and target corpus in the general domain to extract corpus features. Through encoding, the corpus in the corpus can be projected into a common semantic vector space, and the semantic vector of each corpus corresponds to a point in the semantic vector space. For example, a semantic vector can be represented as a three-dimensional vector [x, y, z]. If the semantics of the two corpora are similar and belong to the same domain, the distance between the semantic vectors corresponding to the two corpora is small, while the distance between the semantic vectors corresponding to corpora with different semantics or different domains is larger.

In this embodiment, the similarity of the corpus, the similarity of the semantics, and the similarity of the semantic vector can be understood as the similarity of the semantic vector is greater than or equal to the set threshold. In an example, Euclidean distance, cosine similarity, etc. can be used as the evaluation index of the similarity between semantic vectors. For example, the smaller the Euclidean distance of two semantic vectors in the semantic vector space, the higher the similarity of the two semantic vectors; when the Euclidean distance of the two semantic vectors is less than or equal to the set distance threshold, the two semantic vectors are Similar semantic vectors. For another example, when the cosine similarity of two semantic vectors in the semantic vector space is higher than or equal to the set threshold, the two semantic vectors are similar semantic vectors.

The method of this embodiment does not distinguish between languages, domains, sources or targets, and uniformly encodes corpora in general domains in the corpus. On this basis, the features of the corpus extracted are more comprehensive, which can be used by the translation model to fully learn different languages and domains. Or the characteristics of the corpus at different ends, can be applied to any field in practical application, and also supports any translation direction, for example, it can be translated from English to Chinese, and also from Chinese to English.

The same source corpus may have different meanings in different fields, corresponding to different target corpora. In order to facilitate understanding, the corpus in the form of words is used as an example, taking the source corpus as an English corpus and the target corpus as a Chinese corpus as an example, the semantics of "Matrix" in the field of mathematics is a matrix, and the corresponding target corpus is "Matrix"; In the field of biology, the semantics is the matrix, and the corresponding target-end corpus is "matrix"; in the field of geography, the semantics is the matrix, and the corresponding target-end corpus is "matrix", etc. These corpora belong to the general domain corpus in the corpus. In the semantic vector space, the distance between "Matrix" and the semantic vector of "Matrix", the distance between "Matrix" and the semantic vector of "Matrix", and the distance between "Matrix" and the semantic vector of "Matrix" For the translation model trained based on the corpus in the general field, input "Matrix", and the output translation result may be one of "Matrix", "Matrix", and "Matrix". However, if you want to get a translation model suitable for the target domain (such as the mathematical domain), you need to at least select the parallel corpus of "Matrix" and "Matrix" whose real semantics are matrices, and use them to train the corpus based on the general domain. The obtained translation model is trained, and the network parameters of the translation model are adjusted.

S120. From the constructed semantic vector, retrieve a candidate vector set that is similar to the target semantic vector of the target domain, wherein the candidate vector set includes a semantic vector of at least one end of at least one domain.

In this embodiment, the target semantic vector refers to the semantic vector obtained by encoding the corpus belonging to the target domain, including the semantic vector corresponding to the source corpus of the target domain and the semantic vector corresponding to the target corpus of the target domain. For example, if the target field is mathematics, and the translation direction is English to Chinese, then the semantic vector obtained by encoding "Matrix" and "Matrix" whose real semantics are matrices are both target semantic vectors. Among them, "Matrix" whose real semantics is a matrix is The source corpus of the target domain, and the "matrix" is the target corpus of the target domain.

In this embodiment, a semantic vector similar to the target semantic vector is retrieved in the semantic vector space to form a candidate vector set, and the target semantic vector and the candidate vector set are taken together as corpus samples, thereby expanding the diversity of corpus samples. The set of candidate vectors includes semantic vectors for at least one end of at least one domain. Exemplarily, the target semantic vector includes "Matrix" and "Matrix" whose real semantics are matrices. In the semantic vector space, there are some semantic vectors similar to the target semantic vector, such as "Matrix" whose real semantics The "Matrix", "matrix", the semantic vector corresponding to the "matrix", etc. as the parent, these semantic vectors constitute the candidate vector set, covering the source and target ends in the field of biology and geography. The candidate vector set contains semantic vectors that are similar to the target semantic vector of the target domain and are non-target domain. These non-target domain semantic vectors come from the source corpus and target corpus of the general domain in the corpus, and satisfy the target. Semantic vectors are similar.

In an example, the target semantic vector is obtained by combining the feature encoding of the corpus in the target domain, which is different from the semantic vector obtained by unified encoding for the general domain feature in S110. The process of retrieving the candidate vector set can be understood as, combining the characteristics of the corpus in the target domain, encoding the corpus in the target domain to obtain the target semantic vector, and retrieving the semantic vector close to the target semantic vector in the semantic vector space to form a candidate. Vector collection. On this basis, the corpus corresponding to the candidate vector set can also be used as a corpus sample to train the translation model, so that the translation model can learn the characteristics of the corpus in the target domain and non-target domain more accurately based on the extended corpus sample, so as to be able to Perform translation tasks in the target domain more accurately, avoid confusion of features in different domains, and improve the accuracy and professionalism of translation results.

S130. Determine a corpus sample corresponding to the target domain according to the target semantic vector and the candidate vector set.

In this embodiment, when selecting corpus samples for the target domain, not only can the corpus samples be determined according to the target semantic vectors belonging to the target domain, but also the corpus samples in the general domain can be recalled according to the retrieved candidate vector set, thereby providing more A richer and more comprehensive feature of the target domain and non-target domain corpus.

The process of determining corpus samples can be understood as determining the mapping relationship between the target semantic vector and the corpus corresponding to the candidate vector set to form input samples and output samples that can be used to train the translation model for the translation model to learn from the input samples (ie The translation rule from the source corpus in the corpus sample) to the output sample (that is, the target corpus in the corpus sample). For example, in the field of mathematics, when the input "Matrix" whose real semantics is a matrix, the translation model should output "matrix" correctly, but not "matrix" or "matrix", etc.; if it does not output "matrix" correctly, translate The network parameters of the model still need to be iteratively trained and adjusted until the translation model can correctly output the corresponding target-end corpus for the source-end corpus in the corpus samples in the target domain. The training of the translation model is completed, and the translation model has fully learned The features and translation rules of the target semantic vector and the non-target domain semantic vector similar to the target semantic vector can effectively distinguish the features of corpus with similar semantics but different domains, and can be applied to the target domain and accurately perform translation tasks.

Through the above technical solution, a corpus sample corresponding to the target domain is jointly constructed by using the target semantic vector and a set of similar candidate vectors, thereby expanding the scale of the corpus sample and improving the diversity of the corpus sample.

The method for determining corpus samples provided in this embodiment improves the recall rate of corpus samples in non-target domains in corpora in general domains by retrieving a set of candidate vectors, thereby expanding the scale of corpus samples and obtaining corpus with rich features Samples are available for the translation model to fully learn the characteristics of the corpus samples in the target domain and the recalled corpus samples in the non-target domain. The corpus samples for the target domain obtained on this basis are not limited by the language and translation direction, and are used as the basis for training the translation model, and have high reliability.

Embodiment 2

FIG. 2 is a schematic flowchart of a method for determining a corpus sample according to Embodiment 2 of the present disclosure. On the basis of Embodiment 1, this Embodiment 2 describes the process of determining a candidate vector set. In this embodiment, constructing the semantic vector of the source-end corpus and the target-end corpus of the general domain in the corpus includes: encoding the source-end corpus and the target-end corpus of the general domain in the corpus respectively according to the semantics of the corpus and the domain to which it belongs, Obtain the source corpus of the general domain in the corpus and the semantic vector corresponding to the target corpus. By uniformly encoding the corpus in the general field, the translation model can fully learn the characteristics of the corpus in different fields and at different ends. In practical applications, it can be applied to any field and supports any translation direction.

Optionally, in the constructed semantic vector, retrieving a set of candidate vectors similar to the target semantic vector in the target domain, including: calculating the similarity between the target semantic vector and the constructed multiple semantic vectors; determining the candidate vector according to the multiple similarities gather. The scale and diversity of corpus samples are expanded by calculating similarity and retrieving a set of candidate vectors composed of similar semantic vectors.

Optionally, before retrieving a set of candidate vectors similar to the target semantic vector of the target domain, the method further includes: encoding the source-end corpus and the target-end corpus of the target domain according to the semantics of the corpus to obtain the target semantic vector. The target semantic vector is obtained by coding according to the characteristics of the corpus in the target domain, which fully considers the speciality and particularity of different domains, so that the translation model can learn the characteristics of the target domain more deeply.

Please refer to the first embodiment for the content that is not yet detailed in this embodiment.

As shown in FIG. 2 , a method for determining a corpus sample provided by Embodiment 2 of the present disclosure includes the following steps.

S210. According to the semantics of the corpus and the domain to which it belongs, encode the source-end corpus and the target-end corpus of the general domain in the corpus, respectively, to obtain semantic vectors corresponding to the source-end corpus and the target-end corpus of the general domain in the corpus.

In this embodiment, the corpus of the general domain in the corpus is uniformly encoded according to the semantics and the domain, regardless of the language, source or target, and the obtained semantic vector includes the semantics of the corpus and domain-related information. If the semantics of the two corpora are similar and belong to the same domain, the similarity between the semantic vectors corresponding to the two corpora is high, and the similarity between the semantic vectors of the points corresponding to the corpus with different semantics or different domains is relatively high. Low. By uniformly encoding the corpus in the general field, the translation model can fully learn the characteristics of the corpus in different fields and at different ends. In practical application, it can be applied to any field and supports any translation direction.

S220. According to the semantics of the corpus, encode the source-end corpus and the target-end corpus of the target domain to obtain a target semantic vector.

In this embodiment, the target semantic vector is obtained by coding according to the characteristics of the corpus in the target domain, which is used as the basis for retrieving the candidate vector set or recalling similar corpus, fully considering the specialties and particularities of different fields, so that the translation model can be more in-depth. Learn the characteristics of the target domain.

S230. Calculate the similarity between the target semantic vector and the constructed multiple semantic vectors.

In this embodiment, the candidate vector set is retrieved by calculating the similarity between the target semantic vector and the constructed multiple semantic vectors. The similarity is related to the distance between the semantic vectors, and can be expressed based on the cosine similarity or the Euclidean distance of the semantic vectors.

S240. Determine a candidate vector set according to the multiple degrees of similarity.

In this embodiment, a semantic vector similar to the target semantic vector is selected to form a candidate vector set. For example, in the semantic vector space, the semantic vectors whose similarity with the target semantic vector is greater than or equal to the set threshold constitute the candidate vector set; The semantic vector constitutes a candidate vector set; or, a predetermined proportion of the semantic vector with the highest similarity to the target semantic vector is selected in the semantic vector space to constitute a candidate vector set, and the like.

S250. Determine a corpus sample corresponding to the target domain according to the target semantic vector and the candidate vector set.

On the basis of the above, retrieving a set of candidate vectors similar to the target semantic vector in the target domain includes at least one of the following: retrieving the source semantic vector in the non-target domain that is similar to the source semantic vector in the target domain; retrieving the source semantic vector in the target domain The source semantic vector of the non-target domain is similar to the source semantic vector; Retrieve the source semantic vector of the non-target domain similar to the target semantic vector of the target domain; Retrieve the non-target domain similar to the target semantic vector of the target domain Destination semantic vector.

In this embodiment, in the process of retrieving the candidate vector set for the target semantic vector, the monolingual corpus in the target domain (including the source-end corpus and the target-end corpus in the target domain) is encoded to obtain the target semantic vector. The set of candidate vectors in the general domain that is similar to the target semantic vector is retrieved from the semantic vector of , and the corpus of the target domain and the corpus of the general domain recalled according to the candidate vector set are taken together as corpus samples. Among them, the semantic vector constructed for the corpus in the general domain is essentially a multilingual semantic vector, and the features common to multiple languages and multiple domains are extracted. In this embodiment, based on the above four retrieval methods, there are four corresponding recall methods to improve the recall rate of corpus samples: recall the source corpus in the general domain according to the source corpus in the target domain; recall the general domain according to the source corpus in the target domain The target corpus of the domain; the source corpus of the general domain is recalled according to the target corpus of the target domain; the target corpus of the general domain is recalled according to the target corpus of the target domain.

For example, the target domain is mathematics, the translation direction is English to Chinese, and the corpus of non-target domain can be recalled in any of the following ways:

According to the "Matrix" with the real semantics of the matrix in the target field, the "Matrix" with the real semantics of the non-target field as the matrix and the "Matrix" of the real semantics as the matrix can be recalled;

According to the "Matrix" whose real semantics is a matrix in the target field, the "matrix" and "matrix" of the non-target field can be recalled;

According to the "matrix" in the target field, the "Matrix" with the real semantics of the non-target field as the matrix and the "Matrix" with the real semantics as the matrix can be recalled;

According to the "matrix" in the target domain, the "matrix" and "matrix" of the non-target domain can be recalled.

On this basis, the corpus samples include not only the source corpus in the target domain to the target corpus in the target domain, but also the source corpus in the non-target domain and the target corpus in the non-target domain; using this corpus sample as training data, both It can provide the features and mapping relationship between the source corpus in the target domain and the target corpus in the target domain, and can also provide the features and mapping relationship between the source corpus in the non-target domain and the target corpus in the non-target domain.

FIG. 3 is a schematic diagram of recalling corpus in the general field according to the second embodiment of the present disclosure. As shown in Figure 3, the target semantic vector is constructed for the corpus in the target domain; the semantic vector is constructed for the corpus in the general domain to form a semantic vector space; by calculating the similarity between the target semantic vector and multiple semantic vectors in the semantic vector space, the retrieval and A set of candidate vectors similar to the target semantic vector, according to which, in the corpus in the general domain of the corpus, the corpus similar to the source corpus and the target corpus of the target domain is recalled, and the corpus together with the corpus of the target domain constitutes a corpus sample.

The method for determining corpus samples provided in the second embodiment of the present disclosure enables the translation model to fully learn the characteristics of corpora in different fields and at different ends by uniformly encoding corpora in general fields, and can be applied to any field in practical application, and also Supports any translation direction; by calculating the similarity and retrieving the candidate vector set composed of similar semantic vectors, the scale and diversity of the corpus samples are expanded, and the source-end corpus in the non-target domain is added to the target-end corpus in the corpus sample. The feature mapping relationship can be used for the translation model to fully learn the characteristics of the corpus samples in the target domain and the recalled non-target domain; the target semantic vector is obtained by coding according to the characteristics of the corpus in the target domain, fully considering the professionalism and The specificity enables the translation model to learn more specifically and distinguish the characteristics of specialization in the domain.

Embodiment 3

FIG. 4 is a schematic flowchart of a method for determining a corpus sample according to Embodiment 3 of the present disclosure. Embodiment 3 On the basis of the above embodiment, the process of determining the corpus sample is described, and it is clarified how to determine the source-end corpus and the target-end corpus in the corpus sample.

Optionally, determining a candidate vector set according to a plurality of similarities, including: based on a nearest neighbor (k-Nearest Neighbor) search algorithm, retrieving a set number of semantic vectors with the highest similarity to the target semantic vector in the constructed semantic vector, Constitute the candidate vector set. On the basis of ensuring the similarity between the target semantic vector and the candidate vector set, the diversity of corpus samples is expanded, and the features and mapping relationships of the source corpus in the non-target domain to the target corpus are added to the corpus samples, so that the corpus The samples have richer features and more specialized training value.

Optionally, after determining the corpus sample corresponding to the target domain according to the target semantic vector and the candidate vector set, the method further includes: training a translation model according to the corpus sample, wherein the translation model is established according to the source-end corpus and the target-end corpus of the general domain in the corpus. . By using the corpus samples to train the translation model, the professionalism and reliability of the translation model for different fields are improved. Since the corpus samples include corpus in the target domain and non-target domain, the applicability of the translation model to any domain is also improved, and there is no need to select independent corpus samples for training in each domain.

Please refer to the above-mentioned embodiments for the details of this embodiment that are not yet detailed.

As shown in Figure 4, a method for determining a corpus sample provided in Embodiment 3 of the present disclosure includes the following steps:

S310. Construct the source-end corpus and the semantic vector of the target-end corpus of the general domain in the corpus.

S320. Calculate the similarity between the target semantic vector and the constructed multiple semantic vectors.

S330. Based on the nearest neighbor search algorithm, retrieve a set number of semantic vectors with the highest similarity to the target semantic vector from the constructed semantic vectors, to form a candidate vector set.

In this embodiment, based on the nearest neighbor search algorithm, the target semantic vector and one or more adjacent semantic vectors with the closest distance in the semantic vector space are regarded as similar semantic vectors, and the one or more adjacent semantic vectors are regarded as similar semantic vectors. Corresponding corpus is regarded as similar corpus, and corpus samples are added together.

Wherein, retrieving a set of candidate vectors similar to the target semantic vector of the target domain includes at least one of the following: retrieving the source semantic vector of the non-target domain that is similar to the source semantic vector of the target domain; retrieving the source semantic vector of the target domain The target-side semantic vector of the non-target domain with similar vectors; Retrieve the source-side semantic vector of the non-target domain that is similar to the target-side semantic vector of the target domain; Retrieve the target-side semantic vector of the non-target domain that is similar to the target-side semantic vector of the target domain vector.

S340. Determine a corpus sample corresponding to the target domain according to the target semantic vector and the candidate vector set.

In this embodiment, the process of determining the corpus sample includes determining the target semantic vector and the retrieved candidate vector set, using the corpus of the target domain and the corpus of the recalled general domain as the corpus sample, and determining the corpus in the corpus sample. The mapping relationship of the corpus, That is, when training the translation model, which corpus can be used as the pre-translation corpus and which can be used as the post-translation corpus.

Determine the corpus sample corresponding to the target domain according to the target semantic vector and the candidate vector set, including: taking the corpus corresponding to the source semantic vector of the target domain as the pre-translation corpus, and using the target domain similar to the source semantic vector of the target domain. The corpus corresponding to the end semantic vector is used as the translated corpus; it also includes at least one of the following: the corpus corresponding to the source semantic vector of the non-target domain that is similar to the source semantic vector of the target domain is used as the pre-translation corpus. The corpus corresponding to the target semantic vector of the non-target domain with similar target semantic vectors of the domain is used as the translated corpus; the corpus corresponding to the source semantic vector of the non-target domain that is similar to the source semantic vector of the target domain is taken as the pre-translation corpus. The corpus corresponding to the target semantic vector of the non-target domain that is similar to the source semantic vector of the target domain is used as the translated corpus; the source semantic vector of the non-target domain that is similar to the target semantic vector of the target domain The corresponding corpus is used as the corpus before translation, and the corpus corresponding to the target semantic vector of the non-target domain that is similar to the target semantic vector of the target domain is used as the translated corpus; the non-target semantic vector similar to the target semantic vector of the target domain is used as the corpus after translation. The corpus corresponding to the source semantic vector of the domain is used as the pre-translation corpus, and the corpus corresponding to the target semantic vector of the non-target domain that is similar to the source semantic vector of the target domain is used as the translated corpus.

For the corpus of the target domain, the source corpus of the target domain is used as the pre-translation corpus, and the target-end corpus of the target domain is used as the translated corpus.

For the corpus in the non-target domain, the source corpus in the non-target domain is used as the pre-translation corpus, and the target corpus in the non-target domain is used as the translated corpus. The source corpus in the non-target domain may be based on the target domain. The source-end corpus may be recalled according to the target-end corpus of the target domain; the target-end corpus of the non-target domain may be recalled according to the source-end corpus of the target domain, or it may be recalled according to the target-end corpus of the target domain. of.

For example, the target domain is mathematics, the translation direction is English to Chinese, and the recalled non-target domain corpus includes "Matrix" with real semantics as matrix, "Matrix" with real semantics as matrix, "matrix" and "matrix".

For the corpus in the target domain, the "Matrix" with the real semantics of the matrix is used as the corpus before translation, and the "Matrix" is used as the corpus after the translation; for the corpus in the non-target domain, the "Matrix" with the real semantics as the matrix is used as the translation. For the pre-translation corpus, the "matrix" is used as the post-translation corpus; the "Matrix" whose real semantics is the mother is used as the pre-translation corpus, and the "maternal" is used as the post-translation corpus.

S350: Train the translation model according to the corpus samples, wherein the translation model is established according to the source-end corpus and the target-end corpus of the general domain in the corpus.

In this embodiment, the source-end corpus and the target-end corpus of the general domain in the corpus are used to train a cross-language translation model in a general domain, the translation model is trained by using the corpus samples determined for the target domain, and the network parameters of the translation model are adjusted to achieve Domain adaptation of translation models. In the process of using corpus samples to train the translation model, not only the source corpus and the target corpus of the target domain are used as input samples and output samples for the training of the translation model, but also the recalled source corpus of the non-target domain is included. The corpus and the target-end corpus are used as input samples and output samples for the training of the translation model, so as to obtain a more professional translation model, which can support accurate translation in any field, any translation direction, and any language.

Exemplarily, the translation model includes a multi-layer semantic encoder and a single-layer semantic decoder, wherein the encoder and the decoder can be implemented using a recurrent neural network (Recurrent Neural Network, RNN) architecture, such as a long short-term memory network (Long short-term memory network). Short-Term Memory, LSTM), Gated Recurrent Unit (GRU), Transformer model, etc. All languages and corpora in all directions in the general domain in the corpus are trained on the same model.

FIG. 5 is a schematic diagram of a translation model according to Embodiment 3 of the present disclosure. As shown in Figure 5, the translation model includes an encoding network for extracting the semantic features of the corpus (x ₁ , x ₂ ,...x _N ) in the corpus samples; a decoding network for decoding the semantic features, that is, according to multiple Semantic features of the corpus, determine the target corpus with the highest similarity for the source corpus, and obtain the mapping relationship between the source corpus and the target corpus. In practical applications, if the corpus y ₀ and y ₁ to be translated are input, y ₀ and y ₁ will be encoded by the translation model according to the coding rules, and the decoding network will decode according to the features extracted by the encoding, and find the corresponding corpus y2 respectively. and y3, as the corresponding translation result. The translation model is established based on the general domain corpus in the corpus. Based on the corpus sample training for the target domain, the network parameters are adjusted, so that it is more professional, applicable to any professional domain, and the translation accuracy is higher.

The third embodiment of the present disclosure provides a method for determining a corpus sample. The corpus in the target domain and the recalled general domain corpus are used to form the corpus sample in the target domain, and the source corpus in the non-target domain is added to the target domain. The characteristics and mapping relationship of the end corpus make the corpus samples have richer features and more professional training value; by encoding the corpus in the general field into multilingual semantic vectors, a translation model is obtained through preliminary training, and the corpus samples in the target field are obtained by preliminary training. It is used to train the translation model, which improves the professionalism and reliability of the translation model for translation in different fields, and can support accurate translation in any field, in any translation direction, and in any language.

Embodiment 4

6 is a schematic structural diagram of a corpus sample determination device provided in Embodiment 4 of the present disclosure. The device can be applied to the case of selecting corpus samples for machine translation in a specific field, and is used for selecting corpus samples in different fields. The case for training a domain-specific translation model. Wherein the apparatus can be implemented by software and/or hardware, and is generally integrated on electronic equipment.

As shown in FIG. 6 , the apparatus includes: a construction module 410 , a retrieval module 420 and a sample determination module 430 .

The construction module 410 is configured to construct the source-end corpus of the general domain and the semantic vector of the target-end corpus in the corpus;

The retrieval module 420 is configured to retrieve a candidate vector set similar to the target semantic vector of the target domain in the constructed semantic vector, wherein the candidate vector set includes the semantic vector of at least one end of at least one domain;

The sample determination module 430 is configured to determine the corpus sample corresponding to the target domain according to the target semantic vector and the candidate vector set.

In this embodiment, the construction module 410 constructs the semantic vectors of the source-end corpus and the target-end corpus of the general domain in the corpus, and the retrieval module 420 retrieves candidate vectors similar to the target semantic vector of the target domain from the constructed semantic vectors The set of candidate vectors includes semantic vectors of at least one end of at least one domain, and the sample determination module 430 determines the corpus samples corresponding to the target domain according to the target semantic vector and the set of candidate vectors. Through the above technical solution, the target semantic vector and the set of candidate vectors similar to the target semantic vector are used to jointly construct a corpus sample corresponding to the target domain, thereby expanding the scale of the corpus sample and improving the diversity of the corpus sample.

On the basis of the above, the building module 410 is set as:

According to the semantics of the corpus and the domain to which it belongs, the source corpus and the target corpus of the general domain in the corpus are encoded respectively, and the semantic vectors corresponding to the source corpus and the target corpus of the general domain in the corpus are obtained.

On the basis of the above, the retrieval module 420 includes:

a computing unit, configured to calculate the similarity between the target semantic vector and the constructed multiple semantic vectors;

A set determination unit, configured to determine the candidate vector set according to the multiple degrees of similarity.

On the basis of the above, the set determination unit is set to:

Based on the nearest neighbor search algorithm, a set number of semantic vectors with the highest similarity to the target semantic vector are retrieved from the constructed semantic vectors to form the candidate vector set.

On the above basis, the target semantic vector includes the source semantic vector of the target domain and the target semantic vector of the target domain;

The retrieval of a set of candidate vectors similar to the target semantic vector of the target domain includes at least one of the following:

Retrieve the source semantic vector of the non-target domain that is similar to the source semantic vector of the target domain;

Retrieve the target-side semantic vector of the non-target domain that is similar to the source-side semantic vector of the target domain;

Retrieve the source-side semantic vector of the non-target domain that is similar to the target-side semantic vector of the target domain;

Retrieve the target-side semantic vector of the non-target domain that is similar to the target-side semantic vector of the target domain.

On the basis of the above, the sample determination module 430 is set to:

Taking the corpus corresponding to the source semantic vector of the target domain as the corpus before translation, and taking the corpus corresponding to the target semantic vector of the target domain similar to the source semantic vector of the target domain as the translated corpus;

The sample determination module 430 is further configured to be at least one of the following:

The corpus corresponding to the source semantic vector of the non-target domain that is similar to the source semantic vector of the target domain is used as the corpus before translation, and the target semantic vector of the non-target domain that is similar to the target semantic vector of the target domain corresponds to corpus as the translated corpus;

Taking the corpus corresponding to the source semantic vector of the non-target domain that is similar to the source semantic vector of the target domain as the corpus before translation, and corresponding to the target semantic vector of the non-target domain that is similar to the source semantic vector of the target domain corpus as the translated corpus;

The corpus corresponding to the source semantic vector of the non-target domain that is similar to the target semantic vector of the target domain is used as the corpus before translation, and the target semantic vector of the non-target domain that is similar to the target semantic vector of the target domain corresponds to corpus as the translated corpus;

Taking the corpus corresponding to the source semantic vector of the non-target domain that is similar to the target semantic vector of the target domain as the corpus before translation, and corresponding to the target semantic vector of the non-target domain that is similar to the source semantic vector of the target domain corpus as the translated corpus.

On the basis of the above, it also includes:

The encoding module is configured to encode the source corpus and the target corpus of the target domain according to the semantics of the corpus to obtain the target semantic vector before retrieving the candidate vector set similar to the target semantic vector of the target domain.

On the basis of the above, it also includes:

The training module is configured to train a translation model according to the corpus sample after determining the corpus sample corresponding to the target domain according to the target semantic vector and the candidate vector set, wherein the translation model is based on the source end of the general domain in the corpus Corpus and target corpus establishment.

The above apparatus for determining a corpus sample can execute the method for determining a corpus sample provided by any embodiment of the present disclosure, and has functional modules and effects corresponding to the execution method.

Embodiment 5

FIG. 7 is a schematic structural diagram of an electronic device according to Embodiment 5 of the present disclosure. FIG. 7 shows a schematic structural diagram of an electronic device 600 suitable for implementing an embodiment of the present disclosure. The electronic device 600 in the embodiment of the present disclosure includes a notebook computer, a tablet computer, a desktop computer, a server, and the like. The electronic device 600 shown in FIG. 7 is only an example, and should not impose any limitation on the function and scope of use of the embodiments of the present disclosure.

As shown in FIG. 7 , the electronic device 600 may include one or more processing devices (eg, a central processing unit, a graphics processor, etc.) 601 , and the processing device 601 may be stored in a read-only memory (Read-only Memory, ROM) 602 according to or a program loaded from the storage device 608 into a random access memory (Random Access Memory, RAM) 603 to perform various appropriate actions and processes. One or more processing devices 601 implement methods as provided by the present disclosure. In the RAM 603, various programs and data necessary for the operation of the electronic device 600 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other through a bus 604. An Input/Output (I/O) interface 605 is also connected to the bus 604 .

The following devices may be connected to the I/O interface 605: Input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a Liquid Crystal Display (LCD), speakers output means 607 of a vibrator, vibrator, etc.; storage means 608 including, eg, magnetic tape, hard disk, etc., arranged to store one or more programs; and communication means 609. Communication means 609 may allow electronic device 600 to communicate wirelessly or by wire with other devices to exchange data. Although FIG. 7 shows electronic device 600 having various means, it is not required to implement or have all of the illustrated means. More or fewer devices may alternatively be implemented or provided.

According to embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the method illustrated in the flowchart. In such an embodiment, the computer program may be downloaded and installed from the network via the communication device 609 , or from the storage device 608 , or from the ROM 602 . When the computer program is executed by the processing apparatus 601, the above-mentioned functions defined in the methods of the embodiments of the present disclosure are executed.

The computer-readable medium described above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the two. The computer-readable storage medium can be, for example, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. Computer readable storage media can include: electrical connections with one or more wires, portable computer disks, hard disks, RAM, ROM, Erasable Programmable Read-Only Memory (EPROM), flash memory, optical fibers , portable compact disk read-only memory (Compact Disc Read-Only Memory, CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above. In this disclosure, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In the present disclosure, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with computer-readable program code embodied thereon. Such propagated data signals may take a variety of forms, including electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device . The program code embodied on the computer-readable medium may be transmitted by any suitable medium, including: electric wire, optical fiber cable, radio frequency (RF), etc., or any suitable combination of the above. The storage medium may be a non-transitory storage medium.

The above-mentioned computer-readable medium may be included in the above-mentioned electronic device 600;

The above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device 600: constructs the source-end corpus of the general domain in the corpus and the semantic vector of the target-end corpus; In the constructed semantic vector, a candidate vector set similar to the target semantic vector of the target domain is retrieved, wherein the candidate vector set includes the semantic vector of at least one end of at least one domain; according to the target semantic vector and the candidate vector The collection determines the corpus samples corresponding to the target domain.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including object-oriented programming languages—such as Java, Smalltalk, C++, but also conventional Procedural programming language - such as the "C" language or similar programming language. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server. In the case of a remote computer, the remote computer can be connected to the user's computer through any kind of network—including a Local Area Network (LAN) or a Wide Area Network (WAN)—or, can be connected to an external computer ( For example, using an Internet service provider to connect via the Internet).

The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. Each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which contains one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. Each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented in special purpose hardware-based systems that perform the specified functions or operations, or special purpose hardware implemented in combination with computer instructions.

The modules involved in the embodiments of the present disclosure may be implemented in software or hardware. Among them, the name of the module does not constitute a limitation of the module itself in one case.

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, exemplary types of hardware logic components that may be used include: Field Programmable Gate Array (FPGA), Application Specific Integrated Circuit (ASIC), Application Specific Standard Parts (ASSP) , System on Chip (SOC), Complex Programmable Logic Device (CPLD), etc.

In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. Machine-readable media may include electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing. Machine-readable storage media include one or more wire-based electrical connections, portable computer disks, hard disks, RAM, ROM, EPROM, flash memory, optical fibers, portable CD-ROMs, optical storage devices, magnetic storage devices, or the above any suitable combination of content.

According to one or more embodiments of the present disclosure, Example 1 provides a method for determining a corpus sample, including:

According to one or more embodiments of the present disclosure, Example 2, according to the method described in Example 1, constructs the semantic vector of the source-end corpus and the target-end corpus of the general domain in the corpus, including:

According to one or more embodiments of the present disclosure, Example 3 According to the method described in Example 1, in the constructed semantic vector, retrieving a set of candidate vectors similar to the target semantic vector of the target domain, including:

calculating the similarity between the target semantic vector and the constructed multiple semantic vectors;

The set of candidate vectors is determined according to a plurality of degrees of similarity.

According to one or more embodiments of the present disclosure, Example 4, according to the method of Example 3, determines the candidate vector set according to a plurality of similarities, including:

According to one or more embodiments of the present disclosure, Example 5 is according to the method of Example 1,

The target semantic vector includes the source semantic vector of the target domain and the target semantic vector of the target domain;

According to one or more embodiments of the present disclosure, Example 6 is according to the method of Example 5,

The corpus sample corresponding to the target domain is determined according to the target semantic vector and the candidate vector set, including:

Also include at least one of the following:

According to one or more embodiments of the present disclosure, Example 7, according to the method of Example 1, before retrieving a set of candidate vectors similar to the target semantic vector of the target domain, further includes:

According to the semantics of the corpus, the source corpus and the target corpus of the target domain are encoded to obtain the target semantic vector.

According to one or more embodiments of the present disclosure, Example 8 is according to the method of Example 1,

After determining the corpus sample corresponding to the target domain according to the target semantic vector and the candidate vector set, the method further includes:

A translation model is trained according to the corpus samples, wherein the translation model is established according to the source-end corpus and the target-end corpus of the general domain in the corpus.

According to one or more embodiments of the present disclosure, Example 9 provides an apparatus for determining a corpus sample, including:

According to one or more embodiments of the present disclosure, Example 10 provides an electronic device comprising:

one or more processing devices;

storage means arranged to store one or more programs;

The one or more programs, when executed by the one or more processing devices, cause the one or more processing devices to implement the method as described in any of Examples 1-8.

According to one or more embodiments of the present disclosure, Example 11 provides a computer-readable medium having stored thereon a computer program that, when executed by a processing apparatus, implements the method as described in any one of Examples 1-8.

Claims

A corpus sample determination method, comprising:

Construct the source corpus of the general domain and the semantic vector of the target corpus in the corpus;

In the constructed semantic vector, retrieve a candidate vector set similar to the target semantic vector of the target domain, wherein the candidate vector set includes the semantic vector of at least one end of at least one domain;

A corpus sample corresponding to the target domain is determined according to the target semantic vector and the candidate vector set.
The method according to claim 1, wherein the construction of the source-end corpus and the semantic vector of the target-end corpus of the general domain in the corpus comprises:

According to the semantics of the corpus and the domain to which it belongs, the source corpus and the target corpus of the general domain in the corpus are encoded respectively, and the semantic vectors corresponding to the source corpus and the target corpus of the general domain in the corpus are obtained.
The method according to claim 1, wherein, in the constructed semantic vector, retrieving a set of candidate vectors similar to the target semantic vector of the target domain, comprising:

calculating the similarity between the target semantic vector and the constructed multiple semantic vectors;

The set of candidate vectors is determined according to a plurality of degrees of similarity.
The method according to claim 3, wherein the determining the candidate vector set according to a plurality of similarities comprises:

Based on the nearest neighbor search algorithm, a set number of semantic vectors with the highest similarity to the target semantic vector are retrieved from the constructed semantic vectors to form the candidate vector set.
The method according to claim 1, wherein the target semantic vector comprises a source semantic vector of the target domain and a target semantic vector of the target domain;

The retrieval of a set of candidate vectors similar to the target semantic vector of the target domain includes at least one of the following:

Retrieve the source semantic vector of the non-target domain that is similar to the source semantic vector of the target domain;

Retrieve the target-side semantic vector of the non-target domain that is similar to the source-side semantic vector of the target domain;

Retrieve the source-side semantic vector of the non-target domain that is similar to the target-side semantic vector of the target domain;

A target-side semantic vector of a non-target domain that is similar to the target-side semantic vector of the target domain is retrieved.
The method according to claim 5, wherein the determining the corpus sample corresponding to the target domain according to the target semantic vector and the candidate vector set comprises:

Taking the corpus corresponding to the source semantic vector of the target domain as the corpus before translation, and taking the corpus corresponding to the target semantic vector of the target domain similar to the source semantic vector of the target domain as the translated corpus;

Also include at least one of the following:

The corpus corresponding to the source semantic vector of the non-target domain that is similar to the source semantic vector of the target domain is used as the corpus before translation, and the target semantic vector of the non-target domain that is similar to the target semantic vector of the target domain is used. The corpus corresponding to the vector is used as the translated corpus;

The corpus corresponding to the source semantic vector of the non-target domain that is similar to the source semantic vector of the target domain is used as the corpus before translation, and the target semantic vector of the non-target domain that is similar to the source semantic vector of the target domain is used. The corpus corresponding to the vector is used as the translated corpus;

The corpus corresponding to the source semantic vector of the non-target domain that is similar to the target semantic vector of the target domain is used as the corpus before translation, and the target semantic vector of the non-target domain that is similar to the target semantic vector of the target domain is used. The corpus corresponding to the vector is used as the translated corpus;

The corpus corresponding to the source semantic vector of the non-target domain that is similar to the target semantic vector of the target domain is used as the corpus before translation, and the target semantic vector of the non-target domain that is similar to the source semantic vector of the target domain is used. The corpus corresponding to the vector is used as the translated corpus.
The method according to claim 1, wherein before retrieving a set of candidate vectors similar to the target semantic vector of the target domain, the method further comprises:

According to the semantics of the corpus, the source corpus and the target corpus of the target domain are encoded to obtain the target semantic vector.
The method according to claim 1, wherein after determining the corpus sample corresponding to the target domain according to the target semantic vector and the candidate vector set, the method further comprises:

A translation model is trained according to the corpus samples, wherein the translation model is established according to the source-end corpus and the target-end corpus of the general domain in the corpus.
A corpus sample determination device, comprising:

The building module is set to construct the source corpus of the general domain in the corpus and the semantic vector of the target corpus;

A retrieval module, configured to retrieve a candidate vector set similar to the target semantic vector of the target domain in the constructed semantic vector, wherein the candidate vector set includes the semantic vector of at least one end of at least one domain;

The sample determination module is configured to determine the corpus sample corresponding to the target domain according to the target semantic vector and the candidate vector set.
An electronic device comprising:

one or more processors;

storage means arranged to store one or more programs;

When the one or more programs are executed by the one or more processors, the one or more processors implement the method for determining a corpus sample according to any one of claims 1-8.
A computer-readable storage medium storing a computer program, wherein when the computer program is executed by a processor, the method for determining a corpus sample according to any one of claims 1-8 is implemented.