WO2022135080A1 - Corpus sample determination method and apparatus, electronic device, and storage medium - Google Patents

Corpus sample determination method and apparatus, electronic device, and storage medium Download PDF

Info

Publication number
WO2022135080A1
WO2022135080A1 PCT/CN2021/134269 CN2021134269W WO2022135080A1 WO 2022135080 A1 WO2022135080 A1 WO 2022135080A1 CN 2021134269 W CN2021134269 W CN 2021134269W WO 2022135080 A1 WO2022135080 A1 WO 2022135080A1
Authority
WO
WIPO (PCT)
Prior art keywords
corpus
target
semantic vector
domain
vector
Prior art date
Application number
PCT/CN2021/134269
Other languages
French (fr)
Chinese (zh)
Inventor
曹军
许润昕
王明轩
李磊
Original Assignee
北京有竹居网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京有竹居网络技术有限公司 filed Critical 北京有竹居网络技术有限公司
Publication of WO2022135080A1 publication Critical patent/WO2022135080A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation

Definitions

  • the present disclosure relates to machine learning technology, for example, to a method, apparatus, electronic device and storage medium for determining a corpus sample.
  • Machine translation refers to the translation of source text in one language into target text in another language.
  • the quality of machine translation based on neural networks has been continuously improved, and it is playing an increasingly important role in daily life and industrial production environments.
  • it is necessary to use a large number of corpus samples to train the translation model, so that the translation model can learn the semantic features of different corpora, so that the source text to be translated can be effectively translated into the target text.
  • there are differences in syntax and semantics in different fields For example, in the fields of medicine, law, and economics, there are many professional terms, and more professional translation models need to be trained to ensure the accuracy of translation.
  • the monolingual corpus in the target domain can be selected to construct a corpus sample, or the monolingual corpus in the target domain can be used to construct a pseudo-parallel corpus in the target domain as a corpus sample for training the translation model to achieve the purpose of domain adaptation.
  • the corpus samples used by these methods to train the translation model are only for a single specific target domain.
  • the corpus samples in a specific domain are difficult to obtain, the sample size is usually small, and the semantic features of the corpus samples are too single, making it difficult to achieve translation. adequate training of the model.
  • the present disclosure provides a method, device, electronic device and storage medium for determining corpus samples, which improve the diversity of corpus samples.
  • a corpus sample determination method including:
  • the constructed semantic vector retrieve a candidate vector set similar to the target semantic vector of the target domain, wherein the candidate vector set includes the semantic vector of at least one end of at least one domain;
  • a corpus sample corresponding to the target domain is determined according to the target semantic vector and the candidate vector set.
  • a corpus sample determination device including:
  • the building module is set to construct the source corpus of the general domain in the corpus and the semantic vector of the target corpus;
  • the retrieval module is configured to retrieve a candidate vector set similar to the target semantic vector of the target domain in the constructed semantic vector, wherein the candidate vector set includes the semantic vector of at least one end of at least one domain;
  • the sample determination module is configured to determine the corpus sample corresponding to the target domain according to the target semantic vector and the candidate vector set.
  • an electronic device comprising:
  • processors one or more processors
  • storage means arranged to store one or more programs
  • the one or more processors When the one or more programs are executed by the one or more processors, the one or more processors implement the methods provided by the embodiments of the present disclosure.
  • a computer-readable medium which stores a computer program, and when the computer program is executed by the processing apparatus, implements the method provided by the embodiments of the present disclosure.
  • FIG. 1 is a schematic flowchart of a method for determining a corpus sample according to Embodiment 1 of the present disclosure
  • FIG. 2 is a schematic flowchart of a method for determining a corpus sample according to Embodiment 2 of the present disclosure
  • FIG. 3 is a schematic diagram of recalling corpus in the general field according to Embodiment 2 of the present disclosure
  • FIG. 4 is a schematic flowchart of a method for determining a corpus sample according to Embodiment 3 of the present disclosure
  • FIG. 5 is a schematic diagram of a translation model according to Embodiment 3 of the present disclosure.
  • FIG. 6 is a schematic structural diagram of an apparatus for determining a corpus sample according to Embodiment 4 of the present disclosure
  • FIG. 7 is a schematic structural diagram of an electronic device according to Embodiment 5 of the present disclosure.
  • method embodiments of the present disclosure may be performed in different orders and/or in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this regard.
  • the term “including” and variations thereof are open-ended inclusions, ie, “including but not limited to”.
  • the term “based on” is “based at least in part on.”
  • the term “one embodiment” means “at least one embodiment.”
  • FIG. 1 is a schematic flowchart of a method for determining a corpus sample according to Embodiment 1 of the present disclosure.
  • the method can be applied to the case of selecting corpus samples for machine translation in a target field, for example, for selecting corpus samples in corpora involving different fields. , for training a translation model in the target domain.
  • the method can be performed by a corpus sample determination device, wherein the device can be implemented by software and/or hardware, and is generally integrated on an electronic device, in this embodiment, the electronic device includes a notebook computer, a tablet computer, a desktop computer, a server, and the like.
  • a method for determining a corpus sample provided by Embodiment 1 of the present disclosure includes the following steps.
  • the corpus is the basic unit constituting the corpus, and the form of the corpus can be words, words, phrases, sentences, or the like.
  • the corpus in the corpus comes from different fields, such as law, medicine, mathematics, computer and other fields.
  • the corpus in these different fields together constitute the corpus of the general field, where the general field covers the target field and the non-target field.
  • the corpus in the corpus can be divided into source corpus and target corpus according to the actual scene. For example, in the case of translating English into Chinese, "Hello” can be used as the source corpus, and correspondingly, "Hello" can be used as the target.
  • the corpus, "Hello” and “Hello” belong to different ends, but have similar semantics, and form a certain mapping relationship in the translation process.
  • the source corpus with similar semantics and the corresponding target corpus form a set of parallel corpus (Parallel Corpus).
  • the corpus in the general field can be used to train the translation model.
  • the translation model can be constructed based on the deep neural network. After large-scale training, it can learn the characteristics of the source corpus and the target corpus, as well as the mapping relationship between the source corpus and the target corpus. . For the input source corpus in any field, the translation model can translate and output the corresponding target corpus.
  • the process of constructing semantic vectors can be understood as the process of encoding the source corpus and target corpus in the general domain to extract corpus features.
  • the corpus in the corpus can be projected into a common semantic vector space, and the semantic vector of each corpus corresponds to a point in the semantic vector space.
  • a semantic vector can be represented as a three-dimensional vector [x, y, z]. If the semantics of the two corpora are similar and belong to the same domain, the distance between the semantic vectors corresponding to the two corpora is small, while the distance between the semantic vectors corresponding to corpora with different semantics or different domains is larger.
  • the similarity of the corpus, the similarity of the semantics, and the similarity of the semantic vector can be understood as the similarity of the semantic vector is greater than or equal to the set threshold.
  • Euclidean distance, cosine similarity, etc. can be used as the evaluation index of the similarity between semantic vectors. For example, the smaller the Euclidean distance of two semantic vectors in the semantic vector space, the higher the similarity of the two semantic vectors; when the Euclidean distance of the two semantic vectors is less than or equal to the set distance threshold, the two semantic vectors are Similar semantic vectors. For another example, when the cosine similarity of two semantic vectors in the semantic vector space is higher than or equal to the set threshold, the two semantic vectors are similar semantic vectors.
  • the method of this embodiment does not distinguish between languages, domains, sources or targets, and uniformly encodes corpora in general domains in the corpus.
  • the features of the corpus extracted are more comprehensive, which can be used by the translation model to fully learn different languages and domains.
  • the characteristics of the corpus at different ends can be applied to any field in practical application, and also supports any translation direction, for example, it can be translated from English to Chinese, and also from Chinese to English.
  • the same source corpus may have different meanings in different fields, corresponding to different target corpora.
  • the corpus in the form of words is used as an example, taking the source corpus as an English corpus and the target corpus as a Chinese corpus as an example, the semantics of "Matrix" in the field of mathematics is a matrix, and the corresponding target corpus is "Matrix"; In the field of biology, the semantics is the matrix, and the corresponding target-end corpus is "matrix"; in the field of geography, the semantics is the matrix, and the corresponding target-end corpus is "matrix", etc. These corpora belong to the general domain corpus in the corpus.
  • the target semantic vector refers to the semantic vector obtained by encoding the corpus belonging to the target domain, including the semantic vector corresponding to the source corpus of the target domain and the semantic vector corresponding to the target corpus of the target domain.
  • the target field is mathematics, and the translation direction is English to Chinese
  • the semantic vector obtained by encoding "Matrix” and "Matrix” whose real semantics are matrices are both target semantic vectors.
  • “Matrix” whose real semantics is a matrix is The source corpus of the target domain
  • the "matrix" is the target corpus of the target domain.
  • a semantic vector similar to the target semantic vector is retrieved in the semantic vector space to form a candidate vector set, and the target semantic vector and the candidate vector set are taken together as corpus samples, thereby expanding the diversity of corpus samples.
  • the set of candidate vectors includes semantic vectors for at least one end of at least one domain.
  • the target semantic vector includes "Matrix” and "Matrix” whose real semantics are matrices.
  • there are some semantic vectors similar to the target semantic vector such as "Matrix” whose real semantics The "Matrix", “matrix”, the semantic vector corresponding to the "matrix”, etc.
  • these semantic vectors constitute the candidate vector set, covering the source and target ends in the field of biology and geography.
  • the candidate vector set contains semantic vectors that are similar to the target semantic vector of the target domain and are non-target domain. These non-target domain semantic vectors come from the source corpus and target corpus of the general domain in the corpus, and satisfy the target. Semantic vectors are similar.
  • the target semantic vector is obtained by combining the feature encoding of the corpus in the target domain, which is different from the semantic vector obtained by unified encoding for the general domain feature in S110.
  • the process of retrieving the candidate vector set can be understood as, combining the characteristics of the corpus in the target domain, encoding the corpus in the target domain to obtain the target semantic vector, and retrieving the semantic vector close to the target semantic vector in the semantic vector space to form a candidate.
  • Vector collection can be understood as, combining the characteristics of the corpus in the target domain, encoding the corpus in the target domain to obtain the target semantic vector, and retrieving the semantic vector close to the target semantic vector in the semantic vector space to form a candidate.
  • the corpus corresponding to the candidate vector set can also be used as a corpus sample to train the translation model, so that the translation model can learn the characteristics of the corpus in the target domain and non-target domain more accurately based on the extended corpus sample, so as to be able to Perform translation tasks in the target domain more accurately, avoid confusion of features in different domains, and improve the accuracy and professionalism of translation results.
  • the corpus samples when selecting corpus samples for the target domain, not only can the corpus samples be determined according to the target semantic vectors belonging to the target domain, but also the corpus samples in the general domain can be recalled according to the retrieved candidate vector set, thereby providing more A richer and more comprehensive feature of the target domain and non-target domain corpus.
  • the process of determining corpus samples can be understood as determining the mapping relationship between the target semantic vector and the corpus corresponding to the candidate vector set to form input samples and output samples that can be used to train the translation model for the translation model to learn from the input samples (ie The translation rule from the source corpus in the corpus sample) to the output sample (that is, the target corpus in the corpus sample).
  • the translation model when the input "Matrix" whose real semantics is a matrix, the translation model should output “matrix” correctly, but not “matrix” or “matrix”, etc.; if it does not output "matrix” correctly, translate
  • the network parameters of the model still need to be iteratively trained and adjusted until the translation model can correctly output the corresponding target-end corpus for the source-end corpus in the corpus samples in the target domain.
  • the training of the translation model is completed, and the translation model has fully learned
  • the features and translation rules of the target semantic vector and the non-target domain semantic vector similar to the target semantic vector can effectively distinguish the features of corpus with similar semantics but different domains, and can be applied to the target domain and accurately perform translation tasks.
  • a corpus sample corresponding to the target domain is jointly constructed by using the target semantic vector and a set of similar candidate vectors, thereby expanding the scale of the corpus sample and improving the diversity of the corpus sample.
  • the method for determining corpus samples improves the recall rate of corpus samples in non-target domains in corpora in general domains by retrieving a set of candidate vectors, thereby expanding the scale of corpus samples and obtaining corpus with rich features Samples are available for the translation model to fully learn the characteristics of the corpus samples in the target domain and the recalled corpus samples in the non-target domain.
  • the corpus samples for the target domain obtained on this basis are not limited by the language and translation direction, and are used as the basis for training the translation model, and have high reliability.
  • FIG. 2 is a schematic flowchart of a method for determining a corpus sample according to Embodiment 2 of the present disclosure.
  • this Embodiment 2 describes the process of determining a candidate vector set.
  • constructing the semantic vector of the source-end corpus and the target-end corpus of the general domain in the corpus includes: encoding the source-end corpus and the target-end corpus of the general domain in the corpus respectively according to the semantics of the corpus and the domain to which it belongs, Obtain the source corpus of the general domain in the corpus and the semantic vector corresponding to the target corpus.
  • the translation model can fully learn the characteristics of the corpus in different fields and at different ends. In practical applications, it can be applied to any field and supports any translation direction.
  • retrieving a set of candidate vectors similar to the target semantic vector in the target domain including: calculating the similarity between the target semantic vector and the constructed multiple semantic vectors; determining the candidate vector according to the multiple similarities gather.
  • the scale and diversity of corpus samples are expanded by calculating similarity and retrieving a set of candidate vectors composed of similar semantic vectors.
  • the method before retrieving a set of candidate vectors similar to the target semantic vector of the target domain, the method further includes: encoding the source-end corpus and the target-end corpus of the target domain according to the semantics of the corpus to obtain the target semantic vector.
  • the target semantic vector is obtained by coding according to the characteristics of the corpus in the target domain, which fully considers the speciality and particularity of different domains, so that the translation model can learn the characteristics of the target domain more deeply.
  • a method for determining a corpus sample provided by Embodiment 2 of the present disclosure includes the following steps.
  • the corpus of the general domain in the corpus is uniformly encoded according to the semantics and the domain, regardless of the language, source or target, and the obtained semantic vector includes the semantics of the corpus and domain-related information. If the semantics of the two corpora are similar and belong to the same domain, the similarity between the semantic vectors corresponding to the two corpora is high, and the similarity between the semantic vectors of the points corresponding to the corpus with different semantics or different domains is relatively high. Low.
  • the translation model can fully learn the characteristics of the corpus in different fields and at different ends. In practical application, it can be applied to any field and supports any translation direction.
  • the target semantic vector is obtained by coding according to the characteristics of the corpus in the target domain, which is used as the basis for retrieving the candidate vector set or recalling similar corpus, fully considering the specialties and particularities of different fields, so that the translation model can be more in-depth. Learn the characteristics of the target domain.
  • the candidate vector set is retrieved by calculating the similarity between the target semantic vector and the constructed multiple semantic vectors.
  • the similarity is related to the distance between the semantic vectors, and can be expressed based on the cosine similarity or the Euclidean distance of the semantic vectors.
  • a semantic vector similar to the target semantic vector is selected to form a candidate vector set.
  • the semantic vectors whose similarity with the target semantic vector is greater than or equal to the set threshold constitute the candidate vector set;
  • the semantic vector constitutes a candidate vector set; or, a predetermined proportion of the semantic vector with the highest similarity to the target semantic vector is selected in the semantic vector space to constitute a candidate vector set, and the like.
  • retrieving a set of candidate vectors similar to the target semantic vector in the target domain includes at least one of the following: retrieving the source semantic vector in the non-target domain that is similar to the source semantic vector in the target domain; retrieving the source semantic vector in the target domain The source semantic vector of the non-target domain is similar to the source semantic vector; retrieve the source semantic vector of the non-target domain similar to the target semantic vector of the target domain; retrieve the non-target domain similar to the target semantic vector of the target domain Destination semantic vector.
  • the monolingual corpus in the target domain (including the source-end corpus and the target-end corpus in the target domain) is encoded to obtain the target semantic vector.
  • the set of candidate vectors in the general domain that is similar to the target semantic vector is retrieved from the semantic vector of , and the corpus of the target domain and the corpus of the general domain recalled according to the candidate vector set are taken together as corpus samples.
  • the semantic vector constructed for the corpus in the general domain is essentially a multilingual semantic vector, and the features common to multiple languages and multiple domains are extracted.
  • recall the source corpus in the general domain according to the source corpus in the target domain recall the general domain according to the source corpus in the target domain
  • recall the general domain according to the source corpus in the target domain The target corpus of the domain; the source corpus of the general domain is recalled according to the target corpus of the target domain; the target corpus of the general domain is recalled according to the target corpus of the target domain.
  • the target domain is mathematics
  • the translation direction is English to Chinese
  • the corpus of non-target domain can be recalled in any of the following ways:
  • the "Matrix" with the real semantics of the matrix in the target field the "Matrix” with the real semantics of the non-target field as the matrix and the "Matrix” of the real semantics as the matrix can be recalled;
  • the "Matrix" with the real semantics of the non-target field as the matrix and the "Matrix” with the real semantics as the matrix can be recalled;
  • the "matrix" in the target domain the "matrix” and “matrix” of the non-target domain can be recalled.
  • the corpus samples include not only the source corpus in the target domain to the target corpus in the target domain, but also the source corpus in the non-target domain and the target corpus in the non-target domain; using this corpus sample as training data, both It can provide the features and mapping relationship between the source corpus in the target domain and the target corpus in the target domain, and can also provide the features and mapping relationship between the source corpus in the non-target domain and the target corpus in the non-target domain.
  • FIG. 3 is a schematic diagram of recalling corpus in the general field according to the second embodiment of the present disclosure.
  • the target semantic vector is constructed for the corpus in the target domain; the semantic vector is constructed for the corpus in the general domain to form a semantic vector space; by calculating the similarity between the target semantic vector and multiple semantic vectors in the semantic vector space, the retrieval and A set of candidate vectors similar to the target semantic vector, according to which, in the corpus in the general domain of the corpus, the corpus similar to the source corpus and the target corpus of the target domain is recalled, and the corpus together with the corpus of the target domain constitutes a corpus sample.
  • the method for determining corpus samples enables the translation model to fully learn the characteristics of corpora in different fields and at different ends by uniformly encoding corpora in general fields, and can be applied to any field in practical application, and also Supports any translation direction; by calculating the similarity and retrieving the candidate vector set composed of similar semantic vectors, the scale and diversity of the corpus samples are expanded, and the source-end corpus in the non-target domain is added to the target-end corpus in the corpus sample.
  • the feature mapping relationship can be used for the translation model to fully learn the characteristics of the corpus samples in the target domain and the recalled non-target domain; the target semantic vector is obtained by coding according to the characteristics of the corpus in the target domain, fully considering the professionalism and The specificity enables the translation model to learn more specifically and distinguish the characteristics of specialization in the domain.
  • FIG. 4 is a schematic flowchart of a method for determining a corpus sample according to Embodiment 3 of the present disclosure.
  • Embodiment 3 On the basis of the above embodiment, the process of determining the corpus sample is described, and it is clarified how to determine the source-end corpus and the target-end corpus in the corpus sample.
  • determining a candidate vector set according to a plurality of similarities including: based on a nearest neighbor (k-Nearest Neighbor) search algorithm, retrieving a set number of semantic vectors with the highest similarity to the target semantic vector in the constructed semantic vector, Constitute the candidate vector set.
  • k-Nearest Neighbor k-Nearest Neighbor
  • the diversity of corpus samples is expanded, and the features and mapping relationships of the source corpus in the non-target domain to the target corpus are added to the corpus samples, so that the corpus The samples have richer features and more specialized training value.
  • the method further includes: training a translation model according to the corpus sample, wherein the translation model is established according to the source-end corpus and the target-end corpus of the general domain in the corpus. .
  • the corpus samples include corpus in the target domain and non-target domain, the applicability of the translation model to any domain is also improved, and there is no need to select independent corpus samples for training in each domain.
  • a method for determining a corpus sample includes the following steps:
  • the target semantic vector and one or more adjacent semantic vectors with the closest distance in the semantic vector space are regarded as similar semantic vectors, and the one or more adjacent semantic vectors are regarded as similar semantic vectors.
  • Corresponding corpus is regarded as similar corpus, and corpus samples are added together.
  • retrieving a set of candidate vectors similar to the target semantic vector of the target domain includes at least one of the following: retrieving the source semantic vector of the non-target domain that is similar to the source semantic vector of the target domain; retrieving the source semantic vector of the target domain The target-side semantic vector of the non-target domain with similar vectors; retrieve the source-side semantic vector of the non-target domain that is similar to the target-side semantic vector of the target domain; retrieve the target-side semantic vector of the non-target domain that is similar to the target-side semantic vector of the target domain vector.
  • the process of determining the corpus sample includes determining the target semantic vector and the retrieved candidate vector set, using the corpus of the target domain and the corpus of the recalled general domain as the corpus sample, and determining the corpus in the corpus sample.
  • the mapping relationship of the corpus That is, when training the translation model, which corpus can be used as the pre-translation corpus and which can be used as the post-translation corpus.
  • Determine the corpus sample corresponding to the target domain according to the target semantic vector and the candidate vector set including: taking the corpus corresponding to the source semantic vector of the target domain as the pre-translation corpus, and using the target domain similar to the source semantic vector of the target domain.
  • the corpus corresponding to the end semantic vector is used as the translated corpus; it also includes at least one of the following: the corpus corresponding to the source semantic vector of the non-target domain that is similar to the source semantic vector of the target domain is used as the pre-translation corpus.
  • the corpus corresponding to the target semantic vector of the non-target domain with similar target semantic vectors of the domain is used as the translated corpus; the corpus corresponding to the source semantic vector of the non-target domain that is similar to the source semantic vector of the target domain is taken as the pre-translation corpus.
  • the corpus corresponding to the target semantic vector of the non-target domain that is similar to the source semantic vector of the target domain is used as the translated corpus; the source semantic vector of the non-target domain that is similar to the target semantic vector of the target domain
  • the corresponding corpus is used as the corpus before translation, and the corpus corresponding to the target semantic vector of the non-target domain that is similar to the target semantic vector of the target domain is used as the translated corpus; the non-target semantic vector similar to the target semantic vector of the target domain is used as the corpus after translation.
  • the corpus corresponding to the source semantic vector of the domain is used as the pre-translation corpus, and the corpus corresponding to the target semantic vector of the non-target domain that is similar to the source semantic vector of the target domain is used as the translated corpus.
  • the source corpus of the target domain is used as the pre-translation corpus, and the target-end corpus of the target domain is used as the translated corpus.
  • the source corpus in the non-target domain is used as the pre-translation corpus, and the target corpus in the non-target domain is used as the translated corpus.
  • the source corpus in the non-target domain may be based on the target domain.
  • the source-end corpus may be recalled according to the target-end corpus of the target domain; the target-end corpus of the non-target domain may be recalled according to the source-end corpus of the target domain, or it may be recalled according to the target-end corpus of the target domain. of.
  • the target domain is mathematics
  • the translation direction is English to Chinese
  • the recalled non-target domain corpus includes "Matrix” with real semantics as matrix, "Matrix” with real semantics as matrix, "matrix” and “matrix”.
  • the "Matrix" with the real semantics of the matrix is used as the corpus before translation, and the “Matrix” is used as the corpus after the translation; for the corpus in the non-target domain, the "Matrix” with the real semantics as the matrix is used as the translation.
  • the "matrix” is used as the post-translation corpus; the "Matrix” whose real semantics is the mother is used as the pre-translation corpus, and the "maternal” is used as the post-translation corpus.
  • the source-end corpus and the target-end corpus of the general domain in the corpus are used to train a cross-language translation model in a general domain
  • the translation model is trained by using the corpus samples determined for the target domain
  • the network parameters of the translation model are adjusted to achieve Domain adaptation of translation models.
  • corpus samples to train the translation model not only the source corpus and the target corpus of the target domain are used as input samples and output samples for the training of the translation model, but also the recalled source corpus of the non-target domain is included.
  • the corpus and the target-end corpus are used as input samples and output samples for the training of the translation model, so as to obtain a more professional translation model, which can support accurate translation in any field, any translation direction, and any language.
  • the translation model includes a multi-layer semantic encoder and a single-layer semantic decoder, wherein the encoder and the decoder can be implemented using a recurrent neural network (Recurrent Neural Network, RNN) architecture, such as a long short-term memory network (Long short-term memory network). Short-Term Memory, LSTM), Gated Recurrent Unit (GRU), Transformer model, etc. All languages and corpora in all directions in the general domain in the corpus are trained on the same model.
  • RNN recurrent Neural Network
  • RNN recurrent Neural Network
  • LSTM Long short-term memory network
  • GRU Gated Recurrent Unit
  • Transformer model etc. All languages and corpora in all directions in the general domain in the corpus are trained on the same model.
  • FIG. 5 is a schematic diagram of a translation model according to Embodiment 3 of the present disclosure.
  • the translation model includes an encoding network for extracting the semantic features of the corpus (x 1 , x 2 ,...x N ) in the corpus samples; a decoding network for decoding the semantic features, that is, according to multiple Semantic features of the corpus, determine the target corpus with the highest similarity for the source corpus, and obtain the mapping relationship between the source corpus and the target corpus.
  • y 0 and y 1 will be encoded by the translation model according to the coding rules, and the decoding network will decode according to the features extracted by the encoding, and find the corresponding corpus y2 respectively. and y3, as the corresponding translation result.
  • the translation model is established based on the general domain corpus in the corpus. Based on the corpus sample training for the target domain, the network parameters are adjusted, so that it is more professional, applicable to any professional domain, and the translation accuracy is higher.
  • the third embodiment of the present disclosure provides a method for determining a corpus sample.
  • the corpus in the target domain and the recalled general domain corpus are used to form the corpus sample in the target domain, and the source corpus in the non-target domain is added to the target domain.
  • the characteristics and mapping relationship of the end corpus make the corpus samples have richer features and more professional training value; by encoding the corpus in the general field into multilingual semantic vectors, a translation model is obtained through preliminary training, and the corpus samples in the target field are obtained by preliminary training. It is used to train the translation model, which improves the professionalism and reliability of the translation model for translation in different fields, and can support accurate translation in any field, in any translation direction, and in any language.
  • FIG. 6 is a schematic structural diagram of a corpus sample determination device provided in Embodiment 4 of the present disclosure.
  • the device can be applied to the case of selecting corpus samples for machine translation in a specific field, and is used for selecting corpus samples in different fields.
  • the apparatus can be implemented by software and/or hardware, and is generally integrated on electronic equipment.
  • the apparatus includes: a construction module 410 , a retrieval module 420 and a sample determination module 430 .
  • the construction module 410 is configured to construct the source-end corpus of the general domain and the semantic vector of the target-end corpus in the corpus;
  • the retrieval module 420 is configured to retrieve a candidate vector set similar to the target semantic vector of the target domain in the constructed semantic vector, wherein the candidate vector set includes the semantic vector of at least one end of at least one domain;
  • the sample determination module 430 is configured to determine the corpus sample corresponding to the target domain according to the target semantic vector and the candidate vector set.
  • the construction module 410 constructs the semantic vectors of the source-end corpus and the target-end corpus of the general domain in the corpus, and the retrieval module 420 retrieves candidate vectors similar to the target semantic vector of the target domain from the constructed semantic vectors
  • the set of candidate vectors includes semantic vectors of at least one end of at least one domain
  • the sample determination module 430 determines the corpus samples corresponding to the target domain according to the target semantic vector and the set of candidate vectors.
  • the building module 410 is set as:
  • the source corpus and the target corpus of the general domain in the corpus are encoded respectively, and the semantic vectors corresponding to the source corpus and the target corpus of the general domain in the corpus are obtained.
  • the retrieval module 420 includes:
  • a computing unit configured to calculate the similarity between the target semantic vector and the constructed multiple semantic vectors
  • a set determination unit configured to determine the candidate vector set according to the multiple degrees of similarity.
  • the set determination unit is set to:
  • a set number of semantic vectors with the highest similarity to the target semantic vector are retrieved from the constructed semantic vectors to form the candidate vector set.
  • the target semantic vector includes the source semantic vector of the target domain and the target semantic vector of the target domain;
  • the retrieval of a set of candidate vectors similar to the target semantic vector of the target domain includes at least one of the following:
  • the sample determination module 430 is set to:
  • the sample determination module 430 is further configured to be at least one of the following:
  • the corpus corresponding to the source semantic vector of the non-target domain that is similar to the source semantic vector of the target domain is used as the corpus before translation, and the target semantic vector of the non-target domain that is similar to the target semantic vector of the target domain corresponds to corpus as the translated corpus;
  • the corpus corresponding to the source semantic vector of the non-target domain that is similar to the target semantic vector of the target domain is used as the corpus before translation, and the target semantic vector of the non-target domain that is similar to the target semantic vector of the target domain corresponds to corpus as the translated corpus;
  • the encoding module is configured to encode the source corpus and the target corpus of the target domain according to the semantics of the corpus to obtain the target semantic vector before retrieving the candidate vector set similar to the target semantic vector of the target domain.
  • the training module is configured to train a translation model according to the corpus sample after determining the corpus sample corresponding to the target domain according to the target semantic vector and the candidate vector set, wherein the translation model is based on the source end of the general domain in the corpus Corpus and target corpus establishment.
  • the above apparatus for determining a corpus sample can execute the method for determining a corpus sample provided by any embodiment of the present disclosure, and has functional modules and effects corresponding to the execution method.
  • FIG. 7 is a schematic structural diagram of an electronic device according to Embodiment 5 of the present disclosure.
  • FIG. 7 shows a schematic structural diagram of an electronic device 600 suitable for implementing an embodiment of the present disclosure.
  • the electronic device 600 in the embodiment of the present disclosure includes a notebook computer, a tablet computer, a desktop computer, a server, and the like.
  • the electronic device 600 shown in FIG. 7 is only an example, and should not impose any limitation on the function and scope of use of the embodiments of the present disclosure.
  • the electronic device 600 may include one or more processing devices (eg, a central processing unit, a graphics processor, etc.) 601 , and the processing device 601 may be stored in a read-only memory (Read-only Memory, ROM) 602 according to or a program loaded from the storage device 608 into a random access memory (Random Access Memory, RAM) 603 to perform various appropriate actions and processes.
  • ROM Read-only Memory
  • RAM Random Access Memory
  • One or more processing devices 601 implement methods as provided by the present disclosure.
  • various programs and data necessary for the operation of the electronic device 600 are also stored.
  • the processing device 601, the ROM 602, and the RAM 603 are connected to each other through a bus 604.
  • An Input/Output (I/O) interface 605 is also connected to the bus 604 .
  • I/O Input/Output
  • I/O interface 605 The following devices may be connected to the I/O interface 605: Input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a Liquid Crystal Display (LCD), speakers output means 607 of a vibrator, vibrator, etc.; storage means 608 including, eg, magnetic tape, hard disk, etc., arranged to store one or more programs; and communication means 609. Communication means 609 may allow electronic device 600 to communicate wirelessly or by wire with other devices to exchange data.
  • FIG. 7 shows electronic device 600 having various means, it is not required to implement or have all of the illustrated means. More or fewer devices may alternatively be implemented or provided.
  • embodiments of the present disclosure include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the method illustrated in the flowchart.
  • the computer program may be downloaded and installed from the network via the communication device 609 , or from the storage device 608 , or from the ROM 602 .
  • the processing apparatus 601 the above-mentioned functions defined in the methods of the embodiments of the present disclosure are executed.
  • the computer-readable medium described above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the two.
  • the computer-readable storage medium can be, for example, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above.
  • Computer readable storage media can include: electrical connections with one or more wires, portable computer disks, hard disks, RAM, ROM, Erasable Programmable Read-Only Memory (EPROM), flash memory, optical fibers , portable compact disk read-only memory (Compact Disc Read-Only Memory, CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with computer-readable program code embodied thereon. Such propagated data signals may take a variety of forms, including electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device .
  • the program code embodied on the computer-readable medium may be transmitted by any suitable medium, including: electric wire, optical fiber cable, radio frequency (RF), etc., or any suitable combination of the above.
  • the storage medium may be a non-transitory storage medium.
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device 600;
  • the above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device 600: constructs the source-end corpus of the general domain in the corpus and the semantic vector of the target-end corpus; In the constructed semantic vector, a candidate vector set similar to the target semantic vector of the target domain is retrieved, wherein the candidate vector set includes the semantic vector of at least one end of at least one domain; according to the target semantic vector and the candidate vector The collection determines the corpus samples corresponding to the target domain.
  • Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including object-oriented programming languages—such as Java, Smalltalk, C++, but also conventional Procedural programming language - such as the "C" language or similar programming language.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server.
  • the remote computer can be connected to the user's computer through any kind of network—including a Local Area Network (LAN) or a Wide Area Network (WAN)—or, can be connected to an external computer ( For example, using an Internet service provider to connect via the Internet).
  • LAN Local Area Network
  • WAN Wide Area Network
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which contains one or more executable instructions for implementing the specified logical function(s).
  • the functions noted in the blocks may occur out of the order noted in the figures.
  • two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • Each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented in special purpose hardware-based systems that perform the specified functions or operations, or special purpose hardware implemented in combination with computer instructions.
  • the modules involved in the embodiments of the present disclosure may be implemented in software or hardware. Among them, the name of the module does not constitute a limitation of the module itself in one case.
  • FPGA Field Programmable Gate Array
  • ASIC Application Specific Integrated Circuit
  • ASSP Application Specific Standard Parts
  • SOC System on Chip
  • CPLD Complex Programmable Logic Device
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device.
  • the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • Machine-readable media may include electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing.
  • Machine-readable storage media include one or more wire-based electrical connections, portable computer disks, hard disks, RAM, ROM, EPROM, flash memory, optical fibers, portable CD-ROMs, optical storage devices, magnetic storage devices, or the above any suitable combination of content.
  • Example 1 provides a method for determining a corpus sample, including:
  • the constructed semantic vector retrieve a candidate vector set similar to the target semantic vector of the target domain, wherein the candidate vector set includes the semantic vector of at least one end of at least one domain;
  • a corpus sample corresponding to the target domain is determined according to the target semantic vector and the candidate vector set.
  • Example 2 constructs the semantic vector of the source-end corpus and the target-end corpus of the general domain in the corpus, including:
  • the source corpus and the target corpus of the general domain in the corpus are encoded respectively, and the semantic vectors corresponding to the source corpus and the target corpus of the general domain in the corpus are obtained.
  • Example 3 According to the method described in Example 1, in the constructed semantic vector, retrieving a set of candidate vectors similar to the target semantic vector of the target domain, including:
  • the set of candidate vectors is determined according to a plurality of degrees of similarity.
  • Example 4 determines the candidate vector set according to a plurality of similarities, including:
  • a set number of semantic vectors with the highest similarity to the target semantic vector are retrieved from the constructed semantic vectors to form the candidate vector set.
  • Example 5 is according to the method of Example 1,
  • the target semantic vector includes the source semantic vector of the target domain and the target semantic vector of the target domain;
  • the retrieval of a set of candidate vectors similar to the target semantic vector of the target domain includes at least one of the following:
  • Example 6 is according to the method of Example 5,
  • the corpus sample corresponding to the target domain is determined according to the target semantic vector and the candidate vector set, including:
  • the corpus corresponding to the source semantic vector of the non-target domain that is similar to the source semantic vector of the target domain is used as the corpus before translation, and the target semantic vector of the non-target domain that is similar to the target semantic vector of the target domain corresponds to corpus as the translated corpus;
  • the corpus corresponding to the source semantic vector of the non-target domain that is similar to the target semantic vector of the target domain is used as the corpus before translation, and the target semantic vector of the non-target domain that is similar to the target semantic vector of the target domain corresponds to corpus as the translated corpus;
  • Example 7 according to the method of Example 1, before retrieving a set of candidate vectors similar to the target semantic vector of the target domain, further includes:
  • the source corpus and the target corpus of the target domain are encoded to obtain the target semantic vector.
  • Example 8 is according to the method of Example 1,
  • the method further includes:
  • a translation model is trained according to the corpus samples, wherein the translation model is established according to the source-end corpus and the target-end corpus of the general domain in the corpus.
  • Example 9 provides an apparatus for determining a corpus sample, including:
  • the building module is set to construct the source corpus of the general domain in the corpus and the semantic vector of the target corpus;
  • the retrieval module is configured to retrieve a candidate vector set similar to the target semantic vector of the target domain in the constructed semantic vector, wherein the candidate vector set includes the semantic vector of at least one end of at least one domain;
  • the sample determination module is configured to determine the corpus sample corresponding to the target domain according to the target semantic vector and the candidate vector set.
  • Example 10 provides an electronic device comprising:
  • storage means arranged to store one or more programs
  • the one or more programs when executed by the one or more processing devices, cause the one or more processing devices to implement the method as described in any of Examples 1-8.
  • Example 11 provides a computer-readable medium having stored thereon a computer program that, when executed by a processing apparatus, implements the method as described in any one of Examples 1-8.

Abstract

The present invention provides a corpus sample determination method and apparatus, an electronic device, and a storage medium. The corpus sample determination method comprises: constructing semantic vectors of a source-side corpus and a target-side corpus in a general domain in a text corpus; retrieving, in the constructed semantic vectors, a candidate vector set similar to a target semantic vector in a target domain, wherein the candidate vector set comprises a semantic vector of at least one side of at least one domain; and determining, according to the target semantic vector and the candidate vector set, a corpus sample corresponding to the target domain.

Description

语料样本确定方法、装置、电子设备及存储介质Corpus sample determination method, device, electronic device and storage medium
本申请要求在2020年12月23日提交中国专利局、申请号为202011538595.8的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。This application claims the priority of the Chinese Patent Application No. 202011538595.8 filed with the China Patent Office on December 23, 2020, the entire contents of which are incorporated herein by reference.
技术领域technical field
本公开涉及机器学习技术,例如涉及一种语料样本确定方法、装置、电子设备及存储介质。The present disclosure relates to machine learning technology, for example, to a method, apparatus, electronic device and storage medium for determining a corpus sample.
背景技术Background technique
机器翻译指将一种语言的源文本翻译成另一种语言的目标文本。随着深度学习技术的快速发展,基于神经网络的机器翻译质量不断提高,在日常生活与工业生产环境中起着越来越重要的作用。在构建翻译模型的过程中,需要利用大量的语料样本对翻译模型进行训练,使翻译模型学习到不同语料的语义特征,从而能够将待翻译的源文本有效翻译成目标文本。但不同领域语法和语义存在差异,例如在医学、法律、经济等领域,专业术语较多,需要训练更为专业的翻译模型,才能保证翻译的准确性。Machine translation refers to the translation of source text in one language into target text in another language. With the rapid development of deep learning technology, the quality of machine translation based on neural networks has been continuously improved, and it is playing an increasingly important role in daily life and industrial production environments. In the process of building a translation model, it is necessary to use a large number of corpus samples to train the translation model, so that the translation model can learn the semantic features of different corpora, so that the source text to be translated can be effectively translated into the target text. However, there are differences in syntax and semantics in different fields. For example, in the fields of medicine, law, and economics, there are many professional terms, and more professional translation models need to be trained to ensure the accuracy of translation.
可以选择目标领域内的单语语料构造语料样本,或者是利用目标领域内的单语语料构造目标领域内的伪平行语料作为语料样本,用于训练翻译模型,以达到领域适应的目的。但这些方法为训练翻译模型所采用的语料样本都仅是针对单个特定的目标领域的,特定领域的语料样本难以获取、样本规模通常较小,并且语料样本的语义特征过于单一,难以实现对翻译模型的充分训练。The monolingual corpus in the target domain can be selected to construct a corpus sample, or the monolingual corpus in the target domain can be used to construct a pseudo-parallel corpus in the target domain as a corpus sample for training the translation model to achieve the purpose of domain adaptation. However, the corpus samples used by these methods to train the translation model are only for a single specific target domain. The corpus samples in a specific domain are difficult to obtain, the sample size is usually small, and the semantic features of the corpus samples are too single, making it difficult to achieve translation. adequate training of the model.
发明内容SUMMARY OF THE INVENTION
本公开提供一种语料样本确定方法、装置、电子设备及存储介质,提高了语料样本的多样性。The present disclosure provides a method, device, electronic device and storage medium for determining corpus samples, which improve the diversity of corpus samples.
提供了一种语料样本确定方法,包括:A corpus sample determination method is provided, including:
构建语料库中通用领域的源端语料以及目标端语料的语义向量;Construct the source corpus of the general domain and the semantic vector of the target corpus in the corpus;
在构建的语义向量中,检索与目标领域的目标语义向量相似的候选向量集合,其中,所述候选向量集合包括至少一个领域的至少一端的语义向量;In the constructed semantic vector, retrieve a candidate vector set similar to the target semantic vector of the target domain, wherein the candidate vector set includes the semantic vector of at least one end of at least one domain;
根据所述目标语义向量和所述候选向量集合确定目标领域对应的语料样本。A corpus sample corresponding to the target domain is determined according to the target semantic vector and the candidate vector set.
还提供了一种语料样本确定装置,包括:A corpus sample determination device is also provided, including:
构建模块,设置为构建语料库中通用领域的源端语料以及目标端语料的语义向量;The building module is set to construct the source corpus of the general domain in the corpus and the semantic vector of the target corpus;
检索模块,设置为在构建的语义向量中,检索与目标领域的目标语义向量相似的候选向量集合,其中,所述候选向量集合包括至少一个领域的至少一端的语义向量;The retrieval module is configured to retrieve a candidate vector set similar to the target semantic vector of the target domain in the constructed semantic vector, wherein the candidate vector set includes the semantic vector of at least one end of at least one domain;
样本确定模块,设置为根据所述目标语义向量和所述候选向量集合确定目标领域对应的语料样本。The sample determination module is configured to determine the corpus sample corresponding to the target domain according to the target semantic vector and the candidate vector set.
还提供了一种电子设备,包括:Also provided is an electronic device comprising:
一个或多个处理器;one or more processors;
存储装置,设置为存储一个或多个程序;storage means arranged to store one or more programs;
当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现本公开实施例提供的方法。When the one or more programs are executed by the one or more processors, the one or more processors implement the methods provided by the embodiments of the present disclosure.
还提供了一种计算机可读介质,存储有计算机程序,该计算机程序被处理装置执行时实现本公开实施例提供的方法。A computer-readable medium is also provided, which stores a computer program, and when the computer program is executed by the processing apparatus, implements the method provided by the embodiments of the present disclosure.
附图说明Description of drawings
图1为本公开实施例一提供的一种语料样本确定方法的流程示意图;1 is a schematic flowchart of a method for determining a corpus sample according to Embodiment 1 of the present disclosure;
图2为本公开实施例二提供的一种语料样本确定方法的流程示意图;2 is a schematic flowchart of a method for determining a corpus sample according to Embodiment 2 of the present disclosure;
图3为本公开实施例二提供的一种召回通用领域的语料的示意图;3 is a schematic diagram of recalling corpus in the general field according to Embodiment 2 of the present disclosure;
图4为本公开实施例三提供的一种语料样本确定方法的流程示意图;4 is a schematic flowchart of a method for determining a corpus sample according to Embodiment 3 of the present disclosure;
图5为本公开实施例三提供的一种翻译模型的示意图;5 is a schematic diagram of a translation model according to Embodiment 3 of the present disclosure;
图6为本公开实施例四提供的一种语料样本确定装置的结构示意图;6 is a schematic structural diagram of an apparatus for determining a corpus sample according to Embodiment 4 of the present disclosure;
图7为本公开实施例五提供的一种电子设备的结构示意图。FIG. 7 is a schematic structural diagram of an electronic device according to Embodiment 5 of the present disclosure.
具体实施方式Detailed ways
下面将参照附图描述本公开的实施例。虽然附图中显示了本公开的一些实施例,本公开可以通过多种形式来实现,而且不应该被解释为限于这里阐述的 实施例。本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。Embodiments of the present disclosure will be described below with reference to the accompanying drawings. Although some embodiments of the present disclosure are shown in the drawings, the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein. The drawings and embodiments of the present disclosure are only for exemplary purposes, and are not intended to limit the protection scope of the present disclosure.
本公开的方法实施方式中记载的多个步骤可以按照不同的顺序执行,和/或并行执行。此外,方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本公开的范围在此方面不受限制。The multiple steps described in the method embodiments of the present disclosure may be performed in different orders and/or in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this regard.
本文使用的术语“包括”及其变形是开放性包括,即“包括但不限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”。As used herein, the term "including" and variations thereof are open-ended inclusions, ie, "including but not limited to". The term "based on" is "based at least in part on." The term "one embodiment" means "at least one embodiment."
本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的,除非在上下文另有指出,否则应该理解为“一个或多个”。Modifications of "a" and "a plurality" mentioned in the present disclosure are illustrative and not restrictive, and should be read as "one or more" unless the context dictates otherwise.
本公开实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的,而并不是用于对这些消息或信息的范围进行限制。The names of messages or information exchanged between multiple devices in the embodiments of the present disclosure are only for illustrative purposes, and are not intended to limit the scope of these messages or information.
下述多个实施例中,每个实施例中同时提供了可选特征和示例,实施例中记载的多个特征可进行组合,形成多个可选方案,不应将每个编号的实施例仅视为一个技术方案。此外,在不冲突的情况下,本公开中的实施例及实施例中的特征可以相互组合。In the following multiple embodiments, optional features and examples are provided in each embodiment at the same time, and multiple features described in the embodiments can be combined to form multiple optional solutions, and each numbered embodiment should not be used. Considered only as a technical solution. Furthermore, the embodiments of this disclosure and the features of the embodiments may be combined with each other without conflict.
实施例一Example 1
图1为本公开实施例一提供的一种语料样本确定方法的流程示意图,该方法可适用于针对目标领域的机器翻译选取语料样本的情况,例如用于在涉及不同领域的语料库中选取语料样本,用于训练目标领域的翻译模型的情况。该方法可以由语料样本确定装置来执行,其中该装置可由软件和/或硬件实现,并一般集成在电子设备上,在本实施例中电子设备包括笔记本电脑、平板电脑、台式计算机、服务器等。FIG. 1 is a schematic flowchart of a method for determining a corpus sample according to Embodiment 1 of the present disclosure. The method can be applied to the case of selecting corpus samples for machine translation in a target field, for example, for selecting corpus samples in corpora involving different fields. , for training a translation model in the target domain. The method can be performed by a corpus sample determination device, wherein the device can be implemented by software and/or hardware, and is generally integrated on an electronic device, in this embodiment, the electronic device includes a notebook computer, a tablet computer, a desktop computer, a server, and the like.
如图1所示,本公开实施例一提供的一种语料样本确定方法,包括如下步骤。As shown in FIG. 1 , a method for determining a corpus sample provided by Embodiment 1 of the present disclosure includes the following steps.
S110、构建语料库中通用领域的源端语料以及目标端语料的语义向量。S110. Construct the source-end corpus of the general domain and the semantic vector of the target-end corpus in the corpus.
在本实施例中,语料是构成语料库的基本单元,语料的形式可以为字、词语、短语或句子等。语料库中的语料源于不同领域,例如法律、医学、数学、计算机等领域,这些不同领域的语料共同构成通用领域的语料,此处的通用领域涵盖了目标领域和非目标领域。语料库中的语料根据实际场景可分为源端语料和目标端语料,例如,在将英文翻译为中文的情况下,“Hello”可作为源端语料,相应的,“你好”可作为目标端语料,“Hello”与“你好”属于不同端, 但具有相似的语义,在翻译过程中形成一定的映射关系。语义相似的源端语料与相应的目标端语料构成一组平行语料(Parallel Corpus)。In this embodiment, the corpus is the basic unit constituting the corpus, and the form of the corpus can be words, words, phrases, sentences, or the like. The corpus in the corpus comes from different fields, such as law, medicine, mathematics, computer and other fields. The corpus in these different fields together constitute the corpus of the general field, where the general field covers the target field and the non-target field. The corpus in the corpus can be divided into source corpus and target corpus according to the actual scene. For example, in the case of translating English into Chinese, "Hello" can be used as the source corpus, and correspondingly, "Hello" can be used as the target. The corpus, "Hello" and "Hello" belong to different ends, but have similar semantics, and form a certain mapping relationship in the translation process. The source corpus with similar semantics and the corresponding target corpus form a set of parallel corpus (Parallel Corpus).
通用领域的语料可用于训练翻译模型,该翻译模型可基于深度神经网络构建,经过大规模的训练后可以学习到源端语料和目标端语料的特征,以及源端语料和目标端语料的映射关系。针对输入的任意领域的源端语料,该翻译模型都可进行翻译并输出相应的目标端语料。The corpus in the general field can be used to train the translation model. The translation model can be constructed based on the deep neural network. After large-scale training, it can learn the characteristics of the source corpus and the target corpus, as well as the mapping relationship between the source corpus and the target corpus. . For the input source corpus in any field, the translation model can translate and output the corresponding target corpus.
构建语义向量的过程可以理解为对通用领域的源端语料和目标端语料进行编码以提取语料特征的过程。通过编码可以将语料库中的语料投影到一个公共的语义向量空间中,每个语料的语义向量对应于语义向量空间中的一个点。例如,语义向量可以表示为三维的向量[x,y,z]。如果两个语料的语义相似且所属领域相同,则两个语料所对应的语义向量之间的距离较小,而语义不同或者领域不同的语料所对应的语义向量之间的距离则较大。The process of constructing semantic vectors can be understood as the process of encoding the source corpus and target corpus in the general domain to extract corpus features. Through encoding, the corpus in the corpus can be projected into a common semantic vector space, and the semantic vector of each corpus corresponds to a point in the semantic vector space. For example, a semantic vector can be represented as a three-dimensional vector [x, y, z]. If the semantics of the two corpora are similar and belong to the same domain, the distance between the semantic vectors corresponding to the two corpora is small, while the distance between the semantic vectors corresponding to corpora with different semantics or different domains is larger.
本实施例中,语料相似、语义相似、语义向量相似都可以理解为语义向量的相似度大于或等于设定阈值。在一示例中,可以采用欧式距离、余弦相似度等作为语义向量之间的相似度的评价指标。例如,语义向量空间中两个语义向量的欧式距离越小,则两个语义向量的相似度越高;当两个语义向量的欧式距离小于或等于设定距离阈值时,两个语义向量即为相似的语义向量。又如,语义向量空间中两个语义向量的余弦相似度高于或等于设定阈值时,两个语义向量即为相似的语义向量。In this embodiment, the similarity of the corpus, the similarity of the semantics, and the similarity of the semantic vector can be understood as the similarity of the semantic vector is greater than or equal to the set threshold. In an example, Euclidean distance, cosine similarity, etc. can be used as the evaluation index of the similarity between semantic vectors. For example, the smaller the Euclidean distance of two semantic vectors in the semantic vector space, the higher the similarity of the two semantic vectors; when the Euclidean distance of the two semantic vectors is less than or equal to the set distance threshold, the two semantic vectors are Similar semantic vectors. For another example, when the cosine similarity of two semantic vectors in the semantic vector space is higher than or equal to the set threshold, the two semantic vectors are similar semantic vectors.
本实施例的方法不区分语种、领域、源端或目标端,对语料库中通用领域的语料进行统一编码,在此基础上提取的语料特征更全面,可供翻译模型充分学习不同语种、不同领域或不同端的语料的特征,在实际应用时可适用于任意领域,也支持任意翻译方向,例如,可以从英文翻译成中文,也可以从中文翻译成英文。The method of this embodiment does not distinguish between languages, domains, sources or targets, and uniformly encodes corpora in general domains in the corpus. On this basis, the features of the corpus extracted are more comprehensive, which can be used by the translation model to fully learn different languages and domains. Or the characteristics of the corpus at different ends, can be applied to any field in practical application, and also supports any translation direction, for example, it can be translated from English to Chinese, and also from Chinese to English.
同一个源端语料,在不同领域中可能有不同的含义,对应于不同的目标端语料。为便于理解,通过词语形式的语料举例说明,以源端语料为英文语料、目标端语料为中文语料为例,“Matrix”在数学领域的语义是矩阵,对应的目标端语料是“矩阵”;在生物学领域的语义是母体,对应的目标端语料是“母体”;在地理学领域的语义是基质,对应的目标端语料是“基质”等,这些语料属于语料库中通用领域的语料。在语义向量空间中,“Matrix”与“矩阵”的语义向量之间的距离、“Matrix”与“母体”的语义向量之间的距离、“Matrix”与“基质”的语义向量之间的距离都较小,对于基于通用领域的语料训练得到的翻译模型而言,输入“Matrix”,输出的翻译结果可能为“矩阵”、“母体”、“基质”中的一个。但如果要得到适用于目标领域(例如数学领域)的翻译模型, 则至少需要将真实语义为矩阵的“Matrix”与“矩阵”这一组平行语料选取出来,用于对基于通用领域的语料训练得到的翻译模型进行训练,调整翻译模型的网络参数。The same source corpus may have different meanings in different fields, corresponding to different target corpora. In order to facilitate understanding, the corpus in the form of words is used as an example, taking the source corpus as an English corpus and the target corpus as a Chinese corpus as an example, the semantics of "Matrix" in the field of mathematics is a matrix, and the corresponding target corpus is "Matrix"; In the field of biology, the semantics is the matrix, and the corresponding target-end corpus is "matrix"; in the field of geography, the semantics is the matrix, and the corresponding target-end corpus is "matrix", etc. These corpora belong to the general domain corpus in the corpus. In the semantic vector space, the distance between "Matrix" and the semantic vector of "Matrix", the distance between "Matrix" and the semantic vector of "Matrix", and the distance between "Matrix" and the semantic vector of "Matrix" For the translation model trained based on the corpus in the general field, input "Matrix", and the output translation result may be one of "Matrix", "Matrix", and "Matrix". However, if you want to get a translation model suitable for the target domain (such as the mathematical domain), you need to at least select the parallel corpus of "Matrix" and "Matrix" whose real semantics are matrices, and use them to train the corpus based on the general domain. The obtained translation model is trained, and the network parameters of the translation model are adjusted.
S120、在构建的语义向量中,检索与目标领域的目标语义向量相似的候选向量集合,其中,所述候选向量集合包括至少一个领域的至少一端的语义向量。S120. From the constructed semantic vector, retrieve a candidate vector set that is similar to the target semantic vector of the target domain, wherein the candidate vector set includes a semantic vector of at least one end of at least one domain.
在本实施例中,目标语义向量是指属于目标领域的语料编码得到的语义向量,包括目标领域的源端语料对应的语义向量以及目标领域的目标端语料对应的语义向量。例如,目标领域为数学领域,翻译方向为英文到中文,则真实语义为矩阵的“Matrix”与“矩阵”编码得到的语义向量都为目标语义向量,其中,真实语义为矩阵的“Matrix”为目标领域的源端语料,“矩阵”为目标领域的目标端语料。In this embodiment, the target semantic vector refers to the semantic vector obtained by encoding the corpus belonging to the target domain, including the semantic vector corresponding to the source corpus of the target domain and the semantic vector corresponding to the target corpus of the target domain. For example, if the target field is mathematics, and the translation direction is English to Chinese, then the semantic vector obtained by encoding "Matrix" and "Matrix" whose real semantics are matrices are both target semantic vectors. Among them, "Matrix" whose real semantics is a matrix is The source corpus of the target domain, and the "matrix" is the target corpus of the target domain.
在本实施例中,在语义向量空间中检索与目标语义向量相似的语义向量,构成候选向量集合,将目标语义向量和候选向量集合共同作为语料样本,从而扩展语料样本的多样性。候选向量集合包括至少一个领域的至少一端的语义向量。示例性的,目标语义向量包括真实语义为矩阵的“Matrix”与“矩阵”,在语义向量空间中,存在一些与目标语义向量相似的语义向量,例如真实语义为基质的“Matrix”、真实语义为母体的“Matrix”、“基质”、“母体”对应的语义向量等,这些语义向量构成候选向量集合,涵盖了生物学领域、地理学领域的源端、目标端等。候选向量集合中包含的是与目标领域的目标语义向量相似的且是非目标领域的语义向量,这些非目标领域的语义向量来自于语料库中通用领域的源端语料以及目标端语料,且满足与目标语义向量相似。In this embodiment, a semantic vector similar to the target semantic vector is retrieved in the semantic vector space to form a candidate vector set, and the target semantic vector and the candidate vector set are taken together as corpus samples, thereby expanding the diversity of corpus samples. The set of candidate vectors includes semantic vectors for at least one end of at least one domain. Exemplarily, the target semantic vector includes "Matrix" and "Matrix" whose real semantics are matrices. In the semantic vector space, there are some semantic vectors similar to the target semantic vector, such as "Matrix" whose real semantics The "Matrix", "matrix", the semantic vector corresponding to the "matrix", etc. as the parent, these semantic vectors constitute the candidate vector set, covering the source and target ends in the field of biology and geography. The candidate vector set contains semantic vectors that are similar to the target semantic vector of the target domain and are non-target domain. These non-target domain semantic vectors come from the source corpus and target corpus of the general domain in the corpus, and satisfy the target. Semantic vectors are similar.
在一示例中,目标语义向量是结合语料在目标领域内的特征编码得到的,与S110中针对通用领域特征统一编码得到的语义向量不同。检索候选向量集合的过程可以理解为,结合语料在目标领域内的特征,对目标领域内的语料进行编码得到目标语义向量,在语义向量空间中检索与目标语义向量距离相近的语义向量,构成候选向量集合。在此基础上,候选向量集合对应的语料也可以作为语料样本,用于训练翻译模型,使翻译模型基于扩展的语料样本,能更准确地学习目标领域和非目标领域的语料的特征,从而能够更准确地执行目标领域的翻译任务,避免不同领域特征的混淆,提高翻译结果的准确性和专业性。In an example, the target semantic vector is obtained by combining the feature encoding of the corpus in the target domain, which is different from the semantic vector obtained by unified encoding for the general domain feature in S110. The process of retrieving the candidate vector set can be understood as, combining the characteristics of the corpus in the target domain, encoding the corpus in the target domain to obtain the target semantic vector, and retrieving the semantic vector close to the target semantic vector in the semantic vector space to form a candidate. Vector collection. On this basis, the corpus corresponding to the candidate vector set can also be used as a corpus sample to train the translation model, so that the translation model can learn the characteristics of the corpus in the target domain and non-target domain more accurately based on the extended corpus sample, so as to be able to Perform translation tasks in the target domain more accurately, avoid confusion of features in different domains, and improve the accuracy and professionalism of translation results.
S130、根据所述目标语义向量和所述候选向量集合确定目标领域对应的语料样本。S130. Determine a corpus sample corresponding to the target domain according to the target semantic vector and the candidate vector set.
本实施例中,针对目标领域选取语料样本时,不仅可以根据属于目标领域的目标语义向量确定语料样本,还可以根据检索到的候选向量集合,召回通用领域的语料共同构成语料样本,从而提供更丰富、更全面的目标领域和非目标 领域的语料的特征。In this embodiment, when selecting corpus samples for the target domain, not only can the corpus samples be determined according to the target semantic vectors belonging to the target domain, but also the corpus samples in the general domain can be recalled according to the retrieved candidate vector set, thereby providing more A richer and more comprehensive feature of the target domain and non-target domain corpus.
确定语料样本的过程可以理解为,确定目标语义向量以及候选向量集合所对应的语料之间的映射关系,以形成可用于训练翻译模型的输入样本和输出样本,供翻译模型学习由输入样本(即语料样本中的源端语料)到输出样本(即语料样本中的目标端语料)的翻译规律。例如,对于数学领域,当输入真实语义为矩阵的“Matrix”时,翻译模型应正确输出“矩阵”,而不会输出“基质”或“母体”等;如果没有正确输出“矩阵”,则翻译模型的网络参数还需迭代训练和调整,直至翻译模型对于目标领域的语料样本中的源端语料都能正确输出相应的目标端语料,翻译模型的训练完成,此时的翻译模型已经充分学习到目标语义向量以及与目标语义向量相似的非目标领域的语义向量的特征和翻译规律,能够有效区分语义相近但领域不同的语料的特征,可应用于目标领域并准确执行翻译任务。The process of determining corpus samples can be understood as determining the mapping relationship between the target semantic vector and the corpus corresponding to the candidate vector set to form input samples and output samples that can be used to train the translation model for the translation model to learn from the input samples (ie The translation rule from the source corpus in the corpus sample) to the output sample (that is, the target corpus in the corpus sample). For example, in the field of mathematics, when the input "Matrix" whose real semantics is a matrix, the translation model should output "matrix" correctly, but not "matrix" or "matrix", etc.; if it does not output "matrix" correctly, translate The network parameters of the model still need to be iteratively trained and adjusted until the translation model can correctly output the corresponding target-end corpus for the source-end corpus in the corpus samples in the target domain. The training of the translation model is completed, and the translation model has fully learned The features and translation rules of the target semantic vector and the non-target domain semantic vector similar to the target semantic vector can effectively distinguish the features of corpus with similar semantics but different domains, and can be applied to the target domain and accurately perform translation tasks.
通过上述技术方案,利用目标语义向量以及与其相似的候选向量集合共同构建目标领域对应的语料样本,扩展了语料样本的规模,提高了语料样本的多样性。Through the above technical solution, a corpus sample corresponding to the target domain is jointly constructed by using the target semantic vector and a set of similar candidate vectors, thereby expanding the scale of the corpus sample and improving the diversity of the corpus sample.
本实施例提供的一种语料样本确定方法,通过检索候选向量集合,提高了在通用领域的语料中对非目标领域的语料样本的召回率,从而扩展了语料样本的规模,得到特征丰富的语料样本,可供翻译模型充分学习目标领域的语料样本以及召回的非目标领域的语料样本的特征。在此基础上获得的针对目标领域的语料样本,不受语种和翻译方向的限制,作为训练翻译模型的依据,具有较高的可靠性。The method for determining corpus samples provided in this embodiment improves the recall rate of corpus samples in non-target domains in corpora in general domains by retrieving a set of candidate vectors, thereby expanding the scale of corpus samples and obtaining corpus with rich features Samples are available for the translation model to fully learn the characteristics of the corpus samples in the target domain and the recalled corpus samples in the non-target domain. The corpus samples for the target domain obtained on this basis are not limited by the language and translation direction, and are used as the basis for training the translation model, and have high reliability.
实施例二 Embodiment 2
图2为本公开实施例二提供的一种语料样本确定方法的流程示意图,本实施例二在实施例一的基础上,对确定候选向量集合的过程进行说明。在本实施例中,构建语料库中通用领域的源端语料以及目标端语料的语义向量,包括:根据语料的语义以及所属领域,分别对语料库中通用领域的源端语料以及目标端语料进行编码,得到语料库中通用领域的源端语料以及目标端语料对应的语义向量。通过对通用领域的语料进行统一编码,使翻译模型可以充分学习不同领域、不同端的语料的特征,在实际应用中可适用于任意领域,也支持任意翻译方向。FIG. 2 is a schematic flowchart of a method for determining a corpus sample according to Embodiment 2 of the present disclosure. On the basis of Embodiment 1, this Embodiment 2 describes the process of determining a candidate vector set. In this embodiment, constructing the semantic vector of the source-end corpus and the target-end corpus of the general domain in the corpus includes: encoding the source-end corpus and the target-end corpus of the general domain in the corpus respectively according to the semantics of the corpus and the domain to which it belongs, Obtain the source corpus of the general domain in the corpus and the semantic vector corresponding to the target corpus. By uniformly encoding the corpus in the general field, the translation model can fully learn the characteristics of the corpus in different fields and at different ends. In practical applications, it can be applied to any field and supports any translation direction.
可选的,在构建的语义向量中,检索与目标领域的目标语义向量相似的候选向量集合,包括:计算目标语义向量与构建的多个语义向量的相似度;根据 多个相似度确定候选向量集合。通过计算相似度并检索由相似的语义向量构成的候选向量集合,扩展了语料样本的规模和多样性。Optionally, in the constructed semantic vector, retrieving a set of candidate vectors similar to the target semantic vector in the target domain, including: calculating the similarity between the target semantic vector and the constructed multiple semantic vectors; determining the candidate vector according to the multiple similarities gather. The scale and diversity of corpus samples are expanded by calculating similarity and retrieving a set of candidate vectors composed of similar semantic vectors.
可选的,在检索与目标领域的目标语义向量相似的候选向量集合之前,还包括:根据语料的语义,对目标领域的源端语料以及目标端语料进行编码,得到所述目标语义向量。通过根据语料在目标领域内的特征进行编码得到目标语义向量,充分考虑了不同领域的专业性和特殊性,使翻译模型更深入地学习目标领域的特征。Optionally, before retrieving a set of candidate vectors similar to the target semantic vector of the target domain, the method further includes: encoding the source-end corpus and the target-end corpus of the target domain according to the semantics of the corpus to obtain the target semantic vector. The target semantic vector is obtained by coding according to the characteristics of the corpus in the target domain, which fully considers the speciality and particularity of different domains, so that the translation model can learn the characteristics of the target domain more deeply.
本实施例尚未详尽的内容请参考实施例一。Please refer to the first embodiment for the content that is not yet detailed in this embodiment.
如图2所示,本公开实施例二提供的一种语料样本确定方法,包括如下步骤。As shown in FIG. 2 , a method for determining a corpus sample provided by Embodiment 2 of the present disclosure includes the following steps.
S210、根据语料的语义以及所属领域,分别对语料库中通用领域的源端语料以及目标端语料进行编码,得到语料库中通用领域的源端语料以及目标端语料对应的语义向量。S210. According to the semantics of the corpus and the domain to which it belongs, encode the source-end corpus and the target-end corpus of the general domain in the corpus, respectively, to obtain semantic vectors corresponding to the source-end corpus and the target-end corpus of the general domain in the corpus.
本实施例中,不区分语种、源端或目标端,根据语义和所属领域对语料库中通用领域的语料进行统一编码,得到的语义向量包含了语料的语义和领域相关信息。如果两个语料的语义相似且所属领域相同,则两个语料所对应的语义向量之间的相似度较高,而语义不同或者领域不同的语料所对应的点的语义向量之间的相似度较低。通过对通用领域的语料进行统一编码,使翻译模型可以充分学习不同领域、不同端的语料的特征,在实际应用时可适用于任意领域,也支持任意翻译方向。In this embodiment, the corpus of the general domain in the corpus is uniformly encoded according to the semantics and the domain, regardless of the language, source or target, and the obtained semantic vector includes the semantics of the corpus and domain-related information. If the semantics of the two corpora are similar and belong to the same domain, the similarity between the semantic vectors corresponding to the two corpora is high, and the similarity between the semantic vectors of the points corresponding to the corpus with different semantics or different domains is relatively high. Low. By uniformly encoding the corpus in the general field, the translation model can fully learn the characteristics of the corpus in different fields and at different ends. In practical application, it can be applied to any field and supports any translation direction.
S220、根据语料的语义,对目标领域的源端语料以及目标端语料进行编码,得到目标语义向量。S220. According to the semantics of the corpus, encode the source-end corpus and the target-end corpus of the target domain to obtain a target semantic vector.
本实施例中,根据语料在目标领域内的特征进行编码得到目标语义向量,作为检索候选向量集合或召回相似语料的依据,充分考虑了不同领域的专业性和特殊性,使翻译模型更深入地学习目标领域的特征。In this embodiment, the target semantic vector is obtained by coding according to the characteristics of the corpus in the target domain, which is used as the basis for retrieving the candidate vector set or recalling similar corpus, fully considering the specialties and particularities of different fields, so that the translation model can be more in-depth. Learn the characteristics of the target domain.
S230、计算目标语义向量与构建的多个语义向量的相似度。S230. Calculate the similarity between the target semantic vector and the constructed multiple semantic vectors.
本实施例中,通过计算目标语义向量与构建的多个语义向量的相似度,检索得到候选向量集合。其中,相似度与语义向量之间的距离有关,可基于语义向量的余弦相似度或者欧式距离表示。In this embodiment, the candidate vector set is retrieved by calculating the similarity between the target semantic vector and the constructed multiple semantic vectors. The similarity is related to the distance between the semantic vectors, and can be expressed based on the cosine similarity or the Euclidean distance of the semantic vectors.
S240、根据多个相似度确定候选向量集合。S240. Determine a candidate vector set according to the multiple degrees of similarity.
本实施例中,选取与目标语义向量相似的语义向量构成候选向量集合。例如,将语义向量空间中,与目标语义向量的相似度大于或等于设定阈值的语义 向量构成候选向量集合;或者,将语义向量空间中,与目标语义向量的相似度最高的设定数量的语义向量构成候选向量集合;或者,在语义向量空间中选取预定比例的、与目标语义向量的相似度最高的语义向量构成候选向量集合等。In this embodiment, a semantic vector similar to the target semantic vector is selected to form a candidate vector set. For example, in the semantic vector space, the semantic vectors whose similarity with the target semantic vector is greater than or equal to the set threshold constitute the candidate vector set; The semantic vector constitutes a candidate vector set; or, a predetermined proportion of the semantic vector with the highest similarity to the target semantic vector is selected in the semantic vector space to constitute a candidate vector set, and the like.
S250、根据目标语义向量和候选向量集合确定目标领域对应的语料样本。S250. Determine a corpus sample corresponding to the target domain according to the target semantic vector and the candidate vector set.
在上述基础上,检索与目标领域的目标语义向量相似的候选向量集合,包括以下至少之一:检索与目标领域的源端语义向量相似的非目标领域的源端语义向量;检索与目标领域的源端语义向量相似的非目标领域的目标端语义向量;检索与目标领域的目标端语义向量相似的非目标领域的源端语义向量;检索与目标领域的目标端语义向量相似的非目标领域的目标端语义向量。On the basis of the above, retrieving a set of candidate vectors similar to the target semantic vector in the target domain includes at least one of the following: retrieving the source semantic vector in the non-target domain that is similar to the source semantic vector in the target domain; retrieving the source semantic vector in the target domain The source semantic vector of the non-target domain is similar to the source semantic vector; Retrieve the source semantic vector of the non-target domain similar to the target semantic vector of the target domain; Retrieve the non-target domain similar to the target semantic vector of the target domain Destination semantic vector.
本实施例中,在对目标语义向量进行候选向量集合的检索过程中,将目标领域内的单语语料(包括目标领域的源端语料与目标端语料)编码,得到目标语义向量,从已构建的语义向量中检索通用领域中与目标语义向量相似的候选向量集合,将目标领域的语料和根据候选向量集合召回的通用领域的语料共同作为语料样本。其中,对通用领域的语料构建的语义向量实质上是多语言语义向量,提取的是多个语种、多个领域共同的特征。本实施例根据上述四种检索方式,存在四种相应的召回方式,以提高语料样本的召回率:根据目标领域的源端语料召回通用领域的源端语料;根据目标领域的源端语料召回通用领域的目标端语料;根据目标领域的目标端语料召回通用领域的源端语料;根据目标领域的目标端语料召回通用领域的目标端语料。In this embodiment, in the process of retrieving the candidate vector set for the target semantic vector, the monolingual corpus in the target domain (including the source-end corpus and the target-end corpus in the target domain) is encoded to obtain the target semantic vector. The set of candidate vectors in the general domain that is similar to the target semantic vector is retrieved from the semantic vector of , and the corpus of the target domain and the corpus of the general domain recalled according to the candidate vector set are taken together as corpus samples. Among them, the semantic vector constructed for the corpus in the general domain is essentially a multilingual semantic vector, and the features common to multiple languages and multiple domains are extracted. In this embodiment, based on the above four retrieval methods, there are four corresponding recall methods to improve the recall rate of corpus samples: recall the source corpus in the general domain according to the source corpus in the target domain; recall the general domain according to the source corpus in the target domain The target corpus of the domain; the source corpus of the general domain is recalled according to the target corpus of the target domain; the target corpus of the general domain is recalled according to the target corpus of the target domain.
例如,目标领域为数学领域,翻译方向为英文到中文,非目标领域的语料可通过以下任意方式召回:For example, the target domain is mathematics, the translation direction is English to Chinese, and the corpus of non-target domain can be recalled in any of the following ways:
根据目标领域内真实语义为矩阵的“Matrix”,可以召回非目标领域的真实语义为基质的“Matrix”以及真实语义为母体的“Matrix”;According to the "Matrix" with the real semantics of the matrix in the target field, the "Matrix" with the real semantics of the non-target field as the matrix and the "Matrix" of the real semantics as the matrix can be recalled;
根据目标领域内真实语义为矩阵的“Matrix”,可以召回非目标领域的“基质”和“母体”;According to the "Matrix" whose real semantics is a matrix in the target field, the "matrix" and "matrix" of the non-target field can be recalled;
根据目标领域内“矩阵”,可以召回非目标领域的真实语义为基质的“Matrix”以及真实语义为母体的“Matrix”;According to the "matrix" in the target field, the "Matrix" with the real semantics of the non-target field as the matrix and the "Matrix" with the real semantics as the matrix can be recalled;
根据目标领域内“矩阵”,可以召回非目标领域的“基质”和“母体”。According to the "matrix" in the target domain, the "matrix" and "matrix" of the non-target domain can be recalled.
在此基础上,语料样本中除了包括目标领域的源端语料到目标领域的目标端语料,还包括非目标领域的源端到非目标领域的目标端语料;用此语料样本作为训练数据,既能够提供目标领域的源端语料到目标领域的目标端语料的特征和映射关系,也能够提供非目标领域的源端语料到非目标领域的目标端语料的特征和映射关系。On this basis, the corpus samples include not only the source corpus in the target domain to the target corpus in the target domain, but also the source corpus in the non-target domain and the target corpus in the non-target domain; using this corpus sample as training data, both It can provide the features and mapping relationship between the source corpus in the target domain and the target corpus in the target domain, and can also provide the features and mapping relationship between the source corpus in the non-target domain and the target corpus in the non-target domain.
图3为本公开实施例二提供的一种召回通用领域的语料的示意图。如图3所示,为目标领域的语料构建目标语义向量;为通用领域的语料构建语义向量,构成语义向量空间;通过计算目标语义向量与语义向量空间中多个语义向量的相似度,检索与目标语义向量相似的候选向量集合,据此在语料库通用领域的语料中,召回与目标领域的源端语料和目标端语料相似的语料,与目标领域的语料共同构成语料样本。FIG. 3 is a schematic diagram of recalling corpus in the general field according to the second embodiment of the present disclosure. As shown in Figure 3, the target semantic vector is constructed for the corpus in the target domain; the semantic vector is constructed for the corpus in the general domain to form a semantic vector space; by calculating the similarity between the target semantic vector and multiple semantic vectors in the semantic vector space, the retrieval and A set of candidate vectors similar to the target semantic vector, according to which, in the corpus in the general domain of the corpus, the corpus similar to the source corpus and the target corpus of the target domain is recalled, and the corpus together with the corpus of the target domain constitutes a corpus sample.
本公开实施例二提供的一种语料样本确定方法,通过对通用领域的语料进行统一编码,使翻译模型可以充分学习不同领域、不同端的语料的特征,在实际应用时可适用于任意领域,也支持任意翻译方向;通过计算相似度并检索由相似的语义向量构成的候选向量集合,扩展了语料样本的规模和多样性,在语料样本中增加了非目标领域中的源端语料到目标端语料的特征映射关系,可供翻译模型充分学习目标领域以及召回的非目标领域的语料样本的特征;通过根据语料在目标领域内的特征进行编码得到目标语义向量,充分考虑了不同领域的专业性和特殊性,使翻译模型更有针对性地学习并区分领域中专业性的特征。The method for determining corpus samples provided in the second embodiment of the present disclosure enables the translation model to fully learn the characteristics of corpora in different fields and at different ends by uniformly encoding corpora in general fields, and can be applied to any field in practical application, and also Supports any translation direction; by calculating the similarity and retrieving the candidate vector set composed of similar semantic vectors, the scale and diversity of the corpus samples are expanded, and the source-end corpus in the non-target domain is added to the target-end corpus in the corpus sample. The feature mapping relationship can be used for the translation model to fully learn the characteristics of the corpus samples in the target domain and the recalled non-target domain; the target semantic vector is obtained by coding according to the characteristics of the corpus in the target domain, fully considering the professionalism and The specificity enables the translation model to learn more specifically and distinguish the characteristics of specialization in the domain.
实施例三Embodiment 3
图4为本公开实施例三提供的一种语料样本确定方法的流程示意图。实施例三在上述实施例的基础上,对确定语料样本的过程进行说明,明确了如何确定语料样本中的源端语料和目标端语料。FIG. 4 is a schematic flowchart of a method for determining a corpus sample according to Embodiment 3 of the present disclosure. Embodiment 3 On the basis of the above embodiment, the process of determining the corpus sample is described, and it is clarified how to determine the source-end corpus and the target-end corpus in the corpus sample.
可选的,根据多个相似度确定候选向量集合,包括:基于最近邻(k-Nearest Neighbor)搜索算法,在构建的语义向量中检索与目标语义向量相似度最高的设定数量的语义向量,构成候选向量集合。在保证目标语义向量与候选向量集合的相似度的基础上,扩展了语料样本的多样性,在语料样本中增加了非目标领域中的源端语料到目标端语料的特征和映射关系,使得语料样本具有更丰富的特征和更具专业性的训练价值。Optionally, determining a candidate vector set according to a plurality of similarities, including: based on a nearest neighbor (k-Nearest Neighbor) search algorithm, retrieving a set number of semantic vectors with the highest similarity to the target semantic vector in the constructed semantic vector, Constitute the candidate vector set. On the basis of ensuring the similarity between the target semantic vector and the candidate vector set, the diversity of corpus samples is expanded, and the features and mapping relationships of the source corpus in the non-target domain to the target corpus are added to the corpus samples, so that the corpus The samples have richer features and more specialized training value.
可选的,在根据目标语义向量和候选向量集合确定目标领域对应的语料样本之后,还包括:根据语料样本训练翻译模型,其中,翻译模型根据语料库中通用领域的源端语料以及目标端语料建立。通过将语料样本用于训练翻译模型,提高了翻译模型针对不同领域翻译的专业性和可靠性。由于语料样本中包括目标领域和非目标领域的语料,也提高了翻译模型对任意领域的适用性,而无需针对每个领域都选取独立的语料样本进行训练。Optionally, after determining the corpus sample corresponding to the target domain according to the target semantic vector and the candidate vector set, the method further includes: training a translation model according to the corpus sample, wherein the translation model is established according to the source-end corpus and the target-end corpus of the general domain in the corpus. . By using the corpus samples to train the translation model, the professionalism and reliability of the translation model for different fields are improved. Since the corpus samples include corpus in the target domain and non-target domain, the applicability of the translation model to any domain is also improved, and there is no need to select independent corpus samples for training in each domain.
本实施例尚未详尽的内容请参考上述实施例。Please refer to the above-mentioned embodiments for the details of this embodiment that are not yet detailed.
如图4所示,本公开实施例三提供的一种语料样本确定方法,包括如下步 骤:As shown in Figure 4, a method for determining a corpus sample provided in Embodiment 3 of the present disclosure includes the following steps:
S310、构建语料库中通用领域的源端语料以及目标端语料的语义向量。S310. Construct the source-end corpus and the semantic vector of the target-end corpus of the general domain in the corpus.
S320、计算目标语义向量与构建的多个语义向量的相似度。S320. Calculate the similarity between the target semantic vector and the constructed multiple semantic vectors.
S330、基于最近邻搜索算法,在构建的语义向量中检索与目标语义向量相似度最高的设定数量的语义向量,构成候选向量集合。S330. Based on the nearest neighbor search algorithm, retrieve a set number of semantic vectors with the highest similarity to the target semantic vector from the constructed semantic vectors, to form a candidate vector set.
本实施例中,基于最近邻搜索算法,将目标语义向量与语义向量空间中距离最近的一个或多个相邻的语义向量,视为相似的语义向量,该一个或多个相邻的语义向量对应的语料视为相似的语料,共同加入语料样本。In this embodiment, based on the nearest neighbor search algorithm, the target semantic vector and one or more adjacent semantic vectors with the closest distance in the semantic vector space are regarded as similar semantic vectors, and the one or more adjacent semantic vectors are regarded as similar semantic vectors. Corresponding corpus is regarded as similar corpus, and corpus samples are added together.
其中,检索与目标领域的目标语义向量相似的候选向量集合,包括以下至少之一:检索与目标领域的源端语义向量相似的非目标领域的源端语义向量;检索与目标领域的源端语义向量相似的非目标领域的目标端语义向量;检索与目标领域的目标端语义向量相似的非目标领域的源端语义向量;检索与目标领域的目标端语义向量相似的非目标领域的目标端语义向量。Wherein, retrieving a set of candidate vectors similar to the target semantic vector of the target domain includes at least one of the following: retrieving the source semantic vector of the non-target domain that is similar to the source semantic vector of the target domain; retrieving the source semantic vector of the target domain The target-side semantic vector of the non-target domain with similar vectors; Retrieve the source-side semantic vector of the non-target domain that is similar to the target-side semantic vector of the target domain; Retrieve the target-side semantic vector of the non-target domain that is similar to the target-side semantic vector of the target domain vector.
S340、根据目标语义向量和候选向量集合确定目标领域对应的语料样本。S340. Determine a corpus sample corresponding to the target domain according to the target semantic vector and the candidate vector set.
本实施例中,确定语料样本的过程包括确定根据目标语义向量和检索的候选向量集合,将目标领域的语料以及召回的通用领域的语料作为语料样本,并且确定语料样本中的语料的映射关系,即在训练翻译模型时,哪些语料可作为翻译前的语料,哪些可作为翻译后的语料。In this embodiment, the process of determining the corpus sample includes determining the target semantic vector and the retrieved candidate vector set, using the corpus of the target domain and the corpus of the recalled general domain as the corpus sample, and determining the corpus in the corpus sample. The mapping relationship of the corpus, That is, when training the translation model, which corpus can be used as the pre-translation corpus and which can be used as the post-translation corpus.
根据目标语义向量和候选向量集合确定目标领域对应的语料样本,包括:将目标领域的源端语义向量对应的语料作为翻译前的语料,将与目标领域的源端语义向量相似的目标领域的目标端语义向量对应的语料作为翻译后的语料;还包括以下至少之一:将与目标领域的源端语义向量相似的非目标领域的源端语义向量对应的语料作为翻译前的语料,将与目标领域的目标端语义向量相似的非目标领域的目标端语义向量对应的语料作为翻译后的语料;将与目标领域的源端语义向量相似的非目标领域的源端语义向量对应的语料作为翻译前的语料,将与目标领域的源端语义向量相似的非目标领域的目标端语义向量对应的语料作为翻译后的语料;将与目标领域的目标端语义向量相似的非目标领域的源端语义向量对应的语料作为翻译前的语料,将与目标领域的目标端语义向量相似的非目标领域的目标端语义向量对应的语料作为翻译后的语料;将与目标领域的目标端语义向量相似的非目标领域的源端语义向量对应的语料作为翻译前的语料,将与目标领域的源端语义向量相似的非目标领域的目标端语义向量对应的语料作为翻译后的语料。Determine the corpus sample corresponding to the target domain according to the target semantic vector and the candidate vector set, including: taking the corpus corresponding to the source semantic vector of the target domain as the pre-translation corpus, and using the target domain similar to the source semantic vector of the target domain. The corpus corresponding to the end semantic vector is used as the translated corpus; it also includes at least one of the following: the corpus corresponding to the source semantic vector of the non-target domain that is similar to the source semantic vector of the target domain is used as the pre-translation corpus. The corpus corresponding to the target semantic vector of the non-target domain with similar target semantic vectors of the domain is used as the translated corpus; the corpus corresponding to the source semantic vector of the non-target domain that is similar to the source semantic vector of the target domain is taken as the pre-translation corpus. The corpus corresponding to the target semantic vector of the non-target domain that is similar to the source semantic vector of the target domain is used as the translated corpus; the source semantic vector of the non-target domain that is similar to the target semantic vector of the target domain The corresponding corpus is used as the corpus before translation, and the corpus corresponding to the target semantic vector of the non-target domain that is similar to the target semantic vector of the target domain is used as the translated corpus; the non-target semantic vector similar to the target semantic vector of the target domain is used as the corpus after translation. The corpus corresponding to the source semantic vector of the domain is used as the pre-translation corpus, and the corpus corresponding to the target semantic vector of the non-target domain that is similar to the source semantic vector of the target domain is used as the translated corpus.
对于目标领域的语料,将目标领域的源端语料作为翻译前的语料,将目标 领域的目标端语料作为翻译后的语料。For the corpus of the target domain, the source corpus of the target domain is used as the pre-translation corpus, and the target-end corpus of the target domain is used as the translated corpus.
对于非目标领域的语料,将非目标领域的源端语料作为翻译前的语料,将非目标领域的目标端语料作为翻译后的语料,其中,非目标领域的源端语料可能是根据目标领域的源端语料召回的,也可能是根据目标领域的目标端语料召回的;非目标领域的目标端端语料可能是根据目标领域的源端语料召回的,也可能是根据目标领域的目标端语料召回的。For the corpus in the non-target domain, the source corpus in the non-target domain is used as the pre-translation corpus, and the target corpus in the non-target domain is used as the translated corpus. The source corpus in the non-target domain may be based on the target domain. The source-end corpus may be recalled according to the target-end corpus of the target domain; the target-end corpus of the non-target domain may be recalled according to the source-end corpus of the target domain, or it may be recalled according to the target-end corpus of the target domain. of.
例如,目标领域为数学领域,翻译方向为英文到中文,召回的非目标领域的语料包括真实语义为基质的“Matrix”、真实语义为母体的“Matrix”、“基质”和“母体”。For example, the target domain is mathematics, the translation direction is English to Chinese, and the recalled non-target domain corpus includes "Matrix" with real semantics as matrix, "Matrix" with real semantics as matrix, "matrix" and "matrix".
对于目标领域的语料,将真实语义为矩阵的“Matrix”作为翻译前的语料,将“矩阵”作为翻译后的语料;而对于非目标领域的语料,将真实语义为基质的“Matrix”作为翻译前的语料,将“基质”作为翻译后的语料;将真实语义为母体的“Matrix”作为翻译前的语料,将“母体”作为翻译后的语料。For the corpus in the target domain, the "Matrix" with the real semantics of the matrix is used as the corpus before translation, and the "Matrix" is used as the corpus after the translation; for the corpus in the non-target domain, the "Matrix" with the real semantics as the matrix is used as the translation. For the pre-translation corpus, the "matrix" is used as the post-translation corpus; the "Matrix" whose real semantics is the mother is used as the pre-translation corpus, and the "maternal" is used as the post-translation corpus.
S350、根据语料样本训练翻译模型,其中,翻译模型根据语料库中通用领域的源端语料以及目标端语料建立。S350: Train the translation model according to the corpus samples, wherein the translation model is established according to the source-end corpus and the target-end corpus of the general domain in the corpus.
本实施例中,利用语料库中通用领域的源端语料以及目标端语料训练一个通用领域的跨语种的翻译模型,利用针对目标领域确定的语料样本训练该翻译模型,调整翻译模型的网络参数,实现翻译模型的领域适应。在利用语料样本训练翻译模型的过程中,不仅包括将目标领域的源端语料和目标端语料分别作为输入样本和输出样本用于翻译模型的训练,还包括将召回的非目标领域的源端语料和目标端语料分别作为输入样本和输出样本用于翻译模型的训练,从而得到专业性更高的翻译模型,能够支持任意领域、任意翻译方向、任意语种的准确翻译。In this embodiment, the source-end corpus and the target-end corpus of the general domain in the corpus are used to train a cross-language translation model in a general domain, the translation model is trained by using the corpus samples determined for the target domain, and the network parameters of the translation model are adjusted to achieve Domain adaptation of translation models. In the process of using corpus samples to train the translation model, not only the source corpus and the target corpus of the target domain are used as input samples and output samples for the training of the translation model, but also the recalled source corpus of the non-target domain is included. The corpus and the target-end corpus are used as input samples and output samples for the training of the translation model, so as to obtain a more professional translation model, which can support accurate translation in any field, any translation direction, and any language.
示例性的,翻译模型包括多层语义编码器和一个单层语义解码器,其中,编码器、解码器可以采用循环神经网络(Recurrent Neural Network,RNN)的架构实现,例如长短期记忆网络(Long Short-Term Memory,LSTM)、门控循环单元(Gated Recurrent Unit,GRU)、Transformer模型等。语料库中通用领域的所有语种、所有方向的语料均在同一个模型上进行训练。Exemplarily, the translation model includes a multi-layer semantic encoder and a single-layer semantic decoder, wherein the encoder and the decoder can be implemented using a recurrent neural network (Recurrent Neural Network, RNN) architecture, such as a long short-term memory network (Long short-term memory network). Short-Term Memory, LSTM), Gated Recurrent Unit (GRU), Transformer model, etc. All languages and corpora in all directions in the general domain in the corpus are trained on the same model.
图5为本公开实施例三提供的一种翻译模型的示意图。如图5所示,该翻译模型包括编码网络,用于提取语料样本中语料(x 1,x 2,…x N)的语义特征;解码网络,用于对语义特征进行解码,即根据多个语料的语义特征,为源端语料确定相似度最高的目标端语料,得到源端语料和目标端语料的映射关系。在实际应用中,如果输入待翻译的语料y 0和y 1,y 0和y 1会被翻译模型按照编码规则进行 编码,解码网络根据编码所提取到的特征进行解码,分别找到相应的语料y2和y3,作为相应的翻译结果。该翻译模型是基于语料库中通用领域的语料建立的,基于针对目标领域的语料样本训练,调整网络参数,从而具有更高的专业性,可适用于任意专业领域,翻译的准确性更高。 FIG. 5 is a schematic diagram of a translation model according to Embodiment 3 of the present disclosure. As shown in Figure 5, the translation model includes an encoding network for extracting the semantic features of the corpus (x 1 , x 2 ,...x N ) in the corpus samples; a decoding network for decoding the semantic features, that is, according to multiple Semantic features of the corpus, determine the target corpus with the highest similarity for the source corpus, and obtain the mapping relationship between the source corpus and the target corpus. In practical applications, if the corpus y 0 and y 1 to be translated are input, y 0 and y 1 will be encoded by the translation model according to the coding rules, and the decoding network will decode according to the features extracted by the encoding, and find the corresponding corpus y2 respectively. and y3, as the corresponding translation result. The translation model is established based on the general domain corpus in the corpus. Based on the corpus sample training for the target domain, the network parameters are adjusted, so that it is more professional, applicable to any professional domain, and the translation accuracy is higher.
本公开实施例三提供的一种语料样本确定方法,利用目标领域的语料以及召回的通用领域的语料共同构成目标领域的语料样本,在语料样本中增加了非目标领域中的源端语料到目标端语料的特征和映射关系,使得语料样本具有更丰富的特征和更具专业性的训练价值;通过将通用领域的语料编码成多语言语义向量,初步训练得到翻译模型,将目标领域的语料样本用于训练翻译模型,提高了翻译模型针对不同领域翻译的专业性和可靠性,能够支持任意领域、任意翻译方向、任意语种的准确翻译。The third embodiment of the present disclosure provides a method for determining a corpus sample. The corpus in the target domain and the recalled general domain corpus are used to form the corpus sample in the target domain, and the source corpus in the non-target domain is added to the target domain. The characteristics and mapping relationship of the end corpus make the corpus samples have richer features and more professional training value; by encoding the corpus in the general field into multilingual semantic vectors, a translation model is obtained through preliminary training, and the corpus samples in the target field are obtained by preliminary training. It is used to train the translation model, which improves the professionalism and reliability of the translation model for translation in different fields, and can support accurate translation in any field, in any translation direction, and in any language.
实施例四Embodiment 4
图6为本公开实施例四提供的一种语料样本确定装置的结构示意图,该装置可适用于针对特定领域的机器翻译选取语料样本的情况,用于在涉及不同领域的语料库中选取语料样本,用于训练特定领域的翻译模型的情况。其中该装置可由软件和/或硬件实现,并一般集成在电子设备上。6 is a schematic structural diagram of a corpus sample determination device provided in Embodiment 4 of the present disclosure. The device can be applied to the case of selecting corpus samples for machine translation in a specific field, and is used for selecting corpus samples in different fields. The case for training a domain-specific translation model. Wherein the apparatus can be implemented by software and/or hardware, and is generally integrated on electronic equipment.
如图6所示,该装置包括:构建模块410、检索模块420以及样本确定模块430。As shown in FIG. 6 , the apparatus includes: a construction module 410 , a retrieval module 420 and a sample determination module 430 .
构建模块410,设置为构建语料库中通用领域的源端语料以及目标端语料的语义向量;The construction module 410 is configured to construct the source-end corpus of the general domain and the semantic vector of the target-end corpus in the corpus;
检索模块420,设置为在构建的语义向量中,检索与目标领域的目标语义向量相似的候选向量集合,其中,候选向量集合包括至少一个领域的至少一端的语义向量;The retrieval module 420 is configured to retrieve a candidate vector set similar to the target semantic vector of the target domain in the constructed semantic vector, wherein the candidate vector set includes the semantic vector of at least one end of at least one domain;
样本确定模块430,设置为根据目标语义向量和候选向量集合确定目标领域对应的语料样本。The sample determination module 430 is configured to determine the corpus sample corresponding to the target domain according to the target semantic vector and the candidate vector set.
在本实施例中,通过构建模块410构建语料库中通用领域的源端语料以及目标端语料的语义向量,通过检索模块420在构建的语义向量中,检索与目标领域的目标语义向量相似的候选向量集合,其中,候选向量集合包括至少一个领域的至少一端的语义向量,通过样本确定模块430根据目标语义向量和候选向量集合确定目标领域对应的语料样本。通过上述技术方案,利用目标语义向量以及与目标语义向量相似的候选向量集合共同构建目标领域对应的语料样本,扩展了语料样本的规模,提高了语料样本的多样性。In this embodiment, the construction module 410 constructs the semantic vectors of the source-end corpus and the target-end corpus of the general domain in the corpus, and the retrieval module 420 retrieves candidate vectors similar to the target semantic vector of the target domain from the constructed semantic vectors The set of candidate vectors includes semantic vectors of at least one end of at least one domain, and the sample determination module 430 determines the corpus samples corresponding to the target domain according to the target semantic vector and the set of candidate vectors. Through the above technical solution, the target semantic vector and the set of candidate vectors similar to the target semantic vector are used to jointly construct a corpus sample corresponding to the target domain, thereby expanding the scale of the corpus sample and improving the diversity of the corpus sample.
在上述基础上,构建模块410,是设置为:On the basis of the above, the building module 410 is set as:
根据语料的语义以及所属领域,分别对所述语料库中通用领域的源端语料以及目标端语料进行编码,得到语料库中通用领域的源端语料以及目标端语料对应的语义向量。According to the semantics of the corpus and the domain to which it belongs, the source corpus and the target corpus of the general domain in the corpus are encoded respectively, and the semantic vectors corresponding to the source corpus and the target corpus of the general domain in the corpus are obtained.
在上述基础上,检索模块420,包括:On the basis of the above, the retrieval module 420 includes:
计算单元,设置为计算所述目标语义向量与构建的多个语义向量的相似度;a computing unit, configured to calculate the similarity between the target semantic vector and the constructed multiple semantic vectors;
集合确定单元,设置为根据多个相似度确定所述候选向量集合。A set determination unit, configured to determine the candidate vector set according to the multiple degrees of similarity.
在上述基础上,集合确定单元,是设置为:On the basis of the above, the set determination unit is set to:
基于最近邻搜索算法,在构建的语义向量中检索与所述目标语义向量相似度最高的设定数量的语义向量,构成所述候选向量集合。Based on the nearest neighbor search algorithm, a set number of semantic vectors with the highest similarity to the target semantic vector are retrieved from the constructed semantic vectors to form the candidate vector set.
在上述基础上,所述目标语义向量包括目标领域的源端语义向量和目标领域的目标端语义向量;On the above basis, the target semantic vector includes the source semantic vector of the target domain and the target semantic vector of the target domain;
所述检索与目标领域的目标语义向量相似的候选向量集合,包括以下至少之一:The retrieval of a set of candidate vectors similar to the target semantic vector of the target domain includes at least one of the following:
检索与目标领域的源端语义向量相似的非目标领域的源端语义向量;Retrieve the source semantic vector of the non-target domain that is similar to the source semantic vector of the target domain;
检索与目标领域的源端语义向量相似的非目标领域的目标端语义向量;Retrieve the target-side semantic vector of the non-target domain that is similar to the source-side semantic vector of the target domain;
检索与目标领域的目标端语义向量相似的非目标领域的源端语义向量;Retrieve the source-side semantic vector of the non-target domain that is similar to the target-side semantic vector of the target domain;
检索与目标领域的目标端语义向量相似的非目标领域的目标端语义向量。Retrieve the target-side semantic vector of the non-target domain that is similar to the target-side semantic vector of the target domain.
在上述基础上,样本确定模块430,是设置为:On the basis of the above, the sample determination module 430 is set to:
将目标领域的源端语义向量对应的语料作为翻译前的语料,将与所述目标领域的源端语义向量相似的目标领域的目标端语义向量对应的语料作为翻译后的语料;Taking the corpus corresponding to the source semantic vector of the target domain as the corpus before translation, and taking the corpus corresponding to the target semantic vector of the target domain similar to the source semantic vector of the target domain as the translated corpus;
样本确定模块430,还设置为以下至少之一:The sample determination module 430 is further configured to be at least one of the following:
将与目标领域的源端语义向量相似的非目标领域的源端语义向量对应的语料作为翻译前的语料,将与所述目标领域的目标端语义向量相似的非目标领域的目标端语义向量对应的语料作为翻译后的语料;The corpus corresponding to the source semantic vector of the non-target domain that is similar to the source semantic vector of the target domain is used as the corpus before translation, and the target semantic vector of the non-target domain that is similar to the target semantic vector of the target domain corresponds to corpus as the translated corpus;
将与目标领域的源端语义向量相似的非目标领域的源端语义向量对应的语料作为翻译前的语料,将与所述目标领域的源端语义向量相似的非目标领域的目标端语义向量对应的语料作为翻译后的语料;Taking the corpus corresponding to the source semantic vector of the non-target domain that is similar to the source semantic vector of the target domain as the corpus before translation, and corresponding to the target semantic vector of the non-target domain that is similar to the source semantic vector of the target domain corpus as the translated corpus;
将与目标领域的目标端语义向量相似的非目标领域的源端语义向量对应的 语料作为翻译前的语料,将与所述目标领域的目标端语义向量相似的非目标领域的目标端语义向量对应的语料作为翻译后的语料;The corpus corresponding to the source semantic vector of the non-target domain that is similar to the target semantic vector of the target domain is used as the corpus before translation, and the target semantic vector of the non-target domain that is similar to the target semantic vector of the target domain corresponds to corpus as the translated corpus;
将与目标领域的目标端语义向量相似的非目标领域的源端语义向量对应的语料作为翻译前的语料,将与所述目标领域的源端语义向量相似的非目标领域的目标端语义向量对应的语料作为翻译后的语料。Taking the corpus corresponding to the source semantic vector of the non-target domain that is similar to the target semantic vector of the target domain as the corpus before translation, and corresponding to the target semantic vector of the non-target domain that is similar to the source semantic vector of the target domain corpus as the translated corpus.
在上述基础上,还包括:On the basis of the above, it also includes:
编码模块,设置为在检索与目标领域的目标语义向量相似的候选向量集合之前,根据语料的语义,对目标领域的源端语料以及目标端语料进行编码,得到所述目标语义向量。The encoding module is configured to encode the source corpus and the target corpus of the target domain according to the semantics of the corpus to obtain the target semantic vector before retrieving the candidate vector set similar to the target semantic vector of the target domain.
在上述基础上,还包括:On the basis of the above, it also includes:
训练模块,设置为在根据所述目标语义向量和所述候选向量集合确定目标领域对应的语料样本之后,根据所述语料样本训练翻译模型,其中,所述翻译模型根据语料库中通用领域的源端语料以及目标端语料建立。The training module is configured to train a translation model according to the corpus sample after determining the corpus sample corresponding to the target domain according to the target semantic vector and the candidate vector set, wherein the translation model is based on the source end of the general domain in the corpus Corpus and target corpus establishment.
上述语料样本确定装置可执行本公开任意实施例所提供的语料样本确定方法,具备执行方法相应的功能模块和效果。The above apparatus for determining a corpus sample can execute the method for determining a corpus sample provided by any embodiment of the present disclosure, and has functional modules and effects corresponding to the execution method.
实施例五Embodiment 5
图7为本公开实施例五提供的一种电子设备的结构示意图。图7示出了适于用来实现本公开实施例的电子设备600的结构示意图。本公开实施例中的电子设备600包括笔记本电脑、平板电脑、台式计算机、服务器等。图7示出的电子设备600仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。FIG. 7 is a schematic structural diagram of an electronic device according to Embodiment 5 of the present disclosure. FIG. 7 shows a schematic structural diagram of an electronic device 600 suitable for implementing an embodiment of the present disclosure. The electronic device 600 in the embodiment of the present disclosure includes a notebook computer, a tablet computer, a desktop computer, a server, and the like. The electronic device 600 shown in FIG. 7 is only an example, and should not impose any limitation on the function and scope of use of the embodiments of the present disclosure.
如图7所示,电子设备600可以包括一个或多个处理装置(例如中央处理器、图形处理器等)601,处理装置601可以根据存储在只读存储器(Read-only Memory,ROM)602中的程序或者从存储装置608加载到随机访问存储器(Random Access Memory,RAM)603中的程序而执行多种适当的动作和处理。一个或多个处理装置601实现如本公开提供的方法。在RAM603中,还存储有电子设备600操作所需的多种程序和数据。处理装置601、ROM 602以及RAM603通过总线604彼此相连。输入/输出(Input/Output,I/O)接口605也连接至总线604。As shown in FIG. 7 , the electronic device 600 may include one or more processing devices (eg, a central processing unit, a graphics processor, etc.) 601 , and the processing device 601 may be stored in a read-only memory (Read-only Memory, ROM) 602 according to or a program loaded from the storage device 608 into a random access memory (Random Access Memory, RAM) 603 to perform various appropriate actions and processes. One or more processing devices 601 implement methods as provided by the present disclosure. In the RAM 603, various programs and data necessary for the operation of the electronic device 600 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other through a bus 604. An Input/Output (I/O) interface 605 is also connected to the bus 604 .
以下装置可以连接至I/O接口605:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置606;包括例如液晶显示器 (Liquid Crystal Display,LCD)、扬声器、振动器等的输出装置607;包括例如磁带、硬盘等的存储装置608,存储装置608设置为存储一个或多个程序;以及通信装置609。通信装置609可以允许电子设备600与其他设备进行无线或有线通信以交换数据。虽然图7示出了具有多种装置的电子设备600,但是并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。The following devices may be connected to the I/O interface 605: Input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a Liquid Crystal Display (LCD), speakers output means 607 of a vibrator, vibrator, etc.; storage means 608 including, eg, magnetic tape, hard disk, etc., arranged to store one or more programs; and communication means 609. Communication means 609 may allow electronic device 600 to communicate wirelessly or by wire with other devices to exchange data. Although FIG. 7 shows electronic device 600 having various means, it is not required to implement or have all of the illustrated means. More or fewer devices may alternatively be implemented or provided.
根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,包括承载在计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置609从网络上被下载和安装,或者从存储装置608被安装,或者从ROM602被安装。在该计算机程序被处理装置601执行时,执行本公开实施例的方法中限定的上述功能。According to embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the method illustrated in the flowchart. In such an embodiment, the computer program may be downloaded and installed from the network via the communication device 609 , or from the storage device 608 , or from the ROM 602 . When the computer program is executed by the processing apparatus 601, the above-mentioned functions defined in the methods of the embodiments of the present disclosure are executed.
本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质可以包括:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、RAM、ROM、可擦式可编程只读存储器(Erasable Programmable Read-Only Memory,EPROM)、闪存、光纤、便携式紧凑磁盘只读存储器(Compact Disc Read-Only Memory,CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括:电线、光缆、射频(Radio Frequency,RF)等等,或者上述的任意合适的组合。存储介质可以是非暂态(non-transitory)存储介质。The computer-readable medium described above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the two. The computer-readable storage medium can be, for example, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. Computer readable storage media can include: electrical connections with one or more wires, portable computer disks, hard disks, RAM, ROM, Erasable Programmable Read-Only Memory (EPROM), flash memory, optical fibers , portable compact disk read-only memory (Compact Disc Read-Only Memory, CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above. In this disclosure, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In the present disclosure, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with computer-readable program code embodied thereon. Such propagated data signals may take a variety of forms, including electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device . The program code embodied on the computer-readable medium may be transmitted by any suitable medium, including: electric wire, optical fiber cable, radio frequency (RF), etc., or any suitable combination of the above. The storage medium may be a non-transitory storage medium.
上述计算机可读介质可以是上述电子设备600中所包含的;也可以是单独存在,而未装配入该电子设备600中。The above-mentioned computer-readable medium may be included in the above-mentioned electronic device 600;
上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备600:构建语料库中通用领域的源端语料以及目标端语料的语义向量;在构建的语义向量中,检索与目标领域的目标语义向量相似的候选向量集合,其中,所述候选向量集合包括至少一个领域的至 少一端的语义向量;根据所述目标语义向量和所述候选向量集合确定目标领域对应的语料样本。The above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device 600: constructs the source-end corpus of the general domain in the corpus and the semantic vector of the target-end corpus; In the constructed semantic vector, a candidate vector set similar to the target semantic vector of the target domain is retrieved, wherein the candidate vector set includes the semantic vector of at least one end of at least one domain; according to the target semantic vector and the candidate vector The collection determines the corpus samples corresponding to the target domain.
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(Local Area Network,LAN)或广域网(Wide Area Network,WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including object-oriented programming languages—such as Java, Smalltalk, C++, but also conventional Procedural programming language - such as the "C" language or similar programming language. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server. In the case of a remote computer, the remote computer can be connected to the user's computer through any kind of network—including a Local Area Network (LAN) or a Wide Area Network (WAN)—or, can be connected to an external computer ( For example, using an Internet service provider to connect via the Internet).
附图中的流程图和框图,图示了按照本公开多种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. Each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which contains one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. Each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented in special purpose hardware-based systems that perform the specified functions or operations, or special purpose hardware implemented in combination with computer instructions.
描述于本公开实施例中所涉及到的模块可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,模块的名称在一种情况下并不构成对该模块本身的限定。The modules involved in the embodiments of the present disclosure may be implemented in software or hardware. Among them, the name of the module does not constitute a limitation of the module itself in one case.
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(Field Programmable Gate Array,FPGA)、专用集成电路(Application Specific Integrated Circuit,ASIC)、专用标准产品(Application Specific Standard Parts,ASSP)、片上系统(System on Chip,SOC)、复杂可编程逻辑设备(Complex Programmable Logic Device,CPLD)等等。The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, exemplary types of hardware logic components that may be used include: Field Programmable Gate Array (FPGA), Application Specific Integrated Circuit (ASIC), Application Specific Standard Parts (ASSP) , System on Chip (SOC), Complex Programmable Logic Device (CPLD), etc.
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质包括基于 一个或多个线的电气连接、便携式计算机盘、硬盘、RAM、ROM、EPROM、快闪存储器、光纤、便捷式CD-ROM、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. Machine-readable media may include electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing. Machine-readable storage media include one or more wire-based electrical connections, portable computer disks, hard disks, RAM, ROM, EPROM, flash memory, optical fibers, portable CD-ROMs, optical storage devices, magnetic storage devices, or the above any suitable combination of content.
根据本公开的一个或多个实施例,示例1提供了一种语料样本确定方法,包括:According to one or more embodiments of the present disclosure, Example 1 provides a method for determining a corpus sample, including:
构建语料库中通用领域的源端语料以及目标端语料的语义向量;Construct the source corpus of the general domain and the semantic vector of the target corpus in the corpus;
在构建的语义向量中,检索与目标领域的目标语义向量相似的候选向量集合,其中,所述候选向量集合包括至少一个领域的至少一端的语义向量;In the constructed semantic vector, retrieve a candidate vector set similar to the target semantic vector of the target domain, wherein the candidate vector set includes the semantic vector of at least one end of at least one domain;
根据所述目标语义向量和所述候选向量集合确定目标领域对应的语料样本。A corpus sample corresponding to the target domain is determined according to the target semantic vector and the candidate vector set.
根据本公开的一个或多个实施例,示例2根据示例1所述的方法,构建语料库中通用领域的源端语料以及目标端语料的语义向量,包括:According to one or more embodiments of the present disclosure, Example 2, according to the method described in Example 1, constructs the semantic vector of the source-end corpus and the target-end corpus of the general domain in the corpus, including:
根据语料的语义以及所属领域,分别对所述语料库中通用领域的源端语料以及目标端语料进行编码,得到语料库中通用领域的源端语料以及目标端语料对应的语义向量。According to the semantics of the corpus and the domain to which it belongs, the source corpus and the target corpus of the general domain in the corpus are encoded respectively, and the semantic vectors corresponding to the source corpus and the target corpus of the general domain in the corpus are obtained.
根据本公开的一个或多个实施例,示例3根据示例1所述的方法,在构建的语义向量中,检索与目标领域的目标语义向量相似的候选向量集合,包括:According to one or more embodiments of the present disclosure, Example 3 According to the method described in Example 1, in the constructed semantic vector, retrieving a set of candidate vectors similar to the target semantic vector of the target domain, including:
计算所述目标语义向量与构建的多个语义向量的相似度;calculating the similarity between the target semantic vector and the constructed multiple semantic vectors;
根据多个相似度确定所述候选向量集合。The set of candidate vectors is determined according to a plurality of degrees of similarity.
根据本公开的一个或多个实施例,示例4根据示例3所述的方法,根据多个相似度确定所述候选向量集合,包括:According to one or more embodiments of the present disclosure, Example 4, according to the method of Example 3, determines the candidate vector set according to a plurality of similarities, including:
基于最近邻搜索算法,在构建的语义向量中检索与所述目标语义向量相似度最高的设定数量的语义向量,构成所述候选向量集合。Based on the nearest neighbor search algorithm, a set number of semantic vectors with the highest similarity to the target semantic vector are retrieved from the constructed semantic vectors to form the candidate vector set.
根据本公开的一个或多个实施例,示例5根据示例1所述的方法,According to one or more embodiments of the present disclosure, Example 5 is according to the method of Example 1,
所述目标语义向量包括目标领域的源端语义向量和目标领域的目标端语义向量;The target semantic vector includes the source semantic vector of the target domain and the target semantic vector of the target domain;
所述检索与目标领域的目标语义向量相似的候选向量集合,包括以下至少之一:The retrieval of a set of candidate vectors similar to the target semantic vector of the target domain includes at least one of the following:
检索与目标领域的源端语义向量相似的非目标领域的源端语义向量;Retrieve the source semantic vector of the non-target domain that is similar to the source semantic vector of the target domain;
检索与目标领域的源端语义向量相似的非目标领域的目标端语义向量;Retrieve the target-side semantic vector of the non-target domain that is similar to the source-side semantic vector of the target domain;
检索与目标领域的目标端语义向量相似的非目标领域的源端语义向量;Retrieve the source-side semantic vector of the non-target domain that is similar to the target-side semantic vector of the target domain;
检索与目标领域的目标端语义向量相似的非目标领域的目标端语义向量。Retrieve the target-side semantic vector of the non-target domain that is similar to the target-side semantic vector of the target domain.
根据本公开的一个或多个实施例,示例6根据示例5所述的方法,According to one or more embodiments of the present disclosure, Example 6 is according to the method of Example 5,
根据所述目标语义向量和所述候选向量集合确定目标领域对应的语料样本,包括:The corpus sample corresponding to the target domain is determined according to the target semantic vector and the candidate vector set, including:
将目标领域的源端语义向量对应的语料作为翻译前的语料,将与所述目标领域的源端语义向量相似的目标领域的目标端语义向量对应的语料作为翻译后的语料;Taking the corpus corresponding to the source semantic vector of the target domain as the corpus before translation, and taking the corpus corresponding to the target semantic vector of the target domain similar to the source semantic vector of the target domain as the translated corpus;
还包括以下至少之一:Also include at least one of the following:
将与目标领域的源端语义向量相似的非目标领域的源端语义向量对应的语料作为翻译前的语料,将与所述目标领域的目标端语义向量相似的非目标领域的目标端语义向量对应的语料作为翻译后的语料;The corpus corresponding to the source semantic vector of the non-target domain that is similar to the source semantic vector of the target domain is used as the corpus before translation, and the target semantic vector of the non-target domain that is similar to the target semantic vector of the target domain corresponds to corpus as the translated corpus;
将与目标领域的源端语义向量相似的非目标领域的源端语义向量对应的语料作为翻译前的语料,将与所述目标领域的源端语义向量相似的非目标领域的目标端语义向量对应的语料作为翻译后的语料;Taking the corpus corresponding to the source semantic vector of the non-target domain that is similar to the source semantic vector of the target domain as the corpus before translation, and corresponding to the target semantic vector of the non-target domain that is similar to the source semantic vector of the target domain corpus as the translated corpus;
将与目标领域的目标端语义向量相似的非目标领域的源端语义向量对应的语料作为翻译前的语料,将与所述目标领域的目标端语义向量相似的非目标领域的目标端语义向量对应的语料作为翻译后的语料;The corpus corresponding to the source semantic vector of the non-target domain that is similar to the target semantic vector of the target domain is used as the corpus before translation, and the target semantic vector of the non-target domain that is similar to the target semantic vector of the target domain corresponds to corpus as the translated corpus;
将与目标领域的目标端语义向量相似的非目标领域的源端语义向量对应的语料作为翻译前的语料,将与所述目标领域的源端语义向量相似的非目标领域的目标端语义向量对应的语料作为翻译后的语料。Taking the corpus corresponding to the source semantic vector of the non-target domain that is similar to the target semantic vector of the target domain as the corpus before translation, and corresponding to the target semantic vector of the non-target domain that is similar to the source semantic vector of the target domain corpus as the translated corpus.
根据本公开的一个或多个实施例,示例7根据示例1所述的方法,在检索与目标领域的目标语义向量相似的候选向量集合之前,还包括:According to one or more embodiments of the present disclosure, Example 7, according to the method of Example 1, before retrieving a set of candidate vectors similar to the target semantic vector of the target domain, further includes:
根据语料的语义,对目标领域的源端语料以及目标端语料进行编码,得到所述目标语义向量。According to the semantics of the corpus, the source corpus and the target corpus of the target domain are encoded to obtain the target semantic vector.
根据本公开的一个或多个实施例,示例8根据示例1所述的方法,According to one or more embodiments of the present disclosure, Example 8 is according to the method of Example 1,
在根据所述目标语义向量和所述候选向量集合确定所述目标领域对应的语料样本之后,还包括:After determining the corpus sample corresponding to the target domain according to the target semantic vector and the candidate vector set, the method further includes:
根据所述语料样本训练翻译模型,其中,所述翻译模型根据语料库中通用领域的源端语料以及目标端语料建立。A translation model is trained according to the corpus samples, wherein the translation model is established according to the source-end corpus and the target-end corpus of the general domain in the corpus.
根据本公开的一个或多个实施例,示例9提供了一种语料样本确定装置, 包括:According to one or more embodiments of the present disclosure, Example 9 provides an apparatus for determining a corpus sample, including:
构建模块,设置为构建语料库中通用领域的源端语料以及目标端语料的语义向量;The building module is set to construct the source corpus of the general domain in the corpus and the semantic vector of the target corpus;
检索模块,设置为在构建的语义向量中,检索与目标领域的目标语义向量相似的候选向量集合,其中,所述候选向量集合包括至少一个领域的至少一端的语义向量;The retrieval module is configured to retrieve a candidate vector set similar to the target semantic vector of the target domain in the constructed semantic vector, wherein the candidate vector set includes the semantic vector of at least one end of at least one domain;
样本确定模块,设置为根据所述目标语义向量和所述候选向量集合确定目标领域对应的语料样本。The sample determination module is configured to determine the corpus sample corresponding to the target domain according to the target semantic vector and the candidate vector set.
根据本公开的一个或多个实施例,示例10提供了一种电子设备,包括:According to one or more embodiments of the present disclosure, Example 10 provides an electronic device comprising:
一个或多个处理装置;one or more processing devices;
存储装置,设置为存储一个或多个程序;storage means arranged to store one or more programs;
当所述一个或多个程序被所述一个或多个处理装置执行,使得所述一个或多个处理装置实现如示例1-8中任一所述的方法。The one or more programs, when executed by the one or more processing devices, cause the one or more processing devices to implement the method as described in any of Examples 1-8.
根据本公开的一个或多个实施例,示例11提供了一种计算机可读介质,其上存储有计算机程序,该程序被处理装置执行时实现如示例1-8中任一所述的方法。According to one or more embodiments of the present disclosure, Example 11 provides a computer-readable medium having stored thereon a computer program that, when executed by a processing apparatus, implements the method as described in any one of Examples 1-8.

Claims (11)

  1. 一种语料样本确定方法,包括:A corpus sample determination method, comprising:
    构建语料库中通用领域的源端语料以及目标端语料的语义向量;Construct the source corpus of the general domain and the semantic vector of the target corpus in the corpus;
    在构建的语义向量中,检索与目标领域的目标语义向量相似的候选向量集合,其中,所述候选向量集合包括至少一个领域的至少一端的语义向量;In the constructed semantic vector, retrieve a candidate vector set similar to the target semantic vector of the target domain, wherein the candidate vector set includes the semantic vector of at least one end of at least one domain;
    根据所述目标语义向量和所述候选向量集合确定所述目标领域对应的语料样本。A corpus sample corresponding to the target domain is determined according to the target semantic vector and the candidate vector set.
  2. 根据权利要求1所述的方法,其中,所述构建语料库中通用领域的源端语料以及目标端语料的语义向量,包括:The method according to claim 1, wherein the construction of the source-end corpus and the semantic vector of the target-end corpus of the general domain in the corpus comprises:
    根据语料的语义以及所属领域,分别对所述语料库中通用领域的源端语料以及目标端语料进行编码,得到所述语料库中通用领域的源端语料以及目标端语料对应的语义向量。According to the semantics of the corpus and the domain to which it belongs, the source corpus and the target corpus of the general domain in the corpus are encoded respectively, and the semantic vectors corresponding to the source corpus and the target corpus of the general domain in the corpus are obtained.
  3. 根据权利要求1所述的方法,其中,所述在构建的语义向量中,检索与目标领域的目标语义向量相似的候选向量集合,包括:The method according to claim 1, wherein, in the constructed semantic vector, retrieving a set of candidate vectors similar to the target semantic vector of the target domain, comprising:
    计算所述目标语义向量与构建的多个语义向量的相似度;calculating the similarity between the target semantic vector and the constructed multiple semantic vectors;
    根据多个相似度确定所述候选向量集合。The set of candidate vectors is determined according to a plurality of degrees of similarity.
  4. 根据权利要求3所述的方法,其中,所述根据多个相似度确定所述候选向量集合,包括:The method according to claim 3, wherein the determining the candidate vector set according to a plurality of similarities comprises:
    基于最近邻搜索算法,在所述构建的语义向量中检索与所述目标语义向量相似度最高的设定数量的语义向量,构成所述候选向量集合。Based on the nearest neighbor search algorithm, a set number of semantic vectors with the highest similarity to the target semantic vector are retrieved from the constructed semantic vectors to form the candidate vector set.
  5. 根据权利要求1所述的方法,其中,所述目标语义向量包括所述目标领域的源端语义向量和所述目标领域的目标端语义向量;The method according to claim 1, wherein the target semantic vector comprises a source semantic vector of the target domain and a target semantic vector of the target domain;
    所述检索与目标领域的目标语义向量相似的候选向量集合,包括以下至少之一:The retrieval of a set of candidate vectors similar to the target semantic vector of the target domain includes at least one of the following:
    检索与所述目标领域的源端语义向量相似的非目标领域的源端语义向量;Retrieve the source semantic vector of the non-target domain that is similar to the source semantic vector of the target domain;
    检索与所述目标领域的源端语义向量相似的非目标领域的目标端语义向量;Retrieve the target-side semantic vector of the non-target domain that is similar to the source-side semantic vector of the target domain;
    检索与所述目标领域的目标端语义向量相似的非目标领域的源端语义向量;Retrieve the source-side semantic vector of the non-target domain that is similar to the target-side semantic vector of the target domain;
    检索与所述目标领域的目标端语义向量相似的非目标领域的目标端语义向量。A target-side semantic vector of a non-target domain that is similar to the target-side semantic vector of the target domain is retrieved.
  6. 根据权利要求5所述的方法,其中,所述根据所述目标语义向量和所述候选向量集合确定所述目标领域对应的语料样本,包括:The method according to claim 5, wherein the determining the corpus sample corresponding to the target domain according to the target semantic vector and the candidate vector set comprises:
    将所述目标领域的源端语义向量对应的语料作为翻译前的语料,将与所述目标领域的源端语义向量相似的目标领域的目标端语义向量对应的语料作为翻译后的语料;Taking the corpus corresponding to the source semantic vector of the target domain as the corpus before translation, and taking the corpus corresponding to the target semantic vector of the target domain similar to the source semantic vector of the target domain as the translated corpus;
    还包括以下至少之一:Also include at least one of the following:
    将与所述目标领域的源端语义向量相似的非目标领域的源端语义向量对应的语料作为翻译前的语料,将与所述目标领域的目标端语义向量相似的非目标领域的目标端语义向量对应的语料作为翻译后的语料;The corpus corresponding to the source semantic vector of the non-target domain that is similar to the source semantic vector of the target domain is used as the corpus before translation, and the target semantic vector of the non-target domain that is similar to the target semantic vector of the target domain is used. The corpus corresponding to the vector is used as the translated corpus;
    将与所述目标领域的源端语义向量相似的非目标领域的源端语义向量对应的语料作为翻译前的语料,将与所述目标领域的源端语义向量相似的非目标领域的目标端语义向量对应的语料作为翻译后的语料;The corpus corresponding to the source semantic vector of the non-target domain that is similar to the source semantic vector of the target domain is used as the corpus before translation, and the target semantic vector of the non-target domain that is similar to the source semantic vector of the target domain is used. The corpus corresponding to the vector is used as the translated corpus;
    将与所述目标领域的目标端语义向量相似的非目标领域的源端语义向量对应的语料作为翻译前的语料,将与所述目标领域的目标端语义向量相似的非目标领域的目标端语义向量对应的语料作为翻译后的语料;The corpus corresponding to the source semantic vector of the non-target domain that is similar to the target semantic vector of the target domain is used as the corpus before translation, and the target semantic vector of the non-target domain that is similar to the target semantic vector of the target domain is used. The corpus corresponding to the vector is used as the translated corpus;
    将与所述目标领域的目标端语义向量相似的非目标领域的源端语义向量对应的语料作为翻译前的语料,将与所述目标领域的源端语义向量相似的非目标领域的目标端语义向量对应的语料作为翻译后的语料。The corpus corresponding to the source semantic vector of the non-target domain that is similar to the target semantic vector of the target domain is used as the corpus before translation, and the target semantic vector of the non-target domain that is similar to the source semantic vector of the target domain is used. The corpus corresponding to the vector is used as the translated corpus.
  7. 根据权利要求1所述的方法,其中,在所述检索与目标领域的目标语义向量相似的候选向量集合之前,还包括:The method according to claim 1, wherein before retrieving a set of candidate vectors similar to the target semantic vector of the target domain, the method further comprises:
    根据语料的语义,对所述目标领域的源端语料以及目标端语料进行编码,得到所述目标语义向量。According to the semantics of the corpus, the source corpus and the target corpus of the target domain are encoded to obtain the target semantic vector.
  8. 根据权利要求1所述的方法,其中,在所述根据所述目标语义向量和所述候选向量集合确定所述目标领域对应的语料样本之后,还包括:The method according to claim 1, wherein after determining the corpus sample corresponding to the target domain according to the target semantic vector and the candidate vector set, the method further comprises:
    根据所述语料样本训练翻译模型,其中,所述翻译模型根据所述语料库中通用领域的源端语料以及目标端语料建立。A translation model is trained according to the corpus samples, wherein the translation model is established according to the source-end corpus and the target-end corpus of the general domain in the corpus.
  9. 一种语料样本确定装置,包括:A corpus sample determination device, comprising:
    构建模块,设置为构建语料库中通用领域的源端语料以及目标端语料的语义向量;The building module is set to construct the source corpus of the general domain in the corpus and the semantic vector of the target corpus;
    检索模块,设置为在构建的语义向量中,检索与目标领域的目标语义向量相似的候选向量集合,其中,所述候选向量集合包括至少一个领域的至少一端的语义向量;A retrieval module, configured to retrieve a candidate vector set similar to the target semantic vector of the target domain in the constructed semantic vector, wherein the candidate vector set includes the semantic vector of at least one end of at least one domain;
    样本确定模块,设置为根据所述目标语义向量和所述候选向量集合确定所述目标领域对应的语料样本。The sample determination module is configured to determine the corpus sample corresponding to the target domain according to the target semantic vector and the candidate vector set.
  10. 一种电子设备,包括:An electronic device comprising:
    一个或多个处理器;one or more processors;
    存储装置,设置为存储一个或多个程序;storage means arranged to store one or more programs;
    当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1-8中任一项所述的语料样本确定方法。When the one or more programs are executed by the one or more processors, the one or more processors implement the method for determining a corpus sample according to any one of claims 1-8.
  11. 一种计算机可读存储介质,存储有计算机程序,其中,所述计算机程序被处理器执行时实现如权利要求1-8中任一项所述的语料样本确定方法。A computer-readable storage medium storing a computer program, wherein when the computer program is executed by a processor, the method for determining a corpus sample according to any one of claims 1-8 is implemented.
PCT/CN2021/134269 2020-12-23 2021-11-30 Corpus sample determination method and apparatus, electronic device, and storage medium WO2022135080A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011538595.8 2020-12-23
CN202011538595.8A CN112668339A (en) 2020-12-23 2020-12-23 Corpus sample determination method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2022135080A1 true WO2022135080A1 (en) 2022-06-30

Family

ID=75408449

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/134269 WO2022135080A1 (en) 2020-12-23 2021-11-30 Corpus sample determination method and apparatus, electronic device, and storage medium

Country Status (2)

Country Link
CN (1) CN112668339A (en)
WO (1) WO2022135080A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116501829A (en) * 2023-06-29 2023-07-28 北京法伯宏业科技发展有限公司 Data management method and system based on artificial intelligence large language model platform

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112668339A (en) * 2020-12-23 2021-04-16 北京有竹居网络技术有限公司 Corpus sample determination method and device, electronic equipment and storage medium
CN113378583A (en) * 2021-07-15 2021-09-10 北京小米移动软件有限公司 Dialogue reply method and device, dialogue model training method and device, and storage medium
CN114817517B (en) * 2022-05-30 2022-12-20 北京海天瑞声科技股份有限公司 Corpus acquisition method and device, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103577399A (en) * 2013-11-05 2014-02-12 北京百度网讯科技有限公司 Method and device for extension of data in bilingual corpuses
CN110334197A (en) * 2019-06-28 2019-10-15 科大讯飞股份有限公司 Corpus processing method and relevant apparatus
CN111611374A (en) * 2019-02-25 2020-09-01 北京嘀嘀无限科技发展有限公司 Corpus expansion method and device, electronic equipment and storage medium
CN112101026A (en) * 2019-06-18 2020-12-18 掌阅科技股份有限公司 Corpus sample set construction method, computing device and computer storage medium
CN112541076A (en) * 2020-11-09 2021-03-23 北京百度网讯科技有限公司 Method and device for generating extended corpus of target field and electronic equipment
CN112668339A (en) * 2020-12-23 2021-04-16 北京有竹居网络技术有限公司 Corpus sample determination method and device, electronic equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103577399A (en) * 2013-11-05 2014-02-12 北京百度网讯科技有限公司 Method and device for extension of data in bilingual corpuses
CN111611374A (en) * 2019-02-25 2020-09-01 北京嘀嘀无限科技发展有限公司 Corpus expansion method and device, electronic equipment and storage medium
CN112101026A (en) * 2019-06-18 2020-12-18 掌阅科技股份有限公司 Corpus sample set construction method, computing device and computer storage medium
CN110334197A (en) * 2019-06-28 2019-10-15 科大讯飞股份有限公司 Corpus processing method and relevant apparatus
CN112541076A (en) * 2020-11-09 2021-03-23 北京百度网讯科技有限公司 Method and device for generating extended corpus of target field and electronic equipment
CN112668339A (en) * 2020-12-23 2021-04-16 北京有竹居网络技术有限公司 Corpus sample determination method and device, electronic equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116501829A (en) * 2023-06-29 2023-07-28 北京法伯宏业科技发展有限公司 Data management method and system based on artificial intelligence large language model platform
CN116501829B (en) * 2023-06-29 2023-09-19 北京法伯宏业科技发展有限公司 Data management method and system based on artificial intelligence large language model platform

Also Published As

Publication number Publication date
CN112668339A (en) 2021-04-16

Similar Documents

Publication Publication Date Title
WO2022135080A1 (en) Corpus sample determination method and apparatus, electronic device, and storage medium
CN111090987B (en) Method and apparatus for outputting information
US10832658B2 (en) Quantized dialog language model for dialog systems
WO2019205564A1 (en) Machine translation system based on capsule neural network and information data processing terminal
WO2022116841A1 (en) Text translation method, apparatus and device, and storage medium
US20110125486A1 (en) Self-configuring language translation device
CN113139391B (en) Translation model training method, device, equipment and storage medium
JP7335300B2 (en) Knowledge pre-trained model training method, apparatus and electronic equipment
WO2021259205A1 (en) Text sequence generation method, apparatus and device, and medium
WO2022247562A1 (en) Multi-modal data retrieval method and apparatus, and medium and electronic device
CN111488742B (en) Method and device for translation
WO2020042902A1 (en) Speech recognition method and system, and storage medium
WO2022100481A1 (en) Text information translation method and apparatus, electronic device, and storage medium
WO2022166613A1 (en) Method and apparatus for recognizing role in text, and readable medium and electronic device
CN111368560A (en) Text translation method and device, electronic equipment and storage medium
WO2023082931A1 (en) Method for punctuation recovery in speech recognition, and device and storage medium
CN111401078A (en) Running method, device, equipment and medium of neural network text translation model
WO2023280106A1 (en) Information acquisition method and apparatus, device, and medium
JP2023550211A (en) Method and apparatus for generating text
CN115438232A (en) Knowledge graph construction method and device, electronic equipment and storage medium
JP2023002690A (en) Semantics recognition method, apparatus, electronic device, and storage medium
CN112182255A (en) Method and apparatus for storing media files and for retrieving media files
WO2022116819A1 (en) Model training method and apparatus, machine translation method and apparatus, and device and storage medium
CN112711943B (en) Uygur language identification method, device and storage medium
WO2023207690A1 (en) Text generation method and apparatus, electronic device and medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21909079

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21909079

Country of ref document: EP

Kind code of ref document: A1