CN109670190B - Translation model construction method and device - Google Patents

Translation model construction method and device Download PDF

Info

Publication number
CN109670190B
CN109670190B CN201811590009.7A CN201811590009A CN109670190B CN 109670190 B CN109670190 B CN 109670190B CN 201811590009 A CN201811590009 A CN 201811590009A CN 109670190 B CN109670190 B CN 109670190B
Authority
CN
China
Prior art keywords
language
translation
model
material set
translation word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811590009.7A
Other languages
Chinese (zh)
Other versions
CN109670190A (en
Inventor
朱晓宁
张睿卿
何中军
吴华
王海峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201811590009.7A priority Critical patent/CN109670190B/en
Publication of CN109670190A publication Critical patent/CN109670190A/en
Application granted granted Critical
Publication of CN109670190B publication Critical patent/CN109670190B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The application provides a translation model construction method and device, wherein the method comprises the following steps: when the number of the translation word pairs in the first positive example language material set is smaller than a threshold value, a negative example language material set is randomly generated according to each translation word pair in the obtained first positive example language material set, wherein the translation word pairs in the first positive example language material set and the negative example language material set respectively comprise a source language and a corresponding target language, machine learning is conducted on the first positive example language material set and the negative example language material set to generate a classification model, and pruning processing is conducted on a preset translation model by utilizing the classification model to generate a translation model corresponding to the source language and the target language. The method realizes that when bilingual corpus of the source language and the target language is less, a classification model is obtained by utilizing the translation word pairs of the source language and the target language, and the translation model of the source language and the target language obtained by means of the reference language is filtered through the classification model, so that noise of the translation model is greatly reduced, and translation quality of the translation model is improved.

Description

Translation model construction method and device
Technical Field
The present disclosure relates to the field of machine translation technologies, and in particular, to a method and an apparatus for constructing a translation model.
Background
When constructing a translation model, a large-scale bilingual corpus is generally utilized to train the translation model so as to improve the translation quality of the translation model. However, for language pairs with small languages, it is difficult to obtain large-scale bilingual corpus, and if the translation model is trained using small-scale bilingual corpus, the quality of the resulting translation model is low.
Disclosure of Invention
The application provides a translation model construction method and device, which are used for solving the problem that the translation quality of an obtained translation model is lower by training the translation model by using a small-scale bilingual corpus.
In one aspect, an embodiment of the present application provides a method for constructing a translation model, including:
when the number of the translation word pairs in the first positive example language material set is smaller than a threshold value, randomly generating a negative example language material set according to each translation word pair in the acquired first positive example language material set, wherein the translation word pairs in the first positive example language material set and the negative example language material set respectively comprise a source language and a corresponding target language;
performing machine learning on the first positive example corpus and the negative example corpus to generate a classification model;
pruning a preset translation model by using the classification model to generate a translation model corresponding to the source language and the target language;
The preset translation model is a translation model obtained by fusing a first translation model and a second translation model, wherein the first translation model is obtained by training a second normal language material set comprising a source language and a reference language, and the second translation model is obtained by training a third normal language material set comprising the reference language and a target language.
According to the translation model construction method, when the number of translation word pairs in a first positive example language material set is smaller than a threshold value, a negative example language material set is randomly generated according to each translation word pair in the obtained first positive example language material set, wherein the translation word pairs in the first positive example language material set and the negative example language material set respectively comprise a source language and a corresponding target language, machine learning is conducted on the first positive example language material set and the negative example language material set to generate a classification model, a preset translation model is subjected to pruning processing to generate a translation model corresponding to the source language and the target language by utilizing the classification model, the preset translation model is obtained by fusing the first translation model and the second translation model, and the first translation model is obtained by training by utilizing a second positive example language material set comprising the source language and the reference language, and the second translation model is obtained by training by utilizing a third positive example language material set comprising the reference language and the target language. Therefore, when bilingual corpus of the source language and the target language is less, a classification model is obtained by utilizing the translation word pairs of the source language and the target language, and the translation model of the source language and the target language obtained by means of the reference language is filtered through the classification model, so that noise of the translation model is greatly reduced, and translation quality of the translation model is improved.
Another embodiment of the present application proposes a translation model building device, including:
the first generation module is used for randomly generating a negative example language material set according to each translation word pair in the acquired first positive example language material set when the number of the translation word pairs in the first positive example language material set is smaller than a threshold value, wherein the translation word pairs in the first positive example language material set and the negative example language material set respectively comprise a source language and a corresponding target language;
the second generation module is used for performing machine learning on the first positive example corpus and the negative example corpus to generate a classification model;
the third generation module is used for pruning a preset translation model by utilizing the classification model so as to generate a translation model corresponding to the source language and the target language;
the preset translation model is a translation model obtained by fusing a first translation model and a second translation model, the first translation model is a first translation model obtained by training a second normal language material set comprising a source language and a reference language, and the second translation model is obtained by training a third normal language material set comprising the reference language and a target language.
According to the translation model construction device, when the number of translation word pairs in a first positive example language material set is smaller than a threshold value, a negative example language material set is randomly generated according to each obtained translation word pair in the first positive example language material set, wherein the translation word pairs in the first positive example language material set and the negative example language material set respectively comprise a source language and a corresponding target language, machine learning is conducted on the first positive example language material set and the negative example language material set to generate a classification model, a preset translation model is subjected to pruning processing to generate a translation model corresponding to the source language and the target language by utilizing the classification model, the preset translation model is obtained by fusing the first translation model and the second translation model, and the first translation model is obtained by training by utilizing a second positive example language material set comprising the source language and the reference language, and the second translation model is obtained by training by utilizing a third positive example language material set comprising the reference language and the target language. Therefore, when bilingual corpus of the source language and the target language is less, a classification model is obtained by utilizing the translation word pairs of the source language and the target language, and the translation model of the source language and the target language obtained by means of the reference language is filtered through the classification model, so that noise of the translation model is greatly reduced, and translation quality of the translation model is improved.
Another embodiment of the present application provides a computer device comprising a processor and a memory;
wherein the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, for implementing the translation model building method as described in the embodiment of the above aspect.
Another embodiment of the present application proposes a non-transitory computer readable storage medium, on which a computer program is stored, which when executed by a processor implements a translation model building method as described in the above embodiment of the above aspect.
Additional aspects and advantages of the application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the application.
Drawings
The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:
FIG. 1 is a schematic flow chart of a method for constructing a translation model according to an embodiment of the present application;
FIG. 2 is a flowchart of another method for constructing a translation model according to an embodiment of the present application;
FIG. 3 is a flowchart of another method for constructing a translation model according to an embodiment of the present application;
FIG. 4 is a flowchart of another method for constructing a translation model according to an embodiment of the present disclosure;
fig. 5 is a schematic structural diagram of a translation model building device according to an embodiment of the present application;
fig. 6 illustrates a block diagram of an exemplary computer device suitable for use in implementing embodiments of the present application.
Detailed Description
Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the drawings are exemplary and intended for the purpose of explaining the present application and are not to be construed as limiting the present application.
Translation model construction methods and apparatuses according to embodiments of the present application are described below with reference to the accompanying drawings.
Aiming at the problem that in the related technology, for small language pairs with less corpus, the quality of a translation model obtained by directly utilizing bilingual corpus training is lower, the embodiment of the application provides a translation model construction method.
According to the translation model construction method, when bilingual corpus of the source language and the target language is less, the classification model is obtained by utilizing the translation word pairs of the source language and the target language, and the translation model of the source language and the target language obtained by means of the reference language is filtered through the classification model, so that noise of the translation model is greatly reduced, and translation quality of the translation model is improved.
Fig. 1 is a schematic flow chart of a method for constructing a translation model according to an embodiment of the present application.
The translation model construction method of the embodiment of the application can be executed by the translation model construction device provided by the embodiment of the application to realize that the obtained classification model is utilized by the translation word pairs of the source language and the target language, and the obtained translation model of the source language and the target language by means of the reference language is filtered so as to improve the translation quality of the translation model.
As shown in fig. 1, the translation model construction method includes:
and step 101, when the number of the translation word pairs in the first positive example language material set is smaller than a threshold value, randomly generating a negative example language material set according to each translation word pair in the acquired first positive example language material set.
When a translation model is built, a large-scale bilingual corpus is usually utilized to obtain the translation model through training, but for a translation model with small languages, such as Chinese and Japanese, the Chinese-Japanese corpus is less, and the translation quality of the Chinese-Japanese translation model obtained through training with a small amount of Chinese-Japanese corpus is relatively low.
In the application, when the corpus of the source language and the target language is less, a preset translation model of the source language and the target language can be obtained by means of the reference language, and the translation model is filtered by utilizing the corpus of the source language and the target language, so that the translation model of the source language and the target language is obtained.
Specifically, when the number of pairs of translation words in the first positive example corpus is smaller than the threshold value, a negative example corpus may be randomly generated according to each pair of translation words in the first positive example corpus. The translation word pairs in the first positive example corpus and the negative example corpus comprise source languages and corresponding target languages. That is, the pairs of translated words in the first corpus and in the negative corpus are pairs of words in the source language and the target language.
In this embodiment, the translation word pairs in the positive example material set are correct inter-translation word pairs, and the translation word pairs in the negative example material set are not inter-translation word pairs.
When the negative example corpus is randomly generated according to each translation word pair in the first positive example corpus, the source languages in the translation word pairs in the first positive example corpus or the target languages are randomly interchanged, so that the negative example corpus is generated.
For example, the source language is chinese, the target language is japanese, and the translated word pairs in the first example language material set include: the positions of the translation word pairs of Japanese ' Heyuan ' and ' Du can be interchanged, and the translation word pairs (He shou: yuan) and (Bank: heyuan) can be obtained, and the two translation word pairs can be used as the translation word pairs in the negative example language material set.
Step 102, machine learning is performed on the first positive example corpus and the negative example corpus to generate a classification model.
In this embodiment, machine learning may be performed using pairs of translated words in the first positive example material set and the negative example material set, and a classification model may be generated. The classification model is used for judging whether the translation word pairs of the source language and the target language are legal word pairs, namely, whether the translation word pairs are word pairs with an inter-translation relationship.
When machine learning is performed, classification training can be performed by using classification algorithms such as a support vector machine classification algorithm, a decision tree algorithm, a Bayesian algorithm, a K-nearest neighbor algorithm and the like, so as to obtain a classification model.
And 103, pruning the preset translation model by using the classification model to generate a translation model corresponding to the source language and the target language.
In this embodiment, the preset translation model is a translation model of the source language and the target language obtained by means of the reference language. Wherein the reference language may be regarded as a bridge between the source language and the target language.
Specifically, the preset translation model is obtained by fusing a first translation model of a source language and a reference language with a second translation model of the reference language and a target language. The first translation model is trained by using a second positive example language material set comprising a source language and a reference language, and the second translation model is trained by using a third positive example language material set comprising the reference language and a target language.
It will be appreciated that the translated word pairs included in the second example material set are those in the source language and the reference language, and that the translated word pairs included in the third example material set are those in the reference language and the target language.
Taking the construction of the Chinese-English translation model as an example, if the number of the obtained Chinese-English translation word pairs is smaller than a threshold value, a large-scale Chinese-English translation word pair and English-English translation word pairs can be obtained, then the Chinese-English translation model can be obtained by training the Chinese-English translation word pairs, and the English-English translation model obtained by training the English-Japanese translation word pairs can be fused to obtain the preset Chinese-Japanese translation model.
Since there may be ambiguity in the reference language, for example, both Chinese "deposit" and "bank" may be translated into English "bank" and "bank" may be translated into Japanese "Heyuan", "Xuexing", "Kke break". Therefore, noise is present when the chinese-english translation model and the chinese-japanese translation model are fused. For example, the fusion results in a "bank": "bank": "Heyuan", i.e., the medium day translation word pair is "Bank": the "Heyuan" and the Chinese "bank" and the Japanese "Heyuan" have no mutual translation relationship.
Although the number of pairs of translated words in the source language and the target language is less than a threshold, it has a higher quality of inter-translation. In order to improve the translation quality of the translation model, in this embodiment, a classification model obtained by training according to a positive example language material set and a negative example language material set of a source language and a target language may be used to prune a preset translation model to filter noise in the preset translation model, so as to generate a translation model corresponding to the source language and the target language. For example, through the classification model, the translation word pairs which do not have the inter-translation relation in the bilingual database in the preset translation model can be filtered.
Because noise in a preset translation model is filtered by using the classification model, the translation quality of the translation model of the source language and the target language can be improved.
And performing machine learning on the first positive example corpus and the randomly generated negative example corpus to obtain probabilities for determining that the translated words of the source language and the target language are legal word pairs, and performing pruning processing on a preset translation model according to the probabilities. Fig. 2 is a schematic flow chart of another method for constructing a translation model according to an embodiment of the present application.
As shown in fig. 2, the pruning processing is performed on the preset translation model by using the classification model, which includes:
step 201, inputting each translation word pair in the bilingual database in the preset translation model into the classification model respectively to determine the probability that each translation word pair is a legal word pair.
In this embodiment, the bilingual database in the preset translation model includes a translation word pair of the source language and the target language.
When pruning is carried out on a preset translation model, each pair of translation word pairs in the bilingual database can be input into the classification model, and the probability that each pair of translation word pairs are legal word pairs is obtained. The legal word pairs refer to translation word pairs with mutual translation relations.
Step 202, pruning is carried out on a preset translation model according to the obtained legal words.
In this embodiment, the probability that each translation word pair is a legal word pair is compared with a preset threshold probability, the translation word pair with the probability larger than the threshold probability is reserved as the legal word pair, and other translation word pairs in the bilingual database are deleted, so that pruning processing of a preset translation model is completed, and the translation model corresponding to the source language and the target language is generated.
In the embodiment of the application, the probability that each translation word pair in the bilingual database in the preset translation model is a legal word pair is determined through the classification model, the legal word pair is determined according to the probability, and pruning processing is carried out on the preset translation model according to the legal word pair, so that the translation model obtained through fusion is filtered, and the translation quality of the translation model is improved.
In one embodiment of the present application, the pruning process may be performed on the preset translation model by performing machine learning on the features of the extracted translation word pairs to obtain a classification model, and identifying the features of the translation word pairs in the bilingual database in the preset translation model by using the classification model. Fig. 3 is a schematic flow chart of another method for constructing a translation model according to an embodiment of the present application.
As shown in fig. 3, the translation model construction method includes:
step 301, when the number of the translation word pairs in the first positive example corpus is smaller than a threshold value, randomly generating a negative example corpus according to each translation word pair in the acquired first positive example corpus.
In this embodiment, step 301 is similar to step 101, and will not be described herein.
Step 302, parsing each translation word pair in the first positive example corpus and the negative example corpus to determine a feature set of each translation word pair.
In this embodiment, the feature of each translation word pair may be extracted, and the translation word pairs in the positive example material set and the translation word pairs in the negative example material set may be distinguished by the feature.
Before machine learning is performed on the feature set of each translation word pair in the first positive example corpus and the negative example corpus to generate a classification model, each translation word pair in the first positive example corpus and the negative example corpus can be analyzed to obtain the feature of each translation word pair, and the feature set is the feature set of each translation word pair. Wherein the feature set comprises: the method comprises one or more of the characteristics of source language phrase length, target language phrase length, translation word pair length ratio, translation word pair inter-translation probability value and the like.
Step 303, machine learning is performed on the feature set of each translation word pair in the first positive example corpus and the negative example corpus to generate a classification model.
In this embodiment, a classification algorithm is used to train a feature set of each translation word pair in the first positive example corpus and the negative example corpus, so as to generate a classification model.
And 304, analyzing each translation word pair in a bilingual database in a preset translation model to determine a feature set of each translation word pair.
In this embodiment, for each translation word pair in a bilingual database in a preset translation model, analysis processing is performed, and features of each translation word pair are determined, where a set of features is a feature set.
In step 305, the feature set of each translation word pair in the bilingual database in the preset translation model is identified by using the classification model, and the preset translation model is pruned to generate the translation model corresponding to the source language and the target language.
In this embodiment, the feature set of each translation word pair in the bilingual database in the preset translation model may be input into the classification model, the classification model outputs the probability that each translation word pair is a legal word pair, the translation word pair in the bilingual database that is a legal word pair may be determined according to the probability, and the legal word pair is reserved, so that other translation word pairs except the legal word pair in the bilingual database may be deleted, thereby completing pruning processing on the preset translation model, and generating the translation model corresponding to the source language and the target language.
In one embodiment of the present application, before generating the negative corpus randomly according to each translation word pair in the obtained first positive corpus, the first positive corpus of the source language and the target language may be obtained from the corpus, and then the negative corpus may be generated according to the first positive corpus. The corpus may include a plurality of bilingual phrase pairs, including, for example, chinese-english phrase pairs, english-japanese phrase pairs, chinese-japanese phrase pairs, and chinese-russian phrase pairs.
Specifically, when a user initiates a translation model construction request, the translation model construction device may acquire a translation model construction request, where the construction request includes a source language type and a target language type. For example, the request for construction includes a source language type chinese and a target language japanese, and then the translation model to be constructed is a middle day translation model.
Then, according to the source language type and the target language type, phrase pairs of the source language type and the target language type can be obtained from a corpus, then source language words and target language words are extracted from the source language phrases and the target language phrases, so that translation word pairs containing the source language and the target language are obtained, and a plurality of translation word pairs form a first positive example language corpus. After the first positive example corpus is obtained, a negative example corpus can be randomly generated according to the translation word pairs in the first positive example corpus.
In one embodiment of the present application, before pruning the preset translation model by using the classification model, a reference language may be determined, and the preset translation model may be obtained by using the reference language. Fig. 4 is a schematic flow chart of another method for constructing a translation model according to an embodiment of the present application.
As shown in fig. 4, before pruning the preset translation model by using the classification model, the translation model construction method further includes:
step 401, determining a target reference language according to the number of each first-class translation word pair and the number of the corresponding second-class translation word pairs in the corpus.
Because the number of the translation word pairs of the source language and the target language is smaller than the threshold value, the translation word pairs comprising the source language and the target language are directly used for training to obtain a translation model of the source language and the target language, and the translation quality pair of the translation model is lower.
In the embodiment of the application, the translation model of the source language and the target language can be constructed by means of the intermediate language type. When the target reference language serving as the intermediate language type is determined, first-type translation word pairs and corresponding second-type translation word pairs are obtained from a corpus, and the target reference language is determined according to the number of each first-type translation word pair and the number of corresponding second-type translation word pairs in the corpus.
It can be appreciated that the method for obtaining the first type of translation word pair and the second type of translation word pair from the corpus is similar to the method for obtaining the first example language corpus from the corpus, and therefore will not be described herein.
The first class translation word pair and the corresponding second class translation word pair contain the same reference language, the first class translation word pair contains the source language and the corresponding reference language, and the second class translation word pair contains the reference language and the corresponding target language. For example, the first type of translation word pair is a chinese-english translation word pair, the corresponding second type of translation word pair is a english-japanese translation word pair, the first type of translation word pair is a chinese-russian translation word pair, and the corresponding second type of translation word pair is a russian translation word pair.
In order to improve the translation quality of the translation model, the number of each first-class translation word pair and the number of the corresponding second-class translation word pairs can be compared with the preset number, and the reference language contained in the first-class translation word pair and the corresponding second-class translation word pair, of which the number of the first-class translation word pairs and the number of the corresponding second-class translation word pairs exceed the preset number, is used as the target reference language.
It should be noted that if the number of pairs of translation words in multiple source languages and reference languages and corresponding reference languages and target languages meet the number requirement, then the first-class pair and the reference language included in the corresponding second-class pair with the largest sum of the number of pairs of translation words in the first-class and the number of pairs of translation words in the corresponding second-class may be selected as the target reference language.
For example, the number of chinese-english translation word pairs and the number of corresponding english-japanese translation word pairs in the corpus exceed a preset number, and the number of chinese-english translation word pairs and the number of corresponding russian-japanese translation word pairs also exceed a preset number, and if the sum of the number of chinese-english translation word pairs and the number of corresponding english-japanese translation word pairs is greater than the sum of the number of chinese-english translation word pairs and the number of corresponding russian-japanese translation word pairs, the reference language english contained in the chinese-english translation word pairs and the russian-japanese translation word pairs is taken as the target reference language.
Step 402, training to obtain a first translation model by using a second normal language material set formed by the first class translation word pair containing the reference language, and training to obtain a second translation model by using a third normal language material set formed by the second class translation word pair containing the reference language.
In this embodiment, training is performed on a second positive corpus composed of a first class of translation word pairs including a target reference language to obtain a first translation model, and training is performed on a third positive corpus composed of a second class of translation word pairs including the target reference language to obtain a second translation model. The first translation model is a translation model of a source language and a target reference language, and the second translation model is a translation model of the target reference language and the target language.
In this embodiment, the first translation model and the second translation model are obtained by using a large-scale translation word pair of the source language and the reference language and a large-scale translation word pair of the reference language and the target language, and have higher translation quality.
After the first translation model and the second translation model are obtained, the first translation model and the second translation model are fused to obtain preset translation models of the source language and the target language. When fusion is carried out, the translation word pairs in the bilingual database in the first translation model and the translation word pairs in the bilingual database spoken in the second translation model can be correspondingly combined to obtain the bilingual database of the preset translation model.
For example, table 1 is a pair of translated words in a bilingual database in a chinese-english translation model, table 2 is a pair of translated words in a bilingual database in a english-japanese translation model, and table 3 is a pair of translated words fused with the pair of translated words in table 1 and table 2 to obtain a pair of translated words.
TABLE 1
Chinese character English
River bank riverside
River bank bank
Bank bank
Deposit bank
Deposit deposit
TABLE 2
English Japanese language
riverside River origin
bank River origin
bank Go round (Y)
bank Protection of the kidney
deposit Protection of the kidney
TABLE 3 Table 3
Chinese character (English) Japanese language
River bank (riverside) River origin
River bank (bank) River origin
River bank (bank) Go round (Y)
River bank (bank) Protection of the kidney
Bank (bank) River origin
Bank (bank) Go round (Y)
Bank (bank) Protection of the kidney
Deposit (bank) River origin
Deposit (bank) Go round (Y)
Deposit (bank) Protection of the kidney
Deposit (deposit) Protection of the kidney
As can be seen from Table 3, the ambiguity of English "bank" results in that the pair of Chinese translation words obtained after fusion contains translation word pairs which have no inter-translation relationship, such as (river bank: qian, (river bank: pre-run), (bank: heyuan), (bank: pre-run), (deposit: heyuan), (deposit: qian), and the like.
Therefore, after the preset translation model is obtained by fusing the first translation model and the second translation model, pruning treatment can be performed on the preset translation model, and illegal translation word pairs are filtered out, namely, translation word pairs without inter-translation relation are filtered out, so that the translation quality of the translation model constructed by means of the reference language is improved.
In order to achieve the above embodiments, the embodiments of the present application further provide a translation model building device. Fig. 5 is a schematic structural diagram of a translation model building device according to an embodiment of the present application.
As shown in fig. 5, the translation model constructing apparatus includes: a first generation module 510, a second generation module 520, and a third generation module 530.
A first generation module 510, configured to randomly generate a negative example corpus according to each translation word pair in the obtained first positive example corpus when the number of translation word pairs in the first positive example corpus is less than a threshold, where each of the translation word pairs in the first positive example corpus and the negative example corpus includes a source language and a corresponding target language;
The second generation module 520 is configured to perform machine learning on the first positive example corpus and the negative example corpus to generate a classification model;
a third generating module 530, configured to prune a preset translation model by using the classification model to generate a translation model corresponding to the source language and the target language;
the preset translation model is a translation model obtained by fusing a first translation model and a second translation model, wherein the first translation model is obtained by training a second normal language material set comprising a source language and a reference language, and the second translation model is obtained by training a third normal language material set comprising the reference language and a target language.
In one possible implementation manner of the embodiment of the present application, the third generating module 530 is specifically configured to:
respectively inputting each translation word pair in a bilingual database in a preset translation model into a classification model to determine the probability that each translation word pair is a legal word pair;
and pruning the preset translation model according to the obtained legal words.
In one possible implementation manner of the embodiment of the application, the apparatus may further include:
the first determining module is used for analyzing each translation word pair in the first positive example language material set and the negative example language material set so as to determine a feature set of each translation word pair;
The first determining module is further configured to parse each translation word pair in the bilingual database in the preset translation model to determine a feature set of each translation word pair.
In one possible implementation manner of the embodiment of the present application, the feature set of each translation word pair includes at least one of the following features: the source language phrase length, the target language phrase length, the translation word pair length ratio, the translation word pair inter-translation probability value, and the translation probability of each word in the translation word pair.
In one possible implementation manner of the embodiment of the present application, the first generating module 510 is specifically configured to: and randomly exchanging target languages in each translation word pair of the first positive example corpus to generate a negative example corpus.
In one possible implementation manner of the embodiment of the present application, the apparatus further includes:
the first acquisition module is used for acquiring a translation model construction request, wherein the construction request comprises a source language type and a target language type;
the second acquisition module is used for acquiring the first example language material set from the corpus according to the source language type and the target language type.
In one possible implementation manner of the embodiment of the application, the apparatus may further include:
The second determining module is used for determining a target reference language according to the number of each first-class translation word pair and the number of the corresponding second-class translation word pairs in the corpus;
the first class translation word pairs and the corresponding second class translation word pairs contain the same reference language, the first class translation word pairs contain the source language and the corresponding reference language, and the second class translation word pairs contain the reference language and the corresponding target language;
the training module is used for training a second normal language material set formed by the first class translation word pair containing the reference language to obtain a first translation model, and training a third normal language material set formed by the second class translation word pair containing the reference language to obtain a second translation model.
The foregoing explanation of the embodiment of the translation model building method is also applicable to the translation model building device of this embodiment, and therefore will not be repeated here.
According to the translation model construction device, when the number of translation word pairs in a first positive example language material set is smaller than a threshold value, a negative example language material set is randomly generated according to each obtained translation word pair in the first positive example language material set, wherein the translation word pairs in the first positive example language material set and the negative example language material set respectively comprise a source language and a corresponding target language, machine learning is conducted on the first positive example language material set and the negative example language material set to generate a classification model, a preset translation model is subjected to pruning processing to generate a translation model corresponding to the source language and the target language by utilizing the classification model, the preset translation model is obtained by fusing the first translation model and the second translation model, and the first translation model is obtained by training by utilizing a second positive example language material set comprising the source language and the reference language, and the second translation model is obtained by training by utilizing a third positive example language material set comprising the reference language and the target language. Therefore, when bilingual corpus of the source language and the target language is less, a classification model is obtained by utilizing the translation word pairs of the source language and the target language, and the translation model of the source language and the target language obtained by means of the reference language is filtered through the classification model, so that noise of the translation model is greatly reduced, and translation quality of the translation model is improved.
In order to implement the above embodiments, the embodiments of the present application further provide a computer device, including a processor and a memory;
wherein the processor runs a program corresponding to the executable program code by reading the executable program code stored in the memory, for implementing the translation model constructing method as described in the above embodiment.
Fig. 6 illustrates a block diagram of an exemplary computer device suitable for use in implementing embodiments of the present application. The computer device 12 shown in fig. 6 is merely an example and should not be construed as limiting the functionality and scope of use of embodiments of the present application.
As shown in FIG. 6, the computer device 12 is in the form of a general purpose computing device. Components of computer device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, a bus 18 that connects the various system components, including the system memory 28 and the processing units 16.
Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, and a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include industry Standard architecture (Industry Standard Architecture; hereinafter ISA) bus, micro channel architecture (Micro Channel Architecture; hereinafter MAC) bus, enhanced ISA bus, video electronics standards Association (Video Electronics Standards Association; hereinafter VESA) local bus, and peripheral component interconnect (Peripheral Component Interconnection; hereinafter PCI) bus.
Computer device 12 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by computer device 12 and includes both volatile and nonvolatile media, removable and non-removable media.
Memory 28 may include computer system readable media in the form of volatile memory, such as random access memory (Random Access Memory; hereinafter: RAM) 30 and/or cache memory 32. The computer device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from or write to non-removable, nonvolatile magnetic media (not shown in FIG. 6, commonly referred to as a "hard disk drive"). Although not shown in fig. 6, a magnetic disk drive for reading from and writing to a removable non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable non-volatile optical disk (e.g., a compact disk read only memory (Compact Disc Read Only Memory; hereinafter CD-ROM), digital versatile read only optical disk (Digital Video Disc Read Only Memory; hereinafter DVD-ROM), or other optical media) may be provided. In such cases, each drive may be coupled to bus 18 through one or more data medium interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of the embodiments of the present application.
A program/utility 40 having a set (at least one) of program modules 42 may be stored in, for example, memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 42 generally perform the functions and/or methods in the embodiments described herein.
The computer device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), one or more devices that enable a user to interact with the computer device 12, and/or any devices (e.g., network card, modem, etc.) that enable the computer device 12 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 22. Moreover, the computer device 12 may also communicate with one or more networks such as a local area network (Local Area Network; hereinafter LAN), a wide area network (Wide Area Network; hereinafter WAN) and/or a public network such as the Internet via the network adapter 20. As shown, network adapter 20 communicates with other modules of computer device 12 via bus 18. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with computer device 12, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.
The processing unit 16 executes various functional applications and data processing by running programs stored in the system memory 28, for example, implementing the methods mentioned in the foregoing embodiments.
In order to implement the above embodiment, the present application further proposes a non-transitory computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the translation model building method as described in the above embodiment.
In the description of this specification, the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present application, the meaning of "plurality" is at least two, such as two, three, etc., unless explicitly defined otherwise.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and additional implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present application.
Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.
Those of ordinary skill in the art will appreciate that all or a portion of the steps carried out in the method of the above-described embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, and where the program, when executed, includes one or a combination of the steps of the method embodiments.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product.
The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, or the like. Although embodiments of the present application have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the application, and that variations, modifications, alternatives, and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the application.

Claims (10)

1. The translation model construction method is characterized by comprising the following steps of:
when the number of the translation word pairs in the first positive example material set is smaller than a threshold value, randomly generating a negative example material set according to each translation word pair in the acquired first positive example material set, wherein each of the translation word pairs in the first positive example material set and the negative example material set comprises a source language and a corresponding target language, the translation word pairs in the first positive example material set are inter-translation word pairs, and the translation word pairs in the negative example material set are non-inter-translation word pairs;
performing machine learning on the first positive example corpus and the negative example corpus to generate a classification model;
pruning a preset translation model by using the classification model to generate a translation model corresponding to the source language and the target language;
the preset translation model is a translation model obtained by fusing a first translation model and a second translation model, the first translation model is obtained by training a second normal language material set comprising the source language and a reference language, the second translation model is obtained by training a third normal language material set comprising the reference language and the target language, the reference language is a language other than the source language and the target language, the translation word pairs in the second normal language material set comprise the source language and the reference language, and the translation word pairs in the third normal language material set comprise the reference language and the target language.
2. The method of claim 1, wherein pruning a preset translation model using the classification model comprises:
respectively inputting each translation word pair in a bilingual database in the preset translation model into the classification model to determine the probability that each translation word pair is a legal word pair;
and pruning the preset translation model according to the obtained legal words.
3. The method of claim 1, wherein prior to machine learning the first positive corpus and the negative corpus, further comprising:
analyzing each translation word pair in the first positive example corpus and the negative example corpus to determine a feature set of each translation word pair;
before pruning the preset translation model by using the classification model, the method further comprises the following steps:
and analyzing each translation word pair in the bilingual database in the preset translation model to determine a feature set of each translation word pair.
4. The method of claim 3, wherein the feature set for each translation word pair includes at least one of the following features: source language phrase length, target language phrase length, translation word to length ratio, translation word to inter-translation probability value.
5. The method of any one of claims 1-4, wherein randomly generating a negative example corpus from each pair of translated words in the acquired first positive example corpus comprises:
and randomly exchanging target languages in each translation word pair of the first positive example corpus to generate the negative example corpus.
6. The method of any of claims 1-4, wherein before randomly generating the negative corpus from each of the translated word pairs in the obtained first positive corpus, further comprises:
acquiring a translation model construction request, wherein the construction request comprises a source language type and a target language type;
and acquiring a first example language material set from the corpus according to the source language type and the target language type.
7. The method of claim 6, wherein the step of pruning the predetermined translation model using the classification model further comprises:
determining a target reference language according to the number of each first-class translation word pair and the number of the corresponding second-class translation word pairs in the corpus;
the first class translation word pairs and the corresponding second class translation word pairs contain the same reference language, the first class translation word pairs contain the source language and the corresponding reference language, and the second class translation word pairs contain the reference language and the corresponding target language;
And training a second normal language material set formed by the first class translation word pair containing the reference language to obtain the first translation model, and training a third normal language material set formed by the second class translation word pair containing the reference language to obtain the second translation model.
8. A translation model constructing apparatus, comprising:
the first generation module is used for randomly generating a negative example language material set according to each translation word pair in the obtained first positive example language material set when the number of the translation word pairs in the first positive example language material set is smaller than a threshold value, wherein the translation word pairs in the first positive example language material set and the negative example language material set respectively comprise a source language and a corresponding target language, the translation word pairs in the first positive example language material set are inter-translation word pairs, and the translation word pairs in the negative example language material set are non-inter-translation word pairs;
the second generation module is used for performing machine learning on the first positive example corpus and the negative example corpus to generate a classification model;
the third generation module is used for pruning a preset translation model by utilizing the classification model so as to generate a translation model corresponding to the source language and the target language;
The preset translation model is a translation model obtained by fusing a first translation model and a second translation model, the first translation model is obtained by training a second normal language material set comprising the source language and a reference language, the second translation model is obtained by training a third normal language material set comprising the reference language and the target language, the reference language is a language other than the source language and the target language, the translation word pairs in the second normal language material set comprise the source language and the reference language, and the translation word pairs in the third normal language material set comprise the reference language and the target language.
9. A computer device comprising a processor and a memory;
wherein the processor runs a program corresponding to the executable program code by reading the executable program code stored in the memory for implementing the translation model constructing method according to any one of claims 1 to 7.
10. A non-transitory computer-readable storage medium having stored thereon a computer program, which when executed by a processor implements a translation model building method according to any of claims 1-7.
CN201811590009.7A 2018-12-25 2018-12-25 Translation model construction method and device Active CN109670190B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811590009.7A CN109670190B (en) 2018-12-25 2018-12-25 Translation model construction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811590009.7A CN109670190B (en) 2018-12-25 2018-12-25 Translation model construction method and device

Publications (2)

Publication Number Publication Date
CN109670190A CN109670190A (en) 2019-04-23
CN109670190B true CN109670190B (en) 2023-05-16

Family

ID=66146043

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811590009.7A Active CN109670190B (en) 2018-12-25 2018-12-25 Translation model construction method and device

Country Status (1)

Country Link
CN (1) CN109670190B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111046677B (en) * 2019-12-09 2021-07-20 北京字节跳动网络技术有限公司 Method, device, equipment and storage medium for obtaining translation model
CN111259676A (en) * 2020-01-10 2020-06-09 苏州交驰人工智能研究院有限公司 Translation model training method and device, electronic equipment and storage medium
CN111898389B (en) * 2020-08-17 2023-09-19 腾讯科技(深圳)有限公司 Information determination method, information determination device, computer equipment and storage medium
CN113139391B (en) * 2021-04-26 2023-06-06 北京有竹居网络技术有限公司 Translation model training method, device, equipment and storage medium
CN113591492B (en) * 2021-06-30 2023-03-24 北京百度网讯科技有限公司 Corpus generation method and device, electronic equipment and storage medium
CN113988089B (en) * 2021-10-18 2024-08-02 浙江香侬慧语科技有限责任公司 Machine translation method, device and medium based on K nearest neighbor

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102789451A (en) * 2011-05-16 2012-11-21 北京百度网讯科技有限公司 Individualized machine translation system, method and translation model training method
CN103123634A (en) * 2011-11-21 2013-05-29 北京百度网讯科技有限公司 Copyright resource identification method and copyright resource identification device
CN103544147A (en) * 2013-11-06 2014-01-29 北京百度网讯科技有限公司 Translation model training method and device
CN104505090A (en) * 2014-12-15 2015-04-08 北京国双科技有限公司 Method and device for voice recognizing sensitive words
CN104915337A (en) * 2015-06-18 2015-09-16 中国科学院自动化研究所 Translation text integrity evaluation method based on bilingual text structure information
CN105068997A (en) * 2015-07-15 2015-11-18 清华大学 Parallel corpus construction method and device
CN106202059A (en) * 2015-05-25 2016-12-07 松下电器(美国)知识产权公司 Machine translation method and machine translation apparatus
CN106294684A (en) * 2016-08-06 2017-01-04 上海高欣计算机系统有限公司 The file classification method of term vector and terminal unit
CN107590237A (en) * 2017-09-11 2018-01-16 桂林电子科技大学 A kind of knowledge mapping based on dynamic translation principle represents learning method
CN107885760A (en) * 2016-12-21 2018-04-06 桂林电子科技大学 It is a kind of to represent learning method based on a variety of semantic knowledge mappings
CN108845994A (en) * 2018-06-07 2018-11-20 南京大学 Utilize the neural machine translation system of external information and the training method of translation system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5640773B2 (en) * 2011-01-28 2014-12-17 富士通株式会社 Information collation apparatus, information collation method, and information collation program

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102789451A (en) * 2011-05-16 2012-11-21 北京百度网讯科技有限公司 Individualized machine translation system, method and translation model training method
CN103123634A (en) * 2011-11-21 2013-05-29 北京百度网讯科技有限公司 Copyright resource identification method and copyright resource identification device
CN103544147A (en) * 2013-11-06 2014-01-29 北京百度网讯科技有限公司 Translation model training method and device
CN104505090A (en) * 2014-12-15 2015-04-08 北京国双科技有限公司 Method and device for voice recognizing sensitive words
CN106202059A (en) * 2015-05-25 2016-12-07 松下电器(美国)知识产权公司 Machine translation method and machine translation apparatus
CN104915337A (en) * 2015-06-18 2015-09-16 中国科学院自动化研究所 Translation text integrity evaluation method based on bilingual text structure information
CN105068997A (en) * 2015-07-15 2015-11-18 清华大学 Parallel corpus construction method and device
CN106294684A (en) * 2016-08-06 2017-01-04 上海高欣计算机系统有限公司 The file classification method of term vector and terminal unit
CN107885760A (en) * 2016-12-21 2018-04-06 桂林电子科技大学 It is a kind of to represent learning method based on a variety of semantic knowledge mappings
CN107590237A (en) * 2017-09-11 2018-01-16 桂林电子科技大学 A kind of knowledge mapping based on dynamic translation principle represents learning method
CN108845994A (en) * 2018-06-07 2018-11-20 南京大学 Utilize the neural machine translation system of external information and the training method of translation system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"Negative expression translation in Japanese and Chinese machine translation";Hong Zhang.etc;《2008 International Conference on Natural Language Processing and Knowledge Engineering》;20090502;全文 *
"智能语音对话系统中基于规则和统计的语义识别";过冰;《中国优秀硕士学位论文全文数据库信息科技辑》;20150415;I136-175 *
"统计机器翻译中短语切分的新方法";何中军等;《第三届学生计算语言学研讨会论文集 》;20060801;第408-412页 *

Also Published As

Publication number Publication date
CN109670190A (en) 2019-04-23

Similar Documents

Publication Publication Date Title
CN109670190B (en) Translation model construction method and device
Hassan et al. Achieving human parity on automatic chinese to english news translation
Saunders et al. Reducing gender bias in neural machine translation as a domain adaptation problem
Chollampatt et al. A multilayer convolutional encoder-decoder neural network for grammatical error correction
US20230394242A1 (en) Automated translation of subject matter specific documents
US11372942B2 (en) Method, apparatus, computer device and storage medium for verifying community question answer data
US11031009B2 (en) Method for creating a knowledge base of components and their problems from short text utterances
JP2022177242A (en) Method for training text recognition model, method for recognizing text, and device for recognizing text
US20090106015A1 (en) Statistical machine translation processing
US9311299B1 (en) Weakly supervised part-of-speech tagging with coupled token and type constraints
CN109635305A (en) Voice translation method and device, equipment and storage medium
CN111814493B (en) Machine translation method, device, electronic equipment and storage medium
CN110032734B (en) Training method and device for similar meaning word expansion and generation of confrontation network model
CN110569335A (en) triple verification method and device based on artificial intelligence and storage medium
Rikters et al. Training and adapting multilingual NMT for less-resourced and morphologically rich languages
CN112183117B (en) Translation evaluation method and device, storage medium and electronic equipment
CN110245361B (en) Phrase pair extraction method and device, electronic equipment and readable storage medium
KR101646414B1 (en) Lengthy Translation Service Apparatus and Method of same
CN112232057B (en) Method, device, medium and equipment for generating countermeasure sample based on text expansion
CN113627159A (en) Method, device, medium and product for determining training data of error correction model
Fatima et al. Cross-lingual Science Journalism: Select, Simplify and Rewrite Summaries for Non-expert Readers
CN111091915A (en) Medical data processing method and device, storage medium and electronic equipment
Wang et al. Listen, Decipher and Sign: Toward Unsupervised Speech-to-Sign Language Recognition
Eo et al. Word-level quality estimation for Korean-English neural machine translation
Park et al. Unsupervised abstractive dialogue summarization with word graphs and POV conversion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant