WO2018214956A1 - 机器翻译方法、装置及存储介质 - Google Patents

机器翻译方法、装置及存储介质 Download PDF

Info

Publication number
WO2018214956A1
WO2018214956A1 PCT/CN2018/088387 CN2018088387W WO2018214956A1 WO 2018214956 A1 WO2018214956 A1 WO 2018214956A1 CN 2018088387 W CN2018088387 W CN 2018088387W WO 2018214956 A1 WO2018214956 A1 WO 2018214956A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample
feature
translation
document
preset
Prior art date
Application number
PCT/CN2018/088387
Other languages
English (en)
French (fr)
Inventor
涂兆鹏
刘晓华
李航
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP18806246.7A priority Critical patent/EP3617908A4/en
Publication of WO2018214956A1 publication Critical patent/WO2018214956A1/zh
Priority to US16/694,239 priority patent/US20200089774A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/47Machine-assisted translation, e.g. using translation memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/44Statistical methods, e.g. probability models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/51Translation evaluation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present disclosure relates to the field of communication network technologies, and in particular, to a machine translation method, apparatus, and storage medium.
  • machine translation refers to the translation method of automatic translation through machine translation device.
  • machine translation devices include statistical machine translation devices and neural network machine translation devices.
  • translation is performed by a statistical machine translation device, or translation is performed by a neural network machine translation device.
  • the process of translating by the statistical machine translation device may be: splitting the source document to be translated into at least one phrase, respectively translating each phrase, obtaining each translation segment, and splicing each translation segment into a target document.
  • the process of translating through the neural network machine translation device may be: vectorizing each sentence in the source document to be translated, and passing each vectorized vector sentence into a network layer, and converting it into a computer-understandable representation.
  • the target document is generated through multiple layers of complex conduction operations.
  • the statistical machine translation device splicing each translation segment into a target document, resulting in low fluency of the target document; and the translation generated by the neural network machine translation device does not completely reflect the meaning of the source document, and often results in missing translation or over-translation, resulting in The loyalty of translation is low. It can be seen that the above machine translation method has poor accuracy.
  • a machine translation method comprising:
  • a machine translation apparatus Converting the source document into a plurality of target documents by a plurality of machine translation apparatuses, wherein a machine translation apparatus is configured to translate the source document into a target document, the target document including at least one character of a target language
  • the source language is different from the target language
  • the target document with the highest recommendation degree is output according to the recommendation degree of each of the target documents.
  • a plurality of machine translation apparatuses convert a source document into a target document, determine a recommendation degree of each target document, and output a target document with the highest degree of recommendation; since the target document is translated by a plurality of machine translation apparatuses, The target document is output according to the recommendation degree of each target document, thereby improving the accuracy of machine translation.
  • the determining the recommendation degree of each target document according to the feature value of each preset feature of each target document includes:
  • each target document is determined by a preset recommendation degree algorithm according to a feature value of each preset feature of each target document, a reference feature weight of each preset feature, and a reference feature offset. Recommendation level. Since the reference feature weight and the reference feature offset of each preset feature are combined, the determined degree of recommendation of each target document can be improved, and the target document is output according to the recommendation degree of each target document, thereby improving the machine. The accuracy of the translation.
  • the method before the determining the recommendation degree of each target document according to the feature value of each preset feature of each target document, the method further includes:
  • the reference feature weight and the reference feature offset of each preset feature are trained by the first sample document set and the first sample translation set, and each determined preset feature is improved. Accuracy of datum feature weights and datum feature offsets.
  • the determining, according to the first sample document set, the second sample translation set includes:
  • each sample document in the first set of sample documents into a plurality of sample translation sets by the plurality of machine translation devices, wherein one sample translation set includes a machine translation device to each of the samples Translating the document into at least one sample translation of the target language;
  • the second set of sample translations is determined based on the degree of recommendation of each of the sample translations.
  • the determining, according to the first error recommendation rate, the initial feature weight of each preset feature, and an initial feature offset, determining a reference feature weight of each of the preset features Benchmark feature offsets including:
  • the initial feature weight and the initial feature offset of each of the preset features are respectively determined as the reference feature weight and the reference feature offset of each of the preset features; or,
  • the initial feature weight and the initial feature offset of each preset feature are updated by a preset iterative algorithm until the second incorrect recommendation rate satisfies the preset condition.
  • the second error recommendation rate is determined according to the updated initial feature weight and the updated initial feature offset, and the feature weight and the feature offset when the second error recommendation rate meets the preset condition are determined as The baseline feature weight and the reference feature offset for each preset feature.
  • the reference feature weight and the reference feature offset of each preset feature are determined according to the first error recommendation rate and the preset iterative algorithm, and the determined reference feature weight of each preset feature is improved. The accuracy of the reference feature offset.
  • the determining, according to the first sample translation set and the second sample translation set, the first error recommendation rate includes:
  • the first sample number being a number of sample documents included in the second sample document set
  • the second sample number being the number The number of sample documents included in a sample document collection
  • determining the first false recommendation rate is improved by determining a sample number ratio between the first sample number and the second sample number and a recommendation degree of each sample document in the second sample document set. Determine the accuracy of the first false recommendation rate.
  • the determining, according to a recommendation degree of each sample translation document in the third sample translation document set, determining a recommendation coefficient of each sample document in the second sample document set including:
  • the recommendation coefficient of each sample document is determined according to the recommendation degree ratio and the preset recommendation weight of each sample document in the second sample document set, and the determined recommendation coefficient of each sample document is improved. The accuracy.
  • the preset feature includes a first type of preset feature and/or a second type of preset feature, where the first type of preset feature is used to evaluate fluency of a target document, A second type of preset feature is used to evaluate the loyalty of the target document;
  • the determining the feature value of each preset feature of each target translation separately includes:
  • the preset feature includes a first type of preset feature and a second type of preset feature, and subsequently combines the first type of preset feature and the second type of preset feature to determine a recommendation degree of each target document. The accuracy of the determined degree of recommendation for each target document is improved.
  • a machine translation apparatus comprising:
  • An obtaining unit configured to acquire a source document to be translated, where the source document includes at least one character of a source language
  • a translation unit configured to convert the source document into a plurality of target documents by a plurality of machine translation devices, wherein a machine translation device is configured to translate the source document into a target document, the target document including a target At least one character of the language, the source language being different from the target language;
  • a determining unit configured to respectively determine feature values of each preset feature of each target document, wherein feature values of any of the preset features of any target document are used to evaluate fluency of any of the target documents And/or loyalty;
  • the determining module is further configured to determine a recommendation degree of each target document according to a feature value of each preset feature of each target document;
  • an output unit configured to output a target document with the highest recommendation according to the recommendation degree of each target document.
  • the determining unit is further configured to respectively perform feature values of each preset feature of each of the target documents, a reference feature weight of each of the preset features, and a reference feature bias And determining, by using a preset recommendation degree algorithm, a recommendation degree of each target document, wherein the reference feature weight and the reference feature of each preset feature are offset according to the first sample document set and the first sample translation
  • the first sample document set includes at least one sample document to be translated
  • the first sample translation set includes a reference translation corresponding to each sample document.
  • the device further includes:
  • the obtaining unit is further configured to acquire the first sample document set and the first sample translation set;
  • the determining unit is further configured to determine, according to the first sample document set, a second sample translation set, where the second sample translation set includes a sample translation corresponding to each sample document;
  • the determining unit is further configured to determine a first error recommendation rate according to the first sample translation set and the second sample translation set;
  • the determining unit is further configured to determine a reference feature weight and a reference feature bias of each of the preset features according to the first error recommendation rate, the initial feature weight of each preset feature, and an initial feature offset Set.
  • the translation unit is further configured to convert each sample document in the first sample document set into a plurality of sample translation sets by using the plurality of machine translation devices, respectively.
  • a sample translation set includes a machine translation device translating each of the sample documents into at least one sample translation of the target language;
  • the determining unit is further configured to respectively determine feature values of each preset feature of each sample translation in the plurality of sample translation sets; according to feature values of each preset feature of each sample translation, Determining, by the initial feature weight and the initial feature offset of each preset feature, a recommendation degree of each of the sample translations; determining the second sample translation set according to the recommendation degree of each of the sample translations.
  • the determining unit is further configured to: determine, if the first error recommendation rate meets a preset condition, an initial feature weight and an initial feature offset of each preset feature respectively The reference feature weight and the reference feature offset of each of the preset features; or
  • the determining unit is further configured to: if the first error recommendation rate does not meet the preset condition, update the initial feature weight and the initial feature offset of each preset feature by using a preset iterative algorithm until the second error The recommendation rate satisfies a preset condition, and the second error recommendation rate is determined according to the updated initial feature weight and the updated initial feature offset, and the feature weight when the second incorrect recommendation rate satisfies the preset condition And a feature offset is determined as a reference feature weight and a reference feature offset for each of the predetermined features.
  • the determining unit is further configured to determine, according to the first sample translation set and the second sample translation set, a third sample translation set and a second sample document set,
  • the third sample translation set includes different sample translations in the first sample translation set and the second sample translation set, the second sample document set including sample documents corresponding to the different sample translations; a recommendation degree of each sample translation in the third sample translation set, determining a recommendation coefficient of each sample document in the second sample document set; determining a sample number ratio between the first sample number and the second sample number
  • the first number of samples is a number of sample documents included in the second sample document set
  • the second sample number is a number of sample documents included in the first sample document set; determining the number of samples
  • the ratio of the ratio to the product of the recommended coefficients of each sample document in the second set of sample documents results in the first false recommendation rate.
  • the determining unit is further configured to determine, according to a recommendation degree of each sample translation in the third sample translation set, each sample document in the second sample document set. Determining a weight; for each sample document in the second sample document set, determining a ratio of a recommended weight of the sample document to a preset recommendation degree, obtaining a recommendation ratio of the sample document; from the sample document The minimum value of the recommendation degree ratio and the preset recommendation weight is selected as the recommendation coefficient of the sample document.
  • the preset feature includes a first type of preset feature and/or a second type of preset feature, where the first type of preset feature is used to evaluate fluency of a target document, A second type of preset feature is used to evaluate the loyalty of the target document;
  • the determining unit is further configured to extract feature values of each of the first type of preset features of each target translation by using an extraction algorithm of each first type of preset feature; and/or respectively a second type of preset feature extraction algorithm, extracting feature values of each second type of preset feature of each target translation;
  • the determining unit is further configured to combine feature values of each first type of preset feature of each target translation and/or feature values of each second type of preset feature of each target translation The feature values of each of the preset features of each target feature are described.
  • a machine translation apparatus comprising: a processing component further comprising one or more processors, and memory resources represented by the memory for storing instructions executable by the processing component, For example an application.
  • An application stored in the memory may include one or more modules each corresponding to a set of instructions.
  • the processing component is configured to execute instructions to perform the machine translation method described in the first aspect above.
  • a system chip in a fourth aspect, includes an input/output interface, at least one processor, a memory, and a bus; the input/output interface is connected to at least one processor and the memory through a bus, and the input/output interface is used to acquire The translated source document and the output target document, the at least one processor executing the instructions stored in the memory, such that the machine translation system performs the machine translation method described in the first aspect above.
  • a computer readable storage medium storing a computer program, the program being executed by the processor to implement the machine translation method as described in any one of the first aspects.
  • the technical solution provided by the embodiment of the present disclosure has the beneficial effects that, in the embodiment of the present disclosure, the target document is converted into the target document by using multiple machine translation devices, and the recommendation degree of each target document is determined, and the output recommendation degree is the highest.
  • Target document since the target document is translated by a plurality of machine translation devices, the target document is output according to the recommendation degree of each target document, thereby improving the accuracy of machine translation.
  • FIG. 1 is a schematic diagram of a machine translation system provided by an embodiment of the present disclosure
  • FIG. 2 is a flowchart of a machine translation method provided by an embodiment of the present disclosure
  • FIG. 3 is a flowchart of a machine translation method provided by an embodiment of the present disclosure.
  • FIG. 4 is a schematic structural diagram of a machine translation apparatus according to an embodiment of the present disclosure.
  • FIG. 5 is a block diagram of a machine translation apparatus according to an embodiment of the present disclosure.
  • FIG. 6 is a block diagram of a system chip according to an embodiment of the present disclosure.
  • Embodiments of the present disclosure provide a machine translation system.
  • the machine translation system includes a recommendation device 10 and a plurality of machine translation devices 20; each machine translation device 20 is coupled to the recommendation device 10.
  • Each of the machine translation device 20 and the recommendation device 10 may be connected by wire or wirelessly.
  • Each machine translation device 20 is configured to receive a source document to be translated, convert the source document into a target document, and send the target document to the recommendation device 10.
  • the plurality of machine translation apparatuses 20 may be various types of machine translation apparatuses, for example, the plurality of machine translation apparatuses 20 include the statistical machine translation apparatus 20 or the neural network machine translation apparatus 20.
  • the recommendation device 10 is configured to receive a target document sent by each machine translation device 20, and respectively determine feature values of each preset feature of each target document, wherein feature values of any preset feature of any target document are used for The fluency and/or loyalty of any target document is evaluated, and the degree of recommendation of each target document is determined based on the feature values of each of the preset features of each target document.
  • the recommendation device 10 is further configured to output a target document with the highest degree of recommendation according to the degree of recommendation of each target document.
  • the source document includes at least one character of the source language
  • the target document includes at least one character of the target language
  • the source language and the target language are different.
  • the source language and the target language can be set and changed as needed.
  • the source language is not specifically limited.
  • the source language can be Chinese, English, Japanese, or French.
  • the target language can be English, Japanese or French.
  • the recommendation device determines the feature value of each preset feature of each target document when determining the recommendation degree of each target document, and the feature value of any preset feature is used to evaluate the target document. Fluency and/or loyalty; determining each target document by a preset recommendation degree algorithm according to the feature value of each preset feature of each target document, the reference feature weight of each preset feature, and the reference feature offset The degree of recommendation. Therefore, prior to the machine translation method provided by the embodiments of the present disclosure, the machine translation system needs to determine the reference feature weight and the reference feature offset for each of the preset features. Referring to FIG. 2, the process of the machine translation system determining the baseline feature weight and the reference feature offset of each preset feature includes:
  • Step 201 The machine translation system acquires a first sample document set and a first sample translation set, where the first sample document set includes at least one sample document to be translated, and the first sample translation set includes a reference corresponding to each sample document. Translation.
  • the machine translation system acquires sample data, the sample data including the first sample document set And a first sample translation set.
  • the first sample document set includes at least one sample document to be translated, and the first sample translation set includes a reference translation corresponding to each sample document in the first sample set.
  • the reference translation refers to the standard translation.
  • the user labels at least one sample document, inputs at least one sample document to the machine translation system, the machine translation system receives at least one sample document input by the user, and composes at least one sample document into the first sample document set.
  • the machine translation system After the machine translation system acquires the first sample document set, for each sample document in the first sample document set, the sample document is converted into a plurality of sample translations by a plurality of machine translation devices. For a plurality of sample translations corresponding to each sample document, the user labels the reference translation from the plurality of sample translations according to the plurality of sample translations corresponding to the sample document; and the machine translation system obtains the reference translation of the sample document marked by the user, The reference translation corresponding to each sample document constitutes a first sample translation set.
  • each sample document includes at least one character of the source language
  • the sample translation includes at least one character of the target language
  • the source language and the target language are different.
  • the source language may be set and changed according to requirements.
  • the source language is not specifically limited; for example, the source language may be Chinese, English, Japanese, or French.
  • the target language may be set and changed as needed.
  • the target language is not specifically limited; for example, the target language may be English, Japanese, or French.
  • Step 202 The machine translation system determines a second sample translation set according to the first sample document set, where the second sample translation set includes a sample translation corresponding to each sample document.
  • the second sample translation set is a set of sample translations obtained by the machine translation system for translating each sample document and recommending the sample translation.
  • This step can be implemented by the following steps 2021-2024, including:
  • Step 2021 The machine translation system converts each sample document in the first sample document set into a plurality of sample translation sets through a plurality of machine translation devices.
  • a sample translation set includes a machine translation device translating each sample document into at least one sample translation of the target language; for each machine translation device, the machine translation device converts each sample document in the first sample document collection to At least one sample translation, the converted at least one sample translation is composed of a sample translation set.
  • the first sample document set includes a sample document A, a sample document B, and a sample document C;
  • the machine translation devices are respectively a neural network translation device and a statistical translation device; then the neural network translation device respectively takes the sample document A and the sample document B And the sample document C is converted into a sample translation of the target language, and the sample translation A1, the sample translation B1, and the sample translation C1 are obtained, and the sample translation A1, the sample translation B1, and the sample translation C1 are grouped into the sample translation set 1; the statistical translation device respectively sets the sample document A.
  • the sample document B and the sample document C are converted into sample translations of the target language, and the sample translation A2, the sample translation B2, and the sample translation C2 are obtained, and the sample translation A2, the sample translation B2, and the sample translation C2 are composed into a sample translation set 2.
  • Step 2022 The machine translation system determines feature values of each preset feature of each sample translation in the plurality of sample translation sets, respectively.
  • the preset features include a first type of preset features and a second type of preset features.
  • the first type of pre-set features are used to assess the fluency of the sample translation; the second type of pre-set features are used to assess the loyalty of the sample translation.
  • the first type of preset features include a translation language model and/or a sequence model.
  • the second type of preset features include unregistered words, refactoring, translation length, coverage, and/or lexicalization probabilities. Correspondingly, this step can be:
  • the machine translation system extracts feature values of each of the first type of preset features of each sample translation by using an extraction algorithm of each first type of preset feature; and/or an extraction algorithm by each second preset feature And extracting feature values of each of the second type of preset features of each sample translation.
  • the machine translation system combines the feature values of each of the first type of preset features of each sample translation and/or the feature values of each of the second type of preset features of each sample translation into each preset feature of each sample translation. Characteristic value.
  • the step of the machine translation system acquiring the feature values of the preset features of the sample translation may be:
  • the machine translation system obtains a translation language model score for the sample translation. Among them, the higher the score of the target language model, the smoother the translation and the better the quality.
  • the step of the machine translation system acquiring the feature value of the preset feature of the sample translation may be:
  • the machine translation system obtains an ordered model score for the sample translation.
  • a major problem of statistical translation devices is that the ordering is difficult, resulting in the translation is generally sequential splicing, giving people the feeling of machine translation; and the neural network translation device does a good job in this aspect, the translation is smooth. Therefore, by obtaining the sample model score of the sample translation, the higher the score of the sequence model, the better the translation quality.
  • the step of the machine translation system acquiring the feature value of the preset feature of the sample translation may be:
  • the machine translation system obtains the number of unregistered words in the sample translation.
  • the unregistered word refers to the untranslated word; an important problem of the neural network translation device when the word is not registered, the unregistered word is generally caused by the uncommon word in the sample document, and the word appears less frequently. It is difficult to be translated by the machine translation system, and the unregistered words are more serious in the neural network translation device. In general, the more the unregistered words appear in the sample translation, the worse the quality of the sample translation.
  • the feature value of the preset feature is a reconstruction score
  • the step of the machine translation system acquiring the feature value of the preset feature of the sample translation may be:
  • the sample translation includes at least one character of the target language; the machine translation system translates the sample translation into the source language to obtain a reconstructed document, the reconstructed document including at least one character of the source language; calculating the sample document and the reconstructed document The degree of similarity is determined as the reconstructed score of the sample translation.
  • the machine translation system re-translates the sample translation into the original text, and obtains a reconstructed document, and obtains a reconstruction score of the sample translation by using the similarity between the sample document and the reconstructed document; the reconstruction score is a good
  • the evaluation index of the sample translation loyalty in general, the higher the reconstruction score of the sample translation, indicating that the higher the loyalty of the sample translation, the better the quality.
  • the preset feature includes a length of the translation, and the feature value of the preset feature is the translation length score, and the step of the machine translation system acquiring the feature value of the preset feature of the sample translation may be:
  • the machine translation system obtains the number of reference characters included in the sample translation according to the number of characters included in the sample document corresponding to the sample translation, and determines a difference between the number of characters included in the sample translation and the reference character number as the sample translation. Translation length score.
  • the machine translation system stores a correspondence between the number of characters included in the sample document and the number of reference characters included in the translation document; correspondingly, the machine translation system acquires the number of reference characters included in the sample translation according to the number of characters included in the sample document corresponding to the sample translation.
  • the steps can be:
  • the machine translation system obtains the number of reference characters included in the sample document from the correspondence between the number of characters included in the sample document and the number of reference characters included in the translation document according to the number of characters included in the sample document corresponding to the sample translation.
  • the number of reference characters included in the sample translation of different languages may be different; therefore, the machine translation system may also combine the target language to obtain the number of reference characters included in the sample translation; correspondingly, the machine translation system includes the sample documents corresponding to the sample translation.
  • the number of characters, the step of obtaining the number of reference characters included in the sample translation can be:
  • the machine translation system acquires the number of reference characters included in the sample translation according to the number of characters and the target language included in the sample document corresponding to the sample translation.
  • the machine translation system stores the correspondence between the number of characters included in the sample document, the target language, and the number of reference characters included in the translation document; correspondingly, the machine translation system acquires the sample according to the number of characters and the target language included in the sample document corresponding to the sample translation.
  • the steps of the translation include the number of reference characters:
  • the machine translation system obtains the reference character included in the sample translation from the correspondence between the number of characters included in the sample document, the target language, and the number of reference characters included in the translation document according to the number of characters and the target language included in the sample document corresponding to the sample translation. number.
  • the missing translation of the neural network translation device results in a short translation of the sample.
  • the translation of the sample can be evaluated to some extent by the length of the translation; in general, for the same sample
  • the neural network translation device is less likely to have a missing translation phenomenon when the length of the translation of the sample translation obtained by the neural network translation device is similar to the translation length of the sample translation obtained by the statistical translation device.
  • the feature value of the preset feature is the value of the coverage rate
  • the step of the machine translation system acquiring the feature value of the preset feature of the sample translation may be:
  • the machine translation system obtains the number of first words and the number of second words, the number of first words is the number of words included in the sample document, the number of words in the second word number is translated in the sample document; the machine translation system calculates the number of second words and The ratio of the number of first words, the ratio being determined as the coverage of the sample translation.
  • the coverage ratio is the proportion of the sample document being translated; the coverage rate is designed for the phenomenon of frequent translation of the neural network translation device. Generally speaking, the higher the coverage of the sample translation, the better the quality of the sample translation.
  • the feature value of the preset feature is the value of the lexicalization probability
  • the step of the machine translation system acquiring the feature value of the preset feature of the sample translation may be:
  • the machine translation system calculates a degree of matching between the sample document and the sample translation, and determines the degree of matching as the lexical rate of the sample translation.
  • the machine translation system translates the sample translation into a source language, and obtains a reconstructed document, the reconstructed document includes at least one character of the source language; calculating coverage of the sample translation and coverage of the reconstructed document, and translating the sample The sum of the coverage and the coverage of the reconstructed document is determined as the vocabulary of the sample translation.
  • Step 2023 The machine translation system determines the recommendation degree of each sample translation according to the feature value of each preset feature of each sample translation, the initial feature weight of each preset feature, and the initial feature offset.
  • the machine translation system determines the recommendation degree of each sample translation by a preset recommendation degree algorithm according to the feature value of each preset feature of each sample document, the initial feature weight of each preset feature, and the initial feature offset.
  • the preset recommendation degree algorithm may be set and changed as needed.
  • the preset recommendation degree algorithm is not specifically limited; for example, the preset recommendation degree algorithm may be a MultiLayer Perceptron (MLP) algorithm. Or artificial neural network (ANN) and so on.
  • MLP MultiLayer Perceptron
  • ANN artificial neural network
  • this step can be:
  • the machine translation system determines the recommendation of each sample document by the following formula 1 according to the feature value of each preset feature of the sample document, the initial feature weight of each preset feature, and the initial feature offset. degree.
  • f(x) is the recommendation degree of the sample translation
  • x is the feature value of the preset feature
  • b (1) and b (2) are the initial feature weights of the two preset features, respectively, W (1) and W (2) Do not bias the initial features of the two preset features.
  • Step 2024 The machine translation system determines a second sample translation set according to the recommendation degree of each sample translation.
  • the machine translation system For each sample document, the machine translation system selects the sample translation with the highest recommendation from each sample translation corresponding to the sample document according to the recommendation degree of each sample translation corresponding to the sample document, and recommends each sample document correspondingly.
  • the highest-level sample translation constitutes a second sample translation set.
  • Step 203 The machine translation system determines a first error recommendation rate according to the first sample translation set and the second sample translation set.
  • This step can be implemented in the following first manner or the second manner; for the first implementation manner, this step can be:
  • the machine translation system determines a first sample number and a second sample number, the first sample number being the number of sample translations included in the first sample translation set (or the second sample translation set), the second sample number being the same
  • the number of sample translations that are different in the set of translations and the second set of sample translations; the ratio of the number of second samples to the number of first samples is determined as the first false recommendation rate.
  • this step can be implemented by the following steps 2031-2034, including:
  • Step 2031 The machine translation system determines the third sample translation set and the second sample document set according to the first sample translation set and the second sample translation set.
  • the third sample translation set includes different sample translations in the first sample translation set and the second sample translation set
  • the second sample document set includes sample documents corresponding to different sample translations, that is, the second sample document set includes the first A sample document corresponding to each sample translation in the three-sample translation set.
  • Step 2032 The machine translation system determines a recommendation coefficient of each sample document in the second sample document set according to a recommendation degree of each sample translation in the third sample translation set.
  • One sample translation in the third sample translation set corresponds to a plurality of sample documents in the second sample document set; in this step, for each sample document in the third sample translation set, the machine translation system according to the third sample translation set The recommendation degree of each sample translation in the determination of the recommended weight of each sample document in the second sample document set. For each sample document in the second sample document set, determining a ratio of the recommended weight of the sample document to the preset recommendation degree, obtaining a recommendation ratio of the sample document, a ratio of recommendation ratios and preset recommendation weights from the sample document. The minimum value is selected as the recommended coefficient for the sample document.
  • the preset recommendation degree and the preset recommendation weight may be set and changed as needed.
  • the preset recommendation degree and the preset recommendation weight are not specifically limited.
  • the preset recommendation degree is 40 or 20, and the preset recommendation weight may be 0.8 or 1 or the like.
  • the machine translation device includes a neural network translation device and a statistical translation device; the preset recommendation degree is 40, the preset recommendation weight is 1; the sample translation corresponding to the sample document A is the sample translation A1 and the sample translation A2; The recommendation degree is 1, and the recommendation degree of the sample translation A2 is 21, then the machine translation system calculates the difference between the recommendation degree of the sample translation A1 and the recommendation degree of the sample translation A2, and obtains the recommendation degree difference of 20; determining the recommendation difference value and the pre-determination Let the ratio of the recommendation be 0.5, and select 0.5 from 0.5 and 1 (preset recommended weight) as the recommended coefficient of the sample document.
  • each translation device is equally important; however, if the translation degree of the sample translation obtained by each translation device is not much different, even if the classification is wrong, the recommendation result has little effect; The translations of the samples obtained by the translation devices are quite different. At this time, if the classification is wrong, the recommendation results have a great influence; therefore, in the embodiment of the present disclosure, the machine translation system determines the recommendation coefficients of each sample document, and subsequently combines each sample. The recommendation coefficient of the document determines the reference feature weight and the reference feature offset, which improves the accuracy of the determined reference feature weight and the reference feature offset.
  • Step 2033 The machine translation system determines a sample number ratio between the first sample number and the second sample number.
  • the machine translation system acquires the first sample number and the second sample number, and determines a sample number ratio between the first sample number and the second sample number.
  • the first sample number is the number of sample documents included in the second sample document set
  • the second sample number is the number of sample documents included in the first sample document set.
  • Step 2034 The machine translation system determines a product of a sample number ratio value and a recommendation coefficient of each sample document in the second sample document set to obtain a first error recommendation rate.
  • Step 204 The machine translation system determines the reference feature weight and the reference feature offset of each preset feature according to the first error recommendation rate, the initial feature weight of each preset feature, and the initial feature offset.
  • the machine translation system determines whether the first error recommendation rate satisfies a preset condition; if the first error recommendation rate satisfies a preset condition, determining an initial feature weight and an initial feature offset of each preset feature as each preset respectively The baseline feature weight of the feature and the reference feature offset.
  • the initial feature weight and the initial feature offset of each preset feature are updated by a preset iterative algorithm, according to the updated initial feature weight and the updated initial feature offset. Determining a second error recommendation rate, determining whether the second error recommendation rate satisfies a preset condition; if the second error recommendation rate satisfies a preset condition, determining a feature weight and a feature offset of each preset feature at this time are respectively determined as The baseline feature weight and the reference feature offset for each preset feature. If the second error recommendation rate does not satisfy the preset condition, the initial feature weight and the initial feature offset of each preset feature are updated again until the second incorrect recommendation rate satisfies the preset condition.
  • the preset condition may be that the error rate is lower than the first preset threshold or the difference between the two incorrect error recommendation rates obtained by the second preset threshold; the first preset threshold and the second preset threshold may be equal,
  • the first preset threshold and the second preset threshold may be set and changed as needed.
  • the first preset threshold and the second preset threshold are not specifically limited.
  • the first preset threshold is 0.2 or 0.3
  • the second preset threshold may be 0.1 or 0.15 or the like.
  • the reference feature weight and the reference feature offset of each preset feature are trained by the first sample document set and the first sample translation set, and each determined preset feature is improved. Accuracy of datum feature weights and datum feature offsets.
  • Embodiments of the present disclosure provide a machine translation method, which is applied in a machine translation system.
  • the method includes:
  • Step 301 The machine translation system acquires a source document to be translated, and the source document includes at least one character of the source language.
  • the user terminal When the user translates the source document, the user terminal sends the source document to be translated to the machine translation system; the machine translation system receives the source document sent by the user terminal.
  • Step 302 The machine translation system converts the source document into a plurality of target documents by using a plurality of machine translation devices.
  • a machine translation device is used to translate the source document into a target document; the target document includes at least one character of the target language, and the source language and the target language are different.
  • Step 303 The machine translation system respectively determines feature values of each preset feature of each target document, wherein the feature value of any preset feature of any target document is used to evaluate the fluency or loyalty of any target document. .
  • the preset features include a first type of preset features and a second type of preset features, the first type of preset features are used to evaluate the fluency of the target document, and the second type of preset features are used to evaluate the loyalty of the target document;
  • the machine translation system extracts the feature values of each of the first type of preset features of each target translation by using an extraction algorithm of each of the first type of preset features; and/or, respectively, through each of the second types of preset features Extracting an algorithm to extract feature values of each second type of preset feature of each target translation;
  • the feature value of each first type of preset feature of each target translation and/or the feature value of each second type of preset feature of each target translation constitutes a feature value of each preset feature of each target feature .
  • Step 304 The machine translation system determines the recommendation degree of each target document by using a preset recommendation degree algorithm according to the feature value of each preset feature of each target document, the feature weight of each preset feature, and the feature offset.
  • the machine translation system determines the translation of each sample according to the feature value of each preset feature of each sample document, the initial feature weight of each preset feature, and the initial feature offset.
  • the process of recommendation is the same and will not be described here.
  • Step 305 The machine translation system outputs the target document with the highest recommendation according to the recommendation degree of each target document.
  • the machine translation system selects the target document with the highest recommendation from each target document according to the recommendation degree of each target document, and outputs the target document with the highest recommendation degree.
  • the target document is converted into the target document by a plurality of machine translation apparatuses, the degree of recommendation of each target document is determined, and the target document with the highest degree of recommendation is output; since the target document is translated by the plurality of machine translation apparatuses, The target document is output according to the recommendation degree of each target document, thereby improving the accuracy of machine translation.
  • the embodiment of the present disclosure provides a machine translation apparatus.
  • the apparatus includes an acquisition unit 401, a translation unit 402, a determination unit 403, and an output unit 404.
  • the obtaining unit 401 is configured to perform the above steps 201 and 301 and its alternatives.
  • the translation unit 402 is configured to perform the above step 302 and its alternatives.
  • the determining unit 403 is configured to perform the above steps 202, 203, 204, 303, and 304 and their alternatives.
  • the output unit 404 is configured to perform the above step 305 and its alternatives.
  • the target document is converted into the target document by a plurality of machine translation apparatuses, the degree of recommendation of each target document is determined, and the target document with the highest degree of recommendation is output; since the target document is translated by the plurality of machine translation apparatuses, The target document is output according to the recommendation degree of each target document, thereby improving the accuracy of machine translation.
  • machine translation device provided by the foregoing embodiment is only illustrated by the division of each functional module in the machine translation. In actual applications, the function distribution may be completed by different functional modules as needed. The internal structure of the device is divided into different functional modules to perform all or part of the functions described above.
  • machine translation device and the machine translation method embodiment are provided in the same concept, and the specific implementation process is described in detail in the method embodiment, and details are not described herein again.
  • FIG. 5 is a block diagram of a machine translation apparatus 500, according to an exemplary embodiment.
  • device 500 can be provided as a server.
  • apparatus 500 includes a processing component 522 that further includes one or more processors, and memory resources represented by memory 532 for storing instructions executable by processing component 522, such as an application.
  • An application stored in memory 532 can include one or more modules each corresponding to a set of instructions.
  • processing component 522 is configured to execute instructions to perform the machine translation method described above.
  • Apparatus 500 can also include a power supply component 526 configured to perform power management of apparatus 500, a wired or wireless network interface 550 configured to connect apparatus 500 to the network, and an input/output (I/O) interface 558.
  • Device 500 can operate based on an operating system stored in the memory 532, such as Windows Server TM, Mac OS X TM , Unix TM, Linux TM, FreeBSD TM or the like.
  • the embodiment of the present disclosure provides a system chip, which is applied to a machine translation system.
  • the system chip includes: an input/output interface 601, at least one processor 602, a memory 603, and a bus 604.
  • the input/output interface 601 passes through the bus.
  • 604 is coupled to at least one processor 602 for accessing a source document to be translated and an output target document, and at least one processor 602 executing instructions stored in the memory 603 such that the machine translation system performs the machine translation described above. method.
  • the processor in each of the above embodiments may be a central processing unit (CPU), a general purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), and a field programmable gate array ( FPGA) or other programmable logic device, transistor logic device, hardware component, or any combination thereof. It is possible to implement or carry out the various illustrative logical blocks, modules and circuits described in connection with the present disclosure.
  • the processor may also be a combination of computing functions, for example, including one or more microprocessor combinations, a combination of a DSP and a microprocessor, and the like.
  • the steps of a method or algorithm described in connection with the present disclosure may be implemented in a hardware or in a manner that the processor executes the software instructions.
  • the software instructions may be comprised of corresponding software modules that may be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, removable hard disk, CD-ROM, or any other form of storage well known in the art.
  • An exemplary storage medium is coupled to the processor to enable the processor to read information from, and write information to, the storage medium.
  • the storage medium can also be an integral part of the processor.
  • the processor and the storage medium can be located in an ASIC. Additionally, the ASIC can be located in the terminal.
  • the processor and the storage medium can also exist as discrete components in the terminal.
  • the functions described herein can be implemented in hardware, software, firmware, or any combination thereof.
  • the functions may be stored in a computer readable medium or transmitted as one or more instructions or code on a computer readable medium.
  • Computer readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one location to another.
  • a storage medium may be any available media that can be accessed by a general purpose or special purpose computer.
  • a person skilled in the art may understand that all or part of the steps of implementing the above embodiments may be completed by hardware, or may be instructed by a program to execute related hardware, and the program may be stored in a computer readable storage medium.
  • the storage medium mentioned may be a read only memory, a magnetic disk or an optical disk or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

本公开提供了一种机器翻译方法、装置及存储介质,属于通信网络技术领域。所述方法包括:获取待翻译的源文档,该源文档包括源语种的至少一个字符;分别通过多个机器翻译装置,将该源文档转换为多个目标文档,其中,一个机器翻译装置用于将该源文档翻译为一个目标文档,目标文档包括目标语种的至少一个字符,源语种和目标语种不同;分别确定每个目标文档的每个预设特征的特征值;根据每个目标文档的每个预设特征的特征值,确定每个目标文档的推荐度;根据每个目标文档的推荐度,输出推荐度最高的目标文档。本公开由于通过多个机器翻译装置翻译目标文档,根据每个目标文档的推荐度,输出目标文档,从而提高了机器翻译的准确性。

Description

机器翻译方法、装置及存储介质
本申请要求于2017年05月26日提交中国国家知识产权局、申请号为201710386617.5、发明名称为“机器翻译方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本公开涉及通信网络技术领域,特别涉及一种机器翻译方法、装置及存储介质。
背景技术
随着社会科技与经济的高速发展,不同语种之间的信息交流已经成为信息交往中的重要组成部分,随之而来的对各种语言服务,尤其是翻译服务的需求也越来越广泛。然而目前翻译人员尤其是高端翻译人员严重紧缺;并且,翻译人员在进行翻译时,需要花费大量的时间来查询和翻译专业词汇,导致翻译的效率低以及成本高。因此,机器翻译作为一种自动翻译方法,已经成为辅助人工翻译的重要工具;其中,机器翻译是指通过机器翻译装置进行自动翻译的翻译方法。
目前,机器翻译装置包括统计机器翻译装置和神经网络机器翻译装置。现有技术中进行机器翻译时,通过统计机器翻译装置进行翻译,或者通过神经网络机器翻译装置进行翻译。其中,通过统计机器翻译装置进行翻译的过程可以为:将待翻译的源文档拆分成至少一个短语,分别对每个短语进行翻译,得到每个译文片段,将每个译文片段拼接成目标文档。通过神经网络机器翻译装置进行翻译的过程可以为:将待翻译的源文档中的每个句子向量化,将向量化后的每个句子在网络中层层传递,转化为计算机可以理解的表示形式,再经过多层复杂的传导运算,生成目标文档。
在实现本公开的过程中,发明人发现现有技术至少存在以下问题:
统计机器翻译装置是将每个译文片段拼接成目标文档,导致目标文档流畅度低;而神经网络机器翻译装置生成的译文不能完全反映源文档的意思,经常出现遗漏翻译或者过度翻译等情况,导致翻译的忠实度低。由此可见,上述机器翻译方法的准确性差。
发明内容
为了解决现有技术的问题,本公开实施例提供了一种机器翻译方法、装置及存储介质。所述技术方案如下:
第一方面,提供了一种机器翻译方法,所述方法包括:
获取待翻译的源文档,所述源文档包括源语种的至少一个字符;
分别通过多个机器翻译装置,将所述源文档转换为多个目标文档,其中,一个机器翻译装置用于将所述源文档翻译为一个目标文档,所述目标文档包括目标语种的至少一个字符,所述源语种和所述目标语种不同;
分别确定每个目标文档的每个预设特征的特征值,其中,任一目标文档的任一所述预 设特征的特征值用于评估所述任一目标文档的流畅度和/或忠实度;
根据所述每个目标文档的每个预设特征的特征值,确定所述每个目标文档的推荐度;
根据所述每个目标文档的推荐度,输出推荐度最高的目标文档。
在本公开实施例中,通过多个机器翻译装置,将源文档转换为目标文档,确定每个目标文档的推荐度,输出推荐度最高的目标文档;由于通过多个机器翻译装置翻译目标文档,根据每个目标文档的推荐度,输出目标文档,从而提高了机器翻译的准确性。
在一种可能的实现方式中,所述根据所述每个目标文档的每个预设特征的特征值,确定所述每个目标文档的推荐度,包括:
分别根据所述每个目标文档的每个预设特征的特征值、所述每个预设特征的基准特征权重和基准特征偏置,通过预设推荐度算法,确定所述每个目标文档的推荐度,所述每个预设特征的基准特征权重和基准特征偏置为根据第一样本文档集合和第一样本译文集合训练得到的,所述第一样本文档集合包括待翻译的至少一个样本文档,所述第一样本译文集合包括每个样本文档对应的参考译文。
在本公开实施例中,根据每个目标文档的每个预设特征的特征值、每个预设特征的基准特征权重和基准特征偏置,通过预设推荐度算法,确定每个目标文档的推荐度。由于结合了每个预设特征的基准特征权重和基准特征偏置,因此,可以提高确定出的每个目标文档的推荐度,进而根据每个目标文档的推荐度,输出目标文档,提高了机器翻译的准确性。
在一种可能的实现方式中,所述根据所述每个目标文档的每个预设特征的特征值,确定所述每个目标文档的推荐度之前,所述方法还包括:
获取所述第一样本文档集合和所述第一样本译文集合;
根据所述第一样本文档集合,确定第二样本译文集合,所述第二样本译文集合包括所述每个样本文档对应的样本译文;
根据所述第一样本译文集合和所述第二样本译文集合,确定第一错误推荐率;
根据所述第一错误推荐率、所述每个预设特征的初始特征权重和初始特征偏置,确定所述每个预设特征的基准特征权重和基准特征偏置。
在本公开实施例中,通过第一样本文档集合和第一样本译文集合,训练出每个预设特征的基准特征权重和基准特征偏置,提高了确定出的每个预设特征的基准特征权重和基准特征偏置的准确性。
在一种可能的实现方式中,所述根据所述第一样本文档集合,确定第二样本译文集合,包括:
分别通过所述多个机器翻译装置,将所述第一样本文档集合中的每个样本文档转换为多个样本译文集合,其中,一个样本译文集合包括一个机器翻译装置将所述每个样本文档翻译为所述目标语种的至少一个样本译文;
分别确定所述多个样本译文集合中的每个样本译文的每个预设特征的特征值;
根据所述每个样本译文的每个预设特征的特征值、所述每个预设特征的初始特征权重和初始特征偏置,确定所述每个样本译文的推荐度;
根据所述每个样本译文的推荐度,确定所述第二样本译文集合。
在一种可能的实现方式中,所述根据所述第一错误推荐率、所述每个预设特征的初始特征权重和初始特征偏置,确定所述每个预设特征的基准特征权重和基准特征偏置,包括:
如果所述第一错误推荐率满足预设条件,将所述每个预设特征的初始特征权重、初始特征偏置分别确定为所述每个预设特征的基准特征权重和基准特征偏置;或者,
如果所述第一错误推荐率不满足预设条件,通过预设迭代算法,更新所述每个预设特征的初始特征权重和初始特征偏置,直到第二错误推荐率满足预设条件,所述第二错误推荐率为根据更新后的初始特征权重和更新后的初始特征偏置确定得到的,将所述第二错误推荐率满足预设条件时的特征权重和特征偏置确定为所述每个预设特征的基准特征权重和基准特征偏置。
在本公开实施例中,根据第一错误推荐率和预设迭代算法,确定每个预设特征的基准特征权重和基准特征偏置,提高了确定出的每个预设特征的基准特征权重和基准特征偏置的准确性。
在一种可能的实现方式中,所述根据所述第一样本译文集合和第二样本译文集合,确定第一错误推荐率,包括:
根据所述第一样本译文集合和所述第二样本译文集合,确定第三样本译文集合和第二样本文档集合,所述第三样本译文集合包括所述第一样本译文集合和所述第二样本译文集合中不同的样本译文,所述第二样本文档集合包括所述不同的样本译文对应的样本文档;
根据所述第三样本译文集合中的每个样本译文的推荐度,确定所述第二样本文档集合中的每个样本文档的推荐系数;
确定第一样本数目和第二样本数目之间的样本数目比值,所述第一样本数目为所述第二样本文档集合包括的样本文档的数目,所述第二样本数目为所述第一样本文档集合包括的样本文档的数目;
确定所述样本数目比值与所述第二样本文档集合中的每个样本文档的推荐系数的乘积,得到所述第一错误推荐率。
在本公开实施例中,结合确定第一样本数目和第二样本数目之间的样本数目比值以及第二样本文档集合中的每个样本文档的推荐度,确定第一错误推荐率,提高了确定出的第一错误推荐率的准确性。
在一种可能的实现方式中,所述根据所述第三样本译文文档集合中的每个样本译文文档的推荐度,确定所述第二样本文档集合中的每个样本文档的推荐系数,包括:
根据所述第三样本译文集合中的每个样本译文的推荐度,确定所述第二样本文档集合中的每个样本文档的推荐权重;
确定所述第二样本文档集合中的每个样本文档的推荐权重和预设推荐度的比值,得到所述第二样本文档集合中的每个样本文档的推荐度比值;
对于所述第二样本文档集合中的每个样本文档,确定所述样本文档的推荐权重和预设推荐度的比值,得到所述样本文档的推荐度比值,从所述样本文档的推荐度比值和预设推荐权重中选择最小值作为所述样本文档的推荐系数。
在本公开实施例中,根据第二样本文档集合中的每个样本文档的推荐度比值和预设推荐权重,确定每个样本文档的推荐系数,提高了确定出的每个样本文档的推荐系数的准确性。
在一种可能的实现方式中,所述预设特征包括第一类预设特征和/或第二类预设特征,所述第一类预设特征用于评估目标文档的流畅度,所述第二类预设特征用于评估所述目标 文档的忠实度;
所述分别确定每个目标译文的每个预设特征的特征值,包括:
分别通过每个第一类预设特征的提取算法,提取所述每个目标译文的每个第一类预设特征的特征值;和/或,分别通过每个第二类预设特征的提取算法,提取所述每个目标译文的每个第二类预设特征的特征值;
将所述每个目标译文的每个第一类预设特征的特征值和/或所述每个目标译文的每个第二类预设特征的特征值组成所述每个目标特征的每个预设特征的特征值。
在本公开实施例中,预设特征包括第一类预设特征和第二类预设特征,后续结合第一类预设特征和第二类预设特征,确定每个目标文档的推荐度,提高了确定出的每个目标文档的推荐度的准确性。
第二方面,提供了一种机器翻译装置,所述装置包括:
获取单元,用于获取待翻译的源文档,所述源文档包括源语种的至少一个字符;
翻译单元,用于分别通过多个机器翻译装置,将所述源文档转换为多个目标文档,其中,一个机器翻译装置用于将所述源文档翻译为一个目标文档,所述目标文档包括目标语种的至少一个字符,所述源语种和所述目标语种不同;
确定单元,用于分别确定每个目标文档的每个预设特征的特征值,其中,任一目标文档的任一所述预设特征的特征值用于评估所述任一目标文档的流畅度和/或忠实度;
所述确定模块,还用于根据所述每个目标文档的每个预设特征的特征值,确定所述每个目标文档的推荐度;
输出单元,用于根据所述每个目标文档的推荐度,输出推荐度最高的目标文档。
在一种可能的实现方式中,所述确定单元,还用于分别根据所述每个目标文档的每个预设特征的特征值、所述每个预设特征的基准特征权重和基准特征偏置,通过预设推荐度算法,确定所述每个目标文档的推荐度,所述每个预设特征的基准特征权重和基准特征偏置为根据第一样本文档集合和第一样本译文集合训练得到的,所述第一样本文档集合包括待翻译的至少一个样本文档,所述第一样本译文集合包括每个样本文档对应的参考译文。
在一种可能的实现方式中,所述装置还包括:
所述获取单元,还用于获取所述第一样本文档集合和所述第一样本译文集合;
所述确定单元,还用于根据所述第一样本文档集合,确定第二样本译文集合,所述第二样本译文集合包括所述每个样本文档对应的样本译文;
所述确定单元,还用于根据所述第一样本译文集合和所述第二样本译文集合,确定第一错误推荐率;
所述确定单元,还用于根据所述第一错误推荐率、所述每个预设特征的初始特征权重和初始特征偏置,确定所述每个预设特征的基准特征权重和基准特征偏置。
在一种可能的实现方式中,所述翻译单元,还用于分别通过所述多个机器翻译装置,将所述第一样本文档集合中的每个样本文档转换为多个样本译文集合,其中,一个样本译文集合包括一个机器翻译装置将所述每个样本文档翻译为所述目标语种的至少一个样本译文;
所述确定单元,还用于分别确定所述多个样本译文集合中的每个样本译文的每个预设 特征的特征值;根据所述每个样本译文的每个预设特征的特征值、所述每个预设特征的初始特征权重和初始特征偏置,确定所述每个样本译文的推荐度;根据所述每个样本译文的推荐度,确定所述第二样本译文集合。
在一种可能的实现方式中,所述确定单元,还用于如果所述第一错误推荐率满足预设条件,将所述每个预设特征的初始特征权重、初始特征偏置分别确定为所述每个预设特征的基准特征权重和基准特征偏置;或者,
所述确定单元,还用于如果所述第一错误推荐率不满足预设条件,通过预设迭代算法,更新所述每个预设特征的初始特征权重和初始特征偏置,直到第二错误推荐率满足预设条件,所述第二错误推荐率为根据更新后的初始特征权重和更新后的初始特征偏置确定得到的,将所述第二错误推荐率满足预设条件时的特征权重和特征偏置确定为所述每个预设特征的基准特征权重和基准特征偏置。
在一种可能的实现方式中,所述确定单元,还用于根据所述第一样本译文集合和所述第二样本译文集合,确定第三样本译文集合和第二样本文档集合,所述第三样本译文集合包括所述第一样本译文集合和所述第二样本译文集合中不同的样本译文,所述第二样本文档集合包括所述不同的样本译文对应的样本文档;根据所述第三样本译文集合中的每个样本译文的推荐度,确定所述第二样本文档集合中的每个样本文档的推荐系数;确定第一样本数目和第二样本数目之间的样本数目比值,所述第一样本数目为所述第二样本文档集合包括的样本文档的数目,所述第二样本数目为所述第一样本文档集合包括的样本文档的数目;确定所述样本数目比值与所述第二样本文档集合中的每个样本文档的推荐系数的乘积,得到所述第一错误推荐率。
在一种可能的实现方式中,所述确定单元,还用于根据所述第三样本译文集合中的每个样本译文的推荐度,确定所述第二样本文档集合中的每个样本文档的推荐权重;对于所述第二样本文档集合中的每个样本文档,确定所述样本文档的推荐权重和预设推荐度的比值,得到所述样本文档的推荐度比值;从所述样本文档的推荐度比值和预设推荐权重中选择最小值作为所述样本文档的推荐系数。
在一种可能的实现方式中,所述预设特征包括第一类预设特征和/或第二类预设特征,所述第一类预设特征用于评估目标文档的流畅度,所述第二类预设特征用于评估所述目标文档的忠实度;
所述确定单元,还用于分别通过每个第一类预设特征的提取算法,提取所述每个目标译文的每个第一类预设特征的特征值;和/或,分别通过每个第二类预设特征的提取算法,提取所述每个目标译文的每个第二类预设特征的特征值;
所述确定单元,还用于将所述每个目标译文的每个第一类预设特征的特征值和/或所述每个目标译文的每个第二类预设特征的特征值组成所述每个目标特征的每个预设特征的特征值。
第三方面,提供了一种机器翻译装置,所述装置包括:处理组件,其进一步包括一个或多个处理器,以及由存储器所代表的存储器资源,用于存储可由处理组件的执行的指令,例如应用程序。存储器中存储的应用程序可以包括一个或一个以上的每一个对应于一组指令的模块。此外,处理组件被配置为执行指令,以执行上述第一方面所述的机器翻译方法。
第四方面,提供了一种系统芯片,所述系统芯片包括输入输出接口、至少一个处理器、存储器和总线;输入输出接口通过总线与至少一个处理器和存储器相连,输入输出接口用于获取待翻译的源文档以及输出目标文档,至少一个处理器执行存储器中存储的指令,使得机器翻译系统执行上述第一方面所述的机器翻译方法。
第五方面,提供了一种计算机可读存储介质,计算机可读存储介质上存储有计算机程序,所述程序被处理器执行时实现如第一方面任一实现方式所述的机器翻译方法。
本公开实施例提供的技术方案带来的有益效果是:在本公开实施例中,通过多个机器翻译装置,将目标文档转换为目标文档,确定每个目标文档的推荐度,输出推荐度最高的目标文档;由于通过多个机器翻译装置翻译目标文档,根据每个目标文档的推荐度,输出目标文档,从而提高了机器翻译的准确性。
附图说明
图1是本公开实施例提供的机器翻译系统的示意图;
图2是本公开实施例提供的机器翻译方法流程图;
图3是本公开实施例提供的机器翻译方法流程图;
图4是本公开实施例提供的机器翻译装置结构示意图;
图5是本公开实施例提供的机器翻译装置的框图;
图6是本公开实施例提供的系统芯片的框图。
具体实施方式
为使本公开的目的、技术方案和优点更加清楚,下面将结合附图对本公开实施方式作进一步地详细描述。
上述所有可选技术方案,可以采用任意结合形成本公开的可选实施例,在此不再一一赘述。
本公开实施例提供了一种机器翻译系统,参见图1,该机器翻译系统包括:推荐装置10和多个机器翻译装置20;每个机器翻译装置20与推荐装置10连接。其中,每个机器翻译装置20与推荐装置10可以通过有线连接,也可以通过无线连接。
每个机器翻译装置20用于接收待翻译的源文档,将源文档转换为目标文档,并将目标文档发送至推荐装置10。其中,多个机器翻译装置20可以为多种类型的机器翻译装置,例如,多个机器翻译装置20包括统计机器翻译装置20或者神经网络机器翻译装置20。
推荐装置10用于接收每个机器翻译装置20发送的目标文档,分别确定每个目标文档的每个预设特征的特征值,其中,任一目标文档的任一预设特征的特征值用于评估任一目标文档的流畅度和/或忠实度,根据每个目标文档的每个预设特征的特征值,确定每个目标文档的推荐度。
推荐装置10还用于根据每个目标文档的推荐度,输出推荐度最高的目标文档。
其中,源文档包括源语种的至少一个字符,目标文档包括目标语种的至少一个字符,源语种和目标语种不同。源语种和目标语种都可以根据需要进行设置并更改,在本公开实施例中,对源语种不作具体限定。例如,源语种可以为汉语、英语、日语或者法语等。目标语种可以为英语、日语或者法语等。
在本公开实施例中,推荐装置在确定每个目标文档的推荐度时,分别确定每个目标文档的每个预设特征的特征值,任一预设特征的特征值用于评估目标文档的流畅度和/或忠实度;根据每个目标文档的每个预设特征的特征值、每个预设特征的基准特征权重和基准特征偏置,通过预设推荐度算法,确定每个目标文档的推荐度。因此,在通过本公开实施例提供的机器翻译方法之前,机器翻译系统需要确定每个预设特征的基准特征权重和基准特征偏置。参见图2,机器翻译系统确定每个预设特征的基准特征权重和基准特征偏置的过程包括:
步骤201:机器翻译系统获取第一样本文档集合和第一样本译文集合,第一样本文档集合包括待翻译的至少一个样本文档,第一样本译文集合包括每个样本文档对应的参考译文。
为了训练推荐装置的参数(每个预设特征的基准特征权重和基准特征偏置),用户通过机器翻译系统翻译源文档之前,机器翻译系统获取样本数据,该样本数据包括第一样本文档集合和第一样本译文集合。第一样本文档集合包括待翻译的至少一个样本文档,第一样本译文集合包括第一样本集合中的每个样本文档对应的参考译文。其中,参考译文是指标准译文。
在本步骤之前,用户标注至少一个样本文档,向机器翻译系统输入至少一个样本文档,机器翻译系统接收用户输入的至少一个样本文档,将至少一个样本文档组成第一样本文档集合。
机器翻译系统获取第一样本文档集合之后,对于第一样本文档集合中的每个样本文档,分别通过多个机器翻译装置,将该样本文档转换为多个样本译文。对于每个样本文档对应的多个样本译文,用户根据该样本文档对应的多个样本译文,从该多个样本译文中标注参考译文;机器翻译系统获取用户标注的该样本文档的参考译文,将每个样本文档对应的参考译文组成第一样本译文集合。
需要说明的是,每个样本文档包括源语种的至少一个字符,样本译文包括目标语种的至少一个字符;源语种和目标语种不同。其中,源语种可以根据需要进行设置并更改,在本公开实施例中,对源语种不作具体限定;例如,源语种可以为汉语、英语、日语或者法语等。目标语种可以根据需要进行设置并更改,在本公开实施例中,对目标语种不作具体限定;例如,目标语种可以为英语、日语或者法语等。
步骤202:机器翻译系统根据第一样本文档集合,确定第二样本译文集合,第二样本译文集合包括每个样本文档对应的样本译文。
第二样本译文集合为机器翻译系统翻译每个样本文档并推荐样本译文得到的样本译文集合。本步骤可以通过以下步骤2021-2024实现,包括:
步骤2021:机器翻译系统通过多个机器翻译装置,将第一样本文档集合中的每个样本文档转换为多个样本译文集合。
一个样本译文集合包括一个机器翻译装置将每个样本文档翻译为目标语种的至少一个 样本译文;对于每个机器翻译装置,该机器翻译装置将第一样本文档集合中的每个样本文档转换为至少一个样本译文,将转换得到的至少一个样本译文组成样本译文集合。
例如,第一样本文档集合中包括样本文档A、样本文档B和样本文档C;机器翻译装置分别为神经网络翻译装置和统计翻译装置;则神经网络翻译装置分别将样本文档A、样本文档B和样本文档C转换为目标语种的样本译文,得到样本译文A1、样本译文B1和样本译文C1,将样本译文A1、样本译文B1和样本译文C1组成样本译文集合1;统计翻译装置分别将样本文档A、样本文档B和样本文档C转换为目标语种的样本译文,得到样本译文A2、样本译文B2和样本译文C2,将样本译文A2、样本译文B2和样本译文C2组成样本译文集合2。
步骤2022:机器翻译系统分别确定多个样本译文集合中的每个样本译文的每个预设特征的特征值。
预设特征包括第一类预设特征和第二类预设特征。第一类预设特征用于评估样本译文的流畅度;第二类预设特征用于评估样本译文的忠实度。其中,第一类预设特征包括译文语言模型和/或调序模型等。第二类预设特征包括未登录词、重构、译文长度、覆盖率和/或词汇化概率等。相应的,本步骤可以为:
机器翻译系统分别通过每个第一类预设特征的提取算法,提取每个样本译文的每个第一类预设特征的特征值;和/或,通过每个第二预设特征的提取算法,提取每个样本译文的每个第二类预设特征的特征值。机器翻译系统将每个样本译文的每个第一类预设特征的特征值和/或每个样本译文的每个第二类预设特征的特征值组成每个样本译文的每个预设特征的特征值。
对于多个样本译文集合中的每个样本译文;预设特征包括译文语言模型时,机器翻译系统获取该样本译文的预设特征的特征值的步骤可以为:
机器翻译系统获取该样本译文的译文语言模型得分。其中,译文语言模型得分越高,译文越流畅,质量越好。
预设特征包括调序模型时,机器翻译系统获取该样本译文的预设特征的特征值的步骤可以为:
机器翻译系统获取该样本译文的调序模型得分。其中,统计翻译装置的一个主要问题是调序困难,导致译文一般是顺序拼接,给人以机器翻译的感觉;而神经网络翻译装置这方面就做的很好,译文顺畅。所以通过获取样本译文的调序模型得分,调序模型得分越高,译文质量就越好。
预设特征包括未登录词时,机器翻译系统获取该样本译文的预设特征的特征值的步骤可以为:
机器翻译系统获取该样本译文中的未登录词数量。其中,未登录词是指未被翻译的词;未登录词时神经网络翻译装置的一个重要问题,未登录词一般是由样本文档中的不常见词引起的,该类词出现次数较少,很难被机器翻译系统翻译,而未登录词在神经网络翻译装置中问题更严重。一般来说,未登录词在样本译文中出现的数量越多,样本译文的质量越差。
预设特征包括重构时,预设特征的特征值即为重构得分,则机器翻译系统获取该样本译文的预设特征的特征值的步骤可以为:
该样本译文包括目标语种的至少一个字符;机器翻译系统将该样本译文翻译为源语种,得到重构文档,该重构文档包括源语种的至少一个字符;计算该样本文档和该重构文档之间的相似度,将该相似度确定为该样本译文的重构得分。
其中,机器翻译系统将该样本译文翻重新翻译为原文,得到重构文档,通过该样本文档和该重构文档的相似度,得到该样本译文的重构得分;重构得分是一种很好的评价样本译文忠实度的指标,一般来说,样本译文的重构得分越高,表示该样本译文的忠实度越高,质量越好。
预设特征包括译文长度,预设特征的特征值即为译文长度得分,则机器翻译系统获取该样本译文的预设特征的特征值的步骤可以为:
机器翻译系统根据该样本译文对应的样本文档包括的字符数,获取该样本译文包括的基准字符数,将该样本译文包括的字符数与该基准字符数之间的差值确定为该样本译文的译文长度得分。
机器翻译系统存储样本文档包括的字符数和译文文档包括的基准字符数的对应关系;相应的,机器翻译系统根据该样本译文对应的样本文档包括的字符数,获取该样本译文包括的基准字符数的步骤可以为:
机器翻译系统根据该样本译文对应的样本文档包括的字符数,从样本文档包括的字符数和译文文档包括的基准字符数的对应关系中获取该样本文档包括的基准字符数。
由于不同语种的样本译文包括的基准字符数可能不同;因此,机器翻译系统还可以结合目标语种,获取该样本译文包括的基准字符数;相应的,机器翻译系统根据该样本译文对应的样本文档包括的字符数,获取该样本译文包括的基准字符数的步骤可以为:
机器翻译系统根据该样本译文对应的样本文档包括的字符数和目标语种,获取该样本译文包括的基准字符数。
机器翻译系统存储样本文档包括的字符数、目标语种和译文文档包括的基准字符数的对应关系;相应的,机器翻译系统根据该样本译文对应的样本文档包括的字符数和目标语种,获取该样本译文包括的基准字符数的步骤可以为:
机器翻译系统根据该样本译文对应的样本文档包括的字符数和目标语种,从样本文档包括的字符数、目标语种和译文文档包括的基准字符数的对应关系中,获取该样本译文包括的基准字符数。
其中,针对神经网络翻译装置的遗漏翻译导致样本译文偏短的特点,在本公开实施例中,通过译文长度可以在一定程度上评估该样本译文是否存在漏译现象;一般情况下,对于同一样本文档,神经网络翻译装置翻译该样本文档得到的样本译文的译文长度与统计翻译装置翻译该样本文档得到的样本译文的译文长度相近时,神经网络翻译装置不太可能出现漏译现象。
预设特征包括覆盖率时,预设特征的特征值即为该覆盖率的值;则机器翻译系统获取该样本译文的预设特征的特征值的步骤可以为:
机器翻译系统获取第一词语数目和第二词语数目,第一词语数目为该样本文档包括的词语的数目,第二词语数目样本文档中已翻译的词语数目;机器翻译系统计算第二词语数目与第一词语数目的比值,将该比值确定为该样本译文的覆盖率。
其中,覆盖率为样本文档被翻译的比例;该覆盖率是针对神经网络翻译装置经常出现 的漏译现象设计的,一般来说,样本译文的覆盖率越高,样本译文的质量越好。
预设特征包括词汇化概率时,预设特征的特征值即为该词汇化概率的值;则机器翻译系统获取该样本译文的预设特征的特征值的步骤可以为:
机器翻译系统计算该样本文档和该样本译文之间的匹配度,将该匹配度确定为该样本译文的词汇率。
机器翻译系统将该样本译文翻译为源语种,得到重构文档,该重构文档包括源语种的至少一个字符;计算该样本译文的覆盖率以及该重构文档的覆盖率,将该样本译文的覆盖率和该重构文档的覆盖率之和确定为该样本译文的词汇率。
步骤2023:机器翻译系统根据每个样本译文的每个预设特征的特征值、每个预设特征的初始特征权重和初始特征偏置,确定每个样本译文的推荐度。
机器翻译系统根据每个样本文档的每个预设特征的特征值、每个预设特征的初始特征权重和初始特征偏置,通过预设推荐度算法,确定每个样本译文的推荐度。
预设推荐度算法可以根据需要进行设置并更改,在本公开实施例中,对预设推荐度算法不作具体限定;例如,预设推荐度算法可以为多层感知机算法(MultiLayer Perceptron,MLP)或者人工神经网络算法(Aritificial Neural Network,ANN)等。
当预设推荐度算法为MLP时,本步骤可以为:
对于每个样本文档,机器翻译系统根据该样本文档的每个预设特征的特征值、每个预设特征的初始特征权重和初始特征偏置,通过以下公式一,确定每个样本文档的推荐度。
公式一:f(x)=G(b (2)+W (2)(s(b (1)+W (1)x)))
其中,f(x)为该样本译文的推荐度,x为预设特征的特征值;b (1)和b (2)分别为两个预设特征的初始特征权重,W (1)和W (2)别为两个预设特征的初始特征偏置。
步骤2024:机器翻译系统根据每个样本译文的推荐度,确定第二样本译文集合。
对于每个样本文档,机器翻译系统根据该样本文档对应的每个样本译文的推荐度,从该样本文档对应的每个样本译文中选择推荐度最高的样本译文,将每个样本文档对应的推荐度最高的样本译文组成第二样本译文集合。
步骤203:机器翻译系统根据第一样本译文集合和第二样本译文集合,确定第一错误推荐率。
本步骤可以通过以下第一种方式或者第二种方式实现;对于第一种实现方式,本步骤可以为:
机器翻译系统确定第一样本数目和第二样本数目,第一样本数目为第一样本译文集合(或者第二样本译文集合)包括的样本译文的数目,第二样本数目为第一样本译文集合和第二样本译文集合中不相同的样本译文的数目;将第二样本数目和第一样本数目的比值确定为第一错误推荐率。
对于第二种实现方式,本步骤可以通过以下步骤2031-2034实现,包括:
步骤2031:机器翻译系统根据第一样本译文集合和第二样本译文集合,确定第三样本译文集合和第二样本文档集合。
其中,第三样本译文集合包括第一样本译文集合和第二样本译文集合中不同的样本译文,第二样本文档集合包括不同的样本译文对应的样本文档,也即第二样本文档集合包括第三样本译文集合中的每个样本译文对应的样本文档。
步骤2032:机器翻译系统根据第三样本译文集合中的每个样本译文的推荐度,确定第二样本文档集合中的每个样本文档的推荐系数。
第三样本译文集合中的一个样本译文对应第二样本文档集合中的多个样本文档;在本步骤中,对于第三样本译文集合中的每个样本文档,机器翻译系统根据第三样本译文集合中的每个样本译文的推荐度,确定第二样本文档集合中的每个样本文档的推荐权重。对于第二样本文档集合中的每个样本文档,确定该样本文档的推荐权重和预设推荐度的比值,得到该样本文档的推荐度比值,从该样本文档的推荐度比值和预设推荐权重中选择最小值作为该样本文档的推荐系数。
预设推荐度和预设推荐权重可以根据需要进行设置并更改,在本公开实施例中,对预设推荐度和预设推荐权重不作具体限定。例如,预设推荐度为40或者20,预设推荐权重可以为0.8或者1等。
例如,机器翻译装置包括神经网络翻译装置和统计翻译装置;预设推荐度为40,预设推荐权重为1;样本文档A对应的样本译文分别为样本译文A1和样本译文A2;样本译文A1的推荐度为1,样本译文A2的推荐度为21,则机器翻译系统计算样本译文A1的推荐度和样本译文A2的推荐度之差,得到推荐度差值为20;确定推荐度差值和预设推荐度的比值为0.5,从0.5和1(预设推荐权重)中选择最小值0.5作为该样本文档的推荐系数。
在传统的分类模型中,每个翻译装置的是同等重要的;然如果每个翻译装置得到的样本译文的推荐度相差不大,此时即使分类错误,对推荐结果影响也不大;如果每个翻译装置得到的样本译文相差较大,此时如果分类错误,对推荐结果影响较大;因此,在本公开实施例中,机器翻译系统确定每个样本文档的推荐系数,后续结合每个样本文档的推荐系数确定基准特征权重和基准特征偏置,提高了确定出的基准特征权重和基准特征偏置的准确性。
步骤2033:机器翻译系统确定第一样本数目和第二样本数目之间的样本数目比值。
机器翻译系统获取第一样本数目和第二样本数目,确定第一样本数目和第二样本数目之间的样本数目比值。其中,第一样本数目为第二样本文档集合包括的样本文档的数目,第二样本数目为第一样本文档集合包括的样本文档的数目。
步骤2034:机器翻译系统确定样本数目比值与第二样本文档集合中的每个样本文档的推荐系数的乘积,得到第一错误推荐率。
步骤204:机器翻译系统根据该第一错误推荐率、每个预设特征的初始特征权重和初始特征偏置,确定每个预设特征的基准特征权重和基准特征偏置。
机器翻译系统确定该第一错误推荐率是否满足预设条件;如果该第一错误推荐率满足预设条件,将每个预设特征的初始特征权重、初始特征偏置分别确定为每个预设特征的基准特征权重和基准特征偏置。
如果该第一错误推荐率不满足预设条件,通过预设迭代算法,更新每个预设特征的初始特征权重和初始特征偏置,根据更新后的初始特征权重和更新后的初始特征偏置,确定第二错误推荐率,确定第二错误推荐率是否满足预设条件;如果第二错误推荐率满足预设条件,将此时的每个预设特征的特征权重和特征偏置分别确定为每个预设特征的基准特征权重和基准特征偏置。如果第二错误推荐率不满足预设条件,再次更新每个预设特征的初始特征权重和初始特征偏置,直到第二错误推荐率满足预设条件。
预设条件可以为错误率低于第一预设阈值或者相邻两次得到的错误推荐率之间的差值第二预设阈值;第一预设阈值和第二预设阈值可以相等,也可以不相等;并且,第一预设阈值和第二预设阈值都可以根据需要进行设置并更改,在本公开实施例中,对第一预设阈值和第二预设阈值都不作具体限定。例如,第一预设阈值为0.2或者0.3,第二预设阈值可以为0.1或者0.15等。
在本公开实施例中,通过第一样本文档集合和第一样本译文集合,训练出每个预设特征的基准特征权重和基准特征偏置,提高了确定出的每个预设特征的基准特征权重和基准特征偏置的准确性。
本公开实施例提供了一种机器翻译方法,该方法应用在机器翻译系统中,参见图3,该方法包括:
步骤301:机器翻译系统获取待翻译的源文档,源文档包括源语种的至少一个字符。
当用户对源文档进行翻译时,用户终端向机器翻译系统发送该待翻译的源文档;机器翻译系统接收用户终端发送的该源文档。
步骤302:机器翻译系统分别通过多个机器翻译装置,将源文档转换为多个目标文档。
其中,一个机器翻译装置用于将源文档翻译为一个目标文档;目标文档包括目标语种的至少一个字符,源语种和目标语种不同。
步骤303:机器翻译系统分别确定每个目标文档的每个预设特征的特征值,其中,任一目标文档的任一预设特征的特征值用于评估任一目标文档的流畅度或者忠实度。
预设特征包括第一类预设特征和第二类预设特征,第一类预设特征用于评估目标文档的流畅度,第二类预设特征用于评估目标文档的忠实度;相应的,本步骤可以为:
机器翻译系统分别通过每个第一类预设特征的提取算法,提取每个目标译文的每个第一类预设特征的特征值;和/或,分别通过每个第二类预设特征的提取算法,提取每个目标译文的每个第二类预设特征的特征值;
将每个目标译文的每个第一类预设特征的特征值和/或每个目标译文的每个第二类预设特征的特征值组成每个目标特征的每个预设特征的特征值。
需要说明的是,本步骤和步骤2022中机器翻译系统确定样本译文的每个预设特征的特征值的过程相同,在此不再赘述。
步骤304:机器翻译系统根据每个目标文档的每个预设特征的特征值、每个预设特征的特征权重和特征偏置,通过预设推荐度算法,确定每个目标文档的推荐度。
需要说明的是,本步骤和步骤2023中机器翻译系统根据每个样本文档的每个预设特征的特征值、每个预设特征的初始特征权重和初始特征偏置,确定每个样本译文的推荐度的过程相同,在此不再赘述。
步骤305:机器翻译系统根据每个目标文档的推荐度,输出推荐度最高的目标文档。
机器翻译系统根据每个目标文档的推荐度,从每个目标文档中选择推荐度最高的目标文档,输出推荐度最高的目标文档。
在本公开实施例中,通过多个机器翻译装置,将目标文档转换为目标文档,确定每个目标文档的推荐度,输出推荐度最高的目标文档;由于通过多个机器翻译装置翻译目标文档,根据每个目标文档的推荐度,输出目标文档,从而提高了机器翻译的准确性。
本公开实施例提供了一种机器翻译装置,参见图4,该装置包括:获取单元401、翻译单元402、确定单元403和输出单元404。
获取单元401用于执行上述步骤201和301及其可选方案。
翻译单元402用于执行上述步骤302及其可选方案。
确定单元403用于执行上述步骤202、203、204、303和304及其可选方案。
输出单元404用于执行上述步骤305及其可选方案。
在本公开实施例中,通过多个机器翻译装置,将目标文档转换为目标文档,确定每个目标文档的推荐度,输出推荐度最高的目标文档;由于通过多个机器翻译装置翻译目标文档,根据每个目标文档的推荐度,输出目标文档,从而提高了机器翻译的准确性。
需要说明的是:上述实施例提供的机器翻译装置在机器翻译时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的机器翻译装置与机器翻译方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
图5是根据一示例性实施例示出的一种机器翻译装置500的框图。例如,装置500可以被提供为一服务器。参照图5,装置500包括处理组件522,其进一步包括一个或多个处理器,以及由存储器532所代表的存储器资源,用于存储可由处理组件522的执行的指令,例如应用程序。存储器532中存储的应用程序可以包括一个或一个以上的每一个对应于一组指令的模块。此外,处理组件522被配置为执行指令,以执行上述机器翻译方法。
装置500还可以包括一个电源组件526被配置为执行装置500的电源管理,一个有线或无线网络接口550被配置为将装置500连接到网络,和一个输入输出(I/O)接口558。装置500可以操作基于存储在存储器532的操作系统,例如Windows Server TM,Mac OS X TM,Unix TM,Linux TM,FreeBSD TM或类似。
本公开实施例提供了一种系统芯片,应用于机器翻译系统中,参见图6,该系统芯片包括:输入输出接口601、至少一个处理器602、存储器603和总线604;输入输出接口601通过总线604与至少一个处理器602和存储器603相连,输入输出接口601用于获取待翻译的源文档以及输出目标文档,至少一个处理器602执行存储器603中存储的指令,使得机器翻译系统执行上述机器翻译方法。
在一个可能的实现方式中,上述各个实施例中的处理器可以是中央处理器(CPU),通用处理器、数字信号处理器(DSP)、专用集成电路(ASIC),现场可编程门阵列(FPGA)或者其他可编程逻辑器件、晶体管逻辑器件,硬件部件或者其任意组合。其可以实现或执行结合本申请公开内容所描述的各种示例性的逻辑方框,模块和电路。所述处理器也可以是实现计算功能的组合,例如包含一个或多个微处理器组合,DSP和微处理器的组合等等。
结合本公开内容所描述的方法或者算法的步骤可以硬件的方式来实现,也可以是由处 理器执行软件指令的方式来实现。软件指令可以由相应的软件模块组成,软件模块可以被存放于RAM存储器、闪存、ROM存储器、EPROM存储器、EEPROM存储器、寄存器、硬盘、移动硬盘、CD-ROM或者本领域熟知的任何其它形式的存储介质中。一种示例性的存储介质耦合至处理器,从而使处理器能够从该存储介质读取信息,且可向该存储介质写入信息。当然,存储介质也可以是处理器的组成部分。处理器和存储介质可以位于ASIC中。另外,该ASIC可以位于终端中。当然,处理器和存储介质也可以作为分立组件存在于终端中。
本领域技术人员应该可以意识到,在上述一个或多个示例中,本申请所描述的功能可以用硬件、软件、固件或它们的任意组合来实现。当使用软件实现时,可以将这些功能存储在计算机可读介质中或者作为计算机可读介质上的一个或多个指令或代码进行传输。计算机可读介质包括计算机存储介质和通信介质,其中通信介质包括便于从一个地方向另一个地方传送计算机程序的任何介质。存储介质可以是通用或专用计算机能够存取的任何可用介质。
本公开中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上所述仅为本公开的可选实施例,并不用以限制本公开,凡在本公开的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本公开的保护范围之内。

Claims (18)

  1. 一种机器翻译方法,其特征在于,所述方法包括:
    获取待翻译的源文档,所述源文档包括源语种的至少一个字符;
    分别通过多个机器翻译装置,将所述源文档转换为多个目标文档,其中,一个机器翻译装置用于将所述源文档翻译为一个目标文档,所述目标文档包括目标语种的至少一个字符,所述源语种和所述目标语种不同;
    分别确定每个目标文档的每个预设特征的特征值,其中,任一目标文档的任一所述预设特征的特征值用于评估所述任一目标文档的流畅度和/或忠实度;
    根据所述每个目标文档的每个预设特征的特征值,确定所述每个目标文档的推荐度;
    根据所述每个目标文档的推荐度,输出推荐度最高的目标文档。
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述每个目标文档的每个预设特征的特征值,确定所述每个目标文档的推荐度,包括:
    分别根据所述每个目标文档的每个预设特征的特征值、所述每个预设特征的基准特征权重和基准特征偏置,通过预设推荐度算法,确定所述每个目标文档的推荐度,所述每个预设特征的基准特征权重和基准特征偏置为根据第一样本文档集合和第一样本译文集合训练得到的,所述第一样本文档集合包括待翻译的至少一个样本文档,所述第一样本译文集合包括每个样本文档对应的参考译文。
  3. 根据权利要求2所述的方法,其特征在于,所述根据所述每个目标文档的每个预设特征的特征值,确定所述每个目标文档的推荐度之前,所述方法还包括:
    获取所述第一样本文档集合和所述第一样本译文集合;
    根据所述第一样本文档集合,确定第二样本译文集合,所述第二样本译文集合包括所述每个样本文档对应的样本译文;
    根据所述第一样本译文集合和所述第二样本译文集合,确定第一错误推荐率;
    根据所述第一错误推荐率、所述每个预设特征的初始特征权重和初始特征偏置,确定所述每个预设特征的基准特征权重和基准特征偏置。
  4. 根据权利要求3所述的方法,其特征在于,所述根据所述第一样本文档集合,确定第二样本译文集合,包括:
    分别通过所述多个机器翻译装置,将所述第一样本文档集合中的每个样本文档转换为多个样本译文集合,其中,一个样本译文集合包括一个机器翻译装置将所述每个样本文档翻译为所述目标语种的至少一个样本译文;
    分别确定所述多个样本译文集合中的每个样本译文的每个预设特征的特征值;
    根据所述每个样本译文的每个预设特征的特征值、所述每个预设特征的初始特征权重和初始特征偏置,确定所述每个样本译文的推荐度;
    根据所述每个样本译文的推荐度,确定所述第二样本译文集合。
  5. 根据权利要求3所述的方法,其特征在于,所述根据所述第一错误推荐率、所述每个预设特征的初始特征权重和初始特征偏置,确定所述每个预设特征的基准特征权重和基准特征偏置,包括:
    如果所述第一错误推荐率满足预设条件,将所述每个预设特征的初始特征权重、初始特征偏置分别确定为所述每个预设特征的基准特征权重和基准特征偏置;或者,
    如果所述第一错误推荐率不满足预设条件,通过预设迭代算法,更新所述每个预设特征的初始特征权重和初始特征偏置,直到第二错误推荐率满足预设条件,所述第二错误推荐率为根据更新后的初始特征权重和更新后的初始特征偏置确定得到的,将所述第二错误推荐率满足预设条件时的特征权重和特征偏置确定为所述每个预设特征的基准特征权重和基准特征偏置。
  6. 根据权利要求3所述的方法,其特征在于,所述根据所述第一样本译文集合和第二样本译文集合,确定第一错误推荐率,包括:
    根据所述第一样本译文集合和所述第二样本译文集合,确定第三样本译文集合和第二样本文档集合,所述第三样本译文集合包括所述第一样本译文集合和所述第二样本译文集合中不同的样本译文,所述第二样本文档集合包括所述不同的样本译文对应的样本文档;
    根据所述第三样本译文集合中的每个样本译文的推荐度,确定所述第二样本文档集合中的每个样本文档的推荐系数;
    确定第一样本数目和第二样本数目之间的样本数目比值,所述第一样本数目为所述第二样本文档集合包括的样本文档的数目,所述第二样本数目为所述第一样本文档集合包括的样本文档的数目;
    确定所述样本数目比值与所述第二样本文档集合中的每个样本文档的推荐系数的乘积,得到所述第一错误推荐率。
  7. 根据权利要求6所述的方法,其特征在于,所述根据所述第三样本译文文档集合中的每个样本译文文档的推荐度,确定所述第二样本文档集合中的每个样本文档的推荐系数,包括:
    根据所述第三样本译文集合中的每个样本译文的推荐度,确定所述第二样本文档集合中的每个样本文档的推荐权重;
    对于所述第二样本文档集合中的每个样本文档,确定所述样本文档的推荐权重和预设推荐度的比值,得到所述样本文档的推荐度比值,从所述样本文档的推荐度比值和预设推荐权重中选择最小值作为所述样本文档的推荐系数。
  8. 根据权利要求1所述的方法,其特征在于,所述预设特征包括第一类预设特征和/或第二类预设特征,所述第一类预设特征用于评估目标文档的流畅度,所述第二类预设特征用于评估所述目标文档的忠实度;
    所述分别确定每个目标译文的每个预设特征的特征值,包括:
    分别通过每个第一类预设特征的提取算法,提取所述每个目标译文的每个第一类预设特征的特征值;和/或,分别通过每个第二类预设特征的提取算法,提取所述每个目标译文的每 个第二类预设特征的特征值;
    将所述每个目标译文的每个第一类预设特征的特征值和/或所述每个目标译文的每个第二类预设特征的特征值组成所述每个目标特征的每个预设特征的特征值。
  9. 一种机器翻译装置,其特征在于,所述装置包括:
    获取单元,用于获取待翻译的源文档,所述源文档包括源语种的至少一个字符;
    翻译单元,用于分别通过多个机器翻译装置,将所述源文档转换为多个目标文档,其中,一个机器翻译装置用于将所述源文档翻译为一个目标文档,所述目标文档包括目标语种的至少一个字符,所述源语种和所述目标语种不同;
    确定单元,用于分别确定每个目标文档的每个预设特征的特征值,其中,任一目标文档的任一所述预设特征的特征值用于评估所述任一目标文档的流畅度和/或忠实度;
    所述确定模块,还用于根据所述每个目标文档的每个预设特征的特征值,确定所述每个目标文档的推荐度;
    输出单元,用于根据所述每个目标文档的推荐度,输出推荐度最高的目标文档。
  10. 根据权利要求9所述的装置,其特征在于,
    所述确定单元,还用于分别根据所述每个目标文档的每个预设特征的特征值、所述每个预设特征的基准特征权重和基准特征偏置,通过预设推荐度算法,确定所述每个目标文档的推荐度,所述每个预设特征的基准特征权重和基准特征偏置为根据第一样本文档集合和第一样本译文集合训练得到的,所述第一样本文档集合包括待翻译的至少一个样本文档,所述第一样本译文集合包括每个样本文档对应的参考译文。
  11. 根据权利要求10所述的装置,其特征在于,所述装置还包括:
    所述获取单元,还用于获取所述第一样本文档集合和所述第一样本译文集合;
    所述确定单元,还用于根据所述第一样本文档集合,确定第二样本译文集合,所述第二样本译文集合包括所述每个样本文档对应的样本译文;
    所述确定单元,还用于根据所述第一样本译文集合和所述第二样本译文集合,确定第一错误推荐率;
    所述确定单元,还用于根据所述第一错误推荐率、所述每个预设特征的初始特征权重和初始特征偏置,确定所述每个预设特征的基准特征权重和基准特征偏置。
  12. 根据权利要求11所述的装置,其特征在于,
    所述翻译单元,还用于分别通过所述多个机器翻译装置,将所述第一样本文档集合中的每个样本文档转换为多个样本译文集合,其中,一个样本译文集合包括一个机器翻译装置将所述每个样本文档翻译为所述目标语种的至少一个样本译文;
    所述确定单元,还用于分别确定所述多个样本译文集合中的每个样本译文的每个预设特征的特征值;根据所述每个样本译文的每个预设特征的特征值、所述每个预设特征的初始特征权重和初始特征偏置,确定所述每个样本译文的推荐度;根据所述每个样本译文的推荐度,确定所述第二样本译文集合。
  13. 根据权利要求11所述的装置,其特征在于,
    所述确定单元,还用于如果所述第一错误推荐率满足预设条件,将所述每个预设特征的初始特征权重、初始特征偏置分别确定为所述每个预设特征的基准特征权重和基准特征偏置;或者,
    所述确定单元,还用于如果所述第一错误推荐率不满足预设条件,通过预设迭代算法,更新所述每个预设特征的初始特征权重和初始特征偏置,直到第二错误推荐率满足预设条件,所述第二错误推荐率为根据更新后的初始特征权重和更新后的初始特征偏置确定得到的,将所述第二错误推荐率满足预设条件时的特征权重和特征偏置确定为所述每个预设特征的基准特征权重和基准特征偏置。
  14. 根据权利要求11所述的装置,其特征在于,
    所述确定单元,还用于根据所述第一样本译文集合和所述第二样本译文集合,确定第三样本译文集合和第二样本文档集合,所述第三样本译文集合包括所述第一样本译文集合和所述第二样本译文集合中不同的样本译文,所述第二样本文档集合包括所述不同的样本译文对应的样本文档;根据所述第三样本译文集合中的每个样本译文的推荐度,确定所述第二样本文档集合中的每个样本文档的推荐系数;确定第一样本数目和第二样本数目之间的样本数目比值,所述第一样本数目为所述第二样本文档集合包括的样本文档的数目,所述第二样本数目为所述第一样本文档集合包括的样本文档的数目;确定所述样本数目比值与所述第二样本文档集合中的每个样本文档的推荐系数的乘积,得到所述第一错误推荐率。
  15. 根据权利要求14所述的装置,其特征在于,
    所述确定单元,还用于根据所述第三样本译文集合中的每个样本译文的推荐度,确定所述第二样本文档集合中的每个样本文档的推荐权重;对于所述第二样本文档集合中的每个样本文档,确定所述样本文档的推荐权重和预设推荐度的比值,得到所述样本文档的推荐度比值;从所述每个样本文档的推荐度比值和预设推荐权重中选择最小值作为所述样本文档的推荐系数。
  16. 根据权利要求10所述的装置,其特征在于,所述预设特征包括第一类预设特征和/或第二类预设特征,所述第一类预设特征用于评估目标文档的流畅度,所述第二类预设特征用于评估所述目标文档的忠实度;
    所述确定单元,还用于分别通过每个第一类预设特征的提取算法,提取所述每个目标译文的每个第一类预设特征的特征值;和/或,分别通过每个第二类预设特征的提取算法,提取所述每个目标译文的每个第二类预设特征的特征值;
    所述确定单元,还用于将所述每个目标译文的每个第一类预设特征的特征值和/或所述每个目标译文的每个第二类预设特征的特征值组成所述每个目标特征的每个预设特征的特征值。
  17. 一种机器翻译装置,其特征在于,所述装置包括:处理组件,一个或多个处理器,以及由存储器所代表的存储器资源,用于存储可由所述处理组件的执行的指令;
    所述处理组件被配置为执行指令,以执行如权利要求1-8任一所述的机器翻译方法。
  18. 一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,其特征在于,所述程序被处理器执行时实现如权利要求1-8任一所述的机器翻译方法。
PCT/CN2018/088387 2017-05-26 2018-05-25 机器翻译方法、装置及存储介质 WO2018214956A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP18806246.7A EP3617908A4 (en) 2017-05-26 2018-05-25 AUTOMATIC TRANSLATION METHOD AND APPARATUS, AND INFORMATION MEDIUM
US16/694,239 US20200089774A1 (en) 2017-05-26 2019-11-25 Machine Translation Method and Apparatus, and Storage Medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710386617.5A CN108932231B (zh) 2017-05-26 2017-05-26 机器翻译方法及装置
CN201710386617.5 2017-05-26

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/694,239 Continuation US20200089774A1 (en) 2017-05-26 2019-11-25 Machine Translation Method and Apparatus, and Storage Medium

Publications (1)

Publication Number Publication Date
WO2018214956A1 true WO2018214956A1 (zh) 2018-11-29

Family

ID=64396250

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/088387 WO2018214956A1 (zh) 2017-05-26 2018-05-25 机器翻译方法、装置及存储介质

Country Status (4)

Country Link
US (1) US20200089774A1 (zh)
EP (1) EP3617908A4 (zh)
CN (1) CN108932231B (zh)
WO (1) WO2018214956A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109558604B (zh) * 2018-12-17 2022-06-14 北京百度网讯科技有限公司 一种机器翻译方法、装置、电子设备及存储介质
CN111104807B (zh) * 2019-12-06 2024-05-24 北京搜狗科技发展有限公司 一种数据处理方法、装置和电子设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070219774A1 (en) * 2002-12-04 2007-09-20 Microsoft Corporation System and method for machine learning a confidence metric for machine translation
CN102789451A (zh) * 2011-05-16 2012-11-21 北京百度网讯科技有限公司 一种个性化的机器翻译系统、方法及训练翻译模型的方法
WO2013148930A1 (en) * 2012-03-29 2013-10-03 Lionbridge Technologies, Inc. Methods and systems for multi-engine machine translation
CN103678285A (zh) * 2012-08-31 2014-03-26 富士通株式会社 机器翻译方法和机器翻译系统

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8326598B1 (en) * 2007-03-26 2012-12-04 Google Inc. Consensus translations from multiple machine translation systems
US9201871B2 (en) * 2010-06-11 2015-12-01 Microsoft Technology Licensing, Llc Joint optimization for machine translation system combination
JP2014078132A (ja) * 2012-10-10 2014-05-01 Toshiba Corp 機械翻訳装置、方法およびプログラム
CN103646019A (zh) * 2013-12-31 2014-03-19 哈尔滨理工大学 一种多个机器翻译系统融合的方法及装置
US10067936B2 (en) * 2014-12-30 2018-09-04 Facebook, Inc. Machine translation output reranking

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070219774A1 (en) * 2002-12-04 2007-09-20 Microsoft Corporation System and method for machine learning a confidence metric for machine translation
CN102789451A (zh) * 2011-05-16 2012-11-21 北京百度网讯科技有限公司 一种个性化的机器翻译系统、方法及训练翻译模型的方法
WO2013148930A1 (en) * 2012-03-29 2013-10-03 Lionbridge Technologies, Inc. Methods and systems for multi-engine machine translation
CN103678285A (zh) * 2012-08-31 2014-03-26 富士通株式会社 机器翻译方法和机器翻译系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3617908A4

Also Published As

Publication number Publication date
CN108932231A (zh) 2018-12-04
EP3617908A4 (en) 2020-05-13
EP3617908A1 (en) 2020-03-04
CN108932231B (zh) 2023-07-18
US20200089774A1 (en) 2020-03-19

Similar Documents

Publication Publication Date Title
CN109241524B (zh) 语义解析方法及装置、计算机可读存储介质、电子设备
US10394851B2 (en) Methods and systems for mapping data items to sparse distributed representations
CN109325229B (zh) 一种利用语义信息计算文本相似度的方法
US20160196258A1 (en) Semantic Similarity Evaluation Method, Apparatus, and System
CN108628830B (zh) 一种语义识别的方法和装置
US20220318275A1 (en) Search method, electronic device and storage medium
CN110874536B (zh) 语料质量评估模型生成方法和双语句对互译质量评估方法
WO2019028990A1 (zh) 代码元素的命名方法、装置、电子设备及介质
EP3620994A1 (en) Methods, apparatuses, devices, and computer-readable storage media for determining category of entity
CN112883193A (zh) 一种文本分类模型的训练方法、装置、设备以及可读介质
CA2971884C (en) Method and device for general machine translation engine-oriented individualized translation
WO2022156180A1 (zh) 相似文本确定方法及相关设备
CN111950303B (zh) 医疗文本翻译方法、装置及存储介质
CN111488742B (zh) 用于翻译的方法和装置
WO2022174496A1 (zh) 基于生成模型的数据标注方法、装置、设备及存储介质
WO2021159812A1 (zh) 癌症分期信息处理方法、装置及存储介质
WO2020010996A1 (zh) 超链接的处理方法和装置及存储介质
WO2018214956A1 (zh) 机器翻译方法、装置及存储介质
WO2022141872A1 (zh) 文献摘要生成方法、装置、计算机设备及存储介质
WO2022022049A1 (zh) 文本长难句的压缩方法、装置、计算机设备及存储介质
WO2024087297A1 (zh) 文本情感分析方法、装置、电子设备及存储介质
CN111814496A (zh) 文本处理方法、装置、设备及存储介质
WO2023061441A1 (zh) 文本的量子线路确定方法、文本分类方法及相关装置
WO2021098491A1 (zh) 知识图谱的生成方法、装置、终端以及存储介质
CN114118049B (zh) 信息获取方法、装置、电子设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18806246

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2018806246

Country of ref document: EP

Effective date: 20191128