WO2020087655A1 - Translation method, apparatus and device, and readable storage medium - Google Patents

Translation method, apparatus and device, and readable storage medium Download PDF

Info

Publication number
WO2020087655A1
WO2020087655A1 PCT/CN2018/119329 CN2018119329W WO2020087655A1 WO 2020087655 A1 WO2020087655 A1 WO 2020087655A1 CN 2018119329 W CN2018119329 W CN 2018119329W WO 2020087655 A1 WO2020087655 A1 WO 2020087655A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
source language
training
translation
language training
Prior art date
Application number
PCT/CN2018/119329
Other languages
French (fr)
Chinese (zh)
Inventor
孔常青
高建清
刘俊华
胡国平
Original Assignee
科大讯飞股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 科大讯飞股份有限公司 filed Critical 科大讯飞股份有限公司
Publication of WO2020087655A1 publication Critical patent/WO2020087655A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Definitions

  • the text segmentation model is obtained by training the source language training text as the training data, and using the sentence segmentation result of the source language training text that matches the current translation scene as the training label.
  • the sentence breaking method of the source language training text is changed to obtain the changed source language training text, and the candidate source language training is composed of the changed source language training text and the source language training text text;
  • the use of a preset machine translation model to translate each candidate source language training text to obtain a machine translation result of each candidate source language training text includes:
  • the translation of the source language text after the sentence segmentation to obtain the target language text includes:
  • a text segmentation model determination unit which is used to determine a text segmentation model
  • the text segmentation model includes:
  • the sentence segmentation result determination unit includes:
  • a non-terminating punctuation determining unit configured to determine the non-terminating punctuation included in the source language training text
  • a second model training unit configured to use the source language training text as training data and the artificially labeled source language training text as training labels to train a text segmentation model to obtain a preliminary text segmentation model;
  • the second clause translation unit is used to translate each clause in the clause sequence of the source language text after the sentence segmentation by using a preset machine translation model to obtain a machine translation result of each clause;
  • the sentence breaking method in the source language text (that is, the punctuation in the source language text) obtained in the previous step may be affected by the speaker's speaking habits.
  • the sentence breaking method is not standardized and the current translation scenario is not considered. If you directly translate the obtained source language text, the quality of the translation result is not high.
  • the process of sentence segmentation processing of the source language text is added, and the sentence segmentation processing process takes into account the current translation scenario, so that the sentence segmentation method of the source language text after the sentence segmentation is more in line with the current translation scenario.
  • the embodiments of the present application can also choose to synthesize the target language text into speech according to the needs of the user, and then perform speech broadcasting to realize the conversion process from the source language speech to the target language speech.
  • the embodiments of the present application also provide another processing method for sentence segmentation of the source language text, that is, a process of sentence segmentation of the source language text can be performed using a machine learning model.
  • a process of sentence segmentation of the source language text can be performed using a machine learning model. The detailed process is as follows:
  • the machine learning model for sentence segmentation processing in this embodiment is defined as a text sentence segmentation model, which can use existing machine learning models of various structures, such as the BLSTM model under the sequence annotation framework, the Self-Attention model, etc., or the codec
  • the sequence generation model under the Encode-Decode framework can also use a combination of existing multiple structural models.
  • a part of non-terminating punctuation can be converted into terminating punctuation, the occurrence probability of terminating punctuation will increase, and in the machine translation process, it is a translation based on the content before terminating punctuation Therefore, according to the application scheme, the time for waiting for termination punctuation will be shortened, thereby increasing the output speed of translation results, reducing the subjective time for users to wait for translation results, and improving the user experience.
  • the first model training unit is used to train the text segmentation model by using the source language training text as training data and the target sentence segmentation result as a training label.
  • a manual labeling result obtaining unit which is used to obtain a result of manually punctuating the source language training text to obtain the source language training text after manual labeling;

Abstract

Disclosed are a translation method, apparatus and device and a readable storage medium. The method comprises: obtaining a source language text to be translated; and performing sentence segmentation on the source language text further according to the current translation scene, so that the obtained source language text after sentence segmentation better conforms to the current translation scene. Obviously, compared with the existing translation method, the present application adds the sentence segmentation optimization process to the obtained source language text, namely, the sentence segmentation mode of the source language text is more optimized by considering the situation that the sentence segmentation is performed on the source language text again in the current translation scene, and on this basis, the source language text after the sentence segmentation is translated, so that the quality of the obtained target language text is higher.

Description

一种翻译方法、装置、设备及可读存储介质Translation method, device, equipment and readable storage medium 技术领域Technical field
本申请要求于2018年10月30日提交中国专利局、申请号为201811276866.X、发明名称为“一种翻译方法、装置、设备及可读存储介质”的国内申请的优先权,其全部内容通过引用结合在本申请中。This application requires the priority of the domestic application submitted to the China Patent Office on October 30, 2018, with the application number 201811276866.X and the invention titled "A translation method, device, equipment and readable storage medium", all of its content Incorporated by reference in this application.
背景技术Background technique
文本翻译的过程,即将待翻译的源语言文本翻译为目标语音文本的过程。对于待翻译的源语言文本,其断句方式并不规范,受源语言文本的来源的影响,如对于通过语音识别得到的源语言文本,其主要依靠语音的停顿信息进行断句,往往受说话人习惯影响。The process of text translation is the process of translating the source language text to be translated into the target speech text. For the source language text to be translated, the sentence breaking method is not standardized, and is affected by the source of the source language text. For example, for the source language text obtained through speech recognition, it mainly depends on the pause information of the speech to break the sentence, which is often used by speakers influences.
现有技术基于此类断句方式并不优化的源语言文本进行机器翻译时,会大大影响机器翻译的质量。In the prior art, when machine translation is performed on source language text that is not optimized based on such sentence segmentation methods, the quality of machine translation is greatly affected.
发明内容Summary of the invention
有鉴于此,本申请提供了一种翻译方法、装置、设备及可读存储介质,用于解决现有待翻译的源语言文本断句不优化,导致机器翻译质量低的问题。In view of this, the present application provides a translation method, device, equipment, and readable storage medium, which are used to solve the problem that the existing source language text to be translated is not optimized for sentence segmentation, resulting in low quality of machine translation.
为了实现上述目的,现提出的方案如下:In order to achieve the above purpose, the proposed scheme is as follows:
一种翻译方法,包括:A translation method, including:
获取待翻译的源语言文本;Obtain the source language text to be translated;
根据当前翻译场景对所述源语言文本进行断句,得到断句后的源语言文本;Segment the source language text according to the current translation scenario to obtain the source language text after sentence segmentation;
对所述断句后的源语言文本进行翻译,得到目标语言文本。Translate the source language text after the sentence segmentation to obtain the target language text.
优选地,所述根据翻译场景对所述源语言文本进行断句,得到断句后的源语言文本,包括:Preferably, the sentence segmentation of the source language text according to the translation scenario to obtain the source language text after sentence segmentation includes:
将所述源语言文本输入预置的文本断句模型,得到文本断句模型输出的断句后的源语言文本;Input the source language text into a preset text segmentation model to obtain the source language text after the segmentation output by the text segmentation model;
其中,所述文本断句模型为,以源语言训练文本作为训练数据,以所 述源语言训练文本的符合所述当前翻译场景的断句结果作为训练标签训练得到。Wherein, the text segmentation model is obtained by training the source language training text as the training data, and using the sentence segmentation result of the source language training text that matches the current translation scene as the training label.
优选地,所述文本断句模型的确定过程包括:Preferably, the process of determining the text segmentation model includes:
获取源语言训练文本;Get the source language training text;
确定所述源语言训练文本的符合所述当前翻译场景的断句结果,作为目标断句结果;Determining a sentence segmentation result of the source language training text that matches the current translation scenario, as a target sentence segmentation result;
以所述源语言训练文本作为训练数据,以所述目标断句结果作为训练标签,训练文本断句模型。Use the source language training text as training data and the target sentence segmentation result as a training label to train a text sentence segmentation model.
优选地,所述确定所述源语言训练文本的符合所述当前翻译场景的断句结果,作为目标断句结果,包括:Preferably, the determination of the sentence segmentation result of the source language training text that matches the current translation scenario, as the target sentence segmentation result, includes:
获取所述源语言训练文本在所述当前翻译场景下的翻译后的目标语言训练文本;Acquiring the translated target language training text of the source language training text in the current translation scenario;
参考设定的断句更改方式,对所述源语言训练文本的断句方式进行更改,得到更改后的源语言训练文本,由更改后的源语言训练文本及所述源语言训练文本组成候选源语言训练文本;With reference to the set sentence changing method, the sentence breaking method of the source language training text is changed to obtain the changed source language training text, and the candidate source language training is composed of the changed source language training text and the source language training text text;
利用预置的机器翻译模型,对每一所述候选源语言训练文本进行翻译,得到每一所述候选源语言训练文本的机器翻译结果;Using a preset machine translation model to translate each candidate source language training text to obtain a machine translation result of each candidate source language training text;
确定每一所述候选源语言训练文本的机器翻译结果,与所述目标语言训练文本的相似度,将相似度最高的候选源语言训练文本作为所述目标断句结果。Determining the machine translation result of each of the candidate source language training texts and the similarity with the target language training text, and using the candidate source language training text with the highest similarity as the target sentence segmentation result.
优选地,所述参考设定的断句更改方式,对所述源语言训练文本的断句方式进行更改,得到更改后的源语言训练文本,包括:Preferably, the sentence breaking method of the reference language is changed with reference to the sentence breaking method of the source language training text to obtain the changed source language training text, including:
确定所述源语言训练文本包含的非终止型标点;Determine the non-terminating punctuation included in the source language training text;
将所述源语言训练文本包含的每一非终止型标点,使用终止型标点进行替换,得到更改后的源语言训练文本。Each non-terminating punctuation included in the source language training text is replaced with terminating punctuation to obtain the modified source language training text.
优选地,所述利用预置的机器翻译模型,对每一所述候选源语言训练文本进行翻译,得到每一所述候选源语言训练文本的机器翻译结果,包括:Preferably, the use of a preset machine translation model to translate each candidate source language training text to obtain a machine translation result of each candidate source language training text includes:
将每一所述候选源语言训练文本按照其包含的终止型标点进行子句划分,得到划分后的子句序列;Divide each candidate source language training text into clauses according to the termination punctuation it contains to obtain a divided clause sequence;
利用预置的机器翻译模型,对所述候选源语言训练文本的子句序列中每一子句分别进行翻译,得到每一子句的机器翻译结果;Use a preset machine translation model to separately translate each clause in the clause sequence of the candidate source language training text to obtain a machine translation result for each clause;
按照子句序列中各子句的顺序,将各子句的机器翻译结果合并,得到所述候选源语言训练文本的机器翻译结果。According to the order of the clauses in the clause sequence, the machine translation results of the clauses are combined to obtain the machine translation results of the candidate source language training text.
优选地,在所述以所述源语言训练文本作为训练数据,以所述目标断句结果作为训练标签,训练文本断句模型之前,该方法还包括:Preferably, before the training text in the source language is used as training data, and the target sentence segmentation result is used as a training label, before the text sentence segmentation model is trained, the method further includes:
获取人工对所述源语言训练文本标注标点的结果,得到人工标注后的源语言训练文本;Obtaining the result of manually punctuating the source language training text, and obtaining the source language training text after manual labeling;
以所述源语言训练文本作为训练数据,以所述人工标注后的源语言训练文本作为训练标签,训练文本断句模型,得到初步文本断句模型;Using the source language training text as training data and the artificially labeled source language training text as training labels to train a text segmentation model to obtain a preliminary text segmentation model;
则,所述以所述源语言训练文本作为训练数据,以所述目标断句结果作为训练标签,训练文本断句模型,包括:Then, the training text using the source language training text as training data and the target sentence segmentation result as a training label includes:
以所述源语言训练文本作为训练数据,以所述目标断句结果作为训练标签,训练所述初步文本断句模型。Using the source language training text as training data and the target sentence segmentation result as a training label, train the preliminary text sentence segmentation model.
优选地,所述对所述断句后的源语言文本进行翻译,得到目标语言文本,包括:Preferably, the translation of the source language text after the sentence segmentation to obtain the target language text includes:
将所述断句后的源语言文本按照其包含的终止型标点进行子句划分,得到划分后的子句序列;Divide the source language text after the sentence segmentation into clauses according to the termination punctuation it contains to obtain the divided clause sequence;
利用预置的机器翻译模型,对所述断句后的源语言文本的子句序列中每一子句分别进行翻译,得到每一子句的机器翻译结果;Use a preset machine translation model to translate each clause in the clause sequence of the source language text after the sentence segmentation separately to obtain a machine translation result for each clause;
按照子句序列中各子句的顺序,将各子句的机器翻译结果合并,得到所述目标语言文本。According to the order of the clauses in the clause sequence, the machine translation results of the clauses are combined to obtain the target language text.
一种翻译装置,包括:A translation device, including:
源语言文本获取单元,用于获取待翻译的源语言文本;Source language text acquisition unit, used to obtain the source language text to be translated;
文本断句单元,用于根据当前翻译场景对所述源语言文本进行断句,得到断句后的源语言文本;Text segmentation unit, used to segment the source language text according to the current translation scenario, to obtain the source language text after sentence segmentation;
源语言文本翻译单元,用于对所述断句后的源语言文本进行翻译,得到目标语言文本。The source language text translation unit is used to translate the source language text after the sentence segmentation to obtain the target language text.
优选地,所述文本断句单元包括:Preferably, the text segmentation unit includes:
模型参考单元,用于将所述源语言文本输入预置的文本断句模型,得到文本断句模型输出的断句后的源语言文本;A model reference unit, used to input the source language text into a preset text segmentation model to obtain the source language text after the sentence segmentation output by the text segmentation model;
其中,所述文本断句模型为,以源语言训练文本作为训练数据,以所述源语言训练文本的符合所述当前翻译场景的断句结果作为训练标签训练得到。Wherein, the text segmentation model is obtained by training the source language training text as training data and using the sentence segmentation result of the source language training text that matches the current translation scene as the training label.
优选地,还包括:文本断句模型确定单元,用于确定文本断句模型;所述文本断句模型包括:Preferably, it further includes: a text segmentation model determination unit, which is used to determine a text segmentation model; the text segmentation model includes:
源语言训练文本获取单元,用于获取源语言训练文本;Source language training text acquisition unit, used to obtain source language training text;
断句结果确定单元,用于确定所述源语言训练文本的符合所述当前翻译场景的断句结果,作为目标断句结果;A sentence segmentation result determination unit, configured to determine a sentence segmentation result of the source language training text that matches the current translation scenario, as a target sentence segmentation result;
第一模型训练单元,用于以所述源语言训练文本作为训练数据,以所述目标断句结果作为训练标签,训练文本断句模型。The first model training unit is used to train the text segmentation model by using the source language training text as training data and the target sentence segmentation result as a training label.
优选地,所述断句结果确定单元包括:Preferably, the sentence segmentation result determination unit includes:
目标语言训练文本获取单元,用于获取所述源语言训练文本在所述当前翻译场景下的翻译后的目标语言训练文本;A target language training text obtaining unit, configured to obtain the translated target language training text of the source language training text in the current translation scenario;
断句更改单元,用于参考设定的断句更改方式,对所述源语言训练文本的断句方式进行更改,得到更改后的源语言训练文本,由更改后的源语言训练文本及所述源语言训练文本组成候选源语言训练文本;The sentence changing unit is used for referring to the set sentence changing method to change the sentence breaking method of the source language training text to obtain the changed source language training text, which is composed of the changed source language training text and the source language training The text constitutes the candidate source language training text;
源语言训练文本翻译单元,用于利用预置的机器翻译模型,对每一所述候选源语言训练文本进行翻译,得到每一所述候选源语言训练文本的机器翻译结果;A source language training text translation unit, configured to use a preset machine translation model to translate each candidate source language training text to obtain a machine translation result of each candidate source language training text;
相似度确定单元,用于确定每一所述候选源语言训练文本的机器翻译结果,与所述目标语言训练文本的相似度,将相似度最高的候选源语言训练文本作为所述目标断句结果。The similarity determination unit is used to determine the machine translation result of each of the candidate source language training texts and the similarity with the target language training text, and use the candidate source language training text with the highest similarity as the target sentence segmentation result.
优选地,所述断句更改单元包括:Preferably, the sentence-breaking modification unit includes:
非终止型标点确定单元,用于确定所述源语言训练文本包含的非终止型标点;A non-terminating punctuation determining unit, configured to determine the non-terminating punctuation included in the source language training text;
非终止型标点替换单元,用于将所述源语言训练文本包含的每一非终止型标点,使用终止型标点进行替换,得到更改后的源语言训练文本。A non-terminating punctuation replacement unit is used to replace each non-terminating punctuation included in the source language training text with a terminating punctuation to obtain a modified source language training text.
优选地,所述源语言训练文本翻译单元包括:Preferably, the source language training text translation unit includes:
第一子句划分单元,用于将每一所述候选源语言训练文本按照其包含的终止型标点进行子句划分,得到划分后的子句序列;A first clause dividing unit, configured to divide each of the candidate source language training texts according to the terminating punctuation it contains to obtain the divided clause sequence;
第一子句翻译单元,用于利用预置的机器翻译模型,对所述候选源语言训练文本的子句序列中每一子句分别进行翻译,得到每一子句的机器翻译结果;The first clause translation unit is configured to use a preset machine translation model to separately translate each clause in the clause sequence of the candidate source language training text to obtain a machine translation result of each clause;
第一翻译结果合并单元,用于按照子句序列中各子句的顺序,将各子句的机器翻译结果合并,得到所述候选源语言训练文本的机器翻译结果。The first translation result merging unit is used to merge the machine translation results of each clause in the order of each clause in the clause sequence to obtain the machine translation result of the candidate source language training text.
优选地,所述文本断句模型还包括:Preferably, the text segmentation model further includes:
人工标注结果获取单元,用于获取人工对所述源语言训练文本标注标点的结果,得到人工标注后的源语言训练文本;A manual labeling result obtaining unit, which is used to obtain a result of manually punctuating the source language training text to obtain the source language training text after manual labeling;
第二模型训练单元,用于以所述源语言训练文本作为训练数据,以所述人工标注后的源语言训练文本作为训练标签,训练文本断句模型,得到初步文本断句模型;A second model training unit, configured to use the source language training text as training data and the artificially labeled source language training text as training labels to train a text segmentation model to obtain a preliminary text segmentation model;
则所述第一模型训练单元具体用于:Then the first model training unit is specifically used for:
以所述源语言训练文本作为训练数据,以所述目标断句结果作为训练标签,训练所述初步文本断句模型。Using the source language training text as training data and the target sentence segmentation result as a training label, train the preliminary text sentence segmentation model.
优选地,所述源语言文本翻译单元,包括:Preferably, the source language text translation unit includes:
第二子句划分单元,用于将所述断句后的源语言文本按照其包含的终止型标点进行子句划分,得到划分后的子句序列;A second clause dividing unit, configured to divide the source language text after the sentence segmentation into clauses according to the terminating punctuation it contains to obtain the divided clause sequence;
第二子句翻译单元,用于利用预置的机器翻译模型,对所述断句后的源语言文本的子句序列中每一子句分别进行翻译,得到每一子句的机器翻译结果;The second clause translation unit is used to translate each clause in the clause sequence of the source language text after the sentence segmentation by using a preset machine translation model to obtain a machine translation result of each clause;
第二翻译结果合并单元,用于按照子句序列中各子句的顺序,将各子句的机器翻译结果合并,得到所述目标语言文本。The second translation result merging unit is used to merge the machine translation results of each clause in the order of each clause in the clause sequence to obtain the target language text.
一种翻译设备,包括存储器和处理器;A translation device, including memory and processor;
所述存储器,用于存储程序;The memory is used to store programs;
所述处理器,用于执行所述程序,实现如上所述的翻译方法的各个步骤。The processor is configured to execute the program and implement the steps of the translation method as described above.
一种可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时,实现如上所述的翻译方法的各个步骤。A readable storage medium on which a computer program is stored, which when executed by a processor, implements the steps of the translation method as described above.
从上述的技术方案可以看出,本申请实施例提供的翻译方法,在获取到待翻译的源语言文本时,进一步根据当前的翻译场景对源语言文本进行断句,得到的断句后的源语言文本更加符合当前的翻译场景,显然,相比于现有翻译方法,本申请对得到的源语言文本增加了断句优化过程,即考虑了当前翻译场景对源语言文本进行重新断句,使得源语言文本的断句方式更加优化,进而基于此对断句后的源语言文本进行翻译,得到的目标语言文本的质量也会更高。As can be seen from the above technical solution, when the translation method provided in the embodiments of the present application obtains the source language text to be translated, the source language text is further segmented according to the current translation scenario, and the resulting source language text after the sentence segmentation is obtained It is more in line with the current translation scenario. Obviously, compared with the existing translation methods, this application adds a sentence segmentation optimization process to the obtained source language text, that is, considering the current translation scenario to re-segment the source language text, so that the source language text The sentence-breaking method is more optimized, and then the source language text after sentence-breaking is translated, and the quality of the target language text obtained will be higher.
附图说明BRIEF DESCRIPTION
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly explain the embodiments of the present invention or the technical solutions in the prior art, the following will briefly introduce the drawings required in the embodiments or the description of the prior art. Obviously, the drawings in the following description are only These are some embodiments of the present invention. For those of ordinary skill in the art, without paying any creative work, other drawings can be obtained based on these drawings.
图1为本申请实施例公开的一种翻译方法流程图;FIG. 1 is a flowchart of a translation method disclosed in an embodiment of the present application;
图2为本申请实施例公开的一种翻译装置结构示意图;2 is a schematic structural diagram of a translation device disclosed in an embodiment of the present application;
图3为本申请实施例公开的一种翻译设备的硬件结构框图。3 is a block diagram of a hardware structure of a translation device disclosed in an embodiment of the present application.
具体实施方式detailed description
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, but not all the embodiments. Based on the embodiments in the present application, all other embodiments obtained by a person of ordinary skill in the art without creative work fall within the protection scope of the present application.
文本翻译的过程即将待翻译的源语言文本,翻译为目标语言文本。根 据待翻译的源语言文本的不同来源,其断句方式也不唯一,以待翻译的源语言文本为对待翻译语音识别得到为例进行说明。在不同的翻译场景下,源语言文本的不同断句方式,会影响基于源语言文本翻译后的目标语言文本的质量。示例如,源语言文本在不同的上下文环境中,其翻译结果可能会存在差别。再比如,在不同的翻译场合源语言文本的翻译结果也可能存在差别,如在会议场合下源语言文本的翻译结果要求更加严谨、规范,而在聊天场合下源语言文本的翻译结果可能会更加随意、口语化。The text translation process is about to translate the source language text to be translated into the target language text. According to the different sources of the source language text to be translated, the sentence-breaking method is not unique, and the source language text to be translated is taken as an example for speech recognition to be translated. In different translation scenarios, different sentence breaking methods of the source language text will affect the quality of the target language text after translation based on the source language text. For example, the source language text may have different translation results in different contexts. As another example, there may be differences in the translation results of the source language text in different translation occasions. For example, the translation result of the source language text in a meeting occasion needs to be more rigorous and standardized, while the translation result of the source language text may be more Casual and colloquial.
现有技术中,对于待翻译的源语言文本,其直接送入机器翻译模型进行翻译,而源语言文本的断句方式不规范,如通过语音识别得到的源语言文本可能受说话人说话习惯影响,其断句方式并不优化,且并未考虑当前翻译场景,基于此翻译后的结果质量也不高。为此,本申请提供了一种优化后的翻译方法。本申请的翻译方法可以应用于具备数据处理能力的电子设备中。In the prior art, the source language text to be translated is directly sent to the machine translation model for translation, and the sentence breaking method of the source language text is not standardized. For example, the source language text obtained through speech recognition may be affected by the speaker's speaking habits. The sentence-breaking method is not optimized, and the current translation scenario is not taken into account, and the quality of the results after translation is not high. To this end, this application provides an optimized translation method. The translation method of the present application can be applied to electronic devices with data processing capabilities.
接下来结合附图1对本案的翻译方法进行介绍,该方法可以包括:Next, the translation method of this case will be introduced with reference to FIG. 1. The method may include:
步骤S100、获取待翻译的源语言文本。Step S100: Acquire the source language text to be translated.
具体地,待翻译的源语言文本可以通过多种途径获取,如用户上传的源语言文本或者接收用户的语音数据进行语音识别得到的文本。以语音翻译过程为例,可以利用语音端点检测技术对获取的实时语音进行处理,得到语音片段。进一步对语音片段进行识别,得到识别后的文本作为待翻译的源语言文本。Specifically, the source language text to be translated can be obtained through multiple ways, such as the source language text uploaded by the user or the text obtained by receiving voice recognition of the user's voice data for speech recognition. Taking the voice translation process as an example, voice endpoint detection technology can be used to process the acquired real-time voice to obtain a voice segment. The speech segment is further recognized, and the recognized text is obtained as the source language text to be translated.
这里,源语言即为待翻译文本使用的语言。对应的,翻译后的语言定义为目标语言,本申请的目的也即对源语言文本翻译,得到目标语言文本。Here, the source language is the language used for the text to be translated. Correspondingly, the translated language is defined as the target language, and the purpose of this application is to translate the source language text to obtain the target language text.
步骤S110、根据当前翻译场景对所述源语言文本进行断句,得到断句后的源语言文本。Step S110: Perform sentence segmentation on the source language text according to the current translation scenario to obtain the source language text after sentence segmentation.
可以理解的是,上一步骤中得到的源语言文本中的断句方式(即源语言文本中的标点)可能受说话人说话习惯的影响,其断句方式并不规范,也未考虑当前的翻译场景,若直接对获取的源语言文本进行翻译,其翻译结果质量也不高。It is understandable that the sentence breaking method in the source language text (that is, the punctuation in the source language text) obtained in the previous step may be affected by the speaker's speaking habits. The sentence breaking method is not standardized and the current translation scenario is not considered. If you directly translate the obtained source language text, the quality of the translation result is not high.
为此,本步骤中增加了对源语言文本进行断句处理的过程,且该断句 处理过程考虑了当前翻译场景,使得断句后的源语言文本的断句方式更加符合当前翻译场景。对于源语言文本断句处理的详细过程,下文将详细介绍。Therefore, in this step, the process of sentence segmentation processing of the source language text is added, and the sentence segmentation processing process takes into account the current translation scenario, so that the sentence segmentation method of the source language text after the sentence segmentation is more in line with the current translation scenario. For the detailed process of the source language text sentence processing, the following will be introduced in detail.
步骤S120、对所述断句后的源语言文本进行翻译,得到目标语言文本。Step S120: Translate the source language text after the sentence segmentation to obtain the target language text.
一般的,可以使用机器翻译模型对上一步骤中得到的断句后的源语言文本进行翻译,得到翻译后的目标语言文本。Generally, you can use the machine translation model to translate the source language text after the sentence segmentation obtained in the previous step to obtain the translated target language text.
在此基础上,本申请实施例还可以根据用户需要,选择将目标语言文本合成为语音,进而进行语音播报,实现从源语言语音到目标语言语音的转换过程。On this basis, the embodiments of the present application can also choose to synthesize the target language text into speech according to the needs of the user, and then perform speech broadcasting to realize the conversion process from the source language speech to the target language speech.
本申请实施例提供的翻译方法,在获取到待翻译的源语言文本时,进一步根据当前的翻译场景对源语言文本进行断句,得到的断句后的源语言文本更加符合当前的翻译场景,显然,相比于现有翻译方法,本申请对得到的源语言文本增加了断句优化过程,即考虑了当前翻译场景对源语言文本进行重新断句,使得源语言文本的断句方式更加优化,进而基于此对断句后的源语言文本进行翻译,得到的目标语言文本的质量也会更高。According to the translation method provided in the embodiment of the present application, when the source language text to be translated is obtained, the source language text is further segmented according to the current translation scenario, and the obtained source language text after the sentence segmentation is more in line with the current translation scenario. Obviously, Compared with existing translation methods, this application adds a sentence segmentation optimization process to the obtained source language text, that is, considering the current translation scenario to re-segment the source language text, so that the source language text segmentation method is more optimized, and then based on this After the sentence source text is translated, the quality of the target language text will be higher.
在本申请的另一个实施例中,对上述步骤S110,根据当前翻译场景对所述源语言文本进行断句,得到断句后的源语言文本的过程进行介绍。In another embodiment of the present application, in the above step S110, the source language text is segmented according to the current translation scenario, and the process of obtaining the source language text after segmentation is introduced.
首先,可以理解的是,不同的翻译场景下断句方式会存在一定的特点,因此本申请可以预先设置与每个翻译场景对应的断句方式规则。示例如,针对会议场合下,可能需要尽量多的使用短句,也即尽量多的使用终止型标点。则可以设置会议场合这一翻译场景对应的断句方式规则中,使用的终止型标点的个数大于非终止型标点。First of all, it can be understood that there are certain characteristics of sentence-breaking modes in different translation scenarios. Therefore, the application can pre-set rules of sentence-breaking modes corresponding to each translation scenario. For example, for meeting occasions, you may need to use short sentences as much as possible, that is, use as many punctuation as possible. Then, the rule of the sentence-breaking method corresponding to the translation scene of the conference occasion can be set, and the number of the termination punctuation used is larger than that of the non-termination punctuation.
这里,按照是否能够完整的表达句意,将标点划分为终止型标点和非终止型标点两类,其中终止型标点代表能够完整的表达句意,如句号、问号、感叹号等。非终止型标点代表不能够完整的表达句意,如逗号,顿号等。Here, punctuation is divided into two types: end-type punctuation and non-end-type punctuation according to whether they can fully express the meaning of the sentence, where the end-type punctuation represents the complete expression of the meaning of the sentence, such as a period, question mark, exclamation mark, etc. Non-terminating punctuation means that the sentence cannot be completely expressed, such as comma, comma, etc.
基于此,在进行当前的翻译时,可以查询预置的对应关系,确定当前翻译场景对应的断句方式规则。进而在获取到待翻译的源语言文本之后, 按照确定的断句方式规则,对源语言文本进行断句处理,得到断句后的源语言文本。显然,该断句后的源语言文本能够符合当前翻译场景的需要。Based on this, during the current translation, the preset correspondence can be queried to determine the rules of the sentence-breaking mode corresponding to the current translation scenario. After obtaining the source language text to be translated, the source language text is subjected to sentence segmentation processing according to the determined sentence segmentation mode rules to obtain the source language text after sentence segmentation. Obviously, the source language text after the sentence segmentation can meet the needs of the current translation scenario.
进一步地,本申请实施例还提供了另一种对源语言文本进行断句的处理方式,即可以使用机器学习模型来执行对源语言文本进行断句的过程,详细过程如下介绍:Further, the embodiments of the present application also provide another processing method for sentence segmentation of the source language text, that is, a process of sentence segmentation of the source language text can be performed using a machine learning model. The detailed process is as follows:
定义本实施例中进行断句处理的机器学习模型为文本断句模型,其可以使用现有的各种结构的机器学习模型,如序列标注框架下的BLSTM模型、Self-Attention模型等,或者是编解码Encode-Decode框架下的序列生成模型等,当然还可以使用现有多种结构模型的组合。The machine learning model for sentence segmentation processing in this embodiment is defined as a text sentence segmentation model, which can use existing machine learning models of various structures, such as the BLSTM model under the sequence annotation framework, the Self-Attention model, etc., or the codec The sequence generation model under the Encode-Decode framework, of course, can also use a combination of existing multiple structural models.
当然,若采用序列标注框架下的模型,则模型的输入为文本序列中的每个词,模型的输出为每个词对应的标点类别,该标点类别可以是空值、逗号、句号、问号等,其中空值代表对于词后不添加任何标点。Of course, if the model under the sequence labeling framework is used, the input of the model is each word in the text sequence, and the output of the model is the punctuation category corresponding to each word. The punctuation category can be null, comma, period, question mark, etc. , Where the null value means that no punctuation is added after the word.
若采用Encode-Decode框架下的模型,则模型的输入可以是不带标点的文本序列,模型的输出是包含标点信息的文本序列,也即由模型对输入文本序列添加标点后的结果。具体采用哪种形式的机器学习模型可以根据应用需要来选择,本申请不做严格限定。If the model under the Encode-Decode framework is used, the input of the model may be a text sequence without punctuation, and the output of the model is a text sequence containing punctuation information, that is, the result of the model adding punctuation to the input text sequence. The specific form of machine learning model can be selected according to the needs of the application, and this application is not strictly limited.
进一步地,在确定的文本断句模型的结构之后,进一步需要获取模型的训练数据以对文本断句模型进行训练。本申请实施例中可以收集大量的源语言训练文本,作为训练数据。定义源语言训练文本组成的集合为T1。进一步地,还需要确定T1中每一源语言训练文本的符合当前翻译场景的断句结果,作为对应源语言训练文本的训练标签,由该训练标签配合训练数据一起训练文本断句模型。可以理解的是,本实施例中获取的训练数据可以是从待翻译的源语言文本中抽取的。除此之外,还可以通过其它途径获取训练数据,例如,从现有的材料文本中选取部分文本,作为训练数据。Further, after determining the structure of the text segmentation model, it is further necessary to obtain training data of the model to train the text segmentation model. In this embodiment of the present application, a large number of source language training texts can be collected as training data. Define the set of source language training text as T1. Further, it is also necessary to determine the sentence segmentation result of each source language training text in T1 that matches the current translation scenario, as a training label corresponding to the source language training text, and the training label and the training data are used to train the text segmentation model together. It can be understood that the training data acquired in this embodiment may be extracted from the source language text to be translated. In addition, training data can also be obtained through other means, for example, selecting a part of text from existing material text as training data.
依据上述训练数据和训练标签训练后的文本断句模型,具备了将输入样本按照符合当前翻译场景的需要进行断句处理,输出符合当前翻译场景的断句结果的能力。基于此,可以将获取的源语言文本输入文本断句模型,得到文本断句模型输出的断句后的源语言文本,也即得到断句优化处理后 的源语言文本。The text segmentation model after training based on the above training data and the training label has the ability to process the input samples according to the needs of the current translation scenario and output the sentence segmentation results that match the current translation scenario. Based on this, the obtained source language text can be input into the text segmentation model to obtain the source language text after the sentence segmentation output by the text segmentation model, that is, the source language text after the sentence segmentation optimization processing.
本申请的另一个实施例中,对上述文本断句模型的确定过程进行展开说明,文本断句模型的确定过程可以包括:In another embodiment of the present application, the determination process of the foregoing text segmentation model is expanded and described. The determination process of the text segmentation model may include:
A1、获取源语言训练文本。A1. Obtain the source language training text.
同上介绍的,定义源语言训练文本组成的集合为T1。As described above, the set consisting of source language training texts is defined as T1.
A2、确定所述源语言训练文本的符合所述当前翻译场景的断句结果,作为目标断句结果。A2. Determine a sentence segmentation result of the source language training text that matches the current translation scene as a target sentence segmentation result.
定义源语言训练文本的符合当前翻译场景的目标断句结果组成的集合为T2。T2为对T1中每一源语言训练文本进行翻译的结果。The set of target sentence segmentation results of the source language training text that meets the current translation scenario is defined as T2. T2 is the result of translating each source language training text in T1.
A3、以所述源语言训练文本作为训练数据,以所述目标断句结果作为训练标签,训练文本断句模型。A3. Use the source language training text as training data and the target sentence segmentation result as a training label to train a text sentence segmentation model.
在上述示例的文本断句模型的确定过程的基础上,本申请实施例提供了另一种文本断句模型的确定方式,即在上述A3之前,增加如下步骤:Based on the determination process of the text segmentation model in the above example, the embodiment of the present application provides another method for determining the text segmentation model, that is, before the above A3, the following steps are added:
A4、获取人工对所述源语言训练文本标注标点的结果,得到人工标注后的源语言训练文本。A4. Obtain the result of manually punctuating the source language training text, and obtain the source language training text after manual labeling.
具体地,在获取源语言训练文本之后,可以由人工对源语言训练文本进行标点的标注,得到人工标注后的源语言训练文本。Specifically, after the source language training text is obtained, the source language training text may be manually punctuated to obtain the manually labeled source language training text.
A5、以所述源语言训练文本作为训练数据,以所述人工标注后的源语言训练文本作为训练标签,训练文本断句模型,得到初步文本断句模型。A5. Use the source language training text as training data and the artificially labeled source language training text as training labels to train a text segmentation model to obtain a preliminary text segmentation model.
在A4步骤的基础上,可以、以所述源语言训练文本作为训练数据,以所述人工标注后的源语言训练文本作为训练标签,训练文本断句模型,得到初步文本断句模型。On the basis of step A4, the source language training text can be used as the training data, and the artificially labeled source language training text can be used as the training label to train the text segmentation model to obtain a preliminary text segmentation model.
在此基础上,上述A3步骤具体可以包括:On this basis, the above step A3 may specifically include:
以所述源语言训练文本作为训练数据,以所述目标断句结果作为训练标签,训练所述初步文本断句模型。Using the source language training text as training data and the target sentence segmentation result as a training label, train the preliminary text sentence segmentation model.
具体地,可以采用模型自适应更新方法,采用所述源语言训练文本作为训练数据,所述目标断句结果作为训练标签,对初步文本断句模型进行参数更新。Specifically, a model adaptive update method may be used, the source language training text is used as training data, and the target sentence segmentation result is used as a training label to update the parameters of the preliminary text sentence segmentation model.
采用这种模型更新方法,可以提高模型训练数据量,进而使得训练得到的文本断句模型更加优秀。Using this model update method can increase the amount of model training data, which in turn makes the text segmentation model obtained by training more excellent.
在本申请的另一个实施例中,对上述A2确定所述源语言训练文本的符合所述当前翻译场景的断句结果,作为目标断句结果的过程进行介绍。In another embodiment of the present application, the process of determining the sentence segmentation result of the source language training text that matches the current translation scenario as the target sentence segmentation result is described in A2 above.
可以理解的是,上述已经说明可以预先设置与每个翻译场景对应的断句方式规则。则本实施例中可以按照当前翻译场景对应的断句方式规则,对源语言训练文本进行断句处理,得到每一源语言训练文本的断句结果。It can be understood that, as described above, the rules of the sentence segmentation mode corresponding to each translation scenario can be preset. Then, in this embodiment, the source language training text may be subjected to sentence segmentation processing according to the rules of the sentence segmentation mode corresponding to the current translation scenario, to obtain a sentence segmentation result for each source language training text.
此外,本实施例中还提供了另一种可选的实施方式,具体可以包括:In addition, another optional implementation manner is provided in this embodiment, which may specifically include:
A21、获取所述源语言训练文本在所述当前翻译场景下的翻译后的目标语言训练文本。A21. Acquire the translated target language training text of the source language training text in the current translation scenario.
具体地,源语言训练文本在当前翻译场景下的翻译后的目标语言训练文本可以通过人工翻译的方式确定。也即,可以由人工根据当前翻译场景,对源语言训练文本进行翻译,得到目标语言训练文本。Specifically, the translated target language training text of the source language training text in the current translation scenario can be determined by manual translation. That is, the source language training text can be translated manually to obtain the target language training text according to the current translation scenario.
A22、参考设定的断句更改方式,对所述源语言训练文本的断句方式进行更改,得到更改后的源语言训练文本,由更改后的源语言训练文本及所述源语言训练文本组成候选源语言训练文本。A22. Referring to the set sentence changing method, change the sentence breaking method of the source language training text to obtain the changed source language training text, and the candidate source is composed of the changed source language training text and the source language training text Language training text.
具体地,本申请实施例可以预先设定断句更改方式,进而可以按照设定的断句更改方式,对源语言训练文本的断句方式进行更改。Specifically, the embodiment of the present application may set a sentence breaking modification method in advance, and then may change the sentence breaking method of the source language training text according to the set sentence breaking modification method.
可以理解的是,通过合理设置断句更改方式,可以扩展出多条更改后的源语言训练文本。源语言训练文本的符合当前翻译场景的断句方式,或者是源语言训练文本自身的断句方式,或者是某一条更改后的源语言训练文本的断句方式。It is understandable that by appropriately setting the sentence-breaking modification method, multiple changed source language training texts can be expanded. The sentence breaking method of the source language training text conforming to the current translation scenario, or the sentence breaking method of the source language training text itself, or the sentence breaking method of a certain changed source language training text.
也即,本步骤的处理过程是为了扩展出候选源语言训练文本,该候选源语言训练文本种包含了源语言训练文本的符合当前翻译场景的断句方式。That is, the processing procedure in this step is to expand the candidate source language training text, which includes the source language training text's sentence-breaking method that matches the current translation scenario.
A23、利用预置的机器翻译模型,对每一所述候选源语言训练文本进行翻译,得到每一所述候选源语言训练文本的机器翻译结果。A23. Translate each candidate source language training text by using a preset machine translation model to obtain a machine translation result of each candidate source language training text.
A24、确定每一所述候选源语言训练文本的机器翻译结果,与所述目 标语言训练文本的相似度,将相似度最高的候选源语言训练文本作为所述目标断句结果。A24. Determine the machine translation result of each of the candidate source language training texts and the similarity with the target language training text, and use the candidate source language training text with the highest similarity as the target sentence segmentation result.
具体地,目标语言训练文本为源语言训练文本在所述当前翻译场景下的翻译后结果。基于此,本步骤中以目标语言训练文本为标准,确定每一候选源语言训练文本的机器翻译结果,与目标语言训练文本的相似度。可以理解的是,相似度越高的候选源语言训练文本,说明其与当前翻译场景的符合程度越高。基于此,可以选取相似度最高的候选源语言训练文本,将其作为源语言训练文本的符合当前翻译场景的目标断句结果。Specifically, the target language training text is the translated result of the source language training text in the current translation scenario. Based on this, the target language training text is used as the standard in this step to determine the similarity between the machine translation results of each candidate source language training text and the target language training text. It is understandable that the higher the similarity of the candidate source language training text, the higher the degree of conformance with the current translation scene. Based on this, the candidate source language training text with the highest similarity can be selected as the target sentence segmentation result of the source language training text that matches the current translation scenario.
可选的,在本步骤中计算相似度时,可以采用BLEU打分的方式,即以目标语言训练文本为标准,分别对每一候选源语言训练文本的机器翻译结果进行打分评价,打分值越高的候选源语言训练文本代表与目标语言训练文本的相似度越高。Optionally, when calculating the similarity in this step, the BLEU scoring method can be used, that is, the target language training text is used as the standard, and the machine translation results of each candidate source language training text are separately scored and evaluated, the higher the score value The candidate source language training text represents the higher the similarity with the target language training text.
进一步地,本实施例中介绍了一种可选的对源语言训练文本的断句方式进行更改,得到更改后的源语言训练文本的方式,具体可以包括:Further, this embodiment introduces an optional way to change the sentence breaking method of the source language training text to obtain the changed source language training text, which may specifically include:
A221、确定所述源语言训练文本包含的非终止型标点。A221. Determine the non-terminating punctuation included in the source language training text.
具体地,对于源语言训练文本集合T1中的每个源语言训练文本T1 j(j=1…n),n为T1中源语言训练文本的条数,确定T1 j包含的非终止型标点的个数M。 Specifically, for each source language training text T1 j (j = 1 ... n) in the source language training text set T1, n is the number of source language training texts in T1, and the non-terminating punctuation included in T1 j is determined. The number M.
A222、将所述源语言训练文本包含的每一非终止型标点,使用终止型标点进行替换,得到更改后的源语言训练文本。A222. Each non-terminating punctuation included in the source language training text is replaced with terminating punctuation to obtain the modified source language training text.
可以理解的是,T1 j中任意一个非终止型标点可以使用终止型标点进行替换。 It can be understood that any non-terminating punctuation in T1 j can be replaced with terminating punctuation.
按照本步骤介绍的替换方式,若T1 j包含的非终止型标点的个数为M,则由替换前的源语言训练文本,以及通过替换得到的更改后的源语言训练文本组成候选源语言训练文本集合,该集合中共包含2^M(2的幂次方)条候选源语言训练文本。 According to the replacement method introduced in this step, if the number of non-terminating punctuation points included in T1 j is M, the source language training text before replacement and the modified source language training text obtained through replacement constitute candidate source language training A text set, which contains 2 ^ M (power of 2) candidate source language training texts.
下面通过一个例子进行说明:The following is an example:
具体地,对于源语言训练文本:“今天天气不错,我想去爬山,你去吗?”,由于其中有两个逗号是非终止型标点,故每个逗号都可以用终止型 标点如句号替换,最终可以得到2^2=4条候选源语言训练文本,如下:Specifically, for the source language training text: "The weather is good today, I want to go climbing, do you go?" Since there are two commas in it that are non-terminal punctuation, each comma can be replaced with a terminal punctuation, such as a period In the end, 2 ^ 2 = 4 candidate source language training texts can be obtained, as follows:
1、今天天气不错,我想去爬山,你去吗?1. The weather is good today. I want to go climbing. Do you go?
2、今天天气不错。我想去爬山,你去吗?2. The weather is good today. I want to go climbing, do you go?
3、今天天气不错,我想去爬山。你去吗?3. The weather is good today. I want to go climbing. are you going?
4、今天天气不错。我想去爬山。你去吗?4. The weather is good today. I want to go climbing. are you going?
可以理解的是,得到的4条候选源语言训练文本中,第1条为源语言训练文本本身,第2-4条为通过标点替换后得到的更改后的源语言训练文本。It can be understood that, of the four candidate source language training texts obtained, Article 1 is the source language training text itself, and Articles 2-4 are the modified source language training text obtained after punctuation replacement.
基于上述实施例介绍的A22的实现方式,本申请实施例进一步介绍了上述A23,利用预置的机器翻译模型,对每一所述候选源语言训练文本进行翻译的一种可选实施方式,具体可以包括:Based on the implementation of A22 introduced in the above embodiment, the embodiment of the present application further introduces the above A23, an optional implementation of translating each of the candidate source language training texts using a preset machine translation model, specifically Can include:
A231、将每一所述候选源语言训练文本按照其包含的终止型标点进行子句划分,得到划分后的子句序列。A231. Divide each candidate source language training text into clauses according to the termination punctuation it contains to obtain a divided clause sequence.
具体地,对于每一所述候选源语言训练文本,从头开始遍历其中包含的终止型标点,将每一终止型标点处作为一个划分点,将候选源语言训练文本划分为若干子句,划分后的各子句按照在候选源语言训练文本中的先后顺序,组成子句序列。Specifically, for each of the candidate source language training texts, the terminal punctuation contained therein is traversed from the beginning, each terminal punctuation is used as a dividing point, and the candidate source language training text is divided into several clauses. The clauses of the form a sequence of clauses in the order of the candidate source language training text.
A232、利用预置的机器翻译模型,对所述候选源语言训练文本的子句序列中每一子句分别进行翻译,得到每一子句的机器翻译结果。A232. Use a preset machine translation model to separately translate each clause in the clause sequence of the candidate source language training text to obtain a machine translation result of each clause.
A232、按照子句序列中各子句的顺序,将各子句的机器翻译结果合并,得到所述候选源语言训练文本的机器翻译结果。A232. Combine the machine translation results of the clauses according to the order of the clauses in the clause sequence to obtain the machine translation results of the candidate source language training text.
可以理解的是,候选源语言训练文本的条数为2^M,针对每条候选语言训练文本均按照上述方式进行翻译,则最终可以得到2^M条机器翻译结果。It can be understood that the number of candidate source language training texts is 2 ^ M, and each candidate language training text is translated in the above manner, and finally 2 ^ M machine translation results can be obtained.
按照本申请介绍的上述处理方式,可以将一部分非终止型标点转化为终止型标点,则终止型标点的出现概率会提高,而在机器翻译过程,其是以终止型标点之前的内容进行一次翻译,因此按照本申请方案会缩短等待终止型标点的时间,从而提高翻译结果的产出速度,降低用户主观上等待 翻译结果的时间,提升了用户的体验。According to the above-mentioned processing methods introduced in this application, a part of non-terminating punctuation can be converted into terminating punctuation, the occurrence probability of terminating punctuation will increase, and in the machine translation process, it is a translation based on the content before terminating punctuation Therefore, according to the application scheme, the time for waiting for termination punctuation will be shortened, thereby increasing the output speed of translation results, reducing the subjective time for users to wait for translation results, and improving the user experience.
仍以上述示例的例子来说明A23的实现过程:Still taking the above example as an example to illustrate the implementation process of A23:
为了便于表述,将上述示例的4条候选源语言训练文本分别定义为候选文本1-4。For ease of expression, the four candidate source language training texts of the above examples are defined as candidate texts 1-4, respectively.
对于候选文本1:由于该句中只有最后出现终止型标点,句中没有终止型标点,因此该句子无法进一步拆分,或者说拆分后的子句即为候选文本1本身。因此,可以将候选文本1作为一个句子送入机器翻译模型进行翻译。For candidate text 1: Because only the final punctuation appears in the sentence, and there is no terminal punctuation in the sentence, the sentence cannot be further split, or the split clause is the candidate text 1 itself. Therefore, the candidate text 1 can be sent to the machine translation model as a sentence for translation.
对于候选文本2:该句中“不错”后是一个句号,可以对句子进行拆分,候选文本2可以拆分得到两个子句,分别为:For candidate text 2: The sentence is followed by a period, and the sentence can be split. Candidate text 2 can be split into two clauses, which are:
子句21:今天天气不错。Clause 21: The weather is good today.
子句22:我想去爬山,你去吗?Clause 22: I want to go climbing, do you go?
对于拆分后的两个子句,分别送入机器翻译模型进行翻译,并将机器翻译结果合并,得到候选文本2的机器翻译结果。The two clauses after splitting are sent to the machine translation model for translation, and the machine translation results are merged to obtain the machine translation result of candidate text 2.
对于候选文本3:该句中“爬山”后是一个句号,可以对句子进行拆分,候选文本3可以拆分得到两个子句,分别为:For Candidate Text 3: The sentence is followed by a period after "climbing the mountain", the sentence can be split, and Candidate Text 3 can be split to get two clauses, respectively:
子句31:今天天气不错,我想去爬山。Clause 31: The weather is good today. I want to go climbing.
子句32:你去吗?Clause 32: Are you going?
对于拆分后的两个子句,分别送入机器翻译模型进行翻译,并将机器翻译结果合并,得到候选文本3的机器翻译结果。For the split two clauses, they are sent to the machine translation model for translation, and the machine translation results are merged to obtain the machine translation result of candidate text 3.
对于候选文本4:该句中“不错”和“爬山”后各是一个句号,可以对句子进行拆分,候选文本4可以拆分得到三个子句,分别为:For Candidate Text 4: There is a period after “Good” and “Mountain Climbing” in the sentence, and the sentence can be split. Candidate Text 4 can be split into three clauses, namely:
子句41:今天天气不错。Clause 41: The weather is good today.
子句42:我想去爬山。Clause 42: I want to go mountain climbing.
子句43:你去吗?Clause 43: Are you going?
对于拆分后的三个子句,分别送入机器翻译模型进行翻译,并将机器翻译结果合并,得到候选文本4的机器翻译结果。The three clauses after splitting are sent to the machine translation model for translation, and the machine translation results are merged to obtain the machine translation result of the candidate text 4.
此外进一步地,假设对于上述候选文本1-4,采用BLEU方法进行打分,分值依次为:0.1,0.2,0.3,0.4。则可以选取分值最高的候选文本4,作为 源语言训练文本的符合当前翻译场景的目标断句结果。Furthermore, suppose that for the above candidate texts 1-4, the BLEU method is used for scoring, and the score values are: 0.1, 0.2, 0.3, 0.4 in order. Then, the candidate text 4 with the highest score can be selected as the target sentence segmentation result of the source language training text that matches the current translation scenario.
则,源语言训练文本:“今天天气不错,我想去爬山,你去吗?”Then, the source language training text: "The weather is good today, I want to go climbing, do you go?"
目标断句结果:“今天天气不错。我想去爬山。你去吗?”Target sentence sentence result: "The weather is good today. I want to go mountain climbing. Are you going?"
可以将此源语言训练文本机器目标断句结果作为训练数据和训练标签,训练文本断句模型。The target sentence segmentation results of the source language training text machine can be used as training data and training labels to train the text segmentation model.
在本申请的又一个实施例中,对上述步骤S120,对所述断句后的源语言文本进行翻译,得到目标语言文本的过程进行介绍。In still another embodiment of the present application, in the above step S120, the process of translating the source language text after the sentence segmentation to obtain the target language text is introduced.
基于上述实施例的介绍可知,可以使用机器翻译模型对断句后的源语言文本进行翻译。在具体翻译过程,可以首先按照断句后的源语言文本包含的终止型标点进行子句划分,得到划分后的子句序列。进一步,利用预置的机器翻译模型,对断句后的源语言文本对应的子句序列中每一子句分别进行翻译,得到每一子句的机器翻译结果。按照子句序列中各子句的顺序,将各子句的机器翻译结果合并,得到所述断句后的源语言文本的机器翻译结果,也即得到目标语言文本。Based on the introduction of the above embodiment, it can be known that the source language text after sentence segmentation can be translated using a machine translation model. In the specific translation process, you can first divide the clauses according to the terminating punctuation contained in the source language text after the sentence segmentation to obtain the divided clause sequence. Further, using a preset machine translation model, each clause in the clause sequence corresponding to the source language text after the sentence segmentation is translated separately to obtain a machine translation result of each clause. According to the order of the clauses in the clause sequence, the machine translation results of the clauses are combined to obtain the machine translation result of the source language text after the sentence segmentation, that is, the target language text is obtained.
基于上述各实施例的介绍可知,本申请考虑了当前翻译场景对源语言文本进行断句,得到的断句后的源语言文本更加符合当前的翻译场景,进而基于此对断句后的源语言文本进行翻译,得到的目标语言文本的质量也会更高。Based on the introduction of the above embodiments, it can be seen that the present application considers the current translation scenario to segment the source language text, and the resulting source language text after the segmentation is more in line with the current translation scenario, and then based on this, the translated source language text is translated , The quality of the target language text will be higher.
进一步地,本申请在断句时通过标点替换,可以将一部分非终止型标点转化为终止型标点,则终止型标点的出现概率会提高,而在机器翻译过程,其是以终止型标点之前的内容进行一次翻译,因此按照本申请方案会缩短等待终止型标点的时间,从而提高翻译结果的产出速度,降低用户主观上等待翻译结果的时间,提升了用户的体验。Further, in this application, when punctuation is used to break sentences, a part of non-terminating punctuation can be converted into terminating punctuation, the occurrence probability of terminating punctuation will increase, and in the machine translation process, it is the content before terminating punctuation Perform a translation, so the time to wait for the termination of punctuation will be shortened according to the application scheme, thereby increasing the output speed of translation results, reducing the subjective time for users to wait for translation results, and improving the user experience.
下面对本申请实施例提供的翻译装置进行描述,下文描述的翻译装置与上文描述的翻译方法可相互对应参照。The translation device provided by the embodiments of the present application will be described below. The translation device described below and the translation method described above can be referred to each other.
参见图2,图2为本申请实施例公开的一种翻译装置结构示意图。Refer to FIG. 2, which is a schematic structural diagram of a translation device disclosed in an embodiment of the present application.
如图2所示,该装置可以包括:As shown in FIG. 2, the device may include:
源语言文本获取单元11,用于获取待翻译的源语言文本;The source language text obtaining unit 11 is used to obtain the source language text to be translated;
文本断句单元12,用于根据当前翻译场景对所述源语言文本进行断句,得到断句后的源语言文本;The text segmentation unit 12 is configured to segment the source language text according to the current translation scenario to obtain the source language text after the sentence segmentation;
源语言文本翻译单元13,用于对所述断句后的源语言文本进行翻译,得到目标语言文本。The source language text translation unit 13 is configured to translate the source language text after the sentence segmentation to obtain the target language text.
可选的,上述文本断句单元可以包括:Optionally, the above text segmentation unit may include:
模型参考单元,用于将所述源语言文本输入预置的文本断句模型,得到文本断句模型输出的断句后的源语言文本;A model reference unit, used to input the source language text into a preset text segmentation model to obtain the source language text after the sentence segmentation output by the text segmentation model;
其中,所述文本断句模型为,以源语言训练文本作为训练数据,以所述源语言训练文本的符合所述当前翻译场景的断句结果作为训练标签训练得到。Wherein, the text segmentation model is obtained by training the source language training text as training data and using the sentence segmentation result of the source language training text that matches the current translation scene as the training label.
可选的,本申请的翻译装置还可以包括:文本断句模型确定单元,用于确定文本断句模型。该文本断句模型可以包括:Optionally, the translation device of the present application may further include: a text segmentation model determination unit for determining a text segmentation model. The text segmentation model may include:
源语言训练文本获取单元,用于获取源语言训练文本;Source language training text acquisition unit, used to obtain source language training text;
断句结果确定单元,用于确定所述源语言训练文本的符合所述当前翻译场景的断句结果,作为目标断句结果;A sentence segmentation result determination unit, configured to determine a sentence segmentation result of the source language training text that matches the current translation scenario, as a target sentence segmentation result;
第一模型训练单元,用于以所述源语言训练文本作为训练数据,以所述目标断句结果作为训练标签,训练文本断句模型。The first model training unit is used to train the text segmentation model by using the source language training text as training data and the target sentence segmentation result as a training label.
可选的,上述断句结果确定单元可以包括:Optionally, the above sentence determination result determination unit may include:
目标语言训练文本获取单元,用于获取所述源语言训练文本在所述当前翻译场景下的翻译后的目标语言训练文本;A target language training text obtaining unit, configured to obtain the translated target language training text of the source language training text in the current translation scenario;
断句更改单元,用于参考设定的断句更改方式,对所述源语言训练文本的断句方式进行更改,得到更改后的源语言训练文本,由更改后的源语言训练文本及所述源语言训练文本组成候选源语言训练文本;The sentence changing unit is used for referring to the set sentence changing method to change the sentence breaking method of the source language training text to obtain the changed source language training text, which is composed of the changed source language training text and the source language training The text constitutes the candidate source language training text;
源语言训练文本翻译单元,用于利用预置的机器翻译模型,对每一所述候选源语言训练文本进行翻译,得到每一所述候选源语言训练文本的机器翻译结果;A source language training text translation unit, configured to use a preset machine translation model to translate each candidate source language training text to obtain a machine translation result of each candidate source language training text;
相似度确定单元,用于确定每一所述候选源语言训练文本的机器翻译结果,与所述目标语言训练文本的相似度,将相似度最高的候选源语言训 练文本作为所述目标断句结果。The similarity determination unit is used to determine the machine translation result of each of the candidate source language training texts and the similarity with the target language training text, and use the candidate source language training text with the highest similarity as the target sentence segmentation result.
可选的,上述断句更改单元可以包括:Optionally, the above sentence changing unit may include:
非终止型标点确定单元,用于确定所述源语言训练文本包含的非终止型标点;A non-terminating punctuation determining unit, configured to determine the non-terminating punctuation included in the source language training text;
非终止型标点替换单元,用于将所述源语言训练文本包含的每一非终止型标点,使用终止型标点进行替换,得到更改后的源语言训练文本。A non-terminating punctuation replacement unit is used to replace each non-terminating punctuation included in the source language training text with a terminating punctuation to obtain a modified source language training text.
可选的,上述源语言训练文本翻译单元可以包括:Optionally, the above source language training text translation unit may include:
第一子句划分单元,用于将每一所述候选源语言训练文本按照其包含的终止型标点进行子句划分,得到划分后的子句序列;A first clause dividing unit, configured to divide each of the candidate source language training texts according to the terminating punctuation it contains to obtain the divided clause sequence;
第一子句翻译单元,用于利用预置的机器翻译模型,对所述候选源语言训练文本的子句序列中每一子句分别进行翻译,得到每一子句的机器翻译结果;The first clause translation unit is configured to use a preset machine translation model to separately translate each clause in the clause sequence of the candidate source language training text to obtain a machine translation result of each clause;
第一翻译结果合并单元,用于按照子句序列中各子句的顺序,将各子句的机器翻译结果合并,得到所述候选源语言训练文本的机器翻译结果。The first translation result merging unit is used to merge the machine translation results of each clause in the order of each clause in the clause sequence to obtain the machine translation result of the candidate source language training text.
可选的,上述文本断句模型还可以包括:Optionally, the above text segmentation model may also include:
人工标注结果获取单元,用于获取人工对所述源语言训练文本标注标点的结果,得到人工标注后的源语言训练文本;A manual labeling result obtaining unit, which is used to obtain a result of manually punctuating the source language training text to obtain the source language training text after manual labeling;
第二模型训练单元,用于以所述源语言训练文本作为训练数据,以所述人工标注后的源语言训练文本作为训练标签,训练文本断句模型,得到初步文本断句模型。基于此,上述第一模型训练单元具体可以用于:The second model training unit is configured to use the source language training text as training data and the artificially labeled source language training text as training labels to train a text segmentation model to obtain a preliminary text segmentation model. Based on this, the above-mentioned first model training unit can be specifically used for:
以所述源语言训练文本作为训练数据,以所述目标断句结果作为训练标签,训练所述初步文本断句模型。Using the source language training text as training data and the target sentence segmentation result as a training label, train the preliminary text sentence segmentation model.
可选的,上述源语言文本翻译单元,可以包括:Optionally, the above source language text translation unit may include:
第二子句划分单元,用于将所述断句后的源语言文本按照其包含的终止型标点进行子句划分,得到划分后的子句序列;A second clause dividing unit, configured to divide the source language text after the sentence segmentation into clauses according to the terminating punctuation it contains to obtain the divided clause sequence;
第二子句翻译单元,用于利用预置的机器翻译模型,对所述断句后的源语言文本的子句序列中每一子句分别进行翻译,得到每一子句的机器翻译结果;The second clause translation unit is used to translate each clause in the clause sequence of the source language text after the sentence segmentation by using a preset machine translation model to obtain a machine translation result of each clause;
第二翻译结果合并单元,用于按照子句序列中各子句的顺序,将各子 句的机器翻译结果合并,得到所述目标语言文本。The second translation result merging unit is used to merge the machine translation results of each clause in the order of each clause in the clause sequence to obtain the target language text.
本申请实施例提供的翻译装置可应用于翻译设备,如PC终端、云平台、服务器及服务器集群等。可选的,图3示出了翻译设备的硬件结构框图,参照图3,翻译设备的硬件结构可以包括:至少一个处理器1,至少一个通信接口2,至少一个存储器3和至少一个通信总线4;The translation apparatus provided in the embodiments of the present application may be applied to translation equipment, such as PC terminals, cloud platforms, servers, and server clusters. Optionally, FIG. 3 shows a block diagram of the hardware structure of the translation device. Referring to FIG. 3, the hardware structure of the translation device may include: at least one processor 1, at least one communication interface 2, at least one memory 3, and at least one communication bus 4 ;
在本申请实施例中,处理器1、通信接口2、存储器3、通信总线4的数量为至少一个,且处理器1、通信接口2、存储器3通过通信总线4完成相互间的通信;In the embodiment of the present application, the number of the processor 1, the communication interface 2, the memory 3, and the communication bus 4 is at least one, and the processor 1, the communication interface 2, and the memory 3 complete communication with each other through the communication bus 4;
处理器1可能是一个中央处理器CPU,或者是特定集成电路ASIC(Application Specific Integrated Circuit),或者是被配置成实施本发明实施例的一个或多个集成电路等;The processor 1 may be a central processing unit CPU, or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of the present invention, etc .;
存储器3可能包含高速RAM存储器,也可能还包括非易失性存储器(non-volatile memory)等,例如至少一个磁盘存储器;The memory 3 may include a high-speed RAM memory, and may also include a non-volatile memory (non-volatile memory), for example, at least one magnetic disk memory;
其中,存储器存储有程序,处理器可调用存储器存储的程序,所述程序用于:Among them, the memory stores a program, and the processor can call the program stored in the memory, and the program is used for:
获取待翻译的源语言文本;Obtain the source language text to be translated;
根据当前翻译场景对所述源语言文本进行断句,得到断句后的源语言文本;Segment the source language text according to the current translation scenario to obtain the source language text after sentence segmentation;
对所述断句后的源语言文本进行翻译,得到目标语言文本。Translate the source language text after the sentence segmentation to obtain the target language text.
可选的,所述程序的细化功能和扩展功能可参照上文描述。Optionally, the detailed functions and extended functions of the program may refer to the above description.
本申请实施例还提供一种可读存储介质,该可读存储介质可存储有适于处理器执行的程序,所述程序用于:An embodiment of the present application further provides a readable storage medium, where the readable storage medium may store a program suitable for execution by a processor, and the program is used to:
获取待翻译的源语言文本;Obtain the source language text to be translated;
根据当前翻译场景对所述源语言文本进行断句,得到断句后的源语言文本;Segment the source language text according to the current translation scenario to obtain the source language text after sentence segmentation;
对所述断句后的源语言文本进行翻译,得到目标语言文本。Translate the source language text after the sentence segmentation to obtain the target language text.
可选的,所述程序的细化功能和扩展功能可参照上文描述。Optionally, the detailed functions and extended functions of the program may refer to the above description.
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。Finally, it should also be noted that in this article, relational terms such as first and second are used only to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply these entities Or there is any such actual relationship or order between operations. Moreover, the terms "include", "include" or any other variant thereof are intended to cover non-exclusive inclusion, so that a process, method, article, or device that includes a series of elements includes not only those elements, but also those not explicitly listed Or other elements that are inherent to this process, method, article, or equipment. Without more restrictions, the element defined by the sentence "include one ..." does not exclude that there are other identical elements in the process, method, article or equipment that includes the element.
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。The embodiments in this specification are described in a progressive manner. Each embodiment focuses on the differences from other embodiments, and the same or similar parts between the embodiments may refer to each other.
对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本申请。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本申请的精神或范围的情况下,在其它实施例中实现。因此,本申请将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments enables those skilled in the art to implement or use this application. Various modifications to these embodiments will be apparent to those skilled in the art. The general principles defined herein can be implemented in other embodiments without departing from the spirit or scope of the present application. Therefore, the present application will not be limited to the embodiments shown in this document, but should conform to the widest scope consistent with the principles and novel features disclosed in this document.

Claims (18)

  1. 一种翻译方法,其特征在于,包括:A translation method, characterized by including:
    获取待翻译的源语言文本;Obtain the source language text to be translated;
    根据当前翻译场景对所述源语言文本进行断句,得到断句后的源语言文本;Segment the source language text according to the current translation scenario to obtain the source language text after sentence segmentation;
    对所述断句后的源语言文本进行翻译,得到目标语言文本。Translate the source language text after the sentence segmentation to obtain the target language text.
  2. 根据权利要求1所述的方法,其特征在于,所述根据翻译场景对所述源语言文本进行断句,得到断句后的源语言文本,包括:The method according to claim 1, wherein the step of segmenting the source language text according to the translation scenario to obtain the source language text after the sentence segmentation includes:
    将所述源语言文本输入预置的文本断句模型,得到文本断句模型输出的断句后的源语言文本;Input the source language text into a preset text segmentation model to obtain the source language text after the segmentation output by the text segmentation model;
    其中,所述文本断句模型为,以源语言训练文本作为训练数据,以所述源语言训练文本的符合所述当前翻译场景的断句结果作为训练标签训练得到。Wherein, the text segmentation model is obtained by training the source language training text as training data and using the sentence segmentation result of the source language training text that matches the current translation scene as the training label.
  3. 根据权利要求2所述的方法,其特征在于,所述文本断句模型的确定过程包括:The method according to claim 2, wherein the process of determining the text segmentation model includes:
    获取源语言训练文本;Get the source language training text;
    确定所述源语言训练文本的符合所述当前翻译场景的断句结果,作为目标断句结果;Determining a sentence segmentation result of the source language training text that matches the current translation scenario, as a target sentence segmentation result;
    以所述源语言训练文本作为训练数据,以所述目标断句结果作为训练标签,训练文本断句模型。Use the source language training text as training data and the target sentence segmentation result as a training label to train a text sentence segmentation model.
  4. 根据权利要求3所述的方法,其特征在于,所述确定所述源语言训练文本的符合所述当前翻译场景的断句结果,作为目标断句结果,包括:The method according to claim 3, wherein the determining the sentence segmentation result of the source language training text that matches the current translation scenario as the target sentence segmentation result includes:
    获取所述源语言训练文本在所述当前翻译场景下的翻译后的目标语言训练文本;Acquiring the translated target language training text of the source language training text in the current translation scenario;
    参考设定的断句更改方式,对所述源语言训练文本的断句方式进行更改,得到更改后的源语言训练文本,由更改后的源语言训练文本及所述源语言训练文本组成候选源语言训练文本;With reference to the set sentence changing method, the sentence breaking method of the source language training text is changed to obtain the changed source language training text, and the candidate source language training is composed of the changed source language training text and the source language training text text;
    利用预置的机器翻译模型,对每一所述候选源语言训练文本进行翻译, 得到每一所述候选源语言训练文本的机器翻译结果;Using a preset machine translation model to translate each candidate source language training text to obtain a machine translation result of each candidate source language training text;
    确定每一所述候选源语言训练文本的机器翻译结果,与所述目标语言训练文本的相似度,将相似度最高的候选源语言训练文本作为所述目标断句结果。Determining the machine translation result of each of the candidate source language training texts and the similarity with the target language training text, and using the candidate source language training text with the highest similarity as the target sentence segmentation result.
  5. 根据权利要求4所述的方法,其特征在于,所述参考设定的断句更改方式,对所述源语言训练文本的断句方式进行更改,得到更改后的源语言训练文本,包括:The method according to claim 4, wherein the reference setting of the sentence changing method changes the sentence breaking method of the source language training text to obtain the changed source language training text, including:
    确定所述源语言训练文本包含的非终止型标点;Determine the non-terminating punctuation included in the source language training text;
    将所述源语言训练文本包含的每一非终止型标点,使用终止型标点进行替换,得到更改后的源语言训练文本。Each non-terminating punctuation included in the source language training text is replaced with terminating punctuation to obtain the modified source language training text.
  6. 根据权利要求5所述的方法,其特征在于,所述利用预置的机器翻译模型,对每一所述候选源语言训练文本进行翻译,得到每一所述候选源语言训练文本的机器翻译结果,包括:The method according to claim 5, wherein the preset machine translation model is used to translate each candidate source language training text to obtain a machine translation result of each candidate source language training text ,include:
    将每一所述候选源语言训练文本按照其包含的终止型标点进行子句划分,得到划分后的子句序列;Divide each candidate source language training text into clauses according to the termination punctuation it contains to obtain a divided clause sequence;
    利用预置的机器翻译模型,对所述候选源语言训练文本的子句序列中每一子句分别进行翻译,得到每一子句的机器翻译结果;Use a preset machine translation model to separately translate each clause in the clause sequence of the candidate source language training text to obtain a machine translation result for each clause;
    按照子句序列中各子句的顺序,将各子句的机器翻译结果合并,得到所述候选源语言训练文本的机器翻译结果。According to the order of the clauses in the clause sequence, the machine translation results of the clauses are combined to obtain the machine translation results of the candidate source language training text.
  7. 根据权利要求3所述的方法,其特征在于,在所述以所述源语言训练文本作为训练数据,以所述目标断句结果作为训练标签,训练文本断句模型之前,该方法还包括:The method according to claim 3, characterized in that before the training text in the source language is used as training data, and the target sentence segmentation result is used as a training label, the method further includes:
    获取人工对所述源语言训练文本标注标点的结果,得到人工标注后的源语言训练文本;Obtaining the result of manually punctuating the source language training text, and obtaining the source language training text after manual labeling;
    以所述源语言训练文本作为训练数据,以所述人工标注后的源语言训练文本作为训练标签,训练文本断句模型,得到初步文本断句模型;Using the source language training text as training data and the artificially labeled source language training text as training labels to train a text segmentation model to obtain a preliminary text segmentation model;
    则,所述以所述源语言训练文本作为训练数据,以所述目标断句结果作为训练标签,训练文本断句模型,包括:Then, the training text using the source language training text as training data and the target sentence segmentation result as a training label includes:
    以所述源语言训练文本作为训练数据,以所述目标断句结果作为训练 标签,训练所述初步文本断句模型。Using the source language training text as training data and the target sentence segmentation result as a training label, the preliminary text sentence segmentation model is trained.
  8. 根据权利要求1所述的方法,其特征在于,所述对所述断句后的源语言文本进行翻译,得到目标语言文本,包括:The method according to claim 1, wherein the translation of the source language text after the sentence segmentation to obtain the target language text includes:
    将所述断句后的源语言文本按照其包含的终止型标点进行子句划分,得到划分后的子句序列;Divide the source language text after the sentence segmentation into clauses according to the termination punctuation it contains to obtain the divided clause sequence;
    利用预置的机器翻译模型,对所述断句后的源语言文本的子句序列中每一子句分别进行翻译,得到每一子句的机器翻译结果;Use a preset machine translation model to translate each clause in the clause sequence of the source language text after the sentence segmentation separately to obtain a machine translation result for each clause;
    按照子句序列中各子句的顺序,将各子句的机器翻译结果合并,得到所述目标语言文本。According to the order of the clauses in the clause sequence, the machine translation results of the clauses are combined to obtain the target language text.
  9. 一种翻译装置,其特征在于,包括:A translation device, characterized in that it includes:
    源语言文本获取单元,用于获取待翻译的源语言文本;Source language text acquisition unit, used to obtain the source language text to be translated;
    文本断句单元,用于根据当前翻译场景对所述源语言文本进行断句,得到断句后的源语言文本;Text segmentation unit, used to segment the source language text according to the current translation scenario, to obtain the source language text after sentence segmentation;
    源语言文本翻译单元,用于对所述断句后的源语言文本进行翻译,得到目标语言文本。The source language text translation unit is used to translate the source language text after the sentence segmentation to obtain the target language text.
  10. 根据权利要求9所述的装置,其特征在于,所述文本断句单元包括:The device according to claim 9, wherein the text segmentation unit comprises:
    模型参考单元,用于将所述源语言文本输入预置的文本断句模型,得到文本断句模型输出的断句后的源语言文本;A model reference unit, used to input the source language text into a preset text segmentation model to obtain the source language text after the sentence segmentation output by the text segmentation model;
    其中,所述文本断句模型为,以源语言训练文本作为训练数据,以所述源语言训练文本的符合所述当前翻译场景的断句结果作为训练标签训练得到。Wherein, the text segmentation model is obtained by training the source language training text as training data and using the sentence segmentation result of the source language training text that matches the current translation scene as the training label.
  11. 根据权利要求10所述的装置,其特征在于,还包括:文本断句模型确定单元,用于确定文本断句模型;所述文本断句模型包括:The device according to claim 10, further comprising: a text segmentation model determination unit for determining a text segmentation model; the text segmentation model includes:
    源语言训练文本获取单元,用于获取源语言训练文本;Source language training text acquisition unit, used to obtain source language training text;
    断句结果确定单元,用于确定所述源语言训练文本的符合所述当前翻译场景的断句结果,作为目标断句结果;A sentence segmentation result determination unit, configured to determine a sentence segmentation result of the source language training text that matches the current translation scenario, as a target sentence segmentation result;
    第一模型训练单元,用于以所述源语言训练文本作为训练数据,以所述目标断句结果作为训练标签,训练文本断句模型。The first model training unit is used to train the text segmentation model by using the source language training text as training data and the target sentence segmentation result as a training label.
  12. 根据权利要求11所述的装置,其特征在于,所述断句结果确定单元包括:The apparatus according to claim 11, wherein the sentence segmentation result determination unit comprises:
    目标语言训练文本获取单元,用于获取所述源语言训练文本在所述当前翻译场景下的翻译后的目标语言训练文本;A target language training text obtaining unit, configured to obtain the translated target language training text of the source language training text in the current translation scenario;
    断句更改单元,用于参考设定的断句更改方式,对所述源语言训练文本的断句方式进行更改,得到更改后的源语言训练文本,由更改后的源语言训练文本及所述源语言训练文本组成候选源语言训练文本;The sentence changing unit is used for referring to the set sentence changing method to change the sentence breaking method of the source language training text to obtain the changed source language training text, which is composed of the changed source language training text and the source language training The text constitutes the candidate source language training text;
    源语言训练文本翻译单元,用于利用预置的机器翻译模型,对每一所述候选源语言训练文本进行翻译,得到每一所述候选源语言训练文本的机器翻译结果;A source language training text translation unit, configured to use a preset machine translation model to translate each candidate source language training text to obtain a machine translation result of each candidate source language training text;
    相似度确定单元,用于确定每一所述候选源语言训练文本的机器翻译结果,与所述目标语言训练文本的相似度,将相似度最高的候选源语言训练文本作为所述目标断句结果。The similarity determination unit is used to determine the machine translation result of each of the candidate source language training texts and the similarity with the target language training text, and use the candidate source language training text with the highest similarity as the target sentence segmentation result.
  13. 根据权利要求12所述的装置,其特征在于,所述断句更改单元包括:The device according to claim 12, wherein the sentence changing unit comprises:
    非终止型标点确定单元,用于确定所述源语言训练文本包含的非终止型标点;A non-terminating punctuation determining unit, configured to determine the non-terminating punctuation included in the source language training text;
    非终止型标点替换单元,用于将所述源语言训练文本包含的每一非终止型标点,使用终止型标点进行替换,得到更改后的源语言训练文本。A non-terminating punctuation replacement unit is used to replace each non-terminating punctuation included in the source language training text with a terminating punctuation to obtain a modified source language training text.
  14. 根据权利要求13所述的装置,其特征在于,所述源语言训练文本翻译单元包括:The apparatus according to claim 13, wherein the source language training text translation unit includes:
    第一子句划分单元,用于将每一所述候选源语言训练文本按照其包含的终止型标点进行子句划分,得到划分后的子句序列;A first clause dividing unit, configured to divide each of the candidate source language training texts according to the terminating punctuation it contains to obtain the divided clause sequence;
    第一子句翻译单元,用于利用预置的机器翻译模型,对所述候选源语言训练文本的子句序列中每一子句分别进行翻译,得到每一子句的机器翻译结果;The first clause translation unit is configured to use a preset machine translation model to separately translate each clause in the clause sequence of the candidate source language training text to obtain a machine translation result of each clause;
    第一翻译结果合并单元,用于按照子句序列中各子句的顺序,将各子句的机器翻译结果合并,得到所述候选源语言训练文本的机器翻译结果。The first translation result merging unit is used to merge the machine translation results of each clause in the order of each clause in the clause sequence to obtain the machine translation result of the candidate source language training text.
  15. 根据权利要求11所述的装置,其特征在于,所述文本断句模型还 包括:The apparatus according to claim 11, wherein the text segmentation model further comprises:
    人工标注结果获取单元,用于获取人工对所述源语言训练文本标注标点的结果,得到人工标注后的源语言训练文本;A manual labeling result obtaining unit, which is used to obtain a result of manually punctuating the source language training text to obtain the source language training text after manual labeling;
    第二模型训练单元,用于以所述源语言训练文本作为训练数据,以所述人工标注后的源语言训练文本作为训练标签,训练文本断句模型,得到初步文本断句模型;A second model training unit, configured to use the source language training text as training data and the artificially labeled source language training text as training labels to train a text segmentation model to obtain a preliminary text segmentation model;
    则所述第一模型训练单元具体用于:Then the first model training unit is specifically used for:
    以所述源语言训练文本作为训练数据,以所述目标断句结果作为训练标签,训练所述初步文本断句模型。Using the source language training text as training data and the target sentence segmentation result as a training label, train the preliminary text sentence segmentation model.
  16. 根据权利要求9所述的装置,其特征在于,所述源语言文本翻译单元,包括:The apparatus according to claim 9, wherein the source language text translation unit includes:
    第二子句划分单元,用于将所述断句后的源语言文本按照其包含的终止型标点进行子句划分,得到划分后的子句序列;A second clause dividing unit, configured to divide the source language text after the sentence segmentation into clauses according to the terminating punctuation it contains to obtain the divided clause sequence;
    第二子句翻译单元,用于利用预置的机器翻译模型,对所述断句后的源语言文本的子句序列中每一子句分别进行翻译,得到每一子句的机器翻译结果;The second clause translation unit is used to translate each clause in the clause sequence of the source language text after the sentence segmentation by using a preset machine translation model to obtain a machine translation result of each clause;
    第二翻译结果合并单元,用于按照子句序列中各子句的顺序,将各子句的机器翻译结果合并,得到所述目标语言文本。The second translation result merging unit is used to merge the machine translation results of each clause in the order of each clause in the clause sequence to obtain the target language text.
  17. 一种翻译设备,其特征在于,包括存储器和处理器;A translation device, characterized in that it includes a memory and a processor;
    所述存储器,用于存储程序;The memory is used to store programs;
    所述处理器,用于执行所述程序,实现如权利要求1-8中任一项所述的翻译方法的各个步骤。The processor is configured to execute the program and implement the steps of the translation method according to any one of claims 1-8.
  18. 一种可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时,实现如权利要求1-8中任一项所述的翻译方法的各个步骤。A readable storage medium on which a computer program is stored, characterized in that, when the computer program is executed by a processor, each step of the translation method according to any one of claims 1-8 is realized.
PCT/CN2018/119329 2018-10-30 2018-12-05 Translation method, apparatus and device, and readable storage medium WO2020087655A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811276866.X 2018-10-30
CN201811276866.XA CN109408833A (en) 2018-10-30 2018-10-30 A kind of interpretation method, device, equipment and readable storage medium storing program for executing

Publications (1)

Publication Number Publication Date
WO2020087655A1 true WO2020087655A1 (en) 2020-05-07

Family

ID=65470039

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/119329 WO2020087655A1 (en) 2018-10-30 2018-12-05 Translation method, apparatus and device, and readable storage medium

Country Status (2)

Country Link
CN (1) CN109408833A (en)
WO (1) WO2020087655A1 (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110321532A (en) * 2019-06-06 2019-10-11 数译(成都)信息技术有限公司 Language pre-processes punctuate method, computer equipment and computer readable storage medium
CN110232194B (en) * 2019-06-17 2024-04-09 安徽听见科技有限公司 Translation display method, device, equipment and readable storage medium
CN112151019A (en) * 2019-06-26 2020-12-29 阿里巴巴集团控股有限公司 Text processing method and device and computing equipment
CN113591491B (en) * 2020-04-30 2023-12-26 阿里巴巴集团控股有限公司 Speech translation text correction system, method, device and equipment
CN111611811B (en) * 2020-05-25 2023-01-13 腾讯科技(深圳)有限公司 Translation method, translation device, electronic equipment and computer readable storage medium
CN111654658B (en) * 2020-06-17 2022-04-15 平安科技(深圳)有限公司 Audio and video call processing method and system, coder and decoder and storage device
CN112232091B (en) * 2020-10-14 2021-11-16 文思海辉智科科技有限公司 Content matching method and device and readable storage medium
CN112560510B (en) * 2020-12-10 2023-12-01 科大讯飞股份有限公司 Translation model training method, device, equipment and storage medium
CN112668346A (en) * 2020-12-24 2021-04-16 科大讯飞股份有限公司 Translation method, device, equipment and storage medium
CN113378586B (en) * 2021-07-15 2023-03-28 北京有竹居网络技术有限公司 Speech translation method, translation model training method, device, medium, and apparatus
CN113660432A (en) * 2021-08-17 2021-11-16 安徽听见科技有限公司 Translation subtitle production method and device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030055626A1 (en) * 2001-09-19 2003-03-20 International Business Machines Corporation Sentence segmentation method and sentence segmentation apparatus, machine translation system, and program product using sentence segmentation method
CN101458681A (en) * 2007-12-10 2009-06-17 株式会社东芝 Voice translation method and voice translation apparatus
CN103530284A (en) * 2013-09-22 2014-01-22 中国专利信息中心 Short sentence segmenting device, machine translation system and corresponding segmenting method and translation method
CN107247706A (en) * 2017-06-16 2017-10-13 中国电子技术标准化研究院 Text punctuate method for establishing model, punctuate method, device and computer equipment
CN108628819A (en) * 2017-03-16 2018-10-09 北京搜狗科技发展有限公司 Treating method and apparatus, the device for processing

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10303777B2 (en) * 2016-08-08 2019-05-28 Netflix, Inc. Localization platform that leverages previously translated content

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030055626A1 (en) * 2001-09-19 2003-03-20 International Business Machines Corporation Sentence segmentation method and sentence segmentation apparatus, machine translation system, and program product using sentence segmentation method
CN101458681A (en) * 2007-12-10 2009-06-17 株式会社东芝 Voice translation method and voice translation apparatus
CN103530284A (en) * 2013-09-22 2014-01-22 中国专利信息中心 Short sentence segmenting device, machine translation system and corresponding segmenting method and translation method
CN108628819A (en) * 2017-03-16 2018-10-09 北京搜狗科技发展有限公司 Treating method and apparatus, the device for processing
CN107247706A (en) * 2017-06-16 2017-10-13 中国电子技术标准化研究院 Text punctuate method for establishing model, punctuate method, device and computer equipment

Also Published As

Publication number Publication date
CN109408833A (en) 2019-03-01

Similar Documents

Publication Publication Date Title
WO2020087655A1 (en) Translation method, apparatus and device, and readable storage medium
WO2018157703A1 (en) Natural language semantic extraction method and device, and computer storage medium
US20200193217A1 (en) Method for determining sentence similarity
WO2019232991A1 (en) Method for recognizing conference voice as text, electronic device and storage medium
US20210280190A1 (en) Human-machine interaction
CN111754978B (en) Prosodic hierarchy labeling method, device, equipment and storage medium
CN107301170B (en) Method and device for segmenting sentences based on artificial intelligence
CN111402861B (en) Voice recognition method, device, equipment and storage medium
CN109976702A (en) A kind of audio recognition method, device and terminal
CN110415680B (en) Simultaneous interpretation method, simultaneous interpretation device and electronic equipment
WO2021159655A1 (en) Data attribute filling method, apparatus and device, and computer-readable storage medium
CN112509566B (en) Speech recognition method, device, equipment, storage medium and program product
CN113536007A (en) Virtual image generation method, device, equipment and storage medium
US20220076677A1 (en) Voice interaction method, device, and storage medium
CN110633475A (en) Natural language understanding method, device and system based on computer scene and storage medium
CN111539199A (en) Text error correction method, device, terminal and storage medium
CN112560510A (en) Translation model training method, device, equipment and storage medium
WO2020103447A1 (en) Link-type storage method and apparatus for video information, computer device and storage medium
CN112530417A (en) Voice signal processing method and device, electronic equipment and storage medium
CN108538292B (en) Voice recognition method, device, equipment and readable storage medium
CN108682423A (en) A kind of audio recognition method and device
KR20190074508A (en) Method for crowdsourcing data of chat model for chatbot
WO2020199590A1 (en) Mood detection analysis method and related device
CN109408621B (en) Dialogue emotion analysis method and system
CN113553833B (en) Text error correction method and device and electronic equipment

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18939010

Country of ref document: EP

Kind code of ref document: A1