WO2020087655A1 - Procédé, appareil et dispositif de traduction, et support de stockage lisible - Google Patents

Procédé, appareil et dispositif de traduction, et support de stockage lisible Download PDF

Info

Publication number
WO2020087655A1
WO2020087655A1 PCT/CN2018/119329 CN2018119329W WO2020087655A1 WO 2020087655 A1 WO2020087655 A1 WO 2020087655A1 CN 2018119329 W CN2018119329 W CN 2018119329W WO 2020087655 A1 WO2020087655 A1 WO 2020087655A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
source language
training
translation
language training
Prior art date
Application number
PCT/CN2018/119329
Other languages
English (en)
Chinese (zh)
Inventor
孔常青
高建清
刘俊华
胡国平
Original Assignee
科大讯飞股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 科大讯飞股份有限公司 filed Critical 科大讯飞股份有限公司
Publication of WO2020087655A1 publication Critical patent/WO2020087655A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Definitions

  • the text segmentation model is obtained by training the source language training text as the training data, and using the sentence segmentation result of the source language training text that matches the current translation scene as the training label.
  • the sentence breaking method of the source language training text is changed to obtain the changed source language training text, and the candidate source language training is composed of the changed source language training text and the source language training text text;
  • the use of a preset machine translation model to translate each candidate source language training text to obtain a machine translation result of each candidate source language training text includes:
  • the translation of the source language text after the sentence segmentation to obtain the target language text includes:
  • a text segmentation model determination unit which is used to determine a text segmentation model
  • the text segmentation model includes:
  • the sentence segmentation result determination unit includes:
  • a non-terminating punctuation determining unit configured to determine the non-terminating punctuation included in the source language training text
  • a second model training unit configured to use the source language training text as training data and the artificially labeled source language training text as training labels to train a text segmentation model to obtain a preliminary text segmentation model;
  • the second clause translation unit is used to translate each clause in the clause sequence of the source language text after the sentence segmentation by using a preset machine translation model to obtain a machine translation result of each clause;
  • the sentence breaking method in the source language text (that is, the punctuation in the source language text) obtained in the previous step may be affected by the speaker's speaking habits.
  • the sentence breaking method is not standardized and the current translation scenario is not considered. If you directly translate the obtained source language text, the quality of the translation result is not high.
  • the process of sentence segmentation processing of the source language text is added, and the sentence segmentation processing process takes into account the current translation scenario, so that the sentence segmentation method of the source language text after the sentence segmentation is more in line with the current translation scenario.
  • the embodiments of the present application can also choose to synthesize the target language text into speech according to the needs of the user, and then perform speech broadcasting to realize the conversion process from the source language speech to the target language speech.
  • the embodiments of the present application also provide another processing method for sentence segmentation of the source language text, that is, a process of sentence segmentation of the source language text can be performed using a machine learning model.
  • a process of sentence segmentation of the source language text can be performed using a machine learning model. The detailed process is as follows:
  • the machine learning model for sentence segmentation processing in this embodiment is defined as a text sentence segmentation model, which can use existing machine learning models of various structures, such as the BLSTM model under the sequence annotation framework, the Self-Attention model, etc., or the codec
  • the sequence generation model under the Encode-Decode framework can also use a combination of existing multiple structural models.
  • a part of non-terminating punctuation can be converted into terminating punctuation, the occurrence probability of terminating punctuation will increase, and in the machine translation process, it is a translation based on the content before terminating punctuation Therefore, according to the application scheme, the time for waiting for termination punctuation will be shortened, thereby increasing the output speed of translation results, reducing the subjective time for users to wait for translation results, and improving the user experience.
  • the first model training unit is used to train the text segmentation model by using the source language training text as training data and the target sentence segmentation result as a training label.
  • a manual labeling result obtaining unit which is used to obtain a result of manually punctuating the source language training text to obtain the source language training text after manual labeling;

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

L'invention concerne un procédé, un appareil et un dispositif de traduction, et un support de stockage lisible. Le procédé comprend les étapes consistant à : obtenir un texte en langue source à traduire ; et effectuer une segmentation de phrases sur le texte en langue source davantage en fonction de la scène de traduction actuelle, de telle sorte que le texte en langue source obtenu après la segmentation de phrases se conforme mieux à la scène de traduction actuelle. Bien entendu, par comparaison avec le procédé de traduction existant, la présente invention ajoute le processus d'optimisation de segmentation de phrases au texte en langue source obtenu, à savoir, le mode de segmentation de phrases du texte en langue source est plus optimisé en considérant la situation où la segmentation de phrases est effectuée sur le texte en langue source à nouveau dans la scène de traduction actuelle, et sur cette base, le texte en langue source après la segmentation de phrases est traduit, de telle sorte que la qualité du texte en langue cible obtenu est plus élevée.
PCT/CN2018/119329 2018-10-30 2018-12-05 Procédé, appareil et dispositif de traduction, et support de stockage lisible WO2020087655A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811276866.XA CN109408833A (zh) 2018-10-30 2018-10-30 一种翻译方法、装置、设备及可读存储介质
CN201811276866.X 2018-10-30

Publications (1)

Publication Number Publication Date
WO2020087655A1 true WO2020087655A1 (fr) 2020-05-07

Family

ID=65470039

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/119329 WO2020087655A1 (fr) 2018-10-30 2018-12-05 Procédé, appareil et dispositif de traduction, et support de stockage lisible

Country Status (2)

Country Link
CN (1) CN109408833A (fr)
WO (1) WO2020087655A1 (fr)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110321532A (zh) * 2019-06-06 2019-10-11 数译(成都)信息技术有限公司 语言预处理断句方法、计算机设备及计算机可读存储介质
CN112084795A (zh) * 2019-06-12 2020-12-15 阿里巴巴集团控股有限公司 一种翻译系统和翻译服务调用的方法、装置
CN110232194B (zh) * 2019-06-17 2024-04-09 安徽听见科技有限公司 翻译显示方法、装置、设备及可读存储介质
CN112151019B (zh) * 2019-06-26 2024-09-20 阿里巴巴集团控股有限公司 文本处理方法、装置及计算设备
CN113591491B (zh) * 2020-04-30 2023-12-26 阿里巴巴集团控股有限公司 语音翻译文本校正系统、方法、装置及设备
CN111611811B (zh) * 2020-05-25 2023-01-13 腾讯科技(深圳)有限公司 翻译方法、装置、电子设备及计算机可读存储介质
CN111654658B (zh) * 2020-06-17 2022-04-15 平安科技(深圳)有限公司 音视频通话的处理方法、系统、编解码器及存储装置
CN112232091B (zh) * 2020-10-14 2021-11-16 文思海辉智科科技有限公司 一种内容匹配的方法及装置、可读存储介质
CN112560510B (zh) * 2020-12-10 2023-12-01 科大讯飞股份有限公司 翻译模型训练方法、装置、设备及存储介质
CN112668346B (zh) * 2020-12-24 2024-04-30 中国科学技术大学 翻译方法、装置、设备及存储介质
CN113392657A (zh) * 2021-06-18 2021-09-14 北京爱奇艺科技有限公司 训练样本增强方法、装置、计算机设备和存储介质
CN113378586B (zh) * 2021-07-15 2023-03-28 北京有竹居网络技术有限公司 语音翻译方法、翻译模型训练方法、装置、介质及设备
CN113660432B (zh) * 2021-08-17 2024-05-28 安徽听见科技有限公司 翻译字幕制作方法、装置、电子设备与存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030055626A1 (en) * 2001-09-19 2003-03-20 International Business Machines Corporation Sentence segmentation method and sentence segmentation apparatus, machine translation system, and program product using sentence segmentation method
CN101458681A (zh) * 2007-12-10 2009-06-17 株式会社东芝 语音翻译方法和语音翻译装置
CN103530284A (zh) * 2013-09-22 2014-01-22 中国专利信息中心 短句切分装置、机器翻译系统及对应切分方法和翻译方法
CN107247706A (zh) * 2017-06-16 2017-10-13 中国电子技术标准化研究院 文本断句模型建立方法、断句方法、装置及计算机设备
CN108628819A (zh) * 2017-03-16 2018-10-09 北京搜狗科技发展有限公司 处理方法和装置、用于处理的装置

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10303777B2 (en) * 2016-08-08 2019-05-28 Netflix, Inc. Localization platform that leverages previously translated content

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030055626A1 (en) * 2001-09-19 2003-03-20 International Business Machines Corporation Sentence segmentation method and sentence segmentation apparatus, machine translation system, and program product using sentence segmentation method
CN101458681A (zh) * 2007-12-10 2009-06-17 株式会社东芝 语音翻译方法和语音翻译装置
CN103530284A (zh) * 2013-09-22 2014-01-22 中国专利信息中心 短句切分装置、机器翻译系统及对应切分方法和翻译方法
CN108628819A (zh) * 2017-03-16 2018-10-09 北京搜狗科技发展有限公司 处理方法和装置、用于处理的装置
CN107247706A (zh) * 2017-06-16 2017-10-13 中国电子技术标准化研究院 文本断句模型建立方法、断句方法、装置及计算机设备

Also Published As

Publication number Publication date
CN109408833A (zh) 2019-03-01

Similar Documents

Publication Publication Date Title
WO2020087655A1 (fr) Procédé, appareil et dispositif de traduction, et support de stockage lisible
US20210280190A1 (en) Human-machine interaction
CN105869629B (zh) 语音识别方法及装置
US20200193217A1 (en) Method for determining sentence similarity
WO2018157703A1 (fr) Procédé et dispositif d'extraction sémantique de langage naturel et support de stockage informatique
WO2019232991A1 (fr) Procédé de reconnaissance de voix de conférence sous forme de texte, dispositif électronique et support de stockage
CN107301170B (zh) 基于人工智能的切分语句的方法和装置
CN111402861B (zh) 一种语音识别方法、装置、设备及存储介质
CN110415680B (zh) 一种同声传译方法、同声传译装置以及一种电子设备
CN109976702A (zh) 一种语音识别方法、装置及终端
WO2020103447A1 (fr) Procédé et appareil de stockage de type à liaison pour les informations vidéo, dispositif informatique et support d'enregistrement
CN113536007A (zh) 一种虚拟形象生成方法、装置、设备以及存储介质
CN112560510A (zh) 翻译模型训练方法、装置、设备及存储介质
WO2021159655A1 (fr) Procédé, appareil et dispositif de remplissage d'attribut de données et support de stockage lisible par ordinateur
CN110633475A (zh) 基于计算机场景的自然语言理解方法、装置、系统和存储介质
CN110728983B (zh) 一种信息显示方法、装置、设备及可读存储介质
CN112101003B (zh) 语句文本的切分方法、装置、设备和计算机可读存储介质
WO2020199590A1 (fr) Procédé d'analyse de détection d'humeur et dispositif associé
KR20190074508A (ko) 챗봇을 위한 대화 모델의 데이터 크라우드소싱 방법
CN109408621B (zh) 对话情感分析方法和系统
CN112530417A (zh) 语音信号处理方法、装置、电子设备及存储介质
CN113553833B (zh) 文本纠错的方法、装置及电子设备
CN110162794A (zh) 一种分词的方法及服务器
CN113851106B (zh) 音频播放方法、装置、电子设备和可读存储介质
WO2022267451A1 (fr) Procédé de reconnaissance automatique de la parole basé sur un réseau neuronal, dispositif et support de stockage lisible

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18939010

Country of ref document: EP

Kind code of ref document: A1