CN102945232A - 面向统计机器翻译的训练语料质量评价及选取方法 - Google Patents
面向统计机器翻译的训练语料质量评价及选取方法 Download PDFInfo
- Publication number
- CN102945232A CN102945232A CN2012104691724A CN201210469172A CN102945232A CN 102945232 A CN102945232 A CN 102945232A CN 2012104691724 A CN2012104691724 A CN 2012104691724A CN 201210469172 A CN201210469172 A CN 201210469172A CN 102945232 A CN102945232 A CN 102945232A
- Authority
- CN
- China
- Prior art keywords
- sentence
- quality
- translation
- phrase
- quality assessment
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000013519 translation Methods 0.000 title claims abstract description 136
- 238000013441 quality evaluation Methods 0.000 title abstract description 8
- 238000010187 selection method Methods 0.000 title abstract 4
- 238000012549 training Methods 0.000 claims abstract description 54
- 238000001303 quality assessment method Methods 0.000 claims description 92
- 239000000463 material Substances 0.000 claims description 63
- 238000000034 method Methods 0.000 claims description 54
- 230000006870 function Effects 0.000 claims description 16
- 230000008569 process Effects 0.000 claims description 16
- 239000013598 vector Substances 0.000 claims description 16
- 239000000975 dye Substances 0.000 claims description 14
- 230000007613 environmental effect Effects 0.000 claims description 10
- 238000013210 evaluation model Methods 0.000 claims description 10
- 238000004043 dyeing Methods 0.000 claims description 8
- 239000000284 extract Substances 0.000 claims description 8
- 239000012634 fragment Substances 0.000 claims description 4
- 230000007246 mechanism Effects 0.000 claims description 4
- 238000000605 extraction Methods 0.000 claims description 3
- 230000008901 benefit Effects 0.000 abstract description 4
- 238000002474 experimental method Methods 0.000 description 20
- 238000012360 testing method Methods 0.000 description 9
- 238000011156 evaluation Methods 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 3
- 241001269238 Data Species 0.000 description 2
- 230000006399 behavior Effects 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 230000009897 systematic effect Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 239000012141 concentrate Substances 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 206010016256 fatigue Diseases 0.000 description 1
- 238000009533 lab test Methods 0.000 description 1
- 230000008450 motivation Effects 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
Images
Landscapes
- Machine Translation (AREA)
Abstract
Description
数据 | 2 | 1 | 0 | ALL |
CWMT | 156,544 | 474,356 | 104,476 | 735,376 |
NIST | 919,143 | 121,460 | 8,670 | 1,049,273 |
Claims (9)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210469172.4A CN102945232B (zh) | 2012-11-16 | 2012-11-16 | 面向统计机器翻译的训练语料质量评价及选取方法 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210469172.4A CN102945232B (zh) | 2012-11-16 | 2012-11-16 | 面向统计机器翻译的训练语料质量评价及选取方法 |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102945232A true CN102945232A (zh) | 2013-02-27 |
CN102945232B CN102945232B (zh) | 2015-01-21 |
Family
ID=47728179
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201210469172.4A Active CN102945232B (zh) | 2012-11-16 | 2012-11-16 | 面向统计机器翻译的训练语料质量评价及选取方法 |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102945232B (zh) |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103631773A (zh) * | 2013-12-16 | 2014-03-12 | 哈尔滨工业大学 | 基于领域相似性度量方法的统计机器翻译方法 |
CN104731777A (zh) * | 2015-03-31 | 2015-06-24 | 网易有道信息技术(北京)有限公司 | 一种译文评价方法及装置 |
CN105190609A (zh) * | 2013-06-03 | 2015-12-23 | 国立研究开发法人情报通信研究机构 | 翻译装置、学习装置、翻译方法以及存储介质 |
CN105335358A (zh) * | 2015-11-18 | 2016-02-17 | 成都优译信息技术有限公司 | 翻译系统中使用语料等级评价方法 |
CN105446958A (zh) * | 2014-07-18 | 2016-03-30 | 富士通株式会社 | 词对齐方法和词对齐设备 |
CN105512114A (zh) * | 2015-12-14 | 2016-04-20 | 清华大学 | 平行句对的筛选方法和系统 |
CN105930432A (zh) * | 2016-04-19 | 2016-09-07 | 北京百度网讯科技有限公司 | 序列标注工具的训练方法和装置 |
US9678939B2 (en) | 2013-12-04 | 2017-06-13 | International Business Machines Corporation | Morphology analysis for machine translation |
CN107066452A (zh) * | 2016-01-29 | 2017-08-18 | 松下知识产权经营株式会社 | 翻译辅助方法、翻译辅助装置、翻译装置以及翻译辅助程序 |
CN107491444A (zh) * | 2017-08-18 | 2017-12-19 | 南京大学 | 基于双语词嵌入技术的并行化词对齐方法 |
CN107526727A (zh) * | 2017-07-31 | 2017-12-29 | 苏州大学 | 基于统计机器翻译的语言生成方法 |
CN108537246A (zh) * | 2018-02-28 | 2018-09-14 | 成都优译信息技术股份有限公司 | 一种平行语料按翻译质量进行分类的方法及系统 |
JP2019149030A (ja) * | 2018-02-27 | 2019-09-05 | 日本電信電話株式会社 | 学習品質推定装置、方法、及びプログラム |
CN110874536A (zh) * | 2018-08-29 | 2020-03-10 | 阿里巴巴集团控股有限公司 | 语料质量评估模型生成方法和双语句对互译质量评估方法 |
CN111159356A (zh) * | 2019-12-31 | 2020-05-15 | 重庆和贯科技有限公司 | 基于教学内容的知识图谱构建方法 |
CN111178091A (zh) * | 2019-12-20 | 2020-05-19 | 沈阳雅译网络技术有限公司 | 一种多维度的中英双语数据清洗方法 |
WO2021098397A1 (zh) * | 2019-11-21 | 2021-05-27 | 腾讯科技(深圳)有限公司 | 数据处理方法、设备及存储介质 |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102193912A (zh) * | 2010-03-12 | 2011-09-21 | 富士通株式会社 | 短语划分模型建立方法、统计机器翻译方法以及解码器 |
US20120226489A1 (en) * | 2011-03-02 | 2012-09-06 | Bbn Technologies Corp. | Automatic word alignment |
-
2012
- 2012-11-16 CN CN201210469172.4A patent/CN102945232B/zh active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102193912A (zh) * | 2010-03-12 | 2011-09-21 | 富士通株式会社 | 短语划分模型建立方法、统计机器翻译方法以及解码器 |
US20120226489A1 (en) * | 2011-03-02 | 2012-09-06 | Bbn Technologies Corp. | Automatic word alignment |
Non-Patent Citations (4)
Title |
---|
HAO ZHANG 等: "The Impact of Parsing Accuracy on Syntax-based SMT", 《NATURAL LANGUAGE PROCESSING AND KNOWLEDGE ENGINEERING (NLP-KE), 2010 INTERNATIONAL CONFERENCE ON》 * |
姚树杰 等: "基于句对质量和覆盖度的统计机器翻译训练语料选取", 《中文信息学报》 * |
陈毅东 等: "平行语料库处理初探:一种排序模型", 《中文信息学报》 * |
黄瑾 等: "基于信息检索方法的统计翻译系统训练数据选择与优化", 《中文信息学报》 * |
Cited By (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105190609A (zh) * | 2013-06-03 | 2015-12-23 | 国立研究开发法人情报通信研究机构 | 翻译装置、学习装置、翻译方法以及存储介质 |
US9678939B2 (en) | 2013-12-04 | 2017-06-13 | International Business Machines Corporation | Morphology analysis for machine translation |
CN103631773A (zh) * | 2013-12-16 | 2014-03-12 | 哈尔滨工业大学 | 基于领域相似性度量方法的统计机器翻译方法 |
CN105446958A (zh) * | 2014-07-18 | 2016-03-30 | 富士通株式会社 | 词对齐方法和词对齐设备 |
CN104731777A (zh) * | 2015-03-31 | 2015-06-24 | 网易有道信息技术(北京)有限公司 | 一种译文评价方法及装置 |
CN105335358A (zh) * | 2015-11-18 | 2016-02-17 | 成都优译信息技术有限公司 | 翻译系统中使用语料等级评价方法 |
CN105512114A (zh) * | 2015-12-14 | 2016-04-20 | 清华大学 | 平行句对的筛选方法和系统 |
CN105512114B (zh) * | 2015-12-14 | 2018-06-15 | 清华大学 | 平行句对的筛选方法和系统 |
CN107066452A (zh) * | 2016-01-29 | 2017-08-18 | 松下知识产权经营株式会社 | 翻译辅助方法、翻译辅助装置、翻译装置以及翻译辅助程序 |
CN107066452B (zh) * | 2016-01-29 | 2021-11-05 | 松下知识产权经营株式会社 | 翻译辅助方法、翻译辅助装置、翻译装置以及记录介质 |
CN105930432A (zh) * | 2016-04-19 | 2016-09-07 | 北京百度网讯科技有限公司 | 序列标注工具的训练方法和装置 |
CN105930432B (zh) * | 2016-04-19 | 2020-01-07 | 北京百度网讯科技有限公司 | 序列标注工具的训练方法和装置 |
CN107526727A (zh) * | 2017-07-31 | 2017-12-29 | 苏州大学 | 基于统计机器翻译的语言生成方法 |
CN107526727B (zh) * | 2017-07-31 | 2021-01-19 | 苏州大学 | 基于统计机器翻译的语言生成方法 |
CN107491444A (zh) * | 2017-08-18 | 2017-12-19 | 南京大学 | 基于双语词嵌入技术的并行化词对齐方法 |
JP2019149030A (ja) * | 2018-02-27 | 2019-09-05 | 日本電信電話株式会社 | 学習品質推定装置、方法、及びプログラム |
WO2019167794A1 (ja) * | 2018-02-27 | 2019-09-06 | 日本電信電話株式会社 | 学習品質推定装置、方法、及びプログラム |
CN108537246A (zh) * | 2018-02-28 | 2018-09-14 | 成都优译信息技术股份有限公司 | 一种平行语料按翻译质量进行分类的方法及系统 |
CN110874536A (zh) * | 2018-08-29 | 2020-03-10 | 阿里巴巴集团控股有限公司 | 语料质量评估模型生成方法和双语句对互译质量评估方法 |
CN110874536B (zh) * | 2018-08-29 | 2023-06-27 | 阿里巴巴集团控股有限公司 | 语料质量评估模型生成方法和双语句对互译质量评估方法 |
WO2021098397A1 (zh) * | 2019-11-21 | 2021-05-27 | 腾讯科技(深圳)有限公司 | 数据处理方法、设备及存储介质 |
US12164879B2 (en) | 2019-11-21 | 2024-12-10 | Tencent Technology (Shenzhen) Company Limited | Data processing method, device, and storage medium |
CN111178091A (zh) * | 2019-12-20 | 2020-05-19 | 沈阳雅译网络技术有限公司 | 一种多维度的中英双语数据清洗方法 |
CN111178091B (zh) * | 2019-12-20 | 2023-05-09 | 沈阳雅译网络技术有限公司 | 一种多维度的中英双语数据清洗方法 |
CN111159356A (zh) * | 2019-12-31 | 2020-05-15 | 重庆和贯科技有限公司 | 基于教学内容的知识图谱构建方法 |
CN111159356B (zh) * | 2019-12-31 | 2023-06-09 | 重庆和贯科技有限公司 | 基于教学内容的知识图谱构建方法 |
Also Published As
Publication number | Publication date |
---|---|
CN102945232B (zh) | 2015-01-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102945232B (zh) | 面向统计机器翻译的训练语料质量评价及选取方法 | |
US10984318B2 (en) | Word semantic embedding apparatus and method using lexical semantic network and homograph disambiguating apparatus and method using lexical semantic network and word embedding | |
Tedeschi et al. | Named Entity Recognition for Entity Linking: What works and what’s next | |
KR101799681B1 (ko) | 어휘 의미망 및 단어 임베딩을 이용한 동형이의어 분별 장치 및 방법 | |
KR20080021017A (ko) | 텍스트 기반의 문서 비교 | |
CN104731777A (zh) | 一种译文评价方法及装置 | |
CN103870000A (zh) | 一种对输入法所产生的候选项进行排序的方法及装置 | |
Zhang et al. | Hanspeller++: A unified framework for chinese spelling correction | |
Paul et al. | Hidden Markov model based part of speech tagging for Nepali language | |
CN103869998A (zh) | 一种对输入法所产生的候选项进行排序的方法及装置 | |
CN112417119A (zh) | 一种基于深度学习的开放域问答预测方法 | |
Moussaoui et al. | Pre-training two bert-like models for moroccan dialect: Morroberta and morrbert | |
KR20200057824A (ko) | 단어 교정 시스템 | |
Chklovski et al. | The Senseval-3 multilingual English-Hindi lexical sample task | |
Moran et al. | Investigating the relatedness of the endangered Dogon languages | |
Guo et al. | IJCNLP-2017 task 5: Multi-choice question answering in examinations | |
Lopez Ludeña et al. | Architecture for text normalization using statistical machine translation techniques | |
Chen et al. | Improve the detection of improperly used Chinese characters in students’ essays with error model | |
Hasan et al. | SweetCoat-2D: Two-Dimensional Bangla Spelling Correction and Suggestion Using Levenshtein Edit Distance and String Matching Algorithm | |
Wu | Automatic English essay scoring algorithm based on machine learning | |
Richter et al. | Tracking the evolution of written language competence: an NLP–based approach | |
Tuan et al. | A study of text normalization in Vietnamese for text-to-speech system | |
Dasgupta et al. | Resource creation and development of an English-Bangla back transliteration system | |
Yang | Automated English essay scoring based on machine learning algorithms | |
Wardhana et al. | Implementation of Neural Machine Translation in Translating from Indonesian to Sasak Language |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20220214 Address after: 110004 1001 - (1103), block C, No. 78, Sanhao Street, Heping District, Shenyang City, Liaoning Province Patentee after: Calf Yazhi (Shenyang) Technology Co.,Ltd. Address before: Room 1517, No. 55, Sanhao Street, Heping District, Shenyang, Liaoning 110003 Patentee before: SHENYANG YAYI NETWORK TECHNOLOGY CO.,LTD. |
|
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20220715 Address after: 110004 11 / F, block C, Neusoft computer city, 78 Sanhao Street, Heping District, Shenyang City, Liaoning Province Patentee after: SHENYANG YAYI NETWORK TECHNOLOGY CO.,LTD. Address before: 110004 1001 - (1103), block C, No. 78, Sanhao Street, Heping District, Shenyang City, Liaoning Province Patentee before: Calf Yazhi (Shenyang) Technology Co.,Ltd. |
|
PE01 | Entry into force of the registration of the contract for pledge of patent right | ||
PE01 | Entry into force of the registration of the contract for pledge of patent right |
Denomination of invention: Quality Evaluation and Selection of Training Corpus for statistical machine translation Effective date of registration: 20230508 Granted publication date: 20150121 Pledgee: China Construction Bank Shenyang Hunnan sub branch Pledgor: SHENYANG YAYI NETWORK TECHNOLOGY CO.,LTD. Registration number: Y2023210000101 |