CN111178091B - 一种多维度的中英双语数据清洗方法 - Google Patents
一种多维度的中英双语数据清洗方法 Download PDFInfo
- Publication number
- CN111178091B CN111178091B CN201911323592.XA CN201911323592A CN111178091B CN 111178091 B CN111178091 B CN 111178091B CN 201911323592 A CN201911323592 A CN 201911323592A CN 111178091 B CN111178091 B CN 111178091B
- Authority
- CN
- China
- Prior art keywords
- chinese
- english
- word
- bilingual
- dictionary
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 33
- 238000004140 cleaning Methods 0.000 title claims abstract description 17
- 238000013519 translation Methods 0.000 claims abstract description 51
- 238000011156 evaluation Methods 0.000 claims abstract description 33
- 238000012545 processing Methods 0.000 claims abstract description 19
- 230000011218 segmentation Effects 0.000 claims abstract description 13
- 238000012549 training Methods 0.000 claims abstract description 11
- 238000001914 filtration Methods 0.000 claims abstract description 9
- 238000007781 pre-processing Methods 0.000 claims abstract description 9
- 238000001035 drying Methods 0.000 claims abstract description 8
- 230000014616 translation Effects 0.000 claims description 44
- 238000010606 normalization Methods 0.000 claims description 10
- 230000009467 reduction Effects 0.000 claims description 5
- 239000013589 supplement Substances 0.000 claims description 3
- 230000007704 transition Effects 0.000 claims description 2
- 238000004364 calculation method Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 241001122767 Theaceae Species 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000013441 quality evaluation Methods 0.000 description 2
- 238000007476 Maximum Likelihood Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 238000001303 quality assessment method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Quality & Reliability (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
Description
Claims (3)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911323592.XA CN111178091B (zh) | 2019-12-20 | 2019-12-20 | 一种多维度的中英双语数据清洗方法 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911323592.XA CN111178091B (zh) | 2019-12-20 | 2019-12-20 | 一种多维度的中英双语数据清洗方法 |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111178091A CN111178091A (zh) | 2020-05-19 |
CN111178091B true CN111178091B (zh) | 2023-05-09 |
Family
ID=70652073
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911323592.XA Active CN111178091B (zh) | 2019-12-20 | 2019-12-20 | 一种多维度的中英双语数据清洗方法 |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111178091B (zh) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112084796B (zh) * | 2020-09-15 | 2021-04-09 | 南京文图景信息科技有限公司 | 一种基于Transformer深度学习模型的多语种地名词根汉译方法 |
CN112818110B (zh) * | 2020-12-31 | 2024-05-24 | 鹏城实验室 | 文本过滤方法、设备及计算机存储介质 |
CN113177420A (zh) * | 2021-04-29 | 2021-07-27 | 同方知网(北京)技术有限公司 | 一种基于学术文献的中英双语词典构建方法 |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102945232A (zh) * | 2012-11-16 | 2013-02-27 | 沈阳雅译网络技术有限公司 | 面向统计机器翻译的训练语料质量评价及选取方法 |
CN103678285A (zh) * | 2012-08-31 | 2014-03-26 | 富士通株式会社 | 机器翻译方法和机器翻译系统 |
CN104750820A (zh) * | 2015-04-24 | 2015-07-01 | 中译语通科技(北京)有限公司 | 一种语料库的过滤方法及装置 |
CN106649564A (zh) * | 2016-11-10 | 2017-05-10 | 中科院合肥技术创新工程院 | 一种互译多词表达抽取方法及其装置 |
CN108874785A (zh) * | 2018-06-01 | 2018-11-23 | 清华大学 | 一种翻译处理方法及系统 |
CN109739956A (zh) * | 2018-11-08 | 2019-05-10 | 第四范式(北京)技术有限公司 | 语料清洗方法、装置、设备及介质 |
CN109858029A (zh) * | 2019-01-31 | 2019-06-07 | 沈阳雅译网络技术有限公司 | 一种提高语料整体质量的数据预处理方法 |
CN109933808A (zh) * | 2019-01-31 | 2019-06-25 | 沈阳雅译网络技术有限公司 | 一种基于动态配置解码的神经机器翻译方法 |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150286632A1 (en) * | 2014-04-03 | 2015-10-08 | Xerox Corporation | Predicting the quality of automatic translation of an entire document |
CN106383818A (zh) * | 2015-07-30 | 2017-02-08 | 阿里巴巴集团控股有限公司 | 一种机器翻译方法及装置 |
-
2019
- 2019-12-20 CN CN201911323592.XA patent/CN111178091B/zh active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103678285A (zh) * | 2012-08-31 | 2014-03-26 | 富士通株式会社 | 机器翻译方法和机器翻译系统 |
CN102945232A (zh) * | 2012-11-16 | 2013-02-27 | 沈阳雅译网络技术有限公司 | 面向统计机器翻译的训练语料质量评价及选取方法 |
CN104750820A (zh) * | 2015-04-24 | 2015-07-01 | 中译语通科技(北京)有限公司 | 一种语料库的过滤方法及装置 |
CN106649564A (zh) * | 2016-11-10 | 2017-05-10 | 中科院合肥技术创新工程院 | 一种互译多词表达抽取方法及其装置 |
CN108874785A (zh) * | 2018-06-01 | 2018-11-23 | 清华大学 | 一种翻译处理方法及系统 |
CN109739956A (zh) * | 2018-11-08 | 2019-05-10 | 第四范式(北京)技术有限公司 | 语料清洗方法、装置、设备及介质 |
CN109858029A (zh) * | 2019-01-31 | 2019-06-07 | 沈阳雅译网络技术有限公司 | 一种提高语料整体质量的数据预处理方法 |
CN109933808A (zh) * | 2019-01-31 | 2019-06-25 | 沈阳雅译网络技术有限公司 | 一种基于动态配置解码的神经机器翻译方法 |
Non-Patent Citations (2)
Title |
---|
ErayYldlz等."The Effect of Parallel Corpus Quality vs Size in English-Toturkish SMT".《ResearchGate》.2014,全文. * |
姚建民,周明,赵铁军,李生."基于句子相似度的机器翻译评价方法及其有效性分析".《计算机研究与发展》.2004,(07),全文. * |
Also Published As
Publication number | Publication date |
---|---|
CN111178091A (zh) | 2020-05-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108197111B (zh) | 一种基于融合语义聚类的文本自动摘要方法 | |
CN111178091B (zh) | 一种多维度的中英双语数据清洗方法 | |
CN107133213B (zh) | 一种基于算法的文本摘要自动提取方法与系统 | |
CN106598940A (zh) | 基于全局优化关键词质量的文本相似度求解算法 | |
CN108681574B (zh) | 一种基于文本摘要的非事实类问答答案选择方法及系统 | |
US20070174040A1 (en) | Word alignment apparatus, example sentence bilingual dictionary, word alignment method, and program product for word alignment | |
CN106610951A (zh) | 改进的基于语义分析的文本相似度求解算法 | |
CN112257460B (zh) | 基于枢轴的汉越联合训练神经机器翻译方法 | |
CN107102983B (zh) | 一种基于网络知识源的中文概念的词向量表示方法 | |
CN106598941A (zh) | 一种全局优化文本关键词质量的算法 | |
Lakmal et al. | Word embedding evaluation for sinhala | |
CN106610952A (zh) | 一种混合的文本特征词汇提取方法 | |
CN111339753B (zh) | 一种自适应中文新词识别方法与系统 | |
CN103020045A (zh) | 一种基于谓词论元结构的统计机器翻译方法 | |
Dou et al. | Unisar: A unified structure-aware autoregressive language model for text-to-sql | |
Zhao et al. | Knowledge-enhanced self-supervised prototypical network for few-shot event detection | |
CN107038155A (zh) | 基于改进的小世界网络模型实现文本特征的提取方法 | |
CN107092595A (zh) | 新的关键词提取技术 | |
CN106126501B (zh) | 一种基于依存约束和知识的名词词义消歧方法和装置 | |
Miao et al. | Improving accuracy of key information acquisition for social media text summarization | |
Chen et al. | Word embedding evaluation datasets and wikipedia title embedding for Chinese | |
CN107102986A (zh) | 文档中多主题的关键词提取技术 | |
Bungum et al. | A survey of domain adaptation in machine translation: Towards a refinement of domain space | |
CN114880521A (zh) | 基于视觉和语言语义自主优化对齐的视频描述方法及介质 | |
Galinsky et al. | Improving neural models for natural language processing in Russian with synonyms |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB03 | Change of inventor or designer information | ||
CB03 | Change of inventor or designer information |
Inventor after: Du Quan Inventor after: Bi Dong Inventor before: Du Quan Inventor before: Bi Dong Inventor before: Zhu Jingbo Inventor before: Xiao Tong Inventor before: Zhang Chunliang |
|
GR01 | Patent grant | ||
GR01 | Patent grant | ||
PE01 | Entry into force of the registration of the contract for pledge of patent right | ||
PE01 | Entry into force of the registration of the contract for pledge of patent right |
Denomination of invention: A multidimensional bilingual data cleaning method in Chinese and English Granted publication date: 20230509 Pledgee: China Construction Bank Shenyang Hunnan sub branch Pledgor: SHENYANG YAYI NETWORK TECHNOLOGY CO.,LTD. Registration number: Y2024210000102 |