CN114997148B - 一种基于对比学习的中文拼写校对预训练模型构建方法 - Google Patents
一种基于对比学习的中文拼写校对预训练模型构建方法 Download PDFInfo
- Publication number
- CN114997148B CN114997148B CN202210941108.5A CN202210941108A CN114997148B CN 114997148 B CN114997148 B CN 114997148B CN 202210941108 A CN202210941108 A CN 202210941108A CN 114997148 B CN114997148 B CN 114997148B
- Authority
- CN
- China
- Prior art keywords
- words
- mask
- input text
- model
- confusion
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012549 training Methods 0.000 title claims abstract description 46
- 230000001915 proofreading effect Effects 0.000 title claims abstract description 23
- 238000010276 construction Methods 0.000 title claims abstract description 7
- 239000013598 vector Substances 0.000 claims abstract description 21
- 238000012937 correction Methods 0.000 claims abstract description 9
- 230000003321 amplification Effects 0.000 claims abstract description 5
- 238000003199 nucleic acid amplification method Methods 0.000 claims abstract description 5
- 238000000034 method Methods 0.000 claims description 28
- 230000000873 masking effect Effects 0.000 claims description 20
- 230000006870 function Effects 0.000 claims description 16
- 238000005457 optimization Methods 0.000 claims description 13
- 238000004364 calculation method Methods 0.000 claims description 6
- 230000000052 comparative effect Effects 0.000 claims description 6
- 238000005070 sampling Methods 0.000 claims description 6
- 239000000126 substance Substances 0.000 claims description 4
- 238000012545 processing Methods 0.000 claims description 2
- 238000010606 normalization Methods 0.000 claims 1
- 238000005516 engineering process Methods 0.000 abstract description 3
- 238000013135 deep learning Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000012805 post-processing Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/232—Orthographic correction, e.g. spell checking or vowelisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- Biomedical Technology (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Machine Translation (AREA)
Abstract
Description
Claims (5)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210941108.5A CN114997148B (zh) | 2022-08-08 | 2022-08-08 | 一种基于对比学习的中文拼写校对预训练模型构建方法 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210941108.5A CN114997148B (zh) | 2022-08-08 | 2022-08-08 | 一种基于对比学习的中文拼写校对预训练模型构建方法 |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114997148A CN114997148A (zh) | 2022-09-02 |
CN114997148B true CN114997148B (zh) | 2022-11-04 |
Family
ID=83023303
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210941108.5A Active CN114997148B (zh) | 2022-08-08 | 2022-08-08 | 一种基于对比学习的中文拼写校对预训练模型构建方法 |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114997148B (zh) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116030418B (zh) * | 2023-02-14 | 2023-09-12 | 北京建工集团有限责任公司 | 一种汽车吊运行状态监测系统及方法 |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20150007647A (ko) * | 2013-07-12 | 2015-01-21 | 부산대학교 산학협력단 | 교정 어휘 쌍을 이용한 통계적 문맥 철자오류 교정 장치 및 방법 |
CN113901797A (zh) * | 2021-10-18 | 2022-01-07 | 广东博智林机器人有限公司 | 文本纠错方法、装置、设备及存储介质 |
CN114330238A (zh) * | 2021-08-03 | 2022-04-12 | 腾讯科技(深圳)有限公司 | 文本处理方法、文本处理装置、电子设备及存储介质 |
CN114548053A (zh) * | 2022-02-21 | 2022-05-27 | 中科院成都信息技术股份有限公司 | 一种基于编辑方法的文本对比学习纠错系统、方法及装置 |
CN114781358A (zh) * | 2022-04-18 | 2022-07-22 | 润联软件系统(深圳)有限公司 | 基于强化学习的文本纠错方法、装置、设备及存储介质 |
CN114861635A (zh) * | 2022-05-10 | 2022-08-05 | 广东外语外贸大学 | 一种中文拼写纠错方法、装置、设备及存储介质 |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114861637B (zh) * | 2022-05-18 | 2023-06-16 | 北京百度网讯科技有限公司 | 拼写纠错模型生成方法和装置、拼写纠错方法和装置 |
-
2022
- 2022-08-08 CN CN202210941108.5A patent/CN114997148B/zh active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20150007647A (ko) * | 2013-07-12 | 2015-01-21 | 부산대학교 산학협력단 | 교정 어휘 쌍을 이용한 통계적 문맥 철자오류 교정 장치 및 방법 |
CN114330238A (zh) * | 2021-08-03 | 2022-04-12 | 腾讯科技(深圳)有限公司 | 文本处理方法、文本处理装置、电子设备及存储介质 |
CN113901797A (zh) * | 2021-10-18 | 2022-01-07 | 广东博智林机器人有限公司 | 文本纠错方法、装置、设备及存储介质 |
CN114548053A (zh) * | 2022-02-21 | 2022-05-27 | 中科院成都信息技术股份有限公司 | 一种基于编辑方法的文本对比学习纠错系统、方法及装置 |
CN114781358A (zh) * | 2022-04-18 | 2022-07-22 | 润联软件系统(深圳)有限公司 | 基于强化学习的文本纠错方法、装置、设备及存储介质 |
CN114861635A (zh) * | 2022-05-10 | 2022-08-05 | 广东外语外贸大学 | 一种中文拼写纠错方法、装置、设备及存储介质 |
Also Published As
Publication number | Publication date |
---|---|
CN114997148A (zh) | 2022-09-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114444479B (zh) | 一种端到端中文语音文本纠错方法、装置和存储介质 | |
CN111639489A (zh) | 中文文本纠错系统、方法、装置及计算机可读存储介质 | |
CN111310447B (zh) | 语法纠错方法、装置、电子设备和存储介质 | |
CN110276069B (zh) | 一种中国盲文错误自动检测方法、系统及存储介质 | |
CN113268576B (zh) | 一种基于深度学习的部门语义信息抽取的方法及装置 | |
CN111651978A (zh) | 基于实体的词法检查方法与装置和计算机设备及存储介质 | |
CN116306600B (zh) | 一种基于MacBert的中文文本纠错方法 | |
CN113255331B (zh) | 文本纠错方法、装置及存储介质 | |
CN114997148B (zh) | 一种基于对比学习的中文拼写校对预训练模型构建方法 | |
CN114818668A (zh) | 一种语音转写文本的人名纠错方法、装置和计算机设备 | |
CN115034218A (zh) | 一种基于多阶段训练和编辑级别投票的中文语法错误诊断方法 | |
CN113673228A (zh) | 文本纠错方法、装置、计算机存储介质及计算机程序产品 | |
CN115034208A (zh) | 一种基于bert的中文asr输出文本修复方法及系统 | |
CN111160026B (zh) | 一种模型训练方法、装置、实现文本处理的方法及装置 | |
CN115017335A (zh) | 知识图谱构建方法和系统 | |
CN116522165B (zh) | 一种基于孪生结构的舆情文本匹配系统及方法 | |
CN113076740A (zh) | 政务服务领域的同义词挖掘方法及装置 | |
CN115757815A (zh) | 知识图谱的构建方法、装置及存储介质 | |
CN116028608A (zh) | 问答交互方法、装置、计算机设备及可读存储介质 | |
CN112966501B (zh) | 一种新词发现方法、系统、终端及介质 | |
CN111782773B (zh) | 基于级连模式的文本匹配方法及装置 | |
CN112784536B (zh) | 数学应用题解答模型的处理方法、系统和存储介质 | |
CN114548049A (zh) | 一种数字正则化方法、装置、设备及存储介质 | |
CN111428475A (zh) | 分词词库的构建方法、分词方法、装置及存储介质 | |
CN114548080B (zh) | 一种基于分词增强的中文错字校正方法及系统 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20230321 Address after: 200137 Room 301AB, No. 10, Lane 198, Zhangheng Road, China (Shanghai) Pilot Free Trade Zone, Pudong New Area, Shanghai Patentee after: SHANGHAI MDATA INFORMATION TECHNOLOGY Co.,Ltd. Address before: 410205 569 Yuelu Avenue, Changsha City, Hunan Province Patentee before: Hunan University of Technology |
|
TR01 | Transfer of patent right | ||
CP03 | Change of name, title or address |
Address after: Room 301ab, No.10, Lane 198, zhangheng Road, China (Shanghai) pilot Free Trade Zone, Pudong New Area, Shanghai 201204 Patentee after: Shanghai Mido Technology Co.,Ltd. Address before: 200137 Room 301AB, No. 10, Lane 198, Zhangheng Road, China (Shanghai) Pilot Free Trade Zone, Pudong New Area, Shanghai Patentee before: SHANGHAI MDATA INFORMATION TECHNOLOGY Co.,Ltd. |
|
CP03 | Change of name, title or address | ||
PE01 | Entry into force of the registration of the contract for pledge of patent right |
Denomination of invention: A method for constructing a pre training model for Chinese spelling proofreading based on contrastive learning Granted publication date: 20221104 Pledgee: Bank of Communications Ltd. Shanghai New District Branch Pledgor: Shanghai Mido Technology Co.,Ltd. Registration number: Y2024310000145 |
|
PE01 | Entry into force of the registration of the contract for pledge of patent right |