CN115398436B - 用于自然语言处理的噪声数据扩充 - Google Patents

用于自然语言处理的噪声数据扩充

Info

Publication number
CN115398436B
CN115398436B CN202080099408.2A CN202080099408A CN115398436B CN 115398436 B CN115398436 B CN 115398436B CN 202080099408 A CN202080099408 A CN 202080099408A CN 115398436 B CN115398436 B CN 115398436B
Authority
CN
China
Prior art keywords
text
original
utterance
utterances
intent
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202080099408.2A
Other languages
English (en)
Chinese (zh)
Other versions
CN115398436A (zh
Inventor
E·L·贾拉勒丁
V·比什诺伊
M·E·约翰逊
T·L·杜翁
洪宇衡
B·S·文纳科塔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Oracle International Corp
Original Assignee
Oracle International Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Oracle International Corp filed Critical Oracle International Corp
Publication of CN115398436A publication Critical patent/CN115398436A/zh
Application granted granted Critical
Publication of CN115398436B publication Critical patent/CN115398436B/zh
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • G06F40/35Discourse or dialogue representation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • G10L15/05Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering
    • G10L2015/0633Creating reference templates; Clustering using lexical or orthographic knowledge sources
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0638Interactive procedures
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/226Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics
    • G10L2015/227Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics of the speaker; Human-factor methodology

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
CN202080099408.2A 2020-03-30 2020-09-11 用于自然语言处理的噪声数据扩充 Active CN115398436B (zh)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US202063002066P 2020-03-30 2020-03-30
US63/002,066 2020-03-30
PCT/US2020/050342 WO2021201907A1 (en) 2020-03-30 2020-09-11 Noise data augmentation for natural language processing

Publications (2)

Publication Number Publication Date
CN115398436A CN115398436A (zh) 2022-11-25
CN115398436B true CN115398436B (zh) 2025-08-05

Family

ID=72659890

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202080099408.2A Active CN115398436B (zh) 2020-03-30 2020-09-11 用于自然语言处理的噪声数据扩充

Country Status (5)

Country Link
US (2) US11538457B2 (https=)
EP (1) EP4128010A1 (https=)
JP (2) JP7721559B2 (https=)
CN (1) CN115398436B (https=)
WO (1) WO2021201907A1 (https=)

Families Citing this family (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118642630A (zh) * 2018-08-21 2024-09-13 谷歌有限责任公司 用于自动助理调用的方法
US11538457B2 (en) * 2020-03-30 2022-12-27 Oracle International Corporation Noise data augmentation for natural language processing
US11556788B2 (en) * 2020-06-15 2023-01-17 International Business Machines Corporation Text-based response environment action selection
US11599721B2 (en) * 2020-08-25 2023-03-07 Salesforce, Inc. Intelligent training set augmentation for natural language processing tasks
DK202170043A1 (en) * 2021-01-29 2022-12-12 A P Moeller Mærsk As A method for autonomous reconciliation of invoice data and related electronic device
US12026471B2 (en) * 2021-04-16 2024-07-02 Accenture Global Solutions Limited Automated generation of chatbot
US12242816B2 (en) * 2021-06-30 2025-03-04 Microsoft Technology Licensing, Llc Task-action prediction engine for a task management system
US12321428B2 (en) * 2021-07-08 2025-06-03 Nippon Telegraph And Telephone Corporation User authentication device, user authentication method, and user authentication computer program
EP4363965A1 (en) * 2021-08-06 2024-05-08 Siemens Aktiengesellschaft Source code synthesis for domain specific languages from natural language text
US12468938B2 (en) * 2021-09-21 2025-11-11 International Business Machines Corporation Training example generation to create new intents for chatbots
CN114491048B (zh) * 2022-02-16 2025-08-15 北京微播易科技股份有限公司 一种数据增强方法、文本分类模型的训练方法和装置
CN115878765B (zh) * 2022-04-18 2024-09-13 北京中关村科金技术有限公司 一种融合意图识别降噪的催款话术挖掘方法及装置
CN114881130A (zh) * 2022-04-26 2022-08-09 华北电力大学 一种基于Bagging模型的继电保护缺陷文本定级方法
US12451141B2 (en) 2022-06-08 2025-10-21 International Business Machines Corporation Generating multi-turn dialog datasets
US12579448B2 (en) 2022-06-22 2026-03-17 Oracle International Corporation Techniques for positive entity aware augmentation using two-stage augmentation
CN117668216A (zh) * 2022-08-12 2024-03-08 南方电网大数据服务有限公司 意图识别模型训练方法、意图识别方法和装置
CN116150311A (zh) * 2022-08-16 2023-05-23 马上消费金融股份有限公司 文本匹配模型的训练方法、意图识别方法及装置
US12499385B2 (en) * 2022-08-22 2025-12-16 Oracle International Corporation Adaptive training data augmentation to facilitate training named entity recognition models
CN115909354B (zh) * 2022-11-11 2023-11-10 北京百度网讯科技有限公司 文本生成模型的训练方法、文本获取方法及装置
US12512089B2 (en) * 2022-12-07 2025-12-30 International Business Machines Corporation Testing cascaded deep learning pipelines comprising a speech-to-text model and a text intent classifier
JP2024098791A (ja) * 2023-01-11 2024-07-24 株式会社東芝 情報処理装置、情報処理方法及び情報処理プログラム
US12231378B2 (en) * 2023-06-08 2025-02-18 Sap Se Realtime conversation AI insights and deployment
US20250008021A1 (en) * 2023-06-28 2025-01-02 Jpmorgan Chase Bank, N.A. Systems and methods for artificial intelligence-based coaching using microlearning
US12367342B1 (en) * 2025-01-15 2025-07-22 Conversational AI Ltd Automated analysis of computerized conversational agent conversational data

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105786798A (zh) * 2016-02-25 2016-07-20 上海交通大学 一种人机交互中自然语言意图理解方法
CN108073574A (zh) * 2016-11-16 2018-05-25 三星电子株式会社 用于处理自然语言以及训练自然语言模型的方法和设备
CN110209791A (zh) * 2019-06-12 2019-09-06 百融云创科技股份有限公司 一种多轮对话智能语音交互系统及装置

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110289025A1 (en) * 2010-05-19 2011-11-24 Microsoft Corporation Learning user intent from rule-based training data
US20160055240A1 (en) 2014-08-22 2016-02-25 Microsoft Corporation Orphaned utterance detection system and method
WO2016055240A1 (en) * 2014-10-06 2016-04-14 Zentrum Mikroelektronik Dresden Ag Pulsed linear power converter
US10510336B2 (en) * 2017-06-12 2019-12-17 International Business Machines Corporation Method, apparatus, and system for conflict detection and resolution for competing intent classifiers in modular conversation system
CN107515857B (zh) 2017-08-31 2020-08-18 科大讯飞股份有限公司 基于定制技能的语义理解方法及系统
US10303978B1 (en) * 2018-03-26 2019-05-28 Clinc, Inc. Systems and methods for intelligently curating machine learning training data and improving machine learning model performance
US10726204B2 (en) * 2018-05-24 2020-07-28 International Business Machines Corporation Training data expansion for natural language classification
US11093707B2 (en) * 2019-01-15 2021-08-17 International Business Machines Corporation Adversarial training data augmentation data for text classifiers
CN110223674B (zh) * 2019-04-19 2023-05-26 平安科技(深圳)有限公司 语音语料训练方法、装置、计算机设备和存储介质
CN110457447A (zh) * 2019-05-15 2019-11-15 国网浙江省电力有限公司电力科学研究院 一种电网任务型对话系统
US11538457B2 (en) * 2020-03-30 2022-12-27 Oracle International Corporation Noise data augmentation for natural language processing

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105786798A (zh) * 2016-02-25 2016-07-20 上海交通大学 一种人机交互中自然语言意图理解方法
CN108073574A (zh) * 2016-11-16 2018-05-25 三星电子株式会社 用于处理自然语言以及训练自然语言模型的方法和设备
CN110209791A (zh) * 2019-06-12 2019-09-06 百融云创科技股份有限公司 一种多轮对话智能语音交互系统及装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"文書分類におけるテキストノイズおよびラベルノイズの影響分析";池田大志,藤本拓,吉村健,;《言語処理学会-第26回年次大会-発表論文集》;20200109;第221-224页 *

Also Published As

Publication number Publication date
JP7721559B2 (ja) 2025-08-12
JP2023519713A (ja) 2023-05-12
JP2025170253A (ja) 2025-11-18
US11538457B2 (en) 2022-12-27
WO2021201907A1 (en) 2021-10-07
US20210304733A1 (en) 2021-09-30
US20230169955A1 (en) 2023-06-01
CN115398436A (zh) 2022-11-25
US11972755B2 (en) 2024-04-30
EP4128010A1 (en) 2023-02-08

Similar Documents

Publication Publication Date Title
CN115398436B (zh) 用于自然语言处理的噪声数据扩充
CN114424185B (zh) 用于自然语言处理的停用词数据扩充
CN115398437B (zh) 改进的域外(ood)检测技术
CN116724305B (zh) 上下文标签与命名实体识别模型的集成
CN116583837B (zh) 用于自然语言处理的基于距离的logit值
CN116547676B (zh) 用于自然语言处理的增强型logit
CN116635862A (zh) 用于自然语言处理的域外数据扩充
CN115917553A (zh) 在聊天机器人中实现稳健命名实体识别的实体级数据扩充
CN116615727A (zh) 用于自然语言处理的关键词数据扩充工具
CN118265981B (zh) 用于为预训练的语言模型处置长文本的系统和技术
CN118202344A (zh) 用于从文档中提取嵌入式数据的深度学习技术
CN118235143A (zh) 自然语言处理的路径失活
CN118215920A (zh) 用于使用散列嵌入进行语言检测的宽深网络
CN119183573A (zh) 实体感知数据增强技术
CN118251668A (zh) 用于从数据中提取问题答案对的基于规则的技术
CN119768794A (zh) 自适应训练数据扩充以促进命名实体识别模型的训练

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant