WO2021134550A1 - 多个语音识别输出的人类合并和训练 - Google Patents
多个语音识别输出的人类合并和训练 Download PDFInfo
- Publication number
- WO2021134550A1 WO2021134550A1 PCT/CN2019/130694 CN2019130694W WO2021134550A1 WO 2021134550 A1 WO2021134550 A1 WO 2021134550A1 CN 2019130694 W CN2019130694 W CN 2019130694W WO 2021134550 A1 WO2021134550 A1 WO 2021134550A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- rate
- speech
- result
- interval
- recognition
- Prior art date
Links
- 238000012549 training Methods 0.000 title claims abstract description 8
- 238000000034 method Methods 0.000 claims abstract description 14
- 238000013473 artificial intelligence Methods 0.000 claims description 4
- 238000011897 real-time detection Methods 0.000 claims 3
- 230000004048 modification Effects 0.000 description 4
- 238000012986 modification Methods 0.000 description 4
- 241000282412 Homo Species 0.000 description 3
- 230000008859 change Effects 0.000 description 3
- 238000003909 pattern recognition Methods 0.000 description 2
- 230000002411 adverse Effects 0.000 description 1
- 230000003796 beauty Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 210000003811 finger Anatomy 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 210000004933 right little finger Anatomy 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
- G10L21/007—Changing voice quality, e.g. pitch or formants characterised by the process used
- G10L21/01—Correction of time axis
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/04—Time compression or expansion
- G10L21/043—Time compression or expansion by changing speed
Definitions
- the present invention relates to human merging and training of multiple speech recognition outputs, in particular to a human merging of multiple speech recognition outputs to improve the output result, and feedback the result of human merging and modification as material for speech recognition training Methods.
- AI Artificial Intelligence
- the voice recognition input method it will be affected by the speed of speech, the microphone will be blocked (for example, the Huawei mobile phone microphone is at the bottom right, and when held in the right hand, it is just covered by the right little finger holding the phone, which affects the sound pickup), the distance of the sound source to the microphone Too far, server load, network delay, environmental noise, and other factors, the performance of a certain voice recognition input method will be very unstable. Users need a convenient way to call multiple AI applications at the same time to learn from others and balance accidents. Adverse effects (such as server load and network link delay).
- pattern recognition can be used to highlight the differences between multiple AI output results, allowing human users to focus on the differences and pros and cons of multiple AI output results.
- AI applications such as the speech recognition input method, it recognizes the same language and the same word order, even the pauses are the same, because the word order and language structure of multiple output results are exactly the same, and pattern recognition is sufficient.
- a user who often uses voice recognition input method will find that the recognition rate is very low when he speaks faster. When I deliberately slow down my speech, the recognition rate is high. Although there are individual differences, it is difficult to change people's habits.
- the design principle of ergonomics should be that machines adapt to humans, not humans adapt to machines.
- It can be selected by human users, or based on the statistical results of a large number of user selections, default a certain voice recognition result, merge other voice recognition results on its basis, or modify it.
- a predefined gesture or touch such as two fingers to define the modification range
- the difference clause paragraph corresponding to other speech recognition results is provided, such as displayed on the floating layer, when The user clicks or touches the difference clause paragraph of a certain speech recognition result, that is, replaces the difference clause paragraph of the basic speech recognition result to the difference clause paragraph corresponding to the speech recognition result.
- a human user clicks on the mouse or applies another predefined gesture or touch it can be manually modified.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Machine Translation (AREA)
Abstract
一种多个语音识别输出的人类合并和训练的方法,通过对多个语音识别输出进行人类合并,改进输出结果的同时,将人类合并和修改的结果,反馈作为语音识别训练的材料。
Description
本发明涉及多个语音识别输出的人类合并和训练,具体涉及一种通过对多个语音识别输出进行人类合并,改进输出结果的同时,将人类合并和修改的结果,反馈作为语音识别训练的材料的方法。
智能手机时代,弱人工智能(Artificial Intelligence,以下简称为AI)得到了广泛的应用,譬如自动翻译、语音识别、人脸识别、AI美颜、AI识曲、AI变声、视频换脸、音频/视频合成等。然而,这些弱AI应用执行任务的准确率,还不足以完全取代人类。
例如,对于语音识别输入法,会受到语速、话筒被遮蔽(比如,华为手机话筒在右下方,右手持拿时,刚好被托着手机的右小指遮蔽,影响了拾音)、音源距离话筒太远、服务器负载、网络延迟和环境噪音等多种因素影响,某个语音识别输入法表现会很不稳定,用户需要一种便利的方法,可以同时调用多个AI应用,以博采众长、平衡偶然不良影响(比如服务器负载和网络链路的延迟)。
发明概述
可以设想同时调用多个语音识别输入法,比如科大讯飞和百度输入法,将多个相应的输出结果展示给用户,再由人类用户逐句挑选最好的子结果,最后将人类用户的工作成果反馈作为AI训练的材料。
为提高人类用户合并多个语音识别输入结果的效率,可以通过模式识别高亮显示多个AI输出结果之间的差异,让人类用户专注于多个AI输出结果的差异和优劣。对语音识别输入法这样的AI应用,识别相同语言相同语序,连停顿都一样,因为多个输出结果的语序、语言结构完全相同,模式识别足矣。
然而,应当理解,本发明内容可能不包含本发明的所有方面和实施例该发明内容并不意味着以任何方式进行限制或限制,并且本文公开的本发明将被下列之一理解:本领域普通技术人员包括对其的明显改进和修改。
现在将在下文中更充分地描述本发明。然而,本发明可以以许多不同的形式实施,并且不应被解释为限于本文所阐述的实施例。但愿,提供这些实施例使得本公开将是彻底和完整的,并且将向本领域技术人员充分地传达本发明的范围。
应当理解,在不脱离所附权利要求书中阐述的精神和范围的情况下,可以对元件的功能和布置进行各种改变。因此,实施例是本发明的示例或实现,而不是唯一的实现。各种出现“一个实施例”,“实施例”或“一些实施例”不一定都指代相同的实施例。虽然可以在单个实施例的上下文中描述本发明的各种特征,但是特征也可以单独地或以任何合适的组合提供。相反的,尽管为了清楚起见,本文中可以在单独的实施例的上下文中描述本发明,但是本发明也可以在单个实施例或实施例的任何组合中实现。
除非另有定义,本文使用的所有术语(包括技术和科学术语)具有与本发明所属领域的普通技术人员通常理解的相同的含义。将进一步理解的是,诸如在通常使用的字典中定义的那些术语应当被解释为具有与它们在相关技术和本公开的上下文中的含义一致的含义,并且将不被解释为理想化的或过度正式的意义,除非本文中明确地这样定义。
参考术语如“左”、“右”、“上”、“下”、“前”和“后”旨在用于在相对于描绘实施例中的具体特征,结构或元件的取向本发明的实施例。显然,关于设备的实际使用的这种方向性术语没有特定的含义,因为设备可以由用户或多个用户在多个方向中使用。
一个经常使用语音识别输入法的用户,会发现当自己语速较快时,识别率很低。当自己刻意放缓语速时,识别率很高。这固然存在个体差异,然而,人的习惯很难改变。人机工程的设计原则应该是机器适应人类,而非人类适应机器。
可以由人类用户自己选择,或根据大量用户选择统计结果,默认某种语音识别结果,在其基础上合并其它语音识别结果,或进行修改。当人类用户鼠标悬停或施加预先定义的手势或触碰(比如双指定义修改范围)高亮的差异时,即提供其他语音识别结果对应的差异子句段落,比如显示于悬浮图层,当用户点击或触碰某一个语音识别结果的差异子句段落,即替换基础语音识别结果的差异子句段落,到对应语音识别结果的差异子句段落。当人类用户鼠标点击或施加预先定义的另一种手势或触碰,则可手工修改。
以上描述仅是本发明的实施例,并不意在限制本发明的范围。根据本公开的权利要求书和说明书的各种变化和修改仍在所要求保护的发明的范围内。此外,每个实施例和权利要求书未必包含了所公开的所有优点或特可收紧机械夹性。此外,摘要和标题仅用于便于搜索专利文献,并且不旨在以任何方式限制所要求保护的发明的范围。
Claims (3)
- 一种提高语音识别率的输入法,包含了:α.根据统计结果得出最佳识别准确率的语速区间;β.实时检测输入语音的语速,当语速超出所述最佳准确率的区间,调整语速到所述最佳准确率的区间;γ.输出根据调整后语速语音识别结果。
- 一种提高语音识别率的输入法,包含了:α.根据统计结果得出最佳识别准确率的语速区间;β.实时检测输入语音的语速,当语速超出所述最佳准确率的区间,调整语速到所述最佳准确率的区间;γ.在调整语速的同时,根据原始语速语音识别,并输出结果一;δ.输出根据调整后语速语音识别结果,标记为结果二;ε.高亮显示标出所述结果一、所述结果二之间的差异;στ.由用户选择准确的结果,将所述结果一、所述结果二和用户的选择作为所述输入法人工智能训练素材。
- 一种提高语音识别率的输入法,包含了:α.根据统计结果得出最佳识别准确率的语速区间;β.实时检测输入语音的语速,当语速超出所述最佳准确率的区间,调整语速到所述最佳准确率的区间,并在所述最佳准确率的区间选择多个语速同时进行多次识别;γ.在调整语速的同时,根据原始语速语音识别,并输出结果一;δ.输出根据调整后语速语音识别结果,标记为结果二、结果三…;ε.高亮显示标出所述结果一、所述结果二、结果三…之间的差异;στ.由用户选择准确的结果,将所述结果一、所述结果二、所述结果三…和用户的选择作为所述输入法人工智能训练素材。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2019/130694 WO2021134550A1 (zh) | 2019-12-31 | 2019-12-31 | 多个语音识别输出的人类合并和训练 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2019/130694 WO2021134550A1 (zh) | 2019-12-31 | 2019-12-31 | 多个语音识别输出的人类合并和训练 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021134550A1 true WO2021134550A1 (zh) | 2021-07-08 |
Family
ID=76686048
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/130694 WO2021134550A1 (zh) | 2019-12-31 | 2019-12-31 | 多个语音识别输出的人类合并和训练 |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2021134550A1 (zh) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6338038B1 (en) * | 1998-09-02 | 2002-01-08 | International Business Machines Corp. | Variable speed audio playback in speech recognition proofreader |
CN109671433A (zh) * | 2019-01-10 | 2019-04-23 | 腾讯科技(深圳)有限公司 | 一种关键词的检测方法以及相关装置 |
CN109979474A (zh) * | 2019-03-01 | 2019-07-05 | 珠海格力电器股份有限公司 | 语音设备及其用户语速修正方法、装置和存储介质 |
CN110033769A (zh) * | 2019-04-23 | 2019-07-19 | 努比亚技术有限公司 | 一种录入语音处理方法、终端及计算机可读存储介质 |
CN110060665A (zh) * | 2019-03-15 | 2019-07-26 | 上海拍拍贷金融信息服务有限公司 | 语速检测方法及装置、可读存储介质 |
CN110473525A (zh) * | 2019-09-16 | 2019-11-19 | 百度在线网络技术(北京)有限公司 | 获取语音训练样本的方法和装置 |
CN110556098A (zh) * | 2019-07-23 | 2019-12-10 | 平安科技(深圳)有限公司 | 语音识别结果测试方法、装置、计算机设备和介质 |
-
2019
- 2019-12-31 WO PCT/CN2019/130694 patent/WO2021134550A1/zh active Application Filing
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6338038B1 (en) * | 1998-09-02 | 2002-01-08 | International Business Machines Corp. | Variable speed audio playback in speech recognition proofreader |
CN109671433A (zh) * | 2019-01-10 | 2019-04-23 | 腾讯科技(深圳)有限公司 | 一种关键词的检测方法以及相关装置 |
CN109979474A (zh) * | 2019-03-01 | 2019-07-05 | 珠海格力电器股份有限公司 | 语音设备及其用户语速修正方法、装置和存储介质 |
CN110060665A (zh) * | 2019-03-15 | 2019-07-26 | 上海拍拍贷金融信息服务有限公司 | 语速检测方法及装置、可读存储介质 |
CN110033769A (zh) * | 2019-04-23 | 2019-07-19 | 努比亚技术有限公司 | 一种录入语音处理方法、终端及计算机可读存储介质 |
CN110556098A (zh) * | 2019-07-23 | 2019-12-10 | 平安科技(深圳)有限公司 | 语音识别结果测试方法、装置、计算机设备和介质 |
CN110473525A (zh) * | 2019-09-16 | 2019-11-19 | 百度在线网络技术(北京)有限公司 | 获取语音训练样本的方法和装置 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20230111509A1 (en) | Detecting a trigger of a digital assistant | |
AU2018100384B4 (en) | Intelligent automated assistant | |
US10049668B2 (en) | Applying neural network language models to weighted finite state transducers for automatic speech recognition | |
WO2020087655A1 (zh) | 一种翻译方法、装置、设备及可读存储介质 | |
DK201770338A1 (en) | Intelligent automated assistant for media exploration | |
CN110674320B (zh) | 一种检索方法、装置和电子设备 | |
TW200847131A (en) | Method and module for improving personal speech recognition capability | |
EP4085452A1 (en) | Speech recognition | |
DK201770421A1 (en) | DETECTING A TRIGGER OF A DIGITAL ASSISTANT | |
Choe et al. | A survey study on the utilization status and user perception of the VUI of smartphones | |
WO2021134550A1 (zh) | 多个语音识别输出的人类合并和训练 | |
WO2021134549A1 (zh) | 多个人工智能输出的人类合并和训练 | |
Bakst et al. | Modeling the effect of palate shape on the articulatory-acoustics mapping | |
WO2021134546A1 (zh) | 提高语音识别率的输入法 | |
EP4354339A3 (en) | Automated assistant for generating, in response to a request from a user, application input content using application data from other sources | |
TW201937480A (zh) | 適性調整語音輸入等待時間系統及其方法 | |
WO2021134551A1 (zh) | 多个机器翻译输出的人类合并和训练 | |
Murphy et al. | Adaptive time windows for real-time crowd captioning | |
CN109522425A (zh) | 一种调整多媒体环境的方法、装置及存储设备 | |
JP2020067584A (ja) | コミュニケーション装置およびコミュニケーション装置の制御プログラム | |
US20180336191A1 (en) | Method for multi-sense fusion using synchrony | |
Deselaers et al. | Polite mode for a virtual assistant | |
US20240233712A1 (en) | Speech Recognition Biasing | |
Winn et al. | Backwards and indirect context effects in accommodating gender differences in speech | |
Wilson et al. | Generalization in VOT imitation: Feature adaptation or acoustic-phonetic covariation? |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19958339 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19958339 Country of ref document: EP Kind code of ref document: A1 |