WO2023115588A1 - Speech interaction method and apparatus, and storage medium - Google Patents

Speech interaction method and apparatus, and storage medium Download PDF

Info

Publication number
WO2023115588A1
WO2023115588A1 PCT/CN2021/141405 CN2021141405W WO2023115588A1 WO 2023115588 A1 WO2023115588 A1 WO 2023115588A1 CN 2021141405 W CN2021141405 W CN 2021141405W WO 2023115588 A1 WO2023115588 A1 WO 2023115588A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio signal
text
voice
timer
duration
Prior art date
Application number
PCT/CN2021/141405
Other languages
French (fr)
Chinese (zh)
Inventor
唐瑞雪
高益
聂为然
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN202180041317.8A priority Critical patent/CN116670760A/en
Priority to PCT/CN2021/141405 priority patent/WO2023115588A1/en
Publication of WO2023115588A1 publication Critical patent/WO2023115588A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal

Abstract

A speech interaction method and apparatus, and a storage medium. The method comprises: acquiring a first audio signal, wherein the first audio signal comprises a first speech command (S410); determining the duration of a first timer according to first text corresponding to the first speech command (S420); starting the first timer (S430); acquiring a second audio signal, wherein a start moment of the second audio signal is equal to or later than an end moment of the first audio signal (S440); when text corresponding to a speech command in the second audio signal is blank, determining an end moment of the first timer as a speech endpoint (S450); and after the speech endpoint is determined, responding to the first speech command (S460). By means of the method, a speech endpoint can be flexibly determined, such that the problem of a speech response delay being too long due to noise can be alleviated, and premature truncation of speech interaction due to the fact that a user pauses during speech is reduced.

Description

语音交互的方法、装置和存储介质Method, device and storage medium for voice interaction 技术领域technical field
本申请涉及人机交互领域,并且更具体地,涉及一种语音交互的方法、装置和存储介质。The present application relates to the field of human-computer interaction, and more specifically, to a voice interaction method, device and storage medium.
背景技术Background technique
语音识别功能广泛应用于智能家居设备,智能车载设备等设备中,以实现自然地人机语音交互体现。自动语音识别(automatic speech recognition,ASR)对于音频信号中的有效语言片段的判断,涉及到前端点检测和后端点检测,即检测语音开始和结束。语音后端点检测经常由于背景噪音、用户语速差异、用户说话停顿而造成延迟过长或过早截断的问题。The voice recognition function is widely used in smart home equipment, smart vehicle equipment and other equipment to realize natural human-computer voice interaction. The judgment of automatic speech recognition (ASR) on the effective language segment in the audio signal involves front-end point detection and back-end point detection, that is, detecting the start and end of speech. Speech back-end detection often suffers from excessive delays or premature truncation due to background noise, user speech rate differences, and user pauses in speech.
发明内容Contents of the invention
本申请实施例提供一种语音交互的方法、装置和存储介质,能够提高用户对于语音响应的体验。Embodiments of the present application provide a voice interaction method, device, and storage medium, which can improve user experience with voice responses.
第一方面,提供了一种语音交互的方法,该方法包括:获取第一音频信号,该第一音频信号中包括第一语音指令;根据该第一语音指令对应的第一文本,确定第一定时器的时长;启动该第一定时器;获取第二音频信号,该第二音频信号的起始时刻等于或晚于该第一音频信号的结束时刻;在该第二音频信号中的语音指令对应的文本为空时,将该第一定时器的结束时刻确定为语音端点;在确定语音端点之后,响应第一语音指令。In a first aspect, a voice interaction method is provided, the method comprising: acquiring a first audio signal, the first audio signal including a first voice command; determining the first voice command according to the first text corresponding to the first voice command The duration of the timer; start the first timer; obtain the second audio signal, the start moment of the second audio signal is equal to or later than the end moment of the first audio signal; the voice command in the second audio signal When the corresponding text is empty, determine the end time of the first timer as the voice endpoint; after determining the voice endpoint, respond to the first voice instruction.
本申请实施例中,根据语音交互中的语音指令所对应的文本,可以确定定时器时长,并且根据该定时器和第二音频信号,可以灵活地确定语音端点,从而可以缓解由于噪音造成的语音响应延迟过长的问题,以及减少由于用户说话停顿而造成的语音交互截断过早的情况。进一步地,可以在缩短系统响应延迟的情况下,提高语音指令响应的速度,提高用户体验。In the embodiment of the present application, the timer duration can be determined according to the text corresponding to the voice command in the voice interaction, and the voice endpoint can be flexibly determined according to the timer and the second audio signal, so that the voice noise caused by noise can be alleviated. Addresses issues with excessively long response delays, and reduces cases where voice interactions cut off prematurely due to user pauses in speech. Furthermore, in the case of shortening the system response delay, the voice command response speed can be improved, and the user experience can be improved.
结合第一方面,在第一方面的某些实现方式中,该方法还包括:在该第二音频信号的音频帧的能量小于或等于第一阈值时,可以将该第一定时器的结束时刻确定为语音端点。With reference to the first aspect, in some implementations of the first aspect, the method further includes: when the energy of the audio frame of the second audio signal is less than or equal to the first threshold, the end time of the first timer can be identified as a voice endpoint.
本申请实施例中,可以通过第二音频信号的音频帧的能量来确定该第二音频信号是否包括语音指令,可以降低对语音端点的误判率。In this embodiment of the present application, whether the second audio signal includes a voice command can be determined through the energy of the audio frame of the second audio signal, which can reduce the misjudgment rate of the voice endpoint.
结合第一方面,在第一方面的某些实现方式中,在该第二音频信号中的语音指令对应的文本为空时,将该第一定时器的结束时刻确定为语音端点,包括:在该第二音频信号中的语音指令对应的文本为空,且该第二音频信号的音频帧的能量小于或等于第一阈值时,将该第一定时器的结束时刻确定为该语音端点。With reference to the first aspect, in some implementations of the first aspect, when the text corresponding to the voice instruction in the second audio signal is empty, determining the end time of the first timer as the voice endpoint includes: When the text corresponding to the voice command in the second audio signal is empty and the energy of the audio frame of the second audio signal is less than or equal to the first threshold, the end time of the first timer is determined as the voice endpoint.
本申请实施例中,通过结合第二音频信号的音频帧的能量以及根据该音频信号所得到的文本,使得能够更为准确地确定第二音频信号是否包括语音指令,从而可以提高所确定 的语音端点的准确度。In the embodiment of the present application, by combining the energy of the audio frame of the second audio signal and the text obtained according to the audio signal, it is possible to more accurately determine whether the second audio signal includes a voice instruction, thereby improving the determined voice Endpoint accuracy.
结合第一方面,在第一方面的某些实现方式中,该方法还包括:获取第二文本,该第二文本通过显示屏显示;该方法还包括,在该第一语音指令对应的该第一文本与该第二文本匹配时,执行该第二文本所指示的操作。With reference to the first aspect, in some implementations of the first aspect, the method further includes: acquiring a second text, and the second text is displayed on a display screen; the method further includes, in the first voice instruction corresponding to the second text When a text matches the second text, perform the operation indicated by the second text.
本申请实施例中,通过执行该第二文本所指示的操作,可以实现可见即可说的功能,使得用户可以在避免与用户设备接触的情况下,通过语音实现与用户设备间的交互,可以提升用户体验。另外,在本申请实施例中,语音指令所对应的第一文本与通过显示屏显示的第二文本间的匹配过程,可以在语音端点检测之前进行,而并非在确定到完整的语音指令后再进行,通过该方式,可以显著缩短对于用户语音指令的响应时间,从而可以提高用户体验。而且,当无法匹配时,可以使用该第一文本进行语音端点检测,也并不会对语音端点的检测造成影响。In the embodiment of the present application, by performing the operation indicated by the second text, the function of seeing and talking can be realized, so that the user can realize the interaction with the user equipment through voice without contacting the user equipment. Improve user experience. In addition, in the embodiment of the present application, the matching process between the first text corresponding to the voice command and the second text displayed on the display screen can be performed before the voice endpoint detection, rather than after the complete voice command is determined. In this way, the response time to the user's voice command can be significantly shortened, thereby improving user experience. Moreover, when the matching cannot be performed, the first text can be used for voice endpoint detection without affecting the voice endpoint detection.
结合第一方面,在第一方面的某些实现方式中,该方法还包括:获取在显示屏上显示的第二文本;该根据第一语音指令对应的第一文本,确定第一定时器的时长,包括:在该第一语音指令对应的该第一文本与该第二文本不匹配时,根据该第一语音指令对应的该第一文本,确定该第一定时器的时长。With reference to the first aspect, in some implementations of the first aspect, the method further includes: acquiring the second text displayed on the display screen; and determining the first timer's value according to the first text corresponding to the first voice instruction. The duration includes: when the first text corresponding to the first voice instruction does not match the second text, determining the duration of the first timer according to the first text corresponding to the first voice instruction.
本申请实施例中,在用户下达的语音指令与显示屏上所显示的文字无法匹配时,可以根据语音交互中的语音指令确定第一定时器的时长,并响应于用户的语音指令,使得对于用户语音指令的响应不止局限于显示屏中的文字所指示的操作,可以具有更广泛的适用范围。In the embodiment of the present application, when the voice command given by the user cannot match the text displayed on the display screen, the duration of the first timer can be determined according to the voice command in the voice interaction, and in response to the user's voice command, so that for The response to the user's voice command is not limited to the operation indicated by the text in the display screen, and may have a wider scope of application.
结合第一方面,在第一方面的某些实现方式中,该方法还包括:获取第三音频信号,该第三音频信号包括第一预设时间内接收到的音频信号,该第一预设时间的起始时刻,等于或晚于该第一音频信号的结束时刻;该根据该第一语音指令对应的第一文本,确定第一定时器的时长,包括:在该第三音频信号不包括语音指令时,根据该第一语音指令对应的该第一文本,确定该第一定时器的时长。With reference to the first aspect, in some implementations of the first aspect, the method further includes: acquiring a third audio signal, where the third audio signal includes an audio signal received within a first preset time, and the first preset The start moment of time is equal to or later than the end moment of the first audio signal; the first text corresponding to the first voice instruction is used to determine the duration of the first timer, including: when the third audio signal does not include During the voice command, the duration of the first timer is determined according to the first text corresponding to the first voice command.
本申请实施例中,通过获取第一预设时间内接收的第三音频信号,可以在确定第三音频信号不包括语音指令时,再根据第一文本确定第一定时器的时长,可以降低检测语音端点的频次,由此可以节省检测语音端点所占用的资源。In the embodiment of the present application, by obtaining the third audio signal received within the first preset time, when it is determined that the third audio signal does not include voice instructions, the duration of the first timer can be determined according to the first text, which can reduce the detection time. The frequency of the voice endpoint can save resources occupied by detecting the voice endpoint.
结合第一方面,在第一方面的某些实现方式中,该获取该第一音频信号之前,该方法还包括:获取第四音频信号,该第四音频信号中包括第三语音指令;根据该第三语音指令对应的第三文本,确定第二定时器的时长;开启该第二定时器,在该第二定时器运行时获取第五音频信号,该第二定时器的结束时刻,早于或等于该第一定时器的开始时刻;在该第五音频信号中的语音指令对应的文本为非空时,根据该第四音频信号和该第五音频信号,确定该第一音频信号,该第一音频信号包括该第四音频信号和该第五音频信号。With reference to the first aspect, in some implementations of the first aspect, before acquiring the first audio signal, the method further includes: acquiring a fourth audio signal, where the fourth audio signal includes a third voice instruction; according to the The third text corresponding to the third voice command determines the duration of the second timer; starts the second timer, and obtains the fifth audio signal when the second timer is running, and the end time of the second timer is earlier than Or equal to the start time of the first timer; when the text corresponding to the voice command in the fifth audio signal is not empty, determine the first audio signal according to the fourth audio signal and the fifth audio signal, the The first audio signal includes the fourth audio signal and the fifth audio signal.
在语音交互中,用户可能存在多次停顿,因此在确定语音交互中的语音端点的过程中,可以进行多次检测,当语音端点检测失败时,可以根据语音交互中的音频信号,再次进行语音端点检测,直至成功检测到语音端点,并由此响应语音指令。为了对多次语音检测过程中所使用的音频信号和文本进行区分,可以将本次语音端点检测中所使用的文本,定义为第一文本,相应地,将对应的音频信号定义为第一音频信号,将此前检测语音端点过程中确定定时器时所使用的文本定义为第三文本,相应地,将对应的音频信号定义为第四音 频信号,第四音频信号可以作为第一音频信号的一部分,第三文本可以作为第一文本的一部分。In the voice interaction, the user may have multiple pauses, so in the process of determining the voice endpoint in the voice interaction, multiple detections can be performed. When the voice endpoint detection fails, the voice can be performed again according to the audio signal in the voice interaction Endpoint detection until a voice endpoint is successfully detected and thus responds to voice commands. In order to distinguish the audio signal and text used in multiple speech detection processes, the text used in this speech endpoint detection can be defined as the first text, and correspondingly, the corresponding audio signal can be defined as the first audio signal, define the text used when determining the timer in the previous process of detecting the voice endpoint as the third text, correspondingly, define the corresponding audio signal as the fourth audio signal, and the fourth audio signal can be used as a part of the first audio signal , the third text can be part of the first text.
本申请实施例中,将此前语音端点检测失败时所使用的文本,作为本次语音端点检测中所使用的第一文本的一部分,可以充分利用在语音交互中所获取到的语音指令及其对应的文本,可以更为准确地确定第一定时器,从而可以提高语音端点检测的准确性,从而可以避免对语音交互的不恰当截断,缓解由于噪音和用户说话停顿所造成的影响。In the embodiment of the present application, the text used when the previous voice endpoint detection failed is used as part of the first text used in this voice endpoint detection, which can make full use of the voice commands obtained in the voice interaction and their corresponding text, the first timer can be determined more accurately, thereby improving the accuracy of voice endpoint detection, thereby avoiding inappropriate truncation of voice interaction, and alleviating the impact caused by noise and pauses in user speech.
结合第一方面,在第一方面的某些实现方式中,该第一音频信号的起始时刻,早于或等于该第四音频信号的起始时刻,该第一音频信号的结束时刻,等于或晚于第五音频信号的结束时刻。With reference to the first aspect, in some implementations of the first aspect, the start moment of the first audio signal is earlier than or equal to the start moment of the fourth audio signal, and the end moment of the first audio signal is equal to or later than the end moment of the fifth audio signal.
本申请实施例中,对于第四音频信号的起始时刻至第五音频信号的结束时刻之间的时间段,该第一音频信号可以包括该时间段内音频信号中的全部的语音指令,由此可以提高所确定的第一定时器的准确性,从而可以提高语音端点检测的准确性,进一步地,可以提高用户体验。In the embodiment of the present application, for the time period between the start moment of the fourth audio signal and the end moment of the fifth audio signal, the first audio signal may include all the voice instructions in the audio signal in the time period, by This can improve the accuracy of the determined first timer, thus can improve the accuracy of voice endpoint detection, and further, can improve user experience.
结合第一方面,在第一方面的某些实现方式中,根据该第一语音指令对应的第一文本,确定第一定时器的时长,包括:将该第一语音指令对应的该第一文本输入预测模型,得到该第一文本的语义完整度;根据该第一文本的语义完整度,确定该第一定时器的时长。With reference to the first aspect, in some implementations of the first aspect, determining the duration of the first timer according to the first text corresponding to the first voice instruction includes: the first text corresponding to the first voice instruction Inputting the prediction model to obtain the semantic completeness of the first text; and determining the duration of the first timer according to the semantic completeness of the first text.
可选地,语义完整度可以指语义的完整的程度,示例性地,第一文本的语义完整度可以指该第一文本的语义的完整的程度。可选地,可以使用第一信息表征语义完整度。Optionally, the semantic completeness may refer to the completeness of the semantics. Exemplarily, the semantic completeness of the first text may refer to the completeness of the semantics of the first text. Optionally, the first information may be used to characterize the semantic completeness.
本申请实施例中,通过将第一文本输入预测模型,可以得到该第一文本的语义完整度,由此可以根据该第一文本的语义完整度,确定其对应的语音指令是否完整,以此可以灵活地确定语音端点。In the embodiment of the present application, by inputting the first text into the prediction model, the semantic completeness of the first text can be obtained, and thus it can be determined whether the corresponding voice instruction is complete according to the semantic completeness of the first text, so that Voice endpoints can be flexibly determined.
第二方面,提供了一种语音交互的装置,其特征在于,该装置包括:获取模块,用于获取第一音频信号,该第一音频信号中包括第一语音指令;还用于获取第二音频信号,第二音频信号的起始时刻等于或晚于该第一音频信号的结束时刻;处理模块,用于根据该第一语音指令对应的第一文本,确定该第一定时器的时长;启动该第一定时器;在该第二音频信号中的语音指令对应的文本为空时,将该第一定时器的结束时刻确定为语音端点;在确定所述语音端点之后,响应第一语音指令。In a second aspect, a device for voice interaction is provided, which is characterized in that the device includes: an acquisition module, configured to acquire a first audio signal, the first audio signal including a first voice instruction; For the audio signal, the start time of the second audio signal is equal to or later than the end time of the first audio signal; the processing module is configured to determine the duration of the first timer according to the first text corresponding to the first voice instruction; Start the first timer; when the text corresponding to the voice command in the second audio signal is empty, determine the end time of the first timer as the voice endpoint; after determining the voice endpoint, respond to the first voice instruction.
结合第二方面,在第二方面的某些实现方式中,该处理模块,还可以用于,在该第二音频信号的音频帧的能量小于或等于第一阈值时,将该第一定时器的结束时刻确定为该语音端点。With reference to the second aspect, in some implementation manners of the second aspect, the processing module may also be configured to, when the energy of the audio frame of the second audio signal is less than or equal to the first threshold, set the first timer The end moment of is determined as the voice endpoint.
结合第二方面,在第二方面的某些实现方式中,该处理模块,具体用于:在该第二音频信号中的语音指令对应的文本为空,且该第二音频信号的音频帧的能量小于或等于第一阈值时,将该第一定时器的结束时刻确定为该语音端点。With reference to the second aspect, in some implementation manners of the second aspect, the processing module is specifically configured to: the text corresponding to the voice instruction in the second audio signal is empty, and the audio frame of the second audio signal When the energy is less than or equal to the first threshold, the end time of the first timer is determined as the voice endpoint.
结合第二方面,在第二方面的某些实现方式中,该获取模块还用于:获取在显示屏上显示的第二文本;该处理模块,具体用于:在该第一语音指令对应的该第一文本与该第二文本不匹配时,根据该第一语音指令对应的该第一文本,确定该第一定时器的时长。With reference to the second aspect, in some implementations of the second aspect, the acquiring module is further configured to: acquire the second text displayed on the display screen; the processing module is specifically configured to: When the first text does not match the second text, the duration of the first timer is determined according to the first text corresponding to the first voice instruction.
结合第二方面,在第二方面的某些实现方式中,该获取模块还用于:获取第三音频信号,该第三音频信号包括第一预设时间内接收到的音频信号,该第一预设时间的起始时刻等于或晚于该第一音频信号的结束时刻;该处理模块,具体用于:在该第三音频信号不包 括语音指令时,根据该第一语音指令对应的该第一文本,确定该第一定时器的时长。With reference to the second aspect, in some implementation manners of the second aspect, the acquiring module is further configured to: acquire a third audio signal, where the third audio signal includes an audio signal received within a first preset time, and the first The start time of the preset time is equal to or later than the end time of the first audio signal; the processing module is specifically used to: when the third audio signal does not include a voice command, according to the first voice command corresponding to the A text specifying the duration of the first timer.
结合第二方面,在第二方面的某些实现方式中,该获取模块还用于:在获取该第一音频信号之前,获取第四音频信号,该第四音频信号中包括第三语音指令;在第二定时器运行时获取第五音频信号;该处理模块,还用于:根据该第三语音指令对应的第三文本,确定第二定时器的时长;启动该第二定时器,该第二定时器的结束时刻早于或等于该第一定时器的开始时刻;在该第五音频信号中的语音指令对应的文本为非空时,根据该第四音频信号和该第五音频信号,确定该第一音频信号,该第一音频信号包括该第四音频信号和该第五音频信号。With reference to the second aspect, in some implementations of the second aspect, the acquiring module is further configured to: acquire a fourth audio signal before acquiring the first audio signal, where the fourth audio signal includes a third voice instruction; Acquire the fifth audio signal when the second timer is running; the processing module is also used to: determine the duration of the second timer according to the third text corresponding to the third voice instruction; start the second timer, the first The end time of the second timer is earlier than or equal to the start time of the first timer; when the text corresponding to the voice command in the fifth audio signal is not empty, according to the fourth audio signal and the fifth audio signal, The first audio signal is determined, and the first audio signal includes the fourth audio signal and the fifth audio signal.
结合第二方面,在第二方面的某些实现方式中,该第一音频信号的起始时刻早于或等于该第四音频信号的起始时刻,该第一音频信号的结束时刻等于或晚于第五音频信号的结束时刻。With reference to the second aspect, in some implementations of the second aspect, the start moment of the first audio signal is earlier than or equal to the start moment of the fourth audio signal, and the end moment of the first audio signal is equal to or later than At the end moment of the fifth audio signal.
结合第二方面,在第二方面的某些实现方式中,该处理模块,具体用于:将该第一语音指令对应的该第一文本输入预测模型,得到该第一文本的语义完整度;根据该第一文本的语义完整度,确定该第一定时器的时长。With reference to the second aspect, in some implementation manners of the second aspect, the processing module is specifically configured to: input the first text corresponding to the first voice instruction into the prediction model, and obtain the semantic completeness of the first text; The duration of the first timer is determined according to the semantic integrity of the first text.
第三方面,提供了一种用于语音交互的预测模型的训练的方法,该方法包括:获取文本数据集,该文本数据集包括多个第四文本,该第四文本标注了第一信息,该第一信息用于表示该文本的语义完整度;根据该文本数据集进行模型训练,得到预测模型,该预测模型用于预测语音指令的语义完整度。In a third aspect, a method for training a prediction model for voice interaction is provided, the method comprising: acquiring a text data set, the text data set includes a plurality of fourth texts, and the fourth texts are marked with first information, The first information is used to represent the semantic completeness of the text; model training is performed according to the text data set to obtain a prediction model, and the prediction model is used to predict the semantic completeness of the voice instruction.
本申请实施例中,根据文本数据集可以进行模型训练,可以得到预测模型,通过该训练的过程,该预测模型可以从文本数据集中的第四文本学习出文本与其语义完整度间的关系,从而在模型预测阶段,可以基于预测模型,预测待分析的文本的语义完整度,从而可以在语音交互过程中,通过确定音频信号中的语音指令对应的文本的语义完整度,确定用户是否具有继续说话的意图。In the embodiment of the present application, model training can be performed according to the text data set, and a prediction model can be obtained. Through the training process, the prediction model can learn the relationship between the text and its semantic integrity from the fourth text in the text data set, so that In the model prediction stage, the semantic completeness of the text to be analyzed can be predicted based on the prediction model, so that during the voice interaction process, by determining the semantic completeness of the text corresponding to the voice command in the audio signal, it can be determined whether the user has the ability to continue speaking intention of.
结合第三方面,在第三方面的某些实现方式中,该方法还包括:获取文本语料集,该文本语料集中包括多个具有完整语义的文本;根据该文本语料集确定该文本数据集。With reference to the third aspect, in some implementation manners of the third aspect, the method further includes: acquiring a text corpus, where the text corpus includes multiple texts with complete semantics; and determining the text data set according to the text corpus.
本申请实施例中,根据文本语料集确定文本数据集,使得可以仅准备具有完整语义的文本,可以减少用于构建文本数据集时所需准备的文本的数量,可以简化构建文本数据集的过程。In the embodiment of the present application, the text data set is determined according to the text corpus, so that only texts with complete semantics can be prepared, the number of texts that need to be prepared for building a text data set can be reduced, and the process of building a text data set can be simplified .
结合第三方面,在第三方面的某些实现方式中,该根据文本语料集确定该文本数据集,可以包括:根据该文本语料集中的具有完整语义的文本,确定一个或多个第四文本;根据文本语料集中多个具有完整语义的文本所确定的多个第四文本,确定文本数据集。With reference to the third aspect, in some implementations of the third aspect, the determining the text data set according to the text corpus may include: determining one or more fourth texts according to the texts with complete semantics in the text corpus ; Determine a text data set according to multiple fourth texts determined by multiple texts with complete semantics in the text corpus.
本申请实施例中,根据文本语料集中具有完整语义的文本确定一个或多个第四文本,可以简化确定该一个或多个第四文本的语义完整度的过程,从而可以简化确定和标注第一信息的过程。In the embodiment of the present application, determining one or more fourth texts according to the texts with complete semantics in the text corpus can simplify the process of determining the semantic completeness of the one or more fourth texts, thereby simplifying the determination and labeling of the first information process.
结合第三方面,在第三方面的某些实现方式中,该方法还包括:根据该文本语料集,确定字典树,该字典树包括多个节点;根据该字典树中的节点的子节点数目,确定第四文本的语义完整度。In conjunction with the third aspect, in some implementations of the third aspect, the method further includes: determining a dictionary tree according to the text corpus, the dictionary tree including a plurality of nodes; according to the number of child nodes of the nodes in the dictionary tree , to determine the semantic completeness of the fourth text.
示例性地,根据该字典树中的节点,可以确定与该节点对应的第四文本,例如,该第四文本可以是以该节点为结尾的文本。可选地,可以根据字典树中该节点的子节点数目, 确定该节点对应的第四文本的语义完整度。Exemplarily, according to the node in the dictionary tree, the fourth text corresponding to the node can be determined, for example, the fourth text can be the text ending with the node. Optionally, the semantic completeness of the fourth text corresponding to the node may be determined according to the number of child nodes of the node in the trie.
本申请实施例中,通过确定字典树,可以确定该节点的子节点数目,从而可以确定文本数据集中该节点对应的第四文本的语义完整度,从而提高确定第四文本语义完整度的效率。In the embodiment of the present application, by determining the dictionary tree, the number of child nodes of the node can be determined, thereby determining the semantic completeness of the fourth text corresponding to the node in the text data set, thereby improving the efficiency of determining the semantic completeness of the fourth text.
结合第三方面,在第三方面的某些实现方式中,该根据字典树中的节点的子节点数目,确定第四文本的语义完整度,包括:根据该字典树中的节点的子节点数目,以及该具有完整语义的文本对应的尾节点标记,确定第四文本的语义完整度。In conjunction with the third aspect, in some implementations of the third aspect, determining the semantic integrity of the fourth text according to the number of child nodes of the nodes in the dictionary tree includes: according to the number of child nodes of the nodes in the dictionary tree , and the tail node label corresponding to the text with complete semantics, determine the semantic completeness of the fourth text.
本申请实施例中,通过字典树的子节点数目,以及尾节点标记,能够以更细的粒度确认第四文本的语义完整度,从而通过训练能够得到更加准确的预测模型。In the embodiment of the present application, the semantic integrity of the fourth text can be confirmed at a finer granularity through the number of sub-nodes of the dictionary tree and the mark of the tail node, so that a more accurate prediction model can be obtained through training.
第四方面,提供了一种训练用于语音交互的预测模型的装置,该装置包括获取模块和训练模块,其中,获取模块可以用于:获取文本数据集,该文本数据集包括多个第四文本,该第四文本标注了第一信息,该第一信息可以用于表示第四文本的语义完整度;该训练模块可以用于:根据文本数据集进行模型训练,得到预测模型,该预测模型用于预测语音指令的语义完整度。In a fourth aspect, there is provided a device for training a prediction model for voice interaction, the device includes an acquisition module and a training module, wherein the acquisition module can be used to: acquire a text data set, the text data set includes a plurality of fourth Text, the fourth text is marked with first information, and the first information can be used to represent the semantic integrity of the fourth text; the training module can be used to: perform model training according to the text data set to obtain a prediction model, the prediction model Used to predict the semantic completeness of speech commands.
结合第四方面,在第四方面的某些实现方式中,该获取模块,还可以用于获取文本语料集,该文本语料集中可以包括多个具有完整语义的文本;该装置还可以包括处理模块,该处理模块可以用于,根据该文本语料集确定文本数据集。With reference to the fourth aspect, in some implementations of the fourth aspect, the acquisition module can also be used to acquire a text corpus, which can include multiple texts with complete semantics; the device can also include a processing module , the processing module can be used to determine a text data set according to the text corpus.
结合第四方面,在第四方面的某些实现方式中,该处理模块具体用于,根据该文本语料集中的具有完整语义的文本,确定一个或多个第四文本;根据文本语料集中多个具有完整语义的文本所确定的多个第四文本,确定文本数据集。With reference to the fourth aspect, in some implementations of the fourth aspect, the processing module is specifically configured to determine one or more fourth texts according to the texts with complete semantics in the text corpus; A plurality of fourth texts determined by texts with complete semantics determine a text data set.
结合第四方面,在第四方面的某些实现方式中,该处理模块,还可以用于:根据文本语料集确定字典树,该字典树包括多个节点;可以根据该字典树中的节点的子节点数目,确定第四文本的语义完整度。In conjunction with the fourth aspect, in some implementations of the fourth aspect, the processing module may also be used to: determine a dictionary tree according to the text corpus, the dictionary tree includes a plurality of nodes; it may be based on the nodes in the dictionary tree The number of child nodes determines the semantic integrity of the fourth text.
示例性地,根据该字典树中的节点,可以确定与该节点对应的第四文本,例如,该第四文本可以是以该节点为结尾的文本。可选地,可以根据字典树中该节点的子节点数目,确定该节点对应的第四文本的语义完整度。Exemplarily, according to the node in the dictionary tree, the fourth text corresponding to the node can be determined, for example, the fourth text can be the text ending with the node. Optionally, the semantic completeness of the fourth text corresponding to the node may be determined according to the number of child nodes of the node in the dictionary tree.
结合第四方面,在第四方面的某些实现方式中,该处理模块,还可以用于:根据字典树中的节点的子节点数目,以及具有完整语义的文本所确定的尾节点标记,确定第四文本的语义完整度。With reference to the fourth aspect, in some implementations of the fourth aspect, the processing module may also be used to: determine the Semantic completeness of the fourth text.
第五方面,提供了一种装置,该装置包括:包括处理器和存储器,该存储器用于存储程序指令,该处理器用于调用该程序指令来执行第一方面或者第一方面中任一可能的实现方式中的方法。该装置可以设置在各类语音交互、语音识别、语音助手或智能音箱等可以进行语音端点检测的设备或系统中,例如可以为手机终端、车载终端或可穿戴设备等各类终端设备,也可以为电脑、主机或服务器等各类具备运算能力的设备。该装置还可以为芯片。In a fifth aspect, there is provided a device, the device includes: a processor and a memory, the memory is used to store program instructions, and the processor is used to call the program instructions to execute the first aspect or any one of the possible options in the first aspect. method in the implementation. The device can be set in various devices or systems capable of voice endpoint detection such as voice interaction, voice recognition, voice assistants or smart speakers, for example, various terminal devices such as mobile terminals, vehicle terminals or wearable devices, or can be It is a computer, a mainframe or a server and other devices with computing capabilities. The device can also be a chip.
第六方面,提供了一种装置,该装置包括:存储器,用于存储程序;处理器,用于执行所述存储器存储的程序,当所述存储器存储的程序被执行时,所述处理器用于执行第三方面或者第三方面中任一可能的实现方式中的方法。该装置可以为电脑、主机或服务器等各类具备运算能力的设备。该装置还可以为芯片。According to a sixth aspect, an apparatus is provided, and the apparatus includes: a memory for storing a program; a processor for executing the program stored in the memory, and when the program stored in the memory is executed, the processor is used for Execute the third aspect or the method in any possible implementation manner of the third aspect. The device can be a computer, a host or a server and other devices with computing capabilities. The device can also be a chip.
第七方面,提供了一种终端设备,该终端设备可以包括第二方面或者第二方面中任一可能的实现方式中的装置,或,第五方面或者第五方面中任一可能的实现方式中的装置。A seventh aspect provides a terminal device, and the terminal device may include the apparatus in the second aspect or any possible implementation manner of the second aspect, or the fifth aspect or any possible implementation manner of the fifth aspect device in .
示例性地,终端设备具体可以包括电脑、智能手机、平板电脑、个人数字助理(personal digital assistant,PDA)、可穿戴设备、智能音箱、电视、无人机、车辆、车载芯片、车载装置(例如车机、车载电脑)或机器人等装置中的一个或多个。Exemplarily, the terminal device may specifically include a computer, a smart phone, a tablet computer, a personal digital assistant (personal digital assistant, PDA), a wearable device, a smart speaker, a TV, a drone, a vehicle, a vehicle-mounted chip, a vehicle-mounted device (such as One or more of devices such as car machine, on-board computer) or robot.
结合第七方面,在第七方面的某些实现方式中,该终端设备可以是手机或者车辆。With reference to the seventh aspect, in some implementation manners of the seventh aspect, the terminal device may be a mobile phone or a vehicle.
第二方面至第七方面及任一方面中任一可能的实现方式中的装置,可以是车载芯片、车载装置(例如车机、车载电脑)或车。本申请实施例中的车可以理解为一种交通工具,本申请实施例所提出的方案也可以应用于其他交通工具或装置中。The device in any possible implementation manner of the second aspect to the seventh aspect and any one aspect may be an on-board chip, an on-board device (such as a car machine, an on-board computer), or a car. The car in the embodiment of the present application can be understood as a vehicle, and the solution proposed in the embodiment of the present application can also be applied to other vehicles or devices.
第八方面,提供了一种电子设备,该电子设备可以包括第四方面及第四方面中任一可能的实现方式中的装置,或者第六方面及第六方面中任一可能的实现方式中的装置。In an eighth aspect, an electronic device is provided, and the electronic device may include the device in the fourth aspect and any possible implementation manner of the fourth aspect, or the sixth aspect and any possible implementation manner of the sixth aspect installation.
结合第八方面,在第八方面的某些实现方式中,该电子设备可以是云服务设备。With reference to the eighth aspect, in some implementation manners of the eighth aspect, the electronic device may be a cloud service device.
第九方面,提供了一种计算机可读介质,该计算机可读介质存储用于设备执行的程序代码,该程序代码包括用于执行上述第一方面或者第一方面中任一可能的实现方式中的装置,或,第三方面或者第三方面中的任意一种实现方式中的方法。In a ninth aspect, there is provided a computer-readable medium, where the computer-readable medium stores program code for execution by a device, and the program code includes a program code for performing the above-mentioned first aspect or any possible implementation manner of the first aspect. device, or the third aspect or the method in any one of the implementation manners of the third aspect.
第十方面,提供了一种包含指令的计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行上述第一方面或者第一方面中任一可能的实现方式中的装置,或,第三方面或者第三方面中的任意一种实现方式中的方法。In a tenth aspect, a computer program product containing instructions is provided, and when the computer program product is run on a computer, it causes the computer to execute the device in the above-mentioned first aspect or any possible implementation manner of the first aspect, or, The third aspect or the method in any implementation manner in the third aspect.
附图说明Description of drawings
图1为本申请实施例提供的一种语音交互的应用场景。FIG. 1 is an application scenario of voice interaction provided by an embodiment of the present application.
图2为本申请提供的一种利用语音活性检测技术检测语音端点的方法的示意图。FIG. 2 is a schematic diagram of a method for detecting a voice endpoint by using a voice activity detection technology provided in the present application.
图3是本申请实施例提供的一种用于语音交互的预测模型的训练方法的示意性流程图。Fig. 3 is a schematic flowchart of a method for training a prediction model for speech interaction provided by an embodiment of the present application.
图4是本申请实施例提供的一种根据示例性的文本语料集所确定的字典树的示意图。Fig. 4 is a schematic diagram of a trie determined according to an exemplary text corpus provided by an embodiment of the present application.
图5为本申请提供的一种预测模型的输入格式的示意图。FIG. 5 is a schematic diagram of an input format of a prediction model provided by the present application.
图6是本申请实施例提供的一种语音交互的方法的示意性流程图。Fig. 6 is a schematic flowchart of a voice interaction method provided by an embodiment of the present application.
图7是本申请实施例提供的一种语音交互中的音频信号的示意图。Fig. 7 is a schematic diagram of an audio signal in a voice interaction provided by an embodiment of the present application.
图8是本申请实施例提供的另一种语音交互中的音频信号的示意图。Fig. 8 is a schematic diagram of audio signals in another voice interaction provided by an embodiment of the present application.
图9是本申请实施例提供的一种确认音频帧分类的方法的示意图。Fig. 9 is a schematic diagram of a method for confirming audio frame classification provided by an embodiment of the present application.
图10是本申请实施例提供的另一种语音交互中的音频信号的示意图。FIG. 10 is a schematic diagram of audio signals in another voice interaction provided by an embodiment of the present application.
图11为本申请实施例提供的语音交互方法的另一示意性流程图。FIG. 11 is another schematic flowchart of the voice interaction method provided by the embodiment of the present application.
图12是本申请实施例提供的语音交互方法的另一示意性流程图。Fig. 12 is another schematic flowchart of the voice interaction method provided by the embodiment of the present application.
图13为本申请实施例提供的一种显示屏的用户界面的示例性示意图。FIG. 13 is an exemplary schematic diagram of a user interface of a display screen provided by an embodiment of the present application.
图14是本申请实施例提供的一种语音交互的装置的结构示意图。FIG. 14 is a schematic structural diagram of a device for voice interaction provided by an embodiment of the present application.
图15是本申请实施例提供的一种用于训练语音交互的预测模型的装置的结构示意图。FIG. 15 is a schematic structural diagram of an apparatus for training a speech interaction prediction model provided by an embodiment of the present application.
图16为本申请实施例提供的一种装置的结构示例图。Fig. 16 is a structural example diagram of a device provided by an embodiment of the present application.
具体实施方式Detailed ways
下面将结合附图,对本申请实施例中的技术方案进行描述。The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings.
图1为本申请实施例提供的一种语音交互的应用场景。如图1所示,在该应用场景中,可以包括用户和用户设备,用户与用户设备可以进行语音交互。用户设备,可以是车载终端、智能手机、智能机器人、车辆等支持语音交互的设备,也可以是其他支持语音交互功能的设备,例如,智能音箱、智能家居设备、智能电视、台式计算器等,为了简洁,不再一一举例。示例性地,该设备可以进行语音识别。应理解,本申请实施例对用户设备的类型不做限定。FIG. 1 is an application scenario of voice interaction provided by an embodiment of the present application. As shown in FIG. 1 , in this application scenario, a user and a user equipment may be included, and the user and the user equipment may perform voice interaction. User equipment can be devices that support voice interaction, such as vehicle-mounted terminals, smart phones, smart robots, and vehicles, or other devices that support voice interaction, such as smart speakers, smart home devices, smart TVs, desktop calculators, etc. For the sake of brevity, no more examples are given. Exemplarily, the device can perform voice recognition. It should be understood that the embodiment of the present application does not limit the type of the user equipment.
可选地,可以是一个用户与用户设备进行语音交互,也可以是多个用户与用户设备进行语音交互,还可以是通过其他用户设备与该用户设备进行语音交互,还可以是多个用户同时与多个用户设备进行语音交互。例如:用户可以通过麦克风与具有语音识别功能的用户设备进行语音交互;可以通过录音机播放录制好的音频,具有语音交互功能的用户设备可以采集、识别该音频并进行响应。应理解,以上关于用于与用户设备语音交互的方式只是举例以便于说明,本申请实施例对此不做限定。Optionally, one user may perform voice interaction with the user equipment, or multiple users may perform voice interaction with the user equipment, or use other user equipment to perform voice interaction with the user equipment, or multiple users simultaneously Voice interaction with multiple user devices. For example: the user can perform voice interaction with the user equipment with the voice recognition function through the microphone; the recorded audio can be played through the recorder, and the user equipment with the voice interaction function can collect, recognize and respond to the audio. It should be understood that the above manner for voice interaction with the user equipment is only an example for illustration, and is not limited in this embodiment of the present application.
示例性的,用户设备可以运行有支持语音交互的应用程序。例如,该应用程序可以是导航应用、语音助手、智能问答应用等。本申请实施例对此不做限定。示例性地,该用户设备可以是电脑、智能手机、平板电脑、个人数字助理、可穿戴设备、智能音箱、电视、无人机、车辆、车载芯片、车载装置(例如车机、车载电脑)或机器人等装置中的一个或多个终端设备。Exemplarily, the user equipment may run an application program supporting voice interaction. For example, the application program may be a navigation application, a voice assistant, an intelligent question answering application, and the like. This embodiment of the present application does not limit it. Exemplarily, the user equipment may be a computer, a smart phone, a tablet computer, a personal digital assistant, a wearable device, a smart speaker, a TV, a drone, a vehicle, a vehicle-mounted chip, a vehicle-mounted device (such as a vehicle machine, a vehicle computer) or One or more terminal devices in devices such as robots.
示例性地,该应用场景中还可以包括语音检测平台,该语音检测平台可以为支持语音交互的应用程序提供后台服务。例如,通过对模型进行训练,语音检测平台可以得到预测模型,用户设备可以获取该语音检测平台训练得到的预测模型,用户设备在进行语音交互时,可以利用该预测模型进行语音识别、检测语音端点等,为了简洁此处不再一一举例。应理解,终端设备也可以具有上述功能,即也可以无需语音检测平台提供后台服务的情况下实现语音交互,本申请实施例对此不做限定。Exemplarily, the application scenario may further include a voice detection platform, which may provide background services for applications supporting voice interaction. For example, by training the model, the voice detection platform can obtain a prediction model, and the user equipment can obtain the prediction model trained by the voice detection platform, and the user equipment can use the prediction model to perform voice recognition and detect voice endpoints during voice interaction. etc., for the sake of brevity, no more examples are given here. It should be understood that the terminal device may also have the above functions, that is, voice interaction may be implemented without the background service provided by the voice detection platform, which is not limited in this embodiment of the present application.
在语音交互的过程中,可以通过语音端点检测技术来确定响应语音指令的契机,当用户说话后,通过对音频进行端点检测,可以确定语音起始点和语音结束点,可以截取语音起始点语音和结束点之间的音频,作为一条语音指令。示例性地,语音交互可以由用户主动发起,比如,触发语音交互的方式可以是一按即说的方式,例如,用户可以通过按键启动语音交互,该按键可以是实体的也可以是虚拟的;又比如,触发语音交互的方式也可以是语音唤醒的方式,例如,通过说出唤醒词,用户可以启动语音交互。因此,语音起始点(或称为语音前端点),比较容易准确检测。示例性地,语音交互也可以由用户设备发起,例如,用户设备在通过语音播报信息后,向用户征询决策指令(比如,“警告:左侧摄像头可能粘附污渍,是否需要自动对其清洗”)。语音结束点(或称为语音结束点、语音后端点)可以通过机器自动检测来确定。例如,可以基于语音活性检测技术来进行语音结束点的检测。In the process of voice interaction, the voice endpoint detection technology can be used to determine the opportunity to respond to the voice command. When the user speaks, by detecting the audio endpoint, the voice start point and the voice end point can be determined, and the voice start point and voice can be intercepted. Audio between end points, as a voice command. Exemplarily, the voice interaction can be initiated by the user. For example, the way to trigger the voice interaction can be a push-to-talk method. For example, the user can start the voice interaction through a button, and the button can be physical or virtual; For another example, the way of triggering the voice interaction may also be the way of voice wake-up, for example, by speaking the wake-up word, the user can start the voice interaction. Therefore, the speech start point (or called the speech front point) is relatively easy to detect accurately. Exemplarily, the voice interaction can also be initiated by the user equipment. For example, the user equipment asks the user for decision-making instructions (for example, "Warning: the left camera may be stained, whether it needs to be cleaned automatically" after the user equipment broadcasts the information by voice). ). The speech end point (or called speech end point, speech back end point) can be determined through machine automatic detection. For example, the detection of the speech end point can be performed based on the speech activity detection technology.
语音活性检测(voice activity detection,VAD)技术,可以用于检测一定时间窗内的信号是否是语音信号。示例性地,图2为本申请提供的一种利用VAD技术检测语音端点的方法的示意图。其中,音频信号如图2中的(a)所示,图2中的(b)为该音频信号对应的VAD输出。例如,可以将VAD输出中的低值的时间段所对应的音频信号确定为非 语音信号,或称为非语音,比如,将图2中的(b)中VAD输出为0的音频确定为非语音,反之,可以将其他时间段的音频信号确定为语音信号。Voice activity detection (VAD) technology can be used to detect whether a signal within a certain time window is a voice signal. Exemplarily, FIG. 2 is a schematic diagram of a method for detecting voice endpoints using the VAD technology provided in the present application. Wherein, the audio signal is shown in (a) in FIG. 2 , and (b) in FIG. 2 is the VAD output corresponding to the audio signal. For example, the audio signal corresponding to the time period of the low value in the VAD output can be determined as a non-speech signal, or called a non-speech signal. Speech, conversely, audio signals of other time periods may be determined as speech signals.
示例性地,基于语音活性检测技术检测到一段时长的非语音后,可以确定语音端点,以此可以对以获取的语音指令进行响应,例如,执行该语音指令所指示的操作和/或结束语音交互等。该时长可称为语音尾部静音时长,可以设置为固定时长。语音尾部静音时长是这种检测方式的重要参数。示例性地,例如,语音尾部静音时长可以为800毫秒(millisecond,ms),当根据VAD技术检测到超过800ms的非语音时,可以确定语音结束,触发语音端点。但是很难设定一个固定的时长参数以适配所有的场景和环境,例如,如果语音尾部静音时长设定过大,用户感受到的延迟会较长;如果该时长参数设定过小,用户的语音指令很容易被截断。即便根据不同的业务类型设定不同的时长参数,在语音交互中,当用户说话出现停顿时,依然很容易导致语音指令被截断。Exemplarily, after a period of non-voice is detected based on the voice activity detection technology, the voice endpoint can be determined, so as to respond to the acquired voice command, for example, perform the operation indicated by the voice command and/or end the voice interaction etc. This duration may be called the silence duration at the end of the speech, and may be set as a fixed duration. The silence duration at the end of the speech is an important parameter of this detection method. Exemplarily, for example, the duration of silence at the end of the voice may be 800 milliseconds (millisecond, ms). When a non-voice over 800 ms is detected according to the VAD technology, it may be determined that the voice ends, and the voice endpoint is triggered. However, it is difficult to set a fixed duration parameter to suit all scenarios and environments. For example, if the duration of silence at the end of the voice is set too large, the user will experience a longer delay; if the duration parameter is set too small, the user may voice commands are easily truncated. Even if different duration parameters are set according to different business types, in voice interaction, when the user pauses in speaking, it is still easy to cause the voice command to be truncated.
在本申请实施例中,根据用户语音指令所对应的文本,可以确定定时器时长,并且可以根据该定时器和第二音频信号,灵活地确定语音端点,从而可以避免对于语音指令的不恰当截断,可以避免由于不恰当截断所导致的语音指令的执行错误,从而可以适配于更广泛的场景和环境,进一步地,还可以在缩短系统响应延迟的情况下,提高语音指令响应的速度,提高用户体验。In the embodiment of the present application, the timer duration can be determined according to the text corresponding to the user's voice command, and the voice endpoint can be flexibly determined according to the timer and the second audio signal, so that improper truncation of the voice command can be avoided , can avoid the execution error of the voice command caused by improper truncation, so that it can be adapted to a wider range of scenarios and environments. Further, it can also improve the response speed of the voice command while shortening the system response delay. user experience.
以上示例性地介绍了语音交互的应用场景。以下示例性地,对在该应用场景中对语音端点进行检测的方法的示意性流程进行介绍。The application scenarios of voice interaction are exemplarily introduced above. The following exemplarily introduces a schematic flow of a method for detecting a voice endpoint in this application scenario.
示例性地,可以采用预测模型对语音交互中的语音端点进行检测,本申请实施例中的语音端点可以指语音结束点。Exemplarily, a prediction model may be used to detect a speech endpoint in a speech interaction, and the speech endpoint in this embodiment of the present application may refer to a speech end point.
对于预测模型的使用可以包括模型训练阶段和模型预测阶段,在模型训练阶段,通过对预测模型进行训练可以得到文本与其语义完整度间的关系,可以获得预测准确度较高的预测模型。The use of the prediction model can include the model training stage and the model prediction stage. In the model training stage, the relationship between the text and its semantic integrity can be obtained by training the prediction model, and a prediction model with high prediction accuracy can be obtained.
示例性地,图3是本申请实施例提供的一种用于语音交互的预测模型的训练方法的示意性流程图,该方法200可以包括:Exemplarily, FIG. 3 is a schematic flow chart of a method for training a prediction model for speech interaction provided by an embodiment of the present application. The method 200 may include:
S210,获取文本数据集,该文本数据集包括多个文本,该多个文本标注了第一信息,该第一信息可以用于表示该文本的语义完整度。S210. Acquire a text data set, where the text data set includes a plurality of texts, and the plurality of texts are marked with first information, and the first information may be used to indicate the semantic completeness of the text.
为方便说明,可以将该标注有第一信息的文本定义为第四文本,也就是说,该文本数据集包括多个第四文本,进一步地,该文本数据集可以是第四文本的集合。语义完整度,可以用于表示该文本所具有的语义的完整的程度。例如,文本1“我要打游戏”可以具有完整的语义,比如,该文本1可以是在语义交互中用户下达的语音指令所对应的文本,相应地,对文本1进行不同的截断可以得到文本2“我要打”和文本3“我要”,虽然文本2和文本3都不具备完整的语义,但是文本2的语义的完整程度大于文本3,为了简洁不再一一举例说明。For the convenience of description, the text marked with the first information can be defined as the fourth text, that is, the text data set includes multiple fourth texts, and further, the text data set can be a collection of fourth texts. Semantic completeness can be used to indicate the completeness of the semantics of the text. For example, the text 1 "I want to play a game" can have complete semantics. For example, the text 1 can be the text corresponding to the voice command issued by the user in the semantic interaction. Correspondingly, different truncations can be performed on the text 1 to obtain the text 2 "I want to type" and text 3 "I want". Although neither text 2 nor text 3 has complete semantics, the semantic integrity of text 2 is greater than that of text 3. For the sake of brevity, I will not give examples one by one.
应理解,该文本数据集包括多个第四文本,可以是该文本数据集仅包括多个第四文本,也可以是该文本数据集还包括除第四文本以外的其他文本,本申请实施例对此不作限定。It should be understood that the text data set includes a plurality of fourth texts. It may be that the text data set only includes a plurality of fourth texts, or that the text data set also includes other texts other than the fourth text. The embodiment of the present application There is no limit to this.
示例性地,用户设备可以运行语音交互的应用程序,在语音检测平台可以为支持语音交互的应用程序提供后台服务时,该语音检测平台通过获取该文本数据集,对该用于语音交互的预测模型进行训练,语音检测平台对于预测模型的训练可以是通过在线训练的方式, 也可以是通过离线训练的方式,本申请实施例对此不做限定。例如,语音检测平台可以包括处理模块和训练模块,该获取模块可以用于获取该文本数据集,该训练模块可以用于根据该文本数据集,以离线训练的方式进行模型训练,得到预测模型,后续用户设备可以获取该预测模型并根据该预测模型确定语音端点;又例如,用户可以参与该预测模型的改进过程,语音检测平台可以在线获取用户通过其用户设备上传的多个语音指令,该语音检测平台进行模型训练时,其所使用的文本数据集可以根据用户的语音指令持续更新,由此得到的预测模型可以更符合用户的表达习惯;又例如,语音检测平台可以包括芯片,该芯片通过获取文本数据集,可以进行模型训练得到该预测模型。Exemplarily, the user equipment can run an application program for voice interaction, and when the voice detection platform can provide background services for the application program supporting voice interaction, the voice detection platform obtains the text data set to predict the voice interaction The model is trained. The speech detection platform can train the prediction model through online training or offline training, which is not limited in this embodiment of the present application. For example, the voice detection platform may include a processing module and a training module, the acquisition module may be used to acquire the text data set, and the training module may be used to perform model training in an offline training manner according to the text data set to obtain a prediction model, Subsequent user equipment can obtain the prediction model and determine the voice endpoint according to the prediction model; for another example, the user can participate in the improvement process of the prediction model, and the voice detection platform can obtain multiple voice instructions uploaded by the user through the user equipment online. When the detection platform performs model training, the text data set used by it can be continuously updated according to the user's voice commands, and the resulting prediction model can be more in line with the user's expression habits; another example, the voice detection platform can include a chip, which can pass Obtain the text data set, and perform model training to obtain the prediction model.
示例性地,用户设备在获取文本数据集后,可以将其用于进行模型训练得到预测模型,从而可以将该预测模型用于语音交互,为了简洁此处不再赘述。应理解,以上只是示例以便于说明,本申请实施例对此不做限定。Exemplarily, after the user equipment acquires the text data set, it can be used for model training to obtain a prediction model, so that the prediction model can be used for voice interaction, and details are not described here for brevity. It should be understood that, the foregoing is only an example for description, and this embodiment of the present application does not limit it.
为方便说明,本申请实施例后续,以语音检测平台获取文本数据集并进行模型训练为例,进行说明,应理解,本申请对此不做限定。For the convenience of description, in the following embodiments of the present application, the voice detection platform acquires a text data set and performs model training as an example for illustration. It should be understood that this application does not limit this.
示例性地,可以根据文本语料集确定该文本数据集,该文本语料集包括多个具有完整语义的文本。可选地,该文本语料集可以是具有完整语义的文本的集合,也就是说,文本语料集中的文本可以组成完整语句。例如,语音检测平台的获取模块可以获取文本语料集,该文本语料集可以包括“打开音乐”、“我要打游戏”等具有完整语义的文本,语音检测平台还可以包括处理模块,该处理模块可以根据该文本语料集确定该文本数据集,从而训练模块可以根据该文本数据集进行模型训练;又例如,语音检测平台可以包括芯片,该芯片通过获取文本语料集可以确定文本数据集;再例如,语音交互的场景还可以包括预处理系统,其中,该预处理系统可以用于语音预测平台在训练预测模型前所需的准备工作,该预处理系统可以在获取文本语料集后,根据文本语料集确定该文本数据集,语音检测平台通过获取由预处理系统所确定的文本数据集后,可以对预测模型进行训练。应理解,以上关于文本语料集的说明只是举例,关于获取文本语料集的方式,本申请实施例对此不做限定。Exemplarily, the text data set may be determined according to a text corpus, and the text corpus includes a plurality of texts with complete semantics. Optionally, the text corpus can be a collection of texts with complete semantics, that is, the texts in the text corpus can form complete sentences. For example, the acquisition module of the speech detection platform can obtain a text corpus, which can include texts with complete semantics such as "turn on the music", "I want to play a game", etc., and the speech detection platform can also include a processing module, the processing module The text data set can be determined according to the text corpus, so that the training module can perform model training according to the text data set; another example, the speech detection platform can include a chip, and the chip can determine the text data set by obtaining the text corpus; another example , the speech interaction scenario can also include a preprocessing system, wherein the preprocessing system can be used for the preparation work required by the speech prediction platform before training the prediction model, the preprocessing system can obtain the text corpus, according to the text corpus The text data set is determined by the set, and the speech detection platform can train the prediction model after obtaining the text data set determined by the preprocessing system. It should be understood that the above description about the text corpus is just an example, and the method of obtaining the text corpus is not limited in this embodiment of the present application.
示例性地,根据文本语料集确定文本数据集,可以是根据文本语料集具有完整语义的文本确定一个或多个文本数据集中的第四文本,从而可以根据文本语料集中多个具有完整语义的文本确定多个第四文本,从而可以确定文本数据集。Exemplarily, determining the text data set according to the text corpus may be to determine the fourth text in one or more text data sets according to the text with complete semantics in the text corpus, so that multiple texts with complete semantics in the text corpus may be A plurality of fourth texts are determined, so that a text data set can be determined.
示例性地,该文本语料集中的文本,可以划分为一个或多个节点,一个节点可以包括一个字或字符。例如,以“打开音乐”为例,预处理系统可以将其划分为“打”、“开”、“音”、“乐”多个节点。进一步地,该文本的最后一个节点可以表示该文本完结,例如,文本“打开音乐”中的最后一个节点“乐”可以用于表示该具有完整语义的文本“打开音乐”在此节点完结,也就是说,该节点可以包括该文本的尾节点标记,尾节点标记可以用T tail表示。其中,该尾节点标记T tail可以表示,以该节点为结尾的文本具有完整的语义,该文本属于文本语料集。应理解,以上关于文本语料集的说明只是举例,本申请对此不做限定。 Exemplarily, the text in the text corpus can be divided into one or more nodes, and one node can include one word or character. For example, taking "open music" as an example, the preprocessing system can divide it into multiple nodes of "play", "open", "yin" and "music". Further, the last node of the text may indicate the end of the text, for example, the last node "乐" in the text "open music" may be used to indicate that the text "open music" with complete semantics ends at this node, and That is, the node may include a tail node tag of the text, and the tail node tag may be represented by T tail . Wherein, the tail node mark T tail may indicate that the text ending with the node has complete semantics, and the text belongs to the text corpus. It should be understood that the above description about the text corpus is just an example, which is not limited in the present application.
示例性地,根据文本语料集中的具有完整语义的文本,基于其划分的节点,可以确定一个或多个属于文本数据集的第四文本,由于文本语料集中包括多个具有完整语义的文本,由此可以确定多个第四文本,以此可以确定文本数据集,该文本数据集可以包括该多个第 四文本。示例性地,可以将文本语料集中文本的第一个节点作为起始节点,基于对该文本的节点的划分,可以以该节点及其之后的节点作为最后一个节点确定一个或多个第四文本。例如,以文本语料集中的文本“打开空调”为例,其对应的节点可以分别为“打”、“开”、“空”、“调”,可以将节点“打”作为文本的起始节点,分别以“打”、“开”、“空”、“调”作为最后一个节点,可以确定4个属于文本数据集的第四文本,分别为“打”、“打开”、“打开空”、“打开空调”,并可以由此确定该多个第四文本的语义完整度,为了简洁,此处不再赘述。应理解,以上确定一个或多个文本数据集中的文本的方法只是示例,本申请对此不做限定。Exemplarily, according to the text with complete semantics in the text corpus, based on its divided nodes, one or more fourth texts belonging to the text data set can be determined. Since the text corpus includes multiple texts with complete semantics, by Therefore, a plurality of fourth texts can be determined, so that a text data set can be determined, and the text data set can include the plurality of fourth texts. Exemplarily, the first node of the text in the text corpus can be used as the starting node, and based on the division of the nodes of the text, one or more fourth texts can be determined with the node and its subsequent nodes as the last node . For example, taking the text "turn on the air conditioner" in the text corpus as an example, its corresponding nodes can be respectively "play", "open", "empty", and "tune", and the node "play" can be used as the starting node of the text , with "play", "open", "empty" and "tune" as the last node respectively, four fourth texts belonging to the text data set can be determined, namely "play", "open" and "open empty" , "Turn on the air conditioner", and thus determine the semantic integrity of the plurality of fourth texts. For the sake of brevity, details will not be described here. It should be understood that the above method for determining text in one or more text data sets is just an example, which is not limited in the present application.
可选地,为了便于根据文本语料集确定文本数据集,可以根据文本语料集所确定的字典树,确定多个第四文本,从而确定文本数据集。例如,在将文本语料集中的文本分别划分为一个或多个节点之后,比如预处理系统、语音检测平台的处理模块、芯片等,可以根据所划分的多个节点,确定字典树,可以根据该字典树确定第四文本以及其语义完整度。Optionally, in order to facilitate determining the text data set according to the text corpus, a plurality of fourth texts may be determined according to the trie determined by the text corpus, so as to determine the text data set. For example, after the text in the text corpus is divided into one or more nodes, such as a preprocessing system, a processing module of a voice detection platform, a chip, etc., the dictionary tree can be determined according to the divided nodes, and the dictionary tree can be determined according to the The trie determines the fourth text and its semantic completeness.
示例性地,文本语料集包括多个具有完整语义的文本,其中,该具有完整语义的文本可以划分为一个或多个节点,由此根据文本语料集确定字典树,可以是根据该文本语料集中的多个具有完整语义的文本,确定多个节点,根据该多个节点,可以确定字典树。示例性地,图4是本申请实施例提供的一种根据示例性的文本语料集所确定的字典树的示意图,其中,该文本语料集可以包括“我要打游戏”、“打电话”、“打开音乐”、“打开空调”和“打开空调加热”,共五个具有完整语义的文本。例如,如图4所示字典树中,字典树的根节点262可以不包括任何字或字符,除根节点262以外的多个节点可以只包含一个字,比如,节点263可以包含字符“打”,节点264可以包含字符“开”等。进一步地,通过将根节点至某节点的整个路径上所包括的节点依次连接,可以得到以该节点为最后一个节点的文本,或称为以该节点为结尾的文本,该文本可以作为该节点对应的文本。为方便说明,本申请实施例后续,将以该节点为结尾的文本作为该节点对应的文本为例,进行说明,也就是说,本申请后续所描述的该节点对应的文本,均可以替代为以该节点为结尾的文本。例如,如图4所示的字典树,从根节点262至节点266“调”,可以确定以节点266“调”为结尾的文本为“打开空调”;又例如,从根节点262至节点268“热”,可以确定节点268“热”对应的文本为“打开空调加热”。可选地,节点可以包括尾节点标记T tail。例如,如图4所示,节点“乐”、节点266“调”所分别对应的文本“打开音乐”、“打开空调”,上述两个文本具有完整的语义,可以包含在图4所对应的文本语料集中,由此上述节点可以包括尾节点标记,为了简洁不再一一举例。应理解,以上根据文本语料集确定字典树的方法只是举例,本申请对此不做限定。 Exemplarily, the text corpus includes a plurality of texts with complete semantics, wherein the texts with complete semantics can be divided into one or more nodes, thereby determining a dictionary tree according to the text corpus, which can be based on the text corpus A plurality of texts with complete semantics can determine a plurality of nodes, and a trie can be determined according to the plurality of nodes. Exemplarily, FIG. 4 is a schematic diagram of a dictionary tree determined according to an exemplary text corpus provided by an embodiment of the present application, wherein the text corpus may include "I want to play a game", "make a phone call", "Turn on the music", "Turn on the air conditioner" and "Turn on the air conditioner heating", there are five texts with complete semantics. For example, in the dictionary tree as shown in Figure 4, the root node 262 of the dictionary tree may not include any words or characters, and a plurality of nodes other than the root node 262 may only contain a word, such as, the node 263 may include the character "hit", Node 264 may contain the characters "on" or the like. Further, by sequentially connecting the nodes included in the entire path from the root node to a certain node, the text with this node as the last node can be obtained, or the text ending with this node, which can be used as the node the corresponding text. For the convenience of explanation, in the follow-up of the embodiment of this application, the text ending with the node will be used as the text corresponding to the node as an example for illustration, that is to say, the text corresponding to the node described in the follow-up of this application can be replaced by The text to end with this node. For example, in the dictionary tree shown in Figure 4, from the root node 262 to the node 266 "tune", it can be determined that the text ending with the node 266 "tune" is "turn on the air conditioner";"Hot", it can be determined that the text corresponding to the node 268 "Hot" is "Turn on the air conditioner for heating". Optionally, the node may include a tail node tag T tail . For example, as shown in Figure 4, the texts "turn on the music" and "turn on the air conditioner" corresponding to the node "Le" and the node 266 "tune" respectively, the above two texts have complete semantics and can be included in the text corresponding to Figure 4. In the text corpus, the above-mentioned nodes may include tail node markers, and for the sake of brevity, no examples are given here. It should be understood that the above method of determining the dictionary tree according to the text corpus is just an example, and this application does not limit it.
进一步地,基于该字典树,可以确定文本数据集。例如,在根据文本语料集{“我要打游戏”、“打电话”、“打开音乐”、“打开空调”、“打开空调加热”}确定的图4所示字典树中,包括“我”、“要”、“打”等16个节点,可以以根节点262作为文本的起始节点,以其他节点作为文本的最后一个节点,由此可以确定“我”、“我要”、“我要打”等15个第四文本,根据该15个第四文本,可以构成包括15个第四文本的文本数据集,为了简洁此处不再一一举例。Further, based on the dictionary tree, a text data set can be determined. For example, in the dictionary tree shown in Figure 4 determined according to the text corpus {"I want to play games", "make a phone call", "turn on the music", "turn on the air conditioner", "turn on the air conditioner heating"}, including "I" , "want", "play" and other 16 nodes, can use the root node 262 as the starting node of the text, and use other nodes as the last node of the text, so as to determine "I", "I want", "I There are 15 fourth texts such as "To type", and according to the 15 fourth texts, a text data set including 15 fourth texts can be formed, and for the sake of brevity, no examples are given here.
示例性地,根据文本语料集所确定的字典树可以包括多个节点,根据该字典树中的节点的子节点数目,可以确定文本数据集中的第四文本的语义完整度。Exemplarily, the trie determined according to the text corpus may include multiple nodes, and according to the number of child nodes of the nodes in the trie, the semantic completeness of the fourth text in the text data set may be determined.
示例性地,根据字典树中的节点的子节点数目,可以确定该节点所对应的文本的语义完整度。示例性地,当节点的子节点数目为0时,该节点对应的文本可以具有完整的语义。例如,如图4所示,节点264“开”的子节点,包括节点“音”、“乐”、节点265“空”等6个子节点,节点265“空”的子节点数目为3,该节点264、节点265对应的文本不具有完整的语义,无法组成完成的语句,而且,节点264“开”对应的文本的语义的完整的程度,小于节点265“空”对应的文本的语义的完整的程度;又例如,节点268“热”的子节点数目为0,该节点对应的文本“打开空调加热”具有完整的语义,可以组成完整的语句。Exemplarily, according to the number of child nodes of a node in the dictionary tree, the semantic integrity of the text corresponding to the node can be determined. Exemplarily, when the number of child nodes of a node is 0, the text corresponding to the node may have complete semantics. For example, as shown in Figure 4, the subnodes of node 264 "open" include 6 subnodes such as nodes "sound", "music" and node 265 "empty", and the number of subnodes of node 265 "empty" is 3. The texts corresponding to nodes 264 and 265 do not have complete semantics and cannot form complete sentences. Moreover, the degree of semantic integrity of the text corresponding to node 264 "open" is less than the semantic integrity of the text corresponding to node 265 "empty". For another example, the number of child nodes of node 268 "hot" is 0, and the text corresponding to this node "turn on the air conditioning and heating" has complete semantics and can form a complete sentence.
进一步地,可以根据文本数据集中的第四文本的语义完整度,对该第四文本标注第一信息,该第一信息可以用于表示该文本的语义完整度。Further, according to the semantic completeness of the fourth text in the text data set, the fourth text may be marked with first information, and the first information may be used to indicate the semantic completeness of the text.
示例性,可以根据字典树中的节点的子节点数目,确定该节点对应的文本的语义完整度,从而可以确定文本数据集中的第四文本的语义完整度,也就是说,可以根据字典树中的节点的子节点的数量,确定该节点对应的第四文本的第一信息,并可以对该第四文本标注其对应的第一信息。Exemplarily, the semantic completeness of the text corresponding to the node can be determined according to the number of child nodes of the node in the dictionary tree, so that the semantic completeness of the fourth text in the text data set can be determined, that is to say, according to the The number of child nodes of the node of is determined, the first information of the fourth text corresponding to the node is determined, and the corresponding first information can be marked on the fourth text.
应理解,以上根据文本语料集基于字典树确定文本数据集的方法只是示例,本申请对此不做限定。It should be understood that the above method of determining the text data set based on the dictionary tree according to the text corpus is just an example, which is not limited in the present application.
可选地,第一信息可以以数字的方式表征语义完整度。示例性地,为便于表征和统计,基于字典树,可以将该节点的子节点数目映射至区间[0,1],生成该节点的第一频次信息,该第一频次信息可以反映该节点的子节点的数量的多少,可以将该第一频次信息作为该节点对应的文本的第一信息,以此表示该文本的语义完整度,也就是说,第一频次信息,可以指第一信息以数字的方式表征语义完整度。例如,如图4所示,“戏”、“话”等节点的子节点的数量为0,“游”、“电”等节点的子节点数目为1,节点269“打”、节点266“调”的子节点数目为2,节点“要”、节点265“空”的子节点数目为3,节点“我”的子节点数目为4,节点264“开”的子节点数目为6,节点263“打”的子节点数目为9,共15个节点,其中,节点“我”的子节点数目不为0,其对应的文本不具有完整的语义,根据累积概率分布统计,子节点数目小于或等于节点“我”的节点共有13个,因此,节点“我”的第一频次信息,可以为13/15,约为0.867,可以将该第一频次信息作为该节点对应的第四文本“我”的第一信息;又比如,子节点数目小于或等于节点264“开”的节点共有14个,节点264“开”对应的文本“打开”的第一信息可以为14/15,约为0.933,比如预处理系统、语音检测平台的处理模块等,由此可以对第四文本标注其对应的第一信息。应理解,以上确定第四文本的第一信息的方法只是举例,还可以采用其他方法根据子节点数目确定第一信息,本申请实施例对此不做限定。Optionally, the first information may represent the semantic completeness in a numerical manner. Exemplarily, for the convenience of characterization and statistics, based on the dictionary tree, the number of child nodes of the node can be mapped to the interval [0,1], and the first frequency information of the node can be generated, and the first frequency information can reflect the node's The number of child nodes can be used as the first information of the text corresponding to the node to represent the semantic integrity of the text. That is to say, the first frequency information can refer to the first information and A numerical representation of semantic completeness. For example, as shown in Figure 4, the number of child nodes of nodes such as "play" and "talk" is 0, the number of child nodes of nodes such as "you" and "electricity" is 1, node 269 "play", node 266 " The number of sub-nodes of "tune" is 2, the number of sub-nodes of node "want" and node 265 "empty" is 3, the number of sub-nodes of node "I" is 4, the number of sub-nodes of node 264 "open" is 6, node 263 The number of child nodes of "打" is 9, a total of 15 nodes. Among them, the number of child nodes of the node "I" is not 0, and the corresponding text does not have complete semantics. According to the cumulative probability distribution statistics, the number of child nodes is less than Or there are 13 nodes equal to the node "I". Therefore, the first frequency information of the node "I" can be 13/15, which is about 0.867. The first frequency information can be used as the fourth text corresponding to the node " The first information of "I"; as another example, there are 14 nodes with the number of child nodes less than or equal to node 264 "Open", and the first information of the text "Open" corresponding to node 264 "Open" can be 14/15, which is about 0.933, such as the pre-processing system, the processing module of the speech detection platform, etc., so that the corresponding first information can be marked on the fourth text. It should be understood that the above method for determining the first information of the fourth text is only an example, and other methods may also be used to determine the first information according to the number of child nodes, which is not limited in this embodiment of the present application.
可选地,第一信息可以以标签的形式表征语义完整度。Optionally, the first information may represent the semantic completeness in the form of a label.
示例性地,第一信息可以是第一标签或第二标签,其中,第一标签可以用于表示该文本具有完整的语义,该第二标签可以用于表示该文本不具有完整的语义。例如,如图4所示,节点“乐”的子节点个数为0,以该节点为结尾的文本“打开音乐”具有完整的语义,其第一信息可以为第一标签;以节点265“空”为结尾的文本“打开空”不具有完整的语义,其第一信息可以为第二标签。Exemplarily, the first information may be a first label or a second label, wherein the first label may be used to indicate that the text has complete semantics, and the second label may be used to indicate that the text does not have complete semantics. For example, as shown in Figure 4, the number of subnodes of the node "乐" is 0, and the text "open music" ending with this node has complete semantics, and its first information can be the first label; The text "open empty" ending with "empty" does not have complete semantics, and its first information can be the second label.
示例性地,第一信息还可以是第三标签,第三标签可以用于表示该文本在某些语境中 可能具有完整的语义,而在其他某些语境中也可能不具有完整的语义,也就是说仅凭该文本的内容无法确定该文本是否具有完整的语义。例如,在如图4所示字典树所对应的文本语料集中,包括具有完整语义的文本A“打开空调”和文本B“打开空调加热”,以节点266“调”为结尾的文本“打开空调”,可以是文本A“打开空调”的全部,此时节点266对应的文本可以具有文本A的全部的语义,节点266“调”对应的文本也可以是文本B“打开空调加热”的一部分,此时节点266对应的文本只能表示文本B的全部语义的一部分,而无法表示文本B所具有的语义,由此,第四文本“打开空调”的第一信息可以是第三标签。应理解,以上确定第四文本的第一信息的方式只是举例以便于说明,本申请实施例对此不做限定。Exemplarily, the first information may also be a third tag, and the third tag may be used to indicate that the text may have complete semantics in some contexts, but may not have complete semantics in other contexts , that is to say, it is impossible to determine whether the text has complete semantics only by the content of the text. For example, in the text corpus corresponding to the dictionary tree shown in Figure 4, text A "turn on the air conditioner" and text B "turn on the air conditioner" with complete semantics are included, and the text "turn on the air conditioner" ending with node 266 "tune" ", can be all of the text A "turn on the air conditioner". At this time, the text corresponding to the node 266 can have all the semantics of the text A, and the text corresponding to the node 266 "tune" can also be a part of the text B "turn on the air conditioner". At this time, the text corresponding to the node 266 can only represent a part of all the semantics of the text B, but cannot represent the semantics of the text B. Therefore, the first information of the fourth text "turn on the air conditioner" may be the third label. It should be understood that the above manner of determining the first information of the fourth text is only an example for illustration, and is not limited in this embodiment of the present application.
应理解,第一标签、第二标签、第三标签可以是任意的数据格式,例如数字、字母、字符串等。例如,第一标签可以是“完整(complete)”,第二标签可以是“其他(other)”,第三标签可以是“部分(part)”,比如,如图4所示,以节点268“热”为结尾的文本“打开空调加热”具有完整的语义,由此该文本的第一信息可以为complete;又比如,以节点265“空”为结尾的文本“打开空”不具有完整的语义,其第一信息可以为other。应理解,本申请实施例对此不做限定。It should be understood that the first tag, the second tag, and the third tag may be in any data format, such as numbers, letters, and character strings. For example, the first label can be "complete", the second label can be "other (other)", and the third label can be "part (part)", such as, as shown in Figure 4, with node 268 " The text "Turn on air conditioner heating" ending with "hot" has complete semantics, so the first information of the text can be complete; another example, the text "Turn on empty" ending with node 265 "empty" does not have complete semantics , the first information of which can be other. It should be understood that this is not limited in the embodiment of the present application.
为方便说明,本申请实施例后续,以complete为第一标签、other为第二标签、part为第三标签为例,进行说明。也就是说,本申请后续所描述的complete均可替代为第一标签,other均可替代为第二标签,part均可替代为第三标签。For the convenience of description, following the embodiment of the present application, complete is the first label, other is the second label, and part is the third label for example. That is to say, complete described later in this application can be replaced by the first label, other can be replaced by the second label, and part can be replaced by the third label.
示例性地,当节点的子节点数目为0时,即该节点无任何子节点时,可以确定该节点对应的文本的第一信息为complete。例如,如图4所示节点268“热”的子节点数目为0,以该节点为结尾的文本“打开空调加热”具有完整语义,则该文本的第一信息可以为complete,为了简洁此处不再赘述。Exemplarily, when the number of child nodes of a node is 0, that is, when the node has no child nodes, it may be determined that the first information of the text corresponding to the node is complete. For example, as shown in Figure 4, the number of child nodes of node 268 "hot" is 0, and the text "turn on the air conditioner and heating" ending with this node has complete semantics, then the first information of this text can be complete, for the sake of brevity here No longer.
可选地,可以根据节点的子节点数目,以及尾节点标记,确定该节点对应的文本语义完整度,从而确定待标注的第一信息。Optionally, the semantic integrity of the text corresponding to the node may be determined according to the number of child nodes of the node and the label of the tail node, so as to determine the first information to be marked.
示例性地,当节点的子节点数目不为0,且该节点包括尾节点标记时,可以将该节点对应的文本的第一信息确定为第三标签。例如,如图4所示,节点266“调”的子节点数目为2,同时,该节点可以作为具有完整语义的文本“打开空调”的尾节点,即该节点可以包括尾节点标记,由此可以确定其对应的文本的第一信息为part,为了简洁此处不再赘述。可选地,当节点包括尾节点标记时,该节点的子节点数目可以体现以该节点为结尾的文本的语义的完整程度,子节点数目越多表示该文本的语义的完整程度越低,即该文本的语义完整度越低。Exemplarily, when the number of child nodes of a node is not 0, and the node includes a tail node tag, the first information of the text corresponding to the node may be determined as the third tag. For example, as shown in Figure 4, the number of subnodes of node 266 "tune" is 2, meanwhile, this node can be used as the tail node of the text "turn on the air conditioner" with complete semantics, that is, the node can include the tail node mark, thus It can be determined that the first information of the corresponding text is a part, which will not be described here for brevity. Optionally, when a node includes a tail node tag, the number of child nodes of the node can reflect the completeness of the semantics of the text ending with the node, and the greater the number of child nodes, the lower the completeness of the semantics of the text, namely The lower the semantic integrity of the text.
示例性地,当节点的子节点数目大于0,且该节点不包括尾节点标记时,可以将该节点的对应的文本的第一信息确定为第二标签,用于表示以该节点为结尾的文本不具有完整的语义。例如,如图4所示,节点265“空”子节点数目为3,其对应的第一频次信息大于0,且该节点无尾节点标记,由此将该节点的标签信息确定为other,用于表示以节点265“空”作为结尾的文本“打开空”不具备完整语义。为了简洁,此处不再一一举例。Exemplarily, when the number of child nodes of a node is greater than 0, and the node does not include a tail node tag, the first information of the corresponding text of the node can be determined as the second tag, which is used to indicate that the node ends with the node Text does not have full semantics. For example, as shown in Figure 4, the number of "empty" child nodes of node 265 is 3, its corresponding first frequency information is greater than 0, and this node has no tail node label, thus the label information of this node is determined to be other, using Yu means that the text "open empty" ending with node 265 "empty" does not have complete semantics. For the sake of brevity, examples are not given here.
可选地,第一信息可以结合数字和标签以表征语义完整度。示例性地,可以结合第一频次信息和第一、第二、第三标签以表征文本的语义完整度。为了简洁,此处不再赘述。应理解,本申请实施例对此不做限定。Optionally, the first information may combine numbers and labels to characterize semantic completeness. Exemplarily, the first frequency information and the first, second, and third tags may be combined to characterize the semantic completeness of the text. For the sake of brevity, details are not repeated here. It should be understood that this is not limited in the embodiment of the present application.
可选地,可以根据第四文本所包括的节点的数目,调整该文本的第一信息。示例性地,使用标签结合数字作为第一信息,表征文本的语义完整度时,在该文本的长度,即该文本所包括的节点的数目,小于或等于长度阈值时,可以对该文本的第一信息进行调整。例如,比如长度阈值为10个节点,文本为图4所示的“打开空调”,该文本的第一信息可以是part和8/15,其中8/15为节点“调”的第一频次信息,由于该文本所包括的节点数目为4,小于长度阈值,可以将其第一信息中的数字由8/15调整为0.4。应理解,以上关于调整第一信息的方式只是举例,本申请实施例对此不做限定。Optionally, the first information of the text may be adjusted according to the number of nodes included in the fourth text. Exemplarily, when using a label combined with a number as the first information to characterize the semantic integrity of the text, when the length of the text, that is, the number of nodes included in the text, is less than or equal to the length threshold, the first A message is adjusted. For example, if the length threshold is 10 nodes, and the text is "turn on the air conditioner" as shown in Figure 4, the first information of the text can be part and 8/15, where 8/15 is the first frequency information of the node "tune" , since the number of nodes included in the text is 4, which is less than the length threshold, the number in the first information can be adjusted from 8/15 to 0.4. It should be understood that the above manner of adjusting the first information is only an example, and this embodiment of the present application does not limit it.
由于文本语料集中的具有完整语义的文本的使用频次可能与用户实际说话方式存在差异,比如,用户使用短句的频次可能相对长句会更多,在本申请实施例中,通过调整第四文本的第一信息,可以使得文本数据集更加符合实际的语音交互过程,从而可以在对模型进行训练后得到更为准确的预测模型。Since the frequency of use of text with complete semantics in the text corpus may be different from the actual way the user speaks, for example, the frequency of using short sentences may be more than that of long sentences. In the embodiment of the present application, by adjusting the fourth text The first information can make the text data set more in line with the actual voice interaction process, so that a more accurate prediction model can be obtained after the model is trained.
本申请实施例中,根据文本语料集确定文本数据集,使得可以仅准备具有完整语义的文本,通过该方式可以减少用于构建文本数据集过程中所需准备的文本的数量。进一步地,根据文本语料集中的文本可以确定一个或多个第四文本,可以简化确定第四文本的语义完整度的过程,从而可以简化确定和标注第一信息的过程。In the embodiment of the present application, the text data set is determined according to the text corpus, so that only texts with complete semantics can be prepared. In this way, the amount of texts to be prepared in the process of constructing the text data set can be reduced. Furthermore, one or more fourth texts can be determined according to the texts in the text corpus, and the process of determining the semantic integrity of the fourth texts can be simplified, thereby simplifying the process of determining and labeling the first information.
应理解,还可以通过其他方式获取文本数据集,例如,可以直接构建语料集合,该集合可以包括具有完整语义的文本和不具有完整语义的文本,可以根据该集合中文本的语义完整度确定其第一信息,由此可以确定文本数据集。应理解,本申请对获取文本数据集的方法不做限定。It should be understood that text data sets can also be obtained in other ways. For example, a corpus set can be directly constructed, which can include texts with complete semantics and texts without complete semantics. The first information, from which the text data set can be determined. It should be understood that the present application does not limit the method for obtaining the text data set.
S220,根据文本数据集进行模型训练,得到预测模型,该预测模型用于预测语音指令的语义完整度。S220, perform model training according to the text data set to obtain a prediction model, where the prediction model is used to predict the semantic integrity of the voice instruction.
示例性地,预测模型可以通过预测语音指令对应的文本的语义完整度,确定该语音指令的语义的完整的程度,从而确定用户是否具有继续说话的意图。示例性地,预测模型可以根据输入的文本,确定该文本的第一信息,根据输出的第一信息确定该文本是否具有完整的语义,以此确定该文本对应的语音指令是否为完整的语音指令,以此可以确定用户是否有继续说话的意图。Exemplarily, the predictive model may predict the semantic completeness of the text corresponding to the voice instruction to determine the completeness of the semantics of the voice instruction, thereby determining whether the user has the intention to continue speaking. Exemplarily, the prediction model can determine the first information of the text according to the input text, and determine whether the text has complete semantics according to the output first information, so as to determine whether the voice instruction corresponding to the text is a complete voice instruction , so as to determine whether the user intends to continue speaking.
预测模型可以是人工智能(artificial intelligence,AI)模型。预测模型的具体类型可以包括多种,例如,预测模型可以包括神经网络、支持向量机、线性回归模型、逻辑回归模型、决策树或者随机森林中的至少一种。示例性地,预测模型可以是神经网络,例如,预测模型可以是卷积神经网络或者循环神经网络等。应理解,上述预测模型只是举例以便于说明,本申请实施例对此不做限定。The prediction model may be an artificial intelligence (AI) model. Specific types of the prediction model may include multiple types, for example, the prediction model may include at least one of a neural network, a support vector machine, a linear regression model, a logistic regression model, a decision tree, or a random forest. Exemplarily, the predictive model may be a neural network, for example, the predictive model may be a convolutional neural network or a recurrent neural network. It should be understood that the foregoing prediction model is only an example for illustration, and is not limited in this embodiment of the present application.
可选地,预测模型可以是来自转换器的双向解码表象(bidirectional encoder representations from transformers,BERT)模型。该模型输入可以为[CLS]+文本+[SEP],其中[CLS]为表示一段文本由此开始的特殊字符,[SEP]为表示一段文本由此结尾的特殊字符。示例性地,图5为本申请提供的一种预测模型的输入格式的示意图,其中,该预测模型可以BERT模型,根据不同的时间节点,待预测的文本可以是“打开天窗”、“打开天窗到”、“打开天窗到百分之”、“打开天窗到百分之六十”,可以将上述文本划分为不同的节点“打”、“开”、“天”、“窗”等,为了简洁此处不再赘述。Alternatively, the predictive model can be a bidirectional encoder representations from transformers (BERT) model. The model input can be [CLS]+text+[SEP], where [CLS] is a special character indicating the start of a piece of text, and [SEP] is a special character indicating the end of a piece of text. Exemplarily, Fig. 5 is a schematic diagram of the input format of a prediction model provided by the present application, wherein the prediction model can be a BERT model, and according to different time nodes, the text to be predicted can be "open the skylight", "open the skylight to", "open the skylight to 100%", "open the skylight to 60%", the above text can be divided into different nodes "open", "open", "sky", "window" and so on, in order For brevity, I will not repeat them here.
示例性地,在语音交互过程中,随着时间的推移,获得的音频信号中的用户指令可以 不断趋近于完整,可以由此不断获得新的流式文本结果,可以将这些流式的文本结果(或称作流式文本)输入至预测模型中确定当前文本的语义完整度以确定用户是否有继续说话的意图。例如,如图5所示,假如第0秒(second,s)开始进行语音交互,在第2s时已获取的音频信号中的语音指令对应的文本为“打开天窗到”,可以将该文本以图5所示的格式[CLS]+“打”“开”“天”“窗”“到”+[SEP]输入至该预测模型,预测其语义完整度,以此可以确定用户有继续说话的意图;又例如,在第5s时已获取的音频信号中的语音指令对应的文本为“打开天窗到百分之六十”,可以将该文本以图5所示的格式输入至该预测模型,预测其语义完整度,以此可以确定用户不具有继续说话的意图,从而可以由此确定用户在语音交互中所下达的完整的语音指令为“打开天窗到百分之六十”。应理解,以上关于BERT模型的输入只是举例,本申请实施例对此不做限定。Exemplarily, in the voice interaction process, as time goes by, the user instructions in the obtained audio signal can be continuously approached to be complete, and thus new streaming text results can be continuously obtained, and these streaming text The result (or called streaming text) is input into the predictive model to determine the semantic integrity of the current text to determine whether the user has the intention to continue speaking. For example, as shown in Figure 5, if voice interaction starts at the 0th second (second, s), and the text corresponding to the voice command in the audio signal acquired at the 2nd s is "open the sunroof to", the text can be entered as The format [CLS]+“open”“open”“day”“window”“to”+[SEP] shown in Figure 5 is input to the prediction model to predict its semantic integrity, so that it can be determined that the user is willing to continue speaking Intention; for another example, the text corresponding to the voice instruction in the audio signal acquired at the 5th s is "open the sunroof to 60 percent", the text can be input into the prediction model in the format shown in Figure 5, Predict its semantic completeness, so as to determine that the user does not have the intention to continue speaking, so that it can be determined that the complete voice command issued by the user in the voice interaction is "open the sunroof to 60 percent". It should be understood that the above input about the BERT model is just an example, which is not limited in this embodiment of the present application.
应理解,关于BERT模型的相关内容可以参考相关技术,为了简洁,本申请对此不再赘述。本申请实施例中,示例性地,该模型的输出可以为该文本的第一信息,以表征该文本的语义完整度。例如,以图5所示格式[CLS]+“打开天窗到百分之六十”+[SEP]输出至BERT模型后,可以输出该文本的第一信息,比如该第一信息为0,即第一信息可以是数字的形式表征语义完整度,也可以为complete,即第一信息也可以是标签的形式表征语义完整度,本申请对此不做限定。It should be understood that related content about the BERT model can refer to related technologies, and for the sake of brevity, this application will not repeat it. In this embodiment of the present application, for example, the output of the model may be the first information of the text, so as to represent the semantic integrity of the text. For example, after outputting to the BERT model in the format [CLS]+"open the skylight to 60%"+[SEP] as shown in Figure 5, the first information of the text can be output, for example, the first information is 0, that is The first information may be in the form of a number to represent the semantic completeness, or may be complete, that is, the first information may also be in the form of a label to represent the semantic completeness, which is not limited in this application.
示例性地,模型训练的过程可以包括多种实现方式。在一些实施例中,模型训练可以包括多次迭代的过程。一次迭代的过程可以包括以下步骤:Exemplarily, the process of model training may include multiple implementation manners. In some embodiments, model training may include a multiple iterative process. An iterative process can include the following steps:
S305,将文本数据集中的第四文本输入预测模型,通过预测模型对该文本进行处理,输出预测结果。S305. Input the fourth text in the text data set into the prediction model, process the text through the prediction model, and output a prediction result.
S310,根据预测结果与该第四文本的第一信息,可以通过损失函数计算第一损失值,该第一损失值可以表示预测结果与第一信息之间的偏差,预测结果与第一信息之间的偏差越大,则第一损失值越大。S310, according to the prediction result and the first information of the fourth text, a first loss value can be calculated through a loss function, and the first loss value can represent the deviation between the prediction result and the first information, and the difference between the prediction result and the first information The greater the deviation between, the greater the first loss value.
S315,根据第一损失值调整预测模型的参数。S315. Adjust the parameters of the prediction model according to the first loss value.
以上示出了训练的一次迭代过程,当进行一次迭代后,语音检测平台可以检测当前是否已经满足训练终止条件,当不满足训练终止条件时,进行下一次迭代过程;当满足训练终止条件时,将本次迭代过程所采用的预测模型输出为训练完成的预测模型。The above shows an iterative process of training. After one iteration, the voice detection platform can detect whether the training termination condition is currently met. When the training termination condition is not met, the next iteration process is performed; when the training termination condition is met, Output the prediction model adopted in this iterative process as the trained prediction model.
其中,该训练终止条件可以为迭代次数达到目标次数或者损失函数满足预设条件,还可以为基于验证数据集验证时,其能力在一段时间内没有提升。其中,该目标次数可以是预先设置的迭代次数,用以确定训练结束的时机,避免对训练资源的浪费;该预设条件可以是训练过程中损失函数值在一段时间内不变或者不下降,此时说明训练过程已经达到了训练的效果,即预测模型具有了根据语句文本确定用户是否具有继续说话的意图;该验证数据集,可以区别于文本数据集,可以用于对训练效果的评估。Wherein, the training termination condition may be that the number of iterations reaches the target number or the loss function satisfies a preset condition, or that the capability does not improve within a period of time when it is verified based on the verification data set. Wherein, the target number of times can be a preset number of iterations to determine the timing of the end of training to avoid waste of training resources; the preset condition can be that the value of the loss function remains unchanged or does not decrease for a period of time during the training process, At this point, it shows that the training process has achieved the training effect, that is, the prediction model has the intention to determine whether the user continues to speak according to the sentence text; the verification data set can be distinguished from the text data set and can be used to evaluate the training effect.
应理解,以上关于训练模型的方式只是举例,本申请对此不做限定。It should be understood that the above manner of training the model is just an example, which is not limited in the present application.
本申请实施例提供了一种用于语音检测的预测模型的训练方法,根据获取的文本数据集可以进行模型训练,从而得到预测模型。预测模型可以通过训练的过程,从文本数据集中的文本的语义完整度学习出文本与其语义完整度间的关系,从而在模型预测阶段,可以基于预测模型,预测待分析的文本的语义完整度,从而可以在语音交互过程中,通过确定音频信号中的语音指令对应的文本的语义完整度,确定用户是否具有继续说话的意图,并 以此准确地确定语音交互的语音指令进行响应。The embodiment of the present application provides a method for training a predictive model for speech detection. Model training can be performed according to the acquired text data set, so as to obtain a predictive model. The prediction model can learn the relationship between the text and its semantic integrity from the semantic integrity of the text in the text dataset through the training process, so that in the model prediction stage, the semantic integrity of the text to be analyzed can be predicted based on the prediction model. Therefore, during the voice interaction process, by determining the semantic integrity of the text corresponding to the voice command in the audio signal, it can be determined whether the user has the intention to continue speaking, and the voice command for voice interaction can be accurately determined to respond.
示例性地,图6是本申请实施例提供的一种语音交互的方法的示意性流程图,该方法400可以包括步骤S410至S460。Exemplarily, FIG. 6 is a schematic flow chart of a voice interaction method provided by an embodiment of the present application, and the method 400 may include steps S410 to S460.
S410,获取第一音频信号,该第一音频信号中包括第一语音指令。S410. Acquire a first audio signal, where the first audio signal includes a first voice instruction.
其中,第一音频信号可以是用于确定语音端点的音频信号,第一语音指令可以是第一音频信号所包含的语音指令。Wherein, the first audio signal may be an audio signal used to determine a voice endpoint, and the first voice instruction may be a voice instruction included in the first audio signal.
示例性地,用户设备支持语音交互时,该用户设备可以获取语音交互中的音频信号。例如,以交互场景为人与车辆进行语音交互为例,该车辆可以包括语音交互装置,该语音交互装置可以包括获取模块和处理模块,该获取模块可以第一音频信号;又例如,该车辆可以包括一个或多个处理器,该一个或多个处理器可以用于执行方法400,可以获取第一音频信号;再例如,该车辆可以包括用于语音交互的芯片,该芯片可以用于执行方法400,可以获取第一音频信号,为了简洁,此处不再一一举例说明。应理解,以上关于场景的描述只是示例,本申请实施例对此不做限定。Exemplarily, when the user equipment supports voice interaction, the user equipment may acquire audio signals in the voice interaction. For example, taking the interaction scenario of voice interaction between a human and a vehicle as an example, the vehicle may include a voice interaction device, the voice interaction device may include an acquisition module and a processing module, and the acquisition module may be the first audio signal; for another example, the vehicle may include One or more processors, the one or more processors may be used to execute the method 400, and may acquire the first audio signal; for another example, the vehicle may include a chip for voice interaction, and the chip may be used to execute the method 400 , the first audio signal can be obtained, and for the sake of brevity, no examples are given here. It should be understood that the above description about the scene is only an example, which is not limited in this embodiment of the present application.
示例性地,在语音交互过程中,用户设备可以持续地获取音频信号,当进行语音端点检测时,可以将语音交互中已获取的音频信号中的部分或全部作为第一音频信号,其中,该已获取的音频信号包括语音指令。例如,从时刻0开始的语音交互中,可以持续获取从时刻0开始的音频信号,直至语音交互结束,该音频信号中包括语音指令,当时刻3进行语音端点检测时,此时已获取时刻0至时刻3间的音频信号,可以将时刻0与时刻3间的部分或全部音频信号作为第一音频信号,以用于确定语音端点,比如,可以将时刻1至时刻3之间的音频信号作为第一音频信号,又比如,可以将时刻0至时刻3间的音频信号作为第一音频信号,若根据该第一音频信号可以确定出语音端点,则可以将该语音端点作为此次语音交互的结束时刻;若根据该第一音频信号无法确定出语音端点,也就是说,语音交互未结束时,由于可以持续获取音频信号,比如可以在时刻5时再次进行语音端点检测,则可以将时刻0与时刻5之间的部分或全部音频信号作为新的第一音频信号,以再次用于确定语音端点。应理解,时刻0<时刻1<时刻3<时刻5,即时刻0最早,时刻5最晚。Exemplarily, during the voice interaction process, the user equipment may continuously acquire audio signals, and when performing voice endpoint detection, part or all of the audio signals acquired during the voice interaction may be used as the first audio signal, wherein the The acquired audio signal includes voice instructions. For example, in the voice interaction starting from time 0, the audio signal from time 0 can be acquired continuously until the end of the voice interaction. The audio signal includes voice commands. For the audio signal between time 3 and time 3, part or all of the audio signal between time 0 and time 3 can be used as the first audio signal to determine the voice endpoint. For example, the audio signal between time 1 and time 3 can be used as The first audio signal, for example, the audio signal between time 0 and time 3 can be used as the first audio signal, if the voice endpoint can be determined according to the first audio signal, then the voice endpoint can be used as the voice interaction End time; if the voice endpoint cannot be determined according to the first audio signal, that is, when the voice interaction is not over, since the audio signal can be continuously obtained, for example, the voice endpoint detection can be performed again at time 5, then time 0 can be set to Part or all of the audio signals between time 5 and time 5 are used as new first audio signals to be used again to determine the endpoint of the speech. It should be understood that time 0<time 1<time 3<time 5, that is, time 0 is the earliest and time 5 is the latest.
应理解,以上获取第一音频信号的方法仅为示例,以便于说明,本申请实施例对此不做限定。It should be understood that the above method for acquiring the first audio signal is only an example for ease of description, and is not limited in this embodiment of the present application.
S420,根据第一语音指令对应的第一文本,确定第一定时器的时长。S420. Determine the duration of the first timer according to the first text corresponding to the first voice instruction.
应理解,在确定第一定时器的时长之前,可以获取该第一语音指令对应的第一文本。It should be understood that, before the duration of the first timer is determined, the first text corresponding to the first voice instruction may be acquired.
示例性地,第一文本为第一语音指令所对应的文本,可以是由第一音频信号中的语音指令经语音识别得到的文本。例如,以交互场景为人与车辆进行语音交互为例,该车辆可以包括语音交互装置,该语音交互装置可以包括获取模块和处理模块,在语音交互中,用户说话时所下达语音指令为“dǎkāitiānchuāng(即,打开天窗)”,该获取模块可以获取包括该语音指令的第一音频信号,处理模块可以通过对第一音频信号的语音识别,确定该语音指令对应的第一文本为“打开天窗”,可以根据该第一文本确定第一定时器的时长;又例如,车辆还可以包括语音识别装置,该语音识别装置可以对音频信号进行语音识别,得到该音频信号中的语音指令对应的第一文本,处理模块可以根据该第一文本确定第一定时的时长,该语音识别装置也可以位于该语音交互装置的内部,即也可以体现为语音交互装置的语音识别模块,本申请实施例对此不做限定。为了简洁,此处不再一一举例说明。Exemplarily, the first text is the text corresponding to the first voice instruction, which may be the text obtained through speech recognition of the voice instruction in the first audio signal. For example, taking the interaction scenario of voice interaction between a human and a vehicle as an example, the vehicle may include a voice interaction device, and the voice interaction device may include an acquisition module and a processing module. In the voice interaction, the voice command issued by the user when speaking is "dǎkāitiānchuāng( That is, open the sunroof)", the acquisition module can acquire the first audio signal including the voice command, and the processing module can determine that the first text corresponding to the voice command is "open the sunroof" through voice recognition of the first audio signal, The duration of the first timer can be determined according to the first text; for another example, the vehicle can also include a voice recognition device, which can perform voice recognition on the audio signal to obtain the first text corresponding to the voice command in the audio signal The processing module can determine the duration of the first timing according to the first text, and the voice recognition device can also be located inside the voice interaction device, that is, it can also be embodied as a voice recognition module of the voice interaction device, which is not discussed in this embodiment of the present application. Do limited. For the sake of brevity, examples are not given here.
示例性地,在语音交互过程中,可以持续地获取音频信号,根据自动语音识别技术可以获取流式的文本结果,可以根据该流式的文本结果,确定第一文本。例如,从时刻0开始的语音交互中,可以持续地获取音频信号,可以对已获取的音频信号进行自动语音识别,在时刻1时,已获取的音频信号中未包含任何语音指令,该时刻的流式的文本结果为空,在时刻2时,该时刻的流式的文本结果为“打开”,在时刻3时,获取的实时的流式文本结果为“打开天窗”,当在时刻3触发语音端点检测时,可以将该时刻的流式文本结果“打开天窗”作为第一文本,相应地,比如,可以将时刻0至时刻3,或时刻1至时刻3之间的音频信号作为第一音频信号,也就是说,可以是先根据流式的文本结果确定第一文本后,再基于该第一文本从已获取的音频信号中确定所对应的第一音频信号;又例如,流式的文本结果可以包括时间戳,根据流式的文本结果确定第一文本之后,可以基于时间戳,从已获取的音频信号中确定所对应的第一音频信号,以此避免语音识别用时过长以及延迟过长所造成的影响。应理解,时刻0<时刻1<时刻2<时刻3,即时刻0最早,时刻3最晚。Exemplarily, during the voice interaction process, the audio signal may be acquired continuously, a streaming text result may be acquired according to the automatic speech recognition technology, and the first text may be determined according to the streaming text result. For example, in the voice interaction starting from time 0, audio signals can be acquired continuously, and automatic speech recognition can be performed on the acquired audio signals. At time 1, the acquired audio signals do not contain any voice instructions. The streamed text result is empty. At time 2, the streamed text result at this time is "open". At time 3, the real-time streamed text result obtained is "open the skylight". When triggered at time 3 When detecting a voice endpoint, the streaming text result "open the sunroof" at that time can be used as the first text. Correspondingly, for example, the audio signal between time 0 to time 3 or time 1 to 3 can be used as the first text. The audio signal, that is to say, may first determine the first text according to the streaming text result, and then determine the corresponding first audio signal from the acquired audio signals based on the first text; another example, the streaming The text result may include a time stamp. After the first text is determined according to the streaming text result, the corresponding first audio signal may be determined from the acquired audio signals based on the time stamp, so as to avoid excessive time and delay of speech recognition The effects of being too long. It should be understood that time 0<time 1<time 2<time 3, that is, time 0 is the earliest and time 3 is the latest.
应理解,以上获取第一文本的方式只是举例以便于说明,本申请实施例对此不做限定。It should be understood that the above manner of obtaining the first text is only an example for illustration, and is not limited in this embodiment of the present application.
示例性地,由于语音识别技术存在处理时间,获取第一文本的时间,可以等于或晚于获取第一音频信号的时间,本申请实施例对此不做限定。Exemplarily, due to the processing time of the speech recognition technology, the time for obtaining the first text may be equal to or later than the time for obtaining the first audio signal, which is not limited in this embodiment of the present application.
示例性地,可以根据方法200所训练的预测模型确定第一定时器的时长。例如,可以将该第一文本输入预测模型,根据预测模型,可以得到第一文本的第一信息,根据第一信息确定第一定时器的时长;又例如,也可以是将第一文本调整为该预测模型所需的输入格式后,将其输入该预测模型,并由此确定第一定时器的时长。其中,该第一定时器可以用于确定语音端点。Exemplarily, the duration of the first timer may be determined according to the prediction model trained by the method 200 . For example, the first text can be input into the prediction model, and according to the prediction model, the first information of the first text can be obtained, and the duration of the first timer can be determined according to the first information; for another example, the first text can also be adjusted to After the required input format of the prediction model is input, it is input into the prediction model, and thus the duration of the first timer is determined. Wherein, the first timer can be used to determine the voice endpoint.
由于在模型训练阶段,预测模型利用文本数据集进行训练,学习了文本与其语义完整度之间的映射关系,由此在步骤S420中,预测模型可以基于学习出的映射关系,对第一文本进行识别,确定第一文本的语义完整度,可以确定第一文本的第一信息,从而判断用户是否具有继续说话的意图。Since in the model training stage, the prediction model uses the text data set for training, and learns the mapping relationship between the text and its semantic integrity, so in step S420, the prediction model can perform the first text based on the learned mapping relationship Recognition, determining the semantic integrity of the first text, can determine the first information of the first text, so as to determine whether the user has the intention to continue speaking.
可选地,可以根据第一文本的第一信息,确定第一定时器的时长。例如,以交互场景为人与车辆进行语音交互为例,该车辆可以包括语音交互装置,该语音交互装置可以包括获取模块和处理模块,在确定第一文本之后,处理模块可以将第一文本输入由方法200训练得到的预测模型,得到该第一文本的第一信息,以表征第一文本的语义完整度,进一步地,处理模块可以根据该第一信息确定第一定时器的时长。Optionally, the duration of the first timer may be determined according to the first information of the first text. For example, taking the interaction scenario of voice interaction between a human and a vehicle as an example, the vehicle may include a voice interaction device, and the voice interaction device may include an acquisition module and a processing module. After the first text is determined, the processing module may input the first text by The prediction model obtained by training in method 200 obtains the first information of the first text to represent the semantic integrity of the first text. Further, the processing module can determine the duration of the first timer according to the first information.
示例性地,当第一信息以标签的形式表征语义完整度时,可以根据该标签确定第一定时器的时长。例如,比如第一、第二、第三标签分别为complete、other、part,当第一信息为complete时,表示根据预测模型可以确定该第一文本具有完整的语义,可以认为用户继续说话的可能性较小,由此可以设定较小的第一定时器的时长(比如400ms),以减少语音交互过程的延迟,提升用户体验;当标签信息为other时,表示根据预测模型确定该第一文本不具有完整的语义,可以认为用户有继续说话的意图,由此可以设定较大的第一定时器的时长(比如1500ms),以避免提前截断用户语音,避免由此可能导致的语音指令执行错误,从而可以兼顾用户的说话习惯,比如语速慢或者频繁停顿等;当标签信息为part时,表示根据预测模型确定该第一文本可以具有较完整的语义,由此可以设定适中的第一定时器的时长(比如800ms),以便于在兼顾延迟和提前截断用户语音的情况下,提 供较好的用户体验。为了简洁,不再一一举例。Exemplarily, when the first information represents the semantic integrity in the form of a label, the duration of the first timer may be determined according to the label. For example, if the first, second, and third labels are complete, other, and part respectively, when the first information is complete, it means that the first text can be determined to have complete semantics according to the prediction model, and it can be considered that the user may continue to speak Therefore, a smaller first timer duration (such as 400ms) can be set to reduce the delay of the voice interaction process and improve user experience; when the tag information is other, it means that the first timer is determined according to the prediction model. The text does not have complete semantics. It can be considered that the user has the intention to continue speaking. Therefore, a larger first timer (such as 1500ms) can be set to avoid cutting off the user's voice in advance and avoid voice commands that may be caused by this. Execution error, so as to take into account the user's speaking habits, such as slow speech or frequent pauses, etc.; when the tag information is part, it means that the first text can have relatively complete semantics according to the prediction model, so a moderate value can be set The duration of the first timer (for example, 800ms) is to provide a better user experience while taking into account the delay and cutting off the user's voice in advance. For the sake of brevity, no more examples are given.
示例性地,当第一信息以数字的方式表征语义完整度时,可以根据该数字确定第一定时器的时长。例如,若根据预测模型确定的第一信息为0,可以认为该第一文本可以具有完整的语义,由此可以设定较小的第一定时器的时长(比如400ms),以减少语音交互过程的延迟,提升用户体验;若第一信息(比如,根据预测模型得到该第一文本的第一频次信息为0.58)大于或等于第二阈值(比如0.4)时,可以认为该第一文本不具有完整的语义,由此可以设定较大的第一定时器的时长(比如1500ms);当第一信息大于0且小于第二阈值时,可以认为该第一文本具有较完整的语义,由此可以设定适中的第一定时器的时长,以便于提供较好的用户体验。应理解,以上根据第一信息确定第一时长的方法只是示例,本申请对此不做限定。Exemplarily, when the first information represents the semantic completeness in a digital form, the duration of the first timer may be determined according to the number. For example, if the first information determined according to the prediction model is 0, it can be considered that the first text can have complete semantics, so a smaller first timer duration (such as 400ms) can be set to reduce the voice interaction process delay to improve user experience; if the first information (for example, according to the prediction model, the first frequency information of the first text is 0.58) is greater than or equal to the second threshold (for example, 0.4), it can be considered that the first text does not have Complete semantics, thus you can set a larger first timer duration (such as 1500ms); when the first information is greater than 0 and less than the second threshold, it can be considered that the first text has relatively complete semantics, thus A moderate duration of the first timer can be set, so as to provide a better user experience. It should be understood that the above method of determining the first duration according to the first information is just an example, and this application does not limit it.
示例性地,可以第一信息可以结合标签和数字的形式表征语义完整度,并由此确定第一定时器的时长。示例性地,可以结合标签,根据第一频次信息,确定第一定时器的时长。例如,当第一信息中的标签为part时,若第一频次信息为0.3时,可以将第一定时器的时长设定为1200ms,若第一频次信息为0.05,可以将第一定时器的时长设定为500ms,以此可以对第一定时器的时长进行更为细致的设定,可以更好地兼顾延迟和语音的提前截断,可以为用户提供较好的体验。Exemplarily, the first information may be combined with a label and a number to characterize the semantic integrity, and thus determine the duration of the first timer. Exemplarily, the duration of the first timer may be determined according to the first frequency information in combination with the tag. For example, when the label in the first information is part, if the first frequency information is 0.3, the duration of the first timer can be set to 1200ms; if the first frequency information is 0.05, the duration of the first timer can be set to The duration is set to 500ms, so that the duration of the first timer can be set more carefully, which can better take into account the delay and the early truncation of the voice, and can provide a better experience for the user.
应理解,以上根据第一信息确定第一定时器的时长的方法只是举例,以便于说明,本申请对此不做限定。It should be understood that the above method of determining the duration of the first timer according to the first information is only an example for ease of description, and this application does not limit it.
应理解,以上基于预测模型确定第一定时器的时长的方法只是示例,还可以采用其他的方法确定第一定时器的时长。It should be understood that the above method of determining the duration of the first timer based on the prediction model is just an example, and other methods may also be used to determine the duration of the first timer.
示例性地,在获取第一文本后,可以通过查询数据库的方式确定第一定时器的时长,该数据库中可以包括多个文本,以及该多个文本所对应的第一定时器的时长。例如,数据库中可以包括语音交互中的常用语句的文本,在确定第一文本后,处理模块可以根据该第一文本与该数据库中的文本的匹配情况,确定第一定时器的时长。为了简洁,此处不再一一举例说明。Exemplarily, after the first text is acquired, the duration of the first timer may be determined by querying a database. The database may include multiple texts and the duration of the first timer corresponding to the multiple texts. For example, the database may include the text of common sentences in voice interaction, and after the first text is determined, the processing module may determine the duration of the first timer according to the matching between the first text and the text in the database. For the sake of brevity, examples are not given here.
示例性地,在获取第一文本后,可以根据第一文本的结构,词语的性质等对该第一文本添加标点符号,当该第一文本末尾无法添加合适的标点符号时,可以认为该第一文本没有完整的语义,可以设定较长的第一定时器的时长(比如1500ms);当该第一文本末尾可以添加标点符号时(比如,句号、逗号等),可以根据所添加的标点符号设定相应的第一定时器的时长。比如,标点为句号时,可以表示第一文本具有完整的语义,可以设定较短的第一定时器的时长(比如500ms);又比如,标点为逗号时,可以表示第一文本具有较完整的语义,可以设定适中的第一定时器的时长(比如800ms),为了简洁,不再一一举例说明。Exemplarily, after the first text is obtained, punctuation marks can be added to the first text according to the structure of the first text, the nature of words, etc., when the end of the first text cannot add appropriate punctuation marks, it can be considered that the first text A text does not have complete semantics, you can set a longer duration of the first timer (for example, 1500ms); when the end of the first text can add punctuation (for example, full stop, comma, etc.), it can be based on the added punctuation The symbol sets the duration of the corresponding first timer. For example, when the punctuation is a period, it can indicate that the first text has complete semantics, and a shorter duration of the first timer (such as 500ms) can be set; Semantics, you can set a moderate length of the first timer (such as 800ms), for the sake of brevity, no more examples.
应理解,以上根据第一文本确定第一定时器的时长的方法只是举例,以便于说明,本申请实施例对此不做限定。It should be understood that the above method of determining the duration of the first timer according to the first text is only an example for ease of description, and is not limited in this embodiment of the present application.
可选地,可以获取第三音频信号,在第三音频信号不包括语音指令时,可以根据第一文本,确定第一定时器的时长,其中,该第三音频信号包括在第一预设时间内接受到的音频信号,该第一预设时间的起始时刻为第一音频信号的结束时刻。例如,以交互场景为人与车辆进行语音交互为例,该车辆可以包括语音交互装置,该语音交互装置可以包括获取 模块和处理模块,该获取模块可以获取该第三音频信号,该处理模块可以确定该第三音频信号是否包括语音指令;又例如该语音交互装置可以包括处理器,该处理器可以执行方法400,可以获取第三音频信号,可以确定该第三音频信号中是否包括语音指令,为了简洁此处不再赘述。Optionally, a third audio signal may be obtained, and when the third audio signal does not include a voice command, the duration of the first timer may be determined according to the first text, wherein the third audio signal includes The audio signal received within the first preset time, the start time of the first preset time is the end time of the first audio signal. For example, taking the interaction scenario of voice interaction between a human and a vehicle as an example, the vehicle may include a voice interaction device, the voice interaction device may include an acquisition module and a processing module, the acquisition module may acquire the third audio signal, and the processing module may determine Whether the third audio signal includes a voice instruction; for another example, the voice interaction device may include a processor, the processor may execute the method 400, acquire the third audio signal, and determine whether the third audio signal includes a voice instruction, for For brevity, I will not repeat them here.
示例性地,可以根据第三音频信号识别出的文本是否为空,确定第三音频信号是否包括语音指令。例如,从时刻0开始的语音交互中,可以自时刻0起持续获取音频信号,可以实时获取自动语音识别处理后的流式的文本结果,在时刻3时,流式文本结果为“打开天窗”,当在时刻3触发语音端点检测时,可以将时刻0至时刻3之间的音频信号作为第一音频信号,将该流式文本结果“打开天窗”作为第一文本,在时刻3起的第一预设时间内,比如该第一预设时间的结束时刻为时刻4,可以将时刻3与时刻4间的音频信号作为第三音频信号,可以将时刻3与时刻4间的音频信号单独进行语音识别,若识别得到文本结果为空,可以确定该第三音频信号不包括语音指令;又例如,为了避免语音识别过程的处理或延迟过长,也可以根据时刻3时的流式的文本结果的时间戳,确定第一音频信号,若该第一音频信号结束时刻起的第一预设时间内接收到的音频信号,即第三音频信号,语音识别处理后得到的文本结果为空时,可以确定第三音频信号不包括语音指令;又例如,可以认为语音识别的处理时间和延迟基本不变,当时刻4的流式的文本结果为“打开天窗”,与时刻3一致时,也就是说,流式的文本结果在时刻3之后的第一预设时间内未更新时,可以认为第三音频信号中未包含语音指令。为了简洁,不再一一举例。应理解,时刻0<时刻3<时刻4,即时刻0最早,时刻4最晚。Exemplarily, it may be determined whether the third audio signal includes a voice instruction according to whether the text recognized by the third audio signal is empty. For example, in the voice interaction starting from time 0, the audio signal can be continuously obtained from time 0, and the streamed text result after automatic speech recognition processing can be obtained in real time. At time 3, the streamed text result is "open the skylight" , when the voice endpoint detection is triggered at time 3, the audio signal between time 0 and time 3 can be used as the first audio signal, and the streaming text result "open the sunroof" can be used as the first text, and the first text from time 3 Within a preset time, for example, the end time of the first preset time is time 4, the audio signal between time 3 and time 4 can be used as the third audio signal, and the audio signal between time 3 and time 4 can be separately Speech recognition, if the text result of the recognition is empty, it can be determined that the third audio signal does not include a voice command; for another example, in order to avoid the processing or delay of the speech recognition process being too long, it can also be based on the streaming text result at time 3 The time stamp of the first audio signal is determined. If the audio signal received within the first preset time from the end of the first audio signal, that is, the third audio signal, the text result obtained after speech recognition processing is empty, It can be determined that the third audio signal does not include a voice command; for another example, it can be considered that the processing time and delay of voice recognition are basically unchanged, and when the streamed text result at time 4 is "open the sunroof", which is consistent with time 3, that is That is, when the streaming text result is not updated within the first preset time after time 3, it can be considered that the third audio signal does not contain a voice instruction. For the sake of brevity, no more examples are given. It should be understood that time 0<time 3<time 4, that is, time 0 is the earliest and time 4 is the latest.
示例性地,可以根据第三音频信号的音频帧的能量,确定第三音频信号是否包括语音指令。例如,当第三音频信号的音频帧的能量小于或等于预设阈值(比如第一阈值)时,可以确定第三音频信号不包括语音指令。Exemplarily, it may be determined whether the third audio signal includes a voice instruction according to the energy of the audio frame of the third audio signal. For example, when the energy of the audio frame of the third audio signal is less than or equal to a preset threshold (such as the first threshold), it may be determined that the third audio signal does not include a voice instruction.
应理解,以上确定第三音频信号是否包括语音指令的方法只是举例,以便于说明,本申请实施例对此不做限定。It should be understood that the above method for determining whether the third audio signal includes a voice instruction is only an example for ease of description, and is not limited in this embodiment of the present application.
示例性地,当第三音频信号中包括语音指令时,可以确定在第一音频信号后依然获取到新的语音指令,从而可以确定新的第一音频信号,以重新对语音端点进行检测,以重新确认进行语音端点的检测。Exemplarily, when the third audio signal includes a voice command, it may be determined that a new voice command is still obtained after the first audio signal, so that a new first audio signal may be determined to re-detect the voice endpoint, to Reconfirm the detection of the voice endpoint.
本申请实施例中,当确定第三音频信号不包括语音指令时,再根据第一文本确定第一定时器的时长,可以降低检测语音端点的频次,由此可以节省检测语音端点所占用的资源。In the embodiment of the present application, when it is determined that the third audio signal does not include a voice command, the duration of the first timer is determined according to the first text, which can reduce the frequency of detecting voice endpoints, thereby saving the resources occupied by detecting voice endpoints .
S430,启动第一定时器。S430, start the first timer.
示例性地,可以在确定第一定时器时长之后,启动第一定时器,该第一定时器的开始时刻可以不早于第一音频信号的结束时刻。Exemplarily, the first timer may be started after the duration of the first timer is determined, and the start time of the first timer may not be earlier than the end time of the first audio signal.
示例性地,在语音交互过程中,当接收到的音频信号出现静音时,可以进行语音端点检测,也就是说,可以将已获取的音频信号的部分或全部作为第一音频信号,确认其中的第一语音指令所对应的第一文本,并在确定第一定时器的时长后,开启该第一定时器,以此确定语音端点。Exemplarily, during the voice interaction process, when the received audio signal is silent, voice endpoint detection may be performed, that is, part or all of the acquired audio signal may be used as the first audio signal to confirm the The first text corresponding to the first voice instruction, and after determining the duration of the first timer, start the first timer, so as to determine the voice endpoint.
S440,获取第二音频信号,所述第二音频信号的起始时刻晚于所述第一音频信号的结束时刻。S440. Acquire a second audio signal, where the start time of the second audio signal is later than the end time of the first audio signal.
示例性地,所获取的第二音频信号的起始时刻,可以不早于第一音频信号的结束时刻, 该第二音频信号的结束时刻,可以与第一定时器的结束时刻相同。Exemplarily, the acquired start time of the second audio signal may not be earlier than the end time of the first audio signal, and the end time of the second audio signal may be the same as the end time of the first timer.
示例性地,该第二音频信号的起始时刻和结束时刻可以与第一定时器相同,第二音频信号的起始时刻可以等于第一音频信号的结束时刻。示例性地,图7是本申请实施例提供的一种语音交互中的音频信号的示意图。例如,语音交互过程中,可以持续获取音频信号,在第一定时器运行时获取第二音频信号,可以是将该持续获取的音频信号中,起始时刻和结束时刻与该第一定时器相同的部分,作为第二音频信号,比如,如图7所示的第二音频信号,该第二音频信号可以包括语音指令,也可以不包括语音指令,本申请对此不做限定。Exemplarily, the start time and end time of the second audio signal may be the same as the first timer, and the start time of the second audio signal may be equal to the end time of the first audio signal. Exemplarily, FIG. 7 is a schematic diagram of audio signals in a voice interaction provided by an embodiment of the present application. For example, in the process of voice interaction, the audio signal can be continuously obtained, and the second audio signal can be obtained when the first timer is running, and the start time and end time of the continuously obtained audio signal can be the same as the first timer As the second audio signal, for example, the second audio signal shown in FIG. 7 , the second audio signal may or may not include a voice instruction, which is not limited in this application.
示例性地,该第二音频信号的起始时刻和结束时刻可以与第一定时器相同,第二音频信号的起始时刻可以晚于第一音频信号的结束时刻。示例性地,获取第一文本以及确定第一定时器的时长的过程中,可能消耗一段时间,由此,当第二音频信号的起始时刻与第一定时器的起始时刻相同时,在第一音频信号的结束时刻与第一定时器的开始时刻间可以存在一段时长。示例性地,图8是本申请实施例提供的另一种语音交互中的音频信号的示意图。例如,如图8所示的第二音频信号;又例如,该第一音频信号的结束时刻与第一定时器的开始时刻间的时长,可以等于第一预设时间,第三音频信号的结束时刻可以是第二音频信号的开始时刻,从而可以在降低对语音端点的检测频次同时,避免对部分音频信号进行重复处理。为了简洁,此处不再一一举例说明。Exemplarily, the start time and end time of the second audio signal may be the same as the first timer, and the start time of the second audio signal may be later than the end time of the first audio signal. Exemplarily, the process of acquiring the first text and determining the duration of the first timer may take a period of time. Therefore, when the start time of the second audio signal is the same as the start time of the first timer, the There may be a period of time between the end moment of the first audio signal and the start moment of the first timer. Exemplarily, FIG. 8 is a schematic diagram of audio signals in another voice interaction provided by an embodiment of the present application. For example, the second audio signal as shown in Figure 8; for another example, the duration between the end moment of the first audio signal and the start moment of the first timer can be equal to the first preset time, and the end of the third audio signal The moment may be the start moment of the second audio signal, so that while reducing the frequency of detecting the voice endpoint, it is possible to avoid repeated processing of part of the audio signal. For the sake of brevity, examples are not given here.
示例性地,在第一定时器运行时获取第二音频信号,该第二音频信号的起始时刻可以早于第一定时器的启动时刻。例如,在可以持续获取语音交互中的音频信号时,开启第一定时器的时刻可以晚于第一音频信号的结束时刻,对于第二音频信号的获取,可以将此前所确定的第一音频信号结束时刻作为该第二音频信号的开始时刻,并随着第一定时器的运行,持续获取语音交互中的音频信号,可以将第一定时器的结束时刻作为第二音频信号的结束时刻,当语音识别响应较慢、确定第一定时器的时长的过程中所需的时间较长时,通过该方式可以避免对第二音频信号的选取不恰当而造成的对语音端点的检测错误。Exemplarily, the second audio signal is acquired when the first timer is running, and the start time of the second audio signal may be earlier than the start time of the first timer. For example, when the audio signal in the voice interaction can be acquired continuously, the time to start the first timer can be later than the end time of the first audio signal, and for the acquisition of the second audio signal, the previously determined first audio signal can be The end time is used as the start time of the second audio signal, and with the operation of the first timer, the audio signal in the voice interaction is continuously obtained, and the end time of the first timer can be used as the end time of the second audio signal. When the speech recognition response is slow and the time required for determining the duration of the first timer is long, this method can avoid detection errors of speech endpoints caused by inappropriate selection of the second audio signal.
示例性地,在第一定时器运行时获取第二音频信号,该第二音频信号的起始时刻可以晚于第一音频信号的结束时刻,该第二音频信号的结束时刻也可以早于第一定时器的结束时刻,为了简洁此处不再赘述。Exemplarily, when the first timer is running, the second audio signal is acquired. The start time of the second audio signal may be later than the end time of the first audio signal, and the end time of the second audio signal may be earlier than the end time of the first audio signal. The end time of a timer is not repeated here for the sake of brevity.
应理解,以上获取第二音频信号的方法只是举例以便于说明,本申请实施例对此不做限定。It should be understood that the above method for acquiring the second audio signal is only an example for illustration, and is not limited in this embodiment of the present application.
S450,在第二音频信号中的语音指令对应的文本为空时,将所述第一定时器的结束时刻确定为语音端点。S450. When the text corresponding to the voice instruction in the second audio signal is empty, determine the end time of the first timer as the voice endpoint.
示例性地,通过确定语音端点,可以确定待执行的语音指令。例如,在语音交互中,比如用户下达了“dǎkāitiānchuāng(即,打开天窗)”的语音指令,处理模块在基于包括此语音指令的第一音频信号确定语音端点后,可以将该指令确定为待执行的语音指令,从而可以对此指令进行响应,为了简洁此处不再赘述。Exemplarily, by determining the voice endpoint, the voice instruction to be executed can be determined. For example, in voice interaction, for example, if the user issues a voice command of "dǎkāitiānchuāng (that is, open the sunroof)", the processing module may determine the command as to be executed after determining the voice endpoint based on the first audio signal including the voice command. voice command, so as to respond to this command, for the sake of brevity, no more details are given here.
示例性地,可以根据语音识别技术获取该第二音频信号中的语音指令对应的文本。应理解,根据音频信号获取文本的方法可以参考相关技术,本申请实施例对此不做限定。Exemplarily, the text corresponding to the voice instruction in the second audio signal may be acquired according to the voice recognition technology. It should be understood that for the method for acquiring text according to the audio signal, reference may be made to related technologies, which is not limited in this embodiment of the present application.
示例性地,通过对音频信号进行语音识别可以获知其中的语音指令,在对该音频信号进行语音识别前,可能无法准确地获知其中的语音指令,由此,第二音频信号中的语音指令对应的文本为空,可以是第二音频信号不包括语音指令,也可以是对第二音频信号进行 语音识别后未得到语音指令,也就是说,第二音频信号中的语音指令对应的文本为空,可以是,在对第二音频信号进行语音识别后,未得到对应的文本结果;相应地,第二音频信号中的语音指令对应的文本不为空,可以是在对第二音频信号进行语音识别后,得到了对应的文本结果。Exemplarily, the voice instruction in the audio signal can be obtained by performing speech recognition on the audio signal. Before performing speech recognition on the audio signal, the voice instruction in the audio signal may not be accurately known. Therefore, the voice instruction in the second audio signal corresponds to The text of the second audio signal is empty, it may be that the second audio signal does not include a voice command, or the voice command is not obtained after performing voice recognition on the second audio signal, that is, the text corresponding to the voice command in the second audio signal is empty , it may be that after performing speech recognition on the second audio signal, the corresponding text result is not obtained; correspondingly, the text corresponding to the speech command in the second audio signal is not empty, it may be that the second audio signal is speech After recognition, the corresponding text results are obtained.
示例性地,可以对获取的第二音频信号进行语音识别,若对第二音频信号进行语音识别后,未得到文本结果,可以确定第二音频信号中的语音指令对应的文本为空,为了简洁此处不再赘述。Exemplarily, voice recognition can be performed on the acquired second audio signal. If no text result is obtained after voice recognition is performed on the second audio signal, it can be determined that the text corresponding to the voice command in the second audio signal is empty. For brevity I won't repeat them here.
示例性地,在语音端点检测过程中,当第二音频信号中的语音指令对应的文本为空时,可以将第一定时器的结束时刻确定为语音端点,由此可以对语音交互中的语音指令进行响应;而在第二音频信号中的语音指令对应的文本不为空时,可以表示在本次语音端点检测所使用第一音频信号之后,依然接收到新的语音指令,将该第一定时器的结束时刻作为语音端点,可能会导致提前截断用户语音指令,由此可以确定新的第一音频信号,以再次进行语音端点的检测。Exemplarily, in the voice endpoint detection process, when the text corresponding to the voice command in the second audio signal is empty, the end time of the first timer can be determined as the voice endpoint, so that the voice in the voice interaction can be command to respond; and when the text corresponding to the voice command in the second audio signal is not empty, it may indicate that after the first audio signal used in this voice endpoint detection, a new voice command is still received, and the first The end time of the timer is used as the voice end point, which may cause the user's voice command to be cut off in advance, so that a new first audio signal can be determined to detect the voice end point again.
示例性地,可以对该第二音频信号进行语音识别处理,确定第二文本中的语音指令对应的文本是否为空,为了简洁,此处不再赘述。Exemplarily, voice recognition processing may be performed on the second audio signal to determine whether the text corresponding to the voice command in the second text is empty, and details are not described here for brevity.
示例性地,语音交互过程中,可以持续获取音频信号,可以实时自动语音识别获取流式的文本结果时,第二音频信号对应的文本结果为空,可以是在第一定时器运行期间,流式的文本结果未更新。例如,以交互场景为人与车辆进行语音交互为例,该车辆可以包括语音交互装置,该语音交互装置可以包括获取模块和处理模块,从0时刻开始的语音交互中,在时刻0之后可以持续获取音频信号,经自动语音识别处理后,可以得到实时的流式的文本结果,在时刻3时,流式的文本结果为“打开天窗”,可以将其作为第一文本,在语音识别的响应时间较短时,可以将时刻0至时刻3间获取的音频信号视作第一音频信号,或者也可以根据时间戳确定第一音频信号,由此在根据第一文本确定第一定时器的时长之后,可以在时刻4时开启该第一定时器,若直至第一定时器结束前(比如,第一定时器的结束时刻为时刻5),时刻5的流式的文本结果相较时刻4未更新时,可以认为第二音频信号中的语音指令对应的文本为空,可以将该第一定时器的结束时刻确定为语音端点,从而可以对语音交互中的语音指令进行响应,比如处理模块可以将该语音指令发送至车辆控制模块,比如电子控制单元(electronic control unit,ECU),ECU可以控制天窗电机运转直至天窗打开;若时刻4的流式的文本结果相对于时刻3存在更新时,可以认为用户在时刻3之前下达的语音指令不完整,或者在第一音频信号之后,用户下达了新的语音指令,由此可以不开启该第一定时器,以减少检测语音端点的检测次数;若第一定时器运行期间,流式的文本结果发生更新,即时刻5的流式文本结果相较于时刻4更新时,可以确定用户下达了新的语音指令,可以确定第二音频信号中的语音指令对应的文本不为空,由此可以关闭或暂停该第一定时器,并结束本次语音端点的检测。For example, during the voice interaction process, the audio signal can be acquired continuously, and when the streaming text result can be acquired by real-time automatic speech recognition, the text result corresponding to the second audio signal is empty, which can be during the running of the first timer, the streaming The text results for the formula were not updated. For example, taking the interaction scenario of voice interaction between a human and a vehicle as an example, the vehicle may include a voice interaction device, and the voice interaction device may include an acquisition module and a processing module. In the voice interaction starting from time 0, after time 0, it can continue to acquire After the audio signal is processed by automatic speech recognition, a real-time streaming text result can be obtained. At time 3, the streaming text result is "open the skylight", which can be used as the first text. In the response time of speech recognition When it is shorter, the audio signal acquired between time 0 and time 3 can be regarded as the first audio signal, or the first audio signal can also be determined according to the timestamp, so that after determining the duration of the first timer according to the first text , the first timer can be started at time 4. If the first timer ends (for example, the end time of the first timer is time 5), the streaming text result at time 5 is not updated compared to time 4 , it can be considered that the text corresponding to the voice command in the second audio signal is empty, and the end time of the first timer can be determined as the voice endpoint, so that the voice command in the voice interaction can be responded to. For example, the processing module can set The voice command is sent to the vehicle control module, such as an electronic control unit (ECU), and the ECU can control the sunroof motor to run until the sunroof is opened; if the streaming text result at time 4 is updated compared to time 3, it can be considered The voice command issued by the user before time 3 is incomplete, or after the first audio signal, the user has issued a new voice command, thus the first timer may not be started to reduce the number of detections for detecting voice endpoints; if the second During the running of a timer, the streamed text result is updated, that is, when the streamed text result at time 5 is updated compared with time 4, it can be determined that the user has issued a new voice command, and the voice command in the second audio signal can be determined The corresponding text is not empty, so the first timer can be closed or paused, and this detection of the voice endpoint ends.
本申请实施例中,由于语音交互中可以持续获取音频信号,可以根据具体情形从该音频信号中确认并得到第一音频信号和第二音频信号,因此通过流式的文本结果的更新情况确定第二音频信号中是否包括语音指令,并由此确定语音端点,可以节省确认第一音频信号和第二音频信号的过程,可以降低系统运行的复杂度,节省该方法所消耗的资源。而且,由于该方式仅依赖于流式的文本结果,对于语音端点判断可以不依赖于语音识别的内部算 法,可以适用于任何ASR引擎。In the embodiment of the present application, since the audio signal can be obtained continuously during the voice interaction, the first audio signal and the second audio signal can be confirmed and obtained from the audio signal according to the specific situation, so the update of the streaming text result determines the second audio signal. Whether the two audio signals include voice instructions, and thus determine the voice endpoint, can save the process of confirming the first audio signal and the second audio signal, can reduce the complexity of system operation, and save the resources consumed by the method. Moreover, since this method only relies on streamed text results, the judgment of speech endpoints does not depend on the internal algorithm of speech recognition, and can be applied to any ASR engine.
应理解,以上确定第二音频信号语音端点的方法只是举例,以便于说明,本申请实施例对此不做限定。It should be understood that the above method for determining the speech endpoint of the second audio signal is only an example for ease of description, and is not limited in this embodiment of the present application.
可选地,在根据第二音频信号的音频帧的能量小于或等于第一阈值时,可以将所述第一定时器的结束时刻确定为语音端点。Optionally, when the energy of the audio frame according to the second audio signal is less than or equal to the first threshold, the end time of the first timer may be determined as the voice endpoint.
示例性地,可以基于短时能量分析的方法,确定第二音频信号的音频帧的能量。进一步地,第二音频信号的音频帧的能量小于或等于第一阈值,可以是单个该音频帧的能量不大于第一阈值,也可以是多个该音频帧的能量不大于第一阈值,也可以是所有的第二音频帧的能量不大于第一阈值,还可以是根据多个该音频帧的能量加权平均后的结果不大于第一阈值,本申请实施例对此不做限定。例如,基于短时能量分析的方法,对第二音频信号分帧之后,可以获取第二音频信号中的一个或多个音频帧的能量,可以是当音频帧的能量大于第一阈值时,认为用户在该音频帧中包括语音指令;也可以是对多个音频帧的能量进行加权平均之后,得到多个音频帧的短时平均能量,当该短时平均能量大于第一阈值时,可以认为该多个音频帧中包括语音指令;可以当第二音频信号中所有的音频帧的能量小于或等于第一阈值时,认为第二音频信号不包括语音指令;可以是第二音频信号中能量小于或等于第一阈值的音频帧的个数或者比例,超过某一限值时,认为第二音频信号中不包括语音指令,为了简洁不再一一举例。应理解,以上关于确定音频帧的能量的方法只是举例以便于说明,本申请实施例对此不做限定。Exemplarily, the energy of the audio frame of the second audio signal may be determined based on a short-time energy analysis method. Further, the energy of the audio frame of the second audio signal is less than or equal to the first threshold, it may be that the energy of a single audio frame is not greater than the first threshold, or the energy of multiple audio frames is not greater than the first threshold, or It may be that the energy of all the second audio frames is not greater than the first threshold, or the energy weighted average of multiple audio frames is not greater than the first threshold, which is not limited in this embodiment of the present application. For example, based on the method of short-term energy analysis, after the second audio signal is framed, the energy of one or more audio frames in the second audio signal can be obtained, and it can be considered that when the energy of the audio frame is greater than the first threshold The user includes a voice command in the audio frame; it can also be obtained by weighting the energy of multiple audio frames to obtain the short-term average energy of multiple audio frames. When the short-term average energy is greater than the first threshold, it can be considered These multiple audio frames include voice instructions; when the energy of all audio frames in the second audio signal is less than or equal to the first threshold, it is considered that the second audio signal does not include a voice instruction; it can be that the energy in the second audio signal is less than Or the number or proportion of audio frames equal to the first threshold exceeds a certain limit, it is considered that the second audio signal does not include voice commands, and no examples are given for brevity. It should be understood that the above method for determining the energy of an audio frame is only an example for illustration, and is not limited in this embodiment of the present application.
示例性地,可以基于第二音频信号中的音频帧的能量,确定该音频帧的分类。例如,可以将能量大于第一阈值的音频帧,确定为第一类音频帧,可以表示在该音频帧中包括语音指令,即采集该音频帧的时间段内用户明确有下达语音指令;可以将能量小于或等于第三阈值的音频帧,确定为第二类音频帧,表示该音频帧中明确不包含语音指令,该第三阈值可以小于或等于第一阈值;又例如,当第三阈值小于第一阈值时,可以将大于第三阈值,且小于或等于第一阈值的音频帧,确定为第三类音频帧,可以表示无法明确该音频帧中是否包括语音指令。应理解,以上关于音频信号的音频帧的分类方法只是举例以便于说明,本申请实施例对此不做限定。Exemplarily, the classification of the audio frame may be determined based on the energy of the audio frame in the second audio signal. For example, an audio frame with energy greater than the first threshold may be determined as the first type of audio frame, which may indicate that the audio frame includes a voice command, that is, the user clearly issued a voice command during the time period when the audio frame was collected; An audio frame whose energy is less than or equal to the third threshold is determined as the second type of audio frame, indicating that the audio frame clearly does not contain voice instructions, and the third threshold may be less than or equal to the first threshold; for another example, when the third threshold is less than When the first threshold is used, an audio frame greater than the third threshold and less than or equal to the first threshold may be determined as the third type of audio frame, which may indicate that it is not clear whether the audio frame includes a voice instruction. It should be understood that the above method for classifying audio frames of an audio signal is only an example for illustration, and is not limited in this embodiment of the present application.
应理解,可以以任意数据格式表示音频帧的分类,例如数字、字母、字符串等。示例性地,对于音频帧的分类,可以以“讲话(speech,SPE)”代表第一类音频帧,以“沉默(silence,SIL)”代表第二类音频帧,以“中立(neutral,NEU)”代表第三类音频帧。为了简洁不再一一举例。It should be understood that the classification of audio frames may be represented in any data format, such as numbers, letters, character strings, and the like. Exemplarily, for the classification of audio frames, "speech (speech, SPE)" can be used to represent the first type of audio frame, "silence (silence, SIL)" can be used to represent the second type of audio frame, and "neutral (neutral, NEU)" can be used to represent the first type of audio frame. )" represents the third type of audio frame. For the sake of brevity, no more examples are given.
为方便说明,本申请实施例后续,以SPE为第一类音频帧,以SIL为第二类音频帧,以NEU为第三类音频帧,为例进行说明。即本申请后续所描述的SPE均可替代为第一类音频帧,SIL均可替代为第二类音频帧,NEU均可替代为第三类音频帧。For convenience of description, following the embodiments of the present application, SPE is used as the first type of audio frame, SIL is used as the second type of audio frame, and NEU is used as an example for description. That is, the SPE described later in this application can be replaced by the first type of audio frame, the SIL can be replaced by the second type of audio frame, and the NEU can be replaced by the third type of audio frame.
示例性地,图9是本申请实施例提供的一种确认音频帧分类的方法的示意图,其中可以将音频信号划分为第一类音频帧的部分(即SPE部分),第二类音频帧的部分(即SIL部分)和第三类音频帧的部分(即NEU部分),或称为第一类音频信号的部分、第二类音频信号的部分、第三类音频信号的部分。如图9所示,可以根据音频帧的能量,根据不同的阈值,比如第一阈值和第三阈值,将音频信号中能量高于第一阈值的部分确定为SPE部分,将音频信号中能量低于第三阈值的部分确定为SIL部分,将音频信号中能量处于第 一阈值与第三阈值之间的部分确定为NEU部分。其中,第一阈值与第三阈值可以是固定值,也可以根据环境能量值确定,环境能量值可以指语音交互环境中环境噪音的音频帧的能量值。应理解,在获取音频信号的过程中,可以实时对音频信号进行分类。应理解,根据能量对音频信号分类的方法也可以参考相关技术中的其他方式,本申请对此不做限定。Exemplarily, FIG. 9 is a schematic diagram of a method for confirming the classification of audio frames provided by an embodiment of the present application, wherein the audio signal can be divided into parts of the first type of audio frames (ie, SPE parts), and parts of the second type of audio frames. part (that is, the SIL part) and a part of the third type of audio frame (that is, the NEU part), or called the part of the first type of audio signal, the part of the second type of audio signal, and the part of the third type of audio signal. As shown in Figure 9, according to the energy of the audio frame, according to different thresholds, such as the first threshold and the third threshold, the part of the audio signal whose energy is higher than the first threshold can be determined as the SPE part, and the part of the audio signal whose energy is low The portion at the third threshold is determined as the SIL portion, and the portion of the audio signal whose energy is between the first threshold and the third threshold is determined as the NEU portion. Wherein, the first threshold and the third threshold may be fixed values, or may be determined according to an environmental energy value, and the environmental energy value may refer to an energy value of an audio frame of ambient noise in a voice interaction environment. It should be understood that during the process of acquiring the audio signal, the audio signal can be classified in real time. It should be understood that, for the method of classifying audio signals according to energy, reference may also be made to other methods in the related art, which is not limited in the present application.
应理解,若第二音频信号中包括第一类音频帧,可以认为第二音频信号中包括语音指令,由此可以结束本次语音端点检测,可以通过获取新的第一音频信号,再次进行语音端点检测。It should be understood that if the second audio signal includes the first type of audio frame, it can be considered that the second audio signal includes a voice instruction, so that this voice endpoint detection can be ended, and the voice can be performed again by acquiring a new first audio signal Endpoint detection.
示例性地,当第二音频信号中不包括第一类音频帧时,可以将第一定时器的结束时刻确定为语音端点,为了简洁,此处不再赘述。Exemplarily, when the second audio signal does not include the first type of audio frame, the end time of the first timer may be determined as the voice endpoint, and details are not described here for brevity.
示例性地,当第二音频信号中的语音指令对应的文本为空,而且,第二音频信号中的音频帧的能量小于或等于第一阈值时,可以将第一定时器的结束时刻确定为语音端点。为了简洁,此处不再赘述。Exemplarily, when the text corresponding to the voice command in the second audio signal is empty, and the energy of the audio frame in the second audio signal is less than or equal to the first threshold, the end moment of the first timer may be determined as voice endpoint. For the sake of brevity, details are not repeated here.
应理解,以上根据第二音频信号的音频帧的能量确定语音端点的方法只是举例,本申请实施例对此不做限定。It should be understood that the above method of determining the speech endpoint according to the energy of the audio frame of the second audio signal is just an example, which is not limited in this embodiment of the present application.
本申请实施例中,可以根据音频信号的文本信息灵活地设定语音端点,从而可以缓解背景噪音和用户说话习惯,从而可以为提高用户体验。另外,通过根据音频信号对应的文本信息,并结合音频帧的能量,确定语音端点,可以提高所检测的语音端点的准确率。In the embodiment of the present application, the voice endpoint can be flexibly set according to the text information of the audio signal, thereby alleviating background noise and the user's speaking habit, thereby improving user experience. In addition, by determining the speech endpoint according to the text information corresponding to the audio signal and combining the energy of the audio frame, the accuracy of the detected speech endpoint can be improved.
可选地,在根据第一文本确定第一定时器的时长之前,可以获取第二文本,该第二文本可以通过显示屏显示。例如,当该方法应用于车辆时,第二文本可以是显示于车载显示屏的文本,比如显示于车辆中控屏、安装在座椅的头枕显示器等显示屏的文字;又例如,当该方法应用于手机、平板电脑等包括显示屏的终端设备时,第二文本可以是显示于该终端设备的屏幕,或者与该终端设备关联的显示屏的文字等;又例如,当该方法应用于芯片时,该芯片可以获取显示于其关联显示屏的第二文本。为了简洁不再一一举例说明,应理解,本申请实施例对此不做限定。Optionally, before the duration of the first timer is determined according to the first text, the second text may be acquired, and the second text may be displayed through a display screen. For example, when the method is applied to a vehicle, the second text may be a text displayed on a vehicle display screen, such as a text displayed on a display screen such as a vehicle central control screen or a headrest display installed on a seat; and for example, when the When the method is applied to a terminal device including a display screen, such as a mobile phone or a tablet computer, the second text may be displayed on the screen of the terminal device, or text on a display screen associated with the terminal device; for another example, when the method is applied to When the chip is turned on, the chip can capture the second text displayed on its associated display screen. For the sake of brevity, examples are not described one by one, and it should be understood that this embodiment of the present application does not make a limitation thereto.
示例性地,当第一文本可以与第二文本匹配时,可以结束本次语音端点检测,执行该第二文本对应的操作。例如,比如该显示屏中显示正在播放的音乐,且显示的文本中包括“下一首”,用户通过唤醒词进入语音交互后,在用户说出“播放下一首”时,通过获取音频信号可以获取第一文本,当该第一文本中包括“下一首”时,此时第一文本与第二文本可以匹配,可以直接执行显示屏中的文本“下一首”所对应的操作,即播放下一首歌曲,该方式也可以称作可见可说的方式。为了简洁,不再一一举例说明。本申请实施例中,通过直接执行第二文本对应的操作,可以更快地对用户的语音指令进行响应,提高用户的使用体验。Exemplarily, when the first text can match the second text, this speech endpoint detection can be ended, and the operation corresponding to the second text can be performed. For example, if the music being played is displayed on the display screen, and the displayed text includes "next song", after the user enters the voice interaction through the wake-up word, when the user says "play the next song", by obtaining the audio signal The first text can be obtained. When the first text includes "next song", the first text and the second text can be matched at this time, and the operation corresponding to the text "next song" in the display screen can be directly performed. That is, the next song is played, and this method can also be called a visible and talkable method. For the sake of brevity, no more examples are given. In the embodiment of the present application, by directly executing the operation corresponding to the second text, it is possible to respond to the user's voice command more quickly and improve the user experience.
应理解,第一文本和第二文本匹配,可以是第一文本与第二文本相同或相似,也可以是第一文本包括第二文本,还可以是第一文本和第二文本包括相同的关键字,本申请实施例对此不做限定。It should be understood that the match between the first text and the second text may be that the first text is the same or similar to the second text, or that the first text includes the second text, or that the first text and the second text include the same key Words, which are not limited in this embodiment of the application.
示例性地,在第一文本与第二不匹配时,可以根据该第一文本确定第一定时器的时长,为了简洁,此处不再赘述。Exemplarily, when the first text does not match the second text, the duration of the first timer may be determined according to the first text, and details are not described here for brevity.
示例性地,在一次语音交互中,用户在下达语音指令时,可能存在多次停顿,因此在确定语音交互中的语音端点时,可以存在多次尝试,当语音端点检测失败时,也就是说, 在确定用户当前未下达完整的语音指令时,可以根据持续获取的音频信号,后续再次进行语音端点检测直至成功检测到语音端点,并由此响应语音指令。应理解,在任一次进行语音端点检测时,都可以确认此次检测过程中所使用的第一音频信号和第一文本,该多次语音端点检测过程中,所确认的多个第一音频信号和第一文本可以存在关联关系,比如,本次检测过程中所使用的第一音频信号,可以包括其前一次检测过程中所使用的第一音频信号;也可以毫无关联,比如,本次检测过程中所使用的第一音频信号,可以不包括其前一次检测过程中所使用的第一音频信号等,本申请实施例对此不做限定。For example, in a voice interaction, when the user gives a voice command, there may be multiple pauses, so when determining the voice endpoint in the voice interaction, there may be multiple attempts. When the voice endpoint detection fails, that is to say When it is determined that the user has not issued a complete voice command, the voice endpoint detection can be performed again according to the continuously acquired audio signal until the voice endpoint is successfully detected, and the voice command can be responded accordingly. It should be understood that when the voice endpoint detection is performed any time, the first audio signal and the first text used in the detection process can be confirmed. During the multiple voice endpoint detection processes, the confirmed multiple first audio signals and The first text may be related, for example, the first audio signal used in this detection process may include the first audio signal used in the previous detection process; it may also be unrelated, for example, this detection The first audio signal used in the process may not include the first audio signal used in the previous detection process, which is not limited in this embodiment of the present application.
为了便于理解和说明,本申请实施例对多次语音检测过程中所使用的音频信号和文本进行区分,示例性地,将本次语音端点检测中所使用的文本,定义为第一文本,将其对应的音频信号定义为第一音频信号,将此前一次或多次语音端点检测中所使用的文本定义为第三文本,将其所对应的音频信号定义为第四音频信号。In order to facilitate understanding and description, the embodiment of the present application distinguishes the audio signal and text used in multiple speech detection processes. For example, the text used in this speech endpoint detection is defined as the first text, and The corresponding audio signal is defined as the first audio signal, the text used in the previous one or more speech endpoint detections is defined as the third text, and the corresponding audio signal is defined as the fourth audio signal.
示例性地,在获取第一音频信号之前,可以获取第四音频信号,该第四音频信号中可以包括第三语音指令,根据该第三语音指令对应的第三文本,可以确定第二定时器的时长,确定该第二定时器的时长之后,可以开启该第二定时器,还可以获取第五音频信号,在该第五音频信号中的语音指令对应的文本为非空时,可以根据该第四音频信号和第五音频信号,确定第一音频信号,该第一音频信号可以包括该第四音频信号和第五音频信号。其中,该第四音频信号,可以理解为,在此前的语音端点检测过程中,比如前一次,所使用的“第一音频信号”;该第三语音指令可以理解为,在此前的语音端点检测过程中,所使用的“第一音频信号”中包含的“第一语音指令”;该第三文本,可以理解为,在此前的语音端点检测过程中,所使用的“第一文本”;该第二定时器,可以理解为,在此前的语音端点检测过程中,所使用的“第一定时器”,第一定时器的开始时刻不早于第二定时器的结束时刻;该第五音频信号,可以理解为,在此前的语音端点检测过程中,所获取的“第二音频信号”。也就是说,可以是在此前确定语音端点失败后,或者说,在此前检测语音端点失败后,根据本次检测中所确认的第一音频信号再次确定语音端点。Exemplarily, before acquiring the first audio signal, a fourth audio signal may be acquired, the fourth audio signal may include a third voice instruction, and the second timer may be determined according to the third text corresponding to the third voice instruction After the duration of the second timer is determined, the second timer can be started, and the fifth audio signal can also be obtained. When the text corresponding to the voice command in the fifth audio signal is not empty, the The fourth audio signal and the fifth audio signal determine the first audio signal, and the first audio signal may include the fourth audio signal and the fifth audio signal. Wherein, the fourth audio signal can be understood as the "first audio signal" used in the previous speech endpoint detection process, such as the previous time; the third speech instruction can be understood as the "first audio signal" used in the previous speech endpoint detection process. In the process, the "first voice instruction" contained in the "first audio signal" used; the third text can be understood as the "first text" used in the previous voice endpoint detection process; the The second timer can be understood as, in the previous voice endpoint detection process, the "first timer" used, the start time of the first timer is not earlier than the end time of the second timer; the fifth audio The signal can be understood as the "second audio signal" obtained during the previous speech endpoint detection process. That is to say, the voice endpoint may be determined again according to the first audio signal confirmed in this detection after the previous failure to determine the voice endpoint, or after the previous failure to detect the voice endpoint.
示例性地,根据第四音频信号和第五音频信号,可以确定第一音频信号。例如,语音交互中,在可以持续获取语音交互的音频信号时,根据前一次确定语音端点时所采用的“第一音频信号”和“第二音频信号”,或者说旧的第一音频信号和旧的第二音频信号,即第四音频信号和第五音频信号,可以确定第一音频信号,也就是说,可以确定本次检测语音端点时所使用的第一音频信号。Exemplarily, according to the fourth audio signal and the fifth audio signal, the first audio signal can be determined. For example, in voice interaction, when the audio signal of voice interaction can be obtained continuously, according to the "first audio signal" and "second audio signal" used in the previous determination of the voice endpoint, or the old first audio signal and The old second audio signal, that is, the fourth audio signal and the fifth audio signal, can determine the first audio signal, that is, can determine the first audio signal used when detecting the voice endpoint this time.
示例性地,根据第四音频信号和第五音频信号确定第一音频信号,可以结合图10进行说明。示例性地,图10是本申请实施例提供的另一种语音交互中的音频信号的示意图。例如,该第一音频信号,可以仅包括第四音频信号和第五音频信号,比如,如图10中的(a)所示;又例如,由于可以持续获取语音交互中的音频信号,该第一音频信号,可以包括第四音频信号、第五音频信号,以及第四音频信号和第五音频信号间的音频信号,比如,如图13中的(b)所示;再例如,该第一音频信号,还可以包括第四音频信号、第五音频信号之后的音频信号,比如,如图13中的(c)和(d)所示;再例如,如果前一次语音端点检测中使用的“第一音频信号”(即第四音频信号),其起始点晚于语音交互的起始点时,在本次语音端点检测的过程中,第一音频信号的起始点也可以早于该第四音频信号的起始点,比如,以语音交互的起始点作为第一音频信号的起始时刻。为了简洁不再一一举例,应理 解,以上关于获取第一音频信号的方法只是举例以便于说明,本申请实施例对此不做限定。Exemplarily, the determination of the first audio signal according to the fourth audio signal and the fifth audio signal may be described in conjunction with FIG. 10 . Exemplarily, FIG. 10 is a schematic diagram of audio signals in another voice interaction provided by an embodiment of the present application. For example, the first audio signal may only include the fourth audio signal and the fifth audio signal, for example, as shown in (a) in Figure 10; An audio signal may include a fourth audio signal, a fifth audio signal, and an audio signal between the fourth audio signal and the fifth audio signal, for example, as shown in (b) in Figure 13; for another example, the first The audio signal may also include audio signals after the fourth audio signal and the fifth audio signal, such as, as shown in (c) and (d) in Figure 13; for another example, if the " When the starting point of the first audio signal" (that is, the fourth audio signal) is later than the starting point of the voice interaction, in the process of this voice endpoint detection, the starting point of the first audio signal can also be earlier than the fourth audio signal. The starting point of the signal, for example, takes the starting point of the voice interaction as the starting moment of the first audio signal. For the sake of brevity, no examples are given one by one. It should be understood that the above method for obtaining the first audio signal is only an example for illustration, and this embodiment of the present application does not limit it.
S460,在确定语音端点后,可以响应第一语音指令。S460. After the voice endpoint is determined, respond to the first voice instruction.
示例性地,响应该第一语音指令,可以是指仅响应该第一语音指令,也可以是所响应的语音指令中包括第一语音指令,也就是说,该第一语音指令可以是所响应的语音指令的一部分。例如,由于在语音交互过程中,可能进行多次语音端点检测,在确定语音端点后,可以响应于语音交互起始时刻起所获取的语音指令,由于第一音频信号的起始时刻可以晚于语音交互的起始时刻,由此,第一音频信号中的第一语音指令可以是所响应的语音指令的一部分。应理解,本申请实施例对此不做限定。Exemplarily, responding to the first voice instruction may mean only responding to the first voice instruction, or the first voice instruction may be included in the responded voice instruction, that is to say, the first voice instruction may be the part of the voice command. For example, during the voice interaction process, it is possible to perform multiple voice endpoint detections. After the voice endpoint is determined, it can respond to the voice instructions acquired from the start moment of the voice interaction. Since the start moment of the first audio signal can be later than The starting moment of the voice interaction, thus, the first voice command in the first audio signal may be a part of the voice command to be responded to. It should be understood that this is not limited in the embodiment of the present application.
示例性地,响应第一语音指令,可以是执行该第一语音指令所指示的操作。例如,以交互场景为人与车辆进行语音交互为例,当第一语音指令为“打开天窗”时,处理模块可以指示车辆控制器执行该操作,相应地,可以启动车辆的天窗电机直至天窗打开;又例如,当第一语音指令为“搜索地点A”,车辆可以在其中控屏显示地图并突出地点A,还可以显示当前地点至地点A的多个导航路线,车辆还可以通过扬声器发声“请您选择导航路线”,以便于用户下达新的语音指令;再例如,当第一语音指令为“下一首”时,车辆可以切换所播放的音乐,并保持语音交互的静默,当用户在一段时间(比如10s)内未下达新的语音指令时,可以结束语音交互,当用户在该时间段内下达了新的语音指令,可以及时获取用户所下达的语音指令。应理解,以上响应第一语音指令的方式只是示例,以便于说明,本申请实施例对此不做限定。Exemplarily, in response to the first voice instruction, the operation indicated by the first voice instruction may be performed. For example, taking the interaction scenario of voice interaction between a human and a vehicle as an example, when the first voice command is "open the sunroof", the processing module may instruct the vehicle controller to perform the operation, and accordingly, the sunroof motor of the vehicle may be started until the sunroof is opened; For another example, when the first voice command is "search for location A", the vehicle can display a map and highlight location A on the control screen, and can also display multiple navigation routes from the current location to location A, and the vehicle can also make a sound through the speaker "Please You choose the navigation route" so that the user can issue new voice commands; for another example, when the first voice command is "next song", the vehicle can switch the music played and keep the voice interaction silent. When no new voice command is issued within a period of time (for example, 10s), the voice interaction can be terminated, and when the user issues a new voice command within this time period, the voice command issued by the user can be obtained in time. It should be understood that the above manner of responding to the first voice instruction is only an example for ease of description, and is not limited in this embodiment of the present application.
本申请实施例提供了一种语音交互的方法,通过文本确定第一定时器的时长,可以根据文本中确定用户是否存在继续说话的意图,从而可以灵活地确定语音端点,由此可以避免由于噪音造成的系统延迟过长,也可以避免由于用户说话停顿而造成的语音交互截断过早,从而可以在缩短系统延迟的情况下,准确地得到语音交互中的语音指令。The embodiment of the present application provides a method for voice interaction. The length of the first timer can be determined through the text, and whether the user has the intention to continue speaking can be determined according to the text, so that the voice endpoint can be flexibly determined, thereby avoiding the The resulting long system delay can also avoid premature truncation of the voice interaction caused by the user's speech pause, so that the voice command in the voice interaction can be accurately obtained while shortening the system delay.
示例性地,图11为本申请实施例提供的语音交互方法的另一示意性流程图,该方法500可以包括步骤S510至S580中的部分或全部。Exemplarily, FIG. 11 is another schematic flowchart of the voice interaction method provided by the embodiment of the present application, and the method 500 may include part or all of steps S510 to S580.
S510,开始语音识别。S510, start speech recognition.
示例性地,在开始语音交互后,可以持续地获取该语音交互中的音频信号,直至语音交互结束,而且在语音交互过程中对该音频信号进行语音识别。例如,可以在用户说出唤醒词后开始语音交互,开始语音交互后可以调用语音识别模块,使得在获取音频信号后,可以对其进行语音识别,从而可以获取该音频信号语音识别后的处理结果。为了简洁此处不再赘述,应理解,本申请实施例对此不做限定。Exemplarily, after the voice interaction starts, the audio signal in the voice interaction may be acquired continuously until the voice interaction ends, and voice recognition is performed on the audio signal during the voice interaction. For example, the voice interaction can be started after the user speaks the wake-up word, and the voice recognition module can be invoked after the voice interaction is started, so that after the audio signal is acquired, it can be voice recognized, so that the processing result of the audio signal after voice recognition can be obtained . For the sake of brevity, details are not repeated here, and it should be understood that this embodiment of the present application does not limit it.
S520,根据音频信号进行语音识别,获得流式的文本结果。S520. Perform speech recognition according to the audio signal to obtain a streaming text result.
示例性地,通过对该持续获取到的音频信号进行语音识别,可以得到流式的文本结果。可选地,该流式的文本结果,可以用于确定第一文本,该持续获取到的音频信号,可以用于确定第一音频信号。例如,可以将该时刻的流式的文本结果确定为第一文本,将该时刻前已获取的音频信号作为第一音频信号。为了简洁此处不再赘述。Exemplarily, by performing speech recognition on the continuously acquired audio signal, a streaming text result can be obtained. Optionally, the streamed text result may be used to determine the first text, and the continuously acquired audio signal may be used to determine the first audio signal. For example, the streamed text result at the moment may be determined as the first text, and the audio signal acquired before the moment may be the first audio signal. For the sake of brevity, no more details are given here.
S530,根据第一预设时间设定第三定时器。S530. Set a third timer according to the first preset time.
S535,若第三定时器结束前,流式的文本结果未更新,跳转S540;若流式的文本结果有更新,可以重置第三定时器,跳转S520。S535, if the streaming text result is not updated before the third timer ends, jump to S540; if the streaming text result is updated, reset the third timer, and jump to S520.
示例性地,若第三定时器结束时,流式的文本结果相较第三定时器开启时未更新,可 以认为第一预设时间内接收到的第三音频信号中,不包括语音指令,由此可以将该时刻的流式的文本结果确定为第一文本,并以此从持续获取的音频信号中确定第一音频信号。示例性地,关于第三音频信号的描述可以参考步骤S420,为了简洁,此处不再赘述。Exemplarily, if the streaming text result is not updated when the third timer ends compared with when the third timer is started, it can be considered that the third audio signal received within the first preset time does not include voice instructions, Therefore, the streaming text result at this moment can be determined as the first text, and the first audio signal can be determined from the continuously acquired audio signals. Exemplarily, for the description about the third audio signal, reference may be made to step S420, and for the sake of brevity, details are not repeated here.
应理解,通过根据设定第三定时器,可以降低检测语音端点的频次,可以节省用于确定语音端点的过程中所使用的资源。It should be understood that by setting the third timer, the frequency of detecting the voice endpoint can be reduced, and the resources used in the process of determining the voice endpoint can be saved.
S540,基于预测模型,根据第一文本确定第一定时器的时长。S540. Based on the prediction model, determine the duration of the first timer according to the first text.
示例性地,可以将第一文本输入预测模型,得到该第一文本的第一信息,该第一信息可以用于表征该第一文本的语义完整度。进一步地,根据该第一信息,可以确定第一定时器的时长。Exemplarily, the first text may be input into the prediction model to obtain first information of the first text, and the first information may be used to characterize the semantic completeness of the first text. Further, according to the first information, the duration of the first timer can be determined.
示例性地,该预测模型可以是根据方法200训练而得到的预测模型。为了简洁,此处不再赘述。Exemplarily, the prediction model may be a prediction model trained according to the method 200 . For the sake of brevity, details are not repeated here.
示例性地,关于第一文本和第一定时器的描述,可以参考步骤S410至S420,为了简洁,此处不再赘述。Exemplarily, for the description of the first text and the first timer, reference may be made to steps S410 to S420, and for the sake of brevity, details are not repeated here.
S550,开启第一定时器。S550, start the first timer.
示例性地,可以在确定第一定时器的时长之后启动第一定时器,为了简洁此处不再赘述。Exemplarily, the first timer may be started after the duration of the first timer is determined, and details are not described here for brevity.
S560,在第一定时器结束前,流式的文本结果有更新时,可以跳转S520;若流式的文本结果未更新,跳转S570。S560, before the end of the first timer, if the streaming text result is updated, skip to S520; if the streaming text result is not updated, skip to S570.
示例性地,在第一定时器结束前,流式的文本结果无更新时,可以认为获取的第二音频信号中的语音指令对应的文本为空,从而可以将第一定时器的结束时刻作为语音端点;在第一定时器结束前,流式的文本结果有更新时,可以认为第二音频信号中的语音指令对应的文本为非空,由此可以结束本次语音端点的检测,后续可以根据更新后的流式文本结果,再次进行语音端点的检测。Exemplarily, before the end of the first timer, when the streaming text result is not updated, it can be considered that the text corresponding to the voice command in the acquired second audio signal is empty, so that the end time of the first timer can be used as Voice endpoint; before the end of the first timer, when the streaming text result is updated, it can be considered that the text corresponding to the voice command in the second audio signal is not empty, so that the detection of this voice endpoint can be ended, and the follow-up can be According to the updated streaming text result, the speech endpoint detection is performed again.
应理解,若第一定时器结束前,流式的文本结果更新,可以暂停、关闭或重置第一定时器,本申请实施例对此不做限定。It should be understood that if the streaming text result is updated before the first timer expires, the first timer may be suspended, closed or reset, which is not limited in this embodiment of the present application.
示例性地,关于第二音频信号的获取以及第二音频信号中的语音指令对应的文本是否为空的描述,可以参考步骤S430至S440,为了简洁,此处不再赘述。Exemplarily, regarding the acquisition of the second audio signal and whether the text corresponding to the voice command in the second audio signal is empty, reference may be made to steps S430 to S440 , and details are not repeated here for brevity.
S570,根据音频信号分类,确定是否停止语音识别。若当前音频信号包括为第一类音频帧,跳转S520,否则,跳转S580。S570. Determine whether to stop speech recognition according to the audio signal classification. If the current audio signal includes audio frames of the first type, go to S520, otherwise, go to S580.
示例性地,当第一定时器运行过程中,可以获取自第一音频信号结束时刻起所持续接收的音频信号,若第一定时器结束时,该音频信号中包括分类为SPE的音频帧,可以认为第一音频信号后的音频信号中包括语音指令,可以跳转S520,从而可以获取该音频信号中的语音指令对应的文本;在该音频信号中不包括SPE分类的音频帧时,可以根据该第一定时器的结束时刻确定语音端点。Exemplarily, when the first timer is running, the audio signal continuously received since the end of the first audio signal can be acquired, and if the first timer ends, the audio signal includes an audio frame classified as SPE, It can be considered that the audio signal after the first audio signal includes a voice command, and jump to S520, so that the text corresponding to the voice command in the audio signal can be obtained; The end moment of the first timer determines the speech endpoint.
示例性地,关于音频信号的分类的描述,可以参考步骤S450,为了简洁,此处不再赘述。Exemplarily, for the description of the classification of the audio signal, reference may be made to step S450, and for the sake of brevity, details are not repeated here.
本申请实施例中,在语音识别延迟较大时,通过确认音频帧的分类,可以避免由于该延迟导致的对语音端点的误判,可以提高检测的准确率,而且,结合文本和音频帧的分类,可以提高所确定的语音端点的准确度。In the embodiment of the present application, when the speech recognition delay is relatively large, by confirming the classification of the audio frame, the misjudgment of the speech endpoint due to the delay can be avoided, and the accuracy of detection can be improved. Moreover, the combination of text and audio frames Classification, which can improve the accuracy of the identified speech endpoints.
S580,响应语音指令。S580, responding to voice commands.
示例性地,可以在确定语音端点之后,可以将第一文本发送至语义理解模块,用于分析和执行用户在语音交互中所指示的指令。本申请实施例对此不做限定。Exemplarily, after the voice endpoint is determined, the first text may be sent to the semantic understanding module for analyzing and executing the instruction indicated by the user in the voice interaction. This embodiment of the present application does not limit it.
本申请实施例中,可以根据音频信号中的语义指令,其所对应的文本的语义完整程度灵活地设定第一时长,并以此确定语音交互的结束点,从而可以兼顾由于所确定的语音结束点过晚而导致的延迟,以及用户在语音交互中的停顿,使得用户可以有较好的用户体验。同时,由于在确定语音结束点时结合了当前音频的分类,使得可以缓解背景噪声对判断语音结束点的影响,可以提高语音端点检测的准确率。而且本申请实施例中,仅使用了ASR引擎输出的流式的文本结果进行端点检测,并在外部发送停止识别的指令而不依赖于ASR引擎的内部算法,从而可以适用于任何ASR引擎,对于ASR引擎有较好的适配性。In the embodiment of the present application, the first duration can be flexibly set according to the semantic completeness of the text corresponding to the semantic instruction in the audio signal, and the end point of the voice interaction can be determined based on this, so that the determined voice interaction can be taken into account. The delay caused by the end point being too late, as well as the user's pause in the voice interaction, enable the user to have a better user experience. At the same time, since the classification of the current audio is combined when determining the end point of the speech, the influence of background noise on the judgment of the end point of the speech can be alleviated, and the accuracy of speech end point detection can be improved. Moreover, in the embodiment of the present application, only the streaming text results output by the ASR engine are used for endpoint detection, and an instruction to stop recognition is sent externally without relying on the internal algorithm of the ASR engine, so it can be applied to any ASR engine, for The ASR engine has better adaptability.
示例性地,图12是本申请实施例提供的语音交互方法的另一示意性流程图,该方法600可以包括步骤S610至步骤S660中的部分或全部。Exemplarily, FIG. 12 is another schematic flowchart of the voice interaction method provided by the embodiment of the present application, and the method 600 may include part or all of steps S610 to S660.
S610,开始语音识别。S610, start speech recognition.
应理解,步骤S610可以对应步骤S510,为了简洁,此处不再赘述。It should be understood that step S610 may correspond to step S510, and for the sake of brevity, details are not repeated here.
S615,获取界面热词。S615. Obtain interface hot words.
示例性地,以用户与车辆进行语音交互为例,车辆包括显示屏,可以获取该车辆的显示屏中文本,界面热词可以是该显示屏所显示的控件对应的文本,应理解,该界面热词可以作为第二文本。Exemplarily, taking the voice interaction between the user and the vehicle as an example, the vehicle includes a display screen, and the text in the display screen of the vehicle can be obtained, and the interface hot word can be the text corresponding to the control displayed on the display screen. It should be understood that the interface Hot words can be used as the second text.
S620,根据音频信号进行语音识别,获得流式的文本结果。S620. Perform speech recognition according to the audio signal to obtain a streaming text result.
应理解,步骤S620可以对应步骤S520,为了简洁,此处不再赘述。It should be understood that step S620 may correspond to step S520, and for the sake of brevity, details are not repeated here.
可选地,S630,若流式的文本结果非空,可以跳转S634;若流式的文本结果为空,可以跳转S620。Optionally, in S630, if the streaming text result is not empty, skip to S634; if the streaming text result is empty, skip to S620.
示例性地,获取界面热词,可以与获取流式的文本结果同时进行,也可以先获取界面热词,还可以先获取文本结果,也就是说,步骤S615,与步骤S620至S630中的部分或全部步骤,可以同时进行,也可以先执行步骤S615,也可以先执行S620至S630,还可以先执行S620至S630中的部分或全部步骤,本申请对此不做限定。Exemplarily, obtaining interface hot words can be carried out simultaneously with obtaining streaming text results, or first obtaining interface hot words and text results, that is to say, step S615, and the part in steps S620 to S630 Or all the steps can be performed at the same time, or step S615 can be performed first, or S620 to S630 can be performed first, or part or all of the steps from S620 to S630 can be performed first, which is not limited in this application.
S634,匹配获取的界面热词,与获取的流式的文本结果,若界面热词与流式的文本结果匹配,可以跳转S636;若界面热词与流式的文本结果不匹配,跳转S635。S634, match the obtained interface hot words with the obtained streaming text results, if the interface hot words match the streaming text results, skip to S636; if the interface hot words do not match the streaming text results, skip S635.
为简要说明匹配界面热词与流式的文本结果的方法,示例性地,图13为本申请实施例提供的一种显示屏的用户界面的示例性示意图,其中,该显示屏可以应用于车辆,可以在其用户界面上显示地图,音乐,广播,驾驶设置等不同的信息。应理解,该用户界面只是示例,本申请实施例对此不做限定,例如,还可以包括灯光,车辆行驶参数等其他信息。用户可以通过点击用户界面的控件,车辆可以执行该控件所对应的操作,比如,如图13所示,用户点击控件“音乐”中的“歌曲1”后,车辆可以播放音乐,并播放“歌曲1”。为了简洁,此处不再一一举例。In order to briefly describe the method for matching interface hot words and streamed text results, as an example, FIG. 13 is an exemplary schematic diagram of a user interface of a display screen provided in an embodiment of the present application, wherein the display screen can be applied to a vehicle , can display maps, music, radio, driving settings and other different information on its user interface. It should be understood that the user interface is only an example, which is not limited in the embodiment of the present application, and may also include lights, vehicle driving parameters and other information, for example. The user can click the control on the user interface, and the vehicle can perform the operation corresponding to the control. For example, as shown in Figure 13, after the user clicks "Song 1" in the control "Music", the vehicle can play music and play "Song 1". 1". For the sake of brevity, examples are not given here.
示例性地,当用户开启语音识别功能后,可以获取流式的文本结果以及界面热词,并对二者进行匹配。例如,用户通过说出唤醒词启动语音识别后,可以获取如图13所示的界面热词“地图”、“常用地点1”、“音乐”、“歌曲1”等,当根据音频信号获取的流式的文本结果(比如“播放歌曲1”)中包括“歌曲1”时,该流式的文本结果与获取 的界面热词相匹配,可以跳转S636;或根据音频信号获取的流式文本结果(比如“打开车窗”)中,不包括如图13所示的界面热词时,可以将该流式的文本结果作为第一文本,以此确定第一定时器的时长,也就是说,第一文本和第二文本不匹配时,可以跳转S640。Exemplarily, after the user turns on the speech recognition function, the streaming text result and interface hot words can be obtained, and the two can be matched. For example, after the user activates voice recognition by speaking the wake-up word, the hot words "map", "frequently used place 1", "music", "song 1", etc. on the interface as shown in Figure 13 can be obtained. When "song 1" is included in the streamed text result (such as "playing song 1"), the streamed text result matches the obtained interface hot words, and can jump to S636; or the streamed text obtained according to the audio signal When the result (such as "open the car window") does not include the hot words on the interface as shown in Figure 13, the streaming text result can be used as the first text to determine the duration of the first timer, that is to say , when the first text does not match the second text, skip to S640.
应理解,以上获取界面热词的方法只是举例以便于说明,本申请实施例对此不做限定。关于匹配第一文本和第二文本的描述,可以参考步骤S450,本申请实施例对此不再赘述。It should be understood that the above method for obtaining interface hot words is only an example for illustration, and this embodiment of the present application does not limit it. Regarding the description of matching the first text and the second text, reference may be made to step S450, which will not be repeated in this embodiment of the present application.
可选地,S636,确定音频信号的分类,若当前音频信号分类为第一类音频信号,跳转S620,否则,可以跳转S638。Optionally, S636. Determine the classification of the audio signal. If the current audio signal is classified as the first type of audio signal, go to S620; otherwise, go to S638.
示例性地,由于语音交互中,可以持续地获取音频信号,在获取界面热词、获取流式的文本结果、以及对二者进行匹配的过程中,在该过程中所接收的音频信号,或称为更新的音频信号,其中可能会包括语音指令,若完成界面热词与流式的文本结果间的匹配后,该更新的音频信号中,包括分类为SPE的音频帧,即包括用户的语音指令时,可以跳转S620,以获取该更新的音频信号对应的流式文本的结果;否则,可以跳转S638。Exemplarily, since the audio signal can be acquired continuously during voice interaction, during the process of acquiring interface hot words, acquiring streamed text results, and matching the two, the audio signal received during this process, or It is called an updated audio signal, which may include voice instructions. After the matching between interface hot words and streaming text results is completed, the updated audio signal includes audio frames classified as SPE, that is, the user's voice Instruction, skip to S620 to obtain the streaming text result corresponding to the updated audio signal; otherwise, skip to S638.
示例性地,在确定界面热词与流式的文本结果匹配时,可以根据持续获取的音频信号中包括当前时刻的音频帧的能量,确定该音频帧的分类,若该音频帧的分类为SPE时,可以跳转S620;否则可以跳转S638。Exemplarily, when it is determined that the hot words on the interface match the streamed text results, the category of the audio frame may be determined according to the continuously acquired audio signal including the energy of the audio frame at the current moment, if the category of the audio frame is SPE , skip to S620; otherwise, skip to S638.
应理解,通过确定音频信号的分类,可以避免忽略在匹配界面热词与流式文本结果过程中用户新下达的指令,避免执行的操作与用户的实际意图存在明显偏差。It should be understood that by determining the category of the audio signal, it is possible to avoid ignoring the user's new instruction in the process of matching interface hot words and streaming text results, and to avoid obvious deviations between the executed operation and the user's actual intention.
示例性地,关于音频信号的分类的方法,可以参考步骤S450,为了简洁,本申请实施例对此不做限定。For example, regarding the method for classifying audio signals, reference may be made to step S450, which is not limited in this embodiment for the sake of brevity.
S638,可见可说模块可以执行界面热词指示的操作。S638, the visible and utterable module can perform the operation indicated by the interface hot words.
示例性地,可以向可见可说模块发送第一消息,该第一消息可以用于指示执行成功匹配的界面热词所指示的操作,相应地,可见可说模块可以执行该界面热词所指示的操作,也可以指示执行装置执行该界面热词所指示的操作,本申请实施例对此不做限定。Exemplarily, a first message may be sent to the visible and speakable module, and the first message may be used to indicate the operation indicated by the successfully matched interface hotword, and accordingly, the visible and speakable module may execute the operation indicated by the interface hotword The operation may also instruct the executing device to execute the operation indicated by the hot word on the interface, which is not limited in this embodiment of the present application.
本申请实施例中,通过执行界面热词指示的操作,可以实现可见即可说的功能,使得用户可以仅通过语音交互,实现与车载终端间的交互,从而可以避免接触车载终端,可以提升用户体验。另外,由于本申请实施例中,可以在语音端点检测之前进行界面热词与语音指令的匹配,即并非在语音交互结束后而是在语音交互中,进行界面热词的匹配,可以显著缩短可见可说方式的响应时间,提高用户体验。In the embodiment of the present application, by performing the operation indicated by the interface hot words, the function of seeing and speaking can be realized, so that the user can realize the interaction with the vehicle-mounted terminal only through voice interaction, thereby avoiding contact with the vehicle-mounted terminal and improving user experience. In addition, because in the embodiment of the present application, the matching of interface hot words and voice commands can be performed before the voice endpoint detection, that is, the matching of interface hot words is not performed after the voice interaction is over but during the voice interaction, which can significantly shorten the visible time. The response time can be said to improve the user experience.
应理解,当获取的界面热词与流式的文本结果匹配时,可以直接执行界面热词指示的操作,即步骤S634完成界面热词与流式的文本结果的匹配后,也可以直接跳转S638。It should be understood that when the acquired interface hot words match the streaming text results, the operation indicated by the interface hot words can be directly performed, that is, after step S634 completes the matching of the interface hot words and the streaming text results, you can also directly jump to S638.
可选地,S635,若界面热词与流式的文本结果不匹配时,可以根据第一预设时间设定第三定时器。Optionally, in S635, if the hot words on the interface do not match the streaming text results, a third timer may be set according to the first preset time.
示例性地,步骤S635的描述可以参考步骤S530,为了简洁,此处不再赘述。Exemplarily, the description of step S635 may refer to step S530, and for the sake of brevity, details are not repeated here.
S637,若第三定时器结束前,流式的文本结果未更新,可以跳转S640;若流式的文本结果有更新,可以跳转S620。S637, if the streaming text result is not updated before the third timer ends, skip to S640; if the streaming text result is updated, skip to S620.
示例性地,步骤S637的描述可以参考步骤S535,为了简洁,此处不再赘述。Exemplarily, the description of step S637 may refer to step S535, and for the sake of brevity, details are not repeated here.
S640,基于预测模型,根据第一文本确定第一定时器的时长。S640. Based on the prediction model, determine the duration of the first timer according to the first text.
示例性地,步骤S640的描述可以参考步骤S540,为了简洁,对此不再赘述。Exemplarily, the description of step S640 may refer to step S540, and for the sake of brevity, details are not repeated here.
S645,开启第一定时器。S645, start the first timer.
S650,若第一定时器结束前,流式的文本结果有更新,可以跳转S620;若流式的文本结果未更新,跳转S655。S650, if the streaming text result is updated before the first timer ends, skip to S620; if the streaming text result is not updated, skip to S655.
示例性地,步骤S650的描述可以参考步骤S560,为了简洁,对此不再赘述。Exemplarily, the description of step S650 may refer to step S560, which will not be repeated for brevity.
S655,根据音频信号的分类,确定是否停止语音识别。若当前音频信号中包括第一类音频帧,跳转S620,否则,跳转S660。S655. Determine whether to stop speech recognition according to the classification of the audio signal. If the current audio signal includes the first type of audio frame, go to S620, otherwise, go to S660.
示例性地,步骤S655的描述可以参考步骤S570,为了简洁,对此不再赘述。Exemplarily, the description of step S655 may refer to step S570, and for the sake of brevity, details are not repeated here.
S660,响应语音指令。S660, responding to voice commands.
示例性地,步骤S660的描述可以参考步骤S580,为了简洁,此处不再赘述。Exemplarily, the description of step S660 may refer to step S580, and for the sake of brevity, details are not repeated here.
应理解,以上方法400可以和方法500、方法600相互结合,本申请实施例对此不作限定。It should be understood that the foregoing method 400 may be combined with the method 500 and the method 600, which is not limited in this embodiment of the present application.
本申请实施例还提供用于实现以上任一种方法的装置,例如,提供一种装置包括用以实现以上任一种方法中用户设备、车辆、语音交互装置等所执行的各步骤的单元。例如,请参考图14,其为本申请实施例提供的一种语音交互的装置的结构示意图。该装置700可以包括获取模块710和处理模块720。The embodiment of the present application also provides an apparatus for implementing any one of the above methods, for example, an apparatus including a unit for implementing the steps performed by the user equipment, vehicle, voice interaction device, etc. in any of the above methods. For example, please refer to FIG. 14 , which is a schematic structural diagram of a voice interaction device provided by an embodiment of the present application. The apparatus 700 may include an acquisition module 710 and a processing module 720 .
其中,获取模块710,可以用于获取第一音频信号,该第一音频信号中可以包括第一语音指令;还可以用于获取第二音频信号,该第二音频信号的起始时刻等于或晚于该第一音频信号的结束时刻;处理模块720,可以用于:根据该第一语音指令对应的第一文本,确定第一定时器的时长;启动该第一定时器;在该第二音频信号中的语音指令对应的文本为空时,将第一定时器的结束时刻确定为语音端点;在确定语音端点之后,响应该第一语音指令。Among them, the acquisition module 710 can be used to acquire the first audio signal, which can include the first voice instruction; it can also be used to acquire the second audio signal, and the start time of the second audio signal is equal to or later than At the end moment of the first audio signal; the processing module 720 may be configured to: determine the duration of the first timer according to the first text corresponding to the first voice instruction; start the first timer; When the text corresponding to the voice instruction in the signal is empty, the end time of the first timer is determined as the voice endpoint; after the voice endpoint is determined, the first voice instruction is responded.
示例性地,关于响应第一语音指令的描述可以参照步骤S460,为了简洁,此处不再赘述。Exemplarily, for the description about responding to the first voice instruction, reference may be made to step S460, and details are omitted here for brevity.
示例性地,在处理模块720确定第二音频信号中的语音指令对应的文本为非空时,可以确定在第一音频信号结束时刻之后,依然获取到新的语音指令,可以确定本次语音端点检测失败,无法根据第一定时器确定语音端点。由此,获取模块710可以重新获取新的第一音频信号,处理模块720可以根据该新的第一音频信号,再次进行语音端点检测,直至确定到语音端点。Exemplarily, when the processing module 720 determines that the text corresponding to the voice command in the second audio signal is non-empty, it can be determined that a new voice command is still obtained after the end of the first audio signal, and it can be determined that the voice endpoint of this time is The detection fails, and the voice endpoint cannot be determined according to the first timer. Thus, the obtaining module 710 can reacquire a new first audio signal, and the processing module 720 can perform speech endpoint detection again according to the new first audio signal until the speech endpoint is determined.
可选地,处理模块720,可以用于:当第二音频信号的音频帧的能量小于或等于第一阈值时,将该第一定时器的结束时刻确定为语音端点。Optionally, the processing module 720 may be configured to: when the energy of the audio frame of the second audio signal is less than or equal to the first threshold, determine the end time of the first timer as the voice endpoint.
进一步地,处理模块720,具体用于:在该第二音频信号中的语音指令对应的文本为空,且第二音频信号的音频帧的能量小于或等于第一阈值时,将该第一定时器的结束时刻确定为语音端点。Further, the processing module 720 is specifically configured to: when the text corresponding to the voice instruction in the second audio signal is empty, and the energy of the audio frame of the second audio signal is less than or equal to the first threshold, the first timing The end moment of the device is determined as the voice endpoint.
示例性地,可以是当第二音频信号中不包括第一类音频帧时,将第一定时器的结束时刻确定为语音端点。Exemplarily, when the second audio signal does not include the audio frame of the first type, the end time of the first timer may be determined as the voice endpoint.
示例性地,关于第二音频信号的描述可以参照步骤S450,为了简洁,本申请实施例对此不再赘述。Exemplarily, for the description about the second audio signal, reference may be made to step S450, which will not be repeated in this embodiment of the present application for the sake of brevity.
可选地,获取模块710,还用于:获取第二文本,该第二文本可以通过显示屏显示;处理模块720,具体用于:在该第一语音指令对应的第一文本与该第二文本不匹配时,根据该第一语音指令对应的第一文本,确定该第一定时器的时长。Optionally, the acquiring module 710 is further configured to: acquire the second text, which can be displayed through a display screen; the processing module 720 is specifically configured to: compare the first text corresponding to the first voice instruction with the second text When the texts do not match, the duration of the first timer is determined according to the first text corresponding to the first voice instruction.
可选地,处理模块720,还可以用于,在该第一语音指令对应的该第一文本与该第二文本匹配时,执行该第二文本所指示的操作。Optionally, the processing module 720 may also be configured to, when the first text corresponding to the first voice instruction matches the second text, execute the operation indicated by the second text.
示例性地,可以将该第二文本所指示的操作发送至控制装置或执行装置,使得其可以执行该第二文本所指示的操作。Exemplarily, the operation indicated by the second text may be sent to the control device or the execution device, so that it can execute the operation indicated by the second text.
示例性地,关于第二文本的描述,可以参考步骤S450,为了简洁,此处不再赘述。其中,第一文本和第二文本匹配,可以是第一文本与第二文本相同或相似,可以是第一文本包括第二文本,还可以是第一文本和第二文本包括相同的关键字,本申请实施例对此不做限定。For example, for the description of the second text, reference may be made to step S450, and for the sake of brevity, details are not repeated here. Wherein, the matching of the first text and the second text may be that the first text is the same or similar to the second text, may be that the first text includes the second text, or that the first text and the second text include the same keyword, This embodiment of the present application does not limit it.
可选地,获取模块710,还用于:获取第三音频信号,第三音频信号包括第一预设时间内接收到的音频信号,第一预设时间的起始时刻等于或晚于第一音频信号的结束时刻;处理模块720,具体用于:在第三音频信号不包括语音指令时,根据该第一语音指令对应的第一文本,确定该第一定时器的时长。Optionally, the acquiring module 710 is further configured to: acquire a third audio signal, the third audio signal includes an audio signal received within a first preset time, and the start moment of the first preset time is equal to or later than the first The end time of the audio signal; the processing module 720 is specifically configured to: when the third audio signal does not include a voice command, determine the duration of the first timer according to the first text corresponding to the first voice command.
示例性地,当第三音频信号包括语音指令时,可以确定第一音频信号之后依然接收到新的语音指令,由此可以结束本次语音端点的检测,可以重新确定新的第一音频信号,以重新进行语音端点的检测。Exemplarily, when the third audio signal includes a voice command, it can be determined that a new voice command is still received after the first audio signal, so that the detection of this voice endpoint can be ended, and a new first audio signal can be re-determined, To re-detect the voice endpoint.
示例性地,关于第三音频信号的描述,可以参考步骤S420,为了简洁,此处不再赘述。Exemplarily, for the description of the third audio signal, reference may be made to step S420, and for the sake of brevity, details are not repeated here.
示例性地,在语音交互中,用户可能存在多次停顿,因此在确定语音交互中的语音端点时,可以存在多次尝试,当语音端点检测失败时,可以根据持续获取的音频信号,后续再次进行语音端点检测以确认语音端点,直至成功检测到语音端点,并由此响应第一语音指令。该多次语音端点检测过程中,所使用的多个音频信号,可以存在关联关系,也可以不存在关联关系。Exemplarily, in the voice interaction, the user may have multiple pauses, so when determining the voice endpoint in the voice interaction, there may be multiple attempts. Voice endpoint detection is performed to confirm the voice endpoint until the voice endpoint is successfully detected, thereby responding to the first voice command. During the multiple voice endpoint detection processes, the multiple audio signals used may or may not be associated.
为了对多次语音检测过程中所使用的音频信号和文本进行区分,可以将本次语音端点检测中所使用的文本,定义为第一文本,将其对应的音频信号定义为第一音频信号,将此前一次或多次语音端点检测中所使用的文本定义为第三文本,将其所对应的音频信号定义为第四音频信号。In order to distinguish the audio signal and text used in multiple voice detection processes, the text used in this voice endpoint detection can be defined as the first text, and the corresponding audio signal can be defined as the first audio signal, The text used in one or more previous speech endpoint detections is defined as the third text, and the corresponding audio signal is defined as the fourth audio signal.
可选地,获取模块710,还用于:在获取第一音频信号之前,获取第四音频信号,该第四音频信号中包括第三语音指令;在第二定时器运行时获取第五音频信号;处理模块720,还可以用于:根据该第三语音指令对应的第三文本,确定第二定时器的时长;启动第二定时器,该第二定时器的结束时刻早于或等于所述第一定时器的开始时刻;在该第五音频信号中的语音指令对应的文本为非空时,根据该第四音频信号和第五音频信号,确定第一音频信号,该第一音频信号包括该第四音频信号和所述第五音频信号。Optionally, the acquiring module 710 is further configured to: acquire a fourth audio signal before acquiring the first audio signal, the fourth audio signal including the third voice instruction; acquire the fifth audio signal when the second timer is running The processing module 720 can also be used to: determine the duration of the second timer according to the third text corresponding to the third voice instruction; start the second timer, and the end time of the second timer is earlier than or equal to the second timer. The start time of the first timer; when the text corresponding to the voice instruction in the fifth audio signal is not empty, according to the fourth audio signal and the fifth audio signal, determine the first audio signal, the first audio signal includes The fourth audio signal and the fifth audio signal.
可选地,该第一音频信号的起始时刻早于或等于第四音频信号的起始时刻,该第一音频信号的结束时刻等于或晚于第五音频信号的结束时刻。Optionally, the start moment of the first audio signal is earlier than or equal to the start moment of the fourth audio signal, and the end moment of the first audio signal is equal to or later than the end moment of the fifth audio signal.
示例性地,关于第三语音指令、第二定时器、第三文本、第五音频信号等的描述,可以参考步骤S450,为了简洁,此处不再赘述。Exemplarily, for the description of the third voice instruction, the second timer, the third text, the fifth audio signal, etc., reference may be made to step S450, and for the sake of brevity, details are not repeated here.
可选地,处理模块720,具体用于:将该第一语音指令对应的第一文本输入预测模型,得到该第一文本的语义完整度;根据该第一文本的语义完整度,确定第一定时器的时长。Optionally, the processing module 720 is specifically configured to: input the first text corresponding to the first voice command into the prediction model to obtain the semantic completeness of the first text; determine the first text according to the semantic completeness of the first text. The duration of the timer.
示例性地,该预测模型,可以是通过方法200训练而得到的预测模型,关于该预测模 型的训练的方法的描述,可以参考步骤S210至S220,为了简洁,此处不再赘述。Exemplarily, the predictive model may be a predictive model obtained through training in method 200. For descriptions of the method for training the predictive model, reference may be made to steps S210 to S220. For the sake of brevity, details are not repeated here.
示例性地,该装置可以应用于终端设备,该终端设备可以与用户进行语音交互。示例性地,终端设备具体可以包括电脑、智能手机、平板电脑、个人数字助理、可穿戴设备、智能音箱、电视、无人机、车辆、车载芯片、车载装置(例如车机、车载电脑)或机器人等装置中的一个或多个。例如,该终端设备可以是手机、车辆等,也可以是其他电子设备,为了简洁不再一一举例。应理解,以上终端设备只是举例以便于说明,本申请实施例对此不做限定。Exemplarily, the apparatus can be applied to a terminal device, and the terminal device can perform voice interaction with the user. Exemplarily, the terminal device may specifically include a computer, a smart phone, a tablet computer, a personal digital assistant, a wearable device, a smart speaker, a TV, a drone, a vehicle, a vehicle-mounted chip, a vehicle-mounted device (such as a vehicle machine, a vehicle computer) or One or more of devices such as robots. For example, the terminal device may be a mobile phone, a vehicle, etc., or other electronic devices, which are not listed one by one for the sake of brevity. It should be understood that the above terminal devices are only examples for description, and are not limited in this embodiment of the present application.
应理解,图14所示的语音交互的装置可以用于实现上述语音交互的方法400,图14所示的语音交互的装置还可以用于实现方法500、方法600所述的语音交互的方法,具体步骤可以参照上述对于图6至图13的描述,为了简洁,本申请实施例对此不再赘述。It should be understood that the voice interaction device shown in FIG. 14 can be used to implement the above voice interaction method 400, and the voice interaction device shown in FIG. 14 can also be used to implement the voice interaction methods described in method 500 and method 600, For specific steps, reference may be made to the foregoing descriptions of FIG. 6 to FIG. 13 , and for the sake of brevity, details are not repeated in this embodiment of the present application.
示例性地,本申请实施例还提供用于实现方法200的装置,例如,提供一种装置包括用以实现以上任一种方法中用户设备、语音检测平台等所执行的各步骤的单元。例如,请参考图15,其为本申请实施例提供的一种用于训练语音交互的预测模型的装置的结构示意图。如图15所示,该装置800,可以包括获取模块810和训练模块820。Exemplarily, the embodiment of the present application further provides an apparatus for implementing the method 200, for example, an apparatus including units for implementing steps performed by the user equipment, the voice detection platform, etc. in any of the above methods. For example, please refer to FIG. 15 , which is a schematic structural diagram of an apparatus for training a speech interaction prediction model provided by an embodiment of the present application. As shown in FIG. 15 , the apparatus 800 may include an acquisition module 810 and a training module 820 .
其中,获取模块810可以用于:获取文本数据集,该文本数据集包括多个第四文本,该第四文本标注了第一信息,该第一信息可以用于表示文本的语义完整度;该训练模块820可以用于:根据文本数据集进行模型训练,得到预测模型,该预测模型用于预测语音指令的语义完整度。Wherein, the acquisition module 810 can be used to: acquire a text data set, the text data set includes a plurality of fourth texts, the fourth texts are marked with first information, and the first information can be used to represent the semantic completeness of the text; The training module 820 may be used to: perform model training according to the text data set to obtain a prediction model, and the prediction model is used to predict the semantic completeness of the voice instruction.
示例性地,关于文本数据集和第一信息的描述,可以参照步骤S210,为了简洁,此处不再赘述。Exemplarily, for the description of the text data set and the first information, reference may be made to step S210, and details are omitted here for the sake of brevity.
可选地,获取模块810,还可以用于获取文本语料集,该文本语料集中可以包括多个具有完整语义的文本,该装置800还可以包括处理模块830(图15未示出),该处理模块可以用于,根据该文本语料集确定文本数据集。Optionally, the acquiring module 810 can also be used to acquire a text corpus, which can include multiple texts with complete semantics, and the apparatus 800 can also include a processing module 830 (not shown in FIG. 15 ), which processes A module can be used to determine a text dataset from the text corpus.
可选地,该处理模块830,具体可以用于,根据该文本语料集中的具有完整语义的文本,确定一个或多个第四文本;根据文本语料集中多个具有完整语义的文本所确定的多个第四文本,确定文本数据集。Optionally, the processing module 830 may specifically be configured to determine one or more fourth texts according to the texts with complete semantics in the text corpus; The fourth text, determine the text data set.
可选地,该处理模块830,还可以用于:根据文本语料集确定字典树,该字典树包括多个节点;可以根据字典树中的节点的子节点数目,确定第四文本的语义完整度。Optionally, the processing module 830 can also be used to: determine a dictionary tree according to the text corpus, the dictionary tree includes a plurality of nodes; determine the semantic integrity of the fourth text according to the number of child nodes of the nodes in the dictionary tree .
示例性地,根据文本语料集中的具有完整语义的文本,可以确定一个或多个节点,根据文本语料集中的多个具有完整语义的文本,可以确定字典树的多个节点。Exemplarily, one or more nodes may be determined according to texts with complete semantics in the text corpus, and multiple nodes of the dictionary tree may be determined according to multiple texts with complete semantics in the text corpus.
示例性地,以上关于文本语料集和字典树的描述,可以参照步骤S210,为了简洁,此处不再赘述。Exemplarily, the above description about the text corpus and dictionary tree can refer to step S210, and for the sake of brevity, details are not repeated here.
可选地,该处理模块830,还可以用于:根据字典树中的节点的子节点数目,以及具有完整语义的文本所确定的尾节点标记,确定第四文本的语义完整度。Optionally, the processing module 830 may also be configured to: determine the semantic completeness of the fourth text according to the number of child nodes of the nodes in the dictionary tree and the tail node mark determined by the text with complete semantics.
示例性地,关于尾节点标记的描述,可以参照步骤S210,为了简洁,此处不再赘述。Exemplarily, for the description of the tail node label, reference may be made to step S210, and for the sake of brevity, details are not repeated here.
示例性地,该装置800可以用于图1实施例中所描述的语音检测平台,该语音检测平台可以用于为用户与终端设备的语音交互过程提供后台服务。本申请实施例对此不做限定。Exemplarily, the apparatus 800 can be used in the voice detection platform described in the embodiment of FIG. 1 , and the voice detection platform can be used to provide background services for the voice interaction process between the user and the terminal device. This embodiment of the present application does not limit it.
应理解,根据图15所示的用于训练语音交互中所使用的预测模型的装置,可以用于实现方法200,具体的步骤可以参照上述对于图3至图5的描述,为了简洁,本申请实施 例对此不再赘述。It should be understood that the device for training the predictive model used in voice interaction shown in FIG. 15 can be used to implement method 200, and the specific steps can refer to the descriptions of FIG. 3 to FIG. 5 above. For the sake of brevity, this application This will not be described in detail in the embodiment.
应理解,以上装置中各单元或模块的划分仅是一种逻辑功能的划分,实际实现时可以全部或部分集成到一个物理实体上,也可以物理上分开。此外,装置中的单元或模块可以以处理器调用软件的形式实现;例如装置包括处理器,处理器与存储器连接,存储器中存储有指令,处理器调用存储器中存储的指令,以实现以上任一种方法或实现该装置各单元的功能,其中处理器例如为通用处理器,例如中央处理单元(Central Processing Unit,CPU)或微处理器,存储器为装置内的存储器或装置外的存储器。或者,装置中的单元可以以硬件电路的形式实现,可以通过对硬件电路的设计实现部分或全部单元的功能,该硬件电路可以理解为一个或多个处理器;例如,在一种实现中,该硬件电路为专用集成电路(application-specific integrated circuit,ASIC),通过对电路内元件逻辑关系的设计,实现以上部分或全部单元的功能;再如,在另一种实现中,该硬件电路为可以通过可编程逻辑器件(programmable logic device,PLD)实现,以现场可编程门阵列(Field Programmable Gate Array,FPGA)为例,其可以包括大量逻辑门电路,通过配置文件来配置逻辑门电路之间的连接关系,从而实现以上部分或全部单元的功能。以上装置的所有单元可以全部通过处理器调用软件的形式实现,或全部通过硬件电路的形式实现,或部分通过处理器调用软件的形式实现,剩余部分通过硬件电路的形式实现。It should be understood that the division of units or modules in the above device is only a division of logical functions, and may be fully or partially integrated into one physical entity or physically separated during actual implementation. In addition, the units or modules in the device can be implemented in the form of a processor calling software; for example, the device includes a processor, the processor is connected to a memory, and instructions are stored in the memory, and the processor calls the instructions stored in the memory to realize any of the above. A method or realize the function of each unit of the device, wherein the processor is, for example, a general-purpose processor, such as a central processing unit (Central Processing Unit, CPU) or a microprocessor, and the memory is a memory in the device or a memory outside the device. Alternatively, the units in the device may be implemented in the form of hardware circuits, and part or all of the functions of the units may be realized through the design of the hardware circuits. The hardware circuits may be understood as one or more processors; for example, in one implementation, The hardware circuit is an application-specific integrated circuit (ASIC), through the design of the logical relationship between the components in the circuit, the functions of some or all of the above units are realized; for another example, in another implementation, the hardware circuit is It can be realized by programmable logic device (programmable logic device, PLD). Taking Field Programmable Gate Array (Field Programmable Gate Array, FPGA) as an example, it can include a large number of logic gate circuits, and configure the logic gate circuits through configuration files. connection relationship, so as to realize the functions of some or all of the above units. All the units of the above device can be realized in the form of calling software by the processor, or in the form of hardware circuit, or partly in the form of calling software by the processor, and the rest can be realized in the form of hardware circuit.
在本申请实施例中,处理器是一种具有信号的处理能力的电路,在一种实现中,处理器可以是具有指令读取与运行能力的电路,例如CPU、微处理器、图形处理器(graphics processing unit,GPU)(可以理解为一种微处理器)、或数字信号处理器(digital singnal processor,DSP)等;在另一种实现中,处理器可以通过硬件电路的逻辑关系实现一定功能,该硬件电路的逻辑关系是固定的或可以重构的,例如处理器为专用集成电路ASIC或可编程逻辑器件PLD实现的硬件电路,例如FPGA。在可重构的硬件电路中,处理器加载配置文档,实现硬件电路配置的过程,可以理解为处理器加载指令,以实现以上部分或全部单元的功能的过程。此外,还可以是针对人工智能设计的硬件电路,其可以理解为一种ASIC,例如神经网络处理单元(Neural Network Processing Unit,NPU)、张量处理单元(Tensor Processing Unit,TPU)、深度学习处理单元(Deep learning Processing Unit,DPU)等。In the embodiment of the present application, the processor is a circuit with signal processing capabilities. In one implementation, the processor may be a circuit with instruction reading and execution capabilities, such as CPU, microprocessor, graphics processor (graphics processing unit, GPU) (can be understood as a microprocessor), or digital signal processor (digital signal processor, DSP), etc.; in another implementation, the processor can realize a certain Function, the logical relationship of the hardware circuit is fixed or reconfigurable, for example, the processor is a hardware circuit implemented by an application-specific integrated circuit ASIC or a programmable logic device PLD, such as FPGA. In a reconfigurable hardware circuit, the process of the processor loading the configuration file to realize the configuration of the hardware circuit can be understood as the process of the processor loading instructions to realize the functions of some or all of the above units. In addition, it can also be a hardware circuit designed for artificial intelligence, which can be understood as an ASIC, such as a neural network processing unit (Neural Network Processing Unit, NPU), a tensor processing unit (Tensor Processing Unit, TPU), deep learning processing Unit (Deep learning Processing Unit, DPU), etc.
可见,以上装置中的各单元可以是被配置成实施以上方法的一个或多个处理器(或处理电路),例如:CPU、GPU、NPU、TPU、DPU、微处理器、DSP、ASIC、FPGA,或这些处理器形式中至少两种的组合。It can be seen that each unit in the above device can be one or more processors (or processing circuits) configured to implement the above method, for example: CPU, GPU, NPU, TPU, DPU, microprocessor, DSP, ASIC, FPGA , or a combination of at least two of these processor forms.
此外,以上装置中的各单元可以全部或部分可以集成在一起,或者可以独立实现。在一种实现中,这些单元集成在一起,以片上系统(system-on-a-chip,SOC)的形式实现。该SOC中可以包括至少一个处理器,用于实现以上任一种方法或实现该装置各单元的功能,该至少一个处理器的种类可以不同,例如包括CPU和FPGA,CPU和人工智能处理器、CPU和GPU等。In addition, all or part of the units in the above devices can be integrated together, or can be implemented independently. In one implementation, these units are integrated together and implemented in the form of a system-on-a-chip (SOC). The SOC can include at least one processor for implementing any of the above methods or realizing the functions of each unit of the device. The at least one processor can be of different types, such as including CPU and FPGA, CPU and artificial intelligence processor, CPUs and GPUs, etc.
示例性地,图16为本申请实施例提供的一种装置1300的结构示例图。装置1300包括处理器1302、通信接口1303和存储器1304。装置1300的一种示例为芯片。装置1300的另一种示例为计算设备。Exemplarily, FIG. 16 is a structural example diagram of an apparatus 1300 provided in an embodiment of the present application. The apparatus 1300 includes a processor 1302 , a communication interface 1303 and a memory 1304 . One example of device 1300 is a chip. Another example of apparatus 1300 is a computing device.
处理器1302、存储器1304和通信接口1303之间可以通过总线通信。存储器1304中 存储有可执行代码,处理器1302读取存储器1304中的可执行代码以执行对应的方法。存储器1304中还可以包括操作系统等其他运行进程所需的软件模块。The processor 1302, the memory 1304, and the communication interface 1303 may communicate through a bus. Executable codes are stored in the memory 1304, and the processor 1302 reads the executable codes in the memory 1304 to execute a corresponding method. The memory 1304 may also include an operating system and other software modules required for running processes.
例如,存储器1304中的可执行代码用于实现图3至图13所示的方法,处理器1302读取存储器1304中的该可执行代码以执行图3至图13所示的方法。For example, the executable code in the memory 1304 is used to implement the methods shown in FIGS. 3 to 13 , and the processor 1302 reads the executable code in the memory 1304 to execute the methods shown in FIGS. 3 to 13 .
其中,处理器1302可以为CPU。存储器1304可以包括易失性存储器(volatile memory,VM),例如随机存取存储器(random access memory,RAM)。存储器1304还可以包括非易失性存储器(non-volatile memory,NVM),例如只读存储器(read-only memory,ROM),快闪存储器,硬盘驱动器(hard disk drive,HDD)或固态启动器(solid state disk,SSD)。Wherein, the processor 1302 may be a CPU. The memory 1304 may include a volatile memory (volatile memory, VM), such as a random access memory (random access memory, RAM). Memory 1304 can also include non-volatile memory (non-volatile memory, NVM), such as read-only memory (read-only memory, ROM), flash memory, hard disk drive (hard disk drive, HDD) or solid-state starter ( solid state disk, SSD).
本申请中术语“至少一个”的含义是指一个或多个,本申请中术语“多个”的含义是指两个或两个以上。The meaning of the term "at least one" in this application refers to one or more, and the meaning of the term "multiple" in this application refers to two or more.
本申请中术语“第一”“第二”等字样用于对作用和功能基本相同的相同项或相似项进行区分,应理解,“第一”、“第二”、“第n”之间不具有逻辑或时序上的依赖关系,也不对数量和执行顺序进行限定。例如,“第一文本”和“第二文本”仅用于区分,不代表“第一文本”和“第二文本”的优先级不同。In this application, the terms "first" and "second" are used to distinguish the same or similar items with basically the same function and function. It should be understood that "first", "second" and "nth" There are no logical or timing dependencies, nor are there restrictions on quantity or order of execution. For example, "first text" and "second text" are only used to distinguish, and do not mean that the priorities of "first text" and "second text" are different.
应理解,在本申请的各个实施例中,各个过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that in each embodiment of the present application, the size of the sequence numbers of the various processes does not mean the order of execution, and the execution order of the various processes should be determined by their functions and internal logic, and should not be used in the implementation of the embodiments of the present application. process constitutes any qualification.
应理解,根据A确定B并不意味着仅仅根据A确定B,还可以根据A和/或其它信息确定B。It should be understood that determining B according to A does not mean determining B only according to A, and B may also be determined according to A and/or other information.
应理解,本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。It should be understood that the term "and/or" in this article is only an association relationship describing associated objects, indicating that there may be three relationships, for example, A and/or B may mean: A exists alone, and A and B exist at the same time , there are three cases of B alone. In addition, the character "/" in this article generally indicates that the contextual objects are an "or" relationship.
应理解,在本申请的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that, in various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the order of execution, and the execution order of the processes should be determined by their functions and internal logic, and should not be used in the embodiments of the present application. The implementation process constitutes any limitation.
在本说明书中使用的术语“部件”、“模块”、“系统”等用于表示计算机相关的实体、硬件、固件、硬件和软件的组合、软件、或执行中的软件。例如,部件可以是但不限于,在处理器上运行的进程、处理器、对象、可执行文件、执行线程、程序和/或计算机。通过图示,在计算设备上运行的应用和计算设备都可以是部件。一个或多个部件可驻留在进程和/或执行线程中,部件可位于一个计算机上和/或分布在2个或更多个计算机之间。此外,这些部件可从在上面存储有各种数据结构的各种计算机可读介质执行。部件可例如根据具有一个或多个数据分组(例如来自与本地系统、分布式系统和/或网络间的另一部件交互的二个部件的数据,例如通过信号与其它系统交互的互联网)的信号通过本地和/或远程进程来通信。The terms "component", "module", "system" and the like are used in this specification to refer to a computer-related entity, hardware, firmware, a combination of hardware and software, software, or software in execution. For example, a component may be, but is not limited to being, a process running on a processor, a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a computing device and the computing device can be components. One or more components can reside within a process and/or thread of execution and a component can be localized on one computer and/or distributed between two or more computers. In addition, these components can execute from various computer readable media having various data structures stored thereon. A component may, for example, be based on a signal having one or more packets of data (e.g., data from two components interacting with another component between a local system, a distributed system, and/or a network, such as the Internet via a signal interacting with other systems). Communicate through local and/or remote processes.
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本 申请的范围。Those skilled in the art can appreciate that the units and algorithm steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraints of the technical solution. A skilled artisan may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the present application.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process of the above-described system, device and unit can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed systems, devices and methods may be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际情形选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual conditions to achieve the purpose of the solution of this embodiment.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器、随机存取存储器、磁碟或者光盘等各种可以存储程序代码的介质。If the functions described above are realized in the form of software function units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: various media capable of storing program codes such as U disk, mobile hard disk, read-only memory, random access memory, magnetic disk or optical disk.
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。The above is only a specific implementation of the application, but the scope of protection of the application is not limited thereto. Anyone familiar with the technical field can easily think of changes or substitutions within the technical scope disclosed in the application. Should be covered within the protection scope of this application. Therefore, the protection scope of the present application should be determined by the protection scope of the claims.

Claims (17)

  1. 一种语音交互的方法,其特征在于,包括:A method for voice interaction, characterized in that, comprising:
    获取第一音频信号,所述第一音频信号中包括第一语音指令;Acquire a first audio signal, where the first audio signal includes a first voice instruction;
    根据所述第一语音指令对应的第一文本,确定第一定时器的时长;determining the duration of the first timer according to the first text corresponding to the first voice instruction;
    启动所述第一定时器;start the first timer;
    获取第二音频信号,所述第二音频信号的起始时刻等于或晚于所述第一音频信号的结束时刻;acquiring a second audio signal, the start time of the second audio signal is equal to or later than the end time of the first audio signal;
    在所述第二音频信号中的语音指令对应的文本为空时,将所述第一定时器的结束时刻确定为语音端点;When the text corresponding to the voice instruction in the second audio signal is empty, determine the end time of the first timer as the voice endpoint;
    在确定所述语音端点之后,响应所述第一语音指令。After the voice endpoint is determined, respond to the first voice instruction.
  2. 如权利要求1所述的方法,其特征在于,所述在所述第二音频信号中的语音指令对应的文本为空时,将所述第一定时器的结束时刻确定为语音端点,包括:The method according to claim 1, wherein when the text corresponding to the voice instruction in the second audio signal is empty, determining the end time of the first timer as the voice endpoint includes:
    在所述第二音频信号中的语音指令对应的文本为空,且所述第二音频信号的音频帧的能量小于或等于第一阈值时,将所述第一定时器的结束时刻确定为所述语音端点。When the text corresponding to the voice instruction in the second audio signal is empty, and the energy of the audio frame of the second audio signal is less than or equal to the first threshold, the end time of the first timer is determined as the Describe the voice endpoint.
  3. 如权利要求1或2所述的方法,其特征在于,所述方法还包括:The method according to claim 1 or 2, further comprising:
    获取在显示屏上显示的第二文本;obtain the second text displayed on the display screen;
    所述根据所述第一语音指令对应的第一文本,确定第一定时器的时长,包括:The determining the duration of the first timer according to the first text corresponding to the first voice instruction includes:
    在所述第一语音指令对应的所述第一文本与所述第二文本不匹配时,根据所述第一语音指令对应的所述第一文本,确定所述第一定时器的时长。When the first text corresponding to the first voice instruction does not match the second text, determine the duration of the first timer according to the first text corresponding to the first voice instruction.
  4. 如权利要求1至3中任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1 to 3, further comprising:
    获取第三音频信号,所述第三音频信号包括第一预设时间内接收到的音频信号,所述第一预设时间的起始时刻,等于或晚于所述第一音频信号的结束时刻;Acquire a third audio signal, the third audio signal includes an audio signal received within a first preset time, and the start moment of the first preset time is equal to or later than the end moment of the first audio signal ;
    所述根据所述第一语音指令对应的第一文本,确定第一定时器的时长,包括:The determining the duration of the first timer according to the first text corresponding to the first voice instruction includes:
    在所述第三音频信号不包括语音指令时,根据所述第一语音指令对应的所述第一文本,确定所述第一定时器的时长。When the third audio signal does not include a voice instruction, the duration of the first timer is determined according to the first text corresponding to the first voice instruction.
  5. 如权利要求1至4中任一项所述的方法,其特征在于,所述获取所述第一音频信号之前,所述方法还包括:The method according to any one of claims 1 to 4, wherein, before acquiring the first audio signal, the method further comprises:
    获取第四音频信号,所述第四音频信号中包括第三语音指令;Acquire a fourth audio signal, where the fourth audio signal includes a third voice instruction;
    根据所述第三语音指令对应的第三文本,确定第二定时器的时长;determining the duration of the second timer according to the third text corresponding to the third voice instruction;
    启动所述第二定时器且在所述第二定时器运行时获取第五音频信号,所述第二定时器的结束时刻,早于或等于所述第一定时器的开始时刻;Start the second timer and acquire a fifth audio signal when the second timer is running, the end time of the second timer is earlier than or equal to the start time of the first timer;
    在所述第五音频信号中的语音指令对应的文本为非空时,根据所述第四音频信号和所述第五音频信号,确定所述第一音频信号,所述第一音频信号包括所述第四音频信号和所述第五音频信号。When the text corresponding to the voice command in the fifth audio signal is not empty, the first audio signal is determined according to the fourth audio signal and the fifth audio signal, and the first audio signal includes the The fourth audio signal and the fifth audio signal.
  6. 如权利要求5所述的方法,其特征在于,所述第一音频信号的起始时刻,早于或等于所述第四音频信号的起始时刻,所述第一音频信号的结束时刻,等于或晚于所述第五音频信号的结束时刻。The method according to claim 5, wherein the start moment of the first audio signal is earlier than or equal to the start moment of the fourth audio signal, and the end moment of the first audio signal is equal to or later than the end moment of the fifth audio signal.
  7. 如权利要求1至6中任一项所述的方法,其特征在于,所述根据所述第一语音指令对应的第一文本,确定第一定时器的时长,包括:The method according to any one of claims 1 to 6, wherein the determining the duration of the first timer according to the first text corresponding to the first voice command includes:
    将所述第一语音指令对应的所述第一文本输入预测模型,得到所述第一文本的语义完整度;inputting the first text corresponding to the first voice command into a prediction model to obtain the semantic completeness of the first text;
    根据所述第一文本的语义完整度,确定所述第一定时器的时长。The duration of the first timer is determined according to the semantic integrity of the first text.
  8. 一种语音交互的装置,其特征在于,所述装置包括:A device for voice interaction, characterized in that the device includes:
    获取模块,用于获取第一音频信号,所述第一音频信号中包括第一语音指令;还用于获取第二音频信号,所述第二音频信号的起始时刻等于或晚于所述第一音频信号的结束时刻;An acquisition module, configured to acquire a first audio signal, the first audio signal including a first voice command; and also used to acquire a second audio signal, the start moment of the second audio signal being equal to or later than the first audio signal the end time of an audio signal;
    处理模块,用于根据所述第一语音指令对应的第一文本,确定所述第一定时器的时长;启动所述第一定时器;在所述第二音频信号中的语音指令对应的文本为空时,将所述第一定时器的结束时刻确定为语音端点;在确定所述语音端点之后,响应所述第一语音指令。A processing module, configured to determine the duration of the first timer according to the first text corresponding to the first voice command; start the first timer; the text corresponding to the voice command in the second audio signal When it is empty, determine the end time of the first timer as the voice endpoint; after determining the voice endpoint, respond to the first voice instruction.
  9. 如权利要求8所述的装置,其特征在于,所述处理模块,具体用于:The device according to claim 8, wherein the processing module is specifically used for:
    在所述第二音频信号中的语音指令对应的文本为空,且所述第二音频信号的音频帧的能量小于或等于第一阈值时,将所述第一定时器的结束时刻确定为所述语音端点。When the text corresponding to the voice instruction in the second audio signal is empty, and the energy of the audio frame of the second audio signal is less than or equal to the first threshold, the end time of the first timer is determined as the Describe the voice endpoint.
  10. 如权利要求8或9所述的装置,其特征在于,所述获取模块还用于:The device according to claim 8 or 9, wherein the acquiring module is also used for:
    获取在显示屏上显示的第二文本;obtain the second text displayed on the display screen;
    所述处理模块,具体用于:The processing module is specifically used for:
    在所述第一语音指令对应的所述第一文本与所述第二文本不匹配时,根据所述第一语音指令对应的所述第一文本,确定所述第一定时器的时长。When the first text corresponding to the first voice instruction does not match the second text, determine the duration of the first timer according to the first text corresponding to the first voice instruction.
  11. 如权利要求8至10中任一项所述的装置,其特征在于,所述获取模块还用于:The device according to any one of claims 8 to 10, wherein the acquisition module is also used for:
    获取第三音频信号,所述第三音频信号包括第一预设时间内接收到的音频信号,所述第一预设时间的起始时刻等于或晚于所述第一音频信号的结束时刻;Acquiring a third audio signal, the third audio signal comprising an audio signal received within a first preset time, the start moment of the first preset time being equal to or later than the end moment of the first audio signal;
    所述处理模块,具体用于:The processing module is specifically used for:
    在所述第三音频信号不包括语音指令时,根据所述第一语音指令对应的所述第一文本,确定所述第一定时器的时长。When the third audio signal does not include a voice instruction, the duration of the first timer is determined according to the first text corresponding to the first voice instruction.
  12. 如权利要求8至11中任一项所述的装置,其特征在于,所述获取模块还用于:The device according to any one of claims 8 to 11, wherein the acquisition module is also used for:
    在获取所述第一音频信号之前,获取第四音频信号,所述第四音频信号中包括第三语音指令;Before acquiring the first audio signal, acquire a fourth audio signal, where the fourth audio signal includes a third voice instruction;
    在第二定时器运行时获取第五音频信号;acquiring a fifth audio signal when the second timer is running;
    所述处理模块,还用于:The processing module is also used for:
    根据所述第三语音指令对应的第三文本,确定所述第二定时器的时长;determining the duration of the second timer according to the third text corresponding to the third voice instruction;
    启动所述第二定时器,所述第二定时器的结束时刻早于或等于所述第一定时器的开始时刻;start the second timer, the end time of the second timer is earlier than or equal to the start time of the first timer;
    在所述第五音频信号中的语音指令对应的文本为非空时,根据所述第四音频信号和所述第五音频信号,确定所述第一音频信号,所述第一音频信号包括所述第四音频信号和所述第五音频信号。When the text corresponding to the voice instruction in the fifth audio signal is not empty, the first audio signal is determined according to the fourth audio signal and the fifth audio signal, and the first audio signal includes the The fourth audio signal and the fifth audio signal.
  13. 如权利要求12所述的装置,其特征在于,所述第一音频信号的起始时刻早于或等于所述第四音频信号的起始时刻,所述第一音频信号的结束时刻等于或晚于所述第五音 频信号的结束时刻。The device according to claim 12, wherein the start moment of the first audio signal is earlier than or equal to the start moment of the fourth audio signal, and the end moment of the first audio signal is equal to or later than at the end moment of the fifth audio signal.
  14. 如权利要求8至13中任一项所述的装置,其特征在于,所述处理模块,具体用于:The device according to any one of claims 8 to 13, wherein the processing module is specifically used for:
    将所述第一语音指令对应的所述第一文本输入预测模型,得到所述第一文本的语义完整度;inputting the first text corresponding to the first voice command into a prediction model to obtain the semantic completeness of the first text;
    根据所述第一文本的语义完整度,确定所述第一定时器的时长。The duration of the first timer is determined according to the semantic integrity of the first text.
  15. 一种装置,其特征在于,包括处理器和存储器,所述存储器用于存储程序指令,所述处理器用于调用所述程序指令来执行权利要求1至7中任一项所述的方法。An apparatus, characterized by comprising a processor and a memory, the memory is used to store program instructions, and the processor is used to invoke the program instructions to execute the method according to any one of claims 1 to 7.
  16. 一种计算机程序产品,其特征在于,包括计算机程序代码,所述计算机程序代码在计算机上运行时,使得计算机执行如权利要求1至7中任意一项所述的方法。A computer program product, characterized in that it includes computer program code, and when the computer program code runs on a computer, it causes the computer to execute the method according to any one of claims 1 to 7.
  17. 一种计算机可读存储介质,其特征在于,所述计算机可读介质存储有程序代码,当所述程序代码在计算机上运行时,使得计算机执行如权利要求1至7中任意一项所述的方法。A computer-readable storage medium, characterized in that the computer-readable medium stores program codes, and when the program codes are run on a computer, the computer executes the method according to any one of claims 1 to 7. method.
PCT/CN2021/141405 2021-12-25 2021-12-25 Speech interaction method and apparatus, and storage medium WO2023115588A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202180041317.8A CN116670760A (en) 2021-12-25 2021-12-25 Voice interaction method, device and storage medium
PCT/CN2021/141405 WO2023115588A1 (en) 2021-12-25 2021-12-25 Speech interaction method and apparatus, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/141405 WO2023115588A1 (en) 2021-12-25 2021-12-25 Speech interaction method and apparatus, and storage medium

Publications (1)

Publication Number Publication Date
WO2023115588A1 true WO2023115588A1 (en) 2023-06-29

Family

ID=86901127

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/141405 WO2023115588A1 (en) 2021-12-25 2021-12-25 Speech interaction method and apparatus, and storage medium

Country Status (2)

Country Link
CN (1) CN116670760A (en)
WO (1) WO2023115588A1 (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109346074A (en) * 2018-10-15 2019-02-15 百度在线网络技术(北京)有限公司 A kind of method of speech processing and system
US20190318759A1 (en) * 2018-04-12 2019-10-17 Qualcomm Incorporated Context-based detection of end-point of utterance
US20190385636A1 (en) * 2018-06-13 2019-12-19 Baidu Online Network Technology (Beijing) Co., Ltd. Voice activity detection method and apparatus
CN110689877A (en) * 2019-09-17 2020-01-14 华为技术有限公司 Voice end point detection method and device
CN110910863A (en) * 2019-11-29 2020-03-24 上海依图信息技术有限公司 Method, device and equipment for extracting audio segment from audio file and storage medium
CN112995419A (en) * 2021-02-05 2021-06-18 支付宝(杭州)信息技术有限公司 Voice conversation processing method and system
CN113345473A (en) * 2021-06-24 2021-09-03 科大讯飞股份有限公司 Voice endpoint detection method and device, electronic equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190318759A1 (en) * 2018-04-12 2019-10-17 Qualcomm Incorporated Context-based detection of end-point of utterance
US20190385636A1 (en) * 2018-06-13 2019-12-19 Baidu Online Network Technology (Beijing) Co., Ltd. Voice activity detection method and apparatus
CN109346074A (en) * 2018-10-15 2019-02-15 百度在线网络技术(北京)有限公司 A kind of method of speech processing and system
CN110689877A (en) * 2019-09-17 2020-01-14 华为技术有限公司 Voice end point detection method and device
CN110910863A (en) * 2019-11-29 2020-03-24 上海依图信息技术有限公司 Method, device and equipment for extracting audio segment from audio file and storage medium
CN112995419A (en) * 2021-02-05 2021-06-18 支付宝(杭州)信息技术有限公司 Voice conversation processing method and system
CN113345473A (en) * 2021-06-24 2021-09-03 科大讯飞股份有限公司 Voice endpoint detection method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN116670760A (en) 2023-08-29

Similar Documents

Publication Publication Date Title
US11503155B2 (en) Interactive voice-control method and apparatus, device and medium
US10937448B2 (en) Voice activity detection method and apparatus
US11270074B2 (en) Information processing apparatus, information processing system, and information processing method, and program
CN110473531B (en) Voice recognition method, device, electronic equipment, system and storage medium
CN108520743B (en) Voice control method of intelligent device, intelligent device and computer readable medium
US11217230B2 (en) Information processing device and information processing method for determining presence or absence of a response to speech of a user on a basis of a learning result corresponding to a use situation of the user
US11817094B2 (en) Automatic speech recognition with filler model processing
US11669300B1 (en) Wake word detection configuration
US9741343B1 (en) Voice interaction application selection
US11551684B1 (en) State detection and responses for electronic devices
CN112201246B (en) Intelligent control method and device based on voice, electronic equipment and storage medium
EP3564948A1 (en) Information processing device and information processing method
US11574637B1 (en) Spoken language understanding models
CN111768783A (en) Voice interaction control method, device, electronic equipment, storage medium and system
CN108766431B (en) Automatic awakening method based on voice recognition and electronic equipment
US20200035243A1 (en) System and method for uninterrupted application awakening and speech recognition
CN112825248A (en) Voice processing method, model training method, interface display method and equipment
US20220358921A1 (en) Speech processing for multiple inputs
WO2021063101A1 (en) Speech breakpoint detection method, apparatus and device based on artificial intelligence
CN112466302A (en) Voice interaction method and device, electronic equipment and storage medium
CN110956958A (en) Searching method, searching device, terminal equipment and storage medium
CN113611316A (en) Man-machine interaction method, device, equipment and storage medium
EP3757988A1 (en) Information processing device, information processing method, and program
US20240046931A1 (en) Voice interaction method and apparatus
WO2023115588A1 (en) Speech interaction method and apparatus, and storage medium

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 202180041317.8

Country of ref document: CN

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21968717

Country of ref document: EP

Kind code of ref document: A1