WO2023273230A1 - 语音交互方法、语音交互设备及存储介质 - Google Patents

语音交互方法、语音交互设备及存储介质 Download PDF

Info

Publication number
WO2023273230A1
WO2023273230A1 PCT/CN2021/140554 CN2021140554W WO2023273230A1 WO 2023273230 A1 WO2023273230 A1 WO 2023273230A1 CN 2021140554 W CN2021140554 W CN 2021140554W WO 2023273230 A1 WO2023273230 A1 WO 2023273230A1
Authority
WO
WIPO (PCT)
Prior art keywords
signal
voice
target
audio
area
Prior art date
Application number
PCT/CN2021/140554
Other languages
English (en)
French (fr)
Inventor
董天旭
Original Assignee
达闼机器人股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 达闼机器人股份有限公司 filed Critical 达闼机器人股份有限公司
Publication of WO2023273230A1 publication Critical patent/WO2023273230A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Definitions

  • the embodiments of the present application relate to the field of human-computer interaction, and in particular to a voice interaction method, a voice interaction device, and a storage medium.
  • voice interaction is a natural and friendly way of interaction.
  • Voice interaction based on voice recognition is gradually recognized by people and widely used in various life scenarios, such as vehicle voice, smart TV and audio, intelligent robots, etc.
  • voice interaction is divided into near-field voice interaction and far-field voice interaction: near-field voice interaction, such as voice input methods on mobile phones, has been very mature and perfect.
  • the far-field voice interaction is mainly to pick up sound through the far-field microphone array, enhance the existing voice at the target position, and suppress the audio at other positions, so as to enhance the target voice.
  • the TBRR method judges whether there is speech in the target direction through the energy ratio of the fixed beamforming output signal and the interference reference signal; however, by The process of energy ratio judging whether there is voice in the target direction needs to set a preset parameter with high accuracy, and the setting of the preset parameter needs to be determined in combination with the arrangement of the microphone matrix and the type of noise, which not only increases the amount of calculation, Moreover, different microphone matrices need to be set with different preset parameters, and different preset parameters need to be set according to different situations, which makes this determination method more complicated and less accurate.
  • the purpose of the embodiments of the present application is to provide a voice interaction method, a voice interaction device, and a storage medium, so that the process of judging whether there is voice in a target area is more convenient and has higher accuracy.
  • an embodiment of the present application provides a voice interaction method, which is applied to a voice interaction device, and the voice interaction device includes a microphone array composed of a plurality of microphones; including the following steps: using the microphone array to receive Wake-up audio from the external area of the voice interaction device; determine the target area where the wake-up audio is located; receive the current audio of the external area of the voice interaction device through each of the microphones, and obtain the current audio in each The corresponding audio signal at the microphone; if there is a voice signal in the current audio, determine the area where the voice signal is located; when the area where the voice signal is located and the target area meet a preset condition , obtaining the speech signal according to the plurality of audio signals; and performing speech recognition according to the speech signal.
  • An embodiment of the present application also provides a voice interaction device, including: at least one processor; and a memory connected in communication with the at least one processor; wherein, the memory stores information that can be used by the at least one processor Executable instructions, the instructions are executed by the at least one processor, so that the at least one processor can execute the voice interaction method as described above.
  • the embodiment of the present application also provides a computer-readable storage medium storing a computer program, and implementing the above voice interaction method when the computer program is executed by a processor.
  • the embodiment of the present application also provides a computer program, which implements the above voice interaction method when the computer program is executed by a processor.
  • FIG. 1 is a schematic flow diagram of a voice interaction method according to an embodiment of the present application
  • FIG. 2 is a schematic flowchart of a voice interaction method according to an embodiment of the present application
  • FIG. 3 is a schematic flowchart of a voice interaction method according to an embodiment of the present application.
  • FIG. 4 is a schematic flowchart of a voice interaction method according to an embodiment of the present application.
  • Fig. 5 is a schematic structural diagram of a voice interaction device according to an embodiment of the present application.
  • An embodiment of the present application relates to a voice interaction method, which is applied to a voice interaction device, and the voice interaction device includes a microphone array composed of a plurality of microphones; the specific flow of the voice interaction method in this embodiment is shown in FIG. 1 , Include the following steps:
  • the first embodiment of the present application relates to a method for generating a simulated image, the process of which is shown in Figure 1, including the following steps:
  • Step 101 receiving wake-up audio from an external area of a voice interaction device through a microphone array.
  • the microphone array of the voice interaction device is used to receive audio from outside the voice interaction device, including wake-up audio; wake-up audio refers to the wake-up voice of the voice interaction device, and after the voice interaction device receives the wake-up The process of speech recognition.
  • Step 102 Determine the target area where the wake-up audio is located.
  • the voice interaction device needs to determine the position where the wake-up audio is emitted, that is, the position of the target area, so as to determine the area where the user is located.
  • Step 103 Receive the current audio in the external area of the voice interaction device through each microphone, and obtain an audio signal corresponding to the current audio at each microphone.
  • each microphone of the microphone array can receive the current audio of the external area of the voice interaction device. Since the positions of the multiple microphones are different, the audio signals received by each microphone are also different. Therefore, this embodiment adopts After each microphone receives the current audio, it will obtain an audio signal corresponding to the current audio.
  • Step 104 If there is a voice signal in the current audio, determine the area where the voice signal is located.
  • the user will continue to send voice commands in the target area; of course, the user may also deal with other things after waking up the voice interaction device, so after the voice interaction device is woken up,
  • the current audio that the voice interaction device continues to receive may or may not contain voice.
  • the sound source localization algorithm is used to determine whether there is a voice signal in the current voice, and when it is determined that the current voice has a voice signal, the sound source localization algorithm is used to determine the area where the voice signal is located.
  • Step 105 if the area where the voice signal is located and the target area meet the preset conditions, the voice signal is obtained according to the multiple audio signals.
  • the voice signal is obtained according to the multiple audio signals, and the voice signal in the target area is obtained by performing certain arithmetic processing on the multiple audio signals.
  • Step 106 performing speech recognition according to the speech signal.
  • the voice interaction device After obtaining the voice signal in the target area, the voice interaction device performs voice recognition on the voice signal, thereby recognizing the voice instruction carried by the voice signal, and performing corresponding operations according to the voice instruction.
  • the data length of the current audio is 10ms to 30ms.
  • the voice interaction device needs to obtain the current audio in real time. Therefore, the current audio obtained each time is only a small segment of audio.
  • the data length of the current audio can be set to 10ms to 30ms. needs to be adjusted.
  • the speech signal before obtaining the speech signal according to the multiple audio signals, it also includes: using a fixed beamforming module to process the multiple audio signals to obtain a combined audio signal; the combined audio signal represents the sum of all audio signals in the target area ; Using the differential matrix module to process a plurality of audio signals to obtain an interference reference signal.
  • Step 201 receiving wake-up audio from an external area of a voice interaction device through a microphone array.
  • Step 202 determine the target area where the wake-up audio is located.
  • Step 203 Receive current audio in the external area of the voice interaction device through each microphone, and obtain an audio signal corresponding to the current audio at each microphone.
  • Step 204 use the fixed beamforming module to process multiple audio signals to obtain a combined audio signal; the combined audio signal represents the sum of all audio signals in the target area.
  • the voice interaction device includes a fixed beamforming module.
  • the fixed beamforming module is a delay-sum beamformer.
  • the fixed beamforming module averages the audio signals received by each microphone to obtain a combined audio signal. That is, the multiple audio signals are added and divided by the number of microphones to obtain a combined audio signal, thereby eliminating the inconsistency of the multiple audio signals due to different positions of the microphones. It should be noted that, in the case where the voice signal location and the target area meet the preset conditions, the combined audio signal represents the sum of the voice signal and the interference signal in the target area.
  • Step 205 using the differential matrix module to process multiple audio signals to obtain an interference reference signal.
  • the voice interaction device includes a differential matrix module, which removes the voice signal by subtracting the audio signals of the opposite microphones, and uses the remaining signal as an interference reference signal, which eliminates the voice signal.
  • Step 206 if there is a voice signal in the current audio, determine the area where the voice signal is located.
  • Step 207 when the area where the speech signal is located and the target area meet the preset conditions, input the interference reference signal into the target adaptive jammer model to obtain the target interference signal; the target interference signal indicates that there are interference signal.
  • the filter used in the target adaptive jammer in this embodiment is a normalized least mean square adaptive filter (NLMS, Normalized Least Mean Square), and NLMS has better convergence and stability, The accuracy of the target adaptive jammer can be improved, thereby improving the accuracy of speech recognition.
  • NLMS normalized least mean square adaptive filter
  • LMS least mean square error filter
  • RLS recursive least squares filter
  • the target adaptive jammer model in the adaptive jammer of the voice interaction device, the target adaptive jammer model has been learned before the voice interaction, and the coefficients in the target adaptive jammer model have been relatively It is relatively perfect and can be applied to a specific voice interaction process. Therefore, in this embodiment, the interference reference signal is input into the target adaptive jammer model, thereby obtaining the output of the target adaptive jammer model, that is, the target interference signal, which represents the predicted interference existing in the current target area Signal.
  • step 208 a voice signal is obtained according to the difference between the combined audio signal and the target interference signal.
  • the combined audio signal represents the sum of the speech signal and the interference signal in the target area
  • the target interference signal represents the predicted interference existing in the current target area Signal
  • Step 209 performing speech recognition according to the speech signal.
  • steps 201 to 204 and step 209 are the same as steps 101 to 104 and step 106 of the previous embodiment, and to avoid repetition, details are not repeated here.
  • the voice interaction device after receiving the current audio from the external area of the voice interaction device through each microphone, and obtaining the audio signal corresponding to the current audio at each microphone, it further includes: when there is no voice signal in the current audio , taking the combined audio signal and interference reference signal samples as a pair of training samples, and using the training samples to train the target model according to the first learning rate, and updating the target adaptive jammer model.
  • Step 301 receiving the wake-up audio from the external area of the voice interaction device through the microphone array.
  • Step 302 determine the target area where the wake-up audio is located.
  • Step 303 Receive current audio in the external area of the voice interaction device through each microphone, and obtain an audio signal corresponding to the current audio at each microphone.
  • Step 304 use the fixed beamforming module to process multiple audio signals to obtain a combined audio signal; the combined audio signal represents the sum of all audio signals in the target area.
  • Step 305 using the differential matrix module to process multiple audio signals to obtain an interference reference signal. After step 305, go to step 306 and step 310 respectively.
  • Step 306 if there is a voice signal in the current audio, determine the area where the voice signal is located.
  • Step 307 when the area where the speech signal is located and the target area meet the preset conditions, input the interference reference signal into the target adaptive jammer model to obtain the target interference signal; the target interference signal indicates that there are interference signal.
  • step 308 a voice signal is obtained according to the difference between the combined audio signal and the target interference signal.
  • Step 309 performing speech recognition according to the speech signal.
  • Step 310 when there is no speech signal in the current audio, use the combined audio signal and the interference reference signal as a pair of training samples, and use the training samples to train the target adaptive jammer model according to the first learning rate, and update the target Adaptive Disturbance Model.
  • the input data of the target adaptive jammer model is the jamming reference signal sample
  • the output data of the target adaptive jammer model is the audio signal sample
  • the two are used as a pair of training samples to train the target adaptive jammer model , so as to update the target adaptive jammer model, improve the internal coefficients, and improve the accuracy of the target adaptive jammer model.
  • the target adaptive jammer model can also be learned again to continuously improve the coefficients of the target adaptive jammer model and improve the accuracy of the target adaptive jammer model.
  • steps 301 to 309 are the same as the steps 201 to 209 in the previous embodiment, and will not be repeated here to avoid repetition.
  • the learning speed of NLMS with a larger learning rate is faster, and the learning speed of NLMS with a smaller learning rate is slower; and the higher the learning speed, the smaller the fineness of learning. Therefore, it is necessary to ensure that the target adaptive
  • the output data audio signal sample of the jammer model does not have voice, that is, it is necessary to use the obtained interference reference signal and the combined audio signal as a pair of training samples when there is no voice in the external area of the voice interaction device, so as to perform model training.
  • this embodiment only performs model training when there is no voice outside the voice interaction device, and the first learning rate can be set larger, so as to take into account both learning speed and learning accuracy.
  • the target adaptive jammer model after updating the target adaptive jammer model, further comprising: using training samples to train the target adaptive jammer model again, and updating the target adaptive jammer model again.
  • the target adaptive jammer model is further updated to improve the accuracy of the target adaptive jammer model.
  • the second repetition learning has the same convergence speed as the double learning rate, and at the same time has a small error consistent with the single learning rate.
  • the target adaptive interference canceller model is trained by using the training samples according to the first learning rate, and the target adaptive interference canceller model is updated. Specifically, if the area where the voice signal is located and the target area do not meet the preset conditions, it means that the area where the voice signal is located does not coincide with the target area.
  • the voice interaction device will not acquire and use this part of the voice signal as the voice command, the voice signal is used as an interference signal; in order to further improve the accuracy of the target adaptive jammer model, make full use of samples, in this case, the target adaptive jammer model can also learn again, and continuously improve the target self-adaptive jammer model. Adapt the coefficients of the jammer model to improve the accuracy of the target adaptive jammer model.
  • learning and updating can be performed twice or more times.
  • the steps of determining that a voice signal exists in the current audio and determining an area where the voice signal is located are performed by a multi-sound source localization algorithm.
  • the multi-sound source localization algorithm represents a sound source point location algorithm with high reliability; the multi-sound source localization algorithm is more accurate and can improve the accuracy of detecting whether there is voice information in the target area.
  • the voice signal after determining the area where the voice signal is located, it also includes: when the area where the voice signal is located and the target area meet the preset conditions, that is, when the area where the voice signal is located coincides with the target area, the voice The signal is the target voice instruction signal, and the learning and updating of the target adaptive interference canceller model is stopped, that is, the learning rate of the target adaptive interference canceller model is set to 0.
  • the steps of determining that there is a voice signal in the current audio and determining the area where the voice signal is located are performed by a single sound source localization algorithm; after determining the area where the voice signal is located, it also includes: Under the conditional situation, combine audio signal, interference reference signal as a pair of training sample, utilize training sample to train target adaptive interference eliminator model according to the second learning rate, update target adaptive interference eliminator model; The learning rate is less than the first learning rate.
  • Step 401 receiving wake-up audio from an external area of the voice interaction device through a microphone array.
  • Step 402 determine the target area where the wake-up audio is located.
  • Step 403 Receive the current audio in the external area of the voice interaction device through each microphone, and obtain an audio signal corresponding to the current audio at each microphone.
  • Step 404 using the fixed beamforming module to process multiple audio signals to obtain a combined audio signal; the combined audio signal represents the sum of the speech signal and the interference signal in the target area.
  • Step 405 using the differential matrix module to process multiple audio signals to obtain an interference reference signal.
  • Step 406 if there is a voice signal in the current audio, determine the area where the voice signal is located.
  • Step 407 when the voice signal location and the target area meet the preset conditions, input the interference reference signal into the target model to obtain the target interference signal; the target interference signal represents the predicted interference signal existing in the current target area.
  • step 408 a voice signal is obtained according to the difference between the combined audio signal and the target interference signal.
  • Step 409 performing speech recognition according to the speech signal.
  • Step 410 when the area where the speech signal is located and the target area meet the preset conditions, the combined audio signal and the interference reference signal are used as a pair of training samples, and the target adaptive interference canceller model is trained using the training samples according to the second learning rate.
  • the target adaptive interference canceller model is trained using the training samples according to the second learning rate.
  • the second learning rate is less than the first learning rate.
  • the single sound source localization algorithm represents a sound source localization algorithm with high computational efficiency but slightly lower reliability. Specifically, the single sound source localization algorithm takes the direction of the most energetic sound at the current moment as the direction of the voice. Therefore, the single sound source localization algorithm has a simpler operation process than the multi-sound source localization algorithm, but the single sound source localization algorithm Algorithms are less accurate.
  • the single sound source localization algorithm used in this embodiment cannot directly stop the learning and updating of the target adaptive interference canceller model due to the slightly low reliability of the localization algorithm; therefore, in the case of using the single sound source localization algorithm in this embodiment, The target adaptive interference canceller model is trained again, and the combined audio signal and the interference reference signal are used as a pair of training samples, and the target model is trained by using the training samples according to the second learning rate, and the target adaptive interference canceller model is updated. Due to the poor accuracy of the single sound source localization algorithm, a smaller learning rate can be used to train the target adaptive interference canceller model, and the second learning rate is set to be smaller than the first learning rate, that is, the adaptive interference cancellation can be reduced The learning rate of the machine model can improve the accuracy of this learning.
  • steps 401 to 409 are the same as steps 301 to 309 in the previous embodiment, and to avoid repetition, details are not repeated here.
  • step division of the above various methods is only for the sake of clarity of description. During implementation, it can be combined into one step or some steps can be split and decomposed into multiple steps. As long as they include the same logical relationship, they are all within the scope of protection of this patent. ; Adding insignificant modifications or introducing insignificant designs to the algorithm or process, but not changing the core design of the algorithm and process are all within the scope of protection of this patent.
  • An embodiment of the present application relates to a voice interaction device, as shown in FIG. 5 , including at least one processor 501; and a memory 502 communicatively connected to at least one processor 501; wherein, the memory 502 stores information that can be processed by at least one
  • the instructions executed by the processor 501 are executed by at least one processor 501, so that the at least one processor 501 can execute the above voice interaction method.
  • the memory 502 and the processor 501 are connected by a bus, and the bus may include any number of interconnected buses and bridges, and the bus connects one or more processors 501 and various circuits of the memory 502 together.
  • the bus may also connect together various other circuits such as peripherals, voltage regulators, and power management circuits, all of which are well known in the art and therefore will not be further described herein.
  • the bus interface provides an interface between the bus and the transceivers.
  • a transceiver may be a single element or multiple elements, such as multiple receivers and transmitters, providing means for communicating with various other devices over a transmission medium.
  • the data processed by the processor 501 is transmitted on the wireless medium through the antenna, and further, the antenna also receives the data and transmits the data to the processor 501 .
  • Processor 501 is responsible for managing the bus and general processing, and may also provide various functions including timing, peripheral interface, voltage regulation, power management and other control functions. And the memory 502 may be used to store data used by the processor 501 when performing operations.
  • An embodiment of the present application relates to a computer-readable storage medium storing a computer program.
  • the above method embodiments are implemented when the computer program is executed by the processor.
  • An embodiment of the present application relates to a computer program.
  • the above method embodiments are implemented when the computer program is executed by the processor.
  • the program is stored in a storage medium, and includes several instructions to make a device ( It may be a single-chip microcomputer, a chip, etc.) or a processor (processor) to execute all or part of the steps of the methods in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disc, etc., which can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Quality & Reliability (AREA)
  • Measurement Of Velocity Or Position Using Acoustic Or Ultrasonic Waves (AREA)

Abstract

一种语音交互方法、语音交互设备及存储介质,涉及人机交互领域。语音交互方法包括以下步骤:通过麦克风阵列接收来自语音交互设备的外部区域的唤醒音频(101,201,301,401);确定唤醒音频所在的目标区域(102,202,302,402);通过每个麦克风接收语音交互设备的外部区域的当前音频,得到当前音频在每个麦克风处对应的音频信号(103,203,303,403);在当前音频中存在语音信号的情况下,确定语音信号所在的区域(104,206,306,406);在语音信号所在区域与目标区域满足预设条件的情况下,根据多个所述音频信号得到语音信号(105);根据语音信号进行语音识别(106,209,309,409)。相对于相关技术通过能量比的方式来判断目标区域是否存在语音而言,无需对不同的情况设置不同的预设参数,步骤更加便捷且精确度较高。

Description

语音交互方法、语音交互设备及存储介质
交叉引用
本申请基于申请号为“2021107321060”、申请日为2021年06月29日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此以引入方式并入本申请。
技术领域
本申请实施例涉及人机交互领域,特别涉及一种语音交互方法、语音交互设备及存储介质。
背景技术
目前,语音交互是自然友好的交互方式,基于语音识别的语音交互逐渐被人们所认可并广泛应用于各个生活场景,如车载语音、智能电视与音响、智能机器人等。其中,语音交互分为近场语音交互和远场语音交互:近场语音交互,如手机上的语音输入法等,已经非常成熟完善。而远场语音交互主要是通过远场麦克风阵列进行拾音,对目标位置的存在的语音进行增强,并对其他位置的音频进行抑制,从而将增强的目标语音。
相关技术中,存在一种联合瞬时波束干扰比TBRR方法来实现对语音的增强,TBRR方法是通过对固定波束形成输出信号和干扰参考信号的能量比,来判断目标方向是否存在语音;然而,通过能量比判断目标方向是否存在语音的过 程需要设置一个精确度较高的预设参数,而预设参数的设置需要结合麦克风矩阵的排布以及噪音的类型等综合来确定,不仅增加了计算量,且不同的麦克风矩阵需要设置不同的预设参数,需要根据不同情况设置不同的预设参数,导致这种确定方式较为复杂且精确度较低。
发明内容
本申请实施例的目的在于提供一种语音交互方法、语音交互设备及存储介质,判断目标区域是否存在语音的过程更加便捷且精确度较高。
为解决上述技术问题,本申请的实施例提供了一种语音交互方法,应用于语音交互设备,所述语音交互设备包括由多个麦克风构成的麦克风阵列;包括以下步骤:通过所述麦克风阵列接收来自所述语音交互设备的外部区域的唤醒音频;确定所述唤醒音频所在的目标区域;通过每个所述麦克风接收所述语音交互设备的外部区域的当前音频,得到所述当前音频在每个所述麦克风处对应的音频信号;在所述当前音频中存在语音信号的情况下,确定所述语音信号所在的区域;在所述语音信号所在区域与所述目标区域满足预设条件的情况下,根据多个所述音频信号得到所述语音信号;根据所述语音信号进行语音识别。
本申请的实施例还提供了一种语音交互设备,包括:至少一个处理器;以及,与所述至少一个处理器通信连接的存储器;其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行如上述的语音交互方法。
本申请的实施例还提供了一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时实现上述的语音交互方法。
本申请的实施例还提供了一种计算机程序,所述计算机程序被处理器执行时实现上述的语音交互方法。
本实施例通过判断当前音频中是否存在语音,并在当前音频存在语音时定位语音所在的区域,并在语音所在的区域与目标区域满足预设条件时,判断目标区域中存在语音,从而根据多个音频信号得到对应的语音信号,从而进行语音交互,相对于相关技术通过能量比的方式来判断目标区域是否存在语音而言,无需对不同的情况设置不同的预设参数,判断目标区域是否存在语音的过程更加便捷且精确度较高。
附图说明
一个或多个实施例通过与之对应的附图中的图片进行示例性说明,这些示例性说明并不构成对实施例的限定,附图中具有相同参考数字标号的元件表示为类似的元件,除非有特别申明,附图中的图不构成比例限制。
图1是根据本申请一实施例的语音交互方法的流程示意图;
图2是根据本申请一实施例的语音交互方法的流程示意图;
图3是根据本申请一实施例的语音交互方法的流程示意图;
图4是根据本申请一实施例的语音交互方法的流程示意图;
图5是根据本申请一实施例的语音交互设备的结构示意图。
具体实施方式
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合附图对本申请的各实施例进行详细的阐述。然而,本领域的普通技术人员可以理解, 在本申请各实施例中,为了使读者更好地理解本申请而提出了许多技术细节。但是,即使没有这些技术细节和基于以下各实施例的种种变化和修改,也可以实现本申请所要求保护的技术方案。以下各个实施例的划分是为了描述方便,不应对本申请的具体实现方式构成任何限定,各个实施例在不矛盾的前提下可以相互结合相互引用。
本申请的一实施例涉及一种语音交互方法,应用于语音交互设备,所述语音交互设备包括由多个麦克风构成的麦克风阵列;本实施例的语音交互方法的具体流程如图1所示,包括以下步骤:
本申请第一实施例涉及一种模拟图像生成的方法,其流程如图1所示,包括以下步骤:
步骤101,通过麦克风阵列接收来自语音交互设备的外部区域的唤醒音频。
具体地说,语音交互设备的麦克风阵列用于接收语音交互设备外部发出的音频,包括唤醒音频;唤醒音频是指开启语音交互设备的唤醒语音,在语音交互设备接收到唤醒音频之后,语音设备开始进行语音识别的过程。
步骤102:确定唤醒音频所在的目标区域。
具体地说,当语音交互设备被唤醒之后,语音交互设备需要确定发出唤醒音频的位置即目标区域的位置,从而确定用户所在区域。
步骤103:通过每个麦克风接收语音交互设备的外部区域的当前音频,得到当前音频在每个麦克风处对应的音频信号。
具体地说,麦克风阵列的每个麦克风均可以接收语音交互设备外部区域的当前音频,由于多个麦克风设置的位置不同,每个麦克风所接收到的音频信 号也不相同,因此,本实施例通过每个麦克风均接收到当前音频之后,均会得到当前音频对应的一个音频信号。
步骤104:在当前音频中存在语音信号的情况下,确定语音信号所在的区域。
一般而言,用户在目标区域唤醒语音交互设备之后,用户会继续在目标区域发送语音指令;当然,用户也可能会在唤醒语音交互设备之后,去处理其他事情,因此语音交互设备被唤醒之后,语音交互设备继续接收的当前音频可能存在语音,也可能不存在语音。
具体地说,本实施例通过声源定位算法判断当前语音是否存在语音信号,当确定当前音频存在语音信号的情况下,再通过声源定位算法确定语音信号所在的区域。
步骤105,在语音信号所在区域与目标区域满足预设条件的情况下,根据多个音频信号得到语音信号。
具体地说,在确定语音信号所在的区域之后,判断语音信号所在区域与目标区域是否满足预设条件,其中预设条件可以为语音信号所在的区域与目标区域的重合度是否大于预设阈值,在语音信号所在区域与目标区域满足预设条件的情况下,根据多个音频信号得到语音信号,通过对多个音频信号进行一定的运算处理,从而得到目标区域内的语音信号。
步骤106,根据语音信号进行语音识别。
具体地说,得到目标区域的语音信号之后,语音交互设备对语音信号进行语音识别,从而识别出语音信号携带的语音指令,并根据语音指令执行对应的操作。
本实施例通过判断当前音频中是否存在语音,并在当前音频存在语音时定位语音所在的区域,并在语音所在的区域与目标区域满足预设条件时,判断目标区域中存在语音,从而根据多个音频信号得到对应的语音信号,从而进行语音交互,相对于相关技术通过能量比的方式来判断目标区域是否存在语音而言,无需对不同的情况设置不同的预设参数,判断的过程更加便捷且精确度较高。
在一个实施例中,当前音频的数据长度为10ms至30ms。具体地说,语音交互设备是需要实时获取当前音频的,因此,每次获取的当前音频仅为一小段音频段,本实施例可以将当前音频的数据长度设置为10ms至30ms,用户可以根据实际的需要进行调整。
在一个实施例中,根据多个音频信号得到语音信号之前,还包括:利用固定波束形成模块对多个音频信号进行处理,得到合并音频信号;合并音频信号表示目标区域内的所有音频信号的总和;利用差分矩阵模块对多个音频信号进行处理,得到干扰参考信号。
本实施例的具体流程示意图如图2所示,具体包括以下步骤:
步骤201,通过麦克风阵列接收来自语音交互设备的外部区域的唤醒音频。
步骤202,确定唤醒音频所在的目标区域。
步骤203,通过每个麦克风接收语音交互设备的外部区域的当前音频,得到当前音频在每个麦克风处对应的音频信号。
步骤204,利用固定波束形成模块对多个音频信号进行处理,得到合并音频信号;合并音频信号表示目标区域内的所有音频信号的总和。
具体地说,语音交互设备中包括有固定波束形成模块,固定波束形成模块是延时求和波束形成器,固定波束形成模块是对每个麦克风接收到的音频信号进行平均运算得到合并音频信号,即将多个音频信号相加并除以麦克风的个数,得到合并音频信号,从而消除由于麦克风位置不同导致多个音频信号的情况不一致的情况。需要说明的是,在语音信号所在区域与目标区域满足预设条件的情况下,合并音频信号表示目标区域内的语音信号与干扰信号的总和。
步骤205,利用差分矩阵模块对多个音频信号进行处理,得到干扰参考信号。
具体地说,语音交互设备中包括有差分矩阵模块,差分矩阵模块通过对相对设置的麦克风的音频信号相减,从而去掉了语音信号,将留下的信号作为干扰参考信号,该干扰参考信号消除了语音信号。
步骤206,在当前音频中存在语音信号的情况下,确定语音信号所在的区域。
步骤207,在语音信号所在区域与目标区域满足预设条件的情况下,将干扰参考信号输入至目标自适应干扰器模型中,得到目标干扰信号;目标干扰信号表示预测的当前目标区域中内存在的干扰信号。
具体地说,本实施例中的目标自适应干扰器中使用的滤波器为归一化最小均方自适应滤波器(NLMS,Normalized Least Mean Square),NLMS具有较好的收敛性与平稳性,可以提高目标自适应干扰器的精确度,从而提高语音识别的精确度。然,实际应用中,也可以使用其他类型的滤波器,例如:最小均方误差滤波器(LMS,Least Mean Square)、递归最小二乘滤波器(RLS,Recursive Least Squares)。
具体地说,语音交互设备的自适应干扰器中存在一个目标自适应干扰器模型,目标自适应干扰器模型在进行语音交互之前已经进行了学习,目标自适应干扰器模型内的系数已经相对的比较完善,可以应用于具体的语音交互过程中。因此,本实施例将干扰参考信号输入至目标自适应干扰器模型中,从而得到目标自适应干扰器模型的输出,即目标干扰信号,该目标干扰信号表示预测的当前目标区域中内存在的干扰信号。
步骤208,根据合并音频信号与目标干扰信号之间的差异得到语音信号。
具体地说,在语音信号所在区域与目标区域满足预设条件的情况下,合并音频信号表示目标区域内的语音信号与干扰信号的总和,目标干扰信号表示预测的当前目标区域中内存在的干扰信号;在得到目标自适应干扰器模型的输出即目标干扰信号之后,将合并音频信号与目标干扰信号做减法处理之后,即得到目标区域内的语音信号,从而尽可能的滤除掉语音信号中存在的杂音。
步骤209,根据语音信号进行语音识别。
上述步骤201至步骤204、步骤209与上一实施例的步骤101至步骤104、步骤106相同,为避免重复,在此不再赘述。
在一个实施例中,通过每个麦克风接收来自语音交互设备的外部区域的当前音频,得到当前音频在每个麦克风处对应的音频信号之后,还包括:在当前音频中不存在语音信号的情况下,将合并音频信号、干扰参考信号样本作为一对训练样本,并利用训练样本按照第一学习率对目标模型进行训练,更新目标自适应干扰器模型。
本实施例的具体流程示意图如图3所示,具体包括以下步骤:
步骤301,通过麦克风阵列接收来自语音交互设备的外部区域的唤醒音 频。
步骤302,确定唤醒音频所在的目标区域。
步骤303,通过每个麦克风接收语音交互设备的外部区域的当前音频,得到当前音频在每个麦克风处对应的音频信号。
步骤304,利用固定波束形成模块对多个音频信号进行处理,得到合并音频信号;合并音频信号表示目标区域内的所有音频信号的总和。
步骤305,利用差分矩阵模块对多个音频信号进行处理,得到干扰参考信号。步骤305后,分别进入步骤306、步骤310。
步骤306,在当前音频中存在语音信号的情况下,确定语音信号所在的区域。
步骤307,在语音信号所在区域与目标区域满足预设条件的情况下,将干扰参考信号输入至目标自适应干扰器模型中,得到目标干扰信号;目标干扰信号表示预测的当前目标区域中内存在的干扰信号。
步骤308,根据合并音频信号与目标干扰信号之间的差异得到语音信号。
步骤309,根据语音信号进行语音识别。
步骤310,在当前音频中不存在语音信号的情况下,将合并音频信号、干扰参考信号作为一对训练样本,并利用训练样本按照第一学习率对目标自适应干扰器模型进行训练,更新目标自适应干扰器模型。
具体地说,目标自适应干扰器模型的输入数据为干扰参考信号样本,目标自适应干扰器模型的输出数据为音频信号样本,将两者作为一对训练样本对目标自适应干扰器模型进行训练,从而更新目标自适应干扰器模型,完善内部的系数,提高目标自适应干扰器模型的精确度。
具体地说,由于当前音频不存在语音信号,每个麦克风接收到的音频信息也不包括语音信号,此时,无法获取语音信息,而为了进一步提高目标自适应干扰器模型的精确度,充分利用样本,在此情况下,目标自适应干扰器模型也可以再次进行学习,不断完善目标自适应干扰器模型的系数,提高目标自适应干扰器模型的精确度。
上述步骤301至步骤309与上一实施例的步骤201至步骤209相同,为避免重复,在此不再赘述。
需要说明的是,学习率越大的NLMS的学习速度越快,学习率越小的NLMS的学习速度越慢;而学习速度越大,学习的精细度就越小,因此,必须确保目标自适应干扰器模型的输出数据音频信号样本不存在语音,即需要在语音交互设备外部区域不存在语音的情况下,将获取的干扰参考信号、合并音频信号作为一对训练样本,从而进行模型训练,因此,为了满足学习速度以及学习精度的要求,本实施例仅在语音交互设备外部不存在语音的情况下进行模型训练,第一学习率可以设置地较大,从而兼顾学习速度以及学习精度。
在一个实施例中,在更新目标自适应干扰器模型之后,还包括:再次利用训练样本对目标自适应干扰器模型进行训练,再次更新目标自适应干扰器模型。通过多次重复学习,进一步更新目标自适应干扰器模型,提高目标自适应干扰器模型的准确性。其中,二次重复学习有与两倍学习率一样的收敛速度,同时又有与单倍学习率一致的小误差。
在一个实施例中,在当前音频中存在语音信号的情况下,确定语音信号所在的区域之后,还包括:在语音信号所在区域与目标区域不满足预设条件的情况下,将合并音频信号、干扰参考信号作为一对训练样本,并利用训练样本 按照第一学习率对目标自适应干扰消除器模型进行训练,更新目标自适应干扰消除器模型。具体地说,语音信号所在区域与目标区域不满足预设条件的情况下,即表示语音信号所在的区域不与目标区域重合,此时语音交互设备不会获取将这部分的语音信号作为用户发出的语音指令,该语音信号作为干扰信号;而为了进一步提高目标自适应干扰器模型的精确度,充分利用样本,在此情况下,目标自适应干扰器模型也可以再次进行学习,不断完善目标自适应干扰器模型的系数,提高目标自适应干扰器模型的精确度。当然,更新目标自适应干扰消除器模型之后,还可以进行二次或者多次学习并更新。
在一个实施例中,通过多声源定位算法来执行确定当前音频存在语音信号以及确定语音信号所在区域的步骤。多声源定位算法代表着具有高可信度的声源点位算法;多声源定位算法较为精确,可以提高检测目标区域是否存在语音信息的精确度。因此,本实施例中,确定语音信号所在的区域之后,还包括:在语音信号所在的区域与目标区域满足预设条件的情况下,即语音信号所在的区域与目标区域重合的情况下,语音信号为目标语音指令信号,停止目标自适应干扰消除器模型的学习与更新,即将目标自适应干扰消除器模型的学习率置为0。
在一个实施例中,通过单声源定位算法来执行确定当前音频存在语音信号以及确定语音信号所在区域的步骤;确定语音信号所在的区域之后,还包括:在语音信号所在区域与目标区域满足预设条件的情况下,将合并音频信号、干扰参考信号作为一对训练样本,利用训练样本按照第二学习率对目标自适应干扰消除器模型进行训练,更新目标自适应干扰消除器模型;第二学习率小于第一学习率。
本实施例的具体流程示意图如图4所示,具体包括以下步骤:
步骤401,通过麦克风阵列接收来自语音交互设备的外部区域的唤醒音频。
步骤402,确定唤醒音频所在的目标区域。
步骤403,通过每个麦克风接收语音交互设备的外部区域的当前音频,得到当前音频在每个麦克风处对应的音频信号。
步骤404,利用固定波束形成模块对多个音频信号进行处理,得到合并音频信号;合并音频信号表示目标区域内的语音信号与干扰信号的总和。
步骤405,利用差分矩阵模块对多个音频信号进行处理,得到干扰参考信号。
步骤406,在当前音频中存在语音信号的情况下,确定语音信号所在的区域。
步骤407,在语音信号所在区域与目标区域满足预设条件的情况下,将干扰参考信号输入至目标模型中,得到目标干扰信号;目标干扰信号表示预测的当前目标区域中内存在的干扰信号。
步骤408,根据合并音频信号与目标干扰信号之间的差异得到语音信号。
步骤409,根据语音信号进行语音识别。
步骤410,在语音信号所在区域与目标区域满足预设条件的情况下,将合并音频信号、干扰参考信号作为一对训练样本,利用训练样本按照第二学习率对目标自适应干扰消除器模型进行训练,更新目标自适应干扰消除器模型;第二学习率小于第一学习率。
需要说明的是,单声源定位算法代表着具有高运算效率、但可信度略低 的声源定位算法。具体地说,单声源定位算法是将当前时刻能量最大声音的方向作为语音所在方向,因此,单声源定位算法相对于多声源定位算法来说,运算过程更加简单,但单声源定位算法的精确度较低。
本实施例使用的单声源定位算法由于定位算法可信度略低,不能直接停止目标自适应干扰消除器模型的学习与更新;因此,本实施例在使用单声源定位算法的情况下,再次对目标自适应干扰消除器模型进行训练,将合并音频信号、干扰参考信号作为一对训练样本,利用训练样本按照第二学习率对目标模型进行训练,更新目标自适应干扰消除器模型。由于单声源定位算法的精确度较差,可以使用较小的学习率对目标自适应干扰消除器模型进行训练,将第二学习率设置地小于第一学习率,即减小自适应干扰消除器模型的学习率,从而提高本次学习的精确度。
上述步骤401至步骤409与上一实施例的步骤301至步骤309相同,为避免重复,在此不再赘述。
上面各种方法的步骤划分,只是为了描述清楚,实现时可以合并为一个步骤或者对某些步骤进行拆分,分解为多个步骤,只要包括相同的逻辑关系,都在本专利的保护范围内;对算法中或者流程中添加无关紧要的修改或者引入无关紧要的设计,但不改变其算法和流程的核心设计都在该专利的保护范围内。
本申请一实施例涉及一种语音交互设备,如图5所示,包括至少一个处理器501;以及,与至少一个处理器501通信连接的存储器502;其中,存储器502存储有可被至少一个处理器501执行的指令,指令被至少一个处理器501执行,以使至少一个处理器501能够执行上述的语音交互方法。
其中,存储器502和处理器501采用总线方式连接,总线可以包括任意 数量的互联的总线和桥,总线将一个或多个处理器501和存储器502的各种电路连接在一起。总线还可以将诸如外围设备、稳压器和功率管理电路等之类的各种其他电路连接在一起,这些都是本领域所公知的,因此,本文不再对其进行进一步描述。总线接口在总线和收发机之间提供接口。收发机可以是一个元件,也可以是多个元件,比如多个接收器和发送器,提供用于在传输介质上与各种其他装置通信的单元。经处理器501处理的数据通过天线在无线介质上进行传输,进一步,天线还接收数据并将数据传送给处理器501。
处理器501负责管理总线和通常的处理,还可以提供各种功能,包括定时,外围接口,电压调节、电源管理以及其他控制功能。而存储器502可以被用于存储处理器501在执行操作时所使用的数据。
本申请一实施例涉及一种计算机可读存储介质,存储有计算机程序。计算机程序被处理器执行时实现上述方法实施例。
本申请一实施例涉及一种计算机程序。计算机程序被处理器执行时实现上述方法实施例。
即,本领域技术人员可以理解,实现上述实施例方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序存储在一个存储介质中,包括若干指令用以使得一个设备(可以是单片机,芯片等)或处理器(processor)执行本申请各个实施例方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
本领域的普通技术人员可以理解,上述各实施例是实现本申请的具体实施例,而在实际应用中,可以在形式上和细节上对其作各种改变,而不偏离本 申请的精神和范围。

Claims (11)

  1. 一种语音交互方法,其特征在于,应用于语音交互设备,所述语音交互设备包括由多个麦克风构成的麦克风阵列;所述方法包括:
    通过所述麦克风阵列接收来自所述语音交互设备的外部区域的唤醒音频;
    确定所述唤醒音频所在的目标区域;
    通过每个所述麦克风接收所述语音交互设备的外部区域的当前音频,得到所述当前音频在每个所述麦克风处对应的音频信号;
    在所述当前音频中存在语音信号的情况下,确定所述语音信号所在的区域;
    在所述语音信号所在区域与所述目标区域满足预设条件的情况下,根据多个所述音频信号得到所述语音信号;
    根据所述语音信号进行语音识别。
  2. 根据权利要求1所述的语音交互方法,其特征在于,所述根据多个所述音频信号得到所述语音信号之前,还包括:
    利用固定波束形成模块对多个所述音频信号进行处理,得到合并音频信号;所述合并音频信号表示所述目标区域内的所有音频信号的总和;
    利用差分矩阵模块对多个所述音频信号进行处理,得到干扰参考信号;
    所述根据多个所述音频信号得到所述语音信号,具体包括:
    将所述干扰参考信号输入至目标自适应干扰器模型中,得到目标干扰信号;所述目标干扰信号表示预测的当前所述目标区域中内存在的干扰信号;
    根据所述合并音频信号与所述目标干扰信号之间的差异得到所述语音信号。
  3. 根据权利要求2所述的语音交互方法,其特征在于,所述通过每个所述麦克风接收来自所述语音交互设备的外部区域的当前音频,得到所述当前音频在每个所述麦克风处对应的音频信号之后,还包括:
    在所述当前音频中不存在语音信号的情况下,将所述合并音频信号、所述干扰参考信号作为一对训练样本,并利用所述训练样本按照第一学习率对所述目标自适应干扰器模型进行训练,更新所述目标自适应干扰器模型。
  4. 根据权利要求2或3所述的语音交互方法,其特征在于,所述在所述当前音频中存在语音信号的情况下,确定所述语音信号所在的区域之后,还包括:
    在所述语音信号所在区域与所述目标区域不满足预设条件的情况下,将所述合并音频信号、所述干扰参考信号作为一对训练样本,并利用所述训练样本按照第一学习率对所述目标自适应干扰消除器模型进行训练,更新所述目标自适应干扰消除器模型。
  5. 根据权利要求3或4所述的语音交互方法,其特征在于,在所述更新所述目标模型之后,还包括:
    再次利用所述训练样本对所述目标自适应干扰器模型进行训练,再次更新所述目标自适应干扰器模型。
  6. 根据权利要求2至5中任一项所述的语音交互方法,其特征在于,通过多声源定位算法来执行确定所述当前音频存在所述语音信号以及确定所述语音信号所在区域的步骤;
    所述确定所述语音信号所在的区域之后,还包括:
    在所述语音信号所在的区域与所述目标区域满足预设条件的情况下,所述语音信号为目标语音指令信号,停止所述目标自适应干扰消除器模型的学习与更新。
  7. 根据权利要求2至5中任一项所述的语音交互方法,其特征在于,通过单声源定位算法来执行确定所述当前音频存在所述语音信号以及确定所述语音信号所在区域的步骤;
    所述确定所述语音信号所在的区域之后,还包括:
    在所述语音信号所在区域与所述目标区域满足预设条件的情况下,将所述合并音频信号、所述干扰参考信号作为一对训练样本,利用所述训练样本按照第二学习率对所述目标自适应干扰消除器模型进行训练,更新所述目标自适应干扰消除器模型;所述第二学习率小于第一学习率。
  8. 根据权利要求1至7中任一项所述的语音交互方法,其特征在于,所述当前音频的数据长度为10ms至30ms。
  9. 一种语音交互设备,其特征在于,包括:
    至少一个处理器;以及,
    与所述至少一个处理器通信连接的存储器;其中,
    所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行如权利要求1至8中任一所述的语音交互方法。
  10. 一种计算机可读存储介质,存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1至8中任一所述的语音交互方法。
  11. 一种计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1至8中任一项所述的语音交互方法。
PCT/CN2021/140554 2021-06-29 2021-12-22 语音交互方法、语音交互设备及存储介质 WO2023273230A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110732106.0 2021-06-29
CN202110732106.0A CN115223548B (zh) 2021-06-29 2021-06-29 语音交互方法、语音交互设备及存储介质

Publications (1)

Publication Number Publication Date
WO2023273230A1 true WO2023273230A1 (zh) 2023-01-05

Family

ID=83606944

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/140554 WO2023273230A1 (zh) 2021-06-29 2021-12-22 语音交互方法、语音交互设备及存储介质

Country Status (2)

Country Link
CN (1) CN115223548B (zh)
WO (1) WO2023273230A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107591151A (zh) * 2017-08-22 2018-01-16 百度在线网络技术(北京)有限公司 远场语音唤醒方法、装置和终端设备
US20180130485A1 (en) * 2016-11-08 2018-05-10 Samsung Electronics Co., Ltd. Auto voice trigger method and audio analyzer employing the same
US20190080692A1 (en) * 2017-09-12 2019-03-14 Intel Corporation Simultaneous multi-user audio signal recognition and processing for far field audio
CN109599124A (zh) * 2018-11-23 2019-04-09 腾讯科技(深圳)有限公司 一种音频数据处理方法、装置及存储介质
CN109697987A (zh) * 2018-12-29 2019-04-30 苏州思必驰信息科技有限公司 一种外接式的远场语音交互装置及实现方法
CN112735462A (zh) * 2020-12-30 2021-04-30 科大讯飞股份有限公司 分布式麦克风阵列的降噪方法和语音交互方法

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107146614B (zh) * 2017-04-10 2020-11-06 北京猎户星空科技有限公司 一种语音信号处理方法、装置及电子设备
CN107464564B (zh) * 2017-08-21 2023-05-26 腾讯科技(深圳)有限公司 语音交互方法、装置及设备
US9973849B1 (en) * 2017-09-20 2018-05-15 Amazon Technologies, Inc. Signal quality beam selection
JP7162470B2 (ja) * 2018-08-21 2022-10-28 清水建設株式会社 会話音声レベル通知システム及び会話音声レベル通知方法
CN109920405A (zh) * 2019-03-05 2019-06-21 百度在线网络技术(北京)有限公司 多路语音识别方法、装置、设备及可读存储介质
CN112309378B (zh) * 2019-07-24 2023-11-03 广东美的白色家电技术创新中心有限公司 语音识别设备及其唤醒响应方法、计算机存储介质
CN111599361A (zh) * 2020-05-14 2020-08-28 宁波奥克斯电气股份有限公司 一种唤醒方法、装置、计算机存储介质及空调器
CN112309395A (zh) * 2020-09-17 2021-02-02 广汽蔚来新能源汽车科技有限公司 人机对话方法、装置、机器人、计算机设备和存储介质
CN112188368A (zh) * 2020-09-29 2021-01-05 深圳创维-Rgb电子有限公司 定向增强声音的方法及系统
CN112951261B (zh) * 2021-03-02 2022-07-01 北京声智科技有限公司 声源定位方法、装置及语音设备

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180130485A1 (en) * 2016-11-08 2018-05-10 Samsung Electronics Co., Ltd. Auto voice trigger method and audio analyzer employing the same
CN107591151A (zh) * 2017-08-22 2018-01-16 百度在线网络技术(北京)有限公司 远场语音唤醒方法、装置和终端设备
US20190080692A1 (en) * 2017-09-12 2019-03-14 Intel Corporation Simultaneous multi-user audio signal recognition and processing for far field audio
CN109599124A (zh) * 2018-11-23 2019-04-09 腾讯科技(深圳)有限公司 一种音频数据处理方法、装置及存储介质
CN109697987A (zh) * 2018-12-29 2019-04-30 苏州思必驰信息科技有限公司 一种外接式的远场语音交互装置及实现方法
CN112735462A (zh) * 2020-12-30 2021-04-30 科大讯飞股份有限公司 分布式麦克风阵列的降噪方法和语音交互方法

Also Published As

Publication number Publication date
CN115223548B (zh) 2023-03-14
CN115223548A (zh) 2022-10-21

Similar Documents

Publication Publication Date Title
US11798531B2 (en) Speech recognition method and apparatus, and method and apparatus for training speech recognition model
US20230031491A1 (en) Voice Awakening Method and Apparatus, Device, and Medium
CN106910500B (zh) 对带麦克风阵列的设备进行语音控制的方法及设备
WO2021136037A1 (zh) 语音唤醒方法、设备及系统
CN102938254B (zh) 一种语音信号增强系统和方法
CN101903948B (zh) 用于基于多麦克风的语音增强的系统、方法及设备
CN107464565B (zh) 一种远场语音唤醒方法及设备
CN109461456B (zh) 一种提升语音唤醒成功率的方法
CN104810021A (zh) 应用于远场识别的前处理方法和装置
US20210020188A1 (en) Echo Cancellation Using A Subset of Multiple Microphones As Reference Channels
US20190333498A1 (en) Processing audio signals
CN111402877B (zh) 基于车载多音区的降噪方法、装置、设备和介质
CN103347136B (zh) 手机节能处理方法和装置
US10540973B2 (en) Electronic device for performing operation corresponding to voice input
CN107240396B (zh) 说话人自适应方法、装置、设备及存储介质
KR102555801B1 (ko) 노이즈 제거 알고리즘 디버깅 방법, 장치 및 전자기기
WO2020238203A1 (zh) 降噪方法、降噪装置及可实现降噪的设备
WO2023273230A1 (zh) 语音交互方法、语音交互设备及存储介质
CN113096677B (zh) 一种智能降噪的方法及相关设备
CN110867178B (zh) 一种多通道远场语音识别方法
CN108877828A (zh) 语音增强方法/系统、计算机可读存储介质及电子设备
WO2022052691A1 (zh) 基于多设备的语音处理方法、介质、电子设备及系统
CN111739515A (zh) 语音识别方法、设备、电子设备和服务器、相关系统
CN102655558B (zh) 一种双端发音鲁棒结构及其消除声学回声的方法
CN111586512B (zh) 一种防啸叫方法、电子设备及计算机可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21948155

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE