WO2023093029A1 - 唤醒词能量计算方法、系统、语音唤醒系统及存储介质 - Google Patents

唤醒词能量计算方法、系统、语音唤醒系统及存储介质 Download PDF

Info

Publication number
WO2023093029A1
WO2023093029A1 PCT/CN2022/101249 CN2022101249W WO2023093029A1 WO 2023093029 A1 WO2023093029 A1 WO 2023093029A1 CN 2022101249 W CN2022101249 W CN 2022101249W WO 2023093029 A1 WO2023093029 A1 WO 2023093029A1
Authority
WO
WIPO (PCT)
Prior art keywords
wake
word
energy
spectrum
voice
Prior art date
Application number
PCT/CN2022/101249
Other languages
English (en)
French (fr)
Inventor
贾基东
Original Assignee
青岛海尔科技有限公司
海尔智家股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 青岛海尔科技有限公司, 海尔智家股份有限公司 filed Critical 青岛海尔科技有限公司
Publication of WO2023093029A1 publication Critical patent/WO2023093029A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech

Definitions

  • the present disclosure relates to the technical field of smart home, and in particular to a wake word energy calculation method, system, voice wake system and storage medium.
  • the purpose of the embodiments of the present disclosure is to provide a wake-up word energy calculation method, system, voice wake-up system and storage medium, so as to improve the calculation accuracy and robustness of wake-up word energy under background noise conditions.
  • the specific technical scheme is as follows:
  • a wake-up word energy calculation method, system, voice wake-up system, and storage medium provided by the embodiments of the present disclosure estimate the speech components of the wake-up word in the wake-up word audio by introducing a preset neural network model, so that the present disclosure is compared with
  • the existing technology improves the distinction accuracy between noise time-frequency points and wake-up word time-frequency points in wake-up word audio in different application scenarios, thereby improving the robustness and accuracy of the final calculation of wake-up word energy under background noise conditions Spend.
  • introducing a preset neural network model in the present disclosure it is possible to dynamically update internal parameters involved in the calculation of wake word energy for different application scenarios, which improves the applicability of the present disclosure to different application scenarios.
  • the present disclosure can be deployed based on the existing distributed voice wake-up system without modification of hardware devices, the universality of the present disclosure is further improved. It can be seen that the present disclosure improves the calculation accuracy and robustness of wake word energy under background noise conditions.
  • FIG. 1 is a flow chart of a wake word energy calculation method provided by an embodiment of the present disclosure
  • Fig. 2 is a block diagram of a wake word energy calculation system provided by an embodiment of the present disclosure.
  • An embodiment of the present disclosure provides a wake word energy calculation method, as shown in FIG. 1 , the method includes:
  • the above-mentioned device for acquiring the wake-up word audio signal may be a sound collection device deployed on a smart home electronic device.
  • the wake-up word audio signal may be a voice signal containing a wake-up keyword and an audio signal of a scene noise signal of a scene where the distributed voice wake-up system is located.
  • the foregoing first transformation may include a short-time Fourier transform (short-time Fourier transform, STFT), a modulo operation, and a square operation.
  • STFT short-time Fourier transform
  • the above-mentioned process of first converting the wake-up word audio signal to obtain the short-term energy spectrum of the wake-up word audio can be: performing STFT transformation on the wake-up word audio signal to obtain the short-term spectrum of the wake-up word audio signal, and then the short-time spectrum
  • the short-term energy spectrum of the wake-up word audio signal is obtained by taking a modulo operation and a square operation.
  • the above-mentioned STFT transform is suitable for analyzing the frequency spectrum of slowly time-varying signals.
  • the method is to divide the speech signal into frames first, and then perform Fourier transform on each frame signal, so that each frame of speech signal can be considered as a signal from different
  • the short-term spectrum of each frame of speech is the approximation of the spectrum of each stationary signal waveform.
  • the modulo operation and square operation are performed on the short-time spectrum to obtain the short-time energy spectrum representing the distribution of the speech signal with frequency.
  • the above-mentioned STFT, modulo operation and square operation are commonly used methods for speech preprocessing, and the present disclosure will not repeat them here.
  • the wake-up word audio signal can be converted from time-domain data into logarithmic spectral features, and the wake-up word audio signal can be compressed
  • the dynamic range of the wake-up word feature data in the middle so as to ensure the integrity of the logarithmic spectrum data of the wake-up word audio used for subsequent neural network model calculations, thereby improving the accuracy of the final calculation of wake-up word energy.
  • the aforementioned preset neural network model may be a convolutional neural network model (Convolutional Neural Networks, CNN).
  • CNN convolutional Neural Networks
  • the disclosure calculates the probability value that the time-frequency point in the logarithmic spectrum of the input wake-up word audio belongs to the wake-up data by modeling scene noise and wake-up word audio classification network based on CNN, and maps it into a probability matrix.
  • the existing technology obtains the threshold value for distinguishing scene noise and wake-up audio by calculating the energy of multiple frames of data, it essentially assumes that the scene noise is stationary noise and is much smaller than the wake-up word energy, but in actual application scenarios , the above assumptions are difficult to satisfy, resulting in a serious inaccuracy in the final calculated wake-up word energy.
  • the coefficients and parameters used to calculate the threshold value in the prior art are usually obtained from preset static scenarios, they will no longer be updated according to actual application scenarios in subsequent actual deployments, resulting in poor universality. This further leads to an inaccuracy in the energy of the final calculated wake word.
  • the present disclosure realizes the adaptation to different application scenarios compared with the prior art, and at the same time realizes the internal parameter dynamics for different application scenarios. Adjusted to improve the accuracy of the final calculation of wake word energy.
  • the dimension of the predicted probability matrix generated in the above step S104 is different from the dimension of the short-term energy spectrum of the wake-up word audio generated in the above step S102, it is impossible to obtain Scalar used to compute wake word energy. Therefore, through a matrix binarization operation, the above-mentioned predicted probability matrix is converted into a binary matrix, and the dimension of the binary matrix is the same as that of the short-term energy spectrum of the wake-up word audio.
  • the elements greater than the preset threshold value in the above-mentioned predicted probability matrix can be set to 1, not greater than The elements of the preset threshold value are set to 0.
  • the interference data in the data used to calculate the voice energy of the wake-up word is reduced, and the accuracy of the final calculation of the voice energy of the wake-up word is improved.
  • the above second conversion includes but not limited to: matrix Hadamard product and matrix dimension summation.
  • the matrix Hadamard product is a commonly used matrix multiplication operation. After performing matrix Hadamard product operation on the above short-time energy spectrum and binary matrix, a two-dimensional matrix is obtained. Then perform a matrix dimension summation operation on the two dimensions of the two-dimensional matrix to obtain the voice energy of the wake-up word.
  • the present disclosure can realize the selection of the time-frequency points belonging to the wake-up data in the short-term energy spectrum of the wake-up word audio signal through the above-mentioned matrix Hadamard product operation, which improves the accuracy of the final calculation of the wake-up word speech energy.
  • the present disclosure estimates the speech components of the wake-up word in the wake-up word audio by introducing a preset neural network model, so that the present disclosure improves the noise time-frequency points in the wake-up word audio in different application scenarios compared with the prior art The accuracy of the distinction between time and frequency points of wake-up words and wake-up words, thereby improving the robustness and accuracy of the final calculation of wake-up word energy under background noise conditions.
  • a preset neural network model in the present disclosure it is possible to dynamically update internal parameters involved in the calculation of wake word energy for different application scenarios, which improves the applicability of the present disclosure to different application scenarios.
  • the present disclosure can be deployed based on the existing distributed voice wake-up system without modification of hardware devices, the universality of the present disclosure is further improved. It can be seen that the present disclosure improves the calculation accuracy and robustness of wake word energy under background noise conditions.
  • the training process of the preset neural network model includes:
  • the log spectrum of the noisy speech data is input into the initial neural network for processing to obtain the predicted training probability matrix.
  • the noisy speech data may be synthesized.
  • the noise training data is inserted into the training data of the wake-up word speech to obtain the noisy speech data.
  • the above-mentioned training data may be recording data of wake-up word speech without background noise.
  • the above-mentioned label matrix acquisition process may be obtained after performing the first conversion, logarithm and binarization on the training data used to train the above-mentioned initial neural network .
  • the above-mentioned preset threshold value is selected according to the background noise level to which the noise training data belongs, and the time-frequency values greater than the preset threshold value in the above-mentioned logarithmic spectrum are Points are set to 1, and time-frequency points smaller than the preset threshold are set to zero.
  • the time-frequency points in the label matrix all belong to the time-frequency points of the wake-up data.
  • the disclosure uses the label matrix, combined with the cross-entropy loss function and the optimization algorithm based on Adaptive Moment Estimation (Adaptive Moment Estimation, Adam) to iterate the above-mentioned initial neural network, thereby improving the preset neural network's ability to wake up word audio.
  • the wake-up data in the screening and judgment accuracy.
  • the above-mentioned training cut-off condition may be that the loss of the above-mentioned initial neural network on the verification set does not decrease within a preset period of time.
  • training the initial neural network before training the initial neural network, further include: training data processing and/or training data feature extraction.
  • training data processing includes:
  • the training data feature extraction includes:
  • the logarithmic spectrum is input into the preset neural network model, so that the preset neural network model generates a predicted probability matrix according to the logarithmic spectrum, including:
  • the preset neural network maps the time-frequency points of the received logarithmic spectrum into a predicted probability matrix, and each element in the predicted probability matrix represents the probability value that the time-frequency point corresponding to the element belongs to the wake-up data.
  • binarize the predicted probability matrix to obtain a binary matrix including:
  • the predicted probability matrix is binarized to obtain a binary matrix, wherein the binarization is to judge whether each element in the predicted probability matrix is greater than the preset threshold If yes, set this element to 1; if the element is not greater than the preset threshold value, then set this element to 0.
  • the above method is applied to a distributed voice wake-up system, and the method also includes:
  • Multiple electronic devices in the distributed voice wake-up system calculate the voice energy of the wake-up word of the device respectively according to the method, and compare the voice energy of the wake-up word of this device with the voice energy of the wake-up word of other devices, and the voice energy of the wake-up word with the largest voice energy
  • the device performs the wake-up operation, and other devices except the device performing the wake-up operation do not perform the wake-up operation.
  • each electronic device in the distributed voice wake-up system is configured to implement the above-mentioned wake-up word energy calculation method.
  • the aforementioned electronic devices include but are not limited to smart home electronic devices and smart communication devices.
  • E i is the wake-up word speech energy of i device
  • max E j is the largest wake-up word speech energy among the wake-up word speech energies calculated by j devices except i device.
  • E i > max E j it is determined that the voice energy of the wake-up word of the device is the largest, and the device performs the wake-up operation, and other devices do not perform the wake-up operation.
  • the wake-up word audio signal is an audio signal containing a wake-up keyword voice signal and a scene noise signal of a scene where the distributed voice wake-up system is located.
  • the present disclosure estimates the speech components of the wake-up word in the wake-up word audio by introducing a preset neural network model, so that the present disclosure improves the noise time-frequency points in the wake-up word audio in different application scenarios compared with the prior art The accuracy of the distinction between time and frequency points of wake-up words and wake-up words, thereby improving the robustness and accuracy of the final calculation of wake-up word energy under background noise conditions.
  • a preset neural network model in the present disclosure it is possible to dynamically update internal parameters involved in the calculation of wake word energy for different application scenarios, which improves the applicability of the present disclosure to different application scenarios.
  • the present disclosure can be deployed based on the existing distributed voice wake-up system without modification of hardware devices, the universality of the present disclosure is further improved. It can be seen that the present disclosure improves the calculation accuracy and robustness of wake word energy under background noise conditions.
  • the present disclosure also provides a wake-up word energy calculation system, which is applied to a distributed voice wake-up system.
  • the system includes:
  • the signal acquisition module 201 configured to acquire the wake-up word audio signal.
  • the first conversion module 202 configured to perform a first conversion on the wake-up word audio signal to obtain a short-term energy spectrum of the wake-up word audio.
  • the second conversion module 203 is configured to perform logarithm on the short-term energy spectrum to obtain the logarithm spectrum of the wake-up word audio.
  • Matrix generation module 204 configured to input the logarithmic spectrum into the preset neural network model, so that the preset neural network model generates a predicted probability matrix according to the logarithmic spectrum.
  • the third conversion module 205 configured to perform binarization on the predicted probability matrix to obtain a binary matrix.
  • the fourth conversion module 206 configured to perform a second conversion on the short-term energy spectrum and the binary matrix to determine the wake-up word voice energy of the wake-up word audio signal.
  • the above system also includes:
  • the model training module is configured to input the log spectrum of the noisy speech data into the initial neural network for processing to obtain a predicted training probability matrix. Calculate the error value of the training probability matrix and the label matrix based on the cross-entropy loss function. According to the error value, the initial neural network is iteratively updated using the preset optimization algorithm until the training cut-off condition is met, and the preset neural network model is obtained.
  • the above system also includes:
  • the training data processing module is configured to carry out the first conversion to the training data of the wake-up word voice, obtain the short-term energy spectrum of the training data; logarithm is carried out to the short-term energy spectrum of the training data, obtain the logarithmic spectrum of the training data; The log spectrum of the training data is binarized to obtain a label matrix.
  • the training data feature extraction module is configured to insert the noisy training data into the training data according to the signal-to-noise ratio to obtain noisy speech data.
  • a first conversion is performed on the noisy speech data to obtain a short-time energy spectrum of the noisy speech data.
  • Logarithm is taken on the short-time energy spectrum of the noisy speech data to obtain the logarithmic spectrum of the noisy speech data.
  • the matrix generation module 204 is set to:
  • the preset neural network in the matrix generation module 204 maps the time-frequency points of the received logarithmic spectrum into a predicted probability matrix, and each element in the predicted probability matrix indicates that the time-frequency point corresponding to the element belongs to the wake-up data probability value.
  • the third conversion module 205 is set to:
  • the predicted probability matrix is binarized to obtain a binary matrix, wherein the binarization is to judge whether each element in the predicted probability matrix is greater than the preset threshold If yes, set this element to 1; if the element is not greater than the preset threshold value, then set this element to 0.
  • the above system also includes:
  • the device wake-up module is set to control multiple electronic devices in the distributed voice wake-up system, calculates the voice energy of the wake-up word of the device according to the method, and compares the voice energy of the wake-up word of the device with the voice energy of the wake-up word of other devices , the device with the largest voice energy of the wake-up word performs the wake-up operation, and other devices except the device performing the wake-up operation do not perform the wake-up operation.
  • the wake-up word audio signal is an audio signal including a wake-up keyword voice signal and a scene noise signal of a scene where the distributed voice wake-up system is located.
  • An embodiment of the present disclosure provides a voice wake-up system, the system includes:
  • a plurality of electronic devices the electronic devices are configured to execute instructions, so as to realize the wake word energy calculation method according to any one of the above items.
  • An embodiment of the present disclosure provides a computer-readable storage medium.
  • the instructions in the computer-readable storage medium are executed by the processor of the electronic device, the device can execute the wake word energy calculation method as described above.
  • Memory may include non-permanent memory in computer-readable media, in the form of random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM), memory including at least one memory chip.
  • RAM random access memory
  • ROM read-only memory
  • flash RAM flash memory
  • Computer-readable media including both permanent and non-permanent, removable and non-removable media, can be implemented by any method or technology for storage of information.
  • the information may be computer readable instructions, data structures, modules of a program, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), flash memory or other memory technology, compact disc read-only memory (CD-ROM), digital versatile disc (DVD) or other optical storage,
  • PRAM phase change memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • RAM random access memory
  • ROM read only memory
  • EEPROM Electrically Erasable Programmable Read-Only Memory
  • flash memory or other memory technology
  • the embodiments of the present application may be provided as methods, systems or computer program products. Accordingly, the present application can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
  • a computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本公开实施例提供了一种唤醒词能量计算方法、系统、语音唤醒系统及存储介质,其中,方法包括:获取唤醒词音频信号,对唤醒词音频信号进行第一转换,获得唤醒词音频的短时能量谱,对短时能量谱进行取对数,获得唤醒词音频的对数谱,将对数谱输入到预设神经网络模型中,以使预设神经网络模型根据对数谱,生成预测的概率矩阵,对预测的概率矩阵进行二值化,获得二值矩阵,对短时能量谱和二值矩阵进行第二转换,确定唤醒词音频信号的唤醒词语音能量。本公开通过引入预设神经网络模型对唤醒词语音成分进行估计,提高了对不同应用场景下的噪声时频点和唤醒词时频点的区别精度,提高了最终计算唤醒词能量在背景噪声条件下的鲁棒性和准确度。

Description

唤醒词能量计算方法、系统、语音唤醒系统及存储介质
本公开要求于2021年11月26日提交中国专利局、申请号为202111425576.9、发明名称“唤醒词能量计算方法、系统、语音唤醒系统及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本公开中。
技术领域
本公开涉及智能家居技术领域,特别是涉及一种唤醒词能量计算方法、系统、语音唤醒系统及存储介质。
背景技术
随着智能家居的普及,越来越多的家用电子设备都嵌入了语音助手功能,以实现通过语音控制家用电子设备,提升家居智能化水平。但是当存在多个电子设备的唤醒关键词相同或相近时,往往会出现用户发出一个唤醒关键词,多台电子设备同时应答的情况,影响了用户体验。
发明内容
本公开实施例的目的在于提供一种唤醒词能量计算方法、系统、语音唤醒系统及存储介质,以实现提高背景噪声条件下对唤醒词能量的计算精度和鲁棒性。具体技术方案如下:
本公开实施例提供的一种唤醒词能量计算方法、系统、语音唤醒系统及存储介质,通过引入预设神经网络模型,对唤醒词音频中的唤醒词语音成分进行估计,使得本公开相较于现有技术,提高了对于不同应用场景下的唤醒词音频中的噪声时频点和唤醒词时频点的区别精度,从而提高了最终计算唤醒词能量在背景噪声条件下的鲁棒性和准确度。同时,本公开通过引入预设神经网络模型,实现了针对不同的应用场景,动态更新参与计算唤醒词能量的内部参数,提高了本公开对于不同应用场景的适用性。最后,由于本公开可以基于现有的分布式语音唤醒系统实现部署,无需对硬件设备进行改造,进一步提升了本公开的普适性。可见,本公开提高了对背景噪声条件下的唤醒词能量的计算精度和鲁棒性。
当然,实施本公开的任一产品或方法必不一定需要同时达到以上所述的所有优点。
附图说明
为了更清楚地说明本公开实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本公开的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。
图1为本公开实施例提供的一种唤醒词能量计算方法的流程图;
图2为本公开实施例提供的一种唤醒词能量计算系统的框图。
具体实施方式
下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本公开一部分实施例,而不是全部的实施例。基于本公开中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本公开保护的范围。
本公开实施例提供了一种唤醒词能量计算方法,如图1所示,方法包括:
S101、获取唤醒词音频信号。
可选的,在本公开的一个可选实施例中,上述获取唤醒词音频信号的设备可以为智能家居电子设备上部署的声音采集设备。
可选的,在本公开的另一个可选实施例中,上述唤醒词音频信号可以是包含有唤醒关键词的语音信号,以及分布式语音唤醒系统所处场景的场景噪音信号的音频信号。
S102、对唤醒词音频信号进行第一转换,获得唤醒词音频的短时能量谱。
可选的,在本公开的一个可选实施例中,上述第一转换可以包括短时傅里叶变换(short-time Fourier transform,STFT)、取模运算和平方运算。上述对唤醒词音频信号进行第一转换,获得唤醒词音频的短时能量谱的过程可以是:对唤醒词音频信号进行STFT变换,获得唤醒词音频信号的短时频谱,再将该短时频谱通过取模运算和平方运算,获得唤醒词音频信号的短时能量谱。
其中,上述STFT变换适用于分析缓慢时变信号的频谱分析,其方法为先将语音信号分帧,再将各帧信号进行傅里叶变换,这样每一帧语音信号可以被认为是从各个不同的平稳信号波形中截取出来的,各帧语音的短时频谱就是各个平稳信号波形频谱的近似。再对短时频谱进行取模运算和平方运算,即可获得表征语音信号随频率分布状况的短时能量谱。上述STFT、取模运算和平方运算为语音预处理的常用手段,本公开在此不做过多赘述。
S103、对短时能量谱进行取对数,获得唤醒词音频的对数谱。
可选的,在本公开的一个可选实施例中,通过对短时能量谱进行取对数,可以将唤醒词音频信号,由时域数据转换为对数谱特征,压缩了唤醒词音频信号中唤醒词特征数据的动态范围,从而保证用于后续神经网络模型计算的唤醒词音频的对数谱的数据完整,从而提高最终计算唤醒词能量的准确度。
S104、将对数谱输入到预设神经网络模型中,以使预设神经网络模型根据对数谱,生成预测的概率矩阵。
可选的,在本公开的一个可选实施例中,上述预设神经网络模型可以是卷积神经网络模型(Convolutional Neural Networks,CNN)。本公开通过基于CNN建模场景噪音和唤醒词音频的分类网络,计算输入的唤醒词音频的对数谱中的时频点属于唤醒数据的概率值,并将其映射为概率矩阵。
由于现有技术是通过计算多帧数据的能量来获取分辨场景噪声和唤醒音频的门限值,其实质上是假定了场景噪声为平稳噪声,且远小于唤醒词能量,但在实际应用场景中,上述假设条件很难满足,从而导致最终计算的唤醒词能量严重失准。同时,由于现有技术中用于计算门限值的系数和参数通常通过预先设定的静态场景中获得,再后续实际部署中不再根据实际应用场景继续更新,导致其普适性变差,进一步导致最终计算的唤醒词能量失准。因此本公开通过引入CNN对唤醒词音频中的唤醒词语音成分进行估计,使得本公开相较于现有技术,实现了对于不同应用场景的适配,同时实现了对于不同应用场景的内部参数动态调节,提高了最终计算唤醒词能量的准确度。
S105、对预测的概率矩阵进行二值化,获得二值矩阵。
可选的,在本公开的一个可选实施例中,由于上述步骤S104中生成的预测的概率矩阵的维度与上述步骤S102中生成的唤醒词音频的短时能量谱的维度 不同,导致无法获得用于计算唤醒词能量的标量。因此通过矩阵二值化运算,将上述预测的概率矩阵转化为二值矩阵,该二值矩阵与上述唤醒词音频的短时能量谱的维度相同。同时,根据上述预设神经网络模型中的预设门限值,在进行上述二值化运算时,可以将上述预测的概率矩阵中,大于该预设门限值的元素置为1,不大于该预设门限值的元素置为0。减少了用于计算唤醒词语音能量的数据中的干扰数据,提高了最终计算唤醒词语音能量的准确度。
S106、对短时能量谱和二值矩阵进行第二转换,确定唤醒词音频信号的唤醒词语音能量。
可选的,在本公开的一个可选实施例中,上述第二转换包括但不限于:矩阵哈达玛积(Hadamard product)和矩阵维度求和。其中矩阵哈达玛积是一种常用的矩阵乘法运算。将上述短时能量谱和二值矩阵进行矩阵哈达玛积运算后,获得一个二维矩阵。再对该二维矩阵的两个维度进行矩阵维度求和运算,获得唤醒词语音能量。本公开通过上述矩阵哈达玛积运算,可以实现对上述唤醒词音频信号的短时能量谱中属于唤醒数据的时频点的选取,提高了最终计算唤醒词语音能量的准确度。
本公开通过引入预设神经网络模型,对唤醒词音频中的唤醒词语音成分进行估计,使得本公开相较于现有技术,提高了对于不同应用场景下的唤醒词音频中的噪声时频点和唤醒词时频点的区别精度,从而提高了最终计算唤醒词能量在背景噪声条件下的鲁棒性和准确度。同时,本公开通过引入预设神经网络模型,实现了针对不同的应用场景,动态更新参与计算唤醒词能量的内部参数,提高了本公开对于不同应用场景的适用性。最后,由于本公开可以基于现有的分布式语音唤醒系统实现部署,无需对硬件设备进行改造,进一步提升了本公开的普适性。可见,本公开提高了对背景噪声条件下的唤醒词能量的计算精度和鲁棒性。
可选的,预设神经网络模型的训练过程,包括:
将带噪语音数据的对数谱输入到初始神经网络中进行处理,获得预测的训练概率矩阵。
基于交叉熵损失函数计算训练概率矩阵与标签矩阵的误差值。
根据误差值,利用预设优化算法对初始神经网络进行迭代更新,直至满足 训练截至条件,得到预设神经网络模型。
可选的,在本公开的一个可选实施例中,上述对预设神经网络模型的训练过程中,带噪语音数据可以是合成的。按照预设的信噪比,将噪声训练数据插入到唤醒词语音的训练数据中,获得带噪语音数据。其中,上述训练数据可以是无背景噪声的唤醒词语音的录音数据。
可选的,在本公开的另一个可选实施例中,上述标签矩阵的获取过程可以是对用于训练上述初始神经网络的训练数据进行第一转换、取对数和二值化后获得的。其中,在经过取对数获得训练数据的对数谱后,根据噪声训练数据所属的背景噪声级选定上述预设门限值,并将上述对数谱中大于预设门限值的时频点设置为1,小于预设门限值的时频点设置为零。由于该标签矩阵是由上述训练数据通过转换生成的,因此该标签矩阵中的时频点均属于唤醒数据的时频点。本公开通过利用标签矩阵,并结合交叉熵损失函数和基于自适应矩估计的优化算法(Adaptive Moment Estimation,Adam)对上述初始神经网络进行迭代更细,从而提高了预设神经网络对唤醒词音频中的唤醒数据进行筛选和判断的精度。
可选的,在本公开的另一个可选实施例中,上述训练截止条件可以是在预设时期内,上述初始神经网络在验证集上的损失不在下降。
可选的,在对初始神经网络进行训练之前,还包括:训练数据处理和/或训练数据特征提取。
其中,训练数据处理包括:
对唤醒词语音的训练数据进行第一转换,获得训练数据的短时能量谱;对训练数据的短时能量谱进行取对数,获得训练数据的对数谱;对训练数据的对数谱进行二值化,获得标签矩阵。
其中,训练数据特征提取包括:
按照信噪比将噪声训练数据插入到训练数据中,获得带噪语音数据;对带噪语音数据进行第一转换,获得带噪语音数据的短时能量谱;对带噪语音数据的短时能量谱进行取对数,获得带噪语音数据的对数谱。
可选的,将对数谱输入到预设神经网络模型中,以使预设神经网络模型根 据对数谱,生成预测的概率矩阵,包括:
预设神经网络将接收到的对数谱的时频点映射为预测的概率矩阵,预测的概率矩阵中的每个元素均表示该元素对应的时频点属于唤醒数据的概率值。
可选的,对预测的概率矩阵进行二值化,获得二值矩阵,包括:
根据预设神经网络模型中的预设门限值,对预测的概率矩阵进行二值化,获得二值矩阵,其中,二值化是判断预测的概率矩阵中的每一个元素是否大于预设门限值,若是,则将该元素置为1;若元素不大于预设门限值,则将该元素置为0。
可选的,上述方法应用于分布式语音唤醒系统,方法还包括:
分布式语音唤醒系统中的多台电子设备,根据方法分别计算本设备的唤醒词语音能量,并将本设备的唤醒词语音能量与其它设备的唤醒词语音能量进行比较,唤醒词语音能量最大的设备执行唤醒操作,除执行唤醒操作的设备外的其它设备不执行唤醒操作。
可选的,在本公开的一个可选实施例中,分布式语音唤醒系统中的每台电子设备,均被配置为可以执行上述一种唤醒词能量计算方法。上述电子设备包括但不限于智能家居电子设备和智能通信设备。
可选的,在本公开的另一个可选实施例中,根据计算出的本设备的唤醒词语音能量和分布式语音唤醒系统中其它设备计算出的唤醒词语音能量,判断本设备是否执行唤醒操作:
Figure PCTCN2022101249-appb-000001
其中,E i为i设备的唤醒词语音能量,max E j为除i设备外的j个设备计算出的唤醒词语音能量中最大的唤醒词语音能量。在E i>max E j的情况下,判断本设备的唤醒词语音能量最大,由本设备执行唤醒操作,其它设备不执行唤醒操作。本方法通过部署于分布式语音唤醒系统的每个电子设备中,使得本公开提高了分布式语音唤醒系统对于背景噪声的鲁棒性。
可选的,唤醒词音频信号是包含有唤醒关键词的语音信号,以及分布式语音唤醒系统所处场景的场景噪音信号的音频信号。
本公开通过引入预设神经网络模型,对唤醒词音频中的唤醒词语音成分进行估计,使得本公开相较于现有技术,提高了对于不同应用场景下的唤醒词音频中的噪声时频点和唤醒词时频点的区别精度,从而提高了最终计算唤醒词能量在背景噪声条件下的鲁棒性和准确度。同时,本公开通过引入预设神经网络模型,实现了针对不同的应用场景,动态更新参与计算唤醒词能量的内部参数,提高了本公开对于不同应用场景的适用性。最后,由于本公开可以基于现有的分布式语音唤醒系统实现部署,无需对硬件设备进行改造,进一步提升了本公开的普适性。可见,本公开提高了对背景噪声条件下的唤醒词能量的计算精度和鲁棒性。
与上述唤醒词能量计算方法实施例相对应,本公开还提供了一种唤醒词能量计算系统,该系统应用于分布式语音唤醒系统,如图2所示,系统包括:
信号获取模块201:设置为获取唤醒词音频信号。
第一转换模块202:设置为对唤醒词音频信号进行第一转换,获得唤醒词音频的短时能量谱。
第二转换模块203,设置为对短时能量谱进行取对数,获得唤醒词音频的对数谱。
矩阵生成模块204:设置为将对数谱输入到预设神经网络模型中,以使预设神经网络模型根据对数谱,生成预测的概率矩阵。
第三转换模块205:设置为对预测的概率矩阵进行二值化,获得二值矩阵。
第四转换模块206:设置为对短时能量谱和二值矩阵进行第二转换,确定唤醒词音频信号的唤醒词语音能量。
可选的,上述系统还包括:
模型训练模块,设置为将带噪语音数据的对数谱输入到初始神经网络中进行处理,获得预测的训练概率矩阵。基于交叉熵损失函数计算训练概率矩阵与标签矩阵的误差值。根据误差值,利用预设优化算法对初始神经网络进行迭代更新,直至满足训练截至条件,得到预设神经网络模型。
可选的,上述系统还包括:
训练数据处理模块,设置为对唤醒词语音的训练数据进行第一转换,获得训练数据的短时能量谱;对训练数据的短时能量谱进行取对数,获得训练数据的对数谱;对训练数据的对数谱进行二值化,获得标签矩阵。
和/或,训练数据特征提取模块,设置为按照信噪比将噪声训练数据插入到训练数据中,获得带噪语音数据。对带噪语音数据进行第一转换,获得带噪语音数据的短时能量谱。对带噪语音数据的短时能量谱进行取对数,获得带噪语音数据的对数谱。
可选的,矩阵生成模块204被设置为:
矩阵生成模块204中的预设神经网络,将接收到的对数谱的时频点映射为预测的概率矩阵,预测的概率矩阵中的每个元素均表示该元素对应的时频点属于唤醒数据的概率值。
可选的,第三转换模块205被设置为:
根据预设神经网络模型中的预设门限值,对预测的概率矩阵进行二值化,获得二值矩阵,其中,二值化是判断预测的概率矩阵中的每一个元素是否大于预设门限值,若是,则将该元素置为1;若元素不大于预设门限值,则将该元素置为0。
可选的,上述系统还包括:
设备唤醒模块,设置为控制分布式语音唤醒系统中的多台电子设备,根据方法分别计算本设备的唤醒词语音能量,并将本设备的唤醒词语音能量与其它设备的唤醒词语音能量进行比较,唤醒词语音能量最大的设备执行唤醒操作,除执行唤醒操作的设备外的其它设备不执行唤醒操作。
可选的,上述唤醒词音频信号是包含有唤醒关键词的语音信号,以及分布式语音唤醒系统所处场景的场景噪音信号的音频信号。
本公开实施例提供了一种语音唤醒系统,系统包括:
多台电子设备,电子设备被配置为执行指令,以实现如上述任一项的唤醒词能量计算方法。
本公开实施例提供了一种计算机可读存储介质,当计算机可读存储介质中 的指令由电子设备的处理器执行时,使得设备能够执行如上述任一项的唤醒词能量计算方法。
存储器可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM),存储器包括至少一个存储芯片。存储器是计算机可读介质的示例。
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其它数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其它类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其它内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其它光学存储、磁盒式磁带,磁带磁磁盘存储或其它磁性存储设备或任何其它非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。
本领域技术人员应明白,本申请的实施例可提供为方法、系统或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。还需要说明的是,术语“包括”、“包含”或者其任何其它变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素,而且还包括没有明确列出的其它要素,或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个......”限定的要素,并不排除在包括要素的过程、方法、商品或者设备中还存在另外的相同要素。
本说明书中的各个实施例均采用相关的方式描述,各个实施例之间相同相 似的部分互相参见即可,每个实施例重点说明的都是与其它实施例的不同之处。尤其,对于系统实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
以上仅为本申请的实施例而已,并不用于限制本申请。对于本领域技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原理之内所作的任何修改、等同替换、改进等,均应包含在本申请的权利要求范围之内。

Claims (19)

  1. 一种唤醒词能量计算方法,所述方法包括:
    获取唤醒词音频信号;
    对所述唤醒词音频信号进行第一转换,获得唤醒词音频的短时能量谱;
    对所述短时能量谱进行取对数,获得唤醒词音频的对数谱;
    将所述对数谱输入到预设神经网络模型中,以使所述预设神经网络模型根据所述对数谱,生成预测的概率矩阵;
    对所述预测的概率矩阵进行二值化,获得二值矩阵;
    对所述短时能量谱和所述二值矩阵进行第二转换,确定所述唤醒词音频信号的唤醒词语音能量。
  2. 根据权利要求1所述的方法,其中,所述预设神经网络模型的训练过程,包括:
    将带噪语音数据的对数谱输入到初始神经网络中进行处理,获得预测的训练概率矩阵;
    基于交叉熵损失函数计算所述训练概率矩阵与标签矩阵的误差值;
    根据所述误差值,利用预设优化算法对所述初始神经网络进行迭代更新,直至满足训练截至条件,得到所述预设神经网络模型。
  3. 根据权利要求2所述的方法,其中,在对所述初始神经网络进行训练之前,还包括:训练数据处理和/或训练数据特征提取,
    其中,所述训练数据处理包括:
    对唤醒词语音的训练数据进行所述第一转换,获得所述训练数据的短时能量谱;对所述训练数据的短时能量谱进行所述取对数,获得所述训练数据的对数谱;对所述训练数据的对数谱进行所述二值化,获得所述标签矩阵;
    其中,所述训练数据特征提取包括:
    按照信噪比将噪声训练数据插入到所述训练数据中,获得所述带噪语音数据;对所述带噪语音数据进行所述第一转换,获得所述带噪语音数据的短时能 量谱;对所述带噪语音数据的短时能量谱进行所述取对数,获得所述带噪语音数据的对数谱。
  4. 根据权利要求1所述的方法,其中,所述将所述对数谱输入到预设神经网络模型中,以使所述预设神经网络模型根据所述对数谱,生成预测的概率矩阵,包括:
    所述预设神经网络将接收到的所述对数谱的时频点映射为所述预测的概率矩阵,所述预测的概率矩阵中的每个元素均表示该元素对应的所述时频点属于唤醒数据的概率值。
  5. 根据权利要求4所述的方法,其中,所述对所述预测的概率矩阵进行二值化,获得二值矩阵,包括:
    根据所述预设神经网络模型中的预设门限值,对所述预测的概率矩阵进行所述二值化,获得所述二值矩阵,其中,所述二值化是判断所述预测的概率矩阵中的每一个元素是否大于所述预设门限值,若是,则将该元素置为1;若所述元素不大于所述预设门限值,则将该元素置为0。
  6. 根据权利要求1所述的方法,其中,所述方法应用于分布式语音唤醒系统,所述方法还包括:
    所述分布式语音唤醒系统中的多台电子设备,根据所述方法分别计算本设备的所述唤醒词语音能量,并将所述本设备的所述唤醒词语音能量与其它设备的所述唤醒词语音能量进行比较,所述唤醒词语音能量最大的设备执行唤醒操作,除执行所述唤醒操作的设备外的其它设备不执行所述唤醒操作。
  7. 根据权利要求1所述的方法,其中,所述唤醒词音频信号是包含有唤醒关键词的语音信号,以及所述分布式语音唤醒系统所处场景的场景噪音信号的音频信号。
  8. 一种电子设备,所述电子设备包括:
    信号获取模块,设置为获取唤醒词音频信号;
    第一转换模块,设置为对所述唤醒词音频信号进行第一转换,获得唤醒词音频的短时能量谱;
    第二转换模块,设置为对所述短时能量谱进行取对数,获得唤醒词音频的对数谱;
    矩阵生成模块,设置为将所述对数谱输入到预设神经网络模型中,以使所述预设神经网络模型根据所述对数谱,生成预测的概率矩阵;
    第三转换模块,设置为对所述预测的概率矩阵进行二值化,获得二值矩阵;
    第四转换模块,设置为对所述短时能量谱和所述二值矩阵进行第二转换,确定所述唤醒词音频信号的唤醒词语音能量。
  9. 根据权利要求8所示的电子设备,其中,所述矩阵生成模块被设置为:
    所述矩阵生成模块中的预设神经网络,将接收到的所述对数谱的时频点映射为所述预测的概率矩阵,所述预测的概率矩阵中的每个元素均表示该元素对应的所述时频点属于唤醒数据的概率值。
  10. 根据权利要求9所示的电子设备,其中,所述第三转换模块被设置为:
    根据所述预设神经网络模型中的预设门限值,对所述预测的概率矩阵进行所述二值化,获得所述二值矩阵,其中,所述二值化是判断所述预测的概率矩阵中的每一个元素是否大于所述预设门限值,若是,则将该元素置为1;若所述元素不大于所述预设门限值,则将该元素置为0。
  11. 根据权利要求8所示的电子设备,其中,所述电子装置还包括:
    设备唤醒模块,设置为将本设备的所述唤醒词语音能量与其它设备的所述唤醒词语音能量进行比较,当本设备为唤醒词语音能量最大的设备时,执行唤醒操作,当本设备不为唤醒词语音能量最大的设备时,不执行唤醒操作。
  12. 根据权利要求8所示的电子设备,其中,所述唤醒词音频信号是包含有唤醒关键词的语音信号,以及所述分布式语音唤醒系统所处场景的场景噪音信号的音频信号。
  13. 一种唤醒词能量计算系统,所述系统应用于分布式语音唤醒系统,所述系统包括:
    信号获取模块,设置为获取唤醒词音频信号;
    第一转换模块,设置为对所述唤醒词音频信号进行第一转换,获得唤醒词 音频的短时能量谱;
    第二转换模块,设置为对所述短时能量谱进行取对数,获得唤醒词音频的对数谱;
    矩阵生成模块,设置为将所述对数谱输入到预设神经网络模型中,以使所述预设神经网络模型根据所述对数谱,生成预测的概率矩阵;
    第三转换模块,设置为对所述预测的概率矩阵进行二值化,获得二值矩阵;
    第四转换模块,设置为对所述短时能量谱和所述二值矩阵进行第二转换,确定所述唤醒词音频信号的唤醒词语音能量。
  14. 根据权利要求13所示的唤醒词能量计算系统,其中,所述矩阵生成模块被设置为:
    所述矩阵生成模块中的预设神经网络,将接收到的所述对数谱的时频点映射为所述预测的概率矩阵,所述预测的概率矩阵中的每个元素均表示该元素对应的所述时频点属于唤醒数据的概率值。
  15. 根据权利要求14所示的唤醒词能量计算系统,其中,所述第三转换模块被设置为:
    根据所述预设神经网络模型中的预设门限值,对所述预测的概率矩阵进行所述二值化,获得所述二值矩阵,其中,所述二值化是判断所述预测的概率矩阵中的每一个元素是否大于所述预设门限值,若是,则将该元素置为1;若所述元素不大于所述预设门限值,则将该元素置为0。
  16. 根据权利要求13所示的唤醒词能量计算系统,其中,所述唤醒词能量计算系统还包括:
    设备唤醒模块,设置为控制分布式语音唤醒系统中的多台电子设备,分别计算本设备的所述唤醒词语音能量,并将所述本设备的所述唤醒词语音能量与其它设备的所述唤醒词语音能量进行比较,所述唤醒词语音能量最大的设备执行唤醒操作,除执行所述唤醒操作的设备外的其它设备不执行所述唤醒操作。
  17. 根据权利要求13所示的唤醒词能量计算系统,其中,所述唤醒词音频信号是包含有唤醒关键词的语音信号,以及所述分布式语音唤醒系统所处场景 的场景噪音信号的音频信号。
  18. 一种语音唤醒系统,所述系统包括:
    多台电子设备,所述电子设备被配置为执行指令,以实现如上述权利要求1至7中任一项所述的唤醒词能量计算方法。
  19. 一种计算机可读存储介质,当所述计算机可读存储介质中的指令由电子设备的处理器执行时,使得所述设备能够执行如权利要求1至7中任一项所述的唤醒词能量计算方法。
PCT/CN2022/101249 2021-11-26 2022-06-24 唤醒词能量计算方法、系统、语音唤醒系统及存储介质 WO2023093029A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111425576.9 2021-11-26
CN202111425576.9A CN114093347A (zh) 2021-11-26 2021-11-26 唤醒词能量计算方法、系统、语音唤醒系统及存储介质

Publications (1)

Publication Number Publication Date
WO2023093029A1 true WO2023093029A1 (zh) 2023-06-01

Family

ID=80305091

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/101249 WO2023093029A1 (zh) 2021-11-26 2022-06-24 唤醒词能量计算方法、系统、语音唤醒系统及存储介质

Country Status (2)

Country Link
CN (1) CN114093347A (zh)
WO (1) WO2023093029A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114093347A (zh) * 2021-11-26 2022-02-25 青岛海尔科技有限公司 唤醒词能量计算方法、系统、语音唤醒系统及存储介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110570858A (zh) * 2019-09-19 2019-12-13 芋头科技(杭州)有限公司 语音唤醒方法、装置、智能音箱和计算机可读存储介质
CN111667838A (zh) * 2020-06-22 2020-09-15 清华大学 一种用于声纹识别的低功耗模拟域特征向量提取方法
CN111739521A (zh) * 2020-06-19 2020-10-02 腾讯科技(深圳)有限公司 电子设备唤醒方法、装置、电子设备及存储介质
CN112509568A (zh) * 2020-11-26 2021-03-16 北京华捷艾米科技有限公司 一种语音唤醒方法及装置
CN113450771A (zh) * 2021-07-15 2021-09-28 维沃移动通信有限公司 唤醒方法、模型训练方法和装置
CN113516990A (zh) * 2020-04-10 2021-10-19 华为技术有限公司 一种语音增强方法、训练神经网络的方法以及相关设备
CN114093347A (zh) * 2021-11-26 2022-02-25 青岛海尔科技有限公司 唤醒词能量计算方法、系统、语音唤醒系统及存储介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110570858A (zh) * 2019-09-19 2019-12-13 芋头科技(杭州)有限公司 语音唤醒方法、装置、智能音箱和计算机可读存储介质
CN113516990A (zh) * 2020-04-10 2021-10-19 华为技术有限公司 一种语音增强方法、训练神经网络的方法以及相关设备
CN111739521A (zh) * 2020-06-19 2020-10-02 腾讯科技(深圳)有限公司 电子设备唤醒方法、装置、电子设备及存储介质
CN111667838A (zh) * 2020-06-22 2020-09-15 清华大学 一种用于声纹识别的低功耗模拟域特征向量提取方法
CN112509568A (zh) * 2020-11-26 2021-03-16 北京华捷艾米科技有限公司 一种语音唤醒方法及装置
CN113450771A (zh) * 2021-07-15 2021-09-28 维沃移动通信有限公司 唤醒方法、模型训练方法和装置
CN114093347A (zh) * 2021-11-26 2022-02-25 青岛海尔科技有限公司 唤醒词能量计算方法、系统、语音唤醒系统及存储介质

Also Published As

Publication number Publication date
CN114093347A (zh) 2022-02-25

Similar Documents

Publication Publication Date Title
EP3479377B1 (en) Speech recognition
CN108615535B (zh) 语音增强方法、装置、智能语音设备和计算机设备
CN109616139B (zh) 语音信号噪声功率谱密度估计方法和装置
CN110739002A (zh) 基于生成对抗网络的复数域语音增强方法、系统及介质
CN109308912B (zh) 音乐风格识别方法、装置、计算机设备及存储介质
CN107068147A (zh) 语音端点确定
EP3255633B1 (en) Audio content recognition method and device
CN111341319B (zh) 一种基于局部纹理特征的音频场景识别方法及系统
Mundodu Krishna et al. Single channel speech separation based on empirical mode decomposition and Hilbert transform
EP4300489A2 (en) Methods and apparatus to reduce noise from harmonic noise sources
CN102881291A (zh) 语音感知哈希认证的感知哈希值提取方法及认证方法
JP4964259B2 (ja) パラメタ推定装置、音源分離装置、方向推定装置、それらの方法、プログラム
CN112712816B (zh) 语音处理模型的训练方法和装置以及语音处理方法和装置
WO2023093029A1 (zh) 唤醒词能量计算方法、系统、语音唤醒系统及存储介质
CN115062678A (zh) 设备故障检测模型的训练方法、故障检测方法及装置
US20190057705A1 (en) Methods and apparatus to identify a source of speech captured at a wearable electronic device
WO2018014537A1 (zh) 语音识别方法和装置
CN112509601B (zh) 一种音符起始点检测方法及系统
CN106847299B (zh) 延时的估计方法及装置
CN111968620B (zh) 算法的测试方法、装置、电子设备及存储介质
WO2023102930A1 (zh) 语音增强方法、电子设备、程序产品及存储介质
CN113707172B (zh) 稀疏正交网络的单通道语音分离方法、系统、计算机设备
CN111192569B (zh) 双麦语音特征提取方法、装置、计算机设备和存储介质
KR20210134195A (ko) 통계적 불확실성 모델링을 활용한 음성 인식 방법 및 장치
CN111951791A (zh) 声纹识别模型训练方法、识别方法、电子设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22897125

Country of ref document: EP

Kind code of ref document: A1