WO2023246223A1 - Speech enhancement method and apparatus for distributed wake-up, and storage medium - Google Patents

Speech enhancement method and apparatus for distributed wake-up, and storage medium Download PDF

Info

Publication number
WO2023246223A1
WO2023246223A1 PCT/CN2023/085266 CN2023085266W WO2023246223A1 WO 2023246223 A1 WO2023246223 A1 WO 2023246223A1 CN 2023085266 W CN2023085266 W CN 2023085266W WO 2023246223 A1 WO2023246223 A1 WO 2023246223A1
Authority
WO
WIPO (PCT)
Prior art keywords
frequency domain
domain data
beams
speech enhancement
microphone
Prior art date
Application number
PCT/CN2023/085266
Other languages
French (fr)
Chinese (zh)
Inventor
邓邱伟
郝斌
王迪
张丽
Original Assignee
青岛海尔科技有限公司
青岛海尔智能家电科技有限公司
海尔智家股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 青岛海尔科技有限公司, 青岛海尔智能家电科技有限公司, 海尔智家股份有限公司 filed Critical 青岛海尔科技有限公司
Publication of WO2023246223A1 publication Critical patent/WO2023246223A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/20Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0264Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

A speech enhancement method and apparatus for distributed wake-up, and a storage medium, which relate to the technical field of smart homes. The method comprises: acquiring N pieces of frequency-domain data corresponding to N microphones of a microphone array, wherein the N pieces of frequency-domain data are obtained by means of performing a Fourier transform on speech data which is received by the microphone array (S202); determining a delayed addition beam and a delayed subtraction beam of the N pieces of frequency-domain data in each sound pickup direction among M sound pickup directions, so as to obtain M delayed addition beams and M delayed subtraction beams, wherein the N pieces of frequency-domain data all correspond to delayed addition beams and delayed subtraction beams in different sound pickup directions, N is an integer greater than 3, and M is an integer greater than 2 (S204); and inputting the signal amplitudes of the M delayed addition beams and the signal amplitudes of the M delayed subtraction beams into a speech enhancement model, so as to enhance target speech in the speech data by means of the speech enhancement model (S206).

Description

分布式唤醒的语音增强方法和装置、存储介质Distributed wake-up voice enhancement method, device and storage medium
本公开要求于2022年06月20日提交中国专利局、申请号为202210700223.3、发明名称“分布式唤醒的语音增强方法和装置、存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本公开中。This disclosure claims priority to the Chinese patent application filed with the China Patent Office on June 20, 2022, with application number 202210700223.3 and the invention title "Distributed Wake-up Speech Enhancement Method and Device, Storage Medium", the entire content of which is incorporated by reference. in this disclosure.
技术领域Technical field
本公开涉及智能家居技术领域,具体而言,涉及一种分布式唤醒的语音增强方法和装置、存储介质。The present disclosure relates to the field of smart home technology, and specifically to a distributed wake-up voice enhancement method and device, and a storage medium.
背景技术Background technique
随着科技的发展,智能语音设备逐渐进入日常生活,智能语音设备想要听到声音,就需要依赖于麦克风阵列。With the development of technology, smart voice devices have gradually entered daily life. If smart voice devices want to hear sounds, they need to rely on microphone arrays.
相关技术中,麦克风阵列的声源分离技术,可用波束形成和AuxIva等技术实现,但是四麦线性阵列的波束低频主瓣较宽,语音信号处理质量较差,且四麦线阵的AuxIva计算量较大,难以满足实时计算的要求。In related technologies, the sound source separation technology of the microphone array can be implemented by technologies such as beam forming and AuxIva. However, the low-frequency main lobe of the beam of the four-mic linear array is wide, the voice signal processing quality is poor, and the AuxIva calculation amount of the four-mic linear array is It is large and difficult to meet the requirements of real-time calculation.
针对相关技术中,麦克风阵列的波束低频主瓣较宽,语音信号处理质量较差等问题,尚未提出有效的解决方案。In related technologies, effective solutions have not yet been proposed for problems such as the low-frequency main lobe of the microphone array's beam is wide and the speech signal processing quality is poor.
发明内容Contents of the invention
本公开实施例提供了一种分布式唤醒的语音增强方法和装置、存储介质,以至少解决相关技术中,麦克风阵列的波束低频主瓣较宽,语音信号处理质量较差等问题。Embodiments of the present disclosure provide a distributed wake-up voice enhancement method and device, and a storage medium to at least solve the problems in related technologies such as the low-frequency main lobe of a microphone array's beam is wide and the voice signal processing quality is poor.
根据本公开的一个实施例,提供了一种分布式唤醒的语音增强方法,包括:获取麦克风阵列的N个麦克风所对应的N个频域数据;其中,所述N个频域数据为通过对所述麦克风阵列接收到的语音数据进行傅里叶变换得到;确定所述N个频域数据在M个拾音方向中每一个拾音方向上的延迟相加波束和延迟相减波 束,得到M个延迟相加波束和M个延迟相减波束;其中,所述N个频域数据在不同拾音方向上均对应有延迟相加波束和延迟相减波束,N为大于3的整数,M为大于2的整数;将所述M个延迟相加波束的信号幅度和M个延迟相减波束的信号幅度输入到语音增强模型,以通过所述语音增强模型对所述语音数据中的目标语音进行增强。According to an embodiment of the present disclosure, a distributed wake-up speech enhancement method is provided, including: obtaining N frequency domain data corresponding to N microphones of the microphone array; wherein the N frequency domain data are obtained by The speech data received by the microphone array is obtained by Fourier transform; determine the delayed addition beam and delayed subtraction wave of the N frequency domain data in each of the M pickup directions. beams to obtain M delay addition beams and M delay subtraction beams; wherein, the N frequency domain data correspond to delay addition beams and delay subtraction beams in different pickup directions, and N is greater than 3 Integer, M is an integer greater than 2; the signal amplitudes of the M delayed addition beams and the signal amplitudes of the M delayed subtraction beams are input to the speech enhancement model, so as to use the speech enhancement model to enhance the speech data The target speech is enhanced.
根据本公开的另一个实施例,还提供了一种分布式唤醒的语音增强方法装置,包括:获取模块,设置为获取麦克风阵列的N个麦克风所对应的N个频域数据;其中,所述N个频域数据为通过对所述麦克风阵列接收到的语音数据进行傅里叶变换得到;确定模块,设置为确定所述N个频域数据在M个拾音方向中每一个拾音方向上的延迟相加波束和延迟相减波束,得到M个延迟相加波束和M个延迟相减波束;其中,所述N个频域数据在不同拾音方向上均对应有延迟相加波束和延迟相减波束,N为大于3的整数,M为大于2的整数;输入模块,设置为将所述M个延迟相加波束的信号幅度和M个延迟相减波束的信号幅度输入到语音增强模型,以通过所述语音增强模型对所述语音数据中的目标语音进行增强。According to another embodiment of the present disclosure, a distributed wake-up speech enhancement method device is also provided, including: an acquisition module configured to acquire N frequency domain data corresponding to N microphones of the microphone array; wherein, the The N frequency domain data are obtained by Fourier transforming the speech data received by the microphone array; the determination module is configured to determine the position of the N frequency domain data in each of the M sound pickup directions. delay addition beams and delay subtraction beams to obtain M delay addition beams and M delay subtraction beams; wherein, the N frequency domain data correspond to delay addition beams and delays in different pickup directions Subtraction beams, N is an integer greater than 3, M is an integer greater than 2; the input module is configured to input the signal amplitudes of the M delayed addition beams and the signal amplitudes of the M delayed subtraction beams into the speech enhancement model , to enhance the target speech in the speech data through the speech enhancement model.
根据本公开实施例的又一方面,还提供了一种计算机可读的存储介质,该计算机可读的存储介质中存储有计算机程序,其中,该计算机程序被设置为运行时执行上述分布式唤醒的语音增强方法。According to yet another aspect of the embodiments of the present disclosure, a computer-readable storage medium is also provided. The computer-readable storage medium stores a computer program, wherein the computer program is configured to execute the above-mentioned distributed wake-up when running. speech enhancement method.
根据本公开实施例的又一方面,还提供了一种电子装置,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其中,上述处理器通过计算机程序执行上述的分布式唤醒的语音增强方法。According to another aspect of the embodiment of the present disclosure, an electronic device is also provided, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the above-mentioned steps through the computer program. Distributed arousal-based speech enhancement method.
附图说明Description of the drawings
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本公开的实施例,并与说明书一起用于解释本公开的原理。The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.
为了更清楚地说明本公开实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,对于本领域普 通技术人员而言,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to explain the embodiments of the present disclosure or the technical solutions in the prior art more clearly, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, for ordinary people in the art, For those skilled in the art, other drawings can also be obtained based on these drawings without exerting creative labor.
图1是本公开实施例的一种分布式唤醒的语音增强方法的硬件环境示意图;Figure 1 is a schematic diagram of the hardware environment of a distributed wake-up voice enhancement method according to an embodiment of the present disclosure;
图2是根据本公开实施例的分布式唤醒的语音增强方法的流程图;Figure 2 is a flow chart of a distributed wake-up speech enhancement method according to an embodiment of the present disclosure;
图3是根据本公开实施例的分布式唤醒的语音增强方法的示意图(一);Figure 3 is a schematic diagram (1) of a distributed wake-up voice enhancement method according to an embodiment of the present disclosure;
图4是根据本公开实施例的分布式唤醒的语音增强方法的示意图(二);Figure 4 is a schematic diagram (2) of a distributed wake-up speech enhancement method according to an embodiment of the present disclosure;
图5是根据本公开实施例的一种分布式唤醒的语音增强装置的结构框图;Figure 5 is a structural block diagram of a distributed wake-up speech enhancement device according to an embodiment of the present disclosure;
图6是根据本公开实施例的一种可选的电子装置的结构框图。Figure 6 is a structural block diagram of an optional electronic device according to an embodiment of the present disclosure.
具体实施方式Detailed ways
为了使本技术领域的人员更好地理解本公开方案,下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本公开一部分的实施例,而不是全部的实施例。基于本公开中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本公开保护的范围。In order to enable those skilled in the art to better understand the present disclosure, the following will clearly and completely describe the technical solutions in the present disclosure embodiments in conjunction with the accompanying drawings. Obviously, the described embodiments are only These are part of the embodiments of this disclosure, not all of them. Based on the embodiments in this disclosure, all other embodiments obtained by those of ordinary skill in the art without creative efforts should fall within the scope of protection of this disclosure.
需要说明的是,本公开的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本公开的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。It should be noted that the terms "first", "second", etc. in the description and claims of the present disclosure and the above-mentioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It is to be understood that the data so used are interchangeable under appropriate circumstances so that the embodiments of the disclosure described herein can be practiced in sequences other than those illustrated or described herein. In addition, the terms "including" and "having" and any variations thereof are intended to cover non-exclusive inclusions, e.g., a process, method, system, product, or apparatus that encompasses a series of steps or units and need not be limited to those explicitly listed. Those steps or elements may instead include other steps or elements not expressly listed or inherent to the process, method, product or apparatus.
根据本公开实施例的一个方面,提供了一种分布式唤醒的语音增强方法。该分布式唤醒的语音增强方法广泛应用于智慧家庭(Smart Home)、智能家居、智能家用设备生态、智慧住宅(IntelligenceHouse)生态等全屋智能数字化控制应用 场景。可选地,在本实施例中,上述分布式唤醒的语音增强方法可以应用于如图1所示的由终端设备102和服务器104所构成的硬件环境中。如图1所示,服务器104通过网络与终端设备102进行连接,可用于为终端或终端上安装的客户端提供服务(如应用服务等),可在服务器上或独立于服务器设置数据库,用于为服务器104提供数据存储服务,可在服务器上或独立于服务器配置云计算和/或边缘计算服务,用于为服务器104提供数据运算服务。According to one aspect of an embodiment of the present disclosure, a distributed wake-up speech enhancement method is provided. This distributed wake-up voice enhancement method is widely used in whole-house intelligent digital control applications such as Smart Home, smart home, smart home device ecology, and smart residence (IntelligenceHouse) ecology. Scenes. Optionally, in this embodiment, the above distributed wake-up voice enhancement method can be applied to a hardware environment composed of a terminal device 102 and a server 104 as shown in FIG. 1 . As shown in Figure 1, the server 104 is connected to the terminal device 102 through the network and can be used to provide services (such as application services, etc.) for the terminal or the client installed on the terminal. A database can be set up on the server or independently from the server. To provide data storage services for the server 104, cloud computing and/or edge computing services can be configured on the server or independently of the server to provide data computing services for the server 104.
上述网络可以包括但不限于以下至少之一:有线网络,无线网络。上述有线网络可以包括但不限于以下至少之一:广域网,城域网,局域网,上述无线网络可以包括但不限于以下至少之一:WIFI(Wireless Fidelity,无线保真),蓝牙。终端设备102可以并不限定于为PC、手机、平板电脑、智能空调、智能烟机、智能冰箱、智能烤箱、智能炉灶、智能洗衣机、智能热水器、智能洗涤设备、智能洗碗机、智能投影设备、智能电视、智能晾衣架、智能窗帘、智能影音、智能插座、智能音响、智能音箱、智能新风设备、智能厨卫设备、智能卫浴设备、智能扫地机器人、智能擦窗机器人、智能拖地机器人、智能空气净化设备、智能蒸箱、智能微波炉、智能厨宝、智能净化器、智能饮水机、智能门锁等。The above-mentioned network may include but is not limited to at least one of the following: wired network, wireless network. The above-mentioned wired network may include but is not limited to at least one of the following: wide area network, metropolitan area network, and local area network. The above-mentioned wireless network may include at least one of the following: WIFI (Wireless Fidelity, Wireless Fidelity), Bluetooth. The terminal device 102 may be, but is not limited to, a PC, a mobile phone, a tablet, a smart air conditioner, a smart hood, a smart refrigerator, a smart oven, a smart stove, a smart washing machine, a smart water heater, a smart washing equipment, a smart dishwasher, or a smart projection device. , smart TV, smart clothes drying rack, smart curtains, smart audio and video, smart sockets, smart audio, smart speakers, smart fresh air equipment, smart kitchen and bathroom equipment, smart bathroom equipment, smart sweeping robot, smart window cleaning robot, smart mopping robot, Smart air purification equipment, smart steamers, smart microwave ovens, smart kitchen appliances, smart purifiers, smart water dispensers, smart door locks, etc.
在本实施例中提供了一种分布式唤醒的语音增强方法,应用于终端设备,图2是根据本公开实施例的分布式唤醒的语音增强方法的流程图,该流程包括如下步骤:In this embodiment, a distributed wake-up voice enhancement method is provided, which is applied to a terminal device. Figure 2 is a flow chart of a distributed wake-up voice enhancement method according to an embodiment of the present disclosure. The process includes the following steps:
步骤S202,获取麦克风阵列的N个麦克风所对应的N个频域数据;其中,所述N个频域数据为通过对所述麦克风阵列接收到的语音数据进行傅里叶变换得到;Step S202: Obtain N pieces of frequency domain data corresponding to the N microphones of the microphone array; wherein the N pieces of frequency domain data are obtained by Fourier transforming the speech data received by the microphone array;
需要说明的是,N个麦克风分别接收语音数据x1、x2、x3、......xn,将N个麦克风分别接收的语音数据x1(t)、x2(t)、x3(t)、......xn(t)进行傅里叶变化,得到对应的N个频域数据X1(f),X2(f),X3(f),......,XN(f)。It should be noted that N microphones receive voice data x 1 , x 2 , x 3 ,...x n respectively, and the voice data x 1 (t), x 2 (t) received by N microphones respectively are , x 3 (t),...x n (t) are Fourier transformed to obtain the corresponding N frequency domain data X 1 (f), X 2 ( f), ..., X N (f).
步骤S204,确定所述N个频域数据在M个拾音方向中每一个拾音方向上的延迟相加波束和延迟相减波束,得到M个延迟相加波束和M个延迟相减波束;其中,所述N个频域数据在不同拾音方向上均对应有延迟相加波束和延迟相减 波束,N为大于3的整数,M为大于2的整数;Step S204, determine the delay addition beam and delay subtraction beam of the N frequency domain data in each of the M sound pickup directions, and obtain M delay addition beams and M delay subtraction beams; Among them, the N frequency domain data correspond to delayed addition beams and delayed subtraction beams in different pickup directions. Beam, N is an integer greater than 3, M is an integer greater than 2;
步骤S206,将所述M个延迟相加波束的信号幅度和M个延迟相减波束的信号幅度输入到语音增强模型,以通过所述语音增强模型对所述语音数据中的目标语音进行增强。Step S206: Input the signal amplitudes of the M delay addition beams and the M delay subtraction beams into the speech enhancement model, so as to enhance the target speech in the speech data through the speech enhancement model.
通过上述步骤,获取麦克风阵列的N个麦克风所对应的N个频域数据;其中,所述N个频域数据为通过对所述麦克风阵列接收到的语音数据进行傅里叶变换得到;确定所述N个频域数据在M个拾音方向中每一个拾音方向上的延迟相加波束和延迟相减波束,得到M个延迟相加波束和M个延迟相减波束;其中,所述N个频域数据在不同拾音方向上均对应有延迟相加波束和延迟相减波束,N为大于3的整数,M为大于2的整数;将所述M个延迟相加波束的信号幅度和M个延迟相减波束的信号幅度输入到语音增强模型,以通过所述语音增强模型对所述语音数据中的目标语音进行增强,解决了相关技术中,麦克风阵列的波束低频主瓣较宽,语音信号处理质量较差等问题,进而本公开实施例中将M个拾音方向的延迟相加波束和延迟相减波束作为语音增强模型的输入,增强语音信号处理质量。Through the above steps, N pieces of frequency domain data corresponding to the N microphones of the microphone array are obtained; wherein the N pieces of frequency domain data are obtained by performing Fourier transform on the speech data received by the microphone array; determine the Delayed addition beams and delayed subtraction beams of the N frequency domain data in each of the M sound pickup directions are used to obtain M delay addition beams and M delay subtraction beams; wherein, the N Each frequency domain data corresponds to a delay addition beam and a delay subtraction beam in different pickup directions, N is an integer greater than 3, and M is an integer greater than 2; the signal amplitudes of the M delay addition beams are summed The signal amplitudes of the M delayed subtraction beams are input to the speech enhancement model to enhance the target speech in the speech data through the speech enhancement model, solving the problem in related technologies that the low-frequency main lobe of the beam of the microphone array is wide. There are problems such as poor voice signal processing quality. Furthermore, in the embodiment of the present disclosure, delay addition beams and delay subtraction beams in M pickup directions are used as the input of the speech enhancement model to enhance the speech signal processing quality.
可选地,确定所述N个频域数据在M个拾音方向中每一个拾音方向上的延迟相加波束和延迟相减波束,得到M个延迟相加波束和M个延迟相减波束,包括:对于M个拾音方向中每一个拾音方向,根据所述N个频域数据确定目标频域数据,以及根据在所述每一个拾音方向上所述N个麦克风之间的时延确定所述目标频域数据对应的权重矢量,其中,所述目标频域数据用于指示所述N个频域数据对应的阵列信号;根据所述目标频域数据和所述目标频域数据对应的权重矢量确定所述M个延迟相加波束和M个延迟相减波束。Optionally, determine the delay addition beams and delay subtraction beams of the N frequency domain data in each of the M sound pickup directions to obtain M delay addition beams and M delay subtraction beams. , including: for each of the M sound pickup directions, determining the target frequency domain data according to the N frequency domain data, and determining the target frequency domain data according to the time between the N microphones in each of the sound pickup directions. Determine the weight vector corresponding to the target frequency domain data, wherein the target frequency domain data is used to indicate the array signal corresponding to the N frequency domain data; according to the target frequency domain data and the target frequency domain data Corresponding weight vectors determine the M delay-add beams and M delay-subtract beams.
也就是说,目标频域数据由N个频域数据确定,在确定目标频域数据以及目标频域数据对应的权重矢量的情况下,可以根据目标频域数据和所述权重矢量确定M个延迟相加波束和M个延迟相减波束。That is to say, the target frequency domain data is determined by N pieces of frequency domain data. When the target frequency domain data and the weight vector corresponding to the target frequency domain data are determined, M delays can be determined based on the target frequency domain data and the weight vector. Addition beams and M delayed subtraction beams.
可选的,根据所述N个频域数据确定所述目标频域数据,包括:确定所述N个频域数据的对应的第一矩阵,其中,所述第一矩阵的行信息用于指示所述N个 频域数据;根据所述第一矩阵确定所述目标频域数据。Optionally, determining the target frequency domain data according to the N frequency domain data includes: determining a corresponding first matrix of the N frequency domain data, wherein the row information of the first matrix is used to indicate The N Frequency domain data; determining the target frequency domain data according to the first matrix.
具体的,根据以下公式确定所述目标频域数据X(f,θ):
X(f,θ)=[X1(f),X2(f),X3(f),......,XN(f)]T
Specifically, the target frequency domain data X(f, θ) is determined according to the following formula:
X(f, θ)=[X 1 (f), X 2 (f), X 3 (f), ..., X N (f)] T ;
其中,X1(f),X2(f),X3(f),......,XN(f)分别表示所述N个频域数据,f为频率。Among them, X 1 (f), X 2 (f), X 3 (f), ..., X N (f) respectively represent the N frequency domain data, and f is the frequency.
将所述N个频域数据组成矩阵形式,以得到目标频域数据X(f,θ),举例来讲,在N为4的情况下,X(f,θ)=[X1(f),X2(f),X3(f),X4(f)]T,其中,θ为拾音方向。举例来讲,在麦克风阵列为四麦线性阵列的情况下,θ具体的可以为30°、90°、150°;在麦克风阵列为二麦线性阵列的情况下,θ具体的可以为45°、135°。The N frequency domain data are formed into a matrix form to obtain the target frequency domain data X(f, θ). For example, when N is 4, X(f, θ)=[X 1 (f) , X 2 (f), X 3 (f), X 4 (f)] T , where θ is the pickup direction. For example, when the microphone array is a four-mic linear array, θ can be specifically 30°, 90°, and 150°; when the microphone array is a two-mic linear array, specifically θ can be 45°, 135°.
进一步的,根据所述麦克风阵列在所述每一个拾音方向上所述N个麦克风之间的时延确定所述目标频域数据对应的权重矢量,包括:确定N个麦克风中的每一个麦克风相对于目标麦克风的时延以及根据所述时间延时确定所述每一个麦克风对应的子权重矢量,其中,所述目标麦克风为最先接收到所述语音数据的麦克风;确定所述子权重矢量的对应的第二矩阵,其中,所述第二矩阵的列信息用于指示所述每一个麦克风对应的子权重矢量;根据麦克风阵列中的麦克风数量N和所述第二矩阵确定所述目标频域数据对应的权重矢量。Further, determining the weight vector corresponding to the target frequency domain data according to the time delay between the N microphones in each sound pickup direction of the microphone array includes: determining each of the N microphones Relative to the time delay of the target microphone and determining the sub-weight vector corresponding to each microphone based on the time delay, wherein the target microphone is the microphone that first receives the voice data; determining the sub-weight vector The corresponding second matrix, wherein the column information of the second matrix is used to indicate the sub-weight vector corresponding to each microphone; the target frequency is determined according to the number N of microphones in the microphone array and the second matrix. The weight vector corresponding to the domain data.
可选的,确定N个麦克风中的每一个麦克风相对于目标麦克风的时延,包括:确定N个麦克风中的任一麦克风在坐标轴上的横坐标,以及任一麦克风在坐标轴上的纵坐标;确定所述横坐标和所述任一麦克风的拾音方向的余弦值的第一乘积,以及所述纵坐标和所述任一麦克风的拾音方向的正弦值的第二乘积;根据声速、第一乘积、第二乘积确定所述任一麦克风相对于目标麦克风的时延,其中,所述目标麦克风的坐标点为所述坐标轴的原点;根据声速、第一乘积、第二乘积确定所述任一麦克风相对于目标麦克风的时延;循环执行所述确定步骤,直至确定N个麦克风中的每一个麦克风相对于目标麦克风的时延。Optionally, determining the time delay of each of the N microphones relative to the target microphone includes: determining the abscissa of any of the N microphones on the coordinate axis, and the vertical coordinate of any of the microphones on the coordinate axis. Coordinates; determine the first product of the abscissa and the cosine value of the pickup direction of any microphone, and the second product of the ordinate and the sine value of the pickup direction of any microphone; according to the speed of sound , the first product and the second product determine the delay of any microphone relative to the target microphone, where the coordinate point of the target microphone is the origin of the coordinate axis; determined according to the speed of sound, the first product and the second product The time delay of any microphone relative to the target microphone; perform the determining step cyclically until the time delay of each microphone among the N microphones relative to the target microphone is determined.
具体的,根据以下公式确定所述目标频域数据对应的权重矢量d(f,θ): 其中,τ21为麦克风2相对于麦克风1的时延,τ31为麦克风3相对于麦克风1的时延,τN1为麦克风N相对于麦克风1的时延,其中,其中,θ为所述每一个拾音方向,c为声速,其中,麦克风1为目标麦克风,所述aN为麦克风在坐标轴上的横坐标,bN为麦克风在坐标轴上的纵坐标。Specifically, the weight vector d(f, θ) corresponding to the target frequency domain data is determined according to the following formula: Among them, τ 21 is the time delay of microphone 2 relative to microphone 1, τ 31 is the time delay of microphone 3 relative to microphone 1, τ N1 is the time delay of microphone N relative to microphone 1, where, Among them, θ is each sound pickup direction, c is the sound speed, microphone 1 is the target microphone, a N is the abscissa of the microphone on the coordinate axis, and b N is the ordinate of the microphone on the coordinate axis.
举例来讲,在麦克风阵列为四麦线性阵列的情况下,在30°方向上的目标频域数据对应的权重矢量d(f,θ)为: 在90°方向上的目标频域数据对应的权重矢量d(f,θ)为: 在150°方向上的目标频域数据对应的权重矢量d(f,θ)为: For example, when the microphone array is a four-microphone linear array, the weight vector d(f, θ) corresponding to the target frequency domain data in the 30° direction is: The weight vector d(f, θ) corresponding to the target frequency domain data in the 90° direction is: The weight vector d(f, θ) corresponding to the target frequency domain data in the 150° direction is:
可选的,根据所述目标频域数据和所述目标频域数据对应的权重矢量确定所述M个延迟相加波束和M个延迟相减波束,包括:确定所述每一个拾音方向的所述目标频域数据和所述目标频域数据对应的权重矢量的卷积结果,根据所述卷积结果确定所述M个延迟相加波束;确定所述每一个拾音方向的所述目标频域数据和所述目标频域数据对应的权重矢量的卷积结果的共轭复数,根据所述共轭复数果确定所述M个相减相加波束。Optionally, determining the M delay addition beams and M delay subtraction beams according to the target frequency domain data and the weight vector corresponding to the target frequency domain data includes: determining the direction of each sound pickup direction. The convolution result of the target frequency domain data and the weight vector corresponding to the target frequency domain data, determine the M delayed addition beams according to the convolution result; determine the target in each sound pickup direction The M subtraction and addition beams are determined based on the conjugate complex number of the convolution result of the weight vector corresponding to the frequency domain data and the target frequency domain data.
具体的,根据所述X(f,θ)和所述X(f,θ)对应的权重矢量确定所述M个延迟 相加波束和M个延迟相减波束,包括:对于所述M个延迟相加波束中,与所述每一个拾音方向θ对应的延迟相加波束bsum(f,θ)通过以下公式确定:bsum(f,θ)=d(f,θ)*X(f,θ);对于所述M个相减相加波束中,与所述每一个拾音方向θ对应的延迟相减波束bsub(f,θ)通过以下公式确定:bsub(f,θ)=conj[d(f,θ)]*X(f,θ)。Specifically, the M delays are determined according to the weight vector corresponding to the X(f, θ) and the X(f, θ) The addition beam and M delay addition beams include: among the M delay addition beams, the delay addition beam b sum (f, θ) corresponding to each of the pickup directions θ is determined by the following formula : b sum (f, θ) = d (f, θ) * X (f, θ); for the M subtraction and addition beams, the delayed subtraction beam corresponding to each of the pickup directions θ b sub (f, θ) is determined by the following formula: b sub (f, θ) = conj [d (f, θ)]*X (f, θ).
在一个示例性实施例中,在M=3的情况下,将所述M个延迟相加波束的信号幅度和M个延迟相减波束的信号幅度输入到语音增强模型,以通过所述语音增强模型对所述语音数据中的目标语音进行增强之后,还包括:按照第一预设算法对第一拾音方向上的第一语音增强数据进行线性滤波,得到所述第一拾音方向上的语音增强结果,其中,所述第一预设算法包括:以所述第一拾音方向的延迟相加波束作为主波束,将第二拾音方向和第三拾音方向上的语音增强结果作为噪声参数;以及按照第二预设算法对第二拾音方向上的第二语音增强数据进行线性滤波,得到所述第二拾音方向上的语音增强结果,其中,所述第二预设算法包括:以所述第二拾音方向的延迟相加波束作为主波束,将第一拾音方向和第三拾音方向上的语音增强结果作为噪声参数;以及按照第三预设算法对第二拾音方向上的第三语音增强数据进行线性滤波,得到所述第二拾音方向上的语音增强结果,其中,所述第三预设算法包括:以所述第三拾音方向的延迟相加波束作为主波束,将第一拾音方向和第二拾音方向上的语音增强结果作为噪声参数;其中,所述第一语音增强数据,所述第二语音增强数据,第三语音增强数据为所述语音增强模型的输出结果。In an exemplary embodiment, in the case of M=3, the signal amplitudes of the M delay addition beams and the signal amplitudes of the M delay subtraction beams are input to the speech enhancement model to use the speech enhancement After the model enhances the target speech in the speech data, it also includes: linearly filtering the first speech enhancement data in the first sound pickup direction according to a first preset algorithm to obtain the first speech enhancement data in the first sound pickup direction. Speech enhancement results, wherein the first preset algorithm includes: using the delayed addition beam in the first sound pickup direction as the main beam, and using the speech enhancement results in the second sound pickup direction and the third sound pickup direction as Noise parameters; and perform linear filtering on the second speech enhancement data in the second sound pickup direction according to the second preset algorithm to obtain the speech enhancement result in the second sound pickup direction, wherein the second preset algorithm It includes: using the delayed addition beam in the second sound pickup direction as the main beam, using the speech enhancement results in the first sound pickup direction and the third sound pickup direction as noise parameters; and using the third preset algorithm to The third speech enhancement data in the sound pickup direction is linearly filtered to obtain the speech enhancement result in the second sound pickup direction, wherein the third preset algorithm includes: using the delay phase of the third sound pickup direction Add a beam as the main beam, and use the speech enhancement results in the first sound pickup direction and the second sound pickup direction as noise parameters; wherein, the first speech enhancement data, the second speech enhancement data, and the third speech enhancement data is the output result of the speech enhancement model.
需要说明的是,麦克风阵列信号处理通常由波束形成和后置滤波两部分组成,为了进一步提升降噪效果,在获取到语音增强数据之后,再对所述语音增强数据进行滤波,一般采用NLMS或RLS方法进行滤波,具体的,在对第一拾音方向上的第一语音增强数据进行线性滤波时,以所述第一拾音方向的延迟相加波束作为主波束,将第二拾音方向和第三拾音方向上的语音增强结果作为噪声参数,用NLMS或RLS方法,对第一语音增强数据进行线性滤波;在对第二拾音方向上的第二语音增强数据进行线性滤波时,以所述第二拾音方向的延迟相加波束作为主波束,将第一拾音方向和第三拾音方向上的语音增强结果作为噪声参数,用NLMS或RLS方法,对第三语音增强数据进行线性滤波;在对第三拾音方向上 的第三语音增强数据进行线性滤波时,以所述第三拾音方向的延迟相加波束作为主波束,将第一拾音方向和第二拾音方向上的语音增强结果作为噪声参数,用NLMS或RLS方法,对第三语音增强数据进行线性滤波。It should be noted that microphone array signal processing usually consists of two parts: beam forming and post-filtering. In order to further improve the noise reduction effect, after obtaining the speech enhancement data, the speech enhancement data is then filtered, generally using NLMS or The RLS method is used for filtering. Specifically, when performing linear filtering on the first speech enhancement data in the first pickup direction, the delayed addition beam in the first pickup direction is used as the main beam, and the second pickup direction is used as the main beam. And the speech enhancement result in the third pickup direction is used as the noise parameter, and the NLMS or RLS method is used to perform linear filtering on the first speech enhancement data; when linear filtering is performed on the second speech enhancement data in the second pickup direction, Use the delayed addition beam in the second pickup direction as the main beam, use the speech enhancement results in the first pickup direction and the third pickup direction as noise parameters, and use the NLMS or RLS method to enhance the third speech data. Perform linear filtering; in the third pickup direction When the third speech enhancement data is linearly filtered, the delayed addition beam in the third pickup direction is used as the main beam, and the speech enhancement results in the first pickup direction and the second pickup direction are used as noise parameters, using The NLMS or RLS method performs linear filtering on the third speech enhancement data.
为了更好的理解上述分布式唤醒的语音增强方法的过程,以下再结合可选实施例对上述分布式唤醒的语音增强方法的实现方法流程进行说明,但不用于限定本公开实施例的技术方案。In order to better understand the process of the above-mentioned distributed wake-up speech enhancement method, the flow of the implementation method of the above-mentioned distributed wake-up speech enhancement method will be described below in conjunction with optional embodiments, but this is not intended to limit the technical solutions of the embodiments of the present disclosure. .
在本实施例中提供了一种分布式唤醒的语音增强方法,在麦克风阵列为四麦线性阵列,图3是根据本公开实施例的分布式唤醒的语音增强方法的示意图(一),如图3所示,为四麦线性阵列的波束方向。In this embodiment, a distributed wake-up voice enhancement method is provided. The microphone array is a four-microphone linear array. Figure 3 is a schematic diagram (1) of the distributed wake-up voice enhancement method according to an embodiment of the present disclosure, as shown in Fig. 3 shows the beam direction of the four-mic linear array.
以麦克风阵列为四麦线性阵列为例,图4是根据本公开实施例的分布式唤醒的语音增强方法的示意图(二),如图4所示,具体如下步骤:Taking the microphone array as a four-microphone linear array as an example, Figure 4 is a schematic diagram (2) of the distributed wake-up voice enhancement method according to an embodiment of the present disclosure. As shown in Figure 4, the specific steps are as follows:
步骤S401:x1,x2,x3,x4经短时傅里叶变换(STFT)后得到频域数据X1,X2,X3,X4,其中,x1,x2,x3,x4分别表示四个麦克风的时域数据(相当于上述实施例中的语音数据);Step S401: After short-time Fourier transform (STFT), x1, x2, x3, and x4 obtain frequency domain data X1, X2, X3, and Data (equivalent to the voice data in the above embodiment);
步骤S402:分别向30°、90°和150°方向做延迟相加波束SumN和延迟相减波束Sub1,其中,Y1表示30°的波束结果的拼接,即Y1=[Sum1,Sub1];Y2表示90°的波束结果的拼接,即Y2=[Sum2,Sub2];Y3表示150°的波束结果的拼接,即Y3=[Sum3,Sub3]。Step S402: Make delayed addition beams SumN and delayed subtraction beams Sub1 in directions of 30°, 90° and 150° respectively, where Y1 represents the splicing of the 30° beam results, that is, Y1 = [Sum1, Sub1]; Y2 represents The splicing of 90° beam results, that is, Y2 = [Sum2, Sub2]; Y3 represents the splicing of 150° beam results, that is, Y3 = [Sum3, Sub3].
需要说明的是,根据以下公式确定所述目标频域数据X(f,θ):
X(f,θ)=[X1(f),X2(f),X3(f),X4(f)]T
It should be noted that the target frequency domain data X(f, θ) is determined according to the following formula:
X(f, θ)=[X 1 (f), X 2 (f), X 3 (f), X 4 (f)] T ;
其中,X1(f),X2(f),X3(f),X4(f)分别表示所述4个频域数据,f为所述目标频域数据对应的频率;Among them, X 1 (f), X 2 (f), X 3 (f), and X 4 (f) respectively represent the four frequency domain data, and f is the frequency corresponding to the target frequency domain data;
根据以下公式确定所述目标频域数据对应的权重矢量d(f,θ):The weight vector d(f, θ) corresponding to the target frequency domain data is determined according to the following formula:
其中,τ21为麦克风2相对于麦克风1的时延,τ31为麦克风3相对于麦克风1的时延,τ41 为麦克风4相对于麦克风1的时延,所述aN为麦克风在坐标轴上的横坐标,bN为麦克风在坐标轴上的纵坐标。 Among them, τ 21 is the time delay of microphone 2 relative to microphone 1, τ 31 is the time delay of microphone 3 relative to microphone 1, τ 41 is the time delay of microphone 4 relative to microphone 1, a N is the abscissa of the microphone on the coordinate axis, and b N is the ordinate of the microphone on the coordinate axis.
bsum(f,θ)=d(f,θ)*X(f,θ);bsub(f,θ)=conj[d(f,θ)]*X(f,θ),其中,bsum表示延迟相加的波束(复数);bsub表示延迟相减的波束(复数)。b sum (f, θ) = d (f, θ)*X (f, θ); b sub (f, θ) = conj [d (f, θ)] * X (f, θ), where, b sum represents a beam with delays added (a complex number); b sub represents a beam with delays subtracted (a complex number).
Sum1=abs[bsum(θ=30°)];Sum1=abs[b sum (θ=30°)];
Sum2=abs[bsum(θ=90°)];Sum2=abs[b sum (θ=90°)];
Sum3=abs[bsum(θ=150°)]。需要说明的是,SumN是实数数组。Sum3=abs[b sum (θ=150°)]. It should be noted that SumN is an array of real numbers.
步骤S403:取Y1、Y2、Y3的绝对值;Step S403: Get the absolute values of Y1, Y2, and Y3;
步骤S404:构造多维数组,作为语音增强模型的输入,并输出延迟相加的波束的Mask1、Mask2、Mask3;Step S404: Construct a multi-dimensional array as the input of the speech enhancement model, and output Mask1, Mask2, and Mask3 of the delayed and added beams;
需要说明的是,本公开实例中,采用16000采样率,stft帧长512,帧移256,汉宁窗的窗长512。Y1长度257*2=514,Y长度514*3=1542。语音增强模型选用2组stacked TCN,每组包含两组TCN块,卷积核3,dilation rate{1,2,5,9}。后接三个全连接层Fc,选用sigmoid激活函数。It should be noted that in the example of this disclosure, a sampling rate of 16,000 is used, the stft frame length is 512, the frame shift is 256, and the window length of the Hanning window is 512. The length of Y1 is 257*2=514, and the length of Y is 514*3=1542. The speech enhancement model uses 2 groups of stacked TCN, each group contains two groups of TCN blocks, convolution kernel 3, dilation rate {1, 2, 5, 9}. Followed by three fully connected layers Fc, the sigmoid activation function is selected.
语音增强模型的模型结构选用stacked TCN块,最外层有三层输出,分别对应30°、90°和150°方向延迟相加波束的mask值[0,1]之间。Mask1表示bsum(θ=30°)的掩蔽值。因为频谱泄露的原因,延迟相加波束结果会有其他方向的噪声,Mask1可以抑制掉这些方向的噪声。Out1=bsum(θ=30°)*Mask1,表示为30°方向增强后的结果。Out2=bsum(θ=90°)*Mask2,表示为90°方向增强后的结果;Out3=bsum(θ=150°)*Mask3,表示为150°方向增强后的结果。The model structure of the speech enhancement model uses stacked TCN blocks. The outermost layer has three output layers, corresponding to the mask values [0, 1] of the 30°, 90° and 150° direction delay addition beams. Mask1 represents the mask value of b sum (θ=30°). Due to spectrum leakage, the delayed addition beam result will have noise in other directions, and Mask1 can suppress the noise in these directions. Out1=b sum (θ=30°)*Mask1, expressed as the result after enhancement in the 30° direction. Out2=b sum (θ=90°)*Mask2, expressed as the result after enhancement in the 90° direction; Out3=b sum (θ=150°)*Mask3, expressed as the result after enhancement in the 150° direction.
步骤S405:确定所述语音增强模型的损失函数;Step S405: Determine the loss function of the speech enhancement model;
其中,损失函数选用Mse loss。T1,T2和T3分别表示各个方向只有目标声源clean speech时延迟相加波束/有噪声时noisy speech的波束结果。Among them, Mse loss is selected as the loss function. T1, T2 and T3 respectively represent the delayed addition beam when there is only clean speech from the target sound source in each direction/the beam results of noisy speech when there is noise.
需要说明的是,在得到语音增强信号Out1、Out2、Out3之后,将语音增强 信号Out1、Out2、Out3输入后处理模块,以对所述语音增强信号Out1、Out2、Out3进行滤波,以Sum1为主波束,Out2、Out3为噪声参考,用NLMS或RLS方法滤波,得到30°的输出结果;以Sum2为主波束,Out1、Out3为噪声参考,得到90°的输出结果;以Sum3为主波束,Out1、Out2为噪声参考,得到150°的输出结果。It should be noted that after obtaining the speech enhancement signals Out1, Out2, and Out3, the speech enhancement The signals Out1, Out2, and Out3 are input into the post-processing module to filter the speech enhancement signals Out1, Out2, and Out3. Sum1 is used as the main beam, and Out2 and Out3 are used as noise references. They are filtered using the NLMS or RLS method to obtain a 30° Output results: Using Sum2 as the main beam, Out1 and Out3 as noise references, an output result of 90° is obtained; using Sum3 as the main beam, Out1 and Out2 as noise references, an output result of 150° is obtained.
本公开实施例中,将三个方向的延迟相加和延迟相减波束拼在一块作为模型的输入,相比于只用延迟相加波束作为输入模型可以学习更多的空间信息,有利于收敛。同时stacked TCNs结构,可以处理时间序列,较LSTM视野更宽。相比于直接将模型输出作为最终结果,用后处理模块线性过滤估计的噪声成分,可有效减少语音畸变,有助于提升语音质量/唤醒率和识别率。In the embodiment of the present disclosure, the delay addition and delay subtraction beams in three directions are put together as the input of the model. Compared with only using the delay addition beam as the input model, the model can learn more spatial information, which is conducive to convergence. . At the same time, the stacked TCNs structure can handle time series and has a wider field of view than LSTM. Compared with directly using the model output as the final result, linearly filtering the estimated noise components with a post-processing module can effectively reduce speech distortion and help improve speech quality/arousal rate and recognition rate.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到根据上述实施例的方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本公开的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本公开各个实施例的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is Better implementation. Based on this understanding, the technical solution of the present disclosure can be embodied in the form of a software product in essence or that contributes to the existing technology. The computer software product is stored in a storage medium (such as ROM/RAM, disk, CD), including several instructions to cause a terminal device (which can be a mobile phone, computer, server, or network device, etc.) to execute the methods of various embodiments of the present disclosure.
图5是根据本公开实施例的一种分布式唤醒的语音增强装置的结构框图;如图5所示,包括:Figure 5 is a structural block diagram of a distributed wake-up speech enhancement device according to an embodiment of the present disclosure; as shown in Figure 5, it includes:
获取模块52,设置为获取麦克风阵列的N个麦克风所对应的N个频域数据;其中,所述N个频域数据为通过对所述麦克风阵列接收到的语音数据进行傅里叶变换得到;The acquisition module 52 is configured to acquire N frequency domain data corresponding to the N microphones of the microphone array; wherein the N frequency domain data are obtained by Fourier transforming the speech data received by the microphone array;
确定模块54,设置为确定所述N个频域数据在M个拾音方向中每一个拾音方向上的延迟相加波束和延迟相减波束,得到M个延迟相加波束和M个延迟相减波束;其中,所述N个频域数据在不同拾音方向上均对应有延迟相加波束和延迟相减波束,N为大于3的整数,M为大于2的整数; The determination module 54 is configured to determine the delayed addition beams and delayed subtraction beams of the N frequency domain data in each of the M pickup directions, and obtain M delayed addition beams and M delayed subtraction beams. Subtraction beam; wherein, the N frequency domain data correspond to delayed addition beams and delayed subtraction beams in different pickup directions, N is an integer greater than 3, and M is an integer greater than 2;
输入模块56,设置为将所述M个延迟相加波束的信号幅度和M个延迟相减波束的信号幅度输入到语音增强模型,以通过所述语音增强模型对所述语音数据中的目标语音进行增强。The input module 56 is configured to input the signal amplitudes of the M delayed addition beams and the signal amplitudes of the M delayed subtraction beams into the speech enhancement model, so as to use the speech enhancement model to enhance the target speech in the speech data. Make enhancements.
通过上述装置,获取麦克风阵列的N个麦克风所对应的N个频域数据;其中,所述N个频域数据为通过对所述麦克风阵列接收到的语音数据进行傅里叶变换得到;确定所述N个频域数据在M个拾音方向中每一个拾音方向上的延迟相加波束和延迟相减波束,得到M个延迟相加波束和M个延迟相减波束;其中,所述N个频域数据在不同拾音方向上均对应有延迟相加波束和延迟相减波束,N为大于3的整数,M为大于2的整数;将所述M个延迟相加波束的信号幅度和M个延迟相减波束的信号幅度输入到语音增强模型,以通过所述语音增强模型对所述语音数据中的目标语音进行增强,解决了相关技术中,麦克风阵列的波束低频主瓣较宽,语音信号处理质量较差等问题,进而本公开实施例中将M个拾音方向的延迟相加波束和延迟相减波束作为语音增强模型的输入,增强语音信号处理质量。Through the above device, N pieces of frequency domain data corresponding to the N microphones of the microphone array are obtained; wherein the N pieces of frequency domain data are obtained by Fourier transforming the speech data received by the microphone array; determining the Delayed addition beams and delayed subtraction beams of the N frequency domain data in each of the M sound pickup directions are used to obtain M delay addition beams and M delay subtraction beams; wherein, the N Each frequency domain data corresponds to a delay addition beam and a delay subtraction beam in different pickup directions, N is an integer greater than 3, and M is an integer greater than 2; the signal amplitudes of the M delay addition beams are summed The signal amplitudes of the M delayed subtraction beams are input to the speech enhancement model to enhance the target speech in the speech data through the speech enhancement model, solving the problem in related technologies that the low-frequency main lobe of the beam of the microphone array is wide. There are problems such as poor voice signal processing quality. Furthermore, in the embodiment of the present disclosure, delay addition beams and delay subtraction beams in M pickup directions are used as the input of the speech enhancement model to enhance the speech signal processing quality.
在一个示例性实施例中,确定模块54,设置为对于M个拾音方向中每一个拾音方向,根据所述N个频域数据确定目标频域数据,以及根据在所述每一个拾音方向上所述N个麦克风之间的时延确定所述目标频域数据对应的权重矢量,其中,所述目标频域数据用于指示所述N个频域数据对应的阵列信号;根据所述目标频域数据和所述目标频域数据对应的权重矢量确定所述M个延迟相加波束和M个延迟相减波束。In an exemplary embodiment, the determining module 54 is configured to determine the target frequency domain data based on the N frequency domain data for each of the M sound pickup directions, and determine the target frequency domain data based on the The time delay between the N microphones in the direction determines the weight vector corresponding to the target frequency domain data, wherein the target frequency domain data is used to indicate the array signal corresponding to the N frequency domain data; according to the The target frequency domain data and the weight vector corresponding to the target frequency domain data determine the M delay addition beams and M delay subtraction beams.
在一个示例性实施例中,确定模块54,设置为确定所述N个频域数据的对应的第一矩阵,其中,所述第一矩阵的行信息用于指示所述N个频域数据;根据所述第一矩阵确定所述目标频域数据。In an exemplary embodiment, the determining module 54 is configured to determine the corresponding first matrix of the N frequency domain data, wherein the row information of the first matrix is used to indicate the N frequency domain data; The target frequency domain data is determined according to the first matrix.
在一个示例性实施例中,确定模块54,设置为确定N个麦克风中的每一个麦克风相对于目标麦克风的时延以及根据所述时间延时确定所述每一个麦克风对应的子权重矢量,其中,所述目标麦克风为最先接收到所述语音数据的麦克风;确定所述子权重矢量的对应的第二矩阵,其中,所述第二矩阵的列信息用于指示 所述每一个麦克风对应的子权重矢量;根据麦克风阵列中的麦克风数量N和所述第二矩阵确定所述目标频域数据对应的权重矢量。In an exemplary embodiment, the determination module 54 is configured to determine the time delay of each microphone among the N microphones relative to the target microphone and determine the sub-weight vector corresponding to each microphone according to the time delay, where , the target microphone is the microphone that first receives the voice data; determine the corresponding second matrix of the sub-weight vector, wherein the column information of the second matrix is used to indicate The sub-weight vector corresponding to each microphone; the weight vector corresponding to the target frequency domain data is determined according to the number N of microphones in the microphone array and the second matrix.
在一个示例性实施例中,确定模块54,设置为执行确定步骤:确定N个麦克风中的任一麦克风对应的第一频域数据,以及所述目标麦克风对应的第二频域数据;确定所述第一频域数据和所述任一麦克风的拾音方向的余弦值的第一乘积,以及第二频域数据和所述任一麦克风的拾音方向的正弦值的第二乘积;根据声速、第一乘积、第二乘积确定所述任一麦克风相对于目标麦克风的时延;循环执行所述确定步骤,直至确定N个麦克风中的每一个麦克风相对于目标麦克风的时延。In an exemplary embodiment, the determination module 54 is configured to perform the determination steps: determine the first frequency domain data corresponding to any microphone among the N microphones, and the second frequency domain data corresponding to the target microphone; determine the The first product of the first frequency domain data and the cosine value of the sound pickup direction of any microphone, and the second product of the second frequency domain data and the sine value of the sound pickup direction of any microphone; according to the sound speed , the first product and the second product determine the time delay of any microphone relative to the target microphone; the determination step is executed cyclically until the time delay of each microphone among the N microphones relative to the target microphone is determined.
在一个示例性实施例中,确定模块54,设置为确定所述每一个拾音方向的所述目标频域数据和所述目标频域数据对应的权重矢量的卷积结果,根据所述卷积结果确定所述M个延迟相加波束;确定所述每一个拾音方向的所述目标频域数据和所述目标频域数据对应的权重矢量的卷积结果的共轭复数,根据所述共轭复数果确定所述M个相减相加波束。In an exemplary embodiment, the determination module 54 is configured to determine the convolution result of the target frequency domain data for each pickup direction and the weight vector corresponding to the target frequency domain data. According to the convolution As a result, the M delayed addition beams are determined; the conjugate complex number of the convolution result of the target frequency domain data in each pickup direction and the weight vector corresponding to the target frequency domain data is determined. According to the common The complex yoke determines the M subtractive-add beams.
在一个示例性实施例中,在M=3的情况下,确定模块54,设置为按照第一预设算法对第一拾音方向上的第一语音增强数据进行线性滤波,得到所述第一拾音方向上的语音增强结果,其中,所述第一预设算法包括:以所述第一拾音方向的延迟相加波束作为主波束,将第二拾音方向和第三拾音方向上的语音增强结果作为噪声参数;以及按照第二预设算法对第二拾音方向上的第二语音增强数据进行线性滤波,得到所述第二拾音方向上的语音增强结果,其中,所述第二预设算法包括:以所述第二拾音方向的延迟相加波束作为主波束,将第一拾音方向和第三拾音方向上的语音增强结果作为噪声参数;以及按照第三预设算法对第二拾音方向上的第三语音增强数据进行线性滤波,得到所述第二拾音方向上的语音增强结果,其中,所述第三预设算法包括:以所述第三拾音方向的延迟相加波束作为主波束,将第一拾音方向和第二拾音方向上的语音增强结果作为噪声参数;其中,所述第一语音增强数据,所述第二语音增强数据,第三语音增强数据为所述语音增强模型的输出结果。In an exemplary embodiment, when M=3, the determination module 54 is configured to linearly filter the first speech enhancement data in the first sound pickup direction according to the first preset algorithm to obtain the first Speech enhancement results in the sound pickup direction, wherein the first preset algorithm includes: using the delayed addition beam in the first sound pickup direction as the main beam, adding the second sound pickup direction and the third sound pickup direction The speech enhancement result is used as the noise parameter; and linear filtering is performed on the second speech enhancement data in the second sound pickup direction according to the second preset algorithm to obtain the speech enhancement result in the second sound pickup direction, wherein, The second preset algorithm includes: using the delayed addition beam in the second pick-up direction as the main beam, and using the speech enhancement results in the first pick-up direction and the third pick-up direction as noise parameters; and according to the third preset Assume that the algorithm performs linear filtering on the third speech enhancement data in the second sound pickup direction to obtain the speech enhancement result in the second sound pickup direction, wherein the third preset algorithm includes: using the third sound pickup direction The delayed addition beam in the sound direction is used as the main beam, and the speech enhancement results in the first sound pickup direction and the second sound pickup direction are used as noise parameters; wherein, the first speech enhancement data, the second speech enhancement data, The third speech enhancement data is the output result of the speech enhancement model.
本公开的实施例还提供了一种存储介质,该存储介质包括存储的程序,其中, 上述程序运行时执行上述任一项的方法。Embodiments of the present disclosure also provide a storage medium including a stored program, wherein: Methods that perform any of the above when the above program is run.
可选地,在本实施例中,上述存储介质可以被设置为存储用于执行以下步骤的程序代码:Optionally, in this embodiment, the above-mentioned storage medium may be configured to store program codes for performing the following steps:
S1,获取麦克风阵列的N个麦克风所对应的N个频域数据;其中,所述N个频域数据为通过对所述麦克风阵列接收到的语音数据进行傅里叶变换得到;S1, obtain N frequency domain data corresponding to N microphones of the microphone array; wherein the N frequency domain data are obtained by performing Fourier transform on the speech data received by the microphone array;
S2,确定所述N个频域数据在M个拾音方向中每一个拾音方向上的延迟相加波束和延迟相减波束,得到M个延迟相加波束和M个延迟相减波束;其中,所述N个频域数据在不同拾音方向上均对应有延迟相加波束和延迟相减波束,N为大于3的整数,M为大于2的整数;S2, determine the delay addition beams and delay subtraction beams of the N frequency domain data in each of the M sound pickup directions, and obtain M delay addition beams and M delay subtraction beams; where , the N frequency domain data correspond to delay addition beams and delay subtraction beams in different pickup directions, N is an integer greater than 3, and M is an integer greater than 2;
S3,将所述M个延迟相加波束的信号幅度和M个延迟相减波束的信号幅度输入到语音增强模型,以通过所述语音增强模型对所述语音数据中的目标语音进行增强。S3: Input the signal amplitudes of the M delay addition beams and the M delay subtraction beams into the speech enhancement model, so as to enhance the target speech in the speech data through the speech enhancement model.
根据本公开实施例的又一个方面,还提供了一种用于实施上述语义转换方法的电子装置,如图6所示,该电子装置包括存储器602和处理器604,该存储器602中存储有计算机程序,该处理器604被设置为通过计算机程序执行上述任一项方法实施例中的步骤。According to yet another aspect of the embodiment of the present disclosure, an electronic device for implementing the above semantic conversion method is also provided. As shown in Figure 6, the electronic device includes a memory 602 and a processor 604. The memory 602 stores a computer Program, the processor 604 is configured to execute the steps in any of the above method embodiments through a computer program.
可选地,在本实施例中,上述电子装置可以位于计算机网络的多个网络设备中的至少一个网络设备。Optionally, in this embodiment, the above-mentioned electronic device may be located in at least one network device among multiple network devices of the computer network.
可选地,在本实施例中,上述处理器可以被设置为通过计算机程序执行以下步骤:Optionally, in this embodiment, the above-mentioned processor may be configured to perform the following steps through a computer program:
S1,获取麦克风阵列的N个麦克风所对应的N个频域数据;其中,所述N个频域数据为通过对所述麦克风阵列接收到的语音数据进行傅里叶变换得到;S1, obtain N frequency domain data corresponding to N microphones of the microphone array; wherein the N frequency domain data are obtained by performing Fourier transform on the speech data received by the microphone array;
S2,确定所述N个频域数据在M个拾音方向中每一个拾音方向上的延迟相加波束和延迟相减波束,得到M个延迟相加波束和M个延迟相减波束;其中,所述N个频域数据在不同拾音方向上均对应有延迟相加波束和延迟相减波束,N 为大于3的整数,M为大于2的整数;S2, determine the delay addition beams and delay subtraction beams of the N frequency domain data in each of the M sound pickup directions, and obtain M delay addition beams and M delay subtraction beams; where , the N frequency domain data correspond to delayed addition beams and delayed subtraction beams in different pickup directions, N is an integer greater than 3, M is an integer greater than 2;
S3,将所述M个延迟相加波束的信号幅度和M个延迟相减波束的信号幅度输入到语音增强模型,以通过所述语音增强模型对所述语音数据中的目标语音进行增强。S3: Input the signal amplitudes of the M delay addition beams and the M delay subtraction beams into the speech enhancement model, so as to enhance the target speech in the speech data through the speech enhancement model.
可选地,本领域普通技术人员可以理解,图6所示的结构仅为示意,电子装置也可以是智能手机(如Android手机、iOS手机等)、平板电脑、掌上电脑以及移动互联网设备(Mobile Internet Devices,MID)、PAD等终端设备。图6其并不对上述电子装置的结构造成限定。例如,电子装置还可包括比图6中所示更多或者更少的组件(如网络接口等),或者具有与图6所示不同的配置。Optionally, those of ordinary skill in the art can understand that the structure shown in Figure 6 is only illustrative, and the electronic device can also be a smart phone (such as an Android phone, an iOS phone, etc.), a tablet computer, a handheld computer, and a mobile Internet device (Mobile Internet Devices, MID), PAD and other terminal equipment. FIG. 6 does not limit the structure of the above-mentioned electronic device. For example, the electronic device may also include more or fewer components (such as network interfaces, etc.) than shown in FIG. 6 , or have a different configuration than that shown in FIG. 6 .
其中,存储器602可用于存储软件程序以及模块,如本公开实施例中的语义转换方法和装置对应的程序指令/模块,处理器604通过运行存储在存储器602内的软件程序以及模块,从而执行各种功能应用以及数据处理,即实现上述的语义转换方法。存储器602可包括高速随机存储器,还可以包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,存储器602可进一步包括相对于处理器604远程设置的存储器,这些远程存储器可以通过网络连接至终端。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。作为一种示例,如图6所示,上述存储器602中可以但不限于包括上述布式唤醒的语音增强装置中的获取模块52、确定模块54、输入模块56,此外,还可以包括但不限于上述布式唤醒的语音增强装置中的其他模块单元,本示例中不再赘述。The memory 602 can be used to store software programs and modules, such as the program instructions/modules corresponding to the semantic conversion method and device in the embodiment of the present disclosure. The processor 604 executes various software programs and modules by running the software programs and modules stored in the memory 602. A kind of functional application and data processing, that is, to implement the above-mentioned semantic conversion method. Memory 602 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 602 may further include memory located remotely relative to the processor 604, and these remote memories may be connected to the terminal through a network. Examples of the above-mentioned networks include but are not limited to the Internet, intranets, local area networks, mobile communication networks and combinations thereof. As an example, as shown in Figure 6, the above-mentioned memory 602 may include, but is not limited to, the acquisition module 52, the determination module 54, and the input module 56 in the above-mentioned cloth wake-up speech enhancement device. In addition, it may also include, but is not limited to The other module units in the above-mentioned cloth wake-up voice enhancement device will not be described again in this example.
可选地,上述的传输装置606用于经由一个网络接收或者发送数据。上述的网络具体实例可包括有线网络及无线网络。在一个实例中,传输装置606包括一个网络适配器(Network Interface Controller,NIC),其可通过网线与其他网络设备与路由器相连从而可与互联网或局域网进行通讯。在一个实例中,传输装置606为射频(Radio Frequency,RF)模块,其用于通过无线方式与互联网进行通讯。Optionally, the above-mentioned transmission device 606 is used to receive or send data via a network. Specific examples of the above-mentioned network may include wired networks and wireless networks. In one example, the transmission device 606 includes a network adapter (Network Interface Controller, NIC), which can be connected to other network devices and routers through network cables to communicate with the Internet or a local area network. In one example, the transmission device 606 is a radio frequency (Radio Frequency, RF) module, which is used to communicate with the Internet wirelessly.
此外,上述电子装置还包括:显示器608,用于显示上述频域数据;和连接总线610,用于连接上述电子装置中的各个模块部件。 In addition, the above-mentioned electronic device also includes: a display 608 for displaying the above-mentioned frequency domain data; and a connection bus 610 for connecting various module components in the above-mentioned electronic device.
可选地,在本实施例中,上述存储介质可以包括但不限于:U盘、只读存储器(Read-Only Memory,简称为ROM)、随机存取存储器(Random Access Memory,简称为RAM)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。Optionally, in this embodiment, the above storage medium may include but is not limited to: U disk, read-only memory (Read-Only Memory, referred to as ROM), random access memory (Random Access Memory, referred to as RAM), Various media that can store program code, such as mobile hard drives, magnetic disks, or optical disks.
可选地,本实施例中的具体示例可以参考上述实施例及可选实施方式中所描述的示例,本实施例在此不再赘述。Optionally, for specific examples in this embodiment, reference can be made to the examples described in the above-mentioned embodiments and optional implementations, and details will not be described again in this embodiment.
显然,本领域的技术人员应该明白,上述的本公开的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,可选地,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储装置中由计算装置来执行,并且在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本公开不限制于任何特定的硬件和软件结合。Obviously, those skilled in the art should understand that the above-mentioned modules or steps of the present disclosure can be implemented using general-purpose computing devices, and they can be concentrated on a single computing device, or distributed across a network composed of multiple computing devices. , optionally, they may be implemented in program code executable by a computing device, such that they may be stored in a storage device for execution by the computing device, and in some cases, may be in a sequence different from that herein. The steps shown or described are performed either individually as individual integrated circuit modules, or as multiple modules or steps among them as a single integrated circuit module. As such, the present disclosure is not limited to any specific combination of hardware and software.
以上所述仅是本公开的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本公开原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本公开的保护范围。 The above are only preferred embodiments of the present disclosure. It should be pointed out that for those of ordinary skill in the art, several improvements and modifications can be made without departing from the principles of the present disclosure. These improvements and modifications can also be made. should be regarded as the scope of protection of this disclosure.

Claims (16)

  1. 一种分布式唤醒的语音增强方法,包括:A distributed wake-up speech enhancement method, including:
    获取麦克风阵列的N个麦克风所对应的N个频域数据;其中,所述N个频域数据为通过对所述麦克风阵列接收到的语音数据进行傅里叶变换得到;Obtain N pieces of frequency domain data corresponding to the N microphones of the microphone array; wherein the N pieces of frequency domain data are obtained by Fourier transforming the speech data received by the microphone array;
    确定所述N个频域数据在M个拾音方向中每一个拾音方向上的延迟相加波束和延迟相减波束,得到M个延迟相加波束和M个延迟相减波束;其中,所述N个频域数据在不同拾音方向上均对应有延迟相加波束和延迟相减波束,N为大于3的整数,M为大于2的整数;Determine the delay addition beams and delay subtraction beams of the N frequency domain data in each of the M sound pickup directions, and obtain M delay addition beams and M delay subtraction beams; where, The above N frequency domain data correspond to delay addition beams and delay subtraction beams in different pickup directions, N is an integer greater than 3, and M is an integer greater than 2;
    将所述M个延迟相加波束的信号幅度和M个延迟相减波束的信号幅度输入到语音增强模型,以通过所述语音增强模型对所述语音数据中的目标语音进行增强。The signal amplitudes of the M delayed addition beams and the signal amplitudes of the M delayed subtraction beams are input to the speech enhancement model to enhance the target speech in the speech data through the speech enhancement model.
  2. 根据权利要求1所述的分布式唤醒的语音增强方法,其中,确定所述N个频域数据在M个拾音方向中每一个拾音方向上的延迟相加波束和延迟相减波束,得到M个延迟相加波束和M个延迟相减波束,包括:The speech enhancement method of distributed wake-up according to claim 1, wherein the delay addition beam and delay subtraction beam of the N frequency domain data in each of the M sound pickup directions are determined, and we obtain M delay addition beams and M delay subtraction beams, including:
    对于M个拾音方向中每一个拾音方向,根据所述N个频域数据确定目标频域数据,以及根据所述麦克风阵列在所述每一个拾音方向上所述N个麦克风之间的时延确定所述目标频域数据对应的权重矢量,其中,所述目标频域数据用于指示所述N个频域数据对应的阵列信号;For each of the M sound pickup directions, the target frequency domain data is determined according to the N frequency domain data, and the distance between the N microphones in each sound pickup direction of the microphone array is determined. The time delay determines the weight vector corresponding to the target frequency domain data, wherein the target frequency domain data is used to indicate the array signal corresponding to the N frequency domain data;
    根据所述目标频域数据和所述目标频域数据对应的权重矢量确定所述M个延迟相加波束和M个延迟相减波束。The M delay addition beams and M delay subtraction beams are determined according to the target frequency domain data and the weight vector corresponding to the target frequency domain data.
  3. 根据权利要求2所述的分布式唤醒的语音增强方法,其中,根据所述N个频域数据确定所述目标频域数据,包括:The speech enhancement method of distributed wake-up according to claim 2, wherein determining the target frequency domain data according to the N frequency domain data includes:
    确定所述N个频域数据的对应的第一矩阵,其中,所述第一矩阵的行信息用于指示所述N个频域数据; Determine the corresponding first matrix of the N frequency domain data, wherein the row information of the first matrix is used to indicate the N frequency domain data;
    根据所述第一矩阵确定所述目标频域数据。The target frequency domain data is determined according to the first matrix.
  4. 根据权利要求2所述的分布式唤醒的语音增强方法,其中,根据所述麦克风阵列在所述每一个拾音方向上所述N个麦克风之间的时延确定所述目标频域数据对应的权重矢量,包括:The voice enhancement method of distributed wake-up according to claim 2, wherein the target frequency domain data corresponding to the target frequency domain data is determined according to the time delay between the N microphones in each sound pickup direction of the microphone array. Weight vector, including:
    确定N个麦克风中的每一个麦克风相对于目标麦克风的时延以及根据所述时延确定所述每一个麦克风对应的子权重矢量,其中,所述目标麦克风为最先接收到所述语音数据的麦克风;Determine the time delay of each microphone among the N microphones relative to the target microphone, and determine the sub-weight vector corresponding to each microphone based on the time delay, wherein the target microphone is the first to receive the voice data. microphone;
    确定所述子权重矢量的对应的第二矩阵,其中,所述第二矩阵的列信息用于指示所述每一个麦克风对应的子权重矢量;Determine the corresponding second matrix of the sub-weight vector, wherein the column information of the second matrix is used to indicate the sub-weight vector corresponding to each microphone;
    根据麦克风阵列中的麦克风数量N和所述第二矩阵确定所述目标频域数据对应的权重矢量。The weight vector corresponding to the target frequency domain data is determined according to the number N of microphones in the microphone array and the second matrix.
  5. 根据权利要求4所述的分布式唤醒的语音增强方法,其中,确定N个麦克风中的每一个麦克风相对于目标麦克风的时延,包括:The speech enhancement method of distributed wake-up according to claim 4, wherein determining the delay of each microphone in the N microphones relative to the target microphone includes:
    确定步骤:确定N个麦克风中的任一麦克风在坐标轴上的横坐标,以及任一麦克风在坐标轴上的纵坐标;确定所述横坐标和所述任一麦克风的拾音方向的余弦值的第一乘积,以及所述纵坐标和所述任一麦克风的拾音方向的正弦值的第二乘积;根据声速、第一乘积、第二乘积确定所述任一麦克风相对于目标麦克风的时延,其中,所述目标麦克风的坐标点为所述坐标轴的原点;Determining step: Determine the abscissa coordinate of any microphone among the N microphones on the coordinate axis, and the ordinate coordinate of any microphone on the coordinate axis; determine the cosine value of the abscissa coordinate and the pickup direction of any microphone and the second product of the ordinate and the sine value of the pickup direction of any microphone; determine the time of any microphone relative to the target microphone based on the speed of sound, the first product, and the second product. extension, wherein the coordinate point of the target microphone is the origin of the coordinate axis;
    循环执行所述确定步骤,直至确定N个麦克风中的每一个麦克风相对于目标麦克风的时延。The determining step is performed in a loop until the time delay of each microphone among the N microphones relative to the target microphone is determined.
  6. 根据权利要求2所述的分布式唤醒的语音增强方法,其中,根据所述目标频域数据和所述目标频域数据对应的权重矢量确定所述M个延迟相加波束和M个延迟相减波束,包括:The speech enhancement method of distributed wake-up according to claim 2, wherein the M delayed addition beams and M delayed subtraction beams are determined according to the target frequency domain data and the weight vector corresponding to the target frequency domain data. Beams, including:
    确定所述每一个拾音方向的所述目标频域数据和所述目标频域数据对应的权重矢量的卷积结果,根据所述卷积结果确定所述M个延迟相加波束; Determine the convolution result of the target frequency domain data for each sound pickup direction and the weight vector corresponding to the target frequency domain data, and determine the M delayed addition beams according to the convolution result;
    确定所述每一个拾音方向的所述目标频域数据和所述目标频域数据对应的权重矢量的卷积结果的共轭复数,根据所述共轭复数果确定所述M个延迟相减波束。Determine the conjugate complex number of the convolution result of the target frequency domain data for each sound pickup direction and the weight vector corresponding to the target frequency domain data, and determine the M delay subtractions based on the conjugate complex number result. beam.
  7. 根据权利要求1所述的分布式唤醒的语音增强方法,其中,在M=3的情况下,将所述M个延迟相加波束的信号幅度和M个延迟相减波束的信号幅度输入到语音增强模型,以通过所述语音增强模型对所述语音数据中的目标语音进行增强之后,所述方法还包括:The speech enhancement method of distributed wake-up according to claim 1, wherein, in the case of M=3, the signal amplitudes of the M delay addition beams and the signal amplitudes of the M delay subtraction beams are input to the speech Enhancement model, after enhancing the target speech in the speech data through the speech enhancement model, the method further includes:
    按照第一预设算法对第一拾音方向上的第一语音增强数据进行线性滤波,得到所述第一拾音方向上的语音增强结果,其中,所述第一预设算法包括:以所述第一拾音方向的延迟相加波束作为主波束,将第二拾音方向和第三拾音方向上的语音增强结果作为噪声参数;以及Perform linear filtering on the first speech enhancement data in the first sound pickup direction according to a first preset algorithm to obtain the speech enhancement result in the first sound pickup direction, wherein the first preset algorithm includes: The delayed addition beam in the first pickup direction is used as the main beam, and the speech enhancement results in the second pickup direction and the third pickup direction are used as noise parameters; and
    按照第二预设算法对第二拾音方向上的第二语音增强数据进行线性滤波,得到所述第二拾音方向上的语音增强结果,其中,所述第二预设算法包括:以所述第二拾音方向的延迟相加波束作为主波束,将第一拾音方向和第三拾音方向上的语音增强结果作为噪声参数;以及Perform linear filtering on the second speech enhancement data in the second sound pickup direction according to a second preset algorithm to obtain the speech enhancement result in the second sound pickup direction, wherein the second preset algorithm includes: The delayed addition beam in the second pickup direction is used as the main beam, and the speech enhancement results in the first pickup direction and the third pickup direction are used as noise parameters; and
    按照第三预设算法对第二拾音方向上的第三语音增强数据进行线性滤波,得到所述第二拾音方向上的语音增强结果,其中,所述第三预设算法包括:以所述第三拾音方向的延迟相加波束作为主波束,将第一拾音方向和第二拾音方向上的语音增强结果作为噪声参数;Perform linear filtering on the third speech enhancement data in the second sound pickup direction according to a third preset algorithm to obtain the speech enhancement result in the second sound pickup direction, wherein the third preset algorithm includes: The delayed addition beam in the third pickup direction is used as the main beam, and the speech enhancement results in the first pickup direction and the second pickup direction are used as noise parameters;
    其中,所述第一语音增强数据,所述第二语音增强数据,第三语音增强数据为所述语音增强模型的输出结果。The first speech enhancement data, the second speech enhancement data, and the third speech enhancement data are the output results of the speech enhancement model.
  8. 一种分布式唤醒的语音增强方法装置,包括:A distributed wake-up voice enhancement method device, including:
    获取模块,设置为获取麦克风阵列的N个麦克风所对应的N个频域数据;其中,所述N个频域数据为通过对所述麦克风阵列接收到的语音数据进行傅里叶变换得到;The acquisition module is configured to acquire N frequency domain data corresponding to the N microphones of the microphone array; wherein the N frequency domain data are obtained by Fourier transforming the speech data received by the microphone array;
    确定模块,设置为确定所述N个频域数据在M个拾音方向中每一个拾音 方向上的延迟相加波束和延迟相减波束,得到M个延迟相加波束和M个延迟相减波束;其中,所述N个频域数据在不同拾音方向上均对应有延迟相加波束和延迟相减波束,N为大于3的整数,M为大于2的整数;Determining module, configured to determine whether the N frequency domain data picks up sound in each of the M sound pickup directions. The delay addition beams and delay subtraction beams in the direction are used to obtain M delay addition beams and M delay subtraction beams; wherein, the N frequency domain data correspond to delay addition beams in different pickup directions. and delay subtraction beam, N is an integer greater than 3, M is an integer greater than 2;
    输入模块,设置为将所述M个延迟相加波束的信号幅度和M个延迟相减波束的信号幅度输入到语音增强模型,以通过所述语音增强模型对所述语音数据中的目标语音进行增强。An input module configured to input the signal amplitudes of the M delayed addition beams and the signal amplitudes of the M delayed subtraction beams into a speech enhancement model, so as to perform target speech in the speech data through the speech enhancement model. Enhance.
  9. 根据权利要求8所述的分布式唤醒的语音增强装置,其中,包括:The distributed wake-up speech enhancement device according to claim 8, comprising:
    所述确定模块,设置为对于M个拾音方向中每一个拾音方向,根据所述N个频域数据确定目标频域数据,以及根据在所述每一个拾音方向上所述N个麦克风之间的时延确定所述目标频域数据对应的权重矢量,其中,所述目标频域数据用于指示所述N个频域数据对应的阵列信号;根据所述目标频域数据和所述目标频域数据对应的权重矢量确定所述M个延迟相加波束和M个延迟相减波束。The determination module is configured to determine the target frequency domain data according to the N frequency domain data for each of the M sound pickup directions, and determine the target frequency domain data according to the N microphones in each of the sound pickup directions. The time delay between determines the weight vector corresponding to the target frequency domain data, wherein the target frequency domain data is used to indicate the array signal corresponding to the N frequency domain data; according to the target frequency domain data and the The weight vector corresponding to the target frequency domain data determines the M delay addition beams and M delay subtraction beams.
  10. 根据权利要求9所述的分布式唤醒的语音增强装置,其中,包括:The distributed wake-up speech enhancement device according to claim 9, comprising:
    所述确定模块,设置为确定所述N个频域数据的对应的第一矩阵,其中,所述第一矩阵的行信息用于指示所述N个频域数据;根据所述第一矩阵确定所述目标频域数据。The determination module is configured to determine the corresponding first matrix of the N frequency domain data, wherein the row information of the first matrix is used to indicate the N frequency domain data; determine according to the first matrix The target frequency domain data.
  11. 根据权利要求9所述的分布式唤醒的语音增强装置,其中,包括:The distributed wake-up speech enhancement device according to claim 9, comprising:
    所述确定模块,设置为确定N个麦克风中的每一个麦克风相对于目标麦克风的时延以及根据所述时间延时确定所述每一个麦克风对应的子权重矢量,其中,所述目标麦克风为最先接收到所述语音数据的麦克风;确定所述子权重矢量的对应的第二矩阵,其中,所述第二矩阵的列信息用于指示所述每一个麦克风对应的子权重矢量;根据麦克风阵列中的麦克风数量N和所述第二矩阵确定所述目标频域数据对应的权重矢量。The determination module is configured to determine the time delay of each microphone among the N microphones relative to the target microphone and determine the sub-weight vector corresponding to each microphone according to the time delay, wherein the target microphone is the most The microphone that first receives the voice data; determines the corresponding second matrix of the sub-weight vector, wherein the column information of the second matrix is used to indicate the sub-weight vector corresponding to each microphone; according to the microphone array The number of microphones N in and the second matrix determine the weight vector corresponding to the target frequency domain data.
  12. 根据权利要求11所述的分布式唤醒的语音增强装置,其中,包括: The distributed wake-up speech enhancement device according to claim 11, comprising:
    所述确定模块,设置为执行确定步骤:确定N个麦克风中的任一麦克风对应的第一频域数据,以及所述目标麦克风对应的第二频域数据;确定所述第一频域数据和所述任一麦克风的拾音方向的余弦值的第一乘积,以及第二频域数据和所述任一麦克风的拾音方向的正弦值的第二乘积;根据声速、第一乘积、第二乘积确定所述任一麦克风相对于目标麦克风的时延;循环执行所述确定步骤,直至确定N个麦克风中的每一个麦克风相对于目标麦克风的时延。The determination module is configured to perform the determination steps: determine the first frequency domain data corresponding to any microphone among the N microphones, and the second frequency domain data corresponding to the target microphone; determine the first frequency domain data and The first product of the cosine value of the sound pickup direction of any microphone, and the second product of the second frequency domain data and the sine value of the sound pickup direction of any microphone; according to the sound speed, the first product, the second The product determines the time delay of any microphone relative to the target microphone; the determination step is performed in a loop until the time delay of each microphone among the N microphones relative to the target microphone is determined.
  13. 根据权利要求9所述的分布式唤醒的语音增强装置,其中,包括:The distributed wake-up speech enhancement device according to claim 9, comprising:
    所述确定模块,设置为确定所述每一个拾音方向的所述目标频域数据和所述目标频域数据对应的权重矢量的卷积结果,根据所述卷积结果确定所述M个延迟相加波束;确定所述每一个拾音方向的所述目标频域数据和所述目标频域数据对应的权重矢量的卷积结果的共轭复数,根据所述共轭复数果确定所述M个相减相加波束。The determination module is configured to determine the convolution result of the target frequency domain data for each sound pickup direction and the weight vector corresponding to the target frequency domain data, and determine the M delays according to the convolution result. Add beams; determine the conjugate complex number of the convolution result of the target frequency domain data in each pickup direction and the weight vector corresponding to the target frequency domain data, and determine the M based on the conjugate complex number result subtraction and addition beams.
  14. 根据权利要求8所述的分布式唤醒的语音增强装置,其中,包括:The distributed wake-up speech enhancement device according to claim 8, comprising:
    在M=3的情况下,所述确定模块,设置为按照第一预设算法对第一拾音方向上的第一语音增强数据进行线性滤波,得到所述第一拾音方向上的语音增强结果,其中,所述第一预设算法包括:以所述第一拾音方向的延迟相加波束作为主波束,将第二拾音方向和第三拾音方向上的语音增强结果作为噪声参数;以及按照第二预设算法对第二拾音方向上的第二语音增强数据进行线性滤波,得到所述第二拾音方向上的语音增强结果,其中,所述第二预设算法包括:以所述第二拾音方向的延迟相加波束作为主波束,将第一拾音方向和第三拾音方向上的语音增强结果作为噪声参数;以及按照第三预设算法对第二拾音方向上的第三语音增强数据进行线性滤波,得到所述第二拾音方向上的语音增强结果,其中,所述第三预设算法包括:以所述第三拾音方向的延迟相加波束作为主波束,将第一拾音方向和第二拾音方向上的语音增强结果作为噪声参数;其中,所述第一语音增强数据,所述第二语音增强数据,第三语音增强数据为所述语音增强模型的输出结果。In the case of M=3, the determination module is configured to linearly filter the first speech enhancement data in the first sound pickup direction according to the first preset algorithm to obtain the speech enhancement in the first sound pickup direction. As a result, the first preset algorithm includes: using the delayed addition beam in the first pickup direction as the main beam, and using the speech enhancement results in the second pickup direction and the third pickup direction as noise parameters ; and perform linear filtering on the second speech enhancement data in the second sound pickup direction according to the second preset algorithm to obtain the speech enhancement result in the second sound pickup direction, wherein the second preset algorithm includes: Using the delayed addition beam in the second sound pickup direction as the main beam, using the speech enhancement results in the first sound pickup direction and the third sound pickup direction as noise parameters; and according to the third preset algorithm, the second sound pickup The third speech enhancement data in the direction is linearly filtered to obtain the speech enhancement result in the second sound pickup direction, wherein the third preset algorithm includes: adding the beam with delay in the third sound pickup direction As the main beam, the speech enhancement results in the first sound pickup direction and the second sound pickup direction are used as noise parameters; wherein, the first speech enhancement data, the second speech enhancement data, and the third speech enhancement data are the The output results of the speech enhancement model are described.
  15. 一种计算机可读的存储介质,所述计算机可读的存储介质包括存储的程序, 其中,所述程序运行时执行上述权利要求1至7任一项中所述的方法。A computer-readable storage medium including a stored program, Wherein, when the program is running, the method described in any one of claims 1 to 7 is executed.
  16. 一种电子装置,包括存储器和处理器,所述存储器中存储有计算机程序,所述处理器被设置为通过所述计算机程序执行所述权利要求1至7任一项中所述的方法。 An electronic device includes a memory and a processor, a computer program is stored in the memory, and the processor is configured to execute the method described in any one of claims 1 to 7 through the computer program.
PCT/CN2023/085266 2022-06-20 2023-03-30 Speech enhancement method and apparatus for distributed wake-up, and storage medium WO2023246223A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210700223.3 2022-06-20
CN202210700223.3A CN117292700A (en) 2022-06-20 2022-06-20 Voice enhancement method and device for distributed wakeup and storage medium

Publications (1)

Publication Number Publication Date
WO2023246223A1 true WO2023246223A1 (en) 2023-12-28

Family

ID=89243192

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/085266 WO2023246223A1 (en) 2022-06-20 2023-03-30 Speech enhancement method and apparatus for distributed wake-up, and storage medium

Country Status (2)

Country Link
CN (1) CN117292700A (en)
WO (1) WO2023246223A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112712818A (en) * 2020-12-29 2021-04-27 苏州科达科技股份有限公司 Voice enhancement method, device and equipment
CN113050035A (en) * 2021-03-12 2021-06-29 云知声智能科技股份有限公司 Two-dimensional directional pickup method and device
CN113393856A (en) * 2020-03-11 2021-09-14 华为技术有限公司 Sound pickup method and device and electronic equipment
US20220068288A1 (en) * 2018-12-14 2022-03-03 Nippon Telegraph And Telephone Corporation Signal processing apparatus, signal processing method, and program

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220068288A1 (en) * 2018-12-14 2022-03-03 Nippon Telegraph And Telephone Corporation Signal processing apparatus, signal processing method, and program
CN113393856A (en) * 2020-03-11 2021-09-14 华为技术有限公司 Sound pickup method and device and electronic equipment
CN112712818A (en) * 2020-12-29 2021-04-27 苏州科达科技股份有限公司 Voice enhancement method, device and equipment
CN113050035A (en) * 2021-03-12 2021-06-29 云知声智能科技股份有限公司 Two-dimensional directional pickup method and device

Also Published As

Publication number Publication date
CN117292700A (en) 2023-12-26

Similar Documents

Publication Publication Date Title
CN107479030B (en) Frequency division and improved generalized cross-correlation based binaural time delay estimation method
CN106710601A (en) Voice signal de-noising and pickup processing method and apparatus, and refrigerator
TWI661684B (en) Method and apparatus for adaptive beam forming
CN102164328B (en) Audio input system used in home environment based on microphone array
CN103856866B (en) Low noise differential microphone array
JP6837099B2 (en) Estimating the room impulse response for acoustic echo cancellation
CN104883462B (en) A kind of sef-adapting filter and filtering method for eliminating acoustic echo
TW200404477A (en) System and method for automatic room acoustic correction in multi-channel audio environments
CN105792074B (en) A kind of audio signal processing method and device
CN109087663A (en) signal processor
CN105810202B (en) It is a kind of to drop hypoechoic method, apparatus and communication apparatus
CN110289009B (en) Sound signal processing method and device and interactive intelligent equipment
JP2000035788A (en) Multiple channel adaptive filtering
WO2023231552A1 (en) Distributed voice wake-up method and apparatus, storage medium, and electronic apparatus
WO2023246223A1 (en) Speech enhancement method and apparatus for distributed wake-up, and storage medium
CN112201273A (en) Noise power spectral density calculation method, system, equipment and medium
JP6221257B2 (en) Signal processing apparatus, method and program
CN110913312B (en) Echo cancellation method and device
WO2020107455A1 (en) Voice processing method and apparatus, storage medium, and electronic device
US11195540B2 (en) Methods and apparatus for an adaptive blocking matrix
WO2023206686A1 (en) Control method for smart device, and storage medium and electronic apparatus
Qi et al. A late reverberation power spectral density aware approach to speech dereverberation based on deep neural networks
Ruiz et al. Distributed combined acoustic echo cancellation and noise reduction using GEVD-based distributed adaptive node specific signal estimation with prior knowledge
Hioka et al. Estimating power spectral density for spatial audio signal separation: An effective approach for practical applications
Gu et al. Residual Echo and Noise Cancellation with Feature Attention Module and Multi-Domain Loss Function.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23825878

Country of ref document: EP

Kind code of ref document: A1