WO2022246737A1 - 一种语音增强方法和系统 - Google Patents

一种语音增强方法和系统 Download PDF

Info

Publication number
WO2022246737A1
WO2022246737A1 PCT/CN2021/096375 CN2021096375W WO2022246737A1 WO 2022246737 A1 WO2022246737 A1 WO 2022246737A1 CN 2021096375 W CN2021096375 W CN 2021096375W WO 2022246737 A1 WO2022246737 A1 WO 2022246737A1
Authority
WO
WIPO (PCT)
Prior art keywords
signal
coefficient
speech
target
sound source
Prior art date
Application number
PCT/CN2021/096375
Other languages
English (en)
French (fr)
Inventor
肖乐
张承乾
廖风云
齐心
Original Assignee
深圳市韶音科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市韶音科技有限公司 filed Critical 深圳市韶音科技有限公司
Priority to CN202180088314.XA priority Critical patent/CN116724352A/zh
Priority to PCT/CN2021/096375 priority patent/WO2022246737A1/zh
Publication of WO2022246737A1 publication Critical patent/WO2022246737A1/zh
Priority to US18/354,715 priority patent/US20230360664A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0264Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering

Definitions

  • This description relates to the field of computer technology, in particular to a processing method and system for speech enhancement.
  • the method may include obtaining a first signal of the target voice and a second signal, the first signal is the signal of the target voice collected based on a first position, and the second signal is the signal of the target voice collected based on a second position. signal of the target speech.
  • the method may further include processing the first signal and the second signal to determine first coefficients based on a target speech location, the first location, and the second location; based on the first signal and the For the second signal, a plurality of parameters related to a plurality of sound source directions are determined, each parameter corresponds to a probability that a sound is emitted from a sound source direction to form the first signal and the second signal.
  • the method may further comprise determining second coefficients based on the plurality of parameters and the target speech location; and processing the first signal and/or the second signal based on the first coefficients and the second coefficients to obtain the speech-enhanced first output speech signal corresponding to the target speech.
  • the processing the first signal and the second signal to determine the first coefficient based on the target speech position, the first position and the second position may include: based on the target voice position, the first position and the second position, performing a differential operation on the first signal and the second signal to obtain a signal pointing to the first direction and a signal pointing to the second direction, and the signal pointing to the second direction
  • the signal in one direction and the signal pointing in the second direction contain effective signals in different proportions; based on the signal pointing in the first direction and the signal pointing in the second direction, determine a third signal corresponding to the effective signal ; and based on the third signal, determining the first coefficient.
  • the determining the third signal corresponding to the valid signal may include: performing an adaptive difference operation on the signal pointing in the first direction and the signal pointing in the second direction, and determining a fourth signal ; and enhancing low-frequency components in the fourth signal to obtain the third signal.
  • the method may further include: updating an adaptive parameter of the adaptive difference operation based on the fourth signal, the signal directed in the first direction, and the signal directed in the second direction.
  • the processing the first signal and the second signal to determine the first coefficient based on the target speech position, the first position and the second position may include: based on the target voice position, the first position and the second position, performing a differential operation on the first signal and the second signal to obtain a signal pointing to the first direction and a signal pointing to the second direction, and the signal pointing to the second direction
  • the signal in one direction and the signal pointing in the second direction contain effective signals in different proportions; based on the signal pointing in the first direction and the signal pointing in the second direction, determine the estimated signal-to-noise ratio of the target speech; and determining the first coefficient based on the estimated signal-to-noise ratio.
  • the determining multiple parameters related to multiple sound source directions based on the first signal and the second signal may include: based on each sound source direction, the first position and In the second position, a differential operation is performed on the first signal and the second signal to determine parameters related to each sound source direction.
  • the determining the second coefficient based on the multiple parameters and the target voice position may include: determining a synthetic sound source direction based on the multiple parameters; and based on the synthetic sound source direction and the target speech position to determine the second coefficient.
  • the determining the second coefficient based on the direction of the synthesized sound source and the position of the target voice may include: judging whether the position of the target voice is located in the direction of the synthesized sound source; responding to the target The speech position is located in the direction of the synthesized sound source, setting the second coefficient to a first value; and in response to the target speech position being not located in the direction of the synthesized sound source, setting the second coefficient to a second value .
  • the determining the second coefficient based on the synthesized sound source direction and the target voice position may include: based on the angle between the target voice position and the synthesized sound source direction, The second coefficient is determined by a regression function.
  • the method may further include smoothing the second coefficients based on a smoothing factor.
  • the method may further include performing at least one of the following operations on the first signal and the second signal: framing the first signal and the second signal; performing window smoothing on the first signal and the second signal; and transforming the first signal and the second signal into a frequency domain.
  • the method may further include: determining at least one target subband signal in the first output speech signal; and processing the at least one target subband signal based on a single microphone filtering algorithm to obtain a second output voice signal.
  • the determining at least one target subband signal in the first output speech signal may include: obtaining a plurality of subband signals based on the first output speech signal; calculating each of the subband signals and determining the target sub-band signal based on the signal-to-noise ratio of each of the sub-band signals.
  • the method may further include: processing the first signal and/or the second signal based on a single wheat filtering algorithm, and determining a third coefficient; and based on the third coefficient, processing the first signal Once the voice signal is output, a third output voice signal is acquired.
  • the method may further include: determining a fourth coefficient based on the energy difference between the first signal and the second signal; and determining a fourth coefficient based on the first coefficient, the second coefficient and the The fourth coefficient is to process the first signal and/or the second signal to obtain a speech-enhanced fourth output speech signal corresponding to the target speech.
  • the determining the fourth coefficient based on the energy difference between the first signal and the second signal may include: obtaining, based on a silent interval in the first signal and the second signal, noise power spectral density; obtaining the energy difference based on the first power spectral density of the first signal, the second power spectral density of the second signal, and the noise power spectral density; and based on the energy difference and The noise power spectral density determines the fourth coefficient.
  • the system may include: at least one storage medium including a set of instructions; and at least one processor in communication with the at least one storage medium, wherein, when executing the set of instructions, the at least one processor may cause the System: Acquire a first signal and a second signal of the target voice, the first signal is a signal of the target voice collected based on a first position, and the second signal is a signal of the target voice collected based on a second position signal; based on target speech location, said first location and said second location, processing said first signal and said second signal to determine a first coefficient; based on said first signal and said second signal, determining a plurality of parameters related to a plurality of sound source directions, each parameter corresponding to a probability of sound emanating from a sound source direction to form said first signal and said second signal; based on said plurality of parameters and said target voice position, determining a second coefficient; and based on the first coefficient and the second coefficient, processing the first signal and/or
  • the system may include an acquisition module, a processing module and a generation module.
  • the acquisition module may be used to acquire a first signal and a second signal of the target voice, the first signal is a signal of the target voice collected based on a first position, and the second signal is a signal collected based on a second position The signal of the target speech.
  • the processing module may be configured to process the first signal and the second signal to determine first coefficients based on the target speech position, the first position and the second position; based on the first signal and the The second signal, determining a plurality of parameters related to a plurality of sound source directions, each parameter corresponding to the probability of sound from a sound source direction to form the first signal and the second signal; and based on the plurality of parameters and the target speech position to determine the second coefficient.
  • the generation module may be configured to process the first signal and/or the second signal based on the first coefficient and the second coefficient to obtain a speech-enhanced first output speech signal corresponding to the target speech.
  • One of the embodiments of this specification provides a non-transitory computer-readable medium, which may include executable instructions. When executed by at least one processor, the executable instructions can cause the at least one processor to perform the operations described in this specification. Methods.
  • FIG. 1 is a schematic diagram of an application scenario of a speech enhancement system according to some embodiments of this specification
  • Figure 2 is a schematic diagram of exemplary hardware and/or software components of an exemplary computing device according to some embodiments of the present specification
  • Figure 3 is a schematic diagram of exemplary hardware and/or software components of an exemplary mobile device according to some embodiments of the present specification
  • FIG. 4 is an exemplary block diagram of a speech enhancement system according to some embodiments of the present specification.
  • Fig. 5 is an exemplary flow chart of a speech enhancement method according to some embodiments of this specification.
  • Fig. 6 is a schematic diagram of an exemplary dual microphone according to some embodiments of the present specification.
  • Fig. 7 is a schematic diagram of the filtering effect of the ANF algorithm at different noise angles according to some embodiments of the present specification
  • Fig. 8 is an exemplary flowchart of a method for determining a first coefficient according to some embodiments of the present specification
  • Fig. 9 is an exemplary flowchart of a method for determining a first coefficient according to some embodiments of the present specification.
  • Fig. 10 is an exemplary flowchart of a method for determining a second coefficient according to some embodiments of the present specification
  • Fig. 11 is an exemplary flow chart of a single wheat filtering method according to some embodiments of the present specification.
  • Fig. 12 is an exemplary flowchart of a single wheat filtering method according to some embodiments of the present specification.
  • Fig. 13 is an exemplary flowchart of a speech enhancement method according to some embodiments of the present specification.
  • the terms “a”, “an”, “an” and/or “the” are not specific to the singular and may include the plural unless the context clearly indicates an exception.
  • the terms “comprising” and “comprising” only suggest the inclusion of clearly identified steps and elements, and these steps and elements do not constitute an exclusive list, and the method or device may also contain other steps or elements.
  • the term “based on” is “based at least in part on”.
  • the term “one embodiment” means “at least one embodiment”; the term “another embodiment” means “at least one further embodiment”.
  • Fig. 1 is a schematic diagram of an application scenario of a speech enhancement system according to some embodiments of the present specification.
  • the speech enhancement system 100 shown in the embodiment of this specification can be applied in various software, systems, platforms, and devices to implement speech signal enhancement processing. For example, it can be applied to perform voice enhancement processing on user voice signals acquired by various software, systems, platforms, and devices, and can also be applied to perform voice enhancement processing when using devices (such as mobile phones, tablets, computers, earphones, etc.) for voice calls .
  • devices such as mobile phones, tablets, computers, earphones, etc.
  • the embodiment of this specification proposes a speech enhancement system and method, which can implement, for example, the enhancement processing of the target speech in the above-mentioned speech call scene.
  • the speech enhancement system 100 may include a processing device 110, a collection device 120, a terminal 130, a storage device 140, and a network 150.
  • processing device 110 may process data and/or information obtained from other devices or system components.
  • the processing device 110 may execute program instructions based on these data, information and/or processing results to perform one or more functions described in this specification.
  • the processing device 110 may acquire and process the first signal and the second signal of the target speech, and output the speech-enhanced output speech signal.
  • processing device 110 may be a single processing device or a group of processing devices, such as a server or a group of servers.
  • the set of processing devices may be centralized or distributed (eg, processing device 110 may be a distributed system).
  • processing device 110 may be local or remote.
  • the processing device 110 may access information and/or data in the collection device 120 , the terminal 130 , and the storage device 140 through the network 150 .
  • the processing device 110 may be directly connected to the collection device 120, the terminal 130, and the storage device 140 to access stored information and/or data.
  • the processing device 110 may be implemented on a cloud platform.
  • the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, inter-cloud, multi-cloud, etc., or any combination thereof.
  • the processing device 110 may be implemented on a computing device as shown in FIG. 2 of this specification.
  • processing device 110 may be implemented on one or more components of computing device 200 as shown in FIG. 2 .
  • processing device 110 may include a processing engine 112 .
  • the processing engine 112 may process data and/or information related to speech enhancement to perform one or more methods or functions described in this specification. For example, the processing engine 112 may acquire a first signal of the target voice and a second signal, the first signal is the signal of the target voice collected based on the first position, and the second signal is the signal of the target voice collected based on the second position. signal of the target speech. In some embodiments, the processing engine 112 may process the first signal and/or the second signal to obtain a speech-enhanced output speech signal corresponding to the target speech.
  • processing engine 112 may include one or more processing engines (eg, a single-chip processing engine or a multi-chip processor).
  • the processing engine 112 may include a central processing unit (CPU), an application specific integrated circuit (ASIC), an application specific instruction set processor (ASIP), a graphics processing unit (GPU), a physical processing unit (PPU), a digital signal processing DSP, Field Programmable Gate Array (FPGA), Programmable Logic Device (PLD), Controller, Microcontroller Unit, Reduced Instruction Set Computer (RISC), Microprocessor, etc. or any combination of the above.
  • the processing engine 112 may be integrated in the collection device 120 or the terminal 130 .
  • the collection device 120 may be used to collect speech signals of the target speech, for example, to collect the first signal and the second signal of the target speech.
  • the collection device 120 may be a single collection device, or a collection device group composed of a plurality of collection devices.
  • collection device 120 may be a device (e.g., cell phone, headset, walkie-talkie , tablet, computer, etc.).
  • the collection device 120 may include at least two microphones, and the at least two microphones are separated by a certain distance. When the collection device 120 collects the voice of the user, the at least two microphones may simultaneously collect sounds from the user's mouth at different positions.
  • the at least two microphones may include a first microphone and a second microphone. The first microphone can be located closer to the user's mouth, the second microphone can be located farther away from the user's mouth, and the connection between the second microphone and the first microphone can extend toward the position of the user's mouth.
  • the collection device 120 can convert the collected voice into an electrical signal, and send it to the processing device 110 for processing.
  • the first microphone and the second microphone may convert the collected user voice into a first signal and a second signal respectively.
  • the processing device 110 may implement speech enhancement processing based on the first signal and the second signal.
  • the collection device 120 may transmit information and/or data with the processing device 110 , the terminal 130 , and the storage device 140 through the network 150 .
  • collection device 120 may be directly connected to processing device 110 or storage device 140 to transfer information and/or data.
  • the collection device 120 and the processing device 110 may be different parts of the same electronic device (for example, earphones, glasses, etc.), and are connected by metal wires.
  • terminal 130 may be a terminal used by a user or other entity, for example, it may be a terminal used by a sound source (human or other entity) corresponding to the target voice, or may be a sound source (human or other entity) corresponding to the target voice.
  • Other entities Terminals used by other users or entities conducting voice calls.
  • the terminal 130 may include a mobile device 130-1, a tablet 130-2, a laptop 130-3, etc. or any combination thereof.
  • the mobile device 130-1 may include smart home devices, wearable devices, smart mobile devices, virtual reality devices, augmented reality devices, etc. or any combination thereof.
  • smart home devices may include smart lighting devices, smart electrical control devices, smart monitoring devices, smart TVs, smart cameras, walkie-talkies, etc., or any combination thereof.
  • wearable devices may include smart bracelets, smart footwear, smart glasses, smart helmets, smart watches, smart headphones, smart wear, smart backpacks, smart accessories, etc., or any combination thereof.
  • a smart mobile device may include a smart phone, a personal digital assistant (PDA), a gaming device, a navigation device, a point of sale (POS), etc., or any combination thereof.
  • the virtual reality device and/or the augmented reality device may include a virtual reality helmet, virtual reality glasses, virtual reality goggles, augmented virtual reality helmet, augmented reality glasses, augmented reality goggles, etc. or any combination thereof.
  • the terminal 130 may acquire/receive voice signals of the target voice, such as the first signal and the second signal. In some embodiments, the terminal 130 may acquire/receive the speech-enhanced output speech signal of the target speech. In some embodiments, the terminal 130 can directly acquire/receive the voice signal of the target voice, such as the first signal and the second signal, from the collection device 120 and the storage device 140, or the terminal 130 can obtain/receive the speech signal from the collection device 120 and the storage device through the network 150. 140 acquires/receives voice signals of the target voice, such as the first signal and the second signal.
  • the terminal 130 may directly obtain/receive the output speech signal of the target voice after speech enhancement from the processing device 110 and the storage device 140, or the terminal 130 may obtain/receive from the processing device 110 and the storage device 140 through the network 150 The speech-enhanced output speech signal of the target speech.
  • terminal 130 may send instructions to processing device 110 , and processing device 110 may execute instructions from terminal 130 .
  • the terminal 130 may send to the processing device 110 one or more instructions for implementing the speech enhancement method for the target speech, so that the processing device 110 executes one or more operations/steps of the speech enhancement method.
  • Storage device 140 may store data and/or information obtained from other devices or system components.
  • the storage device 140 may store speech signals of the target speech, such as the first signal and the second signal, and may also store the speech-enhanced output speech signal of the target speech.
  • storage device 140 may store data obtained from collection device 120 .
  • storage device 140 may store data retrieved from processing device 110 .
  • storage device 140 may store data and/or instructions for execution or use by processing device 110 to perform the exemplary methods described in this specification.
  • the storage device 140 may include mass storage, removable storage, volatile read-write storage, read-only memory (ROM), etc., or any combination thereof.
  • Exemplary mass storage may include magnetic disks, optical disks, solid state disks, and the like.
  • Exemplary removable storage may include flash drives, floppy disks, optical disks, memory cards, compact disks, magnetic tape, and the like.
  • Exemplary volatile read-only memory may include random access memory (RAM).
  • Exemplary RAMs may include dynamic RAM (DRAM), double rate synchronous dynamic RAM (DDRSDRAM), static RAM (SRAM), thyristor RAM (T-RAM), and zero capacitance RAM (Z-RAM), among others.
  • DRAM dynamic RAM
  • DDRSDRAM double rate synchronous dynamic RAM
  • SRAM static RAM
  • T-RAM thyristor RAM
  • Z-RAM zero capacitance RAM
  • Exemplary ROMs may include mask ROM (MROM), programmable ROM (PROM), erasable programmable ROM (PEROM), electronically erasable programmable ROM (EEPROM), compact disc ROM (CD-ROM), and digital Universal disk ROM, etc.
  • the storage device 140 may be implemented on a cloud platform.
  • the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, an internal cloud, a multi-layer cloud, etc., or any combination thereof.
  • the storage device 140 can be connected to the network 150 to communicate with one or more components in the speech enhancement system 100 (eg, the processing device 110 , the collection device 120 , the terminal 130 ). One or more components in the speech enhancement system 100 can access data or instructions stored in the storage device 140 through the network 150 . In some embodiments, the storage device 140 may be directly connected or communicated with one or more components in the speech enhancement system 100 (for example, the processing device 110, the collection device 120, and the terminal 130). In some embodiments, storage device 140 may be part of processing device 110 .
  • one or more components of speech enhancement system 100 may have permission to access storage device 140 .
  • one or more components of the speech enhancement system 100 may read and/or modify information related to the target speech when one or more conditions are met.
  • Network 150 may facilitate the exchange of information and/or data.
  • one or more components in the speech enhancement system 100 can transmit to/from other components in the speech enhancement system 100 through the network 150 /Receive information and/or data.
  • the processing device 110 can obtain the first signal and the second signal of the target speech from the acquisition device 120 or the storage device 140 through the network 150
  • the terminal 130 can obtain the speech-enhanced signal of the target speech from the processing device 110 or the storage device 140 through the network 150. output voice signal.
  • the network 150 may be any form of wired or wireless network or any combination thereof.
  • network 150 may include a cable network, a wired network, a fiber optic network, a telecommunications network, an intranet, the Internet, a local area network (LAN), a wide area network (WAN), a wireless local area network (WLAN), a metropolitan area network (MAN), Wide Area Network (WAN), Public Switched Telephone Network (PSTN), Bluetooth Network, Zigbee Network, Near Field Communication (NFC) Network, Global System for Mobile Communications (GSM) Network, Code Division Multiple Access (CDMA) Network, Time Division Multiple Access ( TDMA) network, General Packet Radio Service (GPRS) network, Enhanced Data Rates for GSM Evolution (EDGE) network, Wideband Code Division Multiple Access (WCDMA) network, High Speed Downlink Packet Access (HSDPA) network, Long Term Evolution (LTE) network, User Datagram Protocol (UDP) network, Transmission Control Protocol/Internet Protocol (TCP/IP) network, Short Message Service (SMS) network, Wireless Application Protocol (WAP) network, Ultra Wideband (U), Global System for Mobile
  • speech enhancement system 100 may include one or more network access points.
  • speech enhancement system 100 may include wired or wireless network access points, such as base stations and/or wireless access points 150-1, 150-2, ..., through which one or more components of speech enhancement system 100 may be connected to the network 150 to exchange data and/or information.
  • the components may be implemented by electrical and/or electromagnetic signals.
  • the collection device 120 may generate a coded electrical signal. Acquisition device 120 may then send the electrical signal to an output port. If the collection device 120 communicates with the collection device 120 via a wired network or a data transmission line, the output port may be physically connected to a cable, which further transmits electrical signals to the input port of the collection device 120 . If the collection device 120 communicates with the collection device 120 via a wireless network, the output port of the collection device 120 may be one or more antennas that convert electrical signals into electromagnetic signals.
  • an electronic device such as the acquisition device 120 and/or the processing device 110
  • processing instructions, issuing instructions and/or performing actions the instructions and/or actions are performed through electrical signals.
  • processing device 110 retrieves or saves data from a storage medium (e.g., storage device 140)
  • it may send an electrical signal to the storage medium's read/write device, which may read or write structured data in the storage medium. data.
  • the structured data can be transmitted to the processor in the form of electrical signals through the bus of the electronic device.
  • an electrical signal may refer to one electrical signal, a series of electrical signals and/or at least two discontinuous electrical signals.
  • FIG. 2 is a schematic diagram of an exemplary computing device 200 shown in accordance with some embodiments of the present specification.
  • processing device 110 may be implemented on computing device 200 .
  • computing device 200 may include memory 210 , processor 220 , input/output (I/O) 230 , and communication port 240 .
  • I/O input/output
  • Memory 210 may store data/information obtained from acquisition device 120 , terminal 130 , storage device 140 or any other component of speech enhancement system 100 .
  • the memory 210 may include mass memory, removable memory, volatile read-write memory, read-only memory (ROM), etc., or any combination thereof.
  • Exemplary mass storage may include magnetic disks, optical disks, solid state disks, and the like.
  • Exemplary removable storage may include flash drives, floppy disks, optical disks, memory cards, compact disks, magnetic tape, and the like.
  • Exemplary volatile read-only memory may include random access memory (RAM).
  • Exemplary RAMs may include dynamic RAM (DRAM), double rate synchronous dynamic RAM (DDR SDRAM), static RAM (SRAM), thyristor RAM (T-RAM), and zero capacitance RAM (Z-RAM), among others.
  • Exemplary ROMs may include mask ROM (MROM), programmable ROM (PROM), erasable programmable ROM (PEROM), electronically erasable programmable ROM (EEPROM), compact disc ROM (CD-ROM), and digital Universal disk ROM, etc.
  • memory 210 may store one or more programs and/or instructions to perform the exemplary methods described in this specification.
  • the memory 210 may store programs executable by the processing device 110 to implement the speech enhancement method.
  • Processor 220 may execute computer instructions (program code) and perform functions of processing device 110 in accordance with the techniques described in this specification.
  • Computer instructions may include, for example, routines, programs, objects, components, signals, data structures, procedures, modules, and functions, which perform particular functions described herein.
  • processor 220 may process data acquired from acquisition device 120 , terminal 130 , storage device 140 and/or any other components of speech enhancement system 100 .
  • the processor 220 may process the first signal and the second signal of the target speech acquired from the acquisition device 120 to obtain a speech-enhanced output speech signal.
  • the output speech signal may be stored in storage device 140, memory 210, or the like.
  • the output voice signal can be output to a broadcasting device such as a speaker through the I/O 230.
  • processor 220 may execute instructions obtained from terminal 130 .
  • processor 220 may include one or more hardware processors, such as microcontrollers, microprocessors, reduced instruction set computers (RISCs), application specific integrated circuits (ASICs), application specific instruction set processors (ASIP ), Central Processing Unit (CPU), Graphics Processing Unit (GPU), Physical Processing Unit (PPU), Microcontroller Unit, Digital Signal Processor (DSP), Field Programmable Gate Array (FPGA), Advanced RISC Machine (ARM ), programmable logic device (PLD), any circuit or processor capable of performing one or more functions, etc., or any combination thereof.
  • RISCs reduced instruction set computers
  • ASICs application specific integrated circuits
  • ASIP application specific instruction set processors
  • CPU Central Processing Unit
  • GPU Graphics Processing Unit
  • PPU Physical Processing Unit
  • Microcontroller Unit Digital Signal Processor
  • DSP Field Programmable Gate Array
  • ARM Advanced RISC Machine
  • PLD programmable logic device
  • computing device 200 For purposes of illustration only, only one processor is depicted in computing device 200 . It should be noted, however, that the computing device 200 in this specification may also include multiple processors. Therefore, operations and/or method steps performed by one processor as described in this specification may also be jointly or separately performed by multiple processors. For example, if in this specification, the processor of computing device 200 executes operation A and operation B at the same time, it should be understood that operation A and operation B may also be combined or performed by two or more different processors in the computing device. performed separately. For example, a first processor performs operation A and a second processor performs operation B, or the first processor and the second processor perform operations A and B together.
  • I/O 230 may input or output signals, data and/or information. In some embodiments, I/O 230 may enable a user to interact with processing device 110 . In some embodiments, I/O 230 may include input devices and output devices. Exemplary input devices may include a keyboard, mouse, touch screen, microphone, etc., or combinations thereof. Exemplary output devices may include display devices, speakers, printers, projectors, etc., or combinations thereof. Exemplary display devices may include liquid crystal displays (LCDs), light emitting diode (LED) based displays, monitors, flat panel displays, curved screens, television devices, cathode ray tubes (CRTs), etc., or combinations thereof.
  • LCDs liquid crystal displays
  • LED light emitting diode
  • monitors monitors
  • flat panel displays flat panel displays
  • curved screens television devices
  • television devices cathode ray tubes (CRTs), etc., or combinations thereof.
  • Communication port 240 may interface with a network (eg, network 150 ) to facilitate data communication.
  • the communication port 240 can establish a connection between the processing device 110 and the collection device 120 , the terminal 130 or the storage device 140 .
  • This connection can be a wired connection, a wireless connection, or a combination of both to enable data transmission and reception.
  • a wired connection may include electrical cables, fiber optic cables, telephone lines, etc., or any combination thereof.
  • Wireless connections may include Bluetooth, Wi-Fi, WiMax, WLAN, ZigBee, mobile networks (eg, 3G, 4G, 5G, etc.), etc., or combinations thereof.
  • the communication port 240 may be a standardized communication port, such as RS232, RS485, and the like.
  • communication port 240 may be a specially designed communication port.
  • the communication port 240 can be designed according to the voice signal to be transmitted.
  • FIG. 3 is a schematic diagram of exemplary hardware and/or software components of an exemplary mobile device 300 on which the terminal 130 may be implemented according to some embodiments of the present specification.
  • mobile device 300 may include communication unit 310 , display unit 320 , graphics processing unit (GPU) 330 , central processing unit (CPU) 340 , input/output 350 , memory 360 and storage 370 .
  • GPU graphics processing unit
  • CPU central processing unit
  • Central processing unit (CPU) 340 may include interface circuits and processing circuits similar to processor 220 .
  • any other suitable components including but not limited to a system bus or controller (not shown), may also be included within mobile device 300 .
  • a mobile operating system 362 e.g., IOS TM , Andro TM , Windows Phone TM, etc.
  • application programs 364 may be loaded from storage 370 into memory 360 for processing by central processing unit (CPU) 340 implement.
  • Application 364 may include a browser or any other suitable mobile application for receiving and presenting information related to the target voice, the voice enhancement of the target voice, from the voice enhancement system on the mobile device 300 . Interaction of signals and/or data may be accomplished via input/output device 350 and provided to processing engine 112 and/or other components of speech enhancement system 100 via network 150 .
  • a computer hardware platform may be used as a hardware platform for one or more elements (eg, modules of the processing device 110 described in FIG. 1 ). Since these hardware elements, operating systems, and programming languages are common, it can be assumed that those skilled in the art are familiar with these techniques and that they are able to provide the information required in route planning according to the techniques described herein.
  • a computer with a user interface can be used as a personal computer (PC) or other type of workstation or terminal device.
  • a computer with a user interface can be used as a processing device such as a server. It is considered that those skilled in the art may also be familiar with such structure, procedure or general operation of this type of computer device. Therefore, no additional explanations are described with respect to the drawings.
  • Fig. 4 is an exemplary block diagram of a speech enhancement system according to some embodiments of the present specification.
  • speech enhancement system 100 may be implemented on processing device 110 .
  • the processing device 110 may include an acquisition module 410 , a processing module 420 and a generation module 430 .
  • the obtaining module 410 may be used to obtain the first signal and the second signal of the target speech.
  • the target speech may include the speech produced by the target sound source.
  • different acquisition devices eg, different microphones
  • the first signal may be a target speech signal collected by the first microphone (or front microphone) based on the first position
  • the second signal may be a target speech signal collected by the second microphone (or rear microphone) based on the second position.
  • the obtaining module 410 may directly obtain the first signal and the second signal of the target speech from the different collection devices.
  • the first signal and the second signal may be stored in a storage device (eg, the storage device 140, the memory 210, the memory 370, an external storage device, etc.).
  • the obtaining module 410 may obtain the first signal and the second signal from the storage device.
  • the processing module 420 may be configured to process the first signal and the second signal to determine first coefficients based on the target speech location, the first location and the second location. In some embodiments, the processing module 420 may be configured to determine the first coefficient based on an Adaptive Null-Forming (ANF) algorithm. For example, the processing module 420 may be configured to perform a differential operation on the first signal and the second signal based on the target voice position, the first position and the second position, to obtain a signal pointing in the first direction and a signal pointing in the second direction. The signal directed in the first direction and the signal directed in the second direction contain different proportions of valid signals.
  • AMF Adaptive Null-Forming
  • the processing module 420 may be configured to determine a fourth signal based on the signal pointing in the first direction and the signal pointing in the second direction. For example, the processing module 420 may filter the signal pointing to the first direction and the signal pointing to the second direction through an adaptive filter based on a Wiener filtering algorithm to determine the fourth signal. In some embodiments, the processing module 420 may enhance the low frequency components in the fourth signal to obtain the third signal. Further, the processing module 420 may be configured to determine the first coefficient based on the third signal. For example, the processing module 420 may determine the ratio of the third signal to the first signal or the second signal as the first coefficient. Alternatively or additionally, the processing module 420 may update an adaptive parameter of the adaptive difference operation based on the fourth signal, the signal pointing in the first direction, and the signal pointing in the second direction.
  • the processing module 420 may also determine an estimated signal-to-noise ratio of the target speech based on the signal directed in the first direction and the signal directed in the second direction.
  • the estimated signal-to-noise ratio may be a ratio of a signal pointing in a first direction to a signal pointing in a second direction. Further, the processing module 420 determines the first coefficient based on the estimated signal-to-noise ratio.
  • the processing module 420 may also be configured to determine multiple parameters related to multiple sound source directions based on the first signal and the second signal.
  • the multiple sound source directions may include preset sound source directions.
  • the processing module 420 may perform a differential operation on the first signal and the second signal based on each sound source direction, the first position and the second position, to determine parameters related to each sound source direction.
  • the parameters may include a likelihood function. Each parameter may correspond to a probability of sound emanating from a sound source direction to form the first signal and the second signal.
  • the processing module 420 may also be configured to determine a second coefficient based on the multiple parameters and the target speech position. In some embodiments, in order to determine the second coefficient, the processing module 420 may determine the synthetic sound source direction based on the multiple parameters.
  • the synthetic sound source can be regarded as a virtual sound source formed by the combination of the target sound source and the noise source.
  • the processing module 420 may determine the parameter with the largest numerical value among the plurality of parameters. The parameter with the largest value may indicate that a sound is emitted from a corresponding sound source direction to form the first signal and the second signal with the highest probability. Therefore, the processing module 420 may determine that the sound source direction corresponding to the parameter with the largest value is the direction of the synthesized sound source.
  • the processing module 420 may determine the second coefficient based on the synthesized sound source direction and the target voice position. For example, the processing module 420 may determine whether the target voice position is located in the direction of the synthesized sound source, or whether the target voice position is within a certain angle range of the direction of the synthesized sound source. The second coefficient is set to a first value in response to the target voice position being located in the synthesized sound source direction or within a certain angle range of the synthesized sound source direction. In response to the target voice position not being located in the synthesized sound source direction or not within a certain angle range of the synthesized sound source direction, the second coefficient is set to a second value. Alternatively or additionally, the processing module 420 may smooth the second coefficient based on a smoothing factor. For another example, the processing module 420 may determine the second coefficient through a regression function based on the angle between the target voice position and the synthesized sound source direction.
  • the generating module 430 may be configured to process the first signal and/or the second signal based on the first coefficient and the second coefficient to obtain a speech-enhanced first output speech signal corresponding to the target speech.
  • the generating module 430 may be configured to perform weighting processing on the first signal and/or the second signal based on the first coefficient and the second coefficient. For example, the generating module 430 may assign corresponding weights to the third signal acquired based on the first signal and the second signal according to the value of the first coefficient, and assign corresponding weights to the first signal or the third signal according to the value of the second coefficient. The three signals are given corresponding weights.
  • the generation module 430 may further process the weighted signal to obtain the first output speech signal after speech enhancement.
  • the first output speech signal may be a weighted average of the third signal and the first signal.
  • the first output voice signal may be a weighted product of the third signal and the first signal.
  • the first output speech signal may be the larger value of the weighted third signal and the first signal.
  • the generation module 430 may weight the third signal based on the first coefficient, and then perform weighting processing again based on the second coefficient.
  • the above description of the processing device 110 and its modules is only for convenience of description, and does not limit this description to the scope of the illustrated embodiments. It can be understood that, after understanding the principle of the system, those skilled in the art can combine various modules arbitrarily, or form a subsystem to connect with other modules without departing from this principle.
  • the acquisition module 410 and the processing module 420 disclosed in FIG. 4 may be different modules in one system, or one module may realize the functions of the above-mentioned two or more modules.
  • the acquiring module 410 and the processing module 420 may be two modules, or one module may have the functions of acquiring the target voice and processing the target voice at the same time. Such deformations are within the protection scope of this specification.
  • system and its modules shown in FIG. 4 can be implemented in various ways.
  • the system and its modules may be implemented by hardware, software, or a combination of software and hardware.
  • the hardware part can be implemented by using dedicated logic;
  • the software part can be stored in a memory and executed by an appropriate instruction execution system, such as a microprocessor or specially designed hardware.
  • an appropriate instruction execution system such as a microprocessor or specially designed hardware.
  • processor control code for example on a carrier medium such as a magnetic disk, CD or DVD-ROM, such as a read-only memory (firmware ) or on a data carrier such as an optical or electronic signal carrier.
  • the system and its modules in this specification can not only be realized by hardware circuits such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, etc., or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc. , can also be realized by software executed by various types of processors, for example, and can also be realized by a combination of the above-mentioned hardware circuits and software (for example, firmware).
  • hardware circuits such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, etc.
  • programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc.
  • software for example, and can also be realized by a combination of the above-mentioned hardware circuits and software (for example, firmware).
  • Fig. 5 is an exemplary flowchart of a speech enhancement method according to some embodiments of this specification.
  • the method 500 may be executed by the processing device 110 , the processing engine 112 , or the processor 220 .
  • the method 500 may be stored in a storage device (for example, the storage device 140 or the storage unit of the processing device 110) in the form of a program or instructions, when the processing device 110, the processing engine 112, the processor 220 or the modules shown in FIG.
  • Method 500 may be implemented when programs or instructions are executed.
  • method 500 may be accomplished with one or more additional operations/steps not described below, and/or without one or more operations/steps discussed below. Additionally, the order of operations/steps shown in FIG. 5 is not limiting.
  • Step 510 the processing device 110 (for example, the acquisition module 410) may acquire the first signal and the second signal of the target speech.
  • the target speech may include the speech produced by the target sound source.
  • the target sound source can be a user, a robot (for example, an automatic answering robot, a robot that converts human input data such as text, gesture, etc. into a voice signal broadcast, etc.), or other creatures and devices that can send out voice information.
  • speech from the target sound source can be used as an effective signal.
  • the target speech may also include useless or disturbing noise signals, for example, noise generated by the surrounding environment or sounds from other sound sources other than the target sound source. Exemplary noise may include additive noise, white noise, multiplicative noise, etc., or any combination thereof.
  • Additive noise refers to an independent noise signal that has nothing to do with the speech signal
  • multiplicative noise refers to a noise signal proportional to the speech signal
  • white noise refers to a noise signal whose power spectrum is a constant.
  • the target speech when the target speech includes noise signals, the target speech may be a composite signal of effective signals and noise signals.
  • the synthesized signal may be equivalent to a speech signal emitted by a synthesized sound source of the target sound source and the noise sound source.
  • FIG. 6 is a schematic diagram of exemplary dual microphones according to some embodiments of this specification.
  • the target sound source such as the user's mouth
  • the target sound source points to the direction of the dual microphones (for example, the target sound source points to the direction of the first microphone A) and the connection line between the dual microphones
  • the angle formed is ⁇ .
  • the first signal Sig 1 can be the signal of the target voice collected based on the first position by the first microphone A (or the front microphone), and the second signal Sig 2 can be the target collected by the second microphone B (or the rear microphone) based on the second position voice signal.
  • the first position and the second position may be two positions with a distance d and different distances from the target sound source (such as the user's mouth).
  • d can be set according to actual needs, for example, in a specific scene, d can be set to not less than 0.5cm, or not less than 1cm.
  • the first signal or the second signal may include an electrical signal (or an electrical signal generated after further processing) generated by the collection device after receiving the target voice, which may reflect the relative power of the target voice to the collection device. location information.
  • the first signal and the second signal may be representations of the target speech in the time domain.
  • the processing device 110 may frame the signals acquired by the first microphone A and the second microphone B to obtain the first signal and the second signal respectively.
  • the processing device 110 may divide the signal acquired by the first microphone into multiple segments in the time domain (for example, equally divided or overlapped into multiple segments with a duration of 10-30 ms), each The segments can be used as a frame signal, and the first signal can include one or more frames of signals.
  • the first signal and the second signal may be representations of the target speech in the frequency domain.
  • the processing device 110 may perform Fast Fourier Transform (Fast Fourier Transform, FFT) on the above-mentioned one or more frame signals to obtain the first signal or the second signal.
  • FFT Fast Fourier Transform
  • the frame signal may be subjected to windowing and smoothing processing.
  • the processing device 110 may multiply the frame signal by the window function, so as to perform period expansion on the frame signal to obtain a periodic continuous signal.
  • window functions may include rectangular windows, Hanning windows, flat top windows, exponential windows, and the like.
  • the windowed and smoothed frame signal may be further subjected to FFT transformation to generate the first signal or the second signal.
  • the difference between the first signal and the second signal may be related to the intensity, signal amplitude, phase difference, etc. of the target speech and noise signals at different collection positions.
  • the processing device 110 may process the first signal and the second signal to determine first coefficients based on the target speech position, the first position and the second position.
  • the processing device 110 may determine the first coefficient based on an Adaptive Null-Forming (ANF) algorithm.
  • the ANF algorithm may include two differential beamformers and an adaptive filter.
  • the processing device 110 may pair the first signal and the second signal through the two differential beamformers based on the target speech position, the first position and the second position A differential operation is performed to obtain a signal pointing in the first direction and a signal pointing in the second direction.
  • the processing device 110 may perform time-delay processing on the first signal and the second signal based on the target voice position, the first position and the second position according to the differential microphone principle.
  • a differential operation is performed on the time-delayed first signal and the second signal to obtain a signal pointing to the first direction and a signal pointing to the second direction.
  • the signal directed in the first direction is a signal directed in the direction of the target sound source
  • the signal directed in the second direction is a signal directed in a direction opposite to the target sound source.
  • the signal directed in the first direction and the signal directed in the second direction contain effective signals in different proportions.
  • the effective signal refers to the voice emitted by the target sound source.
  • the signal pointing in the first direction may contain a larger proportion of valid signals (and/or a smaller proportion of noise signals).
  • the signals pointing in the second direction may contain a smaller proportion of effective signals (and/or a greater proportion of noise signals).
  • the signal directed in the first direction and the signal directed in the second direction may correspond to two directed microphones.
  • the processing device 110 may determine a third signal corresponding to the effective signal based on the signal pointing in the first direction and the signal pointing in the second direction.
  • the processing device 110 may perform an adaptive difference operation on the signal pointing in the first direction and the signal pointing in the second direction to determine the fourth signal.
  • the processing device 110 may filter the signal pointing to the first direction and the signal pointing to the second direction through the adaptive filter based on the Wiener filter algorithm, and determine the second Four signals.
  • the processing device 110 may adjust the parameters of the adaptive filter so that the zero point of the cardioid diagram corresponding to the fourth signal points to the noise direction. In some embodiments, the processing device 110 may enhance the low frequency components in the fourth signal, so as to obtain the third signal. Further, the processing device 110 may determine the first coefficient based on the third signal. For example, the first coefficient may be a ratio of the third signal to the first signal or the second signal. For more content about determining the first coefficient based on the third signal, refer to FIG. 8 and its description, and details are not repeated here.
  • the processing device 110 may determine an estimated signal-to-noise ratio of the target speech based on the signal directed in the first direction and the signal directed in the second direction.
  • the estimated signal-to-noise ratio may be a ratio between the signal pointing in the first direction and the signal pointing in the second direction.
  • the processing device 110 may determine the first coefficient based on the estimated signal-to-noise ratio.
  • the processing device 110 may determine the first coefficient based on a mapping relationship between the estimated signal-to-noise ratio and the first coefficient.
  • the mapping relationship can be in various forms, for example, a mapping relationship database or a relational function. For more content about determining the first coefficient based on the estimated signal-to-noise ratio, refer to FIG. 9 and its description, and details are not repeated here.
  • the first coefficient may reflect the impact of the noise signal on the valid signal.
  • the estimated signal-to-noise ratio may be a ratio between the signal pointing in the first direction and the signal pointing in the second direction.
  • the signal pointing in the first direction may contain a relatively large proportion of valid signals (and/or a relatively small proportion of noise signals).
  • the signals pointing in the second direction may contain a smaller proportion of effective signals (and/or a greater proportion of noise signals). Therefore, the magnitude of the noise signal can affect the value of the estimated signal-to-noise ratio and thus the value of the first coefficient.
  • the larger the noise signal the smaller the value of the estimated signal-to-noise ratio, and the value of the first coefficient determined according to the estimated signal-to-noise ratio will change accordingly. Therefore, the first coefficient can reflect the influence of the noise signal on the effective signal.
  • the first coefficient is related to the noise source direction. For example, when the direction of the noise source is close to the direction of the target sound source, the first coefficient may have a larger value; when the direction of the noise source deviates from the direction of the target sound source by a larger angle, the first coefficient may have a smaller value.
  • the noise source direction is the direction of the noise source relative to the dual microphones
  • the target sound source direction is the direction of the target sound source (such as the user's mouth) relative to the dual microphones.
  • the processing device 110 may process a third signal corresponding to the effective signal according to the first coefficient.
  • the first coefficient may represent the weight of the third signal corresponding to the effective signal in the speech enhancement process. As an example only, when the first coefficient is "1", it means that the third signal can be completely reserved as a part of the enhanced speech signal; when the first coefficient is "0", it means that the completely filter out the third signal.
  • the first coefficient when the angle difference between the noise source direction and the target sound source direction is relatively large, the first coefficient may have a smaller value, and the third signal processed according to the first coefficient may be weakened or removed; When the angle difference between the noise source direction and the target sound source direction is small, the first coefficient may have a larger value, and the third signal processed according to the first coefficient may be retained as a part of the enhanced speech signal. Therefore, when the angle difference between the direction of the noise source and the direction of the target sound source is large, the ANF algorithm can have a better filtering effect.
  • FIG. 7 shows the ANF algorithm at different noise angles according to some embodiments of this specification. Schematic diagram of filtering effect.
  • the noise angle refers to the angle between the noise source direction and the target sound source direction.
  • Figures a-f represent the filtering effects of the ANF algorithm when the noise angles are 180°, 150°, 120°, 90°, 60°, and 30°, respectively. According to Fig. 7, it can be seen that when the noise angle is large (for example, 180°, 150°, 120°, 90°), the filtering effect of the ANF algorithm is better. When the noise angle is small (for example, 60°, 30°), the filtering effect of the ANF algorithm is poor.
  • the processing device 110 may determine a plurality of parameters related to a plurality of sound source directions based on the first signal and the second signal.
  • the multiple sound source directions may include preset sound source directions.
  • the multiple sound source directions may have preset incident angles (eg, 0°, 30°, 60°, 90°, 120°, 180°, etc.).
  • the direction of the sound source can be selected and/or adjusted according to actual needs, and there is no limitation here.
  • the processing device 110 may perform a differential operation on the first signal and the second signal based on each sound source direction, the first position and the second position, and determine Parameters related to source direction.
  • each sound source direction in the multiple sound source directions may correspond to a time delay, and the processing device 110 may perform a differential operation on the first signal and the second signal based on the time delay.
  • the processing device 110 may calculate a parameter related to the sound source direction based on the difference operation.
  • the parameters may include a likelihood function.
  • the processing device 110 may calculate a likelihood function corresponding to each sound source direction in the plurality of sound source directions.
  • the likelihood function may correspond to a probability of sound emanating from the direction of the sound source to form the first signal and the second signal.
  • the processing device 110 may determine a second coefficient based on the plurality of parameters and the target speech position.
  • the processing device 110 may determine the direction of the synthesized sound source based on the plurality of parameters.
  • the synthetic sound source can be considered as a virtual sound source formed by the combination of the target sound source and the noise source, that is to say, the signals (for example, the first signal and the first signal and The second signal) may be equivalently generated by the synthetic sound source at the dual microphones.
  • the processing device 110 may determine the parameter with the largest value among the plurality of parameters.
  • the parameter with the largest value may indicate that a sound is emitted from a corresponding sound source direction to form the first signal and the second signal with the highest probability. Therefore, the processing device 110 may determine that the sound source direction corresponding to the parameter with the largest value is the direction of the synthesized sound source.
  • the processing device 110 may construct a plurality of directional microphones whose poles point in the direction of the plurality of sound sources. The response of each of the plurality of directional microphones is a cardioid.
  • the cardioid diagrams corresponding to the multiple sound source directions may be referred to as simulated cardioid diagrams.
  • the poles of each simulated cardioid can point in the direction of the corresponding sound source.
  • the processing device 110 may calculate a likelihood function corresponding to each sound source direction in the plurality of sound source directions.
  • the response of the likelihood function corresponding to multiple sound source directions may be a cardioid graph.
  • the cardioid corresponding to the likelihood function may be called a synthetic cardioid (or an actual cardioid).
  • the poles of the synthetic cardioid point in the direction of the synthetic sound source.
  • the processing device 110 may determine the simulated cardioid that is closest to the pole point of the actual cardioid, and determine the direction of the sound source corresponding to the simulated cardioid as the direction of the synthesized sound source.
  • the processing device 110 may determine the second coefficient based on the synthesized sound source direction and the target voice position. For example, the processing device 110 may determine whether the target voice position is located in the direction of the synthesized sound source, or whether the target voice position is within a certain angle range of the direction of the synthesized sound source. The second coefficient is set to a first value in response to the target voice position being located in the synthesized sound source direction or within a certain angle range of the synthesized sound source direction. In response to the target voice position not being located in the synthesized sound source direction or not within a certain angle range of the synthesized sound source direction, the second coefficient is set to a second value.
  • the processing device 110 may smooth the second coefficients based on a smoothing factor.
  • the processing device 110 may determine the second coefficient through a regression function based on an angle between the target voice position and the synthesized sound source direction. For more details about determining the second coefficient, refer to FIG. 10 and its description, which will not be repeated here.
  • the second coefficient may reflect the direction of the synthesized sound source relative to the target sound source, thereby attenuating or removing synthesized sound sources that are not in the direction of the target sound source and/or synthesized sound sources that are offset by an angle relative to the direction of the target sound source. Sound source.
  • the second coefficient may be used to filter out noise whose angle difference between the direction of the noise source and the direction of the target sound source exceeds a certain threshold. For example, when the angle difference between the noise source direction and the target sound source direction exceeds a certain threshold, the second coefficient may have a smaller value; when the angle difference between the noise source direction and the target sound source direction is less than a certain threshold, the second coefficient Coefficients can have large values.
  • the processing device 110 may process the first signal or the third signal according to the second coefficient.
  • the second coefficient may represent the weight of the first signal in the speech enhancement process. As an example only, when the second coefficient is "1", it means that the first signal can be completely reserved as a part of the enhanced speech signal; when the second coefficient is "0", it means that the enhanced speech signal completely filter out the first signal.
  • Step 550 the processing device 110 (for example, the generating module 430) may process the first signal and/or the second signal based on the first coefficient and the second coefficient to obtain the speech-enhanced speech corresponding to the target speech The first output speech signal of .
  • the processing device 110 may perform weighting processing on the first signal and/or the second signal based on the first coefficient and the second coefficient. Taking the first coefficient as an example, the processing device 110 may assign corresponding weights to the third signal acquired based on the first signal and the second signal according to the value of the first coefficient. For example, the processing device 110 may assign a corresponding weight to the third signal according to the range of the first coefficient. For another example, the processing device 110 may directly use the value of the first coefficient as the weight of the third signal. For another example, when the value of the first coefficient is smaller than the preset first coefficient threshold, the processing device 110 may set the weight of the third signal to 0.
  • the processing device 110 may assign corresponding weights to the first signal or the third signal according to the value of the second coefficient.
  • the processing device 110 may further process the above-mentioned weighted signal to obtain the speech-enhanced first output speech signal.
  • the first output speech signal may be a weighted average of the third signal and the first signal.
  • the first output voice signal may be a weighted product of the third signal and the first signal.
  • the first output speech signal may be the larger value of the weighted third signal and the first signal.
  • the generation module 430 may weight the third signal based on the first coefficient, and then perform weighting processing again based on the second coefficient.
  • the speech enhancement method 500 may also include a single microphone filtering process.
  • the processing device 110 may perform single-mic filtering on the first output voice signal based on a single-mic filtering algorithm.
  • the processing device 110 may process the first signal and/or the second signal based on a single-mic filter algorithm to obtain a third coefficient, and filter the first output speech signal based on the third coefficient.
  • FIG. 11, FIG. 12 and their descriptions, and details will not be repeated here.
  • the processing device 110 may also perform speech enhancement processing based on the fourth coefficient. For example, the processing device 110 determines the fourth coefficient based on the energy difference of the first signal and the second signal. And based on any one or combination of the first coefficient, the second coefficient and the fourth coefficient, the first signal and/or the second signal is processed to obtain the speech-enhanced output speech signal. For more content about speech enhancement based on the fourth coefficient, refer to FIG. 13 and its description, and details are not repeated here.
  • the above speech enhancement method 500 may be implemented on the first signal and/or the second signal obtained after preprocessing (for example, framing, windowing and smoothing, FFT transformation, etc.). That is to say, the first output speech signal may be a single-frame speech signal.
  • the speech enhancement method 500 may also include post-processing. Exemplary post-processing may include inverse FFT transform, frame stitching, and the like. After the post-processing process, the processing device 110 can obtain continuous output speech signals.
  • Fig. 8 is an exemplary flowchart of a method for determining a first coefficient according to some embodiments of the present specification.
  • the method 800 may be executed by the processing device 110 , the processing engine 112 , or the processor 220 .
  • the method 800 may be stored in a storage device (for example, the storage device 140 or the storage unit of the processing device 110) in the form of a program or instructions, when the processing device 110, the processing engine 112, the processor 220 or the modules shown in FIG.
  • Method 800 may be implemented when a program or instructions are executed.
  • operation 520 described in method 500 may be implemented by method 800 .
  • method 800 may be accomplished with one or more additional operations/steps not described below, and/or without one or more operations/steps discussed below. Additionally, the order of operations/steps shown in FIG. 8 is not limiting.
  • the processing device 110 may determine the first coefficient based on an Adaptive Null-Forming (ANF) algorithm.
  • the ANF algorithm may include two differential beamformers and an adaptive filter.
  • the two differential beamformers may perform differential processing on the first signal and the second signal to form a signal pointing in the first direction and a signal pointing in the second direction.
  • the adaptive filter may perform adaptive filtering on the signal directed in the first direction and the signal directed in the second direction to obtain a third signal corresponding to the effective signal.
  • method 800 may include:
  • Step 810 the processing device 110 may perform a differential operation on the first signal and the second signal based on the target speech position, the first position and the second position, to obtain a signal pointing in the first direction and a Signal pointing in the second direction.
  • the processing device 110 may perform time-delay processing on the first signal and the second signal based on the target voice position, the first position and the second position according to the differential microphone principle. For example, as shown in Figure 6, the distance between the front microphone A and the rear microphone B is d, and the target sound source has an incident angle ⁇ , then the propagation time of the target sound source between the front microphone A and the rear microphone B can be expressed as :
  • the propagation time ⁇ may serve as a time delay between the first signal and the second signal.
  • t represents each time point of the current frame
  • sig1 represents the first signal
  • sig2 represents the second signal
  • the first signal sig1 and the second signal sig2 are respectively time-delayed and differentiated by time delay ⁇ , and the signal x s pointing to the first direction and the signal pointing to the second direction can be obtained x n .
  • the signal x s pointing in the first direction may correspond to a first directional microphone, and the response of the first directional microphone is a cardioid graph, the pole of which points to the direction of the target sound source.
  • the signal x n pointing in the second direction may correspond to a second pointing microphone, and the response of the second pointing microphone is a cardioid, and its zero point points to the direction of the target sound source.
  • the signal x s directed in the first direction and the signal x n directed in the second direction may contain different proportions of effective signals.
  • the signal x s pointing in the first direction may contain a larger proportion of valid signals (and/or a smaller proportion of noise signals).
  • the signal xn pointing in the second direction may contain a smaller proportion of effective signals (and/or a greater proportion of noise signals).
  • the processing device 110 may perform an adaptive difference operation on the signal pointing in the first direction and the signal pointing in the second direction to determine a fourth signal.
  • the processing device 110 may filter the signal pointing in the first direction and the signal pointing in the second direction through the adaptive filter based on a Wiener filtering algorithm.
  • the adaptive filter may be a Least Mean Square (LMS) filter.
  • LMS Least Mean Square
  • the processing device 110 may perform adaptive filtering (that is, an adaptive difference operation) on the signal x s pointing to the first direction and the signal x n pointing to the second direction through an LMS filter , to determine the fourth signal.
  • the fourth signal may be a signal after filtering out noise.
  • an exemplary process of adaptive difference operation may be as shown in the following formula:
  • y represents the fourth signal (that is, the output signal of the LMS filter)
  • w represents the adaptive parameter of the adaptive difference operation (that is, the coefficient of the LMS filter).
  • the processing device 110 may based on the fourth signal y, the signal x s pointing in the first direction and the signal x n pointing in the second direction, Updating the adaptive parameter w of the adaptive difference operation. For example, based on each frame of the first signal and the second signal, the processing device 110 may acquire a signal x s pointing in the first direction and a signal x n pointing in the second direction. Further, the processing device 110 may update the adaptive parameter w through a gradient descent method, so that the loss function (for example, the mean square error loss function) of the adaptive difference operation gradually converges.
  • the loss function for example, the mean square error loss function
  • the processing device 110 may enhance the low-frequency components in the fourth signal to obtain the third signal.
  • the differential beamformer may have high-pass filtering properties.
  • the differential beamformer When the differential beamformer is used to perform differential processing on the first signal and the second signal, low frequency components in the first signal and the second signal may be weakened.
  • the low frequency components in the fourth signal y obtained after the adaptive difference operation are weakened.
  • the processing device 110 may enhance the low frequency components in the fourth signal y through a compensation filter.
  • the compensation filter may be as shown in the following formula:
  • W EQ represents the compensation filter
  • represents the frequency of the fourth signal y
  • ⁇ c represents the cutoff frequency of the high-pass filter.
  • an exemplary value of ⁇ c may be:
  • c represents the speed of sound propagation
  • d represents the distance between two microphones.
  • the processing device 110 may filter the fourth signal y based on the compensation filter W EQ to obtain a third signal.
  • the third signal may be the product of the fourth signal y and the compensation filter W EQ .
  • step 840 the processing device 110 may determine the first coefficient based on the third signal.
  • the processing device 110 may determine a ratio of the third signal to the first signal or the second signal, and determine the first coefficient according to the ratio.
  • the operation of the method 800 in the above embodiment is to process the first signal and the second signal in the time domain. It should be understood that one or more operations in method 800 may also be performed in the frequency domain. For example, the delay processing performed on the first signal and the second signal in the time domain may also be equivalent phase shifting performed on the first signal and the second signal in the frequency domain.
  • step 830 is not necessary, that is, the fourth signal obtained in step 820 may be directly used as the third signal without low-frequency enhancement.
  • Fig. 9 is an exemplary flowchart of a method for determining a first coefficient according to some embodiments of the present specification.
  • the method 900 may be executed by the processing device 110 , the processing engine 112 , or the processor 220 .
  • the method 900 may be stored in a storage device (for example, the storage device 140 or the storage unit of the processing device 110) in the form of a program or instructions, when the processing device 110, the processing engine 112, the processor 220 or the modules shown in FIG.
  • Method 900 may be implemented when a program or instructions are executed.
  • operation 520 described in method 500 may be implemented by method 900 .
  • method 900 may be accomplished with one or more additional operations/steps not described below, and/or without one or more operations/steps discussed below. Additionally, the sequence of operations/steps shown in FIG. 9 is not limiting. As shown in FIG. 9, method 900 may include:
  • the processing device 110 may perform a differential operation on the first signal and the second signal based on the target speech position, the first position and the second position, to obtain a signal pointing in the first direction and a signal pointing in the second direction.
  • the signal directed in the first direction and the signal directed in the second direction contain different proportions of valid signals.
  • step 910 may be performed by performing step 810 described in FIG. 8 , which will not be repeated here.
  • the processing device 110 may determine an estimated signal-to-noise ratio of the target speech based on the signal pointing in the first direction and the signal pointing in the second direction.
  • the signals pointing in the first direction may contain a larger proportion of valid signals (and/or a smaller proportion of noise signals).
  • the signals pointing in the second direction may contain a smaller proportion of effective signals (and/or a greater proportion of noise signals).
  • the estimated signal-to-noise ratio may be expressed as a ratio between the signal pointing in the first direction and the signal pointing in the second direction (ie x s /x n ).
  • different estimated signal-to-noise ratios may correspond to different synthetic sound source incident angles ⁇ .
  • a larger estimated signal-to-noise ratio may correspond to a smaller synthetic sound source incident angle ⁇ .
  • the incident angle ⁇ of the synthetic sound source can reflect the effect of the noise signal on the effective signal. For example, when the noise signal has a greater influence on the effective signal (for example, the angle difference between the noise source direction and the target sound source direction is larger), the synthetic sound source incident angle ⁇ can have a larger value; when the noise signal has a greater impact on the effective When the influence of the signal is small (for example, the angle difference between the direction of the noise source and the direction of the target sound source is small), the synthetic sound source incident angle ⁇ may have a small value. Therefore, the estimated signal-to-noise ratio can reflect the direction of the synthesized sound source, and further reflect the influence of the noise signal on the effective signal.
  • the processing device 110 may determine the first coefficient based on the estimated signal-to-noise ratio.
  • the processing device 110 may determine the first coefficient based on a mapping relationship between the estimated signal-to-noise ratio and the first coefficient.
  • the mapping relationship can be in various forms, for example, a mapping relationship database or a relational function.
  • different noise source directions may correspond to different synthetic sound source incident angles ⁇ , and correspondingly may correspond to different estimated signal-to-noise ratios. That is to say, the estimated signal-to-noise ratio may be related to the direction of the noise source (that is, related to the influence degree of the noise signal on the effective signal). Thus, for different estimated signal-to-noise ratios, different first coefficients can be determined. For example, when the estimated signal-to-noise ratio is small, the corresponding synthetic sound source incident angle ⁇ may have a larger value, indicating that the noise signal has a greater influence on the effective signal. Correspondingly, the third signal corresponding to the effective signal may contain a larger proportion of the noise signal.
  • the third signal can be weakened or removed by determining the value of the first coefficient.
  • the processing device 110 may process a third signal corresponding to the effective signal according to the first coefficient.
  • the first coefficient may represent the weight of the third signal corresponding to the effective signal in the speech enhancement process.
  • a mapping relationship database between the estimated signal-to-noise ratio and the first coefficient may be established. The processing device 110 may search the database based on the estimated signal-to-noise ratio to determine the first coefficient.
  • the processing device 110 may also determine the first coefficient based on a relationship function between the estimated signal-to-noise ratio and the first coefficient.
  • the relational function may be as shown in the following formula:
  • Fig. 10 is an exemplary flowchart of a method for determining a second coefficient according to some embodiments of the present specification.
  • the method 1000 may be executed by the processing device 110, the processing engine 112, or the processor 220.
  • the method 1000 may be stored in a storage device (for example, the storage device 140 or the storage unit of the processing device 110) in the form of a program or instructions, when the processing device 110, the processing engine 112, the processor 220 or the modules shown in FIG.
  • Method 1000 may be implemented when a program or instructions are executed.
  • operations 530 and 540 described in method 500 may be implemented by method 1000 .
  • method 1000 may be accomplished with one or more additional operations/steps not described below, and/or without one or more operations/steps discussed below. Additionally, the order of operations/steps shown in FIG. 10 is not limiting. As shown in Figure 10, method 1000 may include:
  • the processing device 110 may perform a differential operation on the first signal and the second signal based on each sound source direction, the first position and the second position, to determine parameters related to each sound source direction.
  • the multiple sound source directions may include preset sound source directions.
  • the direction of the sound source can be selected and/or adjusted according to actual needs, and there is no limitation here.
  • the processing device 110 may perform a differential operation on the first signal and the second signal based on each sound source direction, the first position and the second position.
  • the processing device 110 may perform a differential operation on the first signal and the second signal to construct multiple directional microphones.
  • the plurality of microphone-pointing poles may point in the direction of the plurality of sound sources.
  • the response of each of the plurality of directional microphones is a cardioid.
  • the cardioid diagrams corresponding to the multiple sound source directions may be referred to as simulated cardioid diagrams.
  • the poles of each simulated cardioid can point in the direction of the corresponding sound source.
  • the processing device 110 may calculate parameters related to the sound source direction based on the differential operation.
  • the parameters may include a likelihood function.
  • the processing device 110 may calculate a likelihood function corresponding to each sound source direction in the plurality of sound source directions.
  • the likelihood function may be as shown in the following formula:
  • LH i (f, t) represents the likelihood function corresponding to the frequency f at time t, which is the time-frequency domain table sig1(f, t) and sig2(f, t) of the first signal in the direction of the sound source, respectively is the time-frequency domain expression of the first signal and the second signal when the sound source direction is ⁇ i , and -2 ⁇ f ⁇ i in exp(-i2 ⁇ f ⁇ i )sig1(f, t) represents the sound source propagation when the sound source direction is ⁇ i The phase difference to the second position relative to the first position.
  • the likelihood function may correspond to the probability of sound emanating from the direction of the sound source to form the first signal and the second signal.
  • the responses of the likelihood functions corresponding to the directions of the multiple sound sources may be cardioid graphs.
  • the cardioid corresponding to the likelihood function may be called a synthetic cardioid (or an actual cardioid).
  • the processing device 110 may determine a synthetic sound source direction based on the multiple parameters.
  • the processing device 110 may determine the parameter with the largest value among the plurality of parameters. For example , the processing device 110 can calculate the likelihood function LH 1 (f, t ), LH 2 (f, t), . . . , LH n (f, t).
  • the likelihood function LH i (f, t) may correspond to the probability that a sound is emitted from a sound source direction ⁇ i to form the first signal and the second signal.
  • the likelihood function with the largest numerical value may indicate that a sound is emitted from a corresponding sound source direction to form the first signal and the second signal with the largest probability.
  • the processing device 110 may determine the direction of the synthesized sound source based on the plurality of directional microphones.
  • the response of each directional microphone is an analog cardioid.
  • the poles of the simulated cardioid map point to the corresponding sound source directions.
  • the responses of the likelihood functions corresponding to multiple sound source directions may also be a cardioid (referred to as a composite cardioid).
  • the poles of the synthetic cardioid point in the direction of the synthetic sound source.
  • the processing device 110 may compare the synthesized cardioid with the above-mentioned plurality of simulated cardioids, and determine the simulated cardioid closest to the pole point of the actual cardioid.
  • the direction of the sound source corresponding to the simulated cardioid graph may be determined as the direction of the synthetic sound source.
  • Step 1030 the processing device 110 may determine the second coefficient based on the synthesized sound source direction and the target voice position.
  • the processing device 110 may determine whether the target speech position is located in the direction of the synthesized sound source.
  • the processing device 110 may determine whether the direction of the synthesized sound source is 0°, and if the direction of the synthesized sound source is 0°, it may determine that the target voice position is located in the direction of the synthesized sound source. In some embodiments, the processing device 110 may determine whether the likelihood function with the largest value is located in the set where the target sound source dominates.
  • the processing device 110 may determine that the target speech position is located in the direction of the synthesized sound source.
  • the set dominated by the target sound source may be shown in the following formula:
  • LH 0 (f, t) represents the likelihood function value when the target speech position is located in the direction of the synthesized sound source.
  • the signal corresponding to the time-frequency point (for example, the first signal or the third signal) may be a signal sent from the direction of the target sound source.
  • the target sound source shown in the above formula (8) dominates the set Just for example.
  • the target voice position may not be located on the extension line of the line connecting the two microphones (that is, ⁇ 0°).
  • the angle between the target voice position and the extension line of the dual microphone connection is 30 degrees.
  • the processing device 110 may set the second coefficient to a first value (for example, 1). In response to the target speech position not being located in the synthesized sound source direction, the processing device 110 may set the second coefficient to a second value (for example, 0). In some embodiments, the processing device 110 may process the corresponding first signal or the third signal obtained after ANF filtering based on the second coefficient. For example, taking the first signal as an example, the second coefficient may be used as a weight of the first signal. For example, when the second coefficient is 1, it may indicate that the first signal is retained.
  • the first signal corresponding to the direction of the synthesized sound source may be regarded as a noise signal.
  • the processing device 110 may set the second coefficient to zero. Therefore, the processing device 110 may filter or attenuate the corresponding noise signal when the target speech position is not located in the direction of the synthesized sound source based on the second coefficient.
  • the second coefficients may constitute a masking matrix for filtering the input signal (eg, the first signal or the third signal).
  • the masking matrix may be as shown in the following formula:
  • the above masking matrix M is a binarization matrix, which can directly remove the input signal judged as noise. Therefore, the speech signal processed based on the masking matrix M may cause problems such as spectrum leakage and speech discontinuity.
  • the processing device 110 may smooth the second coefficients based on a smoothing factor. For example, based on a smoothing factor, the processing device 110 may smooth the second coefficients in the time domain.
  • the time domain smoothing process can be shown as the following formula:
  • represents the smoothing factor
  • M(f, t-1) represents the masking matrix corresponding to the previous frame
  • M(f, t) represents the masking matrix corresponding to the current frame.
  • the smoothing factor ⁇ can be used to weight the masking matrix of the previous frame and the masking matrix of the current frame to obtain a smooth masking matrix corresponding to the current frame
  • the processing device 110 may also smooth the second coefficients in the frequency domain. For example, the processing device 110 may smooth the second coefficients using a sliding Hamming window.
  • the processing device 110 may determine that the direction of the synthetic sound source is 30°. Further, the processing device 110 may determine the second coefficient through a regression function based on an angle between the target sound source direction and the synthesized sound source direction. For example, the processing device 110 may construct a regression function between said angle and the second coefficient.
  • the regression function may comprise a smooth regression function, eg, a linear regression function.
  • the value of the regression function may decrease as the angle between the direction of the target sound source and the direction of the synthesized sound source increases. In this way, when the input signal is processed with the second coefficient as the weight, the input signal when the angle between the direction of the target sound source and the direction of the synthesized sound source is relatively large can be weakened or removed, thereby achieving the purpose of noise removal.
  • the processing device 110 may acquire the target voice signal through dual microphones, and filter the target voice signal based on a dual microphone filtering algorithm. For example, when the angle difference between the noise signal and the effective signal is relatively large (that is, the angle difference between the noise source direction and the target sound source direction is relatively large), the processing device 110 may perform filtering based on the first coefficient to remove the noise Signal. When the angle difference between the noise signal and the valid signal is small, the processing device 110 may perform filtering based on the second coefficient. In this way, the processing device 110 can basically filter out noise signals in the target speech signal.
  • the acquired first output speech signal after the above dual microphone filtering process may include residual noise.
  • the first output speech signal may include continuous noise on the magnitude spectrum. Therefore, in some embodiments, the processing device 110 may also perform post-filtering on the first output speech signal based on a single microphone filtering algorithm.
  • Fig. 11 is an exemplary flow chart of a single wheat filtering method according to some embodiments of the present specification.
  • the method 1100 may be executed by the processing device 110 , the processing engine 112 , or the processor 220 .
  • the method 1100 may be stored in a storage device (for example, the storage device 140 or the storage unit of the processing device 110) in the form of a program or instructions, when the processing device 110, the processing engine 112, the processor 220 or the modules shown in FIG.
  • Method 1100 may be implemented when a program or instructions are executed.
  • method 1100 may be accomplished with one or more additional operations/steps not described below, and/or without one or more operations/steps discussed below. Additionally, the order of operations/steps shown in FIG. 11 is not limiting. As shown in Figure 11, method 1100 may include:
  • the processing device 110 may determine at least one target subband signal in the first output speech signal.
  • the processing device 110 may determine the at least one target subband signal based on a signal-to-noise ratio of each subband signal in the first output speech signal. In some embodiments, the processing device 110 may acquire a plurality of sub-band signals based on the first output speech signal. For example, the processing device 110 may divide the first output voice signal into subbands based on the signal frequency band, and acquire multiple subband signals. As an example only, the processing device 110 may divide the first output speech signal into subbands according to the frequency band category of low frequency, medium frequency or high frequency, or may also divide the first output speech signal according to a specific frequency bandwidth (for example, every 2kHz is regarded as a frequency band) The voice signal is divided into subbands.
  • a specific frequency bandwidth for example, every 2kHz is regarded as a frequency band
  • the processing device 110 may perform subband division based on signal frequency points of the first output voice signal.
  • the signal frequency point may refer to the value after the decimal point in the frequency value of the signal. For example, if the frequency value of the signal is 72.810, then the signal frequency point of the signal is 810.
  • Dividing the subbands based on signal frequency points may be to divide the signal into subbands according to a specific signal frequency point width, for example, signal frequency points 810-830 are used as a subband, and signal frequency points 600-620 are used as a subband.
  • the processing device 110 may obtain multiple sub-band signals by filtering, or perform sub-band division by other algorithms or devices to obtain multiple sub-band signals, which is not limited here.
  • the processing device 110 may calculate the signal-to-noise ratio of each sub-band signal.
  • Signal-to-noise ratio can refer to the ratio of speech signal energy to noise signal energy.
  • the signal energy may be signal power, other energy data obtained based on the signal power, and the like.
  • the larger the signal-to-noise ratio the smaller the noise contained in the speech signal.
  • the signal-to-noise ratio of the sub-band signal can be the ratio of the energy of the pure speech signal (i.e. effective signal) to the energy of the noise signal in the sub-band signal, or it can be the energy of the sub-band signal containing noise and the noise Ratio of signal energy.
  • the processing device 110 may calculate the signal-to-noise ratio of each sub-band signal through a signal-to-noise ratio estimation algorithm. For example, for each subband signal, the processing device 110 may calculate the noise signal value in the subband signal based on a noise estimation algorithm. Exemplary noise estimation algorithms may include minimum pursuit algorithms, temporal recursive averaging algorithms, etc., or combinations thereof. Further, the processing device 110 may calculate the signal-to-noise ratio based on the original subband signal and noise signal values. In some embodiments, the processing device 110 may use the trained SNR estimation model to calculate the SNR of each sub-band signal.
  • Exemplary SNR estimation models may include but are not limited to multi-layer perceptron (Multi-Layer Perception, MLP), decision tree (Decision Tree, DT), deep neural network (Deep Neural Network, DNN), support vector machine ( Support Vector Machine, SVM), K-nearest neighbor algorithm (K-Nearest Neighbor, KNN) and any other algorithm or model that can perform feature extraction and/or classification.
  • MLP multi-layer perceptron
  • Decision Tree Decision Tree
  • DT deep neural network
  • DNN Deep neural network
  • Support Vector Machine Support Vector Machine
  • K-Nearest Neighbor KNN
  • the SNR estimation model can be obtained by using training samples to train an initial model.
  • the training samples may include speech signal samples (for example, at least one historical speech signal, each of which contains a noise signal), and label values of the speech signal samples (for example, the SNR of the historical speech signal v1 is 0.5, and the historical speech The signal-to-noise ratio of signal v2 is 0.6).
  • the speech signal samples are processed by the model to obtain the predicted signal-to-noise ratio. Construct a loss function based on the predicted signal-to-noise ratio and the label value of the corresponding training sample, and adjust the model parameters based on the loss function to reduce the difference between the predicted target signal-to-noise ratio and the label value. For example, model parameter update or adjustment can be performed based on gradient descent method or the like.
  • the training ends, and a trained SNR estimation model is obtained.
  • the preset condition may be that the result of the loss function converges or is smaller than a preset threshold.
  • the processing device 110 may determine the target subband signal based on the signal-to-noise ratio of each subband signal. In some embodiments, the processing device 110 may determine the target sub-band signal based on a signal-to-noise ratio threshold. For example, for each sub-band signal, the processing device 110 may determine whether the signal-to-noise ratio of the sub-band signal is less than a signal-to-noise ratio threshold. In response to the signal-to-noise ratio of the sub-band signal being less than a signal-to-noise ratio threshold, the processing device 110 may determine that the sub-band signal is a target sub-band signal.
  • the processing device 110 may also determine the at least one target subband signal based on a preset subband range.
  • the preset sub-band range may be a preset frequency range.
  • the preset frequency range may be determined based on empirical values.
  • the experience value may be an experience value obtained during a voice analysis process. As an example only, it is found during the voice analysis process that the voice signal in the frequency range of 3000-4000 Hz usually contains a large proportion of noise, and accordingly, the preset frequency range may at least include 3000-4000 Hz.
  • the processing device 110 may process the at least one target subband signal based on a single microphone filtering algorithm to obtain a second output speech signal.
  • the processing device 110 may process the at least one target subband signal based on a single-microphone filtering algorithm, thereby filtering out noise in the at least one target subband signal, and obtaining a noise-reduced second output speech signal .
  • Exemplary single-mic filtering algorithms may include spectral subtraction, Wiener filtering algorithms, minimum-controlled recursive averaging algorithms, speech generation model algorithms, etc., or combinations thereof.
  • the processing device 110 may further process the first output voice signal obtained through the dual-mic filtering algorithm according to the single-mic filtering algorithm.
  • the processing device 110 may filter part of the sub-band signals (for example, signals of a specific frequency) in the first output speech signal according to the single-mic filter algorithm, so as to weaken or filter out the noise signal in the first output speech signal , to implement the correction and/or supplement to the dual-mic filter algorithm.
  • step 1110 may be omitted.
  • the processing device 110 may not only perform filtering processing on the target subband signal, but may also directly perform filtering processing on the entire first output voice signal.
  • the processing device 110 may automatically detect a noise signal in the first output voice signal based on an automatic noise detection algorithm, and perform filtering processing on the detected noise signal by using a single microphone filtering algorithm.
  • Fig. 12 is an exemplary flowchart of a single wheat filtering method according to some embodiments of the present specification.
  • the method 1200 may be executed by the processing device 110 , the processing engine 112 , or the processor 220 .
  • the method 1200 may be stored in a storage device (for example, the storage device 140 or the storage unit of the processing device 110) in the form of a program or instructions, when the processing device 110, the processing engine 112, the processor 220 or the modules shown in FIG.
  • Method 1200 may be implemented when a program or instructions are executed.
  • method 1200 may be accomplished with one or more additional operations/steps not described below, and/or without one or more operations/steps discussed below. Additionally, the order of operations/steps shown in FIG. 12 is not limiting. As shown in Figure 12, method 1200 may include:
  • the processing device 110 may process the first signal and/or the second signal based on a single-mic filtering algorithm, and determine a third coefficient.
  • the processing device 110 may determine the third coefficient based on any one of the first signal and the second signal. For example, the processing device 110 may determine the third coefficient based on the first signal, or may determine the third coefficient based on the second signal. In some embodiments, the processing device 110 may determine the third coefficient based on the first signal and the second signal. For example, the processing device 110 may determine a first value of the third coefficient based on the first signal, determine a second value of the third coefficient based on the second signal, and then determine the third coefficient based on the first value and the second value ( For example, averaging, weighted sum, etc.).
  • the processing device 110 may process the first signal based on a single wheat filtering algorithm.
  • Exemplary single-mic filtering algorithms may include spectral subtraction, Wiener filtering algorithms, minimum-controlled recursive averaging algorithms, speech generation model algorithms, etc., or combinations thereof.
  • the processing device 110 may obtain a noise signal and an effective signal in the first signal based on a single-mic filter algorithm, and determine a signal corresponding to the first signal based on at least two of the noise signal, the effective signal, and the first signal. SNR.
  • the signal-to-noise ratio corresponding to the first signal may include a priori signal-to-noise ratio, a posteriori signal-to-noise ratio, and the like.
  • the prior signal-to-noise ratio may be an energy ratio of an effective signal to a noise signal.
  • the posterior signal-to-noise ratio may be an energy ratio of the effective signal to the first signal.
  • the processing device 110 may determine the third coefficient based on the signal-to-noise ratio corresponding to the first signal. For example, the processing device 110 may determine the gain coefficient corresponding to the single microphone filtering algorithm based on the priori signal-to-noise ratio and/or the posteriori signal-to-noise ratio, and determine the third coefficient based on the gain coefficient.
  • the processing device 110 may directly use the gain coefficient as the third coefficient.
  • the processing device 110 may determine a mapping relationship between the gain coefficient and the third coefficient, and determine the third coefficient based on the mapping relationship.
  • the gain coefficient here may refer to the transfer function in the single-mic filter algorithm.
  • the transfer function can filter the speech signal with the noise signal to obtain the effective signal.
  • the transfer function may be in the form of a matrix, and the noise signal in the speech signal may be filtered out by multiplying the transfer function by the speech signal with the noise signal.
  • the third coefficient can be used to remove noise in the speech signal.
  • the processing device 110 may also perform a weighted combination of the prior SNR and the posterior SNR by a smoothing factor based on a logistic regression algorithm (for example, a sigmoid function), to obtain a smoothed SNR.
  • a logistic regression algorithm for example, a sigmoid function
  • the gain coefficient corresponding to the single wheat filter algorithm is determined as the third coefficient. Therefore, the third coefficient may have better smoothness, so that strong musical noise may be avoided when the single-mic filter algorithm is used for filtering.
  • the processing device 110 may process the first output speech signal based on the third coefficient to obtain a third output speech signal.
  • the processing device 110 may multiply the third coefficient by the first output speech signal to obtain the third output speech signal.
  • the third coefficient may be a gain coefficient obtained based on a single-mic filter algorithm. By multiplying the gain coefficient with the first output speech signal, noise signals in the first output speech signal can be filtered out.
  • Fig. 13 is an exemplary flowchart of a speech enhancement method according to some embodiments of the present specification.
  • the method 1300 may be executed by the processing device 110 , the processing engine 112 , or the processor 220 .
  • the method 1300 may be stored in a storage device (for example, the storage device 140 or the storage unit of the processing device 110) in the form of a program or instructions, when the processing device 110, the processing engine 112, the processor 220 or the modules shown in FIG. 4
  • Method 1300 may be implemented when programs or instructions are executed.
  • method 1300 may be accomplished with one or more additional operations/steps not described below, and/or without one or more operations/steps discussed below. Additionally, the order of operations/steps shown in FIG. 13 is not limiting. As shown in Figure 13, method 1300 may include:
  • the processing device 110 may acquire the first signal and the second signal of the target speech.
  • Step 1320 the processing device 110 may process the first signal and the second signal based on the target speech position, the first position and the second position to determine a first coefficient.
  • the processing device 110 may determine multiple parameters related to multiple sound source directions based on the first signal and the second signal.
  • the processing device 110 may determine a second coefficient based on the multiple parameters and the target speech position.
  • steps 1310-1340 may be performed by performing steps 510-540 described in FIG. 5 , which will not be repeated here.
  • Step 1350 the processing device 110 (for example, the processing module 420) may determine a fourth coefficient based on the energy difference between the first signal and the second signal.
  • the processing device 110 may obtain the noise power spectral density based on the silent interval in the first signal and the second signal.
  • the silent interval may be a speech signal interval in which there is no effective signal (that is, no speech is emitted by the target sound source). In the silent interval, since there is no voice of the target sound source, the first signal and the second signal acquired by the two microphones only contain noise components.
  • the processing device 110 may determine the silent interval in the first signal and the second signal based on a Voice Activity Detection (Voice Activity Detection, VAD) algorithm. In some embodiments, the processing device 110 may respectively determine one or more speech intervals in the first signal and the second signal as silent intervals.
  • VAD Voice Activity Detection
  • the processing device 110 may directly use a speech interval (for example, 200 ms, 300 ms, etc.) starting from the signal as a silent interval. Further, the processing device 110 may acquire a noise power density spectrum based on the silent interval. In some embodiments, when the noise signal source is far away from the two microphones, it may be considered that the noise signals received by the two microphones are similar or identical. Therefore, the processing device 110 may obtain the noise power spectral density based on the silent interval corresponding to any one of the first signal or the second signal. In some embodiments, the processing device 110 may obtain the noise power spectral density based on a periodogram algorithm. Alternatively or additionally, the processing device 110 may transform the first signal and/or the second signal to a frequency domain based on FFT transformation, so that the noise power spectral density may be obtained based on a periodogram algorithm in the frequency domain.
  • a speech interval for example, 200 ms, 300 ms, etc.
  • the processing device 110 may acquire the energy difference based on the first power spectral density of the first signal, the second power spectral density of the second signal, and the noise power spectral density. In some embodiments, the processing device 110 may determine the first power spectral density of the first signal and the second power spectral density of the second signal based on a periodogram algorithm. In some embodiments, the processing device 110 may obtain the energy difference based on an energy difference (Power Level Difference, PLD) algorithm. In the PLD algorithm, it can be assumed that the distance between the two microphones is relatively far, so the energy difference between the effective signal in the first signal and the effective signal in the second signal is relatively large, and the noise signal in the first signal and the second signal is the same or resemblance. Thus, the energy difference between the first signal and the second signal can be expressed as a function of the effective signal correlation in the first signal.
  • PLD Power Level Difference
  • the processing device 110 may determine the fourth coefficient based on the energy difference and the noise power spectral density. In some embodiments, the processing device 110 may determine the gain coefficient based on the PLD algorithm, and determine the gain coefficient as the fourth coefficient.
  • Step 1360 the processing device 110 (for example, the generating module 430) may process the first signal and/or the second signal based on the first coefficient, the second coefficient and the fourth coefficient to obtain the target The voice corresponds to the voice-enhanced fourth output voice signal.
  • the processing device 110 may perform gain compensation on the first signal and/or the second signal based on the fourth coefficient to obtain an estimated effective signal.
  • the estimated effective signal may be the product of the fourth coefficient and the first signal and/or the second signal.
  • the processing device 110 may, based on the first coefficient, the second coefficient and the fourth coefficient, perform an output signal (for example, the first Three signals, the estimated effective signal) are weighted.
  • the processing device 110 may perform weighting processing on the third signal, the first signal, and the estimated effective signal based on the first coefficient, the second coefficient, and the fourth coefficient, respectively, and determine the fourth output speech signal according to the weighted signal.
  • the fourth output speech signal may be an average value of weighted signals.
  • the fourth output speech signal may be a larger value in the weighted signals.
  • the processing device 110 may process the first signal and/or the second signal based on the first coefficient and the second coefficient to obtain the speech-enhanced first output speech signal corresponding to the target speech, and then based on the first coefficient The four coefficients process the first output speech signal to produce the fourth output speech signal.
  • the processing device 110 may also determine the fourth coefficient based on a power difference between the first signal and the second signal. In some embodiments, processing device 110 may also determine said fourth coefficient based on a magnitude difference between said first signal and said second signal.
  • the possible beneficial effects of the embodiments of this specification include but are not limited to: (1) The target speech signal is processed based on the ANF algorithm, and the damage to the target speech signal is relatively small, and when the angle difference between the effective signal and the noise signal is large, it can Effectively filter the noise signal; (2) process the target speech signal based on the distribution probability algorithm, and can effectively filter the noise signal near the target sound source when the angle difference between the effective signal and the noise signal is small; (3) use The combination of dual-mic filtering and single-mic filtering is used to process the target voice signal, which can effectively filter out the residual noise after dual-mic filtering.
  • aspects of this specification can be illustrated and described by several patentable categories or situations, including any new and useful process, machine, product or combination of substances or their combination Any new and useful improvements.
  • various aspects of this specification may be entirely executed by hardware, may be entirely executed by software (including firmware, resident software, microcode, etc.), or may be executed by a combination of hardware and software.
  • the above hardware or software may be referred to as “block”, “module”, “engine”, “unit”, “component” or “system”.
  • aspects of this specification may be embodied as a computer product comprising computer readable program code on one or more computer readable media.
  • numbers describing the quantity of components and attributes are used. It should be understood that such numbers used in the description of the embodiments use modifiers such as “about”, “approximately” or “substantially” in some examples. to modify. Unless otherwise stated, “about”, “approximately” or “substantially” indicates that the figure allows for a variation of ⁇ 20%. Accordingly, in some embodiments, the numerical data used in the specification and claims are approximations that can vary depending upon the desired characteristics of individual embodiments. In some embodiments, numerical data should take into account the specified significant digits and adopt the general digit reservation method. Although the numerical ranges and data used in certain embodiments of this specification to confirm the breadth of the ranges are approximations, in specific embodiments, such numerical values are set as precisely as practicable.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

本说明书提供一种语音增强方法。所述方法包括:获取目标语音的第一信号和第二信号,所述第一信号为基于第一位置采集的所述目标语音的信号,所述第二信号为基于第二位置采集的所述目标语音的信号;基于目标语音位置、所述第一位置和所述第二位置,处理所述第一信号和所述第二信号以确定第一系数;基于所述第一信号和所述第二信号,确定与多个声源方向有关的多个参数,每个参数对应从一个声源方向发出声音以形成所述第一信号和所述第二信号的概率;基于所述多个参数和所述目标语音位置,确定第二系数;以及基于所述第一系数和所述第二系数,处理所述第一信号和/或第二信号以获取所述目标语音对应的语音增强后的第一输出语音信号。

Description

一种语音增强方法和系统 技术领域
本说明书涉及计算机技术领域,特别涉及语音增强的处理方法和系统。
背景技术
随着语音处理技术的发展,在通讯、语音采集等技术领域,对语音信号的质量要求越来越高。在进行语音通话和语音信号采集等场景中,会存在环境噪声、他人语音等各种噪声信号干扰,导致采集的目标语音不是干净的语音信号,影响了语音信号的质量,导致听不清语音、通话质量不高等问题。
因此,亟需提供一种语音增强方法和系统。
发明内容
本说明书实施例之一提供一种语音增强方法。所述方法可以包括获取目标语音的第一信号和第二信号,所述第一信号为基于第一位置采集的所述目标语音的信号,所述第二信号为基于第二位置采集的所述目标语音的信号。所述方法还可以包括基于目标语音位置、所述第一位置和所述第二位置,处理所述第一信号和所述第二信号以确定第一系数;基于所述第一信号和所述第二信号,确定与多个声源方向有关的多个参数,每个参数对应从一个声源方向发出声音以形成所述第一信号和所述第二信号的概率。所述方法还可以包括基于所述多个参数和所述目标语音位置,确定第二系数;以及基于所述第一系数和所述第二系数,处理所述第一信号和/或第二信号以获取所述目标语音对应的语音增强后的第一输出语音信号。
在一些实施例中,所述基于目标语音位置、所述第一位置和所述第二位置,处理所述第一信号和所述第二信号以确定第一系数,可以包括:基于所述目标语音位置、所述第一位置和所述第二位置,对所述第一信号和所述第二信号进行差分运算,获取指向第一方向的信号和指向第二方向的信号,所述指向第一方向的信号和所述指向第二方向的信号含有不同比例的有效信号;基于所述指向第一方向的信号和所述指向第二方向的信号,确定与所述有效信号对应的第三信号;以及基于所述第三信号,确定所述第一系数。
在一些实施例中,所述确定与所述有效信号对应的第三信号可以包括:对所述指向第一方向的信号和所述指向第二方向的信号进行自适应差分运算,确定第四信号;以及对第四信号中的低频成分进行增强,获取所述第三信号。
在一些实施例中,所述方法还可以包括:基于所述第四信号、所述指向第一方向的信号和所述指向第二方向的信号,更新所述自适应差分运算的自适应参数。
在一些实施例中,所述基于目标语音位置、所述第一位置和所述第二位置,处理所述第一信号和所述第二信号以确定第一系数,可以包括:基于所述目标语音位置、所述第一位置和所述第二位置,对所述第一信号和所述第二信号进行差分运算,获取指向第一方向的信号和指向第二方向的信号,所述指向第一方向的信号和所述指向第二方向的信号含有不同比例的有效信号;基于所述指向第一方向的信号和所述指向第二方向的信号,确定所述目标语音的估计信噪比;以及基于所述估计信噪比,确定所述第一系数。
在一些实施例中,所述基于所述第一信号和所述第二信号,确定与多个声源方向有关的多个参数,可以包括:基于每个声源方向、所述第一位置和所述第二位置,对所述第一信号和所述第二信号进行差分运算,确定与每个声源方向有关的参数。
在一些实施例中,所述基于所述多个参数和所述目标语音位置,确定第二系数,可以包括:基于所述多个参数,确定合成声源方向;以及基于所述合成声源方向和所述目标语音位置,确定所述第二系数。
在一些实施例中,所述基于所述合成声源方向和所述目标语音位置,确定所述第二系数,可以包括:判断所述目标语音位置是否位于合成声源方向;响应于所述目标语音位置位于所述合成声源方向,将所述第二系数设为第一数值;以及响应于所述目标语音位置不位于所述合成声源方向,将所述第二系数设为第二数值。
在一些实施例中,所述基于所述合成声源方向和所述目标语音位置,确定所述第二系数,可以包括:基于所述目标语音位置和所述合成声源方向之间的角度,通过回归函数确定所述第二系数。
在一些实施例中,所述方法还可以包括基于平滑因子,对所述第二系数进行平滑。
在一些实施例中,所述方法还可以包括对所述第一信号和所述第二信号执行以下操作中的至少一个:对所述第一信号和所述第二信号进行分帧;对所述第一信号和所述第二信号进行加窗平滑;以及将所述第一信号和所述第二信号转换到频域。
在一些实施例中,所述方法可以进一步包括:确定所述第一输出语音信号中至少一个目标子带信号;以及基于单麦滤波算法,处理所述至少一个目标子带信号,获取第二输出语音信号。
在一些实施例中,所述确定所述第一输出语音信号中至少一个目标子带信号,可以包括:基于所述第一输出语音信号,获取多个子带信号;计算每一个所述子带信号的信噪比;以及基于每一个所述子带信号的信噪比,确定所述目标子带信号。
在一些实施例中,所述方法可以进一步包括:基于单麦滤波算法处理所述第一信号和/或所述第二信号,确定第三系数;以及基于所述第三系数,处理所述第一输出语音信号,获取第三输出语音信号。
在一些实施例中,所述方法还可以包括:基于所述第一信号和所述第二信号的能量差,确定第四系数;以及基于所述第一系数、所述第二系数和所述第四系数,处理所述第一信号和/或第二信号以获取所述目标语音对应的语音增强后的第四输出语音信号。
在一些实施例中,所述基于所述第一信号和所述第二信号的能量差,确定第四系数,可以包括:基于所述第一信号和所述第二信号中的无声区间,获取噪声功率谱密度;基于所述第一信号的第一功率谱密度、所述第二信号的第二功率谱密度和所述噪声功率谱密度,获取所述能量差;以及基于所述能量差和所述噪声功率谱密度,确定所述第四系数。
本说明书实施例之一提供一种语音增强系统。所述系统可以包括:包括一组指令的至少一个存储介质;以及与至少一个存储介质通信的至少一个处理器,其中,当执行所述一组指令时,所述至少一个处理器可以使所述系统:获取目标语音的第一信号和第二信号,所述第一信号为基于第一位置采集的所述目标语音的信号,所述第二信号为基于第二位置采集的所述目标语音的信号;基于目标语音位置、所述第一位置和所述第二位置,处理所述第一信号和所述第二信号以确定第一系数;基于所述第一信号和所述第二信号,确定与多个声源方向有关的多个参数,每个参数对应从一个声源方向发出声音以形成所述第一信号和所述第二信号的概率;基于所述多个参数和所述目标语音位置,确定第二系数;以及基于所述第一系数和所述第二系数,处理所述第一信号和/或第二信号以获取所述目标语音对应的语音增强后的第一输出语音信号。
本说明书实施例之一提供一种语音增强系统。所述系统可以包括获取模块、处理模块以及生成模块。所述获取模块可以用于获取目标语音的第一信号和第二信号,所述第一信号为基于第一位置采集的所述目标语音的信号,所述第二信号为基于第二位置采集的所述目标语音的信号。所述处理模块可以用于基于目标语音位置、所述第一位置和所述第二位置,处理所述第一信号和所述第二信号以确定第一系数;基于所述第一信 号和所述第二信号,确定与多个声源方向有关的多个参数,每个参数对应从一个声源方向发出声音以形成所述第一信号和所述第二信号的概率;以及基于所述多个参数和所述目标语音位置,确定第二系数。所述生成模块可以用于基于所述第一系数和所述第二系数,处理所述第一信号和/或第二信号以获取所述目标语音对应的语音增强后的第一输出语音信号。
本说明书实施例之一提供一种非暂时性计算机可读介质,可以包括可执行指令,当由至少一个处理器执行时,所述可执行指令可以使所述至少一个处理器执行本说明书所述的方法。
附加的特征将在下面的描述中部分地阐述,并且对于本领域技术人员来说,通过查阅以下内容和附图将变得显而易见,或者可以通过实例的产生或操作来了解。本发明的特征可以通过实践或使用以下详细实例中阐述的方法、工具和组合的各个方面来实现和获得。
附图说明
本说明书将以示例性实施例的方式进一步说明,这些示例性实施例将通过附图进行详细描述。这些实施例并非限制性的,在这些实施例中,相同的编号表示相同的结构,其中:
图1是根据本说明书一些实施例所示的语音增强系统的应用场景示意图;
图2是根据本说明书的一些实施例所示的示例性计算设备的示例性硬件和/或软件组件的示意图;
图3是根据本说明书的一些实施例所示的示例性移动设备的示例性硬件和/或软件组件的示意图;
图4是根据本说明书一些实施例所示的语音增强系统的示例性框图;
图5是根据本说明书一些实施例所示的语音增强方法的示例性流程图;
图6是根据本说明书一些实施例所示的示例性双麦克风的示意图;
图7是根据本说明书一些实施例所示的ANF算法在不同噪声角度时的滤波效果示意图;
图8是根据本说明书一些实施例所示的确定第一系数方法的示例性流程图;
图9是根据本说明书一些实施例所示的确定第一系数方法的示例性流程图;
图10是根据本说明书一些实施例所示的确定第二系数方法的示例性流程图;
图11是根据本说明书一些实施例所示的单麦滤波方法的示例性流程图;
图12是根据本说明书一些实施例所示的单麦滤波方法的示例性流程图;
图13是根据本说明书一些实施例所示的语音增强方法的示例性流程图。
具体实施例
为了更清楚地说明本说明书的实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单的介绍。显而易见地,下面描述中的附图仅仅是本说明书的一些示例或实施例,对于本领域的普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图将本说明书应用于其他类似情景。应当理解,给出这些示例性实施例仅仅是为了使相关领域的技术人员能够更好地理解进而实现本发明,而并非以任何方式限制本发明的范围。除非从语言环境中显而易见或另做说明,图中相同标号代表相同结构或操作。
如本说明书和权利要求书中所示,除非上下文明确提示例外情形,“一”、“一个”、“一种”和/或“该”等词并非特指单数,也可包括复数。一般说来,术语“包括”与“包含”仅提示包括已明确标识的步骤和元素,而这些步骤和元素不构成一个排它性的罗列,方法或者设备也可能包含其他的步骤或元素。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”;术语“另一实施例”表示“至少一个另外的实施例”。
本说明书中使用了流程图用来说明根据本说明书的实施例的系统所执行的操作。应当理解的是,前面或后面操作不一定按照顺序来精确地执行。相反,可以按照倒序或同时处理各个步骤。同时,也可以将其他操作添加到这些过程中,或从这些过程移除某一步或数步操作。
图1是根据本说明书一些实施例所示的语音增强系统的应用场景示意图。本说明书实施例所示的语音增强系统100可以应用在各种软件、系统、平台、设备中以实现语音信号的增强处理。例如,可以应用在对各种软件、系统、平台、设备获取的用户语音信号进行语音增强处理,还可以应用在使用设备(如手机、平板、计算机、耳机等)进行语音通话时进行语音增强处理。
在语音通话场景中,会存在环境噪声、他人语音等各种噪声信号干扰,导致采集的目标语音不是干净的语音信号。为了提高语音通话的质量,需要对目标语音进行噪声滤除、语音信号增强等语音增强处理以得到干净的语音信号。本说明书实施例提出一种语音增强的系统和方法,可以实现对例如上述语音通话场景中的目标语音进行增强处理。
如图1所示,语音增强系统100可以包括处理设备110、采集设备120、终端 130、存储设备140以及网络150。
在一些实施例中,处理设备110可以处理从其他设备或系统组件获得的数据和/或信息。处理设备110可以基于这些数据、信息和/或处理结果执行程序指令,以执行一个或多个本说明书中描述的功能。例如,处理设备110可以获取目标语音的第一信号和第二信号并进行处理,输出语音增强后的输出语音信号。
在一些实施例中,处理设备110可以是单个处理设备或者处理设备组,例如服务器或服务器组。所述处理设备组可以是集中式的或分布式的(例如,处理设备110可以是分布式的系统)。在一些实施例中,处理设备110可以是本地的或远程的。例如,处理设备110可以通过网络150访问采集设备120、终端130、存储设备140中的信息和/或数据。再例如,处理设备110可以直接连接到采集设备120、终端130、存储设备140以访问存储的信息和/或数据。在一些实施例中,处理设备110可以在云平台上实现。仅作为示例,所述云平台可以包括私有云、公共云、混合云、社区云、分布云、云之间、多重云等或其任意组合。在一些实施例中,处理设备110可以在与本说明书图2所示的计算设备上实现。例如,处理设备110可以在如图2所示的计算设备200中的一个或多个部件上实现。
在一些实施例中,处理设备110可以包括处理引擎112。处理引擎112可处理与语音增强有关的数据和/或信息以执行一个或多个本说明书中描述的方法或功能。例如,处理引擎112可以获取目标语音的第一信号和第二信号,所述第一信号为基于第一位置采集的所述目标语音的信号,所述第二信号为基于第二位置采集的所述目标语音的信号。在一些实施例中,处理引擎112可以处理所述第一信号和/或第二信号以获取目标语音对应的语音增强后的输出语音信号。
在一些实施例中,处理引擎112可以包括一个或以上处理引擎(例如,单芯片处理引擎或多芯片处理器)。仅作为示例,处理引擎112可以包括中央处理单元(CPU)、专用集成电路(ASIC)、专用指令集处理器(ASIP)、图像处理单元(GPU)、物理运算处理单元(PPU)、数字信号处理器(DSP)、现场可程序门阵列(FPGA)、可程序逻辑装置(PLD)、控制器、微控制器单元、精简指令集计算机(RISC)、微处理器等或以上任意组合。在一些实施例中,处理引擎112可以集成在采集设备120或终端130中。
在一些实施例中,采集设备120可以用于采集目标语音的语音信号,例如用于采集目标语音的第一信号和第二信号。在一些实施例中,采集设备120可以是单个的采 集设备,或者是多个采集设备构成的采集设备组。在一些实施例中,采集设备120可以是包含一个或多个麦克风或其它声音传感器(例如,120-1、120-2,...,120-n)的设备(例如,手机、耳机、对讲机、平板、计算机等)。例如,采集设备120可以包括至少两个麦克风,所述至少两个麦克风之间相隔一定的距离。当采集设备120对用户语音进行采集时,所述至少两个麦克风可以在不同的位置同时采集来自用户嘴部的声音。所述至少两个麦克风可以包括第一麦克风和第二麦克风。第一麦克风可以位于距离用户嘴部较近的位置,第二麦克风可以位于距离用户嘴部较远的位置,第二麦克风与第一麦克风的连线可以向用户嘴部所在的位置延伸。
采集设备120可以将采集的语音转换为电信号,并发送至处理设备110进行处理。例如,上述第一麦克风和第二麦克风可以将采集得到用户语音分别转化为第一信号和第二信号。处理设备110可以基于第一信号和第二信号实现对语音的增强处理。
在一些实施例中,采集设备120可以通过网络150与处理设备110、终端130、存储设备140进行传输信息和/或数据。在一些实施例中,采集设备120可以直接连接到处理设备110或存储设备140以传输信息和/或数据。例如,采集设备120和处理设备110可以是同一个电子设备(例如,耳机、眼镜等)上的不同部分,并通过金属导线连接。
在一些实施例中,终端130可以是用户或其它实体使用的终端,例如可以是目标语音对应的声源(人或其它实体)使用的终端,也可以是与目标语音对应的声源(人或其它实体)进行语音通话的其它用户或实体使用的终端。
在一些实施例中,终端130可以包括移动设备130-1、平板电脑130-2、笔记本电脑130-3等或其任意组合。在一些实施例中,移动设备130-1可以包括智能家居设备、可穿戴设备、智能移动设备、虚拟现实设备、增强现实设备等或其任意组合。在一些实施例中,智能家居设备可以包括智能照明设备、智能电器控制设备、智能监控设备、智能电视、智能摄像机、对讲机等或其任意组合。在一些实施例中,可穿戴设备可以包括智能手镯、智能鞋袜、智能眼镜、智能头盔、智能手表、智能耳机、智能穿着、智能背包、智能配件等或其任意组合。在一些实施例中,智能移动设备可以包括智能电话、个人数字助理(PDA)、游戏设备、导航设备、销售点(POS)等或其任意组合。在一些实施例中,虚拟现实设备和/或增强现实设备可以包括虚拟现实头盔、虚拟现实眼镜、虚拟现实眼罩、增强型虚拟现实头盔、增强现实眼镜、增强现实眼罩等或其任意组合。
在一些实施例中,终端130可以获取/接收目标语音的语音信号,如第一信号和 第二信号。在一些实施例中,终端130可以获取/接收目标语音的语音增强后的输出语音信号。在一些实施例中,终端130可以直接从采集设备120、存储设备140获取/接收目标语音的语音信号,如第一信号和第二信号,或者终端130可以通过网络150从采集设备120、存储设备140获取/接收目标语音的语音信号,如第一信号和第二信号。在一些实施例中,终端130可以直接从处理设备110、存储设备140获取/接收目标语音的语音增强后的输出语音信号,或者终端130可以通过网络150从处理设备110、存储设备140获取/接收目标语音的语音增强后的输出语音信号。
在一些实施例中,终端130可以向处理设备110发送指令,处理设备110可以执行来自终端130指令。例如,终端130可以向处理设备110发送实现目标语音的语音增强方法的一个或多个指令,以令处理设备110执行语音增强方法的一个或多个操作/步骤。
存储设备140可以存储从其他设备或系统组件中获得的数据和/或信息。例如,存储设备140可以存储目标语音的语音信号,如第一信号和第二信号,还可以存储目标语音的语音增强后的输出语音信号。在一些实施例中,存储设备140可以存储从采集设备120获取的数据。在一些实施例中,存储设备140可以存储从处理设备110获取的数据。在一些实施例中,存储设备140可以存储处理设备110用于执行或使用以完成本说明书中描述的示例性方法的数据和/或指令。在一些实施例中,存储设备140可以包括大容量存储器、可移动存储器、易失性读写存储器、只读存储器(ROM)等或其任意组合。示例性的大容量储存器可以包括磁盘、光盘、固态磁盘等。示例性可移动存储器可以包括闪存驱动器、软盘、光盘、存储卡、压缩盘、磁带等。示例性的挥发性只读存储器可以包括随机存取内存(RAM)。示例性的RAM可包括动态RAM(DRAM)、双倍速率同步动态RAM(DDRSDRAM)、静态RAM(SRAM)、闸流体RAM(T-RAM)和零电容RAM(Z-RAM)等。示例性的ROM可以包括掩模ROM(MROM)、可编程ROM(PROM)、可擦除可编程ROM(PEROM)、电子可擦除可编程ROM(EEPROM)、光盘ROM(CD-ROM)和数字通用磁盘ROM等。在一些实施例中,所述存储设备140可以在云平台上实现。仅作为示例,所述云平台可以包括私有云、公共云、混合云、社区云、分布云、内部云、多层云等或其任意组合。
在一些实施例中,存储设备140可以连接到网络150以与语音增强系统100中的一个或以上组件(例如,处理设备110、采集设备120、终端130)通信。语音增强系统100中的一个或以上组件可以通过网络150访问存储设备140中存储的数据或指令。 在一些实施例中,存储设备140可以与语音增强系统100中的一个或以上组件(例如,处理设备110、采集设备120、终端130)直接连接或通信。在一些实施例中,存储设备140可以是处理设备110的一部分。
在一些实施例中,语音增强系统100的一个或以上组件(例如,处理设备110、采集设备120、终端130)可以具有访问存储设备140的许可。在一些实施例中,语音增强系统100的一个或以上组件可以在满足一个或以上条件时读取和/或修改与目标语音相关的信息。
网络150可以促进信息和/或数据的交换。在一些实施例中,语音增强系统100中的一个或以上组件(例如,处理设备110、采集设备120、终端130和存储设备140)可以通过网络150向/从语音增强系统100中的其他组件发送/接收信息和/或数据。例如,处理设备110可以通过网络150从采集设备120或存储设备140获取目标语音的第一信号和第二信号,终端130可以通过网络150从处理设备110或存储设备140获取目标语音的语音增强后的输出语音信号。在一些实施例中,网络150可以为任意形式的有线或无线网络或其任意组合。仅作为示例,网络150可以包括缆线网络、有线网络、光纤网络、远程通信网络、内部网络、互联网、局域网(LAN)、广域网(WAN)、无线局域网(WLAN)、城域网(MAN)、广域网(WAN)、公共交换电话网络(PSTN)、蓝牙网络、紫蜂网络、近场通讯(NFC)网络、全球移动通讯系统(GSM)网络、码分多址(CDMA)网络、时分多址(TDMA)网络、通用分组无线服务(GPRS)网络、增强数据速率GSM演进(EDGE)网络、宽带码分多址接入(WCDMA)网络、高速下行分组接入(HSDPA)网络、长期演进(LTE)网络、用户数据报协议(UDP)网络、传输控制协议/互联网协议(TCP/IP)网络、短讯息服务(SMS)网络、无线应用协议(WAP)网络、超宽带(UWB)网络、红外线等或其任意组合。在一些实施例中,语音增强系统100可以包括一个或以上网络接入点。例如,语音增强系统100可以包括有线或无线网络接入点,例如基站和/或无线接入点150-1、150-2、...,语音增强系统100的一个或以上组件可以通过其连接到网络150以交换数据和/或信息。
本领域普通技术人员将理解,当语音增强系统100的元件或组件执行时,组件可以通过电信号和/或电磁信号执行。例如,当采集设备120向处理设备110发送目标语音的第一信号和第二信号时,采集设备120可以生成编码的电信号。然后,采集设备120可以将电信号发送到输出端口。若采集设备120经由有线网络或数据传输线与采集设备120通信,则输出端口可物理连接至电缆,其进一步将电信号传输给采集设备120 的输入端口。如果采集设备120经由无线网络与采集设备120通信,则采集设备120的输出端口可以是一个或以上天线,其将电信号转换为电磁信号。在电子设备内,例如采集设备120和/或处理设备110,当处理指令、发出指令和/或执行动作时,所述指令和/或动作通过电信号进行。例如,当处理设备110从存储介质(例如,存储设备140)检索或保存数据时,它可以将电信号发送到存储介质的读/写设备,其可以在存储介质中读取或写入结构化数据。该结构化数据可以通过电子设备的总线,以电信号的形式传输至处理器。此处,电信号可以指一个电信号、一系列电信号和/或至少两个不连续的电信号。
图2是根据本说明书的一些实施例所示的示例性计算设备200的示意图。在一些实施例中,可以在计算设备200上实现处理设备110。如图2所示,计算设备200可以包括存储器210、处理器220、输入/输出(I/O)230和通信端口240。
存储器210可以存储从采集设备120、终端130、存储设备140或语音增强系统100的任何其他组件获得的数据/信息。在一些实施例中,存储器210可以包括大容量存储器、可移动存储器、易失性读写存储器、只读存储器(ROM)等或其任意组合。示例性的大容量储存器可以包括磁盘、光盘、固态磁盘等。示例性可移动存储器可以包括闪存驱动器、软盘、光盘、存储卡、压缩盘、磁带等。示例性的挥发性只读存储器可以包括随机存取内存(RAM)。示例性的RAM可包括动态RAM(DRAM)、双倍速率同步动态RAM(DDR SDRAM)、静态RAM(SRAM)、闸流体RAM(T-RAM)和零电容RAM(Z-RAM)等。示例性的ROM可以包括掩模ROM(MROM)、可编程ROM(PROM)、可擦除可编程ROM(PEROM)、电子可擦除可编程ROM(EEPROM)、光盘ROM(CD-ROM)和数字通用磁盘ROM等。在一些实施例中,存储器210可以存储一个或多个程序和/或指令以执行本说明书中描述的示例性方法。例如,存储器210可以存储处理设备110可执行以实现语音增强方法的程序。
处理器220可以根据本说明书描述的技术执行计算机指令(程序代码)并执行处理设备110的功能。计算机指令可以包括例如例程、程序、对象、组件、信号、数据结构、过程、模块和功能,其执行本文描述的特定功能。例如,处理器220可以处理从采集设备120、终端130、存储设备140和/或语音增强系统100的任何其他组件获取的数据。例如,处理器220可以处理从采集设备120获取的目标语音的第一信号和第二信号,以得到语音增强后的输出语音信号。在一些实施例中,可将输出语音信号存储在存储设备140、存储器210等中。在一些实施例中,可通过I/O230将输出语音信号输出给 扬声器等播报设备。在一些实施例中,处理器220可以执行从终端130获得的指令。
在一些实施例中,处理器220可以包括一个或多个硬件处理器,例如微控制器、微处理器、精简指令集计算机(RISC)、专用集成电路(ASIC)、专用指令集处理器(ASIP)、中央处理单元(CPU)、图形处理单元(GPU)、物理处理单元(PPU)、微控制器单元、数字信号处理器(DSP)、现场可编程门阵列(FPGA)、高级RISC机器(ARM)、可编程逻辑设备(PLD)、能够执行一个或多个功能的任何电路或处理器等或其任意组合。
仅出于说明的目的,在计算设备200中仅描述了一个处理器。然而,应当注意,本说明书中的计算设备200也可以包括多个处理器。因此,如本说明书中所描述的由一个处理器执行的操作和/或方法步骤也可以由多个处理器联合或分别执行。例如,如果在本说明书中,计算设备200的处理器同时执行操作A和操作B,则应当理解,操作A和操作B也可以由计算设备中的两个或更多个不同的处理器联合或分开地执行。例如,第一处理器执行操作A,第二处理器执行操作B,或者第一处理器和第二处理器共同执行操作A和B。
I/O230可以输入或输出信号、数据和/或信息。在一些实施例中,I/O230可以使用户能够与处理设备110交互。在一些实施例中,I/O230可以包括输入设备和输出设备。示例性的输入设备可以包括键盘、鼠标、触摸屏、麦克风等或其组合。示例性的输出设备可以包括显示设备、扬声器、打印机、投影仪等或其组合。示例性的显示设备可以包括液晶显示器(LCD)、基于发光二极管(LED)的显示器、显示器、平板显示器、曲面屏、电视设备、阴极射线管(CRT)等或其组合。
通信端口240可以与网络(例如,网络150)连接,以促进数据通信。通信端口240可以在处理设备110与采集设备120、终端130或存储设备140之间建立连接。该连接可以是有线连接、无线连接或两者的组合,以实现数据传输和接收。有线连接可以包括电缆、光缆、电话线等或其任何组合。无线连接可以包括蓝牙、Wi-Fi、WiMax、WLAN、ZigBee、移动网络(例如3G、4G、5G等)等或其组合。在一些实施例中,通信端口240可以是标准化的通信端口,例如RS232、RS485等。在一些实施例中,通信端口240可以是专门设计的通信端口。例如,可以根据需要传输的语音信号来设计通信端口240。
图3是根据本说明书的一些实施例所示的可以在其上实现终端130的示例性移动设备300的示例性硬件和/或软件组件的示意图。如图3所示,移动设备300可以包 括通信单元310、显示单元320、图形处理单元(GPU)330、中央处理单元(CPU)340、输入/输出350、内存360和存储器370。
中央处理单元(CPU)340可以包括接口电路和类似于处理器220的处理电路。在一些实施例中,任何其他合适的组件,包括但不限于系统总线或控制器(未示出),也可包括在移动设备300内。在一些实施例中,移动操作系统362(例如,IOS TM、Andro TM、Windows Phone TM等)和一个或以上应用程序364可以从存储器370加载到内存360中,以便由中央处理单元(CPU)340执行。应用程序364可以包括浏览器或任何其他合适的移动应用程序,用于从移动设备300上的语音增强系统接收和呈现与目标语音、目标语音的语音增强有关的信息。信号和/或数据的交互可以通过输入/输出设备350实现,并通过网络150提供给处理引擎112和/或语音增强系统100的其他组件。
为了实现上述各种模块、单元及其功能,计算机硬件平台可以用作一个或以上元件(例如,图1中描述的处理设备110的模块)的硬件平台。由于这些硬件元件、操作系统和程序语言是常见的,因此可以假设本领域技术人员熟悉这些技术并且他们能够根据本文中描述的技术提供路线规划中所需的信息。具有用户界面的计算机可以用作个人计算机(PC)或其他类型的工作站或终端设备。在正确编程之后,具有用户界面的计算机可以用作处理设备如服务器。可以认为本领域技术人员也可以熟悉这种类型的计算机设备的这种结构、程序或一般操作。因此,没有针对附图描述额外的解释。
图4是根据本说明书一些实施例所示的语音增强系统的示例性框图。在一些实施例中,语音增强系统100可以在处理设备110上实施。如图4所示,处理设备110可以包括获取模块410、处理模块420以及生成模块430。
获取模块410可以用于获取目标语音的第一信号和第二信号。在一些实施例中,目标语音可以包括目标声源所发出的语音。在一些实施例中,可以用不同的采集设备(例如,不同的麦克风)在不同位置采集目标语音的信号。例如,第一信号可以是第一麦克风(或前麦克风)基于第一位置采集的目标语音的信号,第二信号可以是第二麦克风(或后麦克风)基于第二位置采集的目标语音的信号。在一些实施例中,获取模块410可以从所述不同的采集设备直接获取目标语音的第一信号和第二信号。在一些实施例中,所述第一信号和第二信号可以存储在存储设备(例如,存储设备140、存储器210、存储器370、外接存储设备等)中。获取模块410可以从所述存储设备获取所述第一信号和第二信号。
处理模块420可以用于基于目标语音位置、所述第一位置和所述第二位置,处 理所述第一信号和所述第二信号以确定第一系数。在一些实施例中,处理模块420可以用于基于自适应零点形成(Adaptive Null-Forming,ANF)算法确定所述第一系数。例如,处理模块420可以用于基于目标语音位置、第一位置和第二位置,对第一信号和第二信号进行差分运算,获取指向第一方向的信号和指向第二方向的信号。所述指向第一方向的信号和所述指向第二方向的信号含有不同比例的有效信号。进一步地,处理模块420可以用于基于指向第一方向的信号和指向第二方向的信号,确定第四信号。例如,处理模块420可以基于维纳滤波算法,通过自适应滤波器对指向第一方向的信号和指向第二方向的信号进行滤波,确定第四信号。在一些实施例中,处理模块420可以对第四信号中的低频成分进行增强,以获取第三信号。进一步地,处理模块420可以用于基于所述第三信号,确定第一系数。例如,处理模块420可以将第三信号与第一信号或第二信号的比值确定为第一系数。可选地或附加地,处理模块420可以基于第四信号、指向第一方向的信号和指向第二方向的信号,更新自适应差分运算的自适应参数。
在一些实施例中,为了确定所述第一系数,处理模块420还可以基于指向第一方向的信号和指向第二方向的信号,确定目标语音的估计信噪比。例如,所述估计信噪比可以是指向第一方向的信号与指向第二方向的信号的比值。进一步地,处理模块420基于所述估计信噪比,确定第一系数。
处理模块420还可以用于基于所述第一信号和所述第二信号,确定与多个声源方向有关的多个参数。在一些实施例中,所述多个声源方向可以包括预设的声源方向。在一些实施例中,处理模块420可以基于每个声源方向、第一位置和第二位置,对第一信号和第二信号进行差分运算,确定与每个声源方向有关的参数。在一些实施例中,所述参数可以包括似然函数。每个参数可以对应从一个声源方向发出声音以形成第一信号和第二信号的概率。
处理模块420还可以用于基于所述多个参数和所述目标语音位置,确定第二系数。在一些实施例中,为了确定第二系数,处理模块420可以基于所述多个参数,确定合成声源方向。合成声源可以认为是由目标声源和噪声源综合形成的虚拟声源。仅作为示例,处理模块420可以确定所述多个参数中数值最大的参数。所述数值最大的参数可以表示从与其对应的声源方向发出声音以形成所述第一信号和所述第二信号的概率最大。由此,处理模块420可以确定所述数值最大的参数对应的声源方向为合成声源的方向。进一步地,处理模块420可以基于所述合成声源方向和目标语音位置,确定所述第二系数。例如,处理模块420可以判断所述目标语音位置是否位于合成声源方向,或者 所述目标语音位置是否在合成声源方向的一定角度范围之内。响应于所述目标语音位置位于所述合成声源方向或者在合成声源方向的一定角度范围之内,将所述第二系数设为第一数值。响应于所述目标语音位置不位于所述合成声源方向或者不在合成声源方向的一定角度范围之内,将所述第二系数设为第二数值。可选地或附加地,处理模块420可以基于平滑因子,对所述第二系数进行平滑。再例如,处理模块420可以基于所述目标语音位置和所述合成声源方向之间的角度,通过回归函数确定所述第二系数。
生成模块430可以用于基于所述第一系数和所述第二系数,处理所述第一信号和/或第二信号以获取所述目标语音对应的语音增强后的第一输出语音信号。在一些实施例中,生成模块430可以用于基于第一系数和第二系数,对第一信号和/或第二信号进行加权处理。例如,生成模块430可以根据第一系数的值,对基于第一信号和第二信号获取的第三信号赋予相应的权重,并根据第二系数的值,对所述第一信号或所述第三信号赋予相应的权重。生成模块430可以进一步地处理上述加权后的信号以获取语音增强后的第一输出语音信号。例如,所述第一输出语音信号可以是加权后的第三信号和第一信号的平均值。再例如,第一输出语音信号可以是加权后的第三信号和第一信号的乘积。再例如,第一输出语音信号可以是加权后的第三信号和第一信号两者中较大的值。再例如,生成模块430可以基于第一系数对所述第三信号加权后,再基于所述第二系数进行再次加权处理。
需要注意的是,以上对于处理设备110及其模块的描述,仅为描述方便,并不能把本说明书限制在所举实施例范围之内。可以理解,对于本领域的技术人员来说,在了解该系统的原理后,可以在不背离这一原理的情况下,对各个模块进行任意组合,或者构成子系统与其他模块连接。例如,图4中披露的获取模块410和处理模块420可以是一个系统中的不同模块,也可以是一个模块实现上述的两个或两个以上模块的功能。例如,获取模块410和处理模块420可以是两个模块,也可以是一个模块同时具有获取目标语音以及处理目标语音的功能。诸如此类的变形,均在本说明书的保护范围之内。
应当理解,图4所示的系统及其模块可以利用各种方式来实现。例如,在一些实施例中,系统及其模块可以通过硬件、软件或者软件和硬件的结合来实现。其中,硬件部分可以利用专用逻辑来实现;软件部分则可以存储在存储器中,由适当的指令执行系统,例如微处理器或者专用设计硬件来执行。本领域技术人员可以理解上述的方法和系统可以使用计算机可执行指令和/或包含在处理器控制代码中来实现,例如在诸如磁盘、CD或DVD-ROM的载体介质、诸如只读存储器(固件)的可编程的存储器或者诸 如光学或电子信号载体的数据载体上提供了这样的代码。本说明书的系统及其模块不仅可以有诸如超大规模集成电路或门阵列、诸如逻辑芯片、晶体管等的半导体、或者诸如现场可编程门阵列、可编程逻辑设备等的可编程硬件设备的硬件电路实现,也可以用例如由各种类型的处理器所执行的软件实现,还可以由上述硬件电路和软件的结合(例如,固件)来实现。
图5是根据本说明书一些实施例所示的语音增强方法的示例性流程图。在一些实施例中,方法500可以由处理设备110、处理引擎112、处理器220执行。例如,方法500可以以程序或指令的形式存储在存储设备(例如,存储设备140或处理设备110的存储单元)中,当处理设备110、处理引擎112、处理器220或图4所示的模块执行程序或指令时,可以实现方法500。在一些实施例中,方法500可以利用以下未描述的一个或以上附加操作/步骤,和/或不通过以下所讨论的一个或以上操作/步骤完成。另外,如图5所示的操作/步骤的顺序并非限制性的。
步骤510,处理设备110(例如,获取模块410)可以获取目标语音的第一信号和第二信号。
在一些实施例中,目标语音可以包括目标声源所发出的语音。目标声源可以是用户、机器人(例如,自动应答机器人、将人的输入数据如文本、手势等转换为语音信号播报的机器人等)、或者能够发出语音信息的其它生物和设备。在一些实施例中,目标声源所发出的语音可以作为有效信号。在一些实施例中,目标语音还可以包括无用或带来干扰的噪声信号,例如,周围环境产生的噪声或者目标声源外其他声源的声音。示例性的噪声可以包括加性噪声、白噪声、乘性噪声等或其任意的组合。加性噪声是指与语音信号无关的独立噪声信号,乘性噪声是指与语音信号成正比的噪声信号,白噪声是指噪声的功率谱为一常数的噪声信号。在一些实施例中,当目标语音中包括噪声信号时,目标语音可以是有效信号与噪声信号的合成信号。所述合成信号可以等效为由目标声源和噪声声源的合成声源所发出的语音信号。
在一些实施例中,可以用不同的采集设备(例如,不同的麦克风)在不同位置采集目标语音的信号。以双麦克风为例,图6是根据本说明书一些实施例所示的示例性双麦克风的示意图。如图6所示,目标声源(如用户的嘴部)位于双麦克风的左上方,目标声源指向双麦克风的方向(例如目标声源指向第一麦克风A的方向)与双麦克风的连线形成的夹角为θ。第一信号Sig 1可以为第一麦克风A(或前麦克风)基于第一位置采集的目标语音的信号,第二信号Sig 2可以为第二麦克风B(或后麦克风)基于第 二位置采集的目标语音的信号。仅作为示例,第一位置和第二位置可以是距离为d且相对于目标声源(如用户的嘴部)距离不同的两个位置。d可以根据实际需求设置,例如,在特定的场景下,d可以被设置为不小于0.5cm,或者不小于1cm。在一些实施例中,第一信号或第二信号可以包括采集设备在接收到目标语音后所生成的电信号(或者经过进一步处理后所生成的电信号),其可以反映目标语音相对于采集设备的位置信息。在一些实施例中,第一信号和第二信号可以是目标语音在时域上的呈现。例如,处理设备110可以对第一麦克风A和第二麦克风B获取的信号进行分帧以分别获得第一信号和第二信号。以获取第一信号为例,处理设备110可以将第一麦克风获取的信号在时域上分成多个分段(例如,平均分成或交叠分成时长为10-30ms的多个分段),每个分段可以作为一帧信号,第一信号可以包括其中一帧或多帧信号。在一些可替代的实施例中,第一信号和第二信号可以是目标语音在频域上的呈现。例如,处理设备110可以对上述一帧或多帧信号进行快速傅里叶变换(Fast Fourier Transform,FFT)以获得第一信号或第二信号。可选地,在对帧信号进行FFT之前,可以先对帧信号进行加窗平滑处理。具体地,处理设备110可以将帧信号与窗函数相乘,以对帧信号进行周期扩张,获得周期性的连续信号。示例性的窗函数可以包括矩形窗、汉宁窗、平顶窗、指数窗等。加窗平滑后的帧信号可以进一步进行FFT变换而生成第一信号或第二信号。
在一些实施例中,第一信号和第二信号的差异可以与目标语音和噪声信号在不同采集位置的强度、信号幅值、相位差异等相关。
步骤520,处理设备110(例如,处理模块420)可以基于目标语音位置、所述第一位置和所述第二位置,处理所述第一信号和所述第二信号以确定第一系数。
在一些实施例中,处理设备110可以基于自适应零点形成(Adaptive Null-Forming,ANF)算法确定所述第一系数。所述ANF算法可以包括两个差分波束形成器以及自适应滤波器。在一些实施例中,处理设备110可以基于所述目标语音位置、所述第一位置和所述第二位置,通过所述两个差分波束形成器对所述第一信号和所述第二信号进行差分运算,获取指向第一方向的信号和指向第二方向的信号。例如,处理设备110可以根据差分麦克风原理,基于目标语音位置、第一位置和第二位置对第一信号和第二信号进行时延处理。并对时延后的第一信号和第二信号进行差分运算,获取指向第一方向的信号和指向第二方向的信号。在一些实施例中,指向第一方向的信号为指向目标声源方向的信号,指向第二方向的信号为指向与目标声源相反方向的信号。所述指向第一方向的信号和指向第二方向的信号含有不同比例的有效信号。所述有效信号是指目标声 源所发出的语音。例如,所述指向第一方向的信号中可以含有较大比例的有效信号(和/或较小比例的噪声信号)。所述指向第二方向的信号中可以含有较小比例的有效信号(和/或较大比例的噪声信号)。在一些实施例中,所述指向第一方向的信号和指向第二方向的信号可以对应两个指向麦克风。进一步的,处理设备110可以基于所述指向第一方向的信号和指向第二方向的信号,确定与有效信号对应的第三信号。例如,处理设备110可以对所述指向第一方向的信号和指向第二方向的信号进行自适应差分运算,确定第四信号。仅作为示例,在所述自适应差分运算过程中,处理设备110可以基于维纳滤波算法,通过所述自适应滤波器对指向第一方向的信号和指向第二方向的信号进行滤波,确定第四信号。在一些实施例中,在所述自适应差分运算过程中,处理设备110可以调整自适应滤波器的参数,使第四信号对应的心形图零点指向噪声方向。在一些实施例中,处理设备110可以对第四信号中的低频成分进行增强,以获取所述第三信号。进一步地,处理设备110可以基于所述第三信号,确定所述第一系数。例如,第一系数可以是第三信号与第一信号或第二信号的比值。关于基于第三信号确定所述第一系数的更多内容可以参见图8及其描述,此处不再赘述。
在一些实施例中,为了确定所述第一系数,处理设备110可以基于所述指向第一方向的信号和所述指向第二方向的信号,确定所述目标语音的估计信噪比。例如,所述估计信噪比可以是所述指向第一方向的信号与所述指向第二方向的信号之间的比值。进一步地,处理设备110可以基于所述估计信噪比确定第一系数。在一些实施例中,处理设备110可以基于估计信噪比与第一系数之间的映射关系确定所述第一系数。所述映射关系可以是多种形式,例如,映射关系数据库或关系函数。关于基于估计信噪比确定第一系数的更多内容可以参见图9及其描述,此处不再赘述。
在一些实施例中,第一系数可以反映噪声信号对有效信号的影响。以基于估计信噪比确定的第一系数为例,所述估计信噪比可以是所述指向第一方向的信号与所述指向第二方向的信号之间的比值。所述指向第一方向的信号中可能含有较大比例的有效信号(和/或较小比例的噪声信号)。所述指向第二方向的信号中可能含有较小比例的有效信号(和/或较大比例的噪声信号)。因此,噪声信号的大小可以影响估计信噪比的值,从而影响第一系数的值。例如,噪声信号越大,估计信噪比的值越小,根据估计信噪比确定的第一系数的值也会相应的变化。由此,第一系数可以反映噪声信号对有效信号的影响。
在一些实施例中,第一系数与噪声源方向有关。例如,当噪声源方向接近目标 声源方向时,第一系数可以具有较大的值;当噪声源方向偏离目标声源方向较大角度时,第一系数可以具有较小的值。。所述噪声源方向为噪声源相对于双麦克风的方向,目标声源方向为目标声源(如用户的嘴部)相对于双麦克风的方向。处理设备110可以根据该第一系数处理与有效信号对应的第三信号。例如,第一系数可以表示与有效信号对应的第三信号在语音增强过程中的权重。仅仅作为示例,当第一系数为“1”时,表示第三信号可以被完全保留而作为增强后的语音信号中的一部分;当第一系数为“0”时,表示从增强后的语音信号中完全滤除第三信号。
在一些实施例中,当噪声源方向与目标声源方向之间的角度差较大时,第一系数可以具有较小的值,根据该第一系数处理的第三信号可以被减弱或去除;当噪声源方向与目标声源方向之间的角度差较小时,第一系数可以具有较大的值,根据该第一系数处理的第三信号可以被保留作为增强后的语音信号中的一部分。由此,当噪声源方向与目标声源方向之间的角度差较大时,ANF算法可以具有较好的滤波效果图7是根据本说明书一些实施例所示的ANF算法在不同噪声角度时的滤波效果示意图。所述噪声角度是指噪声源方向与目标声源方向之间的角度。如图7所示,图a-f分别表示在噪声角度为180°、150°、120°、90°、60°、30°时,ANF算法的滤波效果。根据图7可知,当噪声角度较大时(例如,180°、150°、120°、90°),ANF算法的滤波效果较好。当噪声角度较小时(例如,60°、30°),ANF算法的滤波效果较差。
步骤530,处理设备110(例如,处理模块420)可以基于所述第一信号和所述第二信号,确定与多个声源方向有关的多个参数。
在一些实施例中,所述多个声源方向可以包括预设的声源方向。例如,所述多个声源方向可以具有预设的入射角(例如,0°、30°、60°、90°、120°、180°等)。所述声源方向可以根据实际需求选择和/或调整,此处不做限制。在一些实施例中,处理设备110可以基于每个声源方向、所述第一位置和所述第二位置,对所述第一信号和所述第二信号进行差分运算,确定与每个声源方向有关的参数。例如,所述多个声源方向中的每个声源方向可以对应一个时延,处理设备110可以基于该时延对所述第一信号和第二信号进行差分运算。进一步地,处理设备110可以基于所述差分运算计算与所述声源方向有关的参数。在一些实施例中,所述参数可以包括似然函数。例如,对于当前帧的每一个信号点,处理设备110可以计算出与所述多个声源方向中的每个声源方向对应的似然函数。在一些实施例中,所述似然函数可以对应从所述声源方向发出声音以形成所述第一信号和所述第二信号的概率。仅作为示例,声源方向为θ=30°时的似然函数值经过 归一化处理后为0.8,可以表示从声源方向为θ=30°发出声音以形成所述第一信号和所述第二信号的概率为80%。
步骤540,处理设备110(例如,处理模块420)可以基于所述多个参数和所述目标语音位置,确定第二系数。
在一些实施例中,处理设备110可以基于所述多个参数确定合成声源的方向。所述合成声源可以认为是由目标声源和噪声源综合形成的虚拟声源,也就是说,由目标声源和噪声源共同在双麦克风处产生的信号(例如,所述第一信号和所述第二信号)可以等效为是由该合成声源在双麦克风处产生的。
在一些实施例中,为了确定合成声源的方向,处理设备110可以确定所述多个参数中数值最大的参数。所述数值最大的参数可以表示从与其对应的声源方向发出声音以形成所述第一信号和所述第二信号的概率最大。由此,处理设备110可以确定所述数值最大的参数对应的声源方向为合成声源的方向。作为另一示例,为了确定合成声源的方向,处理设备110可以构建极点指向所述多个声源方向的多个指向麦克风。所述多个指向麦克风中的每个麦克风的响应为心形图。为便于描述,所述多个声源方向对应的心形图可以称为模拟心形图。每个模拟心形图的极点可以指向对应的声源方向。进一步地,基于所述第一信号和所述第二信号,处理设备110可以计算出与所述多个声源方向中的每个声源方向对应的似然函数。所述对应多个声源方向的似然函数的响应可以为心形图。为便于描述,所述似然函数对应的心形图可以称为合成心形图(或实际心形图)。合成心形图的极点指向合成声源的方向。处理设备110可以确定与实际心形图的极点指向最接近的模拟心形图,并确定所述模拟心形图对应的声源方向为合成声源的方向。
在一些实施例中,处理设备110可以基于所述合成声源方向和所述目标语音位置,确定所述第二系数。例如,处理设备110可以判断所述目标语音位置是否位于合成声源方向,或者所述目标语音位置是否在合成声源方向的一定角度范围之内。响应于所述目标语音位置位于所述合成声源方向或者在合成声源方向的一定角度范围之内,将所述第二系数设为第一数值。响应于所述目标语音位置不位于所述合成声源方向或者不在合成声源方向的一定角度范围之内,将所述第二系数设为第二数值。可选地或附加地,处理设备110可以基于平滑因子,对所述第二系数进行平滑。再例如,处理设备110可以基于所述目标语音位置和所述合成声源方向之间的角度,通过回归函数确定所述第二系数。关于确定第二系数的更多内容可以参见图10及其描述,此处不再赘述。
在一些实施例中,第二系数可以反映合成声源相对于目标声源的方向,从而减 弱或去除不在目标声源方向上的合成声源和/或相对于目标声源方向偏差一定角度的合成声源。在一些实施例中,第二系数可以用于滤除噪声源方向与目标声源方向之间角度差超过一定阈值的噪声。例如,当噪声源方向与目标声源方向之间角度差超过一定阈值时,第二系数可以具有较小的值;当噪声源方向与目标声源方向之间角度差小于一定阈值时,第二系数可以具有较大的值。处理设备110可以根据该第二系数处理第一信号或第三信号。例如,第二系数可以表示第一信号在语音增强过程中的权重。仅仅作为示例,当第二系数为“1”时,表示第一信号可以被完全保留而作为增强后的语音信号中的一部分;当第二系数为“0”时,表示从增强后的语音信号中完全滤除第一信号。
步骤550,处理设备110(例如,生成模块430)可以基于所述第一系数和所述第二系数,处理所述第一信号和/或第二信号以获取所述目标语音对应的语音增强后的第一输出语音信号。
在一些实施例中,处理设备110可以基于所述第一系数和所述第二系数,对所述第一信号和/或第二信号进行加权处理。以第一系数为例,处理设备110可以根据第一系数的值,对基于第一信号和第二信号获取的第三信号赋予相应的权重。例如,处理设备110可以根据第一系数所在范围赋予第三信号相应的权重。再例如,处理设备110可以将所述第一系数的值直接作为第三信号的权重。再例如,当第一系数的值小于预设的第一系数阈值时,处理设备110可以将第三信号的权重设为0。以第二系数为例,处理设备110可以根据第二系数的值,对第一信号或第三信号赋予相应的权重。处理设备110可以进一步地处理上述加权后的信号以获取语音增强后的第一输出语音信号。例如,所述第一输出语音信号可以是加权后的第三信号和第一信号的平均值。再例如,第一输出语音信号可以是加权后的第三信号和第一信号的乘积。再例如,第一输出语音信号可以是加权后的第三信号和第一信号两者中较大的值。再例如,生成模块430可以基于第一系数对所述第三信号加权后,再基于所述第二系数进行再次加权处理。
需要注意的是,以上对于语音增强方法500的描述,仅为描述方便,并不能把本说明书限制在所举实施例范围之内。可以理解,对于本领域的技术人员来说,在了解该方法的原理后,可以在不背离这一原理的情况下,对各个步骤进行任意组合,或者,可以增加或删减任意步骤。
在一些实施例中,语音增强方法500还可以包括单麦滤波过程。例如,处理设备110可以基于单麦滤波算法对所述第一输出语音信号进行单麦滤波。再例如,处理设备110可以基于单麦滤波算法处理所述第一信号和/或第二信号以获取第三系数,并基 于所述第三系数对所述第一输出语音信号进行滤波。关于单麦滤波过程的更多内容可以参见图11、图12及其描述,此处不再赘述。
在一些实施例中,处理设备110还可以基于第四系数进行语音增强处理。例如,处理设备110基于第一信号和第二信号的能量差,确定第四系数。并基于第一系数、第二系数和第四系数中的任意一个或其组合,处理所述第一信号和/或第二信号以获取所述语音增强后的输出语音信号。关于基于第四系数进行语音增强的更多内容可以参见图13及其描述,此处不再赘述。
在一些实施例中,上述语音增强方法500可以在预处理(例如,分帧、加窗平滑、FFT变换等)后获得的第一信号和/或第二信号上实施。也就是说,所述第一输出语音信号可以是单帧语音信号。由此,语音增强方法500还可以包括后处理过程。示例性的后处理可以包括逆FFT变换、帧拼接等。经过所述后处理过程,处理设备110可以获得连续的输出语音信号。
图8是根据本说明书一些实施例所示的确定第一系数方法的示例性流程图。在一些实施例中,方法800可以由处理设备110、处理引擎112、处理器220执行。例如,方法800可以以程序或指令的形式存储在存储设备(例如,存储设备140或处理设备110的存储单元)中,当处理设备110、处理引擎112、处理器220或图4所示的模块执行程序或指令时,可以实现方法800。在一些实施例中,方法500中所述的操作520可以通过方法800实施。在一些实施例中,方法800可以利用以下未描述的一个或以上附加操作/步骤,和/或不通过以下所讨论的一个或以上操作/步骤完成。另外,如图8所示的操作/步骤的顺序并非限制性的。
在一些实施例中,处理设备110可以基于自适应零点形成(Adaptive Null-Forming,ANF)算法确定所述第一系数。所述ANF算法可以包括两个差分波束形成器以及自适应滤波器。所述两个差分波束形成器可以对所述第一信号和所述第二信号进行差分处理,形成指向第一方向的信号和指向第二方向的信号。所述自适应滤波器可以对指向第一方向的信号和指向第二方向的信号进行自适应滤波,获得与有效信号对应的第三信号。如图8所示,方法800可以包括:
步骤810,处理设备110可以基于所述目标语音位置、所述第一位置和所述第二位置,对所述第一信号和所述第二信号进行差分运算,获取指向第一方向的信号和指向第二方向的信号。
在一些实施例中,处理设备110可以根据差分麦克风原理,基于目标语音位置、 第一位置和第二位置对第一信号和第二信号进行时延处理。例如,如图6所述,前麦克风A与后麦克风B之间的距离为d,目标声源具有入射角θ,则目标声源在前麦克风A与后麦克风B之间的传播时间可以表示为:
τ=dcosθ/c,  (1)
其中,c为声音传播速度。所述传播时间τ可以作为第一信号和第二信号间的时延。当θ=0°时,τ=d/c。根据差分麦克风原理可以获取指向第一方向的信号和指向第二方向的信号:
x s(t)=sig1(t)-sig2(t-τ),  (2)
x n(t)=sig2(t)-sig1(t-τ),  (3)
其中,t表示当前帧的每一个时间点,sig1表示第一信号,sig2表示第二信号。
根据上述公式(2)和(3),通过时延τ分别对第一信号sig1和第二信号sig2进行时延后做差分,可以获取指向第一方向的信号x s和指向第二方向的信号x n。所述指向第一方向的信号x s可以对应第一指向麦克风,所述第一指向麦克风的响应为心形图,其极点指向目标声源方向。所述指向第二方向的信号x n可以对应第二指向麦克风,所述第二指向麦克风的响应为心形图,其零点指向目标声源方向。
在一些实施例中,所述指向第一方向的信号x s和所述指向第二方向的信号x n可以含有不同比例的有效信号。例如,所述指向第一方向的信号x s中可以含有较大比例的有效信号(和/或较小比例的噪声信号)。所述指向第二方向的信号x n中可以含有较小比例的有效信号(和/或较大比例的噪声信号)。
步骤820,处理设备110可以对所述指向第一方向的信号和所述指向第二方向的信号进行自适应差分运算,确定第四信号。
在一些实施例中,处理设备110可以基于维纳滤波算法,通过所述自适应滤波器对指向第一方向的信号和指向第二方向的信号进行滤波。所述自适应滤波器可以是最小均方差(LeastMeanSquare,LMS)滤波器。在进行滤波时,指向第一方向的信号x s可以作为LMS滤波器的期望信号,指向第二方向的信号x n可以作为LMS滤波器的参考噪声。基于所述期望信号和参考噪声,处理设备110可以通过LMS滤波器对所述指向第一方向的信号x s和所述指向第二方向的信号x n进行自适应滤波(即自适应差分运算),确定第四信号。所述第四信号可以是滤除噪声之后的信号。在一些实施例中,自适应差分运算的示例性过程可以如下述公式所示:
y=x s-wx n,  (4)
其中,y表示第四信号(即LMS滤波器的输出信号),w表示自适应差分运算的自适应参数(即LMS滤波器的系数)。
在一些实施例中,在所述自适应差分运算过程中,处理设备110可以基于所述第四信号y、所述指向第一方向的信号x s和所述指向第二方向的信号x n,更新所述自适应差分运算的自适应参数w。例如,基于每一帧第一信号和第二信号,处理设备110可以获取指向第一方向的信号x s和指向第二方向的信号x n。进一步地,处理设备110可以通过梯度下降法对自适应参数w进行更新,使得自适应差分运算的损失函数(例如,均方差损失函数)逐渐收敛。
步骤830,处理设备110可以对第四信号中的低频成分进行增强,获取第三信号。
在一些实施例中,所述差分波束形成器可以具有高通滤波的特性。在使用所述差分波束形成器对所述第一信号和所述第二信号进行差分处理时,可能会减弱所述第一信号和所述第二信号中的低频成分。相应地,通过自适应差分运算后获取的第四信号y中的低频成分被减弱。在一些实施例中,处理设备110可以通过补偿滤波器增强第四信号y中的低频成分。仅作为示例,所述补偿滤波器可以如下述公式所示:
Figure PCTCN2021096375-appb-000001
其中,W EQ表示补偿滤波器,ω表示第四信号y的频率,ω c表示高通滤波的截止频率。在一些实施例中,示例性的ω c取值可以为:
ω c=0.5πc/d,  (6)
其中,c表示声音传播速度,d表示双麦克风间距。
在一些实施例中,处理设备110可以基于所述补偿滤波器W EQ对所述第四信号y进行滤波,获取第三信号。例如,第三信号可以是第四信号y与补偿滤波器W EQ的乘积。
步骤840,处理设备110可以基于所述第三信号,确定所述第一系数。
在一些实施例中,处理设备110可以确定第三信号与第一信号或第二信号的比值,并根据所述比值确定第一系数。
需要注意的是,以上对于方法800的描述,仅为描述方便,并不能把本说明书限制在所举实施例范围之内。可以理解,对于本领域的技术人员来说,在了解该方法的原理后,可以在不背离这一原理的情况下,对各个步骤进行任意组合,或者,可以增加 或删减任意步骤。
仅仅为了说明,上述实施例中方法800的操作是在时域上对第一信号和第二信号进行处理。应当理解,方法800中的一个或以上的操作也可以在频域上进行。例如,在时域上对第一信号和第二信号进行的时延处理也可以是在频域上对第一信号和第二信号进行等效的相移。在一些实施例中,步骤830不是必须的,即步骤820中获得的第四信号可以不经过低频增强而直接作为第三信号使用。
图9是根据本说明书一些实施例所示的确定第一系数方法的示例性流程图。在一些实施例中,方法900可以由处理设备110、处理引擎112、处理器220执行。例如,方法900可以以程序或指令的形式存储在存储设备(例如,存储设备140或处理设备110的存储单元)中,当处理设备110、处理引擎112、处理器220或图4所示的模块执行程序或指令时,可以实现方法900。在一些实施例中,方法500中所述的操作520可以通过方法900实施。在一些实施例中,方法900可以利用以下未描述的一个或以上附加操作/步骤,和/或不通过以下所讨论的一个或以上操作/步骤完成。另外,如图9所示的操作/步骤的顺序并非限制性的。如图9所示,方法900可以包括:
步骤910,处理设备110可以基于目标语音位置、第一位置和第二位置,对第一信号和第二信号进行差分运算,获取指向第一方向的信号和指向第二方向的信号。所述指向第一方向的信号和所述指向第二方向的信号含有不同比例的有效信号。
在一些实施例中,可以通过执行图8所描述的步骤810来执行步骤910,此处不再赘述。
步骤920,处理设备110可以基于所述指向第一方向的信号和所述指向第二方向的信号,确定所述目标语音的估计信噪比。
在一些实施例中,所述指向第一方向的信号中可以含有较大比例的有效信号(和/或较小比例的噪声信号)。所述指向第二方向的信号中可以含有较小比例的有效信号(和/或较大比例的噪声信号)。所述估计信噪比可以表示为指向第一方向的信号与所述指向第二方向的信号之间的比值(即x s/x n)。在一些实施例中,不同的估计信噪比可以对应不同的合成声源入射角θ。例如,较大的估计信噪比可以对应较小的合成声源入射角θ。在一些实施例中,合成声源入射角θ可以反映噪声信号对有效信号的影响。例如,当噪声信号对有效信号影响较大时(例如,噪声源方向与目标声源方向之间的角度差较大),合成声源入射角θ可以具有较大的值;当噪声信号对有效信号影响较小时(例如,噪声源方向与目标声源方向之间的角度差较小),合成声源入射角θ可以具有较小的值。 由此,所述估计信噪比可以反映合成声源的方向,并进一步反映噪声信号对有效信号的影响。
步骤930,处理设备110可以基于所述估计信噪比,确定所述第一系数。
在一些实施例中,处理设备110可以基于估计信噪比与第一系数之间的映射关系确定所述第一系数。所述映射关系可以是多种形式,例如,映射关系数据库或关系函数。
在一些实施例中,不同的噪声源方向可以对应不同的合成声源入射角θ,相应地可以对应不同的估计信噪比。也就是说,估计信噪比可以与噪声源方向有关(也即与噪声信号对有效信号的影响程度有关)。由此,对于不同的估计信噪比,可以确定不同的第一系数。例如,当估计信噪比较小时,对应的合成声源入射角θ可以具有较大的值,表示噪声信号对有效信号的影响较大。相应地,与有效信号对应的第三信号可能含有较大比例的噪声信号。由此,可以通过确定第一系数的值以减弱或去除所述第三信号。处理设备110可以根据该第一系数处理与有效信号对应的第三信号。例如,第一系数可以表示与有效信号对应的第三信号在语音增强过程中的权重。仅仅作为示例,当第一系数为“1”时,表示第三信号可以被完全保留而作为增强后的语音信号中的一部分;当第一系数为“0”时,表示从增强后的语音信号中完全滤除第三信号。在一些实施例中,可以建立估计信噪比与第一系数之间的映射关系数据库。处理设备110可以基于估计信噪比检索所述数据库,从而确定所述第一系数。
在一些实施例中,处理设备110还可以基于估计信噪比与第一系数之间的关系函数确定所述第一系数。例如,所述关系函数可以如下述公式所示:
Figure PCTCN2021096375-appb-000002
其中,
Figure PCTCN2021096375-appb-000003
表示所述估计信噪比。
需要注意的是,以上对于方法900的描述,仅为描述方便,并不能把本说明书限制在所举实施例范围之内。可以理解,对于本领域的技术人员来说,在了解该方法的原理后,可以在不背离这一原理的情况下,对各个步骤进行任意组合,或者,可以增加或删减任意步骤。
图10是根据本说明书一些实施例所示的确定第二系数方法的示例性流程图。在 一些实施例中,方法1000可以由处理设备110、处理引擎112、处理器220执行。例如,方法1000可以以程序或指令的形式存储在存储设备(例如,存储设备140或处理设备110的存储单元)中,当处理设备110、处理引擎112、处理器220或图4所示的模块执行程序或指令时,可以实现方法1000。在一些实施例中,方法500中所述的操作530及540可以通过方法1000实施。在一些实施例中,方法1000可以利用以下未描述的一个或以上附加操作/步骤,和/或不通过以下所讨论的一个或以上操作/步骤完成。另外,如图10所示的操作/步骤的顺序并非限制性的。如图10所示,方法1000可以包括:
步骤1010,处理设备110可以基于每个声源方向、第一位置和第二位置,对第一信号和第二信号进行差分运算,确定与每个声源方向有关的参数。
在一些实施例中,所述多个声源方向可以包括预设的声源方向。所述声源方向可以根据实际需求选择和/或调整,此处不做限制。例如,所述多个声源可以具有预设的入射角θ=(θ 1,θ 2,...,θ n)(例如,0°、30°、60°、90°、120°、150°、180°等)。处理设备110可以基于每个声源方向、所述第一位置和所述第二位置,对所述第一信号和所述第二信号进行差分运算。例如,所述多个声源方向中的每个声源方向可以对应一个时延,处理设备110可以构建与所述多个声源方向对应的时延组合τ=(τ 1,τ 2,...,τ n)。基于所述时延组合,处理设备110可以对所述第一信号和第二信号进行差分运算,构建多个指向麦克风。所述多个指向麦克风的极点可以指向所述多个声源方向。所述多个指向麦克风中的每个麦克风的响应为心形图。为便于描述,所述多个声源方向对应的心形图可以称为模拟心形图。每个模拟心形图的极点可以指向对应的声源方向。
在一些实施例中,处理设备110可以基于所述差分运算计算与所述声源方向有关的参数。所述参数可以包括似然函数。例如,对于当前帧的每一个信号点,处理设备110可以计算出与所述多个声源方向中的每个声源方向对应的似然函数。例如,所述似然函数可以如下述公式所示:
LH i(f,t)=-|exp(-i2πfθ i)sig1(f,t)-sig2(f,t)| 2,  (8)
其中,LH i(f,t)表示在t时刻f频率对应的似然函数,是在声源方向为时第一信号的时频域表sig1(f,t)、sig2(f,t)分别为声源方向为θ i时第一信号和第二信号的时频域表达,exp(-i2πfθ i)sig1(f,t)中的-2πfθ i表示在声源方向为θ i时声源传播至第二位置相对于第一位置的相位差。
在一些实施例中,所述似然函数可以对应从所述声源方向发出声音以形成所述 第一信号和所述第二信号的概率。仅作为示例,声源方向为θ=30°时的似然函数值为0.8,可以表示从声源方向为θ=30°发出声音以形成所述第一信号和所述第二信号的概率为80%。在一些实施例中,所述对应多个声源方向的似然函数的响应可以为心形图。为便于描述,所述似然函数对应的心形图可以称为合成心形图(或实际心形图)。
步骤1020,处理设备110可以基于所述多个参数,确定合成声源方向。
在一些实施例中,为了确定合成声源的方向,处理设备110可以确定所述多个参数中数值最大的参数。例如,处理设备110可以计算出与所述多个声源方向θ=(θ 1,θ 2,...,θ n)中的每个声源方向对应的似然函数LH 1(f,t),LH 2(f,t),...,LH n(f,t)。似然函数LH i(f,t)可以对应从声源方向θ i发出声音以形成所述第一信号和所述第二信号的概率。所述数值最大的似然函数可以表示从与其对应的声源方向发出声音以形成所述第一信号和所述第二信号的概率最大。由此,处理设备110可以确定所述数值最大的似然函数对应的声源方向为合成声源的方向。例如,如果θ i=30°时的似然函数值最大,处理设备110可以确定合成声源的方向为30°。
在一些实施例中,处理设备110可以基于所述多个指向麦克风确定合成声源的方向。根据上述实施例,所述多个指向麦克风可以分别与预设的声源方向θ=(θ 1,θ 2,...,θ n)对应。每个指向麦克风的响应为模拟心形图。所述模拟心形图的极点指向对应的声源方向。进一步地,所述对应多个声源方向的似然函数的响应也可以是心形图(称为合成心形图)。所述合成心形图的极点指向合成声源的方向。处理设备110可以将所述合成心形图与上述多个模拟心形图比较,确定与实际心形图的极点指向最接近的模拟心形图。所述模拟心形图对应的声源方向可以确定为合成声源的方向。
步骤1030,处理设备110可以基于所述合成声源方向和所述目标语音位置,确定所述第二系数。
在一些实施例中,为了确定所述第二系数,处理设备110可以判断所述目标语音位置是否位于合成声源方向。例如,所述目标语音位置可以位于双麦克风的延长线上。也就是说,目标语音位置对应的声源方向为θ=0°。处理设备110可以判断所述合成声源方向是否为0°,如果合成声源方向为0°,那么可以判断所述目标语音位置位于合成声源方向。在一些实施例中,处理设备110可以判断所述数值最大的似然函数是否位于目标声源占主导的集合。如果所述数值最大的似然函数位于目标声源占主导的集合,处理设备110可以确定所述目标语音位置位于合成声源方向。在一些实施例中,所述目标声源占主导的集合可以如下述公式所示:
Figure PCTCN2021096375-appb-000004
其中,LH 0(f,t)表示目标语音位置位于合成声源方向时的似然函数值。基于集合
Figure PCTCN2021096375-appb-000005
处理设备110可以确定时频点(f,t),使得该时频点(f,t)对应的似然函数在θ=0°时取得最大值maxLH i(f,t)。此时,该时频点对应的信号(例如,第一信号或第三信号)可以为从目标声源方向发出的信号。
需要注意的是,以上公式(8)所示的目标声源占主导的集合
Figure PCTCN2021096375-appb-000006
仅仅作为示例。在公式(8)中,目标语音位置位于双麦克风连线的延长线上(即θ=0°),因此基于上述方法确定的时频点(f,t)在θ=0°时取得最大值。可选地或附加地,目标语音位置可以不位于双麦克风连线的延长线上(即θ≠0°)。例如,目标语音位置与双麦克风连线延长线的夹角为30度。此时公式(8)中的LH 0(f,t)就可以是θ=30°时的似然函数值。也就是说,此时根据目标声源占主导的集合求出的时频点(f,t)应当使得似然函数在θ=30°时取得最大值。
在一些实施例中,响应于所述目标语音位置位于所述合成声源方向,处理设备110可以将所述第二系数设为第一数值(例如,1)。响应于所述目标语音位置不位于所述合成声源方向,处理设备110可以将所述第二系数设为第二数值(例如,0)。在一些实施例中,处理设备110可以基于所述第二系数处理对应的第一信号或经过ANF滤波后获取的第三信号。例如,以第一信号为例,所述第二系数可以作为第一信号的权重。例如,当第二系数为1时,可以表示保留所述第一信号。相反,当目标语音位置不位于合成声源方向时,所述合成声源方向对应的第一信号可以认为是噪声信号。处理设备110可以将第二系数为0。由此,处理设备110可以基于所述第二系数滤除或减弱目标语音位置不位于合成声源方向时对应的噪声信号。
在一些实施例中,所述第二系数可以组成用于过滤输入信号(例如,第一信号或第三信号)的掩蔽矩阵。例如,所述掩蔽矩阵可以如下述公式所示:
Figure PCTCN2021096375-appb-000007
上述掩蔽矩阵M为二值化矩阵,可以直接去除被判断为噪声的输入信号。因此,基于掩蔽矩阵M处理的语音信号可能会造成频谱泄漏、语音不连续等问题。在一些实施例中,处理设备110可以基于平滑因子,对所述第二系数进行平滑。例如,基于平滑因子,处理设备110可以在时域上对所述第二系数进行平滑。时域平滑过程可以如下述公式所示:
Figure PCTCN2021096375-appb-000008
其中,α表示平滑因子,M(f,t-1)表示对应前一帧的掩蔽矩阵,M(f,t)表示对应当前帧的掩蔽矩阵。平滑因子α可以用于对前一帧的掩蔽矩阵和当前帧的掩蔽矩阵进行加权处理,从而得到对应当前帧的平滑的掩蔽矩阵
Figure PCTCN2021096375-appb-000009
在一些实施例中,处理设备110还可以在频域上对所述第二系数进行平滑。例如,处理设备110可以使用滑动汉明窗对所述第二系数进行平滑。
在一些实施例中,处理设备110可以基于所述目标语音位置和所述合成声源方向之间的角度,通过回归函数确定所述第二系数。例如,对于每一个时频点,处理设备110可以计算出与所述多个声源方向θ=(θ 1,θ 2,...,θ n)中的每个声源方向对应的似然函数LH 1(f,t),LH 2(f,t),...,LH n(f,t)。并确定数值最大的似然函数对应的声源方向为合成声源的方向。例如,如果θ i=30°时的似然函数值最大,处理设备110可以确定合成声源的方向为30°。进一步地,处理设备110可以基于所述目标声源方向和所述合成声源方向之间的角度,通过回归函数确定所述第二系数。例如,处理设备110可以构建所述角度与第二系数间的回归函数。在一些实施例中,所述回归函数可以包括平滑的回归函数,例如,线性回归函数。仅作为示例,所述回归函数的值可以随着目标声源方向与合成声源方向之间的角度的增大而减小。这样,当以第二系数作为权重处理输入信号时,可以减弱或去除目标声源方向与合成声源方向之间的角度的较大时的输入信号,从而达到去除噪声的目的。
需要注意的是,以上对于方法1000的描述,仅为描述方便,并不能把本说明书限制在所举实施例范围之内。可以理解,对于本领域的技术人员来说,在了解该方法的原理后,可以在不背离这一原理的情况下,对各个步骤进行任意组合,或者,可以增加或删减任意步骤。
根据本说明书的一些实施例,处理设备110可以通过双麦克风获取目标语音信号,并基于双麦滤波算法对目标语音信号进行滤波。例如,当噪声信号与有效信号之间的角度差较大(即噪声源方向与目标声源方向之间的角度差较大)时,处理设备110可以基于第一系数进行滤波,去除所述噪声信号。当噪声信号与有效信号之间的角度差较小时,处理设备110可以基于第二系数进行滤波。如此,处理设备110可以基本滤除目标语音信号中的噪声信号。在一些实施例中,经过上述双麦滤波过程的获取的第一输出语音信号可能会包括遗留噪声。例如,在部分频率子带(例如,中高频子带)上,第一输出语音信号可能包括幅度谱上连续的噪声。因此,在一些实施例中,处理设备110还 可以基于单麦滤波算法对第一输出语音信号进行后置滤波。
图11是根据本说明书一些实施例所示的单麦滤波方法的示例性流程图。在一些实施例中,方法1100可以由处理设备110、处理引擎112、处理器220执行。例如,方法1100可以以程序或指令的形式存储在存储设备(例如,存储设备140或处理设备110的存储单元)中,当处理设备110、处理引擎112、处理器220或图4所示的模块执行程序或指令时,可以实现方法1100。在一些实施例中,方法1100可以利用以下未描述的一个或以上附加操作/步骤,和/或不通过以下所讨论的一个或以上操作/步骤完成。另外,如图11所示的操作/步骤的顺序并非限制性的。如图11所示,方法1100可以包括:
步骤1110,处理设备110(例如,处理模块420)可以确定第一输出语音信号中至少一个目标子带信号。
在一些实施例中,处理设备110可以基于第一输出语音信号中每个子带信号的信噪比确定所述至少一个目标子带信号。在一些实施例中,处理设备110可以基于第一输出语音信号,获取多个子带信号。例如,处理设备110可以基于信号频段对第一输出语音信号进行子带划分,获取多个子带信号。仅作为示例,处理设备110可以按照低频、中频或高频的频段类别对第一输出语音信号进行子带划分,或者也可以按照特定的频带宽度(例如,每2kHz作为一个频带)对第一输出语音信号进行子带的划分。再例如,处理设备110可以基于第一输出语音信号的信号频点进行子带划分。信号频点可以指信号的频率值中小数点之后的数值,例如信号的频率值为72.810,则该信号的信号频点为810。基于信号频点进行子带划分可以是按照特定的信号频点宽度对信号进行子带的划分,例如,信号频点810~830作为一个子带,信号频点600~620作为一个子带。在一些实施例中,处理设备110可以通过滤波的方式获取多个子带信号,也可以通过其它的算法或器件进行子带划分,获取多个子带信号,此处不做限制。
进一步地,处理设备110可以计算每一个子带信号的信噪比。信噪比(signal-noise ratio,SNR)可以指语音信号能量与噪声信号能量的比值。信号能量可以是信号功率、基于信号功率得到的其它能量数据等。在一些实施例中,信噪比越大,说明语音信号中包含的噪声越小。在一些实施例中,子带信号的信噪比可以是子带信号中纯净的语音信号(即有效信号)的能量与噪声信号能量的比值,也可以是含有噪声的子带信号的能量与噪声信号能量的比值。在一些实施例中,处理设备110可以通过信噪比估计算法计算每一个子带信号的信噪比。例如,对于每一个子带信号,处理设备110可以基于噪声估计算法计算得到子带信号中的噪声信号值。示例性的噪声估计算法可以包括最小值跟踪 算法、时间递归平均算法等或其组合。进一步地,处理设备110可以基于原始子带信号和噪声信号值计算得到信噪比。在一些实施例中,处理设备110可以采用训练得到的信噪比估计模型计算每一个子带信号的信噪比。示例性的信噪比估计模型可以包括但不限于多层感知机(Multi-Layer Perception,MLP)、决策树(Decision Tree,DT)、深度神经网络(Deep Neural Network,DNN)、支持向量机(Support Vector Machine,SVM)、K最近邻算法(K-Nearest Neighbor,KNN)等任何可以进行特征提取和/或分类的算法或者模型。在一些实施例中,信噪比估计模型可以通过采用训练样本训练初始模型得到。训练样本可以包括语音信号样本(例如,至少一个历史语音信号,每个历史语音信号中含有噪声信号),以及语音信号样本的标签值(例如,历史语音信号v1的信噪比为0.5,历史语音信号v2的信噪比为0.6)。利用模型处理语音信号样本,得到预测的信噪比。基于预测的信噪比与对应训练样本的标签值构造损失函数,基于损失函数调整模型参数,以减小预测的目标信噪比与标签值之间的差异。例如,可以基于梯度下降法等进行模型参数更新或调整。如此进行多轮迭代训练,当训练的模型满足预设条件时,训练结束,得到训练后的信噪比估计模型。其中,预设条件可以是损失函数结果收敛或小于预设阈值等。
进一步地,处理设备110可以基于每一个所述子带信号的信噪比,确定所述目标子带信号。在一些实施例中,处理设备110可以基于信噪比阈值确定所述目标子带信号。例如,对于每一个子带信号,处理设备110可以确定所述子带信号的信噪比是否小于信噪比阈值。响应于所述子带信号的信噪比小于信噪比阈值,处理设备110可以确定所述子带信号为目标子带信号。
在一些实施例中,处理设备110还可以基于预设子带范围确定所述至少一个目标子带信号。例如,所述预设子带范围可以是预设频率范围。所述预设频率范围可以基于经验值确定。所述经验值可以是在语音分析处理过程得到的经验值。仅作为示例,在语音分析处理过程发现3000-4000Hz的频率范围内的语音信号通常含有较大比例的噪声,相应地,所述预设频率范围可以至少包括3000-4000Hz。
步骤1120,处理设备110(例如,生成模块430)可以基于单麦滤波算法,处理所述至少一个目标子带信号,获取第二输出语音信号。
在一些实施例中,处理设备110可以基于单麦滤波算法处理所述至少一个目标子带信号,从而滤除所述至少一个目标子带信号中的噪声,获取降噪后的第二输出语音信号。示例性的单麦滤波算法可以包括谱减法、维纳滤波算法、最小值控制的递归平均 算法、语音生成模型算法等或其组合。
根据上述实施例,处理设备110可以根据单麦滤波算法进一步处理通过双麦滤波算法获取的第一输出语音信号。例如,处理设备110可以根据单麦滤波算法对中第一输出语音信号中的部分子带信号(例如,特定频率的信号)进行滤波,从而可以减弱或滤除第一输出语音信号中的噪声信号,实现对双麦滤波算法的修正和/或补充。
需要注意的是,以上对于方法1100的描述,仅为描述方便,并不能把本说明书限制在所举实施例范围之内。可以理解,对于本领域的技术人员来说,在了解该方法的原理后,可以在不背离这一原理的情况下,对各个步骤进行任意组合,或者,可以增加或删减任意步骤。例如,步骤1110可以省略。处理设备110可以不仅仅是针对目标子带信号进行滤波处理,还可以直接对第一输出语音信号整体进行滤波处理。再例如,处理设备110可以基于噪声自动检测算法自动检测出第一输出语音信号中的噪声信号,并通过单麦滤波算法对检测出的噪声信号进行滤波处理。
图12是根据本说明书一些实施例所示的单麦滤波方法的示例性流程图。在一些实施例中,方法1200可以由处理设备110、处理引擎112、处理器220执行。例如,方法1200可以以程序或指令的形式存储在存储设备(例如,存储设备140或处理设备110的存储单元)中,当处理设备110、处理引擎112、处理器220或图4所示的模块执行程序或指令时,可以实现方法1200。在一些实施例中,方法1200可以利用以下未描述的一个或以上附加操作/步骤,和/或不通过以下所讨论的一个或以上操作/步骤完成。另外,如图12所示的操作/步骤的顺序并非限制性的。如图12所示,方法1200可以包括:
步骤1210,处理设备110(例如,处理模块420)可以基于单麦滤波算法处理所述第一信号和/或所述第二信号,确定第三系数。
在一些实施例中,处理设备110可以基于第一信号和第二信号中的任意一个确定第三系数。例如,处理设备110可以基于第一信号确定第三系数,或者可以基于第二信号确定第三系数。在一些实施例中,处理设备110可以基于第一信号和第二信号确定第三系数。例如,处理设备110可以基于第一信号确定第三系数额的第一数值,基于第二信号确定第三系数额的第二数值,然后基于第一数值和第二数值确定所述第三系数(例如,求平均值、加权求和等)。
以第一系数为例,在一些实施例中,处理设备110可以基于单麦滤波算法处理所述第一信号。示例性的单麦滤波算法可以包括谱减法、维纳滤波算法、最小值控制的递归平均算法、语音生成模型算法等或其组合。仅作为示例,处理设备110可以基于单 麦滤波算法得到第一信号中的噪声信号和有效信号,并基于所述噪声信号、有效信号以及第一信号中的至少两个确定与第一信号对应的信噪比。所述与第一信号对应的信噪比可以包括先验信噪比、后验信噪比等。所述先验信噪比可以是有效信号与噪声信号的能量比值。所述后验信噪比可以是有效信号与第一信号的能量比值。进一步地,处理设备110可以基于与第一信号对应的信噪比确定所述第三系数。例如,处理设备110可以基于所述先验信噪比和/或后验信噪比确定所述单麦滤波算法对应的增益系数,并基于所述增益系数确定所述第三系数。例如,处理设备110可以直接将所述增益系数作为第三系数。再例如,处理设备110可以确定增益系数与第三系数之间的映射关系,并基于所述映射关系确定第三系数。这里的增益系数可以指单麦滤波算法中的传递函数。所述传递函数可以对带有噪声信号的语音信号进行滤波,获取有效信号。例如,传递函数可以是矩阵的形式,通过将传递函数与带有噪声信号的语音信号相乘,可以滤除所述语音信号中的噪声信号。相应地,所述第三系数可以用于去除语音信号中的噪声。
可选地或附加地,处理设备110还可以基于逻辑回归算法(例如,sigmoid函数),通过平滑因子对所述先验信噪比和后验信噪比进行加权组合,获取平滑信噪比。并基于平滑信噪比确定所述单麦滤波算法对应的增益系数作为第三系数。由此,所述第三系数可以具有较好的平滑性,从而可以避免在使用单麦滤波算法进行滤波时产生较强的音乐噪声。
步骤1220,处理设备110(例如,生成模块430)可以基于所述第三系数,处理所述第一输出语音信号,获取第三输出语音信号。
在一些实施例中,处理设备110可以将所述第三系数与第一输出语音信号相乘,获取所述第三输出语音信号。例如,根据步骤1210所述,第三系数可以是基于单麦滤波算法获取的增益系数。通过将所述增益系数与第一输出语音信号相乘,可以滤除第一输出语音信号中的噪声信号。
需要注意的是,以上对于方法1200的描述,仅为描述方便,并不能把本说明书限制在所举实施例范围之内。可以理解,对于本领域的技术人员来说,在了解该方法的原理后,可以在不背离这一原理的情况下,对各个步骤进行任意组合,或者,可以增加或删减任意步骤。
图13是根据本说明书一些实施例所示的语音增强方法的示例性流程图。在一些实施例中,方法1300可以由处理设备110、处理引擎112、处理器220执行。例如,方法1300可以以程序或指令的形式存储在存储设备(例如,存储设备140或处理设备110 的存储单元)中,当处理设备110、处理引擎112、处理器220或图4所示的模块执行程序或指令时,可以实现方法1300。在一些实施例中,方法1300可以利用以下未描述的一个或以上附加操作/步骤,和/或不通过以下所讨论的一个或以上操作/步骤完成。另外,如图13所示的操作/步骤的顺序并非限制性的。如图13所示,方法1300可以包括:
步骤1310,处理设备110可以获取目标语音的第一信号和第二信号。
步骤1320,处理设备110可以基于目标语音位置、所述第一位置和所述第二位置,处理所述第一信号和所述第二信号以确定第一系数。
步骤1330,处理设备110可以基于所述第一信号和所述第二信号,确定与多个声源方向有关的多个参数。
步骤1340,处理设备110可以基于所述多个参数和所述目标语音位置,确定第二系数。
在一些实施例中,可以通过执行图5所描述的步骤510-540来执行步骤1310-1340,此处不再赘述。
步骤1350,处理设备110(例如,处理模块420)可以基于所述第一信号和所述第二信号的能量差,确定第四系数。
在一些实施例中,为了确定第四系数,处理设备110可以基于第一信号和第二信号中的无声区间,获取噪声功率谱密度。所述无声区间可以是不存在有效信号(即目标声源未发出语音)的语音信号区间。在无声区间内,由于不存在目标声源的语音,此时两个麦克风获取的第一信号和第二信号中仅含有噪声成分。在一些实施例中,处理设备110可以基于语音活动检测(Voice Activity Detection,VAD)算法确定所述第一信号和第二信号中的无声区间。在一些实施例中,处理设备110可以分别确定第一信号和第二信号中的一个或多个语音区间作为无声区间。例如,对于第一信号和第二信号中的每一个,处理设备110可以直接将该信号开始的一段(例如,200ms,300ms等)语音区间作为无声区间。进一步地,处理设备110可以基于所述无声区间获取噪声功率密度谱。在一些实施例中,当噪声信号源与双麦克风距离较远时,可以认为双麦克风所接收的噪声信号相似或相同。因此,处理设备110可以基于第一信号或第二信号中的任一个对应的无声区间获取噪声功率谱密度。在一些实施例中,处理设备110可以基于周期图算法获取所述噪声功率谱密度。可选地或附加地,处理设备110可以基于FFT变换将第一信号和/或第二信号变换到频域,从而可以在频域上基于周期图算法获取所述噪声功率谱密度。
进一步地,处理设备110可以基于所述第一信号的第一功率谱密度、所述第二信号的第二功率谱密度和所述噪声功率谱密度,获取所述能量差。在一些实施例中,处理设备110可以基于周期图算法确定所述第一信号的第一功率谱密度以及所述第二信号的第二功率谱密度。在一些实施例中,处理设备110可以基于能量差(Power Level Difference,PLD)算法获取所述能量差。在所述PLD算法中,可以假设双麦克风距离较远,因而第一信号中的有效信号与第二信号中的有效信号能量差较大,且第一信号与第二信号中的噪声信号相同或相似。由此,第一信号与第二信号的能量差可以表示为第一信号中的有效信号相关的函数。
进一步地,处理设备110可以基于所述能量差和所述噪声功率谱密度,确定所述第四系数。在一些实施例中,处理设备110可以基于PLD算法确定增益系数,并将所述增益系数确定为第四系数。
步骤1360,处理设备110(例如,生成模块430)可以基于所述第一系数、所述第二系数和所述第四系数,处理所述第一信号和/或第二信号以获取所述目标语音对应的语音增强后的第四输出语音信号。
在一些实施例中,处理设备110可以基于第四系数对第一信号和/或第二信号进行增益补偿,获取估计的有效信号。例如,所述估计的有效信号可以是第四系数与第一信号和/或第二信号的乘积。在一些实施例中,处理设备110可以基于所述第一系数、所述第二系数和所述第四系数,对基于所述第一信号和/或第二信号获取的输出信号(例如,第三信号、估计的有效信号)进行加权处理。例如,处理设备110可以基于第一系数、第二系数和第四系数分别对第三信号、第一信号和估计的有效信号进行加权处理,并根据加权处理后的信号确定第四输出语音信号。例如,第四输出语音信号可以是加权后的信号的平均值。再例如,第四输出语音信号可以是加权后的信号中较大的值。
在一些实施例中,处理设备110可以基于第一系数和第二系数,处理第一信号和/或第二信号以获取所述目标语音对应的语音增强后的第一输出语音信号,再基于第四系数对第一输出语音信号进行处理,以第四输出语音信号。
需要注意的是,以上对于方法1300的描述,仅为描述方便,并不能把本说明书限制在所举实施例范围之内。可以理解,对于本领域的技术人员来说,在了解该方法的原理后,可以在不背离这一原理的情况下,对各个步骤进行任意组合,或者,可以增加或删减任意步骤。在一些实施例中,处理设备110还可以基于所述第一信号和所述第二信号的功率差确定所述第四系数。在一些实施例中,处理设备110还可以基于所述第一 信号和所述第二信号的幅度差确定所述第四系数。
本说明书实施例可能带来的有益效果包括但不限于:(1)基于ANF算法处理目标语音信号,对目标语音信号的损害比较小,且当有效信号和噪声信号的角度差较大时,可以对噪声信号进行有效的滤波;(2)基于分布概率算法处理目标语音信号,可以在有效信号和噪声信号的角度差较小时,对目标声源附近的噪声信号进行有效的滤波;(3)采用双麦滤波与单麦滤波相结合的方式处理目标语音信号,可以有效滤除双麦滤波后的残留噪声。
上文已对基本概念做了描述,显然,对于本领域技术人员来说,上述发明披露仅仅作为示例,而并不构成对本说明书的限定。虽然此处并没有明确说明,本领域技术人员可能会对本说明书进行各种修改、改进和修正。该类修改、改进和修正在本说明书中被建议,所以该类修改、改进、修正仍属于本说明书示范实施例的精神和范围。
同时,本说明书使用了特定词语来描述本说明书的实施例。如“一个实施例”、“一实施例”和/或“一些实施例”意指与本说明书至少一个实施例相关的某一特征、结构或特点。因此,应强调并注意的是,本说明书中在不同位置两次或多次提及的“一实施例”或“一个实施例”或“一替代性实施例”并不一定是指同一实施例。此外,本说明书的一个或多个实施例中的某些特征、结构或特点可以进行适当的组合。
此外,本领域技术人员可以理解,本说明书的各方面可以通过若干具有可专利性的种类或情况进行说明和描述,包括任何新的和有用的工序、机器、产品或物质的组合或对他们的任何新的和有用的改进。相应地,本说明书的各个方面可以完全由硬件执行、可以完全由软件(包括固件、常驻软件、微码等)执行、也可以由硬件和软件组合执行。以上硬件或软件均可被称为“数据块”、“模块”、“引擎”、“单元”、“组件”或“系统”。此外,本说明书的各方面可能表现为位于一个或多个计算机可读介质中的计算机产品,该产品包括计算机可读程序编码。
此外,除非权利要求中明确说明,本说明书处理元素和序列的顺序、数字字母的使用或其他名称的使用,并非用于限定本说明书流程和方法的顺序。尽管上述披露中通过各种示例讨论了一些目前认为有用的发明实施例,但应当理解的是,该类细节仅起到说明的目的,附加的权利要求并不仅限于披露的实施例,相反,权利要求旨在覆盖所有符合本说明书实施例实质和范围的修正和等价组合。例如,虽然以上所描述的系统组件可以通过硬件设备实现,但是也可以只通过软件的解决方案得以实现,如在现有的服务器或移动设备上安装所描述的系统。
同理,应当注意的是,为了简化本说明书披露的表述,从而帮助对一个或多个发明实施例的理解,前文对本说明书实施例的描述中,有时会将多种特征归并至一个实施例、附图或对其的描述中。但是,这种披露方法并不意味着本说明书对象所需要的特征比权利要求中提及的特征多。实际上,实施例的特征要少于上述披露的单个实施例的全部特征。
一些实施例中使用了描述成分、属性数量的数字,应当理解的是,此类用于实施例描述的数字,在一些示例中使用了修饰词“大约”、“近似”或“大体上”等来修饰。除非另外说明,“大约”、“近似”或“大体上”表明数字允许有±20%的变化。相应地,在一些实施例中,说明书和权利要求中使用的数值数据均为近似值,该近似值根据个别实施例所需特点可以发生改变。在一些实施例中,数值数据应考虑规定的有效数位并采用一般位数保留的方法。尽管本说明书一些实施例中用于确认其范围广度的数值域和数据为近似值,在具体实施例中,此类数值的设定在可行范围内尽可能精确。

Claims (34)

  1. 一种语音增强方法,其特征在于,所述方法包括:
    获取目标语音的第一信号和第二信号,所述第一信号为基于第一位置采集的所述目标语音的信号,所述第二信号为基于第二位置采集的所述目标语音的信号;
    基于目标语音位置、所述第一位置和所述第二位置,处理所述第一信号和所述第二信号以确定第一系数;
    基于所述第一信号和所述第二信号,确定与多个声源方向有关的多个参数,每个参数对应从一个声源方向发出声音以形成所述第一信号和所述第二信号的概率;
    基于所述多个参数和所述目标语音位置,确定第二系数;以及
    基于所述第一系数和所述第二系数,处理所述第一信号和/或第二信号以获取所述目标语音对应的语音增强后的第一输出语音信号。
  2. 如权利要求1所述的方法,所述基于目标语音位置、所述第一位置和所述第二位置,处理所述第一信号和所述第二信号以确定第一系数,包括:
    基于所述目标语音位置、所述第一位置和所述第二位置,对所述第一信号和所述第二信号进行差分运算,获取指向第一方向的信号和指向第二方向的信号,所述指向第一方向的信号和所述指向第二方向的信号含有不同比例的有效信号;
    基于所述指向第一方向的信号和所述指向第二方向的信号,确定与所述有效信号对应的第三信号;以及
    基于所述第三信号,确定所述第一系数。
  3. 如权利要求2所述的方法,其特征在于,所述确定与所述有效信号对应的第三信号包括:
    对所述指向第一方向的信号和所述指向第二方向的信号进行自适应差分运算,确定第四信号;以及
    对第四信号中的低频成分进行增强,获取所述第三信号。
  4. 如权利要求3所述的方法,所述方法还包括:
    基于所述第四信号、所述指向第一方向的信号和所述指向第二方向的信号,更新所述自适应差分运算的自适应参数。
  5. 如权利要求1所述的方法,所述基于目标语音位置、所述第一位置和所述第二位置,处理所述第一信号和所述第二信号以确定第一系数,包括:
    基于所述目标语音位置、所述第一位置和所述第二位置,对所述第一信号和所述第二信号进行差分运算,获取指向第一方向的信号和指向第二方向的信号,所述指向第一方向的信号和所述指向第二方向的信号含有不同比例的有效信号;
    基于所述指向第一方向的信号和所述指向第二方向的信号,确定所述目标语音的估计信噪比;以及
    基于所述估计信噪比,确定所述第一系数。
  6. 如权利要求1所述的方法,所述基于所述第一信号和所述第二信号,确定与多个声源方向有关的多个参数,包括:
    基于每个声源方向、所述第一位置和所述第二位置,对所述第一信号和所述第二信号进行差分运算,确定与每个声源方向有关的参数。
  7. 如权利要求1所述的方法,其特征在于,所述基于所述多个参数和所述目标语音位置,确定第二系数,包括:
    基于所述多个参数,确定合成声源方向;以及
    基于所述合成声源方向和所述目标语音位置,确定所述第二系数。
  8. 如权利要求7所述的方法,其特征在于,所述基于所述合成声源方向和所述目标语音位置,确定所述第二系数,包括:
    判断所述目标语音位置是否位于合成声源方向;
    响应于所述目标语音位置位于所述合成声源方向,将所述第二系数设为第一数值;以及
    响应于所述目标语音位置不位于所述合成声源方向,将所述第二系数设为第二数值。
  9. 如权利要求7所述的方法,其特征在于,所述基于所述合成声源方向和所述目标语音位置,确定所述第二系数,包括:
    基于所述目标语音位置和所述合成声源方向之间的角度,通过回归函数确定所述第 二系数。
  10. 如权利要求7所述的方法,其特征在于,还包括:
    基于平滑因子,对所述第二系数进行平滑。
  11. 如权利要求1所述方法,其特征在于,所述方法还包括对所述第一信号和所述第二信号执行以下操作中的至少一个:
    对所述第一信号和所述第二信号进行分帧;
    对所述第一信号和所述第二信号进行加窗平滑;以及
    将所述第一信号和所述第二信号转换到频域。
  12. 如权利要求1所述的方法,其特征在于,所述方法进一步包括:
    确定所述第一输出语音信号中至少一个目标子带信号;以及
    基于单麦滤波算法,处理所述至少一个目标子带信号,获取第二输出语音信号。
  13. 如权利要求12所述的方法,其特征在于,所述确定所述第一输出语音信号中至少一个目标子带信号,包括:
    基于所述第一输出语音信号,获取多个子带信号;
    计算每一个所述子带信号的信噪比;以及
    基于每一个所述子带信号的信噪比,确定所述目标子带信号。
  14. 如权利要求1所述的方法,其特征在于,所述方法进一步包括:
    基于单麦滤波算法处理所述第一信号和/或所述第二信号,确定第三系数;以及
    基于所述第三系数,处理所述第一输出语音信号,获取第三输出语音信号。
  15. 如权利要求11所述的方法,其特征在于,所述方法还包括:
    基于所述第一信号和所述第二信号的能量差,确定第四系数;以及
    基于所述第一系数、所述第二系数和所述第四系数,处理所述第一信号和/或第二信号以获取所述目标语音对应的语音增强后的第四输出语音信号。
  16. 如权利要求15所述的方法,其特征在于,所述基于所述第一信号和所述第二信号的能量差,确定第四系数,包括:
    基于所述第一信号和所述第二信号中的无声区间,获取噪声功率谱密度;
    基于所述第一信号的第一功率谱密度、所述第二信号的第二功率谱密度和所述噪声功率谱密度,获取所述能量差;以及
    基于所述能量差和所述噪声功率谱密度,确定所述第四系数。
  17. 一种语音增强系统,其特征在于,所述系统包括:
    包括一组指令的至少一个存储介质;以及
    与至少一个存储介质通信的至少一个处理器,其中,当执行所述一组指令时,所述至少一个处理器使所述系统:
    获取目标语音的第一信号和第二信号,所述第一信号为基于第一位置采集的所述目标语音的信号,所述第二信号为基于第二位置采集的所述目标语音的信号;
    基于目标语音位置、所述第一位置和所述第二位置,处理所述第一信号和所述第二信号以确定第一系数;
    基于所述第一信号和所述第二信号,确定与多个声源方向有关的多个参数,每个参数对应从一个声源方向发出声音以形成所述第一信号和所述第二信号的概率;
    基于所述多个参数和所述目标语音位置,确定第二系数;以及
    基于所述第一系数和所述第二系数,处理所述第一信号和/或第二信号以获取所述目标语音对应的语音增强后的第一输出语音信号。
  18. 如权利要求17所述的系统,其特征在于,为了基于目标语音位置、所述第一位置和所述第二位置,处理所述第一信号和所述第二信号以确定第一系数,所述至少一个处理器使所述系统:
    基于所述目标语音位置、所述第一位置和所述第二位置,对所述第一信号和所述第二信号进行差分运算,获取指向第一方向的信号和指向第二方向的信号,所述指向第一方向的信号和所述指向第二方向的信号含有不同比例的有效信号;
    基于所述指向第一方向的信号和所述指向第二方向的信号,确定与所述有效信号对应的第三信号;以及
    基于所述第三信号,确定所述第一系数。
  19. 如权利要求18所述的系统,其特征在于,为了确定与所述有效信号对应的第三信号,所述至少一个处理器使所述系统:
    对所述指向第一方向的信号和所述指向第二方向的信号进行自适应差分运算,确定第四信号;以及
    对第四信号中的低频成分进行增强,获取所述第三信号。
  20. 如权利要求19所述的系统,其特征在于,所述至少一个处理器进一步使所述系统:
    基于所述第四信号、所述指向第一方向的信号和所述指向第二方向的信号,更新所述自适应差分运算的自适应参数。
  21. 如权利要求17所述的系统,其特征在于,为了基于目标语音位置、所述第一位置和所述第二位置,处理所述第一信号和所述第二信号以确定第一系数,所述至少一个处理器使所述系统:
    基于所述目标语音位置、所述第一位置和所述第二位置,对所述第一信号和所述第二信号进行差分运算,获取指向第一方向的信号和指向第二方向的信号,所述指向第一方向的信号和所述指向第二方向的信号含有不同比例的有效信号;
    基于所述指向第一方向的信号和所述指向第二方向的信号,确定所述目标语音的估计信噪比;以及
    基于所述估计信噪比,确定所述第一系数。
  22. 如权利要求17所述的系统,其特征在于,为了基于所述第一信号和所述第二信号,确定与多个声源方向有关的多个参数,所述至少一个处理器使所述系统:
    基于每个声源方向、所述第一位置和所述第二位置,对所述第一信号和所述第二信号进行差分运算,确定与每个声源方向有关的参数。
  23. 如权利要求17所述的系统,其特征在于,为了基于所述多个参数和所述目标语音位置,确定第二系数,所述至少一个处理器使所述系统:
    基于所述多个参数,确定合成声源方向;以及
    基于所述合成声源方向和所述目标语音位置,确定所述第二系数。
  24. 如权利要求23所述的系统,其特征在于,为了基于所述合成声源方向和所述目标语音位置,确定所述第二系数,所述至少一个处理器使所述系统:
    判断所述目标语音位置是否位于合成声源方向;
    响应于所述目标语音位置位于所述合成声源方向,将所述第二系数设为第一数值;以及
    响应于所述目标语音位置不位于所述合成声源方向,将所述第二系数设为第二数值。
  25. 如权利要求23所述的系统,其特征在于,为了基于所述合成声源方向和所述目标语音位置,确定所述第二系数,所述至少一个处理器使所述系统:
    基于所述目标语音位置和所述合成声源方向之间的角度,通过回归函数确定所述第二系数。
  26. 如权利要求23所述的系统,其特征在于,所述至少一个处理器进一步使所述系统:
    基于平滑因子,对所述第二系数进行平滑。
  27. 如权利要求17所述系统,其特征在于,所述至少一个处理器进一步使所述系统对所述第一信号和所述第二信号执行以下操作中的至少一个:
    对所述第一信号和所述第二信号进行分帧;
    对所述第一信号和所述第二信号进行加窗平滑;以及
    将所述第一信号和所述第二信号转换到频域。
  28. 如权利要求17所述的系统,其特征在于,所述至少一个处理器进一步使所述系统:
    确定所述第一输出语音信号中至少一个目标子带信号;以及
    基于单麦滤波算法,处理所述至少一个目标子带信号,获取第二输出语音信号。
  29. 如权利要求28所述的系统,其特征在于,为了确定所述第一输出语音信号中至少一个目标子带信号,所述至少一个处理器使所述系统:
    基于所述第一输出语音信号,获取多个子带信号;
    计算每一个所述子带信号的信噪比;以及
    基于每一个所述子带信号的信噪比,确定所述目标子带信号。
  30. 如权利要求17所述的系统,其特征在于,所述至少一个处理器进一步使所述系统:
    基于单麦滤波算法处理所述第一信号和/或所述第二信号,确定第三系数;以及
    基于所述第三系数,处理所述第一输出语音信号,获取第三输出语音信号。
  31. 如权利要求27所述的系统,其特征在于,所述至少一个处理器进一步使所述系统:
    基于所述第一信号和所述第二信号的能量差,确定第四系数;以及
    基于所述第一系数、所述第二系数和所述第四系数,处理所述第一信号和/或第二信号以获取所述目标语音对应的语音增强后的第四输出语音信号。
  32. 如权利要求31所述的系统,其特征在于,为了基于所述第一信号和所述第二信号的能量差,确定第四系数,所述至少一个处理器使所述系统:
    基于所述第一信号和所述第二信号中的无声区间,获取噪声功率谱密度;
    基于所述第一信号的第一功率谱密度、所述第二信号的第二功率谱密度和所述噪声功率谱密度,获取所述能量差;以及
    基于所述能量差和所述噪声功率谱密度,确定所述第四系数。
  33. 一种语音增强系统,包括:
    获取模块,用于获取目标语音的第一信号和第二信号,所述第一信号为基于第一位置采集的所述目标语音的信号,所述第二信号为基于第二位置采集的所述目标语音的信号;
    处理模块,用于
    基于目标语音位置、所述第一位置和所述第二位置,处理所述第一信号和所 述第二信号以确定第一系数;
    基于所述第一信号和所述第二信号,确定与多个声源方向有关的多个参数,每个参数对应从一个声源方向发出声音以形成所述第一信号和所述第二信号的概率;以及
    基于所述多个参数和所述目标语音位置,确定第二系数;以及
    生成模块,用于基于所述第一系数和所述第二系数,处理所述第一信号和/或第二信号以获取所述目标语音对应的语音增强后的第一输出语音信号。
  34. 一种非暂时性计算机可读介质,包括可执行指令,当由至少一个处理器执行时,所述可执行指令使所述至少一个处理器执行权利要求1-16中任一项所述的方法。
PCT/CN2021/096375 2021-05-27 2021-05-27 一种语音增强方法和系统 WO2022246737A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN202180088314.XA CN116724352A (zh) 2021-05-27 2021-05-27 一种语音增强方法和系统
PCT/CN2021/096375 WO2022246737A1 (zh) 2021-05-27 2021-05-27 一种语音增强方法和系统
US18/354,715 US20230360664A1 (en) 2021-05-27 2023-07-19 Methods and systems for voice enhancement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/096375 WO2022246737A1 (zh) 2021-05-27 2021-05-27 一种语音增强方法和系统

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/354,715 Continuation US20230360664A1 (en) 2021-05-27 2023-07-19 Methods and systems for voice enhancement

Publications (1)

Publication Number Publication Date
WO2022246737A1 true WO2022246737A1 (zh) 2022-12-01

Family

ID=84229394

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/096375 WO2022246737A1 (zh) 2021-05-27 2021-05-27 一种语音增强方法和系统

Country Status (3)

Country Link
US (1) US20230360664A1 (zh)
CN (1) CN116724352A (zh)
WO (1) WO2022246737A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101510426A (zh) * 2009-03-23 2009-08-19 北京中星微电子有限公司 一种噪声消除方法及系统
CN102509552A (zh) * 2011-10-21 2012-06-20 浙江大学 一种基于联合抑制的麦克风阵列语音增强方法
CN110856072A (zh) * 2019-12-04 2020-02-28 北京声加科技有限公司 一种耳机通话降噪方法及耳机
CN111063366A (zh) * 2019-12-26 2020-04-24 紫光展锐(重庆)科技有限公司 降低噪声的方法、装置、电子设备及可读存储介质
CN112116918A (zh) * 2020-09-27 2020-12-22 北京声加科技有限公司 语音信号增强处理方法和耳机
CN112802486A (zh) * 2020-12-29 2021-05-14 紫光展锐(重庆)科技有限公司 一种噪声抑制方法、装置及电子设备

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101510426A (zh) * 2009-03-23 2009-08-19 北京中星微电子有限公司 一种噪声消除方法及系统
CN102509552A (zh) * 2011-10-21 2012-06-20 浙江大学 一种基于联合抑制的麦克风阵列语音增强方法
CN110856072A (zh) * 2019-12-04 2020-02-28 北京声加科技有限公司 一种耳机通话降噪方法及耳机
CN111063366A (zh) * 2019-12-26 2020-04-24 紫光展锐(重庆)科技有限公司 降低噪声的方法、装置、电子设备及可读存储介质
CN112116918A (zh) * 2020-09-27 2020-12-22 北京声加科技有限公司 语音信号增强处理方法和耳机
CN112802486A (zh) * 2020-12-29 2021-05-14 紫光展锐(重庆)科技有限公司 一种噪声抑制方法、装置及电子设备

Also Published As

Publication number Publication date
CN116724352A (zh) 2023-09-08
US20230360664A1 (en) 2023-11-09

Similar Documents

Publication Publication Date Title
CN111418010B (zh) 一种多麦克风降噪方法、装置及终端设备
CN107464564B (zh) 语音交互方法、装置及设备
US20220230651A1 (en) Voice signal dereverberation processing method and apparatus, computer device and storage medium
US11048472B2 (en) Dynamically adjustable sound parameters
EP3526979B1 (en) Method and apparatus for output signal equalization between microphones
WO2016078369A1 (zh) 移动终端通话语音降噪方法及装置、存储介质
US9489963B2 (en) Correlation-based two microphone algorithm for noise reduction in reverberation
WO2019100500A1 (zh) 语音信号降噪方法及设备
US11069366B2 (en) Method and device for evaluating performance of speech enhancement algorithm, and computer-readable storage medium
CN106663445A (zh) 声音处理装置、声音处理方法及程序
WO2017152601A1 (zh) 一种麦克风确定方法和终端
EP4254979A1 (en) Active noise reduction method, device and system
TWI818493B (zh) 語音增強方法、系統和裝置
CN114898762A (zh) 基于目标人的实时语音降噪方法、装置和电子设备
CN114203163A (zh) 音频信号处理方法及装置
WO2022256577A1 (en) A method of speech enhancement and a mobile computing device implementing the method
CN113963716A (zh) 通话式门铃的音量均衡方法、装置、设备和可读存储介质
WO2022105571A1 (zh) 语音增强方法、装置、设备及计算机可读存储介质
US11164591B2 (en) Speech enhancement method and apparatus
WO2024041512A1 (zh) 音频降噪方法、装置、电子设备及可读存储介质
WO2022246737A1 (zh) 一种语音增强方法和系统
TW201312551A (zh) 語音增強方法
CN112562717B (zh) 啸叫检测方法、装置、存储介质、计算机设备
CN114363753A (zh) 耳机的降噪方法、装置、耳机及存储介质
CN115410590A (zh) 一种语音增强方法和系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21942315

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202180088314.X

Country of ref document: CN

NENP Non-entry into the national phase

Ref country code: DE