CN117392994B

CN117392994B - Audio signal processing method, device, equipment and storage medium

Info

Publication number: CN117392994B
Application number: CN202311697438.5A
Authority: CN
Inventors: 梁俊斌
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-12-12
Filing date: 2023-12-12
Publication date: 2024-03-01
Anticipated expiration: 2043-12-12
Also published as: CN117392994A

Abstract

The application provides an audio signal processing method, device, equipment and storage medium; the method comprises the following steps: acquiring a sound source signal to be output and acquiring an audio signal acquired from the environment, wherein the audio signal comprises the environment sound source signal and environment noise; determining noise estimation values respectively corresponding to a plurality of frequency points in the audio signal; determining noise masking values corresponding to the plurality of frequency points respectively; determining masking intensity values of environmental noise corresponding to sound source signals to be output on a plurality of frequency points based on the noise estimated values and the noise masking values corresponding to the plurality of frequency points respectively; determining suppression gain values corresponding to the frequency points according to the masking intensity values corresponding to the frequency points; and generating a first noise reduction signal through the suppression gain values of the plurality of frequency points and the sound source signal to be output. According to the method and the device, the auditory masking effect in the real-time playing environment of the sound source signal to be output can be combined to adjust the suppression gain value, and the definition and the understandability of the finally played sound source signal to be output are improved.

Description

Audio signal processing method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to an audio signal processing method, apparatus, device, and storage medium.

Background

In the prior art, noise reduction algorithms mainly analyze and implement suppression from noise types and noise components of sound source signals, however, in daily practical applications, such as audio call applications, two parties of a call are respectively in different acoustic environments and are in a conversation through a call terminal and a data transmission network, when the acoustic environments where users are located are relatively noisy, the sounds of the parties which are not clearly heard by the users are easily caused, such as noisy sounds in subway carriages, supermarket human sounds, road car sounds, outdoor rain sounds and the like, and as the sounds have masking effects, namely, the person with larger volume can mask the person with smaller volume, the sound frequency domain of the current application is represented by: after the other party's sound is played through the speaker or the earphone, the sound component of the individual frequency domain is completely masked by the environmental noise component, so that the local listener cannot hear the other party's sound clearly. At the same time the masking effect of the sound is also manifested in mutual masking between adjacent frequency domains, e.g. low frequency sounds can mask out high frequency sounds if the energy is strong enough, and vice versa.

The prior art lacks efficient noise reduction schemes for complex environments.

Disclosure of Invention

The embodiment of the application provides an audio signal processing method, an audio signal processing device, electronic equipment, a computer program product and a computer readable storage medium, which can adjust the suppression gain value corresponding to each frequency point by combining the sound masking effect of the actual playing effect of an audio signal, and improve the definition and the understandability of the finally played audio signal.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides an audio signal processing method, which comprises the following steps:

acquiring a sound source signal and acquiring an audio signal acquired from the environment, wherein the audio signal comprises the environment sound source signal and environment noise;

determining noise estimation values respectively corresponding to a plurality of frequency points in the audio signal;

determining noise masking values corresponding to the plurality of frequency points respectively;

determining masking intensity values of the environmental noise on the plurality of frequency points, which correspond to the sound source signals to be output, based on the noise estimated values and the noise masking values which correspond to the plurality of frequency points, respectively;

determining suppression gain values corresponding to the frequency points according to the masking intensity values corresponding to the frequency points;

And generating a first noise reduction signal through the suppression gain values of the plurality of frequency points and the sound source signal.

An embodiment of the present application provides an audio signal processing apparatus, including:

the data acquisition module is used for acquiring a sound source signal to be output and acquiring an audio signal acquired from the environment, wherein the audio signal comprises an environment sound source signal and environment noise;

the data processing module is used for determining noise estimation values corresponding to a plurality of frequency points in the audio signal respectively;

the data processing module is further used for determining noise masking values corresponding to the plurality of frequency points respectively;

the data processing module is further configured to determine masking intensity values of the environmental noise on the plurality of frequency points, where the masking intensity values correspond to the sound source signals to be output respectively, based on the noise estimation values and the noise masking values corresponding to the plurality of frequency points respectively;

the data processing module is further used for determining suppression gain values corresponding to the plurality of frequency points according to the masking intensity values corresponding to the plurality of frequency points respectively;

and the generation processing module is used for generating a first noise reduction signal through the suppression gain values of the plurality of frequency points and the sound source signal to be output.

An embodiment of the present application provides an electronic device, including:

a memory for storing computer executable instructions;

and the processor is used for realizing the audio signal processing method provided by the embodiment of the application when executing the computer executable instructions stored in the memory.

The embodiment of the application provides a computer readable storage medium storing a computer program or computer executable instructions for implementing the audio signal processing method provided by the embodiment of the application when being executed by a processor.

Embodiments of the present application provide a computer program product, including a computer program or computer executable instructions, which when executed by a processor, implement the audio signal processing method provided in the embodiments of the present application.

The embodiment of the application has the following beneficial effects:

combining with auditory masking effect in real-time playing environment of sound source signals to be output, aiming at the masking degree (namely by masking intensity value representation) of different frequency points in the sound source signals to be output by environmental noise, pertinently adjusting the suppression gain value corresponding to each frequency point, and then attenuating the energy on each frequency point in the sound source signals to be output by using the suppression gain value, thereby reducing masking influence of environmental strong noise, realizing pertinently noise reduction aiming at each frequency point with different masking intensity, and further improving the definition and comprehensibility of the finally played sound source signals to be output.

Drawings

Fig. 1A is a schematic diagram of a first architecture of an audio signal processing system architecture according to an embodiment of the present application;

fig. 1B is a second structural schematic diagram of an audio signal processing system architecture according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a terminal provided in an embodiment of the present application;

fig. 3A is a schematic structural diagram of an earphone according to an embodiment of the present application;

FIG. 3B is a schematic illustration of a first vehicle environment provided by an embodiment of the present application;

FIG. 3C is a schematic illustration of a second vehicle-mounted environment provided by an embodiment of the present application;

fig. 4A is a schematic flow chart of a first process of an audio signal processing method according to an embodiment of the present disclosure;

fig. 4B is a schematic diagram of a second flow of the audio signal processing method according to the embodiment of the present application;

fig. 4C is a third flow chart of an audio signal processing method according to an embodiment of the present disclosure;

fig. 4D is a fourth flowchart of an audio signal processing method according to an embodiment of the present application;

fig. 4E is a fifth flowchart of an audio signal processing method according to an embodiment of the present application;

fig. 4F is a sixth flowchart of an audio signal processing method according to an embodiment of the present application;

fig. 4G is a seventh flowchart of an audio signal processing method according to an embodiment of the present application;

Fig. 4H is an eighth flowchart of an audio signal processing method according to an embodiment of the present application;

fig. 4I is a ninth flowchart of an audio signal processing method according to an embodiment of the present application;

fig. 5 is a schematic flow chart of audio signal processing in a voice call according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the present application will be described in further detail with reference to the accompanying drawings, and the described embodiments should not be construed as limiting the present application, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.

In the present embodiment, the term "module" or "unit" refers to a computer program or a part of a computer program having a predetermined function, and works together with other relevant parts to achieve a predetermined object, and may be implemented in whole or in part by using software, hardware (such as a processing circuit or a memory), or a combination thereof. Also, a processor (or multiple processors or memories) may be used to implement one or more modules or units. Furthermore, each module or unit may be part of an overall module or unit that incorporates the functionality of the module or unit.

Unless defined otherwise, all technical and scientific terms used in the embodiments of the present application have the same meaning as commonly understood by one of ordinary skill in the art. The terminology used in the embodiments of the application is for the purpose of describing the embodiments of the application only and is not intended to be limiting of the application.

Before further describing embodiments of the present application in detail, the terms and expressions that are referred to in the embodiments of the present application are described, and are suitable for the following explanation.

1) Speech noise reduction is a signal processing technique that is capable of recognizing and removing noise from speech signals. Common speech noise reduction algorithms include frequency domain based algorithms (e.g., spectral subtraction, minimum mean square error algorithms, etc.), time domain based algorithms (e.g., wiener filters, kalman filters, etc.), subband based algorithms (e.g., wavelet transform noise reduction algorithms, etc.), and the like. The choice of these algorithms depends on factors such as the type of noise, the relative strengths of the noise and the speech signal.

2) Auditory masking effect, which means that when the perception of one auditory signal (e.g. sound) is masked by another stronger or more pronounced auditory signal, resulting in a weaker signal being perceptively ignored or becoming less pronounced, which may occur at different frequencies, times or spaces, resulting in the perception of certain sounds being disturbed or suppressed, e.g. when using a mobile terminal for audio-video calls, the ambient noise in a subway car, supermarket human sound, road car sound, outdoor rain sound, etc. has masking effect, i.e. a louder will mask a smaller one, and the effect is that in the sound frequency domain: after the other party's sound is played through the speaker or the earphone, the sound component of the individual frequency domain is completely masked by the environmental noise component, so that the local listener cannot hear the other party's sound clearly. At the same time the masking effect of the sound is also manifested in mutual masking between adjacent frequency domains, e.g. low frequency sounds can mask out high frequency sounds if the energy is strong enough, and vice versa.

3) The sound source signal refers to a sound source that generates a voice, that is, a sound signal generated by the oral cavity or vocal cords of a person who generates a voice, and may be, for example, an analog electric signal of a speaker's voice captured by a microphone in a voice call, or a digital audio signal of the speaker's voice after being digitized.

4) The suppression gain value, which is a parameter for controlling the suppression degree of noise by the voice noise reduction method, determines how much noise components the noise reduction algorithm reduces when processing a voice signal, and is typically expressed as a value between 0 and 1.

5) The frequency point is a representation mode of carrier frequency used in radio communication, and can clearly give out a numerical value of a specific frequency, and a conversion formula of the frequency and the frequency point is as follows:wherein, the selected bandwidth is the signal bandwidth used in communication, the size of which determines the information amount that a channel can carry, and the carrier frequency can be controlled more accurately by the frequency point.

The noise reduction algorithms of the prior art mainly include two classes: noise reduction algorithm based on traditional filtering noise reduction model, noise reduction algorithm based on deep learning network. The noise reduction algorithm based on the conventional filtering noise reduction model is mainly a frequency domain-based algorithm (such as a spectral subtraction algorithm, a minimum mean square error algorithm, etc.), a time domain-based algorithm (such as a Wiener filter, a Kalman filter, etc.), a subband-based algorithm (such as a wavelet transform noise reduction algorithm, etc.), etc. The algorithms identify the noise type of the current frame and the component proportion of the noise and the voice signal through a statistical method, so that the noise reduction effect of suppressing the noise component from the noisy signal is realized; the principle of the deep learning network-based voice noise reduction algorithm is to model and process a voice signal containing noise by using models such as a deep neural network (Deep Neural Networks, DNN) or a convolutional neural network (Convolutional Neural Networks, CNN) so as to generate a cleaner voice signal. Specifically, the algorithm typically employs a supervised learning approach, using noisy speech signals as input and corresponding clean speech signals as output, training the deep learning network. During training, the deep learning network adaptively learns how to remove noise by learning the mapping relationship between the input and the output, thereby generating a cleaner speech signal. However, this denoising method, which analyzes and suppresses the noise type, noise component, of the sound source signal, ignores the influence of the acoustic masking effect during daily use.

In the prior art, aiming at the problem that the listener cannot hear the sound of the source due to the acoustic masking effect, namely, the surrounding noise environment, there are schemes which are improved by a noise estimation and source signal equalization adjustment method, such as an adaptive voice enhancement scheme (Adaptive Voice Quality Enhancement, AVQ), the sound source signal acquired by a microphone is subjected to echo cancellation and noise estimation to obtain the energy estimation value of each frequency band of the surrounding noise frequency domain, whether the sound source signal is masked is judged based on the energy of the noise frequency band, and if the sound source signal is masked, the energy of each frequency band of the sound source signal is enhanced by an equalization method (EQ) so that the sound source signal is not masked by the surrounding noise environment. Although the AVQ scheme enhances the sound source signal through equalization to get rid of the masking of the environmental noise, if the environmental noise is strong, the sound source signal needs to be enhanced by a large multiple to get rid of the masking of the environmental noise completely, and the result of this processing may cause the sound source signal to be easily amplified to break sound and serious distortion occurs in the sound, so the AVQ scheme can improve a part of the masking problem of the peripheral noise, but the disadvantages of the AVQ scheme are also obvious.

In order to solve the above problems, embodiments of the present application provide an audio signal processing method, apparatus, electronic device, computer readable storage medium, and computer program product, which take auditory masking effects into consideration, adjust suppression gain values corresponding to respective frequency points in a sound source signal by combining auditory masking effects in an actual playing environment of the sound source signal, and then attenuate energy on respective frequency points in the sound source signal by the suppression gain values, so as to reduce masking effects of environmental strong noise on the sound source signal, thereby achieving the beneficial effects of improving the definition and understandability of the finally played sound source signal.

The electronic device provided in the embodiments of the present application may be implemented as various types of user terminals such as a notebook computer, a tablet computer, a desktop computer, a set-top box, a mobile device (for example, a mobile phone, a portable music player, a personal digital assistant, a dedicated messaging device, and a portable game device), a smart phone, a smart speaker, a smart voice interaction device, a smart home appliance, a smart watch, a smart television, a vehicle-mounted terminal, a smart headset, and an aircraft. In the following, an exemplary application when the electronic device is implemented as a terminal will be described.

Referring to fig. 1A, fig. 1A is a schematic diagram of a first architecture of an audio signal processing system architecture provided in an embodiment of the present application, and in an example, fig. 1A relates to a server 100, a terminal device 200, and a network 300. The terminal device 200 is connected to the server 100 through a network 300, wherein the network 300 may be a wide area network or a local area network, or a combination of both.

In some embodiments, the terminal device or the server may implement the audio signal processing method provided in the embodiments of the present application by running various computer executable instructions or computer programs. For example, the computer-executable instructions may be commands at the micro-program level, machine instructions, or software instructions. The computer program may be a native program or a software module in an operating system; a Native Application (APP), i.e. a program that needs to be installed in an operating system to run, such as an instant messaging client; or an applet embedded in any APP, i.e. a program that can be run only by downloading into a browser environment. In general, the computer-executable instructions may be any form of instructions and the computer program may be any form of application, module, or plug-in.

In some embodiments, the server 100 may be an independent physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDNs), and basic cloud computing services such as big data and artificial intelligence platforms, where the cloud services may be interaction processing services for a terminal to call.

In some embodiments, multiple servers may be organized into a blockchain network, and server 100 may be a node on the blockchain network, where there may be an information connection between each node in the blockchain network, and where information may be transferred between the nodes via the information connection. The data related to the audio signal processing method provided in the embodiment of the present application may be stored in a blockchain.

In some embodiments, the audio signal processing system provided in the embodiments of the present application may be implemented cooperatively by a server and a terminal. For example, in the voice call process, referring to fig. 1A, fig. 1A is a first structural schematic diagram of an architecture of an audio signal processing system provided in the embodiment of the present application, where a terminal device 200-1 performs voice communication or video communication with a terminal device 200-2 through an instant communication client, a sensor device of the terminal device 200-1 (for example, a built-in microphone of the terminal device 200-1) collects a sound source signal to be output, where the sound source signal to be output carries an active sound local to the terminal device 200-1, that is, a sound to be heard by an opposite end, for example, a sound of a user of the terminal device 200-1, and transmits the sound source signal to be output to the terminal device 200-2 through a background server 100 of the instant communication client, and the terminal device 200-2 responds to the sound source signal to be output sent by the terminal device 200-1, the audio signal in the environment is acquired by the sensor device of the terminal device 200-2, wherein the audio signal comprises an ambient sound source signal and ambient noise, the ambient sound source signal is an effective sound in the environment where the terminal device 200-2 is located, that is, a sound that needs to be heard by the user of the terminal device 200-1, for example, a sound of the user of the terminal device 200-2, the ambient noise is a noise in the environment where the terminal device 200-2 is located, next, a masking intensity value of the ambient noise to be outputted sound source signal is acquired by the audio signal processing method provided by the embodiment of the present application, a suppression gain value is determined by the masking intensity value, a first noise reduction signal is obtained by the suppression gain value and the sound source signal to be outputted, and then the obtained first noise reduction signal is further subjected to a conventional automatic gain processing (Auto Gain Control, AGC), digital-analog conversion is performed via the terminal device 200-2, and output is performed through a speaker, so that the sound source signal to be output, which is sent by the terminal device 200-1, is more easily understood by the receiver of the terminal device 200-2.

In other embodiments, referring to fig. 1B, fig. 1B is a second structural schematic diagram of an architecture of an audio signal processing system provided in the embodiment of the present application, where the terminal device 200-1 performs voice communication or video communication with the terminal device 200-2 through an instant communication client, the terminal device 200-1 collects a sound source signal to be output through a microphone of the connected external earphone 300-1, the sound source signal to be output is transmitted to the terminal device 200-2 via the background server 100 of the instant communication client, the terminal device 200-2 responds to the sound source signal to be output sent by the receiving terminal device 200-1, the microphone of the external earphone 300-2 connected to the terminal device 200-2 collects an audio signal in the environment, the audio signal includes an environmental sound source signal and environmental noise, the environmental sound source signal is an effective sound in the environment where the terminal device 200-2 is located, that is a sound that needs to be heard by a user of the terminal device 200-1, for example, the environmental noise is a noise in the environment where the terminal device 200-2 is located, then, the audio signal to be output is provided by the embodiment of the present application is further processed by a digital-analog conversion method, the gain is further determined by a first gain value of the audio signal to be output by masking gain value, the gain value is further determined by masking the gain value of the audio signal to be output by the audio signal of the audio signal processing system, and gain value is obtained by the conventional gain value of the gain value (Auto Gain Control), so that the sound source signal to be output from the terminal device 200-1 is more easily understood by the listener of the terminal device 200-2.

In some embodiments, taking an in-vehicle call scenario as an example, the audio signal processing system provided in the embodiments of the present application may be implemented by a server and a terminal in cooperation. In an in-vehicle communication environment, for example, referring to fig. 3B and 3C, fig. 3B is a schematic view of a first in-vehicle environment provided by an embodiment of the present application, fig. 3C is a schematic view of a second in-vehicle environment provided by an embodiment of the present application, in-vehicle communication is performed through a microphone 320 (an integrable speaker) in the in-vehicle environment shown in fig. 3B or through a microphone 330 (an integrable speaker) in the in-vehicle environment shown in fig. 3C, a terminal device 200-1 performs voice communication or video communication with a terminal device 200-2 through an in-vehicle communication client or other instant communication client, the terminal device 200-1 transmits a sound source signal to be output to the terminal device 200-2 via a background server 100 of the in-vehicle communication client or other instant communication client, the terminal device 200-2 responds to receiving the sound source signal to be output issued by the terminal device 200-1, the terminal device 200-2 collects audio signals in a vehicle-mounted communication environment through the vehicle-mounted microphone 320 shown in fig. 3C or the microphone 330 shown in fig. 3C, etc., wherein the audio signals include an environmental sound source signal and environmental noise, the environmental sound source signal is effective sound in the environment where the terminal device 200-2 is located, that is, sound that needs to be heard by the user of the terminal device 200-1, for example, sound of the user of the terminal device 200-2, the environmental noise is noise in the environment where the terminal device 200-2 is located, next, a masking intensity value of the environmental noise to be outputted is obtained through the audio signal processing method provided by the embodiment of the present application, a suppression gain value is determined through the masking intensity value, a first noise reduction signal is obtained through the suppression gain value and the sound source signal to be outputted, the first noise reduction signal is further subjected to conventional automatic gain processing (Auto Gain Control, AGC), digital-to-analog conversion is performed through the terminal device 200-2, and output is performed through a speaker, so that the sound source signal to be output, which is sent out by the terminal device 200-1, is more easily understood by the receiver of the terminal device 200-2.

Referring to fig. 2, fig. 2 is a schematic structural diagram of a terminal device provided in an embodiment of the present application, and a terminal device 200 shown in fig. 2 includes: at least one processor 210, a memory 250, at least one network interface 220, and a user interface 230. The various components in terminal device 200 are coupled together by bus system 240. It is understood that the bus system 240 is used to enable connected communications between these components. The bus system 240 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration the various buses are labeled as bus system 240 in fig. 2.

The processor 210 may be an integrated circuit chip with signal processing capabilities such as a general purpose processor, such as a microprocessor or any conventional processor, a digital signal processor (Digital Signal Processor, DSP), or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like.

The user interface 230 includes one or more output devices 231, including one or more speakers and/or one or more visual displays, that enable presentation of media content. The user interface 230 also includes one or more input devices 232, including user interface components that facilitate user input, such as a keyboard, mouse, microphone (e.g., a feedforward microphone in an external headset 310 shown in FIG. 3A or an internal microphone of a terminal device, etc.), touch screen display, camera, other input buttons and controls.

The memory 250 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 250 optionally includes one or more storage devices physically located remote from processor 210.

Memory 250 includes volatile memory or nonvolatile memory, and may also include both volatile and nonvolatile memory. The non-volatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a random access Memory (Random Access Memory, RAM). The memory 250 described in embodiments of the present application is intended to comprise any suitable type of memory.

In some embodiments, memory 250 is capable of storing data to support various operations, examples of which include programs, modules and data structures, or subsets or supersets thereof, as exemplified below.

An operating system 251 including system programs for handling various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and handling hardware-based tasks;

a network communication module 252 for reaching other electronic devices via one or more (wired or wireless) network interfaces 220, the exemplary network interfaces 220 include: bluetooth, wireless compatibility authentication (WiFi), and universal serial bus (Universal Serial Bus, USB), etc.;

A presentation module 253 for enabling presentation of information (e.g., a user interface for operating peripheral devices and displaying content and information) via one or more output devices 231 (e.g., a display screen, speakers, etc.) associated with the user interface 230;

an input processing module 254 for detecting one or more user inputs or interactions from one of the one or more input devices 232 and translating the detected inputs or interactions.

In some embodiments, the apparatus provided in the embodiments of the present application may be implemented in software, and fig. 2 shows an audio signal processing apparatus 255 stored in a memory 250, which may be software in the form of a program, a plug-in, or the like, including the following software modules: the data acquisition module 2551, the data processing module 2552 and the generation module 2553 are logical, and thus may be arbitrarily combined or further split according to the implemented functions. The functions of the respective modules will be described hereinafter.

In other embodiments, the apparatus provided by the embodiments of the present application may be implemented in hardware, and by way of example, the apparatus provided by the embodiments of the present application may be a processor in the form of a hardware decoding processor that is programmed to perform the audio signal processing method provided by the embodiments of the present application, e.g., the processor in the form of a hardware decoding processor may employ one or more application specific integrated circuits (Application Specific Integrated Circuit, ASIC), digital signal processors (Digital Signal Processor, DSP), programmable logic devices (Programmable Logic Device, PLD), complex programmable logic devices (Complex Programmable Logic Device, CPLD), field programmable gate arrays (Field-Programmable Gate Array, FPGA), or other electronic components.

The audio signal processing method provided by the embodiment of the present application will be described below with reference to an exemplary application and implementation of the terminal provided by the embodiment of the present application, and with the terminal as an execution body. Referring to fig. 4A, fig. 4A is a schematic flow chart of a first procedure of the audio signal processing method according to the embodiment of the present application, which may be executed by the terminal device described above, and will be described with reference to the steps shown in fig. 4A.

In step 101, a sound source signal to be output is acquired, and an audio signal acquired from the environment is acquired, wherein the audio signal includes an ambient sound source signal and ambient noise.

Taking a scenario in which a user performs a voice call or a video call through a mobile terminal as an example, referring to fig. 1A, a user 1 performs voice communication or video communication with a holder user 2 of a terminal device 200-2 through an instant messaging client using a terminal device 200-1, a sensor device of the terminal device 200-1 (for example, a built-in microphone of the terminal device 200-1) collects a sound source signal to be output, the sound source signal to be output is transmitted to the terminal device 200-2 through a background server 100 of the instant messaging client, the terminal device 200-2 responds to the sound source signal to be output sent by the terminal device 200-1, and collects an audio signal in the environment through the sensor device of the terminal device 200-2, wherein the audio signal includes the environmental sound source signal and environmental noise, and the environmental sound source signal is a sound source signal collected by the microphone by the user 2 corresponding to the terminal device 200-2 in voice communication, for example, and includes human voice information of the user 2.

Taking a scenario that a user wears an earphone to perform voice or video call as an example, referring to fig. 1B, a user 1 performs voice communication or video communication with a holder user 2 of a terminal device 200-2 through an instant messaging client by using the terminal device 200-1, the terminal device 200-1 collects a sound source signal to be output through a microphone of the connected external earphone 300-1, referring to fig. 3A, fig. 3A is a schematic structural diagram of the earphone provided in the embodiment of the present application, where a feedforward microphone is used to capture external environmental noise in an environment; the feedforward microphone is used for capturing the residual ambient noise around the ear which is not captured by the feedforward microphone; the talk microphone is used for voice input during talk so as to make a talk, after which the terminal device 200-1 transmits a sound source signal to the terminal device 200-2 via the background server 100 of the instant communication client, and the terminal device 200-2 collects audio signals in the environment through microphones (e.g., feedforward microphone and feedforward microphone of the earphone 310 shown in fig. 3A) of the external earphone 300-2 connected to the terminal device 200-2 in response to receiving the sound source signal emitted from the terminal device 200-1.

Taking a voice or video call performed by a user in a vehicle-mounted scenario as an example, referring to fig. 3B and 3C, in the vehicle-mounted scenario, vehicle-mounted communication may be performed through a microphone 320 (an integrable speaker) in the vehicle-mounted environment shown in fig. 3B or through a microphone 330 (an integrable speaker) in the vehicle-mounted environment shown in fig. 3C, and the vehicle-mounted terminal may collect an audio signal through the vehicle-mounted microphone.

Wherein the sound source signal to be output includes noise components such as electromagnetic interference sound from communication devices, wires, power supplies, etc. during a voice call, mechanical noise from microphones, headphones, etc. or distortion and noise due to compression of voice by a compression algorithm during a call, etc.

In step 102, noise estimation values corresponding to a plurality of frequency points in the audio signal are determined.

In some embodiments, referring to fig. 4B, step 102 shown in fig. 4A may be implemented by the following steps 1021 through 1027, which are specifically described below.

In step 1021, the audio signal is divided into a plurality of audio frames.

In some embodiments, the audio signal is first subjected to a silence removal process (e.g., silence in the audio signal is removed by an endpoint detection method), then the audio signal is divided into audio frames of a fixed length (e.g., 20 ms is one frame) by a framing operation, and a window function (e.g., hamming window, hanning window, etc.) is applied to each audio frame to reduce abrupt changes at the boundaries of the audio frame, where the window function of the hamming window is as shown in the following equation:

（1）

wherein,belonging to，Representing the length of the window function (the window function length is the same as the length of the audio frame), Representing window functions at indexThe values of the positions are not described in detail below.

In step 1022, the plurality of audio frames are converted from the time domain to the frequency domain, resulting in frequency domain representations corresponding to the plurality of frequency points in the audio signal, respectively.

In some embodiments, fourier transforming an audio signal including a plurality of audio frames to obtain frequency domain representations corresponding to a plurality of frequency points in the audio signal, where the fourier transform formula is as follows:

（2）

wherein,，an index value representing a frequency point,the sequence numbers corresponding to the audio frames are not described in detail below.

In step 1023, a short-time power spectrum corresponding to each of the plurality of frequency points is obtained by frequency domain representation.

In some embodiments, the short-time power spectrum corresponding to each of the plurality of frequency bins is obtained by:

（3）

wherein,represent the firstFrame NoThe short-term power spectrum of the individual frequency bins,representation of the first pairFrame NoAnd carrying out square operation on the amplitude values of the frequency points to obtain short-time power spectrums corresponding to the frequency points respectively.

In step 1024, the short-time power spectrum is smoothed to obtain noise power spectrums corresponding to the frequency points.

In some embodiments, first, the short-term power spectrum is frequency domain smoothed according to the following equation:

（4）

wherein,for a window of size The frequency domain smoothing weighting factor group of (2) can be adjusted according to the specific application scene, for exampleIn the time-course of which the first and second contact surfaces,，is the first in the sequenceLeft of each positionPositions (i.e., the previous w positions) and rightThe value of the position (i.e. the w positions thereafter), the meaning of equation (4) is to be a sequenceIn positionThe value of the point is aboutWeighted average of the values of the individual positions to obtainIs a value of (2).

Next, the short-time power spectrum is time-domain smoothed according to the following equation:

（5）

wherein,represent the firstIn the first time stepThe moving average of the individual frequency points is the noise power spectrum corresponding to the frequency point,is a time domain smoothing coefficient, and the time domain smoothing coefficient interval isFor exampleMeaning of equation (5)Meaning that the data of the next time step is output by weighted averaging of the history data.

In step 1025, a minimum power spectrum value is obtained over a plurality of time windows through the noise power spectrum, wherein each time window includes a plurality of audio frames.

In some embodiments, the minimum power spectrum value in each time window may be obtained by a minimum value iteration method, and the minimum power spectrum value obtaining process is as follows:

（6）

wherein,representing the value of the minimum power spectrum,representing time windows (e.g., taking 0.5 seconds as a time window), each time window comprising a plurality of video frames, Representing temporary power spectrum values during the iterative process,for the output of the formula (5),the whole meaning of the expression (6) is that ifDivided byThe remainder of 0 (corresponding to) ThenEqual toAndand of smaller value of (1)Andequal; if it isDivided byThe remainder of not 0 (corresponds toAnd subsequent steps), thenEqual toAndand of smaller value of (1)Equal toAndis a smaller value of (a). The purpose of equation (6) is to update by constant iterationAndand finally obtaining the minimum power spectrum value.

In step 1026, a voice presence probability value corresponding to each of the plurality of frequency points is obtained by the minimum power spectrum value.

In some embodiments, first, signal-to-noise ratio predicted values corresponding to a plurality of frequency points are obtained by the following equation:

（7）

wherein,represent the firstFrame NoThe signal-to-noise ratio predicted value of each frequency point,represent the firstFrame NoThe noise power spectrum of the individual frequency points,represent the firstFrame NoMinimum power spectrum value of each frequency point.

Next, the initial voice existence probability corresponding to each frequency point is obtained through the following formula:

（8）

wherein,representing the probability of existence threshold value based onComparing the signal-to-noise ratio predicted value with the existence probability threshold value to obtain the firstFrame NoInitial speech presence probability of individual frequency point, specifically, when the first Frame NoWhen the signal-to-noise ratio predicted value of each frequency point is greater than the threshold value of the existence probabilityFrame NoThe initial voice existence probability of each frequency point is 1 whenFirst, theFrame NoWhen the signal-to-noise ratio predicted value of each frequency point is smaller than or equal to the existence probability threshold value, the firstFrame NoThe initial voice existence probability of each frequency point is 0.

Finally, smoothing the obtained initial voice probability according to the following formula to obtain voice existence probability values corresponding to a plurality of frequency points respectively:

（9）

wherein,representing the coefficient of smoothing and the coefficient of smoothing,represent the firstFrame NoThe speech presence probability value of the individual frequency points.

In step 1027, noise estimation values respectively corresponding to the plurality of frequency points are obtained through the voice existence probability value.

In some embodiments, the noise estimation values respectively corresponding to the plurality of frequency points are obtained by using the voice existence probability values through the following formula:

（10）

wherein,represent the firstFrame NoThe noise estimate for each frequency bin,represent the firstFrame NoThe speech presence probability value of the individual frequency points,represent the firstFrame NoThe noise estimate for each frequency bin,represent the firstFrame NoShort-time power spectrum of individual frequency bins.

With continued reference to fig. 4A, in step 103, noise masking values corresponding to the plurality of frequency bins, respectively, are determined.

In some embodiments, referring to fig. 4C, step 103 shown in fig. 4A may be implemented by performing the following steps 1031 to 1036 for each of a plurality of frequency points, which are specifically described below.

In step 1031, a critical band index value of the frequency bin is determined.

In some embodiments, a critical frequency band, commonly referred to as a Bark (Bark), is a psychological scale used to describe the perception of sound frequency by the human ear, the Bark domain being defined as shown in table 1:

TABLE 1

The linear frequency corresponding to the frequency point can be converted into a critical band index value by the following formula:

（11）

wherein,the inverse tangent function is represented by a graph,expressed in kilohertz。

In step 1032, a signal power spectrum corresponding to the critical band index value is determined.

In some embodiments, the signal power spectrum corresponding to the critical band index value is calculated by:

（12）

wherein,、represents the firstCritical band index values corresponding to upper and lower limit frequencies of the respective Bark critical bands,represent the firstFrame NoPower spectrum of each critical frequency band.

In step 1033, a preset expansion function is obtained.

In some embodiments, the preset expansion function formula is as follows:

（13）

wherein,，a critical band index value representing the masked signal,representing the critical band index value of the masking signal.

In step 1034, an extended spectrum of frequency points is determined from the signal power spectrum and the extended function.

In some embodiments, the spread spectrum may be expressed as:

（14）

Wherein,represent the firstFrame NoIndex value of each critical frequency bandIs provided.

In step 1035, critical band mask values are determined by expanding the spectrum.

In some embodiments, the critical band mask value is obtained by:

（15）

wherein,represent the firstFrame NoCritical band mask values for the respective critical band index values.

In step 1036, a noise masking value is determined from the critical band masking value.

In some embodiments, the noise masking value is equal to the maximum of the critical band masking value and the absolute hearing threshold calculated as follows:

（16）

wherein,the function is a function of converting the critical frequency band index value into the frequency, the input of the function is the critical frequency band index value, and the output of the function is the frequency value.

The noise masking value may be expressed as follows:

（17）

finally, the noise masking value of the sound pressure level is converted into the electronic domain by:

（18）

wherein,first, theFrame frequencyNoise masking values of frequency bins of (a).

With continued reference to fig. 4A, in step 104, a masking intensity value of the environmental noise at each of the plurality of frequency points corresponding to the sound source signal to be output is determined based on the noise estimation value and the noise masking value corresponding to each of the plurality of frequency points.

In some embodiments, referring to fig. 4D, step 104 shown in fig. 4A may be implemented by the following steps 1041 to 1042, which are specifically described below.

In step 1041, a ratio of the noise estimation value corresponding to the frequency bin to the noise masking value corresponding to the frequency bin is determined.

In step 1042, the ratio is taken as a masking intensity value of the frequency point, wherein the masking intensity value is positively correlated with the masking intensity value of the sound source signal to be output by the environmental noise at the frequency point.

In some embodiments, the masking intensity value may be represented by the following formula:

（19）

wherein,represent the firstFrame NoThe masking intensity values of the individual frequency points,represent the firstFrame NoThe noise estimate for each frequency bin,represent the firstFrame NoNoise masking values for the individual frequency bins.

Here, forThe conversion formula of the frequency and the frequency point can be adopted: frequency = frequency point x selected communication bandwidth + lower limit frequency, frequency pointConversion to frequencySubsequent calculations are then performed by equation (18).

With continued reference to fig. 4A, in step 105, suppression gain values corresponding to the plurality of frequency points are determined according to the masking intensity values corresponding to the plurality of frequency points, respectively.

In some embodiments, referring to fig. 4E, step 105 shown in fig. 4A may be implemented by the following steps 1051A through 1052A, which are described in detail below.

In step 1051A, original gain values corresponding to the plurality of frequency points are obtained, wherein the original gain values are obtained by performing noise estimation processing on noise components included in the sound source signal to be output.

In some embodiments, as described above, the sound source signal to be output includes noise components, such as electromagnetic interference sound from communication devices, wires, power supplies, etc., mechanical noise from microphones, headphones, etc., or distortion and noise due to compression of speech by a compression algorithm during a call, and the like, and the original gain value is obtained by performing noise estimation processing (such as wavelet transform noise reduction algorithm, spectral subtraction, etc.) on the noise components included in the sound source signal to be output.

In step 1052A, the original gain value is subjected to nonlinear transformation by using masking intensity values corresponding to the frequency points, so as to obtain suppression gain values corresponding to the frequency points.

In some embodiments, the original gain values are subjected to nonlinear transformation through masking intensity values corresponding to a plurality of frequency points respectively to obtain suppression gain values corresponding to a plurality of frequency points respectively, wherein the nonlinear transformation of the original gain values corresponding to each frequency point means that the original gain values corresponding to the frequency points are subjected to nonlinear processing and then compared with preset minimum suppression gain values, and the maximum value of the two values is taken as the suppression gain value corresponding to the frequency point finally, for example, the suppression gain value corresponding to each frequency point can be determined through the following formula:

（20）

Wherein,to the final pair ofFrame NoThe suppression gain values for the individual frequency points,corresponding to the original gain values above, hereAs a monotonically increasing function, i.e. with input valueValue increaseThe larger the output value is, andthe output maximum upper limit value of (c) is 1,for the minimum suppression gain value, excessive suppression for controlling to avoid noise reduction causes the sound cavitation phenomenon.

Through the steps 1051A to 1052A, the method and the device have the advantages that the suppression gain values corresponding to the frequency points are determined according to the masking intensity values corresponding to the frequency points, and the suppression gain values corresponding to the frequency points are adjusted in a targeted mode, so that the accurate noise reduction is achieved.

In some embodiments, referring to fig. 4F, step 105 shown in fig. 4A may be implemented by the following steps 1051B through 1052B, which are described in detail below.

In step 1051B, the plurality of frequency points are divided into a plurality of masking intervals according to endpoint masking intensity values preset in the plurality of masking intervals and masking intensity values respectively corresponding to the plurality of frequency points.

In some embodiments, the plurality of frequency points are divided into a plurality of masking intervals according to endpoint masking intensity values preset for the plurality of masking intervals and masking intensity values respectively corresponding to the plurality of frequency points, for example, The interval less than or equal to 1.5 is a weak masking interval;the interval within 1.5-2.5 is a medium masking interval;the interval of 2.5 or more is a strong masking interval, and here, the division of the masking interval is merely an example, and the embodiments of the present application do not limit the division of the masking intervalSpecific endpoint masking intensity values between and specific number of masking intervals obtained by dividing.

In step 1052B, a uniform suppression gain value is assigned for each frequency point in each masking interval, wherein the suppression gain values assigned for different masking intervals are different, and the suppression gain values assigned for the masking intervals are positively correlated with the masking intensity values corresponding to the frequency points included in the masking interval.

In some embodiments, a uniform suppression gain value is assigned for each bin in each masking interval, accepting the above example, which may be done for exampleA suppression gain value of 0.9 allocated to each frequency point in the weak masking interval of 1.5 or less forThe suppression gain value allocated to each frequency point in the middle masking interval within 1.5-2.5 is 0.7,the suppression gain value allocated to each frequency point in the strong masking interval greater than or equal to 2.5 is 0.5, where the suppression gain values allocated to the frequency points in the different masking intervals are only used as examples, and the embodiment of the present application does not limit the specific values of the suppression gain values in the different masking intervals, for example, the suppression gain values in the different masking intervals may also be allocated by performing linear transformation on the original gain values, for example, the original gain value is directly used as the suppression gain value of the frequency point in the interval in the weak masking interval, 0.9 times of the original gain value is used as the suppression gain value of the frequency point in the interval in the medium masking interval, and 0.8 times of the original gain value is used as the suppression gain value of the frequency point in the interval in the strong masking interval.

In some embodiments, when the number of masking intervals is two, the plurality of masking intervals includes a strong masking interval and a weak masking interval, the masking intensity value of the strong masking interval being greater than the masking intensity value of the weak masking interval; sound source components of the sound source signal to be output in the strong masking interval are completely masked by noise components of the environmental noise in the strong masking frequency domain interval; the sound source component of the sound source signal to be output in the weak masking section is partially masked by the noise component of the environmental noise in the weak masking section, and as an example, the suppression gain value for each frequency point within the weak masking section and the strong masking section may also be the original gain value for each frequency point in the weak masking section as the suppression gain value, and the suppression gain value for each frequency point in the strong masking section refers to the formula (20) of step 1052A and the related description.

Through steps 1051B to 1052B, the suppression gain values of each frequency point are allocated in units of the masking intervals, so that the setting of the gain values in units of the masking intervals divided by the masking intensity values is realized, and compared with the noise reduction scheme of performing the confirmation of the suppression gain values in units of the frequency points, the calculation process of generating the noise reduction signal is simplified, and the processing efficiency is improved.

With continued reference to fig. 4A, in step 106, a first noise reduction signal is generated by the suppression gain values of the plurality of frequency points and the sound source signal to be output.

In some embodiments, referring to fig. 4G, step 106 shown in fig. 4A may be implemented by the following steps 1061 to 1063, which are specifically described below.

In step 1061, the suppression gain values of the multiple frequency points are multiplied by the frequency domain signal values of the corresponding frequency points in the sound source signal to be output, so as to obtain a first frequency domain noise reduction signal of the sound source signal to be output.

In some embodiments, the original sound source signal to be output is fourier transformed to obtain a frequency domain signal, i.e. a complex value of each frequency point, and the complex value of each frequency point is multiplied by a corresponding suppression gain valueAnd obtaining a first frequency domain noise reduction signal of the sound source signal to be output.

In step 1062, the first frequency domain noise reduction signal is converted to a first time domain signal.

In some embodiments, the first time domain signal is obtained by performing an inverse fourier transform process on the first frequency domain noise reduction signal.

In step 1063, an automatic gain control process is performed on the first time domain signal to obtain a first noise reduction signal.

In some embodiments, the first time domain signal is subjected to an automatic gain control process (Auto Gain Control, AGC), the first time domain signal is subjected to signal energy level detection (e.g., by a root mean square or peak detection method), and the AGC obtains a gain value that should be applied through the time domain signal energy level and the target energy range of the current frame, e.g., when the signal energy is too low, the gain will be increased to increase the signal to noise ratio; conversely, if the energy is too high, the gain will be reduced to avoid signal overflow, and the obtained gain value is applied to the first time domain signal (e.g. by multiplication), resulting in a first noise reduction signal.

In some embodiments, referring to fig. 4H, the following steps 107 to 110 may also be performed before performing step 104 shown in fig. 4A, as described in detail below.

In step 107, the audio signal is divided into a plurality of audio frames.

Here, the audio signal is divided into a plurality of audio frames, see the explanation of step 1021 above, after which the following processing of steps 108 to 110 is performed for each audio frame.

In step 108, noise estimation values respectively corresponding to a plurality of frequency points included in the audio frame are obtained.

Here, see the description of step 1022 to step 1027 above.

In step 109, the noise estimation values corresponding to the plurality of frequency points included in the audio frame are added together to obtain an overall noise estimation value of the audio frame.

In some embodiments, the noise estimation values corresponding to the multiple frequency points included in the audio frame are added, and the calculation formula is as follows:

（21）

wherein,represent the firstThe overall noise estimate of the frame,represent the firstFrame NoNoise estimate for each frequency bin.

In step 110, in response to the overall noise estimate of the audio frame being greater than the noise threshold, a transition is made to a process for determining masking intensity values of the ambient noise at a plurality of frequency bins corresponding respectively to the sound source signals to be output.

In some embodiments, when the audio frame overall noise estimate isWhen the environmental noise is greater than the preset noise threshold, the process of determining masking intensity values of the environmental noise corresponding to the sound source signals to be output on the plurality of frequency points respectively is shifted, and here, the description of step 103 to step 105 is referred to.

In some embodiments, referring to fig. 4I, in performing step 110 shown in fig. 4H, the following steps 111 to 114 may also be performed, which are specifically described below.

In step 111, in response to the overall noise estimation value of the audio frame being less than or equal to the noise threshold, the original gain values corresponding to the plurality of frequency points are obtained, and the original gain values are used as the suppression gain values corresponding to the plurality of frequency points.

In some embodiments, when the audio frame overall noise estimate isAnd when the noise threshold is smaller than or equal to the preset noise threshold, acquiring original gain values corresponding to the frequency points respectively, wherein the original gain values corresponding to the frequency points respectively are taken as described in the step 1051A.

In step 112, the suppression gain values of the plurality of frequency points are multiplied by the frequency domain signal values of the corresponding frequency points in the sound source signal to be output, so as to obtain a second frequency domain noise reduction signal of the sound source signal to be output.

In some embodiments, the suppression gain values of the plurality of frequency points are multiplied by the frequency domain signal values of the corresponding frequency points in the sound source signal to be output, so as to obtain the second frequency domain noise reduction signal of the sound source signal to be output, for specific implementation, see the description of step 1061 above.

In step 113, the second frequency domain noise reduction signal is converted to a second time domain signal.

In some embodiments, the second frequency domain noise reduction signal is converted to a second time domain signal, for a specific implementation, see the description of step 1062 above.

In step 114, an automatic gain control process is performed on the second time domain signal to obtain a second noise reduction signal.

In some embodiments, the second time domain signal is subjected to an automatic gain control process to obtain a second noise reduction signal, and the implementation will be described in step 1063 above.

107 to 114, performing relatively mild denoising processing by taking the original gain value as a final suppression gain value for the frequency points contained in the audio frame with the overall noise estimation value lower than the noise threshold; the original gain values of the frequency points contained in the audio frames with the overall noise estimation values higher than the noise threshold are subjected to nonlinear transformation and then serve as final suppression gain values, and the denoising processing with strong suppression is performed, so that the targeted denoising of each frequency point of the audio frames with different overall noise estimation values is realized, and the beneficial effects of reducing masking influence of other audio frames in the audio source signals to be output caused by the audio frames with higher overall noise estimation values (namely, the audio frames completely masked by the environmental noise signals) and reducing the loss of signal energy of the audio frames with lower overall noise estimation values in the audio source signals to be output are achieved.

In some embodiments, the audio signal further includes an echo signal, and before performing step 102 shown in fig. 4A, the audio signal may be subjected to an echo cancellation process to obtain an audio signal from which the echo signal is removed, where the audio signal from which the echo signal is removed is used to determine the noise estimate instead of the audio signal before removing the echo signal.

In some embodiments, the audio signal from which the echo signal is removed may be obtained by performing echo cancellation processing on the audio signal, for example, by a frequency domain correlation analysis method, a reverberation compensation method (Reverberation Compensation), or a Dual microphone echo cancellation (Dual-Microphone Echo Cancellation) method.

Through steps 101 to 106, the degree to which different frequency points in the sound source signal to be output are masked (namely, by the representation of masking intensity values) is targeted by combining auditory masking effect in the real-time playing environment of the sound source signal to be output, the original gain values corresponding to all frequency points contained in the audio frame with the overall noise estimated value larger than the noise threshold are subjected to nonlinear processing to obtain the suppression gain values with stronger suppression intensity, the original gain values with relatively mild suppression intensity of all frequency points contained in the audio frame with the overall noise estimated value smaller than the noise threshold are used as the suppression gain values, the suppression gain values corresponding to all frequency points are adjusted in a targeted manner, then energy on all frequency points in the audio signal is attenuated by using the suppression gain values, masking influence of environmental strong noise is reduced, targeted noise reduction of all frequency points with different masking intensities is achieved, and accordingly definition and sound source understandability of the finally played signal to be output are improved.

An exemplary application of the embodiments of the present application in a voice call application scenario will be described below.

In the audio and video call process using the mobile terminal, the call process is easily interfered by acoustic noise of external environment, such as noisy sound in subway carriages, supermarket voice, road car voice, outdoor rain sound and the like, and as the sound has a masking effect, namely, the person with larger volume can mask the person with smaller volume, the sound is expressed in the sound frequency domain as follows: after the sound of the opposite party is played through the loudspeaker or the earphone, the sound component of the individual frequency domain is completely masked by the environmental noise component, so that the local listener cannot hear the sound of the opposite party, and the audio data collected by the equipment of the local listener is subjected to noise reduction processing in the audio signal processing Fang Duiyu voice communication process provided by the embodiment of the application, so that a clearer voice signal is output, and the listener can effectively hear the sound of the opposite party under the surrounding environmental noise.

Referring to fig. 5, fig. 5 is a schematic flow chart of audio signal processing in a voice call according to an embodiment of the present application, and is specifically described below.

In step 201, a sound source signal to be output is acquired, and audio data is acquired.

In some embodiments, in response to a local listener acquiring, via an instant messaging client, a sound source signal to be output transmitted by a speaker in a voice call, a terminal device of the local listener (e.g., via a feedforward microphone and a feedforward microphone in an external headset 310 shown in fig. 3A, or an internal microphone of the terminal device, etc.) collects audio data, where the audio data includes an ambient sound source signal and ambient noise, while the sound source signal to be output includes noise components, such as electromagnetic interference sounds from communication devices, wires, power supplies, etc., mechanical noise from microphones, headphones, etc., or distortion and noise due to compression of voice by compression algorithms during the call, etc.

In step 202, audio data preprocessing is performed.

In some embodiments, the audio data is subjected to silence removal processing, and echo cancellation processing is performed on the audio data by a frequency domain correlation analysis method, a reverberation compensation method (Reverberation Compensation), or a Dual-microphone echo cancellation (Dual-Microphone Echo Cancellation) method, so as to obtain audio data from which echo signals are removed.

In step 203, a noise estimate is obtained.

Here, reference is made to the description of step 102 above.

In step 204, a noise masking value is obtained.

Here, reference is made to the description of step 103 above.

In step 205, strong and weak masking intervals are identified.

In some embodiments, the audio signal is divided into a plurality of audio frames, the noise estimation values corresponding to the plurality of frequency points included in the audio frames are added to obtain the overall noise estimation value of the audio frame, the calculation formula and description refer to the above steps 107 to 109, the frequency points included in the audio frames with the overall noise estimation value of the audio frame being greater than the noise threshold are divided into strong masking intervals, and the frequency points included in the other audio frames are divided into weak masking intervals.

In some examples, as an alternative implementation, the plurality of frequency points may be further divided into a plurality of masking intervals according to endpoint masking intensity values preset for the plurality of masking intervals and masking intensity values respectively corresponding to the plurality of frequency points, where, see the description of step 1052B above.

In step 206, different noise reduction strategies are employed for different masking intervals.

In some embodiments, in the weak masking interval, the original gain value for each frequency bin is taken as the suppression gain value, where the acquisition of the original gain value is described with reference to step 1051A above; in the strong masking interval, the original gain value is subjected to nonlinear transformation to obtain suppression gain values corresponding to the plurality of frequency points respectively, where the original gain value is subjected to nonlinear transformation, see the description of step 1052A above.

In some examples, as an alternative implementation, a uniform suppression gain value may also be allocated for each frequency point in each masking interval, where the suppression gain values allocated for different masking intervals are different, and the suppression gain values allocated for the masking intervals are positively correlated with the masking intensity values corresponding to the frequency points included in the masking interval, where reference is made to the description of step 1052B above.

In some embodiments, the suppression gain values of a plurality of frequency points in different masking intervals are multiplied by the frequency domain signal values of the corresponding frequency points in the sound source signal to be output respectively to obtain a frequency domain noise reduction signal of the sound source signal to be output, and the frequency domain noise reduction signal is subjected to inverse fourier transform processing to obtain a time domain signal, where, refer to the description of steps 1061 to 1062 above.

In step 207, an automatic gain process is performed.

In some embodiments, taking the above example, the time domain signal output in step 206 is subjected to an automatic gain control process (Auto Gain Control, AGC), which obtains the gain value that should be applied by obtaining the time domain signal energy level (e.g., by a root mean square or peak detection method) of the current frame and the target energy range, e.g., when the signal energy is too low, the gain will be increased to improve the signal to noise ratio; otherwise, if the energy is too high, the gain is reduced to avoid signal overflow, and the obtained gain value is applied to the time domain signal (for example, by multiplication), so as to obtain the noise-reduced sound source signal to be output.

In step 208, the noise-reduced sound source signal to be output is played.

In some embodiments, the noise-reduced sound source signal to be output is played through a speaker of the terminal device of the local listener.

Continuing with the description below of an exemplary architecture in which the audio signal processing device 255 provided in embodiments of the present application is implemented as a software module, in some embodiments, as shown in fig. 2, the audio signal processing device 255 stored in the memory 250 may include:

the data acquisition module 2551 is configured to acquire a sound source signal to be output, and acquire an audio signal acquired from an environment, where the audio signal includes an environmental sound source signal and environmental noise.

The data processing module 2552 is configured to determine noise estimation values corresponding to a plurality of frequency points in the audio signal.

The generating module 2553 is configured to generate a first noise reduction signal through the suppression gain values of the multiple frequency points and the sound source signal to be output.

In some embodiments, the data processing module 2552 is further configured to divide the audio signal into a plurality of audio frames; converting the plurality of audio frames from a time domain to a frequency domain to obtain frequency domain representations corresponding to a plurality of frequency points in the audio signal respectively; acquiring short-time power spectrums corresponding to the frequency points respectively through the frequency domain representation; smoothing the short-time power spectrum to obtain noise power spectrums corresponding to the frequency points respectively; obtaining a minimum power spectrum value in a plurality of time windows through the noise power spectrum, wherein each time window comprises a plurality of audio frames; acquiring voice existence probability values corresponding to the frequency points respectively through the minimum power spectrum value; and acquiring noise estimated values respectively corresponding to the plurality of frequency points through the voice existence probability value.

In some embodiments, the data processing module 2552 is further configured to determine noise masking values corresponding to the plurality of frequency bins respectively.

In some embodiments, the data processing module 2552 is further configured to perform, for each of the plurality of bins, the following processing: determining a critical band index value of the frequency point; determining a signal power spectrum corresponding to the critical frequency band index value; acquiring a preset expansion function; determining an expansion spectrum of the frequency point through the signal power spectrum and the expansion function; determining a critical band masking value through the extended spectrum; the noise masking value is determined by the critical band masking value.

In some embodiments, the data processing module 2552 is further configured to determine masking intensity values of the environmental noise corresponding to the sound source signal to be output on the plurality of frequency points, based on the noise estimation values and the noise masking values corresponding to the plurality of frequency points, respectively.

In some embodiments, the data processing module 2552 is further configured to perform, for each of the frequency bins, the following processing: determining a ratio of the noise estimation value corresponding to the frequency point to the noise masking value corresponding to the frequency point; and taking the ratio as a masking intensity value of the frequency point, wherein the masking intensity value is positively correlated with the masking intensity value of the environmental noise on the sound source signal to be output at the frequency point.

In some embodiments, the data processing module 2552 is further configured to determine suppression gain values corresponding to the plurality of frequency points respectively according to masking intensity values corresponding to the plurality of frequency points respectively.

In some embodiments, the data processing module 2552 is further configured to obtain original gain values corresponding to the plurality of frequency points, where the original gain values are obtained by performing noise estimation processing on noise components included in the sound source signal to be output; and carrying out nonlinear transformation on the original gain value through the masking intensity values respectively corresponding to the plurality of frequency points to obtain the inhibition gain values respectively corresponding to the plurality of frequency points.

In some embodiments, the data processing module 2552 is further configured to divide the plurality of frequency points into a plurality of masking intervals according to endpoint masking intensity values preset in the plurality of masking intervals and masking intensity values respectively corresponding to the plurality of frequency points; and allocating a unified suppression gain value for each frequency point in each masking interval, wherein the suppression gain values allocated for different masking intervals are different, and the suppression gain values allocated for the masking intervals are positively correlated with the masking intensity values corresponding to the frequency points included in the masking interval.

In some embodiments, the data processing module 2552 is further configured to, when the number of masking intervals is two, the plurality of masking intervals includes a strong masking interval and a weak masking interval, and the masking intensity value of the strong masking interval is greater than the masking intensity value of the weak masking interval; the sound source component of the sound source signal to be output in the strong masking interval is completely masked by the noise component of the environmental noise in the strong masking interval; the sound source component of the sound source signal to be output in the weak masking section is partially masked by the noise component of the environmental noise in the weak masking section.

In some embodiments, the data processing module 2552 is further configured to divide the audio signal into a plurality of audio frames; the following processing is performed for each of the audio frames: acquiring the noise estimation values respectively corresponding to the plurality of frequency points included in the audio frame; adding the noise estimation values respectively corresponding to a plurality of frequency points included in the audio frame to obtain an overall noise estimation value of the audio frame; and responding to the whole noise estimated value of the audio frame being larger than a noise threshold, and switching to the process of determining masking intensity values of the environmental noise on the plurality of frequency points, which correspond to the sound source signals to be output respectively.

In some embodiments, the data processing module 2552 is further configured to, in response to the overall noise estimate of the audio frame being less than or equal to the noise threshold, obtain original gain values corresponding to the plurality of frequency points, respectively, and use the original gain values as suppression gain values corresponding to the plurality of frequency points, respectively; multiplying the suppression gain values of the plurality of frequency points with the frequency domain signal values corresponding to the frequency points in the sound source signal to be output respectively to obtain a second frequency domain noise reduction signal of the sound source signal to be output; converting the second frequency domain noise reduction signal to a second time domain signal; and performing automatic gain control processing on the second time domain signal to obtain a second noise reduction signal.

In some embodiments, the data processing module 2552 is further configured to perform echo cancellation processing on the audio signal to obtain an audio signal from which the echo signal is removed, where the audio signal from which the echo signal is removed is used to replace the audio signal before the echo signal is removed to determine the noise estimation value.

In some embodiments, the generating module 2553 is further configured to multiply the suppression gain values of the plurality of frequency points with the frequency domain signal values corresponding to the frequency points in the sound source signal to be output, so as to obtain a first frequency domain noise reduction signal of the sound source signal to be output; converting the first frequency domain noise reduction signal into a first time domain signal; and performing automatic gain control processing on the first time domain signal to obtain a first noise reduction signal.

The embodiment of the application provides a computer program product, which comprises computer executable instructions, the computer executable instructions are stored in a computer readable storage medium, a processor of an electronic device reads the computer executable instructions from the computer readable storage medium, and the processor executes the computer executable instructions, so that the electronic device executes the audio signal processing method of the embodiment of the application.

The present embodiments provide a computer-readable storage medium storing computer-executable instructions or a computer program stored therein, which when executed by a processor, cause the processor to perform the audio signal processing method provided by the embodiments of the present application, for example, the audio signal processing method as shown in fig. 4A.

In some embodiments, the computer readable storage medium may be RAM, ROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; but may be a variety of devices including one or any combination of the above memories.

In some embodiments, computer-executable instructions may be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, in the form of programs, software modules, scripts, or code, and they may be deployed in any form, including as stand-alone programs or as modules, components, subroutines, or other units suitable for use in a computing environment.

As an example, computer-executable instructions may, but need not, correspond to files in a file system, may be stored as part of a file that holds other programs or data, such as in one or more scripts in a hypertext markup language (Hyper Text Markup Language, HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

As an example, computer-executable instructions may be deployed to be executed on one electronic device or on multiple electronic devices located at one site or, alternatively, on multiple electronic devices distributed across multiple sites and interconnected by a communication network.

In summary, by considering the auditory masking effect and adjusting the suppression gain value corresponding to each frequency point in combination with the auditory masking effect in the actual playing environment of the sound source signal to be output according to the embodiment of the application, then attenuating the energy on each frequency point in the sound source signal to be output according to the suppression gain value, the masking effect of strong environmental noise on the sound source signal to be output can be reduced, and therefore the beneficial effects of improving the definition and the understandability of the finally played sound source signal to be output are achieved.

The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application. Any modifications, equivalent substitutions, improvements, etc. that are within the spirit and scope of the present application are intended to be included within the scope of the present application.

Claims

1. A method of audio signal processing, the method comprising:

acquiring a sound source signal to be output and acquiring an audio signal acquired from the environment, wherein the audio signal comprises the environment sound source signal and environment noise;

obtaining original gain values corresponding to the frequency points respectively, wherein the original gain values are obtained by carrying out noise estimation processing on noise components included in the sound source signals to be output;

nonlinear transformation is carried out on the original gain value through the masking intensity values respectively corresponding to the plurality of frequency points, so that suppression gain values respectively corresponding to the plurality of frequency points are obtained;

And generating a first noise reduction signal through the suppression gain values respectively corresponding to the plurality of frequency points and the sound source signal to be output.

2. The method of claim 1, wherein the determining masking intensity values of the ambient noise at the plurality of frequency points corresponding to the sound source signal to be output, based on the noise estimate values and the noise masking values corresponding to the plurality of frequency points, respectively, comprises:

the following processing is performed for each of the frequency bins:

determining a ratio of the noise estimation value corresponding to the frequency point to the noise masking value corresponding to the frequency point;

and taking the ratio as a masking intensity value of the frequency point, wherein the masking intensity value is positively correlated with the masking intensity value of the environmental noise corresponding to the sound source signal to be output at the frequency point.

3. The method according to claim 2, wherein the method further comprises:

dividing the plurality of frequency points into a plurality of masking intervals according to endpoint masking intensity values preset in the masking intervals and masking intensity values respectively corresponding to the plurality of frequency points;

and allocating a unified suppression gain value for each frequency point in each masking interval, wherein the suppression gain values allocated for different masking intervals are different, and the suppression gain values allocated for the masking intervals are positively correlated with the masking intensity values corresponding to the frequency points included in the masking interval.

4. The method of claim 3, wherein the step of,

when the number of the masking intervals is two, the plurality of masking intervals include a strong masking interval and a weak masking interval, and the masking intensity value of the strong masking interval is greater than that of the weak masking interval; the sound source component of the sound source signal to be output in the strong masking interval is completely masked by the noise component of the environmental noise in the strong masking interval; the sound source component of the sound source signal to be output in the weak masking section is partially masked by the noise component of the environmental noise in the weak masking section.

5. The method according to any one of claims 1 to 4, wherein the generating a first noise reduction signal by the suppression gain values of the plurality of frequency points and the sound source signal to be output includes:

multiplying the suppression gain values of the plurality of frequency points with the frequency domain signal values corresponding to the frequency points in the sound source signal to be output respectively to obtain a first frequency domain noise reduction signal of the sound source signal to be output;

converting the first frequency domain noise reduction signal into a first time domain signal;

and performing automatic gain control processing on the first time domain signal to obtain a first noise reduction signal.

6. The method according to any one of claims 1 to 4, wherein before said determining that the ambient noise corresponds to masking intensity values of the sound source signal to be output at the plurality of frequency points, respectively, the method further comprises:

dividing the audio signal into a plurality of audio frames;

the following processing is performed for each of the audio frames:

acquiring the noise estimation values respectively corresponding to the plurality of frequency points included in the audio frame;

adding the noise estimation values respectively corresponding to a plurality of frequency points included in the audio frame to obtain an overall noise estimation value of the audio frame;

and responding to the overall noise estimated value of the audio frame being larger than a noise threshold, and switching to the process of determining masking intensity values of the environmental noise on the plurality of frequency points, which correspond to the sound source signals to be output respectively.

7. The method of claim 6, the method further comprising:

in response to the overall noise estimation value of the audio frame being smaller than or equal to the noise threshold, acquiring original gain values respectively corresponding to the plurality of frequency points, and taking the original gain values as suppression gain values respectively corresponding to the plurality of frequency points;

Multiplying the suppression gain values of the plurality of frequency points with the frequency domain signal values corresponding to the frequency points in the sound source signal to be output respectively to obtain a second frequency domain noise reduction signal of the sound source signal to be output;

converting the second frequency domain noise reduction signal to a second time domain signal;

and performing automatic gain control processing on the second time domain signal to obtain a second noise reduction signal.

8. The method of any of claims 1 to 4, wherein the audio signal further comprises an echo signal, and wherein prior to said determining noise estimates for each of a plurality of frequency bins in the audio signal, the method further comprises:

and performing echo cancellation processing on the audio signal to obtain an audio signal from which the echo signal is removed, wherein the audio signal from which the echo signal is removed is used for replacing the audio signal before the echo signal is removed to determine the noise estimation value.

9. The method according to any one of claims 1 to 4, wherein determining noise estimates corresponding to a plurality of frequency points in the audio signal, respectively, includes:

dividing the audio signal into a plurality of audio frames;

converting the plurality of audio frames from a time domain to a frequency domain to obtain frequency domain representations corresponding to a plurality of frequency points in the audio signal respectively;

Acquiring short-time power spectrums corresponding to the frequency points respectively through the frequency domain representation;

smoothing the short-time power spectrum to obtain noise power spectrums corresponding to the frequency points respectively;

obtaining a minimum power spectrum value in a plurality of time windows through the noise power spectrum, wherein each time window comprises a plurality of audio frames;

acquiring voice existence probability values corresponding to the frequency points respectively through the minimum power spectrum value;

and acquiring noise estimated values respectively corresponding to the plurality of frequency points through the voice existence probability value.

10. The method according to any one of claims 1 to 4, wherein determining noise masking values to which the plurality of frequency points respectively correspond comprises:

the following processing is performed for each of the plurality of frequency points:

determining a critical band index value of the frequency point;

determining a signal power spectrum corresponding to the critical frequency band index value;

acquiring a preset expansion function;

determining an expansion spectrum of the frequency point through the signal power spectrum and the expansion function;

determining a critical band masking value through the extended spectrum;

the noise masking value is determined by the critical band masking value.

11. An audio signal processing apparatus, the apparatus comprising:

the data processing module is further configured to obtain original gain values corresponding to the multiple frequency points, where the original gain values are obtained by performing noise estimation processing on noise components included in the sound source signal to be output; nonlinear transformation is carried out on the original gain value through the masking intensity values respectively corresponding to the plurality of frequency points, so that suppression gain values respectively corresponding to the plurality of frequency points are obtained;

12. An electronic device, the electronic device comprising:

a memory for storing computer executable instructions;

a processor for implementing the audio signal processing method of any one of claims 1 to 10 when executing computer executable instructions stored in the memory.

13. A computer-readable storage medium storing computer-executable instructions or a computer program, which when executed by a processor implements the audio signal processing method of any one of claims 1 to 10.

14. A computer program product comprising computer executable instructions or a computer program, which when executed by a processor implements the audio signal processing method of any of claims 1 to 10.