CN117714581A

CN117714581A - Audio signal processing method and electronic equipment

Info

Publication number: CN117714581A
Application number: CN202311017599.5A
Authority: CN
Inventors: 丁歌; 杨枭; 高荣荣; 王瑞亮; 程有宏
Original assignee: Honor Device Co Ltd
Current assignee: Honor Device Co Ltd
Priority date: 2023-08-11
Filing date: 2023-08-11
Publication date: 2024-03-15

Abstract

The embodiment of the application provides an audio signal processing method and electronic equipment, the method is applied to the electronic equipment, the electronic equipment comprises N target audio playing devices, N is an integer greater than or equal to 2, and the method comprises the following steps: acquiring target audio signals, wherein the target audio signals are audio signals which need to be played by N target audio playing devices; generating a masking signal from the target audio signal; respectively performing sound field control processing on the target audio signals and the masking signals, and then performing superposition processing to obtain N paths of signals to be played, which are in one-to-one correspondence with N target audio playing devices; respectively playing corresponding signals to be played through N target audio playing devices; the energy of the sound field formed by the signal to be played in the first space region is larger than that of the sound field formed by the signal to be played in the second space region, and the first space region is closer to the target audio playing device than the second space region. The method can protect the privacy of the user and improve the user experience.

Description

Audio signal processing method and electronic equipment

Technical Field

The application relates to the technical field of electronics, in particular to an audio signal processing method and electronic equipment.

Background

The communication through electronic equipment such as mobile phones and the like is a common application scene in life of people. However, during the call, the voice leakage of the mobile phone can cause the talk content of the user to be revealed, which is not beneficial to the privacy protection of the user.

Disclosure of Invention

The application provides an audio signal processing method and electronic equipment, which can reduce sound leakage, protect user privacy and improve user experience.

In a first aspect, the present application provides a method for processing an audio signal, applied to an electronic device, where the electronic device includes N target audio playing devices, N is an integer greater than or equal to 2, and the method includes: acquiring target audio signals, wherein the target audio signals are audio signals which need to be played by N target audio playing devices; generating a masking signal from the target audio signal; respectively performing sound field control processing on the target audio signals and the masking signals, and then performing superposition processing to obtain N paths of signals to be played, which are in one-to-one correspondence with N target audio playing devices; respectively playing corresponding signals to be played through N target audio playing devices; the energy of a sound field formed by the signal to be played in the first space area is larger than that of a sound field formed by the signal to be played in the second space area, wherein the first space area is closer to N target audio playing devices than the second space area.

Alternatively, the target audio playback device may be different depending on the electronic device configuration and audio system settings. In a particular embodiment, the electronic device includes a receiver (i.e., earpiece) for playing audio in hands-free or down, and a speaker for playing audio in a hand-held mode. In this case, the target audio playback device is a receiver. In another specific embodiment, the electronic device includes only a speaker for playing audio in both the handheld mode and the listening hands-free mode. In this case, the target audio playback device is a speaker.

Optionally, the target audio signal may be a call downlink signal output by a call application, a voice audio signal output by an instant messaging application, or an audio signal output by another application, for example, a recording audio signal output by a recording application. The embodiments of the present application do not set any limitation on the specific source of the target audio signal.

Alternatively, the first spatial region may be an in-ear region, and the second control region may be a leaky sound region.

According to the processing method of the audio signal, a masking signal is generated based on a target audio signal, sound field control processing is respectively carried out on the target audio signal and the masking signal, then superposition processing is carried out on the target audio signal and the masking signal, and N paths of signals to be played are obtained; after the N paths of signals to be played are played through the N target audio playing devices, the energy of a sound field formed in the first space area is larger than that of a sound field formed in the second space area by the signals to be played, and because the first space area is closer to the N target audio playing devices than the second space area, when a user uses the electronic equipment to listen to audio (for example, answer a call), the user is closer to the target audio playing devices than surrounding people, and therefore sound (namely, leakage sound) heard by the surrounding people is smaller than sound heard by the user, leakage sound can be reduced, user privacy is protected, and user experience is improved. In addition, the masking signal can psychologically mask the target audio signal, and the sound intelligibility is reduced, so that the user privacy can be further protected, and the user experience is improved.

In a possible implementation manner, the first spatial region is a spatial region within a preset range around the N target audio playing devices, the second spatial region does not overlap with the first spatial region, and the second spatial region is located at a side, away from the electronic device, of the first spatial region.

The first spatial region may be an in-ear region, i.e.: when the target audio playing device of the electronic equipment is close to the human ear, the periphery of the electronic equipment comprises an ear canal of the human ear and a region with a preset range around the ear. The second spatial region may be a leaky region, i.e.: and a region on the side away from the electronic device, around the first space region.

That is, the energy of the sound field formed by the signal to be played in the in-ear region (i.e., the in-ear sound field) is greater than the energy of the sound field formed by the signal to be played in the sound leakage region (i.e., the sound leakage sound field), so that the sound heard by surrounding people is smaller than the sound heard by the user, the sound leakage is reduced, the privacy of the user is protected, and the user experience is improved.

In one possible implementation manner, the superposition processing is performed after the sound field control processing is performed on the target audio signal and the masking signal, so as to obtain N paths of signals to be played corresponding to N target audio playing devices one to one, where the method includes: performing first sound field control processing on the target audio signals to obtain N paths of target output signals which are in one-to-one correspondence with N target audio playing devices, wherein the energy of a sound field formed by the target output signals in a first space region is larger than that of a sound field formed by the target output signals in a second space region; performing second sound field control processing on the masking signals to obtain N paths of masking output signals which are in one-to-one correspondence with N target audio playing devices, wherein the energy of a sound field formed by the masking output signals in the second space region is larger than that of a sound field formed by the masking output signals in the first space region; and superposing the target output signals corresponding to the target audio playing devices and the corresponding masking output signals to obtain N paths of signals to be played, which are in one-to-one correspondence with the N target audio playing devices.

In this implementation manner, on the one hand, by performing sound field control on the target audio signal, the energy of the sound field formed by the target audio signal in the first spatial region is made larger than the energy of the sound field formed in the second spatial region. Therefore, the voice leakage is reduced, the privacy of the user is protected, and the user experience is improved. On the other hand, sound field control is performed on the masking signals so that the masking signals eventually form a sound field having a larger energy in the second spatial region than in the first spatial region. Thus, the target audio signal of the first space region is psycho-acoustically masked to a small extent, and the target audio signal of the second space region is psycho-masked to a large extent, so that the influence on the first space region is reduced as much as possible while the sound intelligibility of the second space region is reduced, namely, the influence on the hearing of the user is reduced while the privacy of the user is protected, and the user experience is improved.

In one possible implementation, an electronic device includes a first filter bank and a second filter bank; performing first sound field control processing on the target audio signals to obtain N paths of target output signals corresponding to N target audio playing devices one by one, wherein the first sound field control processing comprises the following steps: performing first sound field control processing on the target audio signal through a first filter bank to obtain N paths of target output signals; performing second sound field control processing on the masking signals to obtain N paths of masking output signals corresponding to N target audio playing devices one by one, wherein the second sound field control processing comprises the following steps: and performing second sound field control processing on the masking signals through a second filter bank to obtain N paths of masking output signals.

That is, the first filter bank is a filter bank for sound field control of the target audio signal, and the second filter bank is a filter bank for sound field control of the masking signal. The first filter group comprises N first filters which are in one-to-one correspondence with the target audio playing devices. The second filter group comprises N second filters which are in one-to-one correspondence with the target audio playing devices.

In a possible implementation manner, the first filter bank takes the first space area as a bright area, takes the second space area as a dark area, performs sound field control test or simulation, maximizes the sound field energy ratio of the bright area and the dark area, and solves the filter coefficient generation; the second filter bank takes the second space area as a bright area, takes the first space area as a dark area, performs sound field control test or simulation, maximizes the sound field energy ratio of the bright area and the dark area, and solves the filter coefficient to generate the filter bank.

That is, the first filter bank and the second filter bank are exchanged for the bright area and the dark area when solving for generation. The first filter bank thus generated enables the energy of the generated sound field (in-ear sound field of the target audio signal) formed by the target audio signal in the first spatial region to be large, and the energy of the generated sound field (leaky sound field of the masking signal) formed by the target audio signal in the second spatial region to be small; the second filter bank thus generated can make the energy of the sound field generated by the masking signal in the first spatial region (in-ear sound field of the masking signal) smaller and the energy of the sound field generated by the masking signal in the second spatial region (leaky sound field of the masking signal) larger. Therefore, after the target audio signal and the masking signal are overlapped, the in-ear sound field intelligibility of the finally output sound is not affected, and the intelligibility of the leaky sound field is reduced, so that the user privacy is effectively protected and the user experience is improved under the condition that the hearing feeling of the user is not affected.

In a possible implementation, the masking signal coincides with the spectrum of the target audio signal. Thus, the masking effect of the masking signal on the target audio signal is better, and the protection effect on the privacy of the user is also better.

In a possible implementation, generating a masking signal from a target audio signal includes: generating an initial masking signal consistent with the frequency spectrum of the target audio signal according to the target audio signal; and performing spectrum correction on the initial masking signal to obtain a masking signal.

In the implementation mode, the masking signal is generated after the spectrum correction is carried out on the initial masking signal, and then the sound field control is carried out on the masking signal, so that the spectrum distortion during the sound field control can be corrected, the masking effect is improved, and the effect of final privacy protection is further improved.

In one possible implementation, generating an initial masking signal according to a target audio signal spectrum, the initial masking signal being consistent with the target audio signal spectrum, includes: the method comprises the steps that a target audio signal with the caching duration being the length of a preset masking frame is obtained, and a signal to be masked of a current frame is obtained; carrying out time reversal after windowing the signal to be masked of the current frame to obtain a reversal signal of the current frame; acquiring a previous frame inversion signal, wherein the previous frame inversion signal refers to a latest frame inversion signal before a current frame inversion signal; superposing the current frame inversion signal and the previous frame inversion signal to obtain a current frame basic masking signal; and adjusting the energy of the basic masking signal of the current frame according to the preset energy ratio threshold and the energy of the signal to be masked of the current frame to obtain the initial masking signal of the current frame.

Alternatively, the preset masking frame length is larger than the frame length of the target audio signal, for example, the preset masking frame may be a value between 50ms and 300 ms.

In the implementation mode, the target audio signal is processed in real time to obtain the initial masking signal, so that the spectrum consistency of the finally obtained masking signal and the target audio signal is higher, the masking effect of the masking signal is improved, and the privacy protection effect is improved. Moreover, the current frame inversion signal and the previous frame inversion signal are overlapped, so that the smoothness of the obtained current basic masking signal can be improved, and the smoothness of the finally obtained masking signal can be further improved.

In a possible implementation manner, according to a preset energy ratio threshold and the energy of a signal to be masked of a current frame, adjusting the energy of a basic masking signal of the current frame to obtain an initial masking signal of the current frame, including: adjusting the energy of the current frame basic masking signal to obtain a current frame masking adjustment signal, wherein the energy ratio of the current frame to-be-masked signal and the current frame masking adjustment signal is equal to a preset energy ratio threshold; respectively determining whether valid signals exist in a signal to be masked of the current frame and a masking adjustment signal of the current frame; if no effective signal exists in the signal to be masked of the current frame, reducing the amplitude of the masking adjustment signal of the current frame to obtain an initial masking signal of the current frame, or taking 0 or a null value as the initial masking signal of the current frame; if the effective signal exists in the signal to be masked of the current frame and the effective signal does not exist in the masking adjustment signal of the current frame, acquiring a preset signal from a preset masking signal library, and adjusting the frequency spectrum of the preset signal according to the frequency spectrum of the signal to be masked of the current frame to obtain the initial masking signal of the current frame.

In the implementation manner, if no effective signal exists in the signal to be masked of the current frame, the amplitude of the masking adjustment signal of the current frame is reduced to obtain the initial masking signal of the current frame, or a 0 or null value is used as the initial masking signal of the current frame, and the signal of the frame to be masked is not masked subsequently. In this way, the target audio signal of the signal having no sound or a small sound is prevented from being excessively masked, and the user's sense of hearing is prevented from being affected. If the effective signal exists in the signal to be masked of the current frame and the effective signal does not exist in the masking adjustment signal of the current frame, acquiring a preset signal from a preset masking signal library, and adjusting the frequency spectrum of the preset signal according to the frequency spectrum of the signal to be masked of the current frame to obtain the initial masking signal of the current frame. The absence of an effective signal in the current frame masking adjustment signal indicates that the signal is a silent signal or a less audible signal that cannot effectively mask the frame signal to be masked. Thus, the preset signal is directly obtained from the preset masking signal library, and the frequency spectrum of the preset signal is adjusted to be consistent with the frequency spectrum of the signal to be masked of the current frame, so that the initial masking signal of the current frame is obtained. This can prevent occurrence of an ineffective masking phenomenon and enhance the masking effect.

In a possible implementation manner, determining whether valid signals exist in the signal to be masked of the current frame and the masking adjustment signal of the current frame respectively includes: determining whether the energy of the signal to be masked of the current frame is greater than or equal to a first energy threshold; if the energy of the signal to be masked of the current frame is larger than or equal to a first energy threshold value, determining that an effective signal exists in the signal to be masked of the current frame; if the energy of the signal to be masked of the current frame is smaller than a first energy threshold, determining that no effective signal exists in the signal to be masked of the current frame; determining whether the energy of the current frame masking adjustment signal is greater than or equal to a second energy threshold; if the energy of the current frame masking adjustment signal is greater than or equal to a second energy threshold, determining that a valid signal exists in the current frame masking adjustment signal; if the energy of the current frame masking adjustment signal is less than the second energy threshold, determining that no valid signal exists in the current frame masking adjustment signal.

In a possible implementation, generating a masking signal from a target audio signal includes: acquiring a preset signal from a preset masking signal library; and adjusting the frequency spectrum of the preset signal according to the frequency spectrum of the target audio signal to obtain a masking signal.

In the implementation mode, signals are acquired from the preset masking signal library to generate the masking signals, the masking signals can be simply and rapidly obtained, and the algorithm is simplified.

In a second aspect, the present application provides an apparatus, which is included in an electronic device, and which has a function of implementing the electronic device behavior in the first aspect and possible implementations of the first aspect. The functions may be realized by hardware, or may be realized by hardware executing corresponding software. The hardware or software includes one or more modules or units corresponding to the functions described above. Such as a receiving module or unit, a processing module or unit, etc.

In a third aspect, the present application provides an electronic device, the electronic device comprising: a processor, a memory, and an interface; the processor, the memory and the interface cooperate with each other such that the electronic device performs any one of the methods of the technical solutions of the first aspect.

In a fourth aspect, the present application provides a chip comprising a processor. The processor is configured to read and execute a computer program stored in the memory to perform the method of the first aspect and any possible implementation thereof.

Optionally, the chip further comprises a memory, and the memory is connected with the processor through a circuit or a wire.

Further optionally, the chip further comprises a communication interface.

In a fifth aspect, the present application provides a computer readable storage medium, in which a computer program is stored, which when executed by a processor causes the processor to perform any one of the methods of the first aspect.

In a sixth aspect, the present application provides a computer program product comprising: computer program code which, when run on an electronic device, causes the electronic device to perform any one of the methods of the solutions of the first aspect.

Drawings

Fig. 1 is a schematic structural diagram of an electronic device 100 according to an embodiment of the present disclosure;

FIG. 2 is a block diagram of an audio system architecture of an example electronic device 100 according to an embodiment of the present disclosure;

fig. 3 is a flowchart illustrating an exemplary method for processing an audio signal according to an embodiment of the present application;

fig. 4 is a schematic diagram of an exemplary audio signal processing method according to an embodiment of the present disclosure;

fig. 5 is a flowchart of another exemplary audio signal processing method according to an embodiment of the present disclosure;

fig. 6 is a flowchart of another audio signal processing method according to an embodiment of the present application;

Fig. 7 is a schematic diagram of an example of sound field sampling point distribution provided in an embodiment of the present application;

fig. 8 is a diagram of a sound field correspondence relationship between a first filter bank and a second filter bank when solving and generating.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. Wherein, in the description of the embodiments of the present application, "/" means or is meant unless otherwise indicated, for example, a/B may represent a or B; "and/or" herein is merely an association relationship describing an association object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In addition, in the description of the embodiments of the present application, "plurality" means two or more than two.

The terms "first," "second," "third," and the like, are used below for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first", "a second", or a third "may explicitly or implicitly include one or more such feature.

Reference in the specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in the specification are not necessarily all referring to the same embodiment, but mean "one or more, but not all, embodiments" unless expressly specified otherwise. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.

For a better understanding of the embodiments of the present application, terms or concepts that may be referred to in the embodiments are explained below.

A sound field (sound field) refers to the space occupied by an elastic medium in which sound waves exist, i.e., the portion of the medium in which sound waves propagate.

Sound field control refers to spatial sound field playback by adjusting and controlling the amplitude and phase of input signals of each audio playing device, or controlling the spatial sound field distribution mode.

TMR is the energy ratio (TMR) of the target-to-mask ratio of the target signal to the mask signal.

The audio signal processing method provided by the embodiment of the application can be applied to electronic equipment with a receiver and/or a loudspeaker, such as a mobile phone, a tablet personal computer, wearable equipment, a personal digital assistant (personal digital assistant, PDA) and the like. Or, the method for processing the audio signal provided by the embodiment of the application can also be applied to audio playing equipment such as headphones and the like. The embodiment of the application does not limit the specific type of the electronic equipment, and the mobile phone is mainly used as an example for the following description.

Fig. 1 is a schematic structural diagram of an electronic device 100 according to an embodiment of the present application. The electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (universal serial bus, USB) interface 130, a charge management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, keys 190, a motor 191, an indicator 192, a camera 193, a display 194, and a subscriber identity module (subscriber identification module, SIM) card interface 195, etc. The sensor module 180 may include a pressure sensor 180A, a gyro sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, a bone conduction sensor 180M, and the like.

It is to be understood that the structure illustrated in the embodiments of the present application does not constitute a specific limitation on the electronic device 100. In other embodiments of the present application, electronic device 100 may include more or fewer components than shown, or certain components may be combined, or certain components may be split, or different arrangements of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

The processor 110 may include one or more processing units, such as: the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), a controller, a memory, a video codec, a digital signal processor (digital signal processor, DSP), an audio processor (audio signal processor, ADSP), a baseband processor, and/or a neural network processor (neural-network processing unit, NPU), etc. Wherein the different processing units may be separate devices or may be integrated in one or more processors.

The controller may be a neural hub and a command center of the electronic device 100, among others. The controller can generate operation control signals according to the instruction operation codes and the time sequence signals to finish the control of instruction fetching and instruction execution.

A memory may also be provided in the processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that the processor 110 has just used or recycled. If the processor 110 needs to reuse the instruction or data, it may be called directly from memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby improving the efficiency of the system.

In some embodiments, the processor 110 may include one or more interfaces. The interfaces may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, a universal asynchronous receiver transmitter (universal asynchronous receiver/transmitter, UART) interface, a mobile industry processor interface (mobile industry processor interface, MIPI), a general-purpose input/output (GPIO) interface, a subscriber identity module (subscriber identity module, SIM) interface, and/or a universal serial bus (universal serial bus, USB) interface, among others.

The I2S interface may be used for audio communication. In some embodiments, the processor 110 may contain multiple sets of I2S buses. The processor 110 may be coupled to the audio module 170 via an I2S bus to enable communication between the processor 110 and the audio module 170. In some embodiments, the audio module 170 may transmit an audio signal to the wireless communication module 160 through the I2S interface, to implement a function of answering a call through the bluetooth headset.

PCM interfaces may also be used for audio communication to sample, quantize and encode analog signals. In some embodiments, the audio module 170 and the wireless communication module 160 may be coupled through a PCM bus interface. In some embodiments, the audio module 170 may also transmit audio signals to the wireless communication module 160 through the PCM interface to implement a function of answering a call through the bluetooth headset. Both the I2S interface and the PCM interface may be used for audio communication.

The UART interface is a universal serial data bus for asynchronous communications. The bus may be a bi-directional communication bus. It converts the data to be transmitted between serial communication and parallel communication. In some embodiments, a UART interface is typically used to connect the processor 110 with the wireless communication module 160. For example: the processor 110 communicates with a bluetooth module in the wireless communication module 160 through a UART interface to implement a bluetooth function. In some embodiments, the audio module 170 may transmit an audio signal to the wireless communication module 160 through a UART interface, to implement a function of playing music through a bluetooth headset.

It should be understood that the interfacing relationship between the modules illustrated in the embodiments of the present application is only illustrative, and does not limit the structure of the electronic device 100. In other embodiments of the present application, the electronic device 100 may also use different interfacing manners, or a combination of multiple interfacing manners in the foregoing embodiments.

The wireless communication function of the electronic device may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, a modem processor, a baseband processor, and the like. The antennas 1 and 2 are used for transmitting and receiving electromagnetic wave signals. Each antenna in the electronic device may be used to cover a single or multiple communication bands. Different antennas may also be multiplexed to improve the utilization of the antennas. For example: the antenna 1 may be multiplexed into a diversity antenna of a wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.

The mobile communication module 150 may provide a solution for wireless communication including 2G/3G/4G/5G, etc. applied on an electronic device. The mobile communication module 150 may include at least one filter, switch, power amplifier, low noise amplifier (low noise amplifier, LNA), etc. The mobile communication module 150 may receive electromagnetic waves from the antenna 1, perform processes such as filtering, amplifying, and the like on the received electromagnetic waves, and transmit the processed electromagnetic waves to the modem processor for demodulation. The mobile communication module 150 can amplify the signal modulated by the modem processor, and convert the signal into electromagnetic waves through the antenna 1 to radiate. In some embodiments, at least some of the functional modules of the mobile communication module 150 may be disposed in the processor 110. In some embodiments, at least some of the functional modules of the mobile communication module 150 may be provided in the same device as at least some of the modules of the processor 110.

The modem processor may include a modulator and a demodulator. The modulator is used for modulating the low-frequency baseband signal to be transmitted into a medium-high frequency signal. The demodulator is used for demodulating the received electromagnetic wave signal into a low-frequency baseband signal. The demodulator then transmits the demodulated low frequency baseband signal to the baseband processor for processing. The low frequency baseband signal is processed by the baseband processor and then transferred to the application processor. The application processor outputs sound signals through an audio playing device (not limited to the speaker 170A, the receiver 170B, etc.), or displays images or video through the display screen 194. In some embodiments, the modem processor may be a stand-alone device. In other embodiments, the modem processor may be provided in the same device as the mobile communication module 150 or other functional module, independent of the processor 110.

The external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, to enable expansion of the memory capabilities of the electronic device 100. The external memory card communicates with the processor 110 through an external memory interface 120 to implement data storage functions. For example, files such as music, video, etc. are stored in an external memory card.

The internal memory 121 may be used to store computer-executable program code that includes instructions. The processor 110 executes various functional applications of the electronic device 100 and data processing by executing instructions stored in the internal memory 121. The internal memory 121 may include a storage program area and a storage data area. The storage program area may store an application program (such as a sound playing function, an image playing function, etc.) required for at least one function of the operating system, etc. The storage data area may store data created during use of the electronic device 100 (e.g., audio data, phonebook, etc.), and so on. In addition, the internal memory 121 may include a high-speed random access memory, and may further include a nonvolatile memory such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (universal flash storage, UFS), and the like.

The electronic device 100 may implement audio functions through an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, and ADSP, etc. Such as music playing, recording, etc.

The audio module 170 is used to convert digital audio information into an analog audio signal output and also to convert an analog audio input into a digital audio signal. The audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be disposed in the processor 110, or a portion of the functional modules of the audio module 170 may be disposed in the processor 110.

The speaker 170A, also referred to as a "horn," is used to convert audio electrical signals into sound signals. The electronic device 100 may play music, or hands-free talk or voice information, i.e., audio in hands-free mode, through the speaker 170A. Optionally, speaker 170A may also be used to play audio in a handheld mode.

A receiver 170B, also referred to as a "earpiece", is used to convert the audio electrical signal into a sound signal. When the electronic device 100 is in the handheld mode to answer a call or voice message, the user may answer the voice by placing the receiver 170B close to the human ear. That is, the receiver 170B may be used to play audio in a handheld mode.

It will be appreciated that in some embodiments, electronic device 100 may not include receiver 170B, but only speaker 170. In this case, the speaker 170A may be used to play audio in both the handheld mode and the listening hands-free mode. Such speakers are also known as "two-in-one speakers".

In the embodiment of the present application, the number of speakers 170A and/or receivers 170B that can be used to play audio in the handheld mode is at least two. At least two speakers 170A and/or receivers 170B are located at different locations of the electronic device 100.

The earphone interface 170D is used to connect a wired earphone. The headset interface 170D may be a USB interface 130 or a 3.5mm open mobile electronic device platform (open mobile terminal platform, OMTP) standard interface, a american cellular telecommunications industry association (cellular telecommunications industry association of the USA, CTIA) standard interface.

Fig. 2 is an audio system architecture diagram of an example electronic device 100 according to an embodiment of the present application. It should be understood that the hierarchical architecture of the system may include several layers, each layer having a distinct role and division of work, and that the layers communicate via software interfaces.

In the embodiment of the application, an Android system-based Audio (Audio) architecture is referred to as an "Android Audio system architecture" or an "Android Audio framework", and is hereinafter referred to as an "Audio system" for short. Specifically, the audio system can realize the functions of audio data collection and output, audio stream control, audio equipment management, volume adjustment and the like.

It should be appreciated that the electronic device 100 may play audio for different applications based on an audio system. The Audio system out-of-pair playing interface is Audio track (Audiotrack), each application creates a corresponding Audio track in the application framework layer, and each Audio track can further call an Audio system engine (Audio player) to realize Audio playing. The following is a detailed description with reference to fig. 2.

Illustratively, as shown in fig. 2, the Android audio system architecture of the embodiment of the present application includes an application layer, an application framework layer (framework), a hardware abstraction layer, and a kernel layer from top to bottom.

As shown in fig. 2, the application layer may include a series of application packages. Such as application packages corresponding to applications such as conversations, instant messaging, music, etc. The instant messaging application may be, for exampleEtc.

The application framework layer may provide an application programming interface (application programming interface, API) and programming framework, i.e., an "audio system framework," for the application of the application layer. The application framework layer may also include some predefined functions, classes, or interfaces.

As shown in fig. 2, in the embodiment of the present application, the application framework layer may be further divided into a Java layer and a Native (Native) layer. Specifically, the Java layer may include a programming interface (applicationprogramming interface, API) of Java class applications, in other words, the Java layer may include Java APIs corresponding to a plurality of different applications for controlling playback of audio streams, such as Java APIs of telephony applications, java APIs of instant messaging applications, java APIs of music, and the like.

The Native layer includes Audio track, audio system engine (Audio Flinger), and Audio policy service (Audio Policy Service) corresponding to each application. The Native layer can parse and compile the rc file. Illustratively, when a user connects to a call, the triggering of the Audio system, and in particular, the triggering of an Audio service (Audio server) of the Audio system, is performed. The file of the audio service is an audioserver. The Native layer may parse the audioserver 'rc file and compile the audioserver' rc file into an audio executable program. Each application may create a corresponding Audio track at the Native layer that may output the decoded Audio data.

The audio service may load the audio system engine at the start-up of the audio system. In this process, the audio service may call the audio system engine and the audio policy service, respectively, and initialize the audio system engine and the audio policy service.

It should be understood that the audio policy service and the audio system engine are two major basic services of the audio system. The Audio policy service is a maker of an Audio system policy, and is responsible for policy choice, volume adjustment policy of Audio device switching, for example, the Audio policy service may play a role of routing, and it has the right to decide to which Audio playing device (speaker, headphone or earphone, etc.) an Audio stream generated by a certain Audio Track is finally transmitted. For example, in the embodiment of the present application, when the Audio decision service determines that the call mode is the handheld mode according to the selection of the user in the call application, the decision Audio playing device is a speaker and/or a receiver (hereinafter referred to as a target Audio playing device) for playing Audio in the handheld mode, and then the Audio stream generated by the Audio Track corresponding to the call application is finally transmitted to the target Audio playing device.

The audio system engine is an executor of the audio system playing audio strategy, and is responsible for the device for managing the input audio stream, the device for managing the output audio stream, the switching of audio modes, the loading of audio parameters, the processing and transmission of audio stream data and the like. In the process of initializing the audio system engine, the default volume parameters of the audio system can be obtained, for example, volume levels corresponding to call flows, system flows, ring flows, warning flows (alarm clocks), notification flows, dial key flows and the like are obtained.

It should be appreciated that during the initialization of the audio policy service, the audio configuration in the setup may be parsed and the deployment of the entire audio system is completed using the interface provided by the audio system engine, thereby providing the underlying support for the use of audio devices by later upper layer applications.

It should also be appreciated that the audio policy service is where it actually invokes the audio system engine, and may invoke functions within the audio system engine, i.e., invoke an audio system engine interface, through an adhesive (Binder) mechanism.

For example, in the embodiment of the present application, when a certain application outputs an Audio stream, the Audio signals included in the Audio stream may be identified as a certain stream type, and the decoded Audio stream data is output to the Audio system engine through the Audio track corresponding to the application. The audio system engine may load a volume corresponding to the audio stream type based on the audio stream type.

In this embodiment of the present application, the Audio system engine may further perform sound field control and psycho-acoustic masking processing on the Audio signal output by the corresponding Audio track when determining that the output mode of the speech is the handheld mode according to the decision result of the Audio decision service, so as to output Audio signals with different parameters (such as amplitude and phase) to each speaker and/or receiver.

In particular, referring to fig. 2, the audio system engine may include a mask generation module, a spectral correction module, a sound field control module, and an audio superposition module. The mask generation module is used for generating an initial mask signal according to the target audio signal. The spectrum correction module is used for performing spectrum correction on the initial masking signal to obtain a masking signal. Alternatively, the spectral correction module may be implemented by a finite impulse response (finite impulse response, FIR) filter or an infinite impulse response (infinite impulse response, IIR) filter, or the like.

The sound field control module is used for respectively performing sound field control on the target audio signal and the masking signal. The target audio signal refers to an audio signal that needs to be output through the target audio playback device. Optionally, the target audio signal may be a call downlink signal output by a call application, a voice audio signal output by an instant messaging application, or an audio signal output by another application, for example, a recording audio signal output by a recording application. The embodiments of the present application do not set any limitation on the specific source of the target audio signal.

The sound field control module is used for controlling the sound fields of the target audio signal and the masking signal respectively in the subareas based on the distribution position of the target audio playing device. In particular, the sound field control module may comprise a first filter bank and a second filter bank. The first filter bank includes N first control filters. The second filter group comprises N second control filters respectively. Where N is the number of target audio playback devices shown in fig. 1, and N is an integer greater than or equal to 2. The N first control filters are in one-to-one correspondence with the N second control filters and in one-to-one correspondence with the N target audio playing devices. The N first control filters are used for respectively processing the target audio signals and outputting N paths of target output signals with different parameters so as to realize sound field control of the target audio signals. The N control filters in the second filter group are used for respectively processing the masking signals and outputting N paths of masking output signals with different parameters so as to realize sound field control of the masking signals.

The audio superposition module is used for superposing the target audio signal output by each first control filter and the masking output signal output by the corresponding second control filter to obtain one path of signal to be played. And finally, each path of signal to be played is played by the corresponding target audio playing device.

It may be understood that in the embodiment of the present application, the spectrum correction module, the sound field control module, the audio superposition module, and the like may be software modules, or may be a module formed by combining software and hardware, which is not limited in any way.

As shown in fig. 2, in an embodiment of the present application, the hardware abstraction layer may include an Audio hardware abstraction layer (Audio hardware abstraction layer, audio HAL). The audio hardware abstraction layer may be responsible for interactions with the audio hardware devices, which may be invoked directly by the audio system engine.

The audio hardware abstraction layer can select the audio hardware device of the audio stream of the type and access the driver of the hardware device of the kernel layer, so as to drive the audio hardware device to play according to the volume corresponding to the audio stream type loaded by the audio system engine.

The kernel layer is a layer between hardware and software. The kernel layer contains at least audio drivers and the like. In particular, the audio driver may include an earpiece driver, a speaker driver, a headphone interface driver, and the like. The earphone driver may include 1 or a plurality of earphone drivers. When the earpiece driving may include 1, driving by the 1 earpiece may include driving at least two earpieces by 1. When the earpiece drive includes a plurality, 1 earpiece may be driven by 1 earpiece drive. The embodiments of the present application are not limited in any way.

In addition, the audio system may also include an audio hardware device of the electronic device 100 that may be invoked by a corresponding driver of the kernel layer. The hardware devices such as a loudspeaker, a receiver and the like of the mobile phone play based on the volume loaded by the audio system engine.

The audio hardware device of the electronic device 100 may include a speaker 170A and/or an earphone (receiver 170B) shown in fig. 1, a wired earphone connected through an earphone interface 170D, and a playing device capable of playing audio, which is not limited in the embodiment of the present application.

For easy understanding, the following embodiments of the present application will take an electronic device having the structure shown in fig. 1 and fig. 2 as an example, and take a call scenario as an example with reference to the accompanying drawings, to specifically describe a processing method of an audio signal provided in the embodiments of the present application.

Fig. 3 is a flowchart of an exemplary audio signal processing method provided in an embodiment of the present application, and fig. 4 is a schematic diagram of an exemplary audio signal processing method provided in an embodiment of the present application, please refer to fig. 3 and fig. 4 together, where the method includes:

s101, the call application outputs a call downlink signal to a sound field control module and a masking generation module in the audio system engine.

The call downlink signal, that is, the audio signal transmitted by the opposite terminal device is received by the call application. The telephony application creates a corresponding Audio track at the application framework layer. In the call process, the Audio track corresponding to the call application calls the Audio system engine, and the call downlink signal is output to the Audio system engine. Specifically, the call application may output the call downlink signal to the sound field control module and the mask generation module in the audio system engine, respectively.

It will be appreciated that the call downstream signal is an audio stream that the call application can output to the sound field control module on a frame-by-frame basis. In a specific embodiment, the frame length of the talk downlink signal may be, for example, 1ms or 20 ms.

S102, performing sound field control on the call downlink signals by a first filter bank in the sound field control module to obtain N paths of downlink output signals.

It can be appreciated that when a user speaks using the target audio playback device, the sound field can be divided into an in-ear sound field and a leaky sound field. The in-ear sound field is also called an ear canal sound field, a target sound field, etc., and refers to a sound field of a preset range in and around the ear canal of a user. The sound leakage sound field is also referred to as an external sound field, and refers to a sound field outside (on the side away from the target audio playback device) the internal sound field. In this embodiment of the present application, the sound of the in-ear sound field is a sound that needs to be heard by the user, i.e. a sound of interest to the user, i.e. the intelligibility of the in-ear sound field sound is as high as possible. The sound of the leaky sound field is leaky, and is not desired to be heard by surrounding persons, i.e. the intelligibility of the leaky sound field sound is as low as possible.

Based on this, in the embodiment of the present application, the area around the electronic device is divided into the in-ear area (also referred to as a first spatial area) and the sound leakage area (also referred to as a second spatial area) in advance. The in-ear region is a region corresponding to an in-ear sound field, namely a region in a preset range around a target audio playing device of the electronic equipment. The sound leakage area is the area corresponding to the sound leakage field, namely the area outside the in-ear area (the side far away from the target audio playing device). It is understood that the specific ranges of the in-ear region and the leaky-tone region can be divided according to actual needs.

The sound field control module performs regional sound field control on the in-ear region and the sound leakage region. In the step, the first filter bank takes the in-ear area as a bright area, takes the sound leakage area as a dark area, performs sound field control test or simulation, and maximizes the energy ratio of the bright area and the dark area to solve the filter coefficient to generate the filter bank. The first filter bank can control the sound field of the call downlink signal, so that the in-ear area can generate the call downlink signal sound field with larger energy as much as possible, and the sound leakage area can generate the call downlink signal sound field with smaller energy as much as possible. That is, the first filter bank controls the sound field of the call downlink signal such that the energy ratio of the in-ear sound field formed in the in-ear region to the sound leakage field formed in the sound leakage region of the call downlink signal is as large as possible. Therefore, the intelligibility and definition of the conversation sound heard by the user are high, the intelligibility and definition of the conversation sound heard by surrounding people are low, the voice leakage is effectively prevented, and the conversation privacy of the user is protected.

Specifically, referring to fig. 4, the sound field control module includes a first filter bank and a second filter bank. The first filter bank includes N first control filters: first control filter 1, first control filter 2, … …, and first control filter N. The second filter bank includes N second control filters: second control filter 1, second control filter 2, … …, second control filter N. The first control filter 1 corresponds to the second control filter 1, and is a control filter corresponding to the target audio playing device 1, and is used for controlling the amplitude, the phase, and the like of the audio signal input to the target audio playing device 1. The first control filter 2 corresponds to the second control filter 2, is a control filter corresponding to the target audio playing device 2, and is used for controlling … … the amplitude, the phase, etc. of the audio signal input to the target audio playing device 2, the first control filter N corresponds to the second control filter N, is a control filter corresponding to the target audio playing device N, and is used for controlling the amplitude, the phase, etc. of the audio signal input to the target audio playing device N.

Referring to fig. 4, the call application may input the call downlink signals to the respective first control filters in the first filter bank, where a signal obtained by processing the call downlink signals by the first control filter 1 is referred to as a downlink output signal 1, a signal obtained by processing the call downlink signals by the first control filter 2 is referred to as a downlink output signal 2 … …, and a signal obtained by processing the call downlink signals by the first control filter N is referred to as a downlink output signal N.

That is, the N first control filters in the first filter bank are used for performing sound field control on the call downlink signal, so that parameters (including amplitude and phase) of the call downlink signal finally output by different target audio playing devices are different, and the energy ratio of the in-ear sound field and the leaky sound field of the call downlink signal is as large as possible.

S103, the sound field control module outputs N paths of downlink output signals to the audio superposition module respectively.

S104, the masking generation module generates an initial masking signal according to the call downlink signal, and the frequency spectrum of the initial masking signal is consistent with that of the call downlink signal.

Optionally, the masking generating module may acquire a preset signal from a preset masking signal library, and adjust a frequency spectrum of the preset signal according to the downlink call signal, so as to obtain an initial masking signal. Alternatively, the preset signal in the preset masking signal library may be, for example, white noise, music, or sea wave sound, which is not limited in the embodiment of the present application.

Optionally, the masking generation module may also process the call downlink signal to generate an initial masking signal, which will be described in further detail in the following embodiments.

It can be understood that in the embodiment of the present application, the spectrum agreement may be that the spectrums are identical, or that the spectrum correlation is higher than a preset threshold.

S105, the mask generation module outputs an initial mask signal to the spectrum correction module.

S106, the spectrum correction module performs spectrum correction on the initial masking signal to obtain a masking signal.

And S107, the spectrum correction module outputs the masking signal to the sound field control module.

S108, performing sound field control on the masking signals by a second filter bank in the sound field control module to obtain N paths of masking output signals.

In the step, the second filter group takes the sound leakage area as a bright area, takes the in-ear area as a dark area, performs sound field control test or simulation, maximizes the sound field energy ratio of the bright area and the dark area, and solves the filter coefficient generation. That is, the second filter bank is used for exchanging the bright and dark areas corresponding to the first filter bank during sound field control. In this way, the second filter bank performs sound field control on the masking signal, so that the in-ear region can generate a masking signal sound field with smaller energy as much as possible, and the sound leakage region can generate a masking signal sound field with smaller energy as much as possible. That is, the second filter bank controls the sound field of the masking signal such that the energy ratio of the sound leakage field of the masking signal to the in-ear sound field is as large as possible. Thus, the final masking signal has smaller masking strength on the call downlink signal in the in-ear area, the intelligibility and definition of the call sound heard by the user are not affected, the masking strength on the call downlink signal in the voice leakage area by the masking signal is larger, the intelligibility and definition of the voice in the voice leakage area are effectively reduced, thereby effectively preventing voice leakage and protecting the call privacy of the user.

Specifically, referring to fig. 4, the spectrum correction module may input the masking signals to respective second control filters in the second filter bank, wherein a signal obtained by processing the masking signals by the second control filter 1 is referred to as a masking output signal 1, a signal obtained by processing the masking signals by the second control filter 2 is referred to as a masking output signal 2 … …, and a signal obtained by processing the masking signals by the second control filter N is referred to as a masking output signal N.

That is, the N second control filters in the second filter set are used to perform sound field control on the masking signal, so that parameters (including amplitude, phase) and the like of the masking signal finally output by different target audio playing devices are different, and thus the energy ratio of the sound leakage field formed by the masking signal in the sound leakage area to the sound in the ear field formed by the masking signal in the in-ear area is as large as possible.

S109, the sound field control module outputs N paths of masking output signals to the audio superposition module respectively.

S110, the audio superposition module superposes the downlink output signals and the corresponding masking output signals to obtain signals to be played corresponding to each target audio playing device.

As described above, the N first control filters are in one-to-one correspondence with the N second control filters. Thus, the N downlink output signals are in one-to-one correspondence with the N masking output signals.

Specifically, referring to fig. 4, the audio superposition module superimposes the downlink output signal 1 and the shielding output signal 1 to obtain a signal 1 to be played corresponding to the target audio playing device 1; and superposing the downlink output signal 2 and the shielding output signal 2 to obtain a signal to be played 2 … … corresponding to the target audio playing device 2, and superposing the downlink output signal N and the shielding output signal N to obtain a signal to be played N corresponding to the target audio playing device N.

It can be appreciated that the sound field control module can process the call downlink signal and the masking signal frame by frame, and the audio superposition module can superpose the multi-frame signal at a time. Specifically, the frame length of the masking signal is generally greater than that of the call downlink signal, which may be n times that of the call downlink signal, where n is a positive integer. The frame length of the mask output signal is thus n times the frame length of the downstream output signal. When the audio superposition module superposes signals, the downlink output signals corresponding to the n-frame call downlink signals and the masking output signals corresponding to the n-frame call downlink signals are superposed. For example, the downlink output signals corresponding to the 1 st to 50 th frame of call downlink signals are superimposed with the masking output signals corresponding to the 1 st to 50 th frame of call downlink signals, so as to obtain the signal to be played.

As described above, the first filter group makes the energy ratio of the in-ear sound field formed by the call downlink signal in the in-ear region to the leaky sound field formed by the leaky sound region as large as possible, that is, the energy of the in-ear sound field formed by the call downlink signal is larger than the energy of the leaky sound field formed by the call downlink signal. The second filter bank causes the energy ratio of the sound field of the masking signal formed in the sound leakage area to the sound field of the masking signal formed in the ear in the in-ear area to be as large as possible, i.e. the energy of the sound field of the masking signal formed in the ear is smaller than the energy of the sound field of the masking signal formed in the sound leakage. Therefore, the energy of the sound field formed in the in-ear area of the signal to be played, which is obtained by superposing the call downlink signal and the masking signal, is far greater than the energy of the sound field formed in the sound leakage area. Therefore, the hearing in the auditory canal of the user can be guaranteed, the intelligibility of missing sound is reduced, and the privacy of the user is protected.

S111, the audio superposition module outputs each path of signals to be played to the audio hardware abstraction layer.

S112, the audio hardware abstraction layer accesses a driver of the target audio playing device corresponding to the kernel layer, so that each target audio playing device is driven to play a corresponding signal to be played.

Specifically, referring to fig. 4, the target audio playing device 1 plays the signal 1 to be played; the target audio playing device 2 plays the signal to be played 2 … … the target audio playing device N plays the signal to be played N.

In summary, according to the audio signal processing method provided by the embodiment of the application, on one hand, the sound field control module is used for controlling the sound field of the call downlink signal, so that the energy ratio of the in-ear sound field of the call downlink signal to the sound leakage sound field of the call downlink signal is as large as possible. Therefore, the voice leakage is reduced, the privacy of the user is protected, and the user experience is improved. On the other hand, a masking signal is generated, and the masking signal is subjected to sound field control through a sound field control module, so that the energy ratio of the sound leakage sound field of the masking signal to the in-ear sound field of the masking signal is as large as possible. Thus, the psychoacoustic masking of the call downlink signal in the in-ear region is performed to a small extent, and the psychoacoustic masking of the call downlink signal in the leaky sound region is performed to a large extent, so that the influence on the in-ear region is reduced as much as possible while the sound intelligibility of the leaky sound region is reduced, namely, the influence on the hearing of the user is reduced while the privacy of the user is protected, and the user experience is improved. In the third aspect, since the second control filter introduces spectral distortion during sound field control, in this embodiment, the masking signal is generated after the initial masking signal is subjected to spectral correction, and then the masking signal is input into the sound field control module, so that the spectral distortion during sound field control can be corrected, the masking effect is improved, and the effect of final privacy protection is further improved.

Optionally, in some embodiments, the spectral correction of the initial masking signal may also be performed after the sound field control, which is not limited in any way by the embodiments of the present application.

It will be appreciated that the order in which the electronic device performs the steps described above is not limited, and for example, in a specific embodiment, the steps S102 and S103 to S108 may be performed simultaneously.

The generation process of the initial masking signal is explained below.

Fig. 5 is a schematic flow chart of another processing method of an audio signal according to an embodiment of the present application. As shown in fig. 5, the step of generating the initial masking signal by the masking generation module according to the call downlink signal in the step S104 may include the following steps. The execution subject of the following steps is a mask generation module, and will not be described in detail.

S201, buffering a call downlink signal with a duration of a preset masking frame length to obtain a signal to be masked of a current frame.

The masking frame length, i.e., the frame length of the finally generated masking signal, is preset. Optionally, the preset masking frame length may be greater than 50ms and less than 300ms, so that the masking signal has a better masking effect.

Specifically, the masking generation module caches the call downlink signal according to the preset masking frame length from receiving the call downlink signal output by the call application, and each cached call downlink signal with the duration of the preset masking frame length is called a frame of signal to be masked. For convenience of distinction, in the embodiment of the present application, a signal to be masked obtained by buffering the signal to be masked for the last time closest to the current time is referred to as a signal to be masked for the current frame. The last frame to be masked signal before the current frame to be masked signal is referred to as the previous frame to be masked signal (or as the previous frame to be masked signal).

As described above, the call application may output the call downlink signal to the mask generation module frame by frame, and the frame length of the call downlink signal may be, for example, 1ms, 20ms, or the like. Taking the frame length of the call downlink signal as 1ms and the preset masking frame length as 50ms as an example, the masking generation module caches the call downlink signal according to the preset masking frame length, and after each 50ms (i.e. 50 frames) of call downlink signal is cached, a section of call downlink signal with the duration of 50ms is obtained, and the 50ms call downlink signal is called a frame of signal to be masked.

It will be appreciated that the signal to be masked of the current frame, i.e. the signal simultaneously input to the first filter bank in the sound field control module, is essentially the signal to be masked of the final masking signal, i.e. the target audio signal, also referred to as the current frame to be masked signal.

S202, carrying out time reversal after windowing the signal to be masked of the current frame to obtain a reversal signal of the current frame.

Alternatively, a hamming window, a rectangular window, a hanning window, or the like may be used in the windowing process, which is not limited in any way in the embodiments of the present application.

The signal obtained after the windowing processing and the time inversion of the signal to be masked is called an inversion signal. For convenience of distinction, a signal obtained after windowing and time-inverting a signal to be masked of a current frame is referred to as a current frame-inverted signal, and a signal obtained after windowing and time-inverting a signal to be masked of a previous frame is referred to as a previous frame-inverted signal (or as a previous frame-inverted signal).

S203, the current frame inversion signal and the previous frame inversion signal are overlapped to obtain a current frame basic masking signal.

In this step, the current frame inversion signal and the previous frame inversion signal are superimposed, so that the smoothness of the obtained basic masking signal can be improved, and the smoothness of the finally obtained masking signal can be improved.

Optionally, when the current frame inversion signal is the first frame inversion signal, the previous frame inversion signal may be defaulted to be 0, and the current frame base masking signal is obtained by superposition.

S204, adjusting the energy of the basic masking signal of the current frame according to the preset TMR threshold (also called as a preset energy ratio threshold) and the energy of the signal to be masked of the current frame to obtain the initial masking signal of the current frame.

Alternatively, the preset TMR threshold may be, for example, 0dB.

It is understood that the masking effect of the masking signal on the target audio signal is related to TMR. By adjusting the energy of the basic masking signal, the energy ratio of the signal to be masked of the current frame and the basic masking signal is a preset threshold value, so that the masking effect of the masking signal can be improved, and the effect of call privacy protection is further improved.

The steps S201 to S204 are repeatedly executed, so that the corresponding initial masking signal can be continuously generated according to the call downlink signal.

In this embodiment, the call downlink signal is processed in real time to obtain an initial masking signal, so that the spectrum consistency of the finally obtained masking signal and the call downlink signal is higher, the masking effect of the masking signal is improved, and the effect of protecting call privacy is improved.

As described above, when the preset masking frame length is large, the masking effect of the masking signal can be improved. However, in actual communication, a larger preset masking frame length may cause time delay (time difference) between the masking signal and the call downlink signal, so that the generated masking signal does not correspond to the spectrum of the actual call downlink signal, and there is time delay, thereby causing that the leakage sound is not masked and an ineffective masking phenomenon occurs; or the conversation downlink signal expected to be heard by the user is excessively masked, so that the hearing feeling of the user is reduced, and the user experience is affected. Based on this, in the embodiment of the present application, when the initial masking signal is generated, the effective and accurate initial masking signal may be generated in combination with the acoustic energy detection, so as to improve the masking effect of the final masking signal. The following description is made with reference to the accompanying drawings.

Fig. 6 is a flowchart illustrating a processing method of an audio signal according to another embodiment of the present application. As shown in fig. 6, the step of "S204" of adjusting the energy of the base masking signal of the current frame according to the preset TMR threshold and the energy of the signal to be masked of the current frame to obtain the initial masking signal of the current frame "includes:

S301, adjusting the energy of a current frame basic masking signal to obtain a current frame masking adjustment signal, wherein the energy ratio of a current frame to-be-masked signal and the current frame masking adjustment signal is equal to a preset TMR threshold.

That is, the energy of the current frame base masking signal is adjusted such that the energy ratio of the current frame to-be-masked signal to the current frame base masking signal is equal to a preset TMR threshold, and the adjusted current frame base masking signal is referred to as a current frame masking adjustment signal.

S302, determining whether a valid signal exists in a signal to be masked of a current frame according to a call energy threshold; if not, executing step S303; if yes, step S304 is executed.

Optionally, the call energy threshold may be a preset value, or may be determined by a related module in the electronic device according to voice activity detection, which is not limited in the embodiment of the present application.

Judging whether the energy of the signal to be masked of the current frame is larger than or equal to a call energy threshold value, if so, determining that an effective signal exists in the signal to be masked of the current frame; if not, determining that no valid signal exists in the signal to be masked of the current frame.

It can be understood that there is a valid signal in a certain frame of sound signal, which indicates that the frame of sound signal is a louder signal; the absence of a valid signal indicates that the frame signal is a silent or less audible signal. The effective signal exists in the signal to be masked of the current frame, which indicates that the signal to be masked is a signal with loud sound, and the signal to be masked needs to be masked; the absence of a valid signal indicates that the frame signal to be masked is a silent or less sound signal, which does not need to be masked or requires a lesser degree of masking. It can be understood that if the call downlink signal without the effective signal is normally masked, the intelligibility of the user in the call is reduced, and the hearing of the user is affected.

S303, reducing the amplitude of the current frame masking adjustment signal to obtain a current frame initial masking signal, or deleting the current frame masking adjustment signal and setting the current frame initial masking signal to be null or 0.

If no valid signal exists in the signal to be masked of the current frame, the step is executed. The absence of a valid signal in the current frame signal to be masked indicates that the frame signal to be masked is a silent or less audible signal, which does not need to be masked or requires a lesser degree of masking. Therefore, the amplitude of the masking adjustment signal of the current frame is reduced, the initial masking signal of the current frame is obtained, and the masking signal obtained after the spectrum correction of the initial masking signal of the current frame is used for masking the frame signal to be masked to a small extent. Or deleting the current frame masking adjustment signal, setting the current frame initial masking signal to be null or 0, and subsequently, not masking the frame signal to be masked. Thus, the communication downlink signal of the signal with low noise or low sound is prevented from being excessively masked, and the user hearing is prevented from being influenced.

S304, determining whether a valid signal exists in the current frame masking adjustment signal according to the masking energy threshold; if not, step S305 is executed, and if yes, no processing is performed.

Alternatively, the masking energy threshold may be determined based on the call energy threshold and the adjustment value of the energy in step S301. Specifically, the masking energy threshold may be a sum of the call energy threshold and the energy adjustment value. Wherein the energy adjustment value is the energy difference between the current frame masking adjustment signal and the current frame base masking signal. In step S301, if the energy of the current frame base masking signal is increased (i.e., the energy of the current frame masking adjustment signal > the energy of the current frame base masking signal), the energy adjustment value is positive; if the energy of the current frame base masking signal is reduced (i.e., the energy of the current frame masking adjustment signal < the energy of the current frame base masking signal), the energy adjustment value is negative.

Judging whether the energy of the masking adjustment signal of the current frame is larger than or equal to a masking energy threshold, if so, determining that a valid signal exists in the masking adjustment signal of the current frame; if not, determining that no valid signal exists in the current frame masking adjustment signal.

S305, acquiring a preset signal from a preset masking signal library, and adjusting the frequency spectrum of the preset signal according to the frequency spectrum of the signal to be masked of the current frame to obtain an initial masking signal of the current frame.

That is, if there is a valid signal in the current frame to-be-masked signal and there is no valid signal in the current frame masking adjustment signal, this step is performed. The absence of an effective signal in the current frame masking adjustment signal indicates that the signal is a silent signal or a less audible signal that cannot effectively mask the frame signal to be masked. Thus, the preset signal is directly obtained from the preset masking signal library, and the frequency spectrum of the preset signal is adjusted to be consistent with the frequency spectrum of the signal to be masked of the current frame, so that the initial masking signal of the current frame is obtained. This can prevent occurrence of an ineffective masking phenomenon and enhance the masking effect.

It will be appreciated that if the initial masking signal is not generated according to the process shown in steps S201 to S204, but is generated directly based on the preset signal in the preset masking signal library, in this case, acoustic energy detection may be performed according to a process similar to the process in steps S301 to S304, so as to generate an effective and accurate initial masking signal, so as to improve the masking effect of the masking signal. The difference is that in this case, the signals generated from the preset signals all have valid signals, and thus there is no case of step S305, and step S305 is not performed.

The following describes a generation process of the sound field control module.

As described above, the in-ear region and the sound leakage region may be divided in advance. When the sound field control module is generated, a preset number of sound field sampling points can be respectively arranged in the in-ear area and the sound leakage area, simulation or test of sound field control is carried out based on the sound field sampling points, so that the optimal solution filter coefficient is solved, a filter bank is obtained, and the sound field control module is further obtained. Alternatively, a filter coefficient may be solved by an acoustic energy contrast method, a sound pressure matching method, or the like, to obtain a filter bank, or the filter coefficient may be solved by a fusion algorithm of the acoustic energy contrast method and the sound pressure matching method, and the embodiment mainly uses the acoustic energy contrast method as an example for explanation.

Fig. 7 is an exemplary schematic diagram of sound field sampling point distribution according to an embodiment of the present application. As shown in fig. 7, the sampling points of the in-ear area are shown as black dots, and the sampling points are points within a preset range around the ear canal and the ear of the user when the user holds the electronic device to answer the call. The sampling points of the leaky tone region are shown as triangles. The sampling points of the leaky tone region are distributed outside the in-ear region (i.e., in a direction away from the electronic device).

In one embodiment, taking the sound field sampling points shown in fig. 7 as an example, the first filter bank and the second filter bank may be generated by solving the following procedure:

1) Solution generation of a first filter bank

And taking the in-ear area sampling point as a bright area, taking the sound leakage area sampling point as a dark area, maximizing the energy ratio of the bright area to the dark area, and solving the optimal filter coefficient to obtain the first filter bank. The specific solving process comprises the following steps:

a. setting a target frequency band of the first filter bank.

The target frequency band of the first filter bank refers to a frequency range that can be processed by the first filter, and for example, the target frequency band may be f1 to f 2. f1 may be, for example, 100Hz and f2 may be, for example, 4kHz.

b. And determining a transmission function matrix of the audio signal in the target frequency band in the bright area and the dark area.

Alternatively, a transmission function matrix (referred to as a bright area transmission function matrix G) for transmitting audio signals in the target frequency band from each target audio playing device to the bright area NB sampling points can be determined by test or simulation _B ) The method comprises the steps of carrying out a first treatment on the surface of the Determining a transfer function matrix (called a dark area transfer function matrix G) for transferring audio signals in a target frequency band from each target audio playing device to ND sampling points in a dark area _D )。

Wherein NB is the number of the sampling points in the bright area, and ND is the number of the sampling points in the dark area. In this embodiment, the number of bright area sampling points is the number of in-ear area sampling points, and the number of dark area sampling points is the number of missing sound area sampling points. Taking the sound field sampling point shown in fig. 6 as an example, NB is 3 and nd is 12.

Then, the bright area transfer function matrix G _B Is of the size NFFTxNBxN, dark region transfer function matrix G _D Is of the size NFFTxNDxN. NFFT is the number of fourier transform frequency points in the target frequency band, and N represents the number of target audio playing devices.

c. And determining a space correlation matrix of the light area and the dark area according to the transfer function matrix of the light area and the dark area.

Specifically, according to the bright region transfer function matrix G _B Determining a light space correlation matrix R based on formula (1) _B ：

Wherein H represents a conjugate transpose.

According to the dark space transfer function matrix G _D Determining a light space correlation matrix R based on formula (2) _D ：

d. And solving the first filter bank according to the space correlation matrix of the bright area and the dark area.

Specifically, the first filter bank Q is expressed as formula (3):

it is understood that the first filter bank Q has a size NFFTxN. Wherein, the first control filter corresponding to the ith target audio playing device is Q (: i).

2) Solution generation of a second filter bank

And taking the sampling points of the sound leakage area as a bright area, taking the sampling points of the in-ear area as a dark area, maximizing the energy ratio of the bright area to the dark area, and solving the optimal filter coefficient to obtain a second filter bank. That is, the first filter bank and the second filter bank are exchanged for the bright area and the dark area when solving for generation.

Illustratively, fig. 8 shows the sound field correspondence of the first filter bank and the second filter bank when solving for generation. As shown in fig. 8, sound field control is performed using in-ear region sampling points as bright regions and leaky sound region sampling points as dark regions, and filter coefficients are solved to obtain a first filter bank. And taking the sampling points of the sound leakage area as a bright area, taking the sampling points of the in-ear area as a dark area to perform sound field control, and solving the filter coefficients to obtain a second filter bank.

In this way, the sound field of the target audio signal (such as a call downlink signal) is controlled by the first filter bank, so that the energy of the in-ear sound field of the target audio signal is larger, and the energy of the sound leakage sound field of the target audio signal is smaller; the second filter bank is used for controlling the sound field of the masking signal, so that the energy of the sound field in the ear of the masking signal is smaller, and the energy of the sound leakage field of the masking signal is larger. After the target audio signal and the masking signal are overlapped, the in-ear sound field intelligibility of the finally output sound is not affected, and the intelligibility of the leaky sound field is reduced, so that the user privacy is effectively protected and the user experience is improved under the condition that the hearing feeling of the user is not affected.

The solving process of the second filter bank is referred to the steps a to d, and will not be described in detail.

Examples of the audio signal processing method provided in the embodiments of the present application are described above in detail. It will be appreciated that the electronic device, in order to achieve the above-described functions, includes corresponding hardware and/or software modules that perform the respective functions. Those of skill in the art will readily appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Those skilled in the art may implement the described functionality using different approaches for each particular application in conjunction with the embodiments, but such implementation is not to be considered as outside the scope of this application.

The embodiment of the present application may divide the functional modules of the electronic device according to the above method examples, for example, may divide each function into each functional module corresponding to each function, for example, a detection unit, a processing unit, a display unit, or the like, or may integrate two or more functions into one module. The integrated modules may be implemented in hardware or in software functional modules. It should be noted that, in the embodiment of the present application, the division of the modules is schematic, which is merely a logic function division, and other division manners may be implemented in actual implementation.

It should be noted that, all relevant contents of each step related to the above method embodiment may be cited to the functional description of the corresponding functional module, which is not described herein.

The electronic device provided in this embodiment is configured to execute the above-described audio signal processing method, so that the same effects as those of the above-described implementation method can be achieved.

In case an integrated unit is employed, the electronic device may further comprise a processing module, a storage module and a communication module. The processing module can be used for controlling and managing the actions of the electronic equipment. The memory module may be used to support the electronic device to execute stored program code, data, etc. And the communication module can be used for supporting the communication between the electronic device and other devices.

Wherein the processing module may be a processor or a controller. Which may implement or perform the various exemplary logic blocks, modules, and circuits described in connection with this disclosure. A processor may also be a combination that performs computing functions, e.g., including one or more microprocessors, digital signal processing (digital signal processing, DSP) and microprocessor combinations, and the like. The memory module may be a memory. The communication module can be a radio frequency circuit, a Bluetooth chip, a Wi-Fi chip and other equipment which interact with other electronic equipment.

In one embodiment, when the processing module is a processor and the storage module is a memory, the electronic device according to this embodiment may be a device having the structure shown in fig. 1.

The embodiment of the application also provides a computer readable storage medium, in which a computer program is stored, which when executed by a processor, causes the processor to execute the method for processing an audio signal according to any of the embodiments.

The present application also provides a computer program product, which when run on a computer, causes the computer to perform the above-mentioned related steps to implement the method for processing an audio signal in the above-mentioned embodiments.

In addition, embodiments of the present application also provide an apparatus, which may be specifically a chip, a component, or a module, and may include a processor and a memory connected to each other; the memory is configured to store computer-executable instructions, and when the device is running, the processor may execute the computer-executable instructions stored in the memory, so that the chip performs the method for processing an audio signal in each of the method embodiments described above.

The electronic device, the computer readable storage medium, the computer program product or the chip provided in this embodiment are used to execute the corresponding method provided above, so that the beneficial effects thereof can be referred to the beneficial effects in the corresponding method provided above, and will not be described herein.

It will be appreciated by those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional modules is illustrated, and in practical application, the above-described functional allocation may be performed by different functional modules according to needs, i.e. the internal structure of the apparatus is divided into different functional modules to perform all or part of the functions described above.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another apparatus, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and the parts shown as units may be one physical unit or a plurality of physical units, may be located in one place, or may be distributed in a plurality of different places. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a readable storage medium. Based on such understanding, the technical solution of the embodiments of the present application may be essentially or a part contributing to the prior art or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, including several instructions to cause a device (may be a single-chip microcomputer, a chip or the like) or a processor (processor) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for processing an audio signal, applied to an electronic device, the electronic device including N target audio playing devices, where N is an integer greater than or equal to 2, the method comprising:

acquiring target audio signals, wherein the target audio signals are audio signals which need to be played by the N target audio playing devices;

generating a masking signal from the target audio signal;

respectively performing sound field control treatment on the target audio signals and the masking signals, and then performing superposition treatment to obtain N paths of signals to be played, which are in one-to-one correspondence with the N target audio playing devices;

respectively playing the corresponding signals to be played through the N target audio playing devices; the energy of a sound field formed by the signal to be played in a first space area is larger than that of a sound field formed by the signal to be played in a second space area, wherein the first space area is closer to the N target audio playing devices than the second space area.

2. The method of claim 1, wherein the first spatial region is a spatial region within a predetermined range around the N target audio playback devices, the second spatial region does not overlap the first spatial region, and the second spatial region is located on a side of the first spatial region away from the electronic device.

3. The method according to claim 1 or 2, wherein the performing the sound field control processing on the target audio signal and the masking signal, respectively, and then performing the superposition processing, obtains N paths of signals to be played, which are in one-to-one correspondence with the N target audio playing devices, includes:

performing first sound field control processing on the target audio signals to obtain N paths of target output signals which are in one-to-one correspondence with the N target audio playing devices, wherein the energy of a sound field formed by the target output signals in the first space region is larger than that of a sound field formed by the target output signals in the second space region;

performing second sound field control processing on the masking signals to obtain N paths of masking output signals which are in one-to-one correspondence with the N target audio playing devices, wherein the energy of a sound field formed by the masking output signals in the second space region is larger than that of a sound field formed by the masking output signals in the first space region;

And superposing the target output signals corresponding to the target audio playing devices and the corresponding masking output signals to obtain the N paths of signals to be played, which are in one-to-one correspondence with the N target audio playing devices.

4. A method according to claim 3, characterized in that the electronic device comprises a first filter bank and a second filter bank;

the performing a first sound field control process on the target audio signals to obtain N paths of target output signals corresponding to the N target audio playing devices one by one, where the first sound field control process includes:

performing first sound field control processing on the target audio signal through the first filter bank to obtain the N paths of target output signals;

the second sound field control processing is performed on the masking signals to obtain N paths of masking output signals corresponding to the N target audio playing devices one by one, including:

and performing second sound field control processing on the masking signal through the second filter bank to obtain the N paths of masking output signals.

5. The method according to claim 4, wherein the first filter bank is a filter bank that uses the first spatial region as a bright region and the second spatial region as a dark region, performs a sound field control test or simulation, maximizes a sound field energy ratio of the bright region and the dark region, and solves for filter coefficient generation;

The second filter set takes the second space area as a bright area, takes the first space area as a dark area, performs sound field control test or simulation, maximizes the sound field energy ratio of the bright area and the dark area, and solves the filter coefficient generation.

6. The method according to any one of claims 1 to 5, wherein the masking signal is consistent with the frequency spectrum of the target audio signal.

7. The method of claim 6, wherein generating a masking signal from the target audio signal comprises:

generating an initial masking signal according to the target audio signal, wherein the initial masking signal is consistent with the frequency spectrum of the target audio signal;

and carrying out frequency spectrum correction on the initial masking signal to obtain the masking signal.

8. The method of claim 7, wherein generating an initial masking signal that is spectrally identical to the target audio signal from the target audio signal comprises:

the target audio signal with the buffer time length being the preset masking frame length is obtained, and a signal to be masked of the current frame is obtained;

performing time reversal after windowing the signal to be masked of the current frame to obtain a current frame reversal signal;

Acquiring a previous frame inversion signal, wherein the previous frame inversion signal refers to a latest frame inversion signal before the current frame inversion signal;

superposing the current frame inversion signal and the previous frame inversion signal to obtain a current frame basic masking signal;

and adjusting the energy of the current frame basic masking signal according to a preset energy ratio threshold and the energy of the current frame to-be-masked signal to obtain the current frame initial masking signal.

9. The method of claim 8, wherein adjusting the energy of the current frame base masking signal according to a preset energy ratio threshold and the energy of the current frame to be masked signal to obtain the current frame initial masking signal comprises:

adjusting the energy of the current frame basic masking signal to obtain a current frame masking adjustment signal, wherein the energy ratio of the current frame to-be-masked signal to the current frame masking adjustment signal is equal to the preset energy ratio threshold;

respectively determining whether valid signals exist in the signal to be masked of the current frame and the masking adjustment signal of the current frame;

if no effective signal exists in the current frame to-be-masked signal, reducing the amplitude of the current frame masking adjustment signal to obtain the current frame initial masking signal, or taking 0 or null value as the current frame initial masking signal;

If the effective signal exists in the signal to be masked of the current frame and the effective signal does not exist in the masking adjustment signal of the current frame, a preset signal is obtained from a preset masking signal library, and the frequency spectrum of the preset signal is adjusted according to the frequency spectrum of the signal to be masked of the current frame, so that the initial masking signal of the current frame is obtained.

10. The method of claim 9, wherein the determining whether valid signals are present in the current frame to-be-masked signal and the current frame masking adjustment signal, respectively, comprises:

determining whether the energy of the signal to be masked of the current frame is greater than or equal to a first energy threshold;

if the energy of the signal to be masked of the current frame is greater than or equal to the first energy threshold, determining that an effective signal exists in the signal to be masked of the current frame;

if the energy of the signal to be masked of the current frame is smaller than the first energy threshold, determining that no effective signal exists in the signal to be masked of the current frame;

determining whether the energy of the current frame masking adjustment signal is greater than or equal to a second energy threshold;

if the energy of the current frame masking adjustment signal is greater than or equal to the second energy threshold, determining that a valid signal exists in the current frame masking adjustment signal;

And if the energy of the current frame masking adjustment signal is smaller than the second energy threshold, determining that no effective signal exists in the current frame masking adjustment signal.

11. The method of claim 6, wherein generating a masking signal from the target audio signal comprises:

acquiring a preset signal from a preset masking signal library;

and adjusting the frequency spectrum of the preset signal according to the frequency spectrum of the target audio signal to obtain the masking signal.

12. An electronic device, comprising: a processor, a memory, and an interface;

the processor, the memory and the interface cooperate to cause the electronic device to perform the method of any one of claims 1 to 11.

13. A computer readable storage medium, wherein a computer program is stored in the computer readable storage medium, which when executed by a processor invokes instructions, causing the electronic device to perform the method of any one of claims 1 to 11.