CN117079659A

CN117079659A - Audio processing method and related device

Info

Publication number: CN117079659A
Application number: CN202310990548.4A
Authority: CN
Inventors: 许剑峰
Original assignee: Honor Device Co Ltd
Current assignee: Honor Device Co Ltd
Priority date: 2023-03-28
Filing date: 2023-03-28
Publication date: 2023-11-17
Also published as: CN116013349B; CN116013349A

Abstract

The embodiment of the application provides an audio processing method and a related device, and relates to the technical field of terminals. The method comprises the following steps: the electronic equipment acquires a first audio signal; the electronic equipment separates a human voice component and a non-human voice component of the first audio signal; the electronic equipment performs energy suppression on tooth sounds in the voice component; and the electronic equipment mixes sound according to the non-human sound component and the human sound component after tooth sound suppression to obtain a second audio signal. Therefore, the suppression of tooth sound can be realized, and meanwhile, the damage to non-human sound components can be avoided, so that tone distortion is reduced, and user experience is improved.

Description

Audio processing method and related device

The present application is a divisional application, the application number of which is 202310309529.0, the application date of which is 2023, 3, 28, and the entire contents of which are incorporated herein by reference.

Technical Field

The application relates to the technical field of terminals, in particular to an audio processing method and a related device.

Background

When a user uses the electronic equipment to surf the internet, watch videos and listen to songs, tooth sounds in voice can exist in the played audio, so that the definition and the intelligibility of voice are reduced, and the voice quality is reduced.

In one implementation, the electronic device may gain suppress the input audio, but still produce more tonal distortion.

Disclosure of Invention

According to the audio processing method and the related device, the electronic equipment can firstly perform voice/non-voice separation on the input audio to obtain the voice component and the non-voice component, and further the electronic equipment can perform tooth sound suppression on the voice component, so that the suppression of the tooth sound can be realized, meanwhile, the damage to the non-voice component can be avoided, the tone distortion is reduced, and the user experience is improved.

In a first aspect, an audio processing method provided by an embodiment of the present application includes:

the electronic equipment acquires a first audio signal; the electronic equipment separates a human voice component and a non-human voice component of the first audio signal; the electronic equipment performs energy suppression on tooth sounds in the voice component; and the electronic equipment mixes sound according to the non-human sound component and the human sound component after tooth sound suppression to obtain a second audio signal. Therefore, the suppression of tooth sound can be realized, and meanwhile, the damage to non-human sound components can be avoided, so that tone distortion is reduced, and user experience is improved.

In one possible implementation, after the electronic device acquires the first audio signal, the electronic device is included to transform the first audio signal from a time domain signal to a frequency domain signal; the electronic device separating a human voice component and a non-human voice component of the first audio signal, including the electronic device separating the human voice component and the non-human voice component of the first audio signal in the frequency domain; the electronic equipment performs energy suppression on the tooth sound in the human voice component, including the electronic equipment performs energy suppression on the tooth sound in the human voice component in a frequency domain; the electronic equipment mixes sound according to the non-human sound component and the human sound component after the tooth sound suppression to obtain a second audio signal, and the electronic equipment transforms the mixed sound signal from a frequency domain signal to a time domain signal after mixing sound according to the non-human sound component and the human sound component after the tooth sound suppression in a frequency domain to obtain the second audio signal. In this way, since the tooth sound suppression in the frequency domain can be performed without generating additional delay, the time delay alignment processing process for the non-human voice can be reduced, and the calculation force can be saved.

In one possible implementation, the electronic device energy-suppresses the pitch in the vocal component, including the electronic device energy-suppressing the pitch in the vocal component according to the perceived energy and the spectrum of the vocal component; the perceptual energy is the perceptual energy of the tooth audio band in the first audio signal, the perceptual energy is in direct proportion to the first energy, the perceptual energy is in inverse proportion to the second energy, the first energy is the energy of the tooth audio band in the human sound component, and the second energy is the energy of the tooth audio band in the non-human sound component. In this way, as the masking effect of the non-human sound component on the tooth sound is considered, the effect can perform smaller energy suppression or no energy suppression on the unobvious tooth sound, and the tone quality of the original first audio signal is ensured.

In one possible implementation, the perceived energy satisfies the following formula:

where EV' (i) is a value of the perceptible energy, EV (i) is a value of the first energy, EU (i) is a value of the second energy, i is a tooth tone subband number, the tooth tone subband includes a plurality of tooth tone subbands, and epsilon is a control parameter of the perceptible energy. Therefore, the degree of the masking effect can be controlled by introducing epsilon into the perceivable energy formula, the tone quality is ensured, and the user experience is improved.

In a possible implementation, the control parameters epsilon of the perceived energy corresponding to the different tooth tone sub-bands i are different. In this way, different parameters are adopted in different frequency bands to control the degree of the masking effect, so that the influence of the masking effect on the audio signal can be reduced, and the user experience is improved.

In one possible implementation, the electronic device performs energy suppression on the tooth sound in the human voice component, and satisfies the following formula:

where V '(k) is the spectrum of the human voice component after tooth pitch suppression, V (k) is the spectrum of the human voice component, EV' (i) is the value of the perceived energy, m is the suppression degree parameter, and then (i) is the tooth pitch energy suppression threshold value of the i-th tooth pitch band. Thus, the voice signal of the tooth audio frequency section can be prevented from being greatly restrained, and obvious tone distortion is generated.

In a possible implementation, the suppression degree parameters m corresponding to different tooth sound sub-bands i are different, and the electronic device performs energy suppression on tooth sound in the human voice component, so that the following formula is satisfied:

where V '(k) is the spectrum of the human voice component after tooth pitch suppression, V (k) is the spectrum of the human voice component, EV' (i) is the value of the perceived energy, m (i) is the suppression degree parameter, and then (i) is the tooth pitch energy suppression threshold value of the i-th tooth pitch band. Therefore, different parameters are adopted in different frequency bands to control the upper limit of the tooth sound frequency band inhibition degree, the influence of the electronic equipment on the tooth sound inhibition degree can be reduced, and further user experience is improved.

In one possible implementation, before the electronic device performs energy suppression on the tooth sound in the voice component, the electronic device sets a flag bit according to whether the tooth sound is included in the voice component, wherein the flag bit includes a first value or a second value, the first value indicates that the tooth sound exists in the voice component, and the second value indicates that the tooth sound does not exist in the voice component; the electronic device energy suppressing the tooth sound in the human voice component includes: if the flag bit is the first value, the electronic equipment performs energy suppression on the tooth sound in the human voice component. Therefore, the electronic equipment can determine whether to carry out energy suppression on the tooth sound in the voice component according to the zone bit, and can more accurately judge whether the tooth sound exists in the tooth sound frequency band, so that the accuracy of tooth sound suppression is improved.

In one possible implementation, if the flag bit is the second value, the electronic device does not energy suppress the tooth sound in the human voice component. Therefore, the electronic equipment can not inhibit tooth sound for the frequency band without tooth sound, so that unnecessary calculation is reduced, and calculation force is saved.

wherein V '(k) is the spectrum of the human voice component after tooth sound suppression, V (k) is the spectrum of the human voice component, EV' (i) is the value of the perceived energy, m (i) is the suppression degree parameter, and then (i) is the tooth sound energy suppression threshold value of the ith tooth sound band, and flag is the flag bit. Therefore, the electronic equipment can more accurately judge whether the tooth sound exists in the tooth sound frequency band, and the electronic equipment can not inhibit the tooth sound for the frequency band without the tooth sound, so that the accuracy of the tooth sound inhibition is improved, and the user experience is improved.

In one possible implementation, before the electronic device mixes the non-human voice component with the human voice component after tooth sound suppression, the method includes: and the electronic equipment performs time delay alignment on the non-human voice component and the human voice component after tooth sound suppression. In this way, the time delay due to tooth sound suppression of the human voice signal can be reduced.

In one possible implementation, the electronic device performs time delay alignment on the non-human voice component and the human voice component after tooth sound suppression, including buffering silence for a period of time before the non-human voice component, where the period of time is a time delay period generated when tooth sound suppression is performed on the human voice component. Therefore, the electronic equipment can make the relative time delay of the human voice signal and the non-human voice signal identical by buffering the silence for a period of time before the non-human voice component, so that the time delay caused by tooth sound suppression of the human voice signal is reduced, and the user experience is improved.

In a second aspect, an embodiment of the present application provides an apparatus for audio processing, where the apparatus may be a terminal device, or may be a chip or a chip system in the terminal device. The apparatus may include a processing unit. The processing unit is configured to implement any method related to processing performed by the terminal device in the first aspect or any possible implementation manner of the first aspect. When the apparatus is a terminal device, the processing unit may be a processor. The apparatus may further comprise a storage unit, which may be a memory. The storage unit is configured to store instructions, and the processing unit executes the instructions stored in the storage unit, so that the terminal device implements the method described in the first aspect or any one of possible implementation manners of the first aspect. When the apparatus is a chip or a system of chips within a terminal device, the processing unit may be a processor. The processing unit executes the instructions stored by the storage unit to cause the terminal device to implement the method described in the first aspect or any one of the possible implementations of the first aspect. The memory unit may be a memory unit (e.g., a register, a cache, etc.) in the chip, or a memory unit (e.g., a read-only memory, a random access memory, etc.) located outside the chip in the terminal device.

In a possible implementation manner, a processing unit is configured to acquire a first audio signal; and is further configured to separate a human voice component and a non-human voice component of the first audio signal; the device is also used for energy inhibition of tooth sound in the human voice component; and the audio mixing unit is also used for mixing the voice according to the non-voice component and the voice component after tooth sound suppression to obtain a second audio signal.

In a possible implementation, the processing unit is configured to transform the first audio signal from a time domain signal to a frequency domain signal; but also for transforming the mixing signal from a frequency domain signal to a time domain signal.

In a possible implementation, the processing unit is configured to perform energy suppression on the tooth sound in the human voice component according to the perceived energy and the frequency spectrum of the human voice component.

in a possible implementation, the control parameters epsilon of the perceived energy corresponding to the different tooth tone sub-bands i are different.

In a possible implementation manner, the processing unit is configured to perform energy suppression on the tooth sound in the human voice component, and the following formula is satisfied:

in a possible implementation manner, the suppression degree parameters m corresponding to different tooth sound sub-bands i are different, and the processing unit is used for performing energy suppression on tooth sound in the human voice component, so that the following formula is satisfied:

In a possible implementation manner, the processing unit is configured to set a flag bit according to whether the voice component includes a tooth sound, and specifically is further configured to perform energy suppression on the tooth sound in the voice component if the flag bit is a first value.

In a possible implementation manner, the processing unit is configured to not perform energy suppression on the tooth sound in the voice component if the flag bit is the second value.

in a possible implementation manner, the processing unit is configured to perform time delay alignment on the non-human voice component and the human voice component after tooth sound suppression.

In a possible implementation, the processing unit is configured to buffer silence for a period of time before the non-human voice component.

In a third aspect, an embodiment of the present application provides a terminal device, including a processor and a memory, where the memory is configured to store code instructions, and where the processor is configured to execute the code instructions to perform the audio processing method described in the first aspect or any one of the possible implementation manners of the first aspect.

In a fourth aspect, embodiments of the present application provide a computer readable storage medium having stored therein a computer program or instructions which, when run on a computer, cause the computer to perform the audio processing method described in the first aspect or any one of the possible implementations of the first aspect.

In a fifth aspect, embodiments of the present application provide a computer program product comprising a computer program which, when run on a computer, causes the computer to perform the audio processing method described in the first aspect or any one of the possible implementations of the first aspect.

In a sixth aspect, the present application provides a chip or chip system comprising at least one processor and a communication interface, the communication interface and the at least one processor being interconnected by wires, the at least one processor being for running a computer program or instructions to perform the audio processing method described in the first aspect or any one of the possible implementations of the first aspect. The communication interface in the chip can be an input/output interface, a pin, a circuit or the like.

In one possible implementation, the chip or chip system described above further includes at least one memory, where the at least one memory has instructions stored therein. The memory may be a memory unit within the chip, such as a register, a cache, etc., or may be a memory unit of the chip (e.g., a read-only memory, a random access memory, etc.).

It should be understood that, the second aspect to the sixth aspect of the present application correspond to the technical solutions of the first aspect of the present application, and the advantages obtained by each aspect and the corresponding possible embodiments are similar, and are not repeated.

Drawings

Fig. 1 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

fig. 2 is a schematic software structure of an electronic device according to an embodiment of the present application;

fig. 3 is a schematic diagram of gain suppression of input audio by an electronic device according to an embodiment of the present application;

fig. 4 is a schematic diagram of suppressing tooth noise by an electronic device in a frequency domain according to an embodiment of the present application;

fig. 5 is a schematic diagram of an audio processing method according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a human voice/non-human voice separation based on an NN network method provided by an embodiment of the application;

FIG. 7 is a schematic diagram of another audio processing method according to an embodiment of the present application;

FIG. 8 is a schematic diagram of another audio processing method according to an embodiment of the present application;

FIG. 9 is a schematic diagram of another audio processing method according to an embodiment of the present application;

fig. 10 is a schematic diagram of a specific audio processing method according to an embodiment of the present application;

Fig. 11 is a schematic structural diagram of a chip according to an embodiment of the present application.

Detailed Description

In order to facilitate the clear description of the technical solutions of the embodiments of the present application, the following simply describes some terms and techniques involved in the embodiments of the present application:

1. terminology

In embodiments of the present application, the words "first," "second," and the like are used to distinguish between identical or similar items that have substantially the same function and effect. For example, the first chip and the second chip are merely for distinguishing different chips, and the order of the different chips is not limited. It will be appreciated by those of skill in the art that the words "first," "second," and the like do not limit the amount and order of execution, and that the words "first," "second," and the like do not necessarily differ.

It should be noted that, in the embodiments of the present application, words such as "exemplary" or "such as" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "for example" should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.

In the embodiments of the present application, "at least one" means one or more, and "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a alone, a and B together, and B alone, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b, or c may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or plural.

2. Electronic equipment

The electronic device of the embodiment of the application can also be any form of terminal device, for example, the electronic device can include a handheld device with an audio function, a vehicle-mounted device and the like. For example, some electronic devices are: a mobile phone, tablet, palm, notebook, mobile internet device (mobile internet device, MID), wearable device, virtual Reality (VR) device, augmented reality (augmented reality, AR) device, wireless terminal in industrial control (industrial control), wireless terminal in unmanned (self driving), wireless terminal in teleoperation (remote medical surgery), wireless terminal in smart grid (smart grid), wireless terminal in transportation security (transportation safety), wireless terminal in smart city (smart city), wireless terminal in smart home (smart home), cellular phone, cordless phone, session initiation protocol (session initiation protocol, SIP) phone, wireless local loop (wireless local loop, WLL) station, personal digital assistant (personal digital assistant, PDA), handheld device with wireless communication function, public computing device or other processing device connected to wireless modem, vehicle-mounted device, wearable device, electronic device in the 5G network or evolving land mobile network (public land mobile network), and the like, without limiting the application.

By way of example, and not limitation, in embodiments of the application, the electronic device may also be a wearable device. The wearable device can also be called as a wearable intelligent device, and is a generic name for intelligently designing daily wear by applying wearable technology and developing wearable devices, such as glasses, gloves, watches, clothes, shoes and the like. The wearable device is a portable device that is worn directly on the body or integrated into the clothing or accessories of the user. The wearable device is not only a hardware device, but also can realize a powerful function through software support, data interaction and cloud interaction. The generalized wearable intelligent device includes full functionality, large size, and may not rely on the smart phone to implement complete or partial functionality, such as: smart watches or smart glasses, etc., and focus on only certain types of application functions, and need to be used in combination with other devices, such as smart phones, for example, various smart bracelets, smart jewelry, etc. for physical sign monitoring.

In addition, in the embodiment of the application, the electronic equipment can also be electronic equipment in an internet of things (internet of things, ioT) system, and the IoT is an important component of the development of future information technology, and the main technical characteristics of the IoT are that the article is connected with a network through a communication technology, so that the man-machine interconnection and the intelligent network of the internet of things are realized.

The electronic device in the embodiment of the application may also be referred to as: a User Equipment (UE), a Mobile Station (MS), a Mobile Terminal (MT), an access terminal, a subscriber unit, a subscriber station, a mobile station, a remote terminal, a mobile device, a user terminal, a wireless communication device, a user agent, or a user equipment, etc.

In an embodiment of the present application, the electronic device or each network device includes a hardware layer, an operating system layer running on top of the hardware layer, and an application layer running on top of the operating system layer. The hardware layer includes hardware such as a central processing unit (central processing unit, CPU), a memory management unit (memory management unit, MMU), and a memory (also referred to as a main memory). The operating system may be any one or more computer operating systems that implement business processes through processes (processes), such as a Linux operating system, a Unix operating system, an Android operating system, an IOS operating system, or a windows operating system, etc. The application layer comprises applications such as a browser, an address book, word processing software, instant messaging software and the like.

By way of example, fig. 1 shows a schematic diagram of an electronic device.

The electronic device may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (universal serial bus, USB) interface 130, a charge management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, keys 190, a motor 191, an indicator 192, a camera 193, a display 194, and a subscriber identity module (subscriber identification module, SIM) card interface 195, etc. The sensor module 180 may include a pressure sensor 180A, a gyro sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, a bone conduction sensor 180M, and the like.

It should be understood that the structure illustrated in the embodiments of the present application does not constitute a specific limitation on the electronic device. In other embodiments of the application, the electronic device may include more or less components than illustrated, or certain components may be combined, or certain components may be split, or different arrangements of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

The processor 110 may include one or more processing units, such as: the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), a controller, a video codec, a digital signal processor (digital signal processor, DSP), a baseband processor, and/or a neural network processor (neural-network processing unit, NPU), etc. Wherein the different processing units may be separate devices or may be integrated in one or more processors.

The controller can generate operation control signals according to the instruction operation codes and the time sequence signals to finish the control of instruction fetching and instruction execution.

A memory may also be provided in the processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that the processor 110 has just used or recycled. If the processor 110 needs to reuse the instruction or data, it can be called directly from the memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby improving the efficiency of the system.

In some embodiments, the processor 110 may include one or more interfaces. The interfaces may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, a universal asynchronous receiver transmitter (universal asynchronous receiver/transmitter, UART) interface, a mobile industry processor interface (mobile industry processor interface, MIPI), a general-purpose input/output (GPIO) interface, a SIM card interface, and/or a USB interface, among others.

It should be understood that the connection relationship between the modules illustrated in the embodiments of the present application is only illustrative, and does not limit the structure of the electronic device. In other embodiments of the present application, the electronic device may also use different interfacing manners, or a combination of multiple interfacing manners in the foregoing embodiments.

The external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, to enable expansion of the memory capabilities of the electronic device. The external memory card communicates with the processor 110 through an external memory interface 120 to implement data storage functions. For example, files such as music, video, etc. are stored in an external memory card.

The internal memory 121 may be used to store computer-executable program code that includes instructions. The internal memory 121 may include a storage program area and a storage data area. The storage program area may store an application program (such as a sound playing function, an image playing function, etc.) required for at least one function of the operating system, etc. The storage data area may store data created during use of the electronic device (e.g., audio data, phonebook, etc.), and so forth. In addition, the internal memory 121 may include a high-speed random access memory, and may further include a nonvolatile memory such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (universal flash storage, UFS), and the like. The processor 110 performs various functional applications of the electronic device and data processing by executing instructions stored in the internal memory 121 and/or instructions stored in a memory provided in the processor. For example, the method of the embodiments of the present application may be performed.

The electronic device implements display functions via a GPU, a display screen 194, an application processor, and the like. The GPU is a microprocessor for image processing, and is connected to the display 194 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 110 may include one or more GPUs that execute program instructions to generate or change display information. The electronic device may implement shooting functions through an ISP, a camera 193, a video codec, a GPU, a display screen 194, an application processor, and the like.

The electronic device may implement audio functions through an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, an application processor, and the like. Such as audio playback or recording, etc.

The audio module 170 is used to convert digital audio information into an analog audio signal output and also to convert an analog audio input into a digital audio signal. The speaker 170A, also referred to as a "horn," is used to convert an audio electrical signal into a sound signal, and may include 1 or N speakers 170A, N being a positive integer greater than 1 in the electronic device. The electronic device may listen to music, video, or hands-free conversation, etc., through speaker 170A. A receiver 170B, also referred to as a "earpiece", is used to convert the audio electrical signal into a sound signal. When the electronic device picks up a phone call or voice message, the voice can be picked up by placing the receiver 170B close to the human ear. Microphone 170C, also known as a "microphone" or "microphone," is used to convert sound signals into electrical signals. The earphone interface 170D is used to connect a wired earphone.

Fig. 2 is a software configuration block diagram of an electronic device according to an embodiment of the present application. The layered architecture divides the software into several layers, each with distinct roles and branches. The layers communicate with each other through a software interface. In some embodiments, the Android system is divided into five layers, from top to bottom, an application layer, an application framework layer, an Zhuoyun row (Android run) and system libraries, a hardware abstraction layer, and a kernel layer, respectively.

The application layer may include a series of application packages. As shown in fig. 2, the application package may include application programs such as an audio application, a video application, a social application, and the like. Applications may include system applications and three-way applications.

The application framework layer provides an application programming interface (application programming interface, API) and programming framework for application programs of the application layer. The application framework layer includes a number of predefined functions.

As shown in FIG. 2, the application framework layer may include a window manager, a resource manager, a notification manager, a content provider, a view system, and the like.

The window manager is used for managing window programs. The window manager may obtain the display screen size, determine if there is a status bar, lock screen, touch screen, drag screen, intercept screen, etc.

The resource manager provides various resources for the application program, such as localization strings, icons, pictures, layout files, video files, and the like.

The notification manager allows the application to display notification information in a status bar, can be used to communicate notification type messages, can automatically disappear after a short dwell, and does not require user interaction. Such as notification manager is used to inform that the download is complete, message alerts, etc. The notification manager may also be a notification in the form of a chart or scroll bar text that appears on the system top status bar, such as a notification of a background running application, or a notification that appears on the screen in the form of a dialog window. Such as prompting text messages in status bars, sounding prompts, vibrating electronic devices, flashing indicator lights, etc.

The content provider is used for realizing the function of data sharing among different application programs, allowing one program to access the data in the other program, and simultaneously ensuring the safety of the accessed data.

The view system may be responsible for interface rendering and event handling for the application.

Android runtimes include core libraries and virtual machines. Android run time is responsible for scheduling and management of the Android system.

The core library consists of two parts: one part is a function which needs to be called by java language, and the other part is a core library of android.

The application layer and the application framework layer run in a virtual machine. The virtual machine executes java files of the application program layer and the application program framework layer as binary files. The virtual machine is used for executing the functions of object life cycle management, stack management, thread management, security and exception management, garbage collection and the like.

The system library may include a plurality of functional modules. For example: media libraries (media libraries), function libraries (function libraries), audio and video processing libraries, etc.

Media libraries support a variety of commonly used audio, video format playback and recording, still image files, and the like. The media library may support a variety of audio video encoding formats, such as: MPEG4, h.264, MP3, AAC, AMR, JPG, PNG, etc.

The function library provides multiple service API interfaces for the developer, and is convenient for the developer to integrate and realize various functions quickly.

The hardware abstraction layer is a layer of abstracted structure between the kernel layer and the Android run. The hardware abstraction layer may be a package for hardware drivers that provides a unified interface for the invocation of upper layer applications.

The kernel layer is a layer between hardware and software. The kernel layer may include an audio driver, a video driver, a camera driver, and the like.

The embodiment of the application is only illustrated by an android system, and in other operating systems (such as a Windows system, an IOS system and the like), the scheme of the application can be realized as long as the functions realized by the functional modules are similar to those of the embodiment of the application.

It should be noted that, the processing of frequency band suppression, time-frequency conversion, separation of human voice and non-human voice, tooth voice suppression, time delay alignment, masking effect calculation, audio mixing, time-frequency conversion and the like performed on the input audio in the embodiment of the application can be implemented in a plurality of software architecture layers of the electronic equipment. For example, the application related to audio and video in the application layer of the electronic device can perform the above-mentioned processing, and in addition, the audio and video processing library in the system library of the electronic device, the audio and video module of the hardware abstraction layer, and the audio and video driver of the driver layer can all perform the above-mentioned processing, thereby realizing the suppression of the tooth sound in the input audio. The embodiment of the present application is not particularly limited, and the software level of tooth noise suppression is embodied.

When a user uses the electronic equipment to surf the internet, watch videos and listen to songs, tooth sounds in voice can exist in the played audio, so that the definition and the intelligibility of voice are reduced, and the voice quality is reduced. The tooth sound is understood to mean that when the sound-producing initial consonant is j, q, x, zh, ch, sh, z, c, s, the tip of the tongue is pressed against the upper incisor teeth, and the airflow rubs with the teeth to produce the tooth sound.

Because the frequency range of tooth tones in the audio frequency is between 4kHz and 10kHz, the tooth tones belong to the middle-high frequency band, are sensitive areas of human ears, and are easy to bring harsh hearing. On the one hand, some electronic devices adopt a side sound emitting mode, so that obvious peaks and valleys are formed near 4.3kHz and 9 kHz-10 kHz, and tooth sounds in audio are obvious, so that hearing is affected. On the other hand, when the user uses the earphone, because the earphone is attached to the ear, the tooth sound can be increased along with the shortening of the distance, and the tooth sound in the audio can be obvious, so that the voice quality of the human voice is reduced.

In one implementation, as shown in FIG. 3, the electronic device may perform fixed gain suppression for the 4 kHz-10 kHz and/or 5 kHz-12 kHz toothed audio segments of the input audio. Wherein, the input audio comprises a human sound component and a non-human sound component, and the non-human sound component can comprise bird sounds, musical instrument sounds, special effect sounds, noise and the like. The suppression of the fixed gain of the input audio by the electronic device includes suppression of both the human voice component and the non-human voice component. Illustratively, the electronic device may implement suppression of the input audio through one or more infinite impulse response (infinite impulse response, IIR) filters/equalizers. However, this implementation may result in the frequency bands of the non-human voice component also being suppressed, producing more tone color distortion.

In another implementation, as shown in FIG. 4, the electronic device may convert input audio, including both human and non-human voice components, from a time domain signal to a frequency domain signal. The input signal may be analyzed in the frequency domain for energy envelope, zero crossing rate, and/or tooth audio band energy, etc., and then the tooth audio band energy may be adaptively suppressed based on the analysis. The energy envelope, the zero-crossing rate and the like can indicate the possibility that the signal in the frequency band is tooth sound, and when the signal in the frequency band is tooth sound, the electronic equipment can calculate the frequency band energy of the tooth sound. If the tooth sound frequency band energy is larger, the tooth sound is obvious, and the tooth sound can be restrained. However, this implementation does not consider the influence on the non-human sound component, and may suppress some of the non-human sound component as tooth tones, resulting in more tone color distortion.

In view of this, according to the audio processing method provided by the embodiment of the application, the electronic device can firstly perform the separation of human voice and non-human voice on the input audio to obtain the human voice component and the non-human voice component, and further the electronic device can perform tooth sound suppression on the human voice component, so that the suppression of tooth sound can be realized, and meanwhile, the damage to the non-human voice component can be avoided, thereby reducing tone color distortion and improving user experience.

Specifically, fig. 5 shows an audio processing method provided by the embodiment of the present application. The audio processing method can comprise processing methods such as human voice/non-human voice separation, tooth voice suppression, sound mixing and the like.

(1) Separation of human voice/non-human voice.

The electronic device may perform the human/non-human separation of the input audio, and in a possible implementation, the electronic device may perform the human/non-human separation of the input audio using a conventional signal processing method, for example, the conventional signal processing method may include a signal processing method based on correlation analysis, a voice activity detection (voice activity detection, VAD) method, and the like. The electronic device may further adopt a Neural Network (NN) based method to perform the separation of human voice and non-human voice on the input audio, and the embodiment of the application is not limited.

The following describes a process of separating human voice from non-human voice on input audio, taking an NN network based method as an example.

As shown in fig. 6, the human voice/non-human voice separation of input audio based on the NN network method may include (a) a training side and (b) an extraction side. The training side can be understood as a training process of the NN network, the NN network realizes the learning of the voice data through training of a large amount of voice data, and the accuracy of extracting the voice by the NN network can be continuously improved. The extraction side can be understood as a process using the NN network, and can achieve separation of human voice/non-human voice.

(a) Training the side.

The NN network method can perform time-frequency conversion on the voice signal to obtain a frequency domain voice signal, the frequency domain voice signal can have a corresponding real data mask (group-trunk mask), and the group-trunk mask can be understood as accurate voice data and can be used as a standard for measuring the accuracy of training data.

In a possible implementation, the time-frequency conversion may be implemented by using a fast fourier transform (fast fourier transformation, FFT), a discrete fourier transform (discrete fourier transform, DFT), a modified discrete cosine transform (modified discrete cosine transform, MDCT), and the like. Taking the DFT method as an example, the time-frequency conversion may satisfy the following formula:

where X (N) may be an input time domain signal, N may be the number of consecutive time domain samples of the DFT processing the input time domain signal at a time, typically the number of consecutive time domain samples may be a signal of one or two frames or so, and X (k) is an output complex spectrum signal, and may include a real part and an imaginary part:

X(k)＝X _R (k)+jX _I (k)。

the sum of squares is calculated for the real and imaginary parts, and then the sum of squares is calculated, so that the frequency spectrum amplitude can be obtained:

during training, the NN network method may add other non-human voice data to human voice data, for example, the non-human voice data may include bird voice, musical instrument voice, special effect voice, noise and other data, generate a mixed signal, and perform time-frequency conversion on the mixed signal to obtain a mixed signal in a frequency domain. The mixed signal of the frequency domain can obtain a human voice mask based on a human voice extraction method of an NN network model. The human voice extraction method based on the NN network model may include a method of full-band audio low-complexity speech enhancement framework (deep filter net) based on depth filtering, a method of load balancing (Dmucs), a method of convolution time domain audio separation network (convolutional time-domain audioseparation network, conv tasnet) and the like, and the human voice extraction method adopted by specific electronic devices is not limited in the embodiment of the present application.

It can be understood that the human voice mask obtained by the human voice extraction method based on the NN network model can be the weight of the human voice signal duty ratio in the mixed signal, and the human voice mask can be understood as the result of training the human voice data by the NN network.

Further, the penalty function may calculate an error between the group-try mask and the voice mask and feed the error back to the NN network. The NN network can perform reverse reasoning to update the mask weight of the NN network, so that the voice mask is continuously close to the group-trunk mask, that is, the voice data obtained by the NN network gradually approaches to the real voice data, and the calculation accuracy of the NN network is continuously improved.

(b) And an extraction side.

The NN network method can perform time-frequency conversion on the mixed signal to obtain a mixed signal in a frequency domain, and the mixed signal in the frequency domain can obtain a voice mask through an NN network model. And multiplying the mixed signal of the frequency domain by the voice mask to obtain the voice signal of the frequency domain. After the human voice signal in the frequency domain is obtained, subtracting the human voice signal in the frequency domain from the mixed signal in the frequency domain, so that the non-human voice signal in the frequency domain can be obtained.

That is, by extracting the human voice/non-human voice separation of the input audio through the NN network method, the mixed signal spectrum X (k) may be decomposed into the human voice signal spectrum V (k) and the non-human voice signal spectrum U (k), and the following relationship may be satisfied:

X(k)＝V(k)+U(k)k＝0，1...N-1。

Further, by performing frequency-time conversion on the human voice signal in the frequency domain, the human voice signal in the time domain can be output; the non-human voice signal in the time domain can be output by performing frequency-time conversion on the non-human voice signal in the frequency domain. Thus, the NN network method can realize the separation of human voice and non-human voice of input audio.

In a possible implementation, the frequency-to-time conversion may be implemented using the inverse of the time-to-frequency conversion. Taking the DFT method as an example, the inverse process of the DFT method may be an inverse discrete fourier transform (inverse discrete fourier transform, IDFT), satisfying the following formula:

where X (k) is an input complex spectrum signal, N may be the number of consecutive frequency domain samples of the DFT processing the input frequency domain signal at a time, and X (N) may be an output time domain signal.

(2) Tooth sound suppression.

After the human voice component is separated, the electronic device may perform tooth sound suppression on the human voice component only, and the specific tooth sound suppression method may include the time domain audio processing method corresponding to fig. 3 and the frequency domain audio processing method corresponding to fig. 4, and the electronic device may also use other tooth sound suppression methods, which is not limited by the embodiment of the present application.

(3) And (5) mixing.

After the tooth sound suppression is completed on the voice component, the electronic equipment can carry out sound mixing processing on the non-voice component and the voice component after the tooth sound suppression to obtain an output signal after the tooth sound suppression.

For example, if the non-human voice signal spectrum is U (k) and the human voice signal spectrum after tooth sound suppression is V '(k), after the mixing process, a mixed spectrum X' (k) after tooth sound suppression can be obtained, which satisfies the following formula:

X′(k)＝V′(k)+U(k)k＝0，1...，N-1。

therefore, by processing methods such as voice/non-voice separation, tooth sound suppression, mixing and the like on the input audio, the tooth sound can be suppressed, meanwhile, the damage to non-voice components can be avoided, tone distortion is reduced, and user experience is improved.

It can be understood that, if the above-mentioned time-domain audio processing method corresponding to fig. 3 is adopted, a certain time delay may be introduced into the input audio signal, for example, when the tooth sound suppression is performed through the IIR filter, a time delay of a plurality of sample points may be introduced; or converting the human voice signal in the time domain into the human voice signal in the frequency domain, after tooth sound suppression is performed on the human voice signal in the frequency domain, converting the human voice signal in the frequency domain into the human voice signal in the time domain, and usually introducing about one frame of time delay.

For this case, fig. 7 shows another audio processing method according to the embodiment of the present application, which can address the problem of delay in the audio processing method. The electronic device may add a time delay alignment process to the non-human voice signal based on the corresponding embodiment of fig. 5. The electronic device may implement the delay alignment process by buffering the non-human voice signal, and specifically, the electronic device may buffer silence for a period of time before the non-human voice component, where the period of time may be a period of time delay generated when the human voice component is subjected to tooth suppression. The electronic device may also use other means of implementing a time delay alignment process. The specific delay alignment processing method is not limited in the embodiment of the application. In this way, the relative delays of the human voice signal and the non-human voice signal can be made the same, thereby reducing the delay caused by tooth sound suppression of the human voice signal.

It will be appreciated that if both the separation of human/non-human voice and tooth pitch suppression are implemented in the frequency domain, the electronic device may not perform time delay alignment processing on the non-human voice. This is because the electronic device can be implemented without passing through the filter when the tooth suppression is performed in the frequency domain, and thus no delay occurs due to the tooth suppression performed by the filter.

Therefore, fig. 8 shows another audio processing method provided by the embodiment of the present application, where before the separation of human voice and non-human voice and the suppression of tooth voice, the electronic device may perform time-frequency conversion on the input audio first, and does not perform time delay alignment processing on the non-human voice. In this way, the tooth sound suppression in the frequency domain can not generate extra delay, and the time delay alignment processing process for the non-human voice is reduced, so that the calculation force is saved.

In the audio processing methods shown in fig. 3 and 4, the masking effect of the non-human voice component on the human voice component tooth band is not considered, that is, if the non-human voice component energy is much larger than the human voice component in the tooth band, the tooth tone may not be heard or heard obviously by the user, and thus it is unnecessary to suppress the tooth tone.

In view of this, fig. 9 shows that the embodiment of the present application provides another audio processing method, and the electronic device may combine to consider the masking effect of the non-human voice component on the tooth-tone frequency band of the human voice component, and then perform tooth-tone suppression, so that for the audio with insignificant tooth tone, the electronic device may not perform the suppression processing on the tooth tone, so that the audio with insignificant tooth tone may not be suppressed, thereby ensuring the tone as much as possible.

Specifically, after performing time-frequency conversion on the input audio and performing separation of human voice and non-human voice in the frequency domain, the electronic device may perform tooth voice suppression frequency band energy calculation on the human voice part in the tooth voice band and perform tooth voice masking frequency band energy calculation on the non-human voice part, and then may calculate the perceived energy of the masked tooth voice band.

It will be appreciated that the frequency band of the input audio may be divided into several sub-bands, e.g. the electronic device may divide the frequency band into several sub-bands according to the methods of bark spectrum, equivalent matrix bandwidth (equivalent rectangular bandwidth, ERB), octave, 1/3 octave, uniform sub-band width, etc. Taking the input signal as a 48kHz sampling rate, the frame length as 480 samples, the DFT input sample as 960, and the band division method as a bark spectrum as an example, as shown in table 1 below, the electronic device may divide the frequency domain signal of the input audio into 24 subbands (corresponding to bark subband numbers 0 to 23 in table 1). The frequency band related to tooth sound can comprise 5 frequency bands (corresponding to tooth sound analysis sub-band numbers 0-4 in table 1), and the total tooth sound analysis frequency band range can comprise 4 kHz-10.5 kHz (corresponding to sub-band starting frequency 4000 Hz-sub-band cut-off frequency 10500Hz in table 1).

TABLE 1

The electronic device may calculate the energy of the human voice spectrum V (k) of the tooth sound suppression band, and the human voice energy EV (i) of the tooth sound suppression band may satisfy the following formula:

where i may represent a tooth pitch analysis subband number, and may be an integer of 0 to 4.

The electronic device may calculate the energy of the non-human voice spectrum U (k) of the tooth-tone suppression band, and the non-human voice energy EU (i) of the tooth-tone suppression band may satisfy the following formula:

After computing the human voice energy EV (i) and the non-human voice energy EU (i) of the tooth voice suppression band, the electronic device may compute the masked tooth voice band perceptible energy EV '(i), and the tooth voice band perceptible energy EV' (i) may satisfy the following formula:

here, epsilon may be a positive real number, and epsilon may take on a value of 0.0001,0.003,0.09,0.5,1.0, for example. Epsilon can be used as a parameter to control the extent of the masking effect. Illustratively, when the epsilon value is smaller, the calculated perceived energy EV' (i) value is relatively smaller, the masking effect of the non-human voice component on the human voice component is relatively obvious, and the tooth pitch is relatively insignificant; when the epsilon value is larger, the calculated perceived energy EV' (i) value is relatively larger, the masking effect of the non-human voice component on the human voice component is relatively insignificant, and the tooth sound is relatively significant.

It will be appreciated that when the electronic device performs discrete sampling on the input audio signal, each sample of the audio signal may be obtained, where the samples may be represented by a power of 2, for example, by a power of 16 to 2, and the maximum value range of the sample of the audio signal may be positive or negative to 15 to 2, and may be represented by M, which may also be understood as the upper limit of the value range of the input time domain signal x (n), for example, the upper limit may be the power of 15 to 2. To avoid having an epsilon value that is too large and a tooth pitch that is too pronounced, the electronic device may set an upper limit on the epsilon possible value. The upper limit of epsilon may be less than the maximum value M of the input time domain signal x (n), e.g. the upper limit of epsilon may be less than or equal to M/10.

In addition, epsilon is also related to the hardware of the electronic equipment, and different electronic equipment can adjust epsilon differently. Epsilon can be preset by the electronic equipment, and the specific epsilon is valued, so that the embodiment of the application is not limited.

The masking effect of the non-human voice component on the human voice component may be different for different frequency bands, and therefore the electronic device may take different epsilon values in different frequency bands, i.e. epsilon may be changed to epsilon (i). In this case, the tooth audio band perceptible energy EV' (i) may satisfy the following formula:

Where i may represent a tooth pitch analysis subband number, each may be an integer from 0 to 4, and ε (i) may be a positive real number. It will be appreciated that the values of i may be different and the corresponding epsilon (i) values may be different.

For an ideal sound producing device, the higher the frequency, the less sensitive the human ear hearing is to the audio signal, and the less sensitive the human ear hearing is to the tooth pitch, so when the value of i is larger, the frequency is also relatively larger, and the overall value of epsilon (i) can be correspondingly larger.

Similarly, the electronic device may set the upper limit of ε (i) to be less than or equal to M/10. The value of epsilon (i) is also related to the hardware of the electronic device, and different electronic devices can adjust the value of epsilon (i) differently. Epsilon (i) can be preset by the electronic equipment, and the specific epsilon (i) is valued, so that the embodiment of the application is not limited.

Since speaker sound emitting devices are generally not particularly desirable and may accentuate the tooth tones in certain specific higher frequency bands (e.g., 8.5 kHz-10.5 kHz), it is not excluded that the value of epsilon (i) (e.g., epsilon (4) is 1.2) is smaller than the value of epsilon (i) (e.g., epsilon (3) is 2.3) in certain specific higher frequency bands (e.g., 8.5 kHz-10.5 kHz).

In this way, different parameters are adopted in different frequency bands to control the degree of the masking effect, so that the influence of the masking effect on the audio signal can be reduced, and the user experience is improved.

After computing the masked tooth voice band perceptible energy EV ' (i), the electronic device may perform adaptive suppression processing on the tooth voice band according to the value of EV ' (i), and specifically, the human voice signal spectrum V ' (k) after tooth voice band suppression may satisfy the following formula:

where i may represent a tooth pitch analysis subband number, and may each be an integer from 0 to 4, and then (i) is a tooth pitch energy suppression threshold for the ith tooth pitch band. the value of the v (i) can be preset by the electronic device, in a possible implementation manner, some audio sequences which are easy to generate tooth sounds can be played on the electronic device, for audio sequences with obvious tooth sounds, the energy on the tooth sound sub-band (sequence number i) of the audio sequences can be counted, so that the initial value of the v (i) can be obtained, for example, the median of the band energy of the audio sequences can be obtained, or the average of the band energy of the audio sequences can be obtained. In addition, fine tuning may be performed on the electronic device based on the initial value of the v (i), which may be turned down if there is still a noticeable tooth pick, or may be turned up slightly. The values of the specific values of the values (i) are not limited by the embodiment of the application.

From the above formula, when the perceived energy EV '(i) is smaller than the threshold of the V (i), it is explained that the perceived energy EV' (i) of the tooth sound band is smaller and the tooth sound is relatively insignificant, so the electronic device may not perform energy suppression on the human voice component spectrum V (k). When the perceived energy EV ' (i) is greater than the threshold of the V (i), and the perceived energy EV ' (i) is greater than the threshold of the V (i) by a factor of 4, the perceived energy EV ' (i) of the tooth sound band is greater, the tooth sound is relatively noticeable, and therefore, the electronic device can perform less energy suppression on the human voice component spectrum V (k). When the perceived energy EV ' (i) is greater than the threshold of 4 times of the V (i), it is indicated that the perceived energy EV ' (i) of the tooth band is large and the tooth tone is significant, and therefore, the electronic device may perform a large energy suppression on the human voice component spectrum V (k), for example, the human voice signal spectrum V ' (k) after tooth band suppression may be half of the human voice component spectrum V (k) before tooth band suppression.

It can be understood that 4 times of the v (i) can be used as a judgment condition for the EV' (i) value in the above formula, however, because different electronic devices may have different threshold settings, the 4 times of the v (i) fixed value reduces the flexibility of performing adaptive suppression processing on the tooth audio band, and therefore, the electronic device can flexibly value the multiple of the v (i). By way of example, the electronic device may control the upper limit of the degree of suppression of the tooth audio band by the parameter m, so that too much suppression of the human voice signal of the tooth audio band may be avoided, resulting in significant tone distortion. The human voice signal spectrum V' (k) after the tooth audio band is suppressed, which is specifically related to the parameter m, may satisfy the following formula:

Where m may be a positive real number greater than 1, for example, m may take on the value 1.0001,2,3,3.2,4.7, etc. For example, if m is 3, it means that the suppression of the finger tooth audio band may be at most 3 times.

It will be appreciated that the less pronounced the suppression of tooth tones is if the value of m is small, and the more pronounced the suppression of tooth tones is if the value of m is large. In order to avoid too much suppression of the human voice signal of the tooth audio segment, obvious tone distortion is generated, and the value of m cannot be too large, in a possible implementation, the value of m can be within 1-10. Therefore, the electronic equipment can control the tooth sound suppression within a reasonable range, and the user has better hearing experience.

In addition, the value of m is also related to the hardware of the electronic device, and different electronic devices can adjust the value of m differently. The value of m can be preset by the electronic device, and the embodiment of the application is not limited.

In addition, because the degree of the upper limit of the suppression required by the tooth noise may be different in different frequency bands, the electronic device may take different values for m in different frequency bands, that is, m may be changed to m (i). In this case, the human voice signal spectrum V' (k) after the tooth audio band suppression specifically related to m (i) may satisfy the following formula:

Where i may represent a tooth pitch analysis subband number, each may be an integer from 0 to 4, and m (i) may be a positive real number. It will be appreciated that the values of i may be different and the corresponding values of m (i) may be different. Similarly, the electronic device can set the value range of m (i) to be within 1-10. The value of m (i) is also related to the hardware of the electronic device, and different electronic devices can adjust the value of m (i) differently. In a possible implementation, m (i) may be first valued in the vicinity of 2, and may be readjusted for different electronic devices until the tooth pitch is not apparent. m (i) may be preset by the electronic device, and the specific value of m (i) is not limited in the embodiment of the present application.

Therefore, different parameters are adopted in different frequency bands to control the upper limit of the tooth sound frequency band inhibition degree, the influence of the electronic equipment on the tooth sound inhibition degree can be reduced, and further user experience is improved.

In addition, the tooth sound band is large in energy, and does not necessarily represent that the sound component is tooth sound, and for example, there is a possibility that some people may have overlapping of voiced sound and tooth sound. In order to improve accuracy of tooth sound suppression, the embodiment of the application can introduce specific frequency spectrum flatness characteristics (specific spectrum flatness measure, SSFM) to the tooth audio segment of the human voice component. SSFM may satisfy the following formula:

Where k may be the number of tooth audio bands, k is a positive integer, for example k may take on a value of 5.

In a possible implementation, the SSFM value may be compared to a threshold thr for determining whether tooth tones are present. For example, if the SSFM value is greater than the threshold thr, then a tooth sound may be considered to be present, with flag being 1; if the SSFM value is less than or equal to the threshold thr, then no tooth noise may be considered to be present, the flag is marked as 0, and the specific flag may satisfy the following formula:

where thr may be a positive real number greater than 1, for example thr may take values of 4.0,5.8, 10, 11.3, etc. the value of thr may be preset by the electronic device, and the embodiment of the present application is not limited.

The human voice signal spectrum V' (k) after the tooth audio band is specifically related to flag is suppressed may satisfy the following formula:

when flag is 0, it may be considered that there is no tooth sound, and the electronic device may not suppress the tooth sound band. When flag is 1, it can be considered that there is a tooth sound, and the electronic device can suppress the tooth sound band. Therefore, the electronic equipment can more accurately judge whether the tooth sound exists in the tooth sound frequency band, and the electronic equipment can not inhibit the tooth sound for the frequency band without the tooth sound, so that the accuracy of the tooth sound inhibition is improved.

After the tooth sound suppression processing is completed, the electronic device can perform sound mixing processing and frequency-time conversion processing on the human voice part and the non-human voice part, so that output audio is obtained. Specific mixing processing and frequency-time conversion processing may refer to the related description in the embodiment corresponding to fig. 5, and will not be described again.

The method according to the embodiment of the present application will be described in detail by way of specific examples. The following embodiments may be combined with each other or implemented independently, and the same or similar concepts or processes may not be described in detail in some embodiments.

Fig. 10 shows an audio processing method of an embodiment of the present application. The method comprises the following steps:

s1001, the electronic equipment acquires a first audio signal.

In the embodiment of the application, the first audio signal can be understood as an input audio signal of the electronic device.

S1002, the electronic device separates a human voice component and a non-human voice component of the first audio signal.

In the embodiment of the application, the electronic equipment can adopt a traditional signal processing method or different methods based on an NN method and the like to separate the voice from the non-voice of the input audio. The specific method for implementing the separation of human voice and non-human voice may refer to the related description of the separation part of human voice and non-human voice in the embodiment corresponding to fig. 5, which is not repeated.

And S1003, the electronic equipment performs energy suppression on the tooth sound in the human voice component.

In the embodiment of the present application, the electronic device may perform energy suppression on the tooth sound in the human voice component, which may include the time domain audio processing method corresponding to fig. 3 and the frequency domain audio processing method corresponding to fig. 4, and other tooth sound suppression methods may also be used in the electronic device.

S1004, the electronic equipment mixes sound according to the non-human sound component and the human sound component with the tooth sound suppressed to obtain a second audio signal.

In the embodiment of the present application, the method for mixing audio may refer to the description related to the audio mixing portion in the embodiment (3) corresponding to fig. 5, which is not repeated. The second audio signal may be understood as an output audio signal of the electronic device.

It may be understood that the non-human voice component may be the non-human voice component separated by the electronic device in step S1002, or may be a non-human voice component obtained by some processing performed by the electronic device on the separated non-human voice component, where the processing includes an amplification or a reduction of a gain of the non-human voice component, and the embodiment of the present application is not limited to this processing.

In addition, the electronic device mixes sound according to the non-human sound component and the human sound component after the tooth sound suppression, and may include: mixing according to the sum of the non-human voice component spectrum and the human voice component spectrum after tooth sound suppression; or mixing the sound according to different weights of the non-human sound component and the human sound component after tooth sound suppression; or mixing according to the multiple of the sum of the non-human voice component and the human voice component after tooth sound suppression, and the specific electronic equipment performs the calculation process of mixing according to the non-human voice component and the human voice component after tooth sound suppression.

The electronic equipment firstly performs voice/non-voice separation on the input audio to obtain a voice component and a non-voice component, and then the electronic equipment can perform tooth sound suppression on the voice component, so that the suppression on the tooth sound can be realized, meanwhile, the damage to the non-voice component can be avoided, the tone distortion is reduced, and the user experience is improved.

Optionally, on the basis of the embodiment corresponding to fig. 10, after the electronic device in step S1001 acquires the first audio signal, the method may include: the electronic device transforms the first audio signal from a time domain signal to a frequency domain signal; the separating the human voice component and the non-human voice component of the first audio signal by the electronic device may include: the electronic equipment separates a human sound component and a non-human sound component of the first audio signal in a frequency domain; the electronic device energy suppressing of the tooth sound in the human voice component may include: the electronic equipment performs energy suppression on tooth sounds in the human voice component in a frequency domain; the electronic device mixes sound according to the non-human sound component and the human sound component after tooth sound suppression to obtain a second audio signal, which may include: after the electronic equipment mixes sound according to the non-human sound component and the human sound component after tooth sound suppression in the frequency domain, the electronic equipment transforms the mixed sound signal from the frequency domain signal to the time domain signal to obtain a second audio signal.

In the embodiment of the application, the electronic equipment performs time-frequency conversion on the first audio signal, so that the separation of the human voice and the non-human voice and the tooth sound suppression are realized in the frequency domain, and therefore, the tooth sound suppression in the frequency domain can not generate extra delay, and further, the time delay alignment processing process of the non-human voice can be reduced, thereby saving the calculation force.

Optionally, based on the embodiment corresponding to fig. 10, the electronic device in step S1003 may perform energy suppression on the tooth sound in the voice component, and may include: the electronic equipment suppresses the energy of the tooth sound in the human voice component according to the perceived energy and the frequency spectrum of the human voice component; the perceptual energy is the perceptual energy of the tooth audio band in the first audio signal, the perceptual energy is in direct proportion to the first energy, the perceptual energy is in inverse proportion to the second energy, the first energy is the energy of the tooth audio band in the human sound component, and the second energy is the energy of the tooth audio band in the non-human sound component.

In the embodiment of the application, the energy suppression of the tooth sound in the human voice component is related to the perceived energy and the frequency spectrum of the human voice component, for example, when the perceived energy is smaller, the tooth sound is relatively unobvious, and the electronic equipment can perform smaller energy suppression or no energy suppression on the human voice component; when the perceived energy is larger, the tooth sound is relatively obvious, and the electronic equipment can restrain the human voice component more energy.

The perceived energy is proportional to the energy of the tooth audio band in the human voice component and the perceived energy is inversely proportional to the energy of the tooth audio band in the non-human voice component. When the energy of the tooth sound frequency band in the human voice component is larger, the tooth sound is obvious, and the perceived energy is larger at the moment; when the energy of the tooth sound band in the non-human sound component is large, the non-human sound component can mask the tooth sound so that the tooth sound becomes insignificant, and the perceived energy is small.

The embodiment of the application considers the masking effect of the non-human sound component on the tooth sound, and the effect can carry out smaller energy inhibition or no energy inhibition on the unobvious tooth sound, thereby ensuring the tone quality of the original first audio signal.

Alternatively, on the basis of the embodiment corresponding to fig. 10, it may include: the perceived energy satisfies the following formula:

where EV' (i) is a value of the perceptible energy, EV (i) is a value of the first energy, EU (i) is a value of the second energy, i is a tooth tone subband number, the tooth tone subband includes a plurality of tooth tone subbands, and epsilon is a control parameter of the perceptible energy.

In the embodiment of the present application, the formula of the perceptible energy may refer to the description of the perceptible energy in the embodiment corresponding to fig. 9, which is not repeated. The degree of masking effect can be controlled by introducing epsilon into the perceivable energy formula, so that tone quality is ensured, and user experience is improved.

Alternatively, on the basis of the embodiment corresponding to fig. 10, it may include: the control parameters epsilon of the perceived energy corresponding to the different tooth pitch sub-bands i are different.

In the embodiment of the present application, the control parameters epsilon of the perceived energy corresponding to different tooth tone sub-bands i are different, and epsilon (i) may be used to represent epsilon (i), and specific epsilon (i) may refer to the related description of epsilon (i) in the embodiment corresponding to fig. 9, which is not repeated. In this way, different parameters are adopted in different frequency bands to control the degree of the masking effect, so that the influence of the masking effect on the audio signal can be reduced, and the user experience is improved.

Optionally, based on the embodiment corresponding to fig. 10, the electronic device in step S1003 performs energy suppression on the tooth sound in the voice component, and may satisfy the following formula:

where V '(k) is the spectrum of the human voice component after tooth pitch suppression, V (k) is the spectrum of the human voice component, EV' (i) is the value of the perceived energy, m is the suppression degree parameter, and then (i) is the tooth pitch energy suppression threshold value of the i-th tooth pitch band.

In the embodiment of the present application, the energy suppression formula may refer to the description related to the energy suppression formula with m in the embodiment corresponding to fig. 9, which is not repeated. The electronic equipment controls the upper limit of the suppression degree of the tooth audio frequency band through the parameter m, so that the excessive suppression of the human voice signal of the tooth audio frequency band can be avoided, and obvious tone distortion is generated.

Optionally, based on the embodiment corresponding to fig. 10, the suppression degree parameters m corresponding to different tooth sound sub-bands i are different, and the electronic device in step S1003 performs energy suppression on tooth sound in the human voice component, so that the following formula may be satisfied:

where V '(k) is the spectrum of the human voice component after tooth pitch suppression, V (k) is the spectrum of the human voice component, EV' (i) is the value of the perceived energy, m (i) is the suppression degree parameter, and then (i) is the tooth pitch energy suppression threshold value of the i-th tooth pitch band.

In the embodiment of the present application, the energy suppression formula may refer to the description related to the energy suppression formula with m (i) in the embodiment corresponding to fig. 9, which is not repeated. Therefore, different parameters are adopted in different frequency bands to control the upper limit of the tooth sound frequency band inhibition degree, the influence of the electronic equipment on the tooth sound inhibition degree can be reduced, and further user experience is improved.

Optionally, before the electronic device in step S1003 performs energy suppression on the tooth sound in the voice component, on the basis of the embodiment corresponding to fig. 10, the method may include: the electronic equipment sets a zone bit according to whether the voice component comprises tooth sound or not, wherein the zone bit comprises a first value or a second value, the first value indicates that the voice component comprises tooth sound, and the second value indicates that the voice component does not comprise tooth sound; the electronic device energy suppressing of the tooth sound in the human voice component may include: if the flag bit is the first value, the electronic equipment performs energy suppression on the tooth sound in the human voice component.

In the embodiment of the present application, whether the noise component includes the tooth noise may be determined according to the spectrum flatness feature SSFM, or may be determined according to other manners, which is not limited by the embodiment of the present application. Specific SSFM may refer to the related description in the embodiment corresponding to fig. 9, and will not be described again.

The flag bit may indicate whether the voice component includes tooth voice, and the value of the flag bit may be a first value or a second value, which may be understood that the data type of the first value or the second value may be integer type, boolean type, character string type, etc., which is not limited in the embodiment of the present application. By way of example, the first value or the second value may be integer, e.g. a first value of 1 indicates the presence of a tooth sound, and the electronic device may energy suppress the tooth sound in the human voice component; a second value of 0 indicates that no tooth sound is present and the electronic device may not energy suppress tooth sound in the human voice component. The value of the first value or the second value is not limited in this embodiment of the present application.

Therefore, the electronic equipment can determine whether to carry out energy suppression on the tooth sound in the voice component according to the zone bit, and can more accurately judge whether the tooth sound exists in the tooth sound frequency band, so that the accuracy of tooth sound suppression is improved.

Optionally, on the basis of the embodiment corresponding to fig. 10, the method further includes: if the flag bit is the second value, the electronic device does not perform energy suppression on the tooth sound in the human voice component.

In the embodiment of the application, the electronic equipment does not need to inhibit the tooth sound for the frequency band without the tooth sound, so that unnecessary calculation is reduced, and the calculation force is saved.

wherein V '(k) is the spectrum of the human voice component after tooth sound suppression, V (k) is the spectrum of the human voice component, EV' (i) is the value of the perceived energy, m (i) is the suppression degree parameter, and then (i) is the tooth sound energy suppression threshold value of the ith tooth sound band, and flag is the flag bit.

In the embodiment of the present application, the energy suppression formula may refer to the description related to the energy suppression formula with the flag bit in the embodiment corresponding to fig. 9, and will not be repeated. Therefore, the electronic equipment can more accurately judge whether the tooth sound exists in the tooth sound frequency band, and the electronic equipment can not inhibit the tooth sound for the frequency band without the tooth sound, so that the accuracy of the tooth sound inhibition is improved, and the user experience is improved.

Optionally, on the basis of the embodiment corresponding to fig. 10, before the electronic device in step S1004 mixes the non-human voice component with the human voice component after the tooth sound suppression, the method may include: and the electronic equipment performs time delay alignment on the non-human voice component and the human voice component after tooth sound suppression.

In the embodiment of the present application, the time delay alignment may refer to the related description in the embodiment corresponding to fig. 7, which is not repeated. The time delay alignment may be such that the relative time delays of the human voice signal and the non-human voice signal are the same, thereby reducing the time delay due to tooth pitch suppression of the human voice signal.

Optionally, on the basis of the embodiment corresponding to fig. 10, the performing, by the electronic device, time delay alignment on the non-human voice component and the human voice component after tooth sound suppression may include: the electronic device buffers silence for a period of time before the non-human voice component, wherein the period of time is a time delay period generated when the human voice component is subjected to tooth sound suppression.

In the embodiment of the application, the electronic equipment can make the relative time delay of the human voice signal and the non-human voice signal identical by buffering the silence for a period of time before the non-human voice component, thereby reducing the time delay caused by tooth sound suppression of the human voice signal and improving the user experience.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region, and provide corresponding operation entries for the user to select authorization or rejection.

The foregoing description of the solution provided by the embodiments of the present application has been mainly presented in terms of a method. To achieve the above functions, it includes corresponding hardware structures and/or software modules that perform the respective functions. Those of skill in the art will readily appreciate that the present application may be implemented in hardware or a combination of hardware and computer software, as the method steps of the examples described in connection with the embodiments disclosed herein. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The embodiment of the application can divide the functional modules of the device for realizing the method according to the method example, for example, each functional module can be divided corresponding to each function, and two or more functions can be integrated in one processing module. The integrated modules may be implemented in hardware or in software functional modules. It should be noted that, in the embodiment of the present application, the division of the modules is schematic, which is merely a logic function division, and other division manners may be implemented in actual implementation.

Fig. 11 is a schematic structural diagram of a chip according to an embodiment of the present application. Chip 1100 includes one or more (including two) processors 1101, communication lines 1102, a communication interface 1103, and a memory 1104.

In some implementations, the memory 1104 stores the following elements: executable modules or data structures, or a subset thereof, or an extended set thereof.

The method described in the above embodiments of the present application may be applied to the processor 1101 or implemented by the processor 1101. The processor 1101 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuitry in hardware in the processor 1101 or instructions in software. The processor 1101 may be a general purpose processor (e.g., a microprocessor or a conventional processor), a digital signal processor (digital signal processing, DSP), an application specific integrated circuit (application specific integrated circuit, ASIC), an off-the-shelf programmable gate array (field-programmable gate array, FPGA) or other programmable logic device, discrete gates, transistor logic, or discrete hardware components, and the processor 1101 may implement or perform the methods, steps, and logic blocks related to the disclosed processes in the embodiments of the present application.

The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a state-of-the-art storage medium such as random access memory, read-only memory, programmable read-only memory, or charged erasable programmable memory (electrically erasable programmable read only memory, EEPROM). The storage medium is located in the memory 1104, and the processor 1101 reads information in the memory 1104 and performs the steps of the above method in combination with its hardware.

The processor 1101, the memory 1104, and the communication interface 1103 may communicate with each other via a communication line 1102.

In the above embodiments, the instructions stored by the memory for execution by the processor may be implemented in the form of a computer program product. The computer program product may be written in the memory in advance, or may be downloaded in the form of software and installed in the memory.

Embodiments of the present application also provide a computer program product comprising one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions in accordance with embodiments of the present application are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL), or wireless (e.g., infrared, wireless, microwave, etc.), or semiconductor medium (e.g., solid state disk, SSD)) or the like.

The embodiment of the application also provides a computer readable storage medium. The methods described in the above embodiments may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. Computer readable media can include computer storage media and communication media and can include any medium that can transfer a computer program from one place to another. The storage media may be any target media that is accessible by a computer.

As one possible design, the computer-readable medium may include compact disk read-only memory (CD-ROM), RAM, ROM, EEPROM, or other optical disk memory; the computer readable medium may include disk storage or other disk storage devices. Moreover, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, digital versatile disc (digital versatiledisc, DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers.

Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Claims

1. A method of audio processing, the method comprising:

the electronic equipment acquires a first audio signal;

the electronic device separates a human voice component and a non-human voice component of the first audio signal;

the electronic equipment performs energy suppression on tooth sounds in the voice component;

the electronic equipment mixes sound according to the non-human sound component and the human sound component after tooth sound suppression to obtain a second audio signal;

The electronic device performs energy suppression on tooth sounds in the human voice component, including:

the electronic equipment suppresses the energy of the tooth sound in the human voice component according to the perceived energy and the frequency spectrum of the human voice component;

wherein the perceptible energy is a perceptible energy of a tooth audio band in the first audio signal, the perceptible energy being proportional to a first energy, the perceptible energy being inversely proportional to a second energy, the first energy being an energy of a tooth audio band in the human voice component, the second energy being an energy of a tooth audio band in the non-human voice component;

the smaller the perceived energy, the more obvious the masking effect of the non-human voice component on the human voice component, the less obvious the tooth pitch in the human voice component; the greater the perceived energy, the less pronounced the masking effect of the non-human voice component on the human voice component, the more pronounced the tooth pitch in the human voice component.

2. The method of claim 1, wherein after the electronic device obtains the first audio signal, comprising:

the electronic device transforms the first audio signal from a time domain signal to a frequency domain signal;

The electronic device separating a human voice component and a non-human voice component of the first audio signal, comprising: the electronic device separating the human voice component and the non-human voice component of the first audio signal in a frequency domain;

the electronic device performs energy suppression on tooth sounds in the human voice component, including: the electronic equipment performs energy suppression on tooth sounds in the human voice component in a frequency domain;

the electronic equipment mixes sound according to the non-human sound component and the human sound component after tooth sound suppression to obtain a second audio signal, and the method comprises the following steps: and after the electronic equipment mixes the sound in the frequency domain according to the non-human sound component and the human sound component after the tooth sound suppression, the electronic equipment converts the mixed sound signal from a frequency domain signal to a time domain signal to obtain the second audio signal.

3. The method according to claim 1 or 2, wherein the perceptible energy satisfies the following formula:

wherein EV' (i) is the value of the perceptible energy, EV (i) is the value of the first energy, EU (i) is the value of the second energy, i is a tooth tone subband number, the tooth tone subband includes a plurality of tooth tone subbands, and epsilon is a control parameter of the perceptible energy.

4. A method according to claim 3, characterized in that the control parameters epsilon of the perceived energy corresponding to different tooth sound sub-bands i are different.

5. The method of any one of claims 1-4, wherein the electronic device energy suppresses the tooth pitch in the human voice component, satisfying the following formula:

wherein V '(k) is the frequency spectrum of the human voice component after the tooth noise suppression, V (k) is the frequency spectrum of the human voice component, EV' (i) is the value of the perceivable energy, m is the suppression degree parameter, and then (i) is the tooth noise energy suppression threshold value of the ith tooth noise band.

6. The method according to any one of claims 1-4, wherein the suppression degree parameters m corresponding to different tooth sound sub-bands i are different, and the electronic device performs energy suppression on tooth sound in the human voice component, so as to satisfy the following formula:

wherein V '(k) is the frequency spectrum of the human voice component after the tooth sound suppression, V (k) is the frequency spectrum of the human voice component, EV' (i) is the value of the perceivable energy, m (i) is the suppression degree parameter, and then (i) is the tooth sound energy suppression threshold value of the ith tooth sound band.

7. The method of any of claims 1-4, wherein before the electronic device energy-suppresses the tooth pitch in the human voice component, comprising:

The electronic equipment sets a zone bit according to whether the voice component comprises tooth sound or not, wherein the zone bit comprises a first value or a second value, the first value indicates that the voice component comprises tooth sound, and the second value indicates that the voice component does not comprise tooth sound;

the electronic device energy suppressing the tooth sound in the human voice component includes: and if the flag bit is the first value, the electronic equipment performs energy suppression on the tooth sound in the voice component.

8. The method as recited in claim 7, further comprising:

and if the flag bit is the second value, the electronic equipment does not inhibit the energy of the tooth sound in the voice component.

9. The method of claim 8, wherein the electronic device energy suppresses the tooth pitch in the human voice component to satisfy the following formula:

wherein V '(k) is the frequency spectrum of the human voice component after the tooth sound suppression, V (k) is the frequency spectrum of the human voice component, EV' (i) is the value of the perceivable energy, m (i) is the suppression degree parameter, then (i) is the tooth sound energy suppression threshold value of the ith tooth sound band, and flag is the flag bit.

10. The method of claim 1, wherein the electronic device includes, prior to mixing the non-human voice component with the tooth suppressed human voice component: and the electronic equipment performs time delay alignment on the non-human voice component and the human voice component after tooth sound suppression.

11. The method of claim 10, wherein the electronic device time-delay aligns the non-human voice component and the tooth-suppressed human voice component, comprising:

the electronic device caches silence for a period of time before the non-human voice component, wherein the period of time is a time delay period generated when the human voice component is subjected to tooth sound suppression.

12. An electronic device, comprising: a memory for storing a computer program and a processor for executing the computer program to perform the method of any of claims 1-11.

13. A computer readable storage medium storing instructions that, when executed, cause a computer to perform the method of any one of claims 1-11.