WO2023000778A9

WO2023000778A9 - Audio signal processing method and related electronic device

Info

Publication number: WO2023000778A9
Application number: PCT/CN2022/092367
Authority: WO
Inventors: 胡贝贝; 许剑峰
Original assignee: 北京荣耀终端有限公司
Priority date: 2021-07-19
Filing date: 2022-05-12
Publication date: 2023-06-15
Also published as: WO2023000778A1; CN115641870A

Abstract

An audio signal processing method, an electronic device, a computer readable storage medium, and a computer program product. The method comprises: obtaining an audio signal; when the tonality value of the audio signal is greater than or equal to a first threshold and the audio source type of the audio signal is a first-type audio source, processing the audio signal by using a first-type suppression policy; and otherwise, processing the audio signal by using a second-type suppression policy. The method solves the problem that in the process of suppressing an audio signal that easily generates noise, the audio signals generate suppression distortions because other types of audio signals are wrongly suppressed.

Description

An audio signal processing method and related electronic equipment

This application claims the priority of the Chinese patent application with the application number 202110815051X and the title of the invention "An audio signal processing method and related electronic equipment" filed with the China Patent Office on July 19, 2021, the entire contents of which are incorporated by reference in this application.

technical field

The present application relates to the field of audio signal processing, in particular to an audio signal processing method and related electronic equipment.

Background technique

When a certain type of signal is played through the built-in speakers of small mobile electronic devices (such as mobile phones and tablet computers), the frequency point energy in the frequency domain is concentrated in a narrow bandwidth, for example, similar to piano music, and the frequency domain energy distribution The duration is longer, and subjectively you can hear a noise similar to "呲呲", mainly because the excessively concentrated long-term narrow-band energy causes the speaker to produce nonlinear distortion during electro-acoustic conversion. The sound source corresponding to this type of audio signal is called the first class sound source.

To reduce the generation of this kind of noise, the traditional method of processing the first type of sound source will cause the problem of falsely suppressing other types of sound sources, such as human voices, resulting in a subjective sense of hearing when this type of sound source is in transition with the first type of sound source. question. Therefore, while reducing the audio signal processing of the first type of sound source and avoiding false suppression of other sound sources, it is a problem concerned by technicians.

Contents of the invention

The embodiment of the present application provides an audio signal processing method, which solves the problem that other types of audio signals are wrongly suppressed during the process of suppressing audio signals that are prone to noise, causing the audio signals to be suppressed and distorted.

In a first aspect, an embodiment of the present application provides a method for processing an audio signal, including: acquiring an audio signal; when the tone value of the audio signal is greater than or equal to a first threshold and the audio source type of the audio signal is the first type of audio source In the case of , the audio signal is processed using the first type of suppression strategy; otherwise, the audio signal is processed using the second type of suppression strategy.

In the above-described embodiments, it is judged based on the sound source type and the tonality value of the audio signal whether the signal is a signal that is prone to noise (the first type of sound source), and whether the audio signal is based on whether the sound source type of the audio signal is the first type of sound source. Adopting different suppression strategies, while maintaining the maximum loudness of the original signal playback, changing the speaker input signal, reducing playback noise and reducing distortion of different audio signals.

With reference to the first aspect, in a possible implementation manner, the first type of suppression strategy is to suppress a single peak or multiple peaks in the frequency domain of the audio signal.

With reference to the first aspect, in a possible implementation manner, the second type of suppression strategy is to suppress a single peak or multiple peaks in the audio signal; or not to suppress the audio signal.

With reference to the first aspect, in a possible implementation manner, after the audio signal is acquired, the method includes: performing tonality calculation on the audio signal to obtain a tonality value of the audio signal. In this way, it is beneficial for the electronic device to adopt different suppression strategies for the audio signal based on the tonality value and the type of the audio source of the audio signal.

In combination with the first aspect, in a possible implementation manner, the tonality calculation of the audio signal is performed to obtain the tonality value of the audio signal, including: according to the formula

Calculate the flatness of the audio signal; N is the length of the time-frequency transformation of the audio signal, x(n) is the energy value of the nth frequency point of the audio signal in the frequency domain, and Flatnes is the flatness of the audio signal; Formula SFMdB=10log 10 (Flatness) calculates the first parameter of this audio signal; SFMdB is the first parameter; According to the formula

Calculate the tonality value of the audio signal; α is the tonality value of the audio signal, and SFMdBMax is the maximum value of the first parameter. In this way, it is beneficial for the electronic device to adopt different suppression strategies for the audio signal based on the tonality value and the type of the audio source of the audio signal.

With reference to the first aspect, in a possible implementation manner, before the audio signal is processed using the first type of suppression strategy, it also includes: performing peak detection on the audio signal, and the peak detection user obtains the audio signal at a frequency Peak information in the domain. In this way, the electronic device can acquire the peak value of the audio signal, calculate the difference gain according to the peak value information, and suppress the audio signal according to the difference gain.

With reference to the first aspect, in a possible implementation manner, the audio signal is processed using the first type of suppression strategy, which specifically includes: calculating the difference between the peak value of the audio signal and the second threshold; the peak value includes at least the The maximum peak value of the audio signal in the frequency domain; calculate the difference gain of the peak value based on the difference value; suppress the peak value according to the formula W'=W*f, f is the difference value gain, and W is before suppression The peak value, W' is the peak value after compression. In this way, when the sound source type of the audio signal is the first type of sound source and the tonality value of the audio signal is greater than or equal to the first threshold, peak suppression is performed on the audio signal to change the speaker input signal and reduce playback noise.

In a second aspect, an embodiment of the present application provides an electronic device, which includes: one or more processors and a memory; the memory is coupled to the one or more processors, and the memory is used to store computer program codes, The computer program code includes computer instructions, and the one or more processors invoke the computer instructions to cause the electronic device to perform: acquiring an audio signal; when the tone value of the audio signal is greater than or equal to a first threshold and the sound source of the audio signal If the type is the first type of audio source, the audio signal is processed using the first type of suppression strategy; otherwise, the audio signal is processed using the second type of suppression strategy.

With reference to the second aspect, in a possible implementation manner, the one or more processors are further configured to invoke the computer instruction so that the electronic device executes: performing tonality calculation on the audio signal to obtain the tonality of the audio signal. sexual value.

With reference to the second aspect, in a possible implementation manner, the one or more processors are further configured to call the computer instruction so that the electronic device executes: according to the formula

Calculate the flatness of the audio signal; N is the length of the time-frequency transformation of the audio signal, x(n) is the energy value of n frequency points of the audio signal in the frequency domain, and Flatness is the flatness of the audio signal; according to the formula SFMdB=10log 10 (Flatness) calculates the first parameter of this audio signal; SFMdB is the first parameter; According to the formula

Calculate the tonality value of the audio signal; α is the tonality value of the audio signal, and SFMdBMax is the maximum value of the first parameter.

With reference to the second aspect, in a possible implementation manner, the one or more processors are further configured to call the computer instruction to make the electronic device execute: perform peak detection on the audio signal, and the peak detection user acquires the The peak information of an audio signal in the frequency domain.

With reference to the second aspect, in a possible implementation manner, the one or more processors are further configured to call the computer instruction so that the electronic device executes: calculating the difference between the peak value of the audio signal and the second threshold; The peak value includes at least the maximum peak value of the audio signal in the frequency domain; the difference gain of the peak value is calculated based on the difference value; the peak value is suppressed according to the formula W'=W*f, and f is the difference value gain , W is the peak value before compression, and W' is the peak value after compression.

In a third aspect, an embodiment of the present application provides an electronic device, including: a touch screen, a camera, one or more processors, and one or more memories; the one or more processors and the touch screen , the camera, the one or more memories are coupled, the one or more memories are used to store computer program codes, the computer program codes include computer instructions, and when the one or more processors execute the computer instructions , so that the electronic device executes the method described in the first aspect or any possible implementation manner of the first aspect.

In a fourth aspect, an embodiment of the present application provides a chip system, which is applied to an electronic device, and the chip system includes one or more processors, and the processor is used to call a computer instruction so that the electronic device executes the first Aspect or the method described in any possible implementation manner of the first aspect.

In the fifth aspect, the embodiment of the present application provides a computer program product containing instructions. When the computer program product is run on an electronic device, the electronic device is made to execute any one of the possible implementations of the first aspect or the first aspect. The method described in the manner.

In the sixth aspect, the embodiment of the present application provides a computer-readable storage medium, including instructions, and when the instructions are run on the electronic device, the electronic device executes any one of the possible implementations of the first aspect or the first aspect. The method described in the manner.

Description of drawings

1A-1C are schematic diagrams of an application scenario provided by the embodiment of the present application;

FIG. 2 is a system architecture diagram of an electronic device processing audio signals provided by an embodiment of the present application;

FIG. 3 is a schematic diagram of a hardware structure of an electronic device 100 provided by an embodiment of the present application;

FIG. 4 is a software structural block diagram of the electronic device 100 provided by the embodiment of the present application;

Fig. 5A-Fig. 5D are diagrams of calculation results of tonality value provided by the embodiment of the present application;

Fig. 6 is a flow chart of processing an audio signal provided by an embodiment of the present application;

7A-7C are diagrams of the audio application startup interface provided by the embodiment of the present application;

FIG. 8 is a waveform diagram of a frequency domain signal provided by an embodiment of the present application.

Detailed ways

The technical solutions in the embodiments of the present application will be clearly and completely described below in conjunction with the drawings in the embodiments of the present application. Apparently, the described embodiments are only some of the embodiments of this application, not all of them. Reference herein to an "embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of this embodiment application. The occurrences of this phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is understood explicitly and implicitly by those skilled in the art that the embodiments described herein can be combined with other embodiments. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the scope of protection of this application.

The terms "first", "second", "third" and the like in the specification and claims of the present application and the drawings are used to distinguish different objects, and are not used to describe a specific order. Furthermore, the terms "comprising" and "having", as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a series of steps or units are included, or alternatively, steps or units not listed, or other steps or units inherent in the process, method, product or apparatus are also included.

The accompanying drawings only show the part relevant to the present application but not the whole content. Before discussing the exemplary embodiments in more detail, it should be mentioned that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although the flowcharts describe various operations (or steps) as sequential processing, many of the operations may be performed in parallel, concurrently, or simultaneously. In addition, the order of operations can be rearranged. The process may be terminated when its operations are complete, but may also have additional steps not included in the figure. The processing may correspond to a method, function, procedure, subroutine, subroutine, or the like.

The terms "component", "module", "system", "unit" and the like are used in this specification to denote a computer-related entity, hardware, firmware, a combination of hardware and software, software or software in execution. For example, a unit may be, but is not limited to being limited to, a process running on a processor, a processor, an object, an executable, a thread of execution, a program, and/or distributed between two or more computers. In addition, these units can execute from various computer readable media having various data structures stored thereon. A unit may, for example, be based on a signal having one or more data packets (eg, data from a second unit interacting with another unit between a local system, a distributed system, and/or a network. For example, the Internet via a signal interacting with other systems) Communicate through local and/or remote processes.

In the following, an application scenario in which an electronic device processes an audio signal is introduced with reference to FIGS. 1A-1C .

In FIG. 1A , when the electronic device 100 detects an input operation (for example, click) on the music application icon 1011 , it will enter the main interface 102 of the music application as shown in FIG. 1B . As shown in FIG. 1B , when the user searches for a singer or music name on the main interface 102 , the electronic device 100 displays a music playing interface 103 as shown in FIG. 1C , and the music application plays music at this time. While the music application is playing music, the electronic device 100 processes the audio signal of the music in real time, so as to ensure that the music played by the music application does not appear noise, thereby bringing a good music experience to the user.

The above-mentioned FIG. 1A-FIG. 1C introduce the application scenarios of electronic equipment processing audio signals, and the system architecture diagram of electronic equipment processing audio scenarios is introduced below. Please refer to FIG. 2 . FIG. 2 is a system architecture diagram of an electronic device processing audio signals provided by an embodiment of the present application. As shown in FIG. 2 , the system architecture includes an audio application, a mixing thread module, and an audio driver. Exemplarily, the audio application may be music player software or video software.

When an audio application plays audio through a speaker, it processes the audio signal in real time. First, the audio application sends the audio signal to the audio mixing thread module, and the audio mixing thread module detects whether the audio source type of the audio signal is the first type of audio source. If so, the audio signal is processed (eg, the energy of the audio signal is suppressed). Then, the mixing thread module sends the processed audio signal to the audio driver, and the audio driver sends the processed audio signal to the speaker, and the speaker outputs audio. In this way, the electronic device can complete the processing of the audio signal in real time, so that the audio emitted through the speaker has no noise.

The structure of the electronic device 100 will be introduced below. Please refer to FIG. 3 . FIG. 3 is a schematic diagram of a hardware structure of an electronic device 100 provided by an embodiment of the present application.

The electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (universal serial bus, USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, and an antenna 2 , mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, earphone jack 170D, sensor module 180, button 190, motor 191, indicator 192, camera 193, display screen 194, and A subscriber identification module (subscriber identification module, SIM) card interface 195 and the like. The sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, bone conduction sensor 180M, etc.

It can be understood that, the structure illustrated in the embodiment of the present invention does not constitute a specific limitation on the electronic device 100 . In other embodiments of the present application, the electronic device 100 may include more or fewer components than shown in the figure, or combine certain components, or separate certain components, or arrange different components. The illustrated components can be realized in hardware, software or a combination of software and hardware.

The processor 110 may include one or more processing units, for example: the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processing unit (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), controller, memory, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural network processor (neural-network processing unit, NPU) wait. Wherein, different processing units may be independent devices, or may be integrated in one or more processors.

The wireless communication function of the electronic device 100 can be realized by the antenna 1 , the antenna 2 , the mobile communication module 150 , the wireless communication module 160 , a modem processor, a baseband processor, and the like.

Antenna 1 and Antenna 2 are used to transmit and receive electromagnetic wave signals. Each antenna in electronic device 100 may be used to cover single or multiple communication frequency bands. Different antennas can also be multiplexed to improve the utilization of the antennas. For example: Antenna 1 can be multiplexed as a diversity antenna of a wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.

The mobile communication module 150 can provide wireless communication solutions including 2G/3G/4G/5G applied on the electronic device 100 . The mobile communication module 150 may include at least one filter, switch, power amplifier, low noise amplifier (low noise amplifier, LNA) and the like. The mobile communication module 150 can receive electromagnetic waves through the antenna 1, filter and amplify the received electromagnetic waves, and send them to the modem processor for demodulation. The mobile communication module 150 can also amplify the signals modulated by the modem processor, and convert them into electromagnetic waves and radiate them through the antenna 1 . In some embodiments, at least part of the functional modules of the mobile communication module 150 may be set in the processor 110 . In some embodiments, at least part of the functional modules of the mobile communication module 150 and at least part of the modules of the processor 110 may be set in the same device.

The wireless communication module 160 can provide wireless local area network (wireless local area networks, WLAN) (such as Wi-Fi network), Bluetooth (bluetooth, BT), BLE broadcasting, global navigation satellite system (global navigation satellite system) applied on the electronic device 100. system, GNSS), frequency modulation (frequency modulation, FM), near field communication technology (near field communication, NFC), infrared technology (infrared, IR) and other wireless communication solutions. The wireless communication module 160 may be one or more devices integrating at least one communication processing module. The wireless communication module 160 receives electromagnetic waves via the antenna 2 , frequency-modulates and filters the electromagnetic wave signals, and sends the processed signals to the processor 110 . The wireless communication module 160 can also receive the signal to be sent from the processor 110 , frequency-modulate it, amplify it, and convert it into electromagnetic waves through the antenna 2 for radiation.

The electronic device 100 realizes the display function through the GPU, the display screen 194 , and the application processor. The GPU is a microprocessor for image processing, and is connected to the display screen 194 and the application processor. GPUs are used to perform mathematical and geometric calculations for graphics rendering. Processor 110 may include one or more GPUs that execute program instructions to generate or change display information.

The display screen 194 is used to display images, videos and the like. The display screen 194 includes a display panel. The display panel can be a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active matrix organic light emitting diode or an active matrix organic light emitting diode (active-matrix organic light emitting diode, AMOLED), flexible light-emitting diode (flex light-emitting diode, FLED), Miniled, MicroLed, Micro-oLed, quantum dot light emitting diodes (quantum dot light emitting diodes, QLED), etc. In some embodiments, the electronic device 100 may include 1 or N display screens 194 , where N is a positive integer greater than 1.

The electronic device 100 can realize the shooting function through the ISP, the camera 193 , the video codec, the GPU, the display screen 194 and the application processor.

The ISP is used for processing the data fed back by the camera 193 . For example, when taking a picture, open the shutter, the light is transmitted to the photosensitive element of the camera through the lens, and the light signal is converted into an electrical signal, and the photosensitive element of the camera transmits the electrical signal to the ISP for processing, and converts it into an image visible to the naked eye. ISP can also perform algorithm optimization on image noise, brightness, and skin color. ISP can also optimize the exposure, color temperature and other parameters of the shooting scene. In some embodiments, the ISP may be located in the camera 193 .

Digital signal processors are used to process digital signals. In addition to digital image signals, they can also process other digital signals. For example, when the electronic device 100 selects a frequency point, the digital signal processor is used to perform Fourier transform on the energy of the frequency point.

The NPU is a neural-network (NN) computing processor. By referring to the structure of biological neural networks, such as the transfer mode between neurons in the human brain, it can quickly process input information and continuously learn by itself. Applications such as intelligent cognition of the electronic device 100 can be realized through the NPU, such as image recognition, face recognition, speech recognition, text understanding, and the like.

The electronic device 100 can implement audio functions through the audio module 170 , the speaker 170A, the receiver 170B, the microphone 170C, the earphone interface 170D, and the application processor. Such as music playback, recording, etc.

The audio module 170 is used to convert digital audio information into analog audio signal output, and is also used to convert analog audio input into digital audio signal. The audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be set in the processor 110 , or some functional modules of the audio module 170 may be set in the processor 110 .

Speaker 170A, also referred to as a "horn", is used to convert audio electrical signals into sound signals. Electronic device 100 can listen to music through speaker 170A, or listen to hands-free calls.

Receiver 170B, also called "earpiece", is used to convert audio electrical signals into sound signals. When the electronic device 100 receives a call or a voice message, the receiver 170B can be placed close to the human ear to receive the voice.

The microphone 170C, also called "microphone" or "microphone", is used to convert sound signals into electrical signals. When making a phone call or sending a voice message, the user can put his mouth close to the microphone 170C to make a sound, and input the sound signal to the microphone 170C. The electronic device 100 may be provided with at least one microphone 170C. In some other embodiments, the electronic device 100 may be provided with two microphones 170C, which may also implement a noise reduction function in addition to collecting sound signals. In some other embodiments, the electronic device 100 can also be provided with three, four or more microphones 170C to realize sound signal collection, noise reduction, sound source identification, and directional recording functions.

The pressure sensor 180A is used to sense the pressure signal and convert the pressure signal into an electrical signal. In some embodiments, pressure sensor 180A may be disposed on display screen 194 .

The air pressure sensor 180C is used to measure air pressure. In some embodiments, the electronic device 100 calculates the altitude based on the air pressure value measured by the air pressure sensor 180C to assist positioning and navigation.

The magnetic sensor 180D includes a Hall sensor. The electronic device 100 may use the magnetic sensor 180D to detect the opening and closing of the flip leather case.

The acceleration sensor 180E can detect the acceleration of the electronic device 100 in various directions (generally three axes). The magnitude and direction of gravity can be detected when the electronic device 100 is stationary. It can also be used to identify the posture of electronic devices, and can be used in applications such as horizontal and vertical screen switching, pedometers, etc.

The fingerprint sensor 180H is used to collect fingerprints. The electronic device 100 can use the collected fingerprint characteristics to implement fingerprint unlocking, access to application locks, take pictures with fingerprints, answer incoming calls with fingerprints, and the like.

Touch sensor 180K, also known as "touch panel". The touch sensor 180K can be disposed on the display screen 194, and the touch sensor 180K and the display screen 194 form a touch screen, also called a “touch screen”. The touch sensor 180K is used to detect a touch operation on or near it. The touch sensor can pass the detected touch operation to the application processor to determine the type of touch event. Visual output related to the touch operation can be provided through the display screen 194 . In other embodiments, the touch sensor 180K may also be disposed on the surface of the electronic device 100 , which is different from the position of the display screen 194 .

The bone conduction sensor 180M can acquire vibration signals. In some embodiments, the bone conduction sensor 180M can acquire the vibration signal of the vibrating bone mass of the human voice.

The software system of the electronic device 100 may adopt a layered architecture, an event-driven architecture, a micro-kernel architecture, a micro-service architecture, or a cloud architecture. In the embodiment of the present invention, the software structure of the electronic device 100 is exemplarily described by taking an Android system with a layered architecture as an example. FIG. 4 is a block diagram of the software structure of the electronic device 100 provided by the embodiment of the present application. The layered architecture divides the software into several layers, and each layer has a clear role and division of labor. Layers communicate through software interfaces. In some embodiments, the Android system is divided into four layers, which are respectively the application program layer, the application program framework layer, the Android runtime (Android runtime) and the system library, and the kernel layer from top to bottom.

The application layer can consist of a series of application packages. As shown in Figure 4, the application package can include applications such as camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, video, short message, and audio applications.

The application framework layer provides an application programming interface (application programming interface, API) and a programming framework for applications in the application layer. The application framework layer includes some predefined functions. As shown in Figure 4, the application framework layer can include a window manager, a content provider, a view system, a phone manager, a resource manager, a notification manager, a mixing thread module (Mixer Thread module) and the like.

The mixing thread module is used to receive the audio signal sent by the audio application and process the audio signal.

A window manager is used to manage window programs. The window manager can get the size of the display screen, determine whether there is a status bar, lock the screen, capture the screen, etc.

Content providers are used to store and retrieve data and make it accessible to applications. Said data may include video, images, audio, calls made and received, browsing history and bookmarks, phonebook, etc.

The view system includes visual controls, such as controls for displaying text, controls for displaying pictures, and so on. The view system can be used to build applications. A display interface can consist of one or more views. For example, a display interface including a text message notification icon may include a view for displaying text and a view for displaying pictures.

The phone manager is used to provide communication functions of the electronic device 100 . For example, the management of call status (including connected, hung up, etc.).

The resource manager provides various resources for the application, such as localized strings, icons, pictures, layout files, video files, and so on.

The notification manager enables the application to display notification information in the status bar, which can be used to convey notification-type messages, and can automatically disappear after a short stay without user interaction. For example, the notification manager is used to notify the download completion, message reminder, etc. The notification manager can also be a notification that appears on the top status bar of the system in the form of a chart or scroll bar text, such as a notification of an application running in the background, or a notification that appears on the screen in the form of a dialog window. For example, prompting text information in the status bar, issuing a prompt sound, vibrating the electronic device, and flashing the indicator light, etc.

Android Runtime includes core library and virtual machine. The Android runtime is responsible for the scheduling and management of the Android system.

The core library consists of two parts: one part is the function function that the java language needs to call, and the other part is the core library of Android.

The application layer and the application framework layer run in virtual machines. The virtual machine executes the java files of the application program layer and the application program framework layer as binary files. The virtual machine is used to perform functions such as object life cycle management, stack management, thread management, security and exception management, and garbage collection.

A system library can include multiple function modules. For example: surface manager (surface manager), media library (Media Libraries), 3D graphics processing library (eg: OpenGL ES), 2D graphics engine (eg: SGL), etc.

The surface manager is used to manage the display subsystem and provides the fusion of 2D and 3D layers for multiple applications.

The media library supports playback and recording of various commonly used audio and video formats, as well as still image files, etc. The media library can support a variety of audio and video encoding formats, such as: MPEG4, H.264, MP3, AAC, AMR, JPG, PNG, etc.

The 3D graphics processing library is used to implement 3D graphics drawing, image rendering, compositing, and layer processing, etc.

2D graphics engine is a drawing engine for 2D drawing.

The kernel layer is the layer between hardware and software. The kernel layer includes at least a display driver, a camera driver, an audio driver, and a sensor driver.

When an electronic device emits audio through a built-in speaker, due to the size limitation of the device, the size of the speaker is relatively small, and its allowable diaphragm vibration amplitude is relatively small. When the loudness of the external audio is too loud, the membrane vibration amplitude of the speaker will exceed the maximum value, which makes the sound prone to break when played at a high volume, resulting in a sound similar to "呵呵". In order to solve the above problems, the sound source is usually processed so that the loudspeaker can reduce the noise when the sound is played out.

Generally, the sound sources are divided into four categories, namely: the first type of sound source, the second type of sound source, the third type of sound source and the fourth type of sound source. The characteristics of the first type of sound source are: the audio signal of this sound source is unevenly distributed on the frequency spectrum, and the energy is concentrated in the middle and low frequencies, the energy is relatively strong, and the duration is long, such as the sound of a piano. When the audio of this type of sound source is played back through the speaker, it is easy to generate noise. The characteristics of the second type of sound source are: the audio signal of the sound source is unevenly distributed on the frequency spectrum, and the energy is mainly concentrated in the middle and low frequencies, but the energy is relatively weak, such as human voice. The characteristic of the third type of sound source is that the audio signal of this sound source is unevenly distributed in the spectrum and has strong energy, but the energy concentration is transient, that is, the energy lasts for a short time, for example, the sound of drums. The characteristic of the fourth type of sound source is that the audio signal of the sound source is evenly distributed on the frequency spectrum. For the above four types of sound sources, the first type of sound source has uneven energy distribution, large energy and long duration. Compared with the other three types of sound sources, electronic equipment uses speakers to play back the audio of the first type of sound source, which is more likely to produce noise. Therefore, in order to solve the above problem, if the audio signal includes the audio signal of the first type of sound source, the electronic device will suppress the audio signal of the first type of sound source before the speaker plays back the audio, and then send the suppressed audio signal to the speaker, and then Audio is output from the speaker, suppressing noise during audio playback.

The method for electronic equipment to process audio signals is: divide the input audio signal into frames, and then perform time-frequency transformation to obtain frequency domain signals, perform tonality calculation on each frame of frequency domain signals, and obtain the tonality value of each frame of frequency domain signals , comparing the tonality value with the set first threshold, and then judging whether the distribution of the audio signal in the frequency domain is uniform. If the tonality value of the frequency domain signal is greater than or equal to the first threshold, it means that the frequency domain signal of the frame is unevenly distributed in the frequency spectrum, indicating that the frequency domain signal of the frame needs to be suppressed, and the frequency domain signal of the frame is suppressed according to the relevant strategy energy, that is, peak suppression. If the tonality value of the frequency domain signal is smaller than the first threshold, it is determined that the frequency domain signal of the frame is evenly distributed in the frequency spectrum, and the frequency domain signal of the frame does not need to be suppressed. Wherein, the first threshold may be obtained based on historical data, may also be obtained based on empirical values, and may also be obtained based on experimental data tests, which is not limited in this embodiment of the present application.

In the above method for processing an audio signal, the electronic device only determines whether the audio signal needs to be suppressed based on the result of judging the tone of the audio signal. However, it is inaccurate and incomplete to judge whether the audio signal should be suppressed only by the tonality, because whether the audio output will produce noise is not only related to the tonality of the audio signal, but also related to the strength of the frequency domain energy of the audio signal and The duration of the audio signal's energy in the frequency domain is related to the length of time. For example, Fig. 5A is a diagram of the calculation results of the tonality of Yangqin audio. In Fig. 5A, if the first threshold is set to 0.7, the tonality of Yangqin audio generally exceeds the first threshold. According to the above-mentioned method for processing audio signals , the electronic device will judge the dulcimer audio as an audio signal that needs to be suppressed. However, in fact, although the yangqin audio signal is unevenly distributed in the frequency domain, due to the strong transient nature of its energy concentration, when the electronic device plays back the yangqin audio through the speaker, it is not easy to generate noise. If the yangqin audio signal is suppressed , which may distort the timbre of the playback dulcimer audio. In addition, for the same type of sound source, the tonality calculation results of the audio signal may have huge differences. Both poetry and drum sound belong to the same sound source (both belong to the third type of sound source), and the tonality calculation results of drum sound and drum poem are quite different. distribution in the domain. In addition, the tonality judgment of the audio signal may be missed. For example, the first threshold value is 0.7, and there is a piece of piano (the first type of sound source) audio in the external audio, and the tonality calculation result is shown in Figure 5D. In Figure 5D, the tonality value of the piano audio signal is less than 0.7 between frame 496 and frame 562, so within this time domain, the electronic device will not suppress the piano audio signal, but in other frames, the electronic device The device will suppress the piano audio signal, which will cause the volume of the played piano audio to change between 496 frames and 562 frames, and the volume of the piano sound will change suddenly, giving the user an extremely poor listening experience. Therefore, if only the result of tonality is used as the only factor to determine whether to suppress the audio signal, it is difficult to select the first threshold, and it is easy to miss or misdetect the problem, and then suppress the sound source that does not produce noise or suppress the sound that does not produce noise. sound source.

In order to solve the above problem, an embodiment of the present application provides a method for processing an audio signal. By identifying the sound source type of the audio, it is judged whether the sound source is the first type of sound source, and if it is the first type of sound source, the peak value of the audio signal of the first type of sound source is suppressed, and the suppressed audio signal is sent to the speaker for output.

Next, with reference to FIG. 6 , a specific process of processing an audio signal by an electronic device will be described. Please refer to FIG. 6. FIG. 6 is a flow chart for processing audio signals provided by an embodiment of the present application. The specific process for processing audio signals is as follows:

Step S601: start the audio application.

Exemplarily, as shown in FIG. 7A, when the electronic device 100 detects an input operation (for example, click) on the audio application icon 7011, the electronic device 100 displays the startup interface 702 as shown in FIG. In the process of 702, the audio application starts to start. When the electronic device displays the main interface 703 of the audio application as shown in FIG. 7C , the start of the audio application is completed. Wherein, the audio application shown in FIG. 7A-FIG. 7C is a music application, and the audio application may also be a video application, and may also be other applications capable of playing audio. This embodiment of the present application is only for illustration and not limitation.

Step S602: the audio application sends an audio signal to the mixing thread module.

Step S603: The audio mixing thread module divides the audio signal into frames to obtain M frames of audio signals.

Specifically, the electronic device processes the audio signal in real time, and the speaker then outputs the processed audio signal in the form of audio. Consider the short-term stability of the signal and the real-time performance of playback, that is, it is not expected to introduce too much delay during playback, so the signal is divided into frames, for example, 10ms is a frame.

Step S604: the audio mixing thread module performs time-frequency conversion on the audio signal of the nth frame to obtain the frequency domain signal of the audio signal of the frame.

Specifically, the audio mixing thread module obtains the frequency domain signal of the audio signal by performing Fourier Transform (Fourier Transform, FT) or Fast Fourier Transform (FFT) on the audio signal, and the audio mixing thread module also The frequency domain signal can be obtained by performing Mel spectrum transformation on the audio signal, and the audio mixing thread module can also obtain the frequency domain signal by performing an improved discrete cosine transform (Modified Discrete Cosine Transform, MDCT) on the audio signal. The embodiment of the present application uses The time-frequency transformation of the audio signal by FFT is taken as an example for description. Before performing FFT, overlapping and windowing can be performed on each frame signal, in order to reduce spectrum leakage during frequency domain transformation and reduce frequency domain processing distortion. After the audio mixing thread module performs time-frequency conversion on the audio signal, it can obtain all frequency components of the frequency domain signal of the audio signal, which is convenient for analyzing and calculating different frequencies of the signal.

Step S605: The sound mixing thread module performs tonality calculation on the frequency domain signal to obtain the tonality value of the frequency domain signal.

Specifically, the sound mixing thread module sequentially calculates the tonality of the frequency domain signal of the nth frame, and obtains the corresponding tonality value. The purpose of performing tonality calculation on the frequency domain signal by the sound mixing thread module is to determine whether the energy distribution of the frame audio signal in the frequency domain is uniform. If the tonality value of the frequency domain signal is greater than or equal to the first threshold, it is judged that the energy distribution of the frame audio signal in the frequency domain is uneven; The energy distribution of the signal in the frequency domain is uniform. Wherein, the first threshold may be obtained based on empirical values, may also be obtained based on historical data, and may also be obtained based on experimental data, which is not limited in this embodiment of the present application. The mixing thread module calculates the tonality value as follows:

The mixing thread module calculates the flatness Flatness of the frequency domain signal according to the formula (1), and the formula (1) is as follows:

Among them, N is the length of the FFT transform of the audio signal, x(n) is the energy value of the nth frequency point of the frequency domain signal of the frame, Flatness is used to represent the energy distribution of the frequency domain signal in the frequency domain, the larger the Flatness is , the more uniform the distribution, the smaller the Flatness, and the more uneven the distribution. Then, the mixing thread module calculates the first parameter SFMdB according to the formula (2), and the formula (2) is as follows:

SFMdB＝10log 10(Flatness) (2)

Then, the sound mixing thread module calculates the tonal value α of the frequency domain signal of the frame according to the formula (3), and the formula (3) is as follows:

Wherein, the value of SFMdBMax may be obtained from historical values, empirical values, or experimental data, which is not limited in this embodiment of the present application. Preferably, SFMdBMax can be set to -60dB.

Step S606: The sound mixing thread module obtains the label of the frequency domain signal based on the neural network.

Specifically, the sound mixing thread module takes the frame frequency domain signal as an input of the neural network, and the neural network outputs a label of the frame frequency domain signal, and the label is used to indicate the audio source type of the frame frequency domain signal. The label includes a first label, a second label, a third label and a fourth label, the first label is used to indicate that the audio source type of the frequency domain signal is the first type of audio source, and the second label is used to indicate the audio source type of the frequency domain signal is the second type of sound source, the third tag is used to indicate that the sound source type of the frequency domain signal is the third type of sound source, and the fourth signal is used to indicate that the sound source type of the frequency domain signal is the fourth type of sound source. In the embodiment of the present application, the first label is 0, the second label is 1, the third label is 2, and the fourth label is 3 as an example for description. Wherein, the neural network is a trained neural network.

Exemplarily, the neural network can be trained offline, and the training process of the neural network is: select a large number of first-class sound sources with a frame length of 10 ms (frequency domain signals with other frame lengths can also be selected, which are not limited in the embodiments of the present application) The frequency domain signal (for example, piano sound), the frequency domain signal of the second type of sound source (for example, human voice), the frequency domain signal of the third type of sound source (for example, drum sound) and the frequency domain signal of the fourth type of sound source are used as training sample. When the frequency domain signal of the first type of sound source is used as the input of the neural network, the neural network will output the label of the frequency domain signal, and compare the label output by the neural network with label 0 to obtain a deviation value Fn1, which is used to represent The degree to which the label output by the neural network differs from label 0.

Then, the internal parameters of the neural network are adjusted based on the Fn1, so that the label of the audio signal of the first type of sound source output by the neural network is label 0. Similarly, train the neural network through other training samples (the frequency domain signal of the second type of sound source, the frequency domain signal of the third type of sound source, and the frequency domain signal of the fourth type of sound source), so that when the neural network receives the input frequency domain signal, The corresponding label can be output.

It should be noted that in the training samples, there may be multiple types of sound sources in a frame of frequency domain signals. For example, in a piece of music, the singer is singing non-stop, and the accompaniment of the song is the sound of the piano, so, in this piece of music, there are two types of sounds: the piano (the first type of sound source) and the human voice (the second type of sound source). sound source. At this time, if there are multiple types of sound sources in a frame of frequency-domain sample signals, the label of the sample signals can be determined according to the intensity of the sound sources. For example, in a frame of frequency-domain sample signal, if the sound of the piano is obviously louder than the human voice, the sound source of the frequency-domain sample signal is determined as the first type of sound source, and the label is set to 0.

Step S607: The sound mixing thread module judges whether the frequency domain signal is the first type of sound source based on the tonality value of the frequency domain signal and the label of the frequency domain signal.

Specifically, if the judgment is yes, execute step S608; if the judgment is no, execute step S610.

Due to the limited training samples of the neural network, and there are many types of sound sources, such as piano sound, dulcimer sound, harmonica sound, pipa sound, etc., when the frequency domain signal of the sound source that has not been trained by the neural network is input, the accuracy of the label output by the neural network is Sex is not high. For example, when the sound of a pipa is input, the neural network may judge that it is the first type of sound source and output a label of 0. In fact, the sound of a pipa is a third type of sound source. In order to solve the above problem, after the neural network judges that the sound source type of the frequency domain signal of the frame is the first sound source (output label 0), the sound mixing thread module will also judge whether the energy distribution of the frequency domain signal of the frame in the frequency domain is consistent. Uniform, only when the label output by the neural network is 0, and the mixing thread module judges that the energy distribution of the frequency domain signal of the frame is uneven in the frequency domain, the mixing thread module will determine the type of the frequency domain signal of the frame for the first sound source. Therefore, if the tonality value of the frequency domain signal of the frame is greater than or equal to the first threshold and the neural network output label is 0, the mixing thread module judges that the sound source of the frequency domain signal of the frame is the first type of sound source, otherwise, it is not the first type of sound source A type of sound source.

Step S608: The sound mixing thread module performs peak detection on the frequency domain signal.

Specifically, the sound mixing thread module detects the peak value of the frequency domain signal of the frame, that is, obtains the amplitudes of the peak and valley of the frequency domain signal of the frame in the frequency domain. For example, Fig. 8 is a waveform diagram of the frequency domain signal of this frame. In the waveform diagram, there are X peaks and Y valleys. The purpose of peak detection is to obtain the amplitudes of these X peaks and Y valleys. From large to small, they are called the largest peak, the second largest peak, the third peak...

One method for the mixing thread module to detect the peak value of the frequency domain signal is: according to the frequency domain energy distribution of the signal, the extreme value is obtained by calculating its derivative. For example: assume that the nth frame signal in the time domain is x(n), the FFT length is N, and the frequency point energy of the corresponding frequency domain signal is X(k), k=0,1,2...N-1. The cumulative energy of each frequency point is

m=0,1,2...N-1, the total energy of the frequency point is Y, then within the frequency range m of setting the peak value, the energy ratio is R[m]=E[m]/Y, m=0 , 1, 2...N-1, and then deriving the energy ratio to get R[m] ^* , looking for the maximum value and second largest value in R[m] ^* means the frequency point where the largest peak and the second largest peak are located.

Step S609: The sound mixing thread module processes the frequency domain signal using the first type of suppression strategy to obtain a processed frequency domain signal.

Specifically, the first type of suppression strategy is: the sound mixing thread module performs single peak suppression or multi-peak suppression on the peak value of the frequency domain signal of the frame. If the single peak suppression is performed on the frequency domain signal of the frame, the maximum peak of the frequency domain signal of the frame is suppressed. If multi-peak suppression is performed on the frequency domain signal of the frame, at least the maximum peak value and the second maximum peak value of the frequency domain signal of the frame frame are suppressed.

The specific method for the sound mixing thread module to suppress the peak is: find the peak according to the frequency domain, calculate the difference between the energy of the peak and the second threshold, and calculate the difference gain based on the difference. Multiply the original frequency point by the difference gain to reduce the energy of the corresponding frequency point. For example, if the current maximum peak value is -10dB, and the second threshold is set to -15dB, then the maximum peak value difference is -5dB, and the difference value is converted to a linear value. The gain is about 0.562, then the original frequency point is multiplied by 0.562 to reduce the frequency point energy. It should be noted that the second threshold is a preset maximum peak value, which can be obtained based on empirical values, historical data, or experimental data, and is not limited in this embodiment of the present application.

Step S610: The sound mixing thread module processes the frequency domain signal using a second type of suppression strategy to obtain a processed frequency domain signal.

Specifically, when the sound source type of the frequency domain signal of the frame is not the first type of sound source, the sound mixing thread module adopts the second type of suppression strategy for the frequency domain signal of the frame, and the second type of suppression strategy is: the sound mixing thread module can The peak value of the frame frequency domain signal is suppressed, or the frame frequency domain signal may not be suppressed.

Before suppressing the frequency domain signal of the frame, it is necessary to perform peak detection on the frequency domain signal of the frame, and the peak suppression of the frequency domain signal of the frame needs to consider that the audio signals of adjacent frames have a strong correlation. Therefore, the difference between the difference gain of the frequency domain signal of the frame and the difference gain of the frequency domain signal of the previous frame should be within a reasonable range. For example, if the difference range of the difference gain is 0.2 to 0.3, the audio source type of the n-1th frame frequency domain signal is the first type of audio source, which needs to be suppressed, the difference gain is 0.5, and the nth frame frequency domain signal is human sound (the second type of sound source), if the frequency-domain signal of the nth frame is to be suppressed, the range of the difference gain of the nth frame's audio signal is 0.7-0.8. If the difference gain of the frequency domain signal of the nth frame is higher than 0.8, the energy difference between the compressed audio signal of the n-1th frame and the compressed audio signal of the nth frame may be too large, causing the speaker to play back the two frames When audio is turned on, the volume can cause abrupt changes (for example, the sound suddenly becomes louder). If the difference gain of the frequency-domain signal of the nth frame is lower than 0.7, the energy of the signal in the frame may be suppressed excessively, and when the speaker plays back the audio of the frame, the volume of the human voice is very low.

Step S611: The sound mixing thread module performs frequency-time transformation on the processed frequency-domain signal to obtain a single-frame audio signal.

Step S612: the sound mixing thread module sends the single-frame audio signal to the audio driver.

Step S613: The sound mixing thread module updates n according to the formula n=n+1.

Specifically, in the case that n is not equal to 0, the sound mixing thread module executes step S604.

Step S614: The audio driver sends the single-frame audio signal to the speaker.

Step S615: The speaker plays the audio corresponding to the single-frame audio signal.

The audio processing method provided by the embodiment of the present application combines the neural network and the traditional detection algorithm to identify the sound source type of the audio signal through the neural network, and solve the misjudgment and missed judgment caused by the traditional algorithm and the difficulty of debugging the upper threshold of the tonality value and other issues, and through the implementation of different audio signals without suppressing the gain and implementation time, while maintaining the maximum loudness of the original signal playback, changing the speaker input signal to reduce playback noise and reduce the suppression of different audio signals Distortion.

In the above embodiments, all or part of them may be implemented by software, hardware, firmware or any combination thereof. When implemented using software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, the processes or functions according to the present application will be generated in whole or in part. The computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Transmission to another website site, computer, server, or data center by wired (eg, coaxial cable, optical fiber, DSL) or wireless (eg, infrared, wireless, microwave, etc.) means. The computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center integrated with one or more available media. Described available medium can be magnetic medium, (for example, floppy disk, hard disk, magnetic tape), optical medium (for example, DVD), or semiconductor medium (for example solid state hard disk Solid State Disk) etc.

Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments are realized. The processes can be completed by computer programs to instruct related hardware. The programs can be stored in computer-readable storage media. When the programs are executed , may include the processes of the foregoing method embodiments. The aforementioned storage medium includes: ROM or random access memory RAM, magnetic disk or optical disk, and other various media that can store program codes.

In a word, the above description is only an embodiment of the technical solution of the present invention, and is not intended to limit the protection scope of the present invention. All modifications, equivalent replacements, improvements, etc. made according to the disclosure of the present invention shall be included in the protection scope of the present invention.

Claims

A method for processing an audio signal, comprising:

get the audio signal;

When the tonality value of the audio signal is greater than or equal to a first threshold and the sound source type of the audio signal is the first type of sound source, the first type of suppression strategy is used to process the audio signal;

Otherwise, the audio signal is processed using the second type of suppression strategy.
The method according to claim 1, wherein the first type of suppression strategy is to suppress a single peak or multiple peaks in the frequency domain of the audio signal.
The method according to any one of claims 1-2, wherein the second type of suppression strategy is to suppress a single peak or multiple peaks in the audio signal; or

No compression processing is performed on the audio signal.
The method according to any one of claims 1-3, wherein after said acquiring the audio signal, comprising:

Perform tonality calculation on the audio signal to obtain a tonality value of the audio signal.
The method according to claim 4, wherein said performing tonality calculation on said audio signal to obtain the tonality value of said audio signal comprises:

According to the formula
Calculate the flatness of the audio signal; the N is the length of the time-frequency transformation of the audio signal, the x(n) is the energy value of the nth frequency point of the audio signal in the frequency domain, and the Flatness is the flatness of the audio signal;

According to formula SFMdB=10log 10 (Flatness) calculates the first parameter of described audio signal; Described SFMdB is described first parameter;

According to the formula
Calculate the tonality value of the audio signal; the α is the tonality value of the audio signal, and the SFMdBMax is the maximum value of the first parameter.
The method according to any one of claims 1-5, wherein, before processing the audio signal using the first type of suppression strategy, further comprising:

Peak detection is performed on the audio signal, and the peak detection user obtains peak information of the audio signal in the frequency domain.
The method according to claim 6, wherein the processing of the audio signal using a first-type suppression strategy specifically includes:

calculating the difference between the peak value of the audio signal and the second threshold; the peak value includes at least the maximum peak value of the audio signal in the frequency domain;

calculating a difference gain for the peak value based on the difference;

The peak value is suppressed according to the formula W'=W*f, the f is the difference gain, the W is the peak value before suppression, and the W' is the peak value after suppression.
An electronic device, characterized in that it includes: a memory, a processor, and a touch screen; wherein:

The touch screen is used to display content;

The memory is used to store a computer program, and the computer program includes program instructions;

The processor is configured to call the program instruction, so that the terminal executes the method according to any one of claims 1-7.
A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the method according to any one of claims 1-7 is implemented.
A computer program product containing instructions, characterized in that, when the computer program product is run on an electronic device, the electronic device is made to execute the method according to any one of claims 1-7.