WO2023000778A9 - Procédé de traitement de signal audio et dispositif électronique associé - Google Patents

Procédé de traitement de signal audio et dispositif électronique associé Download PDF

Info

Publication number
WO2023000778A9
WO2023000778A9 PCT/CN2022/092367 CN2022092367W WO2023000778A9 WO 2023000778 A9 WO2023000778 A9 WO 2023000778A9 CN 2022092367 W CN2022092367 W CN 2022092367W WO 2023000778 A9 WO2023000778 A9 WO 2023000778A9
Authority
WO
WIPO (PCT)
Prior art keywords
audio signal
audio
type
frequency domain
signal
Prior art date
Application number
PCT/CN2022/092367
Other languages
English (en)
Chinese (zh)
Other versions
WO2023000778A1 (fr
Inventor
胡贝贝
许剑峰
Original Assignee
北京荣耀终端有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京荣耀终端有限公司 filed Critical 北京荣耀终端有限公司
Publication of WO2023000778A1 publication Critical patent/WO2023000778A1/fr
Publication of WO2023000778A9 publication Critical patent/WO2023000778A9/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • G10L21/0324Details of processing therefor
    • G10L21/034Automatic adjustment
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72403User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality
    • H04M1/7243User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages
    • H04M1/72433User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages for voice messaging, e.g. dictaphones
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72403User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality
    • H04M1/72442User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality for playing music files
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/233Processing of audio elementary streams

Definitions

  • the present application relates to the field of audio signal processing, in particular to an audio signal processing method and related electronic equipment.
  • the frequency point energy in the frequency domain is concentrated in a narrow bandwidth, for example, similar to piano music, and the frequency domain energy distribution
  • the duration is longer, and subjectively you can hear a noise similar to " ⁇ ", mainly because the excessively concentrated long-term narrow-band energy causes the speaker to produce nonlinear distortion during electro-acoustic conversion.
  • the sound source corresponding to this type of audio signal is called the first class sound source.
  • the traditional method of processing the first type of sound source will cause the problem of falsely suppressing other types of sound sources, such as human voices, resulting in a subjective sense of hearing when this type of sound source is in transition with the first type of sound source. question. Therefore, while reducing the audio signal processing of the first type of sound source and avoiding false suppression of other sound sources, it is a problem concerned by technicians.
  • the embodiment of the present application provides an audio signal processing method, which solves the problem that other types of audio signals are wrongly suppressed during the process of suppressing audio signals that are prone to noise, causing the audio signals to be suppressed and distorted.
  • an embodiment of the present application provides a method for processing an audio signal, including: acquiring an audio signal; when the tone value of the audio signal is greater than or equal to a first threshold and the audio source type of the audio signal is the first type of audio source In the case of , the audio signal is processed using the first type of suppression strategy; otherwise, the audio signal is processed using the second type of suppression strategy.
  • the signal is a signal that is prone to noise (the first type of sound source), and whether the audio signal is based on whether the sound source type of the audio signal is the first type of sound source.
  • the first type of suppression strategy is to suppress a single peak or multiple peaks in the frequency domain of the audio signal.
  • the second type of suppression strategy is to suppress a single peak or multiple peaks in the audio signal; or not to suppress the audio signal.
  • the method includes: performing tonality calculation on the audio signal to obtain a tonality value of the audio signal.
  • the electronic device it is beneficial for the electronic device to adopt different suppression strategies for the audio signal based on the tonality value and the type of the audio source of the audio signal.
  • the audio signal before the audio signal is processed using the first type of suppression strategy, it also includes: performing peak detection on the audio signal, and the peak detection user obtains the audio signal at a frequency Peak information in the domain.
  • the electronic device can acquire the peak value of the audio signal, calculate the difference gain according to the peak value information, and suppress the audio signal according to the difference gain.
  • the sound source type of the audio signal is the first type of sound source and the tonality value of the audio signal is greater than or equal to the first threshold, peak suppression is performed on the audio signal to change the speaker input signal and reduce playback noise.
  • an embodiment of the present application provides an electronic device, which includes: one or more processors and a memory; the memory is coupled to the one or more processors, and the memory is used to store computer program codes,
  • the computer program code includes computer instructions, and the one or more processors invoke the computer instructions to cause the electronic device to perform: acquiring an audio signal; when the tone value of the audio signal is greater than or equal to a first threshold and the sound source of the audio signal If the type is the first type of audio source, the audio signal is processed using the first type of suppression strategy; otherwise, the audio signal is processed using the second type of suppression strategy.
  • the one or more processors are further configured to invoke the computer instruction so that the electronic device executes: performing tonality calculation on the audio signal to obtain the tonality of the audio signal. sexual value.
  • the one or more processors are further configured to call the computer instruction to make the electronic device execute: perform peak detection on the audio signal, and the peak detection user acquires the The peak information of an audio signal in the frequency domain.
  • the one or more processors are further configured to call the computer instruction so that the electronic device executes: calculating the difference between the peak value of the audio signal and the second threshold;
  • the peak value includes at least the maximum peak value of the audio signal in the frequency domain;
  • the difference gain of the peak value is calculated based on the difference value;
  • an embodiment of the present application provides an electronic device, including: a touch screen, a camera, one or more processors, and one or more memories; the one or more processors and the touch screen , the camera, the one or more memories are coupled, the one or more memories are used to store computer program codes, the computer program codes include computer instructions, and when the one or more processors execute the computer instructions , so that the electronic device executes the method described in the first aspect or any possible implementation manner of the first aspect.
  • an embodiment of the present application provides a chip system, which is applied to an electronic device, and the chip system includes one or more processors, and the processor is used to call a computer instruction so that the electronic device executes the first Aspect or the method described in any possible implementation manner of the first aspect.
  • the embodiment of the present application provides a computer program product containing instructions.
  • the computer program product is run on an electronic device, the electronic device is made to execute any one of the possible implementations of the first aspect or the first aspect. The method described in the manner.
  • the embodiment of the present application provides a computer-readable storage medium, including instructions, and when the instructions are run on the electronic device, the electronic device executes any one of the possible implementations of the first aspect or the first aspect. The method described in the manner.
  • FIGS. 1A-1C are schematic diagrams of an application scenario provided by the embodiment of the present application.
  • FIG. 2 is a system architecture diagram of an electronic device processing audio signals provided by an embodiment of the present application
  • FIG. 3 is a schematic diagram of a hardware structure of an electronic device 100 provided by an embodiment of the present application.
  • FIG. 4 is a software structural block diagram of the electronic device 100 provided by the embodiment of the present application.
  • Fig. 5A-Fig. 5D are diagrams of calculation results of tonality value provided by the embodiment of the present application.
  • Fig. 6 is a flow chart of processing an audio signal provided by an embodiment of the present application.
  • FIGS. 7A-7C are diagrams of the audio application startup interface provided by the embodiment of the present application.
  • FIG. 8 is a waveform diagram of a frequency domain signal provided by an embodiment of the present application.
  • a unit may be, but is not limited to being limited to, a process running on a processor, a processor, an object, an executable, a thread of execution, a program, and/or distributed between two or more computers.
  • these units can execute from various computer readable media having various data structures stored thereon.
  • a unit may, for example, be based on a signal having one or more data packets (eg, data from a second unit interacting with another unit between a local system, a distributed system, and/or a network. For example, the Internet via a signal interacting with other systems) Communicate through local and/or remote processes.
  • FIG. 1A when the electronic device 100 detects an input operation (for example, click) on the music application icon 1011 , it will enter the main interface 102 of the music application as shown in FIG. 1B .
  • the electronic device 100 displays a music playing interface 103 as shown in FIG. 1C , and the music application plays music at this time. While the music application is playing music, the electronic device 100 processes the audio signal of the music in real time, so as to ensure that the music played by the music application does not appear noise, thereby bringing a good music experience to the user.
  • FIG. 2 is a system architecture diagram of an electronic device processing audio signals provided by an embodiment of the present application.
  • the system architecture includes an audio application, a mixing thread module, and an audio driver.
  • the audio application may be music player software or video software.
  • an audio application plays audio through a speaker, it processes the audio signal in real time.
  • the audio application sends the audio signal to the audio mixing thread module, and the audio mixing thread module detects whether the audio source type of the audio signal is the first type of audio source. If so, the audio signal is processed (eg, the energy of the audio signal is suppressed). Then, the mixing thread module sends the processed audio signal to the audio driver, and the audio driver sends the processed audio signal to the speaker, and the speaker outputs audio.
  • the electronic device can complete the processing of the audio signal in real time, so that the audio emitted through the speaker has no noise.
  • FIG. 3 is a schematic diagram of a hardware structure of an electronic device 100 provided by an embodiment of the present application.
  • the electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (universal serial bus, USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, and an antenna 2 , mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, earphone jack 170D, sensor module 180, button 190, motor 191, indicator 192, camera 193, display screen 194, and A subscriber identification module (subscriber identification module, SIM) card interface 195 and the like.
  • SIM subscriber identification module
  • the sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, bone conduction sensor 180M, etc.
  • the structure illustrated in the embodiment of the present invention does not constitute a specific limitation on the electronic device 100 .
  • the electronic device 100 may include more or fewer components than shown in the figure, or combine certain components, or separate certain components, or arrange different components.
  • the illustrated components can be realized in hardware, software or a combination of software and hardware.
  • the processor 110 may include one or more processing units, for example: the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processing unit (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), controller, memory, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural network processor (neural-network processing unit, NPU) wait. Wherein, different processing units may be independent devices, or may be integrated in one or more processors.
  • application processor application processor, AP
  • modem processor graphics processing unit
  • GPU graphics processing unit
  • image signal processor image signal processor
  • ISP image signal processor
  • controller memory
  • video codec digital signal processor
  • DSP digital signal processor
  • baseband processor baseband processor
  • neural network processor neural-network processing unit, NPU
  • the wireless communication function of the electronic device 100 can be realized by the antenna 1 , the antenna 2 , the mobile communication module 150 , the wireless communication module 160 , a modem processor, a baseband processor, and the like.
  • Antenna 1 and Antenna 2 are used to transmit and receive electromagnetic wave signals.
  • Each antenna in electronic device 100 may be used to cover single or multiple communication frequency bands. Different antennas can also be multiplexed to improve the utilization of the antennas.
  • Antenna 1 can be multiplexed as a diversity antenna of a wireless local area network.
  • the antenna may be used in conjunction with a tuning switch.
  • the mobile communication module 150 can provide wireless communication solutions including 2G/3G/4G/5G applied on the electronic device 100 .
  • the mobile communication module 150 may include at least one filter, switch, power amplifier, low noise amplifier (low noise amplifier, LNA) and the like.
  • the mobile communication module 150 can receive electromagnetic waves through the antenna 1, filter and amplify the received electromagnetic waves, and send them to the modem processor for demodulation.
  • the mobile communication module 150 can also amplify the signals modulated by the modem processor, and convert them into electromagnetic waves and radiate them through the antenna 1 .
  • at least part of the functional modules of the mobile communication module 150 may be set in the processor 110 .
  • at least part of the functional modules of the mobile communication module 150 and at least part of the modules of the processor 110 may be set in the same device.
  • the wireless communication module 160 can provide wireless local area network (wireless local area networks, WLAN) (such as Wi-Fi network), Bluetooth (bluetooth, BT), BLE broadcasting, global navigation satellite system (global navigation satellite system) applied on the electronic device 100. system, GNSS), frequency modulation (frequency modulation, FM), near field communication technology (near field communication, NFC), infrared technology (infrared, IR) and other wireless communication solutions.
  • the wireless communication module 160 may be one or more devices integrating at least one communication processing module.
  • the wireless communication module 160 receives electromagnetic waves via the antenna 2 , frequency-modulates and filters the electromagnetic wave signals, and sends the processed signals to the processor 110 .
  • the wireless communication module 160 can also receive the signal to be sent from the processor 110 , frequency-modulate it, amplify it, and convert it into electromagnetic waves through the antenna 2 for radiation.
  • the electronic device 100 realizes the display function through the GPU, the display screen 194 , and the application processor.
  • the GPU is a microprocessor for image processing, and is connected to the display screen 194 and the application processor. GPUs are used to perform mathematical and geometric calculations for graphics rendering.
  • Processor 110 may include one or more GPUs that execute program instructions to generate or change display information.
  • the display screen 194 is used to display images, videos and the like.
  • the display screen 194 includes a display panel.
  • the display panel can be a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active matrix organic light emitting diode or an active matrix organic light emitting diode (active-matrix organic light emitting diode, AMOLED), flexible light-emitting diode (flex light-emitting diode, FLED), Miniled, MicroLed, Micro-oLed, quantum dot light emitting diodes (quantum dot light emitting diodes, QLED), etc.
  • the electronic device 100 may include 1 or N display screens 194 , where N is a positive integer greater than 1.
  • the electronic device 100 can realize the shooting function through the ISP, the camera 193 , the video codec, the GPU, the display screen 194 and the application processor.
  • the ISP is used for processing the data fed back by the camera 193 .
  • the light is transmitted to the photosensitive element of the camera through the lens, and the light signal is converted into an electrical signal, and the photosensitive element of the camera transmits the electrical signal to the ISP for processing, and converts it into an image visible to the naked eye.
  • ISP can also perform algorithm optimization on image noise, brightness, and skin color.
  • ISP can also optimize the exposure, color temperature and other parameters of the shooting scene.
  • the ISP may be located in the camera 193 .
  • Digital signal processors are used to process digital signals. In addition to digital image signals, they can also process other digital signals. For example, when the electronic device 100 selects a frequency point, the digital signal processor is used to perform Fourier transform on the energy of the frequency point.
  • the NPU is a neural-network (NN) computing processor.
  • NN neural-network
  • Applications such as intelligent cognition of the electronic device 100 can be realized through the NPU, such as image recognition, face recognition, speech recognition, text understanding, and the like.
  • the electronic device 100 can implement audio functions through the audio module 170 , the speaker 170A, the receiver 170B, the microphone 170C, the earphone interface 170D, and the application processor. Such as music playback, recording, etc.
  • the audio module 170 is used to convert digital audio information into analog audio signal output, and is also used to convert analog audio input into digital audio signal.
  • the audio module 170 may also be used to encode and decode audio signals.
  • the audio module 170 may be set in the processor 110 , or some functional modules of the audio module 170 may be set in the processor 110 .
  • Speaker 170A also referred to as a "horn" is used to convert audio electrical signals into sound signals.
  • Electronic device 100 can listen to music through speaker 170A, or listen to hands-free calls.
  • Receiver 170B also called “earpiece” is used to convert audio electrical signals into sound signals.
  • the receiver 170B can be placed close to the human ear to receive the voice.
  • the microphone 170C also called “microphone” or “microphone” is used to convert sound signals into electrical signals.
  • the user can put his mouth close to the microphone 170C to make a sound, and input the sound signal to the microphone 170C.
  • the electronic device 100 may be provided with at least one microphone 170C. In some other embodiments, the electronic device 100 may be provided with two microphones 170C, which may also implement a noise reduction function in addition to collecting sound signals. In some other embodiments, the electronic device 100 can also be provided with three, four or more microphones 170C to realize sound signal collection, noise reduction, sound source identification, and directional recording functions.
  • the pressure sensor 180A is used to sense the pressure signal and convert the pressure signal into an electrical signal.
  • pressure sensor 180A may be disposed on display screen 194 .
  • the air pressure sensor 180C is used to measure air pressure.
  • the electronic device 100 calculates the altitude based on the air pressure value measured by the air pressure sensor 180C to assist positioning and navigation.
  • the magnetic sensor 180D includes a Hall sensor.
  • the electronic device 100 may use the magnetic sensor 180D to detect the opening and closing of the flip leather case.
  • the acceleration sensor 180E can detect the acceleration of the electronic device 100 in various directions (generally three axes).
  • the magnitude and direction of gravity can be detected when the electronic device 100 is stationary. It can also be used to identify the posture of electronic devices, and can be used in applications such as horizontal and vertical screen switching, pedometers, etc.
  • the fingerprint sensor 180H is used to collect fingerprints.
  • the electronic device 100 can use the collected fingerprint characteristics to implement fingerprint unlocking, access to application locks, take pictures with fingerprints, answer incoming calls with fingerprints, and the like.
  • Touch sensor 180K also known as "touch panel”.
  • the touch sensor 180K can be disposed on the display screen 194, and the touch sensor 180K and the display screen 194 form a touch screen, also called a “touch screen”.
  • the touch sensor 180K is used to detect a touch operation on or near it.
  • the touch sensor can pass the detected touch operation to the application processor to determine the type of touch event.
  • Visual output related to the touch operation can be provided through the display screen 194 .
  • the touch sensor 180K may also be disposed on the surface of the electronic device 100 , which is different from the position of the display screen 194 .
  • the bone conduction sensor 180M can acquire vibration signals. In some embodiments, the bone conduction sensor 180M can acquire the vibration signal of the vibrating bone mass of the human voice.
  • the software system of the electronic device 100 may adopt a layered architecture, an event-driven architecture, a micro-kernel architecture, a micro-service architecture, or a cloud architecture.
  • the software structure of the electronic device 100 is exemplarily described by taking an Android system with a layered architecture as an example.
  • FIG. 4 is a block diagram of the software structure of the electronic device 100 provided by the embodiment of the present application.
  • the layered architecture divides the software into several layers, and each layer has a clear role and division of labor. Layers communicate through software interfaces.
  • the Android system is divided into four layers, which are respectively the application program layer, the application program framework layer, the Android runtime (Android runtime) and the system library, and the kernel layer from top to bottom.
  • the application layer can consist of a series of application packages.
  • the application package can include applications such as camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, video, short message, and audio applications.
  • the application framework layer provides an application programming interface (application programming interface, API) and a programming framework for applications in the application layer.
  • the application framework layer includes some predefined functions. As shown in Figure 4, the application framework layer can include a window manager, a content provider, a view system, a phone manager, a resource manager, a notification manager, a mixing thread module (Mixer Thread module) and the like.
  • the mixing thread module is used to receive the audio signal sent by the audio application and process the audio signal.
  • a window manager is used to manage window programs.
  • the window manager can get the size of the display screen, determine whether there is a status bar, lock the screen, capture the screen, etc.
  • Content providers are used to store and retrieve data and make it accessible to applications.
  • Said data may include video, images, audio, calls made and received, browsing history and bookmarks, phonebook, etc.
  • the view system includes visual controls, such as controls for displaying text, controls for displaying pictures, and so on.
  • the view system can be used to build applications.
  • a display interface can consist of one or more views.
  • a display interface including a text message notification icon may include a view for displaying text and a view for displaying pictures.
  • the phone manager is used to provide communication functions of the electronic device 100 . For example, the management of call status (including connected, hung up, etc.).
  • the resource manager provides various resources for the application, such as localized strings, icons, pictures, layout files, video files, and so on.
  • the notification manager enables the application to display notification information in the status bar, which can be used to convey notification-type messages, and can automatically disappear after a short stay without user interaction.
  • the notification manager is used to notify the download completion, message reminder, etc.
  • the notification manager can also be a notification that appears on the top status bar of the system in the form of a chart or scroll bar text, such as a notification of an application running in the background, or a notification that appears on the screen in the form of a dialog window.
  • prompting text information in the status bar issuing a prompt sound, vibrating the electronic device, and flashing the indicator light, etc.
  • the Android Runtime includes core library and virtual machine. The Android runtime is responsible for the scheduling and management of the Android system.
  • the core library consists of two parts: one part is the function function that the java language needs to call, and the other part is the core library of Android.
  • the application layer and the application framework layer run in virtual machines.
  • the virtual machine executes the java files of the application program layer and the application program framework layer as binary files.
  • the virtual machine is used to perform functions such as object life cycle management, stack management, thread management, security and exception management, and garbage collection.
  • a system library can include multiple function modules. For example: surface manager (surface manager), media library (Media Libraries), 3D graphics processing library (eg: OpenGL ES), 2D graphics engine (eg: SGL), etc.
  • the surface manager is used to manage the display subsystem and provides the fusion of 2D and 3D layers for multiple applications.
  • the media library supports playback and recording of various commonly used audio and video formats, as well as still image files, etc.
  • the media library can support a variety of audio and video encoding formats, such as: MPEG4, H.264, MP3, AAC, AMR, JPG, PNG, etc.
  • the 3D graphics processing library is used to implement 3D graphics drawing, image rendering, compositing, and layer processing, etc.
  • 2D graphics engine is a drawing engine for 2D drawing.
  • the kernel layer is the layer between hardware and software.
  • the kernel layer includes at least a display driver, a camera driver, an audio driver, and a sensor driver.
  • the size of the speaker is relatively small, and its allowable diaphragm vibration amplitude is relatively small.
  • the membrane vibration amplitude of the speaker will exceed the maximum value, which makes the sound prone to break when played at a high volume, resulting in a sound similar to " ⁇ ".
  • the sound source is usually processed so that the loudspeaker can reduce the noise when the sound is played out.
  • the sound sources are divided into four categories, namely: the first type of sound source, the second type of sound source, the third type of sound source and the fourth type of sound source.
  • the characteristics of the first type of sound source are: the audio signal of this sound source is unevenly distributed on the frequency spectrum, and the energy is concentrated in the middle and low frequencies, the energy is relatively strong, and the duration is long, such as the sound of a piano. When the audio of this type of sound source is played back through the speaker, it is easy to generate noise.
  • the characteristics of the second type of sound source are: the audio signal of the sound source is unevenly distributed on the frequency spectrum, and the energy is mainly concentrated in the middle and low frequencies, but the energy is relatively weak, such as human voice.
  • the characteristic of the third type of sound source is that the audio signal of this sound source is unevenly distributed in the spectrum and has strong energy, but the energy concentration is transient, that is, the energy lasts for a short time, for example, the sound of drums.
  • the characteristic of the fourth type of sound source is that the audio signal of the sound source is evenly distributed on the frequency spectrum.
  • the first type of sound source has uneven energy distribution, large energy and long duration.
  • electronic equipment uses speakers to play back the audio of the first type of sound source, which is more likely to produce noise.
  • the electronic device will suppress the audio signal of the first type of sound source before the speaker plays back the audio, and then send the suppressed audio signal to the speaker, and then Audio is output from the speaker, suppressing noise during audio playback.
  • the method for electronic equipment to process audio signals is: divide the input audio signal into frames, and then perform time-frequency transformation to obtain frequency domain signals, perform tonality calculation on each frame of frequency domain signals, and obtain the tonality value of each frame of frequency domain signals , comparing the tonality value with the set first threshold, and then judging whether the distribution of the audio signal in the frequency domain is uniform. If the tonality value of the frequency domain signal is greater than or equal to the first threshold, it means that the frequency domain signal of the frame is unevenly distributed in the frequency spectrum, indicating that the frequency domain signal of the frame needs to be suppressed, and the frequency domain signal of the frame is suppressed according to the relevant strategy energy, that is, peak suppression.
  • the first threshold may be obtained based on historical data, may also be obtained based on empirical values, and may also be obtained based on experimental data tests, which is not limited in this embodiment of the present application.
  • Fig. 5A is a diagram of the calculation results of the tonality of Yangqin audio. In Fig. 5A, if the first threshold is set to 0.7, the tonality of Yangqin audio generally exceeds the first threshold.
  • the electronic device will judge the dulcimer audio as an audio signal that needs to be suppressed.
  • the yangqin audio signal is unevenly distributed in the frequency domain, due to the strong transient nature of its energy concentration, when the electronic device plays back the yangqin audio through the speaker, it is not easy to generate noise. If the yangqin audio signal is suppressed , which may distort the timbre of the playback dulcimer audio.
  • the tonality calculation results of the audio signal may have huge differences.
  • Both poetry and drum sound belong to the same sound source (both belong to the third type of sound source), and the tonality calculation results of drum sound and drum poem are quite different. distribution in the domain.
  • the tonality judgment of the audio signal may be missed.
  • the first threshold value is 0.7
  • there is a piece of piano (the first type of sound source) audio in the external audio and the tonality calculation result is shown in Figure 5D.
  • the tonality value of the piano audio signal is less than 0.7 between frame 496 and frame 562, so within this time domain, the electronic device will not suppress the piano audio signal, but in other frames, the electronic device The device will suppress the piano audio signal, which will cause the volume of the played piano audio to change between 496 frames and 562 frames, and the volume of the piano sound will change suddenly, giving the user an extremely poor listening experience. Therefore, if only the result of tonality is used as the only factor to determine whether to suppress the audio signal, it is difficult to select the first threshold, and it is easy to miss or misdetect the problem, and then suppress the sound source that does not produce noise or suppress the sound that does not produce noise. sound source.
  • an embodiment of the present application provides a method for processing an audio signal.
  • identifying the sound source type of the audio it is judged whether the sound source is the first type of sound source, and if it is the first type of sound source, the peak value of the audio signal of the first type of sound source is suppressed, and the suppressed audio signal is sent to the speaker for output.
  • FIG. 6 is a flow chart for processing audio signals provided by an embodiment of the present application.
  • the specific process for processing audio signals is as follows:
  • Step S601 start the audio application.
  • the electronic device 100 when the electronic device 100 detects an input operation (for example, click) on the audio application icon 7011, the electronic device 100 displays the startup interface 702 as shown in FIG. In the process of 702, the audio application starts to start.
  • the electronic device displays the main interface 703 of the audio application as shown in FIG. 7C , the start of the audio application is completed.
  • the audio application shown in FIG. 7A-FIG. 7C is a music application, and the audio application may also be a video application, and may also be other applications capable of playing audio. This embodiment of the present application is only for illustration and not limitation.
  • Step S602 the audio application sends an audio signal to the mixing thread module.
  • Step S603 The audio mixing thread module divides the audio signal into frames to obtain M frames of audio signals.
  • the electronic device processes the audio signal in real time, and the speaker then outputs the processed audio signal in the form of audio.
  • the signal is divided into frames, for example, 10ms is a frame.
  • Step S604 the audio mixing thread module performs time-frequency conversion on the audio signal of the nth frame to obtain the frequency domain signal of the audio signal of the frame.
  • the audio mixing thread module obtains the frequency domain signal of the audio signal by performing Fourier Transform (Fourier Transform, FT) or Fast Fourier Transform (FFT) on the audio signal, and the audio mixing thread module also
  • the frequency domain signal can be obtained by performing Mel spectrum transformation on the audio signal
  • the audio mixing thread module can also obtain the frequency domain signal by performing an improved discrete cosine transform (Modified Discrete Cosine Transform, MDCT) on the audio signal.
  • MDCT discrete cosine Transform
  • the embodiment of the present application uses The time-frequency transformation of the audio signal by FFT is taken as an example for description. Before performing FFT, overlapping and windowing can be performed on each frame signal, in order to reduce spectrum leakage during frequency domain transformation and reduce frequency domain processing distortion.
  • the audio mixing thread module performs time-frequency conversion on the audio signal, it can obtain all frequency components of the frequency domain signal of the audio signal, which is convenient for analyzing and calculating different frequencies of the signal.
  • Step S605 The sound mixing thread module performs tonality calculation on the frequency domain signal to obtain the tonality value of the frequency domain signal.
  • the sound mixing thread module sequentially calculates the tonality of the frequency domain signal of the nth frame, and obtains the corresponding tonality value.
  • the purpose of performing tonality calculation on the frequency domain signal by the sound mixing thread module is to determine whether the energy distribution of the frame audio signal in the frequency domain is uniform. If the tonality value of the frequency domain signal is greater than or equal to the first threshold, it is judged that the energy distribution of the frame audio signal in the frequency domain is uneven; The energy distribution of the signal in the frequency domain is uniform.
  • the first threshold may be obtained based on empirical values, may also be obtained based on historical data, and may also be obtained based on experimental data, which is not limited in this embodiment of the present application.
  • the mixing thread module calculates the tonality value as follows:
  • the mixing thread module calculates the flatness Flatness of the frequency domain signal according to the formula (1), and the formula (1) is as follows:
  • N is the length of the FFT transform of the audio signal
  • x(n) is the energy value of the nth frequency point of the frequency domain signal of the frame
  • Flatness is used to represent the energy distribution of the frequency domain signal in the frequency domain, the larger the Flatness is , the more uniform the distribution, the smaller the Flatness, and the more uneven the distribution.
  • the mixing thread module calculates the first parameter SFMdB according to the formula (2), and the formula (2) is as follows:
  • the sound mixing thread module calculates the tonal value ⁇ of the frequency domain signal of the frame according to the formula (3), and the formula (3) is as follows:
  • the value of SFMdBMax may be obtained from historical values, empirical values, or experimental data, which is not limited in this embodiment of the present application.
  • SFMdBMax can be set to -60dB.
  • Step S606 The sound mixing thread module obtains the label of the frequency domain signal based on the neural network.
  • the sound mixing thread module takes the frame frequency domain signal as an input of the neural network, and the neural network outputs a label of the frame frequency domain signal, and the label is used to indicate the audio source type of the frame frequency domain signal.
  • the label includes a first label, a second label, a third label and a fourth label, the first label is used to indicate that the audio source type of the frequency domain signal is the first type of audio source, and the second label is used to indicate the audio source type of the frequency domain signal is the second type of sound source, the third tag is used to indicate that the sound source type of the frequency domain signal is the third type of sound source, and the fourth signal is used to indicate that the sound source type of the frequency domain signal is the fourth type of sound source.
  • the first label is 0, the second label is 1, the third label is 2, and the fourth label is 3 as an example for description.
  • the neural network is a trained neural network.
  • the neural network can be trained offline, and the training process of the neural network is: select a large number of first-class sound sources with a frame length of 10 ms (frequency domain signals with other frame lengths can also be selected, which are not limited in the embodiments of the present application)
  • the frequency domain signal for example, piano sound
  • the frequency domain signal of the second type of sound source for example, human voice
  • the frequency domain signal of the third type of sound source for example, drum sound
  • the frequency domain signal of the fourth type of sound source are used as training sample.
  • the neural network When the frequency domain signal of the first type of sound source is used as the input of the neural network, the neural network will output the label of the frequency domain signal, and compare the label output by the neural network with label 0 to obtain a deviation value Fn1, which is used to represent The degree to which the label output by the neural network differs from label 0.
  • the internal parameters of the neural network are adjusted based on the Fn1, so that the label of the audio signal of the first type of sound source output by the neural network is label 0.
  • train the neural network through other training samples (the frequency domain signal of the second type of sound source, the frequency domain signal of the third type of sound source, and the frequency domain signal of the fourth type of sound source), so that when the neural network receives the input frequency domain signal, The corresponding label can be output.
  • the label of the sample signals can be determined according to the intensity of the sound sources. For example, in a frame of frequency-domain sample signal, if the sound of the piano is obviously louder than the human voice, the sound source of the frequency-domain sample signal is determined as the first type of sound source, and the label is set to 0.
  • Step S607 The sound mixing thread module judges whether the frequency domain signal is the first type of sound source based on the tonality value of the frequency domain signal and the label of the frequency domain signal.
  • step S608 if the judgment is yes, execute step S608; if the judgment is no, execute step S610.
  • the neural network may judge that it is the first type of sound source and output a label of 0. In fact, the sound of a pipa is a third type of sound source.
  • the sound mixing thread module will also judge whether the energy distribution of the frequency domain signal of the frame in the frequency domain is consistent. Uniform, only when the label output by the neural network is 0, and the mixing thread module judges that the energy distribution of the frequency domain signal of the frame is uneven in the frequency domain, the mixing thread module will determine the type of the frequency domain signal of the frame for the first sound source.
  • the mixing thread module judges that the sound source of the frequency domain signal of the frame is the first type of sound source, otherwise, it is not the first type of sound source A type of sound source.
  • Step S608 The sound mixing thread module performs peak detection on the frequency domain signal.
  • the sound mixing thread module detects the peak value of the frequency domain signal of the frame, that is, obtains the amplitudes of the peak and valley of the frequency domain signal of the frame in the frequency domain.
  • Fig. 8 is a waveform diagram of the frequency domain signal of this frame. In the waveform diagram, there are X peaks and Y valleys. The purpose of peak detection is to obtain the amplitudes of these X peaks and Y valleys. From large to small, they are called the largest peak, the second largest peak, the third peak...
  • Step S609 The sound mixing thread module processes the frequency domain signal using the first type of suppression strategy to obtain a processed frequency domain signal.
  • the sound mixing thread module performs single peak suppression or multi-peak suppression on the peak value of the frequency domain signal of the frame. If the single peak suppression is performed on the frequency domain signal of the frame, the maximum peak of the frequency domain signal of the frame is suppressed. If multi-peak suppression is performed on the frequency domain signal of the frame, at least the maximum peak value and the second maximum peak value of the frequency domain signal of the frame frame are suppressed.
  • the specific method for the sound mixing thread module to suppress the peak is: find the peak according to the frequency domain, calculate the difference between the energy of the peak and the second threshold, and calculate the difference gain based on the difference. Multiply the original frequency point by the difference gain to reduce the energy of the corresponding frequency point. For example, if the current maximum peak value is -10dB, and the second threshold is set to -15dB, then the maximum peak value difference is -5dB, and the difference value is converted to a linear value. The gain is about 0.562, then the original frequency point is multiplied by 0.562 to reduce the frequency point energy.
  • the second threshold is a preset maximum peak value, which can be obtained based on empirical values, historical data, or experimental data, and is not limited in this embodiment of the present application.
  • Step S610 The sound mixing thread module processes the frequency domain signal using a second type of suppression strategy to obtain a processed frequency domain signal.
  • the sound mixing thread module adopts the second type of suppression strategy for the frequency domain signal of the frame, and the second type of suppression strategy is: the sound mixing thread module can The peak value of the frame frequency domain signal is suppressed, or the frame frequency domain signal may not be suppressed.
  • the difference between the difference gain of the frequency domain signal of the frame and the difference gain of the frequency domain signal of the previous frame should be within a reasonable range.
  • the audio source type of the n-1th frame frequency domain signal is the first type of audio source, which needs to be suppressed
  • the difference gain is 0.5
  • the nth frame frequency domain signal is human sound (the second type of sound source)
  • the range of the difference gain of the nth frame's audio signal is 0.7-0.8.
  • the difference gain of the frequency domain signal of the nth frame is higher than 0.8, the energy difference between the compressed audio signal of the n-1th frame and the compressed audio signal of the nth frame may be too large, causing the speaker to play back the two frames When audio is turned on, the volume can cause abrupt changes (for example, the sound suddenly becomes louder). If the difference gain of the frequency-domain signal of the nth frame is lower than 0.7, the energy of the signal in the frame may be suppressed excessively, and when the speaker plays back the audio of the frame, the volume of the human voice is very low.
  • Step S611 The sound mixing thread module performs frequency-time transformation on the processed frequency-domain signal to obtain a single-frame audio signal.
  • Step S612 the sound mixing thread module sends the single-frame audio signal to the audio driver.
  • the sound mixing thread module executes step S604.
  • Step S614 The audio driver sends the single-frame audio signal to the speaker.
  • Step S615 The speaker plays the audio corresponding to the single-frame audio signal.
  • the audio processing method provided by the embodiment of the present application combines the neural network and the traditional detection algorithm to identify the sound source type of the audio signal through the neural network, and solve the misjudgment and missed judgment caused by the traditional algorithm and the difficulty of debugging the upper threshold of the tonality value and other issues, and through the implementation of different audio signals without suppressing the gain and implementation time, while maintaining the maximum loudness of the original signal playback, changing the speaker input signal to reduce playback noise and reduce the suppression of different audio signals Distortion.
  • all or part of them may be implemented by software, hardware, firmware or any combination thereof.
  • software When implemented using software, it may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, the processes or functions according to the present application will be generated in whole or in part.
  • the computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Transmission to another website site, computer, server, or data center by wired (eg, coaxial cable, optical fiber, DSL) or wireless (eg, infrared, wireless, microwave, etc.) means.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center integrated with one or more available media. Described available medium can be magnetic medium, (for example, floppy disk, hard disk, magnetic tape), optical medium (for example, DVD), or semiconductor medium (for example solid state hard disk Solid State Disk) etc.
  • the processes can be completed by computer programs to instruct related hardware.
  • the programs can be stored in computer-readable storage media.
  • When the programs are executed may include the processes of the foregoing method embodiments.
  • the aforementioned storage medium includes: ROM or random access memory RAM, magnetic disk or optical disk, and other various media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Business, Economics & Management (AREA)
  • General Business, Economics & Management (AREA)
  • Telephone Function (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

L'invention concerne un procédé de traitement de signal audio, un dispositif électronique, un support de stockage lisible par ordinateur et un produit de programme informatique. Le procédé comprend les étapes consistant à : obtenir un signal audio ; lorsque la valeur de tonalité du signal audio est supérieure ou égale à un premier seuil et que le type de source audio du signal audio est une source audio de premier type, traiter le signal audio au moyen d'une politique de suppression de premier type ; et sinon, traiter le signal audio à l'aide d'une politique de suppression de second type. Le procédé résout le problème selon lequel, dans le processus de suppression d'un signal audio qui génère facilement du bruit, les signaux audio génèrent des distorsions de suppression car d'autres types de signaux audio sont supprimés de manière erronée.
PCT/CN2022/092367 2021-07-19 2022-05-12 Procédé de traitement de signal audio et dispositif électronique associé WO2023000778A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110815051.XA CN115641870A (zh) 2021-07-19 2021-07-19 一种音频信号的处理方法及相关电子设备
CN202110815051.X 2021-07-19

Publications (2)

Publication Number Publication Date
WO2023000778A1 WO2023000778A1 (fr) 2023-01-26
WO2023000778A9 true WO2023000778A9 (fr) 2023-06-15

Family

ID=84939464

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/092367 WO2023000778A1 (fr) 2021-07-19 2022-05-12 Procédé de traitement de signal audio et dispositif électronique associé

Country Status (2)

Country Link
CN (1) CN115641870A (fr)
WO (1) WO2023000778A1 (fr)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116233696B (zh) * 2023-05-05 2023-09-15 荣耀终端有限公司 气流杂音抑制方法、音频模组、发声设备和存储介质

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9565508B1 (en) * 2012-09-07 2017-02-07 MUSIC Group IP Ltd. Loudness level and range processing
US9672843B2 (en) * 2014-05-29 2017-06-06 Apple Inc. Apparatus and method for improving an audio signal in the spectral domain
KR20170030384A (ko) * 2015-09-09 2017-03-17 삼성전자주식회사 음향 조절 장치 및 방법과, 장르 인식 모델 학습 장치 및 방법
CN108322868B (zh) * 2018-01-19 2020-07-07 瑞声科技(南京)有限公司 改善扬声器播放钢琴声音音质的方法
EP3847542A4 (fr) * 2018-09-07 2022-06-01 Gracenote, Inc. Procédés et appareil de réglage dynamique du volume par classification audio
CN109616135B (zh) * 2018-11-14 2021-08-03 腾讯音乐娱乐科技(深圳)有限公司 音频处理方法、装置及存储介质
CN111343540B (zh) * 2020-03-05 2021-07-20 维沃移动通信有限公司 一种钢琴音频的处理方法及电子设备
CN112767967A (zh) * 2020-12-30 2021-05-07 深延科技(北京)有限公司 语音分类方法、装置及自动语音分类方法

Also Published As

Publication number Publication date
WO2023000778A1 (fr) 2023-01-26
CN115641870A (zh) 2023-01-24

Similar Documents

Publication Publication Date Title
JP7222112B2 (ja) 歌の録音方法、音声補正方法、および電子デバイス
CN108335703B (zh) 确定音频数据的重音位置的方法和装置
US11284151B2 (en) Loudness adjustment method and apparatus, and electronic device and storage medium
CN109003621B (zh) 一种音频处理方法、装置及存储介质
CN109243479B (zh) 音频信号处理方法、装置、电子设备及存储介质
CN109065068B (zh) 音频处理方法、装置及存储介质
CN115881118B (zh) 一种语音交互方法及相关电子设备
CN111986691A (zh) 音频处理方法、装置、计算机设备及存储介质
WO2023000778A9 (fr) Procédé de traitement de signal audio et dispositif électronique associé
US20240031766A1 (en) Sound processing method and apparatus thereof
CN109961802B (zh) 音质比较方法、装置、电子设备及存储介质
CN116055982B (zh) 音频输出方法、设备及存储介质
CN115359156B (zh) 音频播放方法、装置、设备和存储介质
WO2023061330A1 (fr) Procédé et appareil de synthèse audio et dispositif et support de stockage lisible par ordinateur
WO2022007757A1 (fr) Procédé d'enregistrement d'empreinte vocale inter-appareils, dispositif électronique et support de stockage
CN114974213A (zh) 音频处理方法、电子设备及存储介质
CN113840034B (zh) 声音信号处理方法和终端设备
RU2777617C1 (ru) Способ записи песни, способ коррекции звука и электронное устройство
CN116546126B (zh) 一种杂音抑制方法及电子设备
WO2024051638A1 (fr) Procédé d'étalonnage de champ sonore, dispositif électronique et système
WO2024046416A1 (fr) Procédé de réglage de volume, dispositif électronique et système
WO2023142784A1 (fr) Procédé de commande de volume, dispositif électronique et support de stockage lisible
WO2022206643A1 (fr) Procédé d'estimation de l'angle d'arrivée d'un signal et dispositif électronique
CN114464212A (zh) 一种音频信号的杂音检测方法及相关电子设备
CN115802244A (zh) 虚拟低音生成方法、介质及电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22844945

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE