WO2023016053A1 - 一种声音信号处理方法及电子设备 - Google Patents

一种声音信号处理方法及电子设备 Download PDF

Info

Publication number
WO2023016053A1
WO2023016053A1 PCT/CN2022/095354 CN2022095354W WO2023016053A1 WO 2023016053 A1 WO2023016053 A1 WO 2023016053A1 CN 2022095354 W CN2022095354 W CN 2022095354W WO 2023016053 A1 WO2023016053 A1 WO 2023016053A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio signal
signal
audio
electronic device
sound
Prior art date
Application number
PCT/CN2022/095354
Other languages
English (en)
French (fr)
Inventor
玄建永
刘镇亿
高海宽
Original Assignee
北京荣耀终端有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京荣耀终端有限公司 filed Critical 北京荣耀终端有限公司
Priority to EP22855039.8A priority Critical patent/EP4280211A1/en
Publication of WO2023016053A1 publication Critical patent/WO2023016053A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/60Control of cameras or camera modules
    • H04N23/63Control of cameras or camera modules by using electronic viewfinders
    • H04N23/631Graphical user interfaces [GUI] specially adapted for controlling image capture or setting capture parameters
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Definitions

  • the present application relates to the field of electronic technology, in particular to a sound signal processing method and electronic equipment.
  • Existing electronic equipment collects sound signals around the electronic equipment when recording video, but some sound signals are interference signals, which are not desired by users.
  • sound signals are interference signals, which are not desired by users.
  • the electronic device when the electronic device records the user's selfie short video or live broadcast, the electronic device will collect the user's own voice and the sound of the surrounding environment, resulting in the selfie sound recorded by the electronic device not being clear enough. There is more interference and the sound quality recorded by electronic equipment is lower.
  • Embodiments of the present application provide a sound signal processing method and an electronic device, which can reduce interfering sound signals in the sound of a recorded video, and improve the quality of the sound signal of a recorded video.
  • the embodiment of the present application provides a sound signal processing method.
  • the method is applied to electronic devices, including cameras and microphones.
  • the first target object is within the shooting range of the camera, and the second target object is not within the shooting range of the camera.
  • the first target object is within the shooting range of the camera may refer to that the first target object is within the range of the field of view of the camera.
  • the method includes: the electronic device activates the camera.
  • a preview interface is displayed, and the preview interface includes a first control.
  • a first operation on a first control is detected. In response to the first operation, shooting is started.
  • a shooting interface is displayed, the shooting interface includes a first image, the first image is an image collected by the camera in real time, the first image includes the first target object, and the first image does not include the second target object.
  • the first moment may be any moment during the shooting process.
  • the microphone collects a first audio, the first audio includes a first audio signal and a second audio signal, the first audio signal corresponds to the first target object, and the second audio signal corresponds to the second target object.
  • a second operation on the first control of the shooting interface is detected.
  • the second operation stop shooting and save the first video, wherein the first video includes the first image and the second audio at the first moment, the second audio includes the first audio signal and the third audio signal, and the third audio
  • the signal is obtained by processing the second audio signal by the electronic device, and the energy of the third audio signal is smaller than the energy of the second audio signal.
  • the electronic device collects sound signals around the electronic device through a microphone.
  • the electronic device will collect sound signals within the field of view of the camera when recording video, and the electronic device will also collect sound signals outside the field of view of the camera when recording video, and the electronic device will also collect environmental noise.
  • sound signals outside the field of view of the camera and ambient noise will become interference signals when recording video.
  • the electronic device when the electronic device records the sound signal (that is, the second audio signal) of the second target object (such as the non-target object 1 or the non-target object), the energy of the second audio signal can be reduced to obtain the third audio signal .
  • the electronic device can process the sound signal of the recorded video (such as the sound signal collected by the microphone), and reduce the energy of the interference signal (such as the energy of the second audio signal), so that When playing the recorded video file, the energy of the third audio signal output is lower than the energy of the sound signal of the non-target orientation in the second audio signal, so as to reduce the interference sound signal in the sound signal of the recorded video and improve the sound of the recorded video The quality of the signal.
  • the third audio signal is obtained by the electronic device processing the second audio signal, including: configuring a gain of the second audio signal to be less than 1.
  • the third audio signal is obtained according to the second audio signal and the gain of the second audio signal.
  • the third audio signal is obtained by the electronic device processing the second audio signal, including: the electronic device calculates a probability that the second audio signal is within the target orientation.
  • the target orientation is an orientation within the field of view range of the camera when the video is recorded.
  • the first target object is within the target location and the second target object is not within the target location.
  • the electronic device determines the gain of the second audio signal according to the probability that the second audio signal is within the target azimuth. Wherein, if the probability that the second audio signal is within the target orientation is greater than a preset probability threshold, then the gain of the second audio signal is equal to 1.
  • the gain of the second audio signal is less than 1.
  • the electronic device obtains the third audio signal according to the energy of the second audio signal and the gain of the second audio signal.
  • the electronic device may determine the gain of the second audio signal according to the probability of the second audio signal being within the target azimuth, so as to reduce the energy of the second audio signal to obtain the third audio signal.
  • the first audio further includes a fourth audio signal, where the fourth audio signal is a diffuse field noise audio signal.
  • the second audio further includes a fifth audio signal, and the fifth audio signal is a diffuse field noise audio signal.
  • the fifth audio signal is obtained by processing the fourth audio signal by the electronic device, and the energy of the fifth audio signal is smaller than the energy of the fourth audio signal.
  • the fifth audio signal is obtained by the electronic device processing the fourth audio signal, including: configuring a gain of the fourth audio signal to be less than 1.
  • a fifth audio signal is obtained according to the energy of the fourth audio signal and the gain of the fourth audio signal.
  • the fifth audio signal is obtained by the electronic device processing the fourth audio signal, including: suppressing the fourth audio signal to obtain the sixth audio signal. Compensating the sixth audio signal to obtain the fifth audio signal.
  • the sixth audio signal is a diffuse field noise audio signal
  • the energy of the sixth audio signal is smaller than that of the fourth audio signal
  • the sixth audio signal is smaller than the fifth audio signal.
  • the energy of the sixth audio signal obtained by processing the fourth audio signal may be very small, so that the diffuse field noise is not stable. Therefore, by performing noise compensation on the sixth audio signal, the energy of the fifth audio signal obtained after processing the fourth audio signal can be made more stable, so that the user's sense of hearing is better.
  • the above method further includes: at the first moment, after the microphone collects the first audio, the electronic device processes the first audio to obtain the second audio. That is to say, the electronic device can process the audio signal in real time when the audio signal is collected.
  • the method further includes: after stopping shooting in response to the second operation, the electronic device processes the first audio to obtain the second audio. That is to say, the electronic device can obtain a sound signal from the video file when the recording of the video file ends. Then, the audio signal is processed frame by frame according to time sequence.
  • the embodiment of the present application provides a sound signal processing method.
  • the method is applied to electronic equipment.
  • the method includes: the electronic device acquires a first sound signal.
  • the first sound signal is a sound signal of recorded video.
  • the electronic device processes the first sound signal to obtain the second sound signal.
  • the electronic device plays the recorded video file, it outputs the second sound signal.
  • the energy of the sound signal in the non-target direction in the second sound signal is lower than the energy of the sound signal in the non-target direction in the first sound signal.
  • Non-target orientations are orientations outside the camera's field of view when recording video.
  • the electronic device collects sound signals around the electronic device through a microphone. For example, the electronic device will collect sound signals within the field of view of the camera when recording video, and the electronic device will also collect sound signals outside the field of view of the camera when recording video, and the electronic device will also collect environmental noise.
  • the electronic device can process the sound signal of the recorded video (such as the sound signal collected by the microphone), and suppress the sound signal of the non-target direction in the sound signal, so that when the recorded video file is played, the output
  • the energy of the sound signal of the non-target orientation in the second sound signal is lower than the energy of the sound signal of the non-target orientation in the first sound signal, to reduce the interference sound signal in the sound signal of the recorded video, and improve the sound signal of the recorded video the quality of.
  • the acquiring the first sound signal by the electronic device includes: the electronic device collects the first sound signal in real time through a microphone in response to the first operation.
  • the first operation is used to trigger the electronic device to start video recording or live broadcast.
  • the electronic device may collect the first sound signal in real time through the microphone when the video recording function of the camera is activated and video recording starts.
  • a live broadcast application such as Douyin and Kuaishou
  • the electronic device can collect sound signals in real time through a microphone.
  • every time an electronic device collects a frame of sound signal it processes a frame of sound signal.
  • the foregoing method further includes: the electronic device records a video file.
  • Acquiring the first sound signal by the electronic device includes: the electronic device acquires the first sound signal from the video file in response to the end of recording the video file.
  • an electronic device may obtain a sound signal from a video file when the recording of the video file ends. Then, the audio signal is processed frame by frame according to time sequence.
  • the acquiring the first sound signal by the electronic device includes: the electronic device acquires the first sound signal from a video file saved by the electronic device in response to the second operation.
  • the second operation is used to trigger the electronic device to process the video file to improve the sound quality of the video file.
  • the electronic device processes the sound in the video file stored locally on the electronic device.
  • the electronic device detects that the user instructed to process the above video file (such as clicking the "Denoise Processing" option button in the video file operation interface)
  • the electronic device Start acquiring the sound signal of the video file.
  • the audio signal is processed frame by frame according to time sequence.
  • the first sound signal includes multiple time-frequency voice signals.
  • Processing the first sound signal by the electronic device to obtain the second sound signal includes: the electronic device identifying the orientation of each time-frequency voice signal in the first sound signal. If the orientation of the first time-frequency voice signal in the first sound signal is a non-target orientation, the electronic device reduces the energy of the first time-frequency voice signal to obtain a second voice signal.
  • the first time-frequency voice signal is any one of multiple time-frequency voice signals in the first sound signal.
  • the first sound signal includes multiple time-frequency voice signals.
  • Processing the first sound signal by the electronic device to obtain the second sound signal includes: the electronic device calculating the probability that each time-frequency voice signal in the first sound signal is within the target orientation.
  • the target orientation is an orientation within the field of view range of the camera when the video is recorded.
  • the electronic device determines the gain of the second time-frequency voice signal according to the probability that the second time-frequency voice signal in the first sound signal is within the target orientation; wherein, the second time-frequency voice signal is a plurality of time-frequency voice signals in the first sound signal Any one of the signals; if the probability of the second time-frequency voice signal in the target azimuth is greater than the preset probability threshold, the gain of the second time-frequency voice signal is equal to 1; if the probability of the second time-frequency voice signal in the target azimuth is less than or equal to the preset probability threshold, the gain of the second time-frequency voice signal is less than 1.
  • the electronic device obtains the second sound signal according to each time-frequency voice signal in the first sound signal and the corresponding gain.
  • the energy of the diffuse field noise in the second sound signal is lower than the energy of the diffuse field noise in the first sound signal. It should be understood that by reducing the energy of the non-target azimuth sound signal in the first sound signal, not all diffuse field noise can be reduced. In order to ensure the quality of the sound signal of the recorded video, it is also necessary to reduce the diffuse field noise, so as to improve the signal-to-noise ratio of the sound signal of the recorded video.
  • the first sound signal includes multiple time-frequency voice signals.
  • the electronic device processes the first sound signal to obtain the second sound signal, including: the electronic device identifies whether each time-frequency voice signal in the first sound signal is diffuse field noise. If the third time-frequency audio signal in the first audio signal is diffuse field noise, the electronic device reduces the energy of the third time-frequency audio signal to obtain the second audio signal.
  • the third time-frequency voice signal is any one of the multiple time-frequency voice signals in the first sound signal.
  • the first sound signal includes multiple time-frequency voice signals.
  • the electronic device processes the first sound signal to obtain the second sound signal, and further includes: the electronic device identifies whether each time-frequency voice signal in the first sound signal is diffuse field noise.
  • the electronic device determines the gain of the fourth time-frequency speech signal according to whether the fourth time-frequency speech signal in the first sound signal is diffuse field noise; wherein, the fourth time-frequency speech signal is a plurality of time-frequency speech signals in the first sound signal Any one of the signals; if the fourth time-frequency speech signal is diffuse field noise, the gain of the fourth time-frequency speech signal is less than 1; if the fourth time-frequency speech signal is a coherent signal, then the gain of the fourth time-frequency speech signal is equal to 1.
  • the electronic device obtains the second sound signal according to each time-frequency voice signal in the first sound signal and the corresponding gain.
  • the first sound signal includes a plurality of time-frequency voice signals; the electronic device processes the first sound signal to obtain the second sound signal, and further includes: the electronic device calculates each time-frequency sound signal in the first sound signal The probability that the speech signal is within the target orientation; where the target orientation is the orientation within the field of view of the camera when the video is recorded. The electronic device identifies whether each time-frequency speech signal in the first sound signal is diffuse field noise.
  • the electronic device determines the gain of the fifth time-frequency voice signal according to the probability that the fifth time-frequency voice signal in the first sound signal is within the target orientation, and whether the fifth time-frequency voice signal is diffuse field noise; wherein, the fifth time-frequency voice signal The voice signal is any one of a plurality of time-frequency voice signals in the first sound signal; if the probability of the fifth time-frequency voice signal in the target orientation is greater than the preset probability threshold, and the fifth time-frequency voice signal is a coherent signal, then The gain of the fifth time-frequency voice signal is equal to 1; if the probability of the fifth time-frequency voice signal in the target orientation is greater than the preset probability threshold, and the fifth time-frequency voice signal is diffuse field noise, then the fifth time-frequency voice signal's The gain is less than 1; if the probability that the fifth time-frequency voice signal is within the target position is less than or equal to a preset probability threshold, then the gain of the fifth time-frequency voice signal is less than 1.
  • the electronic device obtains the second sound signal according to each
  • the electronic device determines the fifth time-frequency voice signal according to the probability that the fifth time-frequency voice signal in the first sound signal is within the target orientation, and whether the fifth time-frequency voice signal is diffuse field noise
  • the gain includes: the electronic device determines the first gain of the fifth time-frequency voice signal according to the probability of the fifth time-frequency voice signal in the target orientation; wherein, if the probability of the fifth time-frequency voice signal in the target orientation is greater than the preset If the probability threshold is set, the first gain of the fifth time-frequency voice signal is equal to 1; if the probability of the fifth time-frequency voice signal in the target orientation is less than or equal to the preset probability threshold, then the first gain of the fifth time-frequency voice signal less than 1.
  • the electronic device determines the second gain of the fifth time-frequency voice signal according to whether the fifth time-frequency voice signal is diffuse field noise; wherein, if the fifth time-frequency voice signal is diffuse field noise, then the second gain of the fifth time-frequency voice signal is The second gain is less than 1; if the fifth time-frequency voice signal is a coherent signal, the second gain of the fifth time-frequency voice signal is equal to 1.
  • the electronic device determines the gain of the fifth time-frequency voice signal according to the first gain and the second gain of the fifth time-frequency voice signal; wherein, the gain of the fifth time-frequency voice signal is the first gain and the second gain of the fifth time-frequency voice signal The product of the second gain.
  • the fifth time-frequency speech signal is diffuse field noise, and the product of the first gain and the second gain of the fifth time-frequency speech signal is less than a preset gain value, then the fifth time-frequency speech signal The gain of the signal is equal to the preset gain value.
  • the embodiment of the present application provides an electronic device.
  • the electronic device includes: a microphone; a camera; one or more processors; a memory; a communication module.
  • the microphone is used to collect sound signals during video recording or live broadcast;
  • the camera is used to collect image signals during video recording or live broadcast.
  • the communication module is used for communicating with external devices.
  • One or more computer programs are stored in the memory, the one or more computer programs comprising instructions. When the instruction is executed by the processor, the electronic device is caused to execute the method described in the first aspect and any possible implementation manner thereof.
  • an embodiment of the present application provides a chip system, where the chip system is applied to an electronic device.
  • the system-on-a-chip includes one or more interface circuits and one or more processors.
  • the interface circuit and the processor are interconnected by wires.
  • the interface circuit is for receiving a signal from the memory of the electronic device and sending the signal to the processor, the signal including computer instructions stored in the memory.
  • the processor executes the computer instructions
  • the electronic device executes the method described in the first aspect and any possible implementation manner thereof.
  • an embodiment of the present application provides a computer storage medium, the computer storage medium includes computer instructions, and when the computer instructions are run on a foldable electronic device, the electronic device executes the first aspect and its The method described in any possible implementation manner.
  • an embodiment of the present application provides a computer program product.
  • the computer program product runs on a computer, the computer executes the method described in the first aspect and any possible design manner thereof.
  • FIG. 1 is an application scene diagram of the sound signal processing method provided by the embodiment of the present application
  • FIG. 2 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.
  • FIG. 3 is a schematic diagram of a microphone position of an electronic device provided by an embodiment of the present application.
  • FIG. 4 is a flow chart 1 of the sound signal processing method provided by the embodiment of the present application.
  • FIG. 5 is a schematic diagram of a comparison of time-domain sound signals collected by a microphone of an electronic device provided in an embodiment of the present application into frequency-domain sound signals;
  • Fig. 6 is the corresponding relationship diagram between the voice frame and the frequency point of the frequency domain sound signal involved in the embodiment of the present application;
  • FIG. 7 is the second flowchart of the sound signal processing method provided by the embodiment of the present application.
  • FIG. 8 is a scene diagram 1 of the sound signal involved in the sound signal processing method provided by the embodiment of the present application.
  • FIG. 9 is a probability distribution diagram of the nth frame of speech in 36 orientations provided by the embodiment of the present application.
  • FIG. 10 is a second scene diagram of the sound signal involved in the sound signal processing method provided by the embodiment of the present application.
  • Fig. 11 is the voice spectrogram before and after implementing the method shown in Fig. 7 in the embodiment of the present application;
  • FIG. 12 is a flowchart three of the sound signal processing method provided by the embodiment of the present application.
  • FIG. 13 is a flowchart four of the sound signal processing method provided by the embodiment of the present application.
  • FIG. 14 is a scene diagram three of the sound signal involved in the sound signal processing method provided by the embodiment of the present application.
  • FIG. 15 is a scene diagram four of the sound signal involved in the sound signal processing method provided by the embodiment of the present application.
  • Fig. 16 is a comparison speech spectrum diagram of the embodiment of the present application performing the methods shown in Fig. 7 and Fig. 13;
  • FIG. 17 is a flowchart five of the sound signal processing method provided by the embodiment of the present application.
  • FIG. 18 is a scene diagram five of the sound signal involved in the sound signal processing method provided by the embodiment of the present application.
  • Fig. 19 is a comparative speech spectrogram of performing the methods shown in Fig. 13 and Fig. 17 according to the embodiment of the present application;
  • FIG. 20 is a flowchart five of the sound signal processing method provided by the embodiment of the present application.
  • Fig. 21A is the first interface diagram involved in the sound signal processing method provided by the embodiment of the present application.
  • FIG. 21B is a scene diagram 1 of the sound signal processing method provided by the embodiment of the present application.
  • FIG. 22 is a flowchart six of the sound signal processing method provided by the embodiment of the present application.
  • FIG. 23 is the seventh flowchart of the sound signal processing method provided by the embodiment of the present application.
  • Fig. 24 is the second interface diagram involved in the sound signal processing method provided by the embodiment of the present application.
  • FIG. 25 is a second scene diagram of the sound signal processing method provided by the embodiment of the present application.
  • FIG. 26 is the eighth flowchart of the sound signal processing method provided by the embodiment of the present application.
  • Fig. 27 is the third interface diagram involved in the sound signal processing method provided by the embodiment of the present application.
  • Fig. 28 is the fourth interface diagram involved in the sound signal processing method provided by the embodiment of the present application.
  • Fig. 29 is the fifth interface diagram involved in the sound signal processing method provided by the embodiment of the present application.
  • Fig. 30 is the sixth interface diagram involved in the sound signal processing method provided by the embodiment of the present application.
  • FIG. 31 is a schematic structural diagram of a chip system provided by an embodiment of the present application.
  • Target object an object within the field of view of a camera (such as a front camera), such as a person, an animal, and the like.
  • the field of view of the camera is determined by the field of view (FOV) of the camera.
  • FOV field of view
  • Non-target objects Objects that are not within the camera's field of view. Taking the front camera as an example, objects on the back of the phone are non-target objects.
  • Diffuse field noise During video recording or audio recording, the sound emitted by target objects or non-target objects will be reflected by walls, floors, or ceilings.
  • first and second are used for descriptive purposes only, and cannot be understood as indicating or implying relative importance or implicitly specifying the quantity of indicated technical features. Thus, a feature defined as “first” and “second” may explicitly or implicitly include one or more of these features. In the description of this embodiment, unless otherwise specified, “plurality” means two or more.
  • Existing electronic devices collect sound signals around the electronic device when recording video, but some sound signals are interference signals, which are not what the user wants.
  • the electronic device uses a camera (such as a front camera or a rear camera)
  • the electronic device will collect the sound of the target object within the FOV of the camera, the sound of non-target objects outside the FOV of the camera, and some environmental noise.
  • the sound of the non-target object may become an object of interference, affecting the sound quality of the video recorded by the electronic device.
  • the front camera of an electronic device is used to record a short video or a small video for the convenience of the user.
  • FIG. 1 when the user uses the front camera of the electronic device to take a selfie and record a short video, there may be children playing on the back of the electronic device (that is, the non-target object 1 in FIG. 1 ).
  • the non-target object 1 in FIG. 1 On the same side of the user (ie, the target object in FIG. 1 ), there may also be other objects, such as a dog barking, or a little girl singing and dancing (ie, the non-target object 2 in FIG. 1 ). Therefore, the electronic device will inevitably record the sound of the non-target object 1 or the non-target object 2 during the recording and shooting process.
  • the user expects that the recorded short video can highlight his own voice (that is, the voice of the target object) and suppress non-self voices, such as the voices of non-target object 1 and non-target object 2 in FIG. 1 .
  • the short video recorded by the electronic device has large and harsh noises, which interferes with the sound of the target object and affects the sound quality of the recorded short video.
  • the embodiment of the present application provides a sound signal processing method, which can be applied to electronic equipment, and can suppress non-camera voices during self-timer recording, and improve the signal-to-noise ratio of Selfie voice.
  • the sound signal processing method provided by the embodiment of the present application can remove the sound of non-target objects 1 and 2, retain the sound of the target object, and can also reduce the diffuse field noise 1 and diffuse field noise 2 on the sound of the target object, thereby improving the signal-to-noise ratio of the audio signal of the target object after recording, smoothing the background noise, and making the user feel better.
  • the sound signal processing method provided in the embodiment of the present application can be used for video shooting of a front camera of an electronic device, and can also be used for video shooting of a rear camera of an electronic device.
  • the electronic device may be a mobile phone, a tablet computer, a wearable device (such as a smart watch), a vehicle device, an augmented reality (augmented reality, AR)/virtual reality (virtual reality, VR) device, a notebook computer, an ultra mobile personal computer (ultra -mobile personal computer, UMPC), netbook or personal digital assistant (personal digital assistant, PDA) and other mobile terminals, or professional cameras and other equipment, the embodiment of the present application does not make any restrictions on the specific types of electronic equipment.
  • FIG. 2 shows a schematic structural diagram of the electronic device 100 .
  • the electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (universal serial bus, USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, and an antenna 2 , mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, earphone jack 170D, sensor module 180, button 190, motor 191, indicator 192, camera 193, display screen 194, and A subscriber identification module (subscriber identification module, SIM) card interface 195 and the like.
  • SIM subscriber identification module
  • the sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, bone conduction sensor 180M, etc.
  • the processor 110 may include one or more processing units, for example: the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processing unit (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), controller, memory, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural network processor (neural-network processing unit, NPU) wait. Wherein, different processing units may be independent devices, or may be integrated in one or more processors.
  • application processor application processor, AP
  • modem processor graphics processing unit
  • GPU graphics processing unit
  • image signal processor image signal processor
  • ISP image signal processor
  • controller memory
  • video codec digital signal processor
  • DSP digital signal processor
  • baseband processor baseband processor
  • neural network processor neural-network processing unit, NPU
  • the controller may be the nerve center and command center of the electronic device 100 .
  • the controller can generate an operation control signal according to the instruction opcode and timing signal, and complete the control of fetching and executing the instruction.
  • a memory may also be provided in the processor 110 for storing instructions and data.
  • the memory in processor 110 is a cache memory.
  • the memory may hold instructions or data that the processor 110 has just used or recycled. If the processor 110 needs to use the instruction or data again, it can be directly called from the memory. Repeated access is avoided, and the waiting time of the processor 110 is reduced, thus improving the efficiency of the system.
  • the electronic device 100 realizes the display function through the GPU, the display screen 194 , and the application processor.
  • the GPU is a microprocessor for image processing, and is connected to the display screen 194 and the application processor. GPUs are used to perform mathematical and geometric calculations for graphics rendering.
  • Processor 110 may include one or more GPUs that execute program instructions to generate or change display information.
  • the display screen 194 is used to display images, videos and the like.
  • the display screen 194 includes a display panel.
  • the display panel can be a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active matrix organic light emitting diode or an active matrix organic light emitting diode (active-matrix organic light emitting diode, AMOLED), flexible light-emitting diode (flex light-emitting diode, FLED), Miniled, MicroLed, Micro-oLed, quantum dot light emitting diodes (quantum dot light emitting diodes, QLED), etc.
  • the electronic device 100 may include 1 or N display screens 194 , where N is a positive integer greater than 1.
  • the display screen 194 may be used to display a preview interface and a shooting interface in the shooting mode.
  • the electronic device 100 can realize the shooting function through an ISP, a camera 193, a video codec, a GPU, a display screen 194, and an application processor.
  • the ISP is used for processing the data fed back by the camera 193 .
  • the light is transmitted to the photosensitive element of the camera through the lens, and the light signal is converted into an electrical signal, and the photosensitive element of the camera transmits the electrical signal to the ISP for processing, and converts it into an image visible to the naked eye.
  • ISP can also perform algorithm optimization on image noise, brightness, and skin color.
  • ISP can also optimize the exposure, color temperature and other parameters of the shooting scene.
  • the ISP may be located in the camera 193 .
  • Camera 193 is used to capture still images or video.
  • the object generates an optical image through the lens and projects it to the photosensitive element.
  • the photosensitive element may be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor.
  • CMOS complementary metal-oxide-semiconductor
  • the photosensitive element converts the light signal into an electrical signal, and then transmits the electrical signal to the ISP to convert it into a digital image signal.
  • the ISP outputs the digital image signal to the DSP for processing.
  • DSP converts digital image signals into standard RGB, YUV and other image signals.
  • the electronic device 100 may include 1 or N cameras 193 , where N is a positive integer greater than 1.
  • the camera 193 may also include a depth camera for measuring the object distance of the object to be photographed, and other cameras.
  • the depth camera may include a three-dimensional (3 dimensions, 3D) depth-sensing camera, a time of flight (time of flight, TOF) depth camera, or a binocular depth camera.
  • Digital signal processors are used to process digital signals. In addition to digital image signals, they can also process other digital signals. For example, when the electronic device 100 selects a frequency point, the digital signal processor is used to perform Fourier transform on the energy of the frequency point.
  • Video codecs are used to compress or decompress digital video.
  • the electronic device 100 may support one or more video codecs.
  • the electronic device 100 can play or record videos in various encoding formats, for example: moving picture experts group (moving picture experts group, MPEG) 1, MPEG2, MPEG3, MPEG4 and so on.
  • MPEG moving picture experts group
  • the internal memory 121 may be used to store computer-executable program codes including instructions.
  • the processor 110 executes various functional applications and data processing of the electronic device 100 by executing instructions stored in the internal memory 121 .
  • the internal memory 121 may include an area for storing programs and an area for storing data.
  • the stored program area can store an operating system, at least one application program required by a function (such as a sound playing function, an image playing function, etc.) and the like.
  • the storage data area can store data created during the use of the electronic device 100 (such as audio data, phonebook, etc.) and the like.
  • the internal memory 121 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, flash memory device, universal flash storage (universal flash storage, UFS) and the like.
  • the processor 110 causes the electronic device 100 to execute the method provided in the embodiment of the present application by executing the instructions stored in the internal memory 121 and/or the instructions stored in the memory provided in the processor, And various functional applications and data processing.
  • the electronic device 100 can implement audio functions through the audio module 170 , the speaker 170A, the receiver 170B, the microphone 170C, the earphone interface 170D, and the application processor. Such as music playback, recording, etc.
  • the audio module 170 is used to convert digital audio information into analog audio signal output, and is also used to convert analog audio input into digital audio signal.
  • the audio module 170 may also be used to encode and decode audio signals.
  • the audio module 170 may be set in the processor 110 , or some functional modules of the audio module 170 may be set in the processor 110 .
  • Speaker 170A also referred to as a "horn" is used to convert audio electrical signals into sound signals.
  • Electronic device 100 can listen to music through speaker 170A, or listen to hands-free calls.
  • Receiver 170B also called “earpiece” is used to convert audio electrical signals into sound signals.
  • the receiver 170B can be placed close to the human ear to receive the voice.
  • the microphone 170C also called “microphone” or “microphone” is used to convert sound signals into electrical signals.
  • the user can put his mouth close to the microphone 170C to make a sound, and input the sound signal to the microphone 170C.
  • the electronic device 100 can be provided with one or more microphones 170C.
  • the electronic device 100 can be provided with three, four or more microphones 170C to realize the collection of sound signals, noise reduction, and also to identify the direction of the sound source, to realize the directional recording function and Suppresses sounds from non-target directions, etc.
  • the earphone interface 170D is used for connecting wired earphones.
  • the earphone interface 170D can be a USB interface 130, or a 3.5mm open mobile terminal platform (OMTP) standard interface, or a cellular telecommunications industry association of the USA (CTIA) standard interface.
  • OMTP open mobile terminal platform
  • CTIA cellular telecommunications industry association of the USA
  • Touch sensor 180K also known as "touch panel”.
  • the touch sensor 180K can be disposed on the display screen 194, and the touch sensor 180K and the display screen 194 form a touch screen, also called a touch screen.
  • the touch sensor 180K is used to detect a touch operation on or near it.
  • the touch sensor can pass the detected touch operation to the application processor to determine the type of touch event.
  • Visual output related to the touch operation can be provided through the display screen 194 .
  • the touch sensor 180K may also be disposed on the surface of the electronic device 100 , which is different from the position of the display screen 194 .
  • the structure illustrated in the embodiment of the present application does not constitute a specific limitation on the electronic device 100 .
  • the electronic device 100 may include more or fewer components than shown in the figure, or combine certain components, or separate certain components, or arrange different components.
  • the illustrated components can be realized in hardware, software or a combination of software and hardware.
  • three microphones are provided in the mobile phone 300 , namely, the top microphone 301 , the bottom microphone 302 and the back microphone 303 , which are used to collect the user's voice signal when the user makes a call or records audio and video files.
  • the top microphone 301, the bottom microphone 302, and the back microphone 303 collect the sound signals of the recording environment where the user is located (including the sound of the target object, the sound of non-target objects, and the noise generated by the environment) ).
  • the mobile phone 300 records video and collects audio in the direction shown in FIG.
  • the top microphone 301 can collect the audio signal of the left channel, and the bottom microphone 302 can collect the audio signal of the right channel.
  • the top microphone 301 can collect the right channel voice signal
  • the bottom microphone 302 can collect the left channel voice signal.
  • the voice signals of which channels are collected by the three microphones in the mobile phone 300 may vary according to different usage scenarios.
  • the above description in the embodiment of the present application is only for illustration and does not constitute a limitation.
  • the sound signal collected by the back microphone 303 can be combined with the sound signals collected by the top microphone 301 and the bottom microphone 302 to determine the orientation of the sound signal collected by the mobile phone.
  • the mobile phone 300 can collect the voice of the target object and the voice of the non-target object 1 through the three microphones shown in FIG. The sound, the sound of the non-target object 2, and the diffuse field noise 1 and the diffuse field noise 2 formed by the above-mentioned target object and the sound of the non-target object reflected by the environment.
  • the main subject of the user is himself (for example, the target object shown in FIG. 1 ), so the user does not wish to capture non-target objects (as shown in FIG. 1 shows the sound of non-target object 1 and non-target object 2).
  • the sound signal processing method in the embodiment of the present application can make the mobile phone process the collected sound signal, suppress the sound signal of the non-target object, highlight the sound of the target object, and improve the sound quality of the captured video.
  • a sound signal processing method provided in the embodiment of the present application includes:
  • the mobile phone acquires a sound signal.
  • the mobile phone can collect sound signals through the three microphones (namely, the top microphone 301 , the bottom microphone 302 and the back microphone 303 ) as shown in FIG. 3 .
  • the sound signal may also be referred to as an audio signal.
  • the sound signal collected by the microphone is a time-domain signal, which is used to represent the variation of the amplitude of the sound signal with time.
  • the sound signal collected by the microphone can be converted into a frequency domain signal through Fourier transform, such as fast Fourier transform (fast Fourier transform, FFT) or discrete Fourier transform (discrete Fourier transform, DFT), as shown in Figure 5 (b) shown.
  • FFT fast Fourier transform
  • DFT discrete Fourier transform
  • the time-domain signal is represented by time/amplitude, where the abscissa is the sampling time, and the ordinate is the amplitude of the sound signal.
  • the abscissa is time
  • the ordinate is frequency
  • the coordinate points of the abscissa and ordinate are sound signal energy.
  • the sound data energy at the time-frequency signal position 511 is relatively high
  • the sound data at the time-frequency signal position 512 (the white square box on the upper side, because it cannot be clearly shown in the figure, it is hereby explained)
  • the sound data energy is low.
  • a frame of sound signal can be transformed into a frequency domain signal including multiple (eg 1024, 512) frequency domain sampling points (ie, frequency points) through FFT or DFT.
  • multiple time-frequency points may be used to represent the sound signal collected by the microphone.
  • a box in Figure 6 represents a time-frequency bin.
  • the abscissa in FIG. 6 represents the frame number of the audio signal (which may be referred to as a speech frame), and the ordinate represents the frequency point of the audio signal.
  • X L (t, f) can be used to represent the time-frequency voice signal of the left channel, which means Among the left-channel sound signals collected by Dingmai 301 , sound signals corresponding to different time-frequency points.
  • X R (t, f) can be used to represent the right channel time-frequency speech signal, that is, the sound signal energy corresponding to different time-frequency points in the right channel sound signal collected by the bottom microphone 302; t, f) represent the time-frequency voice signals of the left and right channels, that is, represent the sound signals corresponding to different time-frequency points in the surround sound signals of the left and right channels collected by the back microphone 303 .
  • t represents the frame number of the sound signal (may be referred to as a speech frame)
  • f represents the frequency point of the sound signal.
  • the mobile phone processes the collected sound signals to suppress sound signals in non-target directions.
  • the sound signal collected by the mobile phone includes the sound of the target object and the sound of the non-target object.
  • the sound of non-target objects is not what the user wants and needs to be suppressed.
  • the target object is in the field of view of the camera (such as the front camera), and the field of view of the camera can be used as the target orientation of the embodiment of the present application, so the above-mentioned non-target orientation refers to the position that is not in the camera (such as the front camera) The orientation within the field of view of .
  • the calculation of the time-frequency point gain of the sound signal may be realized by sequentially calculating the sound source direction probability and the target direction probability calculation.
  • the sound signal in the target direction and the sound signal in the non-target direction can be distinguished by the difference in the time-frequency point gain of the sound signal.
  • the suppression of sound signals in non-target orientations may include three calculation processes, including sound source orientation probability calculation, target orientation probability calculation, and time-frequency point gain g mask (t, f) calculation, as follows:
  • the direction directly in front of the mobile phone screen is the 0° direction
  • the direction directly behind the mobile phone screen is the 180° direction
  • the direction directly to the right of the mobile phone screen is 90° Direction
  • the positive left direction of the phone screen is the 270° direction.
  • the 360° spatial orientation formed by the front, rear, left, and right sides of the screen of the mobile phone may be divided into multiple spatial orientations.
  • the spatial orientation of 360° can be divided into 36 spatial orientations at intervals of 10°, as shown in Table 1 below.
  • the target orientation is the [ 310°, 50°] azimuth.
  • the target object photographed by the mobile phone is usually located within the angle range of the field of view FOV of the front camera, that is, within the target azimuth.
  • Suppressing sound signals in non-target orientations refers to suppressing sounds of objects located outside the angle range of the field of view FOV of the front camera, such as non-target object 1 and non-target object 2 shown in FIG. 1 .
  • the environmental noise may be in the target azimuth or in the non-target azimuth.
  • the diffuse field noise 1 and the diffuse field noise 2 may be substantially the same noise.
  • the diffuse field noise 1 is taken as the environmental noise in the non-target azimuth
  • the diffuse field noise 2 is taken as the environmental noise in the target azimuth.
  • FIG. 8 is a schematic diagram of the spatial orientations of the target object, non-target object 1, non-target object 2, diffuse field noise 1, and diffuse field noise 2.
  • the spatial orientation of the target object is approximately 340°
  • the spatial orientation of non-target object 1 is approximately 150°
  • the spatial orientation of non-target object 2 is approximately 60°
  • the spatial orientation of diffuse field noise 1 is approximately 230° direction
  • the spatial orientation of diffuse field noise 2 is approximately 30°.
  • the microphone When the mobile phone is recording the target object, the microphone (such as the top microphone 301, the bottom microphone 302 and the back microphone 303 in Figure 3) will collect the sound of the target object, non-target objects (such as non-target object 1 and non-target object 2) and ambient noise (such as Diffuse Field Noise 1 and Diffuse Field Noise 2).
  • the collected sound signal can be converted into a frequency domain signal (hereinafter collectively referred to as a time-frequency speech signal) by FFT or DFT, which are respectively the left channel time-frequency speech signal X L (t, f), the right channel time-frequency speech signal X R (t, f) and back mixed channel time-frequency speech signal X back (t, f).
  • the time-frequency speech signals X L (t, f), X R (t, f) and X back (t, f) collected by the three microphones can be synthesized into one time-frequency speech signal X(t, f).
  • the time-frequency voice signal X(t, f) can be input into the sound source orientation probability calculation model, and the probability P k (t, f) of the input time-frequency voice signal in each orientation can be calculated.
  • t represents the frame number of the sound (that is, the voice frame)
  • f represents the frequency point
  • k is the spatial orientation number.
  • the sound source azimuth probability calculation model is used to calculate the sound source azimuth probability.
  • the sound source azimuth probability calculation model may be a complex angular central Gaussian mixture model (Complex Angular Central Gaussian Mixture Model, cACGMM).
  • cACGMM Complex Angular Central Gaussian Mixture Model
  • FIG. 9 it is a schematic diagram of the probability that sound signals corresponding to 1024 frequency points in the nth frame sound signal exist in 36 spatial orientations.
  • each small box represents the probability of a sound signal corresponding to a certain frequency point in the nth frame of speech signal in a certain spatial orientation.
  • the dotted box in FIG. 9 indicates that the sum of the probabilities of the sound signal corresponding to the frequency point 3 in the nth frame of the speech signal in 36 spatial orientations is 1.
  • the field of view FOV of the front camera of the mobile phone is [310°, 50°]
  • the target object belongs to the object that the mobile phone wants to record and shoot, so the target object is usually located in the field of view FOV of the front camera Inside.
  • the probability that the sound signal of the target object comes from the [310°, 50°] orientation is the highest, and the specific probability distribution can be an example in Table 2.
  • non-target objects such as non-target object 1 or non-target object 2
  • the probability of non-target objects appearing in the field of view angle FOV of the front camera is small, and may be lower than 0.5, or even 0 .
  • the probability of diffuse field noise 1 appearing in the field of view FOV of the front camera is small , may be lower than 0.5, or even 0. Since the diffuse field noise 2 is the environmental noise within the target azimuth, the probability of the diffuse field noise 2 appearing in the FOV of the front camera is relatively high, which may be higher than 0.8, or even 1.
  • Calculation of target orientation probability refers to: the sum of the probabilities of existence of the above-mentioned time-frequency voice signals in each orientation within the target orientation, which may also be referred to as the spatial clustering probability of the target orientation. Therefore, the spatial clustering probability P(t,f) of the above-mentioned time-frequency speech signal in the target orientation can be calculated by the following formula (1):
  • k1-k2 are angle indices of the target orientation, and may also be spatial orientation numbers of the target orientation.
  • P k (t, f) is the probability that the current time-frequency voice signal exists in k orientation; P(t, f) is the sum of the probability that the current time-frequency voice signal exists in the target orientation.
  • the field of view angle FOV of the front camera of the mobile phone as [310°, 50°]
  • the target orientation as [310°, 50°] as an example.
  • the sum of probabilities P(t, f) of the time-frequency speech signals of non-target objects existing in the target orientation may be less than 0.5, or even 0.
  • diffuse field noise 1 is environmental noise in a non-target orientation, and the time-frequency speech signal of diffuse field noise 1 is less likely to exist in the target orientation, so the time-frequency speech signal of diffuse field noise 1
  • the sum of the probability P(t,f) that the signal exists in the target position may be less than 0.5, or even 0.
  • diffuse field noise 2 is the environmental noise within the target azimuth, and the time-frequency speech signal of diffuse field noise 2 has a higher probability of existing in the target azimuth, so the time-frequency speech signal of diffuse field noise 2
  • the sum of probabilities P(t, f) existing in the target orientation may be greater than 0.8, or even 1.
  • the main purpose of suppressing the sound signal of the non-target direction is to preserve the sound signal of the target object and suppress the sound signal of the non-target object.
  • the target objects are all within the field of view FOV of the front camera, so the sound signals of the target objects mostly come from the target location, that is, the probability of the target object's sound appearing in the target location is usually relatively high.
  • non-target objects are usually not within the field of view FOV of the front camera, so the sound signals of non-target objects mostly come from non-target directions, that is, the sound of non-target objects appears in the target direction is usually less likely.
  • P th is the preset probability threshold, which can be configured by parameters, for example, P th is set to 0.8;
  • g mask-min is the time-frequency point gain when the current time-frequency voice signal is located in a non-target orientation, which can be configured by parameters, for example g mask-min is set to 0.2.
  • the probability threshold P th when the sum of probabilities P(t, f) of the time-frequency speech signal existing in the target azimuth is less than or equal to the probability threshold P th , it can be considered that the current time-frequency speech signal is not in the target azimuth.
  • the time-frequency point gain g mask (t ,f) is set to 1, which can preserve the sound of the target object to the greatest extent. If the current time-frequency voice signal is not within the target orientation, then the current time-frequency voice signal most likely comes from a non-target object (such as non-target object 1 or non-target object 2), so the current time-frequency voice signal is no longer within the target orientation
  • the time-frequency point gain g mask (t, f) is configured as 0.2, it can effectively suppress the sound of non-target objects (such as non-target object 1 or non-target object 2).
  • the time-frequency speech signal of environmental noise may exist in the target orientation, such as diffuse field noise 2 , or may exist in non-target orientations, such as diffuse field noise 1 . Therefore, environmental noise, such as the time-frequency point gain g mask (t, f) of the time-frequency speech signal of diffuse field noise 2, may be 1; The larger gain g mask (t,f) may be g mask-min , such as 0.2. That is to say, the energy level of the environmental noise cannot be suppressed by suppressing the sound of the non-target object as described above.
  • a mobile phone has two speakers, namely a speaker on the top of the screen of the mobile phone (hereinafter referred to as speaker 1) and a speaker on the bottom of the mobile phone (hereinafter referred to as speaker 2).
  • speaker 1 when the mobile phone outputs audio (ie sound signal), the speaker 1 can be used to output the left channel audio signal, and the speaker 2 can be used to output the right channel audio signal.
  • the speaker 1 when the mobile phone outputs audio, the speaker 1 can also be used to output the right channel audio signal, and the speaker 2 can be used to output the left channel audio signal.
  • the embodiment of the present application does not make any special limitation. It can be understood that when the electronic device has only one speaker, it can output (left channel audio signal + right channel audio signal)/2, or can output left channel audio signal + right channel audio signal), and can also perform Output after fusion, which is not limited in this application.
  • the output sound signal can be divided into left channel audio output signal Y L (t, f) and right channel audio output signal Y R (t, f).
  • the sound signal after suppressing the sound of the non-target object can be obtained, such as Y L (t, f) and Y R (t, f).
  • the output sound signals Y L (t, f) and Y R (t, f) after processing can be calculated by the following formula (3) and formula (4):
  • the energy of the left channel audio output signal Y L (t, f) is equal to the energy of the left channel time-frequency speech signal X L (t, f); the right channel audio output signal Y R (t, The energy of f) is equal to the energy of the right channel time-frequency speech signal X R (t, f), that is, the sound signal of the target object as shown in FIG. 10 is completely preserved.
  • the energy of the left channel audio output signal Y L (t, f) is equal to 0.2 times the energy of the left channel time-frequency speech signal X L (t, f); the right channel audio output signal Y R ( The energy of t, f) is equal to 0.2 times the energy of the right channel time-frequency speech signal X R (t, f), that is, the sound signal of the non-target object as shown in FIG. 10 is effectively suppressed.
  • the energy of the left channel audio output signal Y L (t, f) is equal to the energy of the left channel time-frequency speech signal X L (t, f);
  • the energy of the channel audio output signal Y R (t, f) is equal to the energy of the right channel time-frequency speech signal X R (t, f), that is, the sound signal of the diffuse field noise 2 shown in FIG. 10 is not suppressed.
  • the energy of the left channel audio output signal Y L (t, f) is equal to 0.2 of the energy of the left channel time-frequency speech signal X L (t, f) times;
  • the right channel audio output signal Y R (t, f) is equal to 0.2 times the energy of the right channel time-frequency speech signal X R (t, f), that is, the sound signal of diffuse field noise 1 as shown in Figure 10 be effectively suppressed.
  • the time-frequency speech signal in the target direction (such as the time-frequency of the target object Speech signal, and the time-frequency speech signal of diffuse field noise 2) are all preserved completely, and the time-frequency speech signal (such as the time-frequency speech signal of non-target object 1, the time-frequency speech signal of non-target object 2,
  • the time-frequency speech signal of diffuse field noise 1 is effectively suppressed.
  • FIG. 11 is the time-frequency speech signal before the above 401 processing
  • (b) in FIG. 11 is the time-frequency speech signal after the above 401 processing.
  • FIG. 11 and (b) in FIG. 11 are time-frequency speech signals in non-target orientations. Comparing (a) in Figure 11 and (b) in Figure 11, it can be seen that after the above-mentioned 401 suppresses the processing of the sound signal of the non-target orientation, the time-frequency voice signal of the non-target orientation is suppressed, that is, the The energy of the time-frequency speech signal is significantly reduced.
  • time-frequency speech signal output through the above-mentioned time-frequency point gain g mask (t, f) calculation only suppresses the time-frequency speech signal of the non-target orientation, but there may still be environmental noise (such as the diffuse field Noise 2), so that the environmental noise of the output time-frequency voice signal is still relatively large, the signal-to-noise ratio of the output voice signal is small, and the quality of the voice signal is low.
  • environmental noise such as the diffuse field Noise 2
  • the sound signal processing method provided by the embodiment of the present application can also improve the signal-to-noise ratio of the output speech signal and improve the clarity of the speech signal by suppressing the diffuse field noise.
  • the sound signal processing method provided in the embodiment of the present application may include 400-401 and 1201-1203.
  • step 1201 may be performed to process the collected sound signal to suppress diffuse field noise.
  • suppressing diffuse field noise may be calculated sequentially through a coherent-to-diffuse power ratio (CDR) to realize calculation of time-frequency point gain g cdr (t,f) when suppressing diffuse field noise.
  • the coherent signal in the sound signal (such as the sound signal of the target object and the sound signal of the non-target object) and the diffuse field noise are distinguished by the difference in the time-frequency point gain g cdr (t, f).
  • the suppression of diffuse field noise can include two calculation processes, the calculation of the coherent diffusion ratio CDR and the calculation of the time-frequency point gain g cdr (t, f), as follows:
  • the coherent-to-diffuse power ratio refers to the power ratio of the coherent signal (that is, the speech signal of the target object or non-target object) to the diffuse field noise.
  • the above-mentioned left channel time-frequency speech signal X L (t, f), right channel time-frequency speech signal X R (t, f) and back mixed channel time-frequency speech signal X back (t, f) adopt existing Techniques, such as Coherent-to-Diffuse Power Ratio Estimation for Dereverberation (Coherent-to-Diffuse Power Ratio Estimation for Dereverberation) calculate.
  • the coherence-diffusion ratio of the sound signal of the target object to infinity ⁇ .
  • the coherence-diffusion ratio of the sound signal of the non-target object is also infinite ⁇ .
  • diffuse field noise such as diffuse field noise 1 or diffuse field noise 2
  • the coherent diffusion ratio of diffuse field noise is 0.
  • the main purpose of suppressing diffuse field noise is to preserve the sound of coherent signals (such as target objects) and reduce the energy of diffuse field noise.
  • the coherent diffusion ratio can be Determine the time-frequency point gain g cdr (t,f) of coherent signals (ie, the sound signal of the target object, the sound signal of the non-target object), and determine the time-frequency point gain g cdr (t ,f), that is, through the time-frequency point gain g cdr (t,f), to distinguish coherent signals or incoherent signals.
  • the time-frequency point gain g cdr (t, f) of the coherent signal may be kept as 1, and the time-frequency point gain g cdr (t, f) of the incoherent signal may be reduced, for example, set to 0.3.
  • the sound signal of the target object can be preserved, and the diffuse field noise can be suppressed to reduce the ability of the diffuse field noise.
  • g cdr-min is the minimum gain after suppressing the diffuse field noise, which can be configured through parameters, for example, g cdr-min can be set to 0.3.
  • g cdr (t, f) is the time-frequency point gain after suppressing the diffuse field noise.
  • is an overestimation factor, which can be configured by parameters, for example, set ⁇ to 1.
  • the time-frequency point gain g cdr (t, f) is 1, the coherent signal can be completely preserved.
  • the time-frequency point gain g cdr (t, f) of diffuse field noise is 0.3, so the diffuse field noise It is effectively suppressed, that is, the energy of the diffuse field noise is significantly reduced compared with that before the treatment.
  • the main purpose of suppressing the sound signal of the non-target orientation in the above 401 is to retain the sound signal of the target object and suppress the sound signal of the non-target object.
  • the main purpose of suppressing diffuse field noise in 1201 above is to suppress diffuse field noise and protect coherent signals (that is, sound signals of target objects or non-target objects). Therefore, in the sound signal processing method shown in FIG. 12 or FIG.
  • the time-frequency point gain g mask (t, f) obtained by suppressing the sound signal in the non-target direction in the above 401 can be , and the time-frequency point gain g cdr (t, f) obtained by suppressing the diffuse field noise in the above step 1201 is fused and calculated, and the fused gain g mix (t, f) can be obtained.
  • the fused gain g mix (t, f) to process the input time-frequency speech signal a clear time-frequency speech signal of the target object can be obtained, which reduces the energy of the diffuse field noise and improves the signal of the audio (sound) signal noise ratio.
  • g mix (t, f) is the mixing gain after gain fusion calculation.
  • time-frequency speech signal of field noise 1 and the time-frequency speech signal of diffuse field noise 2 are gain fused to calculate the fusion gain g mix (t, f).
  • the fusion gain g mix (t, f) of various sound signals obtained from the above calculation can be combined with the sound input signal collected by the microphone , such as the time-frequency speech signal X L (t, f) of the left channel or the time-frequency speech signal X R (t, f) of the right channel, it can be obtained after suppressing the sound of non-target objects and suppressing the diffuse field noise after fusion processing , such as Y L (t,f) and Y R (t,f).
  • the processed sound signals Y L (t, f) and Y R (t, f) can be respectively calculated by the following formula (7) and formula (8):
  • the energy of the left channel audio output signal Y L (t, f) is equal to the energy of the left channel time-frequency speech signal X L (t, f); the right channel audio output signal Y R (t, The energy of f) is equal to the energy of the right channel time-frequency speech signal X R (t, f), that is, the sound signal of the target object as shown in FIG. 15 is completely preserved.
  • the energy of the left channel audio output signal Y L (t, f) is equal to 0.2 times the energy of the left channel time-frequency speech signal X L (t, f); the right channel audio output signal Y R ( The energy of t, f) is equal to 0.2 times the energy of the right channel time-frequency speech signal X R (t, f), that is, the sound signal of the non-target object as shown in FIG. 15 is effectively suppressed.
  • the energy of the left channel audio output signal Y L (t, f) is equal to 0.3 times the energy of the left channel time-frequency speech signal X L (t, f) ;
  • the energy of the right channel audio output signal Y R (t, f) is equal to 0.3 times of the energy of the right channel time-frequency speech signal X R (t, f), namely the sound of diffuse field noise 2 as shown in Figure 15
  • the signal is effectively suppressed.
  • the energy of the left channel audio output signal Y L (t, f) is equal to 0.06 of the energy of the left channel time-frequency speech signal X L (t, f) times; the energy of the right channel audio output signal Y R (t, f) is equal to 0.06 times the energy of the right channel time-frequency speech signal X R (t, f), that is, the diffuse field noise 1 shown in Figure 15 Acoustic signals are effectively suppressed.
  • the time-frequency voice signal of the target object in the target orientation is completely preserved,
  • the time-frequency speech signals that are not within the target orientation (such as the time-frequency speech signal of non-target object 1, the time-frequency speech signal of non-target object 2, and the time-frequency speech signal of diffuse field noise 1) are effectively suppressed.
  • the time-frequency speech signal of the diffuse field noise 2 within the target orientation is also effectively suppressed, thereby improving the signal-to-noise ratio of the speech signal and improving the clarity of the selfie sound.
  • (a) in Figure 16 is a time-frequency speech signal that only suppresses the sound signal of the non-target orientation
  • (b) in Figure 16 is obtained after fusion processing of suppressing the sound signal of the non-target orientation and suppressing the diffuse field noise time-frequency audio signal.
  • the background color of the corresponding time-frequency speech signal in the spectrogram is darker, and the energy of the background noise (ie, diffuse field noise) is smaller. Therefore, it can be determined that the noise energy of the output sound signal is reduced and the signal-to-noise ratio of the sound signal is greatly improved after fusion processing of suppressing the sound signal of the non-target orientation and suppressing the diffuse field noise.
  • the background noise ie, diffuse field noise
  • the background noise ie, the environmental noise
  • the fusion gain g mix (t, f) in 1202 is calculated, the fusion gain g mix (t, f) of the time-frequency speech signals of the diffuse field noise 1 and the diffuse field noise 2 has a large gap, so The background noise of the output audio signal is not stable.
  • noise compensation may be performed on the diffuse field noise, followed by secondary noise reduction processing to make the background noise of the audio signal more stable.
  • the above sound signal processing method may include 400-401, 1201-1202, 1701-1702.
  • the diffuse field noise can be compensated by the following formula (9):
  • g min is the minimum gain of diffuse field noise (that is, a preset gain value), which can be configured through parameters, for example, g min can be set to 0.3.
  • g out (t, f) is the time-frequency point gain of the time-frequency speech signal after diffuse field noise compensation.
  • time-frequency speech signal of field noise 1 and the time-frequency speech signal of diffuse field noise 2 are subjected to diffuse field noise compensation calculation to obtain the time-frequency point gain g out (t,f).
  • the time-frequency point gain of diffuse field noise 1 increases from 0.06 to 0.3, and the time-frequency point gain of diffuse field noise 2 remains at 0.3, so that the background noise of the output sound signal ( For example, diffuse field noise 1 and diffuse field noise 2) are more stable, making the user's sense of hearing better.
  • the difference from the above 1203 is that the processed sound signals, such as the left channel audio output signal Y L (t, f) and the right channel audio output signal Y R (t, f), are based on the diffuse field noise compensation time
  • the time-frequency point gain g out (t, f) of the frequency speech signal is calculated. Specifically, it can be calculated by the following formula (10) and formula (11):
  • X R (t, f) is the time-frequency speech signal of the right channel collected by the microphone
  • X L (t, f) is the time-frequency speech signal of the left channel collected by the microphone
  • the energy of the left channel audio output signal Y L (t, f) is equal to the energy of the left channel time-frequency speech signal X L (t, f); the right channel audio output signal Y R (t, The energy of f) is equal to the energy of the right channel time-frequency speech signal X R (t, f), that is, the sound signal of the target object as shown in FIG. 18 is completely preserved.
  • the energy of the left channel audio output signal Y L (t, f) is equal to 0.2 times the energy of the left channel time-frequency speech signal X L (t, f); the right channel audio output signal Y R ( The energy of t, f) is equal to 0.2 times the energy of the right channel time-frequency speech signal X R (t, f), that is, the sound signal of the non-target object as shown in FIG. 18 is effectively suppressed.
  • the energy of the left channel audio output signal Y L (t, f) is equal to 0.3 times the energy of the left channel time-frequency speech signal X L (t, f) ;
  • the energy of the right channel audio output signal Y R (t, f) is equal to 0.3 times of the energy of the right channel time-frequency speech signal X R (t, f), namely the sound of diffuse field noise 2 as shown in Figure 18
  • the signal is effectively suppressed.
  • the energy of the left channel audio output signal Y L (t, f) is equal to 0.3 of the energy of the left channel time-frequency speech signal X L (t, f) times; the energy of the right channel audio output signal Y R (t, f) is equal to 0.3 times the energy of the right channel time-frequency speech signal X R (t, f), that is, the diffuse field noise 1 shown in Figure 18 Acoustic signals are effectively suppressed.
  • the time-frequency point gain g out (t,f) of diffuse field noise compensation is calculated, and the time-frequency point gain g out ( The difference between t and f) makes the background noise of the output sound signal (such as diffuse field noise 1 and diffuse field noise 2) more stable, making the user's sense of hearing better.
  • the background noise of the output sound signal such as diffuse field noise 1 and diffuse field noise 2
  • it is the time-frequency speech signal output without diffuse field noise compensation processing
  • it is the time-frequency speech signal output after diffuse field noise compensation processing Signal.
  • Scenario 1 The scene where the user uses the front camera to record video.
  • the mobile phone processes a frame of image data every time it collects a frame of image data, or processes the collected audio data every time it collects audio data corresponding to a frame of image data.
  • a sound signal processing method provided in the embodiment of the present application may include:
  • the mobile phone started the recording function of the front camera and started recording.
  • the mobile phone when the user wants to use the mobile phone to record and shoot, he can activate the video recording function of the mobile phone.
  • the mobile phone can start a camera application, or start other applications with a video recording function (such as AR applications such as Douyin or Kuaishou), thereby activating the video recording function of the application.
  • the mobile phone detects that the user clicks the camera icon 2101 shown in (a) in Figure 21A, it starts the video recording function of the camera application, and displays the front camera video as shown in (b) in Figure 21A preview interface.
  • the mobile phone displays the interface of the desktop or non-camera application, detects the voice command of the user to open the camera application, and starts the recording function, and displays the preview interface of the front camera recording as shown in (b) in Figure 21A.
  • the mobile phone can also activate the video recording function in response to other touch operations, voice commands, or shortcut gestures of the user.
  • the embodiment of the present application does not limit the operation of triggering the mobile phone to start the video recording function.
  • the mobile phone When the mobile phone displays the preview interface of the front camera recording as shown in (b) in Figure 21A, the mobile phone detects that the user clicks the operation of the video recording button 2102 shown in (b) in Figure 21A, and starts the front camera recording , and display the recording interface of the front camera recording as shown in (c) in FIG. 21A, and start the recording timing.
  • the mobile phone collects the Nth frame of image, and processes the Nth frame of image.
  • the mobile phone can be divided into image stream and audio stream.
  • the image stream is used to collect image data, and perform image processing operations on each frame of image.
  • the audio stream is used to collect audio data, and perform sound pickup and denoising processing on each frame of audio data.
  • the mobile phone may process the first frame of image, such as image denoising, tone mapping and other processing.
  • the mobile phone can process the second frame of image.
  • the mobile phone can process the Nth frame of image.
  • N is a positive integer.
  • the mobile phone collects the audio corresponding to the Nth frame of image, and processes the audio corresponding to the Nth frame of image.
  • the image data collected by the mobile phone is the image of the target object, and the mobile phone
  • the collected audio data will not only include the sound of the target object, but may also include the sound of non-target objects (such as non-target object 1 and non-target object 2), and environmental noise (such as diffuse field noise 1 and diffuse field noise 2).
  • One frame of image is 30 milliseconds (ms)
  • one frame of audio is 10 milliseconds
  • the frame counts from the start of recording then the audio corresponding to the Nth frame of image is the 3N-2 frame, the 3N-1 frame and the 3N frame audio .
  • the audio corresponding to the image of the first frame is the audio of the first frame, the second frame and the third frame.
  • the audio corresponding to the image of the second frame is the audio of the fourth frame, the fifth frame and the sixth frame.
  • the audio corresponding to the 10th frame image is the 28th frame, the 29th frame and the 30th frame audio.
  • the mobile phone processes the audio corresponding to the first frame of image, and needs to process the audio of the first frame, the audio of the second frame, and the audio of the third frame respectively.
  • the mobile phone may execute the sound signal processing method shown in FIG. 7 above to process the first frame of audio, suppressing the sound of non-target objects to highlight the sound of the target object; or
  • the mobile phone can execute the sound signal processing method shown in Figure 13 above to suppress the sound of non-target objects to highlight the sound of the target object, and suppress diffuse field noise to reduce the noise energy in the audio signal and improve the signal-to-noise ratio of the audio signal or the mobile phone can execute the sound signal processing method shown in FIG. 17 above to suppress the sound of the non-target object to highlight the sound of the target object, and suppress the diffuse field noise to reduce the noise energy in the audio signal and improve the sound of the audio signal.
  • the signal-to-noise ratio can also smooth the background noise (i.e. ambient noise) to make the user feel better.
  • the mobile phone can also execute the sound signal processing method shown in FIG. 7 or FIG. 13 or FIG. 17 to process the audio signal. It should be understood that the above step 2003 may correspond to step 400 in FIG. 7 or FIG. 13 or FIG. 17 .
  • the audio corresponding to any subsequent image frame can be processed according to the above-mentioned audio processing process corresponding to the first frame image, which will not be repeated here.
  • the above method when processing the audio corresponding to the Nth frame of image, the above method can be executed every time a frame of audio is collected, or the audio collection of 3 frames corresponding to the Nth frame of image can be completed. Each frame of audio in the 3 frames of audio is then processed separately, which is not particularly limited in this embodiment of the present application.
  • the mobile phone synthesizes the processed image of the Nth frame and the audio corresponding to the processed Nth frame of image to obtain the Nth frame of video data.
  • the mobile phone can stream from the image Get the Nth frame image, and get the 3N-2 frame audio, the 3N-1 frame and the 3N frame audio from the audio stream, and then the 3N-2 frame audio, the 3N-1 frame and the 3N frame audio according to the time Combining with the image of the Nth frame in order of stamping, and compositing into the Nth frame of video data.
  • the mobile phone when the mobile phone detects that the user clicks the end recording button 2103 shown in (a) in Figure 21B, the mobile phone responds to the user's click operation, ends the recording, and displays the recording shown in (b) in Figure 21B.
  • the preview interface of the front camera recording is displayed.
  • the mobile phone responds to the user's click operation on the end recording button 2103, the mobile phone stops the collection of images and audio, and the last frame of image and the audio processing corresponding to the last frame of image are completed, and the last frame of time-frequency data is obtained Finally, synthesize the first frame of video data to the last frame of video data according to the time stamp order, and save it as video file A.
  • the preview file displayed in the preview window 2104 in the preview interface of the front camera recording shown in (b) in FIG. 21B is the video file A.
  • the mobile phone After the mobile phone detects the user's click operation on the preview window 2104 shown in (b) in Figure 21B, the mobile phone can display the video shown in (c) in Figure 21B in response to the user's click operation on the preview window 2104.
  • File playback interface to play video file A
  • the sound signal of the played video file A did not have the sound signals of non-target objects (such as non-target object 1 and non-target object 2), and the sound signal in the sound signal of video file A was played out.
  • the background noise (such as diffuse field noise 1 and diffuse field noise 2) is small and smooth, which can give users a good sense of hearing.
  • scenario shown in FIG. 21B is the scenario of executing the audio signal processing method shown in FIG. 17 above, and does not show the scenario of executing the audio signal processing method shown in FIG. 7 or FIG. 13 .
  • the mobile phone after the mobile phone completes the acquisition and processing of the image of the Nth frame and the acquisition and processing of the audio corresponding to the image of the Nth frame, it does not collect the image of the Nth frame and the audio of the Nth frame
  • the audio corresponding to the frame image is synthesized into the Nth frame of video data.
  • all the images and all the audio are synthesized in order of time stamps into Video file A.
  • the video recording method provided by the embodiment of the present application is different from the video recording method shown in Figure 20 above in that: after the mobile phone starts recording, it first collects all the image data and audio data , and then process and synthesize the image data and audio data, and finally save the synthesized video file.
  • the method includes:
  • the mobile phone starts the recording function of the front camera, and starts recording.
  • the mobile phone starts the video recording function of the front camera, and the method for starting the video recording can refer to the description in 2001 above, which will not be repeated here.
  • the mobile phone collects image data and audio data respectively.
  • the mobile phone during the video recording process of the mobile phone, it can be divided into image stream and audio stream.
  • the image stream is used to collect multiple frames of image data during video recording.
  • the audio stream is used to capture multiple frames of audio data during video recording.
  • the mobile phone sequentially captures the first frame of image, the second frame of image, ... the last frame of image.
  • the mobile phone sequentially collects the first frame of audio, the second frame of audio, ... the last frame of image.
  • the image data collected by the mobile phone is the image of the target object, and the mobile phone
  • the collected audio data will not only include the sound of the target object, but may also include the sound of non-target objects (such as non-target object 1 and non-target object 2), and environmental noise (such as diffuse field noise 1 and diffuse field noise 2).
  • the mobile phone processes the collected image data and audio data respectively.
  • the mobile phone when the mobile phone detects that the user clicks the end recording button 2103 shown in (a) in Figure 21B, the mobile phone responds to the user's click operation, ends the recording, and displays the recording shown in (b) in Figure 21B.
  • the preview interface of the front camera recording is displayed.
  • the mobile phone processes the collected image data and the collected audio data in response to the user's click operation on the end recording button 2103 .
  • the mobile phone may separately process each frame of image in the collected image data, such as image denoising, tone mapping, etc., to obtain processed image data.
  • the mobile phone may also process each frame of audio in the collected audio data.
  • execute the sound signal processing method shown in FIG. 7 above to process each frame of audio in the first audio data, suppress the sound of non-target objects, and highlight the sound of the target object.
  • execute the sound signal processing method shown in FIG. 13 above to process each frame of audio in the collected audio data, suppress the sound of non-target objects to highlight the sound of the target object, and suppress diffuse field noise to reduce Noise energy in the audio signal, improving the signal-to-noise ratio of the audio signal.
  • the sound signal processing method shown in Figure 17 above execute the sound signal processing method shown in Figure 17 above to process each frame of audio in the collected audio data, suppress the sound of non-target objects to highlight the sound of the target object, and suppress the diffuse field noise to reduce
  • the noise energy in the audio signal can improve the signal-to-noise ratio of the audio signal, and can also smooth the background noise (that is, environmental noise), so that the user's sense of hearing is better.
  • the processed audio data can be obtained.
  • steps 2202 and 2203 may correspond to step 400 in FIG. 7 or FIG. 13 or FIG. 17 .
  • the mobile phone synthesizes the processed image data and processed audio data to obtain video file A
  • the processed image data and the processed audio data need to be synthesized into a video file before they can be shared or played by users. Therefore, after the mobile phone executes the above step 2203 to obtain the processed image data and the processed audio data, the processed image data and the processed audio data can be synthesized to form a video file A.
  • the mobile phone saves the video file A.
  • the mobile phone can save the video file A.
  • video file A is played.
  • the sound signal of the played video file A did not have the sound signals of non-target objects (such as non-target object 1 and non-target object 2), and the sound signal in the sound signal of video file A was played out.
  • the background noise (such as diffuse field noise 1 and diffuse field noise 2) is small and smooth, which can give users a good sense of hearing.
  • Scenario 2 The scene where the user uses the front camera to broadcast live.
  • the data collected by the live broadcast will be displayed to the user in real time, so the image and audio collected by the live broadcast will be processed in real time, and the processed image and audio data will be displayed to the user in time.
  • mobile phone A, server and mobile phone B are included.
  • mobile phone A and mobile phone B both communicate with the server.
  • Mobile phone A can be a live video recording device, which is used to record audio and video files and transmit them to the server.
  • the mobile phone B may be a live broadcast display device, which is used to obtain audio and video files from the server, and display the content of the audio and video files on the live broadcast interface for users to watch.
  • a video recording method provided in the embodiment of the present application is applied to mobile phone A.
  • the method can include:
  • Mobile phone A starts the live video recording of the front camera, and starts the live broadcast.
  • the user when the user wants to use the mobile phone for live video recording, he can start the live video application in the mobile phone, such as Douyin or Kuaishou, and start the live video recording.
  • the Douyin application As an example, after the mobile phone detects that the user clicks the button 2401 to start the live video broadcast shown in (a) in FIG. (b) shows the live video capture interface. At this point, the Douyin application will collect image data and sound data.
  • Mobile phone A collects the Nth frame of image, and processes the Nth frame of image.
  • This processing procedure is similar to 2202 shown in FIG. 20 above, and will not be repeated here.
  • Mobile phone A collects the audio corresponding to the Nth frame of image, and processes the audio corresponding to the Nth frame of image.
  • the Nth frame of image captured by the mobile phone is the image of the target object, and the Nth frame of image captured by the mobile phone corresponds to
  • the audio will include not only the sound of the target object, but also the sound of non-target objects (such as non-target object 1 and non-target object 2), and environmental noise (such as diffuse field noise 1 and diffuse field noise 2).
  • This process is similar to 2203 shown in FIG. 20 above, and will not be repeated here.
  • Mobile phone A synthesizes the processed image of the Nth frame and the audio corresponding to the processed Nth frame of image to obtain the Nth frame of video data.
  • the mobile phone A sends the Nth frame of video data to the server, so that the mobile phone B displays the Nth frame of video data.
  • mobile phone A may send the Nth frame of video data to the server.
  • the server is usually a server of a live application, such as a server of a Douyin application.
  • mobile phone B can display the Nth frame of video on the live broadcast display interface for the user to watch.
  • the output of the mobile phone B is a processed audio signal, and only the sound of the target object (that is, the live shooting object) remains.
  • scenario shown in FIG. 25 is the scenario of executing the audio signal processing method shown in FIG. 17 above, and does not show the scenario of executing the audio signal processing method shown in FIG. 7 or FIG. 13 .
  • Scenario 3 The scene where the video files in the mobile phone album are picked up and processed.
  • electronic devices such as mobile phones
  • the electronic device can perform the sound signal processing method in Figure 7, Figure 13 or Figure 17 on the video files stored in the mobile phone album and the original sound data to remove the sound of non-target objects or remove the diffuse field noise to make the noise more stable , so that the user's hearing experience is better.
  • the saved original sound data refers to the sound data collected by the microphone of the mobile phone when recording the video files in the photo album of the above mobile phone.
  • the embodiment of the present application also provides a sound pickup processing method, which is used for picking up video files in the mobile phone album.
  • the sound pickup processing method includes:
  • the mobile phone acquires the first video file in the photo album.
  • the user wants to pick up the audio of the video file in the camera roll of the mobile phone, so as to remove the sound of non-target objects or remove the diffuse field noise. You can select the video file you want to process from the phone's camera roll to perform sound pickup processing.
  • the mobile phone detects that the user clicks the preview box 2701 of the first video file shown in (a) in Figure 27, the mobile phone responds to the user's click operation, and displays the screen shown in (b) in Figure 27 An operation interface for the first video file.
  • the user can play, share, collect, edit or delete the first video file.
  • the user can also perform sound pickup and denoising operations on the first video file.
  • the mobile phone detects that the user clicks the "more" option 2801 shown in (a) in Figure 28, the mobile phone responds to the user's click operation and displays the operation selection box as shown in (b) in Figure 28 2802.
  • the operation selection box 2802 there is a "noise removal” option button 2803, and the user can click the "noise removal” option button 2803 to perform sound pickup and noise removal processing on the first video file.
  • the mobile phone can display the waiting interface for performing the sound pickup and noise removal process as shown in (b) in FIG.
  • the mobile phone separates the first video file into first image data and first audio data.
  • the goal of performing sound pickup and denoising on the first video file is to remove the sound of non-target objects and suppress background noise (i.e. diffuse field noise), so the mobile phone needs to separate the audio data in the first video file in order to The audio data in the first video file is subjected to sound pickup and denoising processing.
  • background noise i.e. diffuse field noise
  • the image data and audio data in the first video file may be separated into first image data and first audio data.
  • the first image data may be a set of images from the first frame to the last frame in the image stream of the first video file.
  • the first audio data may be a set of audio from the first frame to the last frame in the audio stream of the first video file.
  • the mobile phone performs image processing on the first image data to obtain second image data; processes the first audio data to obtain second audio data.
  • the mobile phone may process each frame of image in the first image data, such as image denoising, tone mapping, etc., to obtain the second image data.
  • the second image data is a collection of images obtained after processing each frame of image in the first image data.
  • the mobile phone may also process each frame of audio in the first audio data.
  • execute the sound signal processing method shown in FIG. 7 above to process each frame of audio in the first audio data, suppress the sound of non-target objects, and highlight the sound of the target object.
  • execute the sound signal processing method shown in FIG. 13 above to process each frame of audio in the first audio data, suppress the sound of non-target objects to highlight the sound of the target object, and suppress diffuse field noise to reduce Noise energy in the audio signal, improving the signal-to-noise ratio of the audio signal.
  • the noise energy in the audio signal can improve the signal-to-noise ratio of the audio signal, and can also smooth the background noise (that is, environmental noise), so that the user's sense of hearing is better.
  • the second audio data can be obtained.
  • the second audio data is a collection of audio obtained after processing each frame of audio in the first audio data.
  • steps 2601 to 2603 may correspond to step 400 in FIG. 7 or FIG. 13 or FIG. 17 .
  • the mobile phone synthesizes the second image data and the second audio data into a second video file.
  • the processed second image data and second audio data need to be synthesized into a video file before they can be shared or played by users. Therefore, after the mobile phone executes the above step 2603 to obtain the second image data and the second audio data, the second image data and the second audio data may be synthesized to form a second video file. At this point, the second video file is the first video file after sound pickup and denoising.
  • the mobile phone saves the second video file.
  • the mobile phone can display the file saving tab 3001 as shown in FIG. 30 .
  • the user may be prompted "the denoising process is complete, do you want to replace the original file?", and a first option button 3002 and a second option button 3003 are set for the user to choose.
  • “Yes” may be displayed on the first option button 3002 to instruct the user to replace the original first video file with the processed second video file.
  • "No” may be displayed on the second option button 3003 to instruct the user to save the processed second video file as a separate file without replacing the first video file, that is, both the first video file and the second video file are retained.
  • the mobile phone if the mobile phone detects the user's click operation on the first option button 3002 shown in FIG. 30 , the mobile phone will replace the original first video file with the processing After the second video file. If the mobile phone detects the user's click operation on the second option button 3003 shown in FIG. 30 , the mobile phone responds to the user's click operation on the second option button 3003 to save the first video file and the second video file respectively.
  • one frame of image does not correspond to one frame of audio, but in some embodiments, one frame of image corresponds to multiple frames of audio, or multiple frames of image corresponds to one frame of audio.
  • it can be a Frames of images correspond to three frames of audio, so when synthesizing in real time as shown in Figure 23, it can be the Nth frame image after the mobile phone synthesis processing and the processed audio of the 3N-2 frame, 3N-1 frame and 3N frame, and the first frame can be obtained N frames of video data.
  • the embodiment of the present application also provides another sound signal processing method.
  • the method is applied to electronic devices, including cameras and microphones.
  • the first target object is within the shooting range of the camera, and the second target object is not within the shooting range of the camera.
  • the first target object is within the shooting range of the camera may refer to that the first target object is within the range of the field of view of the camera.
  • the first target object may be the target object in the foregoing embodiments.
  • the second target object may be the non-target object 1 or the non-target object 2 in the above embodiments.
  • the method includes:
  • the electronic device activates the camera.
  • a preview interface is displayed, and the preview interface includes a first control.
  • the first control may be the video recording button 2102 shown in (b) in FIG. 21A , or the start video live broadcast button 2401 shown in (a) in FIG. 24 .
  • a first operation on a first control is detected.
  • shooting is started.
  • the first operation may be a user's click operation on the first control.
  • a shooting interface is displayed, the shooting interface includes a first image, the first image is an image collected by the camera in real time, the first image includes the first target object, and the first image does not include the second target object.
  • the first moment may be any moment during the shooting process.
  • the first image may be each frame of image in the methods shown in FIG. 20 , FIG. 22 , and FIG. 23 .
  • the first target object may be the target object in the above embodiments.
  • the second target object may be the non-target object 1 or the non-target object 2 in the above embodiments.
  • the microphone collects a first audio
  • the first audio includes a first audio signal and a second audio signal
  • the first audio signal corresponds to the first target object
  • the second audio signal corresponds to the second target object.
  • the first audio signal may be the sound signal of the target object.
  • the second audio signal may be a sound signal of the non-target object 1 or a sound signal of the non-target object 2 .
  • a second operation on the first control of the shooting interface is detected.
  • the first control on the shooting interface may be the end recording button 2103 shown in (a) in FIG. 21B .
  • the second operation may be a user's click operation on the first control on the shooting interface.
  • the third audio signal is obtained by processing the second audio signal by the electronic device, and the energy of the third audio signal is smaller than the energy of the second audio signal.
  • the third audio signal may be the processed sound signal of the non-target object 1 or the sound signal of the non-target object 2 .
  • the first audio may further include a fourth audio signal, which is a diffuse field noise audio signal; the second audio may further include a fifth audio signal, which is a diffuse field noise audio signal.
  • the fifth audio signal is obtained by processing the fourth audio signal by the electronic device. The energy of the fifth audio signal is less than the energy of the fourth audio signal.
  • the fourth audio signal may be the sound signal of diffuse field noise 1 in the above embodiment, or the sound signal of diffuse field noise 2 in the above embodiment.
  • the fifth audio signal may be the processed sound signal of diffuse field noise 1 or the sound signal of diffuse field noise 2 obtained after the electronic device executes the foregoing FIG. 13 or FIG. 17 .
  • the fifth audio signal is obtained by the electronic device processing the fourth audio signal, including: suppressing the fourth audio signal to obtain the sixth audio signal.
  • the sixth audio signal may be the sound signal of diffuse field noise 1 obtained after performing the processing shown in FIG. 13 .
  • the sixth audio signal is a diffuse field noise audio signal
  • the energy of the sixth audio signal is smaller than that of the fourth audio signal
  • the sixth audio signal is smaller than the fifth audio signal.
  • the fifth audio signal at this time may be the sound signal of the diffuse field noise 1 obtained after performing the processing in FIG. 17 .
  • the electronic device includes: a microphone; a camera; one or more processors; a memory; and a communication module.
  • the microphone is used to collect sound signals during video recording or live broadcast;
  • the camera is used to collect image signals during video recording or live broadcast.
  • the communication module is used for communicating with external devices.
  • One or more computer programs are stored in the memory, the one or more computer programs comprising instructions. When the processor executes the computer instructions, the electronic device can execute various functions or steps performed by the mobile phone in the above method embodiments.
  • the embodiment of the present application also provides a chip system.
  • the chip system can be applied to foldable electronic devices.
  • the chip system includes at least one processor 3101 and at least one interface circuit 3102 .
  • the processor 3101 and the interface circuit 3102 can be interconnected through wires.
  • interface circuit 3102 may be used to receive signals from other devices, such as memory of an electronic device.
  • the interface circuit 3102 may be used to send signals to other devices (such as the processor 3101).
  • the interface circuit 3102 can read instructions stored in the memory, and send the instructions to the processor 3101 .
  • the electronic device may be made to execute various steps in the foregoing embodiments.
  • the chip system may also include other discrete devices, which is not specifically limited in this embodiment of the present application.
  • An embodiment of the present application also provides a computer storage medium, the computer storage medium includes computer instructions, and when the computer instructions are run on the above-mentioned foldable electronic device, the electronic device is made to execute the various functions performed by the mobile phone in the above-mentioned method embodiment. function or step.
  • the embodiment of the present application also provides a computer program product, which, when the computer program product is run on a computer, causes the computer to execute each function or step performed by the mobile phone in the method embodiment above.
  • Each functional unit in each embodiment of the embodiment of the present application may be integrated into one processing unit, or each unit may physically exist separately, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
  • the technical solution of the embodiment of the present application is essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage
  • the medium includes several instructions to enable a computer device (which may be a personal computer, server, or network device, etc.) or a processor to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes: flash memory, mobile hard disk, read-only memory, random access memory, magnetic disk or optical disk, and other various media capable of storing program codes.

Abstract

本申请实施例提供了一种声音信号处理方法及电子设备,能够减少录制视频的声音中的干扰声音信号,提高录制视频的声音信号的质量。该方法应用于电子设备。该方法包括:电子设备获取第一声音信号。第一声音信号为录制视频的声音信号。电子设备对第一声音信号进行处理,得到第二声音信号。电子设备在播放录制的视频文件时,输出第二声音信号。其中,第二声音信号中非目标方位的声音信号的能量低于第一声音信号中非目标方位的声音信号的能量。非目标方位为录制视频时摄像头的视场角范围外的方位。

Description

一种声音信号处理方法及电子设备
本申请要求于2021年08月12日提交国家知识产权局、申请号为202110927121.0、发明名称为“一种声音信号处理方法及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及电子技术领域,尤其涉及一种声音信号处理方法及电子设备。
背景技术
目前,电子设备的录像功能已成为人们经常使用的功能。随着短视频、直播社交软件(如快手、抖音等应用)的发展,录制高质量的视频文件成为了需求。
现有的电子设备在录制视频时会采集电子设备周围的声音信号,但是有一些声音信号属于干扰信号,并不是用户想要的。以前置摄像头录像为例,电子设备在录制用户的自拍短视频或者直播时,电子设备会采集到用户自己的声音,还会采集到周围环境的声音,从而导致电子设备录制的自拍声音不够清晰,存在较多的干扰,电子设备录制的声音质量较低。
发明内容
本申请实施例提供了一种声音信号处理方法及电子设备,能够减少录制视频的声音中的干扰声音信号,提高录制视频的声音信号的质量。
为达到上述目的,本申请提供如下技术方案:
第一方面,本申请实施例提供一种声音信号处理方法。该方法应用于电子设备,电子设备包括摄像头和麦克风。第一目标对象在摄像头的拍摄范围内,第二目标对象不在摄像头的拍摄范围内。其中,第一目标对象在摄像头的拍摄范围内,可以指第一目标对象位于摄像头的视场角的范围内。该方法包括:电子设备启动相机。显示预览界面,预览界面包括第一控件。检测到对第一控件的第一操作。响应于第一操作,开始拍摄。在第一时刻,显示拍摄界面,拍摄界面包括第一图像,第一图像为摄像头实时采集的图像,第一图像包括第一目标对象,第一图像不包括第二目标对象。其中,第一时刻可以是拍摄过程中的任意时刻。在第一时刻,麦克风采集第一音频,第一音频包括第一音频信号和第二音频信号,第一音频信号对应第一目标对象,第二音频信号对应第二目标对象。检测到对拍摄界面的第一控件的第二操作。响应于第二操作,停止拍摄,保存第一视频,其中,第一视频的第一时刻处包括第一图像和第二音频,第二音频包括第一音频信号和第三音频信号,第三音频信号是电子设备对第二音频信号进行处理得到的,第三音频信号的能量小于第二音频信号的能量。
一般而言,用户在使用电子设备录制视频时,电子设备通过麦克风会采集到电子设备周围的声音信号。例如,电子设备会采集到录制视频时摄像头的视场角范围内的声音信号,电子设备也会采集到录制视频时摄像头的视场角范围外的声音信号,电子设备还会采集到环境噪声。在此情况下,录制视频时摄像头的视场角范围外的声音信号以及环境噪声就会成为干扰信号。
示例性地,当电子设备录制到第二目标对象(如非目标对象1或非目标对象)的声音信号(即第二音频信号),则可以降低第二音频信号的能量而得到第三音频信号。如此一来,在本申请实施例中,电子设备可以对录制视频的声音信号(如麦克风采集到的声音信号)进行处理,并降低干扰信号的能量(如第二音频信号的能量),从而使得播放录制的视频文件时,输出的第三音频信号的能量,低于第二音频信号中非目标方位的声音信号的能量,以减少录制视频的声音信号中的干扰声音信号,提高录制视频的声音信号的质量。
在一种可能的实现方式中,第三音频信号是电子设备对第二音频信号进行处理得到的,包括:配置第二音频信号的增益小于1。根据第二音频信号和第二音频信号的增益,得到所述第三音频信号。
在一种可能的实现方式中,第三音频信号是电子设备对第二音频信号进行处理得到的,包括:电子设备计算第二音频信号在目标方位内的概率。其中,目标方位为录制视频时摄像头的视场角范围内的方位。第一目标对象在目标方位内,第二目标对象不在所述目标方位内。电子设备根据第二音频信号在目标方位内的概率,确定第二音频信号的增益。其中,若第二音频信号在目标方位内的概率大于预设概率阈值,则第二音频信号的增益等于1。若第二音频信号在目标方位内的概率小于或等于预设概率阈值,则第二音频信号的增益小于1。电子设备根据第二音频信号的能量和第二音频信号的增益,得到第三音频信号。
在该方案中,电子设备可以根据第二音频信号在目标方位内的概率大小,确定第二音频信号的增益,以便降低第二音频信号的能量,得到第三音频信号。
在一种可能的实现方式中,第一音频还包括第四音频信号,第四音频信号为扩散场噪声音频信号。第二音频还包括第五音频信号,第五音频信号为扩散场噪声音频信号。其中,第五音频信号是电子设备对第四音频信号进行处理得到的,第五音频信号的能量小于第四音频信号的能量。
在一种可能的实现方式中,第五音频信号是电子设备对第四音频信号进行处理得到的,包括:配置第四音频信号的增益小于1。根据第四音频信号的能量和第四音频信号的增益,得到第五音频信号。
在一种可能的实现方式中,第五音频信号是电子设备对第四音频信号进行处理得到的,包括:对第四音频信号进行抑制处理得到第六音频信号。对第六音频信号进行补偿处理得到第五音频信号。其中,第六音频信号为扩散场噪声音频信号,第六音频信号的能量小于第四音频信号,第六音频信号小于所述第五音频信号。
需要说明的是,在对第四音频信号进行处理的过程中,对第四音频信号进行处理得到的第六音频信号的能量可能会非常小,从而使得扩散场噪声不平稳。因此,通过对第六音频信号进行噪声补偿,可以使对第四音频信号进行处理后得到的第五音频信号的能量更加平稳,使用户的听感更佳。
在一种可能的实现方式中,上述方法还包括:在第一时刻,麦克风采集第一音频之后,电子设备处理第一音频得到第二音频。也就是说,电子设备可以在采集到音频信号时对音频信号实时处理。
在一种可能的实现方式中,方法还包括:在响应于第二操作,停止拍摄之后,电 子设备处理第一音频得到第二音频。也就是说,电子设备可以视频文件录制结束时,从视频文件中获取声音信号。然后,对该声音信号按照时间先后顺序一帧一帧处理。
第二方面,本申请实施例提供一种声音信号处理方法。该方法应用于电子设备。该方法包括:电子设备获取第一声音信号。第一声音信号为录制视频的声音信号。电子设备对第一声音信号进行处理,得到第二声音信号。电子设备在播放录制的视频文件时,输出第二声音信号。其中,第二声音信号中非目标方位的声音信号的能量低于第一声音信号中非目标方位的声音信号的能量。非目标方位为录制视频时摄像头的视场角范围外的方位。
一般而言,用户在使用电子设备录制视频时,电子设备通过麦克风会采集到电子设备周围的声音信号。例如,电子设备会采集到录制视频时摄像头的视场角范围内的声音信号,电子设备也会采集到录制视频时摄像头的视场角范围外的声音信号,电子设备还会采集到环境噪声。
在本申请实施例中,电子设备可以对录制视频的声音信号(如麦克风采集到的声音信号)进行处理,抑制该声音信号中的非目标方位的声音信号,使播放录制的视频文件时,输出的第二声音信号中非目标方位的声音信号的能量,低于第一声音信号中非目标方位的声音信号的能量,以减少录制视频的声音信号中的干扰声音信号,提高录制视频的声音信号的质量。
在一种可能的实现方式中,电子设备获取第一声音信号,包括:电子设备响应于第一操作,通过麦克风实时采集第一声音信号。其中,第一操作用于触发电子设备开始录像或者开始直播。
例如,电子设备可以在启动相机的录像功能并开始录像时,通过麦克风实时采集第一声音信号。又例如,电子设备可以在启动直播应用(如抖音、快手)开始视频直播时,通过麦克风实时采集声音信号。在录像或直播的过程中,电子设备每采集一帧声音信号,便处理一帧声音信号。
在一种可能的实现方式中,在电子设备获取第一声音信号之前,上述方法还包括:电子设备录制视频文件。电子设备获取第一声音信号,包括:电子设备响应于视频文件录制结束,从视频文件中获取第一声音信号。
例如,电子设备可以视频文件录制结束时,从视频文件中获取声音信号。然后,对该声音信号按照时间先后顺序一帧一帧处理。
在一种可能的实现方式中,电子设备获取第一声音信号,包括:电子设备响应于第二操作,从电子设备保存的视频文件中获取第一声音信号。其中,第二操作用于触发电子设备处理视频文件提升视频文件的音质。
例如,电子设备对电子设备本地保存的视频文件中的声音进行处理,当电子设备检测到用户指示处理上述视频文件时(如点击视频文件操作界面中的“去噪处理”选项按钮),电子设备开始获取视频文件的声音信号。并且,对该声音信号按照时间先后顺序一帧一帧处理。
在一种可能的实现方式中,第一声音信号包括多个时频语音信号。电子设备对第一声音信号进行处理,得到第二声音信号,包括:电子设备识别第一声音信号中各个时频语音信号的方位。若第一声音信号中第一时频语音信号的方位为非目标方位,电 子设备则降低第一时频语音信号的能量,得到第二声音信号。第一时频语音信号为第一声音信号中多个时频语音信号中的任意一个。
在一种可能的实现方式中,第一声音信号包括多个时频语音信号。电子设备对第一声音信号进行处理,得到第二声音信号,包括:电子设备计算第一声音信号中各个时频语音信号在目标方位内的概率。其中,目标方位为录制视频时摄像头的视场角范围内的方位。电子设备根据第一声音信号中第二时频语音信号在目标方位内的概率,确定第二时频语音信号的增益;其中,第二时频语音信号为第一声音信号中多个时频语音信号中的任意一个;若第二时频语音信号在目标方位内的概率大于预设概率阈值,则第二时频语音信号的增益等于1;若第二时频语音信号在目标方位内的概率小于或等于预设概率阈值,则第二时频语音信号的增益小于1。电子设备根据第一声音信号中每个时频语音信号和对应的增益,得到第二声音信号。
在一种可能的实现方式中,第二声音信号中扩散场噪声的能量低于第一声音信号中扩散场噪声的能量。应理解,通过降低第一声音信号中的非目标方位的声音信号的能量,并不能够降低所有的扩散场噪声。为了保证录制视频的声音信号质量,还需要降低扩散场噪声,以提高录制视频的声音信号的信噪比。
在一种可能的实现方式中,第一声音信号包括多个时频语音信号。电子设备对第一声音信号进行处理,得到第二声音信号,包括:所电子设备识别第一声音信号中各个时频语音信号是否为扩散场噪声。若第一声音信号中第三时频语音信号为扩散场噪声,则电子设备降低第三时频语音信号的能量,得到第二声音信号。第三时频语音信号为第一声音信号中多个时频语音信号中的任意一个。
在一种可能的实现方式中,第一声音信号包括多个时频语音信号。电子设备对第一声音信号进行处理,得到第二声音信号,还包括:电子设备识别第一声音信号中各个时频语音信号是否为扩散场噪声。电子设备根据第一声音信号中的第四时频语音信号是否为扩散场噪声,确定第四时频语音信号的增益;其中,第四时频语音信号为第一声音信号中多个时频语音信号中的任意一个;若第四时频语音信号为扩散场噪声,则第四时频语音信号的增益小于1;若第四时频语音信号为相干信号,则第四时频语音信号的增益等于1。电子设备根据第一声音信号中每个时频语音信号和对应的增益,得到第二声音信号。
在一种可能的实现方式中,第一声音信号包括多个时频语音信号;电子设备对第一声音信号进行处理得到第二声音信号,还包括:电子设备计算第一声音信号中各个时频语音信号在目标方位内的概率;其中,目标方位为录制视频时摄像头的视场角范围内的方位。电子设备识别第一声音信号中各个时频语音信号是否为扩散场噪声。电子设备根据第一声音信号中第五时频语音信号在目标方位内的概率,以及第五时频语音信号是否为扩散场噪声,确定第五时频语音信号的增益;其中,第五时频语音信号为第一声音信号中多个时频语音信号中的任意一个;若第五时频语音信号在目标方位内的概率大于预设概率阈值,且第五时频语音信号为相干信号,则第五时频语音信号的增益等于1;若第五时频语音信号在目标方位内的概率大于预设概率阈值,且第五时频语音信号为扩散场噪声,则第五时频语音信号的增益小于1;若第五时频语音信号在目标方位内的概率小于或等于预设概率阈值,则第五时频语音信号的增益小于1。 电子设备根据第一声音信号中每个时频语音信号和对应的增益,得到第二声音信号。
在一种可能的实现方式中,电子设备根据第一声音信号中第五时频语音信号在目标方位内的概率,以及第五时频语音信号是否为扩散场噪声,确定第五时频语音信号的增益,包括:电子设备根据第五时频语音信号在目标方位内的概率,确定第五时频语音信号的第一增益;其中,若第五时频语音信号在目标方位内的概率大于预设概率阈值,则第五时频语音信号的第一增益等于1;若第五时频语音信号在目标方位内的概率小于或等于预设概率阈值,则第五时频语音信号的第一增益小于1。电子设备根据第五时频语音信号是否为扩散场噪声,确定第五时频语音信号的第二增益;其中,若第五时频语音信号为扩散场噪声,则第五时频语音信号的第二增益小于1;若第五时频语音信号为相干信号,则第五时频语音信号的第二增益等于1。电子设备根据第五时频语音信号的第一增益和第二增益,确定第五时频语音信号的增益;其中,第五时频语音信号的增益为第五时频语音信号的第一增益和第二增益的乘积。
在一种可能的实现方式中,若第五时频语音信号为扩散场噪声,且第五时频语音信号的第一增益和第二增益的乘积小于预设增益值,则第五时频语音信号的增益等于预设增益值。
第三方面,本申请实施例提供一种电子设备。该电子设备包括:麦克风;摄像头;一个或多个处理器;存储器;通信模块。其中,麦克风用于采集录像或直播时的声音信号;摄像头用于采集录像或直播时的图像信号。通信模块用于与外接设备通信。存储器中存储有一个或多个计算机程序,一个或多个计算机程序包括指令。当指令被处理器执行时,使得电子设备执行如第一方面及其任一种可能的实现方式所述的方法。
第四方面,本申请实施例提供一种芯片系统,该芯片系统应用于包括电子设备。该芯片系统包括一个或多个接口电路和一个或多个处理器。该接口电路和处理器通过线路互联。该接口电路用于从电子设备的存储器接收信号,并向处理器发送该信号,该信号包括存储器中存储的计算机指令。当处理器执行所述计算机指令时,电子设备执行如第一方面及其任一种可能的实现方式所述的方法。
第五方面,本申请实施例提供一种计算机存储介质,该计算机存储介质包括计算机指令,当所述计算机指令在可折叠的电子设备上运行时,使得所述电子设备执行如第一方面及其任一种可能的实现方式所述的方法。
第六方面,本申请实施例提供一种计算机程序产品,当所述计算机程序产品在计算机上运行时,使得所述计算机执行如第一方面及其任一种可能的设计方式所述的方法。
可以理解地,上述提供的第三方面所述的电子设备,第四方面所述的芯片系统,第五方面所述的计算机存储介质,第六方面所述的计算机程序产品所能达到的有益效果,可参考如第一方面及其任一种可能的实现方式中的有益效果,此处不再赘述。
附图说明
图1为本申请实施例提供的声音信号处理方法的应用场景图;
图2为本申请实施例提供的一种电子设备的结构示意图;
图3为本申请实施例提供的一种电子设备的麦克风位置示意图;
图4为本申请实施例提供的声音信号处理方法的流程图一;
图5为本申请实施例提供的一种电子设备的麦克风采集的时域声音信号转换为频域声音信号的对比示意图;
图6为本申请实施例中涉及的频域声音信号的语音帧与频点的对应关系图;
图7为本申请实施例提供的声音信号处理方法的流程图二;
图8为本申请实施例提供的声音信号处理方法中涉及的声音信号的场景图一;
图9为本申请实施例提供的第n帧语音在36个方位的概率分布图;
图10为本申请实施例提供的声音信号处理方法中涉及的声音信号的场景图二;
图11为本申请实施例执行图7所示的方法前后的语音频谱图;
图12为本申请实施例提供的声音信号处理方法的流程图三;
图13为本申请实施例提供的声音信号处理方法的流程图四;
图14为本申请实施例提供的声音信号处理方法中涉及的声音信号的场景图三;
图15为本申请实施例提供的声音信号处理方法中涉及的声音信号的场景图四;
图16为本申请实施例执行图7和图13所示的方法的对比语音频谱图;
图17为本申请实施例提供的声音信号处理方法的流程图五;
图18为本申请实施例提供的声音信号处理方法中涉及的声音信号的场景图五;
图19为本申请实施例执行图13和图17所示的方法的对比语音频谱图;
图20为本申请实施例提供的声音信号处理方法的流程图五;
图21A为本申请实施例提供的声音信号处理方法涉及的界面图一;
图21B为本申请实施例提供的声音信号处理方法的场景图一;
图22为本申请实施例提供的声音信号处理方法的流程图六;
图23为本申请实施例提供的声音信号处理方法的流程图七;
图24为本申请实施例提供的声音信号处理方法涉及的界面图二;
图25为本申请实施例提供的声音信号处理方法的场景图二;
图26为本申请实施例提供的声音信号处理方法的流程图八;
图27为本申请实施例提供的声音信号处理方法涉及的界面图三;
图28为本申请实施例提供的声音信号处理方法涉及的界面图四;
图29为本申请实施例提供的声音信号处理方法涉及的界面图五;
图30为本申请实施例提供的声音信号处理方法涉及的界面图六;
图31为本申请实施例提供的一种芯片系统的结构示意图。
具体实施方式
为了便于理解,示例性的给出了部分与本申请实施例相关概念的说明以供参考。如下所示:
目标对象:在摄像头(如前置摄像头)的视野范围内的对象,如人物、动物等。其中,摄像头的视野范围由摄像头的视场角(field of vie,FOV)决定。摄像头的FOV越大,摄像头的视野范围则越大。
非目标对象:不在摄像头的视野范围内的对象。以前置摄像头为例,手机背面的对象则为非目标对象。
扩散场噪声:在视频录制或音频录制的过程中,目标对象或非目标对象发出的声音会经过墙面、地面或天花板等反射形成的声音。
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述。其中,在本申请实施例的描述中,除非另有说明,“/”表示或的意思,例如,A/B可以表示A或B;本文中的“和/或”仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,在本申请实施例的描述中,“多个”是指两个或多于两个。
以下,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。在本实施例的描述中,除非另有说明,“多个”的含义是两个或两个以上。
目前,随着短视频、直播社交软件(如快手、抖音等应用)的发展,电子设备的录像功能已成为人们经常使用的功能,能够录制高质量的视频文件的电子设备成为了需求。
现有的电子设备在录制视频时会采集电子设备周围的声音信号,但是有一些声音信号属于干扰信号,并不是用户想要的,例如电子设备在使用摄像头(如前置摄像头或后置摄像头)录制拍摄视频时,电子设备会采集到摄像头的视场角FOV内的目标对象的声音,也会采集到摄像头的视场角FOV外的非目标对象的声音,还会采集到一些环境噪声。在此情况下,非目标对象的声音便可能成为了干扰对象,影响电子设备录制的视频的声音质量。
以前置摄像头录像为例,通常情况下,电子设备的前置摄像头用于方便用户自拍录制短视频或小视频。如图1所示,用户在使用电子设备的前置摄像头自拍录制短视频时,电子设备的背面可能会存在小朋友在玩耍(即图1中的非目标对象1)。在用户(即图1中的目标对象)的同侧,也可能会存在其他对象,例如小狗在叫,又例如小女孩在唱歌跳舞(即图1中的非目标对象2)。因此,电子设备在录制拍摄的过程中,不可避免地会录制到非目标对象1或非目标对象2的声音。然而,对于用户来说,用户更期望录制的短视频能够突出自己的声音(即目标对象的声音),抑制非自己的声音,例如图1中非目标对象1和非目标对象2发出的声音。
此外,由于拍摄环境的影响,在电子设备拍摄的过程中,电子设备录制的短视频中可能会存在很多由环境引起的噪声,如图1所示的扩散场噪声1和扩散场噪声2,从而使得电子设备录制的短视频存在较大且刺耳的噪声,对目标对象的声音产生干扰,影响录制的短视频的声音质量。
为解决上述问题,本申请实施例提供了一种声音信号处理方法,可以应用于电子设备,能够抑制自拍录像时非镜头内的语音,提升自拍语音的信噪比。以图1所示的录像拍摄场景为例,本申请实施例提供的声音信号处理方法可以去除非目标对象1和非目标对象2的声音,保留目标对象的声音,并且还可以降低扩散场噪声1和扩散场噪声2对目标对象的声音的影响,从而提高录像后目标对象的音频信号的信噪比,平滑背景噪声,使用户听感更佳。
本申请实施例提供的声音信号处理方法,可以用于电子设备的前置摄像头的录像拍摄,也可以用于电子设备的后置摄像头的录像拍摄。该电子设备可以是手机、平板电脑、可穿戴设备(例如智能手表)、车载设备、增强现实(augmented reality,AR) /虚拟现实(virtual reality,VR)设备、笔记本电脑、超级移动个人计算机(ultra-mobile personal computer,UMPC)、上网本或个人数字助理(personal digital assistant,PDA)等移动终端,也可以是专业的相机等设备,本申请实施例对电子设备的具体类型不作任何限制。
示例性的,图2示出了电子设备100的一种结构示意图。电子设备100可以包括处理器110,外部存储器接口120,内部存储器121,通用串行总线(universal serial bus,USB)接口130,充电管理模块140,电源管理模块141,电池142,天线1,天线2,移动通信模块150,无线通信模块160,音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,传感器模块180,按键190,马达191,指示器192,摄像头193,显示屏194,以及用户标识模块(subscriber identification module,SIM)卡接口195等。其中传感器模块180可以包括压力传感器180A,陀螺仪传感器180B,气压传感器180C,磁传感器180D,加速度传感器180E,距离传感器180F,接近光传感器180G,指纹传感器180H,温度传感器180J,触摸传感器180K,环境光传感器180L,骨传导传感器180M等。
处理器110可以包括一个或多个处理单元,例如:处理器110可以包括应用处理器(application processor,AP),调制解调处理器,图形处理器(graphics processing unit,GPU),图像信号处理器(image signal processor,ISP),控制器,存储器,视频编解码器,数字信号处理器(digital signal processor,DSP),基带处理器,和/或神经网络处理器(neural-network processing unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。
其中,控制器可以是电子设备100的神经中枢和指挥中心。控制器可以根据指令操作码和时序信号,产生操作控制信号,完成取指令和执行指令的控制。
处理器110中还可以设置存储器,用于存储指令和数据。在一些实施例中,处理器110中的存储器为高速缓冲存储器。该存储器可以保存处理器110刚用过或循环使用的指令或数据。如果处理器110需要再次使用该指令或数据,可从存储器中直接调用。避免了重复存取,减少了处理器110的等待时间,因而提高了系统的效率。
电子设备100通过GPU,显示屏194,以及应用处理器等实现显示功能。GPU为图像处理的微处理器,连接显示屏194和应用处理器。GPU用于执行数学和几何计算,用于图形渲染。处理器110可包括一个或多个GPU,其执行程序指令以生成或改变显示信息。
显示屏194用于显示图像,视频等。显示屏194包括显示面板。显示面板可以采用液晶显示屏(liquid crystal display,LCD),有机发光二极管(organic light-emitting diode,OLED),有源矩阵有机发光二极体或主动矩阵有机发光二极体(active-matrix organic light emitting diode的,AMOLED),柔性发光二极管(flex light-emitting diode,FLED),Miniled,MicroLed,Micro-oLed,量子点发光二极管(quantum dot light emitting diodes,QLED)等。在一些实施例中,电子设备100可以包括1个或N个显示屏194,N为大于1的正整数。在本申请的实施例中,显示屏194可以用于显示拍摄模式下的预览界面和拍摄界面等。
电子设备100可以通过ISP,摄像头193,视频编解码器,GPU,显示屏194以及 应用处理器等实现拍摄功能。
ISP用于处理摄像头193反馈的数据。例如,拍照时,打开快门,光线通过镜头被传递到摄像头感光元件上,光信号转换为电信号,摄像头感光元件将电信号传递给ISP处理,转化为肉眼可见的图像。ISP还可以对图像的噪点,亮度,肤色进行算法优化。ISP还可以对拍摄场景的曝光,色温等参数优化。在一些实施例中,ISP可以设置在摄像头193中。
摄像头193用于捕获静态图像或视频。物体通过镜头生成光学图像投射到感光元件。感光元件可以是电荷耦合器件(charge coupled device,CCD)或互补金属氧化物半导体(complementary metal-oxide-semiconductor,CMOS)光电晶体管。感光元件把光信号转换成电信号,之后将电信号传递给ISP转换成数字图像信号。ISP将数字图像信号输出到DSP加工处理。DSP将数字图像信号转换成标准的RGB,YUV等格式的图像信号。在一些实施例中,电子设备100可以包括1个或N个摄像头193,N为大于1的正整数。
此外,摄像头193还可以包括用于测量待拍摄对象的物距的深度摄像,以及其他摄像头。例如,深度摄像头可以包括三维(3 dimensions,3D)深感摄像头、飞行时间(time of flight,TOF)深度摄像头或双目深度摄像头等。
数字信号处理器用于处理数字信号,除了可以处理数字图像信号,还可以处理其他数字信号。例如,当电子设备100在频点选择时,数字信号处理器用于对频点能量进行傅里叶变换等。
视频编解码器用于对数字视频压缩或解压缩。电子设备100可以支持一种或多种视频编解码器。这样,电子设备100可以播放或录制多种编码格式的视频,例如:动态图像专家组(moving picture experts group,MPEG)1,MPEG2,MPEG3,MPEG4等。
内部存储器121可以用于存储计算机可执行程序代码,可执行程序代码包括指令。处理器110通过运行存储在内部存储器121的指令,从而执行电子设备100的各种功能应用以及数据处理。内部存储器121可以包括存储程序区和存储数据区。其中,存储程序区可存储操作系统,至少一个功能所需的应用程序(比如声音播放功能,图像播放功能等)等。存储数据区可存储电子设备100使用过程中所创建的数据(比如音频数据,电话本等)等。此外,内部存储器121可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件,闪存器件,通用闪存存储器(universal flash storage,UFS)等。在另一些实施例中,处理器110通过运行存储在内部存储器121的指令,和/或存储在设置于处理器中的存储器的指令,来使得电子设备100执行本申请实施例中提供的方法,以及各种功能应用和数据处理。
电子设备100可以通过音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,以及应用处理器等实现音频功能。例如音乐播放,录音等。
音频模块170用于将数字音频信息转换成模拟音频信号输出,也用于将模拟音频输入转换为数字音频信号。音频模块170还可以用于对音频信号编码和解码。在一些实施例中,音频模块170可以设置于处理器110中,或将音频模块170的部分功能模块设置于处理器110中。
扬声器170A,也称“喇叭”,用于将音频电信号转换为声音信号。电子设备100可 以通过扬声器170A收听音乐,或收听免提通话。
受话器170B,也称“听筒”,用于将音频电信号转换成声音信号。当电子设备100接听电话或语音信息时,可以通过将受话器170B靠近人耳接听语音。
麦克风170C,也称“话筒”,“传声器”,用于将声音信号转换为电信号。当拨打电话、发送语音信息或录制音视频文件时,用户可以通过人嘴靠近麦克风170C发声,将声音信号输入到麦克风170C。电子设备100可以设置一个或多个麦克风170C,例如,电子设备100可以设置三个,四个或更多麦克风170C,实现采集声音信号,降噪,还可以识别声音来源方向,实现定向录音功能以及抑制非目标方向的声音等。
耳机接口170D用于连接有线耳机。耳机接口170D可以是USB接口130,也可以是3.5mm的开放移动电子设备平台(open mobile terminal platform,OMTP)标准接口,美国蜂窝电信工业协会(cellular telecommunications industry association of the USA,CTIA)标准接口。
触摸传感器180K,也称“触控面板”。触摸传感器180K可以设置于显示屏194,由触摸传感器180K与显示屏194组成触摸屏,也称触控屏。触摸传感器180K用于检测作用于其上或附近的触摸操作。触摸传感器可以将检测到的触摸操作传递给应用处理器,以确定触摸事件类型。可以通过显示屏194提供与触摸操作相关的视觉输出。在另一些实施例中,触摸传感器180K也可以设置于电子设备100的表面,与显示屏194所处的位置不同。
可以理解的是,本申请实施例示意的结构并不构成对电子设备100的具体限定。在本申请另一些实施例中,电子设备100可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。
下面以电子设备为手机300,并且以电子设备的前置摄像头录制拍摄为例,对本申请实施例提供的声音信号处理方法进行详细说明。
示例性地,如图3所示,手机300中设置有三个麦克风,分别为顶麦301、底麦302和背麦303,用于在用户拨打电话、录制音视频文件时采集用户的声音信号。在手机录音或录制视频的过程中,顶麦301、底麦302和背麦303分别采集用户所处的录制环境的声音信号(包括目标对象的声音、非目标对象的声音,以及环境产生的噪声)。例如,当手机300以图3所示的方向录制视频采集音频时,顶麦301可以采集左声道声音信号,底麦302可以采集右声道声音信号。相反地,当手机300在图3所示的方向旋转180度之后录制视频采集音频时,顶麦301可以采集右声道语音信号,底麦302可以采集左声道语音信号。对此,手机300中的三个麦克风分别采集哪个声道的语音信号,可以依使用场景的不同而不同,本申请实施例中的上述描述仅仅为示意,不构成限定。
此外,背麦303采集的声音信号可以与顶麦301以及底麦302采集的声音信号结合,用于确定手机采集的声音信号的方位。
以图1所示的录像拍摄场景为例,手机300通过图3所示的三个麦克风(即顶麦301、底麦302和背麦303)可以采集到目标对象的声音、非目标对象1的声音、非目标对象2的声音,以及上述目标对象、非目标对象的声音经过环境的反射形成的扩散 场噪声1和扩散场噪声2等。
应理解,在图1所示录像拍摄场景中,用户主要的拍摄对象为自己(例如,图1所示的目标对象),因此用户在录像拍摄过程中并不希望采集到非目标对象(如图1所示的非目标对象1和非目标对象2)的声音。本申请实施例中的声音信号处理方法,可以使手机对采集到的声音信号进行处理,抑制非目标对象的声音信号,突出目标对象的声音,提高拍摄的视频中声音的质量。
在一些实施例中,如图4所示,本申请实施例提供的一种声音信号处理方法包括:
400、手机获取声音信号。
示例性地,用户在进行录像拍摄的过程中,手机可以通过如图3所示的三个麦克风(即顶麦301、底麦302和背麦303)采集声音信号。以下实施例中,声音信号也可以称为音频信号。
通常情况下,如图5中的(a)所示,麦克风采集的声音信号是时域信号,用于表征声音信号的幅度随时间的变化情况。为了便于对麦克风采集的声音信号进行分析和处理。麦克风采集的声音信号可以通过傅里叶变换,如快速傅里叶变换(fast Fourier transform,FFT)或离散傅里叶变换(discrete Fourier transform,DFT)变换,转换为频域信号,例如图5中的(b)所示。在图5中的(a)中,时域信号采用时间/幅度表示,其中横坐标为采样时间,纵坐标为声音信号的幅度。上述图5中的(a)所示的声音信号经过FFT或DFT变换后,可以转换为如图5中的(b)所示的语音频谱图对应的频域信号。在该语音频谱图中,横坐标为时间、纵坐标为频率,横坐标与纵坐标的坐标点值为声音信号能量。例如,图5中的(b)中,时频信号位置511处的声音数据能量较高,时频信号位置512(上边的白方框,由于图中无法清晰示出,故特此解释说明)处的声音数据能量较低。
应理解,声音信号的幅值越大,则声音信号的能量越大,声音的分贝也就越高。
需要说明的是,在麦克风采集的声音信号通过FFT或DFT变换为频域信号时,会对声音信号进行分帧,并分别处理每一帧声音信号。一帧声音信号可以通过FFT或DFT变换为包括多个(如1024个、512个)频域采样点(即频点)的频域信号。如图6所示,麦克风采集的声音信号转换为1024个频点的频域信号后,可以用多个时频点来表示上述麦克风采集的声音信号。例如,图6中的一个方框代表一个时频点。图6的横坐标表示声音信号的帧数(可称为语音帧),纵坐标表示声音信号的频点。
在上述三个麦克风,即顶麦301、底麦302和背麦303分别采集的声音信号转换为频域信号后,可以用X L(t,f)表示左声道时频语音信号,即代表顶麦301采集的左声道声音信号中,不同时频点对应的声音信号。类似地,可以用X R(t,f)表示右声道时频语音信号,即代表底麦302采集的右声道声音信号中,不同时频点对应的声音信号能量;可以用X (t,f)表示左右声道时频语音信号,即代表背麦303采集的左右声道环绕立体声音信号中,不同时频点对应的声音信号。其中,t表示声音信号的帧数(可称为语音帧),f表示声音信号的频点。
401、手机对采集的声音信号进行处理,抑制非目标方位的声音信号。
由上文描述可知,手机采集的声音信号包括目标对象的声音和非目标对象的声音。然而,非目标对象的声音并非是用户想要的,需要抑制。通常情况下,目标对象处于 摄像头(如前置摄像头)的视场角内,摄像头的视场角可以作为本申请实施例的目标方位,因此上述非目标方位是指不在摄像头(如前置摄像头)的视场角内的方位。
示例性地,抑制非目标方位的声音信号,可以依次通过对声源方位概率计算和目标方位概率计算,实现声音信号的时频点增益计算。可以通过声音信号的时频点增益的大小不同,区分目标方位的声音信号和非目标方位的声音信号。如图7所示,抑制非目标方位的声音信号可以包括声源方位概率计算、目标方位概率计算和时频点增益g mask(t,f)计算三个计算过程,具体如下:
(1)声源方位概率计算
示例性地,在本申请实施例中,假设手机进行视频录制拍摄时,手机屏幕正前方的方向为0°方向,手机屏幕正后方的方向为180°方向,手机屏幕的正右方向为90°方向,手机屏幕的正左方向为270°方向。
在本申请实施例中,手机屏幕前后左右形成的360°的空间方位可以分为多个空间方位。例如,可以以10°为空间方位间隔,将360°的空间方位分为36个空间方位个数,具体如下表1。
表1 360°空间方位分为36个空间方位的方向对照表
Figure PCTCN2022095354-appb-000001
以手机采用前置摄像头录像拍摄为例,假设手机的前置摄像头的视场角FOV的角度为[310°,50°],则目标方位为手机屏幕前后左右形成的360°空间方位中的[310°,50°]方位。手机拍摄的目标对象通常位于前置摄像头的视场角FOV的角度范围内,即位于目标方位内。抑制非目标方位的声音信号是指,抑制位于前置摄像头的视场角FOV的角度范围之外的对象的声音,例如图1所示的非目标对象1和非目标对象2。
对于环境噪声(如扩散场噪声1和扩散场噪声2)来说,环境噪声可能在目标方位内,也可能在非目标方位内。需要说明的是,扩散场噪声1和扩散场噪声2实质上可以是同一种噪声。在本申请实施例中,为了区别说明,以扩散场噪声1作为非目标方位内的环境噪声,以扩散场噪声2作为目标方位内的环境噪声。
示例性地,以图1所示的拍摄场景为例,如图8所示为目标对象、非目标对象1、非目标对象2、扩散场噪声1和扩散场噪声2所在的空间方位示意图。例如,目标对象的空间方位大约为340°方向,非目标对象1的空间方位大约为150°方向,非目标对象2的空间方位大约为60°方向,扩散场噪声1的空间方位大约为230°方向,扩散场噪声2的空间方位大约为30°方向。
手机在录制拍摄目标对象时,麦克风(如图3中的顶麦301、底麦302和背麦303)会采集到目标对象的声音、非目标对象(如非目标对象1和非目标对象2)以及环境噪声(如扩散场噪声1和扩散场噪声2)。采集的声音信号可以通过FFT或DFT转换为频域信号(下文统称为时频语音信号),分别为左声道时频语音信号X L(t,f)、 右声道时频语音信号X R(t,f)和背部混合声道时频语音信号X (t,f)。
三个麦克采集的时频语音信号X L(t,f)、X R(t,f)和X (t,f),可以合成为一个时频语音信号X(t,f)。时频语音信号X(t,f)可以输入至声源方位概率计算模型,计算得到输入的时频语音信号在各方位存在的概率P k(t,f)。其中,t代表声音的帧数(即语音帧),f代表频点,k为空间方位序号。声源方位概率计算模型用于计算声源的方位概率,例如,声源方位概率计算模型可以为复角中心高斯混合模型(Complex Angular Central Gaussian Mixture Model,cACGMM)。以空间方位个数K为36为例,1≤k≤36,且
Figure PCTCN2022095354-appb-000002
也就是说,在同一帧、同一频点上声音信号在36个方位存在的概率总和为1。如图9所示,为第n帧声音信号中1024个频点对应的声音信号在36个空间方位上存在的概率示意图。图9中,每一个小方框代表第n帧语音信号中某一个频点对应的声音信号,在某一个空间方位上的概率。图9中的虚线框中表示第n帧语音信号中的频点3对应的声音信号在36个空间方位上的概率总和为1。
示例性地,假设手机的前置摄像头的视场角FOV为[310°,50°],因目标对象属于手机想要录制拍摄的对象,因此目标对象通常均位于前置摄像头的视场角FOV内。这样一来,目标对象的声音信号来自[310°,50°]方位的概率最大,具体的概率分布可以为表2中的示例。
表2目标对象的音源在36个空间方位的概率分布
Figure PCTCN2022095354-appb-000003
对于非目标对象(如非目标对象1或非目标对象2)来说,通常情况下,非目标对象出现在前置摄像头的视场角FOV的概率较小,可能会低于0.5,甚至为0。
对于环境噪声(如扩散场噪声1和扩散场噪声2)来说,由于扩散场噪声1为非目标方位内的环境噪声,因此扩散场噪声1出现在前置摄像头的视场角FOV的概率小,可能会低于0.5,甚至为0。由于扩散场噪声2为目标方位内的环境噪声,因此扩散场噪声2出现在前置摄像头的视场角FOV的概率较大,可能会高于0.8,甚至为1。
应理解,上述关于目标对象、非目标对象以及环境噪声在目标方位出现的概率均为举例,并不对本申请实施例构成限定。
(2)目标方位概率计算
目标方位概率计算是指:上述时频语音信号在目标方位内的各方位存在的概率总和,也可以称为目标方位的空间聚类概率。因此,上述时频语音信号在目标方位的空间聚类概率P(t,f)可以通过如下公式(一)进行计算:
Figure PCTCN2022095354-appb-000004
Figure PCTCN2022095354-appb-000005
其中,k1~k2为目标方位的角度索引,也可以为目标方位的空间方位序号。P k(t,f)为当前时频语音信号在k方位上存在的概率;P(t,f)为当前时频语音信号在目标方位存在的概率总和。
示例性地,依然以手机屏幕的正前方为0°方向,手机的前置摄像头的视场角FOV为[310°,50°],即目标方位为[310°,50°]为例。
对于目标对象,以上述表2所示的目标对象的声音在36个空间方位的概率分布为例,k1~k2分别为序号32、33、34、35、36、1、2、3、4、5、6对应的概率,因此目标对象的时频语音信号在目标方向存在的概率总和P(t,f)为0.4+0.3+0.3=1。
类似计算方法,可以计算非目标对象的时频语音信号在目标方位存在的概率总和P(t,f),也可以计算环境噪声(如扩散场噪声1和扩散场噪声2)的视频语音信号在目标方位存在的概率总和P(t,f)。
对于非目标对象,非目标对象的时频语音信号在目标方位存在的概率总和P(t,f)可能小于0.5,甚至为0。
对于环境噪声,如扩散场噪声1,扩散场噪声1为非目标方位内的环境噪声,扩散场噪声1的时频语音信号在目标方位存在的概率较小,因此扩散场噪声1的时频语音信号在目标方位存在的概率总和P(t,f)可能小于0.5,甚至为0。
对于环境噪声,如扩散场噪声2,扩散场噪声2为目标方位内的环境噪声,扩散场噪声2的时频语音信号在目标方位存在的概率较大,因此扩散场噪声2的时频语音信号在目标方位存在的概率总和P(t,f)可能大于0.8,甚至为1。
(3)时频点增益g mask(t,f)计算
由上文描述可知,抑制非目标方位的声音信号的主要目的是要保留目标对象的声音信号,并且抑制非目标对象的声音信号。通常情况下,目标对象均处于前置摄像头的视场角FOV内,因此目标对象的声音信号大都来自于目标方位,即目标对象的声音出现在目标方位的概率通常较大。相反地,对于非目标对象,非目标对象通常都不会处于前置摄像头的视场角FOV内,因此非目标对象的声音信号大都来自于非目标方位,即非目标对象的声音出现在目标方位的概率通常较小。
基于此,可以通过上述目标方位聚类概率P(t,f)实现当前时频点增益g mask(t,f)计算,具体可参考如下公式(二):
Figure PCTCN2022095354-appb-000006
其中,P th为预设概率阈值,可通过参数进行配置,例如P th设置为0.8;g mask-min为当前时频语音信号位于非目标方位时的时频点增益,可通过参数配置,例如g mask-min设置为0.2。
当当前时频语音信号在目标方位存在的概率总和P(t,f)大于概率阈值P th时,可以认为当前时频语音信号在目标方位内,即当前时频语音信号的时频点增益g mask(t,f)=1。相应地,当时频语音信号在目标方位存在的概率总和P(t,f)小于或等于概率阈值P th时,可以认为当前时频语音信号不在目标方位内。在此情况下,可以将设置的参数 g mask-min,作为当前时频语音信号不在目标方位时的时频点增益g mask(t,f),例如g mask(t,f)=g mask-min=0.2。
如此一来,若当前时频语音信号在目标方位内,则当前时频语音信号最大可能性来自于目标对象,因此将当前时频语音信号在目标方位内时的时频点增益g mask(t,f)配置为1,能够最大程度保留目标对象的声音。若当前时频语音信号不在目标方位内,则当前时频语音信号最大可能性来自于非目标对象(如非目标对象1或非目标对象2),因此将当前时频语音信号不再目标方位内时的时频点增益g mask(t,f)配置为0.2,能够有效抑制非目标对象(如非目标对象1或非目标对象2)的声音。
应理解,对于环境噪声的时频语音信号,可能存在于目标方位内,如扩散场噪声2;也可能存在于非目标方位内,如扩散场噪声1。因此,环境噪声,如扩散场噪声2的时频语音信号的时频点增益g mask(t,f)较大可能为1;环境噪声,如扩散场噪声1的时频语音信号的时频点增益g mask(t,f)较大可能为g mask-min,如0.2。也就是说,通过上述抑制非目标对象的声音不能够抑制环境噪声的能量大小。
402、输出处理后的声音信号。
通常情况下,手机具有两个扬声器,分别为手机屏幕顶部的扬声器(下文称为扬声器1)和手机底部的扬声器(下文称为扬声器2)。其中,当手机输出音频(即声音信号)时,扬声器1可以用于输出左声道音频信号,扬声器2可以用于输出右声道音频信号。当然,当手机输出音频时,扬声器1也可以用于输出右声道音频信号,扬声器2可以用于输出左声道音频信号。对此,本申请实施例不做特殊限定。可以理解的,当电子设备仅有一个扬声器时,可以输出(左声道音频信号+右声道音频信号)/2,也可以输出左声道音频信号+右声道音频信号),还可以进行融合后进行输出,本申请不进行限定。
为使手机录制拍摄的音频信号能够被扬声器1和扬声器2输出。经过上述方法对声音信号进行处理后,输出的声音信号可以分为左声道音频输出信号Y L(t,f)和右声道音频输出信号Y R(t,f)。
示例性地,根据上述计算得到的各类声音信号的时频点增益g mask(t,f),并结合麦克风采集的声音输入信号,如左声道时频语音信号X L(t,f)或右声道时频语音信号X R(t,f),可以得到经过抑制非目标对象的声音处理后的声音信号,如Y L(t,f)和Y R(t,f)。具体地,处理后输出的声音信号Y L(t,f)和Y R(t,f)可以通过如下公式(三)和公式(四)分别计算:
Y L(t,f)=X L(t,f)*g mask(t,f);                                       公式(三)
Y R(t,f)=X R(t,f)*g mask(t,f)。                                       公式(四)
例如,对于目标对象,左声道音频输出信号Y L(t,f)的能量等于左声道时频语音信号X L(t,f)的能量;右声道音频输出信号Y R(t,f)的能量等于右声道时频语音信号X R(t,f)的能量,即如图10所示的目标对象的声音信号得到完整的保留。
对于非目标对象,左声道音频输出信号Y L(t,f)的能量等于左声道时频语音信号X L(t,f)的能量的0.2倍;右声道音频输出信号Y R(t,f)的能量等于右声道时频语音信 号X R(t,f)的能量的0.2倍,即如图10所示的非目标对象的声音信号得到有效抑制。
对于环境噪声,如处于目标方位内的扩散场噪声2,左声道音频输出信号Y L(t,f)的能量等于左声道时频语音信号X L(t,f)的能量;右声道音频输出信号Y R(t,f)的能量等于右声道时频语音信号X R(t,f)的能量,即如图10所示的扩散场噪声2的声音信号并未得到抑制。
对于环境噪声,如处于非目标方位内的扩散场噪声1,左声道音频输出信号Y L(t,f)的能量等于左声道时频语音信号X L(t,f)的能量的0.2倍;右声道音频输出信号Y R(t,f)等于右声道时频语音信号X R(t,f)的能量的0.2倍,即如图10所示的扩散场噪声1的声音信号得到有效抑制。
综上所述,如图10所示,经过抑制非目标方位的声音信号中的时频点增益g mask(t,f)计算,在目标方位内的时频语音信号(如目标对象的时频语音信号,以及扩散场噪声2的时频语音信号)均得到完整保留,不在目标方位内的时频语音信号(如非目标对象1的时频语音信号、非目标对象2的时频语音信号、扩散场噪声1的时频语音信号)得到了有效抑制。例如,图11中的(a)为经过上述401处理之前的时频语音信号,图11中的(b)为经过上述401处理之后的时频语音信号。其中,图11中的(a)和图11中的(b)中的方框处为非目标方位的时频语音信号。对比图11中的(a)和图11中的(b)可以看出,经过上述401抑制非目标方位的声音信号的处理后,非目标方位的时频语音信号得到抑制,即方框内的时频语音信号的能量明显降低。
应理解,经过上述时频点增益g mask(t,f)计算输出的时频语音信号,仅仅抑制了非目标方位的时频语音信号,但对于目标方位内可能还存在环境噪声(如扩散场噪声2),使得输出的时频语音信号的环境噪声仍然较大,输出的声音信号的信噪比较小,语音信号质量较低。
基于此,在另一些实施例中,本申请实施例提供的声音信号处理方法还可以通过抑制扩散场噪声,来提高输出的语音信号的信噪比,提高语音信号的清晰度。如图12所示,本申请实施例提供的声音信号处理方法可以包括400-401、1201-1203。
1201、对采集的声音信号进行处理,抑制扩散场噪声。
在上述400采集声音信号之后,便可以执行1201对采集的声音信号进行处理,抑制扩散场噪声。示例性地,抑制扩散场噪声可以依次经过相干扩散比(coherent-to-diffuse power ratio,CDR)计算,实现抑制扩散场噪声时的时频点增益g cdr(t,f)计算。通过时频点增益g cdr(t,f)的大小不同,区分声音信号中的相干信号(如目标对象的声音信号和非目标对象的声音信号)和扩散场噪声。如图13所示,抑制扩散场噪声可以包括相干扩散比CDR计算和时频点增益g cdr(t,f)计算两个计算过程,具体如下:
(1)相干扩散比CDR计算
相干扩散比(coherent-to-diffuse power ratio,CDR)是指相干信号(即目标对象或非目标对象的语音信号)与扩散场噪声的功率比例。将上述左声道时频语音信号X L(t,f)、右声道时频语音信号X R(t,f)和背部混合声道时频语音信号X (t,f)采用现有技术,如去混响的相干扩散功率比估计(Coherent-to-Diffuse Power Ratio  Estimation for Dereverberation)实现相干扩散比
Figure PCTCN2022095354-appb-000007
计算。
示例性地,在上述图1所示的拍摄场景中,对于目标对象的声音信号来说,目标对象的声音信号的相干扩散比
Figure PCTCN2022095354-appb-000008
为无穷大∞。对于非目标对象(如非目标对象1或非目标对象2)的声音信号来说,非目标对象的声音信号的相干扩散比也为无穷大∞。对于扩散场噪声(如扩散场噪声1或扩散场噪声2)来说,扩散场噪声的相干扩散比为0。
(2)时频点增益g cdr(t,f)计算
由上文描述可知,抑制扩散场噪声的主要目的是保留相干信号(如目标对象)的声音,降低扩散场噪声的能量。
示例性地,可以通过相干扩散比
Figure PCTCN2022095354-appb-000009
确定相干信号(即目标对象的声音信号、非目标对象的声音信号)的时频点增益g cdr(t,f),确定非相干信号(即扩散场噪声)的时频点增益g cdr(t,f),即通过时频点增益g cdr(t,f),区分相干信号或非相干信号。
示例性地,可以将相干信号的时频点增益g cdr(t,f)保留为1,可以使非相干信号的时频点增益g cdr(t,f)降低,如设置为0.3。这样,便能够保留目标对象的声音信号,并抑制扩散场噪声,以降低扩散场噪声的能力。
例如,可以采用如下公式(五)计算时频点增益g cdr(t,f):
Figure PCTCN2022095354-appb-000010
其中,g cdr-min为抑制扩散场噪声后的最小增益,可通过参数配置,例如可以设置g cdr-min为0.3。g cdr(t,f)为抑制扩散场噪声后的时频点增益。μ为过估因子,可通过参数配置,例如设置μ为1。
如此一来,对于目标对象,由于目标对象的声音信号的相干扩散比
Figure PCTCN2022095354-appb-000011
为无穷大∞,代入上述公式(五),则目标对象的声音信号抑制扩散场噪声后的时频点增益g cdr(t,f)=1。
对于非目标对象(如非目标对象1和非目标对象2),由于非目标对象的声音信号的相干扩散比
Figure PCTCN2022095354-appb-000012
也为无穷大∞,代入上述公式(五),则非目标对象的声音信号抑制扩散场噪声后的时频点增益g cdr(t,f)=1。
对于扩散场噪声(如扩散场噪声1和扩散场噪声2),由于扩散场噪声的相干扩散比
Figure PCTCN2022095354-appb-000013
为0,代入上述公式(五),则扩散场噪声的时频点增益g cdr(t,f)=0.3。
由此可见,如图14所示,经过时频点增益g cdr(t,f)计算,相干信号(如目标对象的声音信号和非目标对象的声音信号)的时频点增益g cdr(t,f)为1,则能够完整保留相干信号。然而,经过时频点增益g cdr(t,f)计算,扩散场噪声(如扩散场噪声1和扩散场噪声2)的时频点增益g cdr(t,f)为0.3,因此扩散场噪声得到了有效的抑制,即扩散场噪声的能量相比处理之前具有明显地降低。
1202、将抑制非目标方位的声音和抑制扩散场噪声进行融合处理。
应理解,上述401中抑制非目标方位的声音信号的主要目的在于保留目标对象的声音信号,抑制非目标对象的声音信号。上述1201中抑制扩散场噪声的主要目的在于抑制扩散场噪声,保护相干信号(即目标对象或非目标对象的声音信号)。因而,如图12或图13所示的声音信号处理方法中,在执行1201抑制扩散场噪声之后,可以将上述401抑制非目标方位的声音信号得到的时频点增益g mask(t,f),以及上述1201抑 制扩散场噪声得到的时频点增益g cdr(t,f)进行融合计算,可以得到融合增益g mix(t,f)。根据融合后的增益g mix(t,f)对输入的时频语音信号进行处理,可以得到清晰的目标对象的时频语音信号,降低了扩散场噪声的能量,提高音频(声音)信号的信噪比。
示例性地,可以采用如下公式(六)进行融合增益g mix(t,f)计算:
g mix(t,f)=g mask(t,f)*g cdr(t,f);                             公式(六)
其中,g mix(t,f)为增益融合计算之后的混合增益。
示例性地,依然以上述图1所示的拍摄场景为例,分别对目标对象的时频点信号、非目标对象(如非目标对象1和非目标对象2)的时频语音信号、以及扩散场噪声1的时频语音信号、扩散场噪声2的时频语音信号进行增益融合后的融合增益g mix(t,f)进行计算。
例如:
目标对象的时频语音信号:g mask(t,f)=1,g cdr(t,f)=1,则g mix(t,f)=1;
非目标对象的时频语音信号:g mask(t,f)=0.2,g cdr(t,f)=1,则g mix(t,f)=0.2;
扩散场噪声1的时频语音信号:g mask(t,f)=0.2,g cdr(t,f)=0.3,则g mix(t,f)=0.06;
扩散场噪声2的时频语音信号:g mask(t,f)=1,g cdr(t,f)=0.3,则g mix(t,f)=0.3。
由上述融合增益g mix(t,f)计算可知,通过对上述抑制非目标方位的声音信号得到的时频点增益g mask(t,f),以及上述抑制扩散场噪声得到的时频点增益g cdr(t,f)进行融合计算,可以将能量较大的扩散场噪声进行抑制,如扩散场噪声2。
1203,输出处理后的声音信号。
示例性地,执行上述1202中的融合增益g mix(t,f)计算之后,可以根据上述计算得到的各类声音信号的融合增益g mix(t,f),并结合麦克风采集的声音输入信号,如左声道时频语音信号X L(t,f)或右声道时频语音信号X R(t,f),可以得到经过抑制非目标对象的声音,以及抑制扩散场噪声融合处理后的声音信号,如Y L(t,f)和Y R(t,f)。具体地,处理后的声音信号Y L(t,f)和Y R(t,f)可以通过如下公式(七)和公式(八)分别计算:
Y L(t,f)=X L(t,f)*g mix(t,f);                                          公式(七)
Y R(t,f)=X R(t,f)*g mix(t,f)。                                          公式(八)
例如,对于目标对象,左声道音频输出信号Y L(t,f)的能量等于左声道时频语音信号X L(t,f)的能量;右声道音频输出信号Y R(t,f)的能量等于右声道时频语音信号X R(t,f)的能量,即如图15所示的目标对象的声音信号得到完整的保留。
对于非目标对象,左声道音频输出信号Y L(t,f)的能量等于左声道时频语音信号X L(t,f)的能量的0.2倍;右声道音频输出信号Y R(t,f)的能量等于右声道时频语音信号X R(t,f)的能量的0.2倍,即如图15所示的非目标对象的声音信号得到有效抑制。
对于环境噪声,如处于目标方位内的扩散场噪声2,左声道音频输出信号Y L(t,f)的能量等于左声道时频语音信号X L(t,f)的能量的0.3倍;右声道音频输出信号Y R(t,f) 的能量等于右声道时频语音信号X R(t,f)的能量的0.3倍,即如图15所示的扩散场噪声2的声音信号得到有效抑制。
对于环境噪声,如处于非目标方位内的扩散场噪声1,左声道音频输出信号Y L(t,f)的能量等于左声道时频语音信号X L(t,f)的能量的0.06倍;右声道音频输出信号Y R(t,f)的能量等于右声道时频语音信号X R(t,f)的能量的0.06倍,即如图15所示的扩散场噪声1的声音信号得到有效抑制。
综上所述,如图15所示,经过抑制非目标方位的声音信号中的时频点增益g mask(t,f)计算,在目标方位内的目标对象的时频语音信号得到完整保留,不在目标方位内的时频语音信号(如非目标对象1的时频语音信号、非目标对象2的时频语音信号、扩散场噪声1的时频语音信号)得到了有效抑制。并且位于目标方位内的扩散场噪声2的时频语音信号也得到了有效抑制,从而提高了语音信号的信噪比,提高了自拍声音的清晰度。例如,图16中的(a)为仅抑制了非目标方位的声音信号的时频语音信号,图16中的(b)为经过抑制非目标方位的声音信号与抑制扩散场噪声融合处理后得到的时频语音信号。通过对比图16中的(a)和图16中的(b),可以看出图16中的(a)所示的语音频谱图中的背景颜色稍浅,该图16中的(a)所示的语音频谱图中对应的时频语音信号的背景噪声能量较大,图16中的(b)所示的语音频谱图中的背景颜色更深,该图16中的(b)所示的语音频谱图中对应的时频语音信号的背景颜色更深,背景噪声(即扩散场噪声)的能量较小。因此,可以确定经过抑制非目标方位的声音信号与抑制扩散场噪声融合处理后,输出的声音信号的噪声能量有所降低,声音信号的信噪比具有大幅提高。
需要说明的是,在嘈杂环境下仅依靠上述1202中的融合增益g mix(t,f)计算进行目标对象的声音与非目标对象的声音的分离,会出现背景噪声(即环境噪声)不平稳的问题。例如,经过上述1202中的融合增益g mix(t,f)计算之后,扩散场噪声1和扩散场噪声2的时频语音信号的融合增益g mix(t,f)具有较大的差距,从而使得输出的音频信号的背景噪声不平稳。
为了解决经过上述处理后音频信号的背景噪声不平稳的问题,在另一些实施例中,可以对扩散场噪声进行噪声补偿,进而二次降噪处理,使音频信号的背景噪声更加平稳。如图17所示,上述声音信号处理方法可以包括400-401、1201-1202、1701-1702。
1701、对扩散场噪声进行补偿。
示例性地,在扩散场噪声补偿阶段,可以通过如下公式(九)对扩散场噪声进行补偿:
g out(t,f)=MAX(g mix(t,f),MIN(1–g cdr(t,f),g min))             公式(九)
其中,g min为扩散场噪声的最小增益(即预设增益值),可通过参数配置,例如可以将g min设置为0.3。g out(t,f)为进行扩散场噪声补偿后时频语音信号的时频点增益。
示例性地,依然以上述图1所示的拍摄场景为例,分别对目标对象的时频语音信号、非目标对象(如非目标对象1和非目标对象2)的时频语音信号、以及扩散场噪声1的时频语音信号、扩散场噪声2的时频语音信号进行扩散场噪声补偿计算,得到 时频点增益g out(t,f)。
例如:
目标对象的时频语音信号:g mask(t,f)=1,g cdr(t,f)=1,g mix(t,f)=1;则g out(t,f)=1;
非目标对象的时频语音信号:g mask(t,f)=0.2,g cdr(t,f)=1,则g mix(t,f)=0.2;则g out(t,f)=0.2;
扩散场噪声1的时频语音信号:g mask(t,f)=0.2,g cdr(t,f)=0.3,则g mix(t,f)=0.06;则g out(t,f)=0.3;
扩散场噪声2的时频语音信号:g mask(t,f)=1,g cdr(t,f)=0.3,则g mix(t,f)=0.3;则g out(t,f)=0.3。
由此可见,经过扩散场噪声补偿后,扩散场噪声1的时频点增益由0.06增加至0.3,扩散场噪声2的时频点增益保持在0.3,从而可以使输出的声音信号的背景噪声(如扩散场噪声1和扩散场噪声2)更加平稳,使用户的听感更佳。
1702、输出处理后的声音信号。
与上述1203不同的是,处理后的声音信号,如左声道音频输出信号Y L(t,f)和右声道音频输出信号Y R(t,f),是根据扩散场噪声补偿后时频语音信号的时频点增益g out(t,f)来计算的。具体可通过如下公式(十)和公式(十一)计算:
Y R(t,f)=X R(t,f)*g out(t,f);                                               公式(十)
Y L(t,f)=X L(t,f)*g out(t,f);                                              公式(十一)
其中,X R(t,f)为麦克风采集的右声道时频语音信号,X L(t,f)为麦克风采集的左声道时频语音信号。
例如,对于目标对象,左声道音频输出信号Y L(t,f)的能量等于左声道时频语音信号X L(t,f)的能量;右声道音频输出信号Y R(t,f)的能量等于右声道时频语音信号X R(t,f)的能量,即如图18所示的目标对象的声音信号得到完整的保留。
对于非目标对象,左声道音频输出信号Y L(t,f)的能量等于左声道时频语音信号X L(t,f)的能量的0.2倍;右声道音频输出信号Y R(t,f)的能量等于右声道时频语音信号X R(t,f)的能量的0.2倍,即如图18所示的非目标对象的声音信号得到有效抑制。
对于环境噪声,如处于目标方位内的扩散场噪声2,左声道音频输出信号Y L(t,f)的能量等于左声道时频语音信号X L(t,f)的能量的0.3倍;右声道音频输出信号Y R(t,f)的能量等于右声道时频语音信号X R(t,f)的能量的0.3倍,即如图18所示的扩散场噪声2的声音信号得到有效抑制。
对于环境噪声,如处于非目标方位内的扩散场噪声1,左声道音频输出信号Y L(t,f)的能量等于左声道时频语音信号X L(t,f)的能量的0.3倍;右声道音频输出信号Y R(t,f)的能量等于右声道时频语音信号X R(t,f)的能量的0.3倍,即如图18所示的扩散场噪声1的声音信号得到有效抑制。
综上所述,如图18所示,经过扩散场噪声补偿的时频点增益g out(t,f)计算,减 小了扩散场噪声1和扩散场噪声2的时频点增益g out(t,f)的差距,使得输出的声音信号的背景噪声(如扩散场噪声1和扩散场噪声2)更加平稳,使用户的听感更加。例如,如图19中的(a)所示,为未经过扩散场噪声补偿处理输出的时频语音信号,图19中的(b)所示,为经过扩散场噪声补偿处理输出的时频语音信号。通过对比图19中的(a)和图19中的(b),可以看出图19中的(b)中所示的语音频谱图的背景颜色更均匀,说明背景噪声的能量更加均匀,因而,可以确定经过扩散场噪声补偿处理后,输出的声音信号的背景噪声(如扩散场噪声)更加均匀平滑,从而使用户听感更佳。
需要说明的是,经过上述图7、图13和图17所示的声音信号处理方法的处理后,得到所有的左声道音频输出信号Y L(t,f)和右声道音频输出信号Y R(t,f),再经过快速傅里叶逆变换(inverse Fast Fourier Transform,IFFT)或(inverse discrete Fourier transform,IDFT),将输出的时频语音信号变化为时域幅度信号,便可以通过手机的扬声器1和扬声器2输出。
以上是对本申请实施例提供的声音信号处理方法的具体过程的介绍,以下再结合不同的应用场景对如何使用上述声音信号处理方法进行说明。
场景一:用户使用前置摄像头录像的场景。
在一些实施例中,针对上述场景一,手机每采集一帧图像数据即对采集的一帧图像数据进行处理,或者每采集一帧图像数据对应的音频数据即对采集的音频数据进行处理。示例性地,如图20所示,本申请实施例提供的一种声音信号的处理方法可以包括:
2001、手机启动前置摄像头录像功能,并启动录像。
在本申请的实施例中,用户想要使用手机进行录像拍摄时,可以启动手机的录像功能。例如,手机可以启动相机应用,或者启动具有录像功能的其他应用(比如抖音或快手等AR应用),从而启动应用的录像功能。
示例性地,手机检测到用户点击图21A中的(a)所示的相机图标2101的操作后,启动相机应用的录像功能,并显示如图21A中的(b)所示的前置摄像头录像的预览界面。再示例性地,手机显示桌面或非相机应用的界面,检测到用户打开相机应用的语音指令后启动录像功能,并显示如图21A中的(b)所示的前置摄像头录像的预览界面。
需要说明的是,手机还可以响应于用户的其他触摸操作、语音指令或快捷手势等操作启动录像功能,本申请实施例对触发手机启动录像功能的操作不作限定。
当手机显示如图21A中的(b)所示的前置摄像头录像的预览界面时,手机检测到用户点击图21A中的(b)所示的录像按钮2102的操作后,启动前置摄像头录像,并显示如图21A中的(c)所示的前置摄像头录像的录像界面,并开始录像计时。
2002、手机采集第N帧图像,并对第N帧图像进行处理。
示例性地,手机在进行视频录制的过程中,可以分为图像流和音频流。其中图像流用于采集图像数据,并分别对每一帧图像进行图像处理操作。音频流用于采集音频数据,并分别对每一帧音频数据进行拾音去噪处理。
示例性地,以第1帧图像为例,当手机采集完第1帧图像之后,手机可以对第1 帧图像进行处理,如图像去噪、色调映射(tonemapping)等处理。当手机采集完第2帧图像之后,手机可以对第2帧图像进行处理。以此类推,当手机采集完第N帧图像之后,手机可以对第N帧图像进行处理。其中,N为正整数。
2003、手机采集第N帧图像对应的音频,并对第N帧图像对应的音频进行处理。
以上述图1所示的拍摄环境和拍摄对象为例,手机在如图21B中的(a)所示的视频录制界面展示的视频录制过程中,手机采集的图像数据为目标对象的图像,手机采集的音频数据不仅会包括目标对象的声音,还可能包括非目标对象(如非目标对象1和非目标对象2)的声音,以及环境噪声(如扩散场噪声1和扩散场噪声2)。
以一帧图像为30毫秒(ms),一帧音频为10毫秒,从启动录像开始帧计数,则第N帧图像对应的音频为第3N-2帧、第3N-1帧和第3N帧音频。例如,第1帧图像对应的音频为第1帧、第2帧和第3帧音频。又例如,第2帧图像对应的音频为第4帧、第5帧和第6帧音频。再例如,第10帧图像对应的音频为第28帧、第29帧和第30帧音频。
以第1帧图像对应的音频为例,手机对第1帧图像对应的音频进行处理,需要分别对第1帧音频、第2帧音频和第3帧音频进行处理。
示例性地,当采集完第1帧音频时,手机可以执行上述图7所示的声音信号处理方法,对第1帧音频进行处理,抑制非目标对象的声音,以突出目标对象的声音;或者手机可以执行上述图13所示的声音信号处理方法,抑制非目标对象的声音,以突出目标对象的声音,并且抑制扩散场噪声,以降低音频信号中的噪声能量,提高音频信号的信噪比;又或者手机可以执行上述图17所示的声音信号处理方法,抑制非目标对象的声音,以突出目标对象的声音,并且抑制扩散场噪声,以降低音频信号中的噪声能量,提高音频信号的信噪比,还能够平滑背景噪声(即环境噪声),使用户听感更佳。
类似地,当采集完第2帧音频或第3帧音频,手机也可以执行上述图7或图13或图17所示的声音信号处理方法,对音频信号进行处理。应理解,上述2003可以对应图7或图13或图17中的400步骤。
应理解,对于后续任意一帧图像对应的音频,均可以按照上述第1帧图像对应的音频处理过程进行处理,此处不再一一赘述。当然,在本申请实施例中,在对第N帧图像对应的音频进行处理时,可以每采集一帧音频即执行上述方法进行处理,也可以在第N帧图像对应的3帧音频采集完成,再分别对3帧音频中的每一帧音频进行处理,本申请实施例不做特殊限制。
2004、手机合成处理后的第N帧图像和处理后的第N帧图像对应的音频,得到第N帧视频数据。
示例性地,当第N帧图像处理完成后,并且第N帧图像对应的音频,如第3N-2帧音频、第3N-1帧和第3N帧音频也处理完成后,手机可以从图像流中获取第N帧图像,并且从音频流中获取3N-2帧音频、第3N-1帧和第3N帧音频,然后将3N-2帧音频、第3N-1帧和第3N帧音频按照时间戳顺序,同第N帧图像进行合成,合成为第N帧视频数据。
2005、结束录制时,待最后一帧图像以及最后一帧图像对应的音频处理完成,得 到最后一帧视频数据后,合成第一帧视频数据至最后一帧视频数据,并保存为视频文件A。
示例性地,当手机检测到用户点击图21B中的(a)所示的结束录制按钮2103的操作后,手机响应于用户的点击操作,结束录制,并显示如图21B中的(b)所示的前置摄像头录像的预览界面。
在此过程中,手机响应于用户对结束录制按钮2103的点击操作,手机停止图像和音频的采集,并且在最后一帧图像以及最后一帧图像对应的音频处理完成,得到最后一帧时频数据后,按照时间戳顺序合成第一帧视频数据至最后一帧视频数据,并保存为视频文件A。此时,在图21B中的(b)所示的前置摄像头录像的预览界面中的预览窗2104显示的预览文件,即为视频文件A。
当手机检测到用户对图21B中的(b)所示的预览窗2104的点击操作后,手机响应于用户对预览窗2104的点击操作,可以显示如图21B中的(c)所示的视频文件播放界面,以播放视频文件A。
当手机播放视频文件A时,播放出的视频文件A的声音信号已经没有了非目标对象(如非目标对象1和非目标对象2)的声音信号,并且播放出视频文件A的声音信号中的背景噪声(如扩散场噪声1和扩散场噪声2)较小且平滑,能够给用户良好的听感。
应理解,在图21B中展示的场景为执行上述图17所示的声音信号处理方法的场景,未示出执行图7或图13所示的声音信号处理方法的场景。
需要说明的是,在上述图20所示的方法中,手机完成第N帧图像的采集和处理,以及第N帧图像对应的音频的采集和处理之后,并不将第N帧图像以及第N帧图像对应的音频合成为第N帧视频数据,可以在结束录制,且最后一帧图像以及最后一帧图像对应的音频处理完成后,再按照时间戳顺序将所有的图像以及所有的音频合成为视频文件A。
在另一些实施例中,针对上述场景一,也可以等到视频文件录制完成后,执行对视频文件中的音频数据的拾音去噪处理操作。
示例性地,如图22所示,本申请实施例提供的录像拍摄方法与上述图20所示的录像拍摄方法不同的是:手机在启动录像后,先采集完所有的图像数据和音频数据后,再对图像数据和音频数据进行处理并合成,最后保存合成的视频文件。具体地,该方法包括:
2201、手机启动前置摄像头录像功能,并启动录像。
示例性地,手机启动前置摄像头录像功能,并启动录像的方法可以参照上述2001中的描述,此处不再赘述。
2202、手机分别采集图像数据和音频数据。
示例性地,手机在进行视频录制的过程中,可以分为图像流和音频流。其中图像流用于采集视频录制过程中的多帧图像数据。音频流用于采集视频录制过程中的多帧音频数据。
例如,在图像流中,手机依次采集第1帧图像、第2帧图像,……最后1帧图像。 在音频流中,手机依次采集第1帧音频、第2帧音频,……最后1帧图像。
以上述图1所示的拍摄环境和拍摄对象为例,手机在如图21B中的(a)所示的视频录制界面展示的视频录制过程中,手机采集的图像数据为目标对象的图像,手机采集的音频数据不仅会包括目标对象的声音,还可能包括非目标对象(如非目标对象1和非目标对象2)的声音,以及环境噪声(如扩散场噪声1和扩散场噪声2)。
2203、结束录制时,手机分别对采集的图像数据和音频数据进行处理。
示例性地,当手机检测到用户点击图21B中的(a)所示的结束录制按钮2103的操作后,手机响应于用户的点击操作,结束录制,并显示如图21B中的(b)所示的前置摄像头录像的预览界面。
在此过程中,手机响应于用户对结束录制按钮2103的点击操作,手机分别对采集的图像数据和采集的音频数据进行处理。
示例性地,手机可以对采集的图像数据中的每1帧图像分别进行处理,如图像去噪、色调映射(tonemapping)等处理,得到处理后的图像数据。
示例性地,手机还可以对采集的音频数据中的每1帧音频进行处理。例如,执行上述图7所示的声音信号处理方法对第一音频数据中的每1帧音频进行处理,抑制非目标对象的声音,以突出目标对象的声音。又例如,执行上述图13所示的声音信号处理方法对采集的音频数据中的每1帧音频进行处理,抑制非目标对象的声音,以突出目标对象的声音,并且抑制扩散场噪声,以降低音频信号中的噪声能量,提高音频信号的信噪比。再例如,执行上述图17所示的声音信号处理方法对采集的音频数据中的每1帧音频进行处理,抑制非目标对象的声音,以突出目标对象的声音,并且抑制扩散场噪声,以降低音频信号中的噪声能量,提高音频信号的信噪比,还能够平滑背景噪声(即环境噪声),使用户听感更佳。
手机对采集的音频数据中的每1帧音频均处理完成后,可以得到处理后的音频数据。
应理解,上述2202和2203可以对应图7或图13或图17中的400步骤。
2204、手机合成处理后的图像数据和处理后的音频数据,得到视频文件A
应理解,处理后的图像数据和处理后的音频数据需要合成为视频文件,才能够供用户分享或播放。因此,在手机执行上述2203得到处理后的图像数据和处理后的音频数据之后,可以对处理后的图像数据和处理后的音频数据进行合成,形成视频文件A。
2205、手机保存视频文件A。
此时,手机可以保存视频文件A。示例性地,当手机检测到用户对图21B中的(b)所示的预览窗2104的点击操作后,手机响应于用户对预览窗2104的点击操作,可以显示如图21B中的(c)所示的视频文件播放界面,以播放视频文件A。
当手机播放视频文件A时,播放出的视频文件A的声音信号已经没有了非目标对象(如非目标对象1和非目标对象2)的声音信号,并且播放出视频文件A的声音信号中的背景噪声(如扩散场噪声1和扩散场噪声2)较小且平滑,能够给用户良好的听感。
场景二:用户使用前置摄像头进行直播的场景。
在该场景中,直播采集的数据会实时展示给用户观看,因此直播采集的图像和音 频会实时进行处理,并将处理后的图像和音频数据及时展示给用户观看。在该场景中至少包括手机A、服务器和手机B。其中,手机A和手机B均与服务器通信。手机A可以为直播录像设备,用于录制音视频文件并传输至服务器。手机B可以为直播显示设备,用于从服务器中获取音视频文件,并在直播界面中显示音视频文件内容,以供用户观看。
示例性地,针对上述场景二,如图23所示,本申请实施例提供的一种录像拍摄方法,应用于手机A。该方法可以包括:
2301、手机A启动前置摄像头直播录像,并开启直播。
在本申请实施例中,用户想要使用手机进行直播录像时,可以启动手机中的直播应用,如抖音或快手,开启直播录像。
示例性地,以抖音应用为例,手机检测到用户点击图24中的(a)所示的开启视频直播按钮2401的操作后,启动抖音应用的视频直播功能,并显示如图24中的(b)所示的视频直播采集界面。此时,抖音应用会采集图像数据和声音数据。
2302、手机A采集第N帧图像,并对第N帧图像进行处理。
此处理过程与上述图20所示的2202类似,此处不再赘述。
2303、手机A采集第N帧图像对应的音频,并对第N帧图像对应的音频进行处理。
以上述图1所示的拍摄环境和拍摄对象为例,如图25所示,在视频直播采集界面中,手机采集的第N帧图像为目标对象的图像,手机采集的第N帧图像对应的音频中不仅会包括目标对象的声音,还可能包括非目标对象(如非目标对象1和非目标对象2)的声音,以及环境噪声(如扩散场噪声1和扩散场噪声2)。
此处理过程与上述图20所示的2203类似,此处不再赘述。
2304、手机A合成处理后的第N帧图像和处理后的第N帧图像对应的音频,得到第N帧视频数据。
此处理过程与上述图20所示的2203类似,此处不再赘述。应理解,第N帧图像和处理后的第N帧图像对应的音频,合成为第N帧视频数据时,即可展示给用户观看。
2305、手机A发送第N帧视频数据至服务器,以使手机B显示第N帧视频数据。
示例性地,当得到第N帧视频数据后,手机A可以将第N帧视频数据发送至服务器。应理解,该服务器通常为直播应用的服务器,如抖音应用的服务器。当观看直播的用户打开手机B的直播应用,如抖音应用时,手机B可以在直播显示界面显示第N帧视频,以供用户观看。
需要说明的是,经过上述2303对第N帧图像对应的音频,如第3N-2帧音频、第3N-1帧和第3N帧音频处理完成后,如图25所示,在手机B输出的第N帧视频数据中的音频信号为经过处理的音频信号,仅保留有目标对象(即直播拍摄对象)的声音。
应理解,在图25中展示的场景为执行上述图17所示的声音信号处理方法的场景,未示出执行图7或图13所示的声音信号处理方法的场景。
场景三:对手机相册中的视频文件进行拾音处理的场景。
在某些情况下,电子设备(如手机)在视频文件录制过程中不支持对录制的声音数据进行处理。为提高视频文件的声音数据的清晰度,抑制噪声信号,提高信号比。 电子设备可以对保存在手机相册中的视频文件及保存的原始声音数据执行上述图7、图13或图17的声音信号处理方法,去除非目标对象的声音或去除扩散场噪声,使噪声更平稳,以使用户的听感更佳。其中,保存的原始声音数据是指录制上述手机相册中的视频文件时,手机的麦克风采集的声音数据。
示例性地,本申请实施例还提供一种拾音处理方法,用于对手机相册中的视频文件进行拾音处理。例如,如图26所示,该拾音处理方法包括:
2601、手机获取相册中的第一视频文件。
在本申请的实施例中,用户想要对手机相册中的视频文件的音频进行拾音处理,以去除非目标对象的声音或去除扩散场噪声。可以从手机的相册中选择想要处理的视频文件,以执行拾音处理。
示例性地,手机检测到用户点击图27中的(a)所示的第一视频文件的预览框2701的操作后,手机响应于用户的点击操作,显示如图27中的(b)所示的对第一视频文件的操作界面。在该操作界面中,用户可以对第一视频文件进行播放、分享、收藏、编辑或删除。
示例性地,在图27中的(b)所示的对第一视频文件的操作界面中,用户还可以对第一视频文件进行拾音去噪操作。例如,手机检测到用户点击图28中的(a)所示的“更多”选项2801的操作后,手机响应于用户的点击操作,显示如图28中的(b)所示的操作选择框2802。在操作选择框2802中设置有“去噪处理”选项按钮2803,用户可以点击“去噪处理”选项按钮2803,以对第一视频文件进行拾音去噪处理。
示例性地,手机检测到用户点击图29中的(a)所示的“去噪处理”选项按钮2803的操作后,手机响应于用户的点击操作,可以获取第一视频文件,以对第一视频文件进行拾音去噪处理。此时,手机可以显示如图29中的(b)所示的执行拾音去噪处理过程的等待界面,并在后台执行下述2602至2604,对第一视频文件进行拾音去噪处理。
2602、手机将第一视频文件分离为第一图像数据和第一音频数据。
应理解,对第一视频文件进行拾音去噪的目标是去除非目标对象的声音,抑制背景噪声(即扩散场噪声),因此手机需要将第一视频文件中的音频数据分离出来,以便对第一视频文件中的音频数据进行拾音去噪处理。
示例性地,当手机获取到第一视频文件后,可以将第一视频文件中的图像数据和音频数据分离为第一图像数据和第一音频数据。其中,第一图像数据可以为第一视频文件的图像流中第1帧到最后1帧图像的集合。第一音频数据可以为第一视频文件的音频流中第1帧音频到最后1帧音频的集合。
2603、手机对第一图像数据进行图像处理,得到第二图像数据;对第一音频数据进行处理,得到第二音频数据。
示例性地,手机可以对第一图像数据中的每1帧图像进行处理,如图像去噪、色调映射(tonemapping)等处理,得到第二图像数据。其中,第二图像数据为第一图像数据中的每1帧图像经过处理后得到的图像的集合。
示例性地,手机还可以对第一音频数据中的每1帧音频进行处理。例如,执行上述图7所示的声音信号处理方法对第一音频数据中的每1帧音频进行处理,抑制非目 标对象的声音,以突出目标对象的声音。又例如,执行上述图13所示的声音信号处理方法对第一音频数据中的每1帧音频进行处理,抑制非目标对象的声音,以突出目标对象的声音,并且抑制扩散场噪声,以降低音频信号中的噪声能量,提高音频信号的信噪比。再例如,执行上述图17所示的声音信号处理方法对第一音频数据中的每1帧音频进行处理,抑制非目标对象的声音,以突出目标对象的声音,并且抑制扩散场噪声,以降低音频信号中的噪声能量,提高音频信号的信噪比,还能够平滑背景噪声(即环境噪声),使用户听感更佳。
手机对第一音频数据中的每1帧音频均处理完成后,可以得到第二音频数据。其中,第二音频数据为第一音频数据中的每一帧音频经过处理后得到的音频的集合。
应理解,上述2601至2603均可以对应图7或图13或图17中的400步骤。
2604、手机将第二图像数据和第二音频数据合成为第二视频文件。
应理解,处理完成后的第二图像数据和第二音频数据需要合成为视频文件,才能够供用户分享或播放。因此,当手机执行上述2603得到第二图像数据和第二音频数据后,可以对第二图像数据和第二音频数据进行合成,形成第二视频文件。此时,第二视频文件即为经过拾音去噪后的第一视频文件。
2605、手机保存第二视频文件。
示例性地,当手机执行2604,将第二图像数据和第二音频数据合成为第二视频文件后,拾音去噪处理已完成,手机可以显示如图30所示的文件保存选项卡3001。在文件保存选项卡3001中,可以提示用户“去噪处理完成,是否替换原有文件?”,并且设置有第一选项按钮3002和第二选项按钮3003,以供用户选择。其中,第一选项按钮3002上可以显示“是”,以指示用户将原有的第一视频文件替换为处理后的第二视频文件。第二选项按钮3003上可以显示“否”,以指示用户将处理后的第二视频文件另存,不替换第一视频文件,即第一视频文件和第二视频文件均保留。
示例性地,若手机检测到用户对图30中所示的第一选项按钮3002的点击操作,手机响应于用户对第一选项按钮3002的点击操作,将原有的第一视频文件替换为处理后的第二视频文件。若手机检测到用户对图30中所示的第二选项按钮3003的点击操作,手机响应于用户对第二选项按钮3003的点击操作,分别保存第一视频文件和第二视频文件。
可以理解的,上述实施例中未一帧图像对应一帧音频,但在某些实施例中,一帧图像对应多帧音频,或者多帧图像对应一帧音频也同样适用,例如,可以是一帧图像对应三帧音频,那么如图23中实时合成时,可以是手机合成处理后的第N帧图像和处理后的第3N-2帧、3N-1帧和第3N帧的音频,得到第N帧视频数据。
本申请实施例还提供另一种声音信号处理方法。该方法应用于电子设备,电子设备包括摄像头和麦克风。第一目标对象在摄像头的拍摄范围内,第二目标对象不在摄像头的拍摄范围内。其中,第一目标对象在摄像头的拍摄范围内,可以指第一目标对象位于摄像头的视场角的范围内。例如,第一目标对象可以为上述实施例中的目标对象。第二目标对象可以为上述实施例中的非目标对象1或非目标对象2。
该方法包括:
电子设备启动相机。
显示预览界面,预览界面包括第一控件。其中,第一控件可以是图21A中的(b)所示的录像按钮2102,也可以是图24中的(a)所示的开启视频直播按钮2401。
检测到对第一控件的第一操作。响应于第一操作,开始拍摄。其中,第一操作,可以是用户对第一控件的点击操作。
在第一时刻,显示拍摄界面,拍摄界面包括第一图像,第一图像为摄像头实时采集的图像,第一图像包括第一目标对象,第一图像不包括第二目标对象。其中,第一时刻可以是拍摄过程中的任意时刻。第一图像可以是图20、图22、图23所示的方法中的每一帧图像。第一目标对象可以是上述实施例中的目标对象。第二目标对象可以是上述实施例中的非目标对象1或非目标对象2。
在第一时刻,麦克风采集第一音频,第一音频包括第一音频信号和第二音频信号,第一音频信号对应第一目标对象,第二音频信号对应第二目标对象。以图1所示的场景为例,第一音频信号可以为目标对象的声音信号。第二音频信号可以是非目标对象1的声音信号或非目标对象2的声音信号。
检测到对拍摄界面的第一控件的第二操作。其中,拍摄界面的第一控件可以是图21B中的(a)所示的结束录制按钮2103。第二操作可以是用户对拍摄界面的第一控件的点击操作。
响应于第二操作,停止拍摄,保存第一视频,其中,第一视频的第一时刻处包括第一图像和第二音频,第二音频包括第一音频信号和第三音频信号,第三音频信号是电子设备对第二音频信号进行处理得到的,第三音频信号的能量小于第二音频信号的能量。例如,第三音频信号可以为处理后的非目标对象1的声音信号或非目标对象2的声音信号。
可选地,第一音频还可以包括第四音频信号,第四音频信号为扩散场噪声音频信号;第二音频还包括第五音频信号,第五音频信号为扩散场噪声音频信号。其中,第五音频信号是电子设备对第四音频信号进行处理得到的。第五音频信号的能量小于所述第四音频信号的能量。例如,第四音频信号可以为上述实施例中的扩散场噪声1的声音信号,或者为上述实施例中的扩散场噪声2的声音信号。第五音频信号可以是电子设备执行上述图13或图17后得到的处理后的扩散场噪声1的声音信号或者扩散场噪声2的声音信号。
可选地,第五音频信号是电子设备对第四音频信号进行处理得到的,包括:对第四音频信号进行抑制处理得到第六音频信号。例如,第六音频信号可以是执行图13得到的处理后的扩散场噪声1的声音信号。
对第六音频信号进行补偿处理得到第五音频信号。其中,第六音频信号为扩散场噪声音频信号,第六音频信号的能量小于第四音频信号,第六音频信号小于所述第五音频信号。例如,此时的第五音频信号可以是执行图17得到的处理后的扩散场噪声1的声音信号。
本申请另一些实施例提供了一种电子设备,该电子设备包括:麦克风;摄像头;一个或多个处理器;存储器;通信模块。其中,麦克风用于采集录像或直播时的声音信号;摄像头用于采集录像或直播时的图像信号。通信模块用于与外接设备通信。存储器中存储有一个或多个计算机程序,一个或多个计算机程序包括指令。当处理器执行 计算机指令时,电子设备可执行上述方法实施例中手机执行的各个功能或者步骤。
本申请实施例还提供一种芯片系统。该芯片系统可以应用于可折叠的电子设备。如图31所示,该芯片系统包括至少一个处理器3101和至少一个接口电路3102。处理器3101和接口电路3102可通过线路互联。例如,接口电路3102可用于从其它装置(例如电子设备的存储器)接收信号。又例如,接口电路3102可用于向其它装置(例如处理器3101)发送信号。示例性的,接口电路3102可读取存储器中存储的指令,并将该指令发送给处理器3101。当所述指令被处理器3101执行时,可使得电子设备执行上述实施例中的各个步骤。当然,该芯片系统还可以包含其他分立器件,本申请实施例对此不作具体限定。
本申请实施例还提供一种计算机存储介质,该计算机存储介质包括计算机指令,当所述计算机指令在上述可折叠的电子设备上运行时,使得该电子设备执行上述方法实施例中手机执行的各个功能或者步骤。
本申请实施例还提供一种计算机程序产品,当所述计算机程序产品在计算机上运行时,使得所述计算机执行上述方法实施例中手机执行的各个功能或者步骤。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请实施例各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
上述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:快闪存储器、移动硬盘、只读存储器、随机存取存储器、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请实施例的具体实施方式,但本申请实施例的保护范围并不局限于此,任何在本申请实施例揭露的技术范围内的变化或替换,都应涵盖在本申请实施例的保护范围之内。因此,本申请实施例的保护范围应以所述权利要求的保护范围为准。

Claims (10)

  1. 一种声音信号处理方法,其特征在于,应用于电子设备,所述电子设备包括摄像头和麦克风;第一目标对象在所述摄像头的拍摄范围内,第二目标对象不在所述摄像头的拍摄范围内,所述方法包括:
    所述电子设备启动相机;
    显示预览界面,所述预览界面包括第一控件;
    检测到对所述第一控件的第一操作;
    响应于所述第一操作,开始拍摄;
    在第一时刻,显示拍摄界面,所述拍摄界面包括第一图像,所述第一图像为所述摄像头实时采集的图像,所述第一图像包括所述第一目标对象,所述第一图像不包括所述第二目标对象;
    在所述第一时刻,所述麦克风采集第一音频,所述第一音频包括第一音频信号和第二音频信号,所述第一音频信号对应所述第一目标对象,第二音频信号对应所述第二目标对象;
    检测到对所述拍摄界面的第一控件的第二操作;
    响应于第二操作,停止拍摄,保存第一视频,其中,
    所述第一视频的第一时刻处包括第一图像和第二音频,所述第二音频包括第一音频信号和第三音频信号,所述第三音频信号是所述电子设备对第二音频信号进行处理得到的,所述第三音频信号的能量小于所述第二音频信号的能量。
  2. 根据权利要求1所述的方法,其特征在于,
    所述第一音频还包括第四音频信号,所述第四音频信号为扩散场噪声音频信号;
    所述第二音频还包括第五音频信号,所述第五音频信号为扩散场噪声音频信号,其中,
    所述第五音频信号是所述电子设备对第四音频信号进行处理得到的,所述第五音频信号的能量小于所述第四音频信号的能量。
  3. 根据权利要求2所述的方法,其特征在于,所述第五音频信号是所述电子设备对第四音频信号进行处理得到的,包括:
    对所述第四音频信号进行抑制处理得到第六音频信号;
    对所述第六音频信号进行补偿处理得到所述第五音频信号,其中,所述第六音频信号为扩散场噪声音频信号,所述第六音频信号的能量小于所述第四音频信号,所述第六音频信号小于所述第五音频信号。
  4. 根据权利要求2所述的方法,其特征在于,所述第五音频信号是所述电子设备对第四音频信号进行处理得到的,包括:
    配置所述第四音频信号的增益小于1;
    根据所述第四音频信号和所述第四音频信号的增益,得到所述第五音频信号。
  5. 根据权利要求1-4中任一项所述的方法,其特征在于,所述第三音频信号是所述电子设备对第二音频信号进行处理得到的,包括:
    配置所述第二音频信号的增益小于1;
    根据所述第二音频信号和所述第二音频信号的增益,得到所述第三音频信号。
  6. 根据权利要求1-5中任一项所述的方法,其特征在于,所述第三音频信号是所述电子设备对第二音频信号进行处理得到的,包括:
    所述电子设备计算所述第二音频信号在目标方位内的概率;其中,所述目标方位为录制视频时摄像头的视场角范围内的方位;所述第一目标对象在所述目标方位内,所述第二目标对象不在所述目标方位内;
    所述电子设备根据所述第二音频信号在所述目标方位内的概率,确定所述第二音频信号的增益;其中,若所述第二音频信号在所述目标方位内的概率大于预设概率阈值,则所述第二音频信号的增益等于1;若所述第二音频信号在所述目标方位内的概率小于或等于所述预设概率阈值,则所述第二音频信号的增益小于1;
    所述电子设备根据所述第二音频信号和所述第二音频信号的增益,得到所述第三音频信号。
  7. 根据权利要求1-6中任一项所述的方法,其特征在于,所述方法还包括:
    在所述第一时刻,所述麦克风采集到所述第一音频后,所述电子设备处理所述第一音频得到所述第二音频。
  8. 根据权利要求1-6中任一项所述的方法,其特征在于,所述方法还包括:
    在响应于第二操作,停止拍摄之后,所述电子设备处理所述第一音频得到所述第二音频。
  9. 一种电子设备,其特征在于,所述电子设备包括:
    麦克风;
    摄像头;
    一个或多个处理器;
    存储器;
    通信模块;
    其中,所述麦克风用于采集录像或直播时的声音信号;所述摄像头用于采集录像或直播时的图像信号;所述通信模块用于与外接设备通信;所述存储器中存储有一个或多个计算机程序,所述一个或多个计算机程序包括指令,当所述指令被所述处理器执行时,使得所述电子设备执行如权利要求1-8中任一项所述的方法。
  10. 一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,其特征在于,当所述指令在电子设备上运行时,使得所述电子设备执行如权利要求1-8中任一项所述的方法。
PCT/CN2022/095354 2021-08-12 2022-05-26 一种声音信号处理方法及电子设备 WO2023016053A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP22855039.8A EP4280211A1 (en) 2021-08-12 2022-05-26 Sound signal processing method and electronic device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110927121.0 2021-08-12
CN202110927121.0A CN115914517A (zh) 2021-08-12 2021-08-12 一种声音信号处理方法及电子设备

Publications (1)

Publication Number Publication Date
WO2023016053A1 true WO2023016053A1 (zh) 2023-02-16

Family

ID=85199808

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/095354 WO2023016053A1 (zh) 2021-08-12 2022-05-26 一种声音信号处理方法及电子设备

Country Status (3)

Country Link
EP (1) EP4280211A1 (zh)
CN (1) CN115914517A (zh)
WO (1) WO2023016053A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104699445A (zh) * 2013-12-06 2015-06-10 华为技术有限公司 一种音频信息处理方法及装置
WO2016183791A1 (zh) * 2015-05-19 2016-11-24 华为技术有限公司 一种语音信号处理方法及装置
CN106653041A (zh) * 2017-01-17 2017-05-10 北京地平线信息技术有限公司 音频信号处理设备、方法和电子设备
CN109036448A (zh) * 2017-06-12 2018-12-18 华为技术有限公司 一种声音处理方法和装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104699445A (zh) * 2013-12-06 2015-06-10 华为技术有限公司 一种音频信息处理方法及装置
WO2016183791A1 (zh) * 2015-05-19 2016-11-24 华为技术有限公司 一种语音信号处理方法及装置
CN106653041A (zh) * 2017-01-17 2017-05-10 北京地平线信息技术有限公司 音频信号处理设备、方法和电子设备
CN109036448A (zh) * 2017-06-12 2018-12-18 华为技术有限公司 一种声音处理方法和装置

Also Published As

Publication number Publication date
EP4280211A1 (en) 2023-11-22
CN115914517A (zh) 2023-04-04

Similar Documents

Publication Publication Date Title
WO2020078237A1 (zh) 音频处理方法和电子设备
EP4099688A1 (en) Audio processing method and device
JP7470808B2 (ja) オーディオ処理方法及び装置
CN114924682A (zh) 一种内容接续方法及电子设备
WO2022100610A1 (zh) 投屏方法、装置、电子设备及计算机可读存储介质
EP4192004A1 (en) Audio processing method and electronic device
CN113744750B (zh) 一种音频处理方法及电子设备
WO2022143119A1 (zh) 声音采集方法、电子设备及系统
CN113132863B (zh) 立体声拾音方法、装置、终端设备和计算机可读存储介质
CN113823314B (zh) 语音处理方法和电子设备
CN114422935B (zh) 音频处理方法、终端及计算机可读存储介质
CN113810589A (zh) 电子设备及其视频拍摄方法和介质
WO2022267468A1 (zh) 一种声音处理方法及其装置
WO2022068613A1 (zh) 音频处理的方法及电子设备
US20230162718A1 (en) Echo filtering method, electronic device, and computer-readable storage medium
WO2023016053A1 (zh) 一种声音信号处理方法及电子设备
CN105244037B (zh) 语音信号处理方法及装置
US20240144948A1 (en) Sound signal processing method and electronic device
WO2022218271A1 (zh) 一种视频录制方法和电子设备
WO2023202431A1 (zh) 一种定向拾音方法及设备
CN113473057B (zh) 一种录像方法与电子设备
CN115297269B (zh) 曝光参数的确定方法及电子设备
WO2022161146A1 (zh) 视频录制方法及电子设备
WO2022228089A1 (zh) 一种收音方法、装置及相关电子设备
US20230275986A1 (en) Accessory theme adaptation method, apparatus, and system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22855039

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2022855039

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 18279165

Country of ref document: US

ENP Entry into the national phase

Ref document number: 2022855039

Country of ref document: EP

Effective date: 20230818

NENP Non-entry into the national phase

Ref country code: DE