WO2022062531A1 - 一种多通道音频信号获取方法、装置及系统 - Google Patents

一种多通道音频信号获取方法、装置及系统 Download PDF

Info

Publication number
WO2022062531A1
WO2022062531A1 PCT/CN2021/103110 CN2021103110W WO2022062531A1 WO 2022062531 A1 WO2022062531 A1 WO 2022062531A1 CN 2021103110 W CN2021103110 W CN 2021103110W WO 2022062531 A1 WO2022062531 A1 WO 2022062531A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio signal
target
main
channel
additional
Prior art date
Application number
PCT/CN2021/103110
Other languages
English (en)
French (fr)
Inventor
王文东
Original Assignee
Oppo广东移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Oppo广东移动通信有限公司 filed Critical Oppo广东移动通信有限公司
Priority to EP21870910.3A priority Critical patent/EP4220637A4/en
Publication of WO2022062531A1 publication Critical patent/WO2022062531A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones
    • H04R3/005Circuits for transducers, loudspeakers or microphones for combining the signals of two or more microphones
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S3/00Systems employing more than two channels, e.g. quadraphonic
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • H04S7/305Electronic adaptation of stereophonic audio signals to reverberation of the listening space
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • G10L21/0356Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude for synchronising with other signals, e.g. video signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2420/00Details of connection covered by H04R, not provided for in its groups
    • H04R2420/01Input selection or mixing for amplifiers or loudspeakers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/01Multi-channel, i.e. more than two input channels, sound reproduction with two speakers wherein the multi-channel information is substantially preserved
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/11Positioning of individual sound objects, e.g. moving airplane, within a sound field
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/15Aspects of sound capture and related signal processing for recording or reproduction

Definitions

  • the present invention relates to the technical field of audio, and in particular, to a method, device and system for acquiring a multi-channel audio signal.
  • Embodiments of the present invention provide a method, device and system for acquiring multi-channel audio signals, which can use the relationship between distributed audio signals to suppress ambient sound and improve the recording effect of audio signals.
  • an embodiment of the present invention provides a method for acquiring a multi-channel audio signal, including:
  • the target audio signal is obtained by performing ambient sound suppression processing on the first additional audio signal and the main audio signal;
  • Multi-channel rendering is performed on the target audio signal to obtain the target multi-channel audio signal
  • the ambient multi-channel audio signal and the target multi-channel audio signal are mixed to obtain a mixed multi-channel audio signal.
  • a device for acquiring a multi-channel audio signal including:
  • the acquisition module is used to acquire the main audio signal collected when the main device shoots the video of the target object, and perform the first multi-channel rendering to obtain the environmental multi-channel audio signal; acquire the audio signal collected by the additional device, and determine the first multi-channel audio signal. an additional audio signal, wherein the distance between the additional device and the target photograph is less than a first threshold;
  • a processing module for performing ambient sound suppression processing through the first additional audio signal and the main audio signal to obtain a target audio signal
  • Multi-channel rendering is performed on the target audio signal to obtain the target multi-channel audio signal
  • the ambient multi-channel audio signal and the target multi-channel audio signal are mixed to obtain a mixed multi-channel audio signal.
  • a terminal device including: a processor, a memory, and a computer program stored in the memory and running on the processor, the computer program being executed by the processor to achieve multi-channel audio signal acquisition as in the first aspect method.
  • a terminal device comprising: the multi-channel audio signal acquisition device and the main device as in the second aspect,
  • the main device is used to collect the main audio signal when shooting video, and send the main audio signal to the multi-channel audio signal acquisition device.
  • a fifth aspect provides a multi-channel audio signal acquisition system, the system comprising: the multi-channel audio signal acquisition device as in the second aspect, a main device and an additional device, the main device and the additional device respectively establish a communication connection with the multi-channel audio signal ;
  • the main device is used to collect the main audio signal when shooting video, and send the main audio signal to the multi-channel audio signal acquisition device;
  • an additional device for collecting the second additional audio signal and sending the second additional audio signal to the multi-channel audio signal acquisition device
  • the distance between the additional device and the target shot is less than a first threshold.
  • a computer-readable storage medium comprising: storing a computer program on the computer-readable storage medium, and when the computer program is executed by a processor, the method for acquiring a multi-channel audio signal according to the first aspect is implemented.
  • the main audio signal collected when the main device shoots the video can be acquired, and multi-channel rendering can be performed to obtain the environmental multi-channel audio signal; determine the first additional audio signal; perform environmental sound suppression processing through the first additional audio signal and the main audio signal to obtain the target audio signal; perform multi-channel rendering on the target audio signal to obtain the target multi-channel audio signal;
  • the multi-channel audio signal and the target multi-channel audio signal are mixed to obtain a mixed multi-channel audio signal.
  • the distributed audio signal can be obtained from the main device and the additional device, and the relationship between the distributed audio signals can be used to obtain the first additional audio signal obtained from the audio signal collected by the additional device and the first additional audio signal collected by the main device.
  • the main audio signal is subjected to environmental sound suppression processing to suppress the environmental sound during the recording process, and the target multi-channel audio signal is obtained, and then the environmental multi-channel audio signal (obtained by multi-channel rendering of the main audio signal) is combined with the target multi-channel audio signal.
  • the audio signals are mixed, not only the distributed audio signals are mixed, the point-like auditory target in the spatial sound field is simulated, but also the ambient sound is suppressed, so that the recording effect of the audio signal can be improved.
  • FIG. 1 is a schematic diagram of a multi-channel audio signal acquisition system provided by an embodiment of the present invention
  • FIG. 2A is a schematic diagram 1 of a method for acquiring a multi-channel audio signal provided by an embodiment of the present invention
  • FIG. 2B is a schematic interface diagram of a terminal device provided by an embodiment of the present invention.
  • FIG. 3 is a schematic diagram 2 of a method for acquiring a multi-channel audio signal provided by an embodiment of the present invention
  • FIG. 4 is a schematic diagram of a device for acquiring a multi-channel audio signal provided by an embodiment of the present invention.
  • FIG. 5 is a schematic structural diagram of a terminal device provided by an embodiment of the present invention.
  • FIG. 6 is a schematic diagram of a hardware structure of a terminal device according to an embodiment of the present invention.
  • words such as “exemplary” or “for example” are used to mean serving as an example, illustration or illustration. Any embodiments or designs described as “exemplary” or “such as” in the embodiments of the present invention should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as “exemplary” or “such as” is intended to present the related concepts in a specific manner.
  • the meaning of "plurality” refers to two or more.
  • Embodiments of the present invention provide a method, device, and system for acquiring a multi-channel audio signal, which can be applied in a video shooting scene, especially in a situation with multiple sound sources or a relatively noisy environment for video shooting.
  • the audio signals are mixed in the same way, simulating the point-shaped auditory target in the spatial sound field, and also suppressing the ambient sound, so that the recording effect of the audio signal can be improved.
  • FIG. 1 it is a schematic diagram of a multi-channel audio signal acquisition system provided by an embodiment of the present invention, and the system may include a main device, an additional device, and an audio processing device (which may be a multi-channel audio signal in the embodiment of the present invention).
  • acquisition device The additional device in FIG. 1 is a TWS Bluetooth headset, which can be used to collect audio streams (that is, additional audio signals in the embodiment of the present invention), and the main device can be used to collect video streams and audio streams (that is, in the embodiment of the present invention).
  • the main audio signal), the audio processing device may include the following modules: object tracking, scene sound source classification, delay compensation, adaptive filtering, spatial filtering, binaural rendering and mixer, etc. The specific function introduction of each module will be described in conjunction with the multi-channel audio signal acquisition method described in the following embodiments, which will not be repeated here.
  • the main device and the audio processing device in the embodiment of the present invention may be two independent devices.
  • the main device and the audio processing device may also be one integrated device, for example, may be a terminal device that integrates the functions of the main device and the audio processing device.
  • an additional device and a terminal device, or between an additional device and an audio processing device may be connected through wireless communication, for example, through a Bluetooth connection or through a WiFi connection. Specific restrictions.
  • the terminal device in the embodiment of the present invention may include: a mobile phone, a tablet computer, a notebook computer, an ultra-mobile personal computer (UMPC), a handheld computer, a netbook, a personal digital assistant (PDA),
  • UMPC ultra-mobile personal computer
  • PDA personal digital assistant
  • wearable devices such as watches, wrists, glasses, helmets, headbands, etc.
  • the specific form of the terminal devices is not particularly limited in this embodiment of the present application.
  • the additional device may be a terminal device independent of the main device and the audio processing device
  • the mobile terminal device may be a portable terminal device, for example, a Bluetooth headset, a wearable device (such as a watch, wrist, glasses, helmets, headbands, etc.) and other terminal equipment.
  • the main device can shoot video, obtain the main audio signal and send it to the audio processing device, while the additional device is relatively close to a target object in the video shooting scene (for example, the distance between the two is less than The first threshold), and get the additional audio device, and then send it to the audio processing device.
  • the target shooting object may be a certain person or a certain musical instrument in the video shooting scene.
  • the target shooting object can be any shooting object.
  • the target shooting object can be any shooting object.
  • FIG. 2A is a schematic diagram of a method for acquiring a multi-channel audio signal provided in an embodiment of the present invention.
  • the execution body of the method may be the audio processing device (ie, the multi-channel audio acquisition device) as shown in FIG.
  • the terminal device in this case, the main device may be a functional module or functional entity that collects audio and video in the terminal device.
  • the terminal device is used as the execution subject for exemplary description.
  • the method includes:
  • the distance between the target shot and the additional device may be smaller than the first threshold.
  • the user can set the additional device on the target object to be tracked, start video shooting on the terminal device, and select the target object in the video content by clicking on the video content displayed on the screen, and the terminal device
  • the radio module on the main device and the radio module on the additional device can start recording and collect audio signals.
  • the radio module on the main device may be a microphone array, and the main audio signal is collected through the microphone array.
  • the radio module on the attached device can be a microphone.
  • FIG. 2B it may be a schematic diagram of an interface of a terminal device, and video content may be displayed on the screen of the terminal device.
  • the user can click on the displayed person 21 in the interface by using a mobile phone to determine the person 21 as the target shooting object, and the person 21 can carry a Bluetooth headset (that is, the above-mentioned additional equipment) to collect audio near the person 21. signal and send it to the terminal device.
  • a Bluetooth headset that is, the above-mentioned additional equipment
  • multi-channel may refer to two-channel, four-channel, 5.1 or more channels.
  • the main audio signal can be binaurally rendered through a head related transfer function (HRTF) to obtain an ambient binaural audio signal.
  • HRTF head related transfer function
  • the binaural renderer in FIG. 1 may be used to perform binaural rendering on the main audio signal to obtain an ambient binaural audio signal.
  • acquiring an audio signal collected by an additional device on the target object, and determining the first additional audio signal may include two implementations:
  • a first implementation manner acquiring a second additional audio signal collected by an additional device on the target photographic object, and determining the second additional audio signal as the first additional audio signal;
  • the second implementation manner acquiring the second additional audio signal collected by the additional device on the target photographic object, and aligning the second additional audio signal with the main audio signal in the time domain to obtain the first additional audio signal.
  • the system delay can be obtained by testing.
  • the actual delay may be obtained according to the estimated sound wave propagation delay (that is, the delay between the above-mentioned main audio signal and the second additional audio signal) in combination with the system delay, and according to The actual delay time aligns the main audio signal with the second additional audio signal to obtain the first additional audio signal.
  • the delay compensator in FIG. 1 can be used to align the additional audio signal with the main audio signal in the time domain according to the time delay between the main audio signal and the second additional audio signal to obtain the first additional audio signal.
  • the ambient sound suppression processing is performed by using the first additional audio signal and the main audio signal , the way to get the target audio signal is different.
  • the main audio signal is spatially filtered in the area outside the shooting field of view of the main device to obtain a reverse focus audio signal; using the reverse focus audio signal as a reference signal, adaptive filtering is performed on the first additional audio signal process to obtain the target audio signal.
  • the main audio signal is firstly spatially filtered in the area outside the shooting field of view of the main device to obtain a reverse focus audio signal, which suppresses the sound components of the target object contained in the main audio signal and obtains a purer audio signal.
  • the ambient sound audio signal is then used as a reference signal to perform adaptive filtering processing on the first additional audio signal, which can further suppress the ambient sound in the additional audio signal.
  • spatial filtering is performed on the main audio signal within the shooting field of view to obtain a focused audio signal; the first additional audio signal is used as a reference signal, and adaptive filtering is performed on the focused audio signal to obtain a target audio signal.
  • the main audio signal is spatially filtered in the area within the shooting field of view to obtain the focused audio signal, which suppresses part of the ambient sound in the main audio signal, and then uses the first additional audio signal as a reference signal to adapt the focused audio signal.
  • the filtering process can further suppress the ambient sound outside the focus area that cannot be completely suppressed in the focused audio signal, especially the component of the sound at the location of the target photographing object contained in the ambient sound.
  • the spatial filter in FIG. 1 can be used to spatially filter the main audio signal to obtain a directionally enhanced audio signal.
  • the main purpose of spatial filtering is to obtain a purer ambient audio signal. It is the area outside the shooting field of view, and the obtained signal is called the reverse focus audio signal; and when the target object is outside the shooting field of view of the main device, since it is necessary to obtain the close-up audio signal of the area within the shooting field of view through spatial filtering, spatial filtering is required.
  • the target area of is the area within the shooting field of view, and the obtained signal is the focused audio signal.
  • the spatial filtering method may be a beamforming-based method, such as a minimum variance distortionless response (MVDR) method, or a general sidelobe canceller (GSC) beamforming method. method etc.
  • MVDR minimum variance distortionless response
  • GSC general sidelobe canceller
  • two groups of adaptive filters are included, and the two groups of adaptive filters act on the target audio signals obtained in the above two cases respectively.
  • only one set of adaptive filters can be enabled according to the change of the target object in the shooting field of view.
  • the adaptive filter acting on the first additional audio signal The filter is activated, and the reverse focus audio signal is input as a reference signal to further suppress ambient sounds from the first additional audio signal, so that sounds near the target subject are more prominent.
  • the adaptive filter acting on the focus audio signal is activated, and the first additional audio signal is input as a reference signal to further suppress the focus audio signal from outside the field of view. Sound, especially where the target subject is located.
  • the adaptive filtering method may be a least mean square (least mean square, LMS) method or the like.
  • the three groups of binaural renderers in FIG. 1 act on the main audio signal, the target audio signal after adaptive filtering in the above-mentioned situation (1), and the target audio signal after adaptive filtering in the above-mentioned situation (2). , to obtain three sets of binaural signals respectively: ambient binaural signals, additional binaural signals and focused binaural signals.
  • the binaural renderer that acts on the target audio signal of the above case (1) and the binaural rendering that acts on the target audio signal of the above case (2) The sensor can not be activated at the same time, and can be activated according to the change of the target subject in the shooting field of view of the main device.
  • the binaural renderer on the main audio signal is always enabled.
  • the binaural renderer acting on the target audio signal obtained in the above situation (1) is enabled.
  • the binaural renderer acting on the target audio signal obtained in the above case (2) is enabled.
  • the above binaural renderer may contain a decorrelator and a convolver, and an HRTF corresponding to the target position is required to simulate the perception of the auditory target in the desired direction and distance.
  • the scene sound source classification module can be used to determine rendering rules according to the determined current scene and the sound source type of the target object, and the determined rendering rules can be used on the decorrelator to obtain different rendering styles,
  • the azimuth and distance between the additional device and the primary device can be used to control HRTF generation.
  • the HRTF corresponding to a specific location can be obtained by interpolating on a pre-stored set of HRTFs, or it can be obtained using a deep neural network (DNN) based approach.
  • DNN deep neural network
  • mixing the environmental multi-channel audio signal and the target multi-channel audio signal refers to adding the environmental multi-channel audio signal and the target multi-channel audio signal according to the gain. Specifically, when the environmental multi-channel audio signal and the target multi-channel audio signal are added according to the gain, the signal sampling points in the environmental multi-channel audio signal may be added, and the signal sampling points in the target multi-channel audio signal are added. .
  • the gain may be a preset fixed value or a variable gain.
  • variable gain may be specifically determined according to the shooting field of view.
  • the mixer in FIG. 1 is used to mix two of the aforementioned three sets of binaural signals.
  • the ambient binaural signal and the additional binaural signal are mixed; when the target object is outside the field of view of the main device, the ambient binaural signal and the focus are mixed Binaural signal.
  • the main audio signal collected when the main device shoots the video can be acquired, and the first multi-channel rendering can be performed to obtain the environmental multi-channel audio signal;
  • the audio signal collected by the device is determined, and the first additional audio signal is determined;
  • the ambient sound suppression processing is performed by the first additional audio signal and the main audio signal to obtain the target audio signal;
  • the second multi-channel rendering is performed on the target audio signal to obtain the target multi-channel Audio signal; mix the ambient multi-channel audio signal and the target multi-channel audio signal to obtain a mixed multi-channel audio signal.
  • the distributed audio signal can be obtained from the main device and the additional device, and the relationship between the distributed audio signals can be used to obtain the first additional audio signal obtained from the audio signal collected by the additional device and the first additional audio signal collected by the main device.
  • the main audio signal is subjected to environmental sound suppression processing to suppress the environmental sound during the recording process, and the target multi-channel audio signal is obtained, and then the environmental multi-channel audio signal (obtained by multi-channel rendering of the main audio signal) is combined with the target multi-channel audio signal.
  • the audio signals are mixed, not only the distributed audio signals are mixed, the point-like auditory target in the spatial sound field is simulated, but also the ambient sound is suppressed, so that the recording effect of the audio signal can be improved.
  • an embodiment of the present invention further provides a method for acquiring a multi-channel audio signal, which includes:
  • the terminal device can perform the above 301 and 302, and the terminal device can continuously respond to the change of the main device's shooting field of view and track the movement of the target object in the shooting field of view.
  • the video data (including the main audio signal) captured by the main device and the second additional audio signal collected by the additional device may be acquired.
  • the current scene category and the target shooting object category may be determined according to the above-mentioned video data and/or the second additional audio signal, and a rendering rule matching the current scene category and the target shooting object category may be used. And according to the determined rendering rules, multi-channel rendering is performed on the subsequent audio signals.
  • perform multi-channel rendering on the target audio signal according to the determined rendering rule to obtain the target multi-channel audio signal which may include:
  • multi-channel rendering is performed on the target audio signal to obtain the target multi-channel audio signal.
  • perform multi-channel rendering on the main audio signal according to the determined rendering rule to obtain an environmental multi-channel audio signal which may include:
  • the first multi-channel rendering is performed on the main audio signal according to the second rendering rule matching the current scene category, so as to obtain the environmental multi-channel audio signal.
  • the scene sound source classification module can include two paths, one using video stream information and the other using audio stream information. Both paths consist of a scene analyzer and a vocal/instrument classifier.
  • the scene analyzer can analyze the type of space where the current user is located from video or audio, such as small room, medium room, large room, concert hall, stadium, outdoor, etc.
  • the vocal/instrument classifier analyzes the types of sound sources near the current target object from the video or audio, such as male, female, children or accordion, guitar, bass, piano, keyboard and percussion.
  • both the scene analyzer and the vocal/instrument classifier can be DNN-based methods.
  • the input of the video is the image of each frame, and the input of the audio can be the Mel spectrum of the sound or the Mel-frequency cepstrum coefficient (MFCC).
  • MFCC Mel-frequency cepstrum coefficient
  • the rendering rules to be used in the next binaural rendering module can also be determined according to the spatial scene analysis and the results obtained by the vocal/instrument classifier, combined with the user's preference settings.
  • the above-mentioned first multi-channel transfer function may be an HRTF function.
  • the binaural renderer in FIG. 1 may have a set of preset HRTF functions and binaural rendering methods, the preset HRTF function is determined according to the microphone array on the main device, and the HRTF pair is used The main audio signal is binaurally rendered to obtain an ambient binaural audio signal.
  • the target tracking module in Figure 1 consists of a visual target tracker and an audio target tracker, and can be used to use visual data, and/or audio signals, to determine the position of the target object and to estimate the distance between the target object and the host device. Azimuth and distance between.
  • the visual data and audio signals can be used to determine the position of the target object.
  • the visual object tracker and the audio object tracker are enabled at the same time.
  • the audio signal can be used to determine the position of the target shot, and only the audio target tracker can be enabled at this time.
  • one of visual data and audio signals may also be used to determine the position of the target photographing object.
  • the first distance is the target distance between the target photographed object and the main device determined last time.
  • the sound source direction finding and beamformer can be used to perform beamforming processing on the main audio signal towards the target azimuth to obtain a beamforming signal, and the delay estimator further determines the beamforming signal and the second additional audio frequency The first time delay between signals.
  • the video data obtained at this time includes the target object.
  • the position of the target object captured in the video frame in the video frame can be combined with the camera parameters. (for example, focal length) and zoom scale (different shooting fields correspond to different zoom scales) and other prior information, the above-mentioned first azimuth angle can be obtained, and the audio signal can also be determined to estimate the distance between the target shooting object and the main device.
  • the azimuth angle and the distance are obtained to obtain the second azimuth angle
  • the target azimuth angle is obtained by smoothing the first azimuth angle and the second azimuth angle.
  • a rough distance estimation can be performed to obtain the above-mentioned second distance.
  • the second distance and speed of sound and the predicted system delay the above-mentioned second time delay can be obtained, and the delay between the second additional audio signal and the main audio signal (ie, the first time delay) is calculated. By smoothing the second delay, the target delay can be obtained.
  • the smoothing process may refer to averaging. If the target azimuth angle is obtained after smoothing the first azimuth angle and the second azimuth angle, the average value of the first azimuth angle and the second azimuth angle can be used as the target azimuth angle; After smoothing, the target delay can be obtained, and the average value of the first delay and the second delay can be obtained as the target delay.
  • the visual target tracker in FIG. 1 can use the captured video to detect the target azimuth and target distance between the target shot and the main device.
  • the advantage of using a visual target tracker is that its tracking results are more accurate than audio target trackers in noisy environments or when there are a large number of sound sources.
  • the visual target tracker and the audio target tracker are simultaneously used to detect the target azimuth and target distance between the target photographed object and the main device, which can further improve the accuracy.
  • the first distance is the target distance between the target photographed object and the main device determined last time.
  • the active time of the audio signal refers to a time period in which a valid audio signal exists in the audio signal.
  • the first active time of the second additional audio signal may refer to the presence of valid audio in the second additional audio signal. time period of the signal.
  • the valid audio signal may refer to human voice or musical instrument sound, or the like. Exemplarily, it may be the sound of the target shot.
  • the time delay between the second additional audio signal and the main audio signal may be determined according to the first distance and the speed of sound, and then according to the time delay and the first active time, the time delay between the second additional audio signal and the main audio signal may be determined. Two audio signals corresponding to the second active time in the additional audio signals.
  • the video data obtained at this time does not include the target shooting object, and an audio signal can be used to determine the position of the target shooting object at this time.
  • the audio target tracker can use the main audio signal and the additional audio signal to estimate the target azimuth and target distance between the target object and the main device, which can specifically include sound source direction finding, beamforming, and delay estimation and so on.
  • the target azimuth can be obtained by estimating the direction of arrival (DOA) of the main audio signal.
  • DOA direction of arrival
  • the second additional audio can be analyzed first, and it is obtained that there is an effective audio signal in the second additional audio (which may refer to the existence of the target object
  • the time corresponding to the active part of the audio signal of the sound), that is, the above-mentioned first active time, and then according to the previously estimated target distance, the delay between the second additional audio signal and the main audio signal (that is, the first delay) is obtained , and the first active time corresponds to the second active time in the main audio signal.
  • DOA estimation to obtain the azimuth angle between the target photographed object and the main device, and use the azimuth angle as the above-mentioned target azimuth angle.
  • the generalized cross correlation (GCC) method of phase weighting PHAT
  • GCC generalized cross correlation
  • PHAT phase weighting
  • the multi-channel main audio signal will pass through a beamformer with a fixed direction to obtain a beamformer signal, and perform directional enhancement in the direction of the above-mentioned target direction angle to improve the delay estimation to be performed next.
  • the beamforming method can be delay-sum (delay-sum), or minimum variance distortion response (MVDR).
  • MVDR minimum variance distortion response
  • the estimation of TDOA is also performed only during the active time of the second additional audio signal. According to the first delay, the speed of sound, and the predicted system delay, the distance between the target photographed object and the main device, that is, the above-mentioned target distance can be obtained.
  • the first time delay is used as the target time delay between the main audio signal and the second additional audio signal, and the second additional audio signal is combined with the second additional audio signal according to the first time delay.
  • the main audio signal is aligned in the time domain, resulting in a first additional audio signal.
  • the delay compensator in FIG. 1 can align the second additional audio signal with the main audio signal in the time domain according to the above-mentioned first delay to obtain the first additional audio signal.
  • the main purpose of spatial filtering is to obtain a purer ambient audio signal, so the target area of spatial filtering is the shooting field of view. Outside the range, the obtained signal is hereinafter referred to as the reverse focus audio signal; and when the target object is outside the range of the shooting field of view, since the close-up audio signal within the shooting field of view needs to be obtained through spatial filtering, the target of spatial filtering is The area is the shooting field of view, and the resulting signal is hereinafter referred to as the focus audio signal.
  • the change of the shooting field of view of the main device can be followed, so that the local audio signal is directionally enhanced.
  • two sets of adaptive filters act on the focused audio signal and the additional audio signal, respectively. Only one set of adaptive filters is enabled based on changes in the target's field of view.
  • the adaptive filter acting on the additional audio signal is activated, and the reverse focus audio signal is input as the reference signal to further suppress the ambient sound from the additional audio signal, so that the The sound is more prominent.
  • an adaptive filter is activated on the focus audio signal, and an additional audio signal is input as a reference signal to further suppress sounds outside the field of view from the focus audio signal.
  • the method of adaptive filtering can be minimum mean square error (LMS, Least Mean Square) and so on.
  • a mixed gain controller can determine the mixed gain according to the user's shooting field of view, that is, the proportion of the two sets of signals in the mixed signal. For example, when the zoom level of the camera is increased, that is, the field of view is reduced, the gain of the ambient binaural audio signal will decrease, and the additional binaural audio signal (that is, when the target object is within the field of view, the determined target multi-channel audio signal) or focused binaural audio signal (i.e. the target multi-channel audio signal determined when the target subject is out of the field of view) is increased. In this way, when the video field of view is focused on the specified area, the audio will also be focused on the specified area.
  • the size of the shooting field of view is determined according to the shooting parameters of the main device (such as the zoom level of the camera), and based on this, the first gain of the environmental multi-channel audio signal and the second gain of the target multi-channel audio signal are determined , so that when the video shooting field of view is focused on the specified area, the audio will also be focused on the specified area, so as to create an "immersive, sound and image moving" effect.
  • the multi-channel audio signal acquisition method provided by the embodiment of the present invention is a distributed recording and audio focusing method that can create a more realistic sense of presence.
  • the method can simultaneously utilize the microphone array on the main device and the microphone on the additional device (TWS Bluetooth headset) in the terminal device to perform distributed collection and fusion of audio.
  • the microphone array in the terminal device collects the spatial audio of the location of the main device (that is, the main audio signal involved in the embodiment of the present invention), and the TWS Bluetooth headset can be set on the target object to be tracked, and follow the target object.
  • the final output binaural audio signal When the final output binaural audio signal is played in stereo headphones, it can simulate the spatial sound field and the point-like auditory target at the specified position at the same time.
  • an embodiment of the present invention provides an apparatus 400 for acquiring a multi-channel audio signal, and the apparatus includes:
  • the acquisition module 401 is used to acquire the main audio signal collected when the main device shoots the video of the target object, and perform the first multi-channel rendering to obtain the environmental multi-channel audio signal; acquire the audio signal collected by the additional device, and determine the first multi-channel audio signal.
  • An additional audio signal wherein, the distance between the additional device and the target photographing object is less than the first threshold;
  • a processing module 402 configured to perform ambient sound suppression processing through the first additional audio signal and the main audio signal to obtain a target audio signal
  • the ambient multi-channel audio signal and the target multi-channel audio signal are mixed to obtain a mixed multi-channel audio signal.
  • the processing module 402 is specifically configured to determine the first gain of the environmental multi-channel audio signal and the second gain of the target multi-channel audio signal according to the shooting parameters of the main device;
  • the ambient multi-channel audio signal and the target multi-channel audio signal are mixed to obtain the mixed multi-channel audio signal.
  • the acquisition module 401 is specifically configured to acquire the main audio signal collected by the microphone array on the main device;
  • the first multi-channel transfer function is generated according to the microphone array formation on the master device,
  • multi-channel rendering is performed on the main audio signal to obtain the ambient multi-channel audio signal.
  • the acquiring module 401 is specifically configured to acquire a second additional audio signal collected by an additional device on the target photograph, and determine the second additional audio signal as the first additional audio signal;
  • the second additional audio signal collected by the additional device is acquired, and the second additional audio signal is aligned with the main audio signal in the time domain to obtain the first additional audio signal.
  • the processing module 402 is specifically configured to obtain the target azimuth angle between the target photographed object and the main device;
  • the second additional audio signal is aligned with the main audio signal in the time domain to obtain the first additional audio signal.
  • the processing module 402 is specifically configured to obtain the target distance and target azimuth between the target photographed object and the main device;
  • Multi-channel rendering is performed on the target audio signal according to the second multi-channel transfer function to obtain the target multi-channel audio signal.
  • the acquiring module 401 is specifically configured to acquire the first active time and the first distance of the second additional audio signal when it is detected that the target photographic object is outside the photographing field of view of the main device, and the first distance is the last determined distance.
  • the angle of arrival is estimated using the main audio signal in the second active time to obtain the target azimuth angle between the target object and the main device.
  • the acquisition module 401 is specifically configured to perform beamforming processing on the main audio signal towards the target azimuth when it is detected that the target photographed object is outside the photographing field of view of the main device, to obtain a beamforming signal;
  • the target distance between the target object and the main device is calculated.
  • the processing module 402 is specifically configured to perform spatial filtering on the main audio signal in the area within the shooting field of view according to the shooting field of view of the main device when it is detected that the target shooting object is outside the shooting field of view of the main device, to obtain a focused audio signal. ;
  • adaptive filtering is performed on the focused audio signal to obtain a target audio signal.
  • the acquisition module 401 is specifically configured to, when it is detected that the target photographic object is within the shooting field of view of the main device, determine the first orientation between the target photographic object and the main device according to the video information and shooting parameters acquired by the main device. Horn;
  • the first azimuth angle and the second azimuth angle are smoothed to obtain the target azimuth angle.
  • the acquiring module 401 is specifically configured to determine the second distance between the target object and the main device according to the video information acquired by the main device when it is detected that the target object is within the shooting field of view of the main device;
  • the second time delay is calculated
  • the target distance is calculated.
  • the processing module 402 is configured to perform spatial filtering on the main audio signal in the area outside the shooting field of view according to the shooting field of view of the main device when it is detected that the target shooting object is within the shooting field of view of the main device, to obtain the reverse focus audio frequency. Signal;
  • adaptive filtering is performed on the first additional audio signal to obtain a target audio signal.
  • the processing module 402 is specifically configured to acquire the video data captured by the main device and the second additional audio signal collected by the additional device;
  • multi-channel rendering is performed on the target audio signal to obtain the target multi-channel audio signal.
  • the processing module 402 is specifically configured to acquire the main audio signal collected when the main device shoots the video of the target object;
  • An embodiment of the present invention provides a terminal device, including: a processor, a memory, and a computer program stored on the memory and running on the processor, where the computer program is executed by the processor to achieve the above-mentioned
  • the method embodiment provides a multi-channel audio signal acquisition method.
  • an embodiment of the present invention further provides a terminal device, where the terminal device includes the foregoing apparatus 400 for acquiring a multi-channel audio signal and a main device 500 .
  • the main device is used to collect a main audio signal when shooting a video, and send the main audio signal to the multi-channel audio signal acquisition device.
  • an embodiment of the present invention further provides a terminal device, which includes but is not limited to: a radio frequency (RF) circuit 601, a memory 602, an input unit 603, a display unit 604, a sensor 605, an audio frequency Circuit 606, wireless fidelity (WiFi) module 607, processor 608, Bluetooth module 609, camera 610 and other components.
  • the radio frequency circuit 601 includes a receiver 6011 and a transmitter 6012 .
  • the RF circuit 601 can be used for receiving and sending signals during transmission and reception of information or during a call. In particular, after receiving the downlink information of the base station, it is processed by the processor 608; in addition, the designed uplink data is sent to the base station.
  • the RF circuit 601 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier (LNA), a duplexer, and the like.
  • the RF circuit 601 can also communicate with the network and other devices through wireless communication.
  • the above-mentioned wireless communication can use any communication standard or protocol, including but not limited to the global system of mobile communication (global system of mobile communication, GSM), general packet radio service (general packet radio service, GPRS), code division multiple access (code division multiple access) multiple access, CDMA), wideband code division multiple access (WCDMA), long term evolution (long term evolution, LTE), email, short message service (short messaging service, SMS) and so on.
  • GSM global system of mobile communication
  • general packet radio service general packet radio service
  • GPRS code division multiple access
  • CDMA code division multiple access
  • WCDMA wideband code division multiple access
  • long term evolution long term evolution
  • email short message service
  • the memory 602 can be used to store software programs and modules, and the processor 608 executes various functional applications and data processing of the terminal device by running the software programs and modules stored in the memory 602 .
  • the memory 602 may mainly include a stored program area and a stored data area, wherein the stored program area may store an operating system, an application program (such as a sound playback function, an image playback function, etc.) required for at least one function, and the like; Data created by the use of terminal equipment (such as audio signals, phonebooks, etc.), etc.
  • memory 602 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.
  • the input unit 603 may be used to receive input numerical or character information, and generate key signal input related to user setting and function control of the terminal device.
  • the input unit 603 may include a touch panel 6031 and other input devices 6032 .
  • the touch panel 6031 also referred to as a touch screen, can collect the user's touch operations on or near it (such as the user's finger, stylus, etc., any suitable object or accessory on or near the touch panel 6031). operation), and drive the corresponding connection device according to the preset program.
  • the touch panel 6031 may include two parts, a touch detection device and a touch controller.
  • the touch detection device detects the user's touch orientation, detects the signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection device, converts it into contact coordinates, and then sends it to the touch controller.
  • the touch panel 6031 can be realized by various types of resistive, capacitive, infrared, and surface acoustic waves.
  • the input unit 603 may also include other input devices 6032 .
  • other input devices 6032 may include, but are not limited to, one or more of physical keyboards, function keys (such as volume control keys, switch keys, etc.), trackballs, mice, joysticks, and the like.
  • the display unit 604 may be used to display information input by the user or information provided to the user and various menus of the terminal device.
  • the display unit 604 may include a display panel 6041.
  • the display panel 6041 may be configured in the form of a liquid crystal display (LCD), an organic light-emitting diode (OLED), or the like.
  • the touch panel 6031 can cover the display panel 6041. When the touch panel 6031 detects a touch operation on or near it, it transmits it to the processor 608 to determine the touch event, and then the processor 608 determines the touch event according to the touch event. Corresponding visual outputs are provided on the display panel 6041 . Although in FIG.
  • the touch panel 6031 and the display panel 6041 are used as two independent components to realize the input and input functions of the terminal device, but in some embodiments, the touch panel 6031 and the display panel 6041 can be integrated And realize the input and output functions of the terminal equipment.
  • the terminal device may also include at least one sensor 605, such as a light sensor, a motion sensor, and other sensors.
  • the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display panel 6041 according to the brightness of the ambient light, and the proximity sensor may exit the display panel 6041 and the display panel 6041 when the terminal device is moved to the ear. / or backlight.
  • the accelerometer sensor can detect the magnitude of acceleration in all directions (generally three axes), and can detect the magnitude and direction of gravity when stationary, and can be used for applications that identify the attitude of terminal devices (such as horizontal and vertical screen switching, related games, magnetometer attitude calibration), vibration recognition related functions (such as pedometer, tapping), etc.; as for other sensors such as gyroscopes, barometers, hygrometers, thermometers, infrared sensors, etc. that can be configured on the terminal device, here No longer.
  • the terminal device may include an acceleration sensor, a depth sensor, or a distance sensor, or the like.
  • the audio circuit 606, the speaker 6061, and the microphone 6062 can provide an audio interface between the user and the terminal device.
  • the audio circuit 606 can convert the received audio signal into an electrical signal, and transmit it to the speaker 6061, and the speaker 6061 converts it into a sound signal for output; on the other hand, the microphone 6062 converts the collected sound signal into an electrical signal, which is converted by the audio circuit 606. After receiving, it is converted into an audio signal, and then the audio signal is output to the processor 608 for processing, and then sent to, for example, another terminal device through the RF circuit 601, or the audio signal is output to the memory 602 for further processing.
  • the above-mentioned microphone 6062 may be a microphone array.
  • WiFi is a short-distance wireless transmission technology
  • the terminal device can help users to send and receive emails, browse web pages, and access streaming media through the WiFi module 607, which provides users with wireless broadband Internet access.
  • FIG. 6 shows the WiFi module 607, it can be understood that it does not belong to the necessary structure of the terminal device, and can be completely omitted as required within the scope of not changing the essence of the invention.
  • the processor 608 is the control center of the terminal device, using various interfaces and lines to connect various parts of the entire terminal device, by running or executing the software programs and/or modules stored in the memory 602, and calling the data stored in the memory 602. , perform various functions of the terminal equipment and process data, so as to monitor the terminal equipment as a whole.
  • the processor 608 may include one or more processing units; preferably, the processor 608 may integrate an application processor and a modem processor, wherein the application processor mainly processes the operating system, user interface, and application programs, etc. , the modem processor mainly deals with wireless communication. It can be understood that, the above-mentioned modulation and demodulation processor may not be integrated into the processor 608 .
  • the terminal device also includes a Bluetooth module 609, which is used for short-distance wireless communication, and is divided into a Bluetooth data module and a Bluetooth voice module according to functions.
  • Bluetooth module refers to the basic circuit set of chips integrated with Bluetooth function, which is used for wireless network communication. It can be roughly divided into three types: data transmission module, Bluetooth audio module, Bluetooth audio + data combination module and so on.
  • the terminal device may also include other functional modules, which will not be repeated here.
  • the microphone 6062 can be used to collect the main audio signal, and the terminal device can be connected to the additional device through the WiFi module 607 or the Bluetooth module 609, and receive the second additional audio signal collected by the additional device.
  • the processor 608 is configured to obtain the main audio signal, perform multi-channel rendering, and obtain the environmental multi-channel audio signal; obtain the audio signal collected by the additional device, and determine the first additional audio signal; The main audio signal is subjected to environmental sound suppression processing to obtain a target audio signal; multi-channel rendering is performed on the target audio signal to obtain a target multi-channel audio signal; the environmental multi-channel audio signal and the target multi-channel audio signal are processed. Mix to get a mixed multi-channel audio signal.
  • the distance between the additional device and the target shot is less than a first threshold;
  • the foregoing processor 608 may also be used to implement other processes implemented by the terminal device in the foregoing method embodiments, and details are not described herein again.
  • An embodiment of the present invention further provides a multi-channel audio signal acquisition system, the system includes: a multi-channel audio signal acquisition device, a main device and an additional device, the main device and the additional device are respectively connected to the multi-channel audio signal signal to establish a communication connection;
  • the main device is used to collect a main audio signal when shooting a video of the target object, and send the main audio signal to the multi-channel audio signal acquisition device;
  • the additional device is configured to collect a second additional audio signal and send the second additional audio signal to the multi-channel audio signal acquisition device.
  • the multi-channel audio signal acquisition system may be as shown in FIG. 1 above, wherein the audio processing device in FIG. 1 may be a multi-channel audio signal acquisition apparatus.
  • Embodiments of the present invention further provide a computer-readable storage medium, including: a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the method for acquiring a multi-channel audio signal in the foregoing method embodiment is implemented.
  • the disclosed system, apparatus and method may be implemented in other manners.
  • the apparatus embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium.
  • the technical solution of the present invention is essentially or the part that contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention.
  • the aforementioned storage medium includes: U disk, mobile hard disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes .

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Quality & Reliability (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • General Health & Medical Sciences (AREA)
  • Otolaryngology (AREA)
  • Stereophonic System (AREA)

Abstract

一种多通道音频信号获取方法,包括:获取主设备对目标拍摄物进行视频拍摄时所采集的主音频信号,并进行第一多通道渲染,得到环境多通道音频信号(201);以及获取目标拍摄物上的附加设备采集的音频信号,并确定第一附加音频信号(202);对第一附加音频信号和主音频信号进行环境声抑制处理,得到目标音频信号(203);对目标音频信号进行第二多通道渲染,得到目标多通道音频信号(204);将环境多通道音频信号和目标多通道音频信号进行混合,得到混合多通道音频信号(205)。还公开了相应的装置、系统、终端设备以及计算机可读存储介质。

Description

一种多通道音频信号获取方法、装置及系统 技术领域
本发明涉及音频技术领域,尤其涉及一种多通道音频信号获取方法、装置及系统。
背景技术
随着技术的进步,人们对移动设备的摄影和录音效果提出了更高的需求。目前随着真无线立体声(true wireless stereo,TWS)蓝牙耳机的普及,出现了一种分布式音频捕获方案,该方案利用TWS蓝牙耳机上的麦克风,捕捉远离用户的高质量特写音频信号,并和主设备上的麦克风阵列采集的空间音频信号进行混合和双耳渲染,模拟了空间声场中的点状听觉目标,营造了一种更真实的沉浸式体验。但是这种方案只是将分布式的音频信号进行混合,并未对环境声进行抑制,在使用移动设备在有多个声源的场合或者比较嘈杂的环境中进行视频拍摄时,用户真正感兴趣的声音会和各个不相关声源混合在一起,甚至淹没在背景噪声之中,因此现有方案可能会由于环境声的影响,使得音频信号的录音效果较差。
发明内容
本发明实施例提供了一种多通道音频信号获取方法、装置及系统,可以采用分布式音频信号之间的关系,对环境声进行抑制处理,提高音频信号的录音效果。
为了解决上述技术问题,本发明实施例是这样实现的:
第一方面,本发明实施例提供一种多通道音频信号获取方法,包括:
获取主设备拍摄视频时采集的主音频信号,并进行多通道渲染,得到环境多通道音频信号;
获取附加设备采集的音频信号,并确定第一附加音频信号;其中,附加设备与目标拍摄物之间的距离小于第一阈值;
通过第一附加音频信号和主音频信号进行环境声抑制处理,得到目标音频信号;
对目标音频信号进行多通道渲染,得到目标多通道音频信号;
将环境多通道音频信号和目标多通道音频信号进行混合,得到混合多通道音频信号。
第二方面,提供一种多通道音频信号获取装置,包括:
获取模块,用于获取主设备对目标拍摄物进行视频拍摄时所采集的主音频信号,并进行第一多通道渲染,得到环境多通道音频信号;获取附加设备采集的音频信号,并确定第一附加音频信号,其中,所述附加设备与所述目标拍摄物之间的距离小于第一阈值;
处理模块,用于通过第一附加音频信号和主音频信号进行环境声抑制处理,得到目标音频信号;
对目标音频信号进行多通道渲染,得到目标多通道音频信号;
将环境多通道音频信号和目标多通道音频信号进行混合,得到混合多通道音频信号。
第三方面,提供一种终端设备,包括:处理器、存储器及存储在存储器上并可在处理器上运行的计算机程序,计算机程序被处理器执行时实现如第一方面的多通道音频信号获取方法。
第四方面,提供一种终端设备,包括:如第二方面的多通道音频信号获取装置和主设备,
主设备,用于在拍摄视频时采集主音频信号,并将主音频信号发送至多通道音频信号获取装置。
第五方面,提供一种多通道音频信号获取系统,该系统包括:如第二方面的多通道音频信号获取装置、主设备和附加设备,主设备和附加设备分别与多通道音频信号建立通信连接;
主设备,用于在拍摄视频时采集主音频信号,并将主音频信号发送至多通道音频信号获取装置;
附加设备,用于采集第二附加音频信号,并将第二附加音频信号发送至多通道音频信号获取装置;
其中,所述附加设备与所述目标拍摄物之间的距离小于第一阈值。
第六方面,提供一种计算机可读存储介质,包括:计算机可读存储介质上存储计算机程序,计算机程序被处理器执行时实现如第一方面的多通道音频信号获取方法。
本发明实施例中,可以获取主设备拍摄视频时采集的主音频信号,并进行多通道渲染,得到环境多通道音频信号;以及获取与目标拍摄物之间的距离小于第一阈值的附加设备采集的音频信号,确定第一附加音频信号;通过第一附加音频信号和主音频信号进行环境声抑制处理,得到目标音频信号;对目标音频信号进行多通道渲染,得到目标多通道音频信号;将环境多通道音频信号和目标多通道音频信号进行混合,得到混合多通道音频信号。通过该方案,可以从主设备和附加设备处获取分布式音频信号,并且可以利用分布式音频信号之间的关系,根据附加设备采集的音频信号所得到的第一附加音频信号和主设备采集的主音频信号,进行环境声抑制处理,以抑制录音过程中的环境声,得到目标多通道音频信号,然后在将环境多通道音频信号(对主音频信号进行多通道渲染得到的)与目标多通道音频信号进行混合 时,不仅实现了将分布式的音频信号进行混合,模拟了空间声场中的点状听觉目标,并且还对环境声进行了抑制,从而可以提高音频信号的录音效果。
附图说明
为了更清楚地说明本发明实施例技术方案,下面将对实施例和现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,还可以根据这些附图获得其它的附图。
图1所示为本发明实施例提供的一种多通道音频信号获取系统的示意图;
图2A所示为本发明实施例提供的一种多通道音频信号获取方法的示意图一;
图2B所示为本发明实施例提供的一种终端设备的界面示意图;
图3所示为本发明实施例提供的一种多通道音频信号获取方法的示意图二;
图4所示为本发明实施例提供的一种多通道音频信号获取装置的示意图;
图5所示为本发明实施例提供的一种终端设备的结构示意图;
图6所示为本发明实施例提供的一种终端设备的硬件结构示意图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
在本发明实施例中,“示例性的”或者“例如”等词用于表示作例子、例证或说明。本发明实施例中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念。此外,在本发明实施例的描述中,除非另有说明,“多个”的含义是指两个或两个以上。
本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。
本发明实施例提供一种多通道音频信号获取方法、装置及系统,可以应用在视频拍摄场景中,尤其可以应用在多个声源的场合或者比较嘈杂的环境中进行视频拍摄,可以实现将分布式的音频信号进行混合,模拟了空间声场中的点状听觉目标,并且还对环境声进行了抑制,从而可以提高音频信号的录音效果。
如图1所示,为本发明实施例提供的一种多通道音频信号获取系统的示意图,该系统中可以包括主设备、附加设备和音频处理设备(可以为本发明实施例中的多通道音频获取装置)。其中,图1中的附加设备为TWS蓝牙耳机,可以用于采集音频流(即本发明实施例中的附加音频信号),主设备可以用于采集视频流和音频流(即本发明实施例中的主音频信号),音频处理设备可以包括以下模块:目标跟踪、场景声源分类、延迟补偿、自适应滤波、空间滤波、双耳渲染和混合器等。其中,各个模块的具体功能介绍将结合下述实施例中所描述的多通道音频信号获取方法进行描述,此处不再赘述。
需要说明的是,本发明实施例中的主设备和音频处理设备可以是两个独立的设备。可选的,主设备和音频处理设备也可以是集成在一起的一个设备,例如,可以是集成了主设备和音频处理设备功能的终端设备。
本发明实施例中,附加设备与终端设备之间,或者附加设备与音频处理设备之间可以通过无线通信方式连接,例如可以通过蓝牙连接,或者通过WiFi连接,本发明实施例中对连接方式不作具体限定。
本发明实施例中的终端设备可以包括:手机、平板电脑、笔记本电脑、超级移动个人计算机(ultra-mobile personal computer,UMPC)、手持计算机、上网本、个人数字助理(personal digital assistant,PDA)、可穿戴设备(如手表、手腕、眼镜、头盔、头带等)等终端设备,本申请实施例对终端设备的具体形式不做特殊限制。
本发明实施例中,附加设备可以为独立于主设备和音频处理设备的一个终端设备,该移动终端设备可以为便携式的终端设备,例如,可以为蓝牙耳机,可穿戴设备(如手表、手腕、眼镜、头盔、头带等)等终端设备。
在视频拍摄场景中,主设备可以拍摄视频,获取主音频信号并发送给音频处理设备,而附加设备与视频拍摄场景中的某个目标拍摄物距离较近(例如,两者之间的距离小于第一阈值),并获取附加音频设备,然后发送给音频处理设备。
可选的,目标拍摄物可以为视频拍摄场景中的某个人、或者某个乐器等。
可选的,通常视频拍摄场景中,可以有多个拍摄物,目标拍摄物可以为
多个拍摄物中的一个。
图2A所示,为本发明实施例中提供的一种多通道音频信号获取方法的示意图。示例性的,该方法的执行主体可以为如上述图1中所示的音频处理设备(即多通道音频获取装置),也可以为集成了上述图1所示的音频处理设备和主设备功能的终端设备,此时主设备可以为终端设备中采集音频和视频的功能模块或者功能实体,下面的实施例中以终端设备为执行主体,进行示例性的说明。
下面对该方法进行详细介绍,如图2A所示,该方法包括:
201、获取主设备对目标拍摄物进行视频拍摄时所采集的主音频信号,并进行第一多通道渲染,得到环境多通道音频信号。
其中,目标拍摄物与附加设备两者之间的距离可以小于第一阈值。
可选的,用户可以将附加设备设置于需要跟踪的目标拍摄物上,并在终端设备上启动视频拍摄,并通过点击屏幕中显示的视频内容,选择视频内容中的该目标拍摄物,终端设备中主设备上的收音模块和附加设备上的收音模块可以开始录音,采集音频信号。
可选的,主设备上的收音模块可以为麦克风阵列,并通过该麦克风阵列来采集主音频信号。附加设备上的收音模块可以为麦克风。
如图2B所示,可以为终端设备的一种界面示意图,终端设备的屏幕中可以显示视频内容。其中,用户可以通过采用手机点击该界面中的显示的人物21,以将该人物21确定为目标拍摄物,人物21身上可以携带有蓝牙耳机(即上述附加设备),以采集人物21附近的音频信号,并发送给该终端设备。
本发明实施例中,多通道可以是指双通道、四通道、5.1或更多声道。
当本发明实施例中获取的音频信号为双通道音频信号时,可以通过人头相关传递函数(head related transfer function,HRTF)对主音频信号进行双耳渲染,得到环境双耳音频信号。
示例性的,可以通过图1中的双耳渲染器,对主音频信号进行双耳渲染,得到环境双耳音频信号。
202、获取附加设备采集的音频信号,并确定第一附加音频信号。
可选的,获取目标拍摄物上的附加设备采集的音频信号,并确定第一附加音频信号可以包括两种实现方式:
第一种实现方式:获取目标拍摄物上的附加设备采集的第二附加音频信号,将第二附加音频信号确定为第一附加音频信号;
第二种实现方式:获取目标拍摄物上的附加设备采集的第二附加音频信号,将第二附加音频信号与主音频信号在时域上对齐,得到第一附加音频信号。
由于主设备与附加设备之间可能存在一定的距离,因此获取的主音频信号与第二附加音频信号之间可能存在一定的时延,可以根据主音频信号与第二附加音频信号之间的时延,将主音频信号与第二附加音频信号在时域上对齐,以得到第一附加音频信号。
通常在音频信号的采集系统中,例如,图1所示的多通道音频信号获取系统中,也会存在一定的系统时延(例如,蓝牙传输所带来的时延,以及解码模块进行解码带来的时延),该系统时延可以通过测试得到。可选的,在本发明实施例中,可以根据估计得到的声波传播时延(即上述主音频信号与第二附加音频信号之间的时延)结合系统时延来得到实际时延,并根据该实际时延将主音频信号与第二附加音频信号在时域上对齐,以得到第一附加音频信号。
图1中的延迟补偿器可以用于根据主音频信号与第二附加音频信号之间的时延,将附加音频信号与主音频信号在时域上对齐,以得到第一附加音频信号。
203、对第一附加音频信号和主音频信号进行环境声抑制处理,得到目标音频信号。
本发明实施例中,针对目标拍摄物在主设备的拍摄视野内的情况,以及针对目标拍摄物在主设备的拍摄视野外的情况,通过第一附加音频信号和主音频信号进行环境声抑制处理,得到目标音频信号的方式有所不同。
(1)针对目标拍摄物在主设备的拍摄视野内的情况。
根据主设备的拍摄视野,对主音频信号在主设备的拍摄视野以外区域进行空间滤波,得到反向聚焦音频信号;将反向聚焦音频信号作为参考信号,对第一附加音频信号进行自适应滤波处理,得到目标音频信号。
这种方式首先对主音频信号在主设备的拍摄视野以外区域进行空间滤波,得到反向聚焦音频信号,抑制了主音频信号中所包含的目标拍摄物所在位置的声音的成分,获得更纯净的环境声音频信号,然后以反向聚焦音频信号作为参考信号对第一附加音频信号进行自适应滤波处理,可以进一步抑制附加音频 信号中的环境声。
(2)针对目标拍摄物在主设备的拍摄视野外的情况。
根据主设备的拍摄视野,对主音频信号在拍摄视野以内区域进行空间滤波,得到聚焦音频信号;将第一附加音频信号作为参考信号,对聚焦音频信号进行自适应滤波处理,得到目标音频信号。
这种方式首先对主音频信号在拍摄视野以内区域进行空间滤波,得到聚焦音频信号,抑制了主音频信号中的部分环境声,然后以第一附加音频信号作为参考信号对聚焦音频信号进行自适应滤波处理,可以进一步抑制聚焦音频信号中未能完全抑制的聚焦区域以外的环境声,尤其是环境声中所包含的目标拍摄物所在位置的声音的成分。
图1中的空间滤波器可以用于对主音频信号进行空间滤波,以得到定向增强的音频信号。当目标拍摄物处于主设备的拍摄视野内时,由于已经通过第一附加音频信号获得了高质量的特写音频信号,空间滤波的主要目的是为了获得更纯净的环境音频信号,空间滤波的目标区域是拍摄视野以外区域,得到的信号称为反向聚焦音频信号;而当目标拍摄物处于主设备的拍摄视野外时,由于需要通过空间滤波来获得拍摄视野以内区域的特写音频信号,因此空间滤波的目标区域即是拍摄视野以内区域,得到的信号为聚焦音频信号。
其中,空间滤波的方法可以是基于波束形成的方法,如采用最小方差无失真响应(minimum variance distortionless response,MVDR)方法,或采用广义旁瓣对消器(general sidelobe canceller,GSC)的波束形成的方法等。
图1中,包括了两组自适应滤波器,这两组自适应滤波器分别作用于上述两种情况下得到的目标音频信号。具体的,可以根据目标拍摄物在拍摄视野中的变化,只启用其中的一组自适应滤波器,当目标拍摄物在主设备的拍摄视野内时,作用于第一附加音频信号上的自适应滤波器被启动,反向聚焦音频信号被作为参考信号输入,以从第一附加音频信号中进一步抑制环境声,使得目标拍摄物附近的声音更为突出。当目标拍摄物在主设备的拍摄视野外时,作用于聚焦音频信号上的自适应滤波器被启动,第一附加音频信号被作为参考信号输入,以从聚焦音频信号中进一步抑制拍摄视野以外的声音,尤其是目标拍摄物所在位置的声音。
其中,自适应滤波的方法可以是最小均方误差(least mean square,LMS)法等。
204、对目标音频信号进行第二多通道渲染,得到目标多通道音频信号。
示例性的,图1中的三组双耳渲染器分别作用于主音频信号、上述情况(1)经过自适应滤波之后的目标音频信号,以及上述情况(2)经过自适应滤波之后目标音频信号,以分别得到三组双耳信号:环境双耳信号、附加双耳信号和聚焦双耳信号。
其中,由于上述情况(1)和(2)不会同时存在,因此作用于上述情况(1)的目标音频信号的双耳渲染器和作用于上述情况(2)的目标音频信号的双耳渲染器可以不同时启用,可以根据目标拍摄物在主设备的拍摄视野中的变化选择启用。而作用于主音频信号上的双耳渲染器则是一直启用的。
进一步的,当目标拍摄物在主设备的拍摄视野内时,启用作用于上述情况(1)得到的目标音频信号的双耳渲染器。当目标拍摄物在主设备的拍摄视野外时,启用作用于上述情况(2)得到的目标音频信号的双耳渲染器。
可选的,上述双耳渲染器内部可以包含解相关器和卷积器,并且需要对应目标位置的HRTF,以在期望的方向和距离上模拟听觉目标的感知。
可选的,场景声源分类模块可以用来根据确定出的当前场景,与目标拍摄物的声源类型确定渲染规则,确定出的渲染规则可以被作用于解相关器以获得不同的渲染风格,附加设备与主设备之间的方位角和距离,可以被用于控制HRTF的生成。对应于特定位置的HRTF可以通过在预先存储的一组HRTF上插值来获得,也可以使用基于深度神经网络(deep neural network,DNN)的方法来获得。
205、将环境多通道音频信号和目标多通道音频信号进行混合,得到混合多通道音频信号。
本发明实施例中,将环境多通道音频信号和目标多通道音频信号进行混合,是指根据增益将环境多通道音频信号和目标多通道音频信号相加。具体的,根据增益将环境多通道音频信号和目标多通道音频信号相加时,可以是将环境多通道音频信号中的信号采样点相加,与目标多通道音频信号中的信号采样点相加。
其中,增益可以是预先设置的固定值,也可以是可变的增益。
可选的,可变的增益具体可以根据拍摄视野确定。
图1中的混合器,用于将前述三组双耳信号中的两组进行混合。当目标拍摄物在主设备的拍摄视野内时,进行混合的是环境双耳信号和附加双耳信号;当目标拍摄物在主设备的拍摄视野外时,进行混合 的是环境双耳信号和聚焦双耳信号。
本发明实施例中,可以获取主设备拍摄视频时采集的主音频信号,并进行第一多通道渲染,得到环境多通道音频信号;以及获取与目标拍摄物之间的距离小于第一阈值的附加设备采集的音频信号,并确定第一附加音频信号;通过第一附加音频信号和主音频信号进行环境声抑制处理,得到目标音频信号;对目标音频信号进行第二多通道渲染,得到目标多通道音频信号;将环境多通道音频信号和目标多通道音频信号进行混合,得到混合多通道音频信号。通过该方案,可以从主设备和附加设备处获取分布式音频信号,并且可以利用分布式音频信号之间的关系,根据附加设备采集的音频信号所得到的第一附加音频信号和主设备采集的主音频信号,进行环境声抑制处理,以抑制录音过程中的环境声,得到目标多通道音频信号,然后在将环境多通道音频信号(对主音频信号进行多通道渲染得到的)与目标多通道音频信号进行混合时,不仅实现了将分布式的音频信号进行混合,模拟了空间声场中的点状听觉目标,并且还对环境声进行了抑制,从而可以提高音频信号的录音效果。
如图3所示,本发明实施例还提供一种多通道音频信号获取方法,该方法包括:
301、获取主设备上的麦克风阵列采集的主音频信号。
302、获取附加设备采集的第二附加音频信号。
用户在主设备上选择目标拍摄物,开始拍摄视频之后,终端设备可以执行上述301和302,终端设备可以持续响应于主设备拍摄视野的变化,追踪目标拍摄物在拍摄视野中的移动。
可选的,可以获取所述主设备拍摄得到的视频数据(包括该主音频信号)和所述附加设备采集的第二附加音频信号。
进一步的,可以根据上述视频数据,和/或,第二附加音频信号,确定当前场景类别和目标拍摄物类别,通过与所述当前场景类别和所述目标拍摄物类别匹配的渲染规则。并根据确定的渲染规则,对后续的音频信号进行多通道渲染。
可选的,根据确定的渲染规则,对目标音频信号进行第二多通道渲染,得到目标多通道音频信号,以及根据确定的渲染规则对主音频信号进行第一多通道渲染,得到环境多通道音频信号。
可选的,根据确定的渲染规则,对目标音频信号进行多通道渲染,得到目标多通道音频信号,可以包括:
获取主设备拍摄得到的视频数据和附加设备采集的第二附加音频信号;
确定当前场景类别和目标拍摄物类别;
通过与当前场景类别和目标拍摄物类别匹配的第一渲染规则,对目标音频信号进行多通道渲染,得到目标多通道音频信号。
可选的,根据确定的渲染规则对主音频信号进行多通道渲染,得到环境多通道音频信号,可以包括:
获取主设备对目标拍摄物拍摄视频时所采集的主音频信号;
确定当前场景类别;
通过与当前场景类别匹配的第二渲染规则,对主音频信号进行第一多通道渲染,得到环境多通道音频信号。
图1中,场景声源分类模块可以包含两条路径,一条使用视频流信息,另一条使用音频流信息。两条路径均由场景分析器和人声/乐器分类器组成。其中,场景分析器可以从视频或音频中分析当前用户所处的空间类型,如小型房间、中型房间、大型房间、音乐厅、体育场、室外等。而人声/乐器分类器从视频或音频中分析当前目标拍摄物附近的声源类型,如男声、女声、童声或者手风琴、吉他、贝司、钢琴、键盘和打击乐器等。
可选的,场景分析器和人声/乐器分类器均可以是基于DNN的方法。其中视频的输入是每一帧的图像,而音频的输入可以是声音的梅尔谱(Mel spectrum)或者梅尔频率倒谱系数(Mel-frequency cepstrum coefficient,MFCC)。
可选的,还可以根据空间场景分析,以及人声/乐器分类器得到的结果,与用户的偏好设置进行结合,来确定在接下来的双耳渲染模块中要使用的渲染规则。
303、根据主设备上的麦克风阵列阵型生成第一多通道传递函数,根据第一多通道传递函数,对主音频信号进行多通道渲染得到环境多通道音频信号。
需要说明的是,在本发明实施例中的多通道为双通道的情况下,上述第一多通道传递函数可以为HRTF函数。
本发明实施例中,图1中的双耳渲染器中,可以有一组预设的HRTF函数和双耳渲染方法,根据主设备上的麦克风阵列阵型确定预设的HRTF函数,并采用该HRTF对主音频信号进行双耳渲染,得到环 境双耳音频信号。
304、判断目标拍摄物是否处于主设备的拍摄视野内。
若检测到目标拍摄物在主设备的拍摄视野内,则执行下述305至312,以及320至323;若检测到目标拍摄物在在主设备的拍摄视野外,则执行下述313至319,以及320至323。
图1中的目标跟踪模块由视觉目标跟踪器和音频目标跟踪器组成,可以用于利用视觉数据,和/或,音频信号,来确定目标拍摄物的位置,以及估计目标拍摄物与主设备之间的方位角和距离。当目标拍摄物在主设备的拍摄视野内时,此时可以采用视觉数据和音频信号一起来确定目标拍摄物的位置,此时视觉目标跟踪器和音频目标跟踪器同时启用,而当目标拍摄物在主设备的拍摄视野以外时,可以采用音频信号来确定目标拍摄物的位置,此时可以只启用音频目标跟踪器。
可选的,当目标拍摄物在主设备的拍摄视野内时,也可以采用视觉数据和音频信号中的一种来确定目标拍摄物的位置。
305、根据主设备获取的视频信息和拍摄参数,确定目标拍摄物与主设备之间的第一方位角,获取第二附加音频信号的第一活跃时间和第一距离,根据第一活跃时间和第一距离,确定主音频信号的第二活跃时间。
其中,第一距离为上一次确定的目标拍摄物与主设备之间的目标距离。
306、使用第二活跃时间内的主音频信号进行到达角估计,得到目标拍摄物与主设备之间的第二方位角,将第一方位角与第二方位角进行平滑处理,得到目标方位角。
307、根据主设备获取的视频信息,确定目标拍摄物与主设备之间的第二距离,根据第二距离和声速,计算得到第二时延。
308、对主音频信号进行朝向目标方位角的波束形成处理,得到波束形成信号,确定波束形成信号与第二附加音频信号之间的第一时延。
图1中,声源测向与波束成型器,可以用于对主音频信号进行朝向目标方位角的波束形成处理,得到波束形成信号,并由延迟估计器进一步确定波束形成信号与第二附加音频信号之间的第一时延。
309、将第二时延与第一时延进行平滑处理,得到目标时延,根据目标时延和声速,计算目标距离。
当目标拍摄物在主设备的拍摄视野内时,此时获取的视频数据中包括目标拍摄物,此时可以根据视频帧中拍摄到的目标拍摄物的在视频帧中的位置,再结合相机参数(例如,焦距)和缩放尺度(不同的拍摄视野对应不同的缩放尺度)等先验信息,可以得到上述第一方位角,还可以通过音频信号来确定来估计目标拍摄物与主设备之间的方位角和距离,得到上述第二方位角,通过将上述第一方位角与第二方位角进行平滑处理后得到目标方位角。
进一步的,根据视频帧中拍摄到的目标拍摄物的尺寸,和预先记录的该目标拍摄物的典型尺寸对比,再结合相机参数(例如,焦距)和缩放尺度(不同的拍摄视野对应不同的缩放尺度)等先验信息,可以进行大致的距离估计,得到上述第二距离。根据第二距离和声速以及预知的系统延迟又可得到上述第二时延,计算第二附加音频信号和主音频信号之间的延迟(即第一时延),通过对第一时延和第二时延的平滑处理,可以得到目标时延。
本发明实施例中,平滑处理可以是指求平均值。如对第一方位角和第二方位角行平滑处理后得到目标方位角,可以为对第一方位角和第二方位角求平均值作为目标方位角;对第一时延和第二时延的平滑处理,可以得到目标时延,可以为对第一时延和第二时延求平均值作为目标时延。
当目标拍摄物在主设备的拍摄视野内时,可以使用图1中视觉目标跟踪器可以利用所拍摄的视频来检测目标拍摄物与主设备之间的目标方位角和目标距离。使用视觉目标跟踪器的优势是在嘈杂环境或者声源数目较多时,其跟踪结果相对于音频目标跟踪器而言更加准确。
进一步的,同时采用视觉目标跟踪器和有音频目标跟踪器,来检测目标拍摄物与主设备之间的目标方位角和目标距离,可以进一步提高准确度。
310、根据目标时延,将第二附加音频信号与主音频信号在时域上对齐,得到第一附加音频信号。
311、根据主设备的拍摄视野,对主音频信号在拍摄视野以外区域进行空间滤波,得到反向聚焦音频信号。
312、将反向聚焦音频信号作为参考信号,对第一附加音频信号进行自适应滤波处理,得到目标音频信号。
313、获取第二附加音频信号的第一活跃时间和第一距离,根据第一活跃时间和第一距离,确定主音频信号的第二活跃时间。
其中,第一距离为上一次确定的目标拍摄物与主设备之间的目标距离。
本发明实施例中,音频信号的活跃时间是指音频信号中存在有效音频信号的时间段,可选的,第二附加音频信号的第一活跃时间可以是指第二附加音频信号中存在有效音频信号的时间段。
可选的,有效音频信号可以是指人声或者乐器声等。示例性的,其可以是目标拍摄物的声音。
本发明实施例中,可以根据第一距离和声速,确定出第二附加音频信号与主音频信号之间的时延,然后根据该时延和第一活跃时间,可以确定主音频信号中与第二附加音频信号中对应的第二活跃时间的音频信号。
314、使用第二活跃时间内的主音频信号进行到达角估计,得到目标拍摄物与主设备之间的目标方位角。
315、对主音频信号进行朝向目标方位角的波束形成处理,得到波束形成信号,确定波束形成信号与第二附加音频信号之间的第一时延。
316、根据第一时延和声速,计算目标拍摄物与主设备之间的目标距离。
目标拍摄物在主设备的拍摄视野以外时,此时获取的视频数据中不包括目标拍摄物,此时可以采用音频信号来确定目标拍摄物的位置。
图1中,音频目标跟踪器可以利用主音频信号和附加音频信号来估计目标拍摄物与主设备之间的目标方位角和目标距离,具体的可以包括声源测向、波束形成,以及延迟估计等步骤。
具体的,目标方位角是可以通过对主音频信号进行到达角(direction of arrival,DOA)估计来得到的。为了避免嘈杂环境或多个声源对DOA估计的影响,在进行DOA估计之前,首先可以对第二附加音频进行分析,得到第二附加音频中存在有效音频信号(可以是指存在目标拍摄物的声音的音频信号)的活跃部分对应的时间,即上述第一活跃时间,再根据前一次估计出的目标距离,得到第二附加音频信号和主音频信号之间的延迟(即第一时延),并将第一活跃时间对应到主音频信号中的第二活跃时间。接着在第二活跃时间上截取主音频信号的段落,并进行DOA估计,得到目标拍摄物与主设备之间的方位角,将该方位角作为上述目标方位角。
可选的,在进行DOA估计时,可以首先使用相位加权(phase transform,PHAT)的广义互相关(generalized cross correlation,GCC)方法来进行到达时间差(time delay of arrival,TDOA)估计,然后结合麦克风阵列的阵型信息来得到DOA。在得到DOA估计之后,多通道的主音频信号将通过一个固定方向的波束形成器(beamformer)得到波束形成信号,朝向上述目标方向角的方向进行定向增强,以提高接下来要进行的延迟估计的准确度。波束形成的方法可以是延迟求和(delay-sum),或者最小方差无失真响应(minimum variance distortion response,MVDR)。上述第一延迟的估计同样是采用TDOA方法,在主音频波束形成信号和第二附加音频信号之间进行,类似的,TDOA的估计同样只在第二附加音频信号的活跃时间内进行。根据第第一延迟和声速以及预知的系统延迟,可以得到目标拍摄物与主设备之间的距离,即上述目标距离。
317、根据第一时延,将第二附加音频信号与主音频信号在时域上对齐,得到第一附加音频信号。
当目标拍摄物在主设备的拍摄视野外时,将第一时延作为主音频信号与所述第二附加音频信号之间的目标时延,根据第一时延,将第二附加音频信号与主音频信号在时域上对齐,得到第一附加音频信号。
图1中的延迟补偿器可以根据上述第一延迟,将第二附加音频信号与主音频信号进行时间域上的在时域上对齐,得到第一附加音频信号。
318、根据主设备的拍摄视野,对主音频信号在拍摄视野以内区域进行空间滤波,得到聚焦音频信号。
319、将第一附加音频信号作为参考信号,对聚焦音频信号进行自适应滤波处理,得到目标音频信号。
当目标拍摄物处于拍摄视野范围内时,由于已经通过附加音频信号获得了高质量的特写音频信号,空间滤波的主要目的是为了获得更纯净的环境音频信号,因此空间滤波的目标区域是拍摄视野范围之外,得到的信号在以下称为反向聚焦音频信号;而当目标拍摄物处于拍摄视野范围之外时,由于需要通过空间滤波来获得拍摄视野内的特写音频信号,因此空间滤波的目标区域即是拍摄视野范围,得到的信号在以下称为聚焦音频信号。
进一步的,在进行空间滤波时,结合了主设备的拍摄视野,可以跟随主设备的拍摄视野的变化,使得对局部音频信号进行了定向增强。
图1中,两组自适应滤波器分别作用于聚焦音频信号和附加音频信号。根据目标在拍摄视野中的变化,只启用其中的一组自适应滤波器。当目标在拍摄视野中时,作用于附加音频信号上的自适应滤波器被启动,反向聚焦音频信号被作为参考信号输入,以从附加音频信号中进一步抑制环境声,使得目标拍 摄物附近的声音更为突出。当目标在拍摄视野以外时,作用于聚焦音频信号上的自适应滤波器被启动,附加音频信号被作为参考信号输入,以从聚焦音频信号中进一步抑制拍摄视野以外的声音。自适应滤波的方法可以是最小均方误差(LMS,Least Mean Square)等。
320、根据目标距离和目标方位角,生成第二多通道传递函数。
321、根据第二多通道传递函数对目标音频信号进行多通道渲染,得到目标多通道音频信号。
322、根据主设备的拍摄参数,确定环境多通道音频信号的第一增益和目标多通道音频信号的第二增益。
323、根据第一增益与第二增益,将环境多通道音频信号和目标多通道音频信号进行混合,得到混合多通道音频信号。
图1中,一个混合增益控制器可以根据用户的拍摄视野决定混合增益,也即两组信号在混合信号中所占的比例。例如,当增加相机的缩放等级,也即缩小拍摄视野时,环境双耳音频信号的增益会减小,而附加双耳音频信号(即当目标拍摄物在视野范围内时,确定的目标多通道音频信号)或聚焦双耳音频信号(即当目标拍摄物在视野范围外时,确定的目标多通道音频信号)的增益会增加。这样在视频的拍摄视野聚焦到指定区域的同时,音频也会聚焦到指定的区域。
本发明实施例中,根据主设备的拍摄参数(如相机的缩放等级),确定拍摄视野的大小,并以此来确定环境多通道音频信号的第一增益和目标多通道音频信号的第二增益,使得在视频的拍摄视野聚焦到指定区域的同时,音频也会聚焦到指定的区域,从而可以营造一种“身临其境,声随像动”的效果。
本发明实施例提供的多通道音频信号获取方法,是一种能营造更真实的临场感的分布式录音和音频聚焦方法。该方法可以同时利用终端设备中主设备上的麦克风阵列和附加设备(TWS蓝牙耳机)上的麦克风进行音频的分布式采集和融合。终端设备中麦克风阵列采集主设备所处位置的空间音频(即本发明实施例中涉及的主音频信号),而TWS蓝牙耳机可以设置于需要跟踪的目标拍摄物上,并随着目标拍摄物的移动,采集远处的高质量特写音频信号(即本发明实施例中涉及的第一附加音频信号),结合视频拍摄过程中的FOV变化,对采集的两组信号进行对应的自适应滤波处理以实现环境声抑制,并且对空间音频信号进行指定区域的空间滤波处理以实现定向增强,再结合视觉和声音两种定位方式,对感兴趣的目标进行跟踪和定位,并分别对得到的空间音频、高质量特写音频和定向增强音频三组信号进行HRTF双耳渲染和上混或下混,得到三组双耳信号:环境双耳信号、附加双耳信号和聚焦双耳信号。最后根据FOV的大小确定上述三组双耳信号的混合比例,并进行混合。
这样的技术方案可以产生以下有益效果:
最终输出的双耳音频信号在立体声耳机中播放时,能够同时模拟空间声场和指定位置的点状听觉目标。
利用分布式音频信号,可以获得更好的定向增强效果,在聚焦时对干扰声和环境声的抑制更明显。
能跟随FOV的变化,更好地对用户感兴趣的声音进行聚焦和跟踪,从而营造出一种“身临其境、声随像动”的沉浸式体验。
如图4所示,本发明实施例提供一种多通道音频信号获取装置400,该装置包括:
获取模块401,用于获取主设备对目标拍摄物进行视频拍摄时所采集的主音频信号,并进行第一多通道渲染,得到环境多通道音频信号;获取附加设备采集的音频信号,并确定第一附加音频信号;其中,附加设备与目标拍摄物之间的距离小于第一阈值;
处理模块402,用于通过第一附加音频信号和主音频信号进行环境声抑制处理,得到目标音频信号;
对目标音频信号进行第二多通道渲染,得到目标多通道音频信号;
将环境多通道音频信号和目标多通道音频信号进行混合,得到混合多通道音频信号。
可选的,处理模块402,具体用于根据主设备的拍摄参数,确定环境多通道音频信号的第一增益和目标多通道音频信号的第二增益;
根据第一增益与第二增益,将环境多通道音频信号和目标多通道音频信号进行混合,得到混合多通道音频信号。
可选的,获取模块401,具体用于获取主设备上的麦克风阵列采集的主音频信号;
根据主设备上的麦克风阵列阵型生成第一多通道传递函数,
根据第一多通道传递函数,对主音频信号进行多通道渲染得到环境多通道音频信号。
可选的,获取模块401,具体用于获取目标拍摄物上的附加设备采集的第二附加音频信号,将第二附加音频信号确定为第一附加音频信号;
或者,
获取附加设备采集的第二附加音频信号,将第二附加音频信号与主音频信号在时域上对齐,得到第一附加音频信号。
可选的,处理模块402,具体用于获取目标拍摄物与主设备之间的目标方位角;
对主音频信号进行朝向目标方位角的波束形成处理,得到波束形成信号;
确定主音频信号与第二附加音频信号之间的目标时延;
根据第一时延,将第二附加音频信号与主音频信号在时域上对齐,得到第一附加音频信号。
可选的,处理模块402,具体用于获取目标拍摄物与主设备之间的目标距离和目标方位角;
根据目标距离和目标方位角,生成第二多通道传递函数;
根据第二多通道传递函数对目标音频信号进行多通道渲染,得到目标多通道音频信号。
可选的,获取模块401,具体用于当检测到目标拍摄物处于主设备的拍摄视野外时,获取第二附加音频信号的第一活跃时间和第一距离,第一距离为上一次确定的目标拍摄物与主设备之间的目标距离;
根据第一活跃时间和第一距离,确定主音频信号的第二活跃时间;
使用第二活跃时间内的主音频信号进行到达角估计,得到目标拍摄物与主设备之间的目标方位角。
可选的,获取模块401,具体用于当检测到目标拍摄物处于主设备的拍摄视野外时,对主音频信号进行朝向目标方位角的波束形成处理,得到波束形成信号;
确定波束形成信号与第二附加音频信号之间的第一时延;
根据第一时延和声速,计算目标拍摄物与主设备之间的目标距离。
可选的,处理模块402,具体用于当检测到目标拍摄物处于主设备的拍摄视野外时,根据主设备的拍摄视野,对主音频信号在拍摄视野以内区域进行空间滤波,得到聚焦音频信号;
将第一附加音频信号作为参考信号,对聚焦音频信号进行自适应滤波处理,得到目标音频信号。
可选的,获取模块401,具体用于当检测到目标拍摄物处于主设备的拍摄视野内时,根据主设备获取的视频信息和拍摄参数,确定目标拍摄物与主设备之间的第一方位角;
获取第二附加音频信号的第一活跃时间和第一距离,第一距离为上一次确定的目标拍摄物与主设备之间的目标距离;
根据第一活跃时间和第一距离,确定主音频信号的第二活跃时间;
使用第二活跃时间内的主音频信号进行到达角估计,得到目标拍摄物与主设备之间的第二方位角;
将第一方位角与第二方位角进行平滑处理,得到目标方位角。
可选的,获取模块401,具体用于当检测到目标拍摄物处于主设备的拍摄视野内时,根据主设备获取的视频信息,确定目标拍摄物与主设备之间的第二距离;
根据第二距离和声速,计算得到第二时延;
对主音频信号进行朝向目标方位角的波束形成处理,得到波束形成信号;
确定波束形成信号与第二附加音频信号之间的第一时延;
将第二时延与第一时延进行平滑处理,得到目标时延;
根据目标时延和声速,计算目标距离。
可选的,处理模块402,用于当检测到目标拍摄物处于主设备的拍摄视野内时,根据主设备的拍摄视野,对主音频信号在拍摄视野以外区域进行空间滤波,得到反向聚焦音频信号;
将反向聚焦音频信号作为参考信号,对第一附加音频信号进行自适应滤波处理,得到目标音频信号。
可选的,处理模块402,具体用于获取主设备拍摄得到的视频数据和附加设备采集的第二附加音频信号;
确定当前场景类别和目标拍摄物类别;
通过与当前场景类别和目标拍摄物类别匹配的第一渲染规则,对目标音频信号进行多通道渲染,得到目标多通道音频信号。
可选的,处理模块402,具体用于获取主设备对目标拍摄物拍摄视频时所采集的主音频信号;
确定当前场景类别;
通过与所述当前场景类别匹配的第二渲染规则,对所述主音频信号进行第一多通道渲染,得到所述环境多通道音频信号。
本发明实施例提供一种终端设备,包括:处理器、存储器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述计算机程序被所述处理器执行时实现如上述方法实施例提供的多通道音频信号获取方法。
如图5所示,本发明实施例还提供一种终端设备,该终端设备包括上述多通道音频信号获取装置400 和主设备500。
其中,该主设备,用于在拍摄视频时采集主音频信号,并将所述主音频信号发送至所述多通道音频信号获取装置。
如图6所示,本发明实施例还提供一种终端设备,该终端设备包括但不限于:射频(radio frequency,RF)电路601、存储器602、输入单元603、显示单元604、传感器605、音频电路606、无线通信(wireless fidelity,WiFi)模块607、处理器608、蓝牙模块609、以及摄像头610等部件。其中,射频电路601包括接收器6011和发送器6012。本领域技术人员可以理解,图6中示出的终端设备结构并不构成对终端设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
RF电路601可用于收发信息或通话过程中,信号的接收和发送,特别地,将基站的下行信息接收后,给处理器608处理;另外,将设计上行的数据发送给基站。通常,RF电路601包括但不限于天线、至少一个放大器、收发信机、耦合器、低噪声放大器(low noise amplifier,LNA)、双工器等。此外,RF电路601还可以通过无线通信与网络和其他设备通信。上述无线通信可以使用任一通信标准或协议,包括但不限于全球移动通讯系统(global system of mobile communication,GSM)、通用分组无线服务(general packet radio service,GPRS)、码分多址(code division multiple access,CDMA)、宽带码分多址(wideband code division multiple access,WCDMA)、长期演进(long term evolution,LTE)、电子邮件、短消息服务(short messaging service,SMS)等。
存储器602可用于存储软件程序以及模块,处理器608通过运行存储在存储器602的软件程序以及模块,从而执行终端设备的各种功能应用以及数据处理。存储器602可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据终端设备的使用所创建的数据(比如音频信号、电话本等)等。此外,存储器602可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。
输入单元603可用于接收输入的数字或字符信息,以及产生与终端设备的用户设置以及功能控制有关的键信号输入。具体地,输入单元603可包括触控面板6031以及其他输入设备6032。触控面板6031,也称为触摸屏,可收集用户在其上或附近的触摸操作(比如用户使用手指、触笔等任何适合的物体或附件在触控面板6031上或在触控面板6031附近的操作),并根据预先设定的程式驱动相应的连接装置。可选的,触控面板6031可包括触摸检测装置和触摸控制器两个部分。其中,触摸检测装置检测用户的触摸方位,并检测触摸操作带来的信号,将信号传送给触摸控制器;触摸控制器从触摸检测装置上接收触摸信息,并将它转换成触点坐标,再送给处理器608,并能接收处理器608发来的命令并加以执行。此外,可以采用电阻式、电容式、红外线以及表面声波等多种实现触控面板6031。除了触控面板6031,输入单元603还可以包括其他输入设备6032。具体地,其他输入设备6032可以包括但不限于物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆等中的一种或多种。
显示单元604可用于显示由用户输入的信息或提供给用户的信息以及终端设备的各种菜单。显示单元604可包括显示面板6041,可选的,可以采用液晶显示器(liquid crystal display,LCD)、有机发光二极管(organic light-Emitting diode,OLED)等形式来配置显示面板6041。进一步的,触控面板6031可覆盖显示面板6041,当触控面板6031检测到在其上或附近的触摸操作后,传送给处理器608以确定触摸事件的,随后处理器608根据触摸事件的在显示面板6041上提供相应的视觉输出。虽然在图6中,触控面板6031与显示面板6041是作为两个独立的部件来实现终端设备的输入和输入功能,但是在某些实施例中,可以将触控面板6031与显示面板6041集成而实现终端设备的输入和输出功能。
终端设备还可包括至少一种传感器605,比如光传感器、运动传感器以及其他传感器。具体地,光传感器可包括环境光传感器及接近传感器,其中,环境光传感器可根据环境光线的明暗来调节显示面板6041的亮度,接近传感器可在终端设备移动到耳边时,退出显示面板6041和/或背光。作为运动传感器的一种,加速计传感器可检测各个方向上(一般为三轴)加速度的大小,静止时可检测出重力的大小及方向,可用于识别终端设备姿态的应用(比如横竖屏切换、相关游戏、磁力计姿态校准)、振动识别相关功能(比如计步器、敲击)等;至于终端设备还可配置的陀螺仪、气压计、湿度计、温度计、红外线传感器等其他传感器,在此不再赘述。本发明实施例中,该终端设备可以包括加速度传感器、深度传感器或者距离传感器等。
音频电路606、扬声器6061,传声器6062可提供用户与终端设备之间的音频接口。音频电路606可将接收到的音频信号转换后的电信号,传输到扬声器6061,由扬声器6061转换为声音信号输出;另一方面,传声器6062将收集的声音信号转换为电信号,由音频电路606接收后转换为音频信号,再将 音频信号输出处理器608处理后,经RF电路601以发送给比如另一终端设备,或者将音频信号输出至存储器602以便进一步处理。其中,上述传声器6062可以是麦克风阵列。
WiFi属于短距离无线传输技术,终端设备通过WiFi模块607可以帮助用户收发电子邮件、浏览网页和访问流式媒体等,它为用户提供了无线的宽带互联网访问。虽然图6示出了WiFi模块607,但是可以理解的是,其并不属于终端设备的必须构成,完全可以根据需要在不改变发明的本质的范围内而省略。
处理器608是终端设备的控制中心,利用各种接口和线路连接整个终端设备的各个部分,通过运行或执行存储在存储器602内的软件程序和/或模块,以及调用存储在存储器602内的数据,执行终端设备的各种功能和处理数据,从而对终端设备进行整体监控。可选的,处理器608可包括一个或多个处理单元;优选的,处理器608可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作系统、用户界面和应用程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器608中。
终端设备还包括蓝牙模块609,蓝牙模块,用于短距离无线通讯,按功能分为蓝牙数据模块和蓝牙语音模块。蓝牙模块是指集成蓝牙功能的芯片基本电路集合,用于无线网络通讯,大致可分为三大类型:数据传输模块、蓝牙音频模块、蓝牙音频+数据二合一模块等等。
尽管未示出,终端设备还可以包括其他功能模块,在此不再赘述。
本发明实施例中,传声器6062可以用于采集主音频信号,该终端设备可以通过上述WiFi模块607,或者蓝牙模块609与附加设备连接,并接收附加设备采集的第二附加音频信号。
处理器608,用于获取主音频信号,并进行多通道渲染,得到环境多通道音频信号;获取附加设备采集的音频信号,并确定第一附加音频信号;通过所述第一附加音频信号和所述主音频信号进行环境声抑制处理,得到目标音频信号;对所述目标音频信号进行多通道渲染,得到目标多通道音频信号;将所述环境多通道音频信号和所述目标多通道音频信号进行混合,得到混合多通道音频信号。其中,所述附加设备与所述目标拍摄物之间的距离小于第一阈值;
可选的,上述处理器608还可以用于实现上述方法实施例中终端设备所实现的其他过程,此处不再赘述。
本发明实施例还提供一种多通道音频信号获取系统,该所述系统包括:多通道音频信号获取装置、主设备和附加设备,所述主设备和所述附加设备分别与所述多通道音频信号建立通信连接;
所述主设备,用于在对目标拍摄物进行视频拍摄时采集主音频信号,并将所述主音频信号发送至所述多通道音频信号获取装置;
所述附加设备,用于采集第二附加音频信号,并将所述第二附加音频信号发送至所述多通道音频信号获取装置。
示例性的,该多通道音频信号获取系统可以如上述图1中所示,其中图1中的音频处理设备可以为多通道音频信号获取装置。
本发明实施例还提供一种计算机可读存储介质,包括:计算机可读存储介质上存储计算机程序,计算机程序被处理器执行时实现如上述方法实施例中的多通道音频信号获取方法。
为了使本技术领域的人员更好地理解本发明方案,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行描述,显然,所描述的实施例仅仅是本发明一部分的实施例,而不是全部的实施例。基于本发明中的实施例,都应当属于本发明保护的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本发明所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。

Claims (32)

  1. 一种多通道音频信号获取方法,其特征在于,包括:
    获取主设备对目标拍摄物进行视频拍摄时所采集的主音频信号,并进行第一多通道渲染,得到环境多通道音频信号;
    获取附加设备所采集的音频信号,确定第一附加音频信号,其中,所述附加设备与所述目标拍摄物之间的距离小于第一阈值;
    对所述第一附加音频信号和所述主音频信号进行环境声抑制处理,得到目标音频信号;
    对所述目标音频信号进行第二多通道渲染,得到目标多通道音频信号;以及将所述环境多通道音频信号和所述目标多通道音频信号进行混合,得到混合多通道音频信号。
  2. 根据权利要求1所述的方法,其特征在于,所述将所述环境多通道音频信号和目标多通道音频信号进行混合,得到混合多通道音频信号,包括:
    根据所述主设备的拍摄参数,确定所述环境多通道音频信号的第一增益和所述目标多通道音频信号的第二增益;
    根据所述第一增益与所述第二增益,将所述环境多通道音频信号和所述目标多通道音频信号进行混合,得到所述混合多通道音频信号。
  3. 根据权利要求1所述的方法,其特征在于,所述获取主设备对目标拍摄物进行视频拍摄时所采集的主音频信号,并进行第一多通道渲染,得到环境多通道音频信号,包括:
    获取所述主设备上的麦克风阵列采集的主音频信号;
    根据所述主设备上的麦克风阵列阵型生成第一多通道传递函数;
    根据所述第一多通道传递函数,对所述主音频信号进行第一多通道渲染得到所述环境多通道音频信号。
  4. 根据权利要求1所述的方法,其特征在于,所述获取附加设备所采集的音频信号,确定第一附加音频信号,包括:
    获取所述附加设备采集的第二附加音频信号,将所述第二附加音频信号确定为所述第一附加音频信号;
    或者,
    获取所述附加设备采集的第二附加音频信号,将所述第二附加音频信号与所述主音频信号在时域上对齐,得到所述第一附加音频信号。
  5. 根据权利要求4所述的方法,其特征在于,所述将所述第二附加音频信号与所述主音频信号在时域上对齐,得到所述第一附加音频信号,包括:
    获取所述目标拍摄物与所述主设备之间的目标方位角;
    确定所述主音频信号与所述第二附加音频信号之间的目标时延;
    根据所述目标时延,将所述第二附加音频信号与所述主音频信号在时域上对齐,得到所述第一附加音频信号。
  6. 根据权利要求1所述的方法,其特征在于,所述对所述目标音频信号进行第二多通道渲染,得到目标多通道音频信号,包括:
    获取所述目标拍摄物与所述主设备之间的目标距离和目标方位角;
    根据所述目标距离和所述目标方位角,生成第二多通道传递函数;
    根据所述第二多通道传递函数对所述目标音频信号进行第二多通道渲染,得到目标多通道音频信号。
  7. 根据权利要求6所述的方法,其特征在于,当检测到所述目标拍摄物处于所述主设备的拍摄视野内时,所述获取所述目标拍摄物与所述主设备之间的目标方位角,包括:
    根据所述主设备获取的视频信息和拍摄参数,确定所述目标拍摄物与所述主设备之间的第一方位角;
    获取所述第二附加音频信号的第一活跃时间和第一距离,所述第一距离为上一次确定的所述目标拍摄物与所述主设备之间的目标距离;
    根据所述第一活跃时间和所述第一距离,确定所述主音频信号的第二活跃时间;
    使用所述第二活跃时间内的主音频信号进行到达角估计,得到所述目标拍摄物与所述主设备之间的第二方位角;
    将所述第一方位角与所述第二方位角进行平滑处理,得到所述目标方位角。
  8. 根据权利要求7或所述的方法,其特征在于,所述获取所述目标拍摄物与所述主设备之间的目标距离,包括:
    根据所述主设备获取的视频信息,确定所述目标拍摄物与所述主设备之间的第二距离;
    根据所述第二距离和声速,计算得到第二时延;
    对所述主音频信号进行朝向所述目标方位角的波束形成处理,得到波束形成信号;
    确定所述波束形成信号与所述第二附加音频信号之间的第一时延;
    将所述第二时延与所述第一时延进行平滑处理,得到目标时延;
    根据所述目标时延和声速,计算所述目标距离。
  9. 根据权利要求1至8中任一项所述的方法,其特征在于,当检测到所述目标拍摄物处于所述主设备的拍摄视野内时,所述通过所述第一附加音频信号和所述主音频信号进行环境声抑制处理,得到目标音频信号,包括:
    根据所述主设备的拍摄视野,对所述主音频信号在所述拍摄视野以外区域进行空间滤波,得到反向聚焦音频信号;
    将所述反向聚焦音频信号作为参考信号,对所述第一附加音频信号进行自适应滤波处理,得到所述目标音频信号。
  10. 根据权利要求6所述的方法,其特征在于,当检测到所述目标拍摄物处于所述主设备的拍摄视野外时,所述获取所述目标拍摄物与所述主设备之间的目标方位角,包括:
    获取所述第二附加音频信号的第一活跃时间和第一距离,所述第一距离为上一次确定的所述目标拍摄物与所述主设备之间的目标距离;
    根据所述第一活跃时间和所述第一距离,确定所述主音频信号的第二活跃时间;
    使用所述第二活跃时间内的主音频信号进行到达角估计,得到所述目标拍摄物与所述主设备之间的目标方位角。
  11. 根据权利要求6所述的方法,其特征在于,当检测到所述目标拍摄物处于所述主设备的拍摄视野外时,所述获取所述目标拍摄物与所述主设备之间的目标距离,包括:
    对所述主音频信号进行朝向所述目标方位角的波束形成处理,得到波束形成信号;
    确定所述波束形成信号与所述第二附加音频信号之间的第一时延;
    根据所述第一时延和声速,计算所述目标拍摄物与所述主设备之间的目标距离。
  12. 根据权利要求1至6、10和11中任一项所述的方法,其特征在于,当检测到所述目标拍摄物处于所述主设备的拍摄视野外时,所述通过所述第一附加音频信号和所述主音频信号进行环境声抑制处理,得到目标音频信号,包括:
    根据所述主设备的拍摄视野,对所述主音频信号在所述拍摄视野以内区域进行空间滤波,得到聚焦音频信号;
    将所述第一附加音频信号作为参考信号,对所述聚焦音频信号进行自适应滤波处理,得到所述目标音频信号。
  13. 根据权利要求1所述的方法,其特征在于,所述对目标音频信号进行多通道渲染,得到目标多通道音频信号,包括:
    获取所述主设备拍摄得到的视频数据和所述附加设备采集的第二附加音频信号;
    确定当前场景类别和目标拍摄物类别;
    通过与所述当前场景类别和所述目标拍摄物类别匹配的第一渲染规则,对所述目标音频信号进行第二多通道渲染,得到所述目标多通道音频信号。
  14. 根据权利要求1所述的方法,其特征在于,获取主设备对目标拍摄物进行视频拍摄时所采集的主音频信号,并进行第一多通道渲染,得到环境多通道音频信号,包括:
    获取主设备对目标拍摄物拍摄视频时所采集的主音频信号;
    确定当前场景类别;
    通过与所述当前场景类别匹配的第二渲染规则,对所述主音频信号进行第一多通道渲染,得到所述环境多通道音频信号。
  15. 一种多通道音频信号获取装置,其特征在于,包括:
    获取模块,用于获取主设备对目标拍摄物进行视频拍摄时所采集的主音频信号,并进行第一多通道渲染,得到环境多通道音频信号;获取附加设备采集的音频信号,并确定第一附加音频信号,其中,所述附加设备与所述目标拍摄物之间的距离小于第一阈值;
    处理模块,用于通过所述第一附加音频信号和所述主音频信号进行环境声抑制处理,得到目标音频信号;
    对所述目标音频信号进行第二多通道渲染,得到目标多通道音频信号;
    将所述环境多通道音频信号和所述目标多通道音频信号进行混合,得到混合多通道音频信号。
  16. 一种终端设备,其特征在于,包括:处理器、存储器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述计算机程序被所述处理器执行时,所述处理器用于:
    获取主设备对目标拍摄物进行视频拍摄时所采集的主音频信号,并进行第一多通道渲染,得到环境多通道音频信号;
    获取附加设备所采集的音频信号,确定第一附加音频信号,其中,所述附加设备与所述目标拍摄物之间的距离小于第一阈值;
    对所述第一附加音频信号和所述主音频信号进行环境声抑制处理,得到目标音频信号;
    对所述目标音频信号进行第二多通道渲染,得到目标多通道音频信号;以及将所述环境多通道音频信号和所述目标多通道音频信号进行混合,得到混合多通道音频信号。
  17. 根据权利要求16所述的终端设备,其特征在于,所述处理器具体用于:
    根据所述主设备的拍摄参数,确定所述环境多通道音频信号的第一增益和所述目标多通道音频信号的第二增益;
    根据所述第一增益与所述第二增益,将所述环境多通道音频信号和所述目标多通道音频信号进行混合,得到所述混合多通道音频信号。
  18. 根据权利要求16所述的终端设备,其特征在于,所述处理器具体用于:
    获取所述主设备上的麦克风阵列采集的主音频信号;
    根据所述主设备上的麦克风阵列阵型生成第一多通道传递函数;
    根据所述第一多通道传递函数,对所述主音频信号进行第一多通道渲染得到所述环境多通道音频信号。
  19. 根据权利要求16所述的终端设备,其特征在于,所述处理器具体用于:
    获取所述附加设备采集的第二附加音频信号,将所述第二附加音频信号确定为所述第一附加音频信号;
    或者,
    获取所述附加设备采集的第二附加音频信号,将所述第二附加音频信号与所述主音频信号在时域上对齐,得到所述第一附加音频信号。
  20. 根据权利要求19所述的终端设备,其特征在于,所述处理器具体用于:
    获取所述目标拍摄物与所述主设备之间的目标方位角;
    确定所述主音频信号与所述第二附加音频信号之间的目标时延;
    根据所述目标时延,将所述第二附加音频信号与所述主音频信号在时域上对齐,得到所述第一附加音频信号。
  21. 根据权利要求16所述的终端设备,其特征在于,所述处理器具体用于:
    获取所述目标拍摄物与所述主设备之间的目标距离和目标方位角;
    根据所述目标距离和所述目标方位角,生成第二多通道传递函数;
    根据所述第二多通道传递函数对所述目标音频信号进行第二多通道渲染,得到目标多通道音频信号。
  22. 根据权利要求21所述的终端设备,其特征在于,所述处理器具体用于:
    根据所述主设备获取的视频信息和拍摄参数,确定所述目标拍摄物与所述主设备之间的第一方位角;
    获取所述第二附加音频信号的第一活跃时间和第一距离,所述第一距离为上一次确定的所述目标拍摄物与所述主设备之间的目标距离;
    根据所述第一活跃时间和所述第一距离,确定所述主音频信号的第二活跃时间;
    使用所述第二活跃时间内的主音频信号进行到达角估计,得到所述目标拍摄物与所述主设备之间的第二方位角;
    将所述第一方位角与所述第二方位角进行平滑处理,得到所述目标方位角。
  23. 根据权利要求22所述的终端设备,其特征在于,所述处理器具体用于:
    根据所述主设备获取的视频信息,确定所述目标拍摄物与所述主设备之间的第二距离;
    根据所述第二距离和声速,计算得到第二时延;
    对所述主音频信号进行朝向所述目标方位角的波束形成处理,得到波束形成信号;
    确定所述波束形成信号与所述第二附加音频信号之间的第一时延;
    将所述第二时延与所述第一时延进行平滑处理,得到目标时延;
    根据所述目标时延和声速,计算所述目标距离。
  24. 根据权利要求16至23任一项所述的终端设备,其特征在于,所述处理器具体用于:
    根据所述主设备的拍摄视野,对所述主音频信号在所述拍摄视野以外区域进行空间滤波,得到反向聚焦音频信号;
    将所述反向聚焦音频信号作为参考信号,对所述第一附加音频信号进行自适应滤波处理,得到所述目标音频信号。
  25. 根据权利要求22所述的终端设备,其特征在于,所述处理器具体用于:
    获取所述第二附加音频信号的第一活跃时间和第一距离,所述第一距离为上一次确定的所述目标拍摄物与所述主设备之间的目标距离;
    根据所述第一活跃时间和所述第一距离,确定所述主音频信号的第二活跃时间;
    使用所述第二活跃时间内的主音频信号进行到达角估计,得到所述目标拍摄物与所述主设备之间的目标方位角。
  26. 根据权利要求22所述的终端设备,其特征在于,所述处理器具体用于:
    对所述主音频信号进行朝向所述目标方位角的波束形成处理,得到波束形成信号;
    确定所述波束形成信号与所述第二附加音频信号之间的第一时延;
    根据所述第一时延和声速,计算所述目标拍摄物与所述主设备之间的目标距离。
  27. 根据权利要求16至22、25和26中任一项所述的终端设备,其特征在于,所述处理器具体用于:
    根据所述主设备的拍摄视野,对所述主音频信号在所述拍摄视野以内区域进行空间滤波,得到聚焦音频信号;
    将所述第一附加音频信号作为参考信号,对所述聚焦音频信号进行自适应滤波处理,得到所述目标音频信号。
  28. 根据权利要求16所述的终端设备,其特征在于,所述处理器具体用于:
    获取所述主设备拍摄得到的视频数据和所述附加设备采集的第二附加音频信号;
    确定当前场景类别和目标拍摄物类别;
    通过与所述当前场景类别和所述目标拍摄物类别匹配的第一渲染规则,对所述目标音频信号进行第二多通道渲染,得到所述目标多通道音频信号。
  29. 根据权利要求16所述的终端设备,其特征在于,所述处理器具体用于:
    获取主设备对目标拍摄物拍摄视频时所采集的主音频信号;
    确定当前场景类别;
    通过与所述当前场景类别匹配的第二渲染规则,对所述主音频信号进行第一多通道渲染,得到所述环境多通道音频信号。
  30. 一种终端设备,其特征在于,包括:如权利要求15所述的多通道音频信号获取装置和主设备
    所述主设备,用于在对目标拍摄物进行视频拍摄时采集主音频信号,并将所述主音频信号发送至所述多通道音频信号获取装置。
  31. 一种多通道音频信号获取系统,其特征在于,所述系统包括:如权利要求15所述的多通道音频信号获取装置、主设备和附加设备,所述主设备和所述附加设备分别与所述多通道音频信号建立通信连接,
    所述主设备,用于在对目标拍摄物进行视频拍摄时采集主音频信号,并将所述主音频信号发送至所述多通道音频信号获取装置;
    所述附加设备,用于采集第二附加音频信号,并将所述第二附加音频信号发送至所述多通道音频信号获取装置;
    其中,所述附加设备与所述目标拍摄物之间的距离小于第一阈值。
  32. 一种计算机可读存储介质,其特征在于,包括:所述计算机可读存储介质上存储计算机程序,所述计算机程序被处理器执行时实现如权利要求1至14中任一项所述的多通道音频信号获取方法。
PCT/CN2021/103110 2020-09-25 2021-06-29 一种多通道音频信号获取方法、装置及系统 WO2022062531A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP21870910.3A EP4220637A4 (en) 2020-09-25 2021-06-29 METHOD AND APPARATUS FOR ACQUIRING MULTI-CHANNEL AUDIO SIGNAL, AND SYSTEM

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011027264.8A CN114255781A (zh) 2020-09-25 2020-09-25 一种多通道音频信号获取方法、装置及系统
CN202011027264.8 2020-09-25

Publications (1)

Publication Number Publication Date
WO2022062531A1 true WO2022062531A1 (zh) 2022-03-31

Family

ID=80790688

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/103110 WO2022062531A1 (zh) 2020-09-25 2021-06-29 一种多通道音频信号获取方法、装置及系统

Country Status (3)

Country Link
EP (1) EP4220637A4 (zh)
CN (1) CN114255781A (zh)
WO (1) WO2022062531A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116095465A (zh) * 2022-05-25 2023-05-09 荣耀终端有限公司 录像方法、装置及存储介质

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116668892B (zh) * 2022-11-14 2024-04-12 荣耀终端有限公司 音频信号的处理方法、电子设备及可读存储介质

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102969003A (zh) * 2012-11-15 2013-03-13 东莞宇龙通信科技有限公司 摄像声音提取方法及装置
CN104599674A (zh) * 2014-12-30 2015-05-06 西安乾易企业管理咨询有限公司 一种摄像中定向录音的系统及方法
US20170359467A1 (en) * 2016-06-10 2017-12-14 Glen A. Norris Methods and Apparatus to Assist Listeners in Distinguishing Between Electronically Generated Binaural Sound and Physical Environment Sound
CN108352155A (zh) * 2015-09-30 2018-07-31 惠普发展公司,有限责任合伙企业 抑制环境声
CN108370471A (zh) * 2015-10-12 2018-08-03 诺基亚技术有限公司 分布式音频捕获和混合
CN108389586A (zh) * 2017-05-17 2018-08-10 宁波桑德纳电子科技有限公司 一种远程集音装置、监控装置及远程集音方法
US20190222950A1 (en) * 2017-06-30 2019-07-18 Apple Inc. Intelligent audio rendering for video recording
CN110089131A (zh) * 2016-11-16 2019-08-02 诺基亚技术有限公司 分布式音频捕获和混合控制
CN110970057A (zh) * 2018-09-29 2020-04-07 华为技术有限公司 一种声音处理方法、装置与设备
CN111050269A (zh) * 2018-10-15 2020-04-21 华为技术有限公司 音频处理方法和电子设备
EP3683794A1 (en) * 2019-01-15 2020-07-22 Nokia Technologies Oy Audio processing

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102516625B1 (ko) * 2015-01-30 2023-03-30 디티에스, 인코포레이티드 몰입형 오디오를 캡처하고, 인코딩하고, 분산하고, 디코딩하기 위한 시스템 및 방법
GB2543275A (en) * 2015-10-12 2017-04-19 Nokia Technologies Oy Distributed audio capture and mixing
GB2567244A (en) * 2017-10-09 2019-04-10 Nokia Technologies Oy Spatial audio signal processing

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102969003A (zh) * 2012-11-15 2013-03-13 东莞宇龙通信科技有限公司 摄像声音提取方法及装置
CN104599674A (zh) * 2014-12-30 2015-05-06 西安乾易企业管理咨询有限公司 一种摄像中定向录音的系统及方法
CN108352155A (zh) * 2015-09-30 2018-07-31 惠普发展公司,有限责任合伙企业 抑制环境声
CN108370471A (zh) * 2015-10-12 2018-08-03 诺基亚技术有限公司 分布式音频捕获和混合
US20170359467A1 (en) * 2016-06-10 2017-12-14 Glen A. Norris Methods and Apparatus to Assist Listeners in Distinguishing Between Electronically Generated Binaural Sound and Physical Environment Sound
CN110089131A (zh) * 2016-11-16 2019-08-02 诺基亚技术有限公司 分布式音频捕获和混合控制
CN108389586A (zh) * 2017-05-17 2018-08-10 宁波桑德纳电子科技有限公司 一种远程集音装置、监控装置及远程集音方法
US20190222950A1 (en) * 2017-06-30 2019-07-18 Apple Inc. Intelligent audio rendering for video recording
CN110970057A (zh) * 2018-09-29 2020-04-07 华为技术有限公司 一种声音处理方法、装置与设备
CN111050269A (zh) * 2018-10-15 2020-04-21 华为技术有限公司 音频处理方法和电子设备
EP3683794A1 (en) * 2019-01-15 2020-07-22 Nokia Technologies Oy Audio processing

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4220637A4 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116095465A (zh) * 2022-05-25 2023-05-09 荣耀终端有限公司 录像方法、装置及存储介质
CN116095465B (zh) * 2022-05-25 2023-10-20 荣耀终端有限公司 录像方法、装置及存储介质

Also Published As

Publication number Publication date
EP4220637A1 (en) 2023-08-02
EP4220637A4 (en) 2024-01-24
CN114255781A (zh) 2022-03-29

Similar Documents

Publication Publication Date Title
US10397722B2 (en) Distributed audio capture and mixing
WO2021037129A1 (zh) 一种声音采集方法及装置
JP7229925B2 (ja) 空間オーディオシステムにおける利得制御
JP6400566B2 (ja) ユーザインターフェースを表示するためのシステムおよび方法
US10257611B2 (en) Stereo separation and directional suppression with omni-directional microphones
US20170208415A1 (en) System and method for determining audio context in augmented-reality applications
WO2014161309A1 (zh) 一种移动终端实现声源定位的方法及装置
WO2022062531A1 (zh) 一种多通道音频信号获取方法、装置及系统
WO2021103672A1 (zh) 一种音频数据处理的方法及装置、电子设备、存储介质
US20190149919A1 (en) Distributed Audio Capture and Mixing Controlling
US9832587B1 (en) Assisted near-distance communication using binaural cues
WO2018234625A1 (en) DETERMINATION OF TARGETED SPACE AUDIOS PARAMETERS AND SPACE AUDIO READING
WO2022057365A1 (zh) 降噪方法、终端设备及计算机可读存储介质
WO2023197646A1 (zh) 一种音频信号处理方法及电子设备
EP3917160A1 (en) Capturing content
US11646046B2 (en) Psychoacoustic enhancement based on audio source directivity
WO2024027315A1 (zh) 音频处理方法、装置、电子设备、存储介质和程序产品
WO2023088156A1 (zh) 一种声速矫正方法以及装置
CN110428802B (zh) 声音混响方法、装置、计算机设备及计算机存储介质
EP3840403A1 (en) Rotating camera and microphone configurations
CN117636928A (zh) 一种拾音装置及相关音频增强方法
CN117153180A (zh) 声音信号处理方法、装置、存储介质及电子设备
CN117098060A (zh) 方位信息确定方法、装置、电子设备、存储介质及芯片
Peltola Lisätyn audiotodellisuuden sovellukset ulkokäytössä

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21870910

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2021870910

Country of ref document: EP

Effective date: 20230425