WO2012068960A1 - 视频通信中的音频处理方法和装置 - Google Patents

视频通信中的音频处理方法和装置 Download PDF

Info

Publication number
WO2012068960A1
WO2012068960A1 PCT/CN2011/082127 CN2011082127W WO2012068960A1 WO 2012068960 A1 WO2012068960 A1 WO 2012068960A1 CN 2011082127 W CN2011082127 W CN 2011082127W WO 2012068960 A1 WO2012068960 A1 WO 2012068960A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio data
video communication
audio
sound source
depth
Prior art date
Application number
PCT/CN2011/082127
Other languages
English (en)
French (fr)
Inventor
岳中辉
Original Assignee
华为终端有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为终端有限公司 filed Critical 华为终端有限公司
Priority to EP11843560.1A priority Critical patent/EP2566194A4/en
Publication of WO2012068960A1 publication Critical patent/WO2012068960A1/zh
Priority to US13/693,823 priority patent/US9113034B2/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/56Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities
    • H04M3/568Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities audio processing specific to telephonic conferencing, e.g. spatial distribution, mixing of participants
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/41Structure of client; Structure of client peripherals
    • H04N21/422Input-only peripherals, i.e. input devices connected to specially adapted client devices, e.g. global positioning system [GPS]
    • H04N21/42203Input-only peripherals, i.e. input devices connected to specially adapted client devices, e.g. global positioning system [GPS] sound input device, e.g. microphone
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/478Supplemental services, e.g. displaying phone caller identification, shopping application
    • H04N21/4788Supplemental services, e.g. displaying phone caller identification, shopping application communicating with other users, e.g. chatting
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/60Network structure or processes for video distribution between server and client or between remote clients; Control signalling between clients, server and network components; Transmission of management data between server and client, e.g. sending from server to client commands for recording incoming content stream; Communication details between server and client 
    • H04N21/63Control signaling related to video distribution between client, server and network components; Network processes for video distribution between server and clients or between remote clients, e.g. transmitting basic layer and enhancement layers over different transmission paths, setting up a peer-to-peer communication via Internet between remote STB's; Communication protocols; Addressing
    • H04N21/647Control signaling between network components and server or clients; Network processes for video distribution between server and clients, e.g. controlling the quality of the video stream, by dropping packets, protecting content from unauthorised alteration within the network, monitoring of network load, bridging between two different networks, e.g. between IP and wireless
    • H04N21/64784Data processing by the network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof
    • H04N21/8106Monomedia components thereof involving special audio data, e.g. different tracks for different languages
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/44Receiver circuitry for the reception of television signals according to analogue transmission standards
    • H04N5/60Receiver circuitry for the reception of television signals according to analogue transmission standards for the sound signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems
    • H04N7/152Multipoint control units therefor
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R27/00Public address systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/11Positioning of individual sound objects, e.g. moving airplane, within a sound field
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/13Aspects of volume control, not necessarily automatic, in stereophonic sound systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/15Aspects of sound capture and related signal processing for recording or reproduction

Definitions

  • the embodiments of the present invention relate to the field of communications technologies, and in particular, to an audio processing method and apparatus in video communication. Background technique
  • the video conferencing service is a multimedia communication means. By using a television device and a communication network to hold a conference, the interactive functions of images, voices, and data between two or more places can be simultaneously realized.
  • the video conferencing service is generally composed of video terminal equipment, transmission network, and multipoint control unit (hereinafter referred to as: MCU).
  • Video terminal equipment mainly includes video input/output devices, audio input/output devices, video codecs, audio codecs, information communication devices, and multiplex/signal distribution devices.
  • the basic function of the video terminal device is to compress and encode the image signal captured by the local camera and the sound signal picked up by the microphone, and then transmit it to the transmission network for transmission to the remote conference site. At the same time, the digital signal transmitted from the remote conference site is decoded. , restored to analog image and sound signals.
  • the video conferencing service has realized long-distance audio and video communication.
  • the current telepresence system utilizes video conferencing technology to realize remote transmission of images and sounds, and cooperates with the overall peripherals, such as the use of larger LCD TVs to achieve "real life size", and realizes people's communication through certain camera processing technology.
  • Eye-to-eye coupled with the overall conference room decoration program, makes the remote display a realistic effect.
  • the existing telepresence system can achieve a more realistic effect.
  • the existing telepresence system can only achieve the correspondence between the image and the sound orientation of the same plane, that is, the sounds of the front and rear rows are all from the same The plane emits, if you don't look at the image, you can't tell whether the sound comes from the front or the back, which makes the sound effect of the sound unreal.
  • Embodiments of the present invention provide an audio processing method and apparatus for video communication to distinguish different rows of sounds in a plurality of rows of video communications.
  • An embodiment of the present invention provides an audio processing method in video communication, including:
  • the embodiment of the invention further provides an audio processing method in video communication, including:
  • An embodiment of the present invention provides an audio processing device in video communication, including:
  • a first acquiring module configured to acquire audio data in the video communication and sound source location information corresponding to the audio data
  • a processing module configured to perform depth processing on the audio data according to the sound source location information acquired by the first acquiring module.
  • the embodiment of the invention further provides an audio processing device in video communication, including:
  • a second acquiring module configured to acquire audio data of the local end of the video communication and sound source location information corresponding to the audio data
  • a second sending module configured to send the audio data and the sound source location information acquired by the second acquiring module to a video communication control unit or a peer end of the video communication, so that the video communication control unit or the video communication
  • the opposite end performs depth processing on the audio data according to the sound source position information.
  • the audio processing method and apparatus in the video communication according to the embodiment of the present invention first acquires audio data in video communication and audio source location information corresponding to the audio data, and then according to the acquired audio source bit. The information is processed in depth, and the audio data is processed to have a sense of depth matching the position information of the sound source, thereby distinguishing sounds emitted by objects in different positions in the video communication.
  • Embodiment 1 is a flowchart of Embodiment 1 of an audio processing method in video communication according to the present invention
  • Embodiment 2 is a flowchart of Embodiment 2 of an audio processing method in video communication according to the present invention
  • Embodiment 3 is a flowchart of Embodiment 3 of an audio processing method in video communication according to the present invention.
  • Figure 4 is a schematic view of the embodiment shown in Figure 3;
  • Embodiment 4 is a flowchart of Embodiment 4 of an audio processing method in video communication according to the present invention.
  • Embodiment 1 of an audio processing device in video communication according to the present invention
  • FIG. 7 is a schematic diagram of Embodiment 2 of an audio processing device in a video communication according to the present invention
  • FIG. 8 is a schematic diagram of Embodiment 3 of an audio processing device in a video communication according to the present invention.
  • Embodiment 1 is a flowchart of Embodiment 1 of an audio processing method in a video communication according to the present invention. As shown in FIG. 1, the method includes:
  • Step 1 01 Acquire audio data in video communication and a sound source location letter corresponding to the audio data
  • Embodiments of the method can be applied to video communication having multiple sound sources at different positions, for example: two or more rows of video conferences, or 3D video conferences.
  • the following is an example of a video conference in a plurality of rows, and the video conference may include a session at both ends or a multi-end session. For other scenarios, refer to the description in the scenario.
  • the audio processing device acquires the audio data of the current speaker in the video communication, and the sound source location information corresponding to the audio data; the sound source location information is the relative position information of the object corresponding to the audio data and the first row in the video communication, That is to say, the sound source position information is the position information of the current speaker in the first row of the video communication, and when the speaker is in the first row, the sound source position information is 0.
  • the audio data acquired by the audio processing device in this embodiment may be the audio data of the local end, and the audio processing device is the device of the audio data collecting end; the audio data obtained by the audio processing device may be the audio data of the opposite end, and the audio processing at this time
  • the device is a device of the audio data playing end; in addition, the audio processing device can also be a device of the MCU end of the video communication, and is configured to obtain audio data from the collecting end of the audio data, and process the audio data to the playing end of the audio data.
  • the audio processing device can acquire the current audio data of the local end and the sound source position information corresponding to the audio data through the local pickup device (for example, a microphone).
  • the local pickup device for example, a microphone
  • the audio processing device acquires the audio data and the sound source position information of the opposite end by receiving the audio data and the sound source position information sent by the opposite end (the collecting end of the audio data).
  • the audio data and the sound source location information sent by the opposite end are acquired by the audio processing device of the opposite end through the sound pickup device.
  • the audio processing device When the audio processing device is a device on the MCU side: the audio processing device receives the audio data and the sound source position information of the end transmitted by one end of the video communication, thereby acquiring the audio data and the sound source position information of the end.
  • Step 102 Perform depth processing on the audio data according to the sound source location information.
  • the sense of depth is the perception of the human ear's distance and depth.
  • the distance is the perception of the human ear's distance from a particular sound source, and the depth is used to describe the perception of the distance between the front and back of the entire sound scene.
  • the sense of depth is forward
  • the sound that comes out has a sense of depth.
  • the human ear can determine the proximity of the source of the sound through the sound, that is, the user can distinguish the front and rear positions of the speaker in the video communication.
  • the audio processing device After acquiring the audio data and the sound source position information, the audio processing device performs depth processing on the audio data according to the sound source position information, so that the audio data has a sense of depth corresponding to the sound source position information, thereby enabling the user to pass the sound.
  • the position of the object corresponding to the sound source data in the video can be distinguished.
  • the audio processing device After the audio processing device performs the depth processing on the audio data, if the audio processing device is located at the collecting end of the audio data or the MCU end, the audio processing device sends the deep processed audio data to the playing end, so that the opposite end plays the audio data. If the audio processing device is located at the playing end of the audio data, the audio processing device directly plays the audio data.
  • the method for deep processing the audio data may include: (1) using a depth sense control algorithm to control sound loudness, sound energy ratio of direct sound and reverberation sound, high frequency attenuation amount, and the like, thereby implementing depth processing. (2) The sound is processed using Wave Field Synthesis technology, so that the processed sound has a sense of depth.
  • the processing may be performed on the audio data collection end or on the audio data playback end; and the wave data synthesis processing is used to process the audio data. At the time, the processing is performed at the playback end of the audio data.
  • the audio processing device first acquires the audio data in the video communication and the sound source location information corresponding to the audio data, and then performs depth processing on the audio data according to the acquired sound source location information, and processes the audio data into It has a sense of depth that matches the position information of the sound source, thereby distinguishing the sounds emitted by the objects of different front and rear positional relationships in the video communication.
  • Embodiment 2 is a flowchart of Embodiment 2 of an audio processing method in a video communication according to the present invention. As shown in FIG. 2, the method includes:
  • Step 201 Acquire audio data of the local end of the video communication and sound source location information corresponding to the audio data.
  • the scenario applied by the embodiment of the present invention is: performing depth processing on the audio data by the playing end of the audio data or the MCU end of the video communication.
  • the execution body of this embodiment is an acquisition end of audio data.
  • the audio processing device at the local end of the video communication acquires the current audio data of the local end through the sound collecting device, and acquires the sound source position information corresponding to the audio data through the identifier of the sound collecting device.
  • the participants having different front and rear positions have different sound pickup devices, and thus the sound source position information corresponding to the audio data picked up by the sound pickup device can be obtained according to the difference in the sound pickup device.
  • Step 202 Send the audio data and the sound source location information to the opposite end of the video communication control unit or the video communication, so that the video communication control unit or the opposite end of the video communication performs the depth processing on the audio data according to the sound source location information.
  • the audio processing device sends the acquired audio data of the local end and the sound source location information to a video communication control unit, such as an MCU, and then the MCU performs depth processing on the audio data according to the sound source location information; or, the audio processing device acquires the audio data of the local end.
  • a video communication control unit such as an MCU
  • the MCU performs depth processing on the audio data according to the sound source location information
  • the audio processing device acquires the audio data of the local end.
  • the sound source position information is sent to the opposite end of the video communication, and then the opposite end performs the depth processing on the audio data according to the sound source position information.
  • the audio processing device acquires the audio data of the local end in the video communication and the sound source location information corresponding to the audio data, and then sends the obtained audio data and the sound source location information to make the opposite end of the MCU or the video communication.
  • the audio data is subjected to depth processing based on the sound source position information to obtain audio data having a sense of depth matching the sound source position information, so that the listener can distinguish the sound emitted by the object in different front and rear positional relationships in the video communication.
  • the audio data may be processed in depth at the collection end of the audio data, or the audio data may be processed in depth at the MCU end, and The audio data is played in depth for the audio data.
  • the embodiments of the present invention are described in detail below based on the positions at which the audio data is processed in depth.
  • FIG. 3 is a flowchart of Embodiment 3 of an audio processing method in a video communication according to the present invention
  • FIG. 4 is a schematic diagram of the embodiment shown in FIG. 3.
  • the audio data is deep in the audio data collecting end.
  • the method includes:
  • Step 301 Acquire, by different sound collecting devices, first audio data having different sound source position information in the video communication and sound source position information corresponding to the first audio data.
  • the first audio data is the audio data of the local end of the video communication in this embodiment.
  • the audio data of the local end can be picked up by the sound collecting device, and the sound collecting device can be a microphone, wherein the audio data can be picked up by the microphone in various ways, as long as the pair can be different.
  • the audio data of the participants in the row can be identified; for example: each participant at the local end uses a microphone, and the audio data picked up by each microphone corresponds to the sound source location information, wherein the audio data of the participants in the same row corresponds to The sound source position information is the same; or each row of participants shares one or several microphones, and the audio data picked up by each row of microphones corresponds to the same sound source position information.
  • each row of participants can share one or several microphones according to the directivity and sensitivity of the microphone.
  • the first audio data may be subjected to pre-processing such as echo cancellation, noise suppression, and the like.
  • the audio processing device acquires the pre-processed first audio data, and obtains the sound source location information corresponding to the first audio data according to the microphone corresponding to the first audio data; wherein the different rows of microphones correspond to different sound source location information.
  • Step 302 Perform depth processing on the first audio data by using a depth perception control algorithm according to the sound source location information.
  • the sense of depth is mainly related to the loudness of the sound, the sound energy of the direct sound and the reverberant sound, and also related to the high frequency attenuation.
  • the loudness is the volume of the volume
  • the loudness is inversely proportional to the square of the distance
  • the distance is increased by 1 time
  • the loudness is attenuated by about 6dB. Therefore, the sound of the far distance is greatly attenuated, and the volume reaching the human ear is small.
  • the attenuation amount of the sound can be calculated, so that the volume of the front and rear rows is different when the sound is played.
  • the attenuation of the high-frequency sound wave is more than that of the low-frequency sound wave when encountering an obstacle in the room; thus, the amount of high-frequency attenuation is also a factor affecting the sense of depth.
  • the depth sense control algorithm can be: firstly, the parameters of the room environment such as the room size and the room reverberation time are obtained, and then the system transfer function is calculated according to the parameters of the room environment, and then the loudness, direct sound and reverberation sound of the transfer function of the system are controlled.
  • the sound energy and high frequency attenuation can control the depth effect of a sound by these three factors, so that the sound effect matches the sound source position of the sound. For example, the listener can distinguish that the sound is from 1 m. It came from 2m.
  • the depth sensing control algorithm may be pre-configured in the audio processing device, so that the audio processing device can perform depth processing on the audio data according to the sound source position information and the algorithm after acquiring the audio data and the sound source position information thereof.
  • the algorithm can adjust the loudness, the delay and the ratio of the reverberation, and the attenuation of the high frequency to generate the audio data with the depth effect of 1 m; when the sound source position information is 0 , indicates that the source is in the front row, so there is no need to perform depth processing on the sound from the source.
  • Step 303 Send the first audio data after the depth processing to the opposite end of the video communication, so that the peer end plays the first audio data after the depth processing.
  • the audio processing device sends the first audio data after the depth processing to the opposite end of the video communication, that is, the playing end of the first audio data, so that the playing end plays the first audio data after the depth processing.
  • the audio processing device in the foregoing embodiment can process the audio data in depth (step 302) at the collection end of the audio data or at the MCU end; when the video communication is not available
  • the audio processing device of the foregoing embodiment processes the depth of the audio data at the collection end of the audio data.
  • the collecting end acquires the audio data acquired in step 301 and the sound source position corresponding to the audio data.
  • the information is sent to the MCU, and the audio data is processed in depth by the MCU, and then the audio data processed in depth is sent to the playing end.
  • the depth processing of the audio data mentioned in the foregoing embodiment can only be processed at the collecting end of the audio data, and is located at After the audio processing device of the collecting end performs depth processing on the audio data, the mixing switching step is further included, that is, the above step 303 can be replaced with the following steps:
  • Step 303 ′ the audio processing device performs mixing and switching on the audio data after the depth processing, and then sends the switched one or two channels of data to the opposite end of the video communication.
  • the system adopts mono or two-channel encoding, mix multiple audio data processed in depth, and then switch one or two data signals according to the preset strategy, and encode the switched data signals. Send to the peer. After the peer receives and decodes the data signal, it can play directly to obtain a sound with a sense of depth.
  • Step 304 The peer end performs decoding after decoding the received audio data.
  • the peer first decodes the received audio data, and then outputs the decoded audio data through the speaker.
  • the first end participant has two rows (front and rear rows), the first front row participant 1, the second front row participant 2, and the third.
  • the front row participant 3 picks up the sound through the first microphone M1, the second microphone M2, and the third microphone M3 in the front row, and the data of the front row microphone has no depth sense processing, and the first rear row participant 4 and the second rear row meeting
  • the fifth and third row of participants 6 respectively pick up the sound through the fourth microphone M4, the fifth microphone M5, and the sixth microphone M6 in the rear row; the front and rear row distance is 1.5 m, and the data collected by the rear microphone is uniformly added to 1.
  • the speakers are placed in the same plane and can be located above, below, to the left or to the right of the video display device. Wherein, for the first end, the first end is the local end, and the second end is the opposite end; In the second end, the second end is the local end, and the first end is the opposite end.
  • the audio processing device first acquires the audio data in the video communication and the sound source location information corresponding to the audio data, and then performs depth processing on the audio data according to the acquired sound source location information, and processes the audio data into The audio data having the depth sense matched with the sound source position information, and then the processed audio data is sent to the opposite end of the video communication for playing, so that the listener of the opposite end can distinguish the speaker in the video communication by sound. Front and rear position.
  • FIG. 5 is a flowchart of Embodiment 4 of the audio processing method in the video communication according to the present invention.
  • the audio data is processed in depth at the playing end of the audio data.
  • the method includes:
  • Step 501 The collecting end of the audio data picks up the second audio data having different sound source position information in the video communication and the sound source position information pair audio data corresponding to the second audio data through different sound collecting devices.
  • the second audio data is the audio data of the peer end of the video communication in this embodiment.
  • the audio data collection end is the opposite end, and the audio data playing end is the local end.
  • the pickup process of the audio data by the pickup device can be referred to the description in step 301 in the embodiment shown in FIG.
  • the picked up audio data is sent to the audio processing device of the collecting end, and the audio processing device at the collecting end acquires the sound source position information corresponding to the audio data through the identifier of the sound collecting device.
  • the participants having different front and rear positions have different sound pickup devices, and thus the sound source position information corresponding to the audio data picked up by the sound pickup device can be obtained according to the difference in the sound pickup device.
  • Step 502 Acquire second audio data in the video communication and sound source location information corresponding to the second audio data.
  • the audio processing device at the collecting end encodes the acquired second audio data and the sound source position information and sends it to the audio processing device at the playing end. Thereby, the audio processing device of the playback end acquires the second audio data and the sound source position information by decoding.
  • Step 503 The audio processing device at the playing end inputs the second audio data according to the sound source location information. Line depth processing.
  • the step may include step a or step b.
  • Step a The audio processing device at the playing end performs depth processing on the second audio data by the depth sensing control algorithm according to the sound source position information.
  • step 302 in the embodiment shown in FIG. 3.
  • Step b The audio processing device at the playing end performs wavefront synthesis processing on the second audio data according to the sound source position information to form the second audio data after the depth processing.
  • Wave Field Synthesis uses the Huyghen principle for sonic synthesis.
  • the wavefront synthesis technique is: All points on the Wave Front can be regarded as a new wave source. These new wave sources have the same rate and wavelength as the original wave source. After the superposition, these new wave sources are next. Instantly form a new wave front.
  • the wavefront synthesis technique can be used to realistically reproduce the sound field.
  • the wavefront synthesis technique can generate a wavefront based on the acoustic wave theory and a loudspeaker matrix formed by a plurality of speakers placed on a plane, wherein the wavefront is the wavefront farthest from the source; each speaker in the matrix is fed A signal calculated by Rayleigh reconstruction integral corresponding to its position; each speaker generates sound waves according to the signal; the superposition of sound waves generated by each speaker reconstructs an accurate original acoustic wavefront below the overlapping frequency.
  • the overlapping frequency is determined by the distance between the speakers.
  • the sound field reconstruction obtained with wavefront technology maintains the temporal and spatial properties of the original sound field throughout the listening space.
  • the audio processing device of the playing end performs the process of wavefront synthesis processing on the second audio data according to the sound source position information:
  • the playing end uses a speaker array for playing, and multiple (for example, six) speakers can be used, which are placed under the video display device, and the number of specific speakers is determined according to an algorithm and an actual application scenario; the audio processing device is based on the sound source of the audio data. Position information, the audio data is subjected to different calculations, and output to a plurality of speakers, so that a plurality of speakers simultaneously play sounds, and the sounds are superimposed together, and the emitted sound forms a wavefront, and the wavefront can virtualize the original sound source Position, thereby recovering the issue A sound source with a sense of depth.
  • Step 504 Play the second audio data after the depth processing.
  • the audio processing device on the playing end performs the depth processing on the second audio data by the depth sensing control algorithm, and then plays the processed second audio data; or, when the second audio data is processed in depth by the wavefront synthesis technology, passes through the speaker. What is played is the processed second audio data.
  • the audio processing device first acquires the audio data of the opposite end in the video communication and the sound source position information corresponding to the audio data, and then performs depth processing on the audio data according to the acquired sound source position information, and processes the audio data.
  • the audio data having the sense of depth matching the position information of the sound source is then played, and then the processed audio data is played, whereby the listener of the opposite end can distinguish the front and rear positions of the speaker in the video communication by sound.
  • the foregoing storage system includes the following steps:
  • the foregoing storage medium includes: R ⁇ M, RAM, a magnetic disk or an optical disk, and the like, which can store various program codes.
  • FIG. 6 is a schematic diagram of Embodiment 1 of an audio processing device in a video communication according to the present invention.
  • the audio processing device includes: a first acquiring module 61 and a processing module 63.
  • the first acquisition module 61 is configured to acquire audio data in the video communication and sound source location information corresponding to the audio data.
  • the processing module 63 is configured to perform depth processing on the audio data according to the sound source location information acquired by the first obtaining module 61.
  • the audio processing device in this embodiment may be a device on the audio data collecting end, a device in the audio data playing end, or a device in the MCU end of the video communication.
  • the first acquiring module first acquires audio data in the video communication and the audio The sound source position information corresponding to the data, and then the processing module performs depth processing on the audio data according to the acquired sound source position information, and processes the audio data to have a sense of depth matching the position information of the sound source, thereby enabling video communication
  • the sounds emitted by objects in different front and rear positional relationships are distinguished.
  • FIG. 7 is a schematic diagram of a second embodiment of an audio processing device in a video communication according to the present invention.
  • the processing module 63 may specifically include: a first processing unit 631 and/or The second processing unit 633; further, the audio processing device may further include: a first sending module 65 and a playing module 67.
  • the first processing unit 631 is configured to perform depth processing on the first audio data by using a depth perception control algorithm according to the sound source location information acquired by the first acquisition module when the audio data is the first audio data of the local end or the transmitting end of the video communication.
  • the two sides of the video communication may be referred to as a transmitting end and a receiving end according to the direction of the data stream.
  • the second processing unit 633 is configured to perform depth processing on the second audio data by the depth sensing control algorithm according to the sound source location information when the audio data is the second audio data of the opposite end of the video communication, or the second processing unit 633 is configured to When the audio data is the second audio data of the opposite end of the video communication, the second audio data is subjected to wavefront synthesis processing according to the sound source position information to form the second audio data after the depth processing.
  • the first sending module 65 is configured to send the first audio data processed by the first processing unit 631 to the opposite end of the video communication, so that the peer end plays the first processed audio data.
  • the audio processing device When the audio processing device is located at the playing end of the audio data, that is, the depth processing is performed on the playing end of the audio data, the audio processing device further includes: a playing module 67, configured to play the second processing unit 633 after the depth processing Second audio data.
  • the first obtaining module 61 may be specifically configured to: obtain different video communications through different sound collecting devices. Audio data of sound source position information and sound source corresponding to the audio data location information.
  • the first obtaining module 61 may be specifically configured to: receive the audio data sent by the collecting end of the audio data, and the sound source location information corresponding to the audio data. .
  • the first obtaining module 61 may be specifically configured to: receive the audio data and the audio sent by the collecting end of the audio data.
  • the audio processing device first acquires the audio data in the video communication and the sound source location information corresponding to the audio data, and then performs depth processing on the audio data according to the acquired sound source location information, and processes the audio data into The audio data having the depth sense matched with the sound source position information, and then the processed audio data is sent to the opposite end of the video communication for playing, so that the listener of the opposite end can distinguish the speaker in the video communication by sound. Front and rear position.
  • FIG. 8 is a schematic diagram of Embodiment 3 of an audio processing device in a video communication according to the present invention.
  • the audio processing device includes: a second obtaining module 81 and a second sending module 83.
  • the audio processing device provided by this embodiment can be applied to the following scenarios: In the video communication, the audio data is processed in depth by the playing end of the audio data or the MCU end of the video communication, and the audio processing device provided in this embodiment is set in the audio data. The collection end.
  • the second obtaining module 81 is configured to acquire audio data of the local end of the video communication and sound source location information corresponding to the audio data.
  • the second sending module 83 is configured to send the audio data and the sound source location information acquired by the second obtaining module 81 to the opposite end of the video communication control unit or the video communication, so that the video communication control unit or the opposite end of the video communication, according to the sound source location The information is processed in depth for the audio data.
  • the audio processing device acquires the audio data of the local end in the video communication and the sound source location information corresponding to the audio data, and then sends the obtained audio data and the sound source location information to make the opposite end of the MCU or the video communication.
  • the audio data is subjected to depth processing based on the sound source position information to obtain audio data having a sense of depth matching the sound source position information, so that the listener can distinguish the sound emitted by the object in different front and rear positional relationships in the video communication.

Description

视频通信中的音频处理方法和装置 本申请要求于 2010 年 11 月 26 日提交中国专利局、 申请号为 201010561696. 7、 发明名称为"视频通信中的音频处理方法和装置 "的中国专 利申请的优先权, 其全部内容通过引用结合在本申请中。 技术领域
本发明实施例涉及通信技术领域, 尤其涉及一种视频通信中的音频处理 方法和装置。 背景技术
视讯会议业务是一种多媒体通信手段, 利用电视设备和通信网络召开会 议, 可以同时实现两地或多地之间的图像、 语音、 数据的交互功能。 视讯会 议业务一般由视讯终端设备、 传输网络、 多点控制单元 (Multipoint Control Units, 以下筒称为: MCU )等几部分组成。 视讯终端设备主要包括视频输入 /输出设备、 音频输入 /输出设备、 视频编解码器、 音频编解码器、 信息通信设 备及多路复用 /信号分线设备等。 视讯终端设备的基本功能是将本地摄像机拍 摄的图像信号、 麦克风拾取的声音信号进行压缩编码, 然后发送给传输网络, 以传至远方会场; 同时, 接收远方会场传来的数字信号, 经解码后, 还原成 模拟的图像和声音信号。
视讯会议业务实现了远距离的音视频交流, 随着技术的不断进步和发展, 出现了可以使远程交流达到面对面的交流效果的网真系统。 目前的网真系统 利用视讯会议技术实现图像与声音的远程传输, 配合整体的外设例如采用尺 寸较大的液晶电视机实现"真人大小",并通过一定的摄像头处理技术实现人们 交流时的"眼对眼",再配上整体的会议室装修方案,使得远程呈现出逼真的效 果。
虽然现在的网真系统能实现比较逼真的效果。 但是对于新出现的双排或 多排网真, 由于前后排具有一定的距离, 而现有的网真系统仅能实现同一平 面的图像与声音方位的对应, 即前后排的声音都从同一个平面发出, 如果不 看图像, 则无法分辨出声音来自前排还是后排, 由此使得声音的临场感效果 不真实。 发明内容
本发明实施例提供一种视频通信中的音频处理方法和装置, 以分辨出多 排视频通信中不同排的声音。
本发明实施例提供一种视频通信中的音频处理方法, 包括:
获取视频通信中的音频数据和与所述音频数据对应的音源位置信息; 根据所述音源位置信息, 对所述音频数据进行纵深处理。
本发明实施例还提供一种视频通信中的音频处理方法, 包括:
获取视频通信的本端的音频数据和与所述音频数据对应的音源位置信
将所述音频数据和所述音源位置信息发送给视频通信控制单元或视频通 信的对端, 以使所述视频通信控制单元或视频通信的对端, 根据所述音源位 置信息对所述音频数据进行纵深处理。
本发明实施例提供一种视频通信中的音频处理装置, 包括:
第一获取模块, 用于获取视频通信中的音频数据和与所述音频数据对应 的音源位置信息;
处理模块, 用于根据所述第一获取模块获取的所述音源位置信息, 对所 述音频数据进行纵深处理。
本发明实施例还提供一种视频通信中的音频处理装置, 包括:
第二获取模块, 用于获取视频通信的本端的音频数据和与所述音频数据 对应的音源位置信息;
第二发送模块, 用于将所述第二获取模块获取的所述音频数据和所述音 源位置信息发送给视频通信控制单元或视频通信的对端, 以使所述视频通信 控制单元或视频通信的对端, 根据所述音源位置信息对所述音频数据进行纵 深处理。 本发明实施例的视频通信中的音频处理方法和装置, 先获取视频通信中 的音频数据和与该音频数据对应的音源位置信息, 然后根据获取的该音源位 置信息, 对该音频数据进行纵深处理, 将该音频数据处理成具有与该音源位 置信息相匹配的纵深感, 由此可以将视频通信中不同前后位置关系的对象发 出的声音分辨出来。 附图说明
为了更清楚地说明本发明实施例的技术方案, 下面将对实施例描述中所 需要使用的附图作一筒单地介绍, 显而易见地, 下面描述中的附图是本发明 的一些实施例, 对于本领域普通技术人员来讲, 在不付出创造性劳动的前提 下, 还可以根据这些附图获得其他的附图。
图 1为本发明视频通信中的音频处理方法实施例一流程图;
图 2为本发明视频通信中的音频处理方法实施例二流程图;
图 3为本发明视频通信中的音频处理方法实施例三流程图;
图 4为图 3所示实施例的示意图;
图 5为本发明视频通信中的音频处理方法实施例四流程图;
图 6为本发明视频通信中的音频处理装置实施例一的示意图;
图 7为本发明视频通信中的音频处理装置实施例二的示意图; 图 8为本发明视频通信中的音频处理装置实施例三的示意图。
具体实施方式
为使本发明实施例的目的、 技术方案和优点更加清楚, 下面将结合本发 明实施例中的附图, 对本发明实施例中的技术方案进行清楚、 完整地描述, 显然, 所描述的实施例是本发明一部分实施例, 而不是全部的实施例。 基于 本发明中的实施例, 本领域普通技术人员在没有作出创造性劳动前提下所获 得的所有其他实施例, 都属于本发明保护的范围。
图 1为本发明视频通信中的音频处理方法实施例一的流程图,如图 1所示, 该方法包括:
步骤 1 01、获取视频通信中的音频数据和与该音频数据对应的音源位置信 本方法各实施例可以应用在具有前后不同位置的多个声音源的视频通信 中, 例如: 双排或多排的视频会议, 或者 3D视频会议。 下面以多排的视频会 议的场景为例进行说明, 并且该视频会议可以包括两端会话或多端会话, 其 他场景可以参见该场景中的描述。
音频处理装置获取视频通信中的当前发言人的音频数据, 以及与该音频数 据对应的音源位置信息; 该音源位置信息为该音频数据对应的对象与视频通信 中第一排的相对位置信息, 也就是说, 该音源位置信息为当前发言人距离视频 通信中第一排的位置信息, 当该发言人位于第一排时, 该音源位置信息即为 0。
本实施例中的音频处理装置获取的音频数据可以是本端的音频数据, 此 时音频处理装置为音频数据采集端的设备; 该音频处理装置获取的音频数据 可以是对端的音频数据, 此时音频处理装置为音频数据播放端的设备; 此外, 该音频处理装置还可以为视频通信的 MCU端的设备, 用于从音频数据的采集 端获取音频数据, 将该音频数据处理后发送至音频数据的播放端。
当该音频处理装置获取的音频数据是本端的音频数据时: 该音频处理装 置可以通过本端的拾音设备(例如麦克风)获取本端当前的音频数据和与该 音频数据对应的音源位置信息。
当该音频处理装置获取的音频数据是对端的音频数据时: 该音频处理装 置通过接收对端 (音频数据的采集端)发送的音频数据和音源位置信息, 来 获取对端的音频数据和音源位置信息, 对端发送的音频数据和音源位置信息 是由对端的音频处理装置通过拾音设备获取的。
当该音频处理装置为 MCU端的设备时: 该音频处理装置接收视频通信的 一端发送的该端的音频数据和音源位置信息, 从而获取到该端的音频数据和 音源位置信息。
步骤 102、 根据音源位置信息, 对音频数据进行纵深处理。
纵深感是人耳对距离和深度的感知。 距离是人耳对某个特定声源距离远 近的感知, 深度用于描述对整个声音场景前后距离的感知。 纵深感是前方发 出的声音具有远近的层次感。 当声音具有纵深感时, 人耳通过该声音即可判 断出该声音的发生源的远近程度, 即用户可以通过该声音分辨出发言者位于 视频通信中的前后位置。
音频处理装置获取到音频数据和音源位置信息后, 根据该音源位置信息, 对该音频数据进行纵深处理, 以使该音频数据具有与该音源位置信息相对应 的纵深感, 从而可以使得用户通过声音即可分辨出该音源数据对应的对象在 视频中的前后位置。
音频处理装置对音频数据进行纵深处理之后, 若该音频处理装置位于音 频数据的采集端或者 MCU端, 则音频处理装置将纵深处理后的音频数据发送 给播放端, 以使对端播放该音频数据; 若该音频处理装置位于音频数据的播 放端, 则音频处理装置直接播放该音频数据。
其中, 对音频数据进行纵深处理的方式可以包括: ( 1 )使用纵深感控制 算法, 对声音的响度、 直达声与混响声的声能比、 高频衰减量等参数进行控 制, 从而实现纵深处理; (2 )使用波前合成(Wave Field Synthesis )技术 对声音进行处理, 使得处理后的声音具有纵深感。
需要说明的是, 使用纵深感控制算法对音频数据进行处理时, 该处理过 程可以在音频数据的采集端进行, 也可以在音频数据的播放端进行; 而使用 波前合成技术对音频数据进行处理时, 该处理过程在音频数据的播放端进行。
本发明实施例, 音频处理装置先获取视频通信中的音频数据和与该音频 数据对应的音源位置信息, 然后根据获取的该音源位置信息, 对该音频数据 进行纵深处理, 将该音频数据处理成具有与该音源位置信息相匹配的纵深感, 由此可以将视频通信中不同前后位置关系的对象发出的声音分辨出来。
图 2为本发明视频通信中的音频处理方法实施例二的流程图,如图 2所示, 该方法包括:
步骤 201、获取视频通信的本端的音频数据和与音频数据对应的音源位置 信息。 本发明实施例应用的场景为: 由音频数据的播放端或者视频通信的 MCU 端对音频数据进行纵深处理。 本实施例的执行主体为音频数据的采集端。
视频通信本端的音频处理装置通过拾音设备获取本端当前的音频数据, 并通过该拾音设备的标识获取与该音频数据对应的音源位置信息。 其中, 前 后位置不同的与会者对应的拾音设备不同, 由此可以根据拾音设备的不同, 获取与该拾音设备拾取的音频数据对应的音源位置信息。
步骤 202、将音频数据和音源位置信息发送给视频通信控制单元或视频通 信的对端, 以使视频通信控制单元或视频通信的对端, 根据音源位置信息对 音频数据进行纵深处理。
音频处理装置将获取的本端的音频数据和音源位置信息发送给视频通信 控制单元, 例如 MCU, 然后由 MCU根据音源位置信息对音频数据进行纵深处 理; 或者, 音频处理装置将获取的本端的音频数据和音源位置信息发送给视 频通信的对端, 然后由对端根据音源位置信息对音频数据进行纵深处理。
本发明实施例, 音频处理装置获取视频通信中本端的音频数据和与该音 频数据对应的音源位置信息, 然后将获取到的音频数据和音源位置信息发送 出去, 以使 MCU或视频通信的对端根据该音源位置信息对该音频数据进行纵 深处理, 得到具有与该音源位置信息相匹配的纵深感的音频数据, 从而使得 听者可以分辨出视频通信中不同前后位置关系的对象发出的声音。 由上述本发明各实施例的描述可知, 本发明各实施例提供的方案中, 可 以在音频数据的采集端对音频数据进行纵深处理, 也可以在 MCU端对音频数 据进行纵深处理, 还可以在音频数据的播放端对音频数据进行纵深处理。 下 面根据对音频数据进行纵深处理的位置不同, 分别对本发明实施例进行详细 描述。
图 3为本发明视频通信中的音频处理方法实施例三的流程图, 图 4为图 3所 示实施例的示意图, 本实施例为在音频数据的采集端对音频数据进行纵深处 理的情况, 如图 3所示, 该方法包括:
步骤 301、 通过不同的拾音设备, 获取视频通信中具有不同音源位置信息 的第一音频数据和与第一音频数据对应的音源位置信息。
其中, 第一音频数据为本实施例中视频通信的本端的音频数据。
在视频通信的本端 (即音频数据的采集端)可以通过拾音设备拾取本端 的音频数据, 该拾音设备可以为麦克风, 其中通过麦克风拾取音频数据可以 有多种方式, 只要保证能够对不同排的与会者的音频数据进行标识即可; 例 如: 本端的每一个与会者使用给一个麦克风, 每一个麦克风拾取的音频数据 都对应有音源位置信息, 其中, 相同排的与会者的音频数据对应的音源位置 信息相同; 或者每一排的与会者共用一个或几个麦克风, 每一排的麦克风拾 取的音频数据对应相同的音源位置信息。 其中可以根据麦克风的指向性和灵 敏度来实现每一排的与会者共用一个或几个麦克风。
在麦克风拾取本端的第一音频数据后, 可以对该第一音频数据进行回声 抵消、 噪声抑制等前处理。 音频处理装置获取经过前处理的第一音频数据, 并根据该第一音频数据对应的麦克风, 获取到该第一音频数据对应的音源位 置信息; 其中不同排的麦克风, 对应不同的音源位置信息。
步骤 302、 根据音源位置信息, 通过纵深感控制算法对第一音频数据进行 纵深处理。
纵深感主要与声音的响度、 直达声与混响声的声能比这两个因素有关, 也与高频衰减量有关。
其中, 响度就是音量的大小, 响度与距离的平方正反比, 距离增加 1倍, 响度衰减约 6dB, 由此距离远的声音衰减大, 到达人耳的音量就会小。 据此, 根据前后排的距离, 可以计算出声音的衰减量, 使得播放前排与后排的声音 时, 音量会有差别。
要体现声音的纵深感, 还必须调节直达声与混响声的声能比; 其中可以 通过控制前后排声音的延时和混响来调节直达声与混响声的声能比。 通过加 延时混响的方法, 可以虚拟出声音的空间感。 当直达声较多时, 听者会感觉 声音是从近处传来的; 当混响声较多时, 会形成较明显的空间感, 听者会感 觉声音是从远处传来的。
此外, 由于高频声波的波长比低频声波的波长短, 在遇到房间的障碍物 时高频声波的衰减比低频声波的衰减多; 由此高频衰减量也是影响纵深感的 一个因素。
纵深感控制算法可以为: 首先获知房间大小及房间混响时间等房间环境 的参数, 然后根据该房间环境的参数计算出系统传输函数, 再通过控制该系 统传输函数的响度、 直达声与混响声的声能、 高频衰减量比这三个因素, 就 可以控制一个声音的纵深感效果, 使得该声音的效果与该声音的音源位置相 匹配, 例如听者可以分辨出该声音是从 1 m处传来的, 还是从 2m处传来的。
其中, 该纵深感控制算法可以预先配置在音频处理装置中, 使得音频处 理装置在每次获取到音频数据及其音源位置信息后, 即可根据音源位置信息 和该算法对音频数据进行纵深处理, 例如: 音源位置信息为 1 m时, 可以通过 该算法调节响度、 延时和混响的比例、 高频的衰减量, 从而产生具有 1 m的纵 深感效果的音频数据; 音源位置信息为 0时, 表示该音源位于最前排, 由此不 需要对该音源发出的声音进行纵深处理。
步骤 303、 将纵深处理后的第一音频数据发送给视频通信的对端, 以使对 端播放纵深处理后的第一音频数据。
音频处理装置将纵深处理后的第一音频数据发送给视频通信的对端, 即 该第一音频数据的播放端, 使得播放端播放纵深处理后的该第一音频数据。
需要说明的是: 当视频通信由 MCU控制时, 本实施例前述的音频处理装 置对音频数据的纵深处理(步骤 302 )可以在该音频数据的采集端, 也可以在 MCU端; 当视频通信没有 MCU控制时, 本实施例前述的音频处理装置对音频 数据的纵深处理在该音频数据的采集端。 当由 MCU端对音频数据进行纵深处 理时,采集端将步骤 301中获取的音频数据和与该音频数据对应的音源位置信 息发送给 MCU端, 由 MCU端对该音频数据进行纵深处理后, 再将纵深处理后 的音频数据发送给播放端。
然而, 当麦克风采集的任两路或多路音频数据通过一路数据流发送给对 端时, 本实施例前述的对音频数据的纵深处理只能在音频数据的采集端进行 处理, 并且, 在位于采集端的音频处理装置对音频数据进行纵深处理后, 还 包括混音切换步骤, 即上述的步骤 303可以被替换为以下步骤:
步骤 303'、音频处理装置对纵深处理后的音频数据进行混音切换,然后将 切换出的一路或两路数据发送给视频通信的对端。
例如: 当系统采用单声道或双声道编码时, 将多路经过纵深处理的音频 数据混合, 然后根据预置的策略切换出一路或两路数据信号, 将切换出的数 据信号经编码后发送给对端。 对端接收并解码得到该数据信号后, 直接播放 即可得到具有纵深感的声音。
步骤 304、 对端对接收到的音频数据解码处理后进行播放。
对端先对接收到的音频数据进行解码, 然后将解码后的音频数据通过扬 声器输出。
下面结合图 4对本实施例描述如下: 如图 4所示, 第一端与会者共有两排 (前排和后排) , 第一前排与会者 1、 第二前排与会者 2、 第三前排与会者 3分 别通过前排的第一麦克风 M1、 第二麦克风 M2、 第三麦克风 M3拾音, 前排麦 克风的数据无纵深感处理, 第一后排与会者 4、 第二后排与会者 5、 第三后排 与会者 6分别通过后排的第四麦克风 M4、第五麦克风 M5、第六麦克风 M6拾音; 前后排距离为 1 .5m, 后排麦克风采集的数据统一加入 1 .5m的纵深感效果; 由 此, 前排麦克风拾取的声音传到第二端进行播放时, 第二端与会者感觉声音 是从扬声器(例如音箱)发出的声音, 而后排麦克风拾取的声音传到第二端 进行播放时,第二端与会者感觉声音是从扬声器后面的 1 .5m处发出来的声音。 其中, 扬声器的摆放位于同一平面内, 可以位于视频显示装置的上方、 下方、 左侧或右侧。 其中, 对于第一端而言, 第一端为本端, 第二端为对端; 对于 第二端而言, 第二端为本端, 第一端为对端。
本发明实施例, 音频处理装置先获取视频通信中的音频数据和与该音频 数据对应的音源位置信息, 然后根据获取的该音源位置信息, 对该音频数据 进行纵深处理, 将该音频数据处理成具有与该音源位置信息相匹配的纵深感 的音频数据, 然后将处理后的音频数据发送给视频通信的对端进行播放, 由 此可以使得对端的听者通过声音分辨出发言者在视频通信中的前后位置。
图 5为本发明视频通信中的音频处理方法实施例四的流程图, 本实施例为 在音频数据的播放端对音频数据进行纵深处理的情况, 如图 5所示, 该方法包 括:
步骤 501、音频数据的采集端通过不同的拾音设备拾取视频通信中具有不 同音源位置信息的第二音频数据和与第二音频数据对应的音源位置信息对音 频数据。
其中, 第二音频数据为本实施例中视频通信的对端的音频数据, 本实施 例中, 音频数据的采集端为对端, 音频数据的播放端为本端。
拾音设备对音频数据的拾取过程可以参见图 3所示实施例中步骤 301中的 描述。
拾音设备拾取音频数据后, 将拾取的音频数据发送给采集端的音频处理装 置, 同时采集端的音频处理装置通过拾音设备的标识获取该音频数据对应的音 源位置信息。 其中, 前后位置不同的与会者对应的拾音设备不同, 由此可以根 据拾音设备的不同, 获取与该拾音设备拾取的音频数据对应的音源位置信息。
步骤 502、获取视频通信中的第二音频数据和与第二音频数据对应的音源 位置信息。
采集端的音频处理装置将获取到的第二音频数据和音源位置信息编码后 发送给播放端的音频处理装置。 由此播放端的音频处理装置通过解码就获取 到该第二音频数据和音源位置信息。
步骤 503、 播放端的音频处理装置根据音源位置信息, 对第二音频数据进 行纵深处理。
具体的, 该步骤可以包括步骤 a或步骤 b。
步骤 a、 播放端的音频处理装置根据音源位置信息, 通过纵深感控制算法 对第二音频数据进行纵深处理。
该步骤具体可以参见图 3所示实施例中的步骤 302。
步骤 b、 播放端的音频处理装置根据音源位置信息, 对第二音频数据进行 波前合成处理, 以形成纵深处理后的第二音频数据。
波前合成( Wave Field Synthesis )利用 Huyghen原理进行声波合成。 波 前合成技术为: 波阵面 (Wave Front)上所有的点均可以看作一个新的波源, 这些新的波源和原始的波源有同样的速率和波长, 在叠加之后这些新的波源 在下一个瞬间形成新的波阵面。 采用波前合成技术可以对声场进行真实的重 现。
波前合成技术可以基于声波动理论并用多个摆放在一个平面上的扬声器 形成的扬声器矩阵来产生波前, 其中波前为离波源最远的波面; 矩阵中的每 一个扬声器都被馈给一个和其位置相应的、 经瑞利重建积分算得的信号; 每 个扬声器根据该信号产生声波; 每个扬声器产生的声波的叠加在交叠频率以 下重建了准确的原声波波前。 其中, 交叠频率决定于各个扬声器间的距离。 用波前技术得到的声场重建在整个聆听空间里保持了原声场的时域和空间性 质。
播放端的音频处理装置根据音源位置信息, 对第二音频数据进行波前合 成处理的过程可以为:
播放端采用扬声器阵列进行放音, 可以采用多个(例如 6个)扬声器, 放 在视频显示装置的下方, 具体扬声器个数根据算法及实际应用场景确定; 音 频处理装置根据该音频数据的声源位置信息, 将该音频数据经过不同的计算, 输出给多个扬声器, 使多个扬声器同时放音, 这些声音叠加在一起, 发出的 声音形成一个波阵面, 该波阵面可以虚拟出原始音源的位置, 由此恢复出具 有纵深感的声源。
步骤 504、 播放纵深处理后的第二音频数据。
播放端的音频处理装置通过纵深感控制算法对第二音频数据进行纵深处 理后, 将处理后的第二音频数据播放出来; 或者, 通过波前合成技术对第二 音频数据进行纵深处理时, 通过扬声器播放出来的, 就是经过处理后的第二 音频数据。
本发明实施例, 音频处理装置先获取视频通信中对端的音频数据和与该 音频数据对应的音源位置信息, 然后根据获取的该音源位置信息, 对该音频 数据进行纵深处理, 将该音频数据处理成具有与该音源位置信息相匹配的纵 深感的音频数据, 然后将处理后的音频数据进行播放, 由此可以使得对端的 听者通过声音分辨出发言者在视频通信中的前后位置。
本领域普通技术人员可以理解: 实现上述方法实施例的全部或部分步骤 可以通过程序指令相关的硬件来完成, 前述的程序可以存储于一计算机可读 取存储介质中, 该程序在执行时, 执行包括上述方法实施例的步骤; 前述的 存储介质包括: R〇M、 RAM,磁碟或者光盘等各种可以存储程序代码的介质。
图 6为本发明视频通信中的音频处理装置实施例一的示意图,如图 6所示, 该音频处理装置包括: 第一获取模块 61和处理模块 63。
第一获取模块 61用于获取视频通信中的音频数据和与音频数据对应的音 源位置信息。
处理模块 63用于根据第一获取模块 61获取的音源位置信息, 对音频数据 进行纵深处理。
本实施例中的音频处理装置可以为音频数据采集端的设备, 也可以为音 频数据播放端的设备, 或者, 还可以为视频通信的 MCU端的设备。 本实施例 中各个模块的工作流程和工作原理参见上述方法实施例一中的描述, 在此不 再赘述。
本发明实施例, 第一获取模块先获取视频通信中的音频数据和与该音频 数据对应的音源位置信息, 然后处理模块根据获取的该音源位置信息, 对该 音频数据进行纵深处理, 将该音频数据处理成具有与该音源位置信息相匹配 的纵深感, 由此可以将视频通信中不同前后位置关系的对象发出的声音分辨 出来。
图 7为本发明视频通信中的音频处理装置实施例二的示意图, 在图 6所示 实施例的基础上, 如图 7所示, 处理模块 63具体可以包括: 第一处理单元 631 和 /或第二处理单元 633; 进一步, 该音频处理装置还可以包括: 第一发送模 块 65和播放模块 67。
第一处理单元 631用于当音频数据为视频通信的本端或发送端的第一音 频数据时, 根据第一获取模块获取的音源位置信息, 通过纵深感控制算法对 第一音频数据进行纵深处理。 其中, 当所述的音频处理装置位于 MCU端时, 根据数据流的方向, 可以把视频通信的双方称为发送端和接收端。
第二处理单元 633用于当音频数据为视频通信的对端的第二音频数据时, 根据音源位置信息, 通过纵深感控制算法对第二音频数据进行纵深处理, 或 者第二处理单元 633用于当音频数据为视频通信的对端的第二音频数据时,根 据音源位置信息, 对第二音频数据进行波前合成处理, 以形成纵深处理后的 第二音频数据。
第一发送模块 65用于将第一处理单元 631纵深处理后的第一音频数据发 送给视频通信的对端, 以使对端播放纵深处理后的第一音频数据。
当所述的音频处理装置位于音频数据的播放端时, 即在音频数据的播放 端进行纵深处理时, 该音频处理装置还包括: 播放模块 67, 用于播放第二处 理单元 633纵深处理后的第二音频数据。
具体的:
当所述的音频处理装置位于音频数据的采集端时, 即在音频数据的采集 端进行纵深处理时, 第一获取模块 61具体可以用于: 通过不同的拾音设备, 获取视频通信中具有不同音源位置信息的音频数据和与音频数据对应的音源 位置信息。
当所述的音频处理装置位于 MCU端时, 即在 MCU端进行纵深处理时, 第 一获取模块 61具体可以用于: 接收音频数据的采集端发送的音频数据和与音 频数据对应的音源位置信息。
当所述的音频处理装置位于音频数据的播放端时, 即在音频数据的播放 端进行纵深处理时, 第一获取模块 61具体可以用于: 接收音频数据的采集端 发送的音频数据和与音频数据对应的音源位置信息。
本实施例中各个模块的工作流程和工作原理参见上述各方法实施例中的 描述, 在此不再赘述。
本发明实施例, 音频处理装置先获取视频通信中的音频数据和与该音频 数据对应的音源位置信息, 然后根据获取的该音源位置信息, 对该音频数据 进行纵深处理, 将该音频数据处理成具有与该音源位置信息相匹配的纵深感 的音频数据, 然后将处理后的音频数据发送给视频通信的对端进行播放, 由 此可以使得对端的听者通过声音分辨出发言者在视频通信中的前后位置。
图 8为本发明视频通信中的音频处理装置实施例三的示意图,如图 8所示, 该音频处理装置包括: 第二获取模块 81和第二发送模块 83。
本实施例提供的音频处理装置可以应用在以下场景: 在视频通信中, 由 音频数据的播放端或者视频通信的 MCU端对音频数据进行纵深处理, 本实施 例提供的音频处理装置设置在音频数据的采集端。
第二获取模块 81用于获取视频通信的本端的音频数据和与音频数据对应 的音源位置信息。
第二发送模块 83用于将第二获取模块 81获取的音频数据和音源位置信息 发送给视频通信控制单元或视频通信的对端, 以使视频通信控制单元或视频 通信的对端, 根据音源位置信息对音频数据进行纵深处理。
本实施例中各个模块的工作流程和工作原理参见上述方法实施例二中的 描述, 在此不再赘述。 本发明实施例, 音频处理装置获取视频通信中本端的音频数据和与该音 频数据对应的音源位置信息, 然后将获取到的音频数据和音源位置信息发送 出去, 以使 MCU或视频通信的对端根据该音源位置信息对该音频数据进行纵 深处理, 得到具有与该音源位置信息相匹配的纵深感的音频数据, 从而使得 听者可以分辨出视频通信中不同前后位置关系的对象发出的声音。 最后应说明的是: 以上实施例仅用以说明本发明的技术方案, 而非对其 限制; 尽管参照前述实施例对本发明进行了详细的说明, 本领域的普通技术 人员应当理解: 其依然可以对前述各实施例所记载的技术方案进行修改, 或 者对其中部分技术特征进行等同替换; 而这些修改或者替换, 并不使相应技 术方案的本质脱离本发明各实施例技术方案的精神和范围。

Claims

权利要求书
1、 一种视频通信中的音频处理方法, 其特征在于, 包括:
获取视频通信中的音频数据和与所述音频数据对应的音源位置信息; 根据所述音源位置信息, 对所述音频数据进行纵深处理。
2、 根据权利要求 1所述的视频通信中的音频处理方法, 其特征在于, 所述 音源位置信息为: 所述音频数据对应的对象与所述视频通信中第一排的相对位 置信息。
3、 根据权利要求 1或 2所述的视频通信中的音频处理方法, 其特征在于, 所 述音频数据为视频通信的本端或发送端的第一音频数据, 所述根据所述音源位 置信息, 对所述音频数据进行纵深处理包括:
根据所述音源位置信息, 通过纵深感控制算法对所述第一音频数据进行纵 深处理。
4、 根据权利要求 3所述的视频通信中的音频处理方法, 其特征在于, 所述 根据所述音源位置信息, 对所述音频数据进行纵深处理之后还包括:
将纵深处理后的所述第一音频数据发送给视频通信的对端, 以使所述对端 播放纵深处理后的所述第一音频数据。
5、 根据权利要求 3所述的视频通信中的音频处理方法, 其特征在于, 所述 获取视频通信中的音频数据和与所述音频数据对应的音源位置信息包括:
通过不同的拾音设备, 获取视频通信中具有不同所述音源位置信息的第一 音频数据和与所述第一音频数据对应的音源位置信息。
6、 根据权利要求 3所述的视频通信中的音频处理方法, 其特征在于, 所述 纵深感控制算法包括: 通过调节声音的响度、 声音的直达声与混响声的声能比、 声音的高频衰减量比, 对音频数据进行纵深处理。
7、 根据权利要求 1或 2所述的视频通信中的音频处理方法, 其特征在于, 所 述音频数据为视频通信的对端的第二音频数据, 所述根据所述音源位置信息, 对所述音频数据进行纵深处理包括: 根据所述音源位置信息, 通过纵深感控制算法对所述第二音频数据进行纵 深处理; 或者
根据所述音源位置信息, 对所述第二音频数据进行波前合成处理, 以形成 纵深处理后的所述第二音频数据。
8、 根据权利要求 7所述的视频通信中的音频处理方法, 其特征在于, 所述 根据所述音源位置信息, 对所述音频数据进行纵深处理之后还包括:
播放纵深处理后的所述第二音频数据。
9、 一种视频通信中的音频处理方法, 其特征在于, 包括:
获取视频通信的本端的音频数据和与所述音频数据对应的音源位置信息; 将所述音频数据和所述音源位置信息发送给视频通信控制单元或视频通信 的对端, 以使所述视频通信控制单元或视频通信的对端, 根据所述音源位置信 息对所述音频数据进行纵深处理。
1 0、 一种视频通信中的音频处理装置, 其特征在于, 包括:
第一获取模块, 用于获取视频通信中的音频数据和与所述音频数据对应的 音源位置信息;
处理模块, 用于根据所述第一获取模块获取的所述音源位置信息, 对所述 音频数据进行纵深处理。
1 1、 根据权利要求 10所述的视频通信中的音频处理装置, 其特征在于, 所 述处理模块包括:
第一处理单元, 用于当所述音频数据为视频通信的本端或发送端的第一音 频数据时, 根据所述第一获取模块获取的所述音源位置信息, 通过纵深感控制 算法对所述第一音频数据进行纵深处理。
12、根据权利要求 10或 1 1所述的视频通信中的音频处理装置, 其特征在于, 所述处理模块包括:
第二处理单元, 用于当所述音频数据为视频通信的对端的第二音频数据时, 根据所述音源位置信息, 通过纵深感控制算法对所述第二音频数据进行纵深处 理, 或者用于当所述音频数据为视频通信的对端的第二音频数据时, 根据所述 音源位置信息, 对所述第二音频数据进行波前合成处理, 以形成纵深处理后的 所述第二音频数据。
13、 根据权利要求 1 1所述的视频通信中的音频处理装置, 其特征在于, 还 包括:
第一发送模块, 用于将所述处理模块纵深处理后的所述第一音频数据发送 给视频通信的对端, 以使所述对端播放纵深处理后的所述第一音频数据。
14、 根据权利要求 12所述的视频通信中的音频处理装置, 其特征在于, 还 包括:
播放模块, 用于播放所述处理模块纵深处理后的所述第二音频数据。
15、 一种视频通信中的音频处理装置, 其特征在于, 包括:
第二获取模块, 用于获取视频通信的本端的音频数据和与所述音频数据对 应的音源位置信息; 第二发送模块, 用于将所述第二获取模块获取的所述音频数据和所述音源 位置信息发送给视频通信控制单元或视频通信的对端, 以使所述视频通信控制 单元或视频通信的对端, 根据所述音源位置信息对所述音频数据进行纵深处理。
PCT/CN2011/082127 2010-11-26 2011-11-14 视频通信中的音频处理方法和装置 WO2012068960A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP11843560.1A EP2566194A4 (en) 2010-11-26 2011-11-14 METHOD AND DEVICE FOR AUDIO PROCESSING IN VIDEO COMMUNICATION
US13/693,823 US9113034B2 (en) 2010-11-26 2012-12-04 Method and apparatus for processing audio in video communication

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201010561696.7A CN102480671B (zh) 2010-11-26 2010-11-26 视频通信中的音频处理方法和装置
CN201010561696.7 2010-11-26

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US13/693,823 Continuation US9113034B2 (en) 2010-11-26 2012-12-04 Method and apparatus for processing audio in video communication

Publications (1)

Publication Number Publication Date
WO2012068960A1 true WO2012068960A1 (zh) 2012-05-31

Family

ID=46093118

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2011/082127 WO2012068960A1 (zh) 2010-11-26 2011-11-14 视频通信中的音频处理方法和装置

Country Status (4)

Country Link
US (1) US9113034B2 (zh)
EP (1) EP2566194A4 (zh)
CN (1) CN102480671B (zh)
WO (1) WO2012068960A1 (zh)

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2680615B1 (en) * 2012-06-25 2018-08-08 LG Electronics Inc. Mobile terminal and audio zooming method thereof
US10203839B2 (en) * 2012-12-27 2019-02-12 Avaya Inc. Three-dimensional generalized space
CN104375811B (zh) * 2013-08-13 2019-04-26 腾讯科技(深圳)有限公司 一种音效处理方法和装置
WO2015058799A1 (en) * 2013-10-24 2015-04-30 Telefonaktiebolaget L M Ericsson (Publ) Arrangements and method thereof for video retargeting for video conferencing
CN104036789B (zh) * 2014-01-03 2018-02-02 北京智谷睿拓技术服务有限公司 多媒体处理方法及多媒体装置
CN104731059A (zh) * 2015-01-29 2015-06-24 曾戊忠 立体声学用户界面
CN104867359B (zh) * 2015-06-02 2017-04-19 阔地教育科技有限公司 一种直录播系统中的音频处理方法及系统
CN105263093B (zh) * 2015-10-12 2018-06-26 深圳东方酷音信息技术有限公司 一种全方位声音采集装置、编辑装置及系统
WO2017124225A1 (zh) * 2016-01-18 2017-07-27 王晓光 一种视频网络会议的人物跟踪方法及系统
CN105761721A (zh) * 2016-03-16 2016-07-13 广东佳禾声学科技有限公司 一种携带位置信息的语音编码方法
CN106774930A (zh) * 2016-12-30 2017-05-31 中兴通讯股份有限公司 一种数据处理方法、装置及采集设备
DE102018100895A1 (de) * 2018-01-16 2019-07-18 Zoe Life Technologies Holding AG Währungseinheiten für Wissen
CN108922538B (zh) * 2018-05-29 2023-04-07 平安科技(深圳)有限公司 会议信息记录方法、装置、计算机设备及存储介质
CN108777832B (zh) * 2018-06-13 2021-02-09 上海艺瓣文化传播有限公司 一种基于视频对象追踪的实时3d声场构建和混音系统
CN109194999B (zh) * 2018-09-07 2021-07-09 深圳创维-Rgb电子有限公司 一种实现声音与图像同位的方法、装置、设备及介质
CN109327795B (zh) * 2018-11-13 2021-09-14 Oppo广东移动通信有限公司 音效处理方法及相关产品
CN109660911A (zh) * 2018-11-27 2019-04-19 Oppo广东移动通信有限公司 录音音效处理方法、装置、移动终端及存储介质
CN110035250A (zh) * 2019-03-29 2019-07-19 维沃移动通信有限公司 音频处理方法、处理设备、终端及计算机可读存储介质
US11030479B2 (en) * 2019-04-30 2021-06-08 Sony Interactive Entertainment Inc. Mapping visual tags to sound tags using text similarity
CN112584299A (zh) * 2020-12-09 2021-03-30 重庆邮电大学 一种基于多激励平板扬声器的沉浸式会议系统
CN112911198B (zh) * 2021-01-18 2023-04-14 广州佰锐网络科技有限公司 一种视频通信中的音频智能降噪的处理系统

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1717955A (zh) * 2002-12-02 2006-01-04 汤姆森许可贸易公司 用于描述音频信号的合成的方法
CN1929593A (zh) * 2005-09-07 2007-03-14 宝利通公司 多点视频会议中的空间相关音频
CN101350931A (zh) * 2008-08-27 2009-01-21 深圳华为通信技术有限公司 音频信号的生成、播放方法及装置、处理系统

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3483086B2 (ja) 1996-03-22 2004-01-06 日本電信電話株式会社 音声電話会議装置
US6829018B2 (en) * 2001-09-17 2004-12-07 Koninklijke Philips Electronics N.V. Three-dimensional sound creation assisted by visual information
GB2397736B (en) * 2003-01-21 2005-09-07 Hewlett Packard Co Visualization of spatialized audio
US8687820B2 (en) * 2004-06-30 2014-04-01 Polycom, Inc. Stereo microphone processing for teleconferencing
JP2007019907A (ja) * 2005-07-08 2007-01-25 Yamaha Corp 音声伝達システム、および通信会議装置
CN101268715B (zh) * 2005-11-02 2012-04-18 雅马哈株式会社 电话会议装置
US8315366B2 (en) * 2008-07-22 2012-11-20 Shoretel, Inc. Speaker identification and representation for a phone
US20100328419A1 (en) * 2009-06-30 2010-12-30 Walter Etter Method and apparatus for improved matching of auditory space to visual space in video viewing applications
WO2011080907A1 (ja) * 2009-12-28 2011-07-07 パナソニック株式会社 表示装置と方法、記録媒体、送信装置と方法、及び再生装置と方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1717955A (zh) * 2002-12-02 2006-01-04 汤姆森许可贸易公司 用于描述音频信号的合成的方法
CN1929593A (zh) * 2005-09-07 2007-03-14 宝利通公司 多点视频会议中的空间相关音频
CN101350931A (zh) * 2008-08-27 2009-01-21 深圳华为通信技术有限公司 音频信号的生成、播放方法及装置、处理系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP2566194A4 *

Also Published As

Publication number Publication date
CN102480671A (zh) 2012-05-30
US9113034B2 (en) 2015-08-18
EP2566194A1 (en) 2013-03-06
EP2566194A4 (en) 2013-08-21
US20130093837A1 (en) 2013-04-18
CN102480671B (zh) 2014-10-08

Similar Documents

Publication Publication Date Title
WO2012068960A1 (zh) 视频通信中的音频处理方法和装置
US20230216965A1 (en) Audio Conferencing Using a Distributed Array of Smartphones
CN101384105B (zh) 三维声音重现的方法、装置及系统
JP2975687B2 (ja) 第1局・第2局間に音声信号とビデオ信号とを送信する方法、局、テレビ会議システム、第1局・第2局間に音声信号を伝送する方法
US8073125B2 (en) Spatial audio conferencing
US8705778B2 (en) Method and apparatus for generating and playing audio signals, and system for processing audio signals
US7184559B2 (en) System and method for audio telepresence
WO2012142975A1 (zh) 会场终端音频信号处理方法及会场终端和视讯会议系统
US9025002B2 (en) Method and apparatus for playing audio of attendant at remote end and remote video conference system
US20130182064A1 (en) Method for operating a conference system and device for a conference system
US20050280701A1 (en) Method and system for associating positional audio to positional video
WO2010022658A1 (zh) 多视点媒体内容的发送和播放方法、装置及系统
WO2011153905A1 (zh) 一种音频信号的混音处理方法及装置
JP2015530037A (ja) テレビ会議表示方法及び装置
CN101631032B (zh) 实现多语言会议的方法、装置和系统
JP7070910B2 (ja) テレビ会議システム
WO2010045869A1 (zh) 一种3d音频信号处理的方法、系统和装置
Hollier et al. Spatial audio technology for telepresence
Kang et al. Realistic audio teleconferencing using binaural and auralization techniques
Lim et al. An approach to immersive audio rendering with wave field synthesis for 3D multimedia content
WO2023042671A1 (ja) 音信号処理方法、端末、音信号処理システム、管理装置
JP2023043497A (ja) リモート会議システム
Hirahara et al. Personal auditory tele-existence system using a TeleHead
CN115002401A (zh) 一种信息处理方法、电子设备、会议系统及介质
Rimell Immersive spatial audio for telepresence applications: system design and implementation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11843560

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2011843560

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE