WO2010045869A1 - 一种3d音频信号处理的方法、系统和装置 - Google Patents

一种3d音频信号处理的方法、系统和装置 Download PDF

Info

Publication number
WO2010045869A1
WO2010045869A1 PCT/CN2009/074528 CN2009074528W WO2010045869A1 WO 2010045869 A1 WO2010045869 A1 WO 2010045869A1 CN 2009074528 W CN2009074528 W CN 2009074528W WO 2010045869 A1 WO2010045869 A1 WO 2010045869A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio stream
audio
terminal
server
identifier
Prior art date
Application number
PCT/CN2009/074528
Other languages
English (en)
French (fr)
Inventor
詹五洲
王东琦
Original Assignee
华为终端有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN200810217091.9A external-priority patent/CN101547265B/zh
Priority claimed from CN2008101712402A external-priority patent/CN101384105B/zh
Application filed by 华为终端有限公司 filed Critical 华为终端有限公司
Priority to EP09821590.8A priority Critical patent/EP2337328B1/en
Publication of WO2010045869A1 publication Critical patent/WO2010045869A1/zh
Priority to US13/090,417 priority patent/US8965015B2/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/56Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/11Positioning of individual sound objects, e.g. moving airplane, within a sound field
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/15Aspects of sound capture and related signal processing for recording or reproduction

Definitions

  • the present invention relates to the field of audio processing technologies, and in particular, to a method, system and apparatus for 3D audio signal processing.
  • the audio stream in the audio conference is processed by using 3D sound processing, that is, the sound image position allocated by each audio stream is used, and the audio stream is adjusted in the left and right channels according to the positional relationship of the audio stream of each sound image position.
  • the size of the gain in turn, creates a three-dimensional sound effect.
  • each terminal needs to receive conference data of other terminals, and then perform 3D positioning processing on the audio data.
  • the terminal 2 receives the conference data of the terminal 1 and the terminal 3, and the terminal 2 performs 3D positioning processing on the audio data to determine the orientations of the terminal 1 and the terminal 3.
  • Another solution in the prior art is to adopt a centralized networking structure. Referring to FIG. 2, in the conference system of FIG.
  • the server performs a 3D positioning process on the audio stream sent to the participant terminal according to the situation of each participant terminal, and streams the processed audio stream. Send to the corresponding participant terminal.
  • the present invention provides a signal processing method, a server, a terminal, and a system for a 3D audio conference, which are used to solve the problem that the transmission channel existing in the prior art is excessive, and the terminal cannot sound to other terminals.
  • An embodiment of the present invention provides a signal processing method for a 3D audio conference, the method comprising: acquiring, by a server, at least one audio stream relative to the terminal for a terminal; Dedicating at least one audio stream allocation identifier of the terminal;
  • the server combines the acquired at least one audio stream with respect to the terminal and the identifier corresponding to the at least one audio stream, and sends the identifier to the target terminal.
  • the embodiment of the invention further provides a server for signal processing of a 3D audio conference, the server comprising:
  • An audio stream obtaining unit configured to acquire an audio stream with respect to the terminal for a terminal
  • an identifier allocation unit configured to allocate an identifier to the acquired audio stream relative to the terminal
  • a combined sending unit configured to combine the acquired audio stream corresponding to the terminal and the identifier corresponding to the audio stream, and send the identifier to a target terminal.
  • the embodiment of the invention further provides a terminal for implementing signal processing of a 3D audio conference, the terminal comprising: An acquiring unit, configured to acquire at least one audio stream carrying the identifier;
  • An audio processing unit configured to extract identification information from at least one audio stream acquired by the acquiring unit, and perform offloading the audio stream according to the identification information, and separately decoding the multiple audio streams;
  • a sound image position allocating unit configured to allocate a sound image position to the decoded multi-channel audio stream according to the identification information extracted by the audio processing unit
  • a 3D sound processing unit configured to perform 3D sound processing on the decoded multi-channel audio stream according to the allocated sound image position.
  • the embodiment of the invention further provides a signal processing method for a 3D audio conference, the method comprising:
  • the embodiment of the present invention further provides a 3D audio conference system, where the system includes: a server, configured to acquire at least one audio stream with respect to the terminal for a terminal; and the obtained relative to the terminal At least 1 audio stream allocation identifier; combining the acquired at least 1 audio stream relative to the terminal and the identifier corresponding to the at least 1 audio stream, and transmitting the identifier to a target terminal;
  • At least one target terminal configured to acquire the at least one audio stream with the identifier, extract an identifier of the audio stream, and perform the offloading of the audio stream having the same identifier according to the identifier, according to the
  • the extracted identification information is used to allocate a sound image position to each of the divided audio streams; the audio stream after the splitting is decoded, and the audio stream after the splitting is performed according to the sound image position information of the audio stream. 3D sound processing.
  • FIG. 1 is a schematic diagram of a network of a distributed 3D audio conference system used in the prior art
  • FIG. 2 is a schematic diagram of a network of a centralized 3D audio conference system used in the prior art
  • FIG. 3 is a schematic flowchart of Embodiment 1 of the method according to the present invention
  • Embodiment 4 is a schematic flow chart of Embodiment 2 of a method according to the present invention.
  • FIG. 5 is a schematic diagram of a corresponding system networking structure according to Embodiment 2 of the present invention
  • FIG. 5B is a schematic diagram of another corresponding system networking structure according to Embodiment 2 of the method of the present invention
  • FIG. 6 is a method embodiment of the present invention. 3 corresponding system network structure diagram
  • Embodiment 7 is a schematic flow chart of Embodiment 3 of a method according to the present invention.
  • FIG. 8 is a schematic structural diagram of a system networking corresponding to Embodiment 4 of the method according to the present invention.
  • Embodiment 9 is a schematic flow chart of Embodiment 4 of a method according to the present invention.
  • Embodiment 5 is a schematic flow chart of Embodiment 5 of a method according to the present invention.
  • FIG. 11 is a schematic structural diagram of 3D sound processing in an embodiment of a method according to the present invention.
  • FIG. 13 is a schematic block diagram of a blind source separation method in Embodiment 6 of the method of the present invention
  • FIG. 14 is a schematic diagram of a microphone array capturing a sound signal in Embodiment 6 of the method of the present invention
  • FIG. 15 is a schematic diagram of Embodiment 1 of the system according to Embodiment 1 of the present invention. Schematic;
  • Embodiment 16 is a schematic structural diagram of Embodiment 1 of a server according to the present invention.
  • FIG. 17 is a schematic structural diagram of an audio stream acquiring unit in the server embodiment 1 shown in FIG. 16;
  • FIG. 18 is a schematic structural diagram of the identifier allocating unit shown in FIG.
  • 19 is a schematic structural diagram of a combined transmitting unit in the server embodiment 1 shown in FIG. 16; 20 is a schematic structural diagram of Embodiment 1 of an apparatus according to the present invention;
  • FIG. 21 is a schematic structural diagram of an audio processing unit in the device embodiment 1 shown in FIG. 20;
  • FIG. 22 is a schematic structural diagram of Embodiment 2 of the device according to the present invention;
  • Figure 23 is a schematic structural view of Embodiment 3 of the device of the present invention.
  • Fig. 24 is a view showing the configuration of the orientation calculating unit shown in Fig. 23.
  • FIG. 1 Method embodiment 1 of the present invention can be illustrated by FIG. 1
  • the server For a terminal, obtain, by the server, at least one audio stream with respect to the terminal.
  • the audio stream that is obtained by the server with respect to the terminal may be: The energy of the multiple audio streams of the terminal; the server selects at least one audio stream with the largest energy according to the energy of the acquired multiple audio streams.
  • the several audio streams that obtain the most energy for one terminal are only one implementation manner, and all the audio streams can also be obtained.
  • the implementation manner does not require calculation of energy, and directly acquires the related audio stream.
  • the server allocates an identifier to the acquired at least one audio stream relative to the terminal.
  • the server allocates an identifier of the at least one audio stream to the terminal, and specifically, the site number or the terminal number may be used as the identifier of the multiple audio stream, and of course, by the conference administrator. Manual assignment, or real-time assignment by the conference management system.
  • each terminal acquires different audio streams, in order to distinguish multiple audio streams from the same site, each of the multiple audio streams of the same site is assigned a serial number, which may be a terminal number corresponding to the audio stream.
  • the server allocates a sequence number to each terminal connected thereto, and when the audio stream relative to a terminal is obtained according to step 301, the identifier allocated to the audio stream may be that the audio stream corresponds to the terminal. Terminal number. In this way, the audio stream acquired by each terminal can be more effectively distinguished.
  • the identifier assigned to the audio stream at this time may be an identifier combining the terminal number and the orientation information.
  • the orientation information is generally carried in the header of the audio stream (RTP (Real-time Transport Protocol), which can be used to transmit data with high real-time requirements such as video and audio), and is obtained according to step 301.
  • RTP Real-time Transport Protocol
  • the server After the audio stream of the terminal, the server obtains the orientation information by detecting the RTP header of the audio stream, for example, by detecting the identifier on the field in the packet header to determine whether there is orientation information in the packet header, wherein the identifier of the corresponding orientation information in the field is set by the terminal; It can be detected by setting the value of 0 or 1, and those skilled in the art can fully implement various detection methods according to the common technical knowledge.
  • the terminal number of the audio stream corresponding terminal and the orientation information in the audio stream are then combined into an identification and assigned to the audio stream. Since the orientation information in each audio stream is definitely different, the audio stream can also be assigned an identifier by means of a combination of the site number and the location information.
  • the identifier of the audio stream in the embodiment of the present invention is only a code for assigning the audio stream, and the purpose is to distinguish the audio stream, and therefore, according to an embodiment of the present invention, Other methods for obtaining an identifier, for which the embodiment of the present invention is not limited.
  • the server combines the acquired at least one audio stream with respect to the terminal and the identifier corresponding to the at least one audio stream, and sends the identifier to the target terminal.
  • the manner in which the one-way audio stream and the identifier corresponding to the at least one audio stream are combined may be The following ways:
  • the obtained mono audio stream is encoded and decoded, and the coded mono audio stream is integrated into a multi-channel stream.
  • An identifier assigned to the at least one audio stream in the step 302 corresponding to the plurality of channels is added to the frame header of the multi-channel stream.
  • the server in a manner that the server combines the audio stream corresponding to the terminal and the identifier corresponding to the audio stream, all may be loosely combined, or all may be closely combined. The way, it can also be a combination of loose combination and tight combination.
  • the identification of the audio stream can be in the protocol header of the IP packet or in the frame header of the audio frame.
  • the technical solution of the embodiment of the present invention enables the terminal to freely locate the sound image position of other terminals according to the received audio stream of other terminals and the identifier assigned by the audio stream, especially when the audio stream includes the sound source.
  • the terminal can more accurately locate the sound image position according to the orientation information of the sound source.
  • the method embodiment 2 of the present invention is mainly described with respect to an embodiment manner of a case of a single server, and the processing procedure thereof can be illustrated by the flowchart drawn in FIG.
  • the server obtains the audio stream corresponding to each terminal.
  • each terminal generally corresponds to each site, and the corresponding terminal acquires the audio stream of the corresponding site, and the server corresponding to each terminal acquires the audio stream corresponding to each terminal.
  • the server calculates an energy of the obtained audio stream, and selects the maximum energy. At least 1 channel of audio stream;
  • the server performs energy calculation on the audio streams corresponding to the respective terminals acquired in 401, and selects at least one audio stream with the largest energy as the finally selected audio stream according to the result of the energy calculation;
  • At least one audio stream with the highest energy can be selected as the selected audio stream.
  • each audio stream is calculated above, so that at least one audio stream with the largest energy is selected, and only one implementation of the audio stream is selected, and the energy of each audio stream may not be calculated. All the audio streams of the participating venues are selected as the audio stream.
  • the server acquires identifier information corresponding to the selected at least one audio stream.
  • the corresponding identification information is obtained for the selected at least one audio stream.
  • the identification information of the selected audio stream may specifically adopt the site corresponding to the audio stream.
  • the number or the terminal number is used as the identifier of the multi-channel audio stream. If the acquired audio stream includes the orientation information of the audio source corresponding to the audio signal in the audio stream, the "terminal number and orientation information combination identifier" or " The method of combining the identifier of the site number and the location information is used as the identifier of the multi-channel audio stream. Generally, if there is only one terminal in the site, the site number is used as the identifier of the multi-channel audio stream.
  • the "terminal number and orientation information combination identifier" or the “site number and orientation information combination identifier” is used as the The identification of multiple audio streams.
  • the location information corresponding to each audio stream can be obtained by detecting the RTP header of the audio stream.
  • the identifier of the audio stream in the embodiment of the present invention is only a code assigned to the audio stream, and the purpose is to distinguish the audio stream, for example, the identifier may also be manually allocated by the conference administrator. Or distributed by the conference management system in real time. Therefore, other methods for obtaining an identifier can be obtained according to an embodiment of the present invention, and the embodiment of the present invention is not limited thereto.
  • the server combines the selected audio stream with the acquired identification information.
  • the selected at least one audio stream is compared with the acquired selected audio stream.
  • the identification information is combined.
  • the way to combine includes:
  • the obtained mono audio stream is encoded and decoded, and the coded mono audio stream is integrated into a multi-channel stream.
  • the audio stream identifier corresponding to the multiple channels is added to the frame header of the multi-channel code stream, and the identifier and the identifier obtained in step 403 are obtained.
  • the manner in which the identifiers are combined may be in a loosely combined manner, or may be a combination of all in a close combination, or a combination of a loose combination and a close combination.
  • the server sends the audio stream combined with the identifier information to the corresponding target terminal according to the corresponding sending policy.
  • the audio stream combined with the identification information is sent to the corresponding target terminal, and the following strategy may be adopted:
  • the audio stream sent to the target terminal is the other selected audio stream after the terminal obtains the audio stream;
  • the selected audio stream does not include the audio stream acquired by a certain terminal, all the selected audio streams are sent to the terminal.
  • FIG. 5a includes four terminals and one server, and each terminal corresponds to each site, so the terminal number here is also the site. number.
  • the meaning of the dotted line of each terminal to the server is as follows:
  • Each terminal uploads the audio stream collected by the terminal to the server, and the solid line of the server to each terminal is defined as follows:
  • the server sends the selected audio stream to each terminal. It is assumed that the terminals 2 and 3 are the terminals corresponding to the energy maximum audio stream, and the server sends the audio streams 2 and 3 to the terminal 1 and the terminal 4 respectively, and the server sends the audio stream 3 to the terminal 2, The audio stream 2 is sent to the terminal 3.
  • terminals 1, 2, and 3 belong to one site (as shown by the dotted line in the figure), and terminal 4 is another site.
  • the meaning of the virtual solid line is the same as that of Figure 5a.
  • the terminal 2 is the terminal corresponding to the maximum energy stream of the energy in the site. Therefore, the server sends the audio stream 4 to the terminal 1, the terminal 2, and the terminal 3, respectively, and the server sends the audio stream 2 to the terminal.
  • Terminal 4 In this example, each terminal does not correspond to each site, so the terminal number here is not the site number.
  • the technical solution of the embodiment of the present invention enables the terminal to freely locate the sound image positions of other terminals according to the received audio streams of other terminals and the identifiers assigned by the audio streams.
  • the terminal can more accurately locate the sound image position according to the orientation information of the sound source.
  • the method embodiment 3 of the present invention is mainly directed to an embodiment manner in which a plurality of servers are cascaded, and the structure thereof can be illustrated by FIG.
  • FIG 6 we can see that there are three servers, and four terminals, where terminal 1 and terminal 2 belong to server 2, terminal 3 and terminal 4 belong to server 3, and server 2 and server 3 are cascaded through server 1.
  • the server 1 can be regarded as a master server, and the server 2 and the server 3 are regarded as slave servers of the server 1.
  • the primary server acquires an audio stream uploaded from the server.
  • the primary server decomposes the audio stream obtained from the slave server into multiple audio streams, where the number of channels of the decomposed audio stream is the number of terminals under the slave server.
  • the slave server since the audio stream obtained from the server is uploaded by each terminal of the slave server, the slave server may decompose different audio streams according to the specific terminal.
  • the primary server calculates its energy for the decomposed audio stream, and selects at least one audio stream with the largest energy
  • the implementation process of calculating the energy of the decomposed audio stream and selecting the at least one audio stream with the largest energy is similar to the 402 in the method embodiment 2 of the present invention, and details are not described herein again.
  • the primary server acquires the identifier information corresponding to the selected at least one audio stream.
  • the primary server obtains the identifier information corresponding to the selected at least one audio stream from the server.
  • the manner of obtaining is similar to 403 in the method embodiment 2 of the present invention, and details are not described herein again.
  • 705. The primary server combines the selected audio stream with the acquired identification information.
  • the primary server sends the audio stream combined with the identifier information to a corresponding terminal according to a corresponding sending policy.
  • the method embodiment 3 of the present invention only gives a form of server cascading composed of three servers.
  • the implementation manner can also be completed according to the process of the embodiment.
  • the technical solution of the embodiment of the present invention enables the terminal to freely locate the sound image position of other terminals according to the received audio stream of other terminals and the identifier assigned to the audio stream, especially when the sound is included in the audio stream.
  • the terminal can more accurately locate the sound image position according to the orientation information of the sound source.
  • the method embodiment 4 of the present invention is mainly directed to an embodiment manner in which at least one terminal is combined with a plurality of server cascades, and the structure thereof can be illustrated by FIG.
  • FIG. 8 includes a total of six terminals. Among them, the terminals 1, 2 are under the jurisdiction of the slave server 2, the terminals 3, 4 are under the jurisdiction of the slave server 3, and the terminals 5, 6 are terminals directly connected to the master server 1.
  • the primary server acquires an audio stream uploaded from the server and an audio stream of the terminal directly managed by the primary server.
  • the primary server decomposes the audio stream obtained from the slave server into multiple audio streams, and the number of channels of the decomposed audio stream is not greater than the number of terminals under the slave server.
  • the slave server may decompose different audio streams according to the specific terminal.
  • the number of channels of the decomposed audio stream may be smaller than the number of terminals under the slave server, and the number of channels of the decomposed audio stream is determined according to whether different terminals emit sounds, when some terminals have no venue sound. Then, the number of channels of the decomposed audio stream is smaller than the number of terminals under the slave server.
  • the primary server calculates energy respectively for the audio stream decomposed from the audio stream obtained by the slave server and the audio stream obtained from the directly-managed terminal, and selects at least one audio stream with the largest energy;
  • the primary server calculates energy for the audio stream decomposed from the audio stream acquired from the slave server and the audio stream obtained from the directly governed terminal, and selects at least one channel of audio with the highest energy.
  • the implementation process of the flow is similar to 402 in the method embodiment 2 of the present invention, and details are not described herein again.
  • the primary server obtains the identifier information corresponding to the selected at least one audio stream.
  • the implementation process of this step is similar to the 403 in the method embodiment 2 of the present invention, and details are not described herein again. .
  • the primary server combines the selected audio stream with the acquired identification information.
  • the primary server sends the audio stream combined with the identifier information to a corresponding terminal or a slave server according to a corresponding sending policy.
  • the method embodiment 4 of the present invention only gives the form of server cascading composed of three servers and two terminals under the jurisdiction of the main server, cascading for more servers, and implementation of more terminals by the main server. The manner of this can also be done according to the process of the embodiment.
  • the technical solution of the embodiment of the present invention enables the terminal to freely locate the sound image position of other terminals according to the received audio stream of other terminals and the identifier assigned by the audio stream, especially when the audio stream includes the sound source.
  • the terminal can more accurately locate the sound image position according to the orientation information of the sound source.
  • the method in this embodiment is directed to the processing performed by the terminal on the received audio stream.
  • the actual process is specifically as follows:
  • At least one audio stream carrying the identifier is obtained, for example, at least one audio stream that is sent by the receiving server and carries the identifier.
  • the identification information is extracted from the protocol header of the IP packet of the obtained audio stream, or from the frame header of the audio frame.
  • the audio stream with the same identifier is offloaded according to the extracted identifier information.
  • the identifier information is different according to different audio streams, and the audio stream of the same identifier is split, and the same identifier is used.
  • the stream is assigned to the same decoding module.
  • the sound image position may be allocated by using the identification information of the audio stream extracted in step 1001.
  • the allocation of the sound image position can be specified in advance by the user, that is, a certain sound image position is fixedly assigned to a certain terminal, and can also be automatically assigned.
  • the automatic allocation can be performed according to the following principles:
  • the middle pan position is assigned. In Figure 11, this position is the virtual pan position in front of the TV. The benefit of this method of distribution is that the pan position matches the image being viewed.
  • the audio signal energy of a terminal is small, the sound image positions on both sides are allocated.
  • Such a terminal may be only noise, and the distribution on both sides may separate the noise from the voice of the far-end speaker, thereby ensuring The clarity of the speaker's voice.
  • the identifier includes only the terminal number: if the terminal number in the audio stream is consistent with the terminal being viewed, the allocation and the image match the sound image position, corresponding to FIG. 11, that is, the first two speakers (P2 and P3) are allocated. If there is an inconsistency, the sound image positions on both sides are assigned, and corresponding to Fig. 11, the sound image position between the two speakers P1 and P2 can be assigned.
  • the identifier includes the terminal number and the orientation information: first, according to the terminal number, if the terminal number in the audio stream is consistent with the terminal being viewed, the allocation and the image match the sound image position, corresponding to FIG. The position of the sound image between the two front speakers (P2 and P3); if they are not consistent, the sound image positions on both sides are assigned, corresponding to Fig. 11, the sound image position between the two speakers P1 and P2 can be assigned. Since the audio stream identifier further includes the orientation information of the audio stream, the terminal number and the orientation information can be used to perform more accurate sound image distribution.
  • the horizontal orientation is the left middle position, which means that the speaker in the image should also be in the left middle position.
  • the sound image of the audio stream is assigned to the left middle position relative to the image, corresponding to Fig. 11, that is, the left middle position between the two front speakers (P2 and P3) is assigned.
  • the audio stream allocated to the same audio stream is decoded according to the same identification information in step 1002, and the decoded audio stream is subjected to 3D sound processing using the sound image position information assigned by 1003.
  • the method embodiment of the present invention uses 3D sound processing, and will not be described elsewhere.
  • the purpose of 3D sound processing is to create a stereo field by using the left and right speakers.
  • the specific process of 3D sound processing can be illustrated by the following example, see Figure 11:
  • the distance between the speakers pl, p2 is d, and the distance between the virtual sound image v1 and the speaker pi is w.
  • the si can be multiplied.
  • gl is the left channel amplitude gain
  • g2 is the right channel amplitude gain
  • c is a fixed value, which can be equal to 1, for example.
  • the stereo sound field can be simulated.
  • the present embodiment can not only enable the terminal to freely locate the sound image positions of other terminals according to the received audio streams of other terminals and the identifiers assigned by the audio streams, and separate the audio signals corresponding to different sound sources that are mixed together. Coming out, and calculating the position information of the audio signal corresponding to different sound sources, so that the receiving side terminal can simulate and reproduce the original real sound field well after the sound is output.
  • FIG. 12 is a schematic diagram of acquiring an audio signal corresponding to the audio stream according to an embodiment of the present invention.
  • Flowchart of a method of source orientation information includes the following steps:
  • the acquiring the multi-channel audio signals from the local sound sources is to use a microphone array composed of multiple microphones to collect voice signals of multiple people (ie, multiple sound sources) that are simultaneously speaking at the same time, thereby capturing Multi-channel sound signal, which is converted into multiple audio signals.
  • Local can refer to the local site where the microphone array is located. 1202: Separating the acquired multiple audio signals into sound sources to obtain audio signals corresponding to the respective sound sources;
  • step 1202 the separation of the acquired multi-channel audio signals by the sound source is performed by a blind source separation method.
  • FIG 13 is a basic block diagram of the blind source separation method shown in Figure 12.
  • the so-called blind source separation means that the source signal is recovered or separated only by the observed mixed signal according to the statistical characteristics of the input signal without knowing the a priori information of the source signal and the transmission channel. That is to say, the source signal cannot be observed, and the mixed signal is obtained; in addition, how the different source signals are mixed is also unknown.
  • a typical observed signal is the output of a series of sensors, and each sensor receives a different combination of source signals.
  • the main task of blind source separation is to recover the source signal from the observed data.
  • the microphone array collects voice signals of a plurality of people who speak at the same time, thereby obtaining multi-channel voice, and recovering the voice signal corresponding to each person from the multi-channel voice by using the blind source separation technology, that is, An audio signal corresponding to a plurality of sound sources is separated in the multiplexed voice.
  • the basic principle of the blind source separation method is to enable the observed signal to recover or separate the source signal after passing through a separate system.
  • N non-statistically independent unknown source signals 8 [ 8 1 :), s2(t), ..., sN(t)]T are transmitted by the unknown hybrid system H, and are detected by M sensors.
  • step 1203 calculating, according to the positional relationship between the acquired multiple audio signals and devices for acquiring multiple audio signals from respective sound sources,
  • the orientation information corresponding to the sound sources specifically includes: estimating a relative delay between the plurality of audio signals propagating to the respective devices for acquiring the plurality of audio signals from the respective sound sources; and according to the estimated relative
  • the positional relationship between the devices and the respective sound sources for calculating the positional relationship between the devices for acquiring the plurality of audio signals from the respective sound sources is calculated.
  • Figure 14 is a schematic illustration of the microphone array shown in Figure 12 capturing sound signals.
  • the sound signals from the sound source propagate to different microphones in the microphone array at different times. For example, there are two sound sources, and the time when the sound signal emitted by the sound source 1 is propagated to each microphone in the microphone array is different, and the time when the sound signal emitted by the sound source 2 is propagated to each microphone in the microphone array is also different. The time of audio signals corresponding to the same sound source output from different microphones is also different.
  • the relative delay between the audio signals corresponding to the respective sound sources is first estimated, and then the orientation of each sound source is determined by using the estimated relative delay and the positional relationship between the known microphones.
  • the most widely used delay estimation algorithm is the Generalized Cross Correlation (GCC).
  • GCC Generalized Cross Correlation
  • the wide-ranging cross-correlation function method obtains the cross-correlation function between two audio signals by finding the mutual power spectrum between the two audio signals and weighting them in the frequency domain to suppress the noise and the reflected sound, and then inversely transform to the time domain.
  • the peak position of the cross-correlation function is the relative delay between the two audio signals.
  • the orientation information may be set in the RTP header of the audio stream, thereby transmitting the audio stream carrying the orientation information, wherein the identifier may be set in the field of the corresponding header when the orientation is set in the packet header so that the server is receiving
  • the orientation information in the packet header is detected according to the identifier. Or, according to the setting of the value of 0 or 1, it is marked whether there is orientation information in the packet header.
  • those skilled in the art can set according to common technical knowledge, so that the server receives the audio stream after receiving the audio stream.
  • the orientation information in the packet header is detected.
  • the method described in this embodiment is to obtain the orientation information of the sound source, and therefore does not conflict with the method for the 3D sound processing in the foregoing embodiment.
  • the method described in this embodiment may perform the 3D sound. Before the processing, for example, before the step 1001 of the embodiment 5, the orientation information of the sound source in the scene where the microphone array is located is obtained, and the sound of the opposite end listening to the local end is explained.
  • Embodiment 5 it may be completed after the processing of the 3D sound in Embodiment 5, indicating that the local end responds according to the content of the opposite end, which is mainly implemented based on the condition of the sound source in the scene where the microphone array is located, thereby It is inferred that the method described in this embodiment can coexist with the method described in the fifth embodiment, for example, in the same terminal, so that not only the method described in the fifth embodiment but also the method described in the embodiment can be implemented. .
  • the terminal of the embodiment can obtain the orientation information of the sound source, and set the acquired orientation information to be sent in the header of the audio stream, so that the server can allocate the audio stream as the identifier according to the orientation information in the audio stream.
  • FIG. 1 A system embodiment of the present invention can be illustrated by way of FIG.
  • the server 1200 is configured to acquire at least one audio stream with respect to the terminal for one terminal, and allocate an identifier to the acquired at least one audio stream with respect to the terminal, where the obtained information is relative to the The at least one audio stream of the terminal and the identifier corresponding to the at least one audio stream are combined and sent to the target terminal;
  • At least one target terminal 1300 configured to acquire the at least one audio stream with the identifier, extract an identifier of the audio stream, and perform offloading the audio stream with the same identifier according to the identifier, according to the
  • the extracted identification information is used to allocate a sound image position to each of the divided audio streams; the audio stream after the splitting is decoded, and the divided audio stream is obtained according to the sound image position information of the audio stream. Perform 3D sound processing.
  • the target terminal 1300 is further configured to acquire multiple audio signals from various sound sources in the conference site where the terminal is located, and separate the acquired multiple audio signals into sound sources to obtain audio signals corresponding to the respective sound sources.
  • the system embodiment includes a primary server, that is, the server 1 in FIG. 6, for acquiring at least one audio stream with respect to the terminal for one terminal; Allocating an identifier to the acquired at least one audio stream relative to the terminal; and the obtained identifier corresponding to at least one audio stream of the terminal and the at least one audio stream Combining and transmitting to the terminal, the method further configured to decompose the at least one combined audio stream that is identified by the server into a multi-channel audio stream; at least one slave server, that is, the server 2 in FIG. And the server 3, configured to acquire an audio stream of a terminal or other server under its own jurisdiction, and combine the acquired audio stream with the identifier of the audio stream.
  • a primary server that is, the server 1 in FIG. 6, for acquiring at least one audio stream with respect to the terminal for one terminal; Allocating an identifier to the acquired at least one audio stream relative to the terminal; and the obtained identifier corresponding to at least one audio stream of the terminal and the at least one audio stream Combining and transmitting to
  • the technical solution of the embodiment of the present invention enables the terminal to freely locate the sound image positions of other terminals according to the received audio streams of other terminals and the identifiers assigned by the audio streams.
  • This embodiment mainly provides a server for realizing signal processing of a 3D audio conference, and the server includes:
  • the audio stream obtaining unit 161 is configured to acquire at least 1 with respect to the terminal for one terminal.
  • An identifier allocation unit 162 configured to allocate an identifier to the acquired at least one audio stream relative to the terminal;
  • the combination sending unit 163 is configured to combine the acquired at least one audio stream with respect to the terminal and the identifier corresponding to the at least one audio stream, and send the identifier to the terminal.
  • the audio stream obtaining unit 161 of this embodiment may include: an audio stream energy acquiring module 1611, configured to acquire an audio stream selection module 1612 of a multi-channel audio stream with respect to the terminal, And selecting at least one audio stream with the largest energy according to the energy of the acquired multiple audio streams.
  • the audio stream obtaining unit 161 of this embodiment may further include:
  • the detecting module 1613 is configured to detect the location information of the audio source corresponding to the audio signal in the obtained audio stream packet header.
  • the identifier allocating unit 162 of this embodiment may include:
  • the site terminal number obtaining module 1621 is configured to obtain the site number of the site where the at least one audio stream of the energy is the highest, and/or the terminal number of the terminal of the site where the site is located;
  • the identifier combination module 1622 is configured to combine the orientation information detected by the detection module 1613 with the site number or the terminal number obtained by the site terminal number acquisition module 1621; the identifier distribution module 1623 is configured to The site number or the terminal number obtained by the site terminal number obtaining module 1621 is allocated to the audio stream as a first identifier; and the second identifier formed by combining the identity combining module 1622 is allocated to the audio stream.
  • the combination sending unit 163 specifically includes the following modules. Referring to FIG. 19, the first combining module 1631 is configured to perform no modification on the selected audio stream, and in protocol encapsulation of each frame of audio data, in the protocol. An identifier assigned to the at least one audio stream is added to the header; and/or a second combining module 1632 is configured to encode and decode the selected mono audio stream, after the encoding and decoding The mono audio stream is integrated into a multi-channel stream, in the multi-channel code A plurality of channels corresponding to the plurality of channels are added to the header of the stream to assign an identifier to the at least one audio stream.
  • the technical solution of the embodiment of the present invention enables the terminal to freely locate the sound image position of other terminals according to the received audio stream of other terminals and the identifier assigned by the audio stream, especially when the audio stream includes the sound source.
  • the terminal can more accurately locate the sound image position according to the orientation information of the sound source.
  • the embodiment of the present invention further provides a terminal for implementing signal processing of a 3D audio conference.
  • the method includes:
  • the obtaining unit 171 is configured to acquire at least one audio stream carrying the identifier
  • the audio processing unit 172 is configured to extract identification information from at least one audio stream acquired by the acquiring unit 171, and perform offloading the audio stream according to the identification information, and separately decoding the multiple audio streams;
  • a sound image position allocating unit 173, configured to allocate a sound image position to the decoded multiple audio stream according to the identification information extracted by the audio processing unit; when the identifier includes the orientation information of the corresponding sound source, the sound image
  • the position assigning unit accurately assigns the sound image position based on the orientation information.
  • the 3D sound processing unit 174 is configured to perform 3D sound processing on the decoded multi-channel audio stream according to the allocated sound image position.
  • the audio processing unit 172 specifically includes reference to FIG. 21: an identifier extraction module 1721, configured to extract identification information from the acquired multiple-channel audio stream of the allocation identifier; and an allocation module 1722, configured to: And distributing the audio stream according to the extracted identification information; and the decoding module 1723 is configured to separately decode the multiple audio streams.
  • the technical solution of the embodiment of the present invention enables the terminal to freely locate the sound image position of other terminals according to the received audio stream of other terminals and the identifier assigned by the audio stream, especially when the audio stream includes the corresponding sound source.
  • the sound image can be entered through the orientation information.
  • the line is more accurately assigned to perform 3D sound processing on the decoded multi-channel audio stream based on the assigned sound image position.
  • the terminal may further include, referring to FIG. 22: an audio encoding unit 175, configured to encode the acquired audio signal.
  • Equipment Example 3 An audio encoding unit 175, configured to encode the acquired audio signal.
  • the terminal may further include, referring to FIG. 23, the multi-channel audio signal acquiring unit 176, before the terminal receives the multi-channel audio stream sent by the server, or After the terminal performs 3D sound processing on the received multiple audio streams, the multi-channel audio signals are obtained from the local sound sources.
  • the sound source separating unit 177 is configured to use the acquired multi-channel audio signals as sound sources.
  • an orientation calculating unit 178 configured to perform a positional relationship between the plurality of audio signals obtained according to the plurality of audio signals and the plurality of audio signals from the respective sound sources Calculating the orientation information corresponding to the respective sound sources;
  • the transmitting unit 179 is configured to send the sounds including the audio signals and the orientation information corresponding to the respective sound sources, as shown in FIG.
  • the orientation calculation unit in this embodiment 178 can include a delay estimation module 1781 for estimating that the multi-channel audio signal is propagated to the multi-channel audio signal for acquiring from each sound source Relative delay between the various devices; a sound source localization module 1782 for determining the relative delay between the delay estimation module 1781 and the location between the devices for acquiring multiple audio signals from the respective sound sources The relationship calculates orientation information corresponding to the respective sound sources.
  • the technical solution of the embodiment of the present invention enables the terminal to freely locate the sound image positions of other terminals according to the received audio streams of other terminals and the identifiers assigned by the audio streams, and can be mixed together.
  • the audio signal corresponding to the sound source is separated, and the position information of the audio signal corresponding to different sound sources is calculated, so that the receiving side terminal can be output after the sound is output It is a good simulation to reproduce the original real sound field.
  • the steps of a method or algorithm described in connection with the embodiments disclosed herein may be implemented directly in hardware, a software module executed by a processor, or a combination of both.
  • the software module can be placed in random access memory (RAM), memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or technical field. Any other form of storage medium known.

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Description

一种 3D音频信号处理的方法、 系统和装置 本申请要求了 2008年 10月 20日提交的, 申请号为 200810217091.9, 发明名称为"一种 3D音频会议的信号处理方法、 设备以及系统" 的中国申 请优先权以及 2008年 10月 27日提交的, 申请号为 200810171240.2, 发明 名称为 "三维声音重现的方法、 装置及系统" 的中国申请优先权, 请全部 内容通过引用结合在本申请中。 技术领域
本发明涉及音频处理技术领域,尤其涉及一种 3D音频信号处理的方法、 系统和装置。
背景技术
目前的音频会议系统通常是单声道或者双声道的, 缺乏空间临场感, 并且在多点会议时, 将各路声音混叠在一起, 导致声音的清晰度下降。
现有技术中采用 3D声处理对音频会议中的音频流进行处理, 即利用 各个音频流分配的声像位置, 以及根据各个声像位置音频流的位置关系, 调节所述音频流在左右声道的增益的大小, 进而营造出一种立体的声响 效果。
在如何进行 3D音频会议的组网上,现有技术中的一种解决方案是采 用分布式的组网结构, 每个终端都需要接收其他终端的会议数据, 然后 对这些音频数据进行 3D定位处理, 以便使用户感知不同的音频流为来自 不同的方位。 参见图 1, 在图 1中, 终端 2接收到终端 1以及终端 3的会 议数据, 终端 2对这些音频数据进行 3D定位处理, 确定终端 1以及终端 3的方位。现有技术中另一种解决方案是采用集中式的组网结构, 参见图 2, 在图 2中的会议系统中, 有一个服务器以及多个终端, 所有的终端都 将自身的音频数据发送给服务器, 由服务器根据各个与会终端的情况, 对发送到该与会终端的音频流进行 3D定位处理,并将处理后的音频流发 送到相应的与会终端。
在完成本发明的过程中, 发明人发现现有技术至少存在以下问题: 现有技术中分布式 3D音频会议, 由于音频数据是通过在各个终端中分布 处理的, 必然就需要许多条传输通道, 因此只能适用于拥有几个会场的 小型会议; 现有技术中集中式 3D音频会议, 由于所有的处理都在服务器 上进行, 这样需要预先知道各个终端播放设备的配置, 且终端也不能对 其他终端的声像位置进行自由定位。
发明内容
基于现有技术存在问题,本发明提供了一种 3D音频会议的信号处理 方法、 服务器、 终端及系统, 用以解决现有技术中存在的传输通道需求 过多, 且终端无法对其他终端的声像位置进行自由定位的问题。
本发明实施例提供一种 3D音频会议的信号处理方法, 该方法包括: 针对一个终端, 由服务器获取相对于所述终端的至少 1路音频流; 所述服务器给所述获取到的相对于所述终端的至少 1路音频流分配标 识;
所述服务器将所述获取到的相对于所述终端的至少 1路音频流以及所 述至少 1路音频流对应的所述标识进行组合并发送给目标终端。
本发明实施例还提供一种 3D音频会议的信号处理的服务器,所述服 务器包括:
音频流获取单元, 用于针对一个终端获取相对于所述终端的音频流; 标识分配单元, 用于给所述获取到的相对于所述终端的音频流分配 标识;
组合发送单元, 用于将所述获取到的相对于所述终端的音频流以及 所述音频流对应的所述标识进行组合并发送给目标终端。
本发明实施例还提供一种实现 3D音频会议的信号处理的终端,所述 终端包括: 获取单元, 用于获取携带有标识的至少 1路音频流;
音频处理单元, 用于从所述获取单元获取到的至少 1路音频流中提取 标识信息, 并根据所述的标识信息对音频流进行分流, 以及将所述多路 音频流分别解码;
声像位置分配单元, 用于根据所述音频处理单元提取的标识信息对 解码后的所述多路音频流分配声像位置;
3D声处理单元, 用于根据所述分配的声像位置对所述解码后的多路 音频流进行 3D声处理。
本发明实施例还提供一种 3D音频会议的信号处理方法,所述方法包 括:
获取携带有标识的多路音频流,并从获取到的多路音频流中提取标识信 息;
根据所述的提取的标识信息对具有同一标识的音频流进行分流; 根据所述提取的标识信息为各路分流后的音频流分配声像位置; 将所述分流后的音频流进行解码, 并根据所述的音频流的声像位置 信息, 对所述解码后的音频流进行 3D声处理。
本发明实施例还提供一种 3D音频的会议系统, 所述系统包括: 服务器, 用于针对一个终端获取相对于所述终端的至少 1音频流; 给 所述获取到的相对于所述终端的至少 1音频流分配标识;将所述获取到的 相对于所述终端的至少 1音频流以及所述至少 1音频流对应的所述标识进 行组合并发送给目标终端;
至少一个目标终端, 用于获取所述带有标识的至少 1路音频流, 提取 所述音频流的标识, 并根据所述的标识对所述的具有同一标识的音频流 进行分流, 根据所述提取的标识信息为各路分流后的音频流分配声像位 置; 将所述分流后的音频流进行解码, 并根据所述的音频流的声像位置 信息, 对所述分流后的音频流进行 3D声处理。 采用本发明实施例的技术方案, 使得终端能够根据接收到的其他终 端的音频流以及音频流所分配的标识, 对其他终端的声像位置进行自由 的定位。
附图说明
此处所说明的附图用来提供对本发明的进一歩理解, 构成本申请的一部 分, 并不构成对本发明的限定。 在附图中:
图 1为现有技术采用的分布式 3D音频会议系统的网络示意图; 图 2为现有技术采用的集中式 3D音频会议系统的网络示意图; 图 3为本发明的方法实施例 1的流程示意图;
图 4为本发明的方法实施例 2的流程示意图;
图 5a为本发明的方法实施例 2—种对应的系统组网结构示意图; 图 5b为本发明的方法实施例 2另一种对应的系统组网结构示意图; 图 6为本发明的方法实施例 3对应的系统组网结构示意图;
图 7为本发明的方法实施例 3的流程示意图;
图 8为本发明的方法实施例 4对应的系统组网结构示意图;
图 9为本发明的方法实施例 4的流程示意图;
图 10为本发明的方法实施例 5的流程示意图;
图 11为本发明的方法实施例中 3D声处理的结构示意图;
图 12为本发明的方法实施例 6的流程示意图;
图 13为本发明的方法实施例 6中盲源分离方法的基本原理框图; 图 14为本发明的方法实施例 6中麦克风阵列捕捉声音信号的示意图; 图 15为本发明的系统实施例 1的结构示意图;
图 16为本发明的服务器实施例 1的结构示意图;
图 17为图 16所示的服务器实施例 1中音频流获取单元的结构示意图; 图 18为图 16所示的标识分配单元的结构示意图;
图 19为图 16所示的服务器实施例 1中组合发送单元的结构示意图; 图 20为本发明的设备实施例 1的结构示意图;
图 21为图 20所示的设备实施例 1中的音频处理单元的结构示意图; 图 22为本发明的设备实施例 2的结构示意图;
图 23为本发明的设备实施例 3的结构示意图;
图 24为图 23所示的方位计算单元的结构示意图。
具体实施方式
为使本发明实施例的目的、 技术方案和优点更加清楚明白, 下面结合实 施例和附图, 对本发明实施例做进一歩详细说明。 在此, 本发明的示意性实 施例及其说明用于解释本发明, 但并不作为对本发明的限定。
方法实施例
方法实施例 1
本发明的方法实施例 1可以通过附图 3进行说明
301、 针对一个终端, 由服务器获取相对于所述终端的至少 1路音频流; 在实施 301中, 所述针对一个终端, 由服务器获取相对于所述终端的音 频流具体可以为: 服务器获取相对于所述终端的多路音频流的能量; 服务器 根据所述获取的多路音频流的能量, 选择能量最大的至少 1路音频流。
可以理解,针对一个终端获取能量最大的几路音频流只是一种实现的方 式, 也可以获取所有的音频流, 其实现方式即不需要对能量进行计算, 直接 获取相关的音频流。
302、 所述服务器给所述获取到的相对于所述终端的至少 1路音频流分 配标识;
在实施 302中,所述服务器分配给相对于所述终端的至少 1路音频流的 标识, 具体可以采用会场号或终端号作为所述多路音频流的标识, 当然, 也 可以由会议管理员人工的分配, 或者由会议管理系统实时的分配。
比如当一个会场只有一个终端时,用会场号来标识歩骤 301从一个终端 获取到能量最大的至少一路音频流不会引起混淆。 但当一个会场有多个终端时, 此时就不能使用会场号来标识音频流。 由 于每个终端获取到的都是不同音频流,所以为了区分来自同一会场的多路音 频流, 将同一会场的多路音频流各自分配一个序号, 这个序号可以是与音频 流对应的终端号。服务器分配给每个与之连接的终端一个序号, 当根据歩骤 301获取相对于某个终端的音频流时, 本歩骤给所述音频流分配的标识可以 是该音频流对应所述终端的终端号。这样就能更有效的区分各个终端获取到 的音频流。
倘若歩骤 301获取到终端的音频流还携带有该音频流中音频信号对应声 源的方位信息,此时分配给音频流的标识可以是将终端号与方位信息组合的 标识。 方位信息一般携带在音频流的 RTP (Real-time Transport Protocol, 实 时传送协议, 通常可用于传输视频、 音频等实时性要求比较高的数据)包头 中, 当根据歩骤 301获取到相对于所述终端的音频流后, 服务器通过检测音 频流的 RTP包头获取方位信息,比如通过检测包头中字段上的标识来判断包 头中是否有方位信息, 其中字段中对应方位信息的标识是终端设置的; 还可 以通过设置 0或 1数值的方式来检测,本领域技术人员完全可根据掌握的普 通技术知识实现多种检测方式。然后将该音频流对应终端的终端号与该音频 流中的方位信息组合成标识, 分配给该音频流。 由于每个音频流中的方位信 息肯定不同,所以也可以采用会场号与方位信息组合标识的方式给音频流分 配标识。
由上述可以理解, 本发明的实施例中所述音频流的标识只是对所述音频 流分配的一种代号, 目的是为了区分出所述的音频流, 因此, 根据本发明的 实施例能够得到其他的标识获取方法, 对此, 本发明的实施例不做限制。
303、 所述服务器将所述获取到的相对于所述终端的至少 1路音频流以 及所述至少 1路音频流对应的所述标识进行组合并发送给目标终端。
在实施 303中,对于所述服务器将所述获取到的相对于所述终端的至少
1路音频流以及所述至少 1路音频流对应的所述标识进行组合的方式可以为 以下方式:
采用松散组合的方式, 即对所述获取的音频码流不做任何更改, 在对每 帧音频数据进行协议封装时,在协议的包头里加上歩骤 302中给所述所述至 少 1路音频流分配的标识;
和 /或
采用紧密组合的方式, 即将所述获取的单声道的音频码流进行编解码, 将所述编解码后的单声道的音频码流整合成一个多声道的码流,在所述的多 声道码流的帧头中增加多个声道对应的歩骤 302中给所述所述至少 1路音频 流分配的标识。
需要说明的是,对于所述服务器将所述的相对于所述终端的音频流与所 述音频流对应的标识进行组合的方式, 可以采用全部是松散组合的方式, 也 可以是全部是紧密组合的方式, 还可以是松散组合与紧密组合相结合的方 式。
音频流的标识可以在 IP包的协议包头, 也可以在音频帧的帧头。
采用本发明实施例的技术方案,使得终端能够根据接收到的其他终端的 音频流以及音频流所分配的标识, 对其他终端的声像位置进行自由的定位, 特别是当音频流中包括声源的方位信息时, 终端可以根据声源的方位信息更 准确的对声像位置进行的定位。 方法实施例 2
本发明的方法实施例 2主要针对单个服务器的情况的实施例方式进行说 明, 其处理过程可以通过附图 4所画的流程图进行说明
401、 服务器获取所述的各个终端对应的音频流;
在实施 401中, 各个终端一般对应于各个会场, 相应的终端获取相应会 场的音频流, 对应于各个终端的服务器获取到所述各个终端对应的音频流。
402、 服务器对所述获取到的音频流计算出其能量, 并选择出能量最大 的至少 1路音频流;
在实施 402时,服务器将 401中获取到的对应于各个终端的音频流分别 进行能量的计算,根据能量计算的结果,选择出能量最大的至少 1路音频流, 作为最终被选择的音频流;
其中, 在进行音频流能量的计算过程中, 可以采用以下的方法:
( 1 ) 计算解码后的音频流在时域中一帧时间内的音频能量, 计算多帧 音频信号后取平均; 或
(2 ) 计算解码后的音频流在频域中相应频域范围内的音频能量, 计算 多帧音频信号后取平均; 或
(3 ) 对音频流的量化因子进行解码, 从而估计出所述音频流的能量。 上述对音频流能量的计算可以归纳为 2类, 一类是基于解码方式的计算 方法, 主要对应于 (1 ) (2 )两种方式, 另外一类是基于非解码估计的方式, 主要对应于(3 )这种方式, 之所以采用这两类方式进行音频流能量的估算, 主要是由于针对不同的协议,对于一些音频协议(例如: G.723.1协议、 G.729 协议), 只有通过对音频流完全解码的方式才能够计算出所述音频流的能量, 对于另外一些音频协议 (例如: G.722.1协议、 AAC LD协议), 只需要对音 频流的某些参数进行解码, 就能够估计出音频流的能量。
在估算出音频流的能量之后, 根据音频会议的策略, 可以选择出其中的 能量最大的至少 1路音频流作为被选择的音频流。
可以理解, 上面对各路音频流的能量进行计算, 从而选择出能量最大的 至少 1路音频流, 只是选择音频流的一种实现方式, 也可以不计算各路音频 流的能量, 而将所有的与会会场的音频流都作为被选择的音频流。
403、 服务器获取被选择的至少 1路音频流对应的标识信息;
在实施 403中, 针对上述被选择的至少 1路音频流, 获取其对应的标识 信息。
其中,所述被选的音频流的标识信息具体可以采用该音频流对应的会场 号或终端号作为所述多路音频流的标识, 若获取到的音频流中包括该音频流 中音频信号对应声源的方位信息, 则也可以采用 "终端号与方位信息组合标 识"或 "会场号与方位信息组合标识" 的方式作为所述多路音频流的标识; 一般在该会场中只有一个终端的情况下采用会场号作为所述多路音频 流的标识。
一般在该会场中有一个或多个终端并且音频流中包含音频信号对应声 源的方位信息的情况下采用 "终端号与方位信息组合标识"或 "会场号与方 位信息组合标识" 作为所述多路音频流的标识。 其中各个音频流对应的方 位信息可以通过检测音频流的 RTP包头获得。
可以理解,本发明的实施例中所述音频流的标识只是对所述音频流分配 的一种代号, 目的是为了区分出所述的音频流, 比如标识也可以由会议管理 员人工的分配, 或者由会议管理系统实时的分配。 因此, 根据本发明的实施 例能够得到其他的标识获取方法, 对此, 本发明的实施例不做限制。
404、 服务器将所述被选择的音频流与所述获取到的标识信息相组合; 在实施 404中, 将所述被选择的至少 1路音频流, 与所述获取到的被选 择的音频流的标识信息进行组合。
其中, 进行组合的方式包括:
采用松散组合的方式, 即对所述获取的音频码流不做任何更改, 在对每 帧音频数据进行协议封装时,在协议的包头里加上歩骤 403获取到的所述至 少 1路音频流对应的标识;
和 /或
采用紧密组合的方式, 即将所述获取的单声道的音频码流进行编解码, 将所述编解码后的单声道的音频码流整合成一个多声道的码流,在所述的多 声道码流的帧头中增加多个声道对应的音频流标识, 该标识及歩骤 403获取 到的标识。
需要说明的是,对于将所述的相对于所述终端的音频流与所述音频流对 应的标识进行组合的方式, 可以采用全部是松散组合的方式, 也可以是全部 是紧密组合的方式, 还可以是松散组合与紧密组合相结合的方式。
405、 服务器将所述与标识信息组合后的音频流根据相应的发送策略发 送给对应的各个目标终端。
在实施 405中,将所述与标识信息组合后的音频流发送给对应的各个目 标终端, 具体可以采用如下的策略:
即: 如果被选择出的音频流中包括某一个终端所获取的音频流, 则发送 给目标终端的音频流则是去除所述终端获取音频流的之后的其他被选择出 的音频流; 如果被选择的音频流不包括某一个终端所获取的音频流时, 则发 送给所述终端的是所有被选择的音频流。
为了更加清晰的说明这种音频流的发送策略, 参考图 5a,对上述策略进 行说明, 图 5a中共包含 4个终端以及一个服务器, 各个终端对应于各个会 场,所以此处的终端号也即会场号。其中,各个终端到服务器的虚线含义为: 各个终端将自身采集到的音频流上传给服务器,服务器到各个终端的实线含 义为: 服务器将选择出的音频流下发给各个终端。 假设经过服务器的计算, 终端 2、 3是能量最大音频流对应的终端, 因此, 服务器就将音频流 2、 3分 别下发给终端 1和终端 4, 服务器将音频流 3下发给终端 2, 将音频流 2下 发给终端 3。
如图 5b所示,图 5b中仍共包含 4个终端以及一个服务器,但终端 1、 2、 3属于一个会场(如图中虚线所示), 终端 4为另一个会场。其中虚实线的含 义与图 5a相同。 假设经过服务器的计算, 终端 2是其所在会场中能量最大 音频流对应的终端, 因此, 服务器就将音频流 4分别下发给终端 1、 终端 2 和终端 3, 服务器将音频流 2下发给终端 4。 本示例中各个终端不对应于各 个会场, 所以此处的终端号不是会场号。
采用本发明实施例的技术方案,使得终端能够根据接收到的其他终端的 音频流以及音频流所分配的标识, 对其他终端的声像位置进行自由的定位, 特别是当音频流中包括声源的方位信息时, 终端可以根据声源的方位信息更 准确的对声像位置进行的定位。 方法实施例 3
本发明的方法实施例 3主要针对多个服务器相级联的情况的实施例方式 进行说明, 其结构可以通过图 6来说明
在图 6中, 我们可以看出共有三个服务器, 以及四个终端, 其中终端 1 与终端 2属于服务器 2,终端 3和终端 4属于服务器 3,服务器 2与服务器 3 通过服务器 1级联在一起, 其中, 可以将服务器 1看成是主服务器, 而服务 器 2与服务器 3看成是服务器 1的从服务器。
对于多服务器相级联的情况, 其处理过程为, 参考图 7的流程图:
701、 主服务器获取从服务器上传的音频流;
702、 所述主服务器对从所述从服务器获取到的音频流分解成多路音频 流, 所分解出的音频流的路数为所述从服务器下的终端的个数;
在实施 702中, 由于所述从服务器获取到的音频流为所述从服务器的各 个终端上传的, 因此, 所述从服务器可以根据具体的终端分解出不同的音频 流。
703、 所述主服务器对所述分解出的音频流计算出其能量, 并选择出能 量最大的至少 1路音频流;
在实施 703中, 对所述分解出的音频流计算出能量, 并选择出能量最大 的至少 1路音频流的实现过程类似于本发明的方法实施例 2中的 402, 在此 不再赘述。
704、 所述主服务器获取被选择的至少 1路音频流对应的标识信息; 在实施 704中,主服务器通过从服务器获取被选择的至少 1路音频流对 应的标识信息。 其获取方式类似于本发明的方法实施例 2中的 403, 在此不 再赘述。 705、 所述主服务器将所述被选择的音频流与所述获取到的标识信息相 组合;
在实施 705中, 由于本歩骤的实现过程类似于本发明的方法实施例 2中 的 404, 在此不再赘述。
706、 所述主服务器将所述与标识信息组合后的音频流根据相应的发送 策略发送给对应的各个终端。
由于本歩骤的实现过程类似于本发明的方法实施例 2中的 405, 在此不 再赘述。
可以理解,本发明的方法实施例 3只给出三个服务器构成的服务器级联 的形式, 对于更多服务器的级联, 其实现的方式同样可以根据本实施例的过 程来完成。
采用本发明实施例的技术方案,使得终端能够根据接收到的其他终端的 音频流以及给音频流所分配的标识, 对其他终端的声像位置进行自由的定 位, 特别是当音频流中包括声源的方位信息时, 终端可以根据声源的方位信 息更准确的对声像位置进行的定位。 方法实施例 4
本发明的方法实施例 4主要针对至少 1个终端与多个服务器级联相结合 的情况的实施例方式进行说明, 其结构可以通过图 8来说明
由图 8可以看出, 包含三个服务器, 其中, 服务器 1为主服务器, 服务 器 2与服务器 3为从服务器, 这三个服务器构成服务器级联的形式, 另外, 图 8共包括 6个终端,其中,终端 1、 2处于从服务器 2的管辖之下,终端 3、 4为从服务器 3的管辖之下, 终端 5、 6为直接与主服务器 1相连的终端。
其实现过程为, 参考图 9:
901、 主服务器获取从服务器上传的音频流以及所述主服务器所直接管 辖的终端的音频流; 902、 所述主服务器对从所述从服务器获取到的音频流分解成多路音频 流, 所分解出的音频流的路数不大于所述从服务器下的终端的个数;
在实施 902中, 由于所述从服务器获取到的音频流为所述从服务器的各 个终端上传的, 因此, 所述从服务器可以根据具体的终端分解出不同的音频 流。其中,所分解出的音频流的路数可以小于所述从服务器下的终端的个数, 根据不同的终端是否发出声音来确定所分解出的音频流的路数, 当一些终端 无会场声音时, 则所分解出的音频流的路数小于所述从服务器下的终端的个 数。
903、 所述主服务器对从所述从服务器获取到的音频流分解出的音频流 以及从直接所管辖的终端获取的音频流分别计算能量, 并选择出能量最大的 至少 1路音频流;
在实施 903中,所述主服务器对从所述从服务器获取到的音频流分解出 的音频流以及从直接所管辖的终端获取的音频流分别计算能量, 并选择出能 量最大的至少 1路音频流的实现过程类似于本发明的方法实施例 2中的 402, 在此不再赘述。
904、 所述主服务器获取被选择的至少 1路音频流对应的标识信息; 在实施 904中, 由于本歩骤的实现过程类似于本发明的方法实施例 2中 的 403, 在此不再赘述。
905、 所述主服务器将所述被选择的音频流与所述获取到的标识信息相 组合;
在实施 905中, 由于本歩骤的实现过程类似于本发明的方法实施例 2中 的 404, 在此不再赘述。
906、 所述主服务器将所述与标识信息组合后的音频流根据相应的发送 策略发送给对应的各个终端或从服务器。
由于本歩骤的实现过程类似于本发明的方法实施例 2中的 405, 在此不 再赘述。 可以理解,本发明的方法实施例 4只给出三个服务器构成的服务器级联 以及主服务器管辖的两个终端的形式, 对于更多服务器的级联, 以及主服务 器管辖更多的终端的实现的方式, 同样可以根据本实施例的过程来完成。
采用本发明实施例的技术方案,使得终端能够根据接收到的其他终端的 音频流以及音频流所分配的标识, 对其他终端的声像位置进行自由的定位, 特别是当音频流中包括声源的方位信息时, 终端可以根据声源的方位信息更 准确的对声像位置进行的定位。
方法实施例 5
本方法实施例针对终端对接收到的音频流进行的处理,参看图 10,其实 现过程具体为:
1001、 获取携带有标识的至少 1路音频流, 并从获取到的至少 1路音频 流中提取标识信息;
在实现 1001中, 先获取携带有标识的至少 1路音频流, 比如接收服务 器发送的携带有标识的至少 1路音频流。 再从获取到的音频流的 IP包的协 议包头, 或者从音频帧的帧头提取所述的标识信息。
1002、 根据所述的提取的标识信息对具有同一标识的音频流进行分流; 在实现 1002中, 由于不同的音频流, 其标识信息不相同, 对于同一标 识的音频流进行分流, 相同标识的音频流分配给同一个解码模块。
1003、 根据所述提取的标识信息为各路分流后的音频流分配声像位置; 在实现 1003中,利用歩骤 1001提取出来的音频流的标识信息可以进行 声像位置的分配。
声像位置的分配可以通过用户预先指定, 即某个声像位置固定分配给某 一个终端, 也可以自动分配, 自动分配可以根据以下原则进行:
当所述标识只包括会场号时: (1 ) 如果音频流对应的标识和正在观看的 终端一致, 则分配中间的声像位置, 在图 11 中该位置即为电视机前的虚拟 声像位置。采用这种方法分配的好处是,声像位置和正在观看的图像相匹配。
(2 ) 如果某终端的音频信号能量较大, 则分配前面的声像位置, 这样 可以保证远端说话人的声音来自前面。
(3 ) 如果某终端的音频信号能量较小, 则分配两侧的声像位置, 这样 的终端可能只是噪声, 分配在两侧可以让噪声和远端说话人的声音分离的较 开, 从而保证说话人声音的清晰度。
当所述标识只包括终端号时: 如果音频流中的终端号和正在观看的终端 一致, 则分配和图像相匹配声像位置, 对应于图 11, 即分配前面两个扬声器 (P2和 P3 ) 之间的声像位置; 如果不一致, 则分配两侧的声像位置, 对应 于图 11, 可以分配 P1和 P2两个扬声器之间的声像位置。
当所述标识包括终端号和方位信息时: 首先根据终端号进行分配, 如果 音频流中的终端号和正在观看的终端一致, 则分配和图像相匹配声像位置, 对应于图 11, 即分配前面两个扬声器 (P2和 P3 ) 之间的声像位置; 如果不 一致, 则分配两侧的声像位置, 对应于图 11, 可以分配 P1和 P2两个扬声 器之间的声像位置。 由于音频流标识中还包括该音频流的方位信息, 此时利 用终端号和方位信息可以进行更准确的声像分配。例如在上述终端号分配完 毕后, 如果某音频流的终端号和正在观看的终端一致, 其水平方位是左中的 位置, 这说明图像中的说话人应该也在左中的位置, 此时可将该音频流的声 像分配在相对于图像的左中位置, 对应于图 11, 即分配前面两个扬声器(P2 和 P3 ) 之间的左中位置。
1004、 将所述分流后的音频流进行解码, 并根据所述的音频流的声像位 置信息, 对所述解码后的音频流进行 3D声处理。
在实现 1004中,对于歩骤 1002中根据相同的标识信息分配在同一音频 流的音频流进行解码, 利用 1003分配的声像位置信息, 对所述的解码后的 音频流进行 3D声处理。 本发明的方法实施例都用到了 3D声处理, 其他地方不再赘述。 3D声处 理的目的是通过利用左右两个音箱来营造出一个立体声场, 3D 声处理的具 体过程可以通过如下的例子进行说明, 参见图 11 :
在图 11 中, 扬声器 pl、 p2之间的距离为 d, 虚拟声像 vl距离扬声器 pi之间的距离为 w, 假设某个音频流 si分配的声像位置为 vl,则可将 si乘 上增益 gl输送到 pl, si乘上增益 g2输送到 p2, gl、 g2可按下式计算: w/d = (gl - g2)/(gl + g2) (1)
c=gl x gl+g2 x g2 (2)
公式 (1 )、 (2) 中 gl是左声道幅度增益, g2是右声道幅度增益, c 是一个固定值, 例如可以等于 1。
当计算出左右声道的增益信息时, 就能够模拟出立体的声场。
本实施例不仅能够使得终端根据接收到的其他终端的音频流以及音频 流所分配的标识, 对其他终端的声像位置进行自由的定位, 将混杂在一起的 不同声源所对应的音频信号分离出来, 并计算出不同声源所对应的音频信号 的位置信息, 使得声音输出后接收侧终端可以很好地模拟再现原始真实声 场。 方法实施例 6
本发明实施例提供一种获取所述音频流中音频信号对应声源的方位信 息的方法, 如图 12所示, 图 12是根据本发明实施例提供的获取所述音频流 中音频信号对应声源的方位信息的方法的流程图。 该流程图包括下列歩骤:
1201、 获取来自本地各个声源的多路音频信号;
在实施歩骤 1201过程中, 所述获取来自本地各个声源的多路音频信号 是采用多个麦克风组成的麦克风阵列采集本地同时说话的多个人(即多个声 源)的语音信号, 从而捕获多路声音信号, 将其转换为多路音频信号。 其中 本地可以指麦克风阵列所在本方会场。 1202、 将获取的多路音频信号进行声源分离, 得到与所述各个声源对应 的音频信号;
在实施歩骤 1202中, 所述将获取的多路音频信号进行声源分离采用盲 源分离方法。
下面详细说明盲源分离方法:
图 13是图 12中所示的盲源分离方法的基本原理框图。 所谓盲源分离, 是指在不知道源信号和传输信道的先验信息的情况下, 根据输入信号的统计 特征, 仅由观测到的混合信号来回复或分离出源信号。 也就是说, 源信号是 不能被观测到的, 得到的是混合后的信号; 另外, 各个不同的源信号是如何 混合的也是未知的。 典型的被观测信号是一系列传感器的输出, 而每一个传 感器收到的是源信号的不同组合。盲源分离的主要任务就是从观测数据中恢 复出源信号。 对应于本发明的实施例中, 麦克风阵列采集同时说话的多个人 的语音信号, 从而获得多路语音, 利用盲源分离技术从这多路语音中恢复出 每个人对应的语音信号, 也就是从多路语音中分离出与多个声源对应的音频 信号。盲源分离方法的基本原理就是使观测信号经过一个分离系统后能恢复 或分离出源信号。 如图 13所示, N个相互统计独立的未知源信号8 =[81 :), s2(t), …, sN(t)]T经未知混合系统 H的传输后, 由 M个传感器检测获得 M 个观测信号 X =[xl(t), x2(t), ..., xM(t)]T 。 盲源分离的任务是将观测信号 通过信号分离器(即, 通过分离算法)后使得输出 y =[yl(t), y2(t),..., yN(t)]T 是源信号的一个拷贝或估计。
目前解决盲源分离最主要的方法有如下三种: 独立分量分析方法、 熵最 大化方法以及非线性主分量分析方法。
1203、根据所述获取的多路音频信号及用来获取来自各个声源的多路音 频信号的装置之间的位置关系计算出与所述各个声源对应的方位信息;
在实施歩骤 1203 的过程中, 所述根据所述获取的多路音频信号及用来 获取来自各个声源的多路音频信号的装置之间的位置关系计算出与所述各 个声源对应的方位信息具体包括:估算所述多路音频信号传播到所述用来获 取来自各个声源的多路音频信号的各个装置之间的相对时延; 根据估算出的 所述相对时延及用来获取来自各个声源的多路音频信号的装置之间的位置 关系计算出与所述各个声源对应的方位信息。
下面详细说明基于时延估算的声源定位算法:
图 14是图 12中所示的麦克风阵列捕捉声音信号的示意图。 如图 14所 示, 由于声源与各个麦克风之间的距离不同, 所以声源发出的声音信号传播 到麦克风阵列中的不同麦克风的时间不同。 例如存在两个声源, 声源 1发出 的声音信号传播到麦克风阵列中的各个麦克风的时间是不同的, 声源 2发出 的声音信号传播到麦克风阵列中的各个麦克风的时间也是不同的, 这样从不 同的麦克风输出的对应同一声源的音频信号的时间也不相同。因此首先估算 与各个声源对应的音频信号之间的相对时延, 然后利用估算出来的相对时延 结合已知的麦克风之间的位置关系确定各个声源的方位。时延估算算法中应 用最为广泛的是广义互相关函数法 (GCC, Generalized Cross Correlation:)。 广 义互相关函数法通过求两音频信号之间的互功率谱, 并在频域内进行加权, 对噪声和反射声进行抑制, 再反变换到时域, 得到两音频信号间的互相关函 数。 互相关函数的峰值位置即为两音频信号之间的相对时延。得到音频信号 之间的时延后, 结合已知的麦克风之间的位置关系, 即可得到与声源对应的 方位信息。
1204、 发送包括所述本地各个声源对应的音频信号和方位信息的音频 流。
此处,方位信息可以设置在音频流的 RTP包头中,从而将携带方位信息 的音频流进行发送, 其中在将方位设置于信息包头中时可在相应的包头的字 段中设置标识以便服务器在接收到音频流时根据标识检测出包头中的方位 信息。 或根据设置 0或 1数值的方式标记包头中是否有方位信息, 总之本领 域技术人员可根据普通的技术知识进行设置, 以使服务器在接收到音频流后 检测出包头中的方位信息。
需要说明的是, 本实施例所述的方法是获取声源的方位信息, 所以与上 述实施例中的对 3D声处理的方法并不冲突, 本实施例所述的方法可以在对 3D声进行处理之前实现, 比如在实施例 5的歩骤 1001之前获取麦克风阵列 所在场景中的声源的方位信息, 说明此时对端在收听本端的声音。 当然也可 以是在实施例 5对 3D声进行处理之后完成, 说明此时本端在根据对端的说 话内容进行回答, 其主要是基于麦克风阵列所在场景中的声源的状况来实 现, 由此可以推断, 本实施例所述的方法完全可以与实施例五所述的方法并 存, 比如设计在同一个终端中, 这样不仅可以实现实施例五所述的方法也可 以实现本实施例所述的方法。
本实施例的终端可以获取声源的方位信息, 并将获取到的方位信息设置 于音频流的包头中发送, 以使服务器可以根据音频流中的方位信息分配给音 频流作为标识。 系统实施例
系统实施例 1
本发明的系统实施例可以通过附图 15进行说明。
服务器 1200, 用于针对一个终端获取相对于所述终端的至少 1 路音频 流; 给所述获取到的相对于所述终端的至少 1路音频流分配标识; 将所述获 取到的相对于所述终端的至少 1路音频流以及所述至少 1路音频流对应的所 述标识进行组合并发送给目标终端;
至少一个目标终端 1300,用于获取所述带有标识的至少 1路音频流,提 取所述音频流的标识, 并根据所述的标识对所述的具有同一标识的音频流进 行分流, 根据所述提取的标识信息为各路分流后的音频流分配声像位置; 将 所述分流后的音频流进行解码, 并根据所述的音频流的声像位置信息, 对所 述分流后的音频流进行 3D声处理。 所述目标终端 1300还用于获取来自所述终端所在会场内的各个声源的 多路音频信号, 将获取的多路音频信号进行声源分离, 得到与所述各个声源 对应的音频信号, 根据所述获取的多路音频信号及用来获取来自各个声源的 多路音频信号的装置之间的位置关系计算出与所述各个声源对应的方位信 息, 向所述服务器发送包括所述各个声源对应的音频信号和方位信息的音频 流。
采用本发明实施例的技术方案,使得终端能够根据接收到的其他终端的 音频流以及音频流所分配的标识, 对其他终端的声像位置进行自由的定位。 系统实施例 2
参考图 6结构图, 在系统实施例 1的基础上, 本系统实施例包括一个主 服务器, 即图 6中的服务器 1, 用于针对一个终端获取相对于所述终端的至 少 1路音频流;给所述获取到的相对于所述终端的至少 1路音频流分配标识; 将所述获取到的相对于所述终端的至少 1路音频流以及所述至少 1路音频流 对应的所述标识进行组合并发送给所述终端,还用于将所述至少一个从服务 器的经过组合后的带标识的所述音频流分解为多路音频流; 至少一个从服务 器, 即图 6中的服务器 2与服务器 3, 用于获取其自身管辖的终端或其他服 务器的音频流, 并将所述获取到的音频流与所述音频流的标识进行组合。
采用本发明实施例的技术方案,使得终端能够根据接收到的其他终端的 音频流以及音频流所分配的标识, 对其他终端的声像位置进行自由的定位。 装置实施例
服务器实施例
本实施例主要提供一种实现 3D音频会议的信号处理的服务器, 所属服 务器包括, 参考图 16:
音频流获取单元 161, 用于针对一个终端获取相对于所述终端的至少 1 路音频流;
标识分配单元 162, 用于给所述获取到的相对于所述终端的至少 1路音 频流分配标识;
组合发送单元 163, 用于将所述获取到的相对于所述终端的至少 1路音 频流以及所述至少 1路音频流对应的所述标识进行组合并发送给所述终端。
如参考图 17所示, 本实施例所述的音频流获取单元 161可以包括: 音频流能量获取模块 1611,用于获取相对于所述终端的多路音频流的能 音频流选择模块 1612,用于根据所述获取的多路音频流的能量,选择能 量最大的至少 1路音频流。
本实施例所述的音频流获取单元 161还可以包括:
检测模块 1613,用于检测获取到的音频流包头中音频信号对应声源的方 位信息。
如图 18所示, 本实施例所述的标识分配单元 162可以包括:
会场终端号获取模块 1621,用于获取所述的能量最大的至少 1路音频流 各自所在会场的会场号和 /或者各自所在会场终端的终端号;
标识组合模块 1622, 用于将所述检测模块 1613检测到的方位信息与所 述会场终端号获取模块 1621获取到的会场号或终端号组合成的第二标识; 标识分配模块 1623, 用于将所述会场终端号获取模块 1621获取到的会 场号或终端号作为第一标识分配给所述音频流; 还用于将所述标识组合模块 1622组合成的第二标识分配给所述音频流。
所述组合发送单元 163具体包括以下模块, 参考图 19: 第一组合模块 1631, 用于对所述被选的音频码流不做任何更改, 在对每帧音频数据进行协 议封装时,在协议的包头里加上给所述至少 1路音频流分配的标识;和 /或 第 二组合模块 1632,用于将所述被选的单声道的音频码流进行编解码,将所述 编解码后的单声道的音频码流整合成一个多声道的码流,在所述的多声道码 流的帧头中增加多个声道对应的给所述至少 1路音频流分配标识。
采用本发明实施例的技术方案,使得终端能够根据接收到的其他终端的 音频流以及音频流所分配的标识, 对其他终端的声像位置进行自由的定位, 特别是当音频流中包括声源的方位信息时, 终端可以根据声源的方位信息更 准确的对声像位置进行的定位。 设备实施例
设备实施例 1
本发明实施例还提供一种实现 3D音频会议的信号处理的终端, 参考图 20, 包括:
获取单元 171, 用于获取携带有标识的至少 1路音频流;
音频处理单元 172, 用于从所述获取单元 171获取到的至少 1路音频流 中提取标识信息, 并根据所述的标识信息对音频流进行分流, 以及将所述多 路音频流分别解码;
声像位置分配单元 173, 用于根据所述音频处理单元提取的标识信息对 解码后的所述多路音频流分配声像位置; 当所述标识中包括对应声源的方位 信息时, 声像位置分配单元根据方位信息准确分配声像位置。
3D声处理单元 174,用于根据所述分配的声像位置对所述解码后的多路 音频流进行 3D声处理。
在实现本发明实施例的过程中,所述音频处理单元 172具体包括参考图 21:标识提取模块 1721,用于从获取到的分配标识的多路音频流提取标识信 息; 分配模块 1722, 用于根据所述的提取出的标识信息分配音频流; 解码模 块 1723, 用于将所述多路音频流分别解码。
采用本发明实施例的技术方案,使得终端能够根据接收到的其他终端的 音频流以及音频流所分配的标识, 对其他终端的声像位置进行自由的定位, 特别当音频流中包括对应声源的方位信息时, 可以通过该方位信息对声像进 行更准确的分配从而根据所述分配的声像位置对所述解码后的多路音频流 进行 3D声处理。 设备实施例 2
在上述设备实施例 1的基础上, 所述终端还可以包括, 参考图 22: 音频 编码单元 175, 用于对获取到的音频信号进行编码。 设备实施例 3
在上述设备实施例 1、 2的基础上, 所述终端还可以包括, 参考图 23, 多路音频信号获取单元 176, 用于在所述终端接收到服务器发送的所述多路 音频流之前或在所述终端对接收到的多路音频流进行 3D声处理后, 获取来 自本地各个声源的多路音频信号; 声源分离单元 177, 用于将所述获取的多 路音频信号进行声源分离, 得到与所述各个声源对应的音频信号; 方位计算 单元 178, 用于根据所述获取的多路音频信号及用来获取来自各个声源的多 路音频信号的装置之间的位置关系计算出与所述各个声源对应的方位信息; 发送单元 179, 用于发送包括所述各个声源对应的音频信号和方位信息的音 如参考图 24所示, 本实施例所述方位计算单元 178可以包括时延估算 模块 1781,用于估算所述多路音频信号传播到所述用来获取来自各个声源的 多路音频信号的各个装置之间的相对时延; 声源定位模块 1782,用于根据时 延估算模块 1781估算出的所述相对时延及用来获取来自各个声源的多路音 频信号的装置之间的位置关系计算出与所述各个声源对应的方位信息。
采用本发明实施例的技术方案,使得不仅终端能够根据接收到的其他终 端的音频流以及音频流所分配的标识,对其他终端的声像位置进行自由的定 位, 而且可以将混杂在一起的不同声源所对应的音频信号分离出来, 并计算 出不同声源所对应的音频信号的位置信息, 使得声音输出后接收侧终端可以 很好地模拟再现原始真实声场。
结合本文中所公开的实施例描述的方法或算法的歩骤可以直接用硬件、 处理器执行的软件模块, 或者二者的结合来实施。 软件模块可以置于随机存 储器(RAM)、 内存、 只读存储器(ROM)、 电可编程 ROM、 电可擦除可编 程 ROM、 寄存器、 硬盘、 可移动磁盘、 CD-ROM、 或技术领域内所公知的 任意其它形式的存储介质中。
以上所述的具体实施例, 对本发明的目的、 技术方案和有益效果进行了 进一歩详细说明, 所应理解的是, 以上所述仅为本发明的具体实施例而已, 并不用于限定本发明的保护范围, 凡在本发明的精神和原则之内, 所做的任 何修改、 等同替换、 改进等, 均应包含在本发明的保护范围之内。

Claims

权 利 要 求 书
1、 一种 3D音频会议的信号处理方法, 其特征在于, 所述方法包括: 针对一个终端, 由服务器获取相对于所述终端的至少 1路音频流; 所述服务器给所述获取到的相对于所述终端的至少 1 路音频流分配标 识;
所述服务器将所述获取到的相对于所述终端的至少 1路音频流以及所述 至少 1路音频流对应的所述标识进行组合并发送给目标终端。
2、 根据权利要求 1所述的信号处理方法, 其特征在于, 所述音频流携 带音频信号对应声源的方位信息。
3、 根据权利要求 2所述的信号处理方法, 其特征在于, 所述服务器给 所述获取到的相对于所述终端的至少 1路音频流分配标识包括:
所述服务器将所述方位信息与终端号组合成第一标识分配给获取到所 述音频流中的能量最大的至少 1路音频流; 或者
所述服务器将所述方位信息与会场号组合成第二标识分配给获取到所 述音频流中的能量最大的至少 1路音频流; 或者
所述服务器将终端号或会场号作为第三标识分配给获取到所述音频流 中的能量最大的至少 1路音频流。
4、 根据权利要求 2所述的信号处理方法, 其特征在于, 所述服务器将 所述获取到的相对于所述终端的至少 1路音频流以及与所述至少 1路音频流 对应的所述标识进行组合具体包括以下方式:
采用松散组合的方式, 即对所述获取的音频码流不做任何更改, 在对每 帧音频数据进行协议封装时,在音频数据帧的包头里加上给所述至少 1路音 频流分配的标识;
和 /或
采用紧密组合的方式, 即将所述获取的单声道的音频码流进行编解码, 将所述编解码后的单声道的音频码流整合成一个多声道的码流,在所述的多 声道码流的帧头中增加多个声道对应的给所述至少 1路音频流分配的标识。
5、 根据权利要求 2所述的信号处理方法, 其特征在于, 所述针对一个 终端, 由服务器获取相对于所述终端的至少 1路音频流具体包括以下几种方 式中的一种:
在单个服务器的情况下,所述单个服务器获取多个终端的分别发送的音 频流中能量最大的至少 1路音频流; 或
在多个服务器级联的情况下,所述多个服务器中的主服务器获取对应所 述主服务器的多个从服务器获取到的多路音频流中能量最大的至少 1路音频 流, 每个从服务器获取到的多路音频流为对应所述从服务器的多个终端分别 发送的音频流;
在至少一个终端以及多个服务器级联并存的情况下,所述多个服务器中 的主服务器获取所述至少一个终端发送的音频流, 以及所述多个服务器中的 主服务器获取对应所述主服务器的多个从服务器获取到的多路音频流, 每个 从服务器获取到的多路音频流为对应所述从服务器的多个终端的分别发送 的音频流,所述主服务器从获取到的所述至少一个终端发送的音频流以及获 取到的多个从服务器获取到的多路音频流中获取能量最大的至少 1 路音频 流。
6、一种实现 3D音频会议的信号处理的服务器, 其特征在于, 所述服务 器包括:
音频流获取单元,用于针对一个终端获取相对于所述终端的至少 1路音 标识分配单元,用于给所述获取到的相对于所述终端的至少 1路音频流 分配标识;
组合发送单元,用于将所述获取到的相对于所述终端的至少 1路音频流 以及所述至少 1路音频流对应的所述标识进行组合并发送给目标终端。
7、 根据权利要求 6所述的服务器, 其特征在于, 所述音频流获取单元 还包括:
检测模块,用于检测获取到的所述至少 1路音频流包头中音频信号对应 声源的方位信息。
8、 根据权利要求 Ί所述的服务器, 其特征在于, 所述标识分配单元包 括:
会场终端号获取模块,用于获取所述的能量最大的至少 1路音频流各自 所在会场的会场号和 /或各自所在会场终端的终端号;
标识分配模块,用于将所述会场终端号获取模块获取到的会场号或终端 号作为第一标识分配给所述音频流。
9、 根据权利要求 8所述的服务器, 其特征在于, 所述标识分配单元还 包括:
标识组合模块,用于将所述检测模块检测到的方位信息与所述会场终端 号获取模块获取到的会场号或终端号组合成的第二标识;
所述标识分配模块,还用于将所述标识组合模块组合成的第二标识分配 给所述音频流。
10、 根据权利要求 8或 9所述的服务器, 其特征在于, 所述组合发送单 元具体包括以下模块:
第一组合模块, 用于对所述被选的音频码流不做任何更改, 在对每帧音 频数据进行协议封装时,在协议的包头里加上给所述至少 1路音频流分配的 标识;
和 /或
第二组合模块, 用于将所述被选的单声道的音频码流进行编解码, 将所 述编解码后的单声道的音频码流整合成一个多声道的码流,在所述的多声道 码流的帧头中增加多个声道对应的给所述至少 1路音频流分配标识。
11、 一种 3D音频会议的信号处理方法, 其特征在于, 所述方法包括: 获取携带有标识的至少 1路音频流, 并从获取到的至少 1路音频流中提 取标识信息;
根据所述的提取的标识信息对具有同一标识的音频流进行分流; 根据所述提取的标识信息为各路分流后的音频流分配声像位置; 将所述分流后的音频流进行解码, 并根据所述的音频流的声像位置信 息, 对所述解码后的音频流进行 3D声处理。
12、 根据权利要求 11 所述的信号处理方法, 其特征在于, 所述根据所 述的提取的标识信息对具有同一标识信息的音频流进行分流具体为:
读取音频流中的标识信息;
根据读取到的所述标识信息的音频流分配到具有相同标识信息的音频 流通道中。
13、 根据权利要求 11 所述的信号处理方法, 其特征在于, 在获取到所 述多路音频流前或在进行 3D声处理后, 所述方法还包括:
获取来自本地各个声源的多路音频信号;
将获取的多路音频信号进行声源分离,得到与所述各个声源对应的音频 信号;
根据所述获取的多路音频信号及用来获取来自各个声源的多路音频信 号的装置之间的位置关系计算出与所述各个声源对应的方位信息;
发送包括所述本地各个声源对应的音频信号和方位信息的音频流。
14、 根据权利要求 13所述的信号处理方法, 其特征在于, 所述根据所 述获取的多路音频信号及用来获取来自本地各个声源的多路音频信号的装 置之间的位置关系计算出与所述各个声源对应的方位信息具体包括:
估算所述多路音频信号传播到所述用来获取来自各个声源的多路音频 信号的各个装置之间的相对时延;
根据估算出的所述相对时延及用来获取来自各个声源的多路音频信号 的装置之间的位置关系计算出与所述各个声源对应的方位信息。
15、 一种实现 3D音频会议的信号处理的终端, 其特征在于, 所述终端 包括:
获取单元, 用于获取携带有标识的至少 1路音频流;
音频处理单元,用于从所述获取单元获取到的至少 1路音频流中提取标 识信息, 并根据所述的标识信息对音频流进行分流, 以及将所述多路音频流 分别解码;
声像位置分配单元,用于根据所述音频处理单元提取的标识信息对解码 后的所述多路音频流分配声像位置;
3D 声处理单元, 用于根据所述分配的声像位置对所述解码后的多路音 频流进行 3D声处理。
16、 根据权利要求 15所述的终端, 其特征在于, 所述音频处理单元具 体包括:
标识提取模块,用于从所述获取单元获取到的多路音频流中提取标识信 息;
分配模块, 用于根据所述的提取出的标识信息分配音频流;
解码模块, 用于将所述多路音频流分别解码。
17、 根据权利要求 16所述的终端, 其特征在于, 所述终端还包括: 音频编码单元, 用于对获取到的音频信号进行编码。
18、 根据权利要求 17所述的终端, 其特征在于, 所述终端还包括: 多路音频信号获取单元,用于在所述终端获取到服务器发送的所述多路 音频流之前或在所述终端对获取到的多路音频流进行 3D声处理后, 获取来 自本地各个声源的多路音频信号;
声源分离单元, 用于将所述获取的多路音频信号进行声源分离, 得到与 所述各个声源对应的音频信号;
方位计算单元,用于根据所述获取的多路音频信号及用来获取来自各个 声源的多路音频信号的装置之间的位置关系计算出与所述各个声源对应的 方位信息; 发送单元,用于发送包括所述本地各个声源对应的音频信号和方位信息 的音频流。
19、 根据权利要求 18所述的终端, 其特征在于, 所述方位计算单元包 括:
时延估算模块,用于估算所述多路音频信号传播到所述用来获取来自各 个声源的多路音频信号的各个装置之间的相对时延;
声源定位模块,用于根据估算出的所述相对时延及用来获取来自各个声 源的多路音频信号的装置之间的位置关系计算出与所述各个声源对应的方 位信息。
20、 一种 3D音频的会议系统, 其特征在于, 所述系统包括: 服务器, 用于针对一个终端获取相对于所述终端的至少 1路音频流; 给 所述获取到的相对于所述终端的至少 1路音频流分配标识;将所述获取到的 相对于所述终端的至少 1路音频流以及所述至少 1路音频流对应的所述标识 进行组合并发送给目标终端;
至少一个目标终端, 用于获取所述带有标识的至少 1路音频流, 提取所 述音频流的标识, 并根据所述的标识对所述的具有同一标识的音频流进行分 流, 根据所述提取的标识信息为各路分流后的音频流分配声像位置; 将所述 分流后的音频流进行解码, 并根据所述的音频流的声像位置信息, 对所述分 流后的音频流进行 3D声处理。
21、 根据权利要求 20所述的会议系统, 其特征在于, 所述服务器为主 服务器, 所述会议系统还包括:
至少一个从服务器, 用于获取其自身管辖的终端或其他服务器的音频 流, 并将所述获取到的音频流与所述音频流的标识进行组合;
所述主服务器,还用于将所述至少一个从服务器的经过组合后的带标识 的所述音频流分解为多路音频流。
22、 根据权利要求 20或 21所述的会议系统, 其特征在于, 所述目标终 端还用于获取来自所述终端所在会场内的各个声源的多路音频信号,将获取 的多路音频信号进行声源分离, 得到与所述各个声源对应的音频信号, 根据 所述获取的多路音频信号及用来获取来自各个声源的多路音频信号的装置 之间的位置关系计算出与所述各个声源对应的方位信息, 向所述服务器发送 包括所述各个声源对应的音频信号和方位信息的音频流。
PCT/CN2009/074528 2008-10-20 2009-10-20 一种3d音频信号处理的方法、系统和装置 WO2010045869A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP09821590.8A EP2337328B1 (en) 2008-10-20 2009-10-20 Method, system and apparatus for processing 3d audio signal
US13/090,417 US8965015B2 (en) 2008-10-20 2011-04-20 Signal processing method, system, and apparatus for 3-dimensional audio conferencing

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN200810217091.9A CN101547265B (zh) 2008-10-20 2008-10-20 一种3d音频会议的信号处理方法、设备以及系统
CN200810217091.9 2008-10-20
CN2008101712402A CN101384105B (zh) 2008-10-27 2008-10-27 三维声音重现的方法、装置及系统
CN200810171240.2 2008-10-27

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US13/090,417 Continuation US8965015B2 (en) 2008-10-20 2011-04-20 Signal processing method, system, and apparatus for 3-dimensional audio conferencing

Publications (1)

Publication Number Publication Date
WO2010045869A1 true WO2010045869A1 (zh) 2010-04-29

Family

ID=42118961

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2009/074528 WO2010045869A1 (zh) 2008-10-20 2009-10-20 一种3d音频信号处理的方法、系统和装置

Country Status (3)

Country Link
US (1) US8965015B2 (zh)
EP (1) EP2337328B1 (zh)
WO (1) WO2010045869A1 (zh)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013122386A1 (en) 2012-02-15 2013-08-22 Samsung Electronics Co., Ltd. Data transmitting apparatus, data receiving apparatus, data transreceiving system, data transmitting method, data receiving method and data transreceiving method
WO2013122387A1 (en) 2012-02-15 2013-08-22 Samsung Electronics Co., Ltd. Data transmitting apparatus, data receiving apparatus, data transceiving system, data transmitting method, and data receiving method
WO2013122385A1 (en) 2012-02-15 2013-08-22 Samsung Electronics Co., Ltd. Data transmitting apparatus, data receiving apparatus, data transreceiving system, data transmitting method, data receiving method and data transreceiving method
US9191516B2 (en) * 2013-02-20 2015-11-17 Qualcomm Incorporated Teleconferencing using steganographically-embedded audio data
US10321256B2 (en) 2015-02-03 2019-06-11 Dolby Laboratories Licensing Corporation Adaptive audio construction
US11700335B2 (en) * 2021-09-07 2023-07-11 Verizon Patent And Licensing Inc. Systems and methods for videoconferencing with spatial audio
AU2022364987A1 (en) * 2021-10-12 2024-02-22 Qsc, Llc Multi-source audio processing systems and methods

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1722752A (zh) * 2004-07-13 2006-01-18 乐金电子(中国)研究开发中心有限公司 用于三方通话的信号处理系统
CN1849834A (zh) * 2003-09-11 2006-10-18 索尼爱立信移动通讯股份有限公司 具有呼叫方定位识别的便携式设备的多方呼叫
CN101133441A (zh) * 2005-02-14 2008-02-27 弗劳恩霍夫应用研究促进协会 音源的参数联合编码
US20080051029A1 (en) * 2006-08-25 2008-02-28 Bradley James Witteman Phone-based broadcast audio identification
CN101384105A (zh) * 2008-10-27 2009-03-11 深圳华为通信技术有限公司 三维声音重现的方法、装置及系统
CN101547265A (zh) * 2008-10-20 2009-09-30 深圳华为通信技术有限公司 一种3d音频会议的信号处理方法、设备以及系统

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US1849834A (en) * 1929-05-01 1932-03-15 Selden Co Production of pelargonic aldehyde
GB2416955B (en) * 2004-07-28 2009-03-18 Vodafone Plc Conference calls in mobile networks
DE102005008366A1 (de) 2005-02-23 2006-08-24 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Vorrichtung und Verfahren zum Ansteuern einer Wellenfeldsynthese-Renderer-Einrichtung mit Audioobjekten
US20060221869A1 (en) * 2005-03-29 2006-10-05 Teck-Kuen Chua System and method for audio multicast
US20070026025A1 (en) * 2005-04-12 2007-02-01 Aquegel Cosmetics, Llc Topical ointment and method for making and using same
JP2007019907A (ja) * 2005-07-08 2007-01-25 Yamaha Corp 音声伝達システム、および通信会議装置
EP1761110A1 (en) 2005-09-02 2007-03-07 Ecole Polytechnique Fédérale de Lausanne Method to generate multi-channel audio signals from stereo signals
EP1954019A1 (en) * 2007-02-01 2008-08-06 Research In Motion Limited System and method for providing simulated spatial sound in a wireless communication device during group voice communication sessions
US8385233B2 (en) * 2007-06-12 2013-02-26 Microsoft Corporation Active speaker identification
GB2452021B (en) * 2007-07-19 2012-03-14 Vodafone Plc identifying callers in telecommunication networks

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1849834A (zh) * 2003-09-11 2006-10-18 索尼爱立信移动通讯股份有限公司 具有呼叫方定位识别的便携式设备的多方呼叫
CN1722752A (zh) * 2004-07-13 2006-01-18 乐金电子(中国)研究开发中心有限公司 用于三方通话的信号处理系统
CN101133441A (zh) * 2005-02-14 2008-02-27 弗劳恩霍夫应用研究促进协会 音源的参数联合编码
US20080051029A1 (en) * 2006-08-25 2008-02-28 Bradley James Witteman Phone-based broadcast audio identification
CN101547265A (zh) * 2008-10-20 2009-09-30 深圳华为通信技术有限公司 一种3d音频会议的信号处理方法、设备以及系统
CN101384105A (zh) * 2008-10-27 2009-03-11 深圳华为通信技术有限公司 三维声音重现的方法、装置及系统

Also Published As

Publication number Publication date
US8965015B2 (en) 2015-02-24
EP2337328A1 (en) 2011-06-22
US20110194701A1 (en) 2011-08-11
EP2337328B1 (en) 2014-12-03
EP2337328A4 (en) 2013-07-24

Similar Documents

Publication Publication Date Title
CN101384105B (zh) 三维声音重现的方法、装置及系统
WO2010045869A1 (zh) 一种3d音频信号处理的方法、系统和装置
JP5990345B1 (ja) サラウンド音場の生成
US9763004B2 (en) Systems and methods for audio conferencing
CN103733602B (zh) 用于静音与源相关联的音频的系统和方法
US8705778B2 (en) Method and apparatus for generating and playing audio signals, and system for processing audio signals
US9049339B2 (en) Method for operating a conference system and device for a conference system
WO2012068960A1 (zh) 视频通信中的音频处理方法和装置
US20090110212A1 (en) Audio Transmission System and Communication Conference Device
WO2014180371A1 (zh) 一种会议控制方法、装置及会议系统
CN102890936A (zh) 一种音频处理方法、终端设备及系统
KR20140103290A (ko) 회의 시스템에서의 에코 소거를 위한 방법 및 장치
JP5912294B2 (ja) テレビ会議装置
CN102186049A (zh) 会场终端音频信号处理方法及会场终端和视讯会议系统
US20070109977A1 (en) Method and apparatus for improving listener differentiation of talkers during a conference call
WO2014094461A1 (zh) 视频会议中的视音频信息的处理方法、装置及系统
CN105247854A (zh) 用于将外部设备关联到视频会议会话的方法和系统
WO2010105695A1 (en) Multi channel audio coding
CN109218948B (zh) 助听系统、系统信号处理单元及用于产生增强的电音频信号的方法
CN109729109A (zh) 语音的传输方法和装置、存储介质、电子装置
EP2207311A1 (en) Voice communication device
JP3898673B2 (ja) 音声通信システム、方法及びプログラム並びに音声再生装置
CN107195308B (zh) 音视频会议系统的混音方法、装置及系统
CN101547265B (zh) 一种3d音频会议的信号处理方法、设备以及系统
JP2010166424A (ja) 多地点会議システム、サーバ装置、音声ミキシング装置、及び多地点会議サービス提供方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 09821590

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 1779/KOLNP/2011

Country of ref document: IN

WWE Wipo information: entry into national phase

Ref document number: 2009821590

Country of ref document: EP