US20140369528A1 - Mixing decision controlling decode decision - Google Patents

Mixing decision controlling decode decision Download PDF

Info

Publication number
US20140369528A1
US20140369528A1 US13/348,278 US201213348278A US2014369528A1 US 20140369528 A1 US20140369528 A1 US 20140369528A1 US 201213348278 A US201213348278 A US 201213348278A US 2014369528 A1 US2014369528 A1 US 2014369528A1
Authority
US
United States
Prior art keywords
audio
frame
level data
frames
decoded
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US13/348,278
Inventor
Lars Henrik Ellner
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Google LLC
Original Assignee
Google LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Google LLC filed Critical Google LLC
Priority to US13/348,278 priority Critical patent/US20140369528A1/en
Assigned to GOOGLE INC. reassignment GOOGLE INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: ELLNER, LARS HENRIK
Publication of US20140369528A1 publication Critical patent/US20140369528A1/en
Assigned to GOOGLE LLC reassignment GOOGLE LLC CHANGE OF NAME (SEE DOCUMENT FOR DETAILS). Assignors: GOOGLE INC.
Abandoned legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/56Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities
    • H04M3/568Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities audio processing specific to telephonic conferencing, e.g. spatial distribution, mixing of participants

Definitions

  • the present disclosure generally relates to methods, systems, and apparatus for mixing audio signals. More specifically, aspects of the present disclosure relate to an audio mixing apparatus mixing incoming audio signals based on voice-activity-detection data received with the incoming audio signals.
  • an audio mixer receives audio streams from most or all of the conference participants and selectively mixes some of the received streams for forwarding to other participants. Depending on the size of the conference, the audio mixer can receive a large number of incoming audio streams at the same time. Because there is usually only a need to mix a subset of these incoming streams, a substantial number of processing resources are often wasted decoding audio frames that will never be used.
  • VAD voice-activity-detection
  • RTP Real-time Transport Protocol
  • One embodiment of the present disclosure relates to a method for mixing audio signals comprising: receiving, at an audio mixing apparatus, audio packets from a plurality of clients in communication with the audio mixing apparatus; retrieving audio level data contained in an extended packet header of each of the received audio packets; selecting audio frames to be mixed based on the audio level data retrieved from the extended packet headers of the audio packets; decoding the audio frames selected to be mixed; and generating a mixed audio stream by mixing the decoded audio frames.
  • the method for mixing audio signals further comprises, in response to receiving the audio packets from the plurality of clients, determining, for each audio frame contained in the audio packets, whether audio level data can be extracted from the audio frame without decoding the audio frame.
  • the method for mixing audio signals further comprises: in response to determining that audio level data can be extracted from the audio frame without decoding the audio frame, extracting audio level data from the audio frame; storing the audio level data extracted from the audio frame in an audio level data set; and storing the audio frame in an encoded audio frames set.
  • the method for mixing audio signals further comprises: in response to determining that no audio level data can be extracted from the audio frame without decoding the audio frame, decoding the audio frame; performing audio level processing on the decoded audio frame to obtain audio level data for the decoded audio frame; storing the audio level data obtained for the decoded audio frame in the audio level data set; and storing the decoded audio frame in a decoded audio frames set.
  • the method for mixing audio signals further comprises: selecting a group of encoded audio frames from the encoded audio frames set and a group of decoded audio frames from the decoded audio frames set; retrieving, from the audio level data set, the audio level data stored for the group of encoded audio frames and the group of decoded audio frames; applying a mixing decision algorithm based on the audio level data retrieved from the audio level data set; and determining a set of encoded and decoded audio frames to be mixed based on the applied mixing decision algorithm.
  • the method for mixing audio signals further comprises storing the received audio packets in a buffer of the audio mixing apparatus.
  • the method for mixing audio signals further comprises discarding the audio frames not selected to be mixed.
  • Another embodiment of the disclosure relates to an audio mixing apparatus configured to perform operations comprising: receiving audio packets from a plurality of clients in communication with the audio mixing apparatus; retrieving audio level data contained in an extended packet header of each of the received audio packets; selecting audio frames to be mixed based on the audio level data retrieved from the extended packet headers of the audio packets; decoding the audio frames selected to be mixed; and generating a mixed audio stream by mixing the decoded audio frames.
  • the audio mixing apparatus is further configured to perform operations comprising, in response to receiving the audio packets from the plurality of clients, determining, for each audio frame contained in the audio packets, whether audio level data can be extracted from the audio frame without decoding the audio frame.
  • the audio mixing apparatus is further configured to perform operations comprising: in response to determining that audio level data can be extracted from the audio frame without decoding the audio frame, extracting audio level data from the audio frame; storing the audio level data extracted from the audio frame in an audio level data set; and storing the audio frame in an encoded audio frames set.
  • the audio mixing apparatus is further configured to perform operations comprising: in response to determining that no audio level data can be extracted from the audio frame without decoding the audio frame, decoding the audio frame; performing audio level processing on the decoded audio frame to obtain audio level data for the decoded audio frame; storing the audio level data obtained for the decoded audio frame in the audio level data set; and storing the decoded audio frame in a decoded audio frames set.
  • the audio mixing apparatus is further configured to perform operations comprising: selecting a group of encoded audio frames from the encoded audio frames set and a group of decoded audio frames from the decoded audio frames set; retrieving, from the audio level data set, the audio level data stored for the group of encoded audio frames and the group of decoded audio frames; applying a mixing decision algorithm based on the audio level data retrieved from the audio level data set; and determining a set of encoded and decoded audio frames to be mixed based on the applied mixing decision algorithm.
  • the audio mixing apparatus is further configured to perform operations comprising storing the received audio packets in a buffer of the audio mixing apparatus.
  • the audio mixing apparatus is further configured to perform operations comprising discarding the audio frames not selected to be mixed.
  • the methods and apparatuses described herein may optionally include one or more of the following additional features: the audio packets received from the plurality of clients are Real-Time Transport Protocol (RTP) packets, and/or the audio level data contained in the extended packet header of each of the received audio packets includes voice data corresponding to one of the plurality of clients.
  • RTP Real-Time Transport Protocol
  • FIG. 1 is a block diagram illustrating an example audio mixing environment in which various embodiments of the present disclosure may be implemented.
  • FIG. 2 is a block diagram illustrating an example audio mixing apparatus along with incoming and outgoing data flows according to one or more embodiments described herein.
  • FIG. 3 is a flowchart illustrating an example method for receiving and storing a voice-activity-detection decision from a client participating in an audio conferencing session according to one or more embodiments described herein.
  • FIG. 4 is a flowchart illustrating an example method for rendering a mixing decision based on available voice-activity detection decisions received from clients participating in an audio conferencing session according to one or more embodiments described herein.
  • FIG. 5 is a block diagram illustrating an example computing device arranged for rendering an audio mixing decision according to one or more embodiments described herein.
  • Embodiments of the present disclosure relate to methods, systems, and apparatus for combining (e.g., mixing) audio signals received from a plurality of participants communicating with each other during a communication session (e.g., an audio conference) based on voice-activity-detection (VAD) data contained in the received signals.
  • a communication session e.g., an audio conference
  • VAD voice-activity-detection
  • participants e.g., users, clients, individuals, etc.
  • an audio conference communicate by sending Real-Time Transport Protocol (RTP) packets containing audio data (e.g., voice data) to an audio mixing apparatus (sometimes referred to herein as an “audio mixer” or simply a “mixer” for purposes of brevity).
  • RTP Real-Time Transport Protocol
  • One or more embodiments relate to determining audio signals to be included in an audio mixing decision based on VAD data contained in an extended profile of received data packets containing the audio signals.
  • a number of participants e.g., all, none, or any number of participants
  • may send outgoing audio signals e.g., audio streams
  • an audio mixer e.g., a server
  • RTP packets with an extended RTP-profile sometimes also referred to herein as “RTP header extension” or “extended RTP header”.
  • the participants can use the extended RTP-profile of the outgoing audio data packets to indicate an audio level (e.g., VAD data) of the packets' payload.
  • this approach reduces the processing load for the audio mixer since the audio mixer does not need to decode audio streams that are not going to be included in a mixed audio stream. Once a mixing decision has been rendered, received audio streams that should be mixed can be decoded by the audio mixer while other received audio streams can be discarded.
  • an audio mixer being used in an audio conference receives RTP packets containing audio data sent to the mixer from clients participating in the conference.
  • the audio mixer stores the received RTP packets in a buffer.
  • the audio mixer may retrieve (e.g., extract, receive, determine, etc.) a VAD decision (e.g., VAD data) from the extended RTP header of the received packets.
  • the audio mixer may decode the frames belonging to that client's audio stream, perform VAD processing on the decoded frames, and store the decoded frames and the VAD decision (e.g., in a decoded frames set and a VAD decision set, respectively, both of which will be described in greater detail below).
  • the audio mixer may set the VAD decision to, for example, “not voice” and then store the received frame.
  • the audio mixer may approximate a VAD decision for that client. For example, the audio mixer may use the client's previous VAD decision to approximate a VAD decision for the instance in which no frame was received. In one embodiment, the previous VAD decision may be used for some consecutive amount of frames and the subsequent VAD decisions determined as “not voice” until a frame is received again. In still another embodiment, the VAD decision may be set to “not voice” as soon as no frame is received during the mixing window. In any such embodiments, the process (e.g., the audio mixer) stores the VAD decision in a set of VAD decisions, as further described herein.
  • determining whether or not a particular audio stream should be considered in a mixing decision may partially depend on an analysis of signal energy present in the frame. For example, in at least one implementation where the signal energy of a frame is analyzed or measured, the mixing decision would involve only mixing a specific number of participants with the highest signal energy in their respective audio frames.
  • a variation of the RTP profile extension described herein may be used to provide a VAD decision rendered at the client to an audio mixing apparatus for purposes of making a mixing decision.
  • An example includes signaling the VAD decision in the payload header of the applicable audio codec involved.
  • an audio codec may be designed such that the VAD of a given audio frame can be detected with significantly less processing resources (e.g., CPU cycles) than is required to fully decode the frame.
  • one or more other embodiments relate to signaling VAD decisions for audio frames to the audio mixer in a separate stream (e.g., out of band signaling).
  • FIG. 1 illustrates an example audio mixing environment in which various embodiments of the present disclosure may be implemented.
  • an audio mixer 130 receives from clients 105 A, 105 B, 105 C through 105 N (where “N” is an arbitrary number) client audio signals 120 A, 120 B, 120 C through 120 N, respectively.
  • the audio mixer 130 sends to each of the clients 105 A, 105 B, 105 C through 105 N a mixed audio signal 125 which, as will be described in greater detail herein, may be comprised of a subset of the incoming client audio signals 120 A, 120 B, 120 C through 120 N.
  • the clients 105 A, 105 B, 105 C through 105 N are participants (e.g., users, individuals, etc.) in a communication session (e.g., audio conference, audio conferencing session, etc.), where the clients 105 A, 105 B, 105 C through 105 N are communicating with each other by sending and receiving audio signals via the audio mixer 130 .
  • a communication session e.g., audio conference, audio conferencing session, etc.
  • the audio mixer 130 may receive incoming audio signals from some or all of the clients 105 A, 105 B, 105 C through 105 N participating in the session.
  • the audio mixer 130 may only mix (e.g., combine) a select subset of such incoming audio signals (e.g., client audio signals 120 A, 120 B, 120 C through 120 N) to send back to the clients 105 A, 105 B, 105 C through 105 N in the form of the mixed audio signal 125 .
  • the decision (sometimes be referred to herein as the “mixing decision”) as to which of the incoming client audio signals 120 A, 120 B, 120 C through 120 N should be included in the mixed audio signal 125 depends on one or more of a variety of factors, considerations, and criteria.
  • the mixed audio signal 125 sent to each of the clients 105 A, 105 B, 105 C through 105 N may vary between the clients, depending on whether or not a particular client's own audio (e.g., client audio signals 120 A, 120 B, 120 C through 120 N) was included in the mixed audio generated by the audio mixer 130 .
  • client audio signal 120 B was mixed as a result of a mixing algorithm applied by the audio mixer 130 , but client audio signal 120 A was not mixed, then the mixed audio signal 125 sent to client 105 A will be different from the mixed audio signal sent to client 105 B.
  • the mixed audio signal 125 sent to client 105 A will contain the mixed audio of all the clients whose audio was included in the mix, while the mixed audio signal 125 sent to client 105 B will be similar but with client 105 B's own audio filtered out (e.g., since the client does not want to hear his or her own audio).
  • FIG. 2 illustrates an example audio mixing apparatus along with incoming and outgoing data flows according to at least some embodiments of the present disclosure.
  • the example audio mixer 230 shown includes a mixer control unit 240 , a mixer unit 260 , and a receiver unit 235 .
  • the receiver unit 235 includes a decoder 255 and a packet buffer 265 .
  • a set of client audio signals 220 A, 220 B, 220 C, through 220 N may be received at the receiver unit 235 and processed by the decoder 255 .
  • the client audio signals 220 A, 220 B, 220 C, through 220 N may be received at the audio mixer 230 and, more specifically, at the receiver unit 235 , from one or more audio channels that provide the client audio signals 220 A, 220 B, 220 C, through 220 N as audio packets containing segments of the audio signals.
  • the client audio signals 220 A, 220 B, 220 C, through 220 N may be RTP packets containing data corresponding to segments of audio signals (e.g., generated by clients 105 A through 105 N as shown in FIG. 1 ).
  • RTP packets comprising the incoming client audio signals 220 A, 220 B, 220 C, through 220 N may have extended RTP headers containing VAD data.
  • the audio mixer 230 produces as output, one or more mixed audio signals 225 .
  • the mixed audio signals 225 may be generated as a result of a mixing algorithm being applied by the audio mixer 230 .
  • the control unit 240 includes a memory 245 , a decoded frame set 270 , an encoded frame set 275 , and a VAD decision set 280 .
  • the mixer control unit 240 also includes, or is operably connected to, a voice activity detection unit 250 .
  • the voice activity detection unit 250 may be configured to perform a variety of operations on audio frames received at the audio mixer 230 from the client audio signals 220 A, 220 B, 220 C, through 220 N.
  • the audio frame may be decoded by the decoder unit 255 before being sent to the voice activity detection unit 250 for voice-activity-detection (VAD) processing.
  • VAD voice-activity-detection
  • one or more of the Decoded Frame Set 270 , the Encoded Frame Set 275 , and the VAD Decision Set 280 may be designated portions of a physical memory of the audio mixer 230 , buffers implemented in a physical memory of the audio mixer 230 , or may be stored in such designated portions or buffers, or may be any combination of the same. Additionally, because the Decoded Frame Set 270 and the Encoded Frame Set 275 may store decoded and encoded frames, respectively, while the VAD Decision Set 280 stores data related to voice activity, the Decoded Frame Set 270 and the Encoded Frame Set 275 may be contained in a memory type different than a memory type containing the VAD Decision Set 280 .
  • the audio mixer 230 may also include other audio mixing components in addition to or instead of the example components illustrated in FIG. 2 .
  • Such other components may similarly be designed or configured to be capable of combining (e.g., mixing) audio signals received from a plurality of participants communicating with each other during a communication session (e.g., an audio conference) based on an audio mixing algorithm such as the one described herein.
  • FIG. 3 illustrates an example process for receiving and storing a VAD decision (e.g., VAD data) from a client participating in an audio conference.
  • a VAD decision e.g., VAD data
  • the example process illustrated in FIG. 3 and described in greater detail below may be performed by an audio mixing apparatus or conferencing server (e.g., audio mixer 130 as shown in FIG. 1 ).
  • the process may be performed by the audio mixing apparatus during an audio conferencing session in which a number of clients (e.g., clients 105 A, 105 B, 105 C through 105 N as shown in FIG. 1 ) participating in the session are sending audio streams to the mixing apparatus in the form of RTP packets containing encoded audio frames (e.g., audio data generated and encoded by, for example, microphones or other audio capture devices being used by the participating clients).
  • encoded audio frames e.g., audio data generated and encoded by, for example, microphones or other audio capture devices being used by the participating clients.
  • an audio frame is received (e.g., at an audio mixer) from a client during a given mix period (e.g., mix cycle, mixing window, etc.).
  • the audio frame received at step 300 is an encoded frame contained in an RTP packet, and may be contained in the RTP packet along with one or more additional encoded audio frames.
  • the audio frame received may be contained in one of a plurality of RTP packets transmitted from clients and received at an audio mixer during the particular mix period.
  • the RTP packets received from the clients may be stored in a buffer of the audio mixer (e.g., packet buffer 265 of audio mixer 230 as shown in FIG. 2 ).
  • a determination is made in step 305 as to whether a VAD decision (e.g., VAD data indicating an audio level of the received frame) can be extracted without decoding the frame.
  • a VAD decision e.g., VAD data indicating an audio level of the received frame
  • clients participating in the audio conference send audio data to the audio conference server as RTP packets with an extended RTP header in which the clients can indicate an audio level (e.g., a VAD decision) of the packets' payload.
  • the RTP header extension carries the VAD decision of the audio contained in the RTP payload of the packet to which the header extension corresponds.
  • the determination made in step 305 may include determining whether the audio frame was received from a client using the extended RTP header. If so, then the VAD decision can be extracted from the extended RTP header without decoding the frame and the process moves to step 330 where the encoded frame is stored.
  • the determination made in step 305 about whether a VAD decision (e.g., VAD data indicating an audio level of the received frame) can be extracted without decoding the frame may be performed by a receiver unit of the audio mixer (e.g., receiver unit 235 as shown in FIG. 2 ), or by some other component or element of the audio mixer.
  • a receiver unit of the audio mixer e.g., receiver unit 235 as shown in FIG. 2
  • such a receiver unit may be designed or configured in a manner such that it is capable of determining whether or not a given packet is received with an extended header attribute indicating that the packet includes a RTP header extension as described above.
  • one or more other components of the audio mixer may be responsible for determining whether or not a VAD decision can be extracted from a received frame without decoding the frame in addition to or instead of a receiver unit of the audio mixer as described above.
  • numerous other approaches may be used to render such a determination in addition to or instead of by way of examining a received packet for an extended header attribute.
  • the encoded frame may be stored (e.g., by a mixer control unit, such as mixer control unit 240 of the example audio mixer 230 as shown in FIG. 2 ) in a set of all encoded frames which, at least in the example process shown in FIG. 3 may be represented as “E”.
  • the set of all encoded frames, E may correspond to the encoded frame set 275 of the example audio mixer 230 shown in FIG. 2 .
  • step 335 the VAD decision (which was determined to be extractable without decoding the frame in step 305 ) is extracted from the encoded audio frame and stored in a set of all VAD decisions, represented as “V” in the example process shown. Similar to the set of all encoded frames, E, the set of all VAD decisions, V, may correspond to the VAD decision set 280 of the example audio mixer 230 illustrated in FIG. 2 . After the VAD decision has been extracted and stored in step 335 , the process moves to step 300 and repeats for the next received audio frame.
  • step 305 if it is instead determined that a VAD decision cannot be extracted without decoding the received audio frame, then the process goes to step 310 where the audio frame is decoded.
  • the audio frame may be decoded in step 310 using any state of the art decoder (e.g., decoder 255 of the example audio mixer 230 shown in FIG. 2 ) suitable for the purpose, as will be appreciated by those skilled in the art.
  • VAD voice-activity-detection
  • VAD voice-activity-detection
  • the detection of audio activity can be performed in a number of different ways.
  • the VAD in step 315 can be based on one or more energy criteria indicating that an audio (e.g., voice) activity level in the decoded frame is above a particular background noise level.
  • the detection of voice activity in step 315 of the process may be performed by some other entity or component within, or connected to, the audio mixing apparatus.
  • the VAD described in connection with step 315 may be based on information received along with the audio frame. For example, there may be a scenario where although a VAD decision cannot be extracted from the audio frame in step 305 without first decoding the frame in step 310 , the determination of voice activity in step 315 may still be based on data generated as a result of processing that occurred remotely from the audio mixer (e.g., at the audio source, such as the client). It should be noted that in any of the various embodiments of the present disclosure, detecting voice activity in a received audio frame may be performed in accordance with the VAD procedure described in granted U.S. Pat. No. 6,993,481.
  • step 320 the decoded frame is stored in a set of all decoded frames, which is represented as “D” in at least the example process illustrated in FIG. 3 .
  • the set of all decoded frames, D may correspond to the Decoded Frame Set 270 of the example audio mixer 230 illustrated in FIG. 2 .
  • step 325 the VAD decision obtained for the decoded frame in step 315 is stored in the set of all VAD decisions, V, as described above with respect to step 335 . Once the VAD decision is stored in V, the process returns to step 300 and repeats for the next audio frame received.
  • FIG. 4 illustrates an example process for rendering a mixing decision based on available VAD decisions received from clients participating in an audio conference.
  • an audio mixing apparatus e.g., conferencing server, such as audio mixer 130 as shown in FIG. 1 .
  • the process may be performed by the audio mixing apparatus during an audio conferencing session in which a number of clients (e.g., clients 105 A, 105 B, 105 C through 105 N as shown in FIG. 1 ) participating in the session are sending audio streams to the mixing apparatus in the form of RTP packets.
  • the process may be performed by the audio mixing apparatus following the receipt of such packets from clients and the storage of encoded frames, decoded frames, and VAD decisions, as described above with respect to FIG. 3 .
  • the example process shown includes steps that may be performed during a particular mix cycle (e.g., mix period, mix instance, etc.) of many mix cycles that collectively comprise an audio conferencing session involving multiple participants.
  • the process begins at step 400 where it is determined that a mixing decision is to be made. For example, as described above with respect to the process illustrated in FIG. 3 , a determination that a mixing decision is to be made may occur following the receipt of audio data packets at an audio mixer (e.g., audio mixer 130 shown in FIG. 1 ) from participating clients, and following the processing and storage of encoded frames, decoded frames, and VAD decisions obtained by the audio mixer from those data packets.
  • the frames and/or VAD decisions are stored in one or more buffers of the audio mixer (e.g., decoded frame set 270 , encoded frame set 275 , and VAD decision set 280 as shown in FIG. 2 ).
  • a set (e.g., subset) of decoded audio frames, a set (e.g., subset) of encoded audio frames, and a set (e.g., subset) of VAD decisions corresponding to each of the decoded and encoded sets are retrieved from one or more buffers of the audio mixer, where for purposes of the present description the set of decoded audio frames is represented as D′, the set of encoded audio frames is represented as E′, and the set of VAD decisions is represented as V′.
  • D′ may be retrieved from a set of all decoded frames
  • E′ may be retrieved from a set of all encoded frames
  • V′ may be retrieved from a set of all VAD decisions (e.g., the set of all decoded frames D, the set of all encoded frames E, and the set of all VAD decisions V as shown in FIG. 3 ).
  • D′, E′, and V′ may be retrieved from the decoded frame set 270 , encoded frame set 275 , and VAD decision set 280 , respectively.
  • a mixing decision algorithm (which is sometimes referred to herein simply as a “mixing decision”) may be applied based on the set of VAD decisions V′ retrieved in step 405 .
  • the mixing decision algorithm is applied in order to determine which of the audio frames in D′ and/or E′ are to be included in the mixing operation for the given mix cycle.
  • a number of different mixing decision algorithms may be used in step 410 .
  • the mixing decision could be to mix all, or a subset of all, clients that have sent audio streams from which a positive VAD decision has been extracted or for which a positive VAD decision has been rendered.
  • the mixing decision rendered in step 410 may partially depend on an analysis of signal energy present in the audio frames comprising D′ and E′.
  • the mixing decision may involve only mixing a specific number of participants with the highest signal energy in their respective audio frames. It should be understood by those skilled in the art that a variety of other mixing decision algorithms may also be applied in step 410 of the process illustrated in FIG. 4 in addition to or instead of the example algorithms described above.
  • step 415 the audio frames to be mixed, based on the mixing decision made in step 410 , are stored in a set of encoded and decoded frames for mix (which for purposes of the present description is represented as M). Additionally, in at least some embodiments, any audio frames that are not stored as part of M may be discarded following step 415 .
  • step 420 a determination is made as to whether the set of encoded and decoded frames for mix, M, is empty (e.g., whether M contains any remaining audio frames to be included in the mixing operation for the present mix cycle). If it is determined in step 420 that M is not empty, then in step 425 an audio frame is removed from M After an audio frame is removed from M in step 425 , it is determined in step 430 whether or not the removed audio frame is a decoded audio frame (e.g., a frame from the set of decoded frames D′).
  • a decoded audio frame e.g., a frame from the set of decoded frames D′.
  • step 435 the decoded audio frame is stored in a set of decoded frames for mix, represented as m for purposes of the present description.
  • the process goes to step 440 where the audio frame is decoded.
  • the decoded audio frame is stored in the set of decoded frames for mix, m, in step 435 .
  • step 435 the process returns to step 420 where it is again determined whether the set of encoded and decoded frames for mix, M, is empty. If M is not empty, then steps 425 through 435 are repeated for an audio frame that remains in M. However, if in step 420 it is determined that M is empty, the process continues to step 445 where a mixing algorithm is applied to all of the audio frames in m to generate one or more mixed audio streams.
  • the mix operation of step 445 may be performed by a mixer unit (e.g., mixer unit 260 of the example audio mixer shown in FIG. 2 ).
  • the mixing algorithm applied in step 445 of the process is based on the earlier application of the mixing decision algorithm in step 410 of the process, as described above. Stated differently, the mixing algorithm applied in step 410 determines the audio frames to be mixed by application of the mixing algorithm in step 445 .
  • the mix operation performed in step 445 may produce (e.g., generate) as output one or more mixed audio signals (e.g., mixed audio signal 125 as shown in FIG. 1 or mixed audio signals 225 as shown in FIG. 2 ).
  • the one or more mixed audio signals generated from mixing all of the frames included in the set of decoded frames for mix, m may be sent from an audio mixer to the clients participating in the audio conference (e.g., clients 105 A through 105 N as shown in FIG. 1 ).
  • step 445 the process goes to step 450 where the set of all decoded frames D, the set of all encoded frames E, and the set of all VAD decisions V (as described above with respect to the process shown in FIG. 3 ) are cleared for the start of the next mix cycle in step 400 .
  • FIG. 5 is a block diagram illustrating an example computing device 500 that is arranged for determining a mixing decision to combine (e.g., mix) audio signals received from a plurality of communicating users based on voice-activity-detection (VAD) data contained in the received signals in accordance with one or more embodiments of the present disclosure.
  • computing device 500 typically includes one or more processors 510 and system memory 520 .
  • a memory bus 530 may be used for communicating between the processor 510 and the system memory 520 .
  • processor 510 can be of any type including but not limited to a microprocessor ( ⁇ P), a microcontroller ( ⁇ C), a digital signal processor (DSP), or any combination thereof.
  • Processor 510 may include one or more levels of caching, such as a level one cache 511 and a level two cache 512 , a processor core 513 , and registers 514 .
  • the processor core 513 may include an arithmetic logic unit (ALU), a floating point unit (FPU), a digital signal processing core (DSP Core), or any combination thereof.
  • a memory controller 515 can also be used with the processor 510 , or in some embodiments the memory controller 515 can be an internal part of the processor 510 .
  • system memory 520 can be of any type including but not limited to volatile memory (e.g., RAM), non-volatile memory (e.g., ROM, flash memory, etc.) or any combination thereof.
  • System memory 520 typically includes an operating system 521 , one or more applications 522 , and program data 524 .
  • application 522 includes a multipath routing algorithm 523 that is configured to receive and store audio frames based on one or more characteristics of the frames (e.g., encoded, decoded, contain VAD decision, etc.).
  • the multipath routing algorithm is further arranged to identify candidate sets of audio frames for consideration in a mixing decision (e.g., by an audio mixer, such as example audio mixer 230 shown in FIG. 2 ) and select from among those candidate sets audio frames to include in a mixed audio signal (e.g., mixed audio signal 125 shown in FIG. 1 ) based on information and data contained in the audio frames (e.g., VAD decisions).
  • a mixing decision e.g., by an
  • Program Data 524 may include multipath routing data 525 that is useful for identifying received audio frames and categorizing the frames into one or more sets based on specific characteristics (e.g., whether a frame is encoded, decoded, contains a VAD decision, etc.).
  • application 522 can be arranged to operate with program data 524 on an operating system 521 such that a received audio frame is analyzed to determine its characteristics before being stored in an appropriate set of audio frames (e.g., decoded frame set 270 or encoded frame set 275 as shown in FIG. 2 ).
  • Computing device 500 can have additional features and/or functionality, and additional interfaces to facilitate communications between the basic configuration 501 and any required devices and interfaces.
  • a bus/interface controller 540 can be used to facilitate communications between the basic configuration 501 and one or more data storage devices 550 via a storage interface bus 541 .
  • the data storage devices 550 can be removable storage devices 551 , non-removable storage devices 552 , or any combination thereof. Examples of removable storage and non-removable storage devices include magnetic disk devices such as flexible disk drives and hard-disk drives (HDD), optical disk drives such as compact disk (CD) drives or digital versatile disk (DVD) drives, solid state drives (SSD), tape drives and the like.
  • Example computer storage media can include volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, and/or other data.
  • Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computing device 500 . Any such computer storage media can be part of computing device 500 .
  • Computing device 500 can also include an interface bus 542 for facilitating communication from various interface devices (e.g., output interfaces, peripheral interfaces, communication interfaces, etc.) to the basic configuration 501 via the bus/interface controller 540 .
  • Example output devices 560 include a graphics processing unit 561 and an audio processing unit 562 , either or both of which can be configured to communicate to various external devices such as a display or speakers via one or more A/V ports 563 .
  • Example peripheral interfaces 570 include a serial interface controller 571 or a parallel interface controller 572 , which can be configured to communicate with external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, touch input device, etc.) or other peripheral devices (e.g., printer, scanner, etc.) via one or more I/O ports 573 .
  • input devices e.g., keyboard, mouse, pen, voice input device, touch input device, etc.
  • other peripheral devices e.g., printer, scanner, etc.
  • An example communication device 580 includes a network controller 581 , which can be arranged to facilitate communications with one or more other computing devices 590 over a network communication (not shown) via one or more communication ports 582 .
  • the communication connection is one example of a communication media.
  • Communication media may typically be embodied by computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism, and includes any information delivery media.
  • a “modulated data signal” can be a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal.
  • communication media can include wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, radio frequency (RF), infrared (IR) and other wireless media.
  • RF radio frequency
  • IR infrared
  • computer readable media can include both storage media and communication media.
  • Computing device 500 can be implemented as a portion of a small-form factor portable (or mobile) electronic device such as a cell phone, a personal data assistant (PDA), a personal media player device, a wireless web-watch device, a personal headset device, an application specific device, or a hybrid device that include any of the above functions.
  • a small-form factor portable (or mobile) electronic device such as a cell phone, a personal data assistant (PDA), a personal media player device, a wireless web-watch device, a personal headset device, an application specific device, or a hybrid device that include any of the above functions.
  • PDA personal data assistant
  • Computing device 500 can also be implemented as a personal computer including both laptop computer and non-laptop computer configurations.
  • ASICs Application Specific Integrated Circuits
  • FPGAs Field Programmable Gate Arrays
  • DSPs digital signal processors
  • ASICs Application Specific Integrated Circuits
  • FPGAs Field Programmable Gate Arrays
  • DSPs digital signal processors
  • some aspects of the embodiments described herein, in whole or in part, can be equivalently implemented in integrated circuits, as one or more computer programs running on one or more computers (e.g., as one or more programs running on one or more computer systems), as one or more programs running on one or more processors (e.g., as one or more programs running on one or more microprocessors), as firmware, or as virtually any combination thereof.
  • processors e.g., as one or more programs running on one or more microprocessors
  • firmware e.g., as one or more programs running on one or more microprocessors
  • designing the circuitry and/or writing the code for the software and/or firmware would be well within the skill of one of skilled in the art in light of the present disclosure.
  • Examples of a signal-bearing medium include, but are not limited to, the following: a recordable-type medium such as a floppy disk, a hard disk drive, a Compact Disc (CD), a Digital Video Disk (DVD), a digital tape, a computer memory, etc.; and a transmission-type medium such as a digital and/or an analog communication medium (e.g., a fiber optic cable, a waveguide, a wired communications link, a wireless communication link, etc.).
  • a recordable-type medium such as a floppy disk, a hard disk drive, a Compact Disc (CD), a Digital Video Disk (DVD), a digital tape, a computer memory, etc.
  • a transmission-type medium such as a digital and/or an analog communication medium (e.g., a fiber optic cable, a waveguide, a wired communications link, a wireless communication link, etc.).
  • a typical data processing system generally includes one or more of a system unit housing, a video display device, a memory such as volatile and non-volatile memory, processors such as microprocessors and digital signal processors, computational entities such as operating systems, drivers, graphical user interfaces, and applications programs, one or more interaction devices, such as a touch pad or screen, and/or control systems including feedback loops and control motors (e.g., feedback for sensing position and/or velocity; control motors for moving and/or adjusting components and/or quantities).
  • a typical data processing system may be implemented utilizing any suitable commercially available components, such as those typically found in data computing/communication and/or network computing/communication systems.

Abstract

Methods, systems, and apparatus are provided for combining (e.g., mixing) audio signals received from a plurality of participants communicating with each other during a communication session (e.g., an audio conference) based on voice-activity-detection (VAD) data contained in extended headers of Real-time Transport Protocol (RTP) packets. An audio mixing apparatus receives RTP packets from connected clients and extracts the VAD data included in the extended RTP headers to render a mixing decision. Once a mixing decision has been made, the audio frames to be mixed are decoded while other received frames are discarded, thereby preventing processing resources from being wasted to decode frames that are never used.

Description

    FIELD OF THE INVENTION
  • The present disclosure generally relates to methods, systems, and apparatus for mixing audio signals. More specifically, aspects of the present disclosure relate to an audio mixing apparatus mixing incoming audio signals based on voice-activity-detection data received with the incoming audio signals.
  • BACKGROUND
  • In audio conferencing systems designed to handle communications involving multiple participants, an audio mixer receives audio streams from most or all of the conference participants and selectively mixes some of the received streams for forwarding to other participants. Depending on the size of the conference, the audio mixer can receive a large number of incoming audio streams at the same time. Because there is usually only a need to mix a subset of these incoming streams, a substantial number of processing resources are often wasted decoding audio frames that will never be used.
  • One approach uses voice-activity-detection (VAD) to determine which audio conference participants to mix. However, VAD is performed in the signal domain, and therefore a frame of an incoming audio stream for a particular participant must be decoded in order to determine if the frame contains voice data. As such, under this approach a decision regarding whether or not a frame should be included in a mix requires that the frame be decoded, and thus resources are expended regardless. This expenditure of resources potentially imposes limits on the size and/or number of audio conferences that a given audio mixer can support.
  • Another approach relates to using an extended Real-time Transport Protocol (RTP) header to determine which participants in an audio conference should be relayed to other participants, where the RTP header extension contains a VAD decision provided by each of the (far-end) participants connected to the relay server. This approach proposes using such a RTP header extension to render relay decisions, where the server does not perform any audio mixing.
  • SUMMARY
  • This Summary introduces a selection of concepts in a simplified form in order to provide a basic understanding of some aspects of the present disclosure. This Summary is not an extensive overview of the disclosure, and is not intended to identify key or critical elements of the disclosure or to delineate the scope of the disclosure. This Summary merely presents some of the concepts of the disclosure as a prelude to the Detailed Description provided below.
  • One embodiment of the present disclosure relates to a method for mixing audio signals comprising: receiving, at an audio mixing apparatus, audio packets from a plurality of clients in communication with the audio mixing apparatus; retrieving audio level data contained in an extended packet header of each of the received audio packets; selecting audio frames to be mixed based on the audio level data retrieved from the extended packet headers of the audio packets; decoding the audio frames selected to be mixed; and generating a mixed audio stream by mixing the decoded audio frames.
  • In another embodiment of the disclosure, the method for mixing audio signals further comprises, in response to receiving the audio packets from the plurality of clients, determining, for each audio frame contained in the audio packets, whether audio level data can be extracted from the audio frame without decoding the audio frame.
  • In another embodiment of the disclosure, the method for mixing audio signals further comprises: in response to determining that audio level data can be extracted from the audio frame without decoding the audio frame, extracting audio level data from the audio frame; storing the audio level data extracted from the audio frame in an audio level data set; and storing the audio frame in an encoded audio frames set.
  • In yet another embodiment of the disclosure, the method for mixing audio signals further comprises: in response to determining that no audio level data can be extracted from the audio frame without decoding the audio frame, decoding the audio frame; performing audio level processing on the decoded audio frame to obtain audio level data for the decoded audio frame; storing the audio level data obtained for the decoded audio frame in the audio level data set; and storing the decoded audio frame in a decoded audio frames set.
  • In still another embodiment of the disclosure, the method for mixing audio signals further comprises: selecting a group of encoded audio frames from the encoded audio frames set and a group of decoded audio frames from the decoded audio frames set; retrieving, from the audio level data set, the audio level data stored for the group of encoded audio frames and the group of decoded audio frames; applying a mixing decision algorithm based on the audio level data retrieved from the audio level data set; and determining a set of encoded and decoded audio frames to be mixed based on the applied mixing decision algorithm.
  • In another embodiment of the disclosure, the method for mixing audio signals further comprises storing the received audio packets in a buffer of the audio mixing apparatus.
  • In another embodiment of the disclosure, the method for mixing audio signals further comprises discarding the audio frames not selected to be mixed.
  • Another embodiment of the disclosure relates to an audio mixing apparatus configured to perform operations comprising: receiving audio packets from a plurality of clients in communication with the audio mixing apparatus; retrieving audio level data contained in an extended packet header of each of the received audio packets; selecting audio frames to be mixed based on the audio level data retrieved from the extended packet headers of the audio packets; decoding the audio frames selected to be mixed; and generating a mixed audio stream by mixing the decoded audio frames.
  • In another embodiment of the disclosure, the audio mixing apparatus is further configured to perform operations comprising, in response to receiving the audio packets from the plurality of clients, determining, for each audio frame contained in the audio packets, whether audio level data can be extracted from the audio frame without decoding the audio frame.
  • In yet another embodiment of the disclosure, the audio mixing apparatus is further configured to perform operations comprising: in response to determining that audio level data can be extracted from the audio frame without decoding the audio frame, extracting audio level data from the audio frame; storing the audio level data extracted from the audio frame in an audio level data set; and storing the audio frame in an encoded audio frames set.
  • In still another embodiment of the disclosure, the audio mixing apparatus is further configured to perform operations comprising: in response to determining that no audio level data can be extracted from the audio frame without decoding the audio frame, decoding the audio frame; performing audio level processing on the decoded audio frame to obtain audio level data for the decoded audio frame; storing the audio level data obtained for the decoded audio frame in the audio level data set; and storing the decoded audio frame in a decoded audio frames set.
  • In another embodiment of the disclosure, the audio mixing apparatus is further configured to perform operations comprising: selecting a group of encoded audio frames from the encoded audio frames set and a group of decoded audio frames from the decoded audio frames set; retrieving, from the audio level data set, the audio level data stored for the group of encoded audio frames and the group of decoded audio frames; applying a mixing decision algorithm based on the audio level data retrieved from the audio level data set; and determining a set of encoded and decoded audio frames to be mixed based on the applied mixing decision algorithm.
  • In still another embodiment of the disclosure, the audio mixing apparatus is further configured to perform operations comprising storing the received audio packets in a buffer of the audio mixing apparatus.
  • In another embodiment of the disclosure, the audio mixing apparatus is further configured to perform operations comprising discarding the audio frames not selected to be mixed.
  • In other embodiments of the disclosure, the methods and apparatuses described herein may optionally include one or more of the following additional features: the audio packets received from the plurality of clients are Real-Time Transport Protocol (RTP) packets, and/or the audio level data contained in the extended packet header of each of the received audio packets includes voice data corresponding to one of the plurality of clients.
  • Further scope of applicability of the present invention will become apparent from the Detailed Description given below. However, it should be understood that the Detailed Description and specific examples, while indicating preferred embodiments of the invention, are given by way of illustration only, since various changes and modifications within the spirit and scope of the invention will become apparent to those skilled in the art from this Detailed Description.
  • BRIEF DESCRIPTION OF DRAWINGS
  • These and other objects, features and characteristics of the present disclosure will become more apparent to those skilled in the art from a study of the following Detailed Description in conjunction with the appended claims and drawings, all of which form a part of this specification. In the drawings:
  • FIG. 1 is a block diagram illustrating an example audio mixing environment in which various embodiments of the present disclosure may be implemented.
  • FIG. 2 is a block diagram illustrating an example audio mixing apparatus along with incoming and outgoing data flows according to one or more embodiments described herein.
  • FIG. 3 is a flowchart illustrating an example method for receiving and storing a voice-activity-detection decision from a client participating in an audio conferencing session according to one or more embodiments described herein.
  • FIG. 4 is a flowchart illustrating an example method for rendering a mixing decision based on available voice-activity detection decisions received from clients participating in an audio conferencing session according to one or more embodiments described herein.
  • FIG. 5 is a block diagram illustrating an example computing device arranged for rendering an audio mixing decision according to one or more embodiments described herein.
  • The headings provided herein are for convenience only and do not necessarily affect the scope or meaning of the claimed invention.
  • In the drawings, the same reference numerals and any acronyms identify elements or acts with the same or similar structure or functionality for ease of understanding and convenience. The drawings will be described in detail in the course of the following Detailed Description.
  • DETAILED DESCRIPTION
  • Various examples of the invention will now be described. The following description provides specific details for a thorough understanding and enabling description of these examples. One skilled in the relevant art will understand, however, that the invention may be practiced without many of these details. Likewise, one skilled in the relevant art will also understand that the invention can include many other obvious features not described in detail herein. Additionally, some well-known structures or functions may not be shown or described in detail below, so as to avoid unnecessarily obscuring the relevant description.
  • Embodiments of the present disclosure relate to methods, systems, and apparatus for combining (e.g., mixing) audio signals received from a plurality of participants communicating with each other during a communication session (e.g., an audio conference) based on voice-activity-detection (VAD) data contained in the received signals. In at least some arrangements, participants (e.g., users, clients, individuals, etc.) in an audio conference communicate by sending Real-Time Transport Protocol (RTP) packets containing audio data (e.g., voice data) to an audio mixing apparatus (sometimes referred to herein as an “audio mixer” or simply a “mixer” for purposes of brevity).
  • One or more embodiments relate to determining audio signals to be included in an audio mixing decision based on VAD data contained in an extended profile of received data packets containing the audio signals. In at least one implementation, a number of participants (e.g., all, none, or any number of participants) in an audio conference may send outgoing audio signals (e.g., audio streams) to an audio mixer (e.g., a server) as RTP packets with an extended RTP-profile (sometimes also referred to herein as “RTP header extension” or “extended RTP header”). The participants can use the extended RTP-profile of the outgoing audio data packets to indicate an audio level (e.g., VAD data) of the packets' payload. Among other advantages, this approach reduces the processing load for the audio mixer since the audio mixer does not need to decode audio streams that are not going to be included in a mixed audio stream. Once a mixing decision has been rendered, received audio streams that should be mixed can be decoded by the audio mixer while other received audio streams can be discarded.
  • In at least one embodiment, an audio mixer being used in an audio conference receives RTP packets containing audio data sent to the mixer from clients participating in the conference. The audio mixer stores the received RTP packets in a buffer. As will be described in greater detail below, the audio mixer may retrieve (e.g., extract, receive, determine, etc.) a VAD decision (e.g., VAD data) from the extended RTP header of the received packets.
  • In one embodiment, if a client is not using the extended RTP header (e.g., audio from the client is not being transmitted in RTP packets configured with the extended header as described herein), then the audio mixer may decode the frames belonging to that client's audio stream, perform VAD processing on the decoded frames, and store the decoded frames and the VAD decision (e.g., in a decoded frames set and a VAD decision set, respectively, both of which will be described in greater detail below). In another embodiment, if a Discontinuous Transmission (DTX) frame is received for one stream from a particular client, then the audio mixer may set the VAD decision to, for example, “not voice” and then store the received frame.
  • Additionally, if no frame is received at the audio mixer for one stream from a client, then, in accordance with at least some embodiments, the audio mixer may approximate a VAD decision for that client. For example, the audio mixer may use the client's previous VAD decision to approximate a VAD decision for the instance in which no frame was received. In one embodiment, the previous VAD decision may be used for some consecutive amount of frames and the subsequent VAD decisions determined as “not voice” until a frame is received again. In still another embodiment, the VAD decision may be set to “not voice” as soon as no frame is received during the mixing window. In any such embodiments, the process (e.g., the audio mixer) stores the VAD decision in a set of VAD decisions, as further described herein.
  • In one or more embodiments, determining whether or not a particular audio stream should be considered in a mixing decision (e.g., the decision as to which incoming client audio signals to mix or combine into a mixed audio signal sent to clients in a given mix cycle) may partially depend on an analysis of signal energy present in the frame. For example, in at least one implementation where the signal energy of a frame is analyzed or measured, the mixing decision would involve only mixing a specific number of participants with the highest signal energy in their respective audio frames.
  • Furthermore, in at least one embodiment of the disclosure a variation of the RTP profile extension described herein may be used to provide a VAD decision rendered at the client to an audio mixing apparatus for purposes of making a mixing decision. An example includes signaling the VAD decision in the payload header of the applicable audio codec involved. In another embodiment, an audio codec may be designed such that the VAD of a given audio frame can be detected with significantly less processing resources (e.g., CPU cycles) than is required to fully decode the frame. Additionally, one or more other embodiments relate to signaling VAD decisions for audio frames to the audio mixer in a separate stream (e.g., out of band signaling).
  • FIG. 1 illustrates an example audio mixing environment in which various embodiments of the present disclosure may be implemented. As shown in the example environment of FIG. 1, an audio mixer 130 receives from clients 105A, 105B, 105C through 105N (where “N” is an arbitrary number) client audio signals 120A, 120B, 120C through 120N, respectively. In turn, the audio mixer 130 sends to each of the clients 105A, 105B, 105C through 105N a mixed audio signal 125 which, as will be described in greater detail herein, may be comprised of a subset of the incoming client audio signals 120A, 120B, 120C through 120N.
  • In at least one scenario, the clients 105A, 105B, 105C through 105N are participants (e.g., users, individuals, etc.) in a communication session (e.g., audio conference, audio conferencing session, etc.), where the clients 105A, 105B, 105C through 105N are communicating with each other by sending and receiving audio signals via the audio mixer 130. For example, in a given mix cycle (e.g., mix session, mix period, etc.) of an audio conferencing session, the audio mixer 130 may receive incoming audio signals from some or all of the clients 105A, 105B, 105C through 105N participating in the session. However, the audio mixer 130 may only mix (e.g., combine) a select subset of such incoming audio signals (e.g., client audio signals 120A, 120B, 120C through 120N) to send back to the clients 105A, 105B, 105C through 105N in the form of the mixed audio signal 125. As will be described in further detail below, the decision (sometimes be referred to herein as the “mixing decision”) as to which of the incoming client audio signals 120A, 120B, 120C through 120N should be included in the mixed audio signal 125 depends on one or more of a variety of factors, considerations, and criteria.
  • Furthermore, it should be noted that the mixed audio signal 125 sent to each of the clients 105A, 105B, 105C through 105N may vary between the clients, depending on whether or not a particular client's own audio (e.g., client audio signals 120A, 120B, 120C through 120N) was included in the mixed audio generated by the audio mixer 130. For example, if client audio signal 120B was mixed as a result of a mixing algorithm applied by the audio mixer 130, but client audio signal 120A was not mixed, then the mixed audio signal 125 sent to client 105A will be different from the mixed audio signal sent to client 105B. In such a scenario, the mixed audio signal 125 sent to client 105A will contain the mixed audio of all the clients whose audio was included in the mix, while the mixed audio signal 125 sent to client 105B will be similar but with client 105B's own audio filtered out (e.g., since the client does not want to hear his or her own audio).
  • FIG. 2 illustrates an example audio mixing apparatus along with incoming and outgoing data flows according to at least some embodiments of the present disclosure. The example audio mixer 230 shown includes a mixer control unit 240, a mixer unit 260, and a receiver unit 235.
  • In one or more embodiments, the receiver unit 235 includes a decoder 255 and a packet buffer 265. A set of client audio signals 220A, 220B, 220C, through 220N (where “N” is an arbitrary number) may be received at the receiver unit 235 and processed by the decoder 255. The client audio signals 220A, 220B, 220C, through 220N may be received at the audio mixer 230 and, more specifically, at the receiver unit 235, from one or more audio channels that provide the client audio signals 220A, 220B, 220C, through 220N as audio packets containing segments of the audio signals. For example, the client audio signals 220A, 220B, 220C, through 220N may be RTP packets containing data corresponding to segments of audio signals (e.g., generated by clients 105A through 105N as shown in FIG. 1). In accordance with various embodiments of the disclosure, such RTP packets comprising the incoming client audio signals 220A, 220B, 220C, through 220N may have extended RTP headers containing VAD data. Additionally, the audio mixer 230 produces as output, one or more mixed audio signals 225. The mixed audio signals 225 may be generated as a result of a mixing algorithm being applied by the audio mixer 230.
  • In at least the example embodiment shown in FIG. 1, the control unit 240 includes a memory 245, a decoded frame set 270, an encoded frame set 275, and a VAD decision set 280. Depending on the implementation, the mixer control unit 240 also includes, or is operably connected to, a voice activity detection unit 250. The voice activity detection unit 250 may be configured to perform a variety of operations on audio frames received at the audio mixer 230 from the client audio signals 220A, 220B, 220C, through 220N. For example, where an audio frame received at the audio mixer 230 does not include an extended RTP header (which is described in greater detail herein), the audio frame may be decoded by the decoder unit 255 before being sent to the voice activity detection unit 250 for voice-activity-detection (VAD) processing.
  • In some embodiments, one or more of the Decoded Frame Set 270, the Encoded Frame Set 275, and the VAD Decision Set 280 may be designated portions of a physical memory of the audio mixer 230, buffers implemented in a physical memory of the audio mixer 230, or may be stored in such designated portions or buffers, or may be any combination of the same. Additionally, because the Decoded Frame Set 270 and the Encoded Frame Set 275 may store decoded and encoded frames, respectively, while the VAD Decision Set 280 stores data related to voice activity, the Decoded Frame Set 270 and the Encoded Frame Set 275 may be contained in a memory type different than a memory type containing the VAD Decision Set 280. It should be understood that numerous other types and variations of memory, databases, and data storage spaces may also be configured for use as the Decoded Frame Set 270, the Encoded Frame Set 275, and/or the VAD Decision Set 280 in addition to or instead of the examples described above.
  • In one or more embodiments, the audio mixer 230 may also include other audio mixing components in addition to or instead of the example components illustrated in FIG. 2. Such other components may similarly be designed or configured to be capable of combining (e.g., mixing) audio signals received from a plurality of participants communicating with each other during a communication session (e.g., an audio conference) based on an audio mixing algorithm such as the one described herein.
  • FIG. 3 illustrates an example process for receiving and storing a VAD decision (e.g., VAD data) from a client participating in an audio conference. In at least some embodiments of the present disclosure, the example process illustrated in FIG. 3 and described in greater detail below may be performed by an audio mixing apparatus or conferencing server (e.g., audio mixer 130 as shown in FIG. 1). In one example scenario, the process may be performed by the audio mixing apparatus during an audio conferencing session in which a number of clients (e.g., clients 105A, 105B, 105C through 105N as shown in FIG. 1) participating in the session are sending audio streams to the mixing apparatus in the form of RTP packets containing encoded audio frames (e.g., audio data generated and encoded by, for example, microphones or other audio capture devices being used by the participating clients).
  • The process begins in step 300 where an audio frame is received (e.g., at an audio mixer) from a client during a given mix period (e.g., mix cycle, mixing window, etc.). In at least some embodiments, the audio frame received at step 300 is an encoded frame contained in an RTP packet, and may be contained in the RTP packet along with one or more additional encoded audio frames. For example, the audio frame received may be contained in one of a plurality of RTP packets transmitted from clients and received at an audio mixer during the particular mix period. Depending on the implementation, the RTP packets received from the clients may be stored in a buffer of the audio mixer (e.g., packet buffer 265 of audio mixer 230 as shown in FIG. 2).
  • Once the audio frame is received in step 300, a determination is made in step 305 as to whether a VAD decision (e.g., VAD data indicating an audio level of the received frame) can be extracted without decoding the frame. In accordance with various embodiments of the disclosure, clients participating in the audio conference send audio data to the audio conference server as RTP packets with an extended RTP header in which the clients can indicate an audio level (e.g., a VAD decision) of the packets' payload. In other words, the RTP header extension carries the VAD decision of the audio contained in the RTP payload of the packet to which the header extension corresponds. Accordingly, in at least one embodiment, the determination made in step 305 may include determining whether the audio frame was received from a client using the extended RTP header. If so, then the VAD decision can be extracted from the extended RTP header without decoding the frame and the process moves to step 330 where the encoded frame is stored.
  • In one embodiment, the determination made in step 305 about whether a VAD decision (e.g., VAD data indicating an audio level of the received frame) can be extracted without decoding the frame may be performed by a receiver unit of the audio mixer (e.g., receiver unit 235 as shown in FIG. 2), or by some other component or element of the audio mixer. Depending on the particular audio mixer used, such a receiver unit may be designed or configured in a manner such that it is capable of determining whether or not a given packet is received with an extended header attribute indicating that the packet includes a RTP header extension as described above. In another embodiment, one or more other components of the audio mixer may be responsible for determining whether or not a VAD decision can be extracted from a received frame without decoding the frame in addition to or instead of a receiver unit of the audio mixer as described above. Furthermore, numerous other approaches may be used to render such a determination in addition to or instead of by way of examining a received packet for an extended header attribute.
  • In step 330, the encoded frame may be stored (e.g., by a mixer control unit, such as mixer control unit 240 of the example audio mixer 230 as shown in FIG. 2) in a set of all encoded frames which, at least in the example process shown in FIG. 3 may be represented as “E”. In one implementation, the set of all encoded frames, E, may correspond to the encoded frame set 275 of the example audio mixer 230 shown in FIG. 2.
  • Following step 330, the process continues to step 335 where the VAD decision (which was determined to be extractable without decoding the frame in step 305) is extracted from the encoded audio frame and stored in a set of all VAD decisions, represented as “V” in the example process shown. Similar to the set of all encoded frames, E, the set of all VAD decisions, V, may correspond to the VAD decision set 280 of the example audio mixer 230 illustrated in FIG. 2. After the VAD decision has been extracted and stored in step 335, the process moves to step 300 and repeats for the next received audio frame.
  • In step 305, if it is instead determined that a VAD decision cannot be extracted without decoding the received audio frame, then the process goes to step 310 where the audio frame is decoded. The audio frame may be decoded in step 310 using any state of the art decoder (e.g., decoder 255 of the example audio mixer 230 shown in FIG. 2) suitable for the purpose, as will be appreciated by those skilled in the art.
  • After the received audio frame is decoded in step 310, voice-activity-detection (VAD) is performed on the decoded frame in step 315. In at least some embodiments, VAD may be performed on the decoded frame by a voice detection unit contained in, or operably connected to, the audio mixer (e.g., voice detection unit 250 of audio mixer 230 as shown in FIG. 2). In any of the various embodiments described herein, the detection of audio activity (e.g., voice activity, which can indicate a presence or absence of speech based on the particular level of activity detected or measured) can be performed in a number of different ways. For example, the VAD in step 315 can be based on one or more energy criteria indicating that an audio (e.g., voice) activity level in the decoded frame is above a particular background noise level. Additionally, the detection of voice activity in step 315 of the process may be performed by some other entity or component within, or connected to, the audio mixing apparatus.
  • Furthermore, in at least some embodiments the VAD described in connection with step 315 may be based on information received along with the audio frame. For example, there may be a scenario where although a VAD decision cannot be extracted from the audio frame in step 305 without first decoding the frame in step 310, the determination of voice activity in step 315 may still be based on data generated as a result of processing that occurred remotely from the audio mixer (e.g., at the audio source, such as the client). It should be noted that in any of the various embodiments of the present disclosure, detecting voice activity in a received audio frame may be performed in accordance with the VAD procedure described in granted U.S. Pat. No. 6,993,481.
  • The process continues to step 320 where the decoded frame is stored in a set of all decoded frames, which is represented as “D” in at least the example process illustrated in FIG. 3. Similar to the set of all encoded frames, E, the set of all decoded frames, D, may correspond to the Decoded Frame Set 270 of the example audio mixer 230 illustrated in FIG. 2.
  • In step 325, the VAD decision obtained for the decoded frame in step 315 is stored in the set of all VAD decisions, V, as described above with respect to step 335. Once the VAD decision is stored in V, the process returns to step 300 and repeats for the next audio frame received.
  • FIG. 4 illustrates an example process for rendering a mixing decision based on available VAD decisions received from clients participating in an audio conference. As with the example process illustrated in FIG. 3, in at least some embodiments of the present disclosure, the example process illustrated in FIG. 4 and described in greater detail below may be performed by an audio mixing apparatus (e.g., conferencing server, such as audio mixer 130 as shown in FIG. 1).
  • In one scenario, the process may be performed by the audio mixing apparatus during an audio conferencing session in which a number of clients (e.g., clients 105A, 105B, 105C through 105N as shown in FIG. 1) participating in the session are sending audio streams to the mixing apparatus in the form of RTP packets. For example, the process may be performed by the audio mixing apparatus following the receipt of such packets from clients and the storage of encoded frames, decoded frames, and VAD decisions, as described above with respect to FIG. 3. The example process shown includes steps that may be performed during a particular mix cycle (e.g., mix period, mix instance, etc.) of many mix cycles that collectively comprise an audio conferencing session involving multiple participants.
  • The process begins at step 400 where it is determined that a mixing decision is to be made. For example, as described above with respect to the process illustrated in FIG. 3, a determination that a mixing decision is to be made may occur following the receipt of audio data packets at an audio mixer (e.g., audio mixer 130 shown in FIG. 1) from participating clients, and following the processing and storage of encoded frames, decoded frames, and VAD decisions obtained by the audio mixer from those data packets. In at least one embodiment, the frames and/or VAD decisions are stored in one or more buffers of the audio mixer (e.g., decoded frame set 270, encoded frame set 275, and VAD decision set 280 as shown in FIG. 2).
  • In step 405, a set (e.g., subset) of decoded audio frames, a set (e.g., subset) of encoded audio frames, and a set (e.g., subset) of VAD decisions corresponding to each of the decoded and encoded sets are retrieved from one or more buffers of the audio mixer, where for purposes of the present description the set of decoded audio frames is represented as D′, the set of encoded audio frames is represented as E′, and the set of VAD decisions is represented as V′. In one embodiment, D′ may be retrieved from a set of all decoded frames, E′ may be retrieved from a set of all encoded frames, and V′ may be retrieved from a set of all VAD decisions (e.g., the set of all decoded frames D, the set of all encoded frames E, and the set of all VAD decisions V as shown in FIG. 3). With reference to example audio mixer shown in FIG. 2, in at least one implementation one or more of D′, E′, and V′ may be retrieved from the decoded frame set 270, encoded frame set 275, and VAD decision set 280, respectively.
  • The process then goes to step 410 where a mixing decision algorithm (which is sometimes referred to herein simply as a “mixing decision”) may be applied based on the set of VAD decisions V′ retrieved in step 405. In at least one embodiment, the mixing decision algorithm is applied in order to determine which of the audio frames in D′ and/or E′ are to be included in the mixing operation for the given mix cycle. A number of different mixing decision algorithms may be used in step 410. For example, in one embodiment the mixing decision could be to mix all, or a subset of all, clients that have sent audio streams from which a positive VAD decision has been extracted or for which a positive VAD decision has been rendered.
  • In another embodiment, the mixing decision rendered in step 410 may partially depend on an analysis of signal energy present in the audio frames comprising D′ and E′. For example, in an implementation where the signal energy of a frame is analyzed and/or measured, the mixing decision may involve only mixing a specific number of participants with the highest signal energy in their respective audio frames. It should be understood by those skilled in the art that a variety of other mixing decision algorithms may also be applied in step 410 of the process illustrated in FIG. 4 in addition to or instead of the example algorithms described above.
  • The process continues to step 415 where the audio frames to be mixed, based on the mixing decision made in step 410, are stored in a set of encoded and decoded frames for mix (which for purposes of the present description is represented as M). Additionally, in at least some embodiments, any audio frames that are not stored as part of M may be discarded following step 415.
  • In step 420, a determination is made as to whether the set of encoded and decoded frames for mix, M, is empty (e.g., whether M contains any remaining audio frames to be included in the mixing operation for the present mix cycle). If it is determined in step 420 that M is not empty, then in step 425 an audio frame is removed from M After an audio frame is removed from M in step 425, it is determined in step 430 whether or not the removed audio frame is a decoded audio frame (e.g., a frame from the set of decoded frames D′).
  • If it is found in step 430 that the audio frame removed from M is a decoded audio frame, then in step 435 the decoded audio frame is stored in a set of decoded frames for mix, represented as m for purposes of the present description. On the other hand, if in step 430 it is determined that the audio frame removed from M is not a decoded frame (e.g., the audio frame removed from M is an encoded audio frame from the set of encoded audio frames E′) then the process goes to step 440 where the audio frame is decoded. Once the audio frame is decoded in step 440, the decoded audio frame is stored in the set of decoded frames for mix, m, in step 435.
  • After step 435, the process returns to step 420 where it is again determined whether the set of encoded and decoded frames for mix, M, is empty. If M is not empty, then steps 425 through 435 are repeated for an audio frame that remains in M. However, if in step 420 it is determined that M is empty, the process continues to step 445 where a mixing algorithm is applied to all of the audio frames in m to generate one or more mixed audio streams. In at least one implementation, the mix operation of step 445 may be performed by a mixer unit (e.g., mixer unit 260 of the example audio mixer shown in FIG. 2). It should be noted that the mixing algorithm applied in step 445 of the process is based on the earlier application of the mixing decision algorithm in step 410 of the process, as described above. Stated differently, the mixing algorithm applied in step 410 determines the audio frames to be mixed by application of the mixing algorithm in step 445.
  • Although not shown as being part of the process in FIG. 4, in accordance with embodiments of the present disclosure, the mix operation performed in step 445 may produce (e.g., generate) as output one or more mixed audio signals (e.g., mixed audio signal 125 as shown in FIG. 1 or mixed audio signals 225 as shown in FIG. 2). The one or more mixed audio signals generated from mixing all of the frames included in the set of decoded frames for mix, m, may be sent from an audio mixer to the clients participating in the audio conference (e.g., clients 105A through 105N as shown in FIG. 1).
  • Following the mix operation performed in step 445, the process goes to step 450 where the set of all decoded frames D, the set of all encoded frames E, and the set of all VAD decisions V (as described above with respect to the process shown in FIG. 3) are cleared for the start of the next mix cycle in step 400.
  • FIG. 5 is a block diagram illustrating an example computing device 500 that is arranged for determining a mixing decision to combine (e.g., mix) audio signals received from a plurality of communicating users based on voice-activity-detection (VAD) data contained in the received signals in accordance with one or more embodiments of the present disclosure. In a very basic configuration 501, computing device 500 typically includes one or more processors 510 and system memory 520. A memory bus 530 may be used for communicating between the processor 510 and the system memory 520.
  • Depending on the desired configuration, processor 510 can be of any type including but not limited to a microprocessor (μP), a microcontroller (μC), a digital signal processor (DSP), or any combination thereof. Processor 510 may include one or more levels of caching, such as a level one cache 511 and a level two cache 512, a processor core 513, and registers 514. The processor core 513 may include an arithmetic logic unit (ALU), a floating point unit (FPU), a digital signal processing core (DSP Core), or any combination thereof. A memory controller 515 can also be used with the processor 510, or in some embodiments the memory controller 515 can be an internal part of the processor 510.
  • Depending on the desired configuration, the system memory 520 can be of any type including but not limited to volatile memory (e.g., RAM), non-volatile memory (e.g., ROM, flash memory, etc.) or any combination thereof. System memory 520 typically includes an operating system 521, one or more applications 522, and program data 524. In at least some embodiments, application 522 includes a multipath routing algorithm 523 that is configured to receive and store audio frames based on one or more characteristics of the frames (e.g., encoded, decoded, contain VAD decision, etc.). The multipath routing algorithm is further arranged to identify candidate sets of audio frames for consideration in a mixing decision (e.g., by an audio mixer, such as example audio mixer 230 shown in FIG. 2) and select from among those candidate sets audio frames to include in a mixed audio signal (e.g., mixed audio signal 125 shown in FIG. 1) based on information and data contained in the audio frames (e.g., VAD decisions).
  • Program Data 524 may include multipath routing data 525 that is useful for identifying received audio frames and categorizing the frames into one or more sets based on specific characteristics (e.g., whether a frame is encoded, decoded, contains a VAD decision, etc.). In some embodiments, application 522 can be arranged to operate with program data 524 on an operating system 521 such that a received audio frame is analyzed to determine its characteristics before being stored in an appropriate set of audio frames (e.g., decoded frame set 270 or encoded frame set 275 as shown in FIG. 2).
  • Computing device 500 can have additional features and/or functionality, and additional interfaces to facilitate communications between the basic configuration 501 and any required devices and interfaces. For example, a bus/interface controller 540 can be used to facilitate communications between the basic configuration 501 and one or more data storage devices 550 via a storage interface bus 541. The data storage devices 550 can be removable storage devices 551, non-removable storage devices 552, or any combination thereof. Examples of removable storage and non-removable storage devices include magnetic disk devices such as flexible disk drives and hard-disk drives (HDD), optical disk drives such as compact disk (CD) drives or digital versatile disk (DVD) drives, solid state drives (SSD), tape drives and the like. Example computer storage media can include volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, and/or other data.
  • System memory 520, removable storage 551 and non-removable storage 552 are all examples of computer storage media. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computing device 500. Any such computer storage media can be part of computing device 500.
  • Computing device 500 can also include an interface bus 542 for facilitating communication from various interface devices (e.g., output interfaces, peripheral interfaces, communication interfaces, etc.) to the basic configuration 501 via the bus/interface controller 540. Example output devices 560 include a graphics processing unit 561 and an audio processing unit 562, either or both of which can be configured to communicate to various external devices such as a display or speakers via one or more A/V ports 563. Example peripheral interfaces 570 include a serial interface controller 571 or a parallel interface controller 572, which can be configured to communicate with external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, touch input device, etc.) or other peripheral devices (e.g., printer, scanner, etc.) via one or more I/O ports 573.
  • An example communication device 580 includes a network controller 581, which can be arranged to facilitate communications with one or more other computing devices 590 over a network communication (not shown) via one or more communication ports 582. The communication connection is one example of a communication media. Communication media may typically be embodied by computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism, and includes any information delivery media. A “modulated data signal” can be a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media can include wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, radio frequency (RF), infrared (IR) and other wireless media. The term computer readable media as used herein can include both storage media and communication media.
  • Computing device 500 can be implemented as a portion of a small-form factor portable (or mobile) electronic device such as a cell phone, a personal data assistant (PDA), a personal media player device, a wireless web-watch device, a personal headset device, an application specific device, or a hybrid device that include any of the above functions. Computing device 500 can also be implemented as a personal computer including both laptop computer and non-laptop computer configurations.
  • There is little distinction left between hardware and software implementations of aspects of systems; the use of hardware or software is generally (but not always, in that in certain contexts the choice between hardware and software can become significant) a design choice representing cost versus efficiency tradeoffs. There are various vehicles by which processes and/or systems and/or other technologies described herein can be effected (e.g., hardware, software, and/or firmware), and the preferred vehicle will vary with the context in which the processes and/or systems and/or other technologies are deployed. For example, if an implementer determines that speed and accuracy are paramount, the implementer may opt for a mainly hardware and/or firmware vehicle; if flexibility is paramount, the implementer may opt for a mainly software implementation. In one or more other scenarios, the implementer may opt for some combination of hardware, software, and/or firmware.
  • The foregoing detailed description has set forth various embodiments of the devices and/or processes via the use of block diagrams, flowcharts, and/or examples. Insofar as such block diagrams, flowcharts, and/or examples contain one or more functions and/or operations, it will be understood by those skilled within the art that each function and/or operation within such block diagrams, flowcharts, or examples can be implemented, individually and/or collectively, by a wide range of hardware, software, firmware, or virtually any combination thereof.
  • In one or more embodiments, several portions of the subject matter described herein may be implemented via Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs), digital signal processors (DSPs), or other integrated formats. However, those skilled in the art will recognize that some aspects of the embodiments described herein, in whole or in part, can be equivalently implemented in integrated circuits, as one or more computer programs running on one or more computers (e.g., as one or more programs running on one or more computer systems), as one or more programs running on one or more processors (e.g., as one or more programs running on one or more microprocessors), as firmware, or as virtually any combination thereof. Those skilled in the art will further recognize that designing the circuitry and/or writing the code for the software and/or firmware would be well within the skill of one of skilled in the art in light of the present disclosure.
  • Additionally, those skilled in the art will appreciate that the mechanisms of the subject matter described herein are capable of being distributed as a program product in a variety of forms, and that an illustrative embodiment of the subject matter described herein applies regardless of the particular type of signal-bearing medium used to actually carry out the distribution. Examples of a signal-bearing medium include, but are not limited to, the following: a recordable-type medium such as a floppy disk, a hard disk drive, a Compact Disc (CD), a Digital Video Disk (DVD), a digital tape, a computer memory, etc.; and a transmission-type medium such as a digital and/or an analog communication medium (e.g., a fiber optic cable, a waveguide, a wired communications link, a wireless communication link, etc.).
  • Those skilled in the art will also recognize that it is common within the art to describe devices and/or processes in the fashion set forth herein, and thereafter use engineering practices to integrate such described devices and/or processes into data processing systems. That is, at least a portion of the devices and/or processes described herein can be integrated into a data processing system via a reasonable amount of experimentation. Those having skill in the art will recognize that a typical data processing system generally includes one or more of a system unit housing, a video display device, a memory such as volatile and non-volatile memory, processors such as microprocessors and digital signal processors, computational entities such as operating systems, drivers, graphical user interfaces, and applications programs, one or more interaction devices, such as a touch pad or screen, and/or control systems including feedback loops and control motors (e.g., feedback for sensing position and/or velocity; control motors for moving and/or adjusting components and/or quantities). A typical data processing system may be implemented utilizing any suitable commercially available components, such as those typically found in data computing/communication and/or network computing/communication systems.
  • With respect to the use of substantially any plural and/or singular terms herein, those having skill in the art can translate from the plural to the singular and/or from the singular to the plural as is appropriate to the context and/or application. The various singular/plural permutations may be expressly set forth herein for sake of clarity.
  • While various aspects and embodiments have been disclosed herein, other aspects and embodiments will be apparent to those skilled in the art. The various aspects and embodiments disclosed herein are for purposes of illustration and are not intended to be limiting, with the true scope and spirit being indicated by the following claims.

Claims (18)

We claim:
1. A method for mixing audio signals comprising:
receiving, at an audio mixing apparatus, audio packets from a plurality of clients in communication with the audio mixing apparatus;
retrieving audio level data contained in an extended packet header of each of the received audio packets;
selecting audio frames to be mixed based on the audio level data retrieved from the extended packet headers of the audio packets;
decoding the audio frames selected to be mixed; and
generating a mixed audio stream by mixing the decoded audio frames.
2. The method of claim 1, further comprising:
responsive to receiving the audio packets from the plurality of clients, determining, for each audio frame contained in the audio packets, whether audio level data can be extracted from the audio frame without decoding the audio frame.
3. The method of claim 2, further comprising:
responsive to determining that audio level data can be extracted from the audio frame without decoding the audio frame, extracting audio level data from the audio frame;
storing the audio level data extracted from the audio frame in an audio level data set; and
storing the audio frame in an encoded audio frames set.
4. The method of claim 3, further comprising:
responsive to determining that no audio level data can be extracted from the audio frame without decoding the audio frame, decoding the audio frame;
performing audio level processing on the decoded audio frame to obtain audio level data for the decoded audio frame;
storing the audio level data obtained for the decoded audio frame in the audio level data set; and
storing the decoded audio frame in a decoded audio frames set.
5. The method of claim 4, further comprising:
selecting a group of encoded audio frames from the encoded audio frames set and a group of decoded audio frames from the decoded audio frames set;
retrieving, from the audio level data set, the audio level data stored for the group of encoded audio frames and the group of decoded audio frames;
applying a mixing decision algorithm based on the audio level data retrieved from the audio level data set; and
determining a set of encoded and decoded audio frames to be mixed based on the applied mixing decision algorithm.
6. The method of claim 1, further comprising storing the received audio packets in a buffer of the audio mixing apparatus.
7. The method of claim 1, further comprising discarding the audio frames not selected to be mixed.
8. The method of claim 1, wherein the audio packets received from the plurality of clients are Real-Time Transport Protocol (RTP) packets.
9. The method of claim 1, wherein the audio level data contained in the extended packet header of each of the received audio packets includes voice data corresponding to one of the plurality of clients.
10. An audio mixing apparatus configured to perform operations comprising:
receiving audio packets from a plurality of clients in communication with the audio mixing apparatus;
retrieving audio level data contained in an extended packet header of each of the received audio packets;
selecting audio frames to be mixed based on the audio level data retrieved from the extended packet headers of the audio packets;
decoding the audio frames selected to be mixed; and
generating a mixed audio stream by mixing the decoded audio frames.
11. The audio mixing apparatus of claim 10, further configured to perform operations comprising, responsive to receiving the audio packets from the plurality of clients, determining, for each audio frame contained in the audio packets, whether audio level data can be extracted from the audio frame without decoding the audio frame.
12. The audio mixing apparatus of claim 11, further configured to perform operations comprising:
responsive to determining that audio level data can be extracted from the audio frame without decoding the audio frame, extracting audio level data from the audio frame;
storing the audio level data extracted from the audio frame in an audio level data set; and
storing the audio frame in an encoded audio frames set.
13. The audio mixing apparatus of claim 12, further configured to perform operations comprising:
responsive to determining that no audio level data can be extracted from the audio frame without decoding the audio frame, decoding the audio frame;
performing audio level processing on the decoded audio frame to obtain audio level data for the decoded audio frame;
storing the audio level data obtained for the decoded audio frame in the audio level data set; and
storing the decoded audio frame in a decoded audio frames set.
14. The audio mixing apparatus of claim 13, further configured to perform operations comprising:
selecting a group of encoded audio frames from the encoded audio frames set and a group of decoded audio frames from the decoded audio frames set;
retrieving, from the audio level data set, the audio level data stored for the group of encoded audio frames and the group of decoded audio frames;
applying a mixing decision algorithm based on the audio level data retrieved from the audio level data set; and
determining a set of encoded and decoded audio frames to be mixed based on the applied mixing decision algorithm.
15. The audio mixing apparatus of claim 10, further configured to perform operations comprising storing the received audio packets in a buffer of the audio mixing apparatus.
16. The audio mixing apparatus of claim 10, further configured to perform operations comprising discarding the audio frames not selected to be mixed.
17. The audio mixing apparatus of claim 10, wherein the audio packets received from the plurality of clients are Real-Time Transport Protocol (RTP) packets.
18. The audio mixing apparatus of claim 10, wherein the audio level data contained in the extended packet header of each of the received audio packets includes voice data corresponding to one of the plurality of clients.
US13/348,278 2012-01-11 2012-01-11 Mixing decision controlling decode decision Abandoned US20140369528A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US13/348,278 US20140369528A1 (en) 2012-01-11 2012-01-11 Mixing decision controlling decode decision

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US13/348,278 US20140369528A1 (en) 2012-01-11 2012-01-11 Mixing decision controlling decode decision

Publications (1)

Publication Number Publication Date
US20140369528A1 true US20140369528A1 (en) 2014-12-18

Family

ID=52019239

Family Applications (1)

Application Number Title Priority Date Filing Date
US13/348,278 Abandoned US20140369528A1 (en) 2012-01-11 2012-01-11 Mixing decision controlling decode decision

Country Status (1)

Country Link
US (1) US20140369528A1 (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130343548A1 (en) * 2012-06-25 2013-12-26 Calgary Scientific Inc. Method and system for multi-channel mixing for transmission of audio over a network
US20140270263A1 (en) * 2013-03-15 2014-09-18 Dts, Inc. Automatic multi-channel music mix from multiple audio stems
US20150333817A1 (en) * 2012-10-26 2015-11-19 Icom Incorporated Relaying device and communication system
US20180061437A1 (en) * 2016-08-25 2018-03-01 Google Inc. Techniques for decreasing echo and transmission periods for audio communication sessions
WO2019062541A1 (en) * 2017-09-26 2019-04-04 华为技术有限公司 Real-time digital audio signal mixing method and device
US10375131B2 (en) * 2017-05-19 2019-08-06 Cisco Technology, Inc. Selectively transforming audio streams based on audio energy estimate
CN110995946A (en) * 2019-12-25 2020-04-10 苏州科达科技股份有限公司 Sound mixing method, device, equipment, system and readable storage medium
US10620904B2 (en) 2018-09-12 2020-04-14 At&T Intellectual Property I, L.P. Network broadcasting for selective presentation of audio content
US11196868B2 (en) * 2016-02-18 2021-12-07 Tencent Technology (Shenzhen) Company Limited Audio data processing method, server, client and server, and storage medium
US20220208210A1 (en) * 2019-02-19 2022-06-30 Sony Interactive Entertainment Inc. Sound output control apparatus, sound output control system, sound output control method, and program
CN116471263A (en) * 2023-05-12 2023-07-21 杭州全能数字科技有限公司 Real-time audio routing method for video system

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130343548A1 (en) * 2012-06-25 2013-12-26 Calgary Scientific Inc. Method and system for multi-channel mixing for transmission of audio over a network
US9282420B2 (en) * 2012-06-25 2016-03-08 Calgary Scientific Inc. Method and system for multi-channel mixing for transmission of audio over a network
US20150333817A1 (en) * 2012-10-26 2015-11-19 Icom Incorporated Relaying device and communication system
US9742483B2 (en) * 2012-10-26 2017-08-22 Icom Incorporated Relaying device
US20140270263A1 (en) * 2013-03-15 2014-09-18 Dts, Inc. Automatic multi-channel music mix from multiple audio stems
US9640163B2 (en) * 2013-03-15 2017-05-02 Dts, Inc. Automatic multi-channel music mix from multiple audio stems
US11196868B2 (en) * 2016-02-18 2021-12-07 Tencent Technology (Shenzhen) Company Limited Audio data processing method, server, client and server, and storage medium
CN109644192A (en) * 2016-08-25 2019-04-16 谷歌有限责任公司 Audio transmission with the compensation of speech detection cycle duration
US10269371B2 (en) * 2016-08-25 2019-04-23 Google Llc Techniques for decreasing echo and transmission periods for audio communication sessions
US20180061437A1 (en) * 2016-08-25 2018-03-01 Google Inc. Techniques for decreasing echo and transmission periods for audio communication sessions
CN114257571A (en) * 2016-08-25 2022-03-29 谷歌有限责任公司 Audio delivery with voice detection period duration compensation
US10375131B2 (en) * 2017-05-19 2019-08-06 Cisco Technology, Inc. Selectively transforming audio streams based on audio energy estimate
WO2019062541A1 (en) * 2017-09-26 2019-04-04 华为技术有限公司 Real-time digital audio signal mixing method and device
US10620904B2 (en) 2018-09-12 2020-04-14 At&T Intellectual Property I, L.P. Network broadcasting for selective presentation of audio content
US20220208210A1 (en) * 2019-02-19 2022-06-30 Sony Interactive Entertainment Inc. Sound output control apparatus, sound output control system, sound output control method, and program
CN110995946A (en) * 2019-12-25 2020-04-10 苏州科达科技股份有限公司 Sound mixing method, device, equipment, system and readable storage medium
CN116471263A (en) * 2023-05-12 2023-07-21 杭州全能数字科技有限公司 Real-time audio routing method for video system

Similar Documents

Publication Publication Date Title
US20140369528A1 (en) Mixing decision controlling decode decision
US9763002B1 (en) Stream caching for audio mixers
US9288435B2 (en) Speaker switching delay for video conferencing
RU2524359C2 (en) Video conference rate matching
US11881945B2 (en) Reference picture selection and coding type decision processing based on scene contents
JP5781441B2 (en) Subscription for video conferencing using multi-bitrate streams
KR101353847B1 (en) Method and apparatus for detecting and suppressing echo in packet networks
US9621902B2 (en) Multi-stream optimization
US20100322387A1 (en) Endpoint echo detection
US10269371B2 (en) Techniques for decreasing echo and transmission periods for audio communication sessions
US20130055331A1 (en) System and method for variable video degradation counter-measures
US20180063481A1 (en) Human interface device (hid) based control of video data conversion at docking station
CN111276152A (en) Audio processing method, terminal and server
WO2023125350A1 (en) Audio data pushing method, apparatus and system, and electronic device and storage medium
KR101858895B1 (en) Method of providing video conferencing service and apparatuses performing the same
CN108337535B (en) Client video forwarding method, device, equipment and storage medium
US8358600B2 (en) Method of transmitting data in a communication system
US20150201041A1 (en) Device dependent codec negotiation
US8782271B1 (en) Video mixing using video speech detection
US10165365B2 (en) Sound sharing apparatus and method
JP5086366B2 (en) Conference terminal device, relay device, and conference system
CN113573004A (en) Video conference processing method and device, computer equipment and storage medium
US10826838B2 (en) Synchronized jitter buffers to handle codec switches
US20210185102A1 (en) Server in multipoint communication system, and operating method thereof
US11457287B2 (en) Method and system for processing video

Legal Events

Date Code Title Description
AS Assignment

Owner name: GOOGLE INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:ELLNER, LARS HENRIK;REEL/FRAME:027528/0633

Effective date: 20120110

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION

AS Assignment

Owner name: GOOGLE LLC, CALIFORNIA

Free format text: CHANGE OF NAME;ASSIGNOR:GOOGLE INC.;REEL/FRAME:044142/0357

Effective date: 20170929