WO2019241778A1 - Audiovisual livestream system and method with latency management and social media-type user interface mechanics - Google Patents

Audiovisual livestream system and method with latency management and social media-type user interface mechanics Download PDF

Info

Publication number
WO2019241778A1
WO2019241778A1 PCT/US2019/037479 US2019037479W WO2019241778A1 WO 2019241778 A1 WO2019241778 A1 WO 2019241778A1 US 2019037479 W US2019037479 W US 2019037479W WO 2019241778 A1 WO2019241778 A1 WO 2019241778A1
Authority
WO
WIPO (PCT)
Prior art keywords
vocal
captured
audiovisual
performance
video
Prior art date
Application number
PCT/US2019/037479
Other languages
French (fr)
Inventor
Anton Holmberg
Benjamin HERSH
Jeannie Yang
Yuning Woo
Wang Liang
Perry R. Cook
Jeffrey C. Smith
Original Assignee
Smule, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Smule, Inc. filed Critical Smule, Inc.
Priority to CN201980052977.9A priority Critical patent/CN112567758A/en
Priority to EP19819554.7A priority patent/EP3808096A4/en
Publication of WO2019241778A1 publication Critical patent/WO2019241778A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/36Accompaniment arrangements
    • G10H1/361Recording/reproducing of accompaniment for use with an external source, e.g. karaoke systems
    • G10H1/366Recording/reproducing of accompaniment for use with an external source, e.g. karaoke systems with means for modifying or correcting the external signal, e.g. pitch correction, reverberation, changing a singer's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/0008Associated control or indicating means
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/0033Recording/reproducing or transmission of music for electrophonic musical instruments
    • G10H1/0041Recording/reproducing or transmission of music for electrophonic musical instruments in coded form
    • G10H1/0058Transmission between separate instruments or between individual components of a musical system
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/02Editing, e.g. varying the order of information signals recorded on, or reproduced from, record carriers
    • G11B27/031Electronic editing of digitised analogue information signals, e.g. audio or video signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/21Server components or server architectures
    • H04N21/218Source of audio or video content, e.g. local disk arrays
    • H04N21/2187Live feed
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/236Assembling of a multiplex stream, e.g. transport stream, by combining a video stream with other content or additional data, e.g. inserting a URL [Uniform Resource Locator] into a video stream, multiplexing software data into a video stream; Remultiplexing of multiplex streams; Insertion of stuffing bits into the multiplex stream, e.g. to obtain a constant bit-rate; Assembling of a packetised elementary stream
    • H04N21/2368Multiplexing of audio and video streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/242Synchronization processes, e.g. processing of PCR [Program Clock References]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/155Musical effects
    • G10H2210/245Ensemble, i.e. adding one or more voices, also instrumental voices
    • G10H2210/251Chorus, i.e. automatic generation of two or more extra voices added to the melody, e.g. by a chorus effect processor or multiple voice harmonizer, to produce a chorus or unison effect, wherein individual sounds from multiple sources with roughly the same timbre converge and are perceived as one
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/325Musical pitch modification
    • G10H2210/331Note pitch correction, i.e. modifying a note pitch or replacing it by the closest one in a given scale
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2240/00Data organisation or data communication aspects, specifically adapted for electrophonic musical tools or instruments
    • G10H2240/171Transmission of musical instrument data, control or status information; Transmission, remote access or control of music data for electrophonic musical instruments
    • G10H2240/175Transmission of musical instrument data, control or status information; Transmission, remote access or control of music data for electrophonic musical instruments for jam sessions or musical collaboration through a network, e.g. for composition, ensemble playing or repeating; Compensation of network or internet delays therefor
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2240/00Data organisation or data communication aspects, specifically adapted for electrophonic musical tools or instruments
    • G10H2240/171Transmission of musical instrument data, control or status information; Transmission, remote access or control of music data for electrophonic musical instruments
    • G10H2240/201Physical layer or hardware aspects of transmission to or from an electrophonic musical instrument, e.g. voltage levels, bit streams, code words or symbols over a physical link connecting network nodes or instruments
    • G10H2240/241Telephone transmission, i.e. using twisted pair telephone lines or any type of telephone network
    • G10H2240/251Mobile telephone transmission, i.e. transmitting, accessing or controlling music data wirelessly via a wireless or mobile telephone receiver, analog or digital, e.g. DECT GSM, UMTS

Definitions

  • the invention relates generally to capture, processing and/or broadcast of multi-performer audiovisual performances and, in particular, to techniques suitable for managing
  • AppleTV ® devices support audio and video processing quite capably, while at the same time providing platforms suitable for advanced user interfaces. Indeed, applications such as the Smule OcarinaTM, Leaf Trombone ® , I Am T-PainTM, AutoRap®, Smuie (fka Sing! KaraokeTM), Guitar! By Smuie ® , and Magic Piano ® apps available from Smule, Inc. have shown that advanced digital acoustic techniques may be delivered using such devices in ways that provide compelling musical experiences.
  • vocal performances including vocal music
  • the vocal performances of collaborating contributors are captured (together with performance synchronized video) in the context of a karaoke-style presentation of lyrics and in correspondence with audible renderings of a backing track.
  • vocals and typically synchronized video
  • vocal interactions are captured as part of a live or unscripted performance with vocal interactions (e.g., a duet or dialog) between collaborating contributors.
  • non-negligible network communication latencies will exist between at least some of the collaborating contributors, particularly where those contributors are geographically separated.
  • a technical challenge exists to manage latencies and the captured audiovisual content in such a way that a combined audiovisual performance nonetheless can be disseminated (e.g., broadcast) in a manner that presents to recipients, listeners and/or viewers as a live interactive collaboration.
  • actual and non-negligibie network communication latency is (in effect) masked in one direction between a guest and host performer and tolerated in the other direction.
  • a captured audiovisual performance of a guest performer on a“live show” internet broadcast of a host performer could include a guest + host duet sung in apparent real-time synchrony.
  • the guest could be a performer who has popularized a particular musical performance.
  • the guest could be an amateur vocalist given the opportunity to sing“live” (though remote) with the popular artist or group virtually“in studio” as (or with) the show ’ s host.
  • the host performs in apparent synchrony with (though temporally lagged from, in an absolute sense) the guest and the apparently synchronously performed vocals are captured and mixed with the guest’s contribution for broadcast or dissemination.
  • the result is an apparently live interactive performance (at least from the perspective of the host and the recipients, listeners and/or viewers of the disseminated or broadcast performance).
  • the non-negligible network communication latency from guest-to- host is masked, it will be understood that latency exists and is tolerated in the host-to-guest direction.
  • host-to-guest latency while discernible (and perhaps quite noticeable) to the guest, need not be apparent in the apparently live broadcast or other dissemination it has been discovered that lagged audible rendering of host vocals (or more generally, of the host’s captured audiovisual performance) need not psychoacousfica!!y interfere with the guest’s performance.
  • Performance synchronized video may be captured and included in a combined audiovisual performance that constitutes the apparently live broadcast, wherein visuals may be based, at least in part, on time-varying, computationally-defined audio features extracted from (or computed over) captured vocal audio. In some cases or embodiments, these
  • computationally-defined audio features are selective, over the course of a coordinated audiovisual mix, for particular synchronized video of one or more of the contributing vocalists (or prominence thereof).
  • vocal audio can be pitch-corrected in real time at the guest performer’s device (or more generally, at a portable computing device such as a mobile phone, personal digital assistant, laptop computer, notebook computer, pad-type computer or netbook, or on a content or media application server) in accord with pitch correction settings.
  • pitch correction settings code a particular key or scale for the vocal performance or for portions thereof.
  • pitch correction settings include a score-coded melody and/or harmony sequence supplied with, or for association with, the lyrics and backing tracks. Harmony notes or chords may be coded as explicit targets or relative to the score coded melody or even actual pitches sounded by a vocalist, if desired.
  • a content server or service for the host can further mediate coordinated performances by manipulating and mixing the uploaded audiovisual content of multiple contributing vocalists for further broadcast or other dissemination.
  • uploads may include pitch-corrected vocal performances (with or without harmonies), dry (i.e., uncorrected) vocals, and/or control tracks of user key and/or pitch correction selections, etc.
  • Synthesized harmonies and/or additional vocals may also be included in the mix.
  • Geocoding of captured vocal performances (or individual contributions to a combined performance) and/or listener feedback may facilitate animations or display artifacts in ways that are suggestive of a performance or endorsement emanating from a particular geographic locale on a user-manipulabie globe in this way,
  • implementations of the described functionality can transform otherwise mundane mobile devices and living room or entertainment systems into social instruments that foster a unique sense of global connectivity, collaboration and community.
  • a collaboration method for a livestream broadcast of a coordinated audiovisual work of first and second performers captured at respective geographically-distributed, first and second devices includes:
  • the performance synchronized video included in the received media encoding is captured in connection with the vocal capture at the first device, the method further includes capturing, at the second device, video that is performance synchronized with the captured second performer vocals, and the audiovisual broadcast mix is an audiovisual mix of captured audio and video of at least the first and second performers.
  • the method further includes capturing, at the second device, second performer video that is performance synchronized with the captured second performer vocals; and compositing the second performer video with video for the first performer in the supplied audiovisual broadcast mix.
  • the first and second performer video compositing includes, for at least some portions of the supplied audiovisual broadcast mix, a computational blurring of image frames of first and second performer video at a visual boundary therebetween.
  • the method further includes dynamically varying in a course of the audiovisual broadcast mix relative visual prominence of one or the other of the first and second performers.
  • the dynamic varying is, at least partially, in correspondence with time varying vocal part codings in an audio score corresponding to and temporally synchronized with the backing audio track in some cases or embodiments, the dynamic varying is, at least partially, based on evaluation of a computationally defined audio feature of either or both of the first and second performer vocals.
  • the first device is associated with the second device as a current Iivestream guest, and the second device operates as a current Iivestream host, the current Iivestream host controlling association and dissociation of particular devices from the audience as the current Iivestream guest.
  • the current Iivestream host selects from a queue of requests from the audience to associate as the current Iivestream guest.
  • the first device operates in a Iivestream guest role and the second device operates in a Iivestream host role, the method further comprising either or both of: the second device releasing the Iivestream host role for assumption by another device; and the second device passing the Iivestream host role to a particular device selected from a set comprising the first device and the audience.
  • the method further includes accessing a machine readable encoding of musical structure that includes at least musical section boundaries coded for temporal alignment with the vocal audio captured at the first and second devices; and applying a first visual effect schedule to at least a portion of audiovisual broadcast mix, wherein the applied visual effect schedule encodes differing visual effects for differing musical structure elements of the first audiovisual performance encoding and provides visual effect transitions in temporal alignment with at least some of the coded musical section boundaries in some cases or embodiments, the differing visual effects encoded by the applied visual effect schedule include for a given element thereof, one or more of: a particle-based effect or lens flare; transitions between distinct source videos; animations or motion of a frame within a source video; vector graphics or images of patterns or textures; and color, saturation or contrast in some cases or embodiments, the associated musical structure encodes musical sections of differing types; and the applied visual effect schedule defines differing visual effects for different ones of the encoded musical sections.
  • the associated musical structure encodes events or transitions; and the applied visual effect schedule defines differing visual effects for different ones of the encoded events or transitions.
  • the associated musical structure encodes group parts, and the applied visual effect schedule is temporally selective for particular
  • the method is performed, at least in part, on a handheld mobile device communicatively coupled to a content server or service platform in some embodiments.
  • the method is embodied, at least in part, as a computer program product encoding of instructions executable on the second device as part of a cooperative system including a content server or service platform to which a plurality of geographically- distributed, network-connected, vocal capture devices, including the second device, are communicatively coupled.
  • dissemination of an apparently live broadcast of a joint performance of geographically- distributed first and second performers includes first and second devices coupled by a communication network with non-negligible peer-to-peer latency for transmission of audiovisual content.
  • the first device is communicatively coupled to supply to the second device an audiovisual performance mixed with a backing audio track and including (1) vocal audio of the first performer captured against the backing audio track and (2) video that is performance synchronized therewith.
  • the second device is communicatively configured to receive a media encoding of the mixed audiovisual performance and to audibly render at least audio portions of the mixed audiovisual performance, to capture thereagainst vocal audio of the second performer, and to mix the captured second performer vocal audio with the received mixed audiovisual performance for transmission as the apparently live broadcast.
  • the second device is further configured to capture second performer video that is performance synchronized with the captured second performer vocals and to composite the second performer video with video for the first performer in the supplied audiovisual broadcast mix.
  • the first and second performer video compositing includes, for at least some portions of the supplied audiovisual broadcast mix, a computational blurring of image frames of first and second performer video at a visual boundary therebetween.
  • the first and second performer video compositing includes dynamically varying in a course of the audiovisual broadcast mix relative visual prominence of one or the other of the first and second performers in some cases or embodiments, the dynamic varying is, at least partially, in correspondence with time varying vocal part codings in an audio score corresponding to and temporally synchronized with the backing audio track. In some cases or embodiments, the dynamic varying is, at least partially, based on evaluation of a computationally defined audio feature of either or both of the first and second performer vocals.
  • the first device is associated with the second device as a current livestream guest
  • the second device operates as a current livestream host, the current livestream host controlling association and dissociation of particular devices from the audience as the current livestream guest.
  • the current livestream host selects from a queue of requests from the audience to associate as the current livestream guest.
  • the system further includes a video compositor that accesses a machine readable encoding of musical structure including at least musical section boundaries coded for temporal alignment with the vocal audio captured at the first and second devices and that applies a first visual effect schedule to at least a portion of audiovisual broadcast mix, wherein the applied visual effect schedule encodes differing visual effects for differing musical structure elements of the first audiovisual performance encoding and provides visual effect transitions In temporal alignment with at least some of the coded musical section boundaries.
  • the video compositor is hosted either on the second device or on a content server or service piatform through which the apparently live performance is supplied.
  • a different vocal part selection is presented for each of the performers; and responsive to, and in correspondence with, gestures by either or both of the first and second performers at the respective geographically-distributed devices, the vocal part selections are updated, wherein assignment of a particular vocal part selection to the respective first or second performers is changeable until one or the other of the first and second performers gestures a start of vocal capture on a respective
  • a user interface method for social media includes (1) as part of a user interface visual on a touchscreen display of a client device, presenting on the touchscreen display, live video captured using a camera of the client device; (2) responsive to a first touchscreen gesture by a user of the client device, initiating capture of a snippet of the live video and presenting, as part of the user interface visual, a progress indication in correspondence with an accreting capture of the snippet; and (3) responsive to a second touchscreen gesture by the user of the client device, transmitting the captured snippet to a network-coupled service piatform as a posting in multiuser social media thread.
  • the method further includes presenting the multiuser social media thread on the touchscreen display, the presented multiuser social media thread including the captured snippet together with posted temporally-ordered content from other users received via the network-coupled service piatform, wherein the posted content from at least one other user includes one or more of text and captured snippet of video from the at least one other user.
  • the captured snippet is a fixed-length snippet, and the method further includes visually updating the progress indication in correspondence portion of the fixed-length snippet captured.
  • the first touchscreen gesture is a maintained contact, by the user, with a first visually presented feature on the touchscreen display, and wherein the second touchscreen gesture includes release of the maintained contact.
  • the first touchscreen gesture is a first tap-type contact, by the user, with a first visually presented feature on the touchscreen display, and wherein the second touchscreen gesture is a second tap-type contact following the first tap-type contact on the touchscreen display.
  • the method further includes presenting the multiuser social media thread on the touchscreen display in correspondence with a iivestreamed audiovisual broadcast mix.
  • a method for capture of at least a portion of a coordinated multi-vocal performance of first and second performers at respective first and second geographically-distributed devices includes (1) as part of a user interface visual on either or both of the first and second devices, presenting for a current song selection, a different vocal part selection for each of the performers; and (2) responsive to, and in correspondence with, gestures by either or both of the first and second performers at the respective geographically-distributed devices, updating the vocal part selections, wherein assignment of a particular vocal part selection to the respective first or second performers is changeable until one or the other of the first and second performers gestures a start of vocal capture on a respective geographically-distributed device, whereupon a then- current assignment of particular vocal part selections to the respective first or second performers is fixed for duration of capture of the coordinated multi-vocal performance.
  • the method further includes updating the vocal part selections at the second device in correspondence with a gestured selection communicated from the first device; and supplying the first device with updates to the vocal part selections in correspondence with a gestured selection at the second device in some cases or embodiments, the method further includes changing the current song selection and, in correspondence therewith, updating on either or both of the first and second devices the user interface visual.
  • the change in current song selection is triggered by one or the other of the first and second performers on a respective one of the first and second devices.
  • the method further includes triggering the change in current song selection based on a periodic or recurring event.
  • the change in current song selection selects from a library of song selections based on one or more of coded interests and performance history of either or both of the first and second performers.
  • the method further includes receiving at the second device, a media encoding of a mixed audio performance (i) including vocal audio captured at the first device from a first one of the performers and (ii) mixed with a backing audio track for the current song selection; at the second device, audibly rendering the received mixed audio performance and capturing thereagainst vocal audio from a second one of the performers; and mixing the captured second performer vocal audio with the received mixed audio performance to provide a broadcast mix that includes the captured vocal audio of the first and second performers and the backing audio track without apparent temporal lag therebetween.
  • the method further includes, at the second device, visually presenting in correspondence with the audible rendering, lyrics and score-coded note targets for the current song selection, wherein the visually presented lyrics and note targets correspond the assignment of a particular vocal part at the start of vocal capture.
  • the received media encoding includes video that is performance synchronized with the captured first performer vocals
  • the method further includes capturing, at the host device, video that is performance synchronized with the captured second performer vocals
  • the broadcast mix is an audiovisual mix of captured audio and video of at least the first and second performers.
  • the method further includes capturing, at the host device, second performer video that is performance synchronized with the captured second performer vocals; and compositing the second performer video with video for the first performer in the supplied audiovisual broadcast mix.
  • the method further includes supplying the broadcast mix to a service platform configured to livestream the broadcast mix to plural recipient devices constituting an audience.
  • FIG. 1 depicts information flows amongst illustrative mobile phone-type portable computing devices in a host and guest configuration for !ivestreaming a duet-type group audiovisual performance in accordance with some embodiments of the present invention(s).
  • FIG. 2 is a flow graph depicting the flow of audio signals captured and processed at respective guest and host devices coupled in a“host sync” peer-to-peer configuration for generation of a group audiovisual performance livestream in accordance with some embodiments of the present invention(s).
  • HG. 3 is a flow graph depicting the flow of audio signais captured and processed at respective guest and host devices coupled in a“shared latency” peer-to-peer configuration for generation of a group audiovisual performance livestream in accordance with some embodiments of the present invention(s)
  • FIG. 4 is a flow diagram illustrating, for an audiovisual performance captured at a guest or host device in accordance with some embodiments of the present invention(s), optional real time continuous pitch-correction and harmony generation signal flows that may be performed based on score-coded pitch correction settings.
  • FIG. 5 is a functional block diagram of hardware and software components executable at an illustrative mobile phone-type portable computing device to facilitate processing and communication of a captured audiovisual performance for use in a multi-vocalist
  • FIG. 8 illustrates features of a mobile device that may serve as a platform for execution of software implementations of at least some audiovisual performance capture and/or livestream performance devices in accordance with some embodiments of the present invention(s).
  • FIGs. 7 A and 7B illustrate a video presentation of livestream content for which a
  • FIGs. 8A and 8B illustrate a seifie chat interaction mechanism in which a capture viewport is presented on screen and a user interaction mechanic is supported whereby a user holds a touchscreen-presented button or other feature to capture a video snippet and releases to post the video snippet in a social media interaction in accordance with some embodiments of the present invention(s).
  • FIGs, 9A, 9B and 9C illustrate a user part selection and coordination mechanism in which a song roulette and/or gestured selections by user-performers on geographically-distributed devices provide corresponding song and/or part selections on peer devices for a livestream performance in accordance with some embodiments of the present invention(s).
  • FIG. 10 is a network diagram that illustrates cooperation of exemplary devices in accordance with some embodiments of the present invention(s). Skilled artisans will appreciate that elements or features in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. For example, the dimensions or prominence of some of the illustrated elements or features may be
  • Audiovisual performances including vocal music are captured and coordinated with performances of other users in ways that can create compelling user and listener experiences.
  • duets with a host performer may be supported in a sing-wifh-the-artist style audiovisual !ivestream in which aspiring vocalists request or queue particular songs for a live radio show entertainment format.
  • the developed techniques provide a communications latency-tolerant mechanism for synchronizing vocal performances captured at geographically-separated devices (e.g., at globally-distributed, but network-connected mobile phones or tablets or at audiovisual capture devices geographically separated from a live studio).
  • livestream content will typically include performance-synchronized video captured in connection with vocals.
  • network-connected mobile phones are illustrated as audiovisual capture devices, it will be appreciated based on the description herein that audiovisual capture and viewing devices may include suitably-configured computers, smart TVs and/or living room style set-top box configurations, and even intelligent virtual assistance devices with audio and/or audiovisual capture devices or capabilities.
  • audiovisual capture applications need not be limited to vocal duets, but may be adapted to other forms of group performance in which one or more successive performances are accreted to a prior performance to produce a livestream.
  • the vocal performances of collaborating contributors are captured (together with performance synchronized video) in the context of a karaoke-style presentation of lyrics and in correspondence with audible renderings of a backing track in some cases, vocals (and typically synchronized video) are captured as part of a live or unscripted performance with vocal interactions (e.g., a duet or dialog) between collaborating contributors.
  • vocal interactions e.g., a duet or dialog
  • a captured audiovisual performance of a guest performer on a“live show” internet broadcast of a host performer could include a guest + host duet sung in apparent real-time synchrony.
  • the host could be a performer who has popularized a particular musical performance.
  • the guest could be an amateur vocalist given the opportunity to sing "live” (though remote) with the popular artist or group“in studio” as (or with) the show’s host.
  • the host performs in apparent synchrony with (though temporally lagged from, in an absolute sense) the guest and the apparently synchronously performed vocals are captured and mixed with the guest’s contribution for broadcast or dissemination.
  • the result is an apparently live interactive performance (at least from the perspective of the host and the recipients, listeners and/or viewers of the disseminated or broadcast performance).
  • the non-negligibie network communication latency from guest-to- host is masked, if will be understood that latency exists and is tolerated in the host-to-guest direction.
  • host-to-guest latency while discernible (and perhaps quite noticeable) to the guest, need not be apparent in the apparently live broadcast or other dissemination it has been discovered that lagged audible rendering of host vocals (or more generally, of the host’s captured audiovisual performance) need not psychoacousticaliy interfere with the guest’s performance.
  • some embodiments in accordance with the present invenfion(s) may provide host/guest control logic that allows a host to“pass the mic” such that a new user (in some cases a user selected by the current host and other cases, a user who“picks up the mic” after the current host“drops the mic”) may take over as host.
  • some embodiments in accordance with the present invention(s) may provide host/guest control logic that queues guests (and/or aspiring hosts) and automatically assigns queued users to appropriate roles.
  • vocal audio of individual host- and guest-role performers is captured together with performance synchronized video in a karaoke-style user interface framework and coordinated with audiovisual contributions of the other users to form duet- style or glee club-style group audiovisual performances.
  • vocal audio of individual host- and guest-role performers is captured together with performance synchronized video in a karaoke-style user interface framework and coordinated with audiovisual contributions of the other users to form duet- style or glee club-style group audiovisual performances.
  • the vocal audio of individual host- and guest-role performers is captured together with performance synchronized video in a karaoke-style user interface framework and coordinated with audiovisual contributions of the other users to form duet- style or glee club-style group audiovisual performances.
  • performances of individual users may be captured (together with performance synchronized video) on mobile devices, television-type display and/or set-fop box equipment in the context of karaoke-style presentations of lyrics in correspondence with audible renderings of a backing track in some cases or embodiments, score-coded continuous pitch correction may be provided as well as user selectable audio and/or video effects.
  • karaoke-style vocal performance capture using portable handheld devices provides illustrative context.
  • embodiments of the present invention are not limited thereto, pitch-corrected, karaoke-style, vocal capture using mobile phone-type and/or television-type audiovisual equipment provides a useful descriptive context.
  • iPhoneTM handhelds available from Apple Inc. or more generally, handhelds 101 A, 101B operating as guest and host devices, respectively
  • execute software that operates in coordination with a content server 110 to provide vocal capture.
  • the configuration optionally provides continuous real-time, score-coded pitch correction and harmonization of the captured vocals.
  • Performance synchronized video may also be captured using a camera provided by, or in connection with, a computer, a television or other audiovisual equipment (not specifically shown) or connected set-top box equipment such as an Apple TVTM device in some embodiments, performance synchronized video may be captured using an on-board camera provided by handheld paired with connected set-top box equipment.
  • a current host user of current host device 101 B at least partially controls the content of a live stream 122 that is buffered for, and streamed to, an audience on devices 120A, 120B ... 120N.
  • a current guest user of current guest device 101 A contributes to the group audiovisual performance mix 111 that is supplied (eventually via content server 110) by current host device 101 B as live stream 122
  • devices 120A, 120B ... 120N and, indeed, current guest and host devices 101A, 101B are, for simplicity, illustrated as handheld devices such as mobile phones, persons of skill in the art having benefit of the present disclosure will appreciate that any given member of the audience may receive iivestream 122 on any suitable computer, smart television, tablet, via a set-top box or other streaming media capable client.
  • Content that is mixed to form group audiovisual performance mix 111 is captured, in the illustrated configuration, in the context of karaoke-style performance capture wherein lyrics 102, optional pitch cues 105 and, typically, a backing track 107 are supplied from content server 110 to either or both of current guest device 101 A and current host device 101B
  • a current host typically exercises ultimate control over the live stream, e.g., by selecting a particular user (or users) from the audience to act as the current guest(s), by selecting a particular song from a request queue (and/or vocal parts thereof for particular users), and/or by starting, stopping or pausing the group AV performance.
  • the guest user may (in some embodiments) start/stop/pause the roil of backing track 107A for local audible rendering and otherwise control the content of guest mix 108 (backing track roil mixed with captured guest audiovisual content) supplied to current host device 101B.
  • Roll of lyrics 102A and optional pitch cues 105A at current guest device 101 A is in temporal correspondence with the backing track 107A, and is likewise subject start/stop/pause control by the current guest.
  • backing audio and/or video may be rendered from a media store such as an iTunesTM library resident or accessible from a handheld, set-top box, etc.
  • song requests 132 are audience-sourced and conveyed by signaling paths to content selection and guest queue control logic 112 of content server 110.
  • controls 131 and guest controls 133 are illustrated as bi-directional signaling paths. Other queuing and control logic configurations consistent with the operations described, including host or guest controlled queueing and/or song selection, will be appreciated based on the present disclosure.
  • current host device 101B receives and audibly renders guest mix 106 as a backing track against which the current host’s audiovisual performance are captured at current host device 101 B.
  • Roll of lyrics 102B and optional pitch cues 105B at current host device 101B is in temporal correspondence with the backing track, here guest mix 106.
  • guest mix 106 To facilitate synchronization to the guest mix 106 in view of temporal lag in the peer-to-peer communications channel between current guest
  • marker beacons may be encoded in the guest mix to provide the appropriate phase control of lyrics 102B and optional pitch cues 105B on screen.
  • phase analysis of any backing track 107A included in guest mix 106 may be used to provide the appropriate phase control of lyrics 102B and optional pitch cues 105B on screen at current host device 101 B.
  • temporal lag in the peer-to-peer communications channel between current guest device 101 A and current host device 101B affects both guest mix 106 and communications in the opposing direction (e.g., host mic 103C signal encodings).
  • Any of a variety of communications channels may be used to convey audiovisual signals and controls between current guest device 101 A and current host device 101B, as well as between the guest and host devices 101 A, 101B and content server 110 and between audience devices 120A, 120B ... 120N and content server 110.
  • telecommunications carrier wireless facilities and/or wireless local area networks and respective wide-area network gateways may provide
  • User vocals 103A and 103B are captured at respective handhelds 101A, 101 B, and may be optionally pitch-corrected continuously and in real-time and audibly rendered mixed with the localiy-appropriate backing track (e.g., backing track 107A at current guest device 101A and guest mix 106 at current host device 101 B) to provide the user with an improved tonal quality rendition of his/her own vocal performance.
  • Pitch correction is typically based on score-coded note sets or cues (e.g., the pitch and harmony cues 105A, 105B visually displayed at current guest device 101 A and at current host device 101 B, respectively), which provide continuous pitch-correction algorithms executing on the respective device with performance-synchronized sequences of target notes in a current key or scale.
  • score-coded harmony note sequences provide pitch-shifting algorithms with additional targets (typically coded as offsets relative to a lead melody note track and typically scored only for selected portions thereof) for pitch-shifting to harmony versions of the users own captured vocals.
  • pitch correction settings may be characteristic of a particular artist such as the artist that performed vocals associated with the particular backing track.
  • lyrics, melody and harmony track note sets and related timing and control information may be encapsulated in an appropriate container or object (e.g., in a Musical instrument Digital Interface, MIDI, or Java Script Object Notation, json, type format) for supply together with the backing track(s).
  • devices 101A and 101 B may display lyrics and even visual cues related to target notes, harmonies and currently detected vocal pitch in correspondence with an audible performance of the backing track(s) so as to facilitate a karaoke-style vocal performance by a user.
  • lyrics and even visual cues related to target notes, harmonies and currently detected vocal pitch in correspondence with an audible performance of the backing track(s) so as to facilitate a karaoke-style vocal performance by a user.
  • m4a may be downloaded from the content server (if not already available or cached based on prior download) and, in turn, used to provide background music, synchronized lyrics and, in some situations or embodiments, score-coded note tracks for continuous, real-time pitch-correction while the user sings.
  • harmony note tracks may be score coded for harmony shifts to
  • performance together with performance synchronized video is saved locally, on the handheld device or set-top box, as one or more audiovisual files and is subsequently compressed and encoded for communication (e.g., as guest mix 106 or group audiovisual performance mix 111 or constituent encodings thereof) to content server 110 as an MPEG-4
  • MPEG-4 is one suitable standard for the coded representation
  • social network constructs may at least partially supplant or inform host control of the pairings of geographically-distributed vocalists and/or formation of geographically-distributed virtual glee clubs. For example,
  • individual vocalists may perform as current host and guest users in a manner captured (with vocal audio and performance synchronized video) and eventually streamed as a live stream 122 to an audience.
  • Such captured audiovisual content may, in turn, be distributed to social media contacts of the vocalist, members of the audience etc., via an open call mediated by the content server. In this way, the vocalists themselves,
  • 35 members of the audience may invite others to join in a coordinated audiovisual performance, or as members of an audience or guest queue.
  • vocals captured, pitch-corrected (and possibly, though not necessarily, harmonized) may themselves be mixed (as with guest mix 106) to produce a“backing track” used to motivate, guide or frame subsequent vocal capture.
  • additional vocalists may be invited to sing a particular part (e.g., tenor, part B in duet, etc.) or simply to sing, the subsequent vocal capture device (e.g., current host device 101B in the
  • FIG. 1) may pitch shift and place their captured vocals into one or more positions within a duet or virtual glee club.
  • the backing track (e.g., backing track 107A) can provide the synchronization timeline for temporally-phased vocal capture performed at the respective peer devices (guest device 101 A and host device 101 B) and minimize (or eliminate) the perceived latency for the users thereof.
  • FIG. 2 is a flow graph depicting the flow of audio signals captured and processed at respective guest and host devices coupled in a“host sync” peer-to-peer configuration for generation of a group audiovisual performance livestream in accordance with some embodiments of the present invention(s).
  • FIG. 2 (and later, FIG. 3) each emphasize in the form of a teaching example the audio signa!/dafa components and flows that provide synchronization and temporal alignment for an apparently live performance, a person of skill in the art having benefit of the present disclosure will appreciate that corresponding audio performance synchronized video may be captured (as in FIG. 1) and that corresponding video signal/data components and flows may, in like manner, be conveyed between guest and host devices, though not explicitly shown in FIGs. 2 and 3.
  • FIG. 2 Illustrates how an exemplary configuration of guest and host devices 101A and 101 B (recall FIG. 1) and audiovisual signal flows therebetween (e.g., guest mix 106 and host mic audio 103C) during a peer-to-peer session provide a user experience in which the host device vocalist (at host device 1Q1B) always hears guest vocals (captured from guest mic local input 103A) and backing track 107A in perfect synchronization.
  • guest mix 106 and host mic audio 103C e.g., guest mix 106 and host mic audio 103C
  • FIG. 2 Illustrates how an exemplary configuration of guest and host devices 101A and 101 B (recall FIG. 1) and audiovisual signal flows therebetween (e.g., guest mix 106 and host mic audio 103C) during a peer-to-peer session provide a user experience in which the host device vocalist (at host device 1Q1B) always hears guest vocals (captured from guest mic local input 103A) and backing track
  • the audio stream (including the remote guest mic mixed with the backing track) supplied to the host device 1Q1 B and mixed as the livestreamed (122) multi-vocal performance exhibit zero (or negligible) latency to the host vocalist or to the audience.
  • a key to masking actual latencies is to include track 107A in the audio mix suppled from guest device 101 A and to the broadcaster’s device, host device 101B.
  • This audio flow ensures that the guest's voice and backing track is always synced from the broadcaster's point of view (based on audible rendering at host speaker or headset 240B.
  • the guest may still perceive that the broadcaster is singing slightly out of sync if the network delay is significant.
  • the multi-vocal mix of host vocals with guest vocals and the backing track is in sync when livestreamed to an audience.
  • FIG. 3 is a flow graph depicting the flow of audio signals captured and processed at respective guest and host devices coupled in an alternative“shared latency” peer-to-peer configuration for generation of a group audiovisual performance !ivestream in accordance with some embodiments of the present invention(s). More specifically, FIG. 3 illustrates how an exemplary configuration of guest and host devices 101A and 101 B (recall FIG. 1) and audiovisual signal flows therebetween (e.g., guest mix 106 and host mic audio 103C) during a peer-to-peer session combine to limit the guest and host vocalist’s perception of the other vocalist’s audio delay to just a one-way lag (nominally one half of the full audio round-trip- travel delay) behind the backing track.
  • guest and host devices 101A and 101 B recall FIG. 1
  • audiovisual signal flows therebetween e.g., guest mix 106 and host mic audio 103C
  • the guest device 101 A sends periodic timing messages to the host containing the current position in the song, and the host device 1Q1 B adjusts the playback position of the song accordingly.
  • NTP network time protocol
  • performance synchronized video may also be conveyed in audiovisual data encodings that include the more explicitly illustrated audio and in flows analogous to those more explicitly illustrated for audio signal/data components.
  • audio signals are captured, conveyed and mixed
  • video captured at respective devices is composited in correspondence with the audio with which it is performance synchronized as a temporal alignment and distributed performer synchronization baseline.
  • Video compositing is typically performed at the host device, but in some cases may be performed using facilities of a content server or service platform (recall FIG, 1).
  • computer readable encodings of musical structure may guide the compositing function, affecting selection, visual placement and/or prominence of performance synchronized video in the apparently live performance.
  • selection, visual placement and/or prominence may be in correspondence with score-coded musical structure such as group (or duet A/B) parts, musical sections, meiody/harmony positions of captured or pitch-shifted audio and/or with computationally determined audio features of the vocal audio captured at guest or host device or both.
  • FUG. 4 is a flow diagram illustrating real-time continuous score-coded pitch-correction and harmony generation for a captured vocal performance in accordance with some
  • a user/vocalist (e.g., the guest or host vocalist at guest device 101 A or host device 101 B, recall FIG, 1) sings along with a backing track karaoke style.
  • the operant backing track is backing track 107A
  • the operant backing track is guest mix 108 which, at least in embodiments employing the“host sync” method, conveys the original backing track mixed with guest vocals in either case, vocals captured (251) from a microphone input 201 may optionally be continuously pitch-corrected (252) and harmonized (255) in real-time for mix (253) with the operant backing track audibly rendered at one or more acoustic transducers 202.
  • Both pitch correction and added harmonies are chosen to correspond to a score 207, which in the illustrated configuration, is wirelessly communicated (281) to the device(s) (e.g., from content server 110 to guest device 101 A or via guest device 101 A to host device 101 B, recall FIG. 1) on which vocal capture and pitch-correction is to be performed, together with lyrics 208 and an audio encoding of the operant backing track 209 (e.g., backing track 107A or guest mix 106).
  • content selection and guest queue control logic 112 is selective for melody or harmony note selections at the respective guest and host devices 101 A and 1Q1B
  • the note (in a current scale or key) that is closest to that sounded by the user/vocalist is determined based on score 207. While this closest note may typically be a main pitch corresponding to the score-coded vocal melody, it need not be. indeed, in some cases, the user/vocalist may intend to sing harmony and the sounded notes may more closely approximate a harmony track.
  • handheld device 101 e.g., current guest device 101 A or current host device 101B, recall FIG. 1
  • handheld device 101 may itself capture both vocal audio and performance
  • FIG. 5 illustrates basic signal processing flows (350) in accord with certain implementations suitable for a mobile phone-type handheld device 101 to capture vocal audio and performance synchronized video, to generate pitch-corrected and optionally harmonized vocals for audible rendering (locally and/or at a remote target device), and to communicate with a content server or service platform 110.
  • pitch-detection and pitch- correction have a rich technological history in the music and voice coding arts. Indeed, a wide variety of feature picking, time-domain and even frequency-domain techniques have been employed in the art and may be employed in some embodiments in accord with the present invention.
  • pitch-detection methods calculate an average magnitude difference function (AMDF) and execute logic to pick a peak that corresponds to an estimate of the pitch period.
  • AMDF average magnitude difference function
  • PSOLA pitch shift overlap add
  • FIG. 6 illustrates features of a mobile device that may serve as a platform for execution of software implementations in accordance with some embodiments of the present invention. More specifically, FIG. 6 is a block diagram of a mobile device 400 that is generally consistent with commerciaily-available versions of an iPhoneTM mobile digital device.
  • embodiments of the present invention are certainly not limited to iPhone deployments or applications (or even to iPhone-type devices), the iPhone device platform, together with its rich complement of sensors, multimedia facilities, application programmer interfaces and wireless application delivery model, provides a highly capable platform on which to deploy certain implementations. Based on the description herein, persons of ordinary skill in the art will appreciate a wide range of additional mobile device platforms that may be suitable (now or hereafter) for a given implementation or deployment of the inventive techniques described herein.
  • mobile device 400 includes a display 402 that can be sensitive to haptic and/or tactile contact with a user.
  • Touch-sensitive display 402 can support multi-touch features, processing multiple simultaneous touch points, including processing data related to the pressure, degree and/or position of each touch point. Such processing facilitates gestures and interactions with multiple fingers and other interactions.
  • touch-sensitive display technologies can also be used, e.g., a display in which contact is made using a stylus or other pointing device.
  • mobile device 400 presents a graphical user interface on the touch-sensitive display 402, providing the user access to various system objects and for conveying information.
  • the graphical user interface can include one or more display objects 404, 406
  • the display objects 404, 406 are graphic representations of system objects. Examples of system objects include device functions, applications, windows, files, alerts, events, or other identifiable system objects.
  • applications when executed, provide at least some of the digital acoustic functionality described herein.
  • the mobile device 400 supports network connectivity including, for example, both mobile radio and wireless internetworking functionality to enable the user to travel with the mobile device 400 and its associated network-enabled functions.
  • the mobile device 400 can interact with other devices in the vicinity (e.g., via Wi-Fi, Bluetooth, etc.).
  • mobile device 400 can be configured to interact with peers or a base station for one or more devices. As such, mobile device 400 may grant or deny network access to other wireless devices.
  • Mobile device 400 includes a variety of input/output (I/O) devices, sensors and transducers.
  • I/O input/output
  • a speaker 460 and a microphone 462 are typically included to facilitate audio, such as the capture of vocal performances and audible rendering of backing tracks and mixed pitch-corrected vocal performances as described elsewhere herein.
  • speaker 460 and microphone 662 may provide appropriate transducers for techniques described herein.
  • An external speaker port 464 can be included to facilitate hands-free voice functionalities, such as speaker phone functions.
  • An audio jack 466 can also be included for use of headphones and/or a microphone.
  • an external speaker and/or microphone may be used as a transducer for the techniques described herein.
  • a proximity sensor 468 can be included to facilitate the detection of user positioning of mobile device 400.
  • an ambient light sensor 470 can be utilized to facilitate adjusting brightness of the touch- sensitive display 402.
  • An accelerometer 472 can be utilized to detect movement of mobile device 400, as indicated by the directional arrow 474. Accordingly, display objects and/or media can be presented according to a detected orientation, e.g., portrait or landscape.
  • mobile device 400 may include circuitry and sensors for supporting a location determining capability, such as that provided by the global positioning system (GPS) or other positioning systems (e.g., systems using Wi-Fi access points, television signals, cellular grids, Uniform Resource Locators (URLs)) to facilitate geocodings described herein.
  • Mobi!e device 400 also includes a camera lens and imaging sensor 480.
  • instances of a camera lens and sensor 480 are located on front and back surfaces of the mobile device 400. The cameras allow capture still images and/or video for association with captured pitch-corrected vocals.
  • Mobile device 400 can also include one or more wireless communication subsystems, such as an 802 11 b/g/n/ac communication device, and/or a BluetoothTM communication device 488.
  • Other communication protocols can also be supported, including other 802.x communication protocols (e.g., WiMax, Wi-Fi, 3G), fourth generation protocols and modulations (4G-LTE) and beyond (e.g., 5G), code division multiple access (CDMA), global system for mobile communications (GSM), Enhanced Data GSM Environment (EDGE), etc.
  • 802.x communication protocols e.g., WiMax, Wi-Fi, 3G
  • 4G-LTE fourth generation protocols and modulations
  • 5G code division multiple access
  • GSM global system for mobile communications
  • EDGE Enhanced Data GSM Environment
  • a port device 490 e.g., a Universal Serial Bus (USB) port, or a docking port, or some other wired port connection, can be included and used to establish a wired connection to other computing devices, such as other communication devices 400, network access devices, a personal computer, a printer, or other processing devices capable of receiving and/or transmitting data.
  • Port device 490 may also allow mobile device 400 to synchronize with a host device using one or more protocols, such as, for example, the TCP/IP, HTTP, UDP and any other known protocol.
  • F!!Gs, 7 A and 7B illustrate a video presentation of livestream content for which a
  • compositing of images for first and second performers is performed at a host device in accordance with some embodiments of the present invention(s).
  • a host device in accordance with some embodiments of the present invention(s).
  • performance synchronized video of performers captured at respective host and guest devices are composited to provide visuals of the apparently live performance.
  • An image blurring at boundary technique is illustrated.
  • FIGs. 8A and 8B illustrate a seifie chat interaction mechanism in which a capture viewport is presented on screen and a user interaction mechanic is supported whereby a user holds a touchscreen-presented button or other feature to capture a video snippet and releases to post the video snippet in a social media interaction in accordance with some embodiments of the present invention(s).
  • Some embodiments support alternative or additional gesture mechanics such as tap-to- start and tap-to-stop with a post confirmation.
  • FIGs, 9A, 9B and 9C illustrate a user part selection and coordination mechanism in which gestured selections by user-performers on geographically-distributed devices provide part selections on peer devices for a livestream performance in accordance with some embodiments of the present invention(s).
  • Some embodiments support alternative or additional gesture mechanics including song selection such as based on coded interests of one or more of the user-performers, history of performances. In some cases or
  • song selection presents as a pseudorandom song roulette selection triggered by one or more of the user-performers or automatically such as based on expiry of a timer.
  • FIG. 10 illustrates respective instances (701 , 720A, 720B and 711) of computing devices programmed (or programmable) with vocal audio and video capture code, user interface code, pitch correction code, an audio rendering pipeline and playback code in accord with the functional descriptions herein.
  • Device instance 701 is depicted operating in a vocal audio and performance-synchronized video capture mode
  • devices instances 720A and 720B are depicted as operating in a mode that receives livestreamed mixed audiovisual performances.
  • television-type display and/or set-top box equipment 720B is depicted operating in a livestream receiving mode, such equipment and computer 711 may operate as part of a vocal audio and performance synchronized video capture facility (as guest device 101A or host device 101B, recall FIG. 1).
  • Each of the aforementioned devices communicate via wireless data transport and/or intervening networks 704 with a server 712 or service platform that hosts storage and/or functionality explained herein with regard to content server 110.
  • Captured, pitch-corrected vocal performances mixed with performance- synchronized video to define a multi-vocalist audiovisual performance as described herein may (optionally) be livestreamed and audiovisualiy rendered at laptop computer 711.
  • Embodiments in accordance with the present invention may take the form of, and/or be provided as, a computer program product encoded in a machine-readable medium as instruction sequences and other functional constructs of software, which may in turn be executed in a computational system (such as a iPhone handheld, mobile or portable computing device, media application platform, set-top box, or content server platform) to perform methods described herein in general
  • a machine readable medium can include tangible articles that encode information in a form (e.g., as applications, source or object code, functionally descriptive information, etc.) readable by a machine (e.g., a computer, computational facilities of a mobile or portable computing device, media device or streamer, etc.) as well as non-transitory storage incident to transmission of the information.
  • a machine-readable medium may include, but need not be limited to, magnetic storage medium (e.g , disks and/or tape storage); optical storage medium (e.g , CD-ROM, DVD, etc.); magneto-optical storage medium; read only memory (ROM); random access memory (RAM); erasable programmable memory (e.g., EPROM and EEPROM); flash memory; or other types of medium suitable for storing electronic instructions, operation sequences, functionally descriptive information encodings, etc. in general, plural instances may be provided for components, operations or structures described herein as a single instance. Boundaries between various components, operations and data stores are somewhat arbitrary, and particular operations are illustrated in the context of specific illustrative configurations.

Abstract

Techniques have been developed to facilitate the livestreaming of group audiovisual performances. Audiovisual performances including vocal music are captured and coordinated with performances of other users in ways that can create compelling user and listener experiences. For example, in some cases or embodiments, duets with a host performer may be supported in a singwith- the-artist style audiovisual livestream in which aspiring vocalists request or queue particular songs for a live radio show entertainment format. The developed techniques provide a communications latency-tolerant mechanism for synchronizing vocal performances captured at geographically-separated devices (e.g., at globally-distributed, but network-connected mobile phones or tablets or at audiovisual capture devices geographically separated from a live studio).

Description

AUDIOVISUAL LIVESTREAIVI SYSTEM AND METHOD WITH LATENCY MANAGEMENT AND SOCIAL MEDIA-TYPE USER INTERFACE MECHANICS
TECHNICAL FIELD
The invention relates generally to capture, processing and/or broadcast of multi-performer audiovisual performances and, in particular, to techniques suitable for managing
transmission latency for audiovisual content captured in the context of a near real-time audiovisual collaboration of multiple, geographically-distributed performers.
BACKG ROU N D ART
The installed base of mobile phones, personal media players, and portable computing devices, together with media streamers and television set-top boxes, grows in sheer number and computational power each day. Hyper-ubiquitous and deeply entrenched in the lifestyles of people around the world, many of these devices transcend cultural and economic barriers. Computationally, these computing devices offer speed and storage capabilities comparable to engineering workstation or workgroup computers from less than ten years ago, and typically include powerful media processors, rendering them suitable for real-time sound synthesis and other musical applications. Partly as a result, some portable handheld devices, such as iPhone®, iPad®, iPod Touch® and other iOS® or Android devices, as well as media application platforms and set-top box (STB) type devices such as
AppleTV® devices, support audio and video processing quite capably, while at the same time providing platforms suitable for advanced user interfaces. Indeed, applications such as the Smule Ocarina™, Leaf Trombone®, I Am T-Pain™, AutoRap®, Smuie (fka Sing! Karaoke™), Guitar! By Smuie®, and Magic Piano® apps available from Smule, Inc. have shown that advanced digital acoustic techniques may be delivered using such devices in ways that provide compelling musical experiences.
Smuie (fka Sing! Karaoke™) implementations have previously demonstrated accretion of vocal performances captured on a non-real-time basis with respect to each other using geographically-distributed, handheld devices, as well as implementations where more tightly- coupled coordination between portable handheld devices and a local media application platform (e.g., in-room) is supported, typically with short-range, negligible-latency
communications on a same local- or personal-area network segment improved techniques and functional capabilities are desired to extend an intimate sense of“now” or liveness” to collaborative vocal performances, where the performers are separated by more significant geographic distances and notwithstanding non-negligible communication latencies between devices.
As researchers seek to transition their innovations to commercial applications deployable to modern handheld devices and media application platforms within the real-world constraints imposed by processor, memory and other limited computational resources thereof and/or within communications bandwidth and transmission latency constraints typical of wireless and wide-area networks, significant practical challenges present. For example, while applications such as Smule (fka Sing! Karaoke™} have demonstrated the promise of post performance audiovisual mixes to simulate vocal duets or collaborative vocal performances of larger numbers of performers, creating a sense of now and live collaboration has proved elusive without physical co-location. improved techniques and functional capabilities are desired, particularly relative to management of communication latencies and captured audiovisual content in such a way that a combined audiovisual performance nonetheless can be disseminated (e.g., broadcast) in a manner that presents to recipients, listeners and/or viewers as a live interactive collaboration of geographically-distributed performers. Audience involvement and
participation constructs that provide an intimate sense of“now” or“liveness” are also desired.
DISCLOSURE OF THE INVENTION(S)
It has been discovered that, despite practical limitations imposed by mobile device platforms and other media application execution environments, audiovisual performances, including vocal music, may be captured and coordinated with those of other users in ways that create compelling user and listener experiences. In some cases, the vocal performances of collaborating contributors are captured (together with performance synchronized video) in the context of a karaoke-style presentation of lyrics and in correspondence with audible renderings of a backing track. In some cases, vocals (and typically synchronized video) are captured as part of a live or unscripted performance with vocal interactions (e.g., a duet or dialog) between collaborating contributors. In either case, it is envisioned that non-negligible network communication latencies will exist between at least some of the collaborating contributors, particularly where those contributors are geographically separated. As a result, a technical challenge exists to manage latencies and the captured audiovisual content in such a way that a combined audiovisual performance nonetheless can be disseminated (e.g., broadcast) in a manner that presents to recipients, listeners and/or viewers as a live interactive collaboration. In one technique for accomplishing this facsimile of live interactive performance collaboration, actual and non-negligibie network communication latency is (in effect) masked in one direction between a guest and host performer and tolerated in the other direction. For example, a captured audiovisual performance of a guest performer on a“live show” internet broadcast of a host performer could include a guest + host duet sung in apparent real-time synchrony. In some cases, the guest could be a performer who has popularized a particular musical performance. In some cases, the guest could be an amateur vocalist given the opportunity to sing“live” (though remote) with the popular artist or group virtually“in studio” as (or with) the shows host. Notwithstanding a non-negligible network communication latency from guest-to-host involved in the conveyance of the guest’s audiovisual contribution stream (perhaps 200-500 ms or more), the host performs in apparent synchrony with (though temporally lagged from, in an absolute sense) the guest and the apparently synchronously performed vocals are captured and mixed with the guest’s contribution for broadcast or dissemination.
The result is an apparently live interactive performance (at least from the perspective of the host and the recipients, listeners and/or viewers of the disseminated or broadcast performance). Although the non-negligible network communication latency from guest-to- host is masked, it will be understood that latency exists and is tolerated in the host-to-guest direction. However, host-to-guest latency, while discernible (and perhaps quite noticeable) to the guest, need not be apparent in the apparently live broadcast or other dissemination it has been discovered that lagged audible rendering of host vocals (or more generally, of the host’s captured audiovisual performance) need not psychoacousfica!!y interfere with the guest’s performance.
Performance synchronized video may be captured and included in a combined audiovisual performance that constitutes the apparently live broadcast, wherein visuals may be based, at least in part, on time-varying, computationally-defined audio features extracted from (or computed over) captured vocal audio. In some cases or embodiments, these
computationally-defined audio features are selective, over the course of a coordinated audiovisual mix, for particular synchronized video of one or more of the contributing vocalists (or prominence thereof).
Optionally, and in some cases or embodiments, vocal audio can be pitch-corrected in real time at the guest performer’s device (or more generally, at a portable computing device such as a mobile phone, personal digital assistant, laptop computer, notebook computer, pad-type computer or netbook, or on a content or media application server) in accord with pitch correction settings. In some cases, pitch correction settings code a particular key or scale for the vocal performance or for portions thereof. In some cases, pitch correction settings include a score-coded melody and/or harmony sequence supplied with, or for association with, the lyrics and backing tracks. Harmony notes or chords may be coded as explicit targets or relative to the score coded melody or even actual pitches sounded by a vocalist, if desired.
Using uploaded vocals captured at guest performer devices such as the aforementioned portable computing devices, a content server or service for the host can further mediate coordinated performances by manipulating and mixing the uploaded audiovisual content of multiple contributing vocalists for further broadcast or other dissemination. Depending on the goals and implementation of a particular system, in addition to video content, uploads may include pitch-corrected vocal performances (with or without harmonies), dry (i.e., uncorrected) vocals, and/or control tracks of user key and/or pitch correction selections, etc.
Synthesized harmonies and/or additional vocals (e.g., vocals captured from another vocalist at still other locations and optionally pitch-shifted to harmonize with other vocals) may also be included in the mix. Geocoding of captured vocal performances (or individual contributions to a combined performance) and/or listener feedback may facilitate animations or display artifacts in ways that are suggestive of a performance or endorsement emanating from a particular geographic locale on a user-manipulabie globe in this way,
implementations of the described functionality can transform otherwise mundane mobile devices and living room or entertainment systems into social instruments that foster a unique sense of global connectivity, collaboration and community.
Audiovisual Uvestream
in some embodiments in accordance with the present invention(s), a collaboration method for a livestream broadcast of a coordinated audiovisual work of first and second performers captured at respective geographically-distributed, first and second devices includes:
(1) receiving at the second device, a media encoding of an audiovisual performance mixed with a backing audio track and including (I) vocal audio captured at the first device from a first one of the performers and (ii) video that is performance synchronized with the captured first performer vocals; (2) at the second device, audibly rendering the received mixed audio performance and capturing thereagainst vocal audio from a second one of the performers;
(3) mixing the captured second performer vocal audio with the received mixed audio performance to provide a broadcast audiovisual mix that includes the captured vocal audio of the first and second performers and the backing audio track without apparent temporal lag therebetween; and (4) supplying the audiovisual broadcast mix to a service platform configured to iivestream the broadcast audiovisual mix to plural recipient devices constituting an audience. in some cases or embodiments, the performance synchronized video included in the received media encoding is captured in connection with the vocal capture at the first device, the method further includes capturing, at the second device, video that is performance synchronized with the captured second performer vocals, and the audiovisual broadcast mix is an audiovisual mix of captured audio and video of at least the first and second performers.
In some embodiments, the method further includes capturing, at the second device, second performer video that is performance synchronized with the captured second performer vocals; and compositing the second performer video with video for the first performer in the supplied audiovisual broadcast mix. in some cases or embodiments, the first and second performer video compositing includes, for at least some portions of the supplied audiovisual broadcast mix, a computational blurring of image frames of first and second performer video at a visual boundary therebetween. in some embodiments, the method further includes dynamically varying in a course of the audiovisual broadcast mix relative visual prominence of one or the other of the first and second performers. In some cases or embodiments, the dynamic varying is, at least partially, in correspondence with time varying vocal part codings in an audio score corresponding to and temporally synchronized with the backing audio track in some cases or embodiments, the dynamic varying is, at least partially, based on evaluation of a computationally defined audio feature of either or both of the first and second performer vocals.
In some cases or embodiments, the first device is associated with the second device as a current Iivestream guest, and the second device operates as a current Iivestream host, the current Iivestream host controlling association and dissociation of particular devices from the audience as the current Iivestream guest. In some cases or embodiments, the current Iivestream host selects from a queue of requests from the audience to associate as the current Iivestream guest. In some cases or embodiments, the first device operates in a Iivestream guest role and the second device operates in a Iivestream host role, the method further comprising either or both of: the second device releasing the Iivestream host role for assumption by another device; and the second device passing the Iivestream host role to a particular device selected from a set comprising the first device and the audience. In some embodiments, the method further includes accessing a machine readable encoding of musical structure that includes at least musical section boundaries coded for temporal alignment with the vocal audio captured at the first and second devices; and applying a first visual effect schedule to at least a portion of audiovisual broadcast mix, wherein the applied visual effect schedule encodes differing visual effects for differing musical structure elements of the first audiovisual performance encoding and provides visual effect transitions in temporal alignment with at least some of the coded musical section boundaries in some cases or embodiments, the differing visual effects encoded by the applied visual effect schedule include for a given element thereof, one or more of: a particle-based effect or lens flare; transitions between distinct source videos; animations or motion of a frame within a source video; vector graphics or images of patterns or textures; and color, saturation or contrast in some cases or embodiments, the associated musical structure encodes musical sections of differing types; and the applied visual effect schedule defines differing visual effects for different ones of the encoded musical sections. In some cases or embodiments, the associated musical structure encodes events or transitions; and the applied visual effect schedule defines differing visual effects for different ones of the encoded events or transitions. In some cases or embodiments, the associated musical structure encodes group parts, and the applied visual effect schedule is temporally selective for particular
performance synchronized video in correspondence with the encoded musical structure. in some embodiments, the method is performed, at least in part, on a handheld mobile device communicatively coupled to a content server or service platform in some
embodiments, the method is embodied, at least in part, as a computer program product encoding of instructions executable on the second device as part of a cooperative system including a content server or service platform to which a plurality of geographically- distributed, network-connected, vocal capture devices, including the second device, are communicatively coupled.
In some embodiments in accordance with the present invention(s), a system for
dissemination of an apparently live broadcast of a joint performance of geographically- distributed first and second performers includes first and second devices coupled by a communication network with non-negligible peer-to-peer latency for transmission of audiovisual content. The first device is communicatively coupled to supply to the second device an audiovisual performance mixed with a backing audio track and including (1) vocal audio of the first performer captured against the backing audio track and (2) video that is performance synchronized therewith. The second device is communicatively configured to receive a media encoding of the mixed audiovisual performance and to audibly render at least audio portions of the mixed audiovisual performance, to capture thereagainst vocal audio of the second performer, and to mix the captured second performer vocal audio with the received mixed audiovisual performance for transmission as the apparently live broadcast.
In some cases or embodiments, the second device is further configured to capture second performer video that is performance synchronized with the captured second performer vocals and to composite the second performer video with video for the first performer in the supplied audiovisual broadcast mix. in some cases or embodiments, the first and second performer video compositing includes, for at least some portions of the supplied audiovisual broadcast mix, a computational blurring of image frames of first and second performer video at a visual boundary therebetween. in some cases or embodiments, the first and second performer video compositing includes dynamically varying in a course of the audiovisual broadcast mix relative visual prominence of one or the other of the first and second performers in some cases or embodiments, the dynamic varying is, at least partially, in correspondence with time varying vocal part codings in an audio score corresponding to and temporally synchronized with the backing audio track. In some cases or embodiments, the dynamic varying is, at least partially, based on evaluation of a computationally defined audio feature of either or both of the first and second performer vocals. in some cases or embodiments, the first device is associated with the second device as a current livestream guest, and the second device operates as a current livestream host, the current livestream host controlling association and dissociation of particular devices from the audience as the current livestream guest. In some cases or embodiments, the current livestream host selects from a queue of requests from the audience to associate as the current livestream guest. in some embodiments, the system further includes a video compositor that accesses a machine readable encoding of musical structure including at least musical section boundaries coded for temporal alignment with the vocal audio captured at the first and second devices and that applies a first visual effect schedule to at least a portion of audiovisual broadcast mix, wherein the applied visual effect schedule encodes differing visual effects for differing musical structure elements of the first audiovisual performance encoding and provides visual effect transitions In temporal alignment with at least some of the coded musical section boundaries. In some cases or embodiments, the video compositor is hosted either on the second device or on a content server or service piatform through which the apparently live performance is supplied. in some cases or embodiments, as part of a user interface visual on either or both of the first and second devices, for a current song selection, a different vocal part selection is presented for each of the performers; and responsive to, and in correspondence with, gestures by either or both of the first and second performers at the respective geographically-distributed devices, the vocal part selections are updated, wherein assignment of a particular vocal part selection to the respective first or second performers is changeable until one or the other of the first and second performers gestures a start of vocal capture on a respective
geographically-distributed device, whereupon a then-current assignment of particular vocal part selections to the respective first or second performers is fixed for duration of capture of the coordinated multi-vocal performance. Chat Mechanism for Social Media:
In some embodiments in accordance with the present invention(s), a user interface method for social media includes (1) as part of a user interface visual on a touchscreen display of a client device, presenting on the touchscreen display, live video captured using a camera of the client device; (2) responsive to a first touchscreen gesture by a user of the client device, initiating capture of a snippet of the live video and presenting, as part of the user interface visual, a progress indication in correspondence with an accreting capture of the snippet; and (3) responsive to a second touchscreen gesture by the user of the client device, transmitting the captured snippet to a network-coupled service piatform as a posting in multiuser social media thread. in some cases or embodiments, the method further includes presenting the multiuser social media thread on the touchscreen display, the presented multiuser social media thread including the captured snippet together with posted temporally-ordered content from other users received via the network-coupled service piatform, wherein the posted content from at least one other user includes one or more of text and captured snippet of video from the at least one other user. in some cases or embodiments, the captured snippet is a fixed-length snippet, and the method further includes visually updating the progress indication in correspondence portion of the fixed-length snippet captured. In some cases or embodiments, the first touchscreen gesture is a maintained contact, by the user, with a first visually presented feature on the touchscreen display, and wherein the second touchscreen gesture includes release of the maintained contact. In some cases or embodiments, the first touchscreen gesture is a first tap-type contact, by the user, with a first visually presented feature on the touchscreen display, and wherein the second touchscreen gesture is a second tap-type contact following the first tap-type contact on the touchscreen display. In some cases or embodiments, the method further includes presenting the multiuser social media thread on the touchscreen display in correspondence with a iivestreamed audiovisual broadcast mix.
Audiovisual Collaboration with User Part Arbitration:
in some embodiments in accordance with the present invention(s), a method for capture of at least a portion of a coordinated multi-vocal performance of first and second performers at respective first and second geographically-distributed devices includes (1) as part of a user interface visual on either or both of the first and second devices, presenting for a current song selection, a different vocal part selection for each of the performers; and (2) responsive to, and in correspondence with, gestures by either or both of the first and second performers at the respective geographically-distributed devices, updating the vocal part selections, wherein assignment of a particular vocal part selection to the respective first or second performers is changeable until one or the other of the first and second performers gestures a start of vocal capture on a respective geographically-distributed device, whereupon a then- current assignment of particular vocal part selections to the respective first or second performers is fixed for duration of capture of the coordinated multi-vocal performance. in some cases or embodiments, the method further includes updating the vocal part selections at the second device in correspondence with a gestured selection communicated from the first device; and supplying the first device with updates to the vocal part selections in correspondence with a gestured selection at the second device in some cases or embodiments, the method further includes changing the current song selection and, in correspondence therewith, updating on either or both of the first and second devices the user interface visual. In some cases or embodiments, the change in current song selection is triggered by one or the other of the first and second performers on a respective one of the first and second devices. in some cases or embodiments, the method further includes triggering the change in current song selection based on a periodic or recurring event. In some cases or embodiments, the change in current song selection selects from a library of song selections based on one or more of coded interests and performance history of either or both of the first and second performers. in some cases or embodiments, beginning with the start of vocal capture, the method further includes receiving at the second device, a media encoding of a mixed audio performance (i) including vocal audio captured at the first device from a first one of the performers and (ii) mixed with a backing audio track for the current song selection; at the second device, audibly rendering the received mixed audio performance and capturing thereagainst vocal audio from a second one of the performers; and mixing the captured second performer vocal audio with the received mixed audio performance to provide a broadcast mix that includes the captured vocal audio of the first and second performers and the backing audio track without apparent temporal lag therebetween.
In some cases or embodiments, the method further includes, at the second device, visually presenting in correspondence with the audible rendering, lyrics and score-coded note targets for the current song selection, wherein the visually presented lyrics and note targets correspond the assignment of a particular vocal part at the start of vocal capture. In some cases or embodiments, the received media encoding includes video that is performance synchronized with the captured first performer vocals, wherein the method further includes capturing, at the host device, video that is performance synchronized with the captured second performer vocals, and wherein the broadcast mix is an audiovisual mix of captured audio and video of at least the first and second performers.
In some cases or embodiments, the method further includes capturing, at the host device, second performer video that is performance synchronized with the captured second performer vocals; and compositing the second performer video with video for the first performer in the supplied audiovisual broadcast mix. In some cases or embodiments, the method further includes supplying the broadcast mix to a service platform configured to livestream the broadcast mix to plural recipient devices constituting an audience.
BRIEF DESCRIPTION OF THE DRAWINGS
The present invention(s) are illustrated by way of examples and not limitation with reference to the accompanying figures, in which like references generally indicate similar elements or features.
FIG. 1 depicts information flows amongst illustrative mobile phone-type portable computing devices in a host and guest configuration for !ivestreaming a duet-type group audiovisual performance in accordance with some embodiments of the present invention(s).
FIG. 2 is a flow graph depicting the flow of audio signals captured and processed at respective guest and host devices coupled in a“host sync” peer-to-peer configuration for generation of a group audiovisual performance livestream in accordance with some embodiments of the present invention(s). HG. 3 is a flow graph depicting the flow of audio signais captured and processed at respective guest and host devices coupled in a“shared latency” peer-to-peer configuration for generation of a group audiovisual performance livestream in accordance with some embodiments of the present invention(s)
FIG. 4 is a flow diagram illustrating, for an audiovisual performance captured at a guest or host device in accordance with some embodiments of the present invention(s), optional real time continuous pitch-correction and harmony generation signal flows that may be performed based on score-coded pitch correction settings.
FIG. 5 is a functional block diagram of hardware and software components executable at an illustrative mobile phone-type portable computing device to facilitate processing and communication of a captured audiovisual performance for use in a multi-vocalist
livestreaming configuration of network-connected devices in accordance with some embodiments of the present invention(s).
FIG. 8 illustrates features of a mobile device that may serve as a platform for execution of software implementations of at least some audiovisual performance capture and/or livestream performance devices in accordance with some embodiments of the present invention(s).
FIGs. 7 A and 7B illustrate a video presentation of livestream content for which a
compositing of images for first and second performers is performed at a host device in accordance with some embodiments of the present invention(s).
FIGs. 8A and 8B illustrate a seifie chat interaction mechanism in which a capture viewport is presented on screen and a user interaction mechanic is supported whereby a user holds a touchscreen-presented button or other feature to capture a video snippet and releases to post the video snippet in a social media interaction in accordance with some embodiments of the present invention(s).
FIGs, 9A, 9B and 9C illustrate a user part selection and coordination mechanism in which a song roulette and/or gestured selections by user-performers on geographically-distributed devices provide corresponding song and/or part selections on peer devices for a livestream performance in accordance with some embodiments of the present invention(s).
FIG. 10 is a network diagram that illustrates cooperation of exemplary devices in accordance with some embodiments of the present invention(s). Skilled artisans will appreciate that elements or features in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. For example, the dimensions or prominence of some of the illustrated elements or features may be
exaggerated relative to other elements or features in an effort to help to improve
understanding of embodiments of the present invention. Likewise, a multiplicity of data and control flows (including constituent signals or encodings) will be understood consistent with the descriptions notwithstanding illustration in the drawings of a single flow for simplicity or avoid complexity that might otherwise obscure description of the inventive concepts.
DESCRIPTION
Techniques have been developed to facilitate the livestreaming of group audiovisual performances. Audiovisual performances including vocal music are captured and coordinated with performances of other users in ways that can create compelling user and listener experiences. For example, in some cases or embodiments, duets with a host performer may be supported in a sing-wifh-the-artist style audiovisual !ivestream in which aspiring vocalists request or queue particular songs for a live radio show entertainment format. The developed techniques provide a communications latency-tolerant mechanism for synchronizing vocal performances captured at geographically-separated devices (e.g., at globally-distributed, but network-connected mobile phones or tablets or at audiovisual capture devices geographically separated from a live studio).
While audio-only embodiments are certainly contemplated, it is envisioned that livestream content will typically include performance-synchronized video captured in connection with vocals. In addition, while network-connected mobile phones are illustrated as audiovisual capture devices, it will be appreciated based on the description herein that audiovisual capture and viewing devices may include suitably-configured computers, smart TVs and/or living room style set-top box configurations, and even intelligent virtual assistance devices with audio and/or audiovisual capture devices or capabilities. Finally, while applications to vocal music are described in detail, it will be appreciated based on the description herein that audio or audiovisual capture applications need not be limited to vocal duets, but may be adapted to other forms of group performance in which one or more successive performances are accreted to a prior performance to produce a livestream. in some cases, the vocal performances of collaborating contributors are captured (together with performance synchronized video) in the context of a karaoke-style presentation of lyrics and in correspondence with audible renderings of a backing track in some cases, vocals (and typically synchronized video) are captured as part of a live or unscripted performance with vocal interactions (e.g., a duet or dialog) between collaborating contributors. In each case, it is envisioned that non-negligib!e network communication latencies will exist between at least some of the collaborating contributors, particularly where those contributors are geographically separated. As a result, a technical challenge exists to manage latencies and the captured audiovisual content in such a way that a combined audio visual performance nonetheless can be disseminated (e.g., broadcast) in a manner that presents to recipients, listeners and/or viewers as a live interactive collaboration. in one technique for accomplishing this facsimile of live interactive performance
collaboration, actual and non-negiigibie network communication latency is (in effect) masked in one direction between a guest and host performer and tolerated in the other direction. For example, a captured audiovisual performance of a guest performer on a“live show” internet broadcast of a host performer could include a guest + host duet sung in apparent real-time synchrony. In some cases, the host could be a performer who has popularized a particular musical performance. In some cases, the guest could be an amateur vocalist given the opportunity to sing "live” (though remote) with the popular artist or group“in studio” as (or with) the show’s host. Notwithstanding a non-negligibie network communication delay from guest-to-host (perhaps 200-500 ms or more) to convey the guest’s audiovisual contribution, the host performs in apparent synchrony with (though temporally lagged from, in an absolute sense) the guest and the apparently synchronously performed vocals are captured and mixed with the guest’s contribution for broadcast or dissemination.
The result is an apparently live interactive performance (at least from the perspective of the host and the recipients, listeners and/or viewers of the disseminated or broadcast performance). Although the non-negligibie network communication latency from guest-to- host is masked, if will be understood that latency exists and is tolerated in the host-to-guest direction. However, host-to-guest latency, while discernible (and perhaps quite noticeable) to the guest, need not be apparent in the apparently live broadcast or other dissemination it has been discovered that lagged audible rendering of host vocals (or more generally, of the host’s captured audiovisual performance) need not psychoacousticaliy interfere with the guest’s performance.
Although much of the description herein presumes, for purposes of illustration, a fixed host performer on a particular host device, it will be appreciated based on the description herein that some embodiments in accordance with the present invenfion(s) may provide host/guest control logic that allows a host to“pass the mic” such that a new user (in some cases a user selected by the current host and other cases, a user who“picks up the mic” after the current host“drops the mic”) may take over as host. Likewise, it will be appreciated based on the description herein that some embodiments in accordance with the present invention(s) may provide host/guest control logic that queues guests (and/or aspiring hosts) and automatically assigns queued users to appropriate roles. in some cases or embodiments, vocal audio of individual host- and guest-role performers is captured together with performance synchronized video in a karaoke-style user interface framework and coordinated with audiovisual contributions of the other users to form duet- style or glee club-style group audiovisual performances. For example, the vocal
performances of individual users may be captured (together with performance synchronized video) on mobile devices, television-type display and/or set-fop box equipment in the context of karaoke-style presentations of lyrics in correspondence with audible renderings of a backing track in some cases or embodiments, score-coded continuous pitch correction may be provided as well as user selectable audio and/or video effects. Consistent with the foregoing, but without limitation as to any particular embodiment claimed, karaoke-style vocal performance capture using portable handheld devices provides illustrative context.
Figure imgf000016_0001
Although embodiments of the present invention are not limited thereto, pitch-corrected, karaoke-style, vocal capture using mobile phone-type and/or television-type audiovisual equipment provides a useful descriptive context. For example, in some embodiments such as illustrated in FIG. 1 , iPhone™ handhelds available from Apple Inc. (or more generally, handhelds 101 A, 101B operating as guest and host devices, respectively) execute software that operates in coordination with a content server 110 to provide vocal capture. The configuration optionally provides continuous real-time, score-coded pitch correction and harmonization of the captured vocals. Performance synchronized video may also be captured using a camera provided by, or in connection with, a computer, a television or other audiovisual equipment (not specifically shown) or connected set-top box equipment such as an Apple TV™ device in some embodiments, performance synchronized video may be captured using an on-board camera provided by handheld paired with connected set-top box equipment. in the illustration of FIG. 1, a current host user of current host device 101 B at least partially controls the content of a live stream 122 that is buffered for, and streamed to, an audience on devices 120A, 120B ... 120N. In the illustrated configuration, a current guest user of current guest device 101 A contributes to the group audiovisual performance mix 111 that is supplied (eventually via content server 110) by current host device 101 B as live stream 122 Although devices 120A, 120B ... 120N and, indeed, current guest and host devices 101A, 101B are, for simplicity, illustrated as handheld devices such as mobile phones, persons of skill in the art having benefit of the present disclosure will appreciate that any given member of the audience may receive iivestream 122 on any suitable computer, smart television, tablet, via a set-top box or other streaming media capable client.
Content that is mixed to form group audiovisual performance mix 111 is captured, in the illustrated configuration, in the context of karaoke-style performance capture wherein lyrics 102, optional pitch cues 105 and, typically, a backing track 107 are supplied from content server 110 to either or both of current guest device 101 A and current host device 101B A current host (on current host device 101 B) typically exercises ultimate control over the live stream, e.g., by selecting a particular user (or users) from the audience to act as the current guest(s), by selecting a particular song from a request queue (and/or vocal parts thereof for particular users), and/or by starting, stopping or pausing the group AV performance. Once the current host selects or approves a guest and/or song, the guest user may (in some embodiments) start/stop/pause the roil of backing track 107A for local audible rendering and otherwise control the content of guest mix 108 (backing track roil mixed with captured guest audiovisual content) supplied to current host device 101B. Roll of lyrics 102A and optional pitch cues 105A at current guest device 101 A is in temporal correspondence with the backing track 107A, and is likewise subject start/stop/pause control by the current guest. In some cases or situations, backing audio and/or video may be rendered from a media store such as an iTunes™ library resident or accessible from a handheld, set-top box, etc.
Typically, song requests 132 are audience-sourced and conveyed by signaling paths to content selection and guest queue control logic 112 of content server 110. Host
controls 131 and guest controls 133 are illustrated as bi-directional signaling paths. Other queuing and control logic configurations consistent with the operations described, including host or guest controlled queueing and/or song selection, will be appreciated based on the present disclosure. in the illustrated configuration of FIG. 1 , and notwithstanding a non-negiigible temporal lag (typically 100-250 ms, but possibly more), current host device 101B receives and audibly renders guest mix 106 as a backing track against which the current host’s audiovisual performance are captured at current host device 101 B. Roll of lyrics 102B and optional pitch cues 105B at current host device 101B is in temporal correspondence with the backing track, here guest mix 106. To facilitate synchronization to the guest mix 106 in view of temporal lag in the peer-to-peer communications channel between current guest
device 101 A and current host device 101B as well as for guest-side start/stop/pause control, marker beacons may be encoded in the guest mix to provide the appropriate phase control of lyrics 102B and optional pitch cues 105B on screen. Alternatively, phase analysis of any backing track 107A included in guest mix 106 (or any bleed through, if the backing track is separately encoded or conveyed) may be used to provide the appropriate phase control of lyrics 102B and optional pitch cues 105B on screen at current host device 101 B.
It will be understood that temporal lag in the peer-to-peer communications channel between current guest device 101 A and current host device 101B affects both guest mix 106 and communications in the opposing direction (e.g., host mic 103C signal encodings). Any of a variety of communications channels may be used to convey audiovisual signals and controls between current guest device 101 A and current host device 101B, as well as between the guest and host devices 101 A, 101B and content server 110 and between audience devices 120A, 120B ... 120N and content server 110. For example, respective
telecommunications carrier wireless facilities and/or wireless local area networks and respective wide-area network gateways (not specifically shown) may provide
communications to and from devices 101A, 101B, 120A, 120B ... 120N. Based on the description herein, persons of skill in the art will recognize that any of a variety of data communications facilities, including 802.11 Wi-Fi, Bluetooth™, 4G-LTE, 5G, or other communications, wireless, wired data networks, wired or wireless audiovisual interconnects such as in accord with HDMI, AVI, Wi-Di standards or facilities may employed, individually or in combination to facilitate communications and/or audiovisual rendering described herein.
User vocals 103A and 103B are captured at respective handhelds 101A, 101 B, and may be optionally pitch-corrected continuously and in real-time and audibly rendered mixed with the localiy-appropriate backing track (e.g., backing track 107A at current guest device 101A and guest mix 106 at current host device 101 B) to provide the user with an improved tonal quality rendition of his/her own vocal performance. Pitch correction is typically based on score-coded note sets or cues (e.g., the pitch and harmony cues 105A, 105B visually displayed at current guest device 101 A and at current host device 101 B, respectively), which provide continuous pitch-correction algorithms executing on the respective device with performance-synchronized sequences of target notes in a current key or scale. In addition to performance-synchronized melody targets, score-coded harmony note sequences (or sets) provide pitch-shifting algorithms with additional targets (typically coded as offsets relative to a lead melody note track and typically scored only for selected portions thereof) for pitch-shifting to harmony versions of the users own captured vocals. In some cases, pitch correction settings may be characteristic of a particular artist such as the artist that performed vocals associated with the particular backing track. In general, lyrics, melody and harmony track note sets and related timing and control information may be encapsulated in an appropriate container or object (e.g., in a Musical instrument Digital Interface, MIDI, or Java Script Object Notation, json, type format) for supply together with the backing track(s). Using such information, devices 101A and 101 B (as well as associated audiovisual displays and/or set-top box equipment, not specifically shown) may display lyrics and even visual cues related to target notes, harmonies and currently detected vocal pitch in correspondence with an audible performance of the backing track(s) so as to facilitate a karaoke-style vocal performance by a user. Thus, if an aspiring vocalist selects“When i Was Your Man” as popularized by Bruno Mars, your man . j son
ID and your man . m4a may be downloaded from the content server (if not already available or cached based on prior download) and, in turn, used to provide background music, synchronized lyrics and, in some situations or embodiments, score-coded note tracks for continuous, real-time pitch-correction while the user sings. Optionally, at least for certain embodiments or genres, harmony note tracks may be score coded for harmony shifts to
15 captured vocals. Typically, a captured pitch-corrected (possibly harmonized) vocal
performance together with performance synchronized video is saved locally, on the handheld device or set-top box, as one or more audiovisual files and is subsequently compressed and encoded for communication (e.g., as guest mix 106 or group audiovisual performance mix 111 or constituent encodings thereof) to content server 110 as an MPEG-4
20 container file. MPEG-4 is one suitable standard for the coded representation and
transmission of digital multimedia content for the internet, mobile networks and advanced broadcast applications. Other suitable codecs, compression techniques, coding formats and/or containers may be employed if desired.
As will be appreciated by persons of skill in the art having benefit of the present disclosure,
25 performances of multiple vocalists (including performance synchronized video) may be
accreted and combined, such as to form a duet-style performance, glee dub, or vocal jam session in some embodiments of the present invention, social network constructs may at least partially supplant or inform host control of the pairings of geographically-distributed vocalists and/or formation of geographically-distributed virtual glee clubs. For example,
30 relative to FIG. 1 , individual vocalists may perform as current host and guest users in a manner captured (with vocal audio and performance synchronized video) and eventually streamed as a live stream 122 to an audience. Such captured audiovisual content may, in turn, be distributed to social media contacts of the vocalist, members of the audience etc., via an open call mediated by the content server. In this way, the vocalists themselves,
35 members of the audience (and/or the content server or service platform on their behalf) may invite others to join in a coordinated audiovisual performance, or as members of an audience or guest queue.
Where supply and use of backing tracks is illustrated and described herein, it will be understood, that vocals captured, pitch-corrected (and possibly, though not necessarily, harmonized) may themselves be mixed (as with guest mix 106) to produce a“backing track” used to motivate, guide or frame subsequent vocal capture. Furthermore, additional vocalists may be invited to sing a particular part (e.g., tenor, part B in duet, etc.) or simply to sing, the subsequent vocal capture device (e.g., current host device 101B in the
configuration of FIG. 1) may pitch shift and place their captured vocals into one or more positions within a duet or virtual glee club.
Figure imgf000020_0001
Based on the description herein, persons of skill in the art will appreciate a variety of host- guest synchronization methods that tolerate non-negligible temporal lag in the peer-to-peer communications channel between guest device 101 A and host device 101 B. As illustrated in the context of FIG. 1 , the backing track (e.g., backing track 107A) can provide the synchronization timeline for temporally-phased vocal capture performed at the respective peer devices (guest device 101 A and host device 101 B) and minimize (or eliminate) the perceived latency for the users thereof.
FIG. 2 is a flow graph depicting the flow of audio signals captured and processed at respective guest and host devices coupled in a“host sync” peer-to-peer configuration for generation of a group audiovisual performance livestream in accordance with some embodiments of the present invention(s). Although FIG. 2 (and later, FIG. 3) each emphasize in the form of a teaching example the audio signa!/dafa components and flows that provide synchronization and temporal alignment for an apparently live performance, a person of skill in the art having benefit of the present disclosure will appreciate that corresponding audio performance synchronized video may be captured (as in FIG. 1) and that corresponding video signal/data components and flows may, in like manner, be conveyed between guest and host devices, though not explicitly shown in FIGs. 2 and 3.
More specifically, FIG. 2 Illustrates how an exemplary configuration of guest and host devices 101A and 101 B (recall FIG. 1) and audiovisual signal flows therebetween (e.g., guest mix 106 and host mic audio 103C) during a peer-to-peer session provide a user experience in which the host device vocalist (at host device 1Q1B) always hears guest vocals (captured from guest mic local input 103A) and backing track 107A in perfect synchronization. While the guest will perceive the host's accreted vocals delayed (in the mix supplied at guest speaker or headset 240A) by a full audio round-trip-travel (RTT) delay, the audio stream (including the remote guest mic mixed with the backing track) supplied to the host device 1Q1 B and mixed as the livestreamed (122) multi-vocal performance exhibit zero (or negligible) latency to the host vocalist or to the audience.
A key to masking actual latencies is to include track 107A in the audio mix suppled from guest device 101 A and to the broadcaster’s device, host device 101B. This audio flow ensures that the guest's voice and backing track is always synced from the broadcaster's point of view (based on audible rendering at host speaker or headset 240B. The guest may still perceive that the broadcaster is singing slightly out of sync if the network delay is significant. However, as long as the guest focuses on singing in time with the backing track instead of the host's slightly delayed voice, the multi-vocal mix of host vocals with guest vocals and the backing track is in sync when livestreamed to an audience.
FIG. 3 is a flow graph depicting the flow of audio signals captured and processed at respective guest and host devices coupled in an alternative“shared latency” peer-to-peer configuration for generation of a group audiovisual performance !ivestream in accordance with some embodiments of the present invention(s). More specifically, FIG. 3 illustrates how an exemplary configuration of guest and host devices 101A and 101 B (recall FIG. 1) and audiovisual signal flows therebetween (e.g., guest mix 106 and host mic audio 103C) during a peer-to-peer session combine to limit the guest and host vocalist’s perception of the other vocalist’s audio delay to just a one-way lag (nominally one half of the full audio round-trip- travel delay) behind the backing track.
This limited perception of delay Is accomplished by playing the backing track locally on both devices and working to keep them in sync in real-time. The guest device 101 A sends periodic timing messages to the host containing the current position in the song, and the host device 1Q1 B adjusts the playback position of the song accordingly.
We have experimented with two different approaches to keeping the backing tracks in sync on the two devices (guest and host devices 101A and 101B):
® Method 1: We adjust playback position we receive on the host-side by the one way network delay, which is approximated as the network RTT/2.
® Method 2: We synchronize the clocks of the two devices using network time protocol (NTP). This way we don't need to adjust the timing messages based on the one-way network delay, we simply add an NTP time stamp to each song timing message. For“shared latency” configurations, method 2 has proven more stable than method 1. As an optimization, to avoid excessive timing adjustments, the host only updates the backing track playback position if we are currently more than 50 ms off from the guest's backing track playback position.
As described herein, though video signal/data components may not be specifically illustrated in all drawings, a person of skill in the art having benefit of the present disclosure will understand that performance synchronized video may also be conveyed in audiovisual data encodings that include the more explicitly illustrated audio and in flows analogous to those more explicitly illustrated for audio signal/data components. Just as audio signals are captured, conveyed and mixed, video captured at respective devices is composited in correspondence with the audio with which it is performance synchronized as a temporal alignment and distributed performer synchronization baseline. Video compositing is typically performed at the host device, but in some cases may be performed using facilities of a content server or service platform (recall FIG, 1). In some embodiments, computer readable encodings of musical structure may guide the compositing function, affecting selection, visual placement and/or prominence of performance synchronized video in the apparently live performance. In general, selection, visual placement and/or prominence may be in correspondence with score-coded musical structure such as group (or duet A/B) parts, musical sections, meiody/harmony positions of captured or pitch-shifted audio and/or with computationally determined audio features of the vocal audio captured at guest or host device or both.
Figure imgf000022_0001
FUG. 4 is a flow diagram illustrating real-time continuous score-coded pitch-correction and harmony generation for a captured vocal performance in accordance with some
embodiments of the present invention9s). In the illustrated configuration, a user/vocalist (e.g., the guest or host vocalist at guest device 101 A or host device 101 B, recall FIG, 1) sings along with a backing track karaoke style. In the case of the guest vocalist at the current guest device 101 A, the operant backing track is backing track 107A, whereas for the host vocalist at the current host device 101 B, the operant backing track is guest mix 108 which, at least in embodiments employing the“host sync” method, conveys the original backing track mixed with guest vocals in either case, vocals captured (251) from a microphone input 201 may optionally be continuously pitch-corrected (252) and harmonized (255) in real-time for mix (253) with the operant backing track audibly rendered at one or more acoustic transducers 202. Both pitch correction and added harmonies are chosen to correspond to a score 207, which in the illustrated configuration, is wirelessly communicated (281) to the device(s) (e.g., from content server 110 to guest device 101 A or via guest device 101 A to host device 101 B, recall FIG. 1) on which vocal capture and pitch-correction is to be performed, together with lyrics 208 and an audio encoding of the operant backing track 209 (e.g., backing track 107A or guest mix 106). in some cases or embodiments, content selection and guest queue control logic 112 is selective for melody or harmony note selections at the respective guest and host devices 101 A and 1Q1B
In some embodiments of techniques described herein, the note (in a current scale or key) that is closest to that sounded by the user/vocalist is determined based on score 207. While this closest note may typically be a main pitch corresponding to the score-coded vocal melody, it need not be. indeed, in some cases, the user/vocalist may intend to sing harmony and the sounded notes may more closely approximate a harmony track.
Audiovisual Capture at Handheld Device
Although performance synchronized video capture need not be supported in all
embodiments, handheld device 101 (e.g., current guest device 101 A or current host device 101B, recall FIG. 1) may itself capture both vocal audio and performance
synchronized video. Thus, FIG. 5 illustrates basic signal processing flows (350) in accord with certain implementations suitable for a mobile phone-type handheld device 101 to capture vocal audio and performance synchronized video, to generate pitch-corrected and optionally harmonized vocals for audible rendering (locally and/or at a remote target device), and to communicate with a content server or service platform 110.
Based on the description herein, persons of ordinary skill in the art will appreciate suitable allocations of signal processing techniques (sampling, filtering, decimation, etc.) and data representations to functional blocks (e.g., decoder(s) 352, digifal-to-ana!og (D/A)
converter 351 , capture 353, 353A and encoder 355) of a software executable to provide signal processing flows 350 illustrated in FIG. 5. Likewise, relative to FIG. 4, the signal processing flows 250 and illustrative score coded note targets (including harmony note targets), persons of ordinary skill in the art will appreciate suitable allocations of signal processing techniques and data representations to functional blocks and signal processing constructs (e.g., decoder(s) 258, capture 251 , digitai-to-analog (D/A) converter 258, mixers 253, 254, and encoder 257) that may be implemented at least in part as software executable on a handheld or other portable computing device. As will be appreciated by persons of ordinary skill in the art, pitch-detection and pitch- correction have a rich technological history in the music and voice coding arts. Indeed, a wide variety of feature picking, time-domain and even frequency-domain techniques have been employed in the art and may be employed in some embodiments in accord with the present invention. With this in mind, and recognizing that multi-vocalist synchronization techniques in accordance with the present invention(s) are generally Independent of any particular pitch-detection or pitch-correction technology, the present description does not seek to exhaustively inventory the wide variety of signal processing techniques that may be suitable in various design or implementations in accord with the present description instead, we simply note that in some embodiments in accordance with the present inventions, pitch-detection methods calculate an average magnitude difference function (AMDF) and execute logic to pick a peak that corresponds to an estimate of the pitch period. Building on such estimates, pitch shift overlap add (PSOLA) techniques are used to facilitate resampling of a waveform to produce a pitch-shifted variant while reducing aperiodic effects of a splice.
Figure imgf000024_0001
F!!G. 6 illustrates features of a mobile device that may serve as a platform for execution of software implementations in accordance with some embodiments of the present invention. More specifically, FIG. 6 is a block diagram of a mobile device 400 that is generally consistent with commerciaily-available versions of an iPhone™ mobile digital device.
Although embodiments of the present invention are certainly not limited to iPhone deployments or applications (or even to iPhone-type devices), the iPhone device platform, together with its rich complement of sensors, multimedia facilities, application programmer interfaces and wireless application delivery model, provides a highly capable platform on which to deploy certain implementations. Based on the description herein, persons of ordinary skill in the art will appreciate a wide range of additional mobile device platforms that may be suitable (now or hereafter) for a given implementation or deployment of the inventive techniques described herein.
Summarizing briefly, mobile device 400 includes a display 402 that can be sensitive to haptic and/or tactile contact with a user. Touch-sensitive display 402 can support multi-touch features, processing multiple simultaneous touch points, including processing data related to the pressure, degree and/or position of each touch point. Such processing facilitates gestures and interactions with multiple fingers and other interactions. Of course, other touch-sensitive display technologies can also be used, e.g., a display in which contact is made using a stylus or other pointing device. Typically, mobile device 400 presents a graphical user interface on the touch-sensitive display 402, providing the user access to various system objects and for conveying information. In some implementations, the graphical user interface can include one or more display objects 404, 406 In the example shown, the display objects 404, 406, are graphic representations of system objects. Examples of system objects include device functions, applications, windows, files, alerts, events, or other identifiable system objects. In some embodiments of the present invention, applications, when executed, provide at least some of the digital acoustic functionality described herein.
Typically, the mobile device 400 supports network connectivity including, for example, both mobile radio and wireless internetworking functionality to enable the user to travel with the mobile device 400 and its associated network-enabled functions. In some cases, the mobile device 400 can interact with other devices in the vicinity (e.g., via Wi-Fi, Bluetooth, etc.). For example, mobile device 400 can be configured to interact with peers or a base station for one or more devices. As such, mobile device 400 may grant or deny network access to other wireless devices.
Mobile device 400 includes a variety of input/output (I/O) devices, sensors and transducers. For example, a speaker 460 and a microphone 462 are typically included to facilitate audio, such as the capture of vocal performances and audible rendering of backing tracks and mixed pitch-corrected vocal performances as described elsewhere herein. In some embodiments of the present invention, speaker 460 and microphone 662 may provide appropriate transducers for techniques described herein. An external speaker port 464 can be included to facilitate hands-free voice functionalities, such as speaker phone functions.
An audio jack 466 can also be included for use of headphones and/or a microphone. In some embodiments, an external speaker and/or microphone may be used as a transducer for the techniques described herein.
Other sensors can also be used or provided. A proximity sensor 468 can be included to facilitate the detection of user positioning of mobile device 400. In some implementations, an ambient light sensor 470 can be utilized to facilitate adjusting brightness of the touch- sensitive display 402. An accelerometer 472 can be utilized to detect movement of mobile device 400, as indicated by the directional arrow 474. Accordingly, display objects and/or media can be presented according to a detected orientation, e.g., portrait or landscape. In some implementations, mobile device 400 may include circuitry and sensors for supporting a location determining capability, such as that provided by the global positioning system (GPS) or other positioning systems (e.g., systems using Wi-Fi access points, television signals, cellular grids, Uniform Resource Locators (URLs)) to facilitate geocodings described herein. Mobi!e device 400 also includes a camera lens and imaging sensor 480. In some implementations, instances of a camera lens and sensor 480 are located on front and back surfaces of the mobile device 400. The cameras allow capture still images and/or video for association with captured pitch-corrected vocals.
Mobile device 400 can also include one or more wireless communication subsystems, such as an 802 11 b/g/n/ac communication device, and/or a Bluetooth™ communication device 488. Other communication protocols can also be supported, including other 802.x communication protocols (e.g., WiMax, Wi-Fi, 3G), fourth generation protocols and modulations (4G-LTE) and beyond (e.g., 5G), code division multiple access (CDMA), global system for mobile communications (GSM), Enhanced Data GSM Environment (EDGE), etc. A port device 490, e.g., a Universal Serial Bus (USB) port, or a docking port, or some other wired port connection, can be included and used to establish a wired connection to other computing devices, such as other communication devices 400, network access devices, a personal computer, a printer, or other processing devices capable of receiving and/or transmitting data. Port device 490 may also allow mobile device 400 to synchronize with a host device using one or more protocols, such as, for example, the TCP/IP, HTTP, UDP and any other known protocol.
User Interface Examples
F!!Gs, 7 A and 7B illustrate a video presentation of livestream content for which a
compositing of images for first and second performers is performed at a host device in accordance with some embodiments of the present invention(s). In the illustrations, it will be appreciated that performance synchronized video of performers captured at respective host and guest devices are composited to provide visuals of the apparently live performance. An image blurring at boundary technique is illustrated.
FIGs. 8A and 8B illustrate a seifie chat interaction mechanism in which a capture viewport is presented on screen and a user interaction mechanic is supported whereby a user holds a touchscreen-presented button or other feature to capture a video snippet and releases to post the video snippet in a social media interaction in accordance with some embodiments of the present invention(s). Some embodiments support alternative or additional gesture mechanics such as tap-to- start and tap-to-stop with a post confirmation.
FIGs, 9A, 9B and 9C illustrate a user part selection and coordination mechanism in which gestured selections by user-performers on geographically-distributed devices provide part selections on peer devices for a livestream performance in accordance with some embodiments of the present invention(s). Some embodiments support alternative or additional gesture mechanics including song selection such as based on coded interests of one or more of the user-performers, history of performances. In some cases or
embodiments, song selection presents as a pseudorandom song roulette selection triggered by one or more of the user-performers or automatically such as based on expiry of a timer.
An Exempjary Mobile Device
FIG. 10 illustrates respective instances (701 , 720A, 720B and 711) of computing devices programmed (or programmable) with vocal audio and video capture code, user interface code, pitch correction code, an audio rendering pipeline and playback code in accord with the functional descriptions herein. Device instance 701 is depicted operating in a vocal audio and performance-synchronized video capture mode, while devices instances 720A and 720B are depicted as operating in a mode that receives livestreamed mixed audiovisual performances. Though television-type display and/or set-top box equipment 720B is depicted operating in a livestream receiving mode, such equipment and computer 711 may operate as part of a vocal audio and performance synchronized video capture facility (as guest device 101A or host device 101B, recall FIG. 1). Each of the aforementioned devices communicate via wireless data transport and/or intervening networks 704 with a server 712 or service platform that hosts storage and/or functionality explained herein with regard to content server 110. Captured, pitch-corrected vocal performances mixed with performance- synchronized video to define a multi-vocalist audiovisual performance as described herein may (optionally) be livestreamed and audiovisualiy rendered at laptop computer 711.
OTHER EMBODIMENTS
While the invention(s) is (are) described with reference to various embodiments, it will be understood that these embodiments are illustrative and that the scope of the invention(s) is not limited to them. Many variations, modifications, additions, and improvements are possible. For example, while pitch correction vocal performances captured in accord with a karaoke-style interface have been described, other variations will be appreciated.
Furthermore, while certain illustrative signal processing techniques have been described in the context of certain illustrative applications, persons of ordinary skill in the art will recognize that It is straightforward to modify the described techniques to accommodate other suitable signal processing techniques and effects.
Embodiments in accordance with the present invention may take the form of, and/or be provided as, a computer program product encoded in a machine-readable medium as instruction sequences and other functional constructs of software, which may in turn be executed in a computational system (such as a iPhone handheld, mobile or portable computing device, media application platform, set-top box, or content server platform) to perform methods described herein in general, a machine readable medium can include tangible articles that encode information in a form (e.g., as applications, source or object code, functionally descriptive information, etc.) readable by a machine (e.g., a computer, computational facilities of a mobile or portable computing device, media device or streamer, etc.) as well as non-transitory storage incident to transmission of the information. A machine-readable medium may include, but need not be limited to, magnetic storage medium (e.g , disks and/or tape storage); optical storage medium (e.g , CD-ROM, DVD, etc.); magneto-optical storage medium; read only memory (ROM); random access memory (RAM); erasable programmable memory (e.g., EPROM and EEPROM); flash memory; or other types of medium suitable for storing electronic instructions, operation sequences, functionally descriptive information encodings, etc. in general, plural instances may be provided for components, operations or structures described herein as a single instance. Boundaries between various components, operations and data stores are somewhat arbitrary, and particular operations are illustrated in the context of specific illustrative configurations. Other allocations of functionality are envisioned and may fall within the scope of the invention(s). In general, structures and functionality presented as separate components in the exemplary configurations may be implemented as a combined structure or component. Similarly, structures and functionality presented as a single component may be implemented as separate components. These and other variations, modifications, additions, and improvements may fail within the scope of the invention(s).

Claims

WHAT IS CLAIMED IS:
1. A collaboration method for a livestream broadcast of a coordinated audiovisual work of first and second performers captured at respective geographically-distributed, first and second devices, the method comprising:
receiving at the second device, a media encoding of an audiovisual performance mixed with a backing audio track and including (i) vocal audio captured at the first device from a first one of the performers and (ii) video that is performance synchronized with the captured first performer vocals;
at the second device, audibly rendering the received mixed audio performance and capturing thereagainst vocal audio from a second one of the performers; mixing the captured second performer vocal audio with the received mixed audio performance to provide a broadcast audiovisual mix that includes the captured vocal audio of the first and second performers and the backing audio track without apparent temporal lag therebetween; and
supplying the audiovisual broadcast mix to a service platform configured to
livestream the broadcast audiovisual mix to plural recipient devices constituting an audience.
2. The method of claim 1 ,
wherein the performance synchronized video included in the received media
encoding is captured in connection with the vocal capture at the first device, wherein the method further includes capturing, at the second device, video that is performance synchronized with the captured second performer vocals, and wherein the audiovisual broadcast mix is an audiovisual mix of captured audio and video of at least the first and second performers.
3. The method of claim 1 , further comprising:
capturing, at the second device, second performer video that is performance
synchronized with the captured second performer vocals; and compositing the second performer video with video for the first performer in the
supplied audiovisual broadcast mix.
4. The method of claim 3,
wherein the first and second performer video compositing includes, for at least some portions of the supplied audiovisual broadcast mix, a computational blurring of image frames of first and second performer video at a visual boundary therebetween.
5. The method of claim 3, further comprising:
dynamically varying in a course of the audiovisual broadcast mix relative visual prominence of one or the other of the first and second performers.
6. The method of claim 5,
wherein the dynamic varying is, at least partially, in correspondence with time varying vocal part codings in an audio score corresponding to and temporally synchronized with the backing audio track.
7. The method of claim 5,
wherein the dynamic varying is, at least partially, based on evaluation of a
computationally defined audio feature of either or both of the first and second performer vocals
8. The method of claim 1 ,
wherein the first device is associated with the second device as a current livestream guest, and
wherein the second device operates as a current livestream host, the current
livestream host controlling association and dissociation of particular devices from the audience as the current livestream guest.
9. The method of claim 2,
wherein the current livestream host selects from a queue of requests from the
audience to associate as the current livestream guest.
10. The method of claim 1 , wherein the first device operates in a livestream guest role and the second device operates in a livestream host role, the method further comprising either or both of:
the second device releasing the livestream host role for assumption by another device; and
the second device passing the livestream host role to a particular device selected from a set comprising the first device and the audience.
11. The method of claim 1 , further comprising:
accessing a machine readable encoding of musical structure that includes at least musical section boundaries coded for temporal alignment with the vocal audio captured at the first and second devices; and
applying a first visual effect schedule to at least a portion of audiovisual broadcast mix, wherein the applied visual effect schedule encodes differing visual effects for differing musical structure elements of the first audiovisual performance encoding and provides visual effect transitions in temporal alignment with at least some of the coded musical section boundaries.
12. The method of claim 11 , wherein the differing visual effects encoded by the applied visual effect schedule include for a given element thereof, one or more of:
a particle-based effect or lens flare;
transitions between distinct source videos;
animations or motion of a frame within a source video
vector graphics or images of patterns or textures; and
color, saturation or contrast.
13. The method of claim 11 ,
wherein the associated musical structure encodes musical sections of differing types; and
wherein the applied visual effect schedule defines differing visual effects for different ones of the encoded musical sections.
14. The method of claim 11 ,
wherein the associated musical structure encodes events or transitions; and wherein the applied visual effect schedule defines differing visual effects for different ones of the encoded events or transitions.
15. The method of claim 11 ,
wherein the associated musical structure encodes group parts, and
wherein the applied visual effect schedule is temporally selective for particular
performance synchronized video in correspondence with the encoded musical structure.
16. The method of any of claims 1 to 15, performed, at least in part, on a handheld mobile device communicatively coupled to a content server or service platform.
17. The method of any of claims 1 to 15, embodied, at least in part, as a computer program product encoding of instructions executable on the second device as part of a cooperative system including a content server or service platform to which a plurality of geographically-distributed, network-connected, vocal capture devices, including the second device, are communicatively coupled.
18. A system for dissemination of an apparently live broadcast of a joint performance of geographically-distributed first and second performers, the system comprising:
first and second devices coupled by a communication network with non-negligible peer-to-peer latency for transmission of audiovisual content;
the first device communicatively coupled to supply to the second device an
audiovisual performance mixed with a backing audio track and including (1) vocal audio of the first performer captured against the backing audio track and (2) video that is performance synchronized therewith; and the second device communicatively configured to receive a media encoding of the mixed audiovisual performance and to audibly render at least audio portions of the mixed audiovisual performance, to capture thereagainst vocal audio of the second performer, and to mix the captured second performer vocal audio with the received mixed audiovisual performance for transmission as the apparently live broadcast.
19. The system of claim 18,
wherein the second device is further configured to capture second performer video that is performance synchronized with the captured second performer vocals and to composite the second performer video with video for the first performer in the supplied audiovisual broadcast mix.
20. The system of claim 19,
wherein the first and second performer video compositing includes, for at least some portions of the supplied audiovisual broadcast mix, a computational blurring of image frames of first and second performer video at a visual boundary therebetween.
21. The system of claim 19,
wherein the first and second performer video compositing includes dynamically
varying in a course of the audiovisual broadcast mix relative visual prominence of one or the other of the first and second performers.
22. The system of claim 21 ,
wherein the dynamic varying is, at least partially, in correspondence with time varying vocal part codings in an audio score corresponding to and temporally synchronized with the backing audio track.
23. The system of claim 21 ,
wherein the dynamic varying is, at least partially, based on evaluation of a
computationally defined audio feature of either or both of the first and second performer vocals.
24. The system of claim 18,
wherein the first device is associated with the second device as a current livestream guest, and
wherein the second device operates as a current livestream host, the current
livestream host controlling association and dissociation of particular devices from the audience as the current livestream guest.
25. The system of claim 24,
wherein the current livestream host selects from a queue of requests from the
audience to associate as the current livestream guest.
26. The system of claim 18, further comprising:
a video compositor that accesses a machine readable encoding of musical structure including at least musical section boundaries coded for temporal alignment with the vocal audio captured at the first and second devices and that applies a first visual effect schedule to at least a portion of audiovisual broadcast mix, wherein the applied visual effect schedule encodes differing visual effects for differing musical structure elements of the first audiovisual performance encoding and provides visual effect transitions in temporal alignment with at least some of the coded musical section boundaries.
27. The system of claim 26, wherein the video compositor is hosted either on the second device or on a content server or service platform through which the apparently live performance is supplied.
28 The system of claim 18, wherein, as part of a user interface visual on either or both of the first and second devices, for a current song selection, a different vocal part selection is presented for each of the performers; and
wherein responsive to, and in correspondence with, gestures by either or both of the first and second performers at the respective geographically-distributed devices, the vocal part selections are updated, wherein assignment of a particular vocal part selection to the respective first or second performers is changeable until one or the other of the first and second performers gestures a start of vocal capture on a respective geographically-distributed device, whereupon a then-current assignment of particular vocal part selections to the respective first or second performers is fixed for duration of capture of the coordinated multi-vocal performance.
29. A user interface method for social media, the method comprising:
as part of a user interface visual on a touchscreen display of a client device,
presenting on the touchscreen display, live video captured using a camera of the client device;
responsive to a first touchscreen gesture by a user of the client device, initiating capture of a snippet of the live video and presenting, as part of the user interface visual, a progress indication in correspondence with an accreting capture of the snippet; and
responsive to a second touchscreen gesture by the user of the client device,
transmitting the captured snippet to a network-coupled service platform as a posting in multiuser social media thread.
30. The method of claim 29, further comprising:
presenting the multiuser social media thread on the touchscreen display, the
presented multiuser social media thread including the captured snippet together with posted temporally-ordered content from other users received via the network-coupled service platform, wherein the posted content from at least one other user includes one or more of text and captured snippet of video from the at least one other user.
31. The method of claim 29,
wherein the captured snippet is a fixed-length snippet, and
wherein the method further includes visually updating the progress indication in
correspondence portion of the fixed-length snippet captured.
32. The method of claim 29,
wherein the first touchscreen gesture is a maintained contact, by the user, with a first visually presented feature on the touchscreen display, and wherein the second touchscreen gesture includes release of the maintained contact.
33. The method of claim 29,
wherein the first touchscreen gesture is a first tap-type contact, by the user, with a first visually presented feature on the touchscreen display, and wherein the second touchscreen gesture is a second tap-type contact following the first tap-type contact on the touchscreen display.
34. The method of claim 29, further comprising:
presenting the multiuser social media thread on the touchscreen display in
correspondence with a iivestreamed audiovisual broadcast mix.
35. A method for capture of at least a portion of a coordinated multi-vocal performance of first and second performers at respective first and second geographically- distributed devices, the method comprising:
as part of a user interface visual on either or both of the first and second devices, presenting for a current song selection, a different vocal part selection for each of the performers; and
responsive to, and in correspondence with, gestures by either or both of the first and second performers at the respective geographically-distributed devices, updating the vocal part selections, wherein assignment of a particular vocal part selection to the respective first or second performers is changeable until one or the other of the first and second performers gestures a start of vocal capture on a respective geographically-distributed device, whereupon a then- current assignment of particular vocal part selections to the respective first or second performers is fixed for duration of capture of the coordinated multi vocal performance.
36. The method of claim 35, further comprising:
updating the vocal part selections at the second device in correspondence with a gestured selection communicated from the first device; and
supplying the first device with updates to the vocal part selections in correspondence with a gestured selection at the second device.
37. The method of claim 35, further comprising:
changing the current song selection and, in correspondence therewith, updating on either or both of the first and second devices the user interface visual.
38. The method of claim 37,
wherein the change in current song selection is triggered by one or the other of the first and second performers on a respective one of the first and second devices.
39. The method of claim 37, further comprising:
triggering the change in current song selection based on a periodic or recurring
event.
40. The method of claim 37,
wherein the change in current song selection selects from a library of song selections based on one or more of coded interests and performance history of either or both of the first and second performers.
41. The method of claim 35, further comprising:
beginning with the start of vocal capture, receiving at the second device, a media encoding of a mixed audio performance (i) including vocal audio captured at the first device from a first one of the performers and (ii) mixed with a backing audio track for the current song selection;
at the second device, audibly rendering the received mixed audio performance and capturing thereagainst vocal audio from a second one of the performers; and mixing the captured second performer vocal audio with the received mixed audio performance to provide a broadcast mix that includes the captured vocal audio of the first and second performers and the backing audio track without apparent temporal lag therebetween.
42. The method of claim 41 , further comprising:
at the second device, visually presenting in correspondence with the audible
rendering, lyrics and score-coded note targets for the current song selection, wherein the visually presented lyrics and note targets correspond the assignment of a particular vocal part at the start of vocal capture.
43. The method of claim 41 ,
wherein the received media encoding includes video that is performance
synchronized with the captured first performer vocals,
wherein the method further includes capturing, at the host device, video that is performance synchronized with the captured second performer vocals, and wherein the broadcast mix is an audiovisual mix of captured audio and video of at least the first and second performers.
44. The method of claim 41 , further comprising:
capturing, at the host device, second performer video that is performance
synchronized with the captured second performer vocals; and
compositing the second performer video with video for the first performer in the supplied audiovisual broadcast mix.
45. The method of claim 41 , further comprising:
supplying the broadcast mix to a service platform configured to iivestream the broadcast mix to plural recipient devices constituting an audience.
PCT/US2019/037479 2018-06-15 2019-06-17 Audiovisual livestream system and method with latency management and social media-type user interface mechanics WO2019241778A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201980052977.9A CN112567758A (en) 2018-06-15 2019-06-17 Audio-visual live streaming system and method with latency management and social media type user interface mechanism
EP19819554.7A EP3808096A4 (en) 2018-06-15 2019-06-17 Audiovisual livestream system and method with latency management and social media-type user interface mechanics

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201862685727P 2018-06-15 2018-06-15
US62/685,727 2018-06-15

Publications (1)

Publication Number Publication Date
WO2019241778A1 true WO2019241778A1 (en) 2019-12-19

Family

ID=68842358

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2019/037479 WO2019241778A1 (en) 2018-06-15 2019-06-17 Audiovisual livestream system and method with latency management and social media-type user interface mechanics

Country Status (3)

Country Link
EP (1) EP3808096A4 (en)
CN (1) CN112567758A (en)
WO (1) WO2019241778A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020117823A1 (en) 2018-12-03 2020-06-11 Smule, Inc. Augmented reality filters for captured audiovisual performances
US20220208157A1 (en) * 2019-04-29 2022-06-30 Paul Andersson System and method for providing electronic musical scores
US20230005462A1 (en) * 2018-05-21 2023-01-05 Smule, Inc. Audiovisual collaboration system and method with seed/join mechanic

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20070016901A (en) * 2005-08-05 2007-02-08 주식회사 오아시스미디어 Internet broadcasting system of Real-time multilayer multimedia image integrated system and Method thereof
US20080270541A1 (en) * 2006-04-24 2008-10-30 Ellis Barlow Keener Interactive audio/video method on the internet
CN102456340A (en) 2010-10-19 2012-05-16 盛大计算机(上海)有限公司 Karaoke in-pair singing method based on internet and system thereof
WO2015103415A1 (en) 2013-12-31 2015-07-09 Smule, Inc. Computationally-assisted musical sequencing and/or composition techniques for social music challenge or competition
US20160057316A1 (en) 2011-04-12 2016-02-25 Smule, Inc. Coordinating and mixing audiovisual content captured from geographically distributed performers
KR101605497B1 (en) * 2014-11-13 2016-03-22 유영재 A Method of collaboration using apparatus for musical accompaniment
US20160358595A1 (en) * 2015-06-03 2016-12-08 Smule, Inc. Automated generation of coordinated audiovisual work based on content captured geographically distributed performers

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8234395B2 (en) * 2003-07-28 2012-07-31 Sonos, Inc. System and method for synchronizing operations among a plurality of independently clocked digital data processing devices
US20090113022A1 (en) * 2007-10-24 2009-04-30 Yahoo! Inc. Facilitating music collaborations among remote musicians
CN101853498B (en) * 2009-03-31 2012-01-11 华为技术有限公司 Image synthetizing method and image processing device
GB2506404B (en) * 2012-09-28 2015-03-18 Memeplex Ltd Automatic audio mixing
ITMI20121617A1 (en) * 2012-09-28 2014-03-29 St Microelectronics Srl METHOD AND SYSTEM FOR SIMULTANEOUS PLAYING OF AUDIO TRACKS FROM A PLURALITY OF DIGITAL DEVICES.
WO2014175482A1 (en) * 2013-04-24 2014-10-30 (주)씨어스테크놀로지 Musical accompaniment device and musical accompaniment system using ethernet audio transmission function
US9331799B2 (en) * 2013-10-07 2016-05-03 Bose Corporation Synchronous audio playback
CN106601220A (en) * 2016-12-08 2017-04-26 天脉聚源(北京)传媒科技有限公司 Method and device for recording antiphonal singing of multiple persons

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20070016901A (en) * 2005-08-05 2007-02-08 주식회사 오아시스미디어 Internet broadcasting system of Real-time multilayer multimedia image integrated system and Method thereof
US20080270541A1 (en) * 2006-04-24 2008-10-30 Ellis Barlow Keener Interactive audio/video method on the internet
CN102456340A (en) 2010-10-19 2012-05-16 盛大计算机(上海)有限公司 Karaoke in-pair singing method based on internet and system thereof
US20160057316A1 (en) 2011-04-12 2016-02-25 Smule, Inc. Coordinating and mixing audiovisual content captured from geographically distributed performers
WO2015103415A1 (en) 2013-12-31 2015-07-09 Smule, Inc. Computationally-assisted musical sequencing and/or composition techniques for social music challenge or competition
KR101605497B1 (en) * 2014-11-13 2016-03-22 유영재 A Method of collaboration using apparatus for musical accompaniment
US20160358595A1 (en) * 2015-06-03 2016-12-08 Smule, Inc. Automated generation of coordinated audiovisual work based on content captured geographically distributed performers

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3808096A4

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230005462A1 (en) * 2018-05-21 2023-01-05 Smule, Inc. Audiovisual collaboration system and method with seed/join mechanic
WO2020117823A1 (en) 2018-12-03 2020-06-11 Smule, Inc. Augmented reality filters for captured audiovisual performances
EP3892001A4 (en) * 2018-12-03 2022-12-28 Smule, Inc. Augmented reality filters for captured audiovisual performances
US20220208157A1 (en) * 2019-04-29 2022-06-30 Paul Andersson System and method for providing electronic musical scores

Also Published As

Publication number Publication date
EP3808096A1 (en) 2021-04-21
EP3808096A4 (en) 2022-06-15
CN112567758A (en) 2021-03-26

Similar Documents

Publication Publication Date Title
US11553235B2 (en) Audiovisual collaboration method with latency management for wide-area broadcast
US11683536B2 (en) Audiovisual collaboration system and method with latency management for wide-area broadcast and social media-type user interface mechanics
US11394855B2 (en) Coordinating and mixing audiovisual content captured from geographically distributed performers
US11158296B2 (en) Automated generation of coordinated audiovisual work based on content captured geographically distributed performers
US20230335094A1 (en) Audio-visual effects system for augmentation of captured performance based on content thereof
US10943574B2 (en) Non-linear media segment capture and edit platform
US20220051448A1 (en) Augmented reality filters for captured audiovisual performances
EP3808096A1 (en) Audiovisual livestream system and method with latency management and social media-type user interface mechanics
US20220122573A1 (en) Augmented Reality Filters for Captured Audiovisual Performances
WO2016070080A1 (en) Coordinating and mixing audiovisual content captured from geographically distributed performers
CN111345044B (en) Audiovisual effects system for enhancing a performance based on content of the performance captured

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19819554

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2019819554

Country of ref document: EP