CN112567758A - Audio-visual live streaming system and method with latency management and social media type user interface mechanism - Google Patents

Audio-visual live streaming system and method with latency management and social media type user interface mechanism Download PDF

Info

Publication number
CN112567758A
CN112567758A CN201980052977.9A CN201980052977A CN112567758A CN 112567758 A CN112567758 A CN 112567758A CN 201980052977 A CN201980052977 A CN 201980052977A CN 112567758 A CN112567758 A CN 112567758A
Authority
CN
China
Prior art keywords
performer
vocal
audio
performance
captured
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201980052977.9A
Other languages
Chinese (zh)
Inventor
安东·霍姆伯格
本杰明·赫什
珍妮·杨
吴宇宁
梁汪
佩里·R·库克
杰弗里·C·史密斯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Smule Inc
Original Assignee
Smule Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Smule Inc filed Critical Smule Inc
Publication of CN112567758A publication Critical patent/CN112567758A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/36Accompaniment arrangements
    • G10H1/361Recording/reproducing of accompaniment for use with an external source, e.g. karaoke systems
    • G10H1/366Recording/reproducing of accompaniment for use with an external source, e.g. karaoke systems with means for modifying or correcting the external signal, e.g. pitch correction, reverberation, changing a singer's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/0008Associated control or indicating means
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/0033Recording/reproducing or transmission of music for electrophonic musical instruments
    • G10H1/0041Recording/reproducing or transmission of music for electrophonic musical instruments in coded form
    • G10H1/0058Transmission between separate instruments or between individual components of a musical system
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/02Editing, e.g. varying the order of information signals recorded on, or reproduced from, record carriers
    • G11B27/031Electronic editing of digitised analogue information signals, e.g. audio or video signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/21Server components or server architectures
    • H04N21/218Source of audio or video content, e.g. local disk arrays
    • H04N21/2187Live feed
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/236Assembling of a multiplex stream, e.g. transport stream, by combining a video stream with other content or additional data, e.g. inserting a URL [Uniform Resource Locator] into a video stream, multiplexing software data into a video stream; Remultiplexing of multiplex streams; Insertion of stuffing bits into the multiplex stream, e.g. to obtain a constant bit-rate; Assembling of a packetised elementary stream
    • H04N21/2368Multiplexing of audio and video streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/242Synchronization processes, e.g. processing of PCR [Program Clock References]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/155Musical effects
    • G10H2210/245Ensemble, i.e. adding one or more voices, also instrumental voices
    • G10H2210/251Chorus, i.e. automatic generation of two or more extra voices added to the melody, e.g. by a chorus effect processor or multiple voice harmonizer, to produce a chorus or unison effect, wherein individual sounds from multiple sources with roughly the same timbre converge and are perceived as one
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/325Musical pitch modification
    • G10H2210/331Note pitch correction, i.e. modifying a note pitch or replacing it by the closest one in a given scale
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2240/00Data organisation or data communication aspects, specifically adapted for electrophonic musical tools or instruments
    • G10H2240/171Transmission of musical instrument data, control or status information; Transmission, remote access or control of music data for electrophonic musical instruments
    • G10H2240/175Transmission of musical instrument data, control or status information; Transmission, remote access or control of music data for electrophonic musical instruments for jam sessions or musical collaboration through a network, e.g. for composition, ensemble playing or repeating; Compensation of network or internet delays therefor
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2240/00Data organisation or data communication aspects, specifically adapted for electrophonic musical tools or instruments
    • G10H2240/171Transmission of musical instrument data, control or status information; Transmission, remote access or control of music data for electrophonic musical instruments
    • G10H2240/201Physical layer or hardware aspects of transmission to or from an electrophonic musical instrument, e.g. voltage levels, bit streams, code words or symbols over a physical link connecting network nodes or instruments
    • G10H2240/241Telephone transmission, i.e. using twisted pair telephone lines or any type of telephone network
    • G10H2240/251Mobile telephone transmission, i.e. transmitting, accessing or controlling music data wirelessly via a wireless or mobile telephone receiver, analog or digital, e.g. DECT GSM, UMTS

Abstract

Techniques have been developed to facilitate live streaming of community audiovisual shows. Audiovisual performances including vocal music are captured and coordinated with the performances of other users in a manner that can create engaging user and listener experiences. For example, in some cases or embodiments, duet with a moderator performer may be supported in an audiovisual live stream of singing style with an artist, where a desired singer requests or queues a particular song for a live broadcast performance entertainment format. The developed technology provides a communication latency tolerant mechanism for synchronizing vocal music captured at geographically separated devices (e.g., at globally distributed but network connected mobile phones or tablets or at audiovisual capture devices geographically separated from a live studio) in performance.

Description

Audio-visual live streaming system and method with latency management and social media type user interface mechanism
Technical Field
The present invention relates generally to the capture, processing and/or broadcasting of multi-performer audiovisual performances, and in particular, to techniques suitable for managing the transmission latency of audiovisual content captured in the context of near real-time audiovisual collaboration of multiple geographically distributed performers.
Background
The installed base of mobile phones, personal media players and portable computing devices, as well as streaming media players (media streamers) and television set-top boxes, has grown in absolute numbers and computing power each day. Many of these devices surpass cultural and economic barriers due to ubiquitous and deep rooting in people's lifestyles throughout the world. Computationally, these computing devices provide speed and storage capabilities comparable to computers from engineering workstations or workgroups less than ten years ago, and typically include powerful media processors, rendering them computationallySuitable for real-time sound synthesis and other music applications. Partially as a result, such as
Figure BDA0002940421210000011
iPod
Figure BDA0002940421210000017
And others
Figure BDA0002940421210000018
Or Android devices and some portable handheld devices such as
Figure BDA0002940421210000012
Media application platforms such as devices and set-top box (STB) type devices support audio and video processing well, while at the same time providing a platform suitable for advanced user interfaces. In practice, such as Smule Ocaine available from Smule IncTM、Leaf
Figure BDA0002940421210000013
I Am T-PainTM
Figure BDA0002940421210000016
Smule(fka Sing!KaraokeTM)、Guitar!by
Figure BDA0002940421210000014
And Magic
Figure BDA0002940421210000015
Applications of the applications have shown that such devices can be used to deliver advanced digital acoustic technology in a manner that provides an engaging music experience.
Smule(fka Sing!KaraokeTM) Embodiments have previously demonstrated the symbiosis (acquisition) of vocal music performances captured on a non-real-time basis with respect to each other using geographically distributed handheld devices, and support of portable handheld devices with local media application platforms (e.g., indoor) typically with short-range negligible latency communications over the same local area network or personal area network segmentMore tightly coupled between the two. Improved technical and functional capabilities are desired to extend the sense of closeness of "now" or "lively" to collaborative vocal performances, where performers are separated by more significant geographic distances, despite non-negligible communication delays between devices.
Significant practical challenges exist as researchers seek to translate their innovations into commercial applications deployable to modern handheld devices and media application platforms within real-world constraints imposed by processors, memory, and their other limited computing resources and/or within communication bandwidth and transmission latency constraints typical of wireless and wide area networks. For example, although such as Smule (fka Sing! KaraokeTM) Such applications have demonstrated that post-performance audio-visual mixing is expected to simulate vocal duet or collaborative vocal performance by a greater number of performers, but creating the perception of now and live collaboration has proven elusive without physical co-location.
Improved technical and functional capabilities are desired, particularly with respect to communication latency and management of captured audiovisual content, such that a combined audiovisual performance may still be disseminated (e.g., broadcast) as a live interactive collaborative presentation of geographically distributed performers to audiences, listeners, and/or viewers. It is also desirable for viewers who provide a sense of closeness of "now" or "lively" to engage in and participate in the idea.
Disclosure of Invention
It has been discovered that audiovisual performances including vocal music can be captured and coordinated with audiovisual performances of other users in a manner that creates an engaging user and listener experience, regardless of practical limitations imposed by mobile device platforms and other media application execution environments. In some cases, vocal performances (and videos synchronized in performance) of the collaborative contributors are captured in the context of karaoke style lyric presentations and corresponding to audible renderings of the accompaniment tracks. In some cases, vocal music (typically synchronized video) is captured as part of a live or impromptu performance as vocal music interactions (e.g., duets or conversations) between the collaborative contributors. In either case, it is envisioned that there will be non-negligible network communication latency between at least some of the collaborating contributors, particularly if those contributors are geographically separated. As a result, there are technical challenges for managing latency and captured audiovisual content such that a combined audiovisual performance can still be disseminated (e.g., broadcast) as a live interactive collaborative presentation to an audience, listener, and/or viewer.
In one technique for accomplishing such faxing of live interactive performance collaboration, the actual and non-negligible network communication latency is (in effect) masked in one direction between the guest and host performers and tolerated in the other direction. For example, an audiovisual performance of a guest performer captured on a "live performance" internet broadcast of a moderator performer can include a guest + moderator duet singing in apparent real-time synchronization. In some cases, the guest may be a performer who has promoted a particular musical performance. In some cases, the guest may be an amateur singer given the opportunity to sing "live" (albeit remotely) with a popular artist or group that is actually "in the studio" as the performance host (or with it). Although a non-negligible network communication delay from the guest to the moderator is involved in the communication of the guest's stream of audiovisual contributions (perhaps 200-.
The result is an apparent live interactive performance (at least from the perspective of the presenter and audience, audience and/or viewers of the broadcast or broadcast performance). Although the non-negligible network communication latency from guest to host is masked, it should be understood that there is and latency tolerance in the host to guest direction. However, the moderator-to-guest latency, while discernable (and possibly quite noticeable) to the guest, is not necessarily noticeable in an apparent live broadcast or other propagation. It has been found that a delayed audible rendering of the presenter's vocal (or more generally, the presenter's captured audiovisual performance) does not necessarily psychologically interfere with the guest's performance.
Video synchronized in performance can be captured and included in a combined audiovisual performance that constitutes an apparent live broadcast, where the vision can be based at least in part on time-varying computationally defined audio features extracted from (or computed over) captured vocal music audio. In some cases or embodiments, these computationally defined audio features are selective to one or more particular synchronized videos (or their highlights) that contribute to the singer in coordinating the audio-visual mix.
Optionally, and in some cases or embodiments, the vocal music audio may be pitch corrected in real time at the guest performer's device (or more generally, at a device such as a mobile phone, personal digital assistant, laptop computer, notebook computer, tablet computer, or netbook, or on a content or media application server) according to pitch correction settings. In some cases, the pitch correction settings encode a particular tone or scale for a vocal performance or for a portion thereof. In some cases, the pitch correction settings include a score-coded melody and/or a harmony sequence (harmony sequence) supplied with or associated with the lyrics and accompaniment tracks. Depending on the need, the harmonic notes or chords may be encoded as explicit targets or as melodies relative to the score or even as actual pitches uttered by the singer.
Using the uploaded vocal music captured at guest performer devices such as the aforementioned portable computing devices, a content server or service for the moderator may further mediate the coordinated performance by manipulating and mixing the uploaded audiovisual content of the plurality of contributing singers for further broadcasting or other dissemination. Depending on the goals and implementations of the particular system, the upload may include, in addition to the video content, pitch-corrected vocal performances (with or without harmony), hard (i.e., uncorrected) vocal and/or control trajectories of user tones and/or pitch correction selections, and the like.
Synthetic harmony and/or additional vocals (e.g., vocals captured from another singer at still other locations and optionally pitch-shifted to harmony with other vocals) may also be included in the mixture. Geocoding (geocoding) of captured vocal performances (or individual contributions to combined performances) and/or audience feedback may promote animation or display artifacts in a manner that suggests a performance or promotion (annotation) emanating from a particular geographic location on a user-manipulable globe. In this way, the described functional embodiments may transform an otherwise mundane mobile device and living room or entertainment system into a social tool that fosters the unique perception of global connectivity, collaboration, and communities.
Viewing live streams
In some embodiments according to the invention, a collaborative method for live streaming broadcast of coordinated audiovisual works of a first performer and a second performer captured at respective geographically distributed first and second devices comprises: (1) at a second device, receiving media encoding of an audiovisual performance mixed with accompanying audio tracks, the media encoding of the audiovisual performance comprising: (i) vocal audio captured at the first device from a first of the performers and (ii) video synchronized in performance with the captured vocal of the first performer; (2) at the second device, audibly rendering the received mixed audio performance and capturing vocal audio from a second one of the performers in opposition thereto; (3) mixing the captured second performer vocal music audio with the received mixed audio performance to provide a broadcast audio-visual mix comprising the captured first and second performer vocal music audio and the accompaniment audio tracks without an apparent time lag therebetween; and (4) providing the audiovisual broadcast mix to a service platform configured to live stream the broadcast audiovisual mix to a plurality of audience devices that constitute the audience.
In some cases or embodiments, the performance-synchronized video included in the received media encoding is captured along with vocal music captured at the first device, the method further comprising: at the second device, video is captured synchronized in performance with the captured vocal music of the second performer, and the audiovisual broadcast mix is an audiovisual mix of the captured audio and video of at least the first performer and the second performer.
In some embodiments, the method further comprises: capturing, at the second device, a second performer video that is performance-wise synchronized with the captured second performer vocal music; and compositing the second performer video with the first performer video in the offered mix of audiovisual broadcasts. In some cases or embodiments, for at least some portions of the supplied mix of audio-visual broadcasts, the first performer video and the second performer video composite includes a computational blur of image frames of the first performer video and the second performer video at a visual boundary therebetween.
In some embodiments, the method further comprises: the relative visual prominence of one or the other of the first performer and the second performer is dynamically altered during the mixing of the audio-visual broadcast. In some cases or embodiments, the dynamic change coincides at least in part with a coding of a time-varying vocal part in an audio score corresponding to and temporally synchronized with the accompaniment audio track. In some cases or embodiments, the dynamic change is based at least in part on an evaluation of computationally defined audio characteristics of either or both of the first performer's vocal music and the second performer's vocal music.
In some cases or embodiments, the first device is associated with the second device as a current live stream guest and the second device operates as a current live stream host that controls the association and disassociation of the particular device with the audience as the current live stream guest. In some cases or embodiments, the current live stream host selects from a queue of requests from viewers to associate as a current live stream guest. In some cases or embodiments, the first device operates in a live stream guest role and the second device operates in a live stream moderator role, the method further comprising one or both of: the second device releasing the live stream moderator role for occupation by the other device; and the second device communicates the live stream moderator role to a particular device selected from the set comprising the first device and the audience.
In some embodiments, the method further comprises: accessing machine-readable code of a music structure, the machine-readable code of the music structure including at least music piece boundaries encoded to be time-aligned with vocal music audio captured at a first device and a second device; and applying a first visual effects schedule to at least a portion of the mix of audiovisual broadcasts, wherein the applied visual effects schedule encodes different visual effects of different musical structure elements encoded for the first audiovisual performance and provides a visual effects transition that is time-aligned with at least some of the encoded musical segment boundaries.
In some cases or embodiments, the different visual effects encoded by the applied visual effects schedule include, for a given element thereof, one or more of: particle-based effects or lens glare; conversion between different source videos; animation or motion of frames within the source video; vector graphics or images of patterns or textures; and color, saturation, or contrast. In some cases or embodiments, the associated music structure encodes different types of music pieces; and the applied visual effects schedule defines different visual effects for different ones of the encoded musical sections. In some cases or embodiments, the associated music structure encodes an event or transition; and the applied visual effect schedule defines different visual effects for different ones of the encoded events or transitions. In some cases or embodiments, the associated music structure encodes a community portion, and the applied visual effects schedule is temporally selective to a particular performance-synchronized video that coincides with the encoded music structure.
In some embodiments, the method is performed at least in part on a handheld mobile device communicatively coupled to a content server or service platform. In some embodiments, the method is at least partially embodied as a computer program product code of instructions executable on a second device that is part of a collaboration system that includes a content server or service platform to which a plurality of geographically distributed, network-connected vocal capture devices (including the second device) are communicatively coupled.
In some embodiments according to the invention, a system for disseminating an apparent live broadcast of a combined performance of geographically distributed first and second performers includes first and second devices coupled by a communication network having non-negligible peer-to-peer latency for transmission of audio-visual content. The first device is communicatively coupled to supply an audiovisual performance mixed with the accompaniment audio tracks to the second device, the audiovisual performance comprising the following: (1) a first performer's vocal audio captured for an accompaniment audio track and (2) video synchronized to the vocal audio in performance. The second device is communicatively configured to receive media encoding of the mixed audiovisual performance and audibly render at least an audio portion of the mixed audiovisual performance, capture vocal music audio of the second performer in opposition thereto, and mix the captured second performer vocal music audio with the received mixed audiovisual performance for transmission as an apparent live broadcast.
In some cases or embodiments, the second device is further configured to capture a second performer video that is performance-wise synchronized with the captured second performer vocal music and to compose the second performer video with the first performer video in the offered mix of audiovisual broadcasts. In some cases or embodiments, for at least some portions of the supplied mix of audio-visual broadcasts, the first performer video and the second performer video composite includes a computational blur of image frames of the first performer video and the second performer video at a visual boundary therebetween.
In some cases or embodiments, the compositing of the first performer video and the second performer video includes dynamically changing the relative visual prominence of one or the other of the first performer and the second performer during the audiovisual broadcast mix. In some cases or embodiments, the dynamic change coincides at least in part with a coding of a time-varying vocal part in an audio score corresponding to and temporally synchronized with the accompaniment audio track. In some cases or embodiments, the dynamic change is based at least in part on an evaluation of computationally defined audio characteristics of either or both of the first performer's vocal music and the second performer's vocal music.
In some cases or embodiments, the first device is associated with the second device as a current live stream guest and the second device operates as a current live stream host that controls the association and disassociation of the particular device with the audience as the current live stream guest. In some cases or embodiments, the current live stream host selects from a queue of requests from viewers to associate as a current live stream guest.
In some embodiments, the system further includes a video compositor that accesses a machine-readable encoding of the music structure and applies a first visual effects schedule to at least a portion of the audiovisual broadcast mix, the machine-readable encoding of the music structure including at least music segment boundaries encoded to be time-aligned with vocal music audio captured at the first device and the second device, wherein the applied visual effects schedule encodes different visual effects of different music structure elements encoded for the first audiovisual performance and provides visual effects transitions that are time-aligned with at least some of the encoded music segment boundaries. In some cases or embodiments, the video compositor is hosted on the second device or on a content server or service platform from which the apparent live performance is served.
In some cases or embodiments, as part of a user interface viewable on either or both of the first and second devices, a different vocal portion selection is presented for each performer for the current song selection; and updating vocal portion selections in response to and in correspondence with gestures made by either or both of the first and second performers at the respective geographically distributed devices, wherein the assignment of the particular vocal portion selection to the respective first or second performer is changeable until one or the other of the first and second performers makes a start vocal capture gesture at the respective geographically distributed devices, whereupon the subsequent current assignment of the particular vocal portion selection to the respective first or second performer is fixed for the capture duration of the coordinated multi-vocal performance.
Self-timer chat mechanism for social media:
in some embodiments according to the invention, a user interface method for social media comprises: (1) presenting, as part of a user interface viewable on a touchscreen display of a client device, live video captured using a camera of the client device on the touchscreen display; (2) in response to a first touchscreen gesture made by a user of the client device, initiating a capture of a snippet of live video and presenting a progress indication corresponding to a contemporaneous capture of the snippet as part of a user interface view; and (3) send the captured snippet to a network-coupled service platform as a posting in a multi-user social media thread in response to a second touchscreen gesture made by the user of the client device.
In some cases or embodiments, the method further comprises: presenting a multi-user social media thread on the touchscreen display, the presented multi-user social media thread including the captured segments and posted temporally-ordered content received from other users via a network-coupled service platform, wherein the posted content from at least one other user includes one or more of text of a video from the at least one other user and the captured segments.
In some cases or embodiments, the captured segment is a fixed length segment, and the method further comprises: the progress indication is visually updated in the corresponding portion of the captured fixed length snippet. In some cases or embodiments, the first touch screen gesture is contact maintained by a user with a first visually presented feature on the touch screen display, and wherein the second touch screen gesture includes a release of the maintained contact. In some cases or embodiments, the first touch screen gesture is a first tap type contact by the user with a first visually presented feature on the touch screen display, and wherein the second touch screen gesture is a second tap type contact on the touch screen display immediately following the first tap type contact. In some cases or embodiments, the method further includes presenting the multi-user social media thread on the touchscreen display in mixed correspondence with the live streaming audiovisual broadcast.
Audiovisual collaboration with user part arbitration:
in some embodiments according to the invention, a method for capturing at least a portion of a coordinated vocal performance of a first performer and a second performer at respective first and second geographically distributed devices comprises: (1) presenting, as part of a user interface viewable on either or both of the first and second devices, a different vocal part selection for each performer for the current song; and (2) updating vocal portion selections in response to and in correspondence with gestures made by either or both of the first and second performers at the respective geographically distributed devices, wherein the assignment of the particular vocal portion selection to the respective first or second performer is changeable until one or the other of the first and second performers makes a start vocal capture gesture at the respective geographically distributed devices, whereupon a subsequent current assignment of the particular vocal portion selection to the respective first or second performer is fixed for a capture duration of the coordinated multi-vocal performance.
In some cases or embodiments, the method further comprises: updating the vocal part selection at the second device corresponding to the gesture selection transmitted from the first device; and supplying an update to the first device to the vocal portion selection corresponding to the gesture selection at the second device. In some cases or embodiments, the method further comprises: changing the current song selection; and, correspondingly, updating the user interface vision on either or both of the first device and the second device. In some cases or embodiments, the change in current song selection is triggered by one or the other of the first and second performers on the respective one of the first and second devices.
In some cases or embodiments, the method further comprises: a change in current song selection is triggered based on a periodic or recurring event. In some cases or embodiments, the change in current song selection is selected from a library of song selections based on one or more of the coded interests and performance history of either or both of the first performer and the second performer.
In some cases or embodiments, starting from the start of vocal music capture, the method further comprises: at the second device, receiving media encoding of a mixed audio performance that (i) includes vocal audio captured at the first device from a first of the performers and (ii) is mixed with an accompaniment audio track for the current song selection; at the second device, audibly rendering the received mixed audio performance and capturing vocal audio from a second one of the performers in opposition thereto; and mixing the captured second performer vocal music audio with the received mixed audio performance to provide a broadcast mix comprising the captured vocal music audio of the first performer and the second performer and the accompaniment audio tracks without an apparent time lag therebetween.
In some cases or embodiments, the method further comprises: at the second device, lyrics and score encoding note targets for the current song selection are visually presented corresponding to the audible rendering, wherein the visually presented lyrics and note targets correspond to an assignment of a particular vocal portion at the time of starting vocal capture. In some cases or embodiments, the received media encoding comprises video synchronized in performance with the captured first performer vocal music, wherein the method further comprises: at the moderator device, video is captured synchronized in performance with the captured vocal music of the second performer, and wherein the broadcast mix is an audiovisual mix of the captured audio and video of at least the first performer and the second performer.
In some cases or embodiments, the method further comprises: capturing, at the moderator device, a second performer video that is performance-wise synchronized with the captured second performer vocal music; and compositing the second performer video with the first performer video in the offered mix of audiovisual broadcasts. In some cases or embodiments, the method further comprises: the broadcast cocktail is supplied to a service platform configured to stream the broadcast cocktail live to a plurality of audience devices that constitute the audience.
Drawings
The present invention is illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer generally to similar elements or features.
Fig. 1 depicts information flow for live streaming of a duet-type community audiovisual performance between an illustrative mobile phone-type portable computing device in a moderator and guest configuration in accordance with some embodiments of the present invention.
FIG. 2 is a flow diagram depicting the flow of audio signals captured and processed at respective guest and moderator devices coupled in a "moderator synchronous" peer-to-peer configuration for generating a live stream of a community audiovisual performance in accordance with some embodiments of the invention.
Fig. 3 is a flow diagram depicting the flow of audio signals captured and processed at respective guest and host devices coupled in a "shared-latency" peer-to-peer configuration for generating a live stream of a community audiovisual performance in accordance with some embodiments of the present invention.
Fig. 4 is a flow diagram illustrating an optional real-time continuous pitch correction and chorus generation signal flow that may be performed based on score-encoded pitch correction settings for an audiovisual performance captured at a guest or host device in accordance with some embodiments of the invention.
Fig. 5 is a functional block diagram of hardware and software components executable at an illustrative mobile phone-type portable computing device to facilitate processing and communication of captured audiovisual performances for use in a multi-singer live streaming configuration of a network-connected device in accordance with some embodiments of the present invention.
Fig. 6 illustrates features of a mobile device that may serve as a platform for performing software implementations of at least some audiovisual performance capturing and/or live-streaming devices according to some embodiments of the invention.
Figures 7A and 7B illustrate video presentations of live streaming content performing compositing of images of a first performer and a second performer at a moderator device according to some embodiments of the invention.
Fig. 8A and 8B illustrate a self-timer chat interaction mechanism in which a capture viewport is presented on a screen and a user interaction mechanism is supported whereby a user holds a touch screen presented button or other feature to capture a video clip and releases to post the video clip in a social media interaction, in accordance with some embodiments of the present invention.
9A, 9B, and 9C illustrate a user partial selection and coordination mechanism in which song wheels and/or gesture selections made by a user performer on geographically distributed devices provide corresponding song and/or partial selections for live streaming performance on a peer device, in accordance with some embodiments of the present invention.
FIG. 10 is a network diagram illustrating cooperation of exemplary devices according to some embodiments of the invention.
Skilled artisans will appreciate that elements or features in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. For example, the dimensions or prominence of some of the illustrated elements or features may be exaggerated relative to other elements or features to help improve understanding of embodiments of the present invention. Likewise, although illustrated in the accompanying drawings as a single flow for simplicity or to avoid complexity that might otherwise obscure the description of the inventive concept, multiple data and control flows (including component signals or code) are to be understood as being consistent with the description.
Detailed Description
Techniques have been developed to facilitate live streaming of community audiovisual shows. Audiovisual performances including vocal music are captured and coordinated with the performances of other users in a manner that can create engaging user and listener experiences. For example, in some cases or embodiments, duet with a moderator performer may be supported in an audiovisual live stream of singing style with an artist, where a desired singer requests or queues a particular song for a live broadcast performance entertainment format. The developed technology provides a communication latency tolerant mechanism for synchronizing vocals captured at geographically separated devices (e.g., at globally distributed but network connected mobile phones or tablet devices or at audiovisual capture devices geographically separated from live studios) in performance.
While it is determined that audio-only embodiments are to be considered, it is envisioned that live streaming content will typically include video captured along with vocal music synchronized in performance. Additionally, while a network-connected mobile phone is illustrated as an audiovisual capture device, it will be appreciated based on the description herein that the audiovisual capture and viewing device may comprise a suitably configured computer, smart TV, and/or living room style set-top box configuration, and even a smart virtual assistant device having audio and/or audiovisual capture devices or capabilities. Finally, although the application to vocal music is described in detail, it will be appreciated based on the description herein that the audio or audiovisual capture application is not necessarily limited to vocal duet, but may be adapted to other forms of group performance where one or more successive performances are symptomized to a prior performance to produce a live stream.
In some cases, vocal performances (and videos synchronized in performance) of the collaborative contributors are captured in the context of a karaoke style lyric presentation and corresponding to an audible rendering of the accompaniment track. In some cases, vocal music (typically synchronized video) is captured as part of a live or impromptu performance as vocal music interactions (e.g., duets or conversations) between the collaborative contributors. In each case, it is envisioned that there will be non-negligible network communication latency between at least some of the collaborating contributors, particularly if those contributors are geographically separated. As a result, there are technical challenges for managing latency and captured audiovisual content such that a combined audiovisual performance can still be disseminated (e.g., broadcast) as a live interactive collaborative presentation to an audience, listener, and/or viewer.
In one technique for accomplishing such faxing of live interactive performance collaboration, the actual and non-negligible network communication latency is (in effect) masked in one direction between the guest and host performers and tolerated in the other direction. For example, an audiovisual performance of a guest performer captured on a "live performance" internet broadcast of a moderator performer can include a guest + moderator duet singing in apparent real-time synchronization. In some cases, the moderator may be an actor who has promoted a particular musical performance. In some cases, the guest may be an amateur singer given the opportunity to sing "live" (albeit remotely) with a popular artist or group that is actually "in the studio" as the performance host (or with it). Regardless of the non-negligible network communication delay from the guest to the moderator (which may be 200 ms or more) for communicating the guest's audiovisual contribution, the moderator acts apparently synchronously with the guest (but in an absolute sense lags the guest in time), and the vocal music that acts apparently synchronously is captured and mixed with the guest's contribution for broadcast or dissemination.
The result is an apparent live interactive performance (at least from the perspective of the presenter and audience, audience and/or viewers of the broadcast or broadcast performance). Although the non-negligible network communication latency from guest to host is masked, it should be understood that there is and tolerance for latency in the host to guest direction. However, the moderator-to-guest latency, while discernable (and possibly quite noticeable) to the guest, is not necessarily noticeable in an apparent live broadcast or other propagation. It has been found that a delayed audible rendering of the presenter's vocal (or more generally, the presenter's captured audiovisual performance) does not necessarily psychologically interfere with the guest's performance.
Although many of the descriptions herein assume a fixed moderator performer on a particular moderator device for purposes of illustration, it will be appreciated based on the description herein that some embodiments in accordance with the present invention may provide moderator/guest control logic that allows the moderator to "pass the microphone" so that a new user (in some cases, a user selected by the current moderator, and in other cases, a user who "picks up the microphone" after the current moderator "drops the microphone") may take over as the moderator. As such, it will be appreciated based on the description herein that some embodiments according to the present invention may provide moderator/guest control logic that queues guests (and/or intended moderators) and automatically assigns the queued users to appropriate roles.
In some cases or embodiments, vocal audio of individual moderator and guest character performers is captured in a karaoke style user interface framework along with performance-synchronized video and coordinated with other user's audiovisual contributions to form a duet style or happy chorus style group audiovisual performance. For example, in the context of a karaoke style lyric presentation corresponding to an audible rendering of an accompaniment track, an individual user's vocal performance (as well as video synchronized in performance) may be captured on a mobile device, television-type display, and/or set-top box equipment. In some cases or embodiments, score encoding continuous pitch correction and user selectable audio and/or video effects may be provided. Consistent with the foregoing, but without limitation with respect to any particular embodiment claimed, karaoke style vocal performance capture using a portable handheld device provides an illustrative context.
Karaoke-style vocal performance capture
Although embodiments of the present invention are not so limited, pitch-corrected karaoke style vocal capture using mobile phone-type and/or television-type audiovisual devices provides a useful descriptive context. For example, in some embodiments as illustrated in FIG. 1, an iPhone available from apple IncTMThe handheld devices (or more generally, handheld devices 101A, 101B operating as guest and host devices, respectively) execute software that cooperates with the content server 110 to provide vocal music capture. The configuration optionally provides for continuous real-time score encoding pitch correction and harmony of the captured vocal music. Use may also be made of a computer, television or other audiovisual equipment (not specifically shown) or such as Apple TVTMThe networked set-top box of the device or a camera provided in conjunction therewith captures video synchronized in performance. In some embodiments, on-board phase provided by a handheld device paired with a connecting set-top box equipment may be usedTo capture video synchronized in performance.
In the illustration of FIG. 1, the current moderator user of the current moderator device 101B at least partially controls the content of the live stream 122, which is buffered and streamed to the audience on the devices 120A, 120B … 120N. In the illustrated configuration, the current guest user of the current guest device 101A contributes to the community audiovisual performance mix 111 served by the current moderator device 101B as the live stream 122 (ultimately served via the content server 110). Although for simplicity the devices 120A, 120B … 120N and, in fact, the current guest device 101A and the current moderator device 101B are illustrated as handheld devices such as mobile phones, persons of ordinary skill in the art having benefit of the present disclosure should appreciate that any given member of the audience may receive the live stream 122 on any suitable computer, smart television, tablet device via a set-top box or other streaming-capable client.
In the illustrated configuration, the content that is blended to form a community audiovisual performance mix 111 is captured in the context of a karaoke style performance capture, with the lyrics 102, optional pitch cues 105, and general accompaniment tracks 107 being supplied from the content server 110 to either or both of the current guest device 101A and the current moderator device 101B. The current moderator (on the current moderator device 101B) typically exercises final control over the live stream, for example, by selecting a particular user(s) from the audience to act as the current guest(s), by selecting a particular song from the request queue (and/or selecting a vocal portion thereof for a particular user), and/or by starting, stopping, or pausing a community AV performance. Once the current moderator selects or approves a guest and/or song, the guest user may (in some embodiments) start/stop/pause the scrolling of the accompaniment tracks 107A for local audible rendering and otherwise control the content of the guest mix 106 (the scrolling of the accompaniment video mixed with the captured guest audiovisual content) supplied to the current moderator device 101B. The scrolling of the lyrics 102A and optional pitch cues 105A at the current guest device 101A corresponds in time to the accompaniment track 107A and is likewise subject to the start/stop of the current guestAnd/or pause control. In some cases or scenarios, iTunes may be accessible from a source such as a resident or from a handheld device, set-top box, or the likeTMThe library's media scratchpad presents accompanying audio and/or video.
In general, the song request 132 is viewer-sourced and is communicated to the content selection and guest queue control logic 112 of the content server 110 via a signaling path. The moderator control 131 and the guest control 133 are illustrated as bi-directional signaling paths. Other queuing and control logic configurations consistent with the described operations, including host or guest controlled queuing and/or song selection, will be appreciated based on this disclosure.
In the illustrated configuration of fig. 1, and despite a non-negligible time lag (typically 100-. The scrolling of the lyrics 102B and optional pitch cues 105B at the current moderator device 101B corresponds to the accompaniment track (here guest mix 106) time. To facilitate synchronization with the guest hybrids 106 and for guest-side start/stop/pause control in view of the time lag in the peer-to-peer communication channel between the current guest device 101A and the current moderator device 101B, marker beacons may be encoded in the guest hybrids to provide appropriate phase control of the onscreen lyrics 102B and optional pitch cues 105B. Alternatively, phase analysis of any accompaniment tracks 107A included in the guest mix 106 (or any infiltration in the case of accompaniment tracks being individually encoded or conveyed) may be used to provide appropriate phase control of the on-screen lyrics 102B and optional pitch cues 105B at the current moderator device 101B.
It should be appreciated that the time lag in the peer-to-peer communication channel between the current guest device 101A and the current moderator device 101B affects both the guest mix 106 and the communication in the opposite direction (e.g., moderator microphone 103C signal encoding). Any of a variety of communication channels may be used between the current guest device 101A and the current moderator device 101B and between the guest and the moderatorAudiovisual signals and control are communicated between the devices 101A, 101B and the content server 110, and between the listener devices 120A, 120B … 120N and the content server 110. For example, a respective telecommunications carrier wireless infrastructure and/or wireless local area network and a respective wide area network gateway (not specifically shown) may provide communications to and from the devices 101A, 101B, 120A, 120B … 120N. Based on the description herein, those skilled in the art will recognize that any of a variety of data communication facilities, including 802.11Wi-Fi, Bluetooth, may be employed, alone or in combinationTM4G-LTE, 5G, or other communication, wireless, wired data networks, wired or wireless audiovisual interconnections such as those conforming to HDMI, AVI, Wi-Di standards or facilities to facilitate the communication and/or audiovisual rendering described herein.
User vocals 103A and 103B are captured at the respective handheld devices 101A, 101B and may optionally be continuously and real-time pitch corrected and audibly rendered blended with locally appropriate accompaniment tracks (e.g., accompaniment track 107A at the current guest device 101A and guest mix 106 at the current moderator device 101B) to provide the user with an improved timbre rendition of his/her own vocal performance. Pitch correction is typically based on a score-coded note set or cue (e.g., pitch and vocal cues 105A, 105B visually displayed at the current guest device 101A and at the current moderator device 101B, respectively) that provides a continuous pitch correction algorithm executed on the respective devices with a performance-synchronized sequence of target notes in the current tone or scale. In addition to the melody goal, which is synchronized in terms of performance, the score coding and harmonic note sequence (or set) provides the pitch-shifting algorithm with additional goals (usually coded as an offset from the main melody note trajectory and usually scored only for selected portions thereof) for transposition to harmonic versions of the user's own captured vocal. In some cases, the pitch correction setting may be characteristic of a particular artist (e.g., an artist performing vocal music associated with a particular accompaniment track).
Typically, the lyrics, melody, and harmonic track note sets and associated timing and control information may be packaged in a suitable container or object (e.g., in an instrumental music digital interface (MIDI) or Java script object notation json type format) for provision with the accompaniment track(s). Using this information, the devices 101A and 101B (and associated audiovisual display and/or set-top box equipment, not specifically shown) may display lyrics and even visual cues related to the target notes, harmony and currently detected vocal pitch, in correspondence with the audible performance of the accompaniment track(s), in order to facilitate karaoke style vocal performance by the user. Thus, if a willing singer selects "When I Was you Man" as promoted by Bruno Mars, you _ man.json and you _ man.m4a can be downloaded from the content server (if not already available or based on a prior download buffer) and, in turn, used to provide background music, synchronized lyrics while the user is singing, and in some cases or embodiments, a score-encoded note track for continuous real-time pitch correction. Optionally, at least for some embodiments or genres, the note tracks and notes tracks may be score coded to obtain a harmonic transformation of the captured vocal. Typically, the captured pitch-corrected (possibly harmonious) vocal performances are saved locally on the handheld device or set-top box as one or more audiovisual files along with the performance-synchronized video, and then compressed and encoded for delivery (e.g., as a guest mix 106 or a community audiovisual performance mix 111 or their constituent code) to the content server 110 as an MPEG-4 container file. MPEG-4 is a standard suitable for the encoded representation and transmission of digital multimedia content for internet, mobile network and advanced broadcast applications. Other suitable codecs, compression techniques, encoding formats, and/or containers may be employed as desired.
As will be appreciated by those skilled in the art having the benefit of this disclosure, the performance of multiple singers (including videos synchronized in terms of performance) may be symbiotic and combined, such as to form a duet style performance, a happy chorus, or a vocal impromptu. In some embodiments of the invention, the social network construct may at least partially replace or inform the moderator control of the pairing of geographically distributed singers and/or the formation of geographically distributed virtual happy chorus. For example, with respect to fig. 1, individual singers may perform as current host and guest users in a manner that is captured (with vocal audio and video synchronized in performance) and ultimately streamed to the audience as a live stream 122. Such captured audiovisual content may in turn be distributed to the singer's social media contacts, members of the audience, etc. via open calls mediated through the content server. In this manner, the artist itself, members of the audience (and/or the content server or service platform on their behalf), may invite others to join the coordinated audio-visual performance, or be members of an audience or guest queue.
Where the provision and use of accompaniment tracks is illustrated and described herein, it should be understood that captured, pitch corrected (and possibly but not necessarily harmonic) vocals may themselves be blended (as with the guest blend 106) to produce "accompaniment tracks" for motivating, guiding or constructing subsequent vocals captures. Further, additional singers may be invited to sing a particular part (e.g., male treble, part B in a duet, etc.) or simply to sing, and subsequent vocal capture devices (e.g., the current moderator device 101B in the configuration of fig. 1) may be transposed and have their captured vocal placed in one or more positions within the duet or virtual happy chorus.
Synchronization method
Based on the description herein, those skilled in the art will appreciate various moderator-guest synchronization methods that tolerate a non-negligible time lag in the peer-to-peer communication channel between the guest device 101A and the moderator device 101B. As illustrated in the context of fig. 1, an accompaniment track (e.g., accompaniment track 107A) may provide a synchronized timeline for temporally segmented vocal capture performed at respective peer devices (guest device 101A and moderator device 101B) and minimize (or eliminate) perceived latency for its users.
FIG. 2 is a flow diagram depicting the flow of audio signals captured and processed at respective guest and moderator devices coupled in a "moderator synchronous" peer-to-peer configuration for generating a live stream of a community audiovisual performance in accordance with some embodiments of the invention. Although fig. 2 (and later fig. 3) each emphasize, in the form of a teaching example, providing synchronized and time-aligned audio signal/data components and streams for an apparent live performance, persons of ordinary skill in the art having benefit of the present disclosure will appreciate that, although not explicitly shown in fig. 2 and 3, video synchronized in performance of corresponding audio may be captured (as in fig. 1) and corresponding video signal/data components and streams may be communicated between guest devices and moderator devices in a similar manner.
More specifically, fig. 2 illustrates an exemplary configuration of the guest device 101A and the moderator device 101B (recall fig. 1) and how the audiovisual signal streams between them (e.g., the guest mix 106 and the moderator microphone audio 103C) during a peer-to-peer session provide a user experience in which the moderator device singer (at the moderator device 101B) consistently hears the guest vocal (captured from the guest microphone local input 103A) and the accompaniment track 107A are perfectly synchronized. While the guest will perceive the presenter's symphony vocal music delayed (in the mix supplied at the guest speaker or headset 240A) by the full audio round trip (RTT) delay, the audio stream supplied to the presenter device 110B and presented to the presenter singer or audience with zero (or negligible) latency as a live streaming (122) multi-vocal performance mix.
The key to masking the actual latency is to include the trace 107A in the audio mix supplied from the guest device 101A and to the broadcaster's device host device 101B. This audio stream ensures that the guest's voice and accompaniment tracks (based on the audible rendering at the moderator speaker or headphones 240B) are always synchronized from the perspective of the broadcaster. If the network delay is significant, the guest may still perceive the broadcaster to sing slightly out of sync. However, as long as the guest focuses on singing in the accompaniment tracks rather than the slightly delayed symphony of the presenter, the presenter vocal is synchronized with the multitone mixture of guest vocal and accompaniment tracks as they are streamed live to the audience.
Fig. 3 is a flow diagram depicting an audio signal stream captured and processed at respective guest and moderator devices coupled in an alternative "shared-latency" peer-to-peer configuration for generating a live stream of a community audiovisual performance in accordance with some embodiments of the present invention. More specifically, fig. 3 illustrates an exemplary configuration of the guest device 101A and the moderator device 101B (recall fig. 1) and how the audiovisual signal streams between them (e.g., the guest mix 106 and the moderator microphone audio 103C) combine during a peer-to-peer session to limit the perception of audio delay of the guest and moderator singers for the other singer to a one-way lag (nominally half of the full audio round-trip travel delay) only behind the accompaniment track.
This limited perception of delay is achieved by playing the accompaniment tracks locally on both devices and striving to keep them synchronized in real time. The guest device 101A sends periodic timing messages to the host containing the current location in the song, and the host device 101B adjusts the playback position of the song accordingly.
We have experimented with two different approaches to keep accompaniment tracks synchronized on two devices (guest device 101A and presenter device 101B):
method 1: we adjust the playback position we receive at the host side by a one-way network delay of approximately RTT/2 of the network.
Method 2: we use the Network Time Protocol (NTP) to synchronize the clocks of the two devices. So we do not need to adjust the timing messages based on one-way network delays, we simply add NTP timestamps to each song timing message.
For the "shared latency" configuration, method 2 has proven to be more stable than method 1. As an optimization, to avoid excessive timing adjustments, the moderator updates the accompaniment track playback position only if we are currently more than 50ms different from the guest's accompaniment track playback position.
As described herein, while a video signal/data component may not be specifically illustrated in all of the figures, persons of ordinary skill in the art having benefit of the present disclosure should appreciate that video synchronized in performance may also be conveyed in audio-visual data encoding including more explicitly illustrated audio and in streams similar to those illustrated more explicitly for audio signals/data components. As the audio signals are captured, communicated, and mixed, the video captured at the respective devices is synthesized in correspondence with the audio, which is synchronized in performance as a time-aligned and distributed performer synchronization baseline. Video compositing is typically performed at the moderator device, but in some cases may be performed using the content server or the services platform's facilities (recall fig. 1). In some embodiments, computer readable encoding of the music structure may direct the compositing function, thereby affecting selection, visual arrangement, and/or prominence of the performance-synchronized video in the apparent live performance. In general, the selection, visual location, and/or prominence may be consistent with a score-encoded music structure, such as a part of a community (or duality a/B), a piece of music, a melody/harmony location of captured or transposed audio, and/or with computationally determined audio characteristics of vocal music audio captured at a guest device or a host device, or both.
Music score coding pitch track
Fig. 4 is a flow diagram illustrating real-time sequential score-encoding pitch correction and harmony generation for a captured vocal performance according to some embodiments of the present invention. In the illustrated configuration, a user/singer (e.g., a guest singer or a host singer at a guest device 101A or a host device 101B recalling fig. 1) sings along with the accompaniment track karaoke style. In the case of a guest singer at the current host device 101A, the controllable accompaniment track is the accompaniment track 107A, whereas for a master singer at the current host device 101B, the controllable accompaniment track is the guest mix 106, which conveys the original accompaniment track mixed with the guest vocal music, at least in embodiments employing the "host synchronization" method. In either case, the vocal music captured (251) from the microphone input 201 may optionally be continuously pitch corrected (252) and harmony (255) in real-time to mix (253) with the controllable accompaniment tracks audibly rendered at the one or more acoustic transducers 202.
The pitch correction and the added chorus are both chosen to correspond to a musical score 207, which in the illustrated configuration is wirelessly transmitted 261 along with audio encoding of the lyrics 208 and controllable accompaniment tracks 209 (e.g., the accompaniment tracks 107A or guest hybrids 106) to the device on which vocal capture and pitch correction is to be performed (e.g., from the content server 110 to the guest device 101A or to the moderator device 101B via the guest device 101A, recall fig. 1). In some cases or embodiments, the content selection and guest queue control logic 112 is selective to melodic or harmonic note selection at the respective guest device 101A and moderator device 101B.
In some embodiments of the techniques described herein, the closest notes (in terms of current scale or pitch) to those sounded by the user/singer are determined based on the score 207. While this closest note may typically be the pitch of the main note corresponding to the score encoded vocal melody, this is not necessarily the case. Indeed, in some cases, the user/singer may intend to sing the harmony and the notes sounded may more closely approximate the harmony track.
Audio-visual capture at handheld devices
Although not required to support performance-synchronized video capture in all embodiments, the handheld device 101 (e.g., current guest device 101A or current moderator device 101B, recall fig. 1) itself can capture both voice audio and performance-synchronized video. Thus, fig. 5 illustrates a basic signal processing flow (350) according to some embodiments suitable for a mobile phone type handheld device 101 to capture vocal audio and performance-wise synchronized video to generate pitch-corrected and optionally harmony vocal for audible rendering (locally and/or at a remote target device) and communication with a content server or service platform 110.
Based on the description herein, one of ordinary skill in the art will recognize the appropriate allocation of signal processing techniques (sampling, filtering, decimation, etc.) and data representations to the functional blocks of software (e.g., decoder 352, digital-to-analog (D/a) converter 351, capture 353, 353A, and encoder 355) that are executable to provide the signal processing flow 350 illustrated in fig. 5. Likewise, with respect to the signal processing flow 250 and illustrative score coding note targets (including chord note targets) of fig. 4, one of ordinary skill in the art will recognize the appropriate allocation of signal processing techniques and data representations to functional blocks and signal processing constructs (e.g., decoder 258, capture 251, digital-to-analog (D/a) converter 256, mixers 253, 254 and encoder 257) that may be implemented, at least in part, as software executable on a handheld or other portable computing device.
As will be appreciated by those of ordinary skill in the art, pitch detection and pitch correction have a rich technical history in the music and speech coding arts. Indeed, a wide variety of feature culling, time domain techniques, and even frequency domain techniques have been used in the art and may be used in some embodiments according to the invention. In view of this, and recognizing that the multi-singer synchronization technique in accordance with the present invention is generally independent of any particular pitch detection or pitch correction technique, the present specification does not seek to exhaustively inventory the wide variety of signal processing techniques that may be suitable in various designs or embodiments in accordance with the present specification. Instead, we simply note that in some embodiments according to the invention, the pitch detection method computes the average amplitude difference function (AMDF) and performs logic to pick the peak corresponding to the estimate of the pitch period. Based on such estimates, a pitch-offset overlap-add (PSOLA) technique is used to facilitate resampling of the waveform to produce variations of the pitch offset while reducing the aperiodic impact of the stitching.
Exemplary Mobile device
FIG. 6 illustrates features of a mobile device that may be used as a platform for executing software implementations in accordance with some embodiments of the invention. More specifically, FIG. 6 is a diagram of a generic iPhoneTMA block diagram of a mobile device 400 consistent with a commercially available version of a mobile digital device. While embodiments of the present invention are certainly not limited to iPhone deployments or applications (or even to iPhone-type devices), the iPhone device platform and its rich complement to sensors, multimedia infrastructure, application programmer interfaces, and wireless application delivery models provide a highly capable platform on which to deploy certain embodiments. Based on the description herein, the artThose of ordinary skill in the art will recognize a wide variety of additional mobile device platforms that may be (now or hereafter) suitable for a given implementation or deployment of the inventive techniques described herein.
Briefly summarized, mobile device 400 includes a display 402 that may be sensitive to tactile and/or tactile contact with a user. Touch sensitive display 402 may support multi-touch features, process multiple simultaneous touch points, including processing data related to the pressure, extent, and/or location of each touch point. Such processing facilitates gestures and interactions with multiple fingers, as well as other interactions. Of course, other touch-sensitive display technologies may also be used, such as a display that uses a stylus or other pointing device to make contact.
Generally, the mobile device 400 presents a graphical user interface on the touch-sensitive display 402, thereby providing the user with access to various system objects and for communicating information. In some implementations, the graphical user interface can include one or more display objects 404, 406. In the example shown, the display objects 404, 406 are graphical representations of system objects. Examples of system objects include device functions, applications, windows, files, alarms, events, or other identifiable system objects. In some embodiments of the invention, the application, when executed, provides at least some of the digital acoustic functionality described herein.
In general, the mobile device 400 supports network connectivity, including for example both mobile radio and wireless interconnection functionality, to enable a user to travel with the functionality of the mobile device 400 and its associated support network. In some cases, mobile device 400 can interact with other devices in the vicinity (e.g., via Wi-Fi, Bluetooth, etc.). For example, mobile device 400 can be configured to interact with a peer or base station for one or more devices. Thus, the mobile device 400 can grant or deny network access to other wireless devices.
The mobile device 400 includes various input/output (I/O) devices, sensors, and transducers. For example, a speaker 460 and a microphone 462 are typically included to facilitate the capture of audio, such as vocal performance and the audible rendering of accompaniment tracks and mixed pitch corrected vocal performance as described elsewhere herein. In some embodiments of the invention, speaker 460 and microphone 662 may provide appropriate transducers for the techniques described herein. An external speaker port 464 may be included to facilitate hands-free voice functionality, such as speakerphone functionality. An audio jack 466 may also be included to facilitate the use of a headset and/or microphone. In some embodiments, an external speaker and/or microphone may be used as a transducer for the techniques described herein.
Other sensors may also be used or provided. A proximity sensor 468 can be included to facilitate detection of a user location of the mobile device 400. In some embodiments, the ambient light sensor 470 may be utilized to facilitate adjusting the brightness of the touch sensitive display 402. Movement of the mobile device 400 may be detected using the accelerometer 472, as indicated by directional arrow 474. Accordingly, display objects and/or media, such as a portrait or landscape, may be presented according to the detected orientation. In some implementations, the mobile device 400 can include circuitry and sensors to support location determination capabilities, such as those provided by a Global Positioning System (GPS) or other positioning system (e.g., systems using Wi-Fi access points, television signals, cellular grids, Uniform Resource Locators (URLs)) to facilitate geocoding as described herein. The mobile device 400 also includes a camera lens and imaging sensor 480. In some embodiments, instances of the camera lens and sensor 480 are located on the front and back surfaces of the mobile device 400. The camera allows for the capture of still images and/or video for association with the captured pitch-corrected vocal music.
Mobile device 400 can also include one or more wireless communication subsystems, such as an 802.11b/g/n/ac communication device and/or BluetoothTM A communication device 488. Other communication protocols may also be supported, including other 802.x communication protocols (e.g., WiMax, Wi-Fi, 3G), fourth generation protocols and modulation (4G-LTE) and above (e.g., 5G), Code Division Multiple Access (CDMA), global system for mobile communications (GSM), Enhanced Data GSM Environment (EDGE), and so forth. A port device 490 (e.g., a Universal Serial Bus (USB) port or a docking port) or some other wired port connection may be included and used to establish a connection to a device such asWired connection of other computing devices of: other communication devices 400, network access devices, personal computers, printers, or other processing devices capable of receiving and/or transmitting data. The port device 490 may also allow the mobile device 400 to synchronize with the moderator device using one or more protocols such as, for example, TCP/IP, HTTP, UDP, and any other known protocols.
Example user interface
Figures 7A and 7B illustrate video presentations of live streaming content performing compositing of images of a first performer and a second performer at a moderator device according to some embodiments of the invention. In these illustrations, it should be appreciated that performance-synchronized video of performers captured at respective moderator and guest devices is composited to provide a visual appearance of an apparent live performance. A blurring technique at image boundaries is illustrated.
Fig. 8A and 8B illustrate a self-timer chat interaction mechanism, presenting a capture viewport on a screen and supporting a user interaction mechanism whereby a user holds a touch screen presented button or other feature to capture a video clip and releases to post the video clip in a social media interaction, in accordance with some embodiments of the present invention. Some embodiments support alternative or additional gesture mechanisms such as tap start with post-confirmation and tap stop.
Fig. 9A, 9B, and 9C illustrate a user partial selection and coordination mechanism in which gesture selections by user performers on geographically distributed devices provide partial selection for live streaming performance on peer devices, according to some embodiments of the present invention. Some embodiments support alternative or additional gesture mechanisms including song selection, such as based on coded interests, performance history of one or more user performers. In some cases or embodiments, the song selection is presented as a pseudo-random song roulette selection triggered by one or more user performers or automatically, e.g., based on expiration of a timer.
Exemplary Mobile device
Fig. 10 illustrates respective instances (701, 720A, 720B, and 711) of computing devices programmed (or programmable) with vocal audio and video capture code, user interface code, pitch correction code, audio rendering pipeline, and playback code according to the functional descriptions described herein. Device instance 701 is depicted as operating in a video capture mode with vocal audio and performance synchronized, whereas device instances 720A and 720B are depicted as operating in a mode to receive live stream mixed audio visual performance. Although the television-type display and/or set-top box equipment 720B is described as operating in a live stream reception mode, such equipment and computer 711 may operate as part of a video capture facility (as guest device 101A or presenter device 101B, recall fig. 1) that synchronizes audio and performance. Each of the aforementioned devices communicates via wireless data transmission and/or the intermediary network 704 with a server 712 or service platform hosting the storage and/or functionality described herein for the content server 110. Video mixes synchronized with performance aspects can be (optionally) streamed live and rendered audiovisual iy at the laptop 711 to define a captured pitch-corrected vocal music performance of a multi-singer audiovisual performance as described herein.
Other embodiments
While the present invention has been described with reference to various embodiments, it should be understood that these embodiments are illustrative and that the scope of the invention is not limited to them. Many variations, modifications, additions, and improvements are possible. For example, while pitch-corrected vocal performances captured from a karaoke style interface have been described, other variations will be appreciated. Moreover, while certain illustrative signal processing techniques have been described in the context of certain illustrative applications, those of ordinary skill in the art will recognize that it would be straightforward to modify the described techniques to accommodate other suitable signal processing techniques and effects.
Embodiments in accordance with the present invention may take the form of and/or be provided as a computer program product of sequences of instructions and other functional constructs encoded in a computer-readable medium as software, which can then be executed in a computing system, such as an iPhone handheld, mobile or portable computing device, a media application platform, a set-top box, or a content server platform, to perform the methods described herein. In general, a machine-readable medium may include a tangible article of manufacture that encodes information in a form readable by a machine (e.g., as an application, source or object code, functional descriptive information, etc.), and non-transitory storage that is inevitable of transmission of the information (e.g., as a computer, a computing facility of a mobile or portable computing device, a media device or streamer, etc.). The machine-readable medium may include, but is not necessarily limited to, magnetic storage media (e.g., disk and/or tape storage); optical storage media (e.g., CD-ROM, DVD, etc.); a magneto-optical storage medium; read Only Memory (ROM); random Access Memory (RAM); erasable programmable memory (e.g., EPROM and EEPROM); a flash memory; or other similar medium suitable for storing electronic instructions, sequences of operations, functional descriptive information encodings, and the like.
In general, multiple instances may be provided for a component, operation, or structure described herein as a single instance. Boundaries between various components, operations and data scratchpad are somewhat arbitrary, and particular operations are illustrated in the context of specific illustrative configurations. Other allocations of functionality are envisioned and may fall within the scope of claims. In general, structures and functionality presented as separate components in the exemplary configurations may be implemented as a combined structure or component. Similarly, structure and functionality presented as a single component may be implemented as separate components. These and other variations, modifications, additions, and improvements may fall within the scope of the invention.

Claims (45)

1. A collaboration method for capturing live streaming broadcasts of coordinated audiovisual works of a first performer and a second performer at respective geographically distributed first and second devices, the method comprising:
at a second device, receiving media encoding of an audiovisual performance mixed with accompanying audio tracks, the audiovisual performance comprising the following: (i) audio of vocal music captured from the first performer at the first device and (ii) video synchronized in performance with the captured vocal music of the first performer;
at the second device, audibly rendering the received mixed audio performance and capturing vocal audio from the second performer in opposition thereto;
mixing the captured second performer vocal music audio with the received mixed audio performance to provide a broadcast audiovisual mix comprising the captured first and second performer vocal music audio and the accompaniment audio tracks without an apparent time lag therebetween; and
the audiovisual broadcast mix is supplied to a service platform configured to live stream the broadcast audiovisual mix to a plurality of audience devices that constitute an audience.
2. The method of claim 1, wherein the first and second light sources are selected from the group consisting of,
wherein the performance-synchronized video included in the received media encoding is captured along with the vocal music captured at the first device,
wherein the method further comprises: at the second device, video is captured synchronized in performance with the captured second performer's vocal music, and
wherein the audio-visual broadcast mix is an audio-visual mix of the captured audio and video of at least the first and second performers.
3. The method of claim 1, further comprising:
capturing, at the second device, a second performer video that is performance-wise synchronized with the captured second performer vocal music; and
the second performer video is composited with the first performer video in the offered mix of audiovisual broadcasts.
4. The method of claim 3, wherein the first and second light sources are selected from the group consisting of,
wherein, for at least some portion of the supplied mix of audiovisual broadcasts, the compositing of the first and second performer videos comprises a computational blurring of image frames of the first and second performer videos at a visual boundary therebetween.
5. The method of claim 3, further comprising:
the relative visual prominence of one or the other of the first and second performers is dynamically altered during the course of the mix of audiovisual broadcasts.
6. The method of claim 5, wherein the first and second light sources are selected from the group consisting of,
wherein the dynamic change is at least partially in accordance with a time-varying vocal score portion code in an audio score corresponding to and temporally synchronized with the accompaniment audio track.
7. The method of claim 5, wherein the first and second light sources are selected from the group consisting of,
wherein the dynamic change is based at least in part on an evaluation of computationally defined audio characteristics of either or both of the first and second performer vocals.
8. The method of claim 1, wherein the first and second light sources are selected from the group consisting of,
wherein the first device is associated with the second device as a current live stream guest, and
wherein the second device operates as a current live stream host that controls the association and disassociation of the particular device with the audience as a guest to the current live stream.
9. The method of claim 2, wherein the first and second light sources are selected from the group consisting of,
wherein the current live stream host selects from a queue of requests from the viewer to associate as a current live stream guest.
10. The method of claim 1, wherein the first device operates in a live stream guest role and the second device operates in a live stream moderator role, the method further comprising any one or both of:
the second device releasing the live stream moderator role for occupation by the other device; and is
The second device communicates the live stream moderator role to a particular device selected from the set comprising the first device and the audience.
11. The method of claim 1, further comprising:
accessing machine-readable code for a musical structure, the musical structure including at least musical segment boundaries encoded to be temporally aligned with vocal music audio captured at the first and second devices; and
applying a first visual effects schedule to at least a portion of the audiovisual broadcast mix, wherein the applied visual effects schedule encodes different visual effects of different music structure elements encoded for the first audiovisual performance and provides a visual effects transition that is temporally aligned with at least some of the encoded musical segment boundaries.
12. The method of claim 11, wherein the different visual effects encoded by the applied visual effects schedule include, for a given element thereof, one or more of:
particle-based effects or lens glare;
conversion between different source videos;
animation or motion of frames within the source video;
vector graphics or images of patterns or textures; and
color, saturation, or contrast.
13. The method of claim 11, wherein the first and second light sources are selected from the group consisting of,
wherein the associated music structure encodes different types of music pieces; and is
Wherein the applied visual effects schedule defines different visual effects for different ones of the encoded musical sections.
14. The method of claim 11, wherein the first and second light sources are selected from the group consisting of,
wherein the associated music structure encodes an event or transition; and is
Wherein the applied visual effects schedule defines different visual effects for different ones of the encoded events or transitions.
15. The method of claim 11, wherein the first and second light sources are selected from the group consisting of,
wherein the associated music structure encodes the community part, and
wherein the applied visual effects schedule is temporally selective to the specific performance-wise synchronized video that coincides with the encoded music structure.
16. The method of any of claims 1-15, performed at least in part on a handheld mobile device communicatively coupled to a content server or service platform.
17. A method according to any one of claims 1 to 15, at least partially embodied as a computer program product code of instructions executable on the second device as part of a collaborative system comprising a content server or service platform to which a plurality of geographically distributed and network-connected vocal capture devices, including second devices, are communicatively coupled.
18. A system for disseminating an apparent live broadcast of a combined performance of a first performer and a second performer that are geographically distributed, the system comprising:
a first device and a second device coupled by a communication network having non-negligible peer-to-peer latency for transmission of audiovisual content;
a first device is communicatively coupled to supply an audiovisual performance mixed with accompanying audio tracks to a second device, the audiovisual performance comprising the following: (1) a first performer's vocal audio captured in opposition to the accompaniment audio track and (2) video synchronized with the vocal audio in performance; and is
The second device is communicatively configured to receive media encoding of the mixed audiovisual performance and audibly render at least an audio portion of the mixed audiovisual performance, capture vocal music audio of the second performer in opposition thereto, and mix the captured second performer vocal music audio with the received mixed audiovisual performance for transmission as an apparent live broadcast.
19. The system of claim 18, wherein the first and second components are selected from the group consisting of,
wherein the second device is further configured to capture a second performer video that is performance-synchronized with the captured second performer vocal music, and to compose the second performer video and the video of the first performer in the supplied mix of audiovisual broadcasts.
20. The system as set forth in claim 19, wherein,
wherein, for at least some portion of the supplied mix of audiovisual broadcasts, the compositing of the first and second performer videos comprises a computational blurring of image frames of the first and second performer videos at a visual boundary therebetween.
21. The system as set forth in claim 19, wherein,
wherein the compositing of the first and second performer videos includes dynamically changing the relative visual prominence of one or the other of the first and second performers during the audiovisual broadcast mix.
22. The system of claim 21, wherein the first and second sensors are arranged in a single unit,
wherein the dynamic change is at least partially in accordance with a time-varying vocal score portion code in an audio score corresponding to and temporally synchronized with the accompaniment audio track.
23. The system of claim 21, wherein the first and second sensors are arranged in a single unit,
wherein the dynamic change is based at least in part on an evaluation of computationally defined audio characteristics of either or both of the first and second performer vocals.
24. The system of claim 18, wherein the first and second components are selected from the group consisting of,
wherein the first device is associated with the second device as a current live stream guest, and
wherein the second device operates as a current live stream host that controls the association and disassociation of the particular device with the audience as a guest to the current live stream.
25. The system of claim 24, wherein the first and second light sources are,
wherein the current live stream host selects from a queue of requests from the viewer to associate as a current live stream guest.
26. The system of claim 18, further comprising:
a video compositor accessing machine readable encoding of a music structure and applying a first visual effects schedule to at least a portion of an audiovisual broadcast mix, the music structure including at least music segment boundaries encoded to be temporally aligned with vocal music audio captured at the first and second devices, wherein the applied visual effects schedule encodes different visual effects of different music structure elements encoded for the first audiovisual performance and provides a visual effects transition that is temporally aligned with at least some of the encoded music segment boundaries.
27. The system of claim 26, wherein the video compositor is hosted on a second device or on a content server or service platform through which the apparent live performance is served.
28. The system of claim 18, wherein the first and second components are selected from the group consisting of,
wherein, as part of a user interface viewable on either or both of the first and second devices, a different vocal portion selection is presented for each of the performers for the current song selection; and is
Wherein the vocal portion selections are updated in response to and in correspondence with gestures made by either or both of the first and second performers at the respective geographically distributed devices, wherein the assignment of a particular vocal portion selection to a respective first or second performer is changeable until one or the other of the first and second performers makes a start vocal capture gesture at the respective geographically distributed devices, whereby the assignment of the particular vocal portion selection to the respective first or second performer is fixed for the duration of time that the coordinated multi-vocal performance is captured.
29. A user interface method for social media, the method comprising:
presenting, as part of a user interface viewable on a touchscreen display of a client device, live video captured using a camera of the client device on the touchscreen display;
in response to a first touchscreen gesture made by a user of a client device, initiating capture of a snippet of live video and presenting a progress indication corresponding to a contemporaneous capture of the snippet as part of a user interface view; and
in response to a second touchscreen gesture made by the user of the client device, the captured snippet is sent to a network-coupled service platform as a post in a multi-user social media thread.
30. The method of claim 29, further comprising:
presenting a multi-user social media thread on the touchscreen display, the presented multi-user social media thread comprising the captured segments and posted temporally-ordered content received from other users via a network-coupled service platform, wherein the posted content from at least one other user comprises one or more of text and captured segments of a video from the at least one other user.
31. The method of claim 29, wherein the first and second portions are selected from the group consisting of,
wherein the captured segment is a fixed length segment, and
wherein the method further comprises visually updating the progress indication in the corresponding portion of the captured fixed length snippet.
32. The method of claim 29, wherein the first and second portions are selected from the group consisting of,
wherein the first touch screen gesture is a contact maintained by a user with a first visually presented feature on the touch screen display, and wherein the second touch screen gesture includes a release of the maintained contact.
33. The method of claim 29, wherein the first and second portions are selected from the group consisting of,
wherein the first touch screen gesture is a first tap type contact by the user with a first visually presented feature on the touch screen display, and wherein the second touch screen gesture is a second tap type contact on the touch screen display after the first tap type contact.
34. The method of claim 29, further comprising:
the multi-user social media thread is presented on the touchscreen display corresponding to the live streaming audiovisual broadcast mix.
35. A method for capturing at least a portion of a coordinated multi-vocal performance of a first performer and a second performer at respective geographically distributed first and second devices, the method comprising:
presenting, as part of a user interface viewable on either or both of the first and second devices, a different vocal portion selection for each of the performers for the current song selection; and is
Wherein the vocal portion selections are updated in response to and in correspondence with gestures made by either or both of the first and second performers at the respective geographically distributed devices, wherein the assignment of the particular vocal portion selection to the respective first or second performer is changeable until the one or other of the first and second performers makes a start vocal capture gesture at the respective geographically distributed devices, whereby the assignment of the particular vocal portion selection to the respective first or second performer is fixed for the duration of time that the coordinated multi-vocal performance is captured.
36. The method of claim 35, further comprising:
updating the vocal part selection at the second device corresponding to the gesture selection transmitted from the first device; and
the first device is supplied with updates to the vocal portion selections corresponding to the gesture selections at the second device.
37. The method of claim 35, further comprising:
the current song selection is changed and, in correspondence therewith, the user interface vision is updated on either or both of the first and second devices.
38. The method of claim 37, wherein the first and second portions are selected from the group consisting of,
wherein a change in the current song selection is triggered by one or the other of the first and second performers on a respective one of the first and second devices.
39. The method of claim 37, further comprising:
a change in current song selection is triggered based on a periodic or recurring event.
40. The method of claim 37, wherein the first and second portions are selected from the group consisting of,
wherein the change in current song selection selects from a library of song selections based on one or more of the encoded interests and performance history of either or both of the first and second performers.
41. The method of claim 35, further comprising:
receiving, at the second device, media encoding of a mixed audio performance starting with the start of the vocal music capture, the mixed audio performance (i) comprising vocal music audio captured at the first device from the first performer and (ii) mixed with an accompaniment audio track for the current song selection;
at the second device, audibly rendering the received mixed audio performance and capturing vocal audio from the second performer in opposition thereto; and
the captured second performer vocal music audio is mixed with the received mixed audio performance to provide a broadcast mix including the captured first and second performer vocal music audio and the accompaniment audio tracks without an apparent time lag therebetween.
42. The method of claim 41, further comprising:
at the second device, visually presenting lyrics and musically encoded note targets are selected for the current song corresponding to the audible rendering, wherein the visually presented lyrics and note targets correspond to an assignment of a particular vocal portion at the beginning of vocal capture.
43. The method of claim 41, wherein said step of selecting said target,
wherein the received media encoding comprises video synchronized in performance with the captured first performer vocal music,
wherein the method further comprises: capturing, at the moderator device, video that is performance-wise synchronized with the captured second performer's vocal music, and
wherein the broadcast mix is an audio-visual mix of the captured audio and video of at least the first and second performers.
44. The method of claim 41, further comprising:
capturing, at the moderator device, a second performer video that is performance-wise synchronized with the captured second performer vocal music; and
the second performer video is combined with the first performer video in a supplied mix of audio-visual broadcasts.
45. The method of claim 41, further comprising:
the broadcast mix is provisioned to a service platform configured to stream the broadcast mix live to a plurality of audience devices that constitute the audience.
CN201980052977.9A 2018-06-15 2019-06-17 Audio-visual live streaming system and method with latency management and social media type user interface mechanism Pending CN112567758A (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US201862685727P 2018-06-15 2018-06-15
US62/685,727 2018-06-15
PCT/US2019/037479 WO2019241778A1 (en) 2018-06-15 2019-06-17 Audiovisual livestream system and method with latency management and social media-type user interface mechanics

Publications (1)

Publication Number Publication Date
CN112567758A true CN112567758A (en) 2021-03-26

Family

ID=68842358

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201980052977.9A Pending CN112567758A (en) 2018-06-15 2019-06-17 Audio-visual live streaming system and method with latency management and social media type user interface mechanism

Country Status (3)

Country Link
EP (1) EP3808096A4 (en)
CN (1) CN112567758A (en)
WO (1) WO2019241778A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11250825B2 (en) * 2018-05-21 2022-02-15 Smule, Inc. Audiovisual collaboration system and method with seed/join mechanic
WO2020117823A1 (en) 2018-12-03 2020-06-11 Smule, Inc. Augmented reality filters for captured audiovisual performances
CN114072872A (en) * 2019-04-29 2022-02-18 保罗·安德森 System and method for providing electronic music score

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090113022A1 (en) * 2007-10-24 2009-04-30 Yahoo! Inc. Facilitating music collaborations among remote musicians
CN101853498A (en) * 2009-03-31 2010-10-06 华为技术有限公司 Image synthetizing method and image processing device
CN102281294A (en) * 2003-07-28 2011-12-14 索诺斯公司 System and method for synchronizing operations among a plurality of independently clocked digital data processing devices
CN102456340A (en) * 2010-10-19 2012-05-16 盛大计算机(上海)有限公司 Karaoke in-pair singing method based on internet and system thereof
GB201217381D0 (en) * 2012-09-28 2012-11-14 Memeplex Ltd Automatic audio mixing
US20140094944A1 (en) * 2012-09-28 2014-04-03 Stmicroelectronics S.R.I. Method and system for simultaneous playback of audio tracks from a plurality of digital devices
WO2014175482A1 (en) * 2013-04-24 2014-10-30 (주)씨어스테크놀로지 Musical accompaniment device and musical accompaniment system using ethernet audio transmission function
WO2015103415A1 (en) * 2013-12-31 2015-07-09 Smule, Inc. Computationally-assisted musical sequencing and/or composition techniques for social music challenge or competition
US20160057316A1 (en) * 2011-04-12 2016-02-25 Smule, Inc. Coordinating and mixing audiovisual content captured from geographically distributed performers
CN105850094A (en) * 2013-10-07 2016-08-10 伯斯有限公司 Synchronous audio playback
CN106601220A (en) * 2016-12-08 2017-04-26 天脉聚源(北京)传媒科技有限公司 Method and device for recording antiphonal singing of multiple persons
CN108040497A (en) * 2015-06-03 2018-05-15 思妙公司 Content based on the performing artist's capture being distributed from strange land automatically generates the audio-video work of coordination

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100748060B1 (en) * 2005-08-05 2007-08-09 주식회사 오아시스미디어 Internet broadcasting system of Real-time multilayer multimedia image integrated system and Method thereof
US8019815B2 (en) * 2006-04-24 2011-09-13 Keener Jr Ellis Barlow Interactive audio/video method on the internet
KR101605497B1 (en) * 2014-11-13 2016-03-22 유영재 A Method of collaboration using apparatus for musical accompaniment

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102281294A (en) * 2003-07-28 2011-12-14 索诺斯公司 System and method for synchronizing operations among a plurality of independently clocked digital data processing devices
US20090113022A1 (en) * 2007-10-24 2009-04-30 Yahoo! Inc. Facilitating music collaborations among remote musicians
CN101853498A (en) * 2009-03-31 2010-10-06 华为技术有限公司 Image synthetizing method and image processing device
CN102456340A (en) * 2010-10-19 2012-05-16 盛大计算机(上海)有限公司 Karaoke in-pair singing method based on internet and system thereof
US20160057316A1 (en) * 2011-04-12 2016-02-25 Smule, Inc. Coordinating and mixing audiovisual content captured from geographically distributed performers
GB201217381D0 (en) * 2012-09-28 2012-11-14 Memeplex Ltd Automatic audio mixing
US20140094944A1 (en) * 2012-09-28 2014-04-03 Stmicroelectronics S.R.I. Method and system for simultaneous playback of audio tracks from a plurality of digital devices
WO2014175482A1 (en) * 2013-04-24 2014-10-30 (주)씨어스테크놀로지 Musical accompaniment device and musical accompaniment system using ethernet audio transmission function
CN105850094A (en) * 2013-10-07 2016-08-10 伯斯有限公司 Synchronous audio playback
WO2015103415A1 (en) * 2013-12-31 2015-07-09 Smule, Inc. Computationally-assisted musical sequencing and/or composition techniques for social music challenge or competition
CN108040497A (en) * 2015-06-03 2018-05-15 思妙公司 Content based on the performing artist's capture being distributed from strange land automatically generates the audio-video work of coordination
CN106601220A (en) * 2016-12-08 2017-04-26 天脉聚源(北京)传媒科技有限公司 Method and device for recording antiphonal singing of multiple persons

Also Published As

Publication number Publication date
EP3808096A1 (en) 2021-04-21
EP3808096A4 (en) 2022-06-15
WO2019241778A1 (en) 2019-12-19

Similar Documents

Publication Publication Date Title
US11553235B2 (en) Audiovisual collaboration method with latency management for wide-area broadcast
US11683536B2 (en) Audiovisual collaboration system and method with latency management for wide-area broadcast and social media-type user interface mechanics
US11394855B2 (en) Coordinating and mixing audiovisual content captured from geographically distributed performers
US20230335094A1 (en) Audio-visual effects system for augmentation of captured performance based on content thereof
US10424283B2 (en) Automated generation of coordinated audiovisual work based on content captured from geographically distributed performers
US10943574B2 (en) Non-linear media segment capture and edit platform
US20220051448A1 (en) Augmented reality filters for captured audiovisual performances
CN112567758A (en) Audio-visual live streaming system and method with latency management and social media type user interface mechanism
US20220122573A1 (en) Augmented Reality Filters for Captured Audiovisual Performances
WO2016070080A1 (en) Coordinating and mixing audiovisual content captured from geographically distributed performers
WO2020006556A9 (en) Audiovisual collaboration system and method with seed/join mechanic
CN111345044B (en) Audiovisual effects system for enhancing a performance based on content of the performance captured

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination