CN113039573A - Audio-visual collaboration system and method with seed/join mechanism - Google Patents

Audio-visual collaboration system and method with seed/join mechanism Download PDF

Info

Publication number
CN113039573A
CN113039573A CN201980056174.0A CN201980056174A CN113039573A CN 113039573 A CN113039573 A CN 113039573A CN 201980056174 A CN201980056174 A CN 201980056174A CN 113039573 A CN113039573 A CN 113039573A
Authority
CN
China
Prior art keywords
audio
user
media
performance
lyrics
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201980056174.0A
Other languages
Chinese (zh)
Inventor
大卫·施坦维尔
安德里亚·斯洛博迪安
杰弗里·C·史密斯
佩里·R·库克
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Smule Inc
Original Assignee
Smule Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US16/418,659 external-priority patent/US10943574B2/en
Application filed by Smule Inc filed Critical Smule Inc
Publication of CN113039573A publication Critical patent/CN113039573A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/02Editing, e.g. varying the order of information signals recorded on, or reproduced from, record carriers
    • G11B27/031Electronic editing of digitised analogue information signals, e.g. audio or video signals
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel
    • G11B27/34Indicating arrangements 

Abstract

User interface technology provides a mechanism for a user singer to seed subsequent performances by other users (e.g., joiners). The seed may be a full-length seed that spans most or all of the pre-existing audio (or audiovisual) work and mixes the user's captured media content for at least some portions of the audio (or audiovisual) work in order to seed additional contributions of one or more joiners. Short seeds may span less than all (and in some cases, far less than all) of the audio (or audiovisual) work. For example, a verse, refrain, repeat, phrase, or other limited "block" of an audio (or audiovisual) work may constitute a seed. The summons of the seeding user invite other users to join the full-length seed or the short seed by following a song, singing a specific vocal part or music part, singing a harmony or other duel part, speaking, clapping, recording a video, adding a video clip from a camera roll, etc. The resulting group show, whether full-length or only one piece, may be posted, live, or otherwise disseminated in the social network.

Description

Audio-visual collaboration system and method with seed/join mechanism
Technical Field
The present invention relates generally to the capture and/or processing of audio-visual performances and, more particularly, to user interface techniques suitable for capturing and manipulating media clips encoding audio and/or visual performances with options of non-linear capture, re-capture, transcription, or lip-sync for use in a seeded and transcribed collaboration mechanism.
Background
The installed base of mobile phones, personal media players and portable computing devices, as well as media streamers and television set-top boxes, is growing in absolute numbers and computing power each day. Many of these devices surpass cultural and economic barriers due to ubiquity and deep continence in the lifestyles of people around the world. Computationally, these computing devices provide speed and storage capabilities comparable to engineering workstation or workgroup computers less than ten years ago, and typically include powerful media processors, making them suitable for real-time sound synthesis and other music applications. Indeed, some modern devices (such as
Figure BDA0002952942780000011
iPod
Figure BDA0002952942780000012
And others
Figure BDA0002952942780000013
Or android devices) to support audio and video processing fairly coherently while providing a platform suitable for advanced user interfaces.
Such as Smule Ocaine available from Smule corporationTM、Leaf
Figure BDA0002952942780000014
I Am T-PainTM
Figure BDA0002952942780000015
Smule(fka Sing!KaraokeTM)、
Figure BDA0002952942780000016
Guitar! And Magic
Figure BDA0002952942780000017
Applications such as application programs have shown that such devices can be used to provide advanced digital acoustic technology in a manner that provides an engaging music experience. Significant practical challenges still exist when researchers seek to transition their innovations into business applications deployable to modern handheld devices and media application platforms within real-world constraints imposed by their processors, memory, and other limited computing resources, and/or within communication bandwidth and transmission latency constraints typical of wireless networks. There is a need for improved technical and functional capabilities, particularly with respect to audiovisual content and user interfaces.
Disclosure of Invention
It has been found that despite the practical limitations imposed by mobile device platforms and media application execution environments, audiovisual performances including vocal music can be captured and coordinated with audiovisual content including performances of other users in a manner that creates an compelling user experience. In some cases, a performance of a human voice of an individual user is captured (along with video synchronized with the performance) on a mobile device in the context of a karaoke style presentation of lyrics corresponding to an audible presentation of a accompanying song. For example, a user interface design may be used to facilitate performance capture, thereby visually presenting lyrics and pitch cues to a user singer, and thereby providing a time-synchronized audible presentation of audio accompaniment music.
On the basis of these and related techniques, user interface improvements are conceived to provide a user singer with a mechanism for seeding (feeding) subsequent performances of other users (e.g., participants). In some cases, the seed may be a full-length seed that spans most or all of the pre-existing audio (or audiovisual) work and mixes the user's captured media content for at least some portions of the audio (or audiovisual) work in order to further seed the contribution of one or more joiners. In some cases, short seeds may be employed that span less than all (and in some cases, far less than all) of the audio (or audiovisual) work. For example, in some cases or embodiments, a verse, refrain, repeat (refain), phrase, or other limited "block" of an audio (or audiovisual) work may constitute a seed. Regardless of the extent or extent of the seed, the seeding user may ask (or summon) others to join. Typically, the summons invite other users to join a full-length seed or a short seed by following a song, singing a particular vocal part or music part, singing a harmony or other duel part, speaking, clapping, recording a video, adding a video clip from a camera roll, and the like. The resulting group show, whether full-length or only one piece, may be posted, live, or otherwise disseminated in the social network.
The seeding user may select seeds or seed portions using dragging (scrubbing) techniques that allow for traversing audio-visual content, optionally including pitch cues, waveform or envelope type performance timelines, lyrics, video, and/or other temporally synchronized content, both forward and backward at recording, during editing, and/or in playback. In this manner, recapture of selected performance portions, coordination of community portions, and dubbing may be facilitated. Direct scrolling to any point in the performance timeline, lyrics, pitch cues, and other temporally synchronized content allows the user to easily move through a capture or audiovisual editing session. For a selection or embodiment involving short seeds, a dragging technique may be employed to define a start point and an end point that define a particular seed portion or block. Also, in the case of full-length seeds, a dragging technique may be employed to define start and end points that define portions of a performance timeline to which participants are invited to contribute. In some cases, such as in short captures for duel guidance, the user singer may be guided in a performance timeline, lyrics, pitch cues, and other temporally synchronized content, corresponding to community part information. In some or all cases, the sled allows the user singer to conveniently move forward and backward in the temporally synchronized content. In some cases, temporally synchronized video capture and/or playback is also supported in conjunction with a sled. Note that while dragging may be provided for synchronized traversal of multiple media lines (e.g., accompaniment audio, vocal, lyrics, pitch cues, and/or community part information), single media dragging is also contemplated.
The dragging technique need not be employed in all cases or embodiments. In some cases or embodiments, portions of the performance timeline (often the portions corresponding to the music portions) may be marked and labeled for selection by the user. The labeling/marking may be based on human or automated sources. For example, the particular portion may be marked or tagged by the user who originally uploaded the track or corresponding lyrics, or by the media content manager. Additionally or alternatively, particular portions may be designated or labeled by machine learning robots trained to recognize portions and boundaries (e.g., from audio accompaniment or vocal tracks, lyrics, or based on data from the general public, such as where the user tends to sing the most or loudest).
In addition to user interface and platform features designed to facilitate non-linear media segment capture and editing, it is contemplated that collaboration features may be provided to allow a user to contribute media content and/or other time synchronized information to an evolved performance timeline. To facilitate collaboration and/or addition of content, a shared service platform may expose media content and performance timeline data as a multi-user concurrent access database. Alternatively or additionally, collaboration may be facilitated by posting the performance timeline for other users to join (e.g., via a shared service platform or otherwise in a peer-to-peer manner), particularly once the performance timeline has been at least partially defined with seed audio or video, who may then capture, edit, and add other media clips, lyric information, pitch tracks, vocal part designations, and/or audio or video effects/filters based on media clips or performance/style/genre mappings to the performance timeline. In some cases, additional capture, editing, and addition of a presentation timeline is accomplished using the user interface and platform features described herein to facilitate nonlinear media segment capture and editing of audiovisual content and data of karaoke style performances.
These and other user interface improvements will be appreciated by those skilled in the art having the benefit of this disclosure in conjunction with other aspects of audiovisual performance capture systems. Optionally, in some cases or embodiments, the human voice audio may be pitch corrected in real time according to pitch correction settings at the mobile device (or more generally at a portable computing device such as a mobile phone, personal digital assistant, laptop computer, notebook computer, tablet computer, or netbook) or on a content or media application server. In some cases, the pitch correction settings encode a particular key or scale for the vocal performance or for portions thereof. In some cases, the pitch correction settings include a melody and/or a sequence of harmony provided with or associated with the lyrics and the accompaniment tracks. The harmonic notes or chords may be coded as well-defined targets if desired, or coded relative to the melody coded for the score or even the actual pitch produced by the singer.
Based on the eye-catching and revolutionary nature of pitch-corrected vocals, performance-synchronized video and score-coded and sound-mixed, the user/singer can overcome the otherwise natural shyness or anxiety associated with sharing their vocal performance. In contrast, even geographically dispersed singers are encouraged to share with friends and family, or to collaborate as part of a social music network and contribute to vocal performance. In some implementations, these interactions are facilitated through social networking and/or email-mediated (media) performance sharing and invitations to participate in group performances. In some implementations, live broadcasting may be supported. A living room style large screen user interface may facilitate these interactions. Using uploaded voices captured at clients such as the aforementioned portable computing devices, a content server (or service) can mediate such coordinated performances by manipulating and mixing uploaded audiovisual content of a plurality of contributing singers. Depending on the goals and implementation of the particular system, the upload may include, in addition to the video content, pitch-corrected vocal performance (with or without harmony), dry (i.e., uncorrected) vocal and/or user chording and/or control tracks and/or pitch correction selections, etc.
Social music may be mediated in any of a variety of ways. For example, in some implementations, vocal performances of the first user captured at the portable computing device with respect to a concomitant song and typically pitch-corrected according to a score-encoded melody and/or vocal cues are provided as seeds to other potential vocal performers. A performance-synchronized video is also captured and may be provided with the pitch-corrected captured human voice. The provided vocal is mixed with the accompanying musical instrument/vocal and forms an accompaniment track for capturing the vocal of the second user. Often, successive vocal contributors are geographically separated and may be unknown (at least a priori) to each other, but the intimacy of the vocal and the collaborative experience itself tend to minimize such separation. As successive vocal performances and videos are captured (e.g., at respective portable computing devices) and symbiosed as part of a social music experience, accompaniment tracks relative to which individual voices are captured may evolve to include previously captured voices of other contributors.
In some cases, a complete performance or a complete performance of a particular vocal part (e.g., part a or part B in duet) may constitute a seed for social musical collaboration. However, using the techniques described herein, even capture of even small or isolated portions of an overall performance (e.g., repetitions, passages, prefaces, ending, duels or community portions, verses or other limited portions, portions of a larger performance, or selected segments) can be conveniently captured, recaptured, or edited for use as a collaborative seed, whether or not it constitutes a complete performance timeline. In some cases, the selected portion, location, or pre-marked/marked segment boundary may correspond to an element of the music structure. As a result, embodiments in accordance with one or more of the present invention may facilitate a "mini-seed" collaboration mechanism for geographically dispersed performers in a social music network.
In some cases, attractive visual animations and/or facilities for audience comment and ranking functions and the formation or addition of duels, chorus groups, or chorus groups are provided in association with audible presentations of vocal performances (e.g., captured and pitch corrected at another similarly configured mobile device) mixed with accompanying instruments and/or voices. Synthetic harmony and/or other voices (e.g., voices captured from another singer at another location and optionally pitch-shifted to harmonize with other voices) may also be included in the mixture. Audio or visual filters or effects may be applied or reapplied after capture to disseminate or post content. In some cases, the disseminated or posted content may take the form of a collaborative request or public summons to other singers. Geocoding of captured vocal performances (or individual contributions to combined performances) and/or audience feedback can facilitate animation or display artifacts by: this approach implies a performance or approval from a particular geographic location on the user-manipulable globe. In these ways, implementation of the described functionality can turn an otherwise commonplace mobile device and living room or entertainment system into a social tool that fosters a unique sense of global connectivity, collaboration, and community.
In some embodiments according to the invention, a system includes first and second media capture devices communicatively coupled via respective network communication interfaces to perform multi-performer collaboration with respect to a baseline media encoding of an audio work. The first media capture device provides a user interface to a first user thereof for selecting a seed portion of an audio work and is configured to at least capture vocal audio performed for audible presentation of at least a portion of the audio work by the first user on the first media capture device. The second media capture device is configured to: (i) receive, via its network communication interface, an indication of a seed portion selected by the first user at the first media capture device, and (ii) capture media content of the second user that is performed for audible presentation of the seed portion on the second media capture device mixed with the captured human audio of the first user.
In some cases or embodiments, the user interface of the first media capture device also allows the first user to specify one or more types of media content to be captured from a performance of the second user for audible presentation of the seed portion on the second media capture device mixed with the captured human audio of the first user. In some cases or embodiments, the specified one or more types of media content to capture are selected from the group consisting of: a vocal audio, a vocal harmony, or a vocal duel; rap, talk, clap, or percussion; and video. In some cases or embodiments, the user interface of the first media capture device also allows the first user to post the seed portion as a collaboration request to other geographically dispersed users and media capture devices, including the second user, for capturing and adding other people audio, video, or performance synchronized audiovisual content.
In some embodiments, the system further includes a service platform communicatively coupled to the first and second media capture devices, the service platform configured to: providing a media encoding cooperating with a multi-presenter of at least a first user and a second user for audible or audiovisual presentation on at least a communicatively coupled third device, the media encoding being based on an audio work but temporally limited to the seed portion thereof selected by the first user.
In some embodiments, the system further includes a media content drawer on the first media capture device by which the first user notes a start point and an end point in the performance timeline to define and thereby select the seed portion. In some cases or embodiments, the media content drag presents to the first user a temporally synchronized representation of two or more of: an audio envelope of accompaniment audio and/or human voice; lyrics; one or more pitch tracks; and duel or other community part symbols.
In some embodiments, the system further includes a user interface on the first media capture device through which the first user selects the seed portion from the pre-marked or marked portion of the audio work. In some cases or embodiments, the pre-marked or tagged portion of the audio work is provided by a service platform communicatively coupled with the first and second media capture devices, the pre-marked or tagged portion having been marked or tagged based on one or more of: a music structure encoding an audio work; a machine learning algorithm applied to the accompaniment audio, the human voice audio or the lyrics of the audio work or the accompaniment audio, the human voice audio or the lyrics corresponding to the audio work; data from the public; and data provided by the user uploader of the audio work or by their third party administrator.
In some cases or embodiments, the baseline media encoding of the audio work also encodes synchronized video content. In some cases or embodiments, the first media capture device is further configured to capture performance-synchronized video content. In some cases or embodiments, the first and second media capture devices are mobile phone-type portable computing devices executing application software that, in at least one mode of operation thereof, provide karaoke-style presentations of performance timelines, including lyrics, in temporal correspondence with audible presentations of audio works on their multi-touch-sensitive displays, and capture, via on-board audio and video interfaces of the respective mobile phone-type portable computing devices, voice and/or performance-synchronized video of the respective first or second users.
In some embodiments according to the invention, a method includes using a portable computing device for media segment capture in conjunction with a karaoke style presentation of a performance timeline on a multi-touch sensitive display thereof, the performance timeline including lyrics and a pitch track synchronized with a music track. The method further comprises the following steps: assigning a subset of the lyrics to the joiner in response to a gesture control on the multi-touch sensitive display; and posting the performance timeline with the lyric subset designation as a collaboration request for the joining remote user to capture and add further vocal audio content on a second remote portable computing device configured for further media clip capture in conjunction with the performance timeline. In some cases or embodiments, the portable computing device is configured with user interface components executable to provide: (i) start/stop control of media segment capture and (ii) drag interaction for temporal position control within the performance timeline.
In some embodiments, the method further comprises: at least one media segment is added to the performance timeline starting from a first position dragged to in the performance timeline, the first position being neither the start of the performance timeline nor the nearest stop or pause position within the performance timeline. In some embodiments, the method further comprises: beginning from a first position of the drag-to in the performance timeline, and corresponding to a karaoke-style presentation of at least synchronized lyrics and pitch tracks on the multi-touch-sensitive display, human voice audio is captured at the portable computing device, wherein the added at least one media segment includes the captured human voice audio.
In some cases or embodiments, the added at least one media segment includes one or more of: video or still images; a video captured at the portable computing device beginning at a first location of a drag-to in the performance timeline and corresponding to a karaoke style presentation of lyrics and a pitch track on the multi-touch sensitive display and a synchronized audible presentation of a music track; and performance-synchronized audio and visual media content captured at the portable computing device.
In some embodiments, the method further comprises saving the performance timeline including the added at least one media segment to a service platform coupled to the network. In some embodiments, the method further includes retrieving a previously saved version of the performance timeline from a service platform coupled to the network.
In some cases or embodiments, a performance timeline is posted for the joining remote user via a service platform coupled to the network. In some cases or embodiments, the designation of the subset of lyrics is responsive to a first user gesture control on the multi-touch sensitive display that selects a particular portion of the human voice for the participant. In some cases or embodiments, the designation of the subset of lyrics is responsive to a second user gesture control on the multi-touch sensitive display, the second user gesture control defining a subset of the lyrics corresponding to the further media segment of the participant in the lyrics.
In some embodiments according to the invention, a method includes using a first portable computing device for media segment capture in conjunction with a karaoke style presentation of a performance timeline on a multi-touch sensitive display thereof, the performance timeline including lyrics and a pitch track synchronized with a music track, wherein the performance timeline includes a subset of lyrics specified by a previous user on a second remote computing device, the specification of the subset of lyrics parameterizing, at least in part, a collaboration request for capturing and adding other people audio content conducted on the first portable computing device by a user joining the performance timeline. In some cases or embodiments, the first portable computing device is configured with user interface components executable to provide: (i) start/stop control of media segment capture and (ii) drag interaction for temporal position control within the performance timeline.
In some embodiments, the method further comprises: at least one person audio media segment is captured beginning at a first position dragged to in the performance timeline, the first position being neither the beginning of the performance timeline nor the nearest stop or pause position within the performance timeline. In some embodiments, the method further comprises: updating the performance timeline to include the captured at least one personal sound audio media segment; and posting the updated performance timeline for the joining remote user via a service platform coupled to the network.
In some cases or embodiments, the designation of the subset of lyrics is selective to a particular portion of the human voice. In some cases or embodiments, further human audio content designated for users joining the performance timeline for a subset of the lyrics defines the subset of the lyrics.
In some embodiments according to the invention, a method comprises: using a portable computing device to capture media content for synchronized lyrics, pitch, and karaoke style presentation of a music track; capturing, using a portable computing device, at least one audio clip on a multi-touch sensitive display thereof, the portable computing device configured with user interface components executable to provide: (i) start/stop control of media segment capture and (ii) drag interaction for temporal position control within a performance timeline; inputting one or more segments of lyrics and aligning the input lyric segments with a performance timeline in response to a first user gesture control on the multi-touch sensitive display; in response to a second user gesture control on the multi-touch sensitive display, moving forward or backward in a visually synchronized presentation of the performance timeline on the multi-touch sensitive display; and after the moving, capturing at least one audio segment and pitch detecting the captured audio segment to produce at least a portion of a pitch track.
In some cases or embodiments, the capture of at least one audio clip is free form, with no scrolling of lyrics or pitch tracks. In some cases or embodiments, the captured freestyle audio clip includes show-synchronized video. In some cases or embodiments, the captured freestyle audio clip includes either or both of: instrumental accompaniment audio; and human voice audio.
In some embodiments, the method further comprises: in response to a third user gesture control on the multi-touch sensitive display, moving forward or backward in a visually synchronized presentation of the performance timeline on the multi-touch sensitive display; and assigning a subset of the lyrics to the first human voice portion after the moving. In some embodiments, the method further comprises: the performance timeline is posted as a collaboration request that captures and adds the vocal audio content of one or more singers on the remote portable computing device.
Drawings
The present invention is illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer generally to similar elements or features.
Fig. 1 depicts the flow of information between illustrative mobile phone-type portable computing devices for non-linear audiovisual capture and/or editing in preparation for a community audiovisual performance in accordance with some embodiments of the present invention.
FIG. 2 depicts in more detail an exemplary user interface with a visually synchronized presentation of lyrics, pitch cues, and a tug in relation to a human voice capture session on a portable computing device.
FIG. 3 illustrates an exemplary user interface related to human voice capture scrolling behavior, where a current point in the presentation of the performance timeline and lyrics moves forward or backward corresponding to a user gesture on a touch screen of the portable computing device.
Fig. 4 illustrates an exemplary user interface relating to a pause in human voice capture.
FIG. 5 illustrates another exemplary user interface with a sled for moving forward or backward corresponding to a user gesture on a touch screen of a portable computing device.
FIG. 6 illustrates a time index traversal mechanism according to some embodiments of the invention.
Fig. 7 shows some illustrative changes in the drag mechanism(s) described with reference to some of the previous figures.
Fig. 8 illustrates using the captured vocal performance as an audio seed to which the user adds video and ultimately updates the performance timeline to add or change stream or vocal section selections.
Fig. 9 depicts an illustrative sequence that includes additional multi-user collaboration aspects.
Fig. 10 depicts an illustrative sequence with multi-user collaboration involving video created or captured by a user as an initial seed show.
FIG. 11 depicts exemplary special invitation options that include a user's designation of a particular voice portion for which a participant is directed to sing or providing audio.
FIG. 12 depicts the freeform creation of an arrangement according to some embodiments of the invention.
FIG. 13 illustrates a short seed collaboration flow in accordance with some embodiments of the invention.
Fig. 14 and 15 illustrate exemplary techniques for capturing, coordinating, and/or mixing audiovisual content.
Figure 16 illustrates features of a mobile phone type device that may be used as a platform for executing a software implementation according to some embodiments of the invention.
FIG. 17 illustrates a system in which devices and associated service platforms may operate according to some embodiments of the invention.
Skilled artisans will appreciate that elements or features in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. For example, the dimensions or importance of some of the illustrated elements or features may be exaggerated relative to other elements or features to improve understanding of embodiments of the present invention.
Detailed Description
Techniques have been developed to facilitate the capture, pitch correction, synthesis, encoding, and presentation of audiovisual performances. The human voice audio along with performance synchronized video can be captured and coordinated with the audiovisual contributions of other users to form a multi-performer, duet style, or chorus style audiovisual performance. Non-linear capture and/or editing of individual segments or portions of the performance timeline allows for free-form collaboration of multiple contributors, typically with independent and geographically dispersed audio and/or video capture. In some cases, audio and video may be captured separately and associated after capture. In some cases, the performance of the individual user (audio, video, or in some cases audio synchronized with the performance) is captured on a mobile device, a television display, and/or a set-top box device in the context of a karaoke style presentation of the lyrics corresponding to an audible presentation of a accompaniment song or a vocal performance. The captured audio, video or audiovisual content of one contributor may be used as a seed for a group performance.
Karaoke OK style vocal performance capture
FIG. 1 depicts information flow between an illustrative mobile-phone-type portable computing device (101A, 101B) and a content server 110, according to some embodiments of the invention. In the illustrated flow, lyrics 102, pitch cues 105, and accompaniment tracks 107 are provided to one or more of the portable computing devices (101A, 101B) to facilitate human voice (and in some cases audiovisual) capture. The user interfaces of the various devices provide draggers (103A, 103B) whereby a given user singer is able to control movement forward and backward in time-synchronized content (e.g., audio, lyrics, pitch cues, etc.) using gestures on the touchscreen. In some cases, the trawl controls also allow for forward and backward movement in the video of performance synchronization.
Although embodiments of the present invention are not so limited, human voice capture using mobile phone type pitch correction, karaoke style, provides a useful descriptive context. For example, in some embodiments consistent with the embodiment shown in FIG. 1, an iPhone available from apple IncTMThe handheld device (or more generally, the portable computing device 101A, 101B) hosts software that executes in cooperation with the content server 110 to provide vocal capture, often with continuous real-time, score-encoded pitch correction and/or harmony of the captured vocal. The video may be captured with the performance synchronized (or performance synchronizable) using a camera provided by the onboard camera. In some embodiments, audio, video, and/or audiovisual content may be captured using or in conjunction with one or more cameras configured with a television (or other audiovisual device) or a connected set-top box device (not specifically shown in fig. 1). In some embodiments, a conventional desktop/laptop computer may be suitably configured and host an application or web application to support some of the functionality described herein.
Showing capture of a two-part performance (e.g., as duel singing, where audiovisual content 106A and 106B are captured, respectively, from a single singer); however, those skilled in the art having the benefit of this disclosure will recognize that the techniques of the present invention may also be employed in solo and in larger multi-part performances. In general, audiovisual content may be posted, streamed, or may be originated or captured in response to a collaboration request. In the illustrated embodiment, the content selection, the group performance, and the dissemination of the captured audio-visual performance are all coordinated via the content server 110. The content selection and show addition module 112 of the content server 110 performs audio mixing and video splicing in the illustrated design, while the audio-visual rendering/stream control module 113 provides the community audio-visual show mix 111 to downstream viewers. In other embodiments, peer-to-peer communication may be used for at least some of the illustrated flows.
In some cases, the wireless local area network may support communication between the portable computing device 101A instance, an audiovisual and/or set-top box device, and a wide area network gateway (not specifically shown) that in turn communicates with the remote device 101B and/or the content server 110. Although FIG. 1 depicts a configuration in which the content server 110 plays an intermediate role between the portable computing devices 101A and 101B, persons of ordinary skill in the art having benefit of the present disclosure will recognize the portable computing devices 101A and 101B101B may also or alternatively be supported. Those skilled in the art will recognize that including 802.11Wi-Fi, Bluetooth (R)TMAny of a variety of data communication facilities, including 4G-LTE, 5G or other communication, wireless, wired data networks, and/or wired or wireless audiovisual interconnections, may be used, alone or in combination, to facilitate the communication and/or audiovisual presentation described herein.
As a typical feature of karaoke style applications, such as the Smule application available from the Smule corporation, accompaniment tracks of instruments and/or voices may be audibly presented for the user/singer to sing. In this case, the lyrics (102A, 102B) may be displayed in correspondence with the local audible presentation to facilitate karaoke style vocal performance for a given user. Note that in general, individual users may perform the same or different parts in a group performance, and the audio or audiovisual capture need not be, and typically is not, simultaneous. In some embodiments, the audio or audiovisual capture contributed by the performers may be independent and asynchronous, often spanning time zones and continents. However, in some embodiments, live technology may be employed. In the illustrated configuration of fig. 1, lyrics, timing information, pitch and vocal cues, accompanying music (e.g., instruments/voices), performance coordination video, etc., may all originate from a content server 110 connected to a network. In some cases or situations, the accompanying audio and/or video may be presented from a media store, such as a music library, that resides in or is accessible from a handheld device, set-top box, content server, or the like.
User vocal or audiovisual content 106A, 106B is captured at the respective devices 101A, 101B, optionally continuously and in real time pitch corrected (at the handheld device or using a computing facility of the audiovisual display and/or set-top box device not specifically shown) and audibly presented to provide the user with an improved tonal rendition of his/her own vocal performance. Pitch correction is typically based on a score-encoded set of notes or cues (e.g., pitch and chorus cues 105) that provide a continuous pitch correction algorithm with a performance synchronization sequence of target notes in the current key or scale. In addition to performing synchronized melodic objectives, the score-encoded harmonic note sequences (or sets) also provide the pitch-shifting algorithm with the additional objective of pitch-shifting to harmonic versions of the user's own captured vocal (typically encoded as an offset relative to the main melody note track, and typically dubbed only for selected portions thereof). In some embodiments, note/pitch targets and score-encoded timing information may be used to assess vocal performance quality.
The lyrics 102, melody and soundtrack note sets 105 and associated timing and control information may be packaged in suitable containers or objects (e.g., in a musical instrument digital interface MIDI or Java script object notation json type format) for provision with the accompaniment tracks 107. Using such information, the portable computing devices 101A, 101B may display lyrics (102A, 102B) or even visual cues (105A, 105B) related to the target notes, harmony and currently detected pitch of the human voice in correspondence with one or more audible performances of a companion song to facilitate a karaoke style human voice performance of the user. Thus, if an overwhelming singer selects "When I Was waters young Man" promoted by Bruno Mars, you _ man.json and you _ man.m4a may be downloaded from the content server (if not already available or cached based on previous downloads), and then used to provide background music, synchronized lyrics, and in some cases or embodiments, score-encoded note tracks for continuous and real-time pitch correction while the user is singing. Optionally, at least for some embodiments or genres, the note tracks may be score coded so that harmony shifts to the captured human voice.
Typically, the captured pitch-corrected (possibly harmonically) vocal performance is stored locally on the handheld device or set-top box as one or more audio or audiovisual files along with performance-synchronized video, then compressed and encoded for uploading (106A, 106B) to the content server 110 as an MPEG-4 container file. Although MPEG-4 is an exemplary standard for the encoded representation and transmission of digital multimedia content for the internet, mobile networks, and advanced broadcast applications, other suitable codecs, compression techniques, encoding formats, and/or containers may be employed, if desired. Depending on the implementation, the encoding of the dry human voice and/or the pitch corrected human voice may be uploaded (106A, 106B) to the content server 110. In general, such voices (e.g., encoded in MPEG-4 containers or otherwise), whether already pitch corrected or pitch corrected at the content server 110, may then be mixed, for example, with accompanying audio and other captured (and possibly pitch shifted) vocal performances to produce files or streams of quality or encoding characteristics selected according to the capabilities or limitations of a particular target or network. In some embodiments, audio processing and mixing and/or video synchronization and splicing may be performed at a server or service platform, such as content server 110, to provide a composite, multi-presenter audiovisual work.
Non-linear fragment capture and/or editing
FIG. 2 depicts in more detail an exemplary user interface representation of lyrics 102A, pitch cues 105A, and a tug 103A in conjunction with a human voice capture session on a portable computing device 101A (recall FIG. 1). The current human voice capture point is noted (281A, 281B, 281C) in multiple reference frames (e.g., in the lyrics 102A, in the pitch cues 105A, and in the audio envelope depiction of the performance timeline in the dragger 103A). Any of a variety of marking techniques or symbologies may be employed. In general, the particular form of user interface symbols and symbology is a matter of design choice, but may include color cues (such as for word, line, or syllable position 281B in lyrics 102A), vertical or horizontal bar markers (see pitch cues 105A of the user interface representation of FIG. 2 and symbols 281A, 281C in sliders 103A), or other means.
As will be understood with reference to the following drawings and description, the exemplary user interface representation of fig. 2 (and variations thereof) provides a mechanism by which a user can control movement forward or backward in a performance timeline based on-screen gestures. By manipulating the current position 281C forward or backward in the tug 103A, the human voice capture point moves forward or backward in the performance timeline, respectively. Accordingly, the lyrics 102A and the pitch cues 105A advance or retreat in a visually synchronized manner. Likewise, the position in the accompaniment tracks and/or the captured audio, video or audiovisual content is advanced or retreated. In this manner, the user interface manipulation on the screen of the user of the portable computing device 101A moves forward or backward and facilitates a non-linear traversal of the performance timeline. For example, the user may move forward or backward to an arbitrary point in the performance timeline, rather than starting a vocal, video or audiovisual capture at the beginning of the performance timeline or restarting at the nearest stop or pause location. Re-recording, recording and/or selectively capturing only specific sections or portions of the performance are facilitated by the non-linear access provided. In some embodiments, non-linear access allows audio and video to be captured in separate passes.
The current position 281C in the tug 103A, visually presented as the audio envelope of the performance timeline, may be manipulated laterally with left (backward in time) and right (forward in time) slide-type gestures on the touch screen display of the portable computing device 101A. User interface gesture convention is a matter of design choice, and other gestures may be employed to achieve similar or complementary effects, if desired. In some embodiments, the current position may also (or alternatively) be manipulated with a gesture in the display's pitch track 105A or a pane of lyrics 102A. In each case, the presentation of the on-screen elements (e.g., pitch track 105A, lyrics 102A, and the audio envelope of the performance timeline) is visually synchronized such that forward or backward motion of one causes forward or backward motion of one or more of the others. If the capture is started or restarted, each of the on-screen elements (e.g., pitch track 105A, lyrics 102A, and the audio envelope of the performance timeline) scrolls forward in a time-corresponding manner from a consistent and visually synchronized starting point within the performance timeline. In embodiments or display modes that provide performance-synchronized video, video scrolling or capture may optionally be initiated at a visually synchronized start point within the performance timeline.
FIG. 3 illustrates another exemplary user interface mechanism related to human voice capture scrolling behavior, where a current point (281B) in the presentation of the lyrics 102A and its corresponding point (281C) in the performance timeline move forward or backward in correspondence with a user gesture on a touch screen of the portable computing device 101A (reviewing FIG. 1). While in the illustrated embodiment an expanded presentation of the lyrics 102A is provided and the pitch cues are hidden, other embodiments may allocate screen space in different ways. Using the upward or downward movement on the touchscreen of the portable computing device 101A (reviewing fig. 1), the user singer expresses user interface gestures for scrolling forward and backward in the lyrics in conjunction with the on-screen presentation of the lyrics. In some cases or embodiments, visually synchronized traversal using other display features (e.g., the audio envelope of the sled 103A) to make fine-grained (line-level, word-level, or syllable-level) movements in the lyrics may be a preferred mechanism for the user singer to perform timeline traversal during capture or re-capture. As before, the touch screen gesture provides a synchronized motion in the lyrics 102A and the performance timeline. In some embodiments, additional or alternative gesture expressions may be employed.
While the exemplary user interface features emphasize lyrics and pitch cues, elements of the musical structure such as segments, community parts, a/B parts in duals, etc. may also be used to mark points in the performance timeline to which the current position may advance or retreat. In some cases or embodiments, the progression may be automated or scripted. In some cases, the user interface may support "finding" the next or previous point with musical structural importance, finding a selected segment or location, or finding a pre-marked/marked segment boundary.
FIG. 4 illustrates similar user interface features related to pauses in human voice capture, where a current point (281B) in the presentation of the lyrics 102A and its corresponding point (281C) in the performance timeline move forward or backward in correspondence with a user's gesture on the touch screen of the portable computing device 101A (reviewing FIG. 1). While paused, an expanded presentation of the performance timeline is presented in the drag 103A. As before, the current point in the presentation of the lyrics 102A and the timeline drag 103A moves forward or backward corresponding to an up or down touchscreen gesture by the user. The forward and backward movement in the features presented on the screen (e.g., lyrics 102A and performance timeline) are synchronized in time. User selection of lyrics may be employed to specify a vocal portion for subsequent joining, as well as to seed media content (e.g., audio and/or video) for collaboration requests.
FIG. 5 illustrates dragging using timeline drag 103A, where the current point (281C) and its corresponding points (281B, 281A) in the presentation of the lyrics 102A and the pitch cues 105A move forward or backward in correspondence with a gesture by the user on the touch screen of the portable computing device 101A (reviewing FIG. 1). The touch screen gesture provides synchronized movement in the lyrics 102A, pitch cues 105A, and the performance timeline. In some embodiments, additional or alternative gesture expressions may be employed.
Fig. 6 shows a time-indexed traversal of computer-readable encoding of pitch track 605 and lyrics track 602 data in relation to forward and backward user interface gestures expressed by a user on a touchscreen display of an illustrative audio signal envelope computed from a concomitant song and/or a captured human voice. Generally, a MIDI, json, or other suitable in-memory data representation format may be used for pitch, lyrics, musical structure, and other information relating to a given performance or musical schedule. Persons of ordinary skill in the art having benefit of the present disclosure will recognize that any of a variety of data structure indexing techniques may be used to facilitate the visually synchronized presentation of locations in a performance timeline, for example, using displayed sections of a drag 103A, lyrics 102A, and pitch cues 105A.
Fig. 7 shows some illustrative variations of one or more of the drag mechanisms described with reference to the previous figures. Specifically, in one illustrated variation, dragging is instead (or additionally) supported based on left and right gestures in a pitch-cue presentation portion (105A) of the touchscreen. As before, movement in the lyrics (102A) and traversal of the audio signal envelope representation of the performance timeline (103A) are visually synchronized with the pitch cue-based dragging. The vocal portion of the individual user singer (e.g., lyrics 701.1, 702.2) may be noted in the performance timeline, such as by alternate colors of other on-screen symbologies. Similar symbologies may be employed in the pitch prompt 105A portion and the timeline sled 103A portion of the user interface to identify duel sings (A portion, B portion) or community portions that are or are to be performed by individual singers. In some cases or embodiments, a user interface facility may be provided that advances/retreats along the performance timeline to or selects points of musical structural importance. Examples include a music part boundary, a consecutive start of the next part a (or part B) in duet, a specific music part that has been assigned to the user singer as part of a collaboration request, and so on. When a music schedule is loaded, a user interface and drag mechanism according to some embodiments of the present invention allows a user to advance/retreat to, or even select, any or distinct (demarked) point, portion or segment in the schedule for human voice, video and/or audiovisual capture, re-capture or playback using visually synchronized presented performance timeline, lyrics or pitch portions.
Fig. 8 illustrates using the captured vocal performance as an audio seed to which the user adds video and ultimately updates the performance time to add or change the flow or vocal section selection. Fig. 9 depicts an illustrative sequence that includes additional multi-user collaboration aspects. For example, after a first user (user a) captures a vocal performance as an audio seed, a second user (user B) joins user a's performance and adds an audio and/or video media segment. In this illustrative sequence, user B also adds a vocal part designation, such as by noting a particular lyric as a part B of duel. From there, a number of potential joiners are invited (e.g., as part of a public summons) to add additional media content to user a's initial audio seed with the added audio, video and according to user B's vocal part designations.
Fig. 10 depicts a similar sequence with multi-user collaboration, but where video created or captured by a first user (user a) is provided as an initial seed performance. A second user (user B) joins the video of user a and adds an audio clip, here captured vocal audio. User A then invites users (e.g., user B and others) to add additional audio, here a main audio (melody) and two additional vocal and harmonic parts. The result is a video with multiple audio layers added as a collaboration.
FIG. 11 depicts certain exemplary special invitation options including a user's designation of a particular vocal portion for subsequent joining and a user's selection of lyrics to designate the vocal portion for subsequent joining to seed media content (e.g., audio and/or video). In each case, the enrollee is guided to sing, or more generally provide audio for, the specified portion of speech.
The creation of free-form and collaborative arrangements is also contemplated. For example, as shown in step 1 of fig. 12, a user (user a) may perform and capture a free-form mode performance, e.g., acoustic audio with performance-synchronized video of a guitar performance. The initial freeform capture of user a provides an initial seed for further collaboration. Next (in one illustrative flow), a user (e.g., user a or another user B) may enter lyrics (step 2) to accompany the audiovisual performance. The timeline editing and dragging facilities described herein may be particularly useful when entering lyrics, manipulating the lyrics, and aligning the entered lyrics with a desired point in a performance timeline. Next (in the illustrated flow), the user (user A, B or another user C) may assign (step 3) a particular lyric portion to the singer (e.g., portion a versus portion B in duet). More generally, a greater number of vocal portions may be assigned in the community arrangement.
For at least some embodiments, a high-level feature of the free-form and collaborative schedule creation process shown in step 4 of fig. 12 is the provision of a pitch line capture mechanism whereby a music track is captured and used to calculate a pitch track relative to the karaoke style scrolling of the evolved performance timeline. In general, any of a variety of pitch detection techniques may be applied to calculate a pitch track from captured audio. Both human audio and musical instrument audio (e.g., from a piano) are contemplated. In each case, the calculated pitch track is added to the performance timeline. Note that the user-generated arrangement is not necessarily limited to lyrics and chords. By way of example (see step 5+), the media fragment capture and editing platform may be extended to allow a user (user A, B, C or another user D) to specify things like: song parts ("refrain", "master song", etc.), harmony parts, segment-based video or audio effects/filters, etc. It is also noted that while the ordered flow of FIG. 12 is illustrative, other embodiments may change the order of steps, omit steps, or include additional steps as appropriate for a particular free-form collaboration and a particular audio or audiovisual work.
Short seeds and other variants
Although much of the foregoing description demonstrates the flexibility of non-linear segment capture and editing techniques in the context of a complete performance timeline, those skilled in the art having the benefit of this disclosure will recognize that collaboration seeds may, but need not, span the complete audio (or audiovisual work). In some cases, the seed may be a full-length seed that spans most or all of the pre-existing audio (or audiovisual) work and mixes the captured media content that seeds the user for at least some portion of the audio (or audiovisual) work. In some cases, a short seed may be employed that spans less than all (and in some cases much less than all) of the audio (or audiovisual) work. For example (as shown in fig. 13), a verse, refrain, repeat, phrase, or other limited "block" of an audio (or audiovisual) work may constitute a subsequently added seed. The seeding user may select a pre-designated portion of the audio or audiovisual work 1301 (here, a music portion). The resulting short seed 1311 constitutes a seed for a plurality of collaborations (here collaboration #1 and # 2). Regardless of its extent or scope, the seed or seed portion defines a collaboration request (or call) for others to join. Typically, the summons invite other users to join a full-length seed or a short seed by following a song, singing a particular vocal part or music part, singing a harmony or other duel part, speaking, clapping, recording a video, adding a video clip from a camera roll, and the like. In the short seed example of FIG. 13, the invitations 1321 and 1322 are illustrative. The resulting group show, whether full-length or only one piece, may be posted, live, or otherwise disseminated (1341) in the social network.
The seeding user may select seeds or seed portions using dragging techniques that allow for traversing audio-visual content, optionally including pitch cues, waveform or envelope type performance timelines, lyrics, video, and/or other temporally synchronized content, both forward and backward at recording, during editing, and/or in playback. In this manner, recapture of selected performance portions, coordination of community portions, and dubbing may be facilitated. Direct scrolling to any point in the performance timeline, lyrics, pitch cues, and other temporally synchronized content allows the user to easily move through a capture or audiovisual editing session. For a selection or embodiment involving short seeds, a dragging technique may be employed to define a start point and an end point that define a particular seed portion or block. Also, in the case of full-length seeds, a dragging technique may be employed to define start and end points that define portions of a performance timeline to which participants are invited to contribute.
In some cases, such as in short captures for duel guidance, the user singer may be guided in content synchronized in the performance timeline, lyrics, pitch cues, and other times, corresponding to community part information. The sled allows the user singer to conveniently move forward and backward in the temporally synchronized content. In some cases, temporally synchronized video capture and/or playback is also supported in conjunction with a sled. Note that while dragging may be provided for synchronized traversal of multiple media lines (e.g., accompaniment audio, vocal, lyrics, pitch cues, and/or community part information), single media dragging is also contemplated.
The dragging technique need not be employed in all cases or embodiments. Portions of the performance timeline (often the portions corresponding to the music portions) may be marked and labeled for selection by the user. The labeling/marking may be based on human or automated sources. For example, the particular portion may be marked or tagged by the user who originally uploaded the track or corresponding lyrics, or by the media content manager. Additionally or alternatively, particular portions may be designated or labeled by machine learning robots trained to recognize portions and boundaries (e.g., from audio accompaniment or vocal tracks, lyrics, or based on data from the general public, such as where the user tends to sing the most or loudest). These and other variations will be appreciated by those skilled in the art having the benefit of this disclosure.
Exemplary Audio-visual processing flow, apparatus, and System
Fig. 14 and 15 illustrate exemplary techniques for capturing, coordinating, and/or mixing audiovisual content for geographically dispersed performers. In particular, fig. 14 is a flow diagram illustrating pitch correction and harmony generation for real-time and continuous score encoding of captured vocal performances according to some embodiments of the present invention. In the illustrated configuration, the user/singer sings in the karaoke style along with accompanying music. The vocal sounds (251) captured from the microphone input 201 are continuously and in real time pitch corrected (252) and harmonized (255) for mixing (253) with the accompaniment tracks audibly presented on the one or more sound transducers 202.
Both pitch correction and added chorus are selected to correspond to the musical score 207, and the musical score 207 is wirelessly transmitted 261 in the illustrated configuration to device(s) (e.g., from the content server 110 to the handheld device 101, reviewing fig. 1, or a set-top box device) on which vocal capture and pitch correction will be performed along with audio encoding of the lyrics 208 and accompaniment music 209. In some embodiments of the techniques described herein, the closest notes (in the current scale or key) to those uttered by the user/singer are determined based on the score 207. Although the closest note may typically be a tonic height corresponding to a score-encoded vocal melody, it need not be. Indeed, in some cases, the user/singer may intend to sing the harmony, and the notes uttered may more closely approximate the harmony track.
In some embodiments, the capture of person audio and show-synchronized video may be performed using the facilities of a television-type display and/or set-top box device. However, in other embodiments, the handheld device (e.g., handheld device 301) itself may support the capture of both human audio and performance-synchronized video. Thus, fig. 15 illustrates a basic signal processing flow (350) suitable for a mobile phone-type handheld device 301 to capture vocal audio and performance synchronization video, generate pitch-corrected and optionally harmonious vocal sounds for audible presentation (locally and/or on a remote target device), and communicate with a content server or service platform 310, according to some embodiments.
Based on the description herein, those of ordinary skill in the art will appreciate the appropriate allocation of signal processing techniques (sampling, filtering, decimation, etc.) and data representations to the functional blocks of software (e.g., decoder(s) 352, digital-to-analog (D/a) converter 351, capture 353, 353A, and encoder 355) that are executable to provide the signal processing flow 350 illustrated in fig. 15. Also, with respect to fig. 14, signal processing flow 250, and illustrative score coding note objectives (including chord note objectives), those of ordinary skill in the art will appreciate the appropriate allocation of signal processing techniques and data representations to the functional blocks and signal processing constructs (e.g., decoder(s) 258, capture 251, digital-to-analog (D/a) converter 256, mixers 253, 254, and encoder 257) implemented at least in part as software executable on a handheld or other portable computing device.
As one of ordinary skill in the art will recognize, pitch detection and pitch correction have a rich technological history in the fields of music and speech coding. Indeed, a wide variety of feature sorting, time domain and even frequency domain techniques have been employed in the art, and may be employed in some embodiments according to the present invention. In some embodiments according to the invention, the pitch detection method computes an average amplitude difference function (AMDF) and executes logic to sort the peaks corresponding to the estimation of the pitch period. Based on such an estimation, a pitch-shift overlap-add (PSOLA) technique is used to facilitate resampling of the waveform to produce variations of pitch-shift, while reducing the aperiodic impact of stitching. An implementation based on the AMDF/PSOLA technique is described in more detail in commonly owned U.S. patent No. 8,983,829 entitled COORDINATING AND MINING VOCAL CAPTURED FROM GEOGRAPHICALLY DISTRIBUTED PERFORS, which designates Cook, Lazier, Lieber, AND Kirk as the inventor.
FIG. 16 illustrates features of a mobile device that may be used as a platform for executing a software implementation according to some embodiments of the invention. More specifically, fig. 16 is a block diagram of a mobile device 400, the mobile device 400 being generally associated with an iPhoneTMThe commercially available versions of the mobile digital devices are consistent. While embodiments of the invention are certainly not limited to iPhone deployments or applications (and even to iPhone-type devices), the iPhone device platform, along with its complement of rich sensors, multimedia infrastructure, application programming interfaces, and wireless application delivery models, provides an extremely capable platform on which to deploy certain implementations. Based on the description herein, one of ordinary skill in the art will appreciate a variety of additional mobile device platforms that may be suitable (now or hereafter) for a given implementation or deployment of the inventive techniques described herein.
Briefly summarized, the mobile device 400 includes a display 402, which display 402 may be sensitive to tactile and/or haptic contact with a user. The touch sensitive display 402 may support a multi-touch feature, processing multiple simultaneous touch points, including processing data related to the pressure, extent, and/or location of each touch point. Such processing facilitates gestures and interactions with multiple fingers, as well as other interactions. Of course, other touch-sensitive display technologies may also be used, such as a display in which contact is made using a stylus or other pointing device.
Generally, the mobile device 400 presents a graphical user interface on the touch-sensitive display 402, thereby providing the user with access to various system objects and for communicating information. In some implementations, the graphical user interface can include one or more display objects 404, 406. In the example shown, the display objects 404, 406 are graphical representations of system objects. Examples of system objects include device functions, applications, windows, files, alarms, events, or other identifiable system objects. In some embodiments of the invention, the application program, when executed, provides at least some of the digital acoustic functionality described herein.
In general, the mobile device 400 supports network connectivity, including, for example, mobile radio and wireless network interconnection functions, to enable a user to travel with the mobile device 400 and its associated network-capable functions. In some cases, mobile device 400 may interact with other devices in the vicinity (e.g., via Wi-Fi, bluetooth, etc.). For example, mobile device 400 can be configured to interact with a peer or base station of one or more devices. In this way, the mobile device 400 can grant or deny network access to other wireless devices.
The mobile device 400 includes various input/output (I/O) devices, sensors, and transducers. For example, a speaker 460 and a microphone 462 are typically included to facilitate audio, such as the audible presentation of vocal performances and accompaniment tracks and the capture of mixed pitch-corrected vocal performances, as described elsewhere herein. In some embodiments of the invention, the speaker 460 and microphone 462 may provide appropriate transducers for the techniques described herein. An external speaker port 464 may be included to facilitate hands-free voice functionality, such as speakerphone functionality. An audio jack 466 may also be included for use with headphones and/or a microphone. In some embodiments, an external speaker and/or microphone may be used as a transducer for the techniques described herein.
Other sensors may also be used or provided. A proximity sensor 468 may be included to facilitate detection of a user location of the mobile device 400. In some implementations, the ambient light sensor 470 can be utilized to facilitate adjusting the brightness of the touch-sensitive display 402. The accelerometer 472 may be utilized to detect movement of the mobile device 400 as indicated by directional arrow 474. Accordingly, display objects and/or media may be rendered according to the detected direction (e.g., portrait or landscape). In some implementations, the mobile device 400 can include circuitry and sensors to support location determination capabilities, such as those provided by a Global Positioning System (GPS) or other positioning system (e.g., systems using Wi-Fi access points, television signals, cellular grids, Uniform Resource Locators (URLs)), to facilitate geocoding as described herein. The mobile device 400 also includes a camera lens and imaging sensor 480. In some implementations, instances of the camera lens and sensor 480 are located on the front and back of the mobile device 400. The camera allows still images and/or video to be captured for association with the captured pitch corrected human voice.
Mobile device 400 can also include one or more wireless communication subsystems, such as an 802.11b/g/n/ac communication device and/or BluetoothTM A communication device 488. Other communication protocols may also be supported, including other 802.x communication protocols (e.g., WiMax, Wi-Fi, 3G), fourth or fifth generation protocols and modulation (4G-LTE, 5G), Code Division Multiple Access (CDMA), global system for mobile communications (GSM), Enhanced Data GSM Environment (EDGE), and so forth. A port device 490 (e.g., a Universal Serial Bus (USB) port, or a docking port, or some other wired port connection) may be included and used to establish a wired connection to other computing devices, such as other communication devices 400, network access devices, personal computers, printers, or other processing devices capable of receiving and/or transmitting data. The port device 490 may also allow the mobile device 400 to synchronize with a host device using one or more protocols (e.g., TCP/IP, HTTP, UDP, and any other known protocols).
Fig. 17 shows respective instances (501 and 520) of a portable computing device, such as mobile device 400, programmed with human sound audio and video capture codes, user interface codes, pitch correction codes, audio presentation pipelines, and playback codes according to the functional description herein. An example device 501 is depicted operating in a human audio and performance synchronized video capture mode, while an example device 520 operates in a presentation or playback mode for a mixed audio-visual performance. Also depicted is a television-type display and/or set-top box device 520A operating in a presentation or playback mode, although such a device may also operate as part of a human sound audio and performance synchronized video capture facility, as described elsewhere herein. Each of the aforementioned devices communicates via wireless data transfer and/or intervening network 504 with a server 512 or service platform that hosts the storage and/or functionality described herein with respect to content servers 110, 210. The captured pitch-corrected vocal performance captured with performance-synchronized video using the techniques described herein may (optionally) be streamed and presented audiovisual on laptop 511.
Other embodiments
While the present invention has been described with reference to various embodiments, it will be understood that these embodiments are illustrative and that the scope of the invention is not limited to them. Many variations, modifications, additions, and improvements are possible. For example, although pitch-corrected vocal performances captured according to a karaoke style interface have been described, other variations will be understood. Additionally, although certain illustrative signal processing techniques have been described in the context of certain illustrative applications, those of ordinary skill in the art will recognize that it is straightforward to modify the described techniques to accommodate other suitable signal processing techniques and effects.
Embodiments in accordance with the present invention may take the form of and/or be provided as a computer program product encoded in a computer-readable medium as a sequence of instructions and other functional constructs for software, which may then be executed in a computing system, such as an iPhone handset, mobile or portable computing device, media application platform, set-top box, or content server platform, to perform the methods described herein. Generally, a machine-readable medium may include a tangible article that encodes information into a form (e.g., as an application program, source or object code, functional descriptive information, etc.) readable by a machine (e.g., a computing facility of a computer, mobile or portable computing device, media device or streamer, etc.) and non-transitory storage susceptible to transmission of information. The machine-readable medium may include, but is not limited to: magnetic storage media (e.g., disk and/or tape storage); optical storage media (e.g., CD-ROM, DVD, etc.); a magneto-optical storage medium; read Only Memory (ROM); random Access Memory (RAM); erasable programmable memory (e.g., EPROM and EEPROM); flashing; or other type of media suitable for storing electronic instructions, sequences of operations, functional descriptive information encodings, and the like.
In general, multiple instances may be provided for a component, operation, or structure described herein as a single instance. Boundaries between various components, operations and data stores are somewhat arbitrary, and particular operations are illustrated in the context of specific illustrative configurations. Other allocations of functionality are envisioned and may fall within the scope of the invention. In general, structures and functionality presented as separate components in the exemplary configurations may be implemented as a combined structure or component. Similarly, structures and functionality presented as a single component may be implemented as separate components. These and other variations, modifications, additions, and improvements may fall within the scope of the invention.

Claims (34)

1. A system, comprising:
a first media capture device and a second media capture device communicatively coupled via respective network communication interfaces to perform multi-performer collaboration with respect to a baseline media encoding of an audio work;
the first media capture device providing a user interface to a first user thereof for selecting a seed portion of the audio work and configured to capture at least vocal audio of the first user performed for audible presentation of at least a portion of the audio work on the first media capture device; and is
The second media capture device is configured to: (i) receive, via its network communication interface, an indication of the seed portion selected by the first user at the first media capture device, and (ii) capture media content of a second user performing for audible presentation of the seed portion on a second media capture device mixed with the captured human audio of the first user.
2. The system of claim 1, wherein the first and second optical elements are selected from the group consisting of a laser,
wherein the user interface of the first media capture device further allows the first user to specify one or more types of media content to be captured from a performance of the second user for audible presentation of the seed portion on the second media capture device mixed with the captured human audio of the first user.
3. The system of claim 2, wherein the specified one or more types of media content to capture are selected from the group consisting of:
a vocal audio, a vocal harmony, or a vocal duel;
rap, talk, clap, or percussion; and
and (6) video.
4. The system of claim 1 or 2, wherein,
wherein the user interface of the first media capture device further allows the first user to post the seed portion as a collaboration request to other geographically dispersed users and media capture devices, including the second user, for capturing and adding further human audio, video, or performance synchronized audiovisual content.
5. The system of claim 1, further comprising:
a service platform communicatively coupled to the first media capture device and the second media capture device, the service platform configured to: providing media encoding in collaboration with a multi-presenter of at least the first user and the second user for audible or audiovisual presentation on at least a communicatively coupled third device, the media encoding being based on the audio work but temporally limited to the seed portion thereof selected by the first user.
6. The system of claim 1, 2 or 5, further comprising:
a media content trailer on the first media capture device, the first user noting a start point and an end point in a performance timeline through the media content trailer to define and thereby select the seed portion.
7. The system of claim 6, wherein the media content trailer presents to the first user temporally synchronized representations of two or more of:
an audio envelope of accompaniment audio and/or human voice;
lyrics;
one or more pitch tracks; and
duel singing or other community part symbols.
8. The system of claim 1, 2 or 5, further comprising:
a user interface on the first media capture device through which the first user selects the seed portion from among the pre-marked or marked portions of the audio work.
9. The system of claim 8, wherein the pre-designated or marked portion of the audio work is provided by a service platform communicatively coupled with the first media capture device and the second media capture device, the pre-designated or marked portion having been designated or marked based on one or more of:
a music structure encoding the audio work;
a machine learning algorithm applied to the accompaniment audio, the vocal audio or the lyrics of the audio work or the accompaniment audio, the vocal audio or the lyrics corresponding to the audio work;
data from the public; and
data provided by a user uploader of the audio work or by a third party administrator thereof.
10. The system of claim 1, 2 or 5,
wherein the baseline media encoding of the audio work also encodes synchronized video content.
11. The system of claim 1, 2 or 5,
wherein the first media capture device is further configured to capture performance-synchronized video content.
12. The system of claim 1, 2 or 5,
wherein the first and second media capture devices are mobile phone-type portable computing devices executing application software that, in at least one mode of operation thereof, provide a karaoke-style presentation of a performance timeline including lyrics on a multi-touch sensitive display thereof in a manner corresponding in time to an audible presentation of the audio work, and capture human voice and/or performance-synchronized video of the respective first or second user via an onboard audio and video interface of the respective mobile phone-type portable computing device.
13. A method, comprising:
using a portable computing device, media clip capture in conjunction with a karaoke style presentation of a performance timeline on its multi-touch sensitive display, the performance timeline including lyrics and a pitch track synchronized with a music track;
assigning a subset of the lyrics to a joiner in response to a gesture control on the multi-touch sensitive display; and
posting the performance timeline with the designation of the subset of songs as a collaboration request for the joining remote user to capture and add further vocal audio content on a second remote portable computing device configured for further media segment capture in conjunction with the performance timeline.
14. The method of claim 13, wherein the first and second light sources are selected from the group consisting of a red light source, a green light source, and a blue light source,
wherein the portable computing device is configured with user interface components executable to provide: (i) start/stop control of the media segment capture and (ii) drag interaction for temporal position control within the performance timeline.
15. The method of claim 14, further comprising:
adding at least one media segment to the performance timeline starting from a dragged-to first location in the performance timeline that is neither the start of the performance timeline nor the nearest stop or pause location within the performance timeline.
16. The method of claim 15, further comprising:
capturing, at the portable computing device, human voice audio starting from a first location in the performance timeline to which the drag is directed and corresponding to a karaoke style presentation of at least synchronized lyrics and a pitch track on the multi-touch sensitive display,
wherein the added at least one media segment includes the captured human voice audio.
17. The method of claim 15, wherein the added at least one media segment comprises one or more of:
video or still images;
a video captured at the portable computing device beginning at a first position of the drag-to in the performance timeline and corresponding to a karaoke style presentation of lyrics and a pitch track on the multi-touch sensitive display and a synchronized audible presentation of the music track; and
the performance-synchronized audio and visual media content captured at the portable computing device.
18. The method of claim 15, further comprising:
saving a performance timeline including the added at least one media segment to a service platform coupled to the network.
19. The method of claim 15, further comprising:
a previously saved version of the performance timeline is retrieved from a service platform coupled to the network.
20. The method of claim 13, 15 or 16,
wherein the performance timeline is posted for the joining remote user via a service platform coupled to the network.
21. The method of claim 13, 15 or 16,
wherein the designation of the subset of lyrics is responsive to a first user gesture control on the multi-touch sensitive display that selects a particular portion of human voice for the participant.
22. The method of claim 13, 15 or 16,
wherein the designation of the subset of the lyrics is responsive to a second user gesture control on the multi-touch sensitive display defining a subset of the lyrics corresponding to a further media segment of the participant.
23. A method, comprising:
media segment capture using a first portable computing device in conjunction with a karaoke-style presentation of a performance timeline on a multi-touch-sensitive display thereof, the performance timeline including lyrics and a pitch track synchronized with a music track,
wherein the performance timeline includes a subset of the lyrics specified by a previous user on a second remote computing device, the specification of the subset of lyrics at least partially parameterizing a collaboration request for capturing and adding other people voice audio content on the first portable computing device by a user joining the performance timeline.
24. The method of claim 23, wherein the step of,
wherein the first portable computing device is configured with a user interface component executable to provide: (i) start/stop control of the media segment capture, and (ii) drag interaction for temporal position control within the performance timeline.
25. The method of claim 24, further comprising:
capturing at least one personal audio media segment starting from a first location in the performance timeline that is dragged to, the first location being neither the start of the performance timeline nor the nearest stop or pause location within the performance timeline.
26. The method of claim 23 or 25, further comprising:
updating the performance timeline to include the captured at least one personal sound audio media segment; and
posting the updated performance timeline for the joining remote user via a service platform coupled to the network.
27. The method according to claim 23 or 25,
wherein the designation of the subset of lyrics is selective to a particular human voice portion.
28. The method according to claim 23 or 25,
wherein further human audio content of the subset of lyrics designated as a user joining the performance timeline defines the subset of lyrics.
29. A method, comprising:
using a portable computing device to capture media content for synchronized lyrics, pitch, and karaoke style presentation of a music track;
capturing, using the portable computing device, at least one audio clip on its multi-touch sensitive display, the portable computing device configured with a user interface component executable to provide: (i) start/stop control of media segment capture, and (ii) drag interaction for temporal position control within a performance timeline;
inputting one or more segments of the lyrics and aligning the input lyric segments with the performance timeline in response to a first user gesture control on the multi-touch sensitive display;
in response to a second user gesture control on the multi-touch sensitive display, moving forward or backward in a visually synchronized presentation of the performance timeline on the multi-touch sensitive display; and
after the moving, at least one audio segment is captured and pitch-detected on the captured audio segment to produce at least a portion of a pitch track.
30. The method of claim 29, wherein the step of,
wherein the capturing of the at least one audio piece is free form without lyrics or scrolling of a pitch track.
31. The method of claim 30, wherein the step of,
wherein the captured freestyle audio clip comprises performance-synchronized video.
32. The method of claim 30, wherein the captured freestyle audio clip comprises either or both of:
instrumental accompaniment audio; and
human voice audio.
33. The method of any of claims 29 to 32, further comprising:
in response to a third user gesture control on the multi-touch sensitive display, moving forward or backward in a visually synchronized presentation of the performance timeline on the multi-touch sensitive display; and
after the moving, assigning a subset of the lyrics to a first human voice portion.
34. The method of any of claims 29 to 32, further comprising:
posting the performance timeline as a collaboration request for capturing and adding audio content of the voice of one or more singers on a remote portable computing device.
CN201980056174.0A 2018-06-29 2019-07-01 Audio-visual collaboration system and method with seed/join mechanism Pending CN113039573A (en)

Applications Claiming Priority (5)

Application Number Priority Date Filing Date Title
US201862692129P 2018-06-29 2018-06-29
US62/692,129 2018-06-29
US16/418,659 US10943574B2 (en) 2018-05-21 2019-05-21 Non-linear media segment capture and edit platform
US16/418,659 2019-05-21
PCT/US2019/040113 WO2020006556A1 (en) 2018-06-29 2019-07-01 Audiovisual collaboration system and method with seed/join mechanic

Publications (1)

Publication Number Publication Date
CN113039573A true CN113039573A (en) 2021-06-25

Family

ID=68985818

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201980056174.0A Pending CN113039573A (en) 2018-06-29 2019-07-01 Audio-visual collaboration system and method with seed/join mechanism

Country Status (4)

Country Link
EP (1) EP3815031A4 (en)
CN (1) CN113039573A (en)
WO (1) WO2020006556A1 (en)
ZA (1) ZA202100481B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019226681A1 (en) * 2018-05-21 2019-11-28 Smule, Inc. Non-linear media segment capture and edit platform

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8160421B2 (en) * 2006-12-18 2012-04-17 Core Wireless Licensing S.A.R.L. Audio routing for audio-video recording
US9276761B2 (en) * 2009-03-04 2016-03-01 At&T Intellectual Property I, L.P. Method and apparatus for group media consumption
US20110126103A1 (en) * 2009-11-24 2011-05-26 Tunewiki Ltd. Method and system for a "karaoke collage"
KR20140044003A (en) * 2012-10-04 2014-04-14 에스케이플래닛 주식회사 System and method for providing user created contents playing service
US20140149861A1 (en) 2012-11-23 2014-05-29 Htc Corporation Method of displaying music lyrics and device using the same
KR101562041B1 (en) * 2013-05-03 2015-10-23 주식회사 인코렙 Method for Producing Media Content of Duet Mode, Media Content Producing Device Used Therein
US9911403B2 (en) * 2015-06-03 2018-03-06 Smule, Inc. Automated generation of coordinated audiovisual work based on content captured geographically distributed performers

Also Published As

Publication number Publication date
ZA202100481B (en) 2022-07-27
EP3815031A1 (en) 2021-05-05
EP3815031A4 (en) 2022-04-27
WO2020006556A9 (en) 2020-03-12
WO2020006556A1 (en) 2020-01-02

Similar Documents

Publication Publication Date Title
US11250825B2 (en) Audiovisual collaboration system and method with seed/join mechanic
US10943574B2 (en) Non-linear media segment capture and edit platform
US11693616B2 (en) Short segment generation for user engagement in vocal capture applications
US11756518B2 (en) Automated generation of coordinated audiovisual work based on content captured from geographically distributed performers
US11074923B2 (en) Coordinating and mixing vocals captured from geographically distributed performers
US20230335094A1 (en) Audio-visual effects system for augmentation of captured performance based on content thereof
US9601127B2 (en) Social music system and method with continuous, real-time pitch correction of vocal performance and dry vocal capture for subsequent re-rendering based on selectively applicable vocal effect(s) schedule(s)
US10930256B2 (en) Social music system and method with continuous, real-time pitch correction of vocal performance and dry vocal capture for subsequent re-rendering based on selectively applicable vocal effect(s) schedule(s)
US20220051448A1 (en) Augmented reality filters for captured audiovisual performances
US20190354272A1 (en) Non-Linear Media Segment Capture Techniques and Graphical User Interfaces Therefor
KR102246623B1 (en) Social music system and method with continuous, real-time pitch correction of vocal performance and dry vocal capture for subsequent re-rendering based on selectively applicable vocal effect(s) schedule(s)
US20220122573A1 (en) Augmented Reality Filters for Captured Audiovisual Performances
CN113039573A (en) Audio-visual collaboration system and method with seed/join mechanism
CN112567758A (en) Audio-visual live streaming system and method with latency management and social media type user interface mechanism
US11972748B2 (en) Audiovisual collaboration system and method with seed/join mechanic
CN111345044B (en) Audiovisual effects system for enhancing a performance based on content of the performance captured

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination