WO2016073205A1 - Collaborative video upload method and apparatus - Google Patents

Collaborative video upload method and apparatus Download PDF

Info

Publication number
WO2016073205A1
WO2016073205A1 PCT/US2015/056742 US2015056742W WO2016073205A1 WO 2016073205 A1 WO2016073205 A1 WO 2016073205A1 US 2015056742 W US2015056742 W US 2015056742W WO 2016073205 A1 WO2016073205 A1 WO 2016073205A1
Authority
WO
WIPO (PCT)
Prior art keywords
recording
recordings
received
metadata
event
Prior art date
Application number
PCT/US2015/056742
Other languages
French (fr)
Inventor
Brian J. CROMARTY
Original Assignee
Thomson Licensing
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Thomson Licensing filed Critical Thomson Licensing
Publication of WO2016073205A1 publication Critical patent/WO2016073205A1/en

Links

Classifications

    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/02Editing, e.g. varying the order of information signals recorded on, or reproduced from, record carriers
    • G11B27/031Electronic editing of digitised analogue information signals, e.g. audio or video signals
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel
    • G11B27/19Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier
    • G11B27/28Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier by using information signals recorded by the same method as the main recording
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/25Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
    • H04N21/258Client or end-user data management, e.g. managing client capabilities, user preferences or demographics, processing of multiple end-users preferences to derive collaborative data
    • H04N21/25808Management of client data
    • H04N21/25841Management of client data involving the geographical location of the client
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/25Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
    • H04N21/266Channel or content management, e.g. generation and management of keys and entitlement messages in a conditional access system, merging a VOD unicast channel into a multicast channel
    • H04N21/2665Gathering content from different sources, e.g. Internet and satellite
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/27Server based end-user applications
    • H04N21/274Storing end-user multimedia data in response to end-user request, e.g. network recorder
    • H04N21/2743Video hosting of uploaded data from client
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof
    • H04N21/8126Monomedia components thereof involving additional data, e.g. news, sports, stocks, weather forecasts
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring

Definitions

  • the present invention relates to digital audio and video processing and three- dimensional (3D) video processing from multiple sources.
  • the present invention automatically combines user generated video from multiple sources into an edited multi- view video or an edited 3D video.
  • the proposed method and apparatus addresses the problem of non-pro professionally recorded videos by providing a method and apparatus for combining such non- professionally recorded videos (as well as the audio) from multiple sources into a single multi-camera recording (audio and video) of an event. Each video is tagged with a date and timestamp as well as a location.
  • the proposed method may also use tools such as video recognition to determine that multiple recordings originated from the same event.
  • the proposed method then automatically combines the best or selected views of the uploaded recording of the event into a multi-camera or 3D recording (audio and video) of the event.
  • video cameras is used interchangeably with cell phones since most cell phone have digital video cameras embedded.
  • the proposed method and apparatus is not limited to video cameras and/or cell phones but is directed to any device that may have audio or video recording capability such as GoogleGlassTM and the term video camera is used herein to encompass and include all such devices.
  • each person can record (audio and video) the content normally, and are given the option to upload metadata about their media or the media or both the media and the metadata for collaboration.
  • Each recorded event is automatically tagged with time and location (GPS).
  • GPS time and location
  • the tags are metadata.
  • the server examines the metadata to determine an approximate location (GPS) and approximate time. This metadata (information) is automatically attached to the recorded media (content) used by mobile devices and is fairly accurate.
  • the server can then combine clips from multiple users to generate a combined multi-camera (multi-view) video.
  • the server then combines the best or selected view of the recordings of the event to create a multi-view or 3D composite recording.
  • a method and apparatus for generation of a composite multi-camera recording including performing event clustering on received content to determine if the received content is a recording of an event for which recordings have been previously received, performing content organization and analysis, selecting content from among the received content and the previously received recordings of the event, performing content editing on the selected content and creating the composite multi-camera recording using the edited content. Also described are a method and apparatus for automatic generation of a composite recording including receiving a recording of an event, the recording including metadata, determining a location of the recorded event, determining a service provider in response to the determined location and combining the recording with content provided by the service provider to generate the composite recording.
  • Fig. 1 is an example of video recordings uploaded to a server and combined by the server to generate a multi-camera (multi-view) video.
  • Fig. 2 shows GoogleGlassTM, which is just one video camera source.
  • Fig. 3 is a flowchart of an exemplary method of the present invention from the perspective of the individual users (clients).
  • Fig. 4 is a flowchart of an exemplary method of the present invention from the perspective of the video server.
  • Fig. 5 is a flowchart of an exploded view of an exemplary implementation of 425 of Fig. 4.
  • Fig. 6 is a diagram showing view groups.
  • Fig. 7 is a diagram showing undesirable audio.
  • Fig. 8 is a block diagram of an exemplary server in accordance with the principles of the present invention.
  • Fig. 9 is a flowchart of the alternative embodiment of the proposed method.
  • processor or “controller” should not be construed to refer exclusively to hardware capable of executing software, and may implicitly include, without limitation, digital signal processor (DSP) hardware, read only memory (ROM) for storing software, random access memory (RAM), and nonvolatile storage.
  • DSP digital signal processor
  • ROM read only memory
  • RAM random access memory
  • any switches shown in the figures are conceptual only. Their function may be carried out through the operation of program logic, through dedicated logic, through the interaction of program control and dedicated logic, or even manually, the particular technique being selectable by the implementer as more specifically understood from the context.
  • any element expressed as a means for performing a specified function is intended to encompass any way of performing that function including, for example, a) a combination of circuit elements that performs that function or b) software in any form, including, therefore, firmware, microcode or the like, combined with appropriate circuitry for executing that software to perform the function.
  • the disclosure as defined by such claims resides in the fact that the functionalities provided by the various recited means are combined and brought together in the manner which the claims call for. It is thus regarded that any means that can provide those functionalities are equivalent to those shown herein.
  • the proposed method and apparatus facilitates anonymous collaboration by providing a recording and its metadata to a server indicating the availability of a video at a particular time and place.
  • the proposed method and apparatus is anonymous since the users recording the content need not know each other in order to retrieve content that other user make public since the server uses time and location and video analysis (such as video recognition) to determine if the recordings are of the same event.
  • a server may determine that multiple recordings are available at the same time and place and therefore a multi-view (multi-camera) recording may be generated.
  • the server retrieves the recording from all available sources and generates a multi-view (multi-camera) or 3D recording using the best or selected recordings. While creating (generating) a composite recording, the views must be unique to be visually interesting to a viewer.
  • the user takes a video of (records) an event.
  • Metadata is collected about the recording including time, place and user identification and seat number if available. Many venues such as stadiums or arenas have seating plans available online.
  • the metadata is transmitted to a server. This metadata may be transmitted automatically, in response to a prompt to the user, or in response to a user initiated action.
  • a server collects metadata from a plurality of users and determines that there may be multiple videos of the same event. This is determined in response to the time, place, seat number and other information and/or video recognition. It can reasonably be assumed that if multiple users are taking video at the same place and the same time, that the video is likely of the same event.
  • the server then initiates upload of the multiple recordings of the same event. This may occur automatically, when a certain condition exists, such as when WiFi is connected, or in response to prompting a user for permission. Another option is to upload metadata without uploading the recording (audio and video) at the same time. If another person is also attending or attended the event and requests collaboration and is informed that other users are already collaborating, this user may also be prompted to collaborate.
  • the seat number for venues such as stadiums and arenas may be useful in determining viewing angles. Viewing angles may be used by the server in deciding which of the uploaded recordings to use in generation of the composite multi-camera (multi- view) recording. For example, the server may review the videos (recordings) from two seats that are fairly close to each other and decide which of the two is better to use or the server may take the best frames from each recording on a frame by frame basis.
  • the frame from one video recording may be a blurry first video recording but not in blurry in a second video recording.
  • the resulting video may then be combined with a video recorded from person sitting about 90 degrees away based on analysis of the seat locations. Users may be given the option to view multiple camera angles and select the camera angle of most interest. This selection information can be used by the server in determining camera angles for the generated composite multi-angle (multi-view) or 3D video.
  • the server will then compile (generate, create) a composite multi-view or 3D recording (audio and video) that highlights that person or object. For example, in a graduation ceremony, a person may be interested in the commencement speaker as well as their graduate(s) or their favorite player(s) and the ball in a sporting event. An individual user may have to search all of the recordings (not just the composite recording) to catch a glimpse of the person or object of interest to the user.
  • the proposed method provides the user the opportunity to select and highlight a person or object of interest through still images or a menu.
  • the server may analyze a submitted (uploaded) recording and determine an object or person of interest that is prominently featured in the uploaded recording.
  • the server uses the analyses (described below) to determine which footage includes the particular person or object of interest.
  • the server then generates a composite recording including footage of the person or object of interest.
  • This composite recording may be in addition to an already generated composite recording.
  • the server may crop or stabilize this second (alternative) composite recording to highlight or center the person or object of interest. If no footage is available for a particular time period highlighting the selected person or object (such as when a player is on the bench) then other general footage of the event may be inserted to maintain a timeline for this second (alternative) recording.
  • Video technology is everywhere and inevitable. Users record events using either an application, such as that of the present invention, , phone video camera, or any video camera.
  • an application such as that of the present invention, , phone video camera, or any video camera.
  • the app When the app is resident on a digital recording device, at the conclusion of recording, the user may be prompted to "Collaborate?" If the user chooses to collaborate, the recording is uploaded to a server on the internet, or a video service, such as Youtube, for example, and the metadata of the recording and Youtube information is sent to servers of the present invention.
  • the recording may be optionally uploaded when WiFi is available etc. There may be an option that metadata is sent without uploading the recording. If someone else at the event requests a collaboration, the user may be prompted to collaborate and informed that there are a number of other contributors currently collaborating.
  • Multiple video streams may be used to generate a composite and/or 3D video of an event. Live events may be streamed using this technology. Viewers may be able to watch a simulcast from people's phones and/or a few installed cameras.
  • the composite (combined) recording (audio and video) with time and place information (metadata) may be combined with other applications such as GoogleMaps to create a time and place record.
  • a user of the web service could search back to a school play on May 14, 2014 to see multiple videos recorded and stored. The user may have never attended the event, or may have been at the event.
  • the proposed method would create a searchable video history where a user knows the date, time and place of an event. The user may also chose to watch recordings in a particular place at different times, such as the sunset at Mallory Square in Key West. Many people record this event every day and a remote user could watch several sunsets days, weeks, or years apart.
  • Fig. 3 is a flowchart of an exemplary method of the proposed method from the perspective of the individual users (clients).
  • a user records an event using a digital video recording device.
  • the digital video recording device also automatically generates metadata for the recording.
  • a test is performed to determine if the recording device is preconfigured to upload the recording. If the recording device is preconfigured to upload the recording then at 320 a test is performed to determine if a WiFi connection is available in order to upload the recorded video. If a WiFi connection is not available in order to upload the recording then processing continues at 320. If a WiFi connection is available in order to upload the recording then at 325 the recording and automatically generated metadata are uploaded. Processing then ends.
  • Fig. 4 is a flowchart of an exemplary method of the present invention from the perspective of the server.
  • the server receives a recording and the associated automatically generated metadata from a user (client).
  • the server analyzes the metadata to determine if the received recording is of the same event for which other recording(s) have already been received.
  • the received recording and its associated metadata are stored with recordings and metadata of the same event. If no other recordings have been received yet, store the received recording and it associated metadata. There is a tacit assumption that multiple recordings of the same event will be received.
  • the server determines the angles/views of the received recordings. Metadata may include seat numbers of events at large venues. Seat numbers will aid in the determination of angles.
  • the server generates a single composite multi-camera/multi-view video using the uploaded recordings and their associated metadata.
  • the server downloads the generated composite multi- camera/multi-view recording to the uploaders.
  • the downloading may be streaming or may merely be a notification that the generated composite multi- camera/multi-view recording is available online and a url to access the posted generated composite multi-camera/multi-view recording.
  • the server posts the generated single composite multi-camera/multi-view recording online.
  • a user may specify a person or object of interest. That person or object of interest may or may not be in any or all of the uploaded recordings. If the user specifies a person or object of interest then the server may use any or all or any combination of the following forms of analysis to determine if any or all of the uploaded recordings include the person or object of interest. The server may then automatically create a single composite multi-camera/multi-view recording for the user focused on the person or object of interest.
  • Image Comparison after Normalization In image comparison after normalization, each image is first normalized, which is a process that changes the range of pixel intensity values. The purpose of normalization is to bring the image, or other type of signal, into a consistent dynamic range to permit statistical image comparison. Once the images are normalized, a Manhattan, Chebychev or Euclidean distance can be determined between the two images and a statistical comparison performed.
  • a color histogram is a representation of the distribution of colors in an image. For digital images, a color histogram represents the number of pixels that have colors in each of a fixed list of color ranges, that span the image's color space, the set of all possible colors.
  • color histogram intersection can be performed by color histogram intersection, color constant indexing, cumulative color histogram, quadratic distance, and color correlograms.
  • color information is faster to compute compared to other invariants. It has been shown in some cases that color can be an efficient method for identifying objects of known location and appearance.
  • Image Retrieval through Object Comparison For any object in an image, interesting points on the object can be extracted to provide a "feature description" of the object. This description, extracted from a first image or training image, can then be used to identify the object when attempting to locate the object in a second image or test image. To perform reliable recognition, it is important that the features extracted from the training image be detectable even under changes in image scale, noise and illumination. Such points usually lie on high-contrast regions of the image, such as object edges.
  • a facial recognition system is a method for automatically identifying or verifying a person from a digital image or a video frame from a video source. One of the ways to do this is by comparing selected facial features from the image and a facial database. In image recognition, in two images a face from each image can be compared to determine if the same face is in each image. Thus, one can determine that the image is of the same scene if the images were gathered at the same time and place.
  • Some facial recognition algorithms identify facial features by extracting landmarks, or features, from an image of the subject's face. For example, an algorithm may analyze the relative position, size, and/or shape of the eyes, nose, cheekbones, and jaw. These features are then used to search for other images with matching features. Other algorithms normalize a gallery of face images and then compress the face data, only saving the data in the image that is useful for face recognition. A probe image is then compared with the face data.
  • One of the earliest successful systems is based on template matching techniques applied to a set of salient facial features, providing a sort of compressed face representation.
  • Recognition algorithms can be divided into two main approaches, geometric, which looks at distinguishing features, or photometric, which is a statistical approach that distills an image into values and compares the values with templates to eliminate variances.
  • Popular recognition algorithms include Principal Component Analysis using eigenfaces, Linear Discriminate Analysis, Elastic Bunch Graph Matching using the Fisherface algorithm, the Hidden Markov model, the Multilinear Subspace Learning using tensor representation, and the neuronal motivated dynamic link matching.
  • Eigenfaces is the name given to a set of eigenvectors when they are used in the computer vision problem of human face recognition.
  • Saliency Detection aims at detecting the "most important" image regions or objects that represent the scene. This can be done by either identify fixation points, detecting a dominant object in the image or center surround contrasts of units modeled on known properties.
  • the proposed method therefore, automatically creates (generates) a composite multi-camera/multi-view recording from a number of (anonymous) contributors, thereby increasing the quality and experience of the recording of the event.
  • more than one contributor records the event from essentially the same vantage point, it may be undesirable to provide each of these recordings with equal time in the composite recording. For example, in an extreme example, assume that there are four contributors each recording a concert from four consecutive seats in a balcony and another (fifth) contributor recording the concert from a front row seat. A composite recording generated using all five recordings in an equal combination would result in a composite recording that appeared to be recorded 80% from the balcony and 20% from the front row.
  • the portion of the recording from the balcony seats may be perceived by the viewer to be the same. It would be better to split the portion of the recording from the balcony to be 50% and the remainder of the composite recording to be split 12.5 % from each of the balcony contributors or chose the best recording from among the balcony contributors.
  • the proposed method for automatic generation of a single composite multi- camera/multi-view recording of an event includes a server performing event clustering - either using the metadata or by some form of audio and/or image analysis - content organization, content selection, content editing and content creation.
  • Fig. 5 is a flowchart of an exploded view of an exemplary implementation of 425 of Fig. 4.
  • the recording received recorded content
  • the recording may be analyzed by any of the above described forms of image analysis to determine if the recording is of an event for which other recordings have already been received. That is, event clustering is effectively performed at 410.
  • the events may be plays, music concerts, high school or college graduations, etc.
  • a user may have a recording and wish to generate a composite recording for any number of reasons such as their recording was incomplete, has sound defects etc.
  • the user may upload as much metadata as possible, such as time, location, event name etc.
  • metadata such as time, location, event name etc.
  • users would have to search the metadata of other recordings and hope to find one or more recordings of the same event in order to complete or repair their recording.
  • the proposed method examines and characterizes user uploaded recordings for metadata, audio and video characteristics etc. to locate distinguishing audio and/or video features (certain music, facial features, etc.).
  • the server then creates a profile of the user submitted (uploaded) recording using the metadata and distinguishing features.
  • the server uses the profile to search other characterized recordings for possible matching recordings. If possible matching recordings are located then the server attempts to synchronize the recordings using audio and/or video to determine if the recordings are indeed of the same event. If recordings of the same event are located, the profile of each recording is augmented and corrected (updated) using data of the each recording and markers may also be added. Markers are added to show common points.
  • the server organizes and analyzes the recordings to determine the best views and the best quality. Recordings may also be analyzed with respect to focal point (focus) of the recording.
  • the server also determines the physical location or perspective of each camera that recorded the event. The location determination can be made by contributors providing their section number and seat number for the concert. This is done using metadata as well as image analysis. Metadata may include GPS data.
  • each view group is then weighted according to a weighting formula.
  • the four contributors are put into one view group and the front row contributor is put into another view group.
  • each group may be weighted equally so the front row recording is used in the composite approximately 50% of the time and the four contributors from the balcony are together used a total of 50% of the time in the composite recording.
  • Each of the balcony contributors maybe used 12.5 % of the time or one or more of the balcony contributors may be discarded because of poor recording quality.
  • the balcony contributors recordings may be used to replace out of focus frames of the other contributors of the same view group or to replace frames where the object or person of interest was lost.
  • the server thus, weights or discards views that are substantially similar (within "X" feet or "X" seats of each other) retaining the recording having the best possible angle and/or with the best quality recording. For example, if two people are sitting next to each other in a concert and both are recording the concert, the server may discard one of the recordings as redundant or weight each recording differently when generating the composite recording. Thus, the composite recording is created with dissimilar views.
  • V4 and V5 are grouped together in a view group.
  • Each of the other contributions (VI, V2, V3, V6) are each in their own view groups.
  • a composite recording may use each of VI, V2, V3, V6 for 20% of the time and each of V4 and V5 for 10% of the time.
  • recordings are selected to be included in the automatically generated single composite multi-camera/multi-view recording.
  • content editing is performed using the selected recordings. Content editing edits the content for color, blur, fuzzy edges etc. With multiple recordings from a number of cameras, the color may not always be exactly the same so the color must be edited so that the colors do not change when moving from one recording to another. One recording may have had a great or even best angle but one or more faces might be blurred or have the common red eye. Such deficiencies are edited out during the content editing phase.
  • Content editing may also include audio editing.
  • audio editing Often when crowd-sourced recordings are generated, one audio stream is selected as the audio for all sources in the composite recording. This may not be advantageous if the audio in the selected stream has interference, background noise or sounds not related to the object of interest being recorded. For example, at a concert, people talking near the recording device may distract viewers from the music being played.
  • the event clustering phase described above also analyzes and synchronizes the audio streams of the uploaded (received) recordings. When the server determines that there is common audio in the multiple sources (uploaded recordings), the server assumes that this is the audio from or of an object of interest (e.g., performer). When there is unique audio on one of the audio streams, the server assumes that this is undesirable and not from or of the object of interest.
  • the server can assume that audio track is not used in the composite recording for at least the duration of the detected undesirable audio. Once the server has determined which audio is desirable and which audio is undesirable, then the server can use the desirable audio from multiple sources to generated the composite recording. Thus, spatial sound relationships can be generated, audio levels can be adjusted etc. such that an enhanced audio layer can be generated to be presented (synchronized) with the composite recording.
  • Fig. 8 is a block diagram of an exemplary embodiment of the proposed apparatus (server) for generation of a composite recording using crowd-sourced recordings of an event.
  • the proposed composite recording generation server includes a communications interface.
  • the communications interface is in bi-directional communication with the processor and interfaces with the user.
  • the communications interface can handle wired line or wireless communications.
  • the communications interface accepts (receives) the recordings to be analyzed.
  • the input may be by downloading or streaming depending on the format of the source.
  • the interface with the user is via any recording device or a display device having a keyboard and/or graphical user interface.
  • the generated composite recordings (files) can be exported via the communications interface.
  • the received recordings are forwarded to the processor, which stores the received content in a data base (labeled "Content Storage") and performs analysis, organization, selection and editing as described above and stores the resulting analysis, profiles, etc. (files) in a data base of a storage system shown as "Content Storage”.
  • the processor then creates the composite multi-camera recording. That is, the processor analyzes the metadata to determine if the metadata is for the event for which recordings have already been received and analyzes the recording associated with the metadata to determine if the recording is for the event for which recordings have already been received. The processor then stores the received content with recordings of the event which have already been received.
  • the processor creates a characterizing profile of the received recording, searches other characterized recordings for possible matching recordings using the characterizing profile of the received recording, synchronizes the received recording with the matching recordings, augments characterizing profiles of the recording and the matching recordings and adds markers to the received recording at common points between the received recording and the matching recordings.
  • the processor also analyzes the received recording and the matching recordings to determine recordings with best views and best quality, analyzes the received recording and the matching recordings to determine focal points of the received recording, analyzes the received recording and the matching recordings to determine physical the physical camera location into a view group and weights each view group.
  • Content selection further comprises the processor selecting recordings for inclusion in the generated composite multi-camera recording from the view groups with the best views or recordings from the view groups with the best quality.
  • Content selection further comprises the processor selecting recordings for inclusion in the generated composite multi-camera recording from the view groups with the best views or recordings from the view groups with the best quality.
  • Code to direct the processor is stored in the storage system.
  • the code to direct the processor may be in a single composite recording generation module or separate modules for content organization and analysis, content selection, content editing and creation of the composite multi-camera recording.
  • the storage system may include any type of memory including disks, optical disks, tapes, hard drives, CDs, DVDs, flash drives, cloud memory, core memory, any form of RAM or any equivalent type of memory or storage device.
  • the code similarly may include any type of memory including disks, optical disks, tapes, hard drives, CDs, DVDs, flash drives, cloud memory, core memory, any form of RAM or any equivalent type of memory or storage device.
  • the code and the storage for the composite recording generation of the proposed method may be on separate storage devices or on a single storage device or on the same storage device in separate partitions.
  • the storage system of the proposed apparatus is a tangible and non-transitory computer readable medium.
  • a user may be in a venue that also provides video services.
  • a user may be in a Disney Park and record an event at the park.
  • the event may be a Disney planned event such as an Independence day celebration of a family birthday or family reunion.
  • the user may upload his/her recording of the event including metadata.
  • the location of the recording may be determined using the metadata.
  • Disney for example, may have a plurality of service providers depending upon in which venue the recording was made. There may be a general or generic website to which to upload a recording and the general or generic website would then forward it to the one of several service providers within the Disney park system that would have pre-recorded content to add to the uploaded recording.
  • the service provider - in the above example the service provider is Disney - would use analysis of an audio component of the recording or analysis of a video component of the recording or a combination of the metadata, the audio component of the recording and the video component of the recording to determine the location of the recording.
  • Disney parks have many different venues.
  • the user's uploaded recording could then be combined on the front end and/or on the back end with generic recorded material supplied by the service provider, e.g., Disney to generate a composite recording.
  • the pre-recorded generic video of the venue supplied by the service provider enhances the user's (amateur) recording with professionally recorded content to give the illusion at least that the generated recording is a multi-camera perspective.
  • Fig. 9 is a flowchart of the alternative embodiment of the proposed method.
  • a recording with metadata is received.
  • a location is determined.
  • a service provider is determined responsive to the determined location.
  • the service provider combines the uploaded recording with pre-recorded professionally recorded content of the venue to generate a composite recording.
  • the composite video is transmitted back to the user.
  • the present invention may be implemented in various forms of hardware, software, firmware, special purpose processors, or a combination thereof.
  • Special purpose processors may include application specific integrated circuits (ASICs), reduced instruction set computers (RISCs) and/or field programmable gate arrays (FPGAs).
  • ASICs application specific integrated circuits
  • RISCs reduced instruction set computers
  • FPGAs field programmable gate arrays
  • the present invention is implemented as a combination of hardware and software.
  • the software is preferably implemented as an application program tangibly embodied on a program storage device.
  • the application program may be uploaded to, and executed by, a machine comprising any suitable architecture.
  • the machine is implemented on a computer platform having hardware such as one or more central processing units (CPU), a random access memory (RAM), and input/output (I/O) interface(s).
  • CPU central processing units
  • RAM random access memory
  • I/O input/output
  • the computer platform also includes an operating system and microinstruction code.
  • the various processes and functions described herein may either be part of the microinstruction code or part of the application program (or a combination thereof), which is executed via the operating system.
  • various other peripheral devices may be connected to the computer platform such as an additional data storage device and a printing device.
  • the elements shown in the figures may be implemented in various forms of hardware, software or combinations thereof. Preferably, these elements are implemented in a combination of hardware and software on one or more appropriately programmed general-purpose devices, which may include a processor, memory and input/output interfaces.
  • general-purpose devices which may include a processor, memory and input/output interfaces.
  • the phrase "coupled" is defined to mean directly connected to or indirectly connected with through one or more intermediate components. Such intermediate components may include both hardware and software based components.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Databases & Information Systems (AREA)
  • Computer Graphics (AREA)
  • Physics & Mathematics (AREA)
  • Astronomy & Astrophysics (AREA)
  • General Physics & Mathematics (AREA)
  • Television Signal Processing For Recording (AREA)

Abstract

A method and apparatus for generation of a composite multi-camera recording including performing event clustering on received content to determine if the received content is a recording of an event for which recordings have been previously received, performing content organization and analysis, selecting content from among the received content and the previously received recordings of the event, performing content editing on the selected content and creating the composite multi-camera recording using the edited content. Also described are a method and apparatus for automatic generation of a composite recording including receiving a recording of an event, the recording including metadata, determining a location of the recorded event, determining a service provider in response to the determined location and combining the recording with content provided by the service provider to generate the composite recording.

Description

COLLABORATIVE VIDEO UPLOAD METHOD AND APPARATUS
FIELD OF THE INVENTION
The present invention relates to digital audio and video processing and three- dimensional (3D) video processing from multiple sources. The present invention automatically combines user generated video from multiple sources into an edited multi- view video or an edited 3D video.
BACKGROUND OF THE INVENTION
This section is intended to introduce the reader to various aspects of art, which may be related to the present embodiments that are described below. This discussion is believed to be helpful in providing the reader with background information to facilitate a better understanding of the various aspects of the present disclosure. Accordingly, it should be understood that these statements are to be read in this light.
Almost everyone carries a video camera or a phone with a digital camera in their pocket or purse. Most of these video cameras are high definition video cameras. People will soon be wearing video cameras such as GoogleGlass™. People will be saving all of the resulting video. Many people record events using their video cameras and upload it to YouTube or Facebook or the like. These videos are of limited interest and limited use because the video is recorded from one vantage point and is often poor quality despite the use of high definition video cameras. The video frames go in and out of focus or are poorly framed. The audio is also often interrupted by other ambient noise in the area or people talking. People view television and receive edited multi-view video. That is, the vantage point changes to give the video additional interest. Despite problems with nonprofessional audio and video recordation, many important events (youth sporting events, birthdays, anniversaries, etc.) in the life of an average person and their family are recorded using cell phone cameras. SUMMARY OF THE INVENTION
The proposed method and apparatus addresses the problem of non-pro fessionally recorded videos by providing a method and apparatus for combining such non- professionally recorded videos (as well as the audio) from multiple sources into a single multi-camera recording (audio and video) of an event. Each video is tagged with a date and timestamp as well as a location. The proposed method may also use tools such as video recognition to determine that multiple recordings originated from the same event. The proposed method then automatically combines the best or selected views of the uploaded recording of the event into a multi-camera or 3D recording (audio and video) of the event.
As used herein, video cameras is used interchangeably with cell phones since most cell phone have digital video cameras embedded. The proposed method and apparatus is not limited to video cameras and/or cell phones but is directed to any device that may have audio or video recording capability such as GoogleGlass™ and the term video camera is used herein to encompass and include all such devices.
For example, if multiple people are filming an event, such as a high school play, each person can record (audio and video) the content normally, and are given the option to upload metadata about their media or the media or both the media and the metadata for collaboration. Each recorded event is automatically tagged with time and location (GPS). The tags are metadata. The server examines the metadata to determine an approximate location (GPS) and approximate time. This metadata (information) is automatically attached to the recorded media (content) used by mobile devices and is fairly accurate. The server can then combine clips from multiple users to generate a combined multi-camera (multi-view) video. The server then combines the best or selected view of the recordings of the event to create a multi-view or 3D composite recording.
A method and apparatus for generation of a composite multi-camera recording including performing event clustering on received content to determine if the received content is a recording of an event for which recordings have been previously received, performing content organization and analysis, selecting content from among the received content and the previously received recordings of the event, performing content editing on the selected content and creating the composite multi-camera recording using the edited content. Also described are a method and apparatus for automatic generation of a composite recording including receiving a recording of an event, the recording including metadata, determining a location of the recorded event, determining a service provider in response to the determined location and combining the recording with content provided by the service provider to generate the composite recording.
BRIEF DESCRIPTION OF THE DRAWINGS
The present invention is best understood from the following detailed description when read in conjunction with the accompanying drawings. The drawings include the following figures briefly described below:
Fig. 1 is an example of video recordings uploaded to a server and combined by the server to generate a multi-camera (multi-view) video.
Fig. 2 shows GoogleGlass™, which is just one video camera source.
Fig. 3 is a flowchart of an exemplary method of the present invention from the perspective of the individual users (clients).
Fig. 4 is a flowchart of an exemplary method of the present invention from the perspective of the video server.
Fig. 5 is a flowchart of an exploded view of an exemplary implementation of 425 of Fig. 4.
Fig. 6 is a diagram showing view groups.
Fig. 7 is a diagram showing undesirable audio.
Fig. 8 is a block diagram of an exemplary server in accordance with the principles of the present invention.
Fig. 9 is a flowchart of the alternative embodiment of the proposed method.
It should be understood that the drawing(s) are for purposes of illustrating the concepts of the disclosure and is not necessarily the only possible configuration for illustrating the disclosure. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
The present description illustrates the principles of the present disclosure. It will thus be appreciated that those skilled in the art will be able to devise various arrangements that, although not explicitly described or shown herein, embody the principles of the disclosure and are included within its scope.
All examples and conditional language recited herein are intended for educational purposes to aid the reader in understanding the principles of the disclosure and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions.
Moreover, all statements herein reciting principles, aspects, and embodiments of the disclosure, as well as specific examples thereof, are intended to encompass both structural and functional equivalents thereof. Additionally, it is intended that such equivalents include both currently known equivalents as well as equivalents developed in the future, i.e., any elements developed that perform the same function, regardless of structure.
Thus, for example, it will be appreciated by those skilled in the art that the block diagrams presented herein represent conceptual views of illustrative circuitry embodying the principles of the disclosure. Similarly, it will be appreciated that any flow charts, flow diagrams, state transition diagrams, pseudocode, and the like represent various processes which may be substantially represented in computer readable media and so executed by a computer or processor, whether or not such computer or processor is explicitly shown.
The functions of the various elements shown in the figures may be provided through the use of dedicated hardware as well as hardware capable of executing software in association with appropriate software. When provided by a processor, the functions may be provided by a single dedicated processor, by a single shared processor, or by a plurality of individual processors, some of which may be shared. Moreover, explicit use of the term "processor" or "controller" should not be construed to refer exclusively to hardware capable of executing software, and may implicitly include, without limitation, digital signal processor (DSP) hardware, read only memory (ROM) for storing software, random access memory (RAM), and nonvolatile storage.
Other hardware, conventional and/or custom, may also be included. Similarly, any switches shown in the figures are conceptual only. Their function may be carried out through the operation of program logic, through dedicated logic, through the interaction of program control and dedicated logic, or even manually, the particular technique being selectable by the implementer as more specifically understood from the context.
In the claims hereof, any element expressed as a means for performing a specified function is intended to encompass any way of performing that function including, for example, a) a combination of circuit elements that performs that function or b) software in any form, including, therefore, firmware, microcode or the like, combined with appropriate circuitry for executing that software to perform the function. The disclosure as defined by such claims resides in the fact that the functionalities provided by the various recited means are combined and brought together in the manner which the claims call for. It is thus regarded that any means that can provide those functionalities are equivalent to those shown herein.
The proposed method and apparatus facilitates anonymous collaboration by providing a recording and its metadata to a server indicating the availability of a video at a particular time and place. The proposed method and apparatus is anonymous since the users recording the content need not know each other in order to retrieve content that other user make public since the server uses time and location and video analysis (such as video recognition) to determine if the recordings are of the same event. A server may determine that multiple recordings are available at the same time and place and therefore a multi-view (multi-camera) recording may be generated. The server then retrieves the recording from all available sources and generates a multi-view (multi-camera) or 3D recording using the best or selected recordings. While creating (generating) a composite recording, the views must be unique to be visually interesting to a viewer. Therefore, it is desirable to not use identical video feeds or very similar video fed when generating the composite recording. The user takes a video of (records) an event. Metadata is collected about the recording including time, place and user identification and seat number if available. Many venues such as stadiums or arenas have seating plans available online. The metadata is transmitted to a server. This metadata may be transmitted automatically, in response to a prompt to the user, or in response to a user initiated action.
Users do not need to record an event using a particular application (app) and may decide to submit their recordings for combination later. This is helpful if a user determines their recording is missing some information or multiple users decide after an event to combine/link their recordings.
A server collects metadata from a plurality of users and determines that there may be multiple videos of the same event. This is determined in response to the time, place, seat number and other information and/or video recognition. It can reasonably be assumed that if multiple users are taking video at the same place and the same time, that the video is likely of the same event.
The server then initiates upload of the multiple recordings of the same event. This may occur automatically, when a certain condition exists, such as when WiFi is connected, or in response to prompting a user for permission. Another option is to upload metadata without uploading the recording (audio and video) at the same time. If another person is also attending or attended the event and requests collaboration and is informed that other users are already collaborating, this user may also be prompted to collaborate.
The seat number for venues such as stadiums and arenas may be useful in determining viewing angles. Viewing angles may be used by the server in deciding which of the uploaded recordings to use in generation of the composite multi-camera (multi- view) recording. For example, the server may review the videos (recordings) from two seats that are fairly close to each other and decide which of the two is better to use or the server may take the best frames from each recording on a frame by frame basis. The frame from one video recording may be a blurry first video recording but not in blurry in a second video recording. The resulting video may then be combined with a video recorded from person sitting about 90 degrees away based on analysis of the seat locations. Users may be given the option to view multiple camera angles and select the camera angle of most interest. This selection information can be used by the server in determining camera angles for the generated composite multi-angle (multi-view) or 3D video.
Users may also select a particular person or object of interest in the recording. The server will then compile (generate, create) a composite multi-view or 3D recording (audio and video) that highlights that person or object. For example, in a graduation ceremony, a person may be interested in the commencement speaker as well as their graduate(s) or their favorite player(s) and the ball in a sporting event. An individual user may have to search all of the recordings (not just the composite recording) to catch a glimpse of the person or object of interest to the user. The proposed method provides the user the opportunity to select and highlight a person or object of interest through still images or a menu. In the alternative, the server may analyze a submitted (uploaded) recording and determine an object or person of interest that is prominently featured in the uploaded recording. The server then uses the analyses (described below) to determine which footage includes the particular person or object of interest. The server then generates a composite recording including footage of the person or object of interest. This composite recording may be in addition to an already generated composite recording. During content editing describe below, the server may crop or stabilize this second (alternative) composite recording to highlight or center the person or object of interest. If no footage is available for a particular time period highlighting the selected person or object (such as when a player is on the bench) then other general footage of the event may be inserted to maintain a timeline for this second (alternative) recording.
Video technology is everywhere and inevitable. Users record events using either an application, such as that of the present invention, , phone video camera, or any video camera. When the app is resident on a digital recording device, at the conclusion of recording, the user may be prompted to "Collaborate?" If the user chooses to collaborate, the recording is uploaded to a server on the internet, or a video service, such as Youtube, for example, and the metadata of the recording and Youtube information is sent to servers of the present invention. The recording may be optionally uploaded when WiFi is available etc. There may be an option that metadata is sent without uploading the recording. If someone else at the event requests a collaboration, the user may be prompted to collaborate and informed that there are a number of other contributors currently collaborating.
Multiple video streams may be used to generate a composite and/or 3D video of an event. Live events may be streamed using this technology. Viewers may be able to watch a simulcast from people's phones and/or a few installed cameras.
The composite (combined) recording (audio and video) with time and place information (metadata) may be combined with other applications such as GoogleMaps to create a time and place record. A user of the web service could search back to a school play on May 14, 2014 to see multiple videos recorded and stored. The user may have never attended the event, or may have been at the event. The proposed method would create a searchable video history where a user knows the date, time and place of an event. The user may also chose to watch recordings in a particular place at different times, such as the sunset at Mallory Square in Key West. Many people record this event every day and a remote user could watch several sunsets days, weeks, or years apart.
Fig. 3 is a flowchart of an exemplary method of the proposed method from the perspective of the individual users (clients). At 305 a user (client) records an event using a digital video recording device. The digital video recording device also automatically generates metadata for the recording. At 310 a test is performed to determine if the recording device is preconfigured to upload the recording. If the recording device is preconfigured to upload the recording then at 320 a test is performed to determine if a WiFi connection is available in order to upload the recorded video. If a WiFi connection is not available in order to upload the recording then processing continues at 320. If a WiFi connection is available in order to upload the recording then at 325 the recording and automatically generated metadata are uploaded. Processing then ends. If the recording device is not preconfigured to upload the recording then at 315 the user (client) is queried if he/she wants to upload the recording. If the user (client) is queried if he/she wants to upload the recording then processing proceeds to 320. If the user (client) is queried if he/she does not want to upload the recording then processing ends. Fig. 4 is a flowchart of an exemplary method of the present invention from the perspective of the server. At 405 the server receives a recording and the associated automatically generated metadata from a user (client). At 410 the server analyzes the metadata to determine if the received recording is of the same event for which other recording(s) have already been received. At 415 the received recording and its associated metadata are stored with recordings and metadata of the same event. If no other recordings have been received yet, store the received recording and it associated metadata. There is a tacit assumption that multiple recordings of the same event will be received. Using the metadata at 420 the server determines the angles/views of the received recordings. Metadata may include seat numbers of events at large venues. Seat numbers will aid in the determination of angles. At 425 the server generates a single composite multi-camera/multi-view video using the uploaded recordings and their associated metadata. At 430 the server downloads the generated composite multi- camera/multi-view recording to the uploaders. As used herein the downloading may be streaming or may merely be a notification that the generated composite multi- camera/multi-view recording is available online and a url to access the posted generated composite multi-camera/multi-view recording. At 435 the server posts the generated single composite multi-camera/multi-view recording online.
Current systems upload recordings to a server and generate a composite recording even if there is only one recording of a particular event. The problem with this is that uploading and downloading recordings require a large amount of bandwidth and the results may be no different than what the user started with. It may also be the case that at 405 the server receives metadata only at first. The server then analyzes the metadata at 410 and if multiple recordings of the same event have already been uploaded to (received by) the server then the server may request that this user upload his/her recording of the event. On the assumption that the user is amenable to uploading his/her recording then the server receives the user's uploaded recording. In the alternative, the uploading may be automatically triggered according to the user's preset preferences.
In addition to the automatic generation of a single composite multi-camera/multi- view recording of an event using uploaded recordings, a user may specify a person or object of interest. That person or object of interest may or may not be in any or all of the uploaded recordings. If the user specifies a person or object of interest then the server may use any or all or any combination of the following forms of analysis to determine if any or all of the uploaded recordings include the person or object of interest. The server may then automatically create a single composite multi-camera/multi-view recording for the user focused on the person or object of interest.
1. Image Comparison after Normalization - In image comparison after normalization, each image is first normalized, which is a process that changes the range of pixel intensity values. The purpose of normalization is to bring the image, or other type of signal, into a consistent dynamic range to permit statistical image comparison. Once the images are normalized, a Manhattan, Chebychev or Euclidean distance can be determined between the two images and a statistical comparison performed.
2. Image Color Histograms - If the two images to be compared are color, a color histogram representations or, alternatively, color correlograms can be generated and statistically compared. A color histogram is a representation of the distribution of colors in an image. For digital images, a color histogram represents the number of pixels that have colors in each of a fixed list of color ranges, that span the image's color space, the set of all possible colors.
Statistical analysis of the color histograms can be performed by color histogram intersection, color constant indexing, cumulative color histogram, quadratic distance, and color correlograms. Although there are drawbacks of using histograms for indexing and classification, using color in a real-time system has several advantages. One is that color information is faster to compute compared to other invariants. It has been shown in some cases that color can be an efficient method for identifying objects of known location and appearance.
3. Image Retrieval through Object Comparison - For any object in an image, interesting points on the object can be extracted to provide a "feature description" of the object. This description, extracted from a first image or training image, can then be used to identify the object when attempting to locate the object in a second image or test image. To perform reliable recognition, it is important that the features extracted from the training image be detectable even under changes in image scale, noise and illumination. Such points usually lie on high-contrast regions of the image, such as object edges.
Another important characteristic of these features is that the relative positions between them in the original scene should not change from one image to another. For example, if only the four corners of a door were used as features, they would work regardless of the door's position; but if points in the frame were also used, the recognition would fail if the door is opened or closed. Similarly, features located in articulated or flexible objects would typically not work if any change in their internal geometry happens between two images in the set being processed. However, in practice SIFT detects and uses a much larger number of features from the images, which reduces the contribution of the errors caused by these local variations in the average error of all feature matching errors.
Facial Recognition - A facial recognition system is a method for automatically identifying or verifying a person from a digital image or a video frame from a video source. One of the ways to do this is by comparing selected facial features from the image and a facial database. In image recognition, in two images a face from each image can be compared to determine if the same face is in each image. Thus, one can determine that the image is of the same scene if the images were gathered at the same time and place.
Some facial recognition algorithms identify facial features by extracting landmarks, or features, from an image of the subject's face. For example, an algorithm may analyze the relative position, size, and/or shape of the eyes, nose, cheekbones, and jaw. These features are then used to search for other images with matching features. Other algorithms normalize a gallery of face images and then compress the face data, only saving the data in the image that is useful for face recognition. A probe image is then compared with the face data. One of the earliest successful systems is based on template matching techniques applied to a set of salient facial features, providing a sort of compressed face representation.
Recognition algorithms can be divided into two main approaches, geometric, which looks at distinguishing features, or photometric, which is a statistical approach that distills an image into values and compares the values with templates to eliminate variances.
Popular recognition algorithms include Principal Component Analysis using eigenfaces, Linear Discriminate Analysis, Elastic Bunch Graph Matching using the Fisherface algorithm, the Hidden Markov model, the Multilinear Subspace Learning using tensor representation, and the neuronal motivated dynamic link matching. Eigenfaces is the name given to a set of eigenvectors when they are used in the computer vision problem of human face recognition.
5. Saliency Detection - Saliency detection aims at detecting the "most important" image regions or objects that represent the scene. This can be done by either identify fixation points, detecting a dominant object in the image or center surround contrasts of units modeled on known properties.
When viewing a professionally generated recording (audio and video) or program, camera views (angles) are constantly changing. This keeps the recording or program interesting and gives the viewer a more complete view of the event and environment of the setting. Changing views gives the viewer a more immersive and enjoyable experience compared to a single view of the event or person or object of interest in the event.
The proposed method, therefore, automatically creates (generates) a composite multi-camera/multi-view recording from a number of (anonymous) contributors, thereby increasing the quality and experience of the recording of the event. However, if more than one contributor records the event from essentially the same vantage point, it may be undesirable to provide each of these recordings with equal time in the composite recording. For example, in an extreme example, assume that there are four contributors each recording a concert from four consecutive seats in a balcony and another (fifth) contributor recording the concert from a front row seat. A composite recording generated using all five recordings in an equal combination would result in a composite recording that appeared to be recorded 80% from the balcony and 20% from the front row. The portion of the recording from the balcony seats may be perceived by the viewer to be the same. It would be better to split the portion of the recording from the balcony to be 50% and the remainder of the composite recording to be split 12.5 % from each of the balcony contributors or chose the best recording from among the balcony contributors.
The proposed method for automatic generation of a single composite multi- camera/multi-view recording of an event includes a server performing event clustering - either using the metadata or by some form of audio and/or image analysis - content organization, content selection, content editing and content creation. Fig. 5 is a flowchart of an exploded view of an exemplary implementation of 425 of Fig. 4. At 410 of Fig. 4, in addition to metadata analysis, the recording (received recorded content) may be analyzed by any of the above described forms of image analysis to determine if the recording is of an event for which other recordings have already been received. That is, event clustering is effectively performed at 410. The events may be plays, music concerts, high school or college graduations, etc. A user (viewer, contributor) may have a recording and wish to generate a composite recording for any number of reasons such as their recording was incomplete, has sound defects etc. The user may upload as much metadata as possible, such as time, location, event name etc. Currently users would have to search the metadata of other recordings and hope to find one or more recordings of the same event in order to complete or repair their recording. With the billions of recordings available on the Internet, it may be impossible to find a recording of the event for which they are looking. The proposed method examines and characterizes user uploaded recordings for metadata, audio and video characteristics etc. to locate distinguishing audio and/or video features (certain music, facial features, etc.). The server then creates a profile of the user submitted (uploaded) recording using the metadata and distinguishing features. The server then uses the profile to search other characterized recordings for possible matching recordings. If possible matching recordings are located then the server attempts to synchronize the recordings using audio and/or video to determine if the recordings are indeed of the same event. If recordings of the same event are located, the profile of each recording is augmented and corrected (updated) using data of the each recording and markers may also be added. Markers are added to show common points.
At 505, the server organizes and analyzes the recordings to determine the best views and the best quality. Recordings may also be analyzed with respect to focal point (focus) of the recording. The server also determines the physical location or perspective of each camera that recorded the event. The location determination can be made by contributors providing their section number and seat number for the concert. This is done using metadata as well as image analysis. Metadata may include GPS data.
If two or more recordings are determined to have the same perspective or have been recorded from the same physical location they are grouped into a view group. Each view group is then weighted according to a weighting formula. Using the above example of four contributors recording from the balcony and one contributor recording from the front row, the four contributors are put into one view group and the front row contributor is put into another view group. When the composite recording is generated then each group may be weighted equally so the front row recording is used in the composite approximately 50% of the time and the four contributors from the balcony are together used a total of 50% of the time in the composite recording. Each of the balcony contributors maybe used 12.5 % of the time or one or more of the balcony contributors may be discarded because of poor recording quality. The balcony contributors recordings (even those of poor quality) may be used to replace out of focus frames of the other contributors of the same view group or to replace frames where the object or person of interest was lost. The server, thus, weights or discards views that are substantially similar (within "X" feet or "X" seats of each other) retaining the recording having the best possible angle and/or with the best quality recording. For example, if two people are sitting next to each other in a concert and both are recording the concert, the server may discard one of the recordings as redundant or weight each recording differently when generating the composite recording. Thus, the composite recording is created with dissimilar views.
Turning to Fig. 6, V4 and V5 are grouped together in a view group. Each of the other contributions (VI, V2, V3, V6) are each in their own view groups. Thus, a composite recording may use each of VI, V2, V3, V6 for 20% of the time and each of V4 and V5 for 10% of the time.
At 510, based on best angle (view), best quality or some other criteria, such as person or object of interest, recordings are selected to be included in the automatically generated single composite multi-camera/multi-view recording. At 515, using the selected recordings, content editing is performed. Content editing edits the content for color, blur, fuzzy edges etc. With multiple recordings from a number of cameras, the color may not always be exactly the same so the color must be edited so that the colors do not change when moving from one recording to another. One recording may have had a great or even best angle but one or more faces might be blurred or have the common red eye. Such deficiencies are edited out during the content editing phase.
Content editing may also include audio editing. Often when crowd-sourced recordings are generated, one audio stream is selected as the audio for all sources in the composite recording. This may not be advantageous if the audio in the selected stream has interference, background noise or sounds not related to the object of interest being recorded. For example, at a concert, people talking near the recording device may distract viewers from the music being played. The event clustering phase described above also analyzes and synchronizes the audio streams of the uploaded (received) recordings. When the server determines that there is common audio in the multiple sources (uploaded recordings), the server assumes that this is the audio from or of an object of interest (e.g., performer). When there is unique audio on one of the audio streams, the server assumes that this is undesirable and not from or of the object of interest. Since most undesirable audio is probably proximate to a single recording device, the server can assume that audio track is not used in the composite recording for at least the duration of the detected undesirable audio. Once the server has determined which audio is desirable and which audio is undesirable, then the server can use the desirable audio from multiple sources to generated the composite recording. Thus, spatial sound relationships can be generated, audio levels can be adjusted etc. such that an enhanced audio layer can be generated to be presented (synchronized) with the composite recording.
Fig. 8 is a block diagram of an exemplary embodiment of the proposed apparatus (server) for generation of a composite recording using crowd-sourced recordings of an event. The proposed composite recording generation server includes a communications interface. The communications interface is in bi-directional communication with the processor and interfaces with the user. The communications interface can handle wired line or wireless communications. The communications interface accepts (receives) the recordings to be analyzed. The input may be by downloading or streaming depending on the format of the source. The interface with the user is via any recording device or a display device having a keyboard and/or graphical user interface. The generated composite recordings (files) can be exported via the communications interface. The received recordings, received via the communications interface, are forwarded to the processor, which stores the received content in a data base (labeled "Content Storage") and performs analysis, organization, selection and editing as described above and stores the resulting analysis, profiles, etc. (files) in a data base of a storage system shown as "Content Storage". The processor then creates the composite multi-camera recording. That is, the processor analyzes the metadata to determine if the metadata is for the event for which recordings have already been received and analyzes the recording associated with the metadata to determine if the recording is for the event for which recordings have already been received. The processor then stores the received content with recordings of the event which have already been received. The processor creates a characterizing profile of the received recording, searches other characterized recordings for possible matching recordings using the characterizing profile of the received recording, synchronizes the received recording with the matching recordings, augments characterizing profiles of the recording and the matching recordings and adds markers to the received recording at common points between the received recording and the matching recordings. The processor also analyzes the received recording and the matching recordings to determine recordings with best views and best quality, analyzes the received recording and the matching recordings to determine focal points of the received recording, analyzes the received recording and the matching recordings to determine physical the physical camera location into a view group and weights each view group. Content selection further comprises the processor selecting recordings for inclusion in the generated composite multi-camera recording from the view groups with the best views or recordings from the view groups with the best quality. Content selection further comprises the processor selecting recordings for inclusion in the generated composite multi-camera recording from the view groups with the best views or recordings from the view groups with the best quality. Code to direct the processor is stored in the storage system. The code to direct the processor may be in a single composite recording generation module or separate modules for content organization and analysis, content selection, content editing and creation of the composite multi-camera recording. The storage system may include any type of memory including disks, optical disks, tapes, hard drives, CDs, DVDs, flash drives, cloud memory, core memory, any form of RAM or any equivalent type of memory or storage device. The code similarly may include any type of memory including disks, optical disks, tapes, hard drives, CDs, DVDs, flash drives, cloud memory, core memory, any form of RAM or any equivalent type of memory or storage device. The code and the storage for the composite recording generation of the proposed method may be on separate storage devices or on a single storage device or on the same storage device in separate partitions. The storage system of the proposed apparatus is a tangible and non-transitory computer readable medium.
In an alternative embodiment, a user may be in a venue that also provides video services. For example, a user may be in a Disney Park and record an event at the park. The event may be a Disney planned event such as an Independence day celebration of a family birthday or family reunion. The user may upload his/her recording of the event including metadata. The location of the recording may be determined using the metadata. Disney, for example, may have a plurality of service providers depending upon in which venue the recording was made. There may be a general or generic website to which to upload a recording and the general or generic website would then forward it to the one of several service providers within the Disney park system that would have pre-recorded content to add to the uploaded recording. If the metadata was insufficient to determine the location then the service provider - in the above example the service provider is Disney - would use analysis of an audio component of the recording or analysis of a video component of the recording or a combination of the metadata, the audio component of the recording and the video component of the recording to determine the location of the recording. Disney parks have many different venues. The user's uploaded recording could then be combined on the front end and/or on the back end with generic recorded material supplied by the service provider, e.g., Disney to generate a composite recording. In this case the pre-recorded generic video of the venue supplied by the service provider enhances the user's (amateur) recording with professionally recorded content to give the illusion at least that the generated recording is a multi-camera perspective. The user gets an enhanced recording and Disney gets free advertising each time the user shows that enhanced recording.
Fig. 9 is a flowchart of the alternative embodiment of the proposed method. At 905 a recording with metadata is received. At 910 a location is determined. At 915 a service provider is determined responsive to the determined location. At 920 the service provider combines the uploaded recording with pre-recorded professionally recorded content of the venue to generate a composite recording. At 925 the composite video is transmitted back to the user.
It is to be understood that the present invention may be implemented in various forms of hardware, software, firmware, special purpose processors, or a combination thereof. Special purpose processors may include application specific integrated circuits (ASICs), reduced instruction set computers (RISCs) and/or field programmable gate arrays (FPGAs). Preferably, the present invention is implemented as a combination of hardware and software. Moreover, the software is preferably implemented as an application program tangibly embodied on a program storage device. The application program may be uploaded to, and executed by, a machine comprising any suitable architecture. Preferably, the machine is implemented on a computer platform having hardware such as one or more central processing units (CPU), a random access memory (RAM), and input/output (I/O) interface(s). The computer platform also includes an operating system and microinstruction code. The various processes and functions described herein may either be part of the microinstruction code or part of the application program (or a combination thereof), which is executed via the operating system. In addition, various other peripheral devices may be connected to the computer platform such as an additional data storage device and a printing device.
It should be understood that the elements shown in the figures may be implemented in various forms of hardware, software or combinations thereof. Preferably, these elements are implemented in a combination of hardware and software on one or more appropriately programmed general-purpose devices, which may include a processor, memory and input/output interfaces. Herein, the phrase "coupled" is defined to mean directly connected to or indirectly connected with through one or more intermediate components. Such intermediate components may include both hardware and software based components.
It is to be further understood that, because some of the constituent system components and method steps depicted in the accompanying figures are preferably implemented in software, the actual connections between the system components (or the process steps) may differ depending upon the manner in which the present invention is programmed. Given the teachings herein, one of ordinary skill in the related art will be able to contemplate these and similar implementations or configurations of the present invention.

Claims

CLAIMS:
1. A method for generation of a composite multi-camera recording, said method comprising:
performing event clustering on received content to determine if said received content is a recording of a same event for which previous recordings have been received;
performing content organization and analysis;
selecting content from among said received content and said previously received recording of the same event;
performing content editing; and
creating said composite multi-camera recording using said selected content.
2. The method according to claim 1, wherein said received content is metadata.
3. The method according to claim 2, said metadata is analyzed to determine if said metadata is for an event for which recordings have already been received.
4. The method according to claim 1, wherein said received content is both metadata and a recording of said event associated with said metadata.
5. The method according to claim 4, further comprising:
analyzing said metadata to determine if said metadata is for said event for which recordings have previously been received;
analyzing said recording associated with said metadata to determine if said recording is for said event for which recordings have previously been received; and
storing said received content with recordings of said event which have previously been received.
6. The method according to claim 5, wherein said analyzing of said recording associated with said metadata is one or more of image comparison after normalization, image color histogram, image retrieval through object comparison, facial recognition and saliency detection,
7. The method according to claim 5, wherein said analyzing of said recording associated with said metadata includes analyzing audio and video characteristics to locate distinguishing audio and video features.
8. The method according to claim 5, further comprising:
creating a characterizing profile of the received recording;
searching other characterized recordings for possible matching recordings using said characterizing profile of the received recording;
synchronizing said received recording with said matching recordings; augmenting characterizing profiles of said recording and said matching recordings; and
adding markers to said received recording at common points between said received recording and said matching recordings.
9. The method according to claim 8, wherein said content organization and analysis further comprises:
analyzing said received recording and said matching recordings to determine recordings with best views and best quality;
analyzing said received recording and said matching recordings to determine focal points of said received recording;
analyzing said received recording and said matching recordings to determine physical location or perspective of each camera that recorded said event;
grouping said received recording and said matching recordings from said same perspective or said same physical camera location into a view group; and weighting each view group.
10. The method according to claim 9, wherein said physical location or said perspective analysis is performed using said metadata or image analysis or both said metadata and said image analysis.
11. The method according to claim 9, wherein said content selection further comprises selecting recordings for inclusion in said generated composite multi- camera recording from said view groups with said best views or recordings from said view groups with said best quality.
12. The method according to claim 11, wherein said content editing further comprises:
determining if there is a unique audio stream in said selected recordings; filtering out said unique audio stream;
generating spatial sound relationships;
adjusting audio levels; and
generating an enhanced audio layer for presentation with said generated composite multi-camera recording.
13. An apparatus for generation of a composite multi-camera recording, comprising:
a processor, said processor performing event clustering on received content, said event clustering determines if said received content is a recording of an event for which recordings have been previously received;
said processor, performing content organization and analysis; said processor, selecting content from among said received content and said previously received recording of said event;
said processor, performing content editing; and
said processor, creating said composite multi-camera recording using said edited content.
14. The apparatus according to claim 13, wherein said received content is metadata.
15. The apparatus according to claim 14, said metadata is analyzed to determine if said metadata is for an event for which recordings have already been received.
16. The apparatus according to claim 13, wherein said received content is both metadata and a recording of said event associated with said metadata.
17. The apparatus according to claim 16, further comprising:
said processor, analyzing said metadata to determine if said metadata is for said event for which recordings have previously been received; said processor, analyzing said recording associated with said metadata to determine if said recording is for said event for which recordings have previously been received; and
said processor, storing said received content with recordings of said event which have previously been received.
18. The apparatus according to claim 17, wherein said analyzing of said recording associated with said metadata is one or more of image comparison after normalization, image color histogram, image retrieval through object comparison, facial recognition and saliency detection,
19. The apparatus according to claim 17, wherein said analyzing of said recording associated with said metadata includes analyzing audio and video characteristics to locate distinguishing audio and video features.
20. The apparatus according to claim 17, further comprising:
said processor, creating a characterizing profile of the received recording; said processor, searching other characterized recordings for possible matching recordings using said characterizing profile of the received recording; said processor, synchronizing said received recording with said matching recordings;
said processor, augmenting characterizing profiles of said recording and said matching recordings; and
said processor, adding markers to said received recording at common points between said received recording and said matching recordings.
21. The apparatus according to claim 20, wherein said content organization and analysis further comprises:
said processor, analyzing said received recording and said matching recordings to determine recordings with best views and best quality;
said processor, analyzing said received recording and said matching recordings to determine focal points of said received recording; said processor, analyzing said received recording and said matching recordings to determine physical location or perspective of each camera that recorded said event;
said processor, grouping said received recording and said matching recordings from said same perspective or said same physical camera location into a view group; and
said processor, weighting each view group.
22. The apparatus according to claim 21, wherein said physical location or said perspective analysis is performed using said metadata or image analysis or both said metadata and said image analysis.
23. The apparatus according to claim 21, wherein said content selection further comprises said processor selecting recordings for inclusion in said generated composite multi-camera recording from said view groups with said best views or recordings from said view groups with said best quality.
24. The apparatus according to claim 23, wherein said content editing further comprises:
determining if there is a unique audio stream in said selected recordings; filtering out said unique audio stream;
generating spatial sound relationships;
adjusting audio levels; and
generating an enhanced audio layer for presentation with said generated composite multi-camera recording.
PCT/US2015/056742 2014-11-07 2015-10-21 Collaborative video upload method and apparatus WO2016073205A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201462076704P 2014-11-07 2014-11-07
US62/076,704 2014-11-07

Publications (1)

Publication Number Publication Date
WO2016073205A1 true WO2016073205A1 (en) 2016-05-12

Family

ID=54397015

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2015/056742 WO2016073205A1 (en) 2014-11-07 2015-10-21 Collaborative video upload method and apparatus

Country Status (1)

Country Link
WO (1) WO2016073205A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115114963A (en) * 2021-09-24 2022-09-27 中国劳动关系学院 Intelligent streaming media video big data analysis method based on convolutional neural network

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090087161A1 (en) * 2007-09-28 2009-04-02 Graceenote, Inc. Synthesizing a presentation of a multimedia event
EP2428903A1 (en) * 2010-09-08 2012-03-14 Nokia Corporation Method and apparatus for video synthesis
US20130188923A1 (en) * 2012-01-24 2013-07-25 Srsly, Inc. System and method for compiling and playing a multi-channel video
US20140133825A1 (en) * 2012-11-15 2014-05-15 International Business Machines Corporation Collectively aggregating digital recordings

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090087161A1 (en) * 2007-09-28 2009-04-02 Graceenote, Inc. Synthesizing a presentation of a multimedia event
EP2428903A1 (en) * 2010-09-08 2012-03-14 Nokia Corporation Method and apparatus for video synthesis
US20130188923A1 (en) * 2012-01-24 2013-07-25 Srsly, Inc. System and method for compiling and playing a multi-channel video
US20140133825A1 (en) * 2012-11-15 2014-05-15 International Business Machines Corporation Collectively aggregating digital recordings

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115114963A (en) * 2021-09-24 2022-09-27 中国劳动关系学院 Intelligent streaming media video big data analysis method based on convolutional neural network

Similar Documents

Publication Publication Date Title
WO2016073206A1 (en) Generating a composite recording
US8879788B2 (en) Video processing apparatus, method and system
US9081798B1 (en) Cloud-based photo management
US8612517B1 (en) Social based aggregation of related media content
KR102137207B1 (en) Electronic device, contorl method thereof and system
US10582149B1 (en) Preview streaming of video data
US8542982B2 (en) Image/video data editing apparatus and method for generating image or video soundtracks
US20160155475A1 (en) Method And System For Capturing Video From A Plurality Of Devices And Organizing Them For Editing, Viewing, And Dissemination Based On One Or More Criteria
US10541000B1 (en) User input-based video summarization
US20240171831A1 (en) System and Method for Algorithmic Editing of Video Content
US10958837B2 (en) Systems and methods for determining preferences for capture settings of an image capturing device
US10943127B2 (en) Media processing
Bano et al. ViComp: composition of user-generated videos
Wu et al. MoVieUp: Automatic mobile video mashup
US20230156245A1 (en) Systems and methods for processing and presenting media data to allow virtual engagement in events
Cricri et al. Multimodal extraction of events and of information about the recording activity in user generated videos
US20160100110A1 (en) Apparatus, Method And Computer Program Product For Scene Synthesis
CN110879944A (en) Anchor recommendation method, storage medium, equipment and system based on face similarity
US20170091205A1 (en) Methods and apparatus for information capture and presentation
WO2016073205A1 (en) Collaborative video upload method and apparatus
WO2013187796A1 (en) Method for automatically editing digital video files
CN104981753B (en) Method and apparatus for content manipulation
CN115917647A (en) Automatic non-linear editing style transfer
AU2012232990A1 (en) Image selection based on correspondence of multiple photo paths
TW201913562A (en) Device and method for generating panorama image

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15790391

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15790391

Country of ref document: EP

Kind code of ref document: A1