WO2012175783A1 - Video remixing system - Google Patents

Video remixing system Download PDF

Info

Publication number
WO2012175783A1
WO2012175783A1 PCT/FI2011/050599 FI2011050599W WO2012175783A1 WO 2012175783 A1 WO2012175783 A1 WO 2012175783A1 FI 2011050599 W FI2011050599 W FI 2011050599W WO 2012175783 A1 WO2012175783 A1 WO 2012175783A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
depth
segment
segments
remix
Prior art date
Application number
PCT/FI2011/050599
Other languages
French (fr)
Inventor
Sujeet Mate
Igor D. Curcio
Kostadin Dabov
Original Assignee
Nokia Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nokia Corporation filed Critical Nokia Corporation
Priority to EP11868268.1A priority Critical patent/EP2724343B1/en
Priority to US14/126,385 priority patent/US9396757B2/en
Priority to CN201180071774.8A priority patent/CN103635967B/en
Priority to PCT/FI2011/050599 priority patent/WO2012175783A1/en
Publication of WO2012175783A1 publication Critical patent/WO2012175783A1/en

Links

Classifications

    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/02Editing, e.g. varying the order of information signals recorded on, or reproduced from, record carriers
    • G11B27/031Electronic editing of digitised analogue information signals, e.g. audio or video signals
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/02Editing, e.g. varying the order of information signals recorded on, or reproduced from, record carriers
    • G11B27/031Electronic editing of digitised analogue information signals, e.g. audio or video signals
    • G11B27/034Electronic editing of digitised analogue information signals, e.g. audio or video signals on discs
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/10Processing, recording or transmission of stereoscopic or multi-view image signals
    • H04N13/106Processing image signals
    • H04N13/161Encoding, multiplexing or demultiplexing different image signal components

Definitions

  • Video remixing is an application where multiple video recordings are combined in order to obtain a video mix that contains some segments selected from the plurality of video recordings.
  • Video remixing is one of the basic manual video editing applications, for which various software products and services are already available.
  • automatic video remixing or editing systems which use multiple instances of user-generated or professional recordings to automatically generate a remix that combines content from the available source content.
  • Some automatic video remixing systems depend only on the recorded content, while others are capable of utilizing environmental context data that is recorded together with the video content.
  • the context data may be, for example, sensor data received from a compass, an accelerometer, or a gyroscope, or GPS location data.
  • Video remixing is computationally a demanding task, especially when multiple recordings possible encoded into different, non-compatible file formats are used as source content. Obtaining a desired resultant video remix may be significantly delayed due to the bottlenecks of the video remixing system. Therefore, a more efficient video remixing system is needed.
  • a method for creating a video remix comprising: obtaining a plurality of source content in a processing device; determining a plurality of segments from the source content to be included in the video remix; determining editing processes required to transform the plurality of segments into form suitable for the video remix; allocating said editing processes to be executed in parallel in at least one processing device; and merging the plurality of segments received from said editing processes into the video remix.
  • the source content comprises at least one of video, audio and/or image
  • said editing processes comprise at least one of the following:
  • the method further comprises receiving a user request for creating a video remix, said user request including a request to create the video remix within a time period; determining an optimal allocation of the editing processes such that the editing processes are optimized according to available processing power of said at least one processing device and the video remix can be created within said time period; and allocating said editing processes to be executed in parallel in at least one processing device according to said optimal allocation.
  • the method further comprises obtaining depth maps for at least some frames of a source video; detecting a type of a video shot and/or an object obstructing a view in the source video based on the depth map; and indexing the source video according to the detected type of a video shot and/or the detected object obstructing a view.
  • the method further comprises analysing the depth map of a frame by dividing the depth map of the frame into at least two non-overlapping region-of-interests, one of them being a central region-of-interest, and calculating the depth of each region-of- interest as a weighted average value of the depth, wherein the weighting is based on reliability values of the depth map.
  • the method further comprises detecting the type of the video shot included in the source video to a close-up shot, a medium shot or a long shot by comparing the depth of the central region-of-interest to the depths of the remaining region-of- interests, the criteria for detecting the type of the video shot including at least the number of region-of-interests having a substantially similar depth to the depth of the central region-of-interest and residing within a predefined distance from the central region-of-interest.
  • the method further comprises detecting the object obstructing the view in the source video on the basis of a difference between an averaged depth for region-of-interests having depth substantially at the depth of expected location of obstructing objects and an averaged depth of the remaining region-of-interests.
  • an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to at least: obtain a plurality of source content; determine a plurality of segments from the source content to be included in a video remix; determine editing processes required to transform the plurality of segments into form suitable for the video remix; allocate said editing processes to be executed in parallel in at least one processing device; and merge the plurality of segments received from said editing processes into the video remix.
  • a computer program embodied on a non-transitory computer readable medium, the computer program comprising instructions causing, when executed on at least one processor, at least one apparatus to: obtain a plurality of source content; determine a plurality of segments from the source content to be included in a video remix; determine editing processes required to transform the plurality of segments into form suitable for the video remix; allocate said editing processes to be executed in parallel in at least one processing device; and merge the plurality of segments received from said editing processes into the video remix.
  • a system comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the system to at least: obtain a plurality of source content; determine a plurality of segments from the source content to be included in a video remix; determine editing processes required to transform the plurality of segments into form suitable for the video remix; allocate said editing processes to be executed in parallel in at least one processing device; and merge the plurality of segments received from said editing processes into the video remix.
  • Figs. 1 a and 1 b show a system and devices suitable to be used
  • Fig. 2 shows a block chart of an implementation embodiment for the automatic video remixing service
  • Fig. 3 shows a partial re-encoding method of a video segment according to an embodiment
  • Fig. 4 shows a system for a time-interval demand based generation of video remix according to an embodiment; an example for positioning an amount of non-overlapping regions of interest (ROIs) in the depth maps of the video frames; shows a flow chart of an embodiment for detecting the type of video shots on the basis of the depth map of the recorded scene; shows a flow chart of an embodiment for detecting objects that obstruct the view on the basis of the depth map of the recorded scene.
  • ROIs non-overlapping regions of interest
  • a loud voice in a party may be an acoustic trigger for a video capture, or people turning suddenly to another direction may be an orientation trigger, received from a compass sensor of the portable device, for a video capture.
  • Spatially nearby portable devices may collaboratively identify an event, and at least locate the portable device with the best view of this event.
  • the devices recording the content may be disconnected from other devices, but share the recorded source content and the corresponding sensor data, which is pooled together in a file server or any such suitable mechanism for generating the automatic remix. Recordings of the attendants from such events, possibly together with various sensor information, provide a suitable framework for the present invention and its embodiments.
  • Figs. 1 a and 1 b show a system and devices suitable to be used in an automatic video remixing service according to an embodiment.
  • the different devices may be connected via a fixed network 21 0 such as the Internet or a local area network; or a mobile communication network 220 such as the Global System for Mobile communications (GSM) network, 3rd Generation (3G) network, 3.5th Generation (3.5G) network, 4th Generation (4G) network, Wireless Local Area Network (WLAN), Bluetooth ® , or other contemporary and future networks.
  • GSM Global System for Mobile communications
  • 3G 3rd Generation
  • 3.5G 3.5th Generation
  • 4G 4th Generation
  • WLAN Wireless Local Area Network
  • Bluetooth ® Wireless Local Area Network
  • the networks comprise network elements such as routers and switches to handle data (not shown), and communication interfaces such as the base stations 230 and 231 in order for providing access for the different devices to the network, and the base stations 230, 231 are themselves connected to the mobile network 220 via a fixed connection 276 or a wireless connection 277.
  • servers 240, 241 and 242 each connected to the mobile network 220, which servers may be arranged to operate as computing nodes (i.e. to form a cluster of computing nodes or a so-called server farm) for the automatic video remixing service.
  • Some of the above devices for example the computers 240, 241 , 242 may be such that they are arranged to make up a connection to the Internet with the communication elements residing in the fixed network 210.
  • end-user devices such as mobile phones and smart phones 251 , Internet access devices (Internet tablets) 250, personal computers 260 of various sizes and formats, televisions and other viewing devices 261 , video decoders and players 262, as well as video cameras 263 and other encoders.
  • These devices 250, 251 , 260, 261 , 262 and 263 can also be made of multiple parts.
  • the various devices may be connected to the networks 21 0 and 220 via communication connections such as a fixed connection 270, 271 , 272 and 280 to the internet, a wireless connection 273 to the internet 21 0, a fixed connection 275 to the mobile network 220, and a wireless connection 278, 279 and 282 to the mobile network 220.
  • Fig. 1 b shows devices for automatic video remixing according to an example embodiment.
  • the server 240 contains memory 245, one or more processors 246, 247, and computer program code 248 residing in the memory 245 for implementing, for example, automatic video remixing.
  • the different servers 241 , 242, 290 may contain at least these elements for employing functionality relevant to each server.
  • the end-user device 251 contains memory 252, at least one processor 253 and 256, and computer program code 254 residing in the memory 252 for implementing, for example, gesture recognition.
  • the end-user device may also have one or more cameras 255 and 259 for capturing image data, for example stereo video.
  • the end-user device may also contain one, two or more microphones 257 and 258 for capturing sound.
  • the end-user device may also contain sensors for generating the depth information using any suitable technology.
  • the different end-user devices 250, 260 may contain at least these same elements for employing functionality relevant to each device.
  • the depth maps i.e.
  • the end-user device may also have a time-of-flight camera, whereby the depth map may be obtained from a time-of-flight camera or from a combination of stereo (or multiple) view depth map and a time-of-flight camera.
  • the end-user device may generate depth map for the captured content using any available and suitable mechanism.
  • the end user devices may also comprise a screen for viewing single- view, stereoscopic (2-view), or multiview (more-than-2-view) images.
  • the end-user devices may also be connected to video glasses 290 e.g.
  • the glasses may contain separate eye elements 291 and 292 for the left and right eye. These eye elements may either show a picture for viewing, or they may comprise a shutter functionality e.g. to block every other picture in an alternating manner to provide the two views of three-dimensional picture to the eyes, or they may comprise an orthogonal polarization filter (compared to each other), which, when connected to similar polarization realized on the screen, provide the separate views to the eyes. Other arrangements for video glasses may also be used to provide stereoscopic viewing capability. Stereoscopic or multiview screens may also be autostereoscopic, i.e.
  • the screen may comprise or may be overlaid by an optics arrangement, which results into a different view being perceived by each eye.
  • Single-view, stereoscopic, and multiview screens may also be operationally connected to viewer tracking such a manner that the displayed views depend on viewer's position, distance, and/or direction of gaze relative to the screen.
  • parallelized processes of the automatic video remixing may be carried out in one or more processing devices; i.e. entirely in one user device like 250, 251 or 260, or in one server device 240, 241 , 242 or 290, or across multiple user devices 250, 251 , 260 or across multiple network devices 240, 241 , 242, 290, or across both user devices 250, 251 , 260 and network devices 240, 241 , 242, 290.
  • the elements of the automatic video remixing process may be implemented as a software component residing on one device or distributed across several devices, as mentioned above, for example so that the devices form a so-called cloud.
  • An embodiment relates to a method for performing parallel video cutting, re-encoding, and merging of video segments within an automatic video remixing service, i.e. an editing service.
  • the service is implemented in at least one, but preferably in a plurality of computing nodes (i.e. a cluster of computing nodes or a server farm), which are able to execute more than one process or thread in parallel.
  • the automatic video remixing service is supplied with one or more video recordings and information regarding suitable cutting points of desired segments from the video recordings. The information regarding the suitable cutting points of segments can be provided in various ways.
  • the cutting points may be obtained a priori via any suitable method (e.g., by content analysis of the source videos or even manually, from a human input) and then supplied to the video remixing service along the one or more video recordings.
  • more cutting points may be utilized by the video remixing service by directly analyzing the one or more video recordings or specific contextual information associated with them.
  • the video remixing service may analyse the video recordings either without any additional information or by exploiting contextual information such as sensor (gyroscope, accelerometer, compass or other sensors) data recorded simultaneously with the source videos. Embodiments relating to such analysis will be described more in detail further below.
  • a list of desired segments is created and on the basis of the list, a job is created, which may comprise cutting the source videos into desired segments, decoding of at least one desired segment in case the source video is already encoded and/or video encoding of at least one desired segment such that it starts with an intra-coded frame.
  • the cutting and the re-encoding are done in such a manner that a cut segment is not fully re-encoded but only the frames that are in between the desired cutting location and the location of the following intra coded frame are encoded. If the desired cutting location is pointing to an intra coded frame, then re-encoding of the segment is not performed.
  • additional cutting points may be allocated in order to ensure that the maximum segment duration is smaller than a predefined threshold.
  • the additional cutting points may improve the parallelization of the cutting and the re-encoding.
  • the automatic video remixing service comprises a control unit 205 for determining the desired video remix and the segments to be included therein.
  • a control unit 205 for determining the desired video remix and the segments to be included therein.
  • As the input data for the video remixing service there is provided a plurality of source videos 201 , 202, 203, 204 (Videol - Video 4), which may, but not necessarily need to be encoded, for example, by any known video coding standard, such as MPEG 2, MPEG4, H.264/AVC, etc.
  • the source videos may be originated from one or more end-user devices or they may be loaded from a computer or a server connected to a network.
  • control unit 205 may be provided with or be arranged to determine a list of desired segments to cut and subsequently to merge in the final video remix.
  • the items of the list of segments may preferably contain information about the source video, the starting time or the frame number of the segment to be cut and the duration of the segment, either in time or in number of frames.
  • the source videos may be more or less overlapping in time domain. Therefore, at least for those overlapping periods priorities could be assigned to the items in the list of segments. According to an embodiment, this could be achieved by sorting the list by the duration of the segments in descending order. If the source videos are already encoded with a desired video encoder, the need for re-encoding is determined by the frame type of the first frame of the segment to be cut. If the first frame of the desired cutting location is an intra coded frame, then there is no need for any re- encoding of the segment.
  • the cutting and the re-encoding are carried out such that a cut segment is only partially re-encoded according to a principle that only the frames that are in between the desired cutting location and the location of the following intra coded frame are encoded..
  • a source video comprises at least the frames 300 - 328, the frames 300, 31 0, 320 and 326 being intra frames and the rest of the frames being predicted frames.
  • the segment to be cut in this example is the frames 304 - 322, i.e. the segment starts from a predicted frame and the first intra frame is the frame 31 0.
  • the frames 304, 306 and 308 are decoded and re-encoded such that the first frame 304 is encoded as an intra frame.
  • the remaining part of the segment i.e. the frames 31 0 - 322, is included in segment without any modifications.
  • a source video is not encoded or it is encoded but not with the desired video encoder, then all desired segments from said source video need to be re-encoded.
  • additional cutting points may be allocated in the segments in order to ensure that the maximum segment duration is smaller than a predefined threshold, T s .
  • the threshold T s can be set such that the minimum processing time would be equal to the encoding time of a segment with duration T s . This typically leads to a relatively short time interval (e.g., 0.5 - 1 sec) for the duration T s .
  • the value for the threshold T s may be defined from the perspective of the optimal utilization of the processing power of the computing nodes.
  • N p the maximum number of processes that can be executed in parallel
  • N P X * Y.
  • T s is set so that the overall number of segments is not smaller than N p .
  • Each segment whose duration is greater than T s is split into segments with durations shorter than or equal to T s .
  • the additional cutting points can be introduced at or close to estimated scene changes, wherein the existence of scene changes is estimated based on the sensor data.
  • the scene changes may be detected using the context sensor data (e.g., a gyroscope, an accelerometer, or a compass of the recording device), and additional cutting points may be introduced at or close to estimated scene changes.
  • control unit 205 creates a job that comprises at least one of following editing processes: cutting the source videos into desired segments, video decoding of the desired segment (only in case the source video is already encoded) and/or video encoding of the desired segment so that it starts with an intra-coded frame.
  • the control unit 205 sends the obtained jobs to a job scheduler 206 controlling the execution of the jobs parallel in the computing nodes.
  • the job scheduler 206 distributes individual tasks (processes) for parallel execution in at least one processing device, but preferably in several nodes of the server farm 207.
  • the parallel execution may comprise any of the tasks of cutting, decoding and re-encoding.
  • the merging of the segments is also performed by a merging unit 208 in parallel by following a binary-tree path, where in each step, each two consecutive segments are merged, which is performed until the final output video remix 209 has been created.
  • the control unit 205, the job scheduler 206 and the merging unit 208 may be implemented as computer program codes executed in at least one processing device; e.g. in an end-user device or in one or more computing nodes of the server farm.
  • TID time-interval demand
  • a TID based generation of a video remix may include a workload manager, which receives the jobs from the job scheduler and assigns video segment cutting, decoding and re-encoding jobs to multiple workers; in this context, a worker can be a CPU or a CPU core on a server machine or on a computing node.
  • the workload manager uses an algorithm to share the workload such that the total execution time for remix generation is minimized, preferably within the demanded time-interval (i.e., execution time ⁇ TID).
  • FIG 4 shows an exemplary illustration of a system for generating a time-interval demand (TID) based generation of video remix.
  • a user 300 requesting a remix or a software agent 300 based on user preference/profile 302 may signal the TID 304 to the workload manager 306 that assigns the video segment cutting and re-encoding jobs to multiple workers.
  • the user requesting a remix or the software agent based on user preference/profile may analyze the current work load 308 on the server 31 0 (or a server farm) for calculating the best suited TID and subsequently signal the TID to the workload manager that assigns the video segment cutting, decoding and re-encoding jobs to multiple works.
  • the user or the software agent may use a further set of input parameters to derive a TID value that is optimal for generating the remix with the smallest possible delay without overloading the server farm.
  • the further set of input parameters for determining the TID value may include one or more of the following :
  • - User preference for quick response time in receiving the video remixes For example, whether the user is a premium customer of the service or using the best-effort free version, whereby the premium customer is provided with a shorter TID.
  • the workload manager after receiving the TID value, analyzes, based on the jobs 31 2 received from the job scheduler, the video editing timeline and sequence information 31 4. Based on the video editing timeline and sequence information, if the creation of the requested video remix 31 6 from the obtained individual video segment lengths seem to need a longer execution time than the requested TID value, the individual video segments may be divided further into shorter segments to enable faster parallel processing.
  • server load information it is obvious that for each configuration of servers or computing nodes available for generating a video remix, there will be a limit on the amount of processing of multiple video segments that can be carried out simultaneously and in parallel. Based on the limit value and measurements of the prevailing load on the servers or computing nodes, the server load information is gathered and provided to the software agent that determines the target TID.
  • the total time to obtain the video remix would be a summation of analysis time (TA), if any, for video editing timeline/sequence and TID.
  • TTVR TID + TA.
  • the source video content and context analysis may be performed for individual source videos prior to receiving the video remix generation request. Also, content and context analysis required to be performed on the group of source videos constituting the input for generating the remix may be performed incrementally with addition of individual source videos to the group. This approach separates the generation of data required for making the decisions about the timeline from the generation of sequence of video segments to be included in the video remix. Consequently, the TA component becomes a very small portion of the TTVR value, thereby enabling the service to have an estimate of the TTVR value based on the previously mentioned TID derivation parameters.
  • the server farm 31 0 delivers the output video remix 31 6 which can subsequently be delivered to the end user in any suitable manner.
  • the type of the video shots included in the segments is typically classified in one of the three categories: long shots, medium shots, and close-ups.
  • a close-up shows a fairly small part of the scene, such as a character's face, or depicts human characters from the breast upwards, in such a detail that it almost fills the screen.
  • a lower frame line typically passes through the body of a human character from the waist down to include the whole body.
  • the human character and surrounding scene occupy roughly equal areas in the frame.
  • Long shots show all or most of a fairly large subject (e.g. a person) and usually much of the surroundings. This category comprises also extreme long shots, where the camera is at its furthest distance from the subject, emphasising the background.
  • the automatic video remixing service that combines the source videos in order to obtain a single video remix may utilise information about video-shot types and obstructing objects in the source videos to decide from which source videos the segments shall be selected for the video remix. Accordingly, the detected video-shot type is used to specify which videos to use in the individual segments so that the following conditions are met:
  • the information about the video-shot types and obstructing objects can be obtained by a method comprising
  • a depth map provides depth information of a 2-D image, where the 2-D image represents an actual 3-D scene.
  • a standard representation of a depth map is a 2-D array whose indices represent spatial coordinates and whose range (element values) convey information about the depth i.e., the distance from the scene to a plane defined by the capturing device. It is herein assumed that the depth can be interpreted as absolute distance (e.g., in meters).
  • depth maps can be obtained using other well-established methods, such as by using time-of-flight (ultrasonic, infrared, or laser) cameras.
  • the depth maps are obtained by interpreting stereo (or multiple) camera video recordings.
  • the depth map is obtained from a time-of-flight camera or from a combination of stereo (multi) view depth map and a time-of-flight camera.
  • the method used for computing or estimating depth maps is not relevant for the embodiments herein, but it is assumed that the source videos are provided with depth maps of some or of all video frames from these recordings.
  • the depth maps can have different resolution than the video frames and either linear or non-linear quantization can be used to encode the depth values. Regardless of this quantization, it is assumed that the depth values can be interpreted in terms of absolute distance of the scene to the sensor plane of the image/video/depth acquisition device. Let us denote the spatial coordinates of a depth map as x and y and the depth information as Z ⁇ x,y). Furthermore, the reliability of the corresponding depth values R( , ) may optionally be provided as a 2-D array with the same size as the depth map. In addition, the maximum depth that can be detected is denoted with Z max . In order to carry out the detection of the type of the video shot and the obstructing objects, the depth maps of the corresponding video frames are analysed.
  • ROIs regions of interest
  • Figure 5 gives an illustration of using 25 rectangular ROIs.
  • the spatial shape and size of the ROIs can be arbitrary, and it is not limited to rectangular shapes.
  • the only requirement for the selection of these ROIs is that there should be one ROI selected as a central ROI and at least one other ROI.
  • the depth within each ROI is extracted.
  • One method to accomplish this is to perform weighted averaging,
  • ROI (/ ) contains the spatial coordinates of the k th ROI, and reliability measures R( , ) are used as weights W( ,y) if they are available, otherwise the weights W( ,y) are assumed to be unity (i.e., corresponding to averaging of the depth values),
  • Figure 6 shows a possible implementation for detecting the type of video shots on the basis of the depth map of the recorded scene.
  • the depth values of all ROIs are obtained, for example in the manner described above.
  • the depth values of all ROIs do not meet the criteria of a close-up shot, then it is examined (606), whether the depth values of all ROIs meet the criteria of a medium shot. Accordingly, if the criteria for a close-up are not met and at least N med ium (a predefined threshold where N med ium ⁇ N C
  • Figure 7 shows a possible implementation for detecting objects that obstruct the view on the basis of the depth map of the recorded scene.
  • the depth values of all ROIs are obtained, for example in the manner described above.
  • the implementation relies on a prior knowledge of the expected location of obstructing objects. For example, when recording an event in a crowded area, obstructing objects are often the people who are between the camera and the scene of interest and these people occupy the lower portion of the scene. Therefore, based on this information or assumption about the video recording, the expected location of obstructing objects can be defined (702).
  • detecting objects that obstruct the view scene in video recordings may be based on a change in the video-shot type.
  • momentarily changes in the video shot type are detected; it is observed whether there is a change in the video-shot type with duration that is less than a predefined threshold.
  • the following cases are considered as cases of objects obstructing the view: if after a long shot, there appears a close-up or a medium shot with duration shorter than said predefined threshold, or if after a medium shot, there appears a close-up with duration shorter than said predefined threshold.
  • the above cases are considered to include the scenario when an object momentarily obstructs the view to the desired scene.
  • a person or vehicle passing in front of the camera can be such an obstructing object.
  • the detection and the indexing may be carried out for video segments of either fixed or variable length, thereby accommodating for changes in the video shot type or appearance of obstructing objects during the video recording.
  • the indexing of the video with the detected video-shot type and obstructing objects may be performed by assigning a timestamp (relative with the beginning of the video recording) to the detected events and transmitting this information as video metadata.
  • the depth map may be utilised in many further embodiments.
  • the depth map is used to filter out any content with object whose distance is beyond a predefined threshold(s). There may be a minimum distance to be exceeded or a maximum distance not be exceeded.
  • the video segments with depth greater than the maximum distance or less than the minimum distance may be labeled as "too far content” or “too near content”, respectively. This labeling information may be utilised by different applications like multimedia search, multimedia tagging, automatic remixing, etc.
  • a plurality of end-user image/video capturing devices may be present at an event. For example, this can automatically be detected based on the substantially similar location information (e.g., from GPS or any other positioning system) or via presence of a common audio scene. Then the depth maps from the end-user devices may be used to determine the type of event. For example, if the depth map of multiple end-user devices is static or changing within a threshold for a temporal window under consideration, this may be used to determine that the event involves a static viewing area. A rapidly changing depth map with changes above a predefined threshold may be used to determine that the event is an event with free movement of the users. A depth map that is observed to change less than a predefined threshold may be used to determine that the event is an event with restricted movement of users.
  • the depth map and orientation information from a plurality of end-user devices present at an event may be used to determine the relative position of the users at the event. If the orientation of at least two users is within a threshold and their depth map have a pattern that is indicating similar object boundaries, the difference in their depth map may be used to determine their relative position to each other and also with relation to the similar object pattern observed in the depth map.
  • Objects of interest such as a face
  • the depth at the center of the detected object boundary may be compared with a predefined threshold in order to determine if the object is too near or too far for being of interest to a wider audience or an object of personal interest. If the same object boundary pattern is detected within a temporal window threshold of more than one end-user devices at an event, the end-user devices being at an orientation value within a predefined threshold, the distance between the users can be approximated based on the difference in the depth-map corresponding to the center of the object.
  • the video remix generation system using a cluster of computing nodes or a server farm in parallel may reduce the time to generate the video remix.
  • the video remix generation time does not increase in direct proportion to the duration of the video remix.
  • the video remix generation time can be controlled based on server load and/or available server hardware.
  • Providing customizable (e.g. based on payment profile) video remix time estimates as well as personalized video remix availability time estimates may improve the user experience.
  • Detecting video-shot types and detecting obstructing objects can be performed without computationally-expensive video content-analysis. Depending on the choice of ROIs, the complexity of the detection may be reduced in order to enable implementation on a resource-limited portable device.
  • the reliability of the detection of semantic information from content recorded at the events may be improved by exploiting the depth information.
  • a terminal device may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the terminal device to carry out the features of an embodiment.
  • a network device may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the network device to carry out the features of an embodiment.
  • the various devices may be or may comprise encoders, decoders and transcoders, packetizers and depacketizers, and transmitters and receivers.

Abstract

A method and related apparatus for creating a video remix, the method comprising obtaining a plurality of source content in a processing device; determining a plurality of segments from the source content to be included in the video remix; determining editing processes required to transform the plurality of segments into form suitable for the video remix; allocating said editing processes to be executed in parallel in at least one processing device; and merging the plurality of segments received from said editing processes into the video remix.

Description

VIDEO REMIXING SYSTEM
Background
Video remixing is an application where multiple video recordings are combined in order to obtain a video mix that contains some segments selected from the plurality of video recordings. Video remixing, as such, is one of the basic manual video editing applications, for which various software products and services are already available. Furthermore, there exist automatic video remixing or editing systems, which use multiple instances of user-generated or professional recordings to automatically generate a remix that combines content from the available source content. Some automatic video remixing systems depend only on the recorded content, while others are capable of utilizing environmental context data that is recorded together with the video content. The context data may be, for example, sensor data received from a compass, an accelerometer, or a gyroscope, or GPS location data.
Video remixing is computationally a demanding task, especially when multiple recordings possible encoded into different, non-compatible file formats are used as source content. Obtaining a desired resultant video remix may be significantly delayed due to the bottlenecks of the video remixing system. Therefore, a more efficient video remixing system is needed.
Summary
Now there has been invented an improved method and technical equipment implementing the method. Various aspects of the invention include a method, an apparatus, a system and a computer program, which are characterized by what is stated in the independent claims. Various embodiments of the invention are disclosed in the dependent claims.
According to a first aspect, there is provided a method for creating a video remix, the method comprising: obtaining a plurality of source content in a processing device; determining a plurality of segments from the source content to be included in the video remix; determining editing processes required to transform the plurality of segments into form suitable for the video remix; allocating said editing processes to be executed in parallel in at least one processing device; and merging the plurality of segments received from said editing processes into the video remix.
According to an embodiment, the source content comprises at least one of video, audio and/or image, and said editing processes comprise at least one of the following:
- cutting at least one source content into plurality of segments;
- decoding at least a part of a segment of a source content;
- encoding at least a part of a segment of a source content. According to an embodiment, the method further comprises receiving a user request for creating a video remix, said user request including a request to create the video remix within a time period; determining an optimal allocation of the editing processes such that the editing processes are optimized according to available processing power of said at least one processing device and the video remix can be created within said time period; and allocating said editing processes to be executed in parallel in at least one processing device according to said optimal allocation. According to an embodiment, the method further comprises obtaining depth maps for at least some frames of a source video; detecting a type of a video shot and/or an object obstructing a view in the source video based on the depth map; and indexing the source video according to the detected type of a video shot and/or the detected object obstructing a view.
According to an embodiment, the method further comprises analysing the depth map of a frame by dividing the depth map of the frame into at least two non-overlapping region-of-interests, one of them being a central region-of-interest, and calculating the depth of each region-of- interest as a weighted average value of the depth, wherein the weighting is based on reliability values of the depth map. According to an embodiment, the method further comprises detecting the type of the video shot included in the source video to a close-up shot, a medium shot or a long shot by comparing the depth of the central region-of-interest to the depths of the remaining region-of- interests, the criteria for detecting the type of the video shot including at least the number of region-of-interests having a substantially similar depth to the depth of the central region-of-interest and residing within a predefined distance from the central region-of-interest.
According to an embodiment, the method further comprises detecting the object obstructing the view in the source video on the basis of a difference between an averaged depth for region-of-interests having depth substantially at the depth of expected location of obstructing objects and an averaged depth of the remaining region-of-interests.
According to a second aspect, there is provided an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to at least: obtain a plurality of source content; determine a plurality of segments from the source content to be included in a video remix; determine editing processes required to transform the plurality of segments into form suitable for the video remix; allocate said editing processes to be executed in parallel in at least one processing device; and merge the plurality of segments received from said editing processes into the video remix.
According to a third aspect, there is provided a computer program embodied on a non-transitory computer readable medium, the computer program comprising instructions causing, when executed on at least one processor, at least one apparatus to: obtain a plurality of source content; determine a plurality of segments from the source content to be included in a video remix; determine editing processes required to transform the plurality of segments into form suitable for the video remix; allocate said editing processes to be executed in parallel in at least one processing device; and merge the plurality of segments received from said editing processes into the video remix.
According to a fourth aspect, there is provided a system comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the system to at least: obtain a plurality of source content; determine a plurality of segments from the source content to be included in a video remix; determine editing processes required to transform the plurality of segments into form suitable for the video remix; allocate said editing processes to be executed in parallel in at least one processing device; and merge the plurality of segments received from said editing processes into the video remix. These and other aspects of the invention and the embodiments related thereto will become apparent in view of the detailed disclosure of the embodiments further below.
List of drawings
In the following, various embodiments of the invention will be described in more detail with reference to the appended drawings, in which
Figs. 1 a and 1 b show a system and devices suitable to be used
automatic video remixing service according to embodiment;
Fig. 2 shows a block chart of an implementation embodiment for the automatic video remixing service;
Fig. 3 shows a partial re-encoding method of a video segment according to an embodiment;
Fig. 4 shows a system for a time-interval demand based generation of video remix according to an embodiment; an example for positioning an amount of non-overlapping regions of interest (ROIs) in the depth maps of the video frames; shows a flow chart of an embodiment for detecting the type of video shots on the basis of the depth map of the recorded scene; shows a flow chart of an embodiment for detecting objects that obstruct the view on the basis of the depth map of the recorded scene.
Description of embodiments
As is generally known, many contemporary portable devices, such as mobile phones, cameras, tablets, are provided with high quality cameras, which enable to capture high quality video files and still images. In addition to the above capabilities, such handheld electronic devices are nowadays equipped with multiple sensors that can assist different applications and services in contextualizing how the devices are used. Sensor (context) data and streams of such data can be recorded together with the video or image or other modality of recording (e.g. speech).
Usually, at events attended by a lot of people, such as live concerts, sport games, social events, there are many who record still images and videos using their portable devices. The above-mentioned sensors may even automatically trigger an image/video capture of an interesting moment, if detected by a sensor. For example, a loud voice in a party may be an acoustic trigger for a video capture, or people turning suddenly to another direction may be an orientation trigger, received from a compass sensor of the portable device, for a video capture. Spatially nearby portable devices may collaboratively identify an event, and at least locate the portable device with the best view of this event. The devices recording the content may be disconnected from other devices, but share the recorded source content and the corresponding sensor data, which is pooled together in a file server or any such suitable mechanism for generating the automatic remix. Recordings of the attendants from such events, possibly together with various sensor information, provide a suitable framework for the present invention and its embodiments.
Figs. 1 a and 1 b show a system and devices suitable to be used in an automatic video remixing service according to an embodiment. In Fig. 1 a, the different devices may be connected via a fixed network 21 0 such as the Internet or a local area network; or a mobile communication network 220 such as the Global System for Mobile communications (GSM) network, 3rd Generation (3G) network, 3.5th Generation (3.5G) network, 4th Generation (4G) network, Wireless Local Area Network (WLAN), Bluetooth®, or other contemporary and future networks. Different networks are connected to each other by means of a communication interface 280. The networks comprise network elements such as routers and switches to handle data (not shown), and communication interfaces such as the base stations 230 and 231 in order for providing access for the different devices to the network, and the base stations 230, 231 are themselves connected to the mobile network 220 via a fixed connection 276 or a wireless connection 277.
There may be a number of servers connected to the network, and in the example of Fig. 1 a are shown servers 240, 241 and 242, each connected to the mobile network 220, which servers may be arranged to operate as computing nodes (i.e. to form a cluster of computing nodes or a so-called server farm) for the automatic video remixing service. Some of the above devices, for example the computers 240, 241 , 242 may be such that they are arranged to make up a connection to the Internet with the communication elements residing in the fixed network 210.
There are also a number of end-user devices such as mobile phones and smart phones 251 , Internet access devices (Internet tablets) 250, personal computers 260 of various sizes and formats, televisions and other viewing devices 261 , video decoders and players 262, as well as video cameras 263 and other encoders. These devices 250, 251 , 260, 261 , 262 and 263 can also be made of multiple parts. The various devices may be connected to the networks 21 0 and 220 via communication connections such as a fixed connection 270, 271 , 272 and 280 to the internet, a wireless connection 273 to the internet 21 0, a fixed connection 275 to the mobile network 220, and a wireless connection 278, 279 and 282 to the mobile network 220. The connections 271 -282 are implemented by means of communication interfaces at the respective ends of the communication connection. Fig. 1 b shows devices for automatic video remixing according to an example embodiment. As shown in Fig. 1 b, the server 240 contains memory 245, one or more processors 246, 247, and computer program code 248 residing in the memory 245 for implementing, for example, automatic video remixing. The different servers 241 , 242, 290 may contain at least these elements for employing functionality relevant to each server.
Similarly, the end-user device 251 contains memory 252, at least one processor 253 and 256, and computer program code 254 residing in the memory 252 for implementing, for example, gesture recognition. The end-user device may also have one or more cameras 255 and 259 for capturing image data, for example stereo video. The end-user device may also contain one, two or more microphones 257 and 258 for capturing sound. The end-user device may also contain sensors for generating the depth information using any suitable technology. The different end-user devices 250, 260 may contain at least these same elements for employing functionality relevant to each device. In another embodiment of this invention, the depth maps (i.e. depth information regarding the distance from the scene to a plane defined by the camera) obtained by interpreting video recordings from the stereo (or multiple) cameras may be utilised in the video remixing system,. The end-user device may also have a time-of-flight camera, whereby the depth map may be obtained from a time-of-flight camera or from a combination of stereo (or multiple) view depth map and a time-of-flight camera. The end-user device may generate depth map for the captured content using any available and suitable mechanism. The end user devices may also comprise a screen for viewing single- view, stereoscopic (2-view), or multiview (more-than-2-view) images. The end-user devices may also be connected to video glasses 290 e.g. by means of a communication block 293 able to receive and/or transmit information. The glasses may contain separate eye elements 291 and 292 for the left and right eye. These eye elements may either show a picture for viewing, or they may comprise a shutter functionality e.g. to block every other picture in an alternating manner to provide the two views of three-dimensional picture to the eyes, or they may comprise an orthogonal polarization filter (compared to each other), which, when connected to similar polarization realized on the screen, provide the separate views to the eyes. Other arrangements for video glasses may also be used to provide stereoscopic viewing capability. Stereoscopic or multiview screens may also be autostereoscopic, i.e. the screen may comprise or may be overlaid by an optics arrangement, which results into a different view being perceived by each eye. Single-view, stereoscopic, and multiview screens may also be operationally connected to viewer tracking such a manner that the displayed views depend on viewer's position, distance, and/or direction of gaze relative to the screen.
It needs to be understood that different embodiments allow different parts to be carried out in different elements. For example, parallelized processes of the automatic video remixing may be carried out in one or more processing devices; i.e. entirely in one user device like 250, 251 or 260, or in one server device 240, 241 , 242 or 290, or across multiple user devices 250, 251 , 260 or across multiple network devices 240, 241 , 242, 290, or across both user devices 250, 251 , 260 and network devices 240, 241 , 242, 290. The elements of the automatic video remixing process may be implemented as a software component residing on one device or distributed across several devices, as mentioned above, for example so that the devices form a so-called cloud. An embodiment relates to a method for performing parallel video cutting, re-encoding, and merging of video segments within an automatic video remixing service, i.e. an editing service. The service is implemented in at least one, but preferably in a plurality of computing nodes (i.e. a cluster of computing nodes or a server farm), which are able to execute more than one process or thread in parallel. The automatic video remixing service is supplied with one or more video recordings and information regarding suitable cutting points of desired segments from the video recordings. The information regarding the suitable cutting points of segments can be provided in various ways. The cutting points may be obtained a priori via any suitable method (e.g., by content analysis of the source videos or even manually, from a human input) and then supplied to the video remixing service along the one or more video recordings. In addition to that, more cutting points may be utilized by the video remixing service by directly analyzing the one or more video recordings or specific contextual information associated with them. The video remixing service may analyse the video recordings either without any additional information or by exploiting contextual information such as sensor (gyroscope, accelerometer, compass or other sensors) data recorded simultaneously with the source videos. Embodiments relating to such analysis will be described more in detail further below.
For carrying out the creation of the actual remix, a list of desired segments is created and on the basis of the list, a job is created, which may comprise cutting the source videos into desired segments, decoding of at least one desired segment in case the source video is already encoded and/or video encoding of at least one desired segment such that it starts with an intra-coded frame.
According to an embodiment, if the source videos are already encoded with a desired video encoder, the cutting and the re-encoding are done in such a manner that a cut segment is not fully re-encoded but only the frames that are in between the desired cutting location and the location of the following intra coded frame are encoded. If the desired cutting location is pointing to an intra coded frame, then re-encoding of the segment is not performed.
According to another embodiment, if at least one desired segment is to be totally re-encoded, then additional cutting points may be allocated in order to ensure that the maximum segment duration is smaller than a predefined threshold. The additional cutting points may improve the parallelization of the cutting and the re-encoding. When the necessary one or more jobs have been defined, they are sent to a job scheduler implemented in a computing node for parallel execution. After all jobs have finished, the merging of the segments may also be performed in parallel, for example by following a binary- tree path, where in each step, each two consecutive segments are merged, and this is continued until the final video remix has been created.
The implementation of the parallel video cutting and re-encoding of video segments as described above is now illustrated more in detail by referring to Figure 2, which discloses an example of the implementation for the automatic video remixing service. The automatic video remixing service comprises a control unit 205 for determining the desired video remix and the segments to be included therein. As the input data for the video remixing service, there is provided a plurality of source videos 201 , 202, 203, 204 (Videol - Video 4), which may, but not necessarily need to be encoded, for example, by any known video coding standard, such as MPEG 2, MPEG4, H.264/AVC, etc. The source videos may be originated from one or more end-user devices or they may be loaded from a computer or a server connected to a network. Additionally, the control unit 205 may be provided with or be arranged to determine a list of desired segments to cut and subsequently to merge in the final video remix. The items of the list of segments may preferably contain information about the source video, the starting time or the frame number of the segment to be cut and the duration of the segment, either in time or in number of frames.
As can be seen in Figure 2, the source videos may be more or less overlapping in time domain. Therefore, at least for those overlapping periods priorities could be assigned to the items in the list of segments. According to an embodiment, this could be achieved by sorting the list by the duration of the segments in descending order. If the source videos are already encoded with a desired video encoder, the need for re-encoding is determined by the frame type of the first frame of the segment to be cut. If the first frame of the desired cutting location is an intra coded frame, then there is no need for any re- encoding of the segment. If the first frame of the desired cutting location is a predicted frame, then the cutting and the re-encoding are carried out such that a cut segment is only partially re-encoded according to a principle that only the frames that are in between the desired cutting location and the location of the following intra coded frame are encoded..
This is illustrated in Figure 3, wherein a source video comprises at least the frames 300 - 328, the frames 300, 31 0, 320 and 326 being intra frames and the rest of the frames being predicted frames. The segment to be cut in this example is the frames 304 - 322, i.e. the segment starts from a predicted frame and the first intra frame is the frame 31 0. Thus, only the frames 304, 306 and 308 are decoded and re-encoded such that the first frame 304 is encoded as an intra frame. The remaining part of the segment, i.e. the frames 31 0 - 322, is included in segment without any modifications.
If a source video is not encoded or it is encoded but not with the desired video encoder, then all desired segments from said source video need to be re-encoded. According to an embodiment, additional cutting points may be allocated in the segments in order to ensure that the maximum segment duration is smaller than a predefined threshold, Ts. The threshold Ts can be set such that the minimum processing time would be equal to the encoding time of a segment with duration Ts. This typically leads to a relatively short time interval (e.g., 0.5 - 1 sec) for the duration Ts.
According to another embodiment, the value for the threshold Ts may be defined from the perspective of the optimal utilization of the processing power of the computing nodes. Let us denote the maximum number of processes that can be executed in parallel as Np; accordingly, for a cluster with X number of computing nodes, each having Y number of CPUs, NP=X*Y. In this case, Ts is set so that the overall number of segments is not smaller than Np. Each segment whose duration is greater than Ts is split into segments with durations shorter than or equal to Ts. According to an embodiment, if the source videos contain auxiliary information, such as sensor data preferably recorded simultaneously with the video and having synchronized timestamps with it, the additional cutting points can be introduced at or close to estimated scene changes, wherein the existence of scene changes is estimated based on the sensor data. For example, the scene changes may be detected using the context sensor data (e.g., a gyroscope, an accelerometer, or a compass of the recording device), and additional cutting points may be introduced at or close to estimated scene changes.
Following the priorities/order, for each segment, the control unit 205 creates a job that comprises at least one of following editing processes: cutting the source videos into desired segments, video decoding of the desired segment (only in case the source video is already encoded) and/or video encoding of the desired segment so that it starts with an intra-coded frame.
The control unit 205 sends the obtained jobs to a job scheduler 206 controlling the execution of the jobs parallel in the computing nodes. The job scheduler 206 distributes individual tasks (processes) for parallel execution in at least one processing device, but preferably in several nodes of the server farm 207. The parallel execution may comprise any of the tasks of cutting, decoding and re-encoding. After all jobs have been finished, the merging of the segments is also performed by a merging unit 208 in parallel by following a binary-tree path, where in each step, each two consecutive segments are merged, which is performed until the final output video remix 209 has been created. The control unit 205, the job scheduler 206 and the merging unit 208 may be implemented as computer program codes executed in at least one processing device; e.g. in an end-user device or in one or more computing nodes of the server farm. In the automatic video remixing service described above, it would be beneficial to provide the customers with a time estimate for creating a video remix. It would also be beneficial to enable a customer, for example a priority customer, to request a video remix to be created within certain period of time. According to an embodiment, these needs are addressed by a method for a time-interval demand (referred to as TID herein below) based generation of a video remix using the source videos and the context data corresponding to the source videos. A TID based generation of a video remix may include a workload manager, which receives the jobs from the job scheduler and assigns video segment cutting, decoding and re-encoding jobs to multiple workers; in this context, a worker can be a CPU or a CPU core on a server machine or on a computing node. The workload manager uses an algorithm to share the workload such that the total execution time for remix generation is minimized, preferably within the demanded time-interval (i.e., execution time < TID).
Figure 4 shows an exemplary illustration of a system for generating a time-interval demand (TID) based generation of video remix. In the system, a user 300 requesting a remix or a software agent 300 based on user preference/profile 302 may signal the TID 304 to the workload manager 306 that assigns the video segment cutting and re-encoding jobs to multiple workers. Alternatively, the user requesting a remix or the software agent based on user preference/profile may analyze the current work load 308 on the server 31 0 (or a server farm) for calculating the best suited TID and subsequently signal the TID to the workload manager that assigns the video segment cutting, decoding and re-encoding jobs to multiple works. In addition to the server load information, the user or the software agent may use a further set of input parameters to derive a TID value that is optimal for generating the remix with the smallest possible delay without overloading the server farm. The further set of input parameters for determining the TID value may include one or more of the following :
- User preference for quick response time in receiving the video remixes. - User payment profile information. For example, whether the user is a premium customer of the service or using the best-effort free version, whereby the premium customer is provided with a shorter TID.
- User's current presence status. For example, if the user's status is observed to be "inactive" or "do not disturb", a longer TID for video remix generation may be sufficient.
The workload manager, after receiving the TID value, analyzes, based on the jobs 31 2 received from the job scheduler, the video editing timeline and sequence information 31 4. Based on the video editing timeline and sequence information, if the creation of the requested video remix 31 6 from the obtained individual video segment lengths seem to need a longer execution time than the requested TID value, the individual video segments may be divided further into shorter segments to enable faster parallel processing.
Regarding the server load information, it is obvious that for each configuration of servers or computing nodes available for generating a video remix, there will be a limit on the amount of processing of multiple video segments that can be carried out simultaneously and in parallel. Based on the limit value and measurements of the prevailing load on the servers or computing nodes, the server load information is gathered and provided to the software agent that determines the target TID.
The total time to obtain the video remix (TTVR) would be a summation of analysis time (TA), if any, for video editing timeline/sequence and TID.
TTVR = TID + TA.
The source video content and context analysis may be performed for individual source videos prior to receiving the video remix generation request. Also, content and context analysis required to be performed on the group of source videos constituting the input for generating the remix may be performed incrementally with addition of individual source videos to the group. This approach separates the generation of data required for making the decisions about the timeline from the generation of sequence of video segments to be included in the video remix. Consequently, the TA component becomes a very small portion of the TTVR value, thereby enabling the service to have an estimate of the TTVR value based on the previously mentioned TID derivation parameters.
After the video remix has been generated, the server farm 31 0 delivers the output video remix 31 6 which can subsequently be delivered to the end user in any suitable manner.
When performing automatic video remixing from multiple source videos, it would be beneficial to know the type of the video shots included in the segments. In cinematography, the type of the video shots is typically classified in one of the three categories: long shots, medium shots, and close-ups.
A close-up shows a fairly small part of the scene, such as a character's face, or depicts human characters from the breast upwards, in such a detail that it almost fills the screen. In a medium shot, a lower frame line typically passes through the body of a human character from the waist down to include the whole body. In a medium shot, the human character and surrounding scene occupy roughly equal areas in the frame. Long shots show all or most of a fairly large subject (e.g. a person) and usually much of the surroundings. This category comprises also extreme long shots, where the camera is at its furthest distance from the subject, emphasising the background. This information enables proper switching between video segments with compatible views such as between a long shot and a close-up and avoids switching between non-compatible views such as between two long shots. According to an embodiment, the automatic video remixing service that combines the source videos in order to obtain a single video remix may utilise information about video-shot types and obstructing objects in the source videos to decide from which source videos the segments shall be selected for the video remix. Accordingly, the detected video-shot type is used to specify which videos to use in the individual segments so that the following conditions are met:
- View switching from a close-up to another close-up or to a medium or to a long shot.
- View switching from a medium shot to a close-up or to a long shot.
- View switching from a long shot to a medium shot or to a close- up.
In addition to these rules, it is possible to use further, possibly user- specified, rules to select the allowed video-shot type. For example, switching from a close-up to another close-up can be disabled.
According to an embodiment, the information about the video-shot types and obstructing objects can be obtained by a method comprising
- detecting the type of video shots (close-up, medium shot, or long shot) based on a depth map of the recorded scene;
- detecting objects that obstruct the view (i.e., objects that are not desired and impede the view of the recorded video) based on a depth map of the recorded scene; and
- indexing the corresponding video with the detected events mentioned above.
A depth map provides depth information of a 2-D image, where the 2-D image represents an actual 3-D scene. A standard representation of a depth map is a 2-D array whose indices represent spatial coordinates and whose range (element values) convey information about the depth i.e., the distance from the scene to a plane defined by the capturing device. It is herein assumed that the depth can be interpreted as absolute distance (e.g., in meters).
There are several methods for computing or estimating depth maps, known as such. Many methods enable to compute the depth maps in real-time, which is required for TV broadcasting. As mentioned above, portable devices with digital stereo (or multiple) cameras and/or camcorders are able to perform depth map estimation. Furthermore, depth maps can be obtained using other well-established methods, such as by using time-of-flight (ultrasonic, infrared, or laser) cameras. According to an embodiment, the depth maps are obtained by interpreting stereo (or multiple) camera video recordings. According to another embodiment, the depth map is obtained from a time-of-flight camera or from a combination of stereo (multi) view depth map and a time-of-flight camera. However, the method used for computing or estimating depth maps is not relevant for the embodiments herein, but it is assumed that the source videos are provided with depth maps of some or of all video frames from these recordings.
The depth maps can have different resolution than the video frames and either linear or non-linear quantization can be used to encode the depth values. Regardless of this quantization, it is assumed that the depth values can be interpreted in terms of absolute distance of the scene to the sensor plane of the image/video/depth acquisition device. Let us denote the spatial coordinates of a depth map as x and y and the depth information as Z{x,y). Furthermore, the reliability of the corresponding depth values R( , ) may optionally be provided as a 2-D array with the same size as the depth map. In addition, the maximum depth that can be detected is denoted with Zmax. In order to carry out the detection of the type of the video shot and the obstructing objects, the depth maps of the corresponding video frames are analysed. This can be performed, for example, by positioning a certain amount of non-overlapping regions of interest (ROIs) in the depth maps of the video frames. Figure 5 gives an illustration of using 25 rectangular ROIs. However, the spatial shape and size of the ROIs can be arbitrary, and it is not limited to rectangular shapes. The only requirement for the selection of these ROIs is that there should be one ROI selected as a central ROI and at least one other ROI. Subsequently, the depth within each ROI is extracted. One method to accomplish this is to perform weighted averaging,
Figure imgf000019_0001
where ROI (/ ) contains the spatial coordinates of the kth ROI, and reliability measures R( , ) are used as weights W( ,y) if they are available, otherwise the weights W( ,y) are assumed to be unity (i.e., corresponding to averaging of the depth values),
Figure imgf000019_0002
Figure 6 shows a possible implementation for detecting the type of video shots on the basis of the depth map of the recorded scene. As a first step (600), the depth values of all ROIs are obtained, for example in the manner described above. Then it is examined (602), whether the depth values of all ROIs meet the criteria of a close-up shot. If the majority of ROIs (i.e. defined by a certain percentage, NC|0Seup, of all ROIs) have substantially similar depths that fall within a distance range around the depth of the central ROI, which should be different than Zmax; the distance range is predefined by a distance parameter, Ddoseup- , then a close-up is detected (604). If the depth values of all ROIs do not meet the criteria of a close-up shot, then it is examined (606), whether the depth values of all ROIs meet the criteria of a medium shot. Accordingly, if the criteria for a close-up are not met and at least Nmedium (a predefined threshold where Nmedium < NC|0seUp) percent of the ROIs have depths that belong a distance range Dmeciium around the depth of the central ROI, which should be different than Zmax, then a medium shot is detected (608). If the criteria for a close-up or a medium shot are not satisfied, then a long shot is detected (61 0). Finally, the source video is indexed according to the detected shot type
(61 2). Figure 7 shows a possible implementation for detecting objects that obstruct the view on the basis of the depth map of the recorded scene. Again as a first step (700), the depth values of all ROIs are obtained, for example in the manner described above. The implementation relies on a prior knowledge of the expected location of obstructing objects. For example, when recording an event in a crowded area, obstructing objects are often the people who are between the camera and the scene of interest and these people occupy the lower portion of the scene. Therefore, based on this information or assumption about the video recording, the expected location of obstructing objects can be defined (702). Next, all the ROIs that fall within the expected location of obstructing objects are detected, and the depth of the detected ROIs is averaged (704). In the similar manner, the depth of the remaining ROIs is averaged (706). The average depth of all the ROIs that fall within the expected location of obstructing objects compared to the average depth of all other ROIs (708). If the difference between said averaged depths is larger than a predefined threshold, D0bS, then an obstructing object is detected, and the source video is indexed to include then an obstructing object (710). Naturally, video segments which contain objects that impede the view of the recorded video are less likely to be included in the automatic video remix.
According to another embodiment, detecting objects that obstruct the view scene in video recordings may be based on a change in the video-shot type. In this embodiment, momentarily changes in the video shot type are detected; it is observed whether there is a change in the video-shot type with duration that is less than a predefined threshold. The following cases are considered as cases of objects obstructing the view: if after a long shot, there appears a close-up or a medium shot with duration shorter than said predefined threshold, or if after a medium shot, there appears a close-up with duration shorter than said predefined threshold. The above cases are considered to include the scenario when an object momentarily obstructs the view to the desired scene. For example, a person or vehicle passing in front of the camera can be such an obstructing object. The detection and the indexing may be carried out for video segments of either fixed or variable length, thereby accommodating for changes in the video shot type or appearance of obstructing objects during the video recording. The indexing of the video with the detected video-shot type and obstructing objects may be performed by assigning a timestamp (relative with the beginning of the video recording) to the detected events and transmitting this information as video metadata.
The depth map may be utilised in many further embodiments. According to an embodiment, the depth map is used to filter out any content with object whose distance is beyond a predefined threshold(s). There may be a minimum distance to be exceeded or a maximum distance not be exceeded. The video segments with depth greater than the maximum distance or less than the minimum distance may be labeled as "too far content" or "too near content", respectively. This labeling information may be utilised by different applications like multimedia search, multimedia tagging, automatic remixing, etc.
According to another embodiment, a plurality of end-user image/video capturing devices may be present at an event. For example, this can automatically be detected based on the substantially similar location information (e.g., from GPS or any other positioning system) or via presence of a common audio scene. Then the depth maps from the end-user devices may be used to determine the type of event. For example, if the depth map of multiple end-user devices is static or changing within a threshold for a temporal window under consideration, this may be used to determine that the event involves a static viewing area. A rapidly changing depth map with changes above a predefined threshold may be used to determine that the event is an event with free movement of the users. A depth map that is observed to change less than a predefined threshold may be used to determine that the event is an event with restricted movement of users.
According to another embodiment, the depth map and orientation information from a plurality of end-user devices present at an event may be used to determine the relative position of the users at the event. If the orientation of at least two users is within a threshold and their depth map have a pattern that is indicating similar object boundaries, the difference in their depth map may be used to determine their relative position to each other and also with relation to the similar object pattern observed in the depth map.
Objects of interest, such as a face, can be detected based on the fact that they will display only a small change in depth value within the object boundary. The depth at the center of the detected object boundary may be compared with a predefined threshold in order to determine if the object is too near or too far for being of interest to a wider audience or an object of personal interest. If the same object boundary pattern is detected within a temporal window threshold of more than one end-user devices at an event, the end-user devices being at an orientation value within a predefined threshold, the distance between the users can be approximated based on the difference in the depth-map corresponding to the center of the object.
A skilled man appreciates that any of the embodiments described above may be implemented as a combination with one or more of the other embodiments, unless there is explicitly or implicitly stated that certain embodiments are only alternatives to each other.
The various embodiments may provide advantages over state of the art. For example, the video remix generation system using a cluster of computing nodes or a server farm in parallel may reduce the time to generate the video remix. The video remix generation time does not increase in direct proportion to the duration of the video remix. The video remix generation time can be controlled based on server load and/or available server hardware. Providing customizable (e.g. based on payment profile) video remix time estimates as well as personalized video remix availability time estimates may improve the user experience. Detecting video-shot types and detecting obstructing objects can be performed without computationally-expensive video content-analysis. Depending on the choice of ROIs, the complexity of the detection may be reduced in order to enable implementation on a resource-limited portable device. The reliability of the detection of semantic information from content recorded at the events may be improved by exploiting the depth information.
The various embodiments of the invention can be implemented with the help of computer program code that resides in a memory and causes the relevant apparatuses to carry out the invention. For example, a terminal device may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the terminal device to carry out the features of an embodiment. Yet further, a network device may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the network device to carry out the features of an embodiment. The various devices may be or may comprise encoders, decoders and transcoders, packetizers and depacketizers, and transmitters and receivers.
It is obvious that the present invention is not limited solely to the above- presented embodiments, but it can be modified within the scope of the appended claims.

Claims

Claims:
1 . A method for creating a video remix, the method comprising:
obtaining a plurality of source content in a processing device;
determining a plurality of segments from the source content to be included in the video remix;
determining editing processes required to transform the plurality of segments into form suitable for the video remix;
allocating said editing processes to be executed in parallel in at least one processing device; and
merging the plurality of segments received from said editing processes into the video remix.
2. A method according to claim 1 , wherein the source content comprises at least one of video, audio and/or image, and said editing processes comprise at least one of the following :
- cutting at least one source content into plurality of segments;
- decoding at least a part of a segment of a source content;
- encoding at least a part of a segment of a source content.
3. A method according to claim 2, wherein
in response to a source video being encoded with a desired encoder and a cutting point of a segment locating at a predicted frame, the method further comprises:
decoding video frames only from said predicted frame to a predicted frame preceding next intra-coded frame of the segment or to an end of the segment if there is no subsequent intra-coded frame in the segment; and
encoding said decoded frames with said desired encoder such that the predicted frame locating at the cutting point of the segment is encoded as an intra-coded frame.
4. A method according to claim 2, wherein in response to a segment of a source content being decoded and re-encoded, the method further comprises:
allocating additional cutting points for said segment.
5. A method according to claim 4, the method further comprising:
allocating said additional cutting points for the segments such that the maximum segment duration is smaller than a predefined threshold, wherein the threshold is set to be equal to an encoding time of a segment with duration equal to the threshold.
6. A method according to claim 4, the method further comprising:
allocating said additional cutting points for the segments such that the maximum segment duration is smaller than a predefined threshold, wherein the threshold is optimized according to available processing power of said at least one processing device.
7. A method according to any of the claims 4 - 6, wherein in response to a source content comprises auxiliary information enabling to estimate an existence of one or more scene changes, the method further comprising:
allocating said additional cutting points for the segments at or close to the estimated one or more scene changes.
8. A method according to claim 7, wherein
said auxiliary information comprises sensor data with timestamps synchronized with the source content.
9. A method according to any preceding claim, the method further comprising
receiving a user request for creating a video remix, said user request including a request to create the video remix within a time period;
determining an optimal allocation of the editing processes such that the editing processes are optimized according to available processing power of said at least one processing device and the video remix can be created within said time period; and
allocating said editing processes to be executed in parallel in at least one processing device according to said optimal allocation.
1 0. A method according to any of the claims 1 - 8, the method further comprising
receiving a user request for creating a video remix, said user request including a request to create the video remix within a time period, wherein the time period has been determined by a user device according to workload information from said at least one processing device.
1 1 . A method according to claim 9 or 1 0, wherein the user request further includes a set of input parameters for determining the time period for generating the video remix, said set of input parameters further including one or more of the following:
-a user preference for a response time in receiving the video remix,
- a user customer profile information,
- user's current presence status.
1 2. A method according to any preceding claim, the method further comprising
obtaining depth maps for at least some frames of a source video;
detecting at least one of a type of a video shot and an object obstructing a view in the source video based on the depth map; and indexing the source video according to at least one of the detected type of a video shot and the detected object obstructing a view.
1 3. A method according to claim 1 2, the method further comprising analysing the depth map of a frame by
dividing the depth map of the frame into at least two non- overlapping region-of-interests, one of them being a central region-of- interest, and calculating the depth of each region-of-interest as a weighted average value of the depth, wherein the weighting is based on reliability values of the depth map.
1 4. A method according to claim 1 3, the method further comprising
detecting the type of the video shot included in the source video to a close-up shot, a medium shot or a long shot by comparing the depth of the central region-of-interest to the depths of the remaining region-of-interests, the criteria for detecting the type of the video shot including at least the number of region-of-interests having a substantially similar depth to the depth of the central region-of-interest and residing within a predefined distance from the central region-of- interest.
1 5. A method according to claim 1 3, the method further comprising
detecting the object obstructing the view in the source video on the basis of a difference between an averaged depth for region-of- interests having depth substantially at the depth of expected location of obstructing objects and an averaged depth of the remaining region-of- interests.
1 6. A method according to any of the claims 1 2 - 1 5, the method further comprising
performing said indexing by assigning, for the detected type of a video shot or the detected object obstructing a view, a timestamp relative with the beginning of the source video; and
transmitting information relating to said indexing as metadata for the source video.
1 7. A method according to any of the claims 1 2 - 1 6, wherein the depth map and orientation information from a plurality of user devices present at an event is used to determine the relative position of the users at the event.
1 8. A method according to any of the claims 1 2 - 1 7, wherein the depth map from a plurality of user devices present at an event is used to determine the type of the event.
1 9. An apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to at least:
obtain a plurality of source content;
determine a plurality of segments from the source content to be included in a video remix;
determine editing processes required to transform the plurality of segments into form suitable for the video remix;
allocate said editing processes to be executed in parallel in at least one processing device; and
merge the plurality of segments received from said editing processes into the video remix.
20. An apparatus according to claim 1 9, wherein the source content comprises at least one of video, audio and/or image, and said editing processes comprise at least one of the following :
- cutting at least one source content into plurality of segments;
- decoding at least a part of a segment of a source content; - encoding at least a part of a segment of a source content.
21 . An apparatus according to claim 20, further comprising computer program code configured to, with the at least one processor, cause the apparatus to at least:
in response to a source video being encoded with a desired encoder and a cutting point of a segment locating at a predicted frame, decode video frames only from said predicted frame to a predicted frame preceding next intra-coded frame of the segment; and encode said decoded frames with said desired encoder such that the predicted frame locating at the cutting point of the segment is encoded as an intra-coded frame.
22. An apparatus according to claim 20, further comprising computer program code configured to, with the at least one processor, cause the apparatus to at least:
in response to a segment of a source content being decoded and re-encoded,
allocate additional cutting points for said segment.
23. An apparatus according to claim 22, further comprising computer program code configured to, with the at least one processor, cause the apparatus to at least:
allocate said additional cutting points for the segments such that the maximum segment duration is smaller than a predefined threshold, wherein the threshold is set to be equal to an encoding time of a segment with duration equal to the threshold.
24. An apparatus according to claim 22, further comprising computer program code configured to, with the at least one processor, cause the apparatus to at least:
allocate said additional cutting points for the segments such that the maximum segment duration is smaller than a predefined threshold, wherein the threshold is optimized according to available processing power of said at least one processing device.
25. An apparatus according to any of the claims 22 - 24, further comprising computer program code configured to, with the at least one processor, cause the apparatus to at least:
in response to a source content comprises auxiliary information enabling to estimate an existence of one or more scene changes,
allocate said additional cutting points for the segments at or close to the estimated one or more scene changes.
26. An apparatus according to claim 25, wherein said auxiliary information comprises sensor data with timestamps synchronized with the source content.
27. An apparatus according to any of the claims 1 9 - 26, further comprising computer program code configured to, with the at least one processor, cause the apparatus to at least:
receive a user request for creating a video remix, said user request including a request to create the video remix within a time period;
determine an optimal allocation of the editing processes such that the editing processes are optimized according to available processing power of said at least one processing device and the video remix can be created within said time period; and
allocate said editing processes to be executed in parallel in at least one processing device according to said optimal allocation.
28. An apparatus according to any of the claims 1 9 - 26, further comprising computer program code configured to, with the at least one processor, cause the apparatus to at least:
receive a user request for creating a video remix, said user request including a request to create the video remix within a time period, wherein the time period has been determined by a user device according to workload information from said at least one processing device.
29. An apparatus according to claim 27 or 28, wherein the user request further includes a set of input parameters for determining the time period for generating the video remix, said set of input parameters further including one or more of the following:
-a user preference for a response time in receiving the video remix,
- a user customer profile information,
- user's current presence status.
30. An apparatus according to any of the claims 1 9- 29, further comprising computer program code configured to, with the at least one processor, cause the apparatus to at least:
obtain depth maps for at least some frames of a source video; detect at least one of a type of a video shot and an object obstructing a view in the source video based on the depth map; and index the source video according to at least one of the detected type of a video shot and the detected object obstructing a view.
31 . An apparatus according to claim 30, further comprising computer program code configured to, with the at least one processor, cause the apparatus to at least:
divide the depth map of the frame into at least two non- overlapping region-of-interests, one of them being a central region-of- interest, and
calculate the depth of each region-of-interest as a weighted average value of the depth, wherein the weighting is based on reliability values of the depth map.
32. An apparatus according to claim 31 , further comprising computer program code configured to, with the at least one processor, cause the apparatus to at least:
detect the type of the video shot included in the source video to a close-up shot, a medium shot or a long shot by comparing the depth of the central region-of-interest to the depths of the remaining region-of-interests, the criteria for detecting the type of the video shot including at least the number of region-of-interests having a substantially similar depth to the depth of the central region-of-interest and residing within a predefined distance from the central region-of- interest.
33. An apparatus according to claim 31 , further comprising computer program code configured to, with the at least one processor, cause the apparatus to at least:
detect the object obstructing the view in the source video on the basis of a difference between an averaged depth for region-of- interests having depth substantially at the depth of expected location of obstructing objects and an averaged depth of the remaining region-of- interests.
34. An apparatus according to any of the claims 30 - 33, further comprising computer program code configured to, with the at least one processor, cause the apparatus to at least:
perform said indexing by assigning, for the detected type of a video shot or the detected object obstructing a view, a timestamp relative with the beginning of the source video; and
transmit information relating to said indexing as metadata for the source video.
35. An apparatus according to any of the claims 30 - 34, wherein the depth map and orientation information from a plurality of user devices present at an event is used to determine the relative position of the users at the event.
36. An apparatus according to any of the claims 30 - 35, wherein the depth map from a plurality of user devices present at an event is used to determine the type of the event
37. A computer program embodied on a non-transitory computer readable medium, the computer program comprising instructions causing, when executed on at least one processor, at least one apparatus to:
obtain a plurality of source content;
determine a plurality of segments from the source content to be included in a video remix;
determine editing processes required to transform the plurality of segments into form suitable for the video remix;
allocate said editing processes to be executed in parallel in at least one processing device; and
merge the plurality of segments received from said editing processes into the video remix.
38. A system comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the system to at least:
obtain a plurality of source content; determine a plurality of segments from the source content to be included in a video remix;
determine editing processes required to transform the plurality of segments into form suitable for the video remix;
allocate said editing processes to be executed in parallel in at least one processing device; and
merge the plurality of segments received from said editing processes into the video remix.
39. An apparatus comprising :
means for obtaining a plurality of source content; means for determining a plurality of segments from the source content to be included in a video remix;
means for determining editing processes required to transform the plurality of segments into form suitable for the video remix;
means for allocating said editing processes to be executed in parallel in at least one processing device; and
means for merging the plurality of segments received from said editing processes into the video remix.
40. An apparatus according to claim 39, wherein the source content comprises at least one of video, audio and/or image, the apparatus further comprising :
means for cutting at least one source content into plurality of segments;
means for decoding at least a part of a segment of a source content; and
means for encoding at least a part of a segment of a source content.
41 . An apparatus according to claim 40, further comprising: means for decoding, in response to a source video being encoded with a desired encoder and a cutting point of a segment locating at a predicted frame, video frames only from said predicted frame to a predicted frame preceding next intra-coded frame of the segment; and means for encoding said decoded frames with said desired encoder such that the predicted frame locating at the cutting point of the segment is encoded as an intra-coded frame.
42. An apparatus according to claim 40, further comprising means for allocating, in response to a segment of a source content being decoded and re-encoded, additional cutting points for said segment.
43. An apparatus according to any of the claims 39 - 42, further comprising
means for obtaining depth maps for at least some frames of a source video;
means for detecting at least one of a type of a video shot and an object obstructing a view in the source video based on the depth map; and
means for indexing the source video according to at least one of the detected type of a video shot and the detected object obstructing a view.
44. An apparatus according to claim 43, further comprising means for dividing the depth map of the frame into at least two non-overlapping region-of-interests, one of them being a central region-of-interest, and
means for calculating the depth of each region-of-interest as a weighted average value of the depth, wherein the weighting is based on reliability values of the depth map.
45. An apparatus according to claim 44, further comprising means for detecting the type of the video shot included in the source video to a close-up shot, a medium shot or a long shot by comparing the depth of the central region-of-interest to the depths of the remaining region-of-interests, the criteria for detecting the type of the video shot including at least the number of region-of-interests having a substantially similar depth to the depth of the central region- of-interest and residing within a predefined distance from the central region-of-interest.
PCT/FI2011/050599 2011-06-21 2011-06-21 Video remixing system WO2012175783A1 (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
EP11868268.1A EP2724343B1 (en) 2011-06-21 2011-06-21 Video remixing system
US14/126,385 US9396757B2 (en) 2011-06-21 2011-06-21 Video remixing system
CN201180071774.8A CN103635967B (en) 2011-06-21 2011-06-21 Video remixes system
PCT/FI2011/050599 WO2012175783A1 (en) 2011-06-21 2011-06-21 Video remixing system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/FI2011/050599 WO2012175783A1 (en) 2011-06-21 2011-06-21 Video remixing system

Publications (1)

Publication Number Publication Date
WO2012175783A1 true WO2012175783A1 (en) 2012-12-27

Family

ID=47422076

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/FI2011/050599 WO2012175783A1 (en) 2011-06-21 2011-06-21 Video remixing system

Country Status (4)

Country Link
US (1) US9396757B2 (en)
EP (1) EP2724343B1 (en)
CN (1) CN103635967B (en)
WO (1) WO2012175783A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103929655A (en) * 2014-04-25 2014-07-16 网易传媒科技(北京)有限公司 Method and device for transcoding audio and video file
US9380328B2 (en) 2011-06-28 2016-06-28 Nokia Technologies Oy Video remixing system

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013118468A (en) * 2011-12-02 2013-06-13 Sony Corp Image processing device and image processing method
US9659595B2 (en) * 2012-05-31 2017-05-23 Nokia Technologies Oy Video remixing system
US9628702B2 (en) 2014-05-21 2017-04-18 Google Technology Holdings LLC Enhanced image capture
US9729784B2 (en) * 2014-05-21 2017-08-08 Google Technology Holdings LLC Enhanced image capture
US20150350481A1 (en) * 2014-05-27 2015-12-03 Thomson Licensing Methods and systems for media capture and formatting
US10192583B2 (en) 2014-10-10 2019-01-29 Samsung Electronics Co., Ltd. Video editing using contextual data and content discovery using clusters
US10032481B2 (en) * 2016-03-22 2018-07-24 Verizon Digital Media Services Inc. Speedy clipping
JP7118966B2 (en) * 2016-12-13 2022-08-16 ロヴィ ガイズ, インコーポレイテッド Systems and methods for minimizing obstruction of media assets by overlays by predicting the path of movement of an object of interest of the media asset and avoiding placement of overlays in the path of movement
CN109167934B (en) * 2018-09-03 2020-12-22 咪咕视讯科技有限公司 Video processing method and device and computer readable storage medium
CN111147779B (en) * 2019-12-31 2022-07-29 维沃移动通信有限公司 Video production method, electronic device, and medium
US20230011547A1 (en) * 2021-07-12 2023-01-12 Getac Technology Corporation Optimizing continuous media collection

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1416490A1 (en) * 2002-11-01 2004-05-06 Microsoft Corporation Systems and methods for automatically editing a video
WO2004081940A1 (en) * 2003-03-11 2004-09-23 Koninklijke Philips Electronics N.V. A method and apparatus for generating an output video sequence
US20050193421A1 (en) * 2004-02-26 2005-09-01 International Business Machines Corporation Method and apparatus for cooperative recording
US20060251382A1 (en) * 2005-05-09 2006-11-09 Microsoft Corporation System and method for automatic video editing using object recognition
EP2091046A1 (en) * 2008-02-15 2009-08-19 Thomson Licensing Presentation system and method for controlling the same
WO2010119181A1 (en) * 2009-04-16 2010-10-21 Valtion Teknillinen Tutkimuskeskus Video editing system

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7751683B1 (en) * 2000-11-10 2010-07-06 International Business Machines Corporation Scene change marking for thumbnail extraction
US7296231B2 (en) * 2001-08-09 2007-11-13 Eastman Kodak Company Video structuring by probabilistic merging of video segments
US20050141613A1 (en) 2002-03-21 2005-06-30 Koninklijke Philips Electronics N.V. Editing of encoded a/v sequences
US7864840B2 (en) * 2005-04-15 2011-01-04 Inlet Technologies, Inc. Scene-by-scene digital video processing
US20060253857A1 (en) 2005-05-04 2006-11-09 Ulead Systems, Inc. Method for processing a data stream by utilizing multi-processor
US20090196570A1 (en) * 2006-01-05 2009-08-06 Eyesopt Corporation System and methods for online collaborative video creation
EP2160734A4 (en) 2007-06-18 2010-08-25 Synergy Sports Technology Llc System and method for distributed and parallel video editing, tagging, and indexing
JP2010541415A (en) * 2007-09-28 2010-12-24 グレースノート インコーポレイテッド Compositing multimedia event presentations
JP2009260933A (en) 2008-03-17 2009-11-05 Toshiba Corp Video contents editing apparatus, program therefor, and video contents editing method
JP4582185B2 (en) 2008-04-22 2010-11-17 ソニー株式会社 Information processing apparatus and information processing method
US9240214B2 (en) 2008-12-04 2016-01-19 Nokia Technologies Oy Multiplexed data sharing
US8818172B2 (en) * 2009-04-14 2014-08-26 Avid Technology, Inc. Multi-user remote video editing
US8538135B2 (en) * 2009-12-09 2013-09-17 Deluxe 3D Llc Pulling keys from color segmented images
US8867901B2 (en) * 2010-02-05 2014-10-21 Theatrics. com LLC Mass participation movies
US8532469B2 (en) * 2011-06-10 2013-09-10 Morgan Fiumi Distributed digital video processing system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1416490A1 (en) * 2002-11-01 2004-05-06 Microsoft Corporation Systems and methods for automatically editing a video
WO2004081940A1 (en) * 2003-03-11 2004-09-23 Koninklijke Philips Electronics N.V. A method and apparatus for generating an output video sequence
US20050193421A1 (en) * 2004-02-26 2005-09-01 International Business Machines Corporation Method and apparatus for cooperative recording
US20060251382A1 (en) * 2005-05-09 2006-11-09 Microsoft Corporation System and method for automatic video editing using object recognition
EP2091046A1 (en) * 2008-02-15 2009-08-19 Thomson Licensing Presentation system and method for controlling the same
WO2010119181A1 (en) * 2009-04-16 2010-10-21 Valtion Teknillinen Tutkimuskeskus Video editing system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
FOOTE, J. ET AL.: "Creating music videos using automatic media analysis", INT. CONF. ON MULTIMEDIA '02, 1 December 2002 (2002-12-01) - 6 December 2002 (2002-12-06), JUAN-LES-PINS, FRANCE, pages 553 - 560, XP001175057 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9380328B2 (en) 2011-06-28 2016-06-28 Nokia Technologies Oy Video remixing system
CN103929655A (en) * 2014-04-25 2014-07-16 网易传媒科技(北京)有限公司 Method and device for transcoding audio and video file

Also Published As

Publication number Publication date
EP2724343A1 (en) 2014-04-30
CN103635967A (en) 2014-03-12
EP2724343B1 (en) 2020-05-13
US20140133837A1 (en) 2014-05-15
CN103635967B (en) 2016-11-02
US9396757B2 (en) 2016-07-19
EP2724343A4 (en) 2016-05-11

Similar Documents

Publication Publication Date Title
US9396757B2 (en) Video remixing system
US10762653B2 (en) Generation apparatus of virtual viewpoint image, generation method, and storage medium
KR102650850B1 (en) Video sound processing device, video sound processing method , and computer readable recording medium storing program
JP6948171B2 (en) Image processing equipment and image processing methods, programs
US10244167B2 (en) Apparatus and methods for image encoding using spatially weighted encoding quality parameters
US8879788B2 (en) Video processing apparatus, method and system
US10506248B2 (en) Foreground detection for video stabilization
US20150222815A1 (en) Aligning videos representing different viewpoints
US20160360267A1 (en) Process for increasing the quality of experience for users that watch on their terminals a high definition video stream
US8903130B1 (en) Virtual camera operator
EP2727344B1 (en) Frame encoding selection based on frame similarities and visual quality and interests
JP2018509030A (en) Events triggered by the depth of an object in the field of view of the imaging device
Yaqoob et al. Dynamic viewport selection-based prioritized bitrate adaptation for tile-based 360° video streaming
WO2018100928A1 (en) Image processing device and method
WO2016192467A1 (en) Method and device for playing videos
JP2003061038A (en) Video contents edit aid device and video contents video aid method
CN116614631B (en) Video processing method, device, equipment and medium
CN109561324B (en) Software defined video processing system and method
US11936839B1 (en) Systems and methods for predictive streaming of image data for spatial computing
US11825066B2 (en) Video reproduction apparatus, reproduction method, and program
CN111818300B (en) Data storage method, data query method, data storage device, data query device, computer equipment and storage medium
US20160111127A1 (en) Generating a Composite Video of an Event Having a Moving Point of Attraction
KR20220080696A (en) Depth estimation method, device, electronic equipment and computer readable storage medium
CN116264640A (en) Viewing angle switching method, device and system for free viewing angle video
Kaiser et al. Automatic Camera Selection for Format Agnostic Live Event Broadcast Production

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11868268

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2011868268

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 14126385

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE